skip to main content
10.1145/2928294.2928306acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

CiteSeerX data: semanticizing scholarly papers

Authors Info & Claims
Published:26 June 2016Publication History

ABSTRACT

Scholarly big data is, for many, an important instance of Big Data. Digital library search engines have been built to acquire, extract, and ingest large volumes of scholarly papers. This paper provides an overview of the scholarly big data released by CiteSeerX, as of the end of 2015, and discusses various aspects such as how the data is acquired, its size, general quality, data management, and accessibility. Preliminary results on extracting semantic entities from body text of scholarly papers with Wikifier show biases towards general terms appearing in Wikipedia and against domain specific terms. We argue that the latter will play a more important role in extracting important facts from scholarly papers.

References

  1. B. Aleman-Meza, F. Hakimpour, I. B. Arpinar, and A. P. Sheth. Swetodblp ontology of computer science publications. Web Semantics: Science, Services and Agents on the World Wide Web, 5(3):151--155, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Y. An, J. Janssen, and E. E. Milios. Characterizing and mining the citation graph of the computer science literature. Knowledge and Information Systems, 6(6):664--678, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Caragea, J. Wu, S. D. Gollapalli, and C. L. Giles. Document Type Classification in Online Digital Libraries. Phoenix, Arizona USA, 2016. AAAI.Google ScholarGoogle Scholar
  4. H.-H. Chen, P. Treeratpituk, P. Mitra, and C. L. Giles. CSSeer: an expert recommendation system based on CiteSeerX. JCDL '14, pages 381--382, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. X. Cheng and D. Roth. Relational inference for wikification. In EMNLP, 2013.Google ScholarGoogle Scholar
  6. I. Councill, C. L. Giles, and M.-Y. Kan. Parscit: an open-source crf reference string parsing package. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco, may 2008.Google ScholarGoogle Scholar
  7. FooLabs. http://www.foolabs.com/xpdf/index.html. Accessed 06-May-2016.Google ScholarGoogle Scholar
  8. D. Graus, D. Odijk, M. Tsagkias, W. Weerkamp, and M. de Rijke. Semanticizing search engine queries: The university of amsterdam at the erd 2014 challenge. In Proceedings of the First International Workshop on Entity Recognition & Disambiguation, ERD '14, pages 69--74, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A. Fox. Automatic document metadata extraction using support vector machines. In Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '03, pages 37--48, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. Huang, Z. Wu, P. Mitra, and C. L. Giles. Refseer: A citation recommendation system. In IEEE/ACM Joint Conference on Digital Libraries, JCDL 2014, London, United Kingdom, September 8--12, 2014, pages 371--374, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Khabsa and C. L. Giles. The number of scholarly documents on the public web. PLoS ONE, 9(5):e93949, May 2014.Google ScholarGoogle ScholarCross RefCross Ref
  12. M. Khabsa, P. Treeratpituk, and C. Giles. Large scale author name disambiguation in digital libraries. In Big Data (Big Data), 2014 IEEE International Conference on, pages 41--42, Oct. 2014.Google ScholarGoogle ScholarCross RefCross Ref
  13. P. Larsen and M. von Ins. The rate of growth in scientific publication and the decline in coverage provided by science citation index. Scientometrics, 84(3):575--603, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  14. J. Leskovec and R. Sosič. Snap.py: SNAP for Python, a general purpose network analysis and graph mining tool in Python. http://snap.stanford.edu/snappy, June 2014.Google ScholarGoogle Scholar
  15. M. Ley. DBLP - some lessons learned. PVLDB, 2(2):1493--1500, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Y. Li, C. Wang, F. Han, J. Han, D. Roth, and X. Yan. Mining evidences for named entity disambiguation. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1070--1078. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Lipinski, K. Yao, C. Breitinger, J. Beel, and B. Gipp. Evaluation of header metadata extraction approaches and tools for scientific pdf documents. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '13, pages 385--386, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Lopez. Grobid: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In Proceedings of the 13th European Conference on Research and Advanced Technology for Digital Libraries, ECDL'09, pages 473--474, Berlin, Heidelberg, 2009. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L.-A. Ratinov, D. Roth, D. Downey, and M. Anderson. Local and global algorithms for disambiguation to wikipedia. In ACL, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Sil and A. Yates. Re-ranking for joint named-entity recognition and linking. In Proceedings of the 22nd ACM international conference on information & knowledge management, pages 2369--2374. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Sinatra, P. Deville, M. Szell, D. Wang, and A.-L. Barabasi. A century of physics. Nat Phys, 11(10):791--796, 10 2015.Google ScholarGoogle ScholarCross RefCross Ref
  22. A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. P. Hsu, and K. Wang. An Overview of Microsoft Academic Service (MAS) and Applications. In Proceedings of the 24th International Conference on World Wide Web, WWW '15 Companion, pages 243--246, Republic and Canton of Geneva, Switzerland, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. L. Subelj, D. Fiala, and M. Bajec. Network-based statistical comparison of citation topology of bibliographic databases. Scientific Reports, 4:6496, Sep 2014. Article.Google ScholarGoogle ScholarCross RefCross Ref
  24. N. Vitucci, M. A. Neri, R. Tedesco, and G. Gini. Semanticizing syntactic patterns in NLP processing using SPARQL-DL queries. CEUR Workshop Proceedings, 849, 2012.Google ScholarGoogle Scholar
  25. M. Wick, S. Singh, and A. McCallum. A Discriminative Hierarchical Model for Fast Coreference at Large Scale. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL '12, pages 379--388, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. K. Williams and C. L. Giles. Near duplicate detection in an academic digital library. DocEng '13, pages 91--94, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Wu, J. Killian, H. Yang, K. Williams, S. R. Choudhury, S. Tuarob, C. Caragea, and C. L. Giles. Pdfmef: A multi-entity knowledge extraction framework for scholarly documents and semantic search. In Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015, pages 13:1--13:8, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Wu, P. Teregowda, J. P. F. Ramírez, P. Mitra, S. Zheng, and C. L. Giles. The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists. In Proceedings of the 3rd Annual ACM Web Science Conference, WebSci '12, pages 340--343, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Wu, K. Williams, H.-H. Chen, M. Khabsa, C. Caragea, A. Ororbia, D. Jordan, and C. L. Giles. Citeseerx: Ai in a digital library search engine. In The Twenty-Sixth Annual Conference on Innovative Applications of Artificial Intelligence, IAAI '14, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. G. Zheng, D. Howsmon, B. Zhang, J. Hahn, D. McGuinness, J. Hendler, and H. Ji. Entity linking for biomedical literature. BMC medical informatics and decision making, 15(Suppl 1):S4, 2015.Google ScholarGoogle Scholar

Index Terms

  1. CiteSeerX data: semanticizing scholarly papers

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            SBD '16: Proceedings of the International Workshop on Semantic Big Data
            June 2016
            83 pages
            ISBN:9781450342995
            DOI:10.1145/2928294

            Copyright © 2016 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 26 June 2016

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate30of54submissions,56%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader