ABSTRACT
Scholarly big data is, for many, an important instance of Big Data. Digital library search engines have been built to acquire, extract, and ingest large volumes of scholarly papers. This paper provides an overview of the scholarly big data released by CiteSeerX, as of the end of 2015, and discusses various aspects such as how the data is acquired, its size, general quality, data management, and accessibility. Preliminary results on extracting semantic entities from body text of scholarly papers with Wikifier show biases towards general terms appearing in Wikipedia and against domain specific terms. We argue that the latter will play a more important role in extracting important facts from scholarly papers.
- B. Aleman-Meza, F. Hakimpour, I. B. Arpinar, and A. P. Sheth. Swetodblp ontology of computer science publications. Web Semantics: Science, Services and Agents on the World Wide Web, 5(3):151--155, 2007. Google ScholarDigital Library
- Y. An, J. Janssen, and E. E. Milios. Characterizing and mining the citation graph of the computer science literature. Knowledge and Information Systems, 6(6):664--678, 2004. Google ScholarDigital Library
- C. Caragea, J. Wu, S. D. Gollapalli, and C. L. Giles. Document Type Classification in Online Digital Libraries. Phoenix, Arizona USA, 2016. AAAI.Google Scholar
- H.-H. Chen, P. Treeratpituk, P. Mitra, and C. L. Giles. CSSeer: an expert recommendation system based on CiteSeerX. JCDL '14, pages 381--382, 2013. Google ScholarDigital Library
- X. Cheng and D. Roth. Relational inference for wikification. In EMNLP, 2013.Google Scholar
- I. Councill, C. L. Giles, and M.-Y. Kan. Parscit: an open-source crf reference string parsing package. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco, may 2008.Google Scholar
- FooLabs. http://www.foolabs.com/xpdf/index.html. Accessed 06-May-2016.Google Scholar
- D. Graus, D. Odijk, M. Tsagkias, W. Weerkamp, and M. de Rijke. Semanticizing search engine queries: The university of amsterdam at the erd 2014 challenge. In Proceedings of the First International Workshop on Entity Recognition & Disambiguation, ERD '14, pages 69--74, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A. Fox. Automatic document metadata extraction using support vector machines. In Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '03, pages 37--48, 2003. Google ScholarDigital Library
- W. Huang, Z. Wu, P. Mitra, and C. L. Giles. Refseer: A citation recommendation system. In IEEE/ACM Joint Conference on Digital Libraries, JCDL 2014, London, United Kingdom, September 8--12, 2014, pages 371--374, 2014. Google ScholarDigital Library
- M. Khabsa and C. L. Giles. The number of scholarly documents on the public web. PLoS ONE, 9(5):e93949, May 2014.Google ScholarCross Ref
- M. Khabsa, P. Treeratpituk, and C. Giles. Large scale author name disambiguation in digital libraries. In Big Data (Big Data), 2014 IEEE International Conference on, pages 41--42, Oct. 2014.Google ScholarCross Ref
- P. Larsen and M. von Ins. The rate of growth in scientific publication and the decline in coverage provided by science citation index. Scientometrics, 84(3):575--603, 2010.Google ScholarCross Ref
- J. Leskovec and R. Sosič. Snap.py: SNAP for Python, a general purpose network analysis and graph mining tool in Python. http://snap.stanford.edu/snappy, June 2014.Google Scholar
- M. Ley. DBLP - some lessons learned. PVLDB, 2(2):1493--1500, 2009. Google ScholarDigital Library
- Y. Li, C. Wang, F. Han, J. Han, D. Roth, and X. Yan. Mining evidences for named entity disambiguation. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1070--1078. ACM, 2013. Google ScholarDigital Library
- M. Lipinski, K. Yao, C. Breitinger, J. Beel, and B. Gipp. Evaluation of header metadata extraction approaches and tools for scientific pdf documents. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '13, pages 385--386, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- P. Lopez. Grobid: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In Proceedings of the 13th European Conference on Research and Advanced Technology for Digital Libraries, ECDL'09, pages 473--474, Berlin, Heidelberg, 2009. Springer-Verlag. Google ScholarDigital Library
- L.-A. Ratinov, D. Roth, D. Downey, and M. Anderson. Local and global algorithms for disambiguation to wikipedia. In ACL, 2011. Google ScholarDigital Library
- A. Sil and A. Yates. Re-ranking for joint named-entity recognition and linking. In Proceedings of the 22nd ACM international conference on information & knowledge management, pages 2369--2374. ACM, 2013. Google ScholarDigital Library
- R. Sinatra, P. Deville, M. Szell, D. Wang, and A.-L. Barabasi. A century of physics. Nat Phys, 11(10):791--796, 10 2015.Google ScholarCross Ref
- A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. P. Hsu, and K. Wang. An Overview of Microsoft Academic Service (MAS) and Applications. In Proceedings of the 24th International Conference on World Wide Web, WWW '15 Companion, pages 243--246, Republic and Canton of Geneva, Switzerland, 2015. Google ScholarDigital Library
- L. Subelj, D. Fiala, and M. Bajec. Network-based statistical comparison of citation topology of bibliographic databases. Scientific Reports, 4:6496, Sep 2014. Article.Google ScholarCross Ref
- N. Vitucci, M. A. Neri, R. Tedesco, and G. Gini. Semanticizing syntactic patterns in NLP processing using SPARQL-DL queries. CEUR Workshop Proceedings, 849, 2012.Google Scholar
- M. Wick, S. Singh, and A. McCallum. A Discriminative Hierarchical Model for Fast Coreference at Large Scale. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL '12, pages 379--388, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarDigital Library
- K. Williams and C. L. Giles. Near duplicate detection in an academic digital library. DocEng '13, pages 91--94, 2013. Google ScholarDigital Library
- J. Wu, J. Killian, H. Yang, K. Williams, S. R. Choudhury, S. Tuarob, C. Caragea, and C. L. Giles. Pdfmef: A multi-entity knowledge extraction framework for scholarly documents and semantic search. In Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015, pages 13:1--13:8, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- J. Wu, P. Teregowda, J. P. F. Ramírez, P. Mitra, S. Zheng, and C. L. Giles. The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists. In Proceedings of the 3rd Annual ACM Web Science Conference, WebSci '12, pages 340--343, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- J. Wu, K. Williams, H.-H. Chen, M. Khabsa, C. Caragea, A. Ororbia, D. Jordan, and C. L. Giles. Citeseerx: Ai in a digital library search engine. In The Twenty-Sixth Annual Conference on Innovative Applications of Artificial Intelligence, IAAI '14, 2014. Google ScholarDigital Library
- J. G. Zheng, D. Howsmon, B. Zhang, J. Hahn, D. McGuinness, J. Hendler, and H. Ji. Entity linking for biomedical literature. BMC medical informatics and decision making, 15(Suppl 1):S4, 2015.Google Scholar
Index Terms
- CiteSeerX data: semanticizing scholarly papers
Recommendations
Big Scholarly Data in CiteSeerX: Information Extraction from the Web
WWW '15 Companion: Proceedings of the 24th International Conference on World Wide WebWe examine CiteSeerX, an intelligent system designed with the goal of automatically acquiring and organizing large-scale collections of scholarly documents from the world wide web. From the perspective of automatic information extraction and modes of ...
CiteSeerX: 20 years of service to scholarly big data
AIDR '19: Proceedings of the Conference on Artificial Intelligence for Data Discovery and ReuseWe overview CiteSeerX, the pioneer digital library search engine, that has been serving academic communities for more than 20 years (first released in 1998), from three perspectives. The system perspective summarizes its architecture evolution in three ...
CiteSeerx: A Scholarly Big Dataset
ECIR 2014: Proceedings of the 36th European Conference on IR Research on Advances in Information Retrieval - Volume 8416The CiteSeer <Emphasis FontCategory="NonProportional">x</Emphasis> digital library stores and indexes research articles in Computer Science and related fields. Although its main purpose is to make it easier for researchers to search for scientific ...
Comments