ABSTRACT
Keeping up with the ever-expanding flow of data and publications is untenable and poses a fundamental bottleneck to scientific progress. Current search technologies typically find many relevant documents, but they do not extract and organize the information content of these documents or suggest new scientific hypotheses based on this organized content. We present an initial case study on KnIT, a prototype system that mines the information contained in the scientific literature, represents it explicitly in a queriable network, and then further reasons upon these data to generate novel and experimentally testable hypotheses. KnIT combines entity detection with neighbor-text feature analysis and with graph-based diffusion of information to identify potential new properties of entities that are strongly implied by existing relationships. We discuss a successful application of our approach that mines the published literature to identify new protein kinases that phosphorylate the protein tumor suppressor p53. Retrospective analysis demonstrates the accuracy of this approach and ongoing laboratory experiments suggest that kinases identified by our system may indeed phosphorylate p53. These results establish proof of principle for automated hypothesis generation and discovery based on text mining of the scientific literature.
Supplemental Material
- ALTSCHUL, S.F., GISH, W., MILLER, W., MYERS, E.W., and LIPMAN, D.J., 1990. Basic local alignment search tool. J Mol Biol 215, 3 (Oct 5), 403--410. DOI= http://dx.doi.org/10.1016/S0022--2836(05)80360--2.Google ScholarCross Ref
- ASHBURNER, M., BALL, C.A., BLAKE, J.A., BOTSTEIN, D., BUTLER, H., CHERRY, J.M., DAVIS, A.P., DOLINSKI, K., DWIGHT, S.S., EPPIG, J.T., HARRIS, M.A., HILL, D.P., ISSEL-TARVER, L., KASARSKIS, A., LEWIS, S., MATESE, J.C., RICHARDSON, J.E., RINGWALD, M., RUBIN, G.M., and SHERLOCK, G., 2000. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 1 (May), 25--29. DOI= http://dx.doi.org/10.1038/75556.Google ScholarCross Ref
- BELKIN, M., MATVEEVA, I., and NIYOGI, P., 2004. Regularization and Semi-supervised Learning on Large Graphs. In Learning Theory, J. SHAWE-TAYLOR and Y. SINGER Eds. Springer Berlin Heidelberg, 624--638. DOI= http://dx.doi.org/10.1007/978--3--540--27819--1_43.Google Scholar
- BJÖRK, B.-C., ROOSR, A., and LAURI, M., Global annual volume of peer reviewed scholarly articles and the share available via different open access options. In Sustainability in the Age of Web 2.0 - Proceedings of the 12th International Conference on Electronic Publishing, Toronto, Canada.Google Scholar
- CHUNG, F.R.K., 1997. Spectral Graph Theory American Mathematical Society.Google Scholar
- COORDINATORS, N.R., 2014. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 42, 1 (Jan 1), D7-D17. DOI= http://dx.doi.org/10.1093/nar/gkt1146.Google Scholar
- DA COSTA, C.A., SUNYACH, C., GIAIME, E., WEST, A., CORTI, O., BRICE, A., SAFE, S., ABOU-SLEIMAN, P.M., WOOD, N.W., TAKAHASHI, H., GOLDBERG, M.S., SHEN, J., and CHECLER, F., 2009. Transcriptional repression of p53 by parkin and impairment by mutations associated with autosomal recessive juvenile Parkinson's disease. Nat Cell Biol 11, 11 (Nov), 1370--1375. DOI= http://dx.doi.org/10.1038/ncb1981.Google ScholarCross Ref
- DAI, C. and GU, W., 2010. p53 post-translational modification: deregulated in tumorigenesis. Trends Mol Med 16, 11 (Nov), 528--536. DOI= http://dx.doi.org/10.1016/j.molmed.2010.09.002.Google ScholarCross Ref
- DERDAK, Z., VILLEGAS, K.A., HARB, R., WU, A.M., SOUSA, A., and WANDS, J.R., 2013. Inhibition of p53 attenuates steatosis and liver injury in a mouse model of non-alcoholic fatty liver disease. J Hepatol 58, 4 (Apr), 785--791. DOI= http://dx.doi.org/10.1016/j.jhep.2012.11.042.Google ScholarCross Ref
- GOH, K.I., CUSICK, M.E., VALLE, D., CHILDS, B., VIDAL, M., and BARABASI, A.L., 2007. The human disease network. Proc Natl Acad Sci U S A 104, 21 (May 22), 8685--8690. DOI= http://dx.doi.org/10.1073/pnas.0701361104.Google ScholarCross Ref
- GRAY, K.A., DAUGHERTY, L.C., GORDON, S.M., SEAL, R.L., WRIGHT, M.W., and BRUFORD, E.A., 2013. Genenames.org: the HGNC resources in 2013. Nucleic Acids Res 41, Database issue (Jan), D545--552. DOI= http://dx.doi.org/10.1093/nar/gks1066.Google Scholar
- GU, B. and ZHU, W.G., 2012. Surf the post-translational modification network of p53 regulation. Int J Biol Sci 8, 5, 672--684. DOI= http://dx.doi.org/10.7150/ijbs.4283.Google ScholarCross Ref
- HAGER, K.M. and GU, W., 2014. Understanding the non-canonical pathways involved in p53-mediated tumor suppression. Carcinogenesis(Feb 3). DOI= http://dx.doi.org/10.1093/carcin/bgt487.Google Scholar
- HORNBECK, P.V., KORNHAUSER, J.M., TKACHEV, S., ZHANG, B., SKRZYPEK, E., MURRAY, B., LATHAM, V., and SULLIVAN, M., 2012. PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res 40, Database issue (Jan), D261--270. DOI= http://dx.doi.org/10.1093/nar/gkr1122.Google Scholar
- JENKINS, L.M., DURELL, S.R., MAZUR, S.J., and APPELLA, E., 2012. p53 N-terminal phosphorylation: a defining layer of complex regulation. Carcinogenesis 33, 8 (Aug), 1441--1449. DOI= http://dx.doi.org/10.1093/carcin/bgs145.Google ScholarCross Ref
- JINHA, A.E., 2010. Article 50 million: an estimate of the number of scholarly articles in existence. Learned Publishing 23, 3 (//), 258--263. DOI= http://dx.doi.org/10.1087/20100308.Google Scholar
- LANGLEY, P., BRADSHAW, G., and SIMON, H., 1983. Rediscovering Chemistry with the Bacon System. In Machine Learning, R. MICHALSKI, J. CARBONELL and T. MITCHELL Eds. Springer Berlin Heidelberg, 307--329. DOI= http://dx.doi.org/10.1007/978--3--662--12405--5_10.Google Scholar
- LARSEN, P.O. and VON INS, M., 2010. The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index. Scientometrics 84, 3 (Sep), 575--603. DOI= http://dx.doi.org/10.1007/s11192-010-0202-z.Google ScholarCross Ref
- LI, M., HE, Y., DUBOIS, W., WU, X., SHI, J., and HUANG, J., 2012. Distinct regulatory mechanisms and functions for p53-activated and p53-repressed DNA damage response genes in embryonic stem cells. Mol Cell 46, 1 (Apr 13), 30--42. DOI= http://dx.doi.org/10.1016/j.molcel.2012.01.020.Google ScholarCross Ref
- LISEWSKI, A.M. and LICHTARGE, O., 2010. Untangling complex networks: risk minimization in financial markets through accessible spin glass ground states. Physica A 389, 16 (Aug 15), 3250--3253. DOI= http://dx.doi.org/10.1016/j.physa.2010.04.005.Google ScholarCross Ref
- MANNING, G., WHYTE, D.B., MARTINEZ, R., HUNTER, T., and SUDARSANAM, S., 2002. The protein kinase complement of the human genome. Science 298, 5600 (Dec 6), 1912--1934. DOI= http://dx.doi.org/10.1126/science.1075762.Google ScholarCross Ref
- MAY, P. and MAY, E., 1999. Twenty years of p53 research: structural and functional aspects of the p53 protein. Oncogene 18, 53 (Dec 13), 7621--7636. DOI= http://dx.doi.org/10.1038/sj.onc.1203285.Google ScholarCross Ref
- MEEK, D.W. and ANDERSON, C.W., 2009. Posttranslational modification of p53: cooperative integrators of function. Cold Spring Harb Perspect Biol 1, 6 (Dec), a000950. DOI= http://dx.doi.org/10.1101/cshperspect.a000950.Google ScholarCross Ref
- MULLER, P.A. and VOUSDEN, K.H., 2013. p53 mutations in cancer. Nat Cell Biol 15, 1 (Jan), 2--8. DOI= http://dx.doi.org/10.1038/ncb2641.Google ScholarCross Ref
- NATHANSON, J.W., YADRON, N.E., FARNAN, J., KINNEAR, S., HART, J., and RUBIN, D.T., 2008. p53 mutations are associated with dysplasia and progression of dysplasia in patients with Crohn's disease. Dig Dis Sci 53, 2 (Feb), 474--480. DOI= http://dx.doi.org/10.1007/s10620-007--9886--1.Google ScholarCross Ref
- SALTON, G. and MCGILL, M.J., 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc. Google ScholarDigital Library
- SHAWVER, L.K., SLAMON, D., and ULLRICH, A., 2002. Smart drugs: tyrosine kinase inhibitors in cancer therapy. Cancer Cell 1, 2 (Mar), 117--123.Google ScholarCross Ref
- SHIEH, S.Y., AHN, J., TAMAI, K., TAYA, Y., and PRIVES, C., 2000. The human homologs of checkpoint kinases Chk1 and Cds1 (Chk2) phosphorylate p53 at multiple DNA damage-inducible sites. Genes Dev 14, 3 (Feb 1), 289--300.Google Scholar
- SIGANAKI, M., KOUTSOPOULOS, A.V., NEOFYTOU, E., VLACHAKI, E., PSARROU, M., SOULITZIS, N., PENTILAS, N., SCHIZA, S., SIAFAKAS, N.M., and TZORTZAKI, E.G., 2010. Deregulation of apoptosis mediators' p53 and bcl2 in lung tissue of COPD patients. Respir Res 11, 46. DOI= http://dx.doi.org/10.1186/1465--9921--11--46.Google ScholarCross Ref
- SRINIVASAN, P., 2004. Text mining: generating hypotheses from MEDLINE. J. Am. Soc. Inf. Sci. Technol. 55, 5, 396--413. DOI= http://dx.doi.org/10.1002/asi.10389. Google ScholarDigital Library
- SWANSON, D.R., 1986. Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspect Biol Med 30, 1 (Autumn), 7--18.Google ScholarCross Ref
- UNIPROT, C., 2013. Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res 41, Database issue (Jan), D43--47. DOI= http://dx.doi.org/10.1093/nar/gks1068.Google Scholar
- WHEELER, D.L., CHURCH, D.M., FEDERHEN, S., LASH, A.E., MADDEN, T.L., PONTIUS, J.U., SCHULER, G.D., SCHRIML, L.M., SEQUEIRA, E., TATUSOVA, T.A., and WAGNER, L., 2003. Database resources of the National Center for Biotechnology. Nucleic Acids Res 31, 1 (Jan 1), 28--33.Google ScholarCross Ref
- ZHOU, D., BOUSQUET, O., WESTON, J., and SCHOLKOPF, B., 2004. Learning with local and global consistency. In Adnvaces in Neural Information Processing Systems (NIPS) 16 MIT, 321--328.Google Scholar
Index Terms
- Automated hypothesis generation based on mining scientific literature
Recommendations
Acknowledgments in scientific publications: Presence in Spanish science and text patterns across disciplines
The acknowledgments in scientific publications are an important feature in the scholarly communication process. This research analyzes funding acknowledgment presence in scientific publications and introduces a novel approach for discovering text ...
New Frontiers of Scientific Text Mining: Tasks, Data, and Tools
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data MiningExploring the vast amount of rapidly growing scientific text data is highly beneficial for real-world scientific discovery. However, scientific text mining is particularly challenging due to the lack of specialized domain knowledge in natural language ...
Measuring social media activity of scientific literature: an exhaustive comparison of scopus and novel altmetrics big data
This paper measures social media activities of 15 broad scientific disciplines indexed in Scopus database using Altmetric.com data. First, the presence of Altmetric.com data in Scopus database is investigated, overall and across disciplines. Second, a ...
Comments