skip to main content
10.1145/3097983.3098073acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Public Access

GELL: Automatic Extraction of Epidemiological Line Lists from Open Sources

Published:13 August 2017Publication History

ABSTRACT

Real-time monitoring and responses to emerging public health threats rely on the availability of timely surveillance data. During the early stages of an epidemic, the ready availability of line lists with detailed tabular information about laboratory-confirmed cases can assist epidemiologists in making reliable inferences and forecasts. Such inferences are crucial to understand the epidemiology of a specific disease early enough to stop or control the outbreak. However, construction of such line lists requires considerable human supervision and therefore, difficult to generate in real-time. In this paper, we motivate Guided Epidemiological Line List (GELL), the first tool for building automated line lists (in near real-time) from open source reports of emerging disease outbreaks. Specifically, we focus on deriving epidemiological characteristics of an emerging disease and the affected population from reports of illness. GELL uses distributed vector representations (ala word2vec) to discover a set of indicators for each line list feature. This discovery of indicators is followed by the use of dependency parsing based techniques for final extraction in tabular form. We evaluate the performance of GELL against a human annotated line list provided by HealthMap corresponding to MERS outbreaks in Saudi Arabia. We demonstrate that GELL extracts line list features with increased accuracy compared to a baseline method. We further show how these automatically extracted line list features can be used for making epidemiological inferences, such as inferring demographics and symptoms-to-hospitalization period of affected individuals.

References

  1. M. Ballesteros, A. Díaz, V. Francisco, P. Gervás, J. C. De Albornoz, and L. Plaza. 2012. UCM-2: a rule-based approach to infer the scope of negation via dependency parsing Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation. Association for Computational Linguistics, 288--293.Google ScholarGoogle Scholar
  2. R. C. Bunescu and R. J. Mooney 2005. A shortest path dependency kernel for relation extraction Proceedings of the conference on human language technology and empirical methods in natural language processing. Association for Computational Linguistics, 724--731.Google ScholarGoogle Scholar
  3. A. Diaz, M. Ballesteros, J. Carrillo-de Albornoz, and L. Plaza. 2012. UCM at TREC-2012: Does negation influence the retrieval of medical reports? Technical Report. DTIC Document.Google ScholarGoogle Scholar
  4. Clark C Freifeld, Kenneth D Mandl, Ben Y Reis, and John S Brownstein 2008. HealthMap: global infectious disease monitoring through automated classification and visualization of Internet media reports. Journal of the American Medical Informatics Association, Vol. 15, 2 (2008), 150--157.Google ScholarGoogle ScholarCross RefCross Ref
  5. S. Ghosh, P. Chakraborty, E. Cohn, J. S. Brownstein, and N. Ramakrishnan 2016. Characterizing Diseases from Unstructured Text: A Vocabulary Driven Word2vec Approach Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 1129--1138.Google ScholarGoogle Scholar
  6. M. Honnibal and M. Johnson 2015. An Improved Non-monotonic Transition System for Dependency Parsing Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 1373--1378. https://aclweb.org/anthology/D/D15/D15-1162 Google ScholarGoogle ScholarCross RefCross Ref
  7. E. HY. Lau, J. Zheng, T. K. Tsang, Q. Liao, B. Lewis, J. S. Brownstein, S. Sanders, J. Y. Wong, S. R. Mekaru, C. Rivers, et almbox. 2014. Accuracy of epidemiological inferences based on publicly available information: retrospective comparative analysis of line lists of human cases infected with influenza A (H7N9) in China. BMC medicine, Vol. 12, 1 (2014), 88. Google ScholarGoogle ScholarCross RefCross Ref
  8. Q. V. Le and T. Mikolov 2014. Distributed Representations of Sentences and Documents. ICML, Vol. Vol. 14. 1188--1196.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. O. Levy and Y. Goldberg 2014natexlaba. Dependency-Based Word Embeddings.. In ACL (2). 302--308.Google ScholarGoogle Scholar
  10. O. Levy and Y. Goldberg 2014. Dependency-Based Word Embeddings. In Proceedings of the 52nd Annual Meeting of the ACL. 302--308. showURL%http://aclweb.org/anthology/P/P14/P14-2050.pdfGoogle ScholarGoogle ScholarCross RefCross Ref
  11. O. Levy and Y. Goldberg 2014natexlabc. Linguistic Regularities in Sparse and Explicit Word Representations Proceedings of the Eighteenth Conference on CoNLL. 171--180. http://aclweb.org/anthology/W/W14/W14-1618.pdfGoogle ScholarGoogle Scholar
  12. O. Levy, Y. Goldberg, and I. Dagan 2015. Improving Distributional Similarity with Lessons Learned from Word Embeddings. TACL Vol. 3 (2015), 211--225. https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/570Google ScholarGoogle ScholarCross RefCross Ref
  13. M. S. Majumder, C. Rivers, E. Lofgren, and D. Fisman. 2014. Estimation of MERS-coronavirus reproductive number and case fatality rate for the spring 2014 Saudi Arabia outbreak: insights from publicly available data. PLOS Currents Outbreaks (2014).Google ScholarGoogle Scholar
  14. T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013natexlaba. Efficient Estimation of Word Representations in Vector Space. CoRR Vol. abs/1301.3781 (2013). http://arxiv.org/abs/1301.3781Google ScholarGoogle Scholar
  15. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean 2013. Distributed Representations of Words and Phrases and their Compositionality 26th Annual Conference on Neural Information Processing Systems. 3111--3119.Google ScholarGoogle Scholar
  16. T. Mikolov, W. Yih, and G. Zweig 2013. Linguistic Regularities in Continuous Space Word Representations Human Language Technologies: Conference of the NAACL. 746--751. http://aclweb.org/anthology/N/N13/N13-1090.pdfGoogle ScholarGoogle Scholar
  17. S. Muthiah, B. Huang, J. Arredondo, D. Mares, L. Getoor, G. Katz, and N. Ramakrishnan 2015. Planned Protest Modeling in News and Social Media. AAAI. 3920--3927.Google ScholarGoogle Scholar
  18. Y. Ou and J. Patrick. 2015. Automatic negation detection in narrative pathology reports. Artificial intelligence in medicine Vol. 64, 1 (2015), 41--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research Vol. 12 (2011), 2825--2830.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. N. Ramakrishnan, P. Butler, S. Muthiah, N. Self, R. Khandpur, P. Saraf, W. Wang, J. Cadena, A. Vullikanti, G. Korkmaz, et almbox. 2014. 'Beating the news' with EMBERS: Forecasting civil unrest using open source indicators Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1799--1808.Google ScholarGoogle Scholar
  21. S. Sohn, S. Wu, and C. G. Chute 2012. Dependency parser-based negation detection in clinical narratives. AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science Vol. 2012 (2012), 1--8.Google ScholarGoogle Scholar
  22. WHO 2016. Coronavirus infections: Disease Outbreak News. (2016). http://www.who.int/csr/don/archive/disease/coronavirus_infections/en/Google ScholarGoogle Scholar
  23. F. Wu and D. S. Weld. 2010. Open information extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 118--127.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. GELL: Automatic Extraction of Epidemiological Line Lists from Open Sources

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
      August 2017
      2240 pages
      ISBN:9781450348874
      DOI:10.1145/3097983

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 August 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      KDD '17 Paper Acceptance Rate64of748submissions,9%Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader