skip to main content
research-article
Public Access

DeepDive: declarative knowledge base construction

Published:24 April 2017Publication History
Skip Abstract Section

Abstract

The dark data extraction or knowledge base construction (KBC) problem is to populate a relational database with information from unstructured data sources, such as emails, webpages, and PDFs. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help to develop KBC systems. The key idea in DeepDive is to frame traditional extract-transform-load (ETL) style data management problems as a single large statistical inference task that is declaratively defined by the user. DeepDive leverages the effectiveness and efficiency of statistical inference and machine learning for difficult extraction tasks, whereas not requiring users to directly write any probabilistic inference algorithms. Instead, domain experts interact with DeepDive by defining features or rules about the domain. DeepDive has been successfully applied to domains such as pharmacogenomics, paleobiology, and antihuman trafficking enforcement, achieving human-caliber quality at machine-caliber scale. We present the applications, abstractions, and techniques used in DeepDive to accelerate the construction of such dark data extraction systems.

References

  1. Angeli, G. et al. Stanford's 2014 slot filling systems. TAC KBP (2014).Google ScholarGoogle Scholar
  2. Banko, M. et al. Open information extraction from the Web. In IJCAI (2007).Google ScholarGoogle Scholar
  3. Betteridge, J., Carlson, A., Hong, S.A., Hruschka, E.R., Jr, Law, E.L., Mitchell, T.M., Wang, S.H. Toward never ending language learning. In AAAI Spring Symposium(2009).Google ScholarGoogle Scholar
  4. Brin, S. Extracting patterns and relations from the world wide web. In WebDB (1999). Google ScholarGoogle ScholarCross RefCross Ref
  5. Brown, E. et al. Tools and methods for building Watson. IBM Research Report (2013).Google ScholarGoogle Scholar
  6. Carlson, A. et al. Toward an architecture for never-ending language learning. In AAAI (2010).Google ScholarGoogle Scholar
  7. Chen, F., Doan, A., Yang, J., Ramakrishnan, R. Efficient information extraction over evolving text data. In ICDE (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chen, F. et al. Optimizing statistical information extraction programs over evolving text. In ICDE (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chen, Y., Wang, D.Z. Knowledge expansion over probabilistic knowledge bases. In SIGMOD (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. De Sa, C., Olukotun, K., Ré, C. Ensuring rapid mixing and low bias for asynchronous gibbs sampling. arXiv preprint arXiv:1602.07415 (2016).Google ScholarGoogle Scholar
  11. Domingos, P., Lowd, D. Markov Logic: An Interface Layer for Artificial Intelligence. Morgan & Claypool, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Dong, X.L. et al. From data fusion to knowledge fusion. In VLDB (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ehrenberg, H.R., Shin, J., Ratner, A.J., Fries, J.A., Ré, C. Data programming with DDLite: Putting humans in a different part of the loop. In HILDA'16 SIGMOD (2016), 13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Etzioni, O. et al. Web-scale information extraction in KnowItAll: Preliminary results. In WWW (2004).Google ScholarGoogle Scholar
  15. Ferrucci, D. et al. Building Watson: An overview of the DeepQA project. AI Magazine (2010).Google ScholarGoogle Scholar
  16. Govindaraju, V. et al. Understanding tables in context using standard NLP toolkits. In ACL (2013).Google ScholarGoogle Scholar
  17. Gupta, A., Mumick, I.S., Subrahmanian, V.S. Maintaining views incrementally. SIGMOD Rec. (1993). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hearst, M.A. Automatic acquisition of hyponyms from large text corpora. In COLING (1992). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hoffmann, R. et al. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL (2011).Google ScholarGoogle Scholar
  20. Jampani, R. et al. MCDB: A Monte Carlo approach to managing uncertain data. In SIGMOD (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jaynes, E.T. Probability Theory: The Logic of Science. Cambridge University Press, 2003. Google ScholarGoogle ScholarCross RefCross Ref
  22. Jiang, S. et al. Learning to refine an automatically extracted knowledge base using Markov logic. In ICDM(2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kasneci, G. et al. The YAGO-NAGA approach to knowledge discovery. SIGMOD Rec. (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Koc, M.L., Ré, C. Incrementally maintaining classification using an RDBMS. PVLDB (2011).Google ScholarGoogle Scholar
  25. Krishnamurthy, R. et al. SystemT: A system for declarative information extraction. SIGMOD Rec. (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Li, Y., Reiss, F.R., Chiticariu, L. System T: A declarative information extraction system. In HLT (2011).Google ScholarGoogle Scholar
  27. Liu, J. and et al. An asynchronous parallel stochastic coordinate descent algorithm. ICML (2014).Google ScholarGoogle Scholar
  28. Madhavan, J. et al. Web-scale data integration: You can only afford to pay as you go. In CIDR (2007).Google ScholarGoogle Scholar
  29. Mallory, E.K. et al. Large-scale extraction of gene interactions from full text literature using deepdive. Bioinformatics (2015). Google ScholarGoogle ScholarCross RefCross Ref
  30. Mintz, M. et al. Distant supervision for relation extraction without labeled data. In ACL (2009). Google ScholarGoogle ScholarCross RefCross Ref
  31. Nakashole, N. et al. Scalable knowledge harvesting with high precision and high recall. In WSDM (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Niu, F. et al. Hogwild! A lock-free approach to parallelizing stochastic gradient descent. In NIPS (2011).Google ScholarGoogle Scholar
  33. Niu, F. et al. Tuffy: Scaling up statistical inference in Markov logic networks using an RDBMS. PVLDB(2011).Google ScholarGoogle Scholar
  34. Niu, F. et al. Elementary: Large-scale knowledge-base construction via machine learning and statistical inference. Int. J. Semantic Web Inf. Syst. (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Niu, F. et al. Scaling inference for Markov logic via dual decomposition. In ICDM (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Peters, S.E. et al. A machine reading system for assembling synthetic Paleontological databases. PloS One (2014). Google ScholarGoogle ScholarCross RefCross Ref
  37. Poon, H., Domingos, P.. Joint inference in information extraction. In AAAI (2007).Google ScholarGoogle Scholar
  38. Ratner, A., De Sa, C., Wu, S., Selsam, D., Ré, C. Data programming: Creating large training sets, quickly. arXiv preprint arXiv:1605.07723 (2016).Google ScholarGoogle Scholar
  39. Ré, C. et al. Feature engineering for knowledge base construction. IEEE Data Eng. Bull. (2014).Google ScholarGoogle Scholar
  40. Robert, C.P, Casella, G. Monte Carlo Statistical Methods. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Shen, W. et al. Declarative information extraction using datalog with embedded extraction predicates. In VLDB (2007).Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Shin, J. et al. Incremental knowledge base construction using deepdive. PVLDB (2015).Google ScholarGoogle Scholar
  43. Suchanek, F.M. et al. SOFIE: A self-organizing framework for information extraction. In WWW (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Wainwright, M., Jordan, M. Log-determinant relaxation for approximate inference in discrete Markov random fields. Trans. Sig. Proc. (2006).Google ScholarGoogle ScholarCross RefCross Ref
  45. Wainwright, M.J., Jordan, M.I. Graphical models, exponential families, and variational inference. FTML (2008).Google ScholarGoogle Scholar
  46. Weikum, G., Theobald, M. From information to knowledge: Harvesting entities and relationships from web sources. In PODS (2010).Google ScholarGoogle Scholar
  47. Wick, M. et al. Scalable probabilistic databases with factor graphs and MCMC. PVLDB (2010).Google ScholarGoogle Scholar
  48. Yates, A. et al. TextRunner: Open information extraction on the Web. In NAACL (2007).Google ScholarGoogle Scholar
  49. Zhang, C. et al. GeoDeepDive: Statistical inference using familiar data-processing languages. In SIGMOD (2013).Google ScholarGoogle Scholar
  50. Zhang, C., Ré, C. Towards high- throughput Gibbs sampling at scale: A study across storage managers. In SIGMOD (2013).Google ScholarGoogle Scholar
  51. Zhang, C., Ré, C.. DimmWitted: A study of main-memory statistical analytics. PVLDB (2014).Google ScholarGoogle Scholar
  52. Zhu, J. et al. StatSnowball: A statistical approach to extracting entity relationships. In WWW (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Zinkevich, M. et al. Parallelized stochastic gradient descent. In NIPS(2010), 2595--2603.Google ScholarGoogle Scholar

Index Terms

  1. DeepDive: declarative knowledge base construction

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Communications of the ACM
      Communications of the ACM  Volume 60, Issue 5
      May 2017
      101 pages
      ISSN:0001-0782
      EISSN:1557-7317
      DOI:10.1145/3084186
      • Editor:
      • Moshe Y. Vardi
      Issue’s Table of Contents

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 April 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format