DeepDive: declarative knowledge base construction

Authors:
Ce Zhang

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland
View Profile

,
Christopher Ré

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

,
Michael Cafarella

Lattice Data, Inc., Palo Alto, CA

Lattice Data, Inc., Palo Alto, CA
View Profile

,
Christopher De Sa

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

,
Alex Ratner

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

,
Jaeho Shin

Lattice Data, Inc., Palo Alto, CA

Lattice Data, Inc., Palo Alto, CA
View Profile

,
Feiran Wang

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

,
Sen Wu

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

Authors Info & Claims

Communications of the ACM Volume 60 Issue 5May 2017pp 93–102https://doi.org/10.1145/3060586

Published:24 April 2017Publication History

Communications of the ACM

Abstract

The dark data extraction or knowledge base construction (KBC) problem is to populate a relational database with information from unstructured data sources, such as emails, webpages, and PDFs. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help to develop KBC systems. The key idea in DeepDive is to frame traditional extract-transform-load (ETL) style data management problems as a single large statistical inference task that is declaratively defined by the user. DeepDive leverages the effectiveness and efficiency of statistical inference and machine learning for difficult extraction tasks, whereas not requiring users to directly write any probabilistic inference algorithms. Instead, domain experts interact with DeepDive by defining features or rules about the domain. DeepDive has been successfully applied to domains such as pharmacogenomics, paleobiology, and antihuman trafficking enforcement, achieving human-caliber quality at machine-caliber scale. We present the applications, abstractions, and techniques used in DeepDive to accelerate the construction of such dark data extraction systems.

References

Angeli, G. et al. Stanford's 2014 slot filling systems. TAC KBP (2014).Google Scholar
Banko, M. et al. Open information extraction from the Web. In IJCAI (2007).Google Scholar
Betteridge, J., Carlson, A., Hong, S.A., Hruschka, E.R., Jr, Law, E.L., Mitchell, T.M., Wang, S.H. Toward never ending language learning. In AAAI Spring Symposium(2009).Google Scholar
Brin, S. Extracting patterns and relations from the world wide web. In WebDB (1999). Google ScholarCross Ref
Brown, E. et al. Tools and methods for building Watson. IBM Research Report (2013).Google Scholar
Carlson, A. et al. Toward an architecture for never-ending language learning. In AAAI (2010).Google Scholar
Chen, F., Doan, A., Yang, J., Ramakrishnan, R. Efficient information extraction over evolving text data. In ICDE (2008). Google ScholarDigital Library
Chen, F. et al. Optimizing statistical information extraction programs over evolving text. In ICDE (2012). Google ScholarDigital Library
Chen, Y., Wang, D.Z. Knowledge expansion over probabilistic knowledge bases. In SIGMOD (2014). Google ScholarDigital Library
De Sa, C., Olukotun, K., Ré, C. Ensuring rapid mixing and low bias for asynchronous gibbs sampling. arXiv preprint arXiv:1602.07415 (2016).Google Scholar
Domingos, P., Lowd, D. Markov Logic: An Interface Layer for Artificial Intelligence. Morgan & Claypool, 2009.Google ScholarDigital Library
Dong, X.L. et al. From data fusion to knowledge fusion. In VLDB (2014). Google ScholarDigital Library
Ehrenberg, H.R., Shin, J., Ratner, A.J., Fries, J.A., Ré, C. Data programming with DDLite: Putting humans in a different part of the loop. In HILDA'16 SIGMOD (2016), 13. Google ScholarDigital Library
Etzioni, O. et al. Web-scale information extraction in KnowItAll: Preliminary results. In WWW (2004).Google Scholar
Ferrucci, D. et al. Building Watson: An overview of the DeepQA project. AI Magazine (2010).Google Scholar
Govindaraju, V. et al. Understanding tables in context using standard NLP toolkits. In ACL (2013).Google Scholar
Gupta, A., Mumick, I.S., Subrahmanian, V.S. Maintaining views incrementally. SIGMOD Rec. (1993). Google ScholarDigital Library
Hearst, M.A. Automatic acquisition of hyponyms from large text corpora. In COLING (1992). Google ScholarDigital Library
Hoffmann, R. et al. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL (2011).Google Scholar
Jampani, R. et al. MCDB: A Monte Carlo approach to managing uncertain data. In SIGMOD (2008). Google ScholarDigital Library
Jaynes, E.T. Probability Theory: The Logic of Science. Cambridge University Press, 2003. Google ScholarCross Ref
Jiang, S. et al. Learning to refine an automatically extracted knowledge base using Markov logic. In ICDM(2012). Google ScholarDigital Library
Kasneci, G. et al. The YAGO-NAGA approach to knowledge discovery. SIGMOD Rec. (2009). Google ScholarDigital Library
Koc, M.L., Ré, C. Incrementally maintaining classification using an RDBMS. PVLDB (2011).Google Scholar
Krishnamurthy, R. et al. SystemT: A system for declarative information extraction. SIGMOD Rec. (2009). Google ScholarDigital Library
Li, Y., Reiss, F.R., Chiticariu, L. System T: A declarative information extraction system. In HLT (2011).Google Scholar
Liu, J. and et al. An asynchronous parallel stochastic coordinate descent algorithm. ICML (2014).Google Scholar
Madhavan, J. et al. Web-scale data integration: You can only afford to pay as you go. In CIDR (2007).Google Scholar
Mallory, E.K. et al. Large-scale extraction of gene interactions from full text literature using deepdive. Bioinformatics (2015). Google ScholarCross Ref
Mintz, M. et al. Distant supervision for relation extraction without labeled data. In ACL (2009). Google ScholarCross Ref
Nakashole, N. et al. Scalable knowledge harvesting with high precision and high recall. In WSDM (2011). Google ScholarDigital Library
Niu, F. et al. Hogwild! A lock-free approach to parallelizing stochastic gradient descent. In NIPS (2011).Google Scholar
Niu, F. et al. Tuffy: Scaling up statistical inference in Markov logic networks using an RDBMS. PVLDB(2011).Google Scholar
Niu, F. et al. Elementary: Large-scale knowledge-base construction via machine learning and statistical inference. Int. J. Semantic Web Inf. Syst. (2012). Google ScholarDigital Library
Niu, F. et al. Scaling inference for Markov logic via dual decomposition. In ICDM (2012). Google ScholarDigital Library
Peters, S.E. et al. A machine reading system for assembling synthetic Paleontological databases. PloS One (2014). Google ScholarCross Ref
Poon, H., Domingos, P.. Joint inference in information extraction. In AAAI (2007).Google Scholar
Ratner, A., De Sa, C., Wu, S., Selsam, D., Ré, C. Data programming: Creating large training sets, quickly. arXiv preprint arXiv:1605.07723 (2016).Google Scholar
Ré, C. et al. Feature engineering for knowledge base construction. IEEE Data Eng. Bull. (2014).Google Scholar
Robert, C.P, Casella, G. Monte Carlo Statistical Methods. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.Google ScholarDigital Library
Shen, W. et al. Declarative information extraction using datalog with embedded extraction predicates. In VLDB (2007).Google ScholarDigital Library
Shin, J. et al. Incremental knowledge base construction using deepdive. PVLDB (2015).Google Scholar
Suchanek, F.M. et al. SOFIE: A self-organizing framework for information extraction. In WWW (2009). Google ScholarDigital Library
Wainwright, M., Jordan, M. Log-determinant relaxation for approximate inference in discrete Markov random fields. Trans. Sig. Proc. (2006).Google ScholarCross Ref
Wainwright, M.J., Jordan, M.I. Graphical models, exponential families, and variational inference. FTML (2008).Google Scholar
Weikum, G., Theobald, M. From information to knowledge: Harvesting entities and relationships from web sources. In PODS (2010).Google Scholar
Wick, M. et al. Scalable probabilistic databases with factor graphs and MCMC. PVLDB (2010).Google Scholar
Yates, A. et al. TextRunner: Open information extraction on the Web. In NAACL (2007).Google Scholar
Zhang, C. et al. GeoDeepDive: Statistical inference using familiar data-processing languages. In SIGMOD (2013).Google Scholar
Zhang, C., Ré, C. Towards high- throughput Gibbs sampling at scale: A study across storage managers. In SIGMOD (2013).Google Scholar
Zhang, C., Ré, C.. DimmWitted: A study of main-memory statistical analytics. PVLDB (2014).Google Scholar
Zhu, J. et al. StatSnowball: A statistical approach to extracting entity relationships. In WWW (2009). Google ScholarDigital Library
Zinkevich, M. et al. Parallelized stochastic gradient descent. In NIPS(2010), 2595--2603.Google Scholar

Index Terms

DeepDive: declarative knowledge base construction
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features

Recommendations

DeepDive: Declarative Knowledge Base Construction

The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that ...
Read More
Incremental knowledge base construction using DeepDive

Populating a database with unstructured information is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge ...
Read More
Incremental knowledge base construction using DeepDive

Populating a database with information from unstructured sources--also known as knowledge base construction (KBC)--is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. In this work, we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Communications of the ACM Volume 60, Issue 5
May 2017
101 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3084186
Editor:
Moshe Y. Vardi
Association for Computing Machinery, New York, NY
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 April 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 44
  Total Citations
  View Citations
- 17,803
  Total Downloads
- Downloads (Last 12 months)4,790
- Downloads (Last 6 weeks)51
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format