Abstract
The dark data extraction or knowledge base construction (KBC) problem is to populate a relational database with information from unstructured data sources, such as emails, webpages, and PDFs. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help to develop KBC systems. The key idea in DeepDive is to frame traditional extract-transform-load (ETL) style data management problems as a single large statistical inference task that is declaratively defined by the user. DeepDive leverages the effectiveness and efficiency of statistical inference and machine learning for difficult extraction tasks, whereas not requiring users to directly write any probabilistic inference algorithms. Instead, domain experts interact with DeepDive by defining features or rules about the domain. DeepDive has been successfully applied to domains such as pharmacogenomics, paleobiology, and antihuman trafficking enforcement, achieving human-caliber quality at machine-caliber scale. We present the applications, abstractions, and techniques used in DeepDive to accelerate the construction of such dark data extraction systems.
- Angeli, G. et al. Stanford's 2014 slot filling systems. TAC KBP (2014).Google Scholar
- Banko, M. et al. Open information extraction from the Web. In IJCAI (2007).Google Scholar
- Betteridge, J., Carlson, A., Hong, S.A., Hruschka, E.R., Jr, Law, E.L., Mitchell, T.M., Wang, S.H. Toward never ending language learning. In AAAI Spring Symposium(2009).Google Scholar
- Brin, S. Extracting patterns and relations from the world wide web. In WebDB (1999). Google ScholarCross Ref
- Brown, E. et al. Tools and methods for building Watson. IBM Research Report (2013).Google Scholar
- Carlson, A. et al. Toward an architecture for never-ending language learning. In AAAI (2010).Google Scholar
- Chen, F., Doan, A., Yang, J., Ramakrishnan, R. Efficient information extraction over evolving text data. In ICDE (2008). Google ScholarDigital Library
- Chen, F. et al. Optimizing statistical information extraction programs over evolving text. In ICDE (2012). Google ScholarDigital Library
- Chen, Y., Wang, D.Z. Knowledge expansion over probabilistic knowledge bases. In SIGMOD (2014). Google ScholarDigital Library
- De Sa, C., Olukotun, K., Ré, C. Ensuring rapid mixing and low bias for asynchronous gibbs sampling. arXiv preprint arXiv:1602.07415 (2016).Google Scholar
- Domingos, P., Lowd, D. Markov Logic: An Interface Layer for Artificial Intelligence. Morgan & Claypool, 2009.Google ScholarDigital Library
- Dong, X.L. et al. From data fusion to knowledge fusion. In VLDB (2014). Google ScholarDigital Library
- Ehrenberg, H.R., Shin, J., Ratner, A.J., Fries, J.A., Ré, C. Data programming with DDLite: Putting humans in a different part of the loop. In HILDA'16 SIGMOD (2016), 13. Google ScholarDigital Library
- Etzioni, O. et al. Web-scale information extraction in KnowItAll: Preliminary results. In WWW (2004).Google Scholar
- Ferrucci, D. et al. Building Watson: An overview of the DeepQA project. AI Magazine (2010).Google Scholar
- Govindaraju, V. et al. Understanding tables in context using standard NLP toolkits. In ACL (2013).Google Scholar
- Gupta, A., Mumick, I.S., Subrahmanian, V.S. Maintaining views incrementally. SIGMOD Rec. (1993). Google ScholarDigital Library
- Hearst, M.A. Automatic acquisition of hyponyms from large text corpora. In COLING (1992). Google ScholarDigital Library
- Hoffmann, R. et al. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL (2011).Google Scholar
- Jampani, R. et al. MCDB: A Monte Carlo approach to managing uncertain data. In SIGMOD (2008). Google ScholarDigital Library
- Jaynes, E.T. Probability Theory: The Logic of Science. Cambridge University Press, 2003. Google ScholarCross Ref
- Jiang, S. et al. Learning to refine an automatically extracted knowledge base using Markov logic. In ICDM(2012). Google ScholarDigital Library
- Kasneci, G. et al. The YAGO-NAGA approach to knowledge discovery. SIGMOD Rec. (2009). Google ScholarDigital Library
- Koc, M.L., Ré, C. Incrementally maintaining classification using an RDBMS. PVLDB (2011).Google Scholar
- Krishnamurthy, R. et al. SystemT: A system for declarative information extraction. SIGMOD Rec. (2009). Google ScholarDigital Library
- Li, Y., Reiss, F.R., Chiticariu, L. System T: A declarative information extraction system. In HLT (2011).Google Scholar
- Liu, J. and et al. An asynchronous parallel stochastic coordinate descent algorithm. ICML (2014).Google Scholar
- Madhavan, J. et al. Web-scale data integration: You can only afford to pay as you go. In CIDR (2007).Google Scholar
- Mallory, E.K. et al. Large-scale extraction of gene interactions from full text literature using deepdive. Bioinformatics (2015). Google ScholarCross Ref
- Mintz, M. et al. Distant supervision for relation extraction without labeled data. In ACL (2009). Google ScholarCross Ref
- Nakashole, N. et al. Scalable knowledge harvesting with high precision and high recall. In WSDM (2011). Google ScholarDigital Library
- Niu, F. et al. Hogwild! A lock-free approach to parallelizing stochastic gradient descent. In NIPS (2011).Google Scholar
- Niu, F. et al. Tuffy: Scaling up statistical inference in Markov logic networks using an RDBMS. PVLDB(2011).Google Scholar
- Niu, F. et al. Elementary: Large-scale knowledge-base construction via machine learning and statistical inference. Int. J. Semantic Web Inf. Syst. (2012). Google ScholarDigital Library
- Niu, F. et al. Scaling inference for Markov logic via dual decomposition. In ICDM (2012). Google ScholarDigital Library
- Peters, S.E. et al. A machine reading system for assembling synthetic Paleontological databases. PloS One (2014). Google ScholarCross Ref
- Poon, H., Domingos, P.. Joint inference in information extraction. In AAAI (2007).Google Scholar
- Ratner, A., De Sa, C., Wu, S., Selsam, D., Ré, C. Data programming: Creating large training sets, quickly. arXiv preprint arXiv:1605.07723 (2016).Google Scholar
- Ré, C. et al. Feature engineering for knowledge base construction. IEEE Data Eng. Bull. (2014).Google Scholar
- Robert, C.P, Casella, G. Monte Carlo Statistical Methods. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.Google ScholarDigital Library
- Shen, W. et al. Declarative information extraction using datalog with embedded extraction predicates. In VLDB (2007).Google ScholarDigital Library
- Shin, J. et al. Incremental knowledge base construction using deepdive. PVLDB (2015).Google Scholar
- Suchanek, F.M. et al. SOFIE: A self-organizing framework for information extraction. In WWW (2009). Google ScholarDigital Library
- Wainwright, M., Jordan, M. Log-determinant relaxation for approximate inference in discrete Markov random fields. Trans. Sig. Proc. (2006).Google ScholarCross Ref
- Wainwright, M.J., Jordan, M.I. Graphical models, exponential families, and variational inference. FTML (2008).Google Scholar
- Weikum, G., Theobald, M. From information to knowledge: Harvesting entities and relationships from web sources. In PODS (2010).Google Scholar
- Wick, M. et al. Scalable probabilistic databases with factor graphs and MCMC. PVLDB (2010).Google Scholar
- Yates, A. et al. TextRunner: Open information extraction on the Web. In NAACL (2007).Google Scholar
- Zhang, C. et al. GeoDeepDive: Statistical inference using familiar data-processing languages. In SIGMOD (2013).Google Scholar
- Zhang, C., Ré, C. Towards high- throughput Gibbs sampling at scale: A study across storage managers. In SIGMOD (2013).Google Scholar
- Zhang, C., Ré, C.. DimmWitted: A study of main-memory statistical analytics. PVLDB (2014).Google Scholar
- Zhu, J. et al. StatSnowball: A statistical approach to extracting entity relationships. In WWW (2009). Google ScholarDigital Library
- Zinkevich, M. et al. Parallelized stochastic gradient descent. In NIPS(2010), 2595--2603.Google Scholar
Index Terms
- DeepDive: declarative knowledge base construction
Recommendations
DeepDive: Declarative Knowledge Base Construction
The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that ...
Incremental knowledge base construction using DeepDive
Populating a database with unstructured information is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge ...
Incremental knowledge base construction using DeepDive
Populating a database with information from unstructured sources--also known as knowledge base construction (KBC)--is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. In this work, we ...
Comments