skip to main content
10.1145/3034786.3056124acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

Data Integration: After the Teenage Years

Authors Info & Claims
Published:09 May 2017Publication History

ABSTRACT

The field of data integration has expanded significantly over the years, from providing a uniform query and update interface to structured databases within an enterprise to the ability to search, ex- change, and even update, structured or unstructured data that are within or external to the enterprise. This paper describes the evolution in the landscape of data integration since the work on rewriting queries using views in the mid-1990's. In addition, we describe two important challenges for the field going forward. The first challenge is to develop good open-source tools for different components of data integration pipelines. The second challenge is to provide practitioners with viable solutions for the long-standing problem of systematically combining structured and unstructured data.

References

  1. B. Alexe, W. C. Tan, and Y. Velegrakis. Stbenchmark: towards a benchmark for mapping systems. PVLDB, 1(1):230--244, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Arenas, P. Barceló, L. Libkin, and F. Murlak. Foundations of Data Exchange. Cambridge University Press, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. C. Arocena, B. Glavic, R. Ciucanu, and R. J. Miller. The ibench integration metadata generator. PVLDB, 9(3):108--119, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Balakrishnan, A. Y. Halevy, B. Harb, H. Lee, J. Madhavan, A. Rostamizadeh, W. Shen, K. Wilder, F. Wu, and C. Yu. Applying webtables in practice. In CIDR, 2015.Google ScholarGoogle Scholar
  5. C. Beeri and M. Vardi. A proof procedure for data dependencies. Journal of the ACM, 31(4):718--741, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. A. Bernstein. Applying model management to classical meta-data problems. In CIDR, 2003.Google ScholarGoogle Scholar
  7. P. A. Bernstein and L. M. Haas. Information integration in the enterprise. Commun. ACM, 51(9):72--79, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Biggorilla: Data integration and data preparation in python. http://www.biggorilla.org, 2017.Google ScholarGoogle Scholar
  9. P. Buneman, J. Cheney, W. C. Tan, and S. Vansummeren. Curated databases. In Proc. of PODS, pages 1--12, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Catarci and M. Lenzerini. Representing and using interschema knowledge in cooperative information systems. Journal of Intelligent and Cooperative Information Systems, pages 55--62, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  11. K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, and Y. He. Data services leveraging bing's data assets. IEEE Data Eng. Bull., 39(3):15--28, 2016.Google ScholarGoogle Scholar
  12. S. Chaudhuri, R. Krishnamurthy, S. Potamianos, and K. Shim. Optimizing queries with materialized views. In Proc. of ICDE, pages 190--200, Taipei, Taiwan, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizer system. In CIDR, 2017.Google ScholarGoogle Scholar
  14. A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of Data Integration. Morgan Kaufmann, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. SOLOMON: seeking the truth via copying detection. PVLDB, 3(2):1617--1620, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. X. L. Dong and D. Srivastava. Big Data Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  17. R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and query answering. Theor. Comput. Sci., 336(1):89--124, 2005. Google ScholarGoogle ScholarCross RefCross Ref
  18. W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. H. L. Fletcher, J. V. den Bussche, D. V. Gucht, and S. Vansummeren. Towards a theory of search queries. ACM Trans. Database Syst., 35(4):28:1--28:33, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: A new abstraction for information management. Sigmod Record, 34(4):27--33, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Goldstein and P.-A. Larson. Optimizing queries using materialized views: a practical, scalable solution. In Proc. of ACM SIGMOD, pages 331--342, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. J. Guo, S. Kandel, J. M. Hellerstein, and J. Heer. Proactive wrangling: mixed-initiative end-user programming of data transformation scripts. In UIST, pages 65--74, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Gupta, A. Y. Halevy, X. Wang, S. E. Whang, and F. Wu. Biperpedia: An ontology for search applications. PVLDB, 7(7):505--516, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Y. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Managing google's data lake: an overview of the goods system. IEEE Data Eng. Bull., 39(3):5--14, 2016.Google ScholarGoogle Scholar
  25. A. Y. Halevy, A. Rajaraman, and J. J. Ordille. Data integration: The teenage years. In VLDB, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Hernández, G. Koutrika, R. Krishnamurthy, L. Popa, and R. Wisnesky. Hil: A high-level scripting language for entity integration. In Proc. of EDBT, pages 549--560, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4):281--393, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Z. G. Ives, T. J. Green, G. Karvounarakis, N. E. Taylor, V. Tannen, P. P. Talukdar, M. Jacob, and F. C. N. Pereira. The ORCHESTRA collaborative data sharing system. SIGMOD Record, 37(3):26--32, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Wrangler: interactive visual specification of data transformation scripts. In CHI, pages 3363--3372, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. P. Kolaitis. Schema mappings, data exchange, and metadata management. In Proc. of ACM PODS, pages 61--75, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. Konda, S. Das, P. S. G. C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. F. Naughton, S. Prasad, G. Krishnan, R. Deep, and V. Raghavendra. Magellan: Toward building entity matching management systems. PVLDB, 9(12):1197--1208, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. A. Larson and H. Yang. Computing queries from derived relations. In Proc. of VLDB, pages 259--269, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Lenzerini. Data Integration: A Theoretical Perspective. In Proc. of ACM PODS, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Lohr. For Big-Data Scientists, 'Janitor Work' is Key Hurdle to Insights. New York Times, 2014.Google ScholarGoogle Scholar
  35. J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google's deep-web crawl. In Proc. of VLDB, pages 1241--1252, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. D. Sa, A. Ratner, C. Ré, J. Shin, F. Wang, S. Wu, and C. Zhang. Deepdive: Declarative knowledge base construction. SIGMOD Record, 45(1):60--67, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. L. Seligman, P. Mork, A. Y. Halevy, K. P. Smith, M. J. Carey, K. Chen, C. Wolf, J. Madhavan, A. Kannan, and D. Burdick. Openii: an open source information integration toolkit. In Proc. of ACM SIGMOD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Singhal. Introducing the knowledge graph: things, not strings. Official google blog, 2012.Google ScholarGoogle Scholar
  39. M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.Google ScholarGoogle Scholar
  40. O. G. Tsatalos, M. H. Solomon, and Y. E. Ioannidis. The GMAP: A versatile tool for physical data independence. VLDB Journal, 5(2):101--118, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. H. Z. Yang and P. A. Larson. Query transformation for PSJ-queries. In Proc. of VLDB, pages 245--254, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Data Integration: After the Teenage Years

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PODS '17: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
        May 2017
        458 pages
        ISBN:9781450341981
        DOI:10.1145/3034786

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 May 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        PODS '17 Paper Acceptance Rate29of101submissions,29%Overall Acceptance Rate642of2,707submissions,24%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader