ABSTRACT
The field of data integration has expanded significantly over the years, from providing a uniform query and update interface to structured databases within an enterprise to the ability to search, ex- change, and even update, structured or unstructured data that are within or external to the enterprise. This paper describes the evolution in the landscape of data integration since the work on rewriting queries using views in the mid-1990's. In addition, we describe two important challenges for the field going forward. The first challenge is to develop good open-source tools for different components of data integration pipelines. The second challenge is to provide practitioners with viable solutions for the long-standing problem of systematically combining structured and unstructured data.
- B. Alexe, W. C. Tan, and Y. Velegrakis. Stbenchmark: towards a benchmark for mapping systems. PVLDB, 1(1):230--244, 2008. Google ScholarDigital Library
- M. Arenas, P. Barceló, L. Libkin, and F. Murlak. Foundations of Data Exchange. Cambridge University Press, 2014. Google ScholarDigital Library
- P. C. Arocena, B. Glavic, R. Ciucanu, and R. J. Miller. The ibench integration metadata generator. PVLDB, 9(3):108--119, 2015. Google ScholarDigital Library
- S. Balakrishnan, A. Y. Halevy, B. Harb, H. Lee, J. Madhavan, A. Rostamizadeh, W. Shen, K. Wilder, F. Wu, and C. Yu. Applying webtables in practice. In CIDR, 2015.Google Scholar
- C. Beeri and M. Vardi. A proof procedure for data dependencies. Journal of the ACM, 31(4):718--741, 1984. Google ScholarDigital Library
- P. A. Bernstein. Applying model management to classical meta-data problems. In CIDR, 2003.Google Scholar
- P. A. Bernstein and L. M. Haas. Information integration in the enterprise. Commun. ACM, 51(9):72--79, 2008. Google ScholarDigital Library
- Biggorilla: Data integration and data preparation in python. http://www.biggorilla.org, 2017.Google Scholar
- P. Buneman, J. Cheney, W. C. Tan, and S. Vansummeren. Curated databases. In Proc. of PODS, pages 1--12, 2008. Google ScholarDigital Library
- T. Catarci and M. Lenzerini. Representing and using interschema knowledge in cooperative information systems. Journal of Intelligent and Cooperative Information Systems, pages 55--62, 1993.Google ScholarCross Ref
- K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, and Y. He. Data services leveraging bing's data assets. IEEE Data Eng. Bull., 39(3):15--28, 2016.Google Scholar
- S. Chaudhuri, R. Krishnamurthy, S. Potamianos, and K. Shim. Optimizing queries with materialized views. In Proc. of ICDE, pages 190--200, Taipei, Taiwan, 1995. Google ScholarDigital Library
- D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizer system. In CIDR, 2017.Google Scholar
- A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of Data Integration. Morgan Kaufmann, 2012. Google ScholarDigital Library
- X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. SOLOMON: seeking the truth via copying detection. PVLDB, 3(2):1617--1620, 2010. Google ScholarDigital Library
- X. L. Dong and D. Srivastava. Big Data Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2015.Google ScholarCross Ref
- R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and query answering. Theor. Comput. Sci., 336(1):89--124, 2005. Google ScholarCross Ref
- W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2012. Google ScholarDigital Library
- G. H. L. Fletcher, J. V. den Bussche, D. V. Gucht, and S. Vansummeren. Towards a theory of search queries. ACM Trans. Database Syst., 35(4):28:1--28:33, 2010. Google ScholarDigital Library
- M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: A new abstraction for information management. Sigmod Record, 34(4):27--33, 2005. Google ScholarDigital Library
- J. Goldstein and P.-A. Larson. Optimizing queries using materialized views: a practical, scalable solution. In Proc. of ACM SIGMOD, pages 331--342, 2001. Google ScholarDigital Library
- P. J. Guo, S. Kandel, J. M. Hellerstein, and J. Heer. Proactive wrangling: mixed-initiative end-user programming of data transformation scripts. In UIST, pages 65--74, 2011. Google ScholarDigital Library
- R. Gupta, A. Y. Halevy, X. Wang, S. E. Whang, and F. Wu. Biperpedia: An ontology for search applications. PVLDB, 7(7):505--516, 2014. Google ScholarDigital Library
- A. Y. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Managing google's data lake: an overview of the goods system. IEEE Data Eng. Bull., 39(3):5--14, 2016.Google Scholar
- A. Y. Halevy, A. Rajaraman, and J. J. Ordille. Data integration: The teenage years. In VLDB, 2006. Google ScholarDigital Library
- M. Hernández, G. Koutrika, R. Krishnamurthy, L. Popa, and R. Wisnesky. Hil: A high-level scripting language for entity integration. In Proc. of EDBT, pages 549--560, 2013. Google ScholarDigital Library
- I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4):281--393, 2015. Google ScholarDigital Library
- Z. G. Ives, T. J. Green, G. Karvounarakis, N. E. Taylor, V. Tannen, P. P. Talukdar, M. Jacob, and F. C. N. Pereira. The ORCHESTRA collaborative data sharing system. SIGMOD Record, 37(3):26--32, 2008. Google ScholarDigital Library
- S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Wrangler: interactive visual specification of data transformation scripts. In CHI, pages 3363--3372, 2011. Google ScholarDigital Library
- P. Kolaitis. Schema mappings, data exchange, and metadata management. In Proc. of ACM PODS, pages 61--75, 2005. Google ScholarDigital Library
- P. Konda, S. Das, P. S. G. C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. F. Naughton, S. Prasad, G. Krishnan, R. Deep, and V. Raghavendra. Magellan: Toward building entity matching management systems. PVLDB, 9(12):1197--1208, 2016. Google ScholarDigital Library
- P. A. Larson and H. Yang. Computing queries from derived relations. In Proc. of VLDB, pages 259--269, 1985. Google ScholarDigital Library
- M. Lenzerini. Data Integration: A Theoretical Perspective. In Proc. of ACM PODS, 2002. Google ScholarDigital Library
- S. Lohr. For Big-Data Scientists, 'Janitor Work' is Key Hurdle to Insights. New York Times, 2014.Google Scholar
- J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google's deep-web crawl. In Proc. of VLDB, pages 1241--1252, 2008. Google ScholarDigital Library
- C. D. Sa, A. Ratner, C. Ré, J. Shin, F. Wang, S. Wu, and C. Zhang. Deepdive: Declarative knowledge base construction. SIGMOD Record, 45(1):60--67, 2016. Google ScholarDigital Library
- L. Seligman, P. Mork, A. Y. Halevy, K. P. Smith, M. J. Carey, K. Chen, C. Wolf, J. Madhavan, A. Kannan, and D. Burdick. Openii: an open source information integration toolkit. In Proc. of ACM SIGMOD. Google ScholarDigital Library
- A. Singhal. Introducing the knowledge graph: things, not strings. Official google blog, 2012.Google Scholar
- M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.Google Scholar
- O. G. Tsatalos, M. H. Solomon, and Y. E. Ioannidis. The GMAP: A versatile tool for physical data independence. VLDB Journal, 5(2):101--118, 1996. Google ScholarDigital Library
- H. Z. Yang and P. A. Larson. Query transformation for PSJ-queries. In Proc. of VLDB, pages 245--254, 1987. Google ScholarDigital Library
Index Terms
- Data Integration: After the Teenage Years
Recommendations
On-demand big data integration
Scientific research requires access, analysis, and sharing of data that is distributed across various heterogeneous data sources at the scale of the Internet. An eager extract, transform, and load (ETL) process constructs an integrated data repository ...
Data integration flows for business intelligence
EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database TechnologyBusiness Intelligence (BI) refers to technologies, tools, and practices for collecting, integrating, analyzing, and presenting large volumes of information to enable better decision making. Today's BI architecture typically consists of a data warehouse (...
Data Warehouse Based Approach to the Integration of Semi-structured Data
Advances in Web and Network Technologies, and Information ManagementSemi-structured data play an increasing role in the development of the web through the use of XML. However, the management of semi-structured data poses specific problems because semi-structured data, contrary to classical database, do not rely on a ...
Comments