Abstract
In 2008, we wrote about WebTables, an effort to exploit the large and diverse set of structured databases casually published online in the form of HTML tables. The past decade has seen a flurry of research and commercial activities around the WebTables project itself, as well as the broad topic of informal online structured data. In this paper, we1 will review the WebTables project, and try to place it in the broader context of the decade of work that followed. We will also show how the progress over the past ten years sets up an exciting agenda for the future, and will draw upon many corners of the data management community.
- Common crawl. http://commoncrawl.org/.Google Scholar
- S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: A system for keyword-based search over relational databases. In Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, February 26 - March 1, 2002, pages 5--16, 2002. Google ScholarDigital Library
- S. Balakrishnan, A. Y. Halevy, B. Harb, H. Lee, J. Madhavan, A. Rostamizadeh, W. Shen, K. Wilder, F. Wu, and C. Yu. Applying webtables in practice. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4--7, 2015, Online Proceedings, 2015.Google Scholar
- T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, 284(5):34--43, May 2001.Google ScholarCross Ref
- C. Bizer. The emerging web of linked data. IEEE Intelligent Systems, 24(5):87--92, Sept. 2009. Google ScholarDigital Library
- M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. PVLDB, 2(1):1090--1101, 2009. Google ScholarDigital Library
- M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008. Google ScholarDigital Library
- M. J. Cafarella, A. Y. Halevy, Y. Zhang, D. Z. Wang, and E. Wu. Uncovering the relational web. In 11th International Workshop on the Web and Databases, WebDB 2008, Vancouver, BC, Canada, June 13, 2008, 2008.Google Scholar
- K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, Y. He, and W. Redmond. Data services leveraging bing's data assets. IEEE Data Eng. Bull., 2016.Google Scholar
- H.-H. Chen, S.-C. Tsai, and J.-H. Tsai. Mining tables from large scale html texts. In Proceedings of the 18th Conference on Computational Linguistics - Volume 1, COLING '00, pages 166--172, Stroudsburg, PA, USA, 2000. Association for Computational Linguistics. Google ScholarDigital Library
- Z. Chen, M. J. Cafarella, and H. V. Jagadish. Long-tail vocabulary dictionary extraction from the web. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, San Francisco, CA, USA, February 22--25, 2016, pages 625--634, 2016. Google ScholarDigital Library
- F. Chirigati, J. Liu, F. Korn, Y. Wu, C. Yu, and H. Zhang. Knowledge exploration using tables on the web. PVLDB, 10(3):193--204, 2016. Google ScholarDigital Library
- X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. TEGRA: table extraction by global record alignment. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pages 1713--1728, 2015. Google ScholarDigital Library
- D. Deng, Y. Jiang, G. Li, J. Li, and C. Yu. Scalable column concept determination for web tables using large knowledge bases. PVLDB, 6(13):1606--1617, 2013. Google ScholarDigital Library
- J. Eberius, K. Braunschweig, M. Hentsch, M. Thiele, A. Ahmadov, and W. Lehner. Building the dresden web table corpus: A classification approach. In 2nd IEEE/ACM International Symposium on Big Data Computing, BDC 2015, Limassol, Cyprus, December 7--10, 2015, pages 41--50, 2015.Google ScholarCross Ref
- H. Elmeleegy, J. Madhavan, and A. Y. Halevy. Harvesting relational tables from lists on the web. PVLDB, 2(1):1078--1089, 2009. Google ScholarDigital Library
- J. Fan, M. Lu, B. C. Ooi, W. Tan, and M. Zhang. A hybrid machine-crowdsourcing system for matching web tables. In IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, pages 976--987, 2014.Google ScholarCross Ref
- W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towards domain-independent information extraction from web tables. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8--12, 2007, pages 71--80, 2007. Google ScholarDigital Library
- R. Gupta, A. Y. Halevy, X. Wang, S. E. Whang, and F. Wu. Biperpedia: An ontology for search applications. PVLDB, 7(7):505--516, 2014. Google ScholarDigital Library
- A. Y. Halevy. Data publishing and sharing using fusion tables. In CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 6--9, 2013, Online Proceedings, 2013.Google Scholar
- Y. He, K. Chakrabarti, T. Cheng, and T. Tylenda. Automatic discovery of attribute synonyms using query logs and table corpora. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11 - 15, 2016, pages 1429--1439, 2016. Google ScholarDigital Library
- V. Hristidis and Y. Papakonstantinou. DISCOVER: keyword search in relational databases. In VLDB 2002, Proceedings of 28th International Conference on Very Large Data Bases, August 20--23, 2002, Hong Kong, China, pages 670--681, 2002. Google ScholarDigital Library
- D. F. Huynh, D. R. Karger, and R. C. Miller. Exhibit: lightweight structured data publishing. In Proceedings of the 16th international conference on World Wide Web, pages 737--746. ACM, 2007. Google ScholarDigital Library
- O. Lehmberg, D. Ritze, R. Meusel, and C. Bizer. A large public corpus of web tables containing time and context metadata. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11--15, 2016, Companion Volume, pages 75--76, 2016. Google ScholarDigital Library
- G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. PVLDB, 3(1):1338--1347, 2010. Google ScholarDigital Library
- X. Ling, A. Y. Halevy, F. Wu, and C. Yu. Synthesizing union tables from the web. In IJCAI 2013, Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, China, August 3--9, 2013, pages 2677--2683, 2013. Google ScholarDigital Library
- R. Pimplikar and S. Sarawagi. Answering table queries on the web using column keywords. PVLDB, 5(10):908--919, 2012. Google ScholarDigital Library
- T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017. Google ScholarDigital Library
- T. Rekatsinas, M. Joglekar, H. Garcia-Molina, A. Parameswaran, and C. Ré. Slimfast: Guaranteed results for data fusion and source reliability. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1399--1414. ACM, 2017. Google ScholarDigital Library
- A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding related tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20--24, 2012, pages 817--828, 2012. Google ScholarDigital Library
- F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web, pages 697--706. ACM, 2007. Google ScholarDigital Library
- P. Venetis, A. Y. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. PVLDB, 4(9):528--538, 2011. Google ScholarDigital Library
- D. Vrandecic and M. Krötzsch. Wikidata: a free collaborative knowledgebase. Commun. ACM, 57(10):78--85, 2014. Google ScholarDigital Library
- C. Wang, K. Chakrabarti, Y. He, K. Ganjam, Z. Chen, and P. A. Bernstein. Concept expansion using web tables. In Proceedings of the 24th International Conference on World Wide Web, WWW 2015, Florence, Italy, May 18--22, 2015, pages 1198--1208, 2015. Google ScholarDigital Library
- D. Z. Wang, L. Dong, A. D. Sarma, M. J. Franklin, and A. Halevy. Functional dependency generation and applications in pay-as-you-go data integration systems. In WebKB, 2009.Google Scholar
- J. Wang, H. Wang, Z. Wang, and K. Q. Zhu. Understanding tables on the web. In Conceptual Modeling - 31st International Conference ER 2012, Florence, Italy, October 15--18, 2012. Proceedings, pages 141--155, 2012. Google ScholarDigital Library
- Y. Wang and J. Hu. A machine learning based approach for table detection on the web. In Proceedings of the Eleventh International World Wide Web Conference, WWW 2002, May 7--11, 2002, Honolulu, Hawaii, USA, pages 242--250, 2002. Google ScholarDigital Library
- Y. Wu, P. K. Agarwal, C. Li, J. Yang, and C. Yu. Toward computational fact-checking. PVLDB, 7(7):589--600, 2014. Google ScholarDigital Library
- M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20--24, 2012, pages 97--108, 2012. Google ScholarDigital Library
- X. Yin, W. Tan, and C. Liu. FACTO: a fact lookup engine based on web tables. In Proceedings of the 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, March 28 - April 1, 2011, pages 507--516, 2011. Google ScholarDigital Library
- R. Zanibbi, D. Blostein, and J. R. Cordy. A survey of table recognition. IJDAR, 7(1):1--16, 2004. Google ScholarDigital Library
- C. Zhang, J. Shin, C. Ré, M. J. Cafarella, and F. Niu. Extracting databases from dark data with deepdive. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 847--859, 2016. Google ScholarDigital Library
- M. Zhang and K. Chakrabarti. Infogather+: semantic matching and annotation of numeric and time-varying attributes in web tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22--27, 2013, pages 145--156, 2013. Google ScholarDigital Library
Index Terms
- Ten years of webtables
Recommendations
WebTables: exploring the power of tables on the web
The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google's general-purpose web crawl, and used statistical classification ...
Reflections on 25 Years of Ethnography in CSCW
In this article we focus attention on ethnography's place in CSCW by reflecting on how ethnography in the context of CSCW has contributed to our understanding of the sociality and materiality of work and by exploring how the notion of the `field site' ...
Thirty years of computational autopoiesis: a review
Computational autopoiesis--the realization of autopoietic entities in computational media--holds an important and distinctive role within the field of artificial life. Its earliest formulation by Francisco Varela, Humberto Maturana, and Ricardo Uribe ...
Comments