skip to main content
research-article

Ten years of webtables

Published:01 August 2018Publication History
Skip Abstract Section

Abstract

In 2008, we wrote about WebTables, an effort to exploit the large and diverse set of structured databases casually published online in the form of HTML tables. The past decade has seen a flurry of research and commercial activities around the WebTables project itself, as well as the broad topic of informal online structured data. In this paper, we1 will review the WebTables project, and try to place it in the broader context of the decade of work that followed. We will also show how the progress over the past ten years sets up an exciting agenda for the future, and will draw upon many corners of the data management community.

References

  1. Common crawl. http://commoncrawl.org/.Google ScholarGoogle Scholar
  2. S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: A system for keyword-based search over relational databases. In Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, February 26 - March 1, 2002, pages 5--16, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Balakrishnan, A. Y. Halevy, B. Harb, H. Lee, J. Madhavan, A. Rostamizadeh, W. Shen, K. Wilder, F. Wu, and C. Yu. Applying webtables in practice. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4--7, 2015, Online Proceedings, 2015.Google ScholarGoogle Scholar
  4. T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, 284(5):34--43, May 2001.Google ScholarGoogle ScholarCross RefCross Ref
  5. C. Bizer. The emerging web of linked data. IEEE Intelligent Systems, 24(5):87--92, Sept. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. PVLDB, 2(1):1090--1101, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. J. Cafarella, A. Y. Halevy, Y. Zhang, D. Z. Wang, and E. Wu. Uncovering the relational web. In 11th International Workshop on the Web and Databases, WebDB 2008, Vancouver, BC, Canada, June 13, 2008, 2008.Google ScholarGoogle Scholar
  9. K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, Y. He, and W. Redmond. Data services leveraging bing's data assets. IEEE Data Eng. Bull., 2016.Google ScholarGoogle Scholar
  10. H.-H. Chen, S.-C. Tsai, and J.-H. Tsai. Mining tables from large scale html texts. In Proceedings of the 18th Conference on Computational Linguistics - Volume 1, COLING '00, pages 166--172, Stroudsburg, PA, USA, 2000. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Z. Chen, M. J. Cafarella, and H. V. Jagadish. Long-tail vocabulary dictionary extraction from the web. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, San Francisco, CA, USA, February 22--25, 2016, pages 625--634, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. F. Chirigati, J. Liu, F. Korn, Y. Wu, C. Yu, and H. Zhang. Knowledge exploration using tables on the web. PVLDB, 10(3):193--204, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. TEGRA: table extraction by global record alignment. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pages 1713--1728, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Deng, Y. Jiang, G. Li, J. Li, and C. Yu. Scalable column concept determination for web tables using large knowledge bases. PVLDB, 6(13):1606--1617, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Eberius, K. Braunschweig, M. Hentsch, M. Thiele, A. Ahmadov, and W. Lehner. Building the dresden web table corpus: A classification approach. In 2nd IEEE/ACM International Symposium on Big Data Computing, BDC 2015, Limassol, Cyprus, December 7--10, 2015, pages 41--50, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  16. H. Elmeleegy, J. Madhavan, and A. Y. Halevy. Harvesting relational tables from lists on the web. PVLDB, 2(1):1078--1089, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Fan, M. Lu, B. C. Ooi, W. Tan, and M. Zhang. A hybrid machine-crowdsourcing system for matching web tables. In IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, pages 976--987, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  18. W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towards domain-independent information extraction from web tables. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8--12, 2007, pages 71--80, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Gupta, A. Y. Halevy, X. Wang, S. E. Whang, and F. Wu. Biperpedia: An ontology for search applications. PVLDB, 7(7):505--516, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Y. Halevy. Data publishing and sharing using fusion tables. In CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 6--9, 2013, Online Proceedings, 2013.Google ScholarGoogle Scholar
  21. Y. He, K. Chakrabarti, T. Cheng, and T. Tylenda. Automatic discovery of attribute synonyms using query logs and table corpora. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11 - 15, 2016, pages 1429--1439, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. V. Hristidis and Y. Papakonstantinou. DISCOVER: keyword search in relational databases. In VLDB 2002, Proceedings of 28th International Conference on Very Large Data Bases, August 20--23, 2002, Hong Kong, China, pages 670--681, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. F. Huynh, D. R. Karger, and R. C. Miller. Exhibit: lightweight structured data publishing. In Proceedings of the 16th international conference on World Wide Web, pages 737--746. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. O. Lehmberg, D. Ritze, R. Meusel, and C. Bizer. A large public corpus of web tables containing time and context metadata. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11--15, 2016, Companion Volume, pages 75--76, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. PVLDB, 3(1):1338--1347, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. X. Ling, A. Y. Halevy, F. Wu, and C. Yu. Synthesizing union tables from the web. In IJCAI 2013, Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, China, August 3--9, 2013, pages 2677--2683, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Pimplikar and S. Sarawagi. Answering table queries on the web using column keywords. PVLDB, 5(10):908--919, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. T. Rekatsinas, M. Joglekar, H. Garcia-Molina, A. Parameswaran, and C. Ré. Slimfast: Guaranteed results for data fusion and source reliability. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1399--1414. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding related tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20--24, 2012, pages 817--828, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web, pages 697--706. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. Venetis, A. Y. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. PVLDB, 4(9):528--538, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. D. Vrandecic and M. Krötzsch. Wikidata: a free collaborative knowledgebase. Commun. ACM, 57(10):78--85, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. Wang, K. Chakrabarti, Y. He, K. Ganjam, Z. Chen, and P. A. Bernstein. Concept expansion using web tables. In Proceedings of the 24th International Conference on World Wide Web, WWW 2015, Florence, Italy, May 18--22, 2015, pages 1198--1208, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. D. Z. Wang, L. Dong, A. D. Sarma, M. J. Franklin, and A. Halevy. Functional dependency generation and applications in pay-as-you-go data integration systems. In WebKB, 2009.Google ScholarGoogle Scholar
  36. J. Wang, H. Wang, Z. Wang, and K. Q. Zhu. Understanding tables on the web. In Conceptual Modeling - 31st International Conference ER 2012, Florence, Italy, October 15--18, 2012. Proceedings, pages 141--155, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Y. Wang and J. Hu. A machine learning based approach for table detection on the web. In Proceedings of the Eleventh International World Wide Web Conference, WWW 2002, May 7--11, 2002, Honolulu, Hawaii, USA, pages 242--250, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Y. Wu, P. K. Agarwal, C. Li, J. Yang, and C. Yu. Toward computational fact-checking. PVLDB, 7(7):589--600, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20--24, 2012, pages 97--108, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. X. Yin, W. Tan, and C. Liu. FACTO: a fact lookup engine based on web tables. In Proceedings of the 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, March 28 - April 1, 2011, pages 507--516, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. R. Zanibbi, D. Blostein, and J. R. Cordy. A survey of table recognition. IJDAR, 7(1):1--16, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. C. Zhang, J. Shin, C. Ré, M. J. Cafarella, and F. Niu. Extracting databases from dark data with deepdive. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 847--859, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. M. Zhang and K. Chakrabarti. Infogather+: semantic matching and annotation of numeric and time-varying attributes in web tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22--27, 2013, pages 145--156, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Ten years of webtables
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 11, Issue 12
        August 2018
        426 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 August 2018
        Published in pvldb Volume 11, Issue 12

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader