skip to main content
10.1145/2932194.2932198acmotherconferencesArticle/Chapter ViewAbstractPublication PageswebdbConference Proceedingsconference-collections
research-article

Web table column categorisation and profiling

Published:26 June 2016Publication History

ABSTRACT

Relational tables collected from HTML pages ("web tables") are used for a variety of tasks including table extension, knowledge base completion, and data transformation. Most of the existing algorithms for these tasks assume that the data in the tables has the form of binary relations, i.e., relates a single entity to a value or to another entity. Our exploration of a large public corpus of web tables, however, shows that web tables contain a large fraction of non-binary relations which will likely be misinterpreted by the state-of-the-art algorithms. In this paper, we propose a categorisation scheme for web table columns which distinguishes the different types of relations that appear in tables on the Web and may help to design algorithms which better deal with these different types. Designing an automated classifier that can distinguish between different types of relations is non-trivial, because web tables are relatively small, contain a high level of noise, and often miss partial key values. In order to be able to perform this distinction, we propose a set of features which goes beyond probabilistic functional dependencies by using the union of multiple tables from the same web site and from different web sites to overcome the problem that single web tables are too small for the reliable calculation of functional dependencies.

References

  1. M. J. Cafarella, A. Halevy, and N. Khoussainova. Data integration for the relational web. Proceedings of the VLDB Endowment, 2(1):1090--1101, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1):538--549, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. Crestan and P. Pantel. Web-scale table census and classification. In Proceedings of the fourth ACM international conference on Web search and data mining (WSDM), pages 545--554, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Das Sarma, L. Fang, N. Gupta, A. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding related tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 817--828, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601--610, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Gupta, A. Halevy, X. Wang, S. E. Whang, and F. Wu. Biperpedia: An ontology for search applications. Proceedings of the VLDB Endowment, 7(7):505--516, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Halevy, N. Noy, S. Sarawagi, S. E. Whang, and X. Yu. Discovering structure in the universe of attribute names. In Proc. 25th International Conference on World Wide Web, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. He, K. Chakrabarti, T. Cheng, and T. Tylenda. Automatic discovery of attribute synonyms using query logs and table corpora. In Proceedings of the 25th International Conference on World Wide Web, pages 1429--1439, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Heise, J.-A. Quiané-Ruiz, Z. Abedjan, A. Jentzsch, and F. Naumann. Scalable discovery of unique column combinations. Proceedings of the VLDB Endowment, 7(4):301--312, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Lee, Z. Wang, H. Wang, and S.-w. Hwang. Attribute extraction and scoring: A probabilistic approach. In 29th International Conference on Data Engineering (ICDE), pages 194--205, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, et al. Dbpedia--a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2):167--195, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  12. D. Ritze, O. Lehmberg, and C. Bizer. Matching HTML Tables to DBpedia. In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, pages 10:1--10:6, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Ritze, O. Lehmberg, Y. Oulabi, and C. Bizer. Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases. In Proceedings of the 25th International Conference on World Wide Web, pages 251--261, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. F. N. Thorsten Papenbrock. A hybrid approach to functional dependency discovery. Proceedings of the International Conference on Management of Data (SIGMOD), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Venetis, A. Halevy, J. Madhavan, M. Paşca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. Proceedings of the VLDB Endowment, 4(9):528--538, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Z. Wang, X. L. Dong, A. D. Sarma, M. J. Franklin, and A. Y. Halevy. Functional dependency generation and applications in pay-as-you-go data integration systems. In 12th International Workshop on the Web and Databases (WebDB), 2009.Google ScholarGoogle Scholar
  17. M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 97--108, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Zhang and K. Chakrabarti. Infogather+: Semantic matching and annotation of numeric and time-varying attributes in web tables. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 145--156, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    WebDB '16: Proceedings of the 19th International Workshop on Web and Databases
    June 2016
    59 pages
    ISBN:9781450343107
    DOI:10.1145/2932194

    Copyright © 2016 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 26 June 2016

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    WebDB '16 Paper Acceptance Rate9of29submissions,31%Overall Acceptance Rate30of100submissions,30%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader