research-article

Web table column categorisation and profiling

Authors:
Oliver Lehmberg

University of Mannheim, Germany

University of Mannheim, Germany
View Profile

,
Christian Bizer

University of Mannheim, Germany

University of Mannheim, Germany
View Profile

WebDB '16: Proceedings of the 19th International Workshop on Web and DatabasesJune 2016Article No.: 4Pages 1–7https://doi.org/10.1145/2932194.2932198

Published:26 June 2016Publication History

WebDB '16: Proceedings of the 19th International Workshop on Web and Databases

Pages 1–7

ABSTRACT

Relational tables collected from HTML pages ("web tables") are used for a variety of tasks including table extension, knowledge base completion, and data transformation. Most of the existing algorithms for these tasks assume that the data in the tables has the form of binary relations, i.e., relates a single entity to a value or to another entity. Our exploration of a large public corpus of web tables, however, shows that web tables contain a large fraction of non-binary relations which will likely be misinterpreted by the state-of-the-art algorithms. In this paper, we propose a categorisation scheme for web table columns which distinguishes the different types of relations that appear in tables on the Web and may help to design algorithms which better deal with these different types. Designing an automated classifier that can distinguish between different types of relations is non-trivial, because web tables are relatively small, contain a high level of noise, and often miss partial key values. In order to be able to perform this distinction, we propose a set of features which goes beyond probabilistic functional dependencies by using the union of multiple tables from the same web site and from different web sites to overcome the problem that single web tables are too small for the reliable calculation of functional dependencies.

References

M. J. Cafarella, A. Halevy, and N. Khoussainova. Data integration for the relational web. Proceedings of the VLDB Endowment, 2(1):1090--1101, 2009. Google ScholarDigital Library
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1):538--549, 2008. Google ScholarDigital Library
E. Crestan and P. Pantel. Web-scale table census and classification. In Proceedings of the fourth ACM international conference on Web search and data mining (WSDM), pages 545--554, 2011. Google ScholarDigital Library
A. Das Sarma, L. Fang, N. Gupta, A. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding related tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 817--828, 2012. Google ScholarDigital Library
X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601--610, 2014. Google ScholarDigital Library
R. Gupta, A. Halevy, X. Wang, S. E. Whang, and F. Wu. Biperpedia: An ontology for search applications. Proceedings of the VLDB Endowment, 7(7):505--516, 2014. Google ScholarDigital Library
A. Halevy, N. Noy, S. Sarawagi, S. E. Whang, and X. Yu. Discovering structure in the universe of attribute names. In Proc. 25th International Conference on World Wide Web, 2016. Google ScholarDigital Library
Y. He, K. Chakrabarti, T. Cheng, and T. Tylenda. Automatic discovery of attribute synonyms using query logs and table corpora. In Proceedings of the 25th International Conference on World Wide Web, pages 1429--1439, 2016. Google ScholarDigital Library
A. Heise, J.-A. Quiané-Ruiz, Z. Abedjan, A. Jentzsch, and F. Naumann. Scalable discovery of unique column combinations. Proceedings of the VLDB Endowment, 7(4):301--312, 2013. Google ScholarDigital Library
T. Lee, Z. Wang, H. Wang, and S.-w. Hwang. Attribute extraction and scoring: A probabilistic approach. In 29th International Conference on Data Engineering (ICDE), pages 194--205, 2013. Google ScholarDigital Library
J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, et al. Dbpedia--a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2):167--195, 2015.Google ScholarCross Ref
D. Ritze, O. Lehmberg, and C. Bizer. Matching HTML Tables to DBpedia. In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, pages 10:1--10:6, 2015. Google ScholarDigital Library
D. Ritze, O. Lehmberg, Y. Oulabi, and C. Bizer. Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases. In Proceedings of the 25th International Conference on World Wide Web, pages 251--261, 2016. Google ScholarDigital Library
F. N. Thorsten Papenbrock. A hybrid approach to functional dependency discovery. Proceedings of the International Conference on Management of Data (SIGMOD), 2016. Google ScholarDigital Library
P. Venetis, A. Halevy, J. Madhavan, M. Paşca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. Proceedings of the VLDB Endowment, 4(9):528--538, 2011. Google ScholarDigital Library
D. Z. Wang, X. L. Dong, A. D. Sarma, M. J. Franklin, and A. Y. Halevy. Functional dependency generation and applications in pay-as-you-go data integration systems. In 12th International Workshop on the Web and Databases (WebDB), 2009.Google Scholar
M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 97--108, 2012. Google ScholarDigital Library
M. Zhang and K. Chakrabarti. Infogather+: Semantic matching and annotation of numeric and time-varying attributes in web tables. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 145--156, 2013. Google ScholarDigital Library

Recommendations

Profiling the semantics of n-ary web table data
SBD '19: Proceedings of the International Workshop on Semantic Big Data

The Web contains millions of relational HTML tables, which cover a multitude of different, often very specific topics. This rich pool of data has motivated a growing body of research on methods that use web table data to extend local tables with ...
Read More
Automatic categorization of web sites based on source types
HYPERTEXT '04: Proceedings of the fifteenth ACM conference on Hypertext and hypermedia

An important issue with the Web is verification of the accuracy, currency and authenticity of the information associated with Web sites. One way to address this problem is to identify the "source" or "sponsor" of the Web site. However, source ...
Read More
Text categorization based on k-nearest neighbor approach for web site classification

Automatic categorization is a viable method to deal with the scaling problem on the World Wide Web. For Web site classification, this paper proposes the use of Web pages linked with the home page in a different manner from the sole use of home pages in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

WebDB '16: Proceedings of the 19th International Workshop on Web and Databases
June 2016
59 pages
ISBN:9781450343107
DOI:10.1145/2932194

Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
WebDB '16 Paper Acceptance Rate9of29submissions,31%Overall Acceptance Rate30of100submissions,30%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 255
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Web table column categorisation and profiling

WebDB '16: Proceedings of the 19th International Workshop on Web and Databases

ABSTRACT

References

Cited By

Recommendations

Profiling the semantics of n-ary web table data

Automatic categorization of web sites based on source types

Text categorization based on k-nearest neighbor approach for web site classification

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Web table column categorisation and profiling

WebDB '16: Proceedings of the 19th International Workshop on Web and Databases

ABSTRACT

References

Cited By

Recommendations

Profiling the semantics of n-ary web table data

Automatic categorization of web sites based on source types

Text categorization based on k-nearest neighbor approach for web site classification

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media