ABSTRACT
Relational tables collected from HTML pages ("web tables") are used for a variety of tasks including table extension, knowledge base completion, and data transformation. Most of the existing algorithms for these tasks assume that the data in the tables has the form of binary relations, i.e., relates a single entity to a value or to another entity. Our exploration of a large public corpus of web tables, however, shows that web tables contain a large fraction of non-binary relations which will likely be misinterpreted by the state-of-the-art algorithms. In this paper, we propose a categorisation scheme for web table columns which distinguishes the different types of relations that appear in tables on the Web and may help to design algorithms which better deal with these different types. Designing an automated classifier that can distinguish between different types of relations is non-trivial, because web tables are relatively small, contain a high level of noise, and often miss partial key values. In order to be able to perform this distinction, we propose a set of features which goes beyond probabilistic functional dependencies by using the union of multiple tables from the same web site and from different web sites to overcome the problem that single web tables are too small for the reliable calculation of functional dependencies.
- M. J. Cafarella, A. Halevy, and N. Khoussainova. Data integration for the relational web. Proceedings of the VLDB Endowment, 2(1):1090--1101, 2009. Google ScholarDigital Library
- M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1):538--549, 2008. Google ScholarDigital Library
- E. Crestan and P. Pantel. Web-scale table census and classification. In Proceedings of the fourth ACM international conference on Web search and data mining (WSDM), pages 545--554, 2011. Google ScholarDigital Library
- A. Das Sarma, L. Fang, N. Gupta, A. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding related tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 817--828, 2012. Google ScholarDigital Library
- X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601--610, 2014. Google ScholarDigital Library
- R. Gupta, A. Halevy, X. Wang, S. E. Whang, and F. Wu. Biperpedia: An ontology for search applications. Proceedings of the VLDB Endowment, 7(7):505--516, 2014. Google ScholarDigital Library
- A. Halevy, N. Noy, S. Sarawagi, S. E. Whang, and X. Yu. Discovering structure in the universe of attribute names. In Proc. 25th International Conference on World Wide Web, 2016. Google ScholarDigital Library
- Y. He, K. Chakrabarti, T. Cheng, and T. Tylenda. Automatic discovery of attribute synonyms using query logs and table corpora. In Proceedings of the 25th International Conference on World Wide Web, pages 1429--1439, 2016. Google ScholarDigital Library
- A. Heise, J.-A. Quiané-Ruiz, Z. Abedjan, A. Jentzsch, and F. Naumann. Scalable discovery of unique column combinations. Proceedings of the VLDB Endowment, 7(4):301--312, 2013. Google ScholarDigital Library
- T. Lee, Z. Wang, H. Wang, and S.-w. Hwang. Attribute extraction and scoring: A probabilistic approach. In 29th International Conference on Data Engineering (ICDE), pages 194--205, 2013. Google ScholarDigital Library
- J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, et al. Dbpedia--a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2):167--195, 2015.Google ScholarCross Ref
- D. Ritze, O. Lehmberg, and C. Bizer. Matching HTML Tables to DBpedia. In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, pages 10:1--10:6, 2015. Google ScholarDigital Library
- D. Ritze, O. Lehmberg, Y. Oulabi, and C. Bizer. Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases. In Proceedings of the 25th International Conference on World Wide Web, pages 251--261, 2016. Google ScholarDigital Library
- F. N. Thorsten Papenbrock. A hybrid approach to functional dependency discovery. Proceedings of the International Conference on Management of Data (SIGMOD), 2016. Google ScholarDigital Library
- P. Venetis, A. Halevy, J. Madhavan, M. Paşca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. Proceedings of the VLDB Endowment, 4(9):528--538, 2011. Google ScholarDigital Library
- D. Z. Wang, X. L. Dong, A. D. Sarma, M. J. Franklin, and A. Y. Halevy. Functional dependency generation and applications in pay-as-you-go data integration systems. In 12th International Workshop on the Web and Databases (WebDB), 2009.Google Scholar
- M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 97--108, 2012. Google ScholarDigital Library
- M. Zhang and K. Chakrabarti. Infogather+: Semantic matching and annotation of numeric and time-varying attributes in web tables. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 145--156, 2013. Google ScholarDigital Library
Recommendations
Profiling the semantics of n-ary web table data
SBD '19: Proceedings of the International Workshop on Semantic Big DataThe Web contains millions of relational HTML tables, which cover a multitude of different, often very specific topics. This rich pool of data has motivated a growing body of research on methods that use web table data to extend local tables with ...
Automatic categorization of web sites based on source types
HYPERTEXT '04: Proceedings of the fifteenth ACM conference on Hypertext and hypermediaAn important issue with the Web is verification of the accuracy, currency and authenticity of the information associated with Web sites. One way to address this problem is to identify the "source" or "sponsor" of the Web site. However, source ...
Text categorization based on k-nearest neighbor approach for web site classification
Automatic categorization is a viable method to deal with the scaling problem on the World Wide Web. For Web site classification, this paper proposes the use of Web pages linked with the home page in a different manner from the sole use of home pages in ...
Comments