Article

Simultaneous record detection and attribute labeling in web data extraction

Authors:
Jun Zhu

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Zaiqing Nie

Microsoft Research Asia, Beijing, China

Microsoft Research Asia, Beijing, China
View Profile

,
Ji-Rong Wen

Microsoft Research Asia, Beijing, China

Microsoft Research Asia, Beijing, China
View Profile

,
Bo Zhang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Wei-Ying Ma

Microsoft Research Asia, Beijing, China

Microsoft Research Asia, Beijing, China
View Profile

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2006Pages 494–503https://doi.org/10.1145/1150402.1150457

Published:20 August 2006Publication History

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 494–503

ABSTRACT

Recent work has shown the feasibility and promise of template-independent Web data extraction. However, existing approaches use decoupled strategies - attempting to do data record detection and attribute labeling in two separate phases. In this paper, we show that separately extracting data records and attributes is highly ineffective and propose a probabilistic model to perform these two tasks simultaneously. In our approach, record detection can benefit from the availability of semantics required in attribute labeling and, at the same time, the accuracy of attribute labeling can be improved when data records are labeled in a collective manner. The proposed model is called Hierarchical Conditional Random Fields. It can efficiently integrate all useful features by learning their importance, and it can also incorporate hierarchical interactions which are very important for Web data extraction. We empirically compare the proposed model with existing decoupled approaches for product information extraction, and the results show significant improvements in both record detection and attribute labeling.

References

Arasu, A., and Garcia-Molina, H. Extracting Structured Data from Web Pages. In Proc. of ACM SIGMOD, 2003.]] Google ScholarDigital Library
Bunescu, R. C., and Mooney, R. J. Collective information extraction with relational Markov networks. In Proc. of ACL, 2004.]] Google ScholarDigital Library
Buttler, D., Liu, L., and Pu, C. A Fully Automated Object Extraction System for the World Wide Web. In Proc. of IEEE ICDCS, 2001.]] Google ScholarDigital Library
Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y. Block-based Web Search. In Proc. of SIGIR, 2004.]] Google ScholarDigital Library
Chang, C.-H., and Liu, S.-L. IEPAD: Information Extraction Based on Pattern Discovery. In Proc. of WWW, 2001.]] Google ScholarDigital Library
Chen, S. F., and Rosenfeld, R. A Gaussian Prior for Smoothing Maximum Entropy Models. Technical Report CMU-CS-99-108, Carnegie Mellon University, 1999.]]Google ScholarCross Ref
Cohen, W. W., and Sarawagi, S. Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods. In Proc. of SIGKDD, 2004.]] Google ScholarDigital Library
Cowell, R. G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter, D. J. Probabilistic Networks and Expert Systems. Springer, 1999.]] Google ScholarDigital Library
Crescenzi, V., Mecca, G., and Merialdo, P. ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites. In Proc. of VLDB, 2001.]] Google ScholarDigital Library
Embley, D. W., Jiang, Y., and Ng, Y.-K. Record-Boundary Discovery in Web Documents. In Proc. of SIGMOD, 1999.]] Google ScholarDigital Library
Fine, S., Singer Y., and Tishby, N. The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32:41--62, 1998.]] Google ScholarDigital Library
Finn, A., and Kushmerick, N. Multi-level boundary classification for information extraction. In Proc. of ECML, 2004.]]Google ScholarDigital Library
He, X., Zemel, R. S., and Carreira-Perpiñán, M. Á. Multi-scale Conditional Random Fields for Image Labeling. In Proc. of CVPR, 2004.]]Google Scholar
Jensen, F. V., Lauritzen, S. L., and Olesen, K. G. Bayesian updating in causal probabilistic networks by local computation. Computational Statistics Quarterly, 4:269--82, 1990.]]Google Scholar
Kumar, S., and Hebert, M. A Hierarchical Field Framework for Unified Context-Based Classification. In Proc. of ICCV, 2005.]] Google ScholarDigital Library
Kushmerick, N. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118:15--68, 2000.]] Google ScholarDigital Library
Lafferty, J., McCallum, A., and Pereira, F. Conditional random fields: Probabilistic models for segmenting and labelling sequence data. In Proc. of ICML, 2001.]] Google ScholarDigital Library
Lerman, K., Getoor, L., Minton, S., and Knoblock, C. Using the Structure of Web Sites for Automatic Segmentation of Tables. In Proc. of ACM SIGMOD, 2004.]] Google ScholarDigital Library
Lerman, K., Minton, S., and Knoblock, C. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:149--181, 2003.]]Google ScholarDigital Library
Liao, L., Fox, D., and Kautz, H. Location-based activity recognition. In Proc. of NIPS, 2005.]]Google Scholar
Liu, D. C., and Nocedal, J. On The Limited Memory BFGS Method for Large Scale Optimization. Mathematical Programming 45, pp. 503--528, 1989.]]Google ScholarDigital Library
Malouf, R. A comparison of algorithms for maximum entropy parameter estimation. In Sixth Conf. on Natural Language Learning, pages 49--55, 2002.]] Google ScholarDigital Library
Muslea, I., Minton, S., and Knoblock C. A. Hierarchical Wrapper Induction for Semi-structured Information Sources. Autonomous Agents and Multi-Agent 4, 1/2 (2001), 2001.]] Google ScholarDigital Library
Nahm, U. Y., and Mooney, R. J. A Mutually Beneficial Integration of Data Mining and Information Extraction. In Proc. of AAAI, 2001.]] Google ScholarDigital Library
Sarawagi, S., and Cohen, W. W. Semi-Markov Conditional Random Fields for Information Extraction. In Proc. of NIPS, 2004.]]Google Scholar
Skounakis, M., Craven, M., and Ray S. Hierarchical Hidden Markov Models for Information Extraction. In Proc. of IJCAI, 2003.]]Google Scholar
Song, R., Liu, H., Wen, J.-R., and Ma, W-Y. Learning Block Importance Models for Web Pages. In Proc. of WWW, 2004.]] Google ScholarDigital Library
Sutton, C., Rohanimanesh, K., and McCallum, A. Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. In Proc. of ICML, 2004.]] Google ScholarDigital Library
Wellner, B., McCallum, A., Peng, F., and Hay, M. An Integrated, Conditional Model of Information Extraction and Coreference with Application to Citation Matching. In Proc. of UAI, 2004.]] Google ScholarDigital Library
Yi, L., Liu, B., and Li, X. Eliminating Noisy Information in Web Pages for Data Mining. In Proc. of SIGKDD, 2003.]] Google ScholarDigital Library
Zhai, Y., and Liu, B. Web Data Extraction Based on Partial Tree Alignment. In Proc. of WWW, 2005.]] Google ScholarDigital Library
Zhao, H., Meng, W., Wu, Z., Raghavan, V., and Yu, C. Fully Automatic Wrapper Generation for Search Engines. In Proc. of WWW, 2005.]] Google ScholarDigital Library
Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., and Ma, W.-Y. 2D Conditional Random Fields for Web Information Extraction. In Proc. of ICML, 2005.]] Google ScholarDigital Library

Recommendations

Link-based hidden attribute discovery for objects on Web
EDBT/ICDT '11: Proceedings of the 14th International Conference on Extending Database Technology

Information extraction from the Web is of growing importance. Objects on the Web are often associated with many attributes that describe the objects. It is essential to extract these attributes and map them to their corresponding objects. However, much ...
Read More
Record linkage for web data
Read More
Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning
WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

We develop a new framework to achieve the goal of Wikipedia entity expansion and attribute extraction from the Web. Our framework takes a few existing entities that are automatically collected from a particular Wikipedia category as seed input and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2006
986 pages
ISBN:1595933395
DOI:10.1145/1150402
Conference Chair:
Tina Eliassi-Rad
LLNL
,
General Chair:
Lyle Ungar
University of Pennsylvania
,
Program Chairs:
Mark Craven
University of Wisconsin
,
Dimitrios Gunopulos
University of California, Riverside
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 August 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
attribute labeling
conditional random fields
data record detection
hierarchical conditional random fields
web page segmentation
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 126
  Total Citations
  View Citations
- 1,729
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Simultaneous record detection and attribute labeling in web data extraction

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Recommendations

Link-based hidden attribute discovery for objects on Web

Record linkage for web data

Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Simultaneous record detection and attribute labeling in web data extraction

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Recommendations

Link-based hidden attribute discovery for objects on Web

Record linkage for web data

Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media