skip to main content
10.1145/1150402.1150457acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Simultaneous record detection and attribute labeling in web data extraction

Published:20 August 2006Publication History

ABSTRACT

Recent work has shown the feasibility and promise of template-independent Web data extraction. However, existing approaches use decoupled strategies - attempting to do data record detection and attribute labeling in two separate phases. In this paper, we show that separately extracting data records and attributes is highly ineffective and propose a probabilistic model to perform these two tasks simultaneously. In our approach, record detection can benefit from the availability of semantics required in attribute labeling and, at the same time, the accuracy of attribute labeling can be improved when data records are labeled in a collective manner. The proposed model is called Hierarchical Conditional Random Fields. It can efficiently integrate all useful features by learning their importance, and it can also incorporate hierarchical interactions which are very important for Web data extraction. We empirically compare the proposed model with existing decoupled approaches for product information extraction, and the results show significant improvements in both record detection and attribute labeling.

References

  1. Arasu, A., and Garcia-Molina, H. Extracting Structured Data from Web Pages. In Proc. of ACM SIGMOD, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bunescu, R. C., and Mooney, R. J. Collective information extraction with relational Markov networks. In Proc. of ACL, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Buttler, D., Liu, L., and Pu, C. A Fully Automated Object Extraction System for the World Wide Web. In Proc. of IEEE ICDCS, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y. Block-based Web Search. In Proc. of SIGIR, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chang, C.-H., and Liu, S.-L. IEPAD: Information Extraction Based on Pattern Discovery. In Proc. of WWW, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chen, S. F., and Rosenfeld, R. A Gaussian Prior for Smoothing Maximum Entropy Models. Technical Report CMU-CS-99-108, Carnegie Mellon University, 1999.]]Google ScholarGoogle ScholarCross RefCross Ref
  7. Cohen, W. W., and Sarawagi, S. Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods. In Proc. of SIGKDD, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cowell, R. G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter, D. J. Probabilistic Networks and Expert Systems. Springer, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Crescenzi, V., Mecca, G., and Merialdo, P. ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites. In Proc. of VLDB, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Embley, D. W., Jiang, Y., and Ng, Y.-K. Record-Boundary Discovery in Web Documents. In Proc. of SIGMOD, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Fine, S., Singer Y., and Tishby, N. The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32:41--62, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Finn, A., and Kushmerick, N. Multi-level boundary classification for information extraction. In Proc. of ECML, 2004.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. He, X., Zemel, R. S., and Carreira-Perpiñán, M. Á. Multi-scale Conditional Random Fields for Image Labeling. In Proc. of CVPR, 2004.]]Google ScholarGoogle Scholar
  14. Jensen, F. V., Lauritzen, S. L., and Olesen, K. G. Bayesian updating in causal probabilistic networks by local computation. Computational Statistics Quarterly, 4:269--82, 1990.]]Google ScholarGoogle Scholar
  15. Kumar, S., and Hebert, M. A Hierarchical Field Framework for Unified Context-Based Classification. In Proc. of ICCV, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kushmerick, N. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118:15--68, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Lafferty, J., McCallum, A., and Pereira, F. Conditional random fields: Probabilistic models for segmenting and labelling sequence data. In Proc. of ICML, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Lerman, K., Getoor, L., Minton, S., and Knoblock, C. Using the Structure of Web Sites for Automatic Segmentation of Tables. In Proc. of ACM SIGMOD, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lerman, K., Minton, S., and Knoblock, C. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:149--181, 2003.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Liao, L., Fox, D., and Kautz, H. Location-based activity recognition. In Proc. of NIPS, 2005.]]Google ScholarGoogle Scholar
  21. Liu, D. C., and Nocedal, J. On The Limited Memory BFGS Method for Large Scale Optimization. Mathematical Programming 45, pp. 503--528, 1989.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Malouf, R. A comparison of algorithms for maximum entropy parameter estimation. In Sixth Conf. on Natural Language Learning, pages 49--55, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Muslea, I., Minton, S., and Knoblock C. A. Hierarchical Wrapper Induction for Semi-structured Information Sources. Autonomous Agents and Multi-Agent 4, 1/2 (2001), 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Nahm, U. Y., and Mooney, R. J. A Mutually Beneficial Integration of Data Mining and Information Extraction. In Proc. of AAAI, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sarawagi, S., and Cohen, W. W. Semi-Markov Conditional Random Fields for Information Extraction. In Proc. of NIPS, 2004.]]Google ScholarGoogle Scholar
  26. Skounakis, M., Craven, M., and Ray S. Hierarchical Hidden Markov Models for Information Extraction. In Proc. of IJCAI, 2003.]]Google ScholarGoogle Scholar
  27. Song, R., Liu, H., Wen, J.-R., and Ma, W-Y. Learning Block Importance Models for Web Pages. In Proc. of WWW, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Sutton, C., Rohanimanesh, K., and McCallum, A. Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. In Proc. of ICML, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Wellner, B., McCallum, A., Peng, F., and Hay, M. An Integrated, Conditional Model of Information Extraction and Coreference with Application to Citation Matching. In Proc. of UAI, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yi, L., Liu, B., and Li, X. Eliminating Noisy Information in Web Pages for Data Mining. In Proc. of SIGKDD, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Zhai, Y., and Liu, B. Web Data Extraction Based on Partial Tree Alignment. In Proc. of WWW, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Zhao, H., Meng, W., Wu, Z., Raghavan, V., and Yu, C. Fully Automatic Wrapper Generation for Search Engines. In Proc. of WWW, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., and Ma, W.-Y. 2D Conditional Random Fields for Web Information Extraction. In Proc. of ICML, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2006
    986 pages
    ISBN:1595933395
    DOI:10.1145/1150402

    Copyright © 2006 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 20 August 2006

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • Article

    Acceptance Rates

    Overall Acceptance Rate1,133of8,635submissions,13%

    Upcoming Conference

    KDD '24

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader