ABSTRACT
Recent work has shown the feasibility and promise of template-independent Web data extraction. However, existing approaches use decoupled strategies - attempting to do data record detection and attribute labeling in two separate phases. In this paper, we show that separately extracting data records and attributes is highly ineffective and propose a probabilistic model to perform these two tasks simultaneously. In our approach, record detection can benefit from the availability of semantics required in attribute labeling and, at the same time, the accuracy of attribute labeling can be improved when data records are labeled in a collective manner. The proposed model is called Hierarchical Conditional Random Fields. It can efficiently integrate all useful features by learning their importance, and it can also incorporate hierarchical interactions which are very important for Web data extraction. We empirically compare the proposed model with existing decoupled approaches for product information extraction, and the results show significant improvements in both record detection and attribute labeling.
- Arasu, A., and Garcia-Molina, H. Extracting Structured Data from Web Pages. In Proc. of ACM SIGMOD, 2003.]] Google ScholarDigital Library
- Bunescu, R. C., and Mooney, R. J. Collective information extraction with relational Markov networks. In Proc. of ACL, 2004.]] Google ScholarDigital Library
- Buttler, D., Liu, L., and Pu, C. A Fully Automated Object Extraction System for the World Wide Web. In Proc. of IEEE ICDCS, 2001.]] Google ScholarDigital Library
- Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y. Block-based Web Search. In Proc. of SIGIR, 2004.]] Google ScholarDigital Library
- Chang, C.-H., and Liu, S.-L. IEPAD: Information Extraction Based on Pattern Discovery. In Proc. of WWW, 2001.]] Google ScholarDigital Library
- Chen, S. F., and Rosenfeld, R. A Gaussian Prior for Smoothing Maximum Entropy Models. Technical Report CMU-CS-99-108, Carnegie Mellon University, 1999.]]Google ScholarCross Ref
- Cohen, W. W., and Sarawagi, S. Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods. In Proc. of SIGKDD, 2004.]] Google ScholarDigital Library
- Cowell, R. G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter, D. J. Probabilistic Networks and Expert Systems. Springer, 1999.]] Google ScholarDigital Library
- Crescenzi, V., Mecca, G., and Merialdo, P. ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites. In Proc. of VLDB, 2001.]] Google ScholarDigital Library
- Embley, D. W., Jiang, Y., and Ng, Y.-K. Record-Boundary Discovery in Web Documents. In Proc. of SIGMOD, 1999.]] Google ScholarDigital Library
- Fine, S., Singer Y., and Tishby, N. The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32:41--62, 1998.]] Google ScholarDigital Library
- Finn, A., and Kushmerick, N. Multi-level boundary classification for information extraction. In Proc. of ECML, 2004.]]Google ScholarDigital Library
- He, X., Zemel, R. S., and Carreira-Perpiñán, M. Á. Multi-scale Conditional Random Fields for Image Labeling. In Proc. of CVPR, 2004.]]Google Scholar
- Jensen, F. V., Lauritzen, S. L., and Olesen, K. G. Bayesian updating in causal probabilistic networks by local computation. Computational Statistics Quarterly, 4:269--82, 1990.]]Google Scholar
- Kumar, S., and Hebert, M. A Hierarchical Field Framework for Unified Context-Based Classification. In Proc. of ICCV, 2005.]] Google ScholarDigital Library
- Kushmerick, N. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118:15--68, 2000.]] Google ScholarDigital Library
- Lafferty, J., McCallum, A., and Pereira, F. Conditional random fields: Probabilistic models for segmenting and labelling sequence data. In Proc. of ICML, 2001.]] Google ScholarDigital Library
- Lerman, K., Getoor, L., Minton, S., and Knoblock, C. Using the Structure of Web Sites for Automatic Segmentation of Tables. In Proc. of ACM SIGMOD, 2004.]] Google ScholarDigital Library
- Lerman, K., Minton, S., and Knoblock, C. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:149--181, 2003.]]Google ScholarDigital Library
- Liao, L., Fox, D., and Kautz, H. Location-based activity recognition. In Proc. of NIPS, 2005.]]Google Scholar
- Liu, D. C., and Nocedal, J. On The Limited Memory BFGS Method for Large Scale Optimization. Mathematical Programming 45, pp. 503--528, 1989.]]Google ScholarDigital Library
- Malouf, R. A comparison of algorithms for maximum entropy parameter estimation. In Sixth Conf. on Natural Language Learning, pages 49--55, 2002.]] Google ScholarDigital Library
- Muslea, I., Minton, S., and Knoblock C. A. Hierarchical Wrapper Induction for Semi-structured Information Sources. Autonomous Agents and Multi-Agent 4, 1/2 (2001), 2001.]] Google ScholarDigital Library
- Nahm, U. Y., and Mooney, R. J. A Mutually Beneficial Integration of Data Mining and Information Extraction. In Proc. of AAAI, 2001.]] Google ScholarDigital Library
- Sarawagi, S., and Cohen, W. W. Semi-Markov Conditional Random Fields for Information Extraction. In Proc. of NIPS, 2004.]]Google Scholar
- Skounakis, M., Craven, M., and Ray S. Hierarchical Hidden Markov Models for Information Extraction. In Proc. of IJCAI, 2003.]]Google Scholar
- Song, R., Liu, H., Wen, J.-R., and Ma, W-Y. Learning Block Importance Models for Web Pages. In Proc. of WWW, 2004.]] Google ScholarDigital Library
- Sutton, C., Rohanimanesh, K., and McCallum, A. Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. In Proc. of ICML, 2004.]] Google ScholarDigital Library
- Wellner, B., McCallum, A., Peng, F., and Hay, M. An Integrated, Conditional Model of Information Extraction and Coreference with Application to Citation Matching. In Proc. of UAI, 2004.]] Google ScholarDigital Library
- Yi, L., Liu, B., and Li, X. Eliminating Noisy Information in Web Pages for Data Mining. In Proc. of SIGKDD, 2003.]] Google ScholarDigital Library
- Zhai, Y., and Liu, B. Web Data Extraction Based on Partial Tree Alignment. In Proc. of WWW, 2005.]] Google ScholarDigital Library
- Zhao, H., Meng, W., Wu, Z., Raghavan, V., and Yu, C. Fully Automatic Wrapper Generation for Search Engines. In Proc. of WWW, 2005.]] Google ScholarDigital Library
- Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., and Ma, W.-Y. 2D Conditional Random Fields for Web Information Extraction. In Proc. of ICML, 2005.]] Google ScholarDigital Library
Recommendations
Link-based hidden attribute discovery for objects on Web
EDBT/ICDT '11: Proceedings of the 14th International Conference on Extending Database TechnologyInformation extraction from the Web is of growing importance. Objects on the Web are often associated with many attributes that describe the objects. It is essential to extract these attributes and map them to their corresponding objects. However, much ...
Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning
WSDM '13: Proceedings of the sixth ACM international conference on Web search and data miningWe develop a new framework to achieve the goal of Wikipedia entity expansion and attribute extraction from the Web. Our framework takes a few existing entities that are automatically collected from a particular Wikipedia category as seed input and ...
Comments