skip to main content
10.5555/2145432.2145598dlproceedingsArticle/Chapter ViewAbstractPublication PagesemnlpConference Proceedingsconference-collections
research-article
Free Access

Bootstrapped named entity recognition for product attribute extraction

Published:27 July 2011Publication History

ABSTRACT

We present a named entity recognition (NER) system for extracting product attributes and values from listing titles. Information extraction from short listing titles present a unique challenge, with the lack of informative context and grammatical structure. In this work, we combine supervised NER with bootstrapping to expand the seed list, and output normalized results. Focusing on listings from eBay's clothing and shoes categories, our bootstrapped NER system is able to identify new brands corresponding to spelling variants and typographical errors of the known brands, as well as identifying novel brands. Among the top 300 new brands predicted, our system achieves 90.33% precision. To output normalized attribute values, we explore several string comparison algorithms and found n-gram substring matching to work well in practice.

References

  1. A. Berger, S. Pietra, V. Pietra, A Maximum Entropy Approach to Natural Language Processing, ACL 1996.Google ScholarGoogle Scholar
  2. S. Brody, N. Elhadad, An Unsupervised Aspect-Sentiment Model for Online Reviews, HLT-NAACL 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Brown, P. deSouza, R. Mercer, V. Della Pietra, J. Lai, Class-based n-gram Models of Natural Language, ACL 1992.Google ScholarGoogle Scholar
  4. C.-C Chang, C.-J. Lin, LibSVM: A Library for Support Vector Machines (2001).Google ScholarGoogle Scholar
  5. H. L. Chieu, H. T. Ng, Named Entity Recognition with a Maximum Entropy Approach, ACL 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Clark, Combining Distributional and Morphological Information for Part of Speech Induction, EACL 2003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Demartini, C. S. Firan, M. Georgescu, T. Iofciu, R. Krestel, and W. Nejdl, An Architecture for Finding Entities on the web, Latin American Web Congress 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Du, Z. Zhang, J. Yan, Y. Cui, and Z. Chen. Using search session context for named entity recognition in query. In SIGIR10, Geneva, Switzerland, July 19--23 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Asif Ekbal, Rejwanul Haque, and Sivaji Bandyopadhyay. 2008. Named entity recognition in Bengali: A conditional random field approach. In Proceedings of IJC-NLP, pages 589594.Google ScholarGoogle Scholar
  10. M. Faruqui, S. Pado, Training and Evaluating a German Named Entity Recognizer with Semantic Generalization, Proceedings of Konvens 2010, Saarbrucken, Germany.Google ScholarGoogle Scholar
  11. F. Feng, A. McCallum, Chinese segmentation and new word detection using conditional random fields, in COLING 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. R. Finkel, T. Grenager, and C. Manning, Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling, ACL 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. R. Finkel, C. Manning, Nested Named Entity Recognition, EMNLP 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Ghani, K. Probst, Y. Liu, M. Krema, A. Fano, Text Mining for Product Attribute Extraction, SIGKDD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Ghani, R. Jones, A comparison of efficacy and assumptions of bootstrapping algorithms for training information extraction systems, Workshop on Linguistic Knowledge Acquisition and Representation at the Third International Conference on Language Resources and Evaluation (LREC), 2002.Google ScholarGoogle Scholar
  16. T. Grenager, D. Klein, and C. D. Manning, Unsupervised Learning of Field Segmentation Models for Information Extraction, ACL 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Gruhl, M. Nagarajan, J. Pieper, C. Robson, and A. Sheth. Context and Domain Knowledge Enhanced Entity Spotting In Informal Text. In Proceedings of the 8th International Semantic Web Conference (ISWC 2009). Springer, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. D. Haghighi, Unsupervised Models of Entity Reference Resolution, Ph. D. Thesis, University of Calfornia, Berkeley, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Halacsy, A. Kornai, C. Oravecz, HunPos: an open source trigram tagger, ACL 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H. Isozaki and H. Kazawa, Efficient Support Vector Classifiers for Named Entity Recognition, ACL 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Jones, Learning to Extract Entities from Labeled and Unlabeled Text, PhD Thesis, 2005.Google ScholarGoogle Scholar
  22. I. Kanaris, K. Kanaris, I. Houvardas, E. Stamatatos, Words vs. Character N-grams for Anti-spam Filtering, International Journal on Artificial Intelligence Tools, 2006.Google ScholarGoogle Scholar
  23. D. Klein, J. Smarr, H. Nguyen, C. Manning, Named Entity Recognition with Character-level Models, CoNLL 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Koeling, Chunking with Maximum Entropy Models, Proc. of CoNLL-2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. Kondrak, N-Gram Similarity and Distance, SPIRE 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. V. Krishnan and C. D. Manning, An effective two-stage model for exploiting non-local dependencies in named entity recognition, in ACL-COLING, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. T. Kudo, Y. Matsumoto, Chunking with Support Vector Machines, ACL 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Lafferty, A. McCallum, F. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, ICML 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. V. I. Levenshtein, Binary code capable of correcting deletions, insertions, and reversals. Phs. Dokl., 6:707--710.Google ScholarGoogle Scholar
  30. D. Lin, X. Wu, Phrase Clustering for Discriminative Learning, ACL 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. B. Liu, M. Hu, and J. Cheng, Opinion Observer: Analyzing and Comparing Opinions on the Web, WWW 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Xinnian Mao, Saike He, Sencheng Bao, Yuan Dong, and Haila Wang, Chinese Word Segmentation and Named Entity Recognition Based on Conditional Random Fields, Sixth SIGHAN Workshop on Chinese Language Processing, 2008Google ScholarGoogle Scholar
  33. A. McCallum, Efficiently Inducing Features of Conditional Random Fields, UAI 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. McCallum, D. Jensen, A Note on Unification of Information Extraction and Data Mining using Conditional-Probability, Relational Models, Proceedings of IJCAI-2003 on Learning Statistical Models from Relational Data, 2003.Google ScholarGoogle Scholar
  35. J. F. McCarthy, A Trainable Approach to Coreference Resolution for Information Extraction, Ph. D. Thesis, University of Massachusetts at Amherst, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. E. Minkov, R. C. Wang, and W. W. Cohen, Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text, ACL 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Mike Mintz, Steven Bills, Rion Snow, Daniel Jurafsky. 2009. Distant Supervision for Relation Extraction without Labeled Data, In Proceedings of ACL/AFNLP 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Moghaddam, M. Ester, Opinion Digger: An Unsupervised Opinion Miner from Unstructured Product Reviews, CIKM 2010 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. David Nadeau, P. Turney, S. Matwin, Unsupervised Named Entity Recognition: Generating Gazetteers and Resolving Ambiguity. In Proc. Canadian Conference on Artificial Intelligence, 2006.Google ScholarGoogle Scholar
  40. David Nadeau and Satoshi Sekine. A survey of named entity recognition and classification. Linguisticae Investigationes, 30(1):326, 2007.Google ScholarGoogle Scholar
  41. Nadeau, D., Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision, PhD thesis, University of Ottawa, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. Pakhomov, Semi-supervised Maximum Entropy Based Approach to Acronym and Abbreviation Normalization in Medical Texts, ACL 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. A.-M. Popescu, O. Etzioni, Extracting Product Features and Opinions from Reviews, EMNLP 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. K. Probst, R. Ghani, M. Krema, A. Fano, Semi-Supervised Learning to Extract Attribute-Value Pairs from Product Descriptions on the Web, ECML 2006.Google ScholarGoogle Scholar
  45. V. Punyakanok, D. Roth, The use of classifiers in sequential inference, NIPS 2001.Google ScholarGoogle Scholar
  46. H. Raghavan, J. Allan, Matching Inconsistently Spelled Names in Automatic Speech Recognizer Output for Information Retrieval, HLT-EMNLP 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. A. Ratnaparkhi, A Maximum Entropy Part of Speech Tagger. In EMNLP 1996.Google ScholarGoogle Scholar
  48. A. Ratnaparkhi, Maximum Entropy Models for Natural Language Ambiguity Resolution, Ph. D. Thesis, University of Pennsylvania. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. E. Riloff, R. Jones, Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping, AAAI 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Settles, B. (2004), Biomedical named entity recognition using conditional random fields and rich feature sets, in Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), 2004, Geneva, Switzerland. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. W. M. Soon, H. T. Ng, D. Chung, Y. Lim, A machine learning approach to coreference resolution of noun phrases, Computational Linguistics, 27(4): 521--544, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. H. Wallach, Efficient Training of Conditional Random Fields, M. Sc. Thesis, Division of Informatics, University of Edinburgh, 2002.Google ScholarGoogle Scholar
  53. D. Wu, W. S. Lee, N. Ye, and H. L. Chieu, Domain adaptive bootstrapping for named entity recognition, EMNLP 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Y. Zhao, B. Qin, S. Hu, T. Liu, Generalizing Syntactic Structures for Product Attribute Candidate Extraction, ACL 2010 Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Bootstrapped named entity recognition for product attribute extraction

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image DL Hosted proceedings
          EMNLP '11: Proceedings of the Conference on Empirical Methods in Natural Language Processing
          July 2011
          1647 pages
          ISBN:9781937284114

          Publisher

          Association for Computational Linguistics

          United States

          Publication History

          • Published: 27 July 2011

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate73of234submissions,31%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader