Article

Free Access

Flexible text segmentation with structured multilabel classification

Authors:
Ryan McDonald

University of Pennsylvania, Philadelphia, PA

University of Pennsylvania, Philadelphia, PA
View Profile

,
Koby Crammer

University of Pennsylvania, Philadelphia, PA

University of Pennsylvania, Philadelphia, PA
View Profile

,
Fernando Pereira

University of Pennsylvania, Philadelphia, PA

University of Pennsylvania, Philadelphia, PA
View Profile

HLT '05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language ProcessingOctober 2005Pages 987–994https://doi.org/10.3115/1220575.1220699

Published:06 October 2005Publication History

HLT '05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing

Pages 987–994

ABSTRACT

Many language processing tasks can be reduced to breaking the text into segments with prescribed properties. Such tasks include sentence splitting, tokenization, named-entity extraction, and chunking. We present a new model of text segmentation based on ideas from multilabel classification. Using this model, we can naturally represent segmentation problems involving overlapping and non-contiguous segments. We evaluate the model on entity extraction and noun-phrase chunking and show that it is more accurate for overlapping and non-contiguous segments, but it still performs well on simpler data sets for which sequential tagging has been the best method.

References

D. M. Bikel, R. Schwartz, and R. M. Weischedel. 1999. An algorithm that learns what's in a name. Machine Learning Journal Special Issue on Natural Language Learning, 34(1/3):221--231. Google ScholarDigital Library
J. Bockhorst and M. Craven. 2004. Markov networks for detecting overlapping elements in sequence data. In Proc. NIPS.Google Scholar
Y. Censor and S. A. Zenios. 1997. Parallel optimization: theory, algorithms, and applications. Oxford University Press. Google ScholarDigital Library
M. Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proc. EMNLP. Google ScholarDigital Library
K. Crammer and Y. Singer. 2002. A new family of online algorithms for category ranking. In Proc SIGIR. Google ScholarDigital Library
K. Crammer. 2005. Online Learning for Complex Categorial Problems. Ph.D. thesis, Hebrew University of Jerusalem, to appear.Google Scholar
N. Cristianini and J. Shawe-Taylor. 2000. An Introduction to Support Vector Machines. Cambridge University Press. Google ScholarDigital Library
M. Dickinson and W. D. Meurers. 2005. Detecting errors in discontinuous structural annotation. In Proc. ACL. Google ScholarDigital Library
A. Elisseeff and J. Weston. 2001. A kernel method for multi-labeled classification. In Proc. NIPS.Google Scholar
T. Kudo and Y. Matsumoto. 2001. Chunking with support vector machines. In Proc. NAACL. Google ScholarDigital Library
J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML. Google ScholarDigital Library
A. McCallum, D. Freitag, and F. Pereira. 2000. Maximum entropy Markov models for information extraction and segmentation. In Proceedings of ICML. Google ScholarDigital Library
R. McDonald, K. Crammer, and F. Pereira. 2004. Large margin online learning algorithms for scalable structured classication. In NIPS Workshop on Structured Outputs.Google Scholar
PennBioIE. 2005. Mining The Bibliome Project. http://bioie.ldc.upenn.edu/.Google Scholar
L. R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257--285, February.Google ScholarCross Ref
A. Ratnaparkhi. 1996. A maximum entropy model for part-of-speech tagging. In Proc. EMNLP.Google Scholar
R. E. Schapire and Y. Singer. 1999. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):1--40. Google ScholarDigital Library
B. Schölkopf and A. J. Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT Press.Google Scholar
F. Sha and F. Pereira. 2003. Shallow parsing with conditional random fields. In Proc. HLT-NAACL. Google ScholarDigital Library
B. Taskar, C. Guestrin, and D. Koller. 2003. Max-margin Markov networks. In Proc. NIPS.Google Scholar
E. F. Tjong Kim Sang and F. De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of CoNLL-2003. http://www.cnts.ua.ac.be/conll2003/ner. Google ScholarDigital Library

Flexible text segmentation with structured multilabel classification
1. Computing methodologies
  1. Artificial intelligence
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Linear text segmentation using classification techniques
A2CWiC '10: Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in India

Automatic segmentation of a text stream into topically coherent segments is an important component in natural language processing tasks such as information retrieval and document summarization. Machine learning techniques can play a vital role in ...
Read More
Urdu text classification
FIT '09: Proceedings of the 7th International Conference on Frontiers of Information Technology

This paper compares statistical techniques for text classification using Naïve Bayes and Support Vector Machines, in context of Urdu language. A large corpus is used for training and testing purpose of the classifiers. However, those classifiers cannot ...
Read More
Text classification in Asian languages without word segmentation
AsianIR '03: Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11

We present a simple approach for Asian language text classification without word segmentation, based on statistical n-gram language modeling. In particular, we examine Chinese and Japanese text classification. With character n-gram models, our approach ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HLT '05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
October 2005
1054 pages
Conference Chair:
Raymond J. Mooney
The University of Texas at Austin
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 6 October 2005
Qualifiers
- Article
Conference

Acceptance Rates
HLT '05 Paper Acceptance Rate127of402submissions,32%Overall Acceptance Rate240of768submissions,31%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 635
  Total Downloads
- Downloads (Last 12 months)34
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Flexible text segmentation with structured multilabel classification

HLT '05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing

ABSTRACT

References

Cited By

Recommendations

Linear text segmentation using classification techniques

Urdu text classification

Text classification in Asian languages without word segmentation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Flexible text segmentation with structured multilabel classification

HLT '05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing

ABSTRACT

References

Cited By

Recommendations

Linear text segmentation using classification techniques

Urdu text classification

Text classification in Asian languages without word segmentation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media