ABSTRACT
Many language processing tasks can be reduced to breaking the text into segments with prescribed properties. Such tasks include sentence splitting, tokenization, named-entity extraction, and chunking. We present a new model of text segmentation based on ideas from multilabel classification. Using this model, we can naturally represent segmentation problems involving overlapping and non-contiguous segments. We evaluate the model on entity extraction and noun-phrase chunking and show that it is more accurate for overlapping and non-contiguous segments, but it still performs well on simpler data sets for which sequential tagging has been the best method.
- D. M. Bikel, R. Schwartz, and R. M. Weischedel. 1999. An algorithm that learns what's in a name. Machine Learning Journal Special Issue on Natural Language Learning, 34(1/3):221--231. Google ScholarDigital Library
- J. Bockhorst and M. Craven. 2004. Markov networks for detecting overlapping elements in sequence data. In Proc. NIPS.Google Scholar
- Y. Censor and S. A. Zenios. 1997. Parallel optimization: theory, algorithms, and applications. Oxford University Press. Google ScholarDigital Library
- M. Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proc. EMNLP. Google ScholarDigital Library
- K. Crammer and Y. Singer. 2002. A new family of online algorithms for category ranking. In Proc SIGIR. Google ScholarDigital Library
- K. Crammer. 2005. Online Learning for Complex Categorial Problems. Ph.D. thesis, Hebrew University of Jerusalem, to appear.Google Scholar
- N. Cristianini and J. Shawe-Taylor. 2000. An Introduction to Support Vector Machines. Cambridge University Press. Google ScholarDigital Library
- M. Dickinson and W. D. Meurers. 2005. Detecting errors in discontinuous structural annotation. In Proc. ACL. Google ScholarDigital Library
- A. Elisseeff and J. Weston. 2001. A kernel method for multi-labeled classification. In Proc. NIPS.Google Scholar
- T. Kudo and Y. Matsumoto. 2001. Chunking with support vector machines. In Proc. NAACL. Google ScholarDigital Library
- J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML. Google ScholarDigital Library
- A. McCallum, D. Freitag, and F. Pereira. 2000. Maximum entropy Markov models for information extraction and segmentation. In Proceedings of ICML. Google ScholarDigital Library
- R. McDonald, K. Crammer, and F. Pereira. 2004. Large margin online learning algorithms for scalable structured classication. In NIPS Workshop on Structured Outputs.Google Scholar
- PennBioIE. 2005. Mining The Bibliome Project. http://bioie.ldc.upenn.edu/.Google Scholar
- L. R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257--285, February.Google ScholarCross Ref
- A. Ratnaparkhi. 1996. A maximum entropy model for part-of-speech tagging. In Proc. EMNLP.Google Scholar
- R. E. Schapire and Y. Singer. 1999. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):1--40. Google ScholarDigital Library
- B. Schölkopf and A. J. Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT Press.Google Scholar
- F. Sha and F. Pereira. 2003. Shallow parsing with conditional random fields. In Proc. HLT-NAACL. Google ScholarDigital Library
- B. Taskar, C. Guestrin, and D. Koller. 2003. Max-margin Markov networks. In Proc. NIPS.Google Scholar
- E. F. Tjong Kim Sang and F. De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of CoNLL-2003. http://www.cnts.ua.ac.be/conll2003/ner. Google ScholarDigital Library
- Flexible text segmentation with structured multilabel classification
Recommendations
Linear text segmentation using classification techniques
A2CWiC '10: Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in IndiaAutomatic segmentation of a text stream into topically coherent segments is an important component in natural language processing tasks such as information retrieval and document summarization. Machine learning techniques can play a vital role in ...
Urdu text classification
FIT '09: Proceedings of the 7th International Conference on Frontiers of Information TechnologyThis paper compares statistical techniques for text classification using Naïve Bayes and Support Vector Machines, in context of Urdu language. A large corpus is used for training and testing purpose of the classifiers. However, those classifiers cannot ...
Text classification in Asian languages without word segmentation
AsianIR '03: Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11We present a simple approach for Asian language text classification without word segmentation, based on statistical n-gram language modeling. In particular, we examine Chinese and Japanese text classification. With character n-gram models, our approach ...
Comments