ABSTRACT
We address the e-rulemaking problem of reducing the manual labor required to analyze public comment sets. In current and previous work, for example, text categorization techniques have been used to speed up the comment analysis phase of e-rulemaking --- by classifying sentences automatically, according to the rule-specific issues [2] or general topics that they address [7, 8]. Manually annotated data, however, is still required to train the supervised inductive learning algorithms that perform the categorization. This paper, therefore, investigates the application of active learning methods for public comment categorization: we develop two new, general-purpose, active learning techniques to selectively sample from the available training data for human labeling when building the sentence-level classifiers employed in public comment categorization. Using an e-rulemaking corpus developed for our purposes [2], we compare our methods to the well-known query by committee (QBC) active learning algorithm [5] and to a baseline that randomly selects instances for labeling in each round of active learning. We show that our methods statistically significantly exceed the performance of the random selection active learner and the query by committee (QBC) variation, requiring many fewer training examples to reach the same levels of accuracy on a held-out test set. This provides promising evidence that automated text categorization methods might be used effectively to support public comment analysis.
- K. Brinker. Incorporating diversity in active learning with support vector machines. In Proceedings of ICML-03, 20th International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco, US, 2003.Google Scholar
- Claire Cardie, Cynthia Farina, Matt Rawding, Adil Aijaz, and Stephen Purpura. A Study in Rule-Specific Issue Categorization for e-Rulemaking. In Proceedings of the 9th Annual International Conference on Digital Government Research, 2008. Google ScholarDigital Library
- C. Coglianese. Weak democracy, strong information: The role of information technology in the rulemaking process. In V. Mayer-Schoenberger and D. Lazer, editors, Electronic Government to Information Government: Governing in the 21ST Century, 2007.Google ScholarCross Ref
- D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201--221, 1994. Google ScholarDigital Library
- Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28:133--168, 1997. Google ScholarDigital Library
- C. Kerwin. The state of rulemaking in the federal government. Technical report, Transcript Panel 1, 2005.Google Scholar
- N. Kwon and E. Hovy. Information acquisition using multiple classifications. In Proceedings of the Fourth International Conference on Knowledge Capture (K-CAP 2007), 2007. Google ScholarDigital Library
- N. Kwon, E. Hovy, and S. Shulman. Multidimensional text analysis for erulemaking. In Proceedings of the 7th Annual International Conference on Digital Government Research, 2006. Google ScholarDigital Library
- D. D. Lewis and J. Catlett. Heterogeneous Uncertainty Sampling for Supervised Learning. In Proceedings of the Eleventh International Conference on Machine Learning, pages 148--156, Rutgers University, New Brunswick, NJ, 1994. Morgan Kaufmann.Google ScholarDigital Library
- P. Melville and R. Mooney. Diverse ensembles for active learning. In Proceedings of ICML-04, 21st International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco, US, 2004. Google ScholarDigital Library
- I. Muslea, S. Minton, and C. Knoblock. Selective sampling with redundant views. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, pages 621--626, 2000. Google ScholarDigital Library
- K. Papineni. Why inverse document frequency? In Proceedings of the North American Association for Computational Linguistics, NAACL, pages 25--32, 2001. Google ScholarDigital Library
- M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.Google ScholarCross Ref
- S. Purpura and D. Hillard. Automated Classification of Congressional Legislation. In Proceedings of the 7th Annual International Conference on Digital Government Research, 2006. Google ScholarDigital Library
- G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983. Google ScholarDigital Library
- B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (Adaptive Computation and Machine Learning). MIT Press, Cambridge, MA, 2002. Google ScholarDigital Library
- H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Computational Learning Theory, pages 287--294, 1992. Google ScholarDigital Library
- S. Shulman. Perverse incentives: The case against mass e-mail campaigns. In Proceedings of the Annual Meeting of the American Political Science Association, 2008.Google Scholar
- P. Strauss, T. Rakoff, and C. Farina. Administrative Law. 10th edition, 2003.Google Scholar
- V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. Google ScholarDigital Library
- H. Yang and J. Callan. Near-duplicate detection for erulemaking. In Proceedings of the Fifth National Conference on Digital Government Research, 2005. Google ScholarDigital Library
- H. Yang and J. Callan. Near-duplicate detection by instance-level constrained clustering. In Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006. Google ScholarDigital Library
Index Terms
- Active learning for e-rulemaking: public comment categorization
Recommendations
A -Nearest Neighbor Based Algorithm for Multi-Instance Multi-Label Active Learning
Artificial Neural Networks in Pattern RecognitionAbstractMulti-instance multi-label learning (MIML) is a framework in machine learning in which each object is represented by multiple instances and associated with multiple labels. This relatively new approach has achieved success in various applications, ...
A Novel Active Learning Method Using SVM for Text Classification
Support vector machines (SVMs) are a popular class of supervised learning algorithms, and are particularly applicable to large and high-dimensional classification problems. Like most machine learning methods for data classification and information ...
Large-scale text categorization by batch mode active learning
WWW '06: Proceedings of the 15th international conference on World Wide WebLarge-scale text categorization is an important research topic for Web data mining. One of the challenges in large-scale text categorization is how to reduce the human efforts in labeling text documents for building reliable classification models. In ...
Comments