research-article

Active learning for e-rulemaking: public comment categorization

Authors:
Stephen Purpura

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

,
Claire Cardie

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

,
Jesse Simons

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

Authors Info & Claims

dg.o '08: Proceedings of the 2008 international conference on Digital government researchMay 2008Pages 234–243

Published:18 May 2008Publication History

dg.o '08: Proceedings of the 2008 international conference on Digital government research

Pages 234–243

ABSTRACT

We address the e-rulemaking problem of reducing the manual labor required to analyze public comment sets. In current and previous work, for example, text categorization techniques have been used to speed up the comment analysis phase of e-rulemaking --- by classifying sentences automatically, according to the rule-specific issues [2] or general topics that they address [7, 8]. Manually annotated data, however, is still required to train the supervised inductive learning algorithms that perform the categorization. This paper, therefore, investigates the application of active learning methods for public comment categorization: we develop two new, general-purpose, active learning techniques to selectively sample from the available training data for human labeling when building the sentence-level classifiers employed in public comment categorization. Using an e-rulemaking corpus developed for our purposes [2], we compare our methods to the well-known query by committee (QBC) active learning algorithm [5] and to a baseline that randomly selects instances for labeling in each round of active learning. We show that our methods statistically significantly exceed the performance of the random selection active learner and the query by committee (QBC) variation, requiring many fewer training examples to reach the same levels of accuracy on a held-out test set. This provides promising evidence that automated text categorization methods might be used effectively to support public comment analysis.

References

K. Brinker. Incorporating diversity in active learning with support vector machines. In Proceedings of ICML-03, 20th International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco, US, 2003.Google Scholar
Claire Cardie, Cynthia Farina, Matt Rawding, Adil Aijaz, and Stephen Purpura. A Study in Rule-Specific Issue Categorization for e-Rulemaking. In Proceedings of the 9th Annual International Conference on Digital Government Research, 2008. Google ScholarDigital Library
C. Coglianese. Weak democracy, strong information: The role of information technology in the rulemaking process. In V. Mayer-Schoenberger and D. Lazer, editors, Electronic Government to Information Government: Governing in the 21ST Century, 2007.Google ScholarCross Ref
D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201--221, 1994. Google ScholarDigital Library
Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28:133--168, 1997. Google ScholarDigital Library
C. Kerwin. The state of rulemaking in the federal government. Technical report, Transcript Panel 1, 2005.Google Scholar
N. Kwon and E. Hovy. Information acquisition using multiple classifications. In Proceedings of the Fourth International Conference on Knowledge Capture (K-CAP 2007), 2007. Google ScholarDigital Library
N. Kwon, E. Hovy, and S. Shulman. Multidimensional text analysis for erulemaking. In Proceedings of the 7th Annual International Conference on Digital Government Research, 2006. Google ScholarDigital Library
D. D. Lewis and J. Catlett. Heterogeneous Uncertainty Sampling for Supervised Learning. In Proceedings of the Eleventh International Conference on Machine Learning, pages 148--156, Rutgers University, New Brunswick, NJ, 1994. Morgan Kaufmann.Google ScholarDigital Library
P. Melville and R. Mooney. Diverse ensembles for active learning. In Proceedings of ICML-04, 21st International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco, US, 2004. Google ScholarDigital Library
I. Muslea, S. Minton, and C. Knoblock. Selective sampling with redundant views. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, pages 621--626, 2000. Google ScholarDigital Library
K. Papineni. Why inverse document frequency? In Proceedings of the North American Association for Computational Linguistics, NAACL, pages 25--32, 2001. Google ScholarDigital Library
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.Google ScholarCross Ref
S. Purpura and D. Hillard. Automated Classification of Congressional Legislation. In Proceedings of the 7th Annual International Conference on Digital Government Research, 2006. Google ScholarDigital Library
G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983. Google ScholarDigital Library
B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (Adaptive Computation and Machine Learning). MIT Press, Cambridge, MA, 2002. Google ScholarDigital Library
H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Computational Learning Theory, pages 287--294, 1992. Google ScholarDigital Library
S. Shulman. Perverse incentives: The case against mass e-mail campaigns. In Proceedings of the Annual Meeting of the American Political Science Association, 2008.Google Scholar
P. Strauss, T. Rakoff, and C. Farina. Administrative Law. 10th edition, 2003.Google Scholar
V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. Google ScholarDigital Library
H. Yang and J. Callan. Near-duplicate detection for erulemaking. In Proceedings of the Fifth National Conference on Digital Government Research, 2005. Google ScholarDigital Library
H. Yang and J. Callan. Near-duplicate detection by instance-level constrained clustering. In Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006. Google ScholarDigital Library

Index Terms

Active learning for e-rulemaking: public comment categorization

Recommendations

A -Nearest Neighbor Based Algorithm for Multi-Instance Multi-Label Active Learning
Artificial Neural Networks in Pattern Recognition
Abstract
Multi-instance multi-label learning (MIML) is a framework in machine learning in which each object is represented by multiple instances and associated with multiple labels. This relatively new approach has achieved success in various applications, ...
Read More
A Novel Active Learning Method Using SVM for Text Classification

Support vector machines (SVMs) are a popular class of supervised learning algorithms, and are particularly applicable to large and high-dimensional classification problems. Like most machine learning methods for data classification and information ...
Read More
Large-scale text categorization by batch mode active learning
WWW '06: Proceedings of the 15th international conference on World Wide Web

Large-scale text categorization is an important research topic for Web data mining. One of the challenges in large-scale text categorization is how to reduce the human efforts in labeling text documents for building reliable classification models. In ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
dg.o '08: Proceedings of the 2008 international conference on Digital government research
May 2008
488 pages
ISBN:9781605580999
Conference Chairs:
Monique Charbonneau
CEFRIO
,
Lester Diamond
US Social Security Administration
,
Stuart Shulman
University of Pittsburgh
,
Program Chairs:
Soon Ae Chun,
Marijn Janssen,
J. Ramon Gil-Garcia
Sponsors
In-Cooperation
Publisher
Digital Government Society of North America
Publication History
- Published: 18 May 2008
Check for updates
Author Tags
active learning
e-rulemaking
machine learning
public comment
text categorization
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate150of271submissions,55%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 160
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Active learning for e-rulemaking: public comment categorization

dg.o '08: Proceedings of the 2008 international conference on Digital government research

ABSTRACT

References

Cited By

Index Terms

Recommendations

A -Nearest Neighbor Based Algorithm for Multi-Instance Multi-Label Active Learning

A Novel Active Learning Method Using SVM for Text Classification

Large-scale text categorization by batch mode active learning