ABSTRACT
Computational resources for research in legal environments have historically implied remote access to large databases of legal documents such as case law, statutes, law reviews and administrative materials. Today, by contrast, there exists enormous growth in lawyers' electronic work product within these environments, specifically within law firms. Along with this growth has come the need for accelerated knowledge management---automated assistance in organizing, analyzing, retrieving and presenting this content in a useful and distributed manner.In cases where a relevant legal taxonomy is available, together with representative labeled data, automated text classification tools can be applied. In the absence of these resources, document clustering offers an alternative approach to organizing collections, and an adjunct to search.To explore this approach further, we have conducted sets of successively more complex clustering experiments using primary and secondary law documents as well as actual law firm data. Tests were run to determine the efficiency and effectiveness of a number of essential clustering functions. After examining the performance of traditional or hard clustering applications, we investigate soft clustering (multiple cluster assignments) as well as hierarchical clustering. We show how these latter clustering approaches are effective, in terms of both internal and external quality measures, and useful to legal researchers. Moreover, such techniques can ultimately assist in the automatic or semi-automatic generation of taxonomies for subsequent use by classification programs.
- C. C. Aggarwal, S. C. Gates, and P. S. Yu. On the merits of building categorization systems by supervised clustering. In Proceedings of the Fifth Int'l Conference on Knowledge Discovery and Data Mining (KDD'99) (San Diego, CA), pages 352--356. ACM Press, Aug. 1999. Google ScholarDigital Library
- K. Al-Kofahi, A. Tyrrell, A. Vachher, T. Travers, and P. Jackson. Combining multiple classifiers for text categorization. In Proceedings of the 10th Int'l Conference on Information and Knowledge Management (CIKM'01) (New Orleans, LA), pages 97--104. ACM Press, Nov. 2001. Google ScholarDigital Library
- T. J. Bench-Capon and P. R. Visser. Ontologies in legal information systems. In Proceedings of the Sixth Int'l Conference of Artificial Intelligence and Law (ICAIL'97) (Melbourne, Australia), pages 132--141. ACM Press, June 1997. Google ScholarDigital Library
- J. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York, 1981. Google ScholarDigital Library
- P. S. Bradley, C. Reina, and U. M. Fayyad. Clustering very large databases using EM mixture models. In Proceedings of the Int'l Conference on Pattern Recognition (ICPR '00), volume 2, pages 2076--2080, 2000.Google ScholarCross Ref
- P. Cheeseman and J. Stutz. Baysian classification (AutoClass): Theory and results. In U. Fayyad, G. P.-Shapiro, P. Smith, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 153--180. AAAI/MIT Press, 1996. Google ScholarDigital Library
- C. Curling. KeySearch, West's Key Number System, & Lexis' Search Advisor. Law Library Resource Exchange, May 2001. http://www.llrx.com/features/keysearch.htm.Google Scholar
- D. Cutting, J. Pedersen, D. Karger, and J. Tukey. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th Int'l Conference on Research and Development in Information Retrieval (SIGIR'93) (Copenhagen, Denmark), pages 318--329, Copenhagen, June 1992. Google ScholarDigital Library
- R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, chapter 10: Unsupervised Learning and Clustering, pages 3--87. Wiley-Interscience, 2nd edition, 2000.Google Scholar
- D. L. Edwards and D. E. Mahling. Toward knowledge management systems in the legal domain. In Proceedings of the Int'l ACM SIGGROUP Conference on Supporting Group Work: The Integration Challenge (Phoenix, AZ), pages 158--166. ACM Press, Nov. 1997. Google ScholarDigital Library
- P. Gottschalk. Use of IT for Knowledge Management in Law Firms. The Journal of Law and Information Technology (JLIT), 3, 1999.Google Scholar
- S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proceedings of the Int'l Conference on Management of Data (SIGMOD'98) (Seattle, WA). ACM Press, June 1998. Google ScholarDigital Library
- S. Guha, R. Rastogi, and K. Shim. ROCK: a robust clustering algorithm for categorical attributes. In Proceedings of the 15th Int'l Conference on Data Engineering, pages 512--521, March 1999. Google ScholarDigital Library
- A. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. Google ScholarDigital Library
- G. Karypis. CLUTO: A Software Package for Clustering High-Dimensional Data Sets. University of Minnesota, Dept. of Computer Science, Minneapolis, MN, Nov. 2003. Release 2.1.1 (www-users.cs.umn.edu/karypis/cluto).Google Scholar
- M. E. Katsh. Law in a Digital World, page 172. Oxford University Press, Oxford, 1995. Google ScholarDigital Library
- B. King. Step-wise clustering procedures. Journal of the American Statistical Association, 69:86--101, 1967.Google ScholarCross Ref
- D. H. Kraft, J. Chen, and A. Mikulcic. Combining fuzzy clustering and fuzzy inference in information retrieval. In Proceedings of the IEEE Int'l Conference on Fuzzy Systems (FUZZ-IEEE'00), pages 375--380, May 2000.Google Scholar
- J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium of Mathematical Statistical Probability, pages 281--297, 1967.Google Scholar
- K. Martin. 'Show me the money' - measuring the return on knowledge management. Law Library Resource Exchange, Oct. 2002. http://www.llrx.com/features/kmroi.htm.Google Scholar
- G. J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley, New York, 1997.Google Scholar
- C. Meltzer. Personal Communication. Chief Information Officer, Dorsey & Whitney, LLP, Minneapolis, MN, Feb. 2004.Google Scholar
- M. E. S. Mendes and L. Sacks. Evaluating fuzzy clustering for relevance-based information access. In Proceedings of the IEEE Int'l Conference on Fuzzy Systems (FUZZ-IEEE'03), pages 648--653, May 2003.Google ScholarCross Ref
- I. Nonaka and H. Takeuchi. The Knowledge-Creating Company. Oxford University Press, 1995.Google Scholar
- C. Ordonez and E. Omiecinski. FREM: fast and robust EM clustering for large data sets. In Proceedings of the 11th Int'l. Conference on Information and Knowledge Management (CIKM'02) (McLean, VA), pages 590--599. ACM Press, Nov. 2002. Google ScholarDigital Library
- A. Oskamp, M. W. Tragter, and A. R. Lodder. Mutual benefits for AI & Law and knowledge management. In Proc. of the Seventh Int'l Conf. of Artificial Intelligence and Law (ICAIL '99) (Oslo, Norway), pages 126--127. ACM Press, June 1999. Google ScholarDigital Library
- M. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.Google ScholarCross Ref
- G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman Publishing Co., Boston, MA, 1989. Google ScholarDigital Library
- M. Sato and S. Ishii. On-line EM algorithm for the normalized gaussian network. Neural Computation, 12:407--432, 2000. Google ScholarDigital Library
- M. Schireson. Does technology matter for knowledge management? In KMWorld: Content, Document, and Knowledge Management, page S12. Information Today, Nov/Dec 2004. Special Supplement on Best Practicies on Enterprise Knowledge Management.Google Scholar
- P. H. Sneath and R. R. Sokal. Numerical Taxonomy. Freeman, London, UK, 1973.Google Scholar
- M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In Notes from KDD Workshop on Text Mining, held at the Sixth Int'l Conference on Knowledge Discovery and Data Mining (KDD'00) (Boston, MA), Aug. 2000.Google Scholar
- R. E. Susskind. The Spirit of AI and Law: Reflections on emerging technology in legal practice. In The 9th Int'l Conference of Artificial Intelligence and Law (ICAIL'03) (Edinburgh, Scotland), June 2003. Keynote Address.Google Scholar
- A. Terrett. Knowledge Management and the Law Firm, pages 67--76. Emerald Group, Bradford, England, Sept. 1998.Google Scholar
- G. T. Tziahanas. White Paper: Legal Knowledge Management: A Holistic Model. Legal Research Center, Minneapolis, MN, April 2003.Google Scholar
- C. J. van Rijsbergen. Automatic Information Structuring and Retrieval. Ph.D. Diss., University of Cambridge, July 1972.Google Scholar
- P. R. Visser, R. W. van Krulingen, and T. J. Bench-Capon. A method for the development of legal knowledge systems. In Proceedings of the Sixth Int'l Conference of Artificial Intelligence and Law (ICAIL'97) (Melbourne, Australia), pages 151--160. ACM Press, June 1997. Google ScholarDigital Library
- B. Zhang, M. Hsu, and U. Dayal. K-harmonic means a data clustering algorithm. Technical Report HPL-1999-124, 1999.Google Scholar
- Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of the 11th Int'l. Conference on Information and Knowledge Management (CIKM'02) (McLean, VA), pages 515--524. ACM Press, Nov. 2002. Google ScholarDigital Library
- Y. Zhao and G. Karypis. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning, 55(3):311--331, 2004. Google ScholarDigital Library
Index Terms
Effective document clustering for large heterogeneous law firm collections
Recommendations
A scaleable document clustering approach for large document corpora
In this paper, the scalability and quality of the contextual document clustering (CDC) approach is demonstrated for large data-sets using the whole Reuters Corpus Volume 1 (RCV1) collection. CDC is a form of distributional clustering, which ...
Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global InformatizationIn this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Possibilistic fuzzy co-clustering of large document collections
In this paper we propose a new co-clustering algorithm called possibilistic fuzzy co-clustering (PFCC) for automatic categorization of large document collections. PFCC integrates a possibilistic document clustering technique and a combined formulation ...
Comments