skip to main content
10.1145/1165485.1165513acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicailConference Proceedingsconference-collections
Article

Effective document clustering for large heterogeneous law firm collections

Published:06 June 2005Publication History

ABSTRACT

Computational resources for research in legal environments have historically implied remote access to large databases of legal documents such as case law, statutes, law reviews and administrative materials. Today, by contrast, there exists enormous growth in lawyers' electronic work product within these environments, specifically within law firms. Along with this growth has come the need for accelerated knowledge management---automated assistance in organizing, analyzing, retrieving and presenting this content in a useful and distributed manner.In cases where a relevant legal taxonomy is available, together with representative labeled data, automated text classification tools can be applied. In the absence of these resources, document clustering offers an alternative approach to organizing collections, and an adjunct to search.To explore this approach further, we have conducted sets of successively more complex clustering experiments using primary and secondary law documents as well as actual law firm data. Tests were run to determine the efficiency and effectiveness of a number of essential clustering functions. After examining the performance of traditional or hard clustering applications, we investigate soft clustering (multiple cluster assignments) as well as hierarchical clustering. We show how these latter clustering approaches are effective, in terms of both internal and external quality measures, and useful to legal researchers. Moreover, such techniques can ultimately assist in the automatic or semi-automatic generation of taxonomies for subsequent use by classification programs.

References

  1. C. C. Aggarwal, S. C. Gates, and P. S. Yu. On the merits of building categorization systems by supervised clustering. In Proceedings of the Fifth Int'l Conference on Knowledge Discovery and Data Mining (KDD'99) (San Diego, CA), pages 352--356. ACM Press, Aug. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Al-Kofahi, A. Tyrrell, A. Vachher, T. Travers, and P. Jackson. Combining multiple classifiers for text categorization. In Proceedings of the 10th Int'l Conference on Information and Knowledge Management (CIKM'01) (New Orleans, LA), pages 97--104. ACM Press, Nov. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. J. Bench-Capon and P. R. Visser. Ontologies in legal information systems. In Proceedings of the Sixth Int'l Conference of Artificial Intelligence and Law (ICAIL'97) (Melbourne, Australia), pages 132--141. ACM Press, June 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. S. Bradley, C. Reina, and U. M. Fayyad. Clustering very large databases using EM mixture models. In Proceedings of the Int'l Conference on Pattern Recognition (ICPR '00), volume 2, pages 2076--2080, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  6. P. Cheeseman and J. Stutz. Baysian classification (AutoClass): Theory and results. In U. Fayyad, G. P.-Shapiro, P. Smith, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 153--180. AAAI/MIT Press, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Curling. KeySearch, West's Key Number System, & Lexis' Search Advisor. Law Library Resource Exchange, May 2001. http://www.llrx.com/features/keysearch.htm.Google ScholarGoogle Scholar
  8. D. Cutting, J. Pedersen, D. Karger, and J. Tukey. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th Int'l Conference on Research and Development in Information Retrieval (SIGIR'93) (Copenhagen, Denmark), pages 318--329, Copenhagen, June 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, chapter 10: Unsupervised Learning and Clustering, pages 3--87. Wiley-Interscience, 2nd edition, 2000.Google ScholarGoogle Scholar
  10. D. L. Edwards and D. E. Mahling. Toward knowledge management systems in the legal domain. In Proceedings of the Int'l ACM SIGGROUP Conference on Supporting Group Work: The Integration Challenge (Phoenix, AZ), pages 158--166. ACM Press, Nov. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Gottschalk. Use of IT for Knowledge Management in Law Firms. The Journal of Law and Information Technology (JLIT), 3, 1999.Google ScholarGoogle Scholar
  12. S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proceedings of the Int'l Conference on Management of Data (SIGMOD'98) (Seattle, WA). ACM Press, June 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Guha, R. Rastogi, and K. Shim. ROCK: a robust clustering algorithm for categorical attributes. In Proceedings of the 15th Int'l Conference on Data Engineering, pages 512--521, March 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Karypis. CLUTO: A Software Package for Clustering High-Dimensional Data Sets. University of Minnesota, Dept. of Computer Science, Minneapolis, MN, Nov. 2003. Release 2.1.1 (www-users.cs.umn.edu/karypis/cluto).Google ScholarGoogle Scholar
  16. M. E. Katsh. Law in a Digital World, page 172. Oxford University Press, Oxford, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. King. Step-wise clustering procedures. Journal of the American Statistical Association, 69:86--101, 1967.Google ScholarGoogle ScholarCross RefCross Ref
  18. D. H. Kraft, J. Chen, and A. Mikulcic. Combining fuzzy clustering and fuzzy inference in information retrieval. In Proceedings of the IEEE Int'l Conference on Fuzzy Systems (FUZZ-IEEE'00), pages 375--380, May 2000.Google ScholarGoogle Scholar
  19. J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium of Mathematical Statistical Probability, pages 281--297, 1967.Google ScholarGoogle Scholar
  20. K. Martin. 'Show me the money' - measuring the return on knowledge management. Law Library Resource Exchange, Oct. 2002. http://www.llrx.com/features/kmroi.htm.Google ScholarGoogle Scholar
  21. G. J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley, New York, 1997.Google ScholarGoogle Scholar
  22. C. Meltzer. Personal Communication. Chief Information Officer, Dorsey & Whitney, LLP, Minneapolis, MN, Feb. 2004.Google ScholarGoogle Scholar
  23. M. E. S. Mendes and L. Sacks. Evaluating fuzzy clustering for relevance-based information access. In Proceedings of the IEEE Int'l Conference on Fuzzy Systems (FUZZ-IEEE'03), pages 648--653, May 2003.Google ScholarGoogle ScholarCross RefCross Ref
  24. I. Nonaka and H. Takeuchi. The Knowledge-Creating Company. Oxford University Press, 1995.Google ScholarGoogle Scholar
  25. C. Ordonez and E. Omiecinski. FREM: fast and robust EM clustering for large data sets. In Proceedings of the 11th Int'l. Conference on Information and Knowledge Management (CIKM'02) (McLean, VA), pages 590--599. ACM Press, Nov. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Oskamp, M. W. Tragter, and A. R. Lodder. Mutual benefits for AI & Law and knowledge management. In Proc. of the Seventh Int'l Conf. of Artificial Intelligence and Law (ICAIL '99) (Oslo, Norway), pages 126--127. ACM Press, June 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  28. G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman Publishing Co., Boston, MA, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Sato and S. Ishii. On-line EM algorithm for the normalized gaussian network. Neural Computation, 12:407--432, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Schireson. Does technology matter for knowledge management? In KMWorld: Content, Document, and Knowledge Management, page S12. Information Today, Nov/Dec 2004. Special Supplement on Best Practicies on Enterprise Knowledge Management.Google ScholarGoogle Scholar
  31. P. H. Sneath and R. R. Sokal. Numerical Taxonomy. Freeman, London, UK, 1973.Google ScholarGoogle Scholar
  32. M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In Notes from KDD Workshop on Text Mining, held at the Sixth Int'l Conference on Knowledge Discovery and Data Mining (KDD'00) (Boston, MA), Aug. 2000.Google ScholarGoogle Scholar
  33. R. E. Susskind. The Spirit of AI and Law: Reflections on emerging technology in legal practice. In The 9th Int'l Conference of Artificial Intelligence and Law (ICAIL'03) (Edinburgh, Scotland), June 2003. Keynote Address.Google ScholarGoogle Scholar
  34. A. Terrett. Knowledge Management and the Law Firm, pages 67--76. Emerald Group, Bradford, England, Sept. 1998.Google ScholarGoogle Scholar
  35. G. T. Tziahanas. White Paper: Legal Knowledge Management: A Holistic Model. Legal Research Center, Minneapolis, MN, April 2003.Google ScholarGoogle Scholar
  36. C. J. van Rijsbergen. Automatic Information Structuring and Retrieval. Ph.D. Diss., University of Cambridge, July 1972.Google ScholarGoogle Scholar
  37. P. R. Visser, R. W. van Krulingen, and T. J. Bench-Capon. A method for the development of legal knowledge systems. In Proceedings of the Sixth Int'l Conference of Artificial Intelligence and Law (ICAIL'97) (Melbourne, Australia), pages 151--160. ACM Press, June 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. B. Zhang, M. Hsu, and U. Dayal. K-harmonic means a data clustering algorithm. Technical Report HPL-1999-124, 1999.Google ScholarGoogle Scholar
  39. Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of the 11th Int'l. Conference on Information and Knowledge Management (CIKM'02) (McLean, VA), pages 515--524. ACM Press, Nov. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Y. Zhao and G. Karypis. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning, 55(3):311--331, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Effective document clustering for large heterogeneous law firm collections

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          ICAIL '05: Proceedings of the 10th international conference on Artificial intelligence and law
          June 2005
          270 pages
          ISBN:1595930817
          DOI:10.1145/1165485

          Copyright © 2005 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 June 2005

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate69of169submissions,41%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader