skip to main content
10.1145/1555400.1555406acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

Automatically generating high quality metadata by analyzing the document code of common file types

Authors Info & Claims
Published:15 June 2009Publication History

ABSTRACT

A major challenge for content management in intranets and other large scale document storage and retrieval services is the generation of high quality metadata. Manual generation of metadata is resource demanding and is often viewed by collection managers and document authors as inefficient use of their time, and there is a desire for other ways to create the needed metadata. Automatic Metadata Generation (AMG) is methods for generating metadata without manual interaction using computer program(s) to interpret the document and possibly the document context. Current AMG research has been limited to collection of similarly formatted documents. The research presented in this paper expands the field of AMG by presenting an approach that is independent of a common visualization scheme; AMG based on document code analysis. This is done by showing AMG possibilities from Latex, Word and PowerPoint documents and how this approach can significantly increase the quality of the generated metadata. This by avoiding common quality reducing factors as missing completeness, low accuracy, logical consistency and coherence and timeliness by giving AMG algorithms direct access to the user specified intellectual content and the file formatting. This research shows how this AMG approach can be combined with other AMG approaches, drawing on their strengths in order to achieve the desired high quality metadata entities.

References

  1. Cardinaels, K., Meire, M. and Duval, E. 2005. Automating metadata generation: the simple indexing interface. Proceedings of the 14th international conference on World Wide Web, Chiba, Japan, pp.548--556, ISBN:1-59593-046-9 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Greenberg, J. 2004. Metadata Extraction and Harvesting: A Comparison of Two Automatic Metadata Generation Applications. Journal of Internet Cataloging, 6(4): 59--82.Google ScholarGoogle ScholarCross RefCross Ref
  3. Meire, M., Ochoa, X. and Duval, E. 2007. SAmgI: Automatic Metadata Generation v2.0. In Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications 2007, pp. 1195--1204, Chesapeake, VA: AACEGoogle ScholarGoogle Scholar
  4. Duval, E. and Hodgins, W. 2004. Making metadata go away: Hiding everything but the benefits. Keynote address at DC--2004, Shanghai, China Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Edvardsen, L.F.H., Sølvberg, I.T., Aalberg, T., Trætteberg, H. 2009. Using the structural content of documents to automatically generate quality metadata. Webist 2009, March 23--26, 2009. SpringerGoogle ScholarGoogle Scholar
  6. Edvardsen, L.F.H., Sølvberg, I.T. 2007. Metadata challenges in introducing the global IEEE Learning Object metadata (LOM) standard in a local environment. Webist 2007, March 3--6, 2007. SpringerGoogle ScholarGoogle Scholar
  7. IEEE LTSC, 2005. IEEE P1484.12.3/D8, 2005-02-22 Draft Standard for Learning Technology -- Extensible Markup Language Schema Definition Language Binding for Learning Object Metadata, WG12: Related Materials, http://ltsc.ieee.org/wg12/files/IEEE_1484_12_03_d8_submitted.pdfGoogle ScholarGoogle Scholar
  8. DCMI, 2008. Dublin Core Metadata Element Set, Version 1.1. Dublin Core Metadata Initiative, http://dublincore.org/documents/dces/Google ScholarGoogle Scholar
  9. It's learning. 2009. It's learning. http://www.itslearning.comGoogle ScholarGoogle Scholar
  10. Open Archives Initiative. 2004 Protocol for Metadata Harvesting -- v.2.0. http://www.openarchives.org/OAI/openarchivesprotocol.htmlGoogle ScholarGoogle Scholar
  11. Seymore, K., McCallum, A. and Rosenfeld, R. 1999. Learning hidden Markov model structure for information extraction. Proc. of AAAI 99 Workshop on Machine Learning for Information Extraction, pages 37--42, 1999.Google ScholarGoogle Scholar
  12. Greenstone. 2007. Source only distribution. http://prdownloads.sourceforge.net/greenstone/gsdl-2.72-src.tar.gz (source code inspected)Google ScholarGoogle Scholar
  13. Bird, K. and the Jorum Team. 2006. Automated Metadata -- A review of existing and potential metadata automation within Jorum and an overview of other automation systems. 31st March 2006, Version 1.0, Final, Signed off by JISC and Intrallect July 2006.Google ScholarGoogle Scholar
  14. Google. 2009. Google. http://www.google.comGoogle ScholarGoogle Scholar
  15. Scirus. 2009. Scirus -- for scientific information. http://www.scirus.comGoogle ScholarGoogle Scholar
  16. Yahoo. 2009. Yahoo!, http://www.yahoo.comGoogle ScholarGoogle Scholar
  17. Singh, A., Boley, H. and Bhavsar, V.C. 2004. LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology. National Research Council and University of New Brunswick, Learning Objects Summit Fredericton, NB, Canada, March 29--30, 2004Google ScholarGoogle Scholar
  18. Giuffrida, G., Shek, E. C. and Yang, J. 2000. Knowledge-Based Metadata Extraction from PostScript Files. Digital Libraries, San Antonio, Tx, 2000 ACM 1-581 13-231-X/00/0006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kawtrakul A. and Yingsaeree C. 2005. A Unified Framework for Automatic Metadata Extraction from Electronic Document. Proceedings of IADLC2005 (25--26 August 2005), pp. 71--77.Google ScholarGoogle Scholar
  20. Flynn, P., Zhou, L., Maly, K., Zeil, S. and Zubair, M. 2007. Automated Template--Based Metadata Extraction Architecture. ICADL 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Li, H., Cao, Y., Xu, J., Hu, Y., Li, S. and Meyerzon, D. 2005. A new approach to intranet search based on information extraction. Proceedings of the 14th ACM international conference on Information and knowledge management, Bremen, Germany, Pages: 460--468, ISBN:1-59593-140-6, ACM New York, NY, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Liu, Y., Bai, K., Mitra, P, and Giles, C.L. 2007. TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries. JCDL'07, June 18--23, 2007, Vancouver, Canada, ACM 978-1-59593-644-8/07/0006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Boguraev, B. and Neff, M. 2000. Lexical Cohesion, Discourse Segmentation and Document Summarization. RIAO.Google ScholarGoogle Scholar
  24. LOMGen. 2006. LOMGen. http://www.cs.unb.ca/agentmatcher/LOMGen.htmlGoogle ScholarGoogle Scholar
  25. Greenberg J., Spurgin, K., Crystal, A., Cronquist, M. and Wilson, A. 2005. Final Report for the AMeGA (Automatic Metadata Generation Applications) Project. UNC School of information and library science.Google ScholarGoogle Scholar
  26. Li, Y., Dorai, C. and Farrell, R. 2005. Creating MAGIC: system for generating learning object metadata for instructional content. Proceedings of the 13th annual ACM international conference on Multimedia, Hilton, Singapore, pp.367--370, ISBN:1-59593-044-2 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Liddy, E.D., Allen, E., Harwell, S., Corieri, S., Yilmazel, O., Ozgencil, N.E., Diekema, A., McCracken, N.J., Silverstein, J. and Sutton, S.A. 2002. Automatic metadata generation and evaluation. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 11--15, Tampere, Finland, ACM Press, New York, pp.401--402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jenkins, C. and Inman, D. 2001. Server-side Automatic Metadata Generation using Qualified Dublin Core and RDF. 0-7695-1022-1/01, 2001 IEEEGoogle ScholarGoogle Scholar
  29. Lindland, O.I., Sindre, G., Sølvberg, A. 1994. Understanding Quality in Conceptual Modeling. IEEE Software, march 1994, Volume: 11, Issue: 2, pp. 42--49, ISSN: 0740-7459, DOI: 10.1109/52.268955 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Bruce, T.R. and Hillmann, D.I. 2004. The Continuum of Metadata Quality: Defining, Expressing, Exploiting. ALA Editions, In Metadata in Practice, D. Hillmann & E Westbrooks, eds., ISSN: 0-8389-0882-9Google ScholarGoogle Scholar
  31. Xue, Y., Hu, Y., Xin, G., Song, R., Shi, S., Cao, Y., Lin, C-Y., Li. H. (2007), "Web page title extraction and its application", Information Processing & Management, Volume 43, Issue 5, September 2007, Pages 1332--1347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. ACM. 2009. ACM SIG Proceedings Templates, http://www.acm.org/sigs/publications/proceedings-templatesGoogle ScholarGoogle Scholar

Index Terms

  1. Automatically generating high quality metadata by analyzing the document code of common file types

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
        June 2009
        502 pages
        ISBN:9781605583228
        DOI:10.1145/1555400

        Copyright © 2009 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 June 2009

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate415of1,482submissions,28%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader