ABSTRACT
A major challenge for content management in intranets and other large scale document storage and retrieval services is the generation of high quality metadata. Manual generation of metadata is resource demanding and is often viewed by collection managers and document authors as inefficient use of their time, and there is a desire for other ways to create the needed metadata. Automatic Metadata Generation (AMG) is methods for generating metadata without manual interaction using computer program(s) to interpret the document and possibly the document context. Current AMG research has been limited to collection of similarly formatted documents. The research presented in this paper expands the field of AMG by presenting an approach that is independent of a common visualization scheme; AMG based on document code analysis. This is done by showing AMG possibilities from Latex, Word and PowerPoint documents and how this approach can significantly increase the quality of the generated metadata. This by avoiding common quality reducing factors as missing completeness, low accuracy, logical consistency and coherence and timeliness by giving AMG algorithms direct access to the user specified intellectual content and the file formatting. This research shows how this AMG approach can be combined with other AMG approaches, drawing on their strengths in order to achieve the desired high quality metadata entities.
- Cardinaels, K., Meire, M. and Duval, E. 2005. Automating metadata generation: the simple indexing interface. Proceedings of the 14th international conference on World Wide Web, Chiba, Japan, pp.548--556, ISBN:1-59593-046-9 Google ScholarDigital Library
- Greenberg, J. 2004. Metadata Extraction and Harvesting: A Comparison of Two Automatic Metadata Generation Applications. Journal of Internet Cataloging, 6(4): 59--82.Google ScholarCross Ref
- Meire, M., Ochoa, X. and Duval, E. 2007. SAmgI: Automatic Metadata Generation v2.0. In Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications 2007, pp. 1195--1204, Chesapeake, VA: AACEGoogle Scholar
- Duval, E. and Hodgins, W. 2004. Making metadata go away: Hiding everything but the benefits. Keynote address at DC--2004, Shanghai, China Google ScholarDigital Library
- Edvardsen, L.F.H., Sølvberg, I.T., Aalberg, T., Trætteberg, H. 2009. Using the structural content of documents to automatically generate quality metadata. Webist 2009, March 23--26, 2009. SpringerGoogle Scholar
- Edvardsen, L.F.H., Sølvberg, I.T. 2007. Metadata challenges in introducing the global IEEE Learning Object metadata (LOM) standard in a local environment. Webist 2007, March 3--6, 2007. SpringerGoogle Scholar
- IEEE LTSC, 2005. IEEE P1484.12.3/D8, 2005-02-22 Draft Standard for Learning Technology -- Extensible Markup Language Schema Definition Language Binding for Learning Object Metadata, WG12: Related Materials, http://ltsc.ieee.org/wg12/files/IEEE_1484_12_03_d8_submitted.pdfGoogle Scholar
- DCMI, 2008. Dublin Core Metadata Element Set, Version 1.1. Dublin Core Metadata Initiative, http://dublincore.org/documents/dces/Google Scholar
- It's learning. 2009. It's learning. http://www.itslearning.comGoogle Scholar
- Open Archives Initiative. 2004 Protocol for Metadata Harvesting -- v.2.0. http://www.openarchives.org/OAI/openarchivesprotocol.htmlGoogle Scholar
- Seymore, K., McCallum, A. and Rosenfeld, R. 1999. Learning hidden Markov model structure for information extraction. Proc. of AAAI 99 Workshop on Machine Learning for Information Extraction, pages 37--42, 1999.Google Scholar
- Greenstone. 2007. Source only distribution. http://prdownloads.sourceforge.net/greenstone/gsdl-2.72-src.tar.gz (source code inspected)Google Scholar
- Bird, K. and the Jorum Team. 2006. Automated Metadata -- A review of existing and potential metadata automation within Jorum and an overview of other automation systems. 31st March 2006, Version 1.0, Final, Signed off by JISC and Intrallect July 2006.Google Scholar
- Google. 2009. Google. http://www.google.comGoogle Scholar
- Scirus. 2009. Scirus -- for scientific information. http://www.scirus.comGoogle Scholar
- Yahoo. 2009. Yahoo!, http://www.yahoo.comGoogle Scholar
- Singh, A., Boley, H. and Bhavsar, V.C. 2004. LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology. National Research Council and University of New Brunswick, Learning Objects Summit Fredericton, NB, Canada, March 29--30, 2004Google Scholar
- Giuffrida, G., Shek, E. C. and Yang, J. 2000. Knowledge-Based Metadata Extraction from PostScript Files. Digital Libraries, San Antonio, Tx, 2000 ACM 1-581 13-231-X/00/0006 Google ScholarDigital Library
- Kawtrakul A. and Yingsaeree C. 2005. A Unified Framework for Automatic Metadata Extraction from Electronic Document. Proceedings of IADLC2005 (25--26 August 2005), pp. 71--77.Google Scholar
- Flynn, P., Zhou, L., Maly, K., Zeil, S. and Zubair, M. 2007. Automated Template--Based Metadata Extraction Architecture. ICADL 2007. Google ScholarDigital Library
- Li, H., Cao, Y., Xu, J., Hu, Y., Li, S. and Meyerzon, D. 2005. A new approach to intranet search based on information extraction. Proceedings of the 14th ACM international conference on Information and knowledge management, Bremen, Germany, Pages: 460--468, ISBN:1-59593-140-6, ACM New York, NY, USA. Google ScholarDigital Library
- Liu, Y., Bai, K., Mitra, P, and Giles, C.L. 2007. TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries. JCDL'07, June 18--23, 2007, Vancouver, Canada, ACM 978-1-59593-644-8/07/0006 Google ScholarDigital Library
- Boguraev, B. and Neff, M. 2000. Lexical Cohesion, Discourse Segmentation and Document Summarization. RIAO.Google Scholar
- LOMGen. 2006. LOMGen. http://www.cs.unb.ca/agentmatcher/LOMGen.htmlGoogle Scholar
- Greenberg J., Spurgin, K., Crystal, A., Cronquist, M. and Wilson, A. 2005. Final Report for the AMeGA (Automatic Metadata Generation Applications) Project. UNC School of information and library science.Google Scholar
- Li, Y., Dorai, C. and Farrell, R. 2005. Creating MAGIC: system for generating learning object metadata for instructional content. Proceedings of the 13th annual ACM international conference on Multimedia, Hilton, Singapore, pp.367--370, ISBN:1-59593-044-2 Google ScholarDigital Library
- Liddy, E.D., Allen, E., Harwell, S., Corieri, S., Yilmazel, O., Ozgencil, N.E., Diekema, A., McCracken, N.J., Silverstein, J. and Sutton, S.A. 2002. Automatic metadata generation and evaluation. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 11--15, Tampere, Finland, ACM Press, New York, pp.401--402. Google ScholarDigital Library
- Jenkins, C. and Inman, D. 2001. Server-side Automatic Metadata Generation using Qualified Dublin Core and RDF. 0-7695-1022-1/01, 2001 IEEEGoogle Scholar
- Lindland, O.I., Sindre, G., Sølvberg, A. 1994. Understanding Quality in Conceptual Modeling. IEEE Software, march 1994, Volume: 11, Issue: 2, pp. 42--49, ISSN: 0740-7459, DOI: 10.1109/52.268955 Google ScholarDigital Library
- Bruce, T.R. and Hillmann, D.I. 2004. The Continuum of Metadata Quality: Defining, Expressing, Exploiting. ALA Editions, In Metadata in Practice, D. Hillmann & E Westbrooks, eds., ISSN: 0-8389-0882-9Google Scholar
- Xue, Y., Hu, Y., Xin, G., Song, R., Shi, S., Cao, Y., Lin, C-Y., Li. H. (2007), "Web page title extraction and its application", Information Processing & Management, Volume 43, Issue 5, September 2007, Pages 1332--1347. Google ScholarDigital Library
- ACM. 2009. ACM SIG Proceedings Templates, http://www.acm.org/sigs/publications/proceedings-templatesGoogle Scholar
Index Terms
- Automatically generating high quality metadata by analyzing the document code of common file types
Recommendations
Generating summary documents for a variable-quality PDF document collection
DocEng '14: Proceedings of the 2014 ACM symposium on Document engineeringThe Cochrane Schizophrenia Group's Register of studies details all aspects of the effects of treating people with schizophrenia. It has been gathered over the last 20 years and consists of around 20,000 documents, overwhelmingly in PDF. Document ...
Semantic Metadata Integration Support Method for Editable Re-flowable Document OOXML and Fixed-layout Document PDF
CSAE '23: Proceedings of the 7th International Conference on Computer Science and Application EngineeringDocument semantic annotation based on metadata lays the foundation for the automatic understanding and processing of document information. At present, most common documents can only support a small amount of preset metadata, and cannot support semantic ...
Semi-Automatic LaTeX-Based Labeling of Mathematical Objects in PDF Documents: MOP Data Set
DocEng '19: Proceedings of the ACM Symposium on Document Engineering 2019Mathematical objects (MO) in PDF documents is paramount in understanding the ontology and mathematical essence in published science, technology, engineering, and mathematical (STEM) documents. As of now, Marmot is the only publicly available data set ...
Comments