research-article

Automatically generating high quality metadata by analyzing the document code of common file types

Authors:
Lars Fredrik Høimyr Edvardsen

Intelligent Communication AS/The Norwegian University of Science and Technology, Oslo, Norway

Intelligent Communication AS/The Norwegian University of Science and Technology, Oslo, Norway
View Profile

,
Ingeborg Torvik Sølvberg

The Norwegian University of Science and Technology, Trondheim, Norway

The Norwegian University of Science and Technology, Trondheim, Norway
View Profile

,
Trond Aalberg

The Norwegian University of Science and Technology, Trondheim, Norway

The Norwegian University of Science and Technology, Trondheim, Norway
View Profile

,
Hallvard Trætteberg

The Norwegian University of Science and Technology, Trondheim, Norway

The Norwegian University of Science and Technology, Trondheim, Norway
View Profile

JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital librariesJune 2009Pages 29–38https://doi.org/10.1145/1555400.1555406

Published:15 June 2009Publication History

JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries

Pages 29–38

ABSTRACT

A major challenge for content management in intranets and other large scale document storage and retrieval services is the generation of high quality metadata. Manual generation of metadata is resource demanding and is often viewed by collection managers and document authors as inefficient use of their time, and there is a desire for other ways to create the needed metadata. Automatic Metadata Generation (AMG) is methods for generating metadata without manual interaction using computer program(s) to interpret the document and possibly the document context. Current AMG research has been limited to collection of similarly formatted documents. The research presented in this paper expands the field of AMG by presenting an approach that is independent of a common visualization scheme; AMG based on document code analysis. This is done by showing AMG possibilities from Latex, Word and PowerPoint documents and how this approach can significantly increase the quality of the generated metadata. This by avoiding common quality reducing factors as missing completeness, low accuracy, logical consistency and coherence and timeliness by giving AMG algorithms direct access to the user specified intellectual content and the file formatting. This research shows how this AMG approach can be combined with other AMG approaches, drawing on their strengths in order to achieve the desired high quality metadata entities.

References

Cardinaels, K., Meire, M. and Duval, E. 2005. Automating metadata generation: the simple indexing interface. Proceedings of the 14th international conference on World Wide Web, Chiba, Japan, pp.548--556, ISBN:1-59593-046-9 Google ScholarDigital Library
Greenberg, J. 2004. Metadata Extraction and Harvesting: A Comparison of Two Automatic Metadata Generation Applications. Journal of Internet Cataloging, 6(4): 59--82.Google ScholarCross Ref
Meire, M., Ochoa, X. and Duval, E. 2007. SAmgI: Automatic Metadata Generation v2.0. In Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications 2007, pp. 1195--1204, Chesapeake, VA: AACEGoogle Scholar
Duval, E. and Hodgins, W. 2004. Making metadata go away: Hiding everything but the benefits. Keynote address at DC--2004, Shanghai, China Google ScholarDigital Library
Edvardsen, L.F.H., Sølvberg, I.T., Aalberg, T., Trætteberg, H. 2009. Using the structural content of documents to automatically generate quality metadata. Webist 2009, March 23--26, 2009. SpringerGoogle Scholar
Edvardsen, L.F.H., Sølvberg, I.T. 2007. Metadata challenges in introducing the global IEEE Learning Object metadata (LOM) standard in a local environment. Webist 2007, March 3--6, 2007. SpringerGoogle Scholar
IEEE LTSC, 2005. IEEE P1484.12.3/D8, 2005-02-22 Draft Standard for Learning Technology -- Extensible Markup Language Schema Definition Language Binding for Learning Object Metadata, WG12: Related Materials, http://ltsc.ieee.org/wg12/files/IEEE_1484_12_03_d8_submitted.pdfGoogle Scholar
DCMI, 2008. Dublin Core Metadata Element Set, Version 1.1. Dublin Core Metadata Initiative, http://dublincore.org/documents/dces/Google Scholar
It's learning. 2009. It's learning. http://www.itslearning.comGoogle Scholar
Open Archives Initiative. 2004 Protocol for Metadata Harvesting -- v.2.0. http://www.openarchives.org/OAI/openarchivesprotocol.htmlGoogle Scholar
Seymore, K., McCallum, A. and Rosenfeld, R. 1999. Learning hidden Markov model structure for information extraction. Proc. of AAAI 99 Workshop on Machine Learning for Information Extraction, pages 37--42, 1999.Google Scholar
Greenstone. 2007. Source only distribution. http://prdownloads.sourceforge.net/greenstone/gsdl-2.72-src.tar.gz (source code inspected)Google Scholar
Bird, K. and the Jorum Team. 2006. Automated Metadata -- A review of existing and potential metadata automation within Jorum and an overview of other automation systems. 31st March 2006, Version 1.0, Final, Signed off by JISC and Intrallect July 2006.Google Scholar
Google. 2009. Google. http://www.google.comGoogle Scholar
Scirus. 2009. Scirus -- for scientific information. http://www.scirus.comGoogle Scholar
Yahoo. 2009. Yahoo!, http://www.yahoo.comGoogle Scholar
Singh, A., Boley, H. and Bhavsar, V.C. 2004. LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology. National Research Council and University of New Brunswick, Learning Objects Summit Fredericton, NB, Canada, March 29--30, 2004Google Scholar
Giuffrida, G., Shek, E. C. and Yang, J. 2000. Knowledge-Based Metadata Extraction from PostScript Files. Digital Libraries, San Antonio, Tx, 2000 ACM 1-581 13-231-X/00/0006 Google ScholarDigital Library
Kawtrakul A. and Yingsaeree C. 2005. A Unified Framework for Automatic Metadata Extraction from Electronic Document. Proceedings of IADLC2005 (25--26 August 2005), pp. 71--77.Google Scholar
Flynn, P., Zhou, L., Maly, K., Zeil, S. and Zubair, M. 2007. Automated Template--Based Metadata Extraction Architecture. ICADL 2007. Google ScholarDigital Library
Li, H., Cao, Y., Xu, J., Hu, Y., Li, S. and Meyerzon, D. 2005. A new approach to intranet search based on information extraction. Proceedings of the 14th ACM international conference on Information and knowledge management, Bremen, Germany, Pages: 460--468, ISBN:1-59593-140-6, ACM New York, NY, USA. Google ScholarDigital Library
Liu, Y., Bai, K., Mitra, P, and Giles, C.L. 2007. TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries. JCDL'07, June 18--23, 2007, Vancouver, Canada, ACM 978-1-59593-644-8/07/0006 Google ScholarDigital Library
Boguraev, B. and Neff, M. 2000. Lexical Cohesion, Discourse Segmentation and Document Summarization. RIAO.Google Scholar
LOMGen. 2006. LOMGen. http://www.cs.unb.ca/agentmatcher/LOMGen.htmlGoogle Scholar
Greenberg J., Spurgin, K., Crystal, A., Cronquist, M. and Wilson, A. 2005. Final Report for the AMeGA (Automatic Metadata Generation Applications) Project. UNC School of information and library science.Google Scholar
Li, Y., Dorai, C. and Farrell, R. 2005. Creating MAGIC: system for generating learning object metadata for instructional content. Proceedings of the 13th annual ACM international conference on Multimedia, Hilton, Singapore, pp.367--370, ISBN:1-59593-044-2 Google ScholarDigital Library
Liddy, E.D., Allen, E., Harwell, S., Corieri, S., Yilmazel, O., Ozgencil, N.E., Diekema, A., McCracken, N.J., Silverstein, J. and Sutton, S.A. 2002. Automatic metadata generation and evaluation. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 11--15, Tampere, Finland, ACM Press, New York, pp.401--402. Google ScholarDigital Library
Jenkins, C. and Inman, D. 2001. Server-side Automatic Metadata Generation using Qualified Dublin Core and RDF. 0-7695-1022-1/01, 2001 IEEEGoogle Scholar
Lindland, O.I., Sindre, G., Sølvberg, A. 1994. Understanding Quality in Conceptual Modeling. IEEE Software, march 1994, Volume: 11, Issue: 2, pp. 42--49, ISSN: 0740-7459, DOI: 10.1109/52.268955 Google ScholarDigital Library
Bruce, T.R. and Hillmann, D.I. 2004. The Continuum of Metadata Quality: Defining, Expressing, Exploiting. ALA Editions, In Metadata in Practice, D. Hillmann & E Westbrooks, eds., ISSN: 0-8389-0882-9Google Scholar
Xue, Y., Hu, Y., Xin, G., Song, R., Shi, S., Cao, Y., Lin, C-Y., Li. H. (2007), "Web page title extraction and its application", Information Processing & Management, Volume 43, Issue 5, September 2007, Pages 1332--1347. Google ScholarDigital Library
ACM. 2009. ACM SIG Proceedings Templates, http://www.acm.org/sigs/publications/proceedings-templatesGoogle Scholar

Index Terms

Automatically generating high quality metadata by analyzing the document code of common file types
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems
  1. Information systems applications
    1. Digital libraries and archives

Recommendations

Generating summary documents for a variable-quality PDF document collection
DocEng '14: Proceedings of the 2014 ACM symposium on Document engineering

The Cochrane Schizophrenia Group's Register of studies details all aspects of the effects of treating people with schizophrenia. It has been gathered over the last 20 years and consists of around 20,000 documents, overwhelmingly in PDF. Document ...
Read More
Semantic Metadata Integration Support Method for Editable Re-flowable Document OOXML and Fixed-layout Document PDF
CSAE '23: Proceedings of the 7th International Conference on Computer Science and Application Engineering

Document semantic annotation based on metadata lays the foundation for the automatic understanding and processing of document information. At present, most common documents can only support a small amount of preset metadata, and cannot support semantic ...
Read More
Semi-Automatic LaTeX-Based Labeling of Mathematical Objects in PDF Documents: MOP Data Set
DocEng '19: Proceedings of the ACM Symposium on Document Engineering 2019

Mathematical objects (MO) in PDF documents is paramount in understanding the ontology and mathematical essence in published science, technology, engineering, and mathematical (STEM) documents. As of now, Marmot is the only publicly available data set ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
June 2009
502 pages
ISBN:9781605583228
DOI:10.1145/1555400
General Chairs:
Fred Heath
University of Texas Libraries, USA
,
Mary Lynn Rice-Lively
University of Texas at Austin, USA
,
Program Chair:
Richard Furuta
Texas A&M University, USA
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
PDF
automatic metadata generation
document code
extraction
harvesting
latex
metadata quality
openXML
powerpoint
word
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate415of1,482submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 650
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatically generating high quality metadata by analyzing the document code of common file types

JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Generating summary documents for a variable-quality PDF document collection

Semantic Metadata Integration Support Method for Editable Re-flowable Document OOXML and Fixed-layout Document PDF

Semi-Automatic LaTeX-Based Labeling of Mathematical Objects in PDF Documents: MOP Data Set

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Automatically generating high quality metadata by analyzing the document code of common file types

JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Generating summary documents for a variable-quality PDF document collection

Semantic Metadata Integration Support Method for Editable Re-flowable Document OOXML and Fixed-layout Document PDF

Semi-Automatic LaTeX-Based Labeling of Mathematical Objects in PDF Documents: MOP Data Set

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media