research-article

Measuring and coding language change: An evolving study in a multilayer corpus architecture

Authors:
Hagen Hirschmann

Humboldt-Universität zu Berlin, Germany

Humboldt-Universität zu Berlin, Germany
View Profile

,
Anke Lüdeling

Humboldt-Universität zu Berlin, Germany

Humboldt-Universität zu Berlin, Germany
View Profile

,
Amir Zeldes

Humboldt-Universität zu Berlin, Germany

Humboldt-Universität zu Berlin, Germany
View Profile

Authors Info & Claims

Journal on Computing and Cultural Heritage Volume 5 Issue 1Article No.: 4pp 1–16https://doi.org/10.1145/2160165.2160169

Published:27 April 2012Publication History

Journal on Computing and Cultural Heritage

Abstract

Our article explores the possibilities of using deeply annotated, incrementally evolving comparable corpora for the study of language change, in this case for different stages from Old High German to New High German. Using the example of the evolution of German past tenses, we show how a variety of categories ranging from low to high complexity interact with the choice between competing linguistic variants. To adequately explore the influence of these categories, we use a multilayer corpus architecture that develops together with our study. We show that a combination of quantitative and qualitative analyses can recognize relevant contextual factors, which feed into the addition of new annotation layers applying to the same data. By making our categorizations explicit as corpus annotations and our data available to other researchers, we promote an open, extensible, and transparent mode of research, where both raw data and the inferential process are exposed to other researchers.

References

Albert, S., Anderssen, J., Bader, R., Becker, S. Bracht, T., Brants, S. 2003. Tiger Annotationsschema. Tech. rep. Universität Potsdam, Universität des Saarlandes, Universität Stuttgart. (http://www.ifi.uzh.ch/cl/volk/treebank_course/tiger_annot. pdf).Google Scholar
Biber, D. and Jones, J. 2009. Quantitative methods in corpus linguistics. In Lüdeling, A. and Kytö, M. (Eds.) Corpus Linguistics. An International Handbook. Vol. 2, Mouton de Gruyter. Berlin, 1286--1304.Google Scholar
Bamman, D. and Crane, G. 2006. The design and use of a latin dependency treebank. In Proceedings of the 5th International Workshop on Treebanks and Linguistic Theories (TLT '06). 67--78.Google Scholar
Bamman, D., Mambrini, F., and Crane, G. 2009. An ownership model of annotation: The ancient greek dependency treebank. In Proceedings of the 8th International Workshop on Treebanks and Linguistic Theories (TLT '08).Google Scholar
Brants, S., Dipper, S., Hansen, S., Lezius, W., and Smith, G. 2002. The TIGER treebank. In Proceedings of the International Workshop on Treebanks and Linguistic Theories (TLT-02).Google Scholar
Carletta, J., Evert, S., Heid, U., Kilgour, J., Robertson, J. and Voormann, H. 2003. The NITE XML toolkit: Flexible annotation for multi-modal language data. Behav. Res. Methods, Instruments, Comput. 35, 3, 353--363.Google ScholarCross Ref
Comrie, B. 1976. Aspect. An Introduction to the Study of Verbal Aspect and Related Problems. Cambridge University Press, Cambridge, MA.Google Scholar
Crane, G. 1998. The Perseus project and beyond: How building a digital library challenges the humanities and technology. D-Lib Mag. Google ScholarDigital Library
Crysmann, B., Hansen-Schirra, S., Smith, G. and Ziegler-Eisele, D. 2005. TIGER Morphologie-Annotationsschema. Tech. rep., Universität Potsdam, Universität Saarbrücken.Google Scholar
Demske, U., Frank, N., Laufer, S. and Stiemer, H. 2004. Syntactic interpretation of an Early New High German corpus. In Proceedings of the 3rd Workshop on Treebanks and Linguistic Theories (TLT '04). 175--182.Google Scholar
Dentler, S. 1997. Zur Perfekterneuerung im Mittelhochdeutschen. Die Erweiterung des zeitreferentiellen Funktionsbereichs von Perfektfügungen. Göteborg: Acta Universitatis Gothoburgensis.Google Scholar
Diel, M., Fisseni, B., Lenders, W. and Schmitz, H.-C. 2002. XML-Kodierung des Bonner Frühneuhochdeutschkorpus. Tech. rep. Bonn University.Google Scholar
Dipper, S. 2005. XML-Based stand-off representation and exploitation of multi-level linguistic annotation. In Proceedings of Berliner XML Tage 2005 (BXML'05). 39--50.Google Scholar
Grønvik, O. 1986. Über den Ursprung und die Entwicklung der aktiven Perfekt- und Plusquamperfektkonstruktionen des Hochdeutschen und ihre Eigenart innerhalb des germanischen Sprachraumes. Solum Forlag, Osio.Google Scholar
Helbig, G. and Buscha, J. 2001. Deutsche grammatik. Ein Handbuch für den Ausländerunterricht. Berlin et al.: Langenscheidt.Google Scholar
Hilpert, M. 2008. Germanic future constructions. A Usage-Based Approach to Language Change. John Benjamins, Amsterdam, Philadelphia, PA.Google Scholar
Hirschmann, H. and Linde, S. 2011. Annotationsguidelines zur Deutschen Diachronen Baumbank. Tech. rep. Humboldt-Universität zu Berlin.Google Scholar
Kroch, A. 1989. Reflexes of grammar in patterns of language change. In Language Variation and Change 1, 199--244.Google ScholarCross Ref
Kroch, A., Santorini, B., and Delfs, L. (Eds.) 2004. The Penn-Helsinki parsed corpus of Early Modern English. Tech. rep., University of Pennsylvania, Philadelphia: Department of Linguistics.Google Scholar
Kytö, M. 1991. Manual to the diachronic part of the Helsinki corpus of english texts: Coding conventions and lists of source texts. Tech. rep., Department of English, University of Helsinki.Google Scholar
Leiss, E. 1992. Die Verbalkategorien des Deutschen. Ein Beitrag zur Theorie der sprachlichen Kategorisierung. Mouton de Gruyter, Berlin (Studia linguistica Germanica, 31).Google Scholar
Lezius, W., Biesinger, H., and Gerstenberger, C. 2002. TIGER-XML quick reference guide. Tech. Rep., IMS, University of Stuttgart.Google Scholar
Lüdeling, A., Hirschmann, H., and Zeldes, A. 2012. Variationism and underuse statistics in the analysis of the development of relative clauses in German. In Kawaguchi, Y., Minegishi, M., and Viereck, W. (Eds.) Corpus Analysis and Diachronic Linguistics. John Benjamins, Amsterdam.Google Scholar
McEnery, T. and Wilson, A. 2001. Corpus Linguistics. 2nd ed. Edinburgh University Press, Edinburgh, UK.Google Scholar
Moya, I. G. 2010. Eine variationistische Analyse der Entstehung und Entwicklung des deutschen haben-Perfekts. Bachelor Thesis. Humboldt-Universität zu Berlin.Google Scholar
Musan, R. 2002. The German Perfect. Kluwer Academic Publishers, Dordrecht, The Netherlands.Google Scholar
Nübling, D. 2006. Historische Sprachwissenschaft des Deutschen. Eine Einführung in die Prinzipien des Sprachwandels. In cooperation with Dammel, A., Duke, J., and Szczepaniak, R. Narr, TübingenGoogle Scholar
Petrova, S., Solf, M., Ritz, J., Chiarcos, C., and Zeldes, A. 2009. Building and using a richly annotated interlinear diachronic corpus: The case of Old High German Tatian. Traitement Automatique des Langues 50, 2, 47--71.Google Scholar
Reichmann, O. and Wegera, K.-P. (Eds.) 1993. Frühneuhochdeutsche Grammatik. Niemeyer, Tübingen, Germany.Google Scholar
Resnik, P., Olsen, M. B., and Diab, M. 1999. The Bible as a parallel corpus: Annotating the “book of 2000 tongues”. Comput. Humanities 33, 129--153.Google ScholarCross Ref
Rissanen, M. 2008. Corpus linguistics and historical linguistics. In Lüdeling, A. and Kytö, M. (Eds.) Corpus Linguistics. An International Handbook. Vol. 1. Mouton de Gruyter, Berlin, 53--68.Google Scholar
Schmid, H. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the Conference on New Methods in Language Processing.Google Scholar
Witten, I. H. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques, 2nd Ed. Morgan Kaufman, San Francisco. Google ScholarDigital Library
Zeldes, A., Ritz, J., Lüdeling, A., and Chiarcos, C. 2009. ANNIS: A search tool for multi-layer annotated corpora. In Proceedings of Corpus Linguistics 09.Google Scholar
Zipser F. 2009. Entwicklung eines Konverterframeworks für linguistisch annotierte Daten auf Basis eines gemeinsamen (Meta-)Modells. Master thesis, Humboldt-Universität zu Berlin, (https://www.linguistik.hu-berlin.de/institut/professuren/korpuslinguistik/mitarbeiter-innen/florian/pdf/diplomarbeit.pdf)Google Scholar
Zipser F. and Romary L. 2010. A model oriented approach to the mapping of annotation formats using standards. In Proceedings of the Workshop on Language Resource and Language Technology Standards, (LREC'10).Google Scholar

Index Terms

Measuring and coding language change: An evolving study in a multilayer corpus architecture
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources

Recommendations

Towards a Lexicographic Approach to Lexical Transfer in Machine Translation (Illustrated by the German–Russian Language Pair)

The translation of lexical items is still a formidable obstacle in the field of Machine Translation. The present article addresses this problem from the perspective of modern lexicography, putting forth detailed monolingual lexica which contain highly ...
Read More
Automatic processing of multilingual medical terminology: applications to thesaurus enrichment and cross-language information retrieval

Objectives:: We present in this article experiments on multi-language information extraction and access in the medical domain. For such applications, multilingual terminology plays a crucial role when working on specialized languages and specific ...
Read More
Light stemming approaches for the French, Portuguese, German and Hungarian languages
SAC '06: Proceedings of the 2006 ACM symposium on Applied computing

This paper describes and evaluates various general stemming approaches for the French, Portuguese (Brazilian), German and Hungarian languages. Based on the CLEF test-collections, we demonstrate that light stemmers for the French, Portuguese and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Journal on Computing and Cultural Heritage Volume 5, Issue 1
April 2012
53 pages
ISSN:1556-4673
EISSN:1556-4711
DOI:10.1145/2160165
Issue’s Table of Contents

Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 April 2012
- Accepted: 1 June 2011
- Received: 1 February 2011
Published in jocch Volume 5, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Corpus linguistics
German
historical linguistics
multilayer corpora
perfect
preterit
tense
variation
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 304
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Measuring and coding language change: An evolving study in a multilayer corpus architecture

Journal on Computing and Cultural Heritage

Abstract

References

Cited By

Index Terms

Recommendations

Towards a Lexicographic Approach to Lexical Transfer in Machine Translation (Illustrated by the German–Russian Language Pair)

Automatic processing of multilingual medical terminology: applications to thesaurus enrichment and cross-language information retrieval

Light stemming approaches for the French, Portuguese, German and Hungarian languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Measuring and coding language change: An evolving study in a multilayer corpus architecture

Journal on Computing and Cultural Heritage

Abstract

References

Cited By

Index Terms

Recommendations

Towards a Lexicographic Approach to Lexical Transfer in Machine Translation (Illustrated by the German–Russian Language Pair)

Automatic processing of multilingual medical terminology: applications to thesaurus enrichment and cross-language information retrieval

Light stemming approaches for the French, Portuguese, German and Hungarian languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media