The Central Word Register of the Danish language

Authors

  • Thomas Widmann Author

Keywords:

lexical database, orthography, Danish language, historical lexicography

Abstract

Det centrale Orderigester (“The Central Word Register”,COR) is a unique and innovative lexical database for the Danish language. Developed by the Danish Language Council, the Society for Danish Language and Literature and the Centre for Language Technology at the University of Copenhagen, with funding from the Agency for Digital Government, the COR assigns unique identification numbers to every lemma and form of the Danish language. At the heart of the COR lies Retskrivningsordbogen, the official orthographical dictionary of Danish, which provides the foundation for the unique identification numbers. The Danish Language Council will update this basis whenever the orthography changes, publishing the changes compared to the previous version, ensuring that the COR will always reflect the orthography of the day while ensuring that existing resources will continue to function even when the orthography changes. The COR is divided into three levels, with Level 1 corresponding to the orthographical dictionary, Level 2 encompassing additional resources from professional language bodies and Level 3 comprising all other resources, with no restrictions on who can contribute. Version 1.0 of Level 1 was released by the Danish Language Council in September 2022. The Society for Danish Language and Literature and the Centre for Language Technology are currently working on adding a semantic component on Level 2. The primary goal of the COR is to create a common key that enables more efficient reuse of language resources, similar to the way Denmark’s Central Person Register (CPR) allows different databases containing information about the inhabitants of Denmark to communicate with one another. The COR database can be easily accessed through a downloadable CSV file for an API, allowing developers to retrieve ID numbers, lemmas, and forms in either CSV or JSON format, providing a great example of invisible lexicography. The project also opens up new possibilities for historical lexicography, as the Danish Language Council intends to make its previous orthographical dictionaries available in COR format, enabling users to track the evolution of the language over time, to study historical texts in a more accurate way and to modify NLP software to work on such texts. Another topic is the development of COR linkers (programs that will assign the correct COR number to every word in a text) and how these are effectively solving the problems of part-of-speech tagging and homograph resolution at once. An example of a COR linker is the Danish Language Council’s CLINK project. Another aspect of the COR is the ability to use crowdsourcing in lexicography. Users can contribute their own data and insights, simply by publishing their data with added COR ID numbers. This fosters greater collaboration and enables the creation of a plethora of rich, dynamic resources for the Danish language. 91 Finally, the article will explore the benefits and potential applications of the COR and discuss the exciting possibilities this creates for the future of the Danish NLP and language research. 

Downloads

Published

2023-06-29