Trawling the corpus for the overlooked lemmas

Authors

  • Nathalie Hau Sørensen Author
  • Nicolai Hartvig Sørensen Author
  • Kirsten Lundholm Appel Author
  • Sanni Nimb Author

Keywords:

Neology detection, lemma selection, low frequent words

Abstract

Lemma selection is a significant part of lexicographic work, also in the case of the online Danish Dictionary (DDO), a corpus-based monolingual dictionary updated twice a year based on the prior identification of good lemma candidates by means of statistical corpus methods as well as introspection. All low frequent word forms have until now been discarded in the statistical process, but in this paper, we present a method to also identify lemma candidates among these. Our hypothesis is that some words are too inconspicuously mundane to be noticed by introspection and at the same time so infrequent that they are overlooked by statistical measures. The method is based on different automatic measures of “lemmaness” by means of language models, character n-grams, statistical calculations and the development of a compound splitter based on information in the DDO. We evaluate the method by comparing the generated list with the lemmas included in the online DDO since 2005. Two trained DDO lexicographers furthermore evaluate words from the top as well as the bottom of the list. Though there is room for improvement, we find that our method identifies a large number of lemma candidates which otherwise would have been overlooked.

Downloads

Published

2023-06-29