Combining a rule-based approach and machine learning in a good-example extraction task for the purpose of lexicographic work on contemporary standard German

Authors: Lothar Lemnitzer, Christian Pölitz, Jörg Didakowski, Alexander Geyken

The work we will present in this paper is part of a dictionary project at the Berlin-Brandenburg Academy of Sciences and Humanities. For a large number of headwords, example sentences for their respective lexicographic descriptions have to be retrieved from a corpus of contemporary German. Lexicographers are typically faced with a huge number of corpus citations. Therefore, a tool that selects only good examples (those which are considered for inclusion into the dictionary) and dismisses the other ones would be time and effort effective. A rule-based good-example extractor proved to offer a good starting point, but the tool still delivers too many inacceptable citations. We have therefore tried to combine this tool with a machine learner that is trained on the decisions of an experienced lexicographer. The learner has been optimized to reject a large share of the example sentences. We present the machine learning results on a test data set with various combinations of linguistic features and quantify the gain in time and effort for the lexicographers. We also discuss the shortcomings of our approach and suggest some measures to counter them.

Keywords: example extraction; machine learning; corpus linguistics; German

Reference: In Kosem, I., Jakubiček, M., Kallas, J., Krek, S. (eds.) Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11-13 August 2015, Herstmonceux Castle, United Kingdom. Ljubljana/Brighton: Trojina, Institute for Applied Slovene Studies/Lexical Computing Ltd., pp. 21-31.


Published: 2015