Extracting terms and their relations from German texts: NLP tools for the preparation of raw material for specialized e-dictionaries
Authors: Ina Rösiger, Johannes Schäfer, Tanja George, Simon Tannert, Ulrich Heid, Michael Dorna
We report on ongoing experiments in data extraction from German texts in the domain of do-it-yourself (DIY) instructions, where the objective is (i) to extract nominal term candidates with high quality; (ii) to extract predicate-argument structures involving the term candidates, and (iii) to relate German word formation products with syntactic paraphrases: we focus on the analysis of compounds and on relating them with their syntactic paraphrases, in order to provide evidence for the (semantic) relationship between compound heads and non-heads (Holzbohrer (wood drill) <–> HolzObject bohren ([to] drill wood)). The extracted material is collected in order to provide structured data input for the creation of specialized dictionaries that are richer than standard terminological glossaries. For the creation of taxonomic knowledge (Bandsäge -is-a -> Säge (bandsaw -> saw)), we analyze subtypes of compounds.
Keywords: terminology extraction; raw material for specialized dictionary creation; lexical resources; German language; parsing
Reference: In Kosem, I., Jakubiček, M., Kallas, J., Krek, S. (eds.) Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11-13 August 2015, Herstmonceux Castle, United Kingdom. Ljubljana/Brighton: Trojina, Institute for Applied Slovene Studies/Lexical Computing Ltd., pp. 486-503.