Towards the enrichment of terminological resources by scientific corpora analysis

Authors: Izabella Thomas, Iana Atanassova

The research presented in this paper explores the possibility of enriching terminological databases through the analysis of recent scientific publications. Our main concern is to evaluate how useful automatic term extraction can be to a human expert. To carry out our experiment, we constructed two corpora of recent scientific papers in two different sub-domains of the bio-medical sciences. Then we proceeded with three steps: automatic term extraction and ranking from a set of corpora of scientific papers; evaluation of the overlap of the candidate terms (CTs) extracted from the corpora and those present in the multidisciplinary terminology portal TermSciences; and evaluation by domain experts of the three sets of the top 200 CTs extracted from the different corpora. To extract terms we used the Sensunique Platform, a web based platform for building terminological resources. Our results show that only about 10% of the extracted CTs are present in the TermSciences resource, which means that many of the extracted CTs, if validated, could potentially be used to enrich the terminological database. Furthermore, the expert evaluation of the top 200 terms for each sub-corpus shows clearly that about 75% of these CTs are correct terms in the respective domains. This validates our ranking algorithm.

Keywords: terminology; term acquisition; term extraction; term recognition; scientific papers

Reference: In Kosem, I., Jakubiček, M., Kallas, J., Krek, S. (eds.) Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11-13 August 2015, Herstmonceux Castle, United Kingdom. Ljubljana/Brighton: Trojina, Institute for Applied Slovene Studies/Lexical Computing Ltd., pp. 136-151.


Published: 2015