Definition Extraction for Slovene: Patterns, Transformer Classifiers and ChatGPT

Authors

  • Hanh Thi Hong Tran Author
  • Vid Podpečan Author
  • Mateja Jemec Tomazin Author
  • Senja Pollak Author

Keywords:

Definition Extraction, RSDO-DEFT, Rule-based, Transformers, ChatGPT

Abstract

Definition Extraction is a Natural Language Processing task that automatically identifies definitions from unstructured text sequences. In our research, we frame this problem as a binary classification task, aiming to detect whether a given sentence is a definition or not, using text sequences in Slovene. The main contributions of our work are two-fold. First, we introduce a novel Slovene corpus for the evaluation of Defnition Extraction named RSDO-def. The dataset contains labeled sentences from specialized corpora using two different extraction processes: random sampling and pattern-based extraction. Both sets contain manual annotations by linguists with three labels: Definition, Weak definition, and None-definition. Second, we propose the benchmarks for Slovene Definition Extraction systems that use (1) rule-based techniques; (2) Transformers-based models as binary classifiers; (3) ChatGPT prompting, and evaluate them on both sets of RSDO-def corpus. When only the small sample RSDO-def-random is considered, the pattern-based rules surpassed the performance of language models classifiers or ChatGPT in terms of F1 on definition class in the strict evaluation setting (considering Weak definition as Noun-definition). Meanwhile, language models (classifiers and ChatGPT) outperformed rule-based approaches when applied to the data with a higher number of definitions and more relaxed evaluation scenarios (considering Weak definition as Definition). Comparing ChatGPT and language models classifiers on the definition class of RSDO-def-random and RSDO-def-large, we observe that higher precision was obtained with classifiers, but higher Recall with ChatGPT.

Downloads

Published

2023-06-29