Word-sense Induction on a Corpus of Buddhist Sanskrit Literature

Authors

  • Matej Martinc Author
  • Andraž Pelicon Author
  • Senja Pollak Author
  • Ligeia Lugli Author

Keywords:

Buddhist Sanskrit, Word sense induction, Transformer language models

Abstract

We report on a series of word sense induction (WSI) experiments conducted on a corpus of Buddhist Sanskrit literature with an objective to introduce a degree of automation in the labour-intensive lexicographic task of matching citations for a lemma to the corresponding sense of the lemma. For this purpose, we construct a Buddhist Sanskrit WSI dataset consisting of 3,108 sentences with manually labeled sense annotations for 39 distinct lemmas. The dataset is used for training and evaluation of three transformer-based language models fine-tuned on the task of identifying intended meaning of lemmas in different contexts. The predictions produced by the models are used for clustering of lemma sentence examples into distinct lemma senses using a novel graph-based clustering solution. We evaluate how well do the obtained clusters represent the true sense distribution of new unseen lemmas not used for model training and report the best Adjusted Rand Index (ARI) score of 0.208, and how well do the clusters represent the true lemma sense distribution when the classifier is tested on new unseen sentence examples of lemmas used for model training and report the best ARI score of 0.3. In both scenarios, we outperform the baseline by a large margin. 

Downloads

Published

2023-06-29