Word-sense Induction on a Corpus of Buddhist Sanskrit Literature

Matej Martinc; Andraž Pelicon; Senja Pollak; Ligeia Lugli

Authors

Matej Martinc Author
Andraž Pelicon Author
Senja Pollak Author
Ligeia Lugli Author

Keywords:

Buddhist Sanskrit, Word sense induction, Transformer language models

Abstract

We report on a series of word sense induction (WSI) experiments conducted on a corpus of Buddhist Sanskrit literature with an objective to introduce a degree of automation in the labour-intensive lexicographic task of matching citations for a lemma to the corresponding sense of the lemma. For this purpose, we construct a Buddhist Sanskrit WSI dataset consisting of 3,108 sentences with manually labeled sense annotations for 39 distinct lemmas. The dataset is used for training and evaluation of three transformer-based language models fine-tuned on the task of identifying intended meaning of lemmas in different contexts. The predictions produced by the models are used for clustering of lemma sentence examples into distinct lemma senses using a novel graph-based clustering solution. We evaluate how well do the obtained clusters represent the true sense distribution of new unseen lemmas not used for model training and report the best Adjusted Rand Index (ARI) score of 0.208, and how well do the clusters represent the true lemma sense distribution when the classifier is tested on new unseen sentence examples of lemmas used for model training and report the best ARI score of 0.3. In both scenarios, we outperform the baseline by a large margin.

Word-sense Induction on a Corpus of Buddhist Sanskrit Literature

Authors

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite