Corpora vs. LLMS: a hands-on workshop on using CLASSLA and AI tools for phraseology and lexicography in South Slavic languages
Organizers:
- Slobodan Beliga
- Mija Bon
- Ivana Filipović Petrović
- Apolonija Gantar
- Taja Kuzman
- Nikola Ljubešić
- Petya Osenova
- Jelena Parizoska
Date: 17 November 2025 (day before the eLex 2025 conference)
Length: half-day (14.00-19.00)
Format: introductory talks followed by guided practical exercises for workshop participants
Audience size: 35 participants
No fee will be charged. Registration will be possible together with the eLex conference registration.
Agenda
Part I: Opening and introductory talks (14:00-15:30)
14:00–14:15 | Opening |
14:15-14:40 | Slobodan Beliga: Large language models and generative artificial intelligence: introduction |
14:40–15:05 | Apolonija Gantar: Some examples of using AI tools in Slovenian lexicography |
15:05-15:30 | Ivana Filipović Petrović: Applying ChatGPT to Croatian phraseology and related lexicographic tasks |
15:30–16:00 | Coffee break |
Part II: hands-on exercises (16:00-19:00)
16:00–17:00 | Corpora and AI-driven interfaces: Extracting linguistic data |
17:00–18:00 | Corpora and AI-driven interfaces: Creating definitions and providing usage examples of phraseological units for dictionaries |
18:00–19:00 | Corpora and AI-driven interfaces: Distinguishing literal and figurative uses of phraseological units |
Rationale
After the initial strong reactions from the broader community, including linguists, to the release of ChatGPT as an easily accessible chat interface, several studies have been conducted to examine how this AI tool performs in lexicography-related tasks (De Schryver 2023; Rundell 2023; Jakubicek and Rundell 2023; Fuertes-Olivera 2024; Lew 2024). These studies primarily focus on its success rates in executing linguistic tasks, its applicability in applied fields such as lexicography, and insights gained from prompt engineering to optimise results. Lexicographers have long relied on corpus data, valuing its reliability and the transparency of sources that provide insights into actual language use. However, corpus search tools have become increasingly sophisticated, enabling greater automation in lexicographic work by generating high-quality dictionary examples, identifying typical collocations, extracting lemma lists, and much more. These tools also ensure direct interlinking between lexicographic resources and corpora. This situation highlights a persistent problem with digital technologies: the under-representation of low-resource languages. AI tools, as anticipated, are less effective with these languages. This challenge also impacts South Slavic languages, though some have recently made progress toward mid-resource classification. A major step forward in this regard was the release of the CLASSLA web corpora for South Slavic languages in 2024 (Ljubešić and Kuzman, 2024). In addition, AI tools have also been used in studies on South Slavic languages (Filipović Petrović and Beliga 2024; Gantar 2024; Gapsa et al. 2024; Kosem et al. 2024). In 2024, CLARIN Slovenia organised seven workshops across five countries (Ljubešić et al. 2024), with the main goal of sharing knowledge using corpus querying tools. Based on insights into which linguistic and lexicographic tasks corpora handle effectively and which still require manual human work, we have chosen to launch a new series of workshops that will combine these strengths with AI tools. The aim is to test the capabilities of corpora and LLMs in performing various tasks relevant to lexicographers and linguists. The linguistic data to be examined includes phraseological units, i.e., multi-word expressions, in South Slavic languages. In doing so, we aim to contribute to two potentially challenging topics: examining how language technologies perform on low and middle-resource languages such as South Slavic languages and exploring how far they have advanced in handling the ever-challenging multi-word expressions and the ambiguity they carry. This workshop, associated with the eLex conference, will specifically focus on tasks and discussions related to the dictionary-making process. The searches will be conducted on CLASSLA corpora using the NoSketch Engine concordancer and AI-driven interfaces such as ChatGPT 3.5 or Gemini. Each participant will work with tools in their chosen language (one of the CLASSLA languages: Slovenian, Croatian, Serbian, Macedonian, or Bulgarian). Participants will indicate which South Slavic language they want to work with when registering for the workshop. They will be divided into groups based on their language preference, each completing exercises and tracking results in their selected language. This workshop aims to establish a methodologically sound approach to integrating two major tool sets, corpora and LLMs, in language research and lexicography.
References
Beliga, Slobodan and Filipović Petrović, Ivana. 2024. Large Language Models Supporting Lexicography: Conceptual Organization of Croatian Idioms. In Proceedings of the Conference on Language Technologies and Digital Humanities, edited by Spela Arhar Holdt and Tomaz Erjavec, 23-46. Ljubljana: Institute of Contemporary History.
De Schryver, Gilles-Maurice. 2023. Generative AI and Lexicography: The Current State of the Art Using ChatGPT. International Journal of Lexicography, 3 6 ( 4 ), 3 5 5-3 87.
Fuertes-Olivera, Pedro. 2024. Making Lexicography Sustainable: Using ChatGPT and Reusing Data for Lexicographic Purposes. Lexikos, 34, 123-140.
Gantar, Apolonija. 2024. Formulisanje recnickih definicija pomocu vestacke inteligencije na primeru slovenackih frazeoloskihjedinica. In Leksikografski susreti, edited by Sasa Marjanovic, 151-158. Beograd: Filoloski fakultet.
Gapsa, Magdalena, Arhar Holdt, Špela, and Kosem, Iztok. 2024. Kako dober je chat GPT pri umescanju sopomenk pod pomene. In Konjerenca Jezikovne tehnologije in digitalna humanistika, edited by Spela Arhar Holdt and Tomaz Erjavec, 144-162. Ljubljana: Institut za novejso zgodovino.
Jakubiček, Miloš, and Rundell, Michael. 2023. The End of Lexicography? Can ChatGPT Outperform Current Tools for Post-Editing Lexicography? In Proceedings of the eLex 2023 Conference: Electronic Lexicography in the 21st Century, edited by Marek Medved’ et al., 508-523. Brno: Lexical Computing.
Kosem, Iztok, et al. 2024. Examining the Potential of AI in the Annotation of Corpus Examples for Language Leaming. 15th International Corpus Linguistics Conference, 93-95. Las Palmas de Gran Canaria, Spain, 22-24 May 2024. [Book of abstracts].
Lew, Robert. 2024. Dictionaries and Lexicography in the AI Era. Humanities and Social Sciences Communications, 11,426.
Ljubešić, Nikola, and Kuzman, Taja. 2024. CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation. In Proceedings of the 2024 Joint International Coriference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 3271-3282.
Ljubešić, Nikola, Kuzman, Taja, Filipović Petrović, Ivana, Parizoska, Jelena, and Osenova, Petya. 2024. CLASSLA-Express: A Train of CLARIN.SI Workshops on Language Resources and Tools with Easily Expanding Route. In CLARIN Annual Conference Proceedings 2024, edited by Vincent Vandeghinste and Thalassia Kontino, 31-3 5. Barcelona: CLARIN.
Rundell, Michael. 2023. Automating the Creation of Dictionaries: Are We Nearly There? In Proceedings of the 16th International Conference of the Asian Association for Lexicography: ‘Lexicography, Artificial Intelligence, and Dictionary Users’, 1-9. Seoul: Yonsei University.