Carole Tiberius & Jesse de Does

Dutch Language Institute, Netherlands

Carole Tiberius is professor of Computational Linguistics at the Leiden University Centre for Linguistics (LUCL) and a senior computational linguist at the Instituut voor de Nederlandse Taal (INT, Dutch Language Institute).

She holds degrees from the Higher Institute for Translators and Interpreters in Antwerp and the University of Nijmegen (MA in Language, Speech and Computer Science) as well as a PhD from the University of Brighton on research into ‘multilingual lexical knowledge representation’. 

Her research interests lie in the domains of computational lexicography and corpus linguistics. At the Dutch Language Institute, she is primarily involved in contemporary lexicographic projects such as the Vertaalwoordenschat, an online platform for bilingual dictionaries and Woordcombinaties, a project combining collocations and pattern analysis for Dutch. She is one of the authors of A Frequency Dictionary of Dutch.

Jesse de Does is a senior computational linguist at the Instituut voor de Nederlandse Taal (INT, Dutch Language Institute). 

He holds degrees in mathematics and Slavic linguistics. From 1986 to 1990, he worked as a research assistant in the Department of Slavic Language and Literature at Leiden University. In 1995, he obtained his PhD in applied mathematics.

His professional interests are historical language processing, linguistic annotation, corpus retrieval and language resource development. Since 2008, he has been closely involved in various international and national projects, such as IMPACT, SUCCEED, tranScriptorium, CLARIN-NL, CLARIAH and SSHOC-NL. He is currently the SSHOC-NL project leader for the institute and a member of the senior team.

The Dutch Language Institute (INT) has a long tradition compiling historic and contemporary dictionaries and other types of lexicographic databases, mainly for Dutch but also for some other languages with a relation to Dutch. Lexicographic work at the institute is computer-supported but there is still a great deal of manual work involved. Therefore, INT is exploring how new technologies (including LLMs) can be used for optimising different parts of the lexicographic work without compromising data quality and reliability. After a brief overview of various pilot studies conducted at the institute, we will take a closer look at how we can make the implementation of Hanks’ Corpus Pattern Analysis procedure (as it is used in the context of the project Woordcombinaties) more intelligent. This way, we hope to ultimately realise Patrick Hanks’ vision that “it seems likely that a large part of the work that is currently being carried out by hand will be automated in the not-too-distant future” (Hanks 2013;247).

Marko Robnik-Šikonja

Faculty of Computer and Information Science, University of Ljubljana, Slovenia

Marko Robnik-Šikonja is a Professor of Computer Science and Informatics at the University of Ljubljana, Faculty of Computer and Information Science, and head of Machine Learning and Language Technology Lab. His research interests span machine learning, data mining, natural language processing, and explainable artificial intelligence. His most notable scientific results concern deep learning, natural language analysis, feature evaluation, ensemble learning, predictive model explanation, information network analysis, and data generation. He is (co)author of over 250 scientific publications cited more than 9,500 times. He has contributed to several national and EU projects and authored several data mining software packages and language resources.

Currently, large language models (LLMs) are redefining methodological approaches in many scientific areas, including linguistics and lexicography. LLMs are pretrained on huge text corpora by predicting the next tokens and adapted for human interaction with the instruction following datasets. This does not make them immune to hallucinations and biases, requiring a human-in-the-loop approach. In the context of lexicography, LLMs can be used to support several tasks. We will present how the information contained in language databases can be utilized to improve LLMs on lexicographic tasks. Our current methodology is based on knowledge graph extraction, continued pretraining of LLMs, prompt engineering, and semi-automatic evaluation.

Michal Měchura

Lexical Computing and Dublin City University

Michal Měchura is a language technologist with two decades of experience building IT solutions for lexicography, terminology and onomastics. He has worked on projects such as the National Terminology Database for Irish, the Placenames Database of Ireland and the New English–Irish Dictionary. He is the founder of the open-source dictionary writing system Lexonomy and the author of Terminologue, an open-source terminology management platform. Recently, Michal has been chairing the LEXIDMA technical committee in OASIS which has created DMLex, a modern data model for lexicography.

It has been almost half a century since we started “doing” lexicography on computers. Let’s stop for a minute now and take a critical look at the data models we have been using to represent the structure of dictionaries in dictionary writing systems and other software.

In this talk, I will trace the history of lexicographic data modelling from its beginnings as text markup for retro-digitised dictionaries, to the present day when most dictionaries are born-digital. I will show that, regardless of which notation we use (XML, JSON or other), the underlying design pattern is almost always a tree structure in which the various content items (headwords, senses, definitions…) are arranged in a parent-child hierarchy.

I will argue that the tree-structured pattern is not expressive enough to handle some phenomena that occur in dictionaries, such as entry-to-entry cross-references, the placement of multiword subentries, and complex hierarchies of subsenses. These things would be easier to manage in a graph-based data structure, such as a relational database or a Semantic Web-style knowledge graph.

Dictionary projects which insist on a purely tree-structured data model are failing to make full use of the digital medium. But upgrading to a graph-based data model is difficult because tree-structured thinking is entrenched in the minds of lexicographers and dictionary users alike. This talk will conclude with an introduction to DMLex, a recently standardised “Data Model for Lexicography” which aims to ease this transition by being a hybrid model, combining tree structures where possible with graph structures where necessary.