Václav Cvrček

Institute of the Czech National Corpus, Czech Republic

Václav Cvrček is a full professor of the Institute of the Czech National Corpus, Charles University, Czech Republic. He focuses on corpus linguistics, quantitative analysis of language and corpus-based discourse analysis. Cvrček’s main interests involve corpus-based and corpus-driven methodologies, descriptive grammar of Czech, language/register variation and study of anti-system media discourse. He is the leading author of the first corpus-based grammar of Czech (Mluvnice současné češtiny, published in 2010) and co-authored several software tools broadening the possibilities of analyzing corpus data (e.g. Calc, QuitaUp).

This talk investigates the potential of register analysis of text corpora for defining style labels or usage markers in dictionaries. In contemporary lexicography, large language corpora are commonly utilized to extract data on lexemes, their meanings, and frequencies, employing advanced methods such as collocation extraction or sophisticated frequency measurement that account for dispersion etc. However, when it comes to style markers, we mostly rely on mere presence (or absence) of a word in a particular genre or text type, neglecting the potential offered by corpus linguistics methods dealing with variability of texts and their functional classification.
To address this issue, this talk proposes the use of the multi-dimensional analysis (MDA) method developed by Douglas Biber (1988, 1995) for register classification. MDA is known for effectively charting the space of variation by identifying major dimensions and delimiting registers within language. By exploring the associations between words and dimensions of variability or text registers in Czech, this talk will attempt to establish style markers that are at the same time practical for the dictionary user, empirically sound, and allow for semi-automatic extraction.

Marco C. Passarotti

Università Cattolica del Sacro Cuore, Italy

Marco Passarotti is a Full Professor of Computational Linguistics at Università Cattolica del Sacro Cuore (Milan, Italy), where he is Director of the CIRCSE Research Centre, which he co-founded in 2009. His main research interests deal with building, using and disseminating linguistic resources and natural language processing tools for Latin. A former pupil of one of the pioneers of humanities computing, father Roberto Busa SJ, since 2006 he has headed the Index Thomisticus Treebank project, which continues the legacy of Busa’s work on the opera omnia of Thomas Aquinas. He is the principal investigator of the LiLa project, an ERC-Consolidator Grant (2018-2023), which aims to build a Linked Data Knowledge Base of linguistic resources and natural language processing tools for Latin.

In this talk, I will discuss the issue of interoperability between linguistic resources and how to address it by applying the principles of the Linked Data paradigm to describe several kinds of (meta)data provided by resources published on the web. In particular, I will focus on lexical resources, presenting how a few dictionaries and lexica for Latin interact with each other (and with textual corpora, too) in the LiLa Knowledge Base, i.e., a collection of multifarious resources made interoperable by adopting the same vocabulary for knowledge description, through common data categories and ontologies widely used in the Linguistic Linked Open Data community.

Wendalyn Nichols

Cambridge University Press & Assessment, United States

Wendalyn Nichols leads the Cambridge Dictionary unit as Publishing Manager in the English division of Cambridge University Press & Assessment. After more than a decade teaching academic and business English to speakers of other languages, she transitioned to publishing, joining Longman Dictionaries to train as a lexicographer. From those early days of the “corpus revolution” to the present, digital workflows and data-driven decision-making have together been a through line connecting her varied leadership roles in trade reference publishing, specialized information publishing and content marketing, and educational publishing. Based in New York, Wendalyn is an enthusiastic member of the Dictionary Society of North America, having served as a board member, reviews editor of its journal Dictionaries, and (currently) the publications committee chair.

Artificial intelligence is seen as an existential threat by publishers of nonfiction, most particularly the producers of reference content. What does it mean for content to be authoritative if anyone can type a question into a search box and get an answer from an AI chatbot without ever visiting a publisher’s website or buying a publisher’s books? Has the ubiquitous use of huge, widely-available sets of lexical data to train AI algorithms hastened the end of original lexicography, and therefore of lexicographers? AI is already disrupting the market and is not going away, but both its strengths and shortcomings can be exploited by dictionary producers to turn the threat to our advantage.

Elena Álvarez Mellado

NLP research at UNED, Spain

Elena is a computational linguist: she is interested in Linguistics, technology and the intersection between them. She holds a BA in Linguistics from UCM and a masters degree in Computational Linguistics from Brandeis University. She currently works at the NLP&IR research group at UNED University, where she is pursuing her PhD on computational approaches to borrowing detection. Her research has led to the creation of Observatorio Lázaro, an observatory that automatically monitors anglicism usage in the Spanish press. Prior to that, she spent a decade working on different language technology projects at various organizations, such as the Information Sciences Institute at University of Southern CaliforniaFundéu, or UNED Digital Humanities Lab. She is also highly involved in dissemination activities that seek to bridge the gap between Linguistics and the general public and frequently writes about language at Spanish national newspaper ElDiario.es and Archiletras magazine. She is also the winner of the Adam Kilgarriff Prize 2022.

Anglicisms are words from English that are borrowed into another language. Anglicisms are a common source of new words in Spanish, which makes them an interesting phenomenon to observe for linguists and lexicographers. In this session we will present Observatorio Lázaro, a machine learning pipeline that monitors the Spanish press of the day and detects new anglicisms automatically.