CELGA-ILTEC, University of Coimbra / University of Lisbon
VOC, a Spelling Dictionary for the Portuguese Language – role and characteristics
The Portuguese language is a pluricentric language, now spoken in eight countries from four continents. Since 1911, it had two different spelling norms, one in Brazil and one in Portugal. During the 20th century, the authorities of these two countries struggled to get a set of spelling rules that may though account for the variation from these two national varieties, using a spelling system which is mostly phonemic. Since 1990, all Portuguese speaking countries were bound by the Orthographic Agreement for the Portuguese Language (AOLP90). However, for more than two decades this set of spelling rules could not be shaped into a common spelling dictionary to be official for all these countries.
VOC (Vocabulário Ortográfico Comum da Língua Portuguesa – Common Spelling Dictionary for the Portuguese Language) was conceived and carried out within this was the context. It is a digital platform that hosts the spelling dictionaries of each Portuguese speaking country, all composed following common orthographic and lexicographic principles.
In this presentation, we will introduce VOC, highlighting its lexicographic challenges and issues, but also its political role and impact.
Margarita Correia has a PhD in Portuguese Linguistics from the University of Lisbon, and a Post-Doc in Computational Lexicography at the Federal University of São Carlos (UFSCar – Brazil). She has been a professor at the Department of General and Romance Languages of the Faculty of Letters of the University of Lisbon since 1990, where she has taught several courses (including Lexicology, Lexicography and Terminology) at undergraduate and graduate levels.
She is a member of the Direction Board of the Centre for General and Applied Linguistics Studies (CELGA-ILTEC, University of Coimbra), where she coordinates the research group Lexicon and Computational Modeling.
She works mainly in Applied Linguistics, with a focus on Lexicography, Terminology, Neology and Language Policy. With José Pedro Ferreira, she directed the projects VOP – Vocabulário Ortográfico do Português [the Spelling Dictionary of Portuguese] (1st and 2nd edition) and Lince – Conversor para a Nova Ortografia [Spelling Converter] (2010), which are official instruments for the implementation of the 1990 spelling reform in Portugal. With José Pedro Ferreira and Gladis Maria de Barcellos Almeida, she coordinated the VOC – Vocabulário Ortográfico Comum da Língua Portuguesa [the Common Spelling Dictionary of the Portuguese Language, Ferreira, Correia, & Almeida (Orgs.) 2017)], under the supervision of the Instituto Internacional da Língua Portuguesa (IILP) [International Institute of the Portuguese Language]. Since 2018, she is the president of the Scientific Board of the IILP.
The Right Rhymes: Smart Lexicography in Full Effect
This talk will introduce The Right Rhymes, an evidence-based dictionary of hip-hop language, and survey methodologies used to build the underlying corpus, develop the dictionary data, and publish to the web. The talk will open with a demo of the rap lyrics corpus; this will include an overview of the tools used for assembly and maintenance, then dive into metadata integration and contextual data enrichment, explaining why investment in these areas improved the dictionary. Following that will be a brief examination of the dictionary data, touching on content modelling, and how that can enable flexibility and evolution in the final product. The talk will conclude with a backend-to-frontend tour of The Right Rhymes website, including API exposure, data visualization, and lessons learned in designing an interface for the modern user.
Matt Kohl began his career at the Oxford English Dictionary (http://www.oed.com/). He then continued working at the Oxford University Press in the field of language technology, where he lead the development of LEAP (Lexical Engine and Platform), a platform to store, optimise and deliver lexical data for projects such as Oxford Global Languages. This work also laid foundations for the Oxford dictionaries API program. He has since transitioned into software and knowledge engineering, and is currently helping to build out the data architecture at GeoPhy (https://geophy.com/) . Matt is the creator of The Right Rhymes (https://therightrhymes.com/), a hip-hop dictionary based on rap lyrics. He lives and works in London.
Matt Kohl is the winner of the Adam Kilgarriff Prize.
Berlin-Brandenburg Academy of Sciences and the Humanities
The Center for digital lexicography of the German Language:
new perspectives for smart lexicography
The Zentrum für digitale Lexikogaphie der deutschen Sprache (ZDL, Center for digital lexicography of the German Language) aims to provide a comprehensive and empirically reliable description of the German language from its origins to the present. To this end, four German academies in Berlin (BBAW, coordinator), Göttingen (AdGW), Leipzig (SAW), and Mainz (AdWL) have joined forces. The academies have a rich tradition of dictionary projects, encompassing historical as well as modern dictionaries and including the Grimmsches Wörterbuch, the dictionaries of Old High German, Middle High German, Early New High German and the Digital Dictionary (DWDS) of contemporary German. In addition, the center is cooperating with the Leibniz Institute for the German Language (IDS) for neologisms and contemporary text corpora. In order to provide a ubiquitous search interface to these diverse dictionary sources, a considerable amount of integration work will be necessary in the coming years, including work on common formats, lemma lists, as well as cross-linking references from dictionaries to corpora.
Alexander Geyken works at the Berlin-Brandenburg Academy of Sciences and the Humanities (BBAW) since 1999 where he directs the long-term research project “Digital Dictionary of the German Language” (DWDS) as well as the Berlin part of the “Zentrum für digitale Lexikographie der deutschen Sprache” (ZDL). He received his Ph.D. in “Computational Linguistics” at the University of Munich in 1998, and obtained his habilitation (post-doctoral degree) in 2017 in the field of “Linguistics” at the University of Potsdam, where also holds a teaching position since May 2018. His main research interests are computational lexicography, corpus linguistics as well as the use of syntactic and semantic resources for the mining of large textual data.
SIL’s language data collection
SIL linguists have studied minority languages since 1934. This talk will describe the extent of SIL’s language data has and give a brief description of the history of data collection methods and tools.
The translation of the Bible into many languages represents a multilingual parallel corpus. Complete translations of the New and Old Testaments exist in 690 languages. New Testament translations exist in an additional 1550 languages. SIL is considering how to provide greater access to academic linguists to those translations for which they hold the copyright.
SIL has also published lexicons for 660 languages and vocabulary lists in an additional 200 languages and is considering possibilities for sharing that data more widely.
SIL’s FieldWorks software has been used as a tool for managing lexical data and has been used to create many of the more recent dictionaries.
Keywords: multilingual corpus, lexicon, FieldWorks, Rapid word collection.
FieldWorks: Open-source dictionary editing software. https://software.sil.org/fieldworks/
FLEx Tools: Programs for manipulating FLEx data. https://github.com/cdfarrow/flextools
LanguageDepot: FieldWorks data hosting. https://public.languagedepot.org/
Language Forge: Online dictionary creation and collaboration. https://languageforge.org/
Rapid Word Collection: Create dictionaries in minority languages. http://rapidwords.net/
David Baines began working with SIL in the Philippines in 2000 and later worked with SIL in Chad. He joined the Language Software Development department of SIL International in 2007 as a software tester for FieldWorks. Many of his roles at SIL have included liaison between linguists and developers. For the past couple of years he has been importing dictionaries from Shoebox/Toolbox into FieldWorks prior to publication on Webonary and as mobile apps. Part of his current role is to design interactions between translators’ software and FieldWorks so that the translators can make the fullest use of linguistic data. He has a particular interest in finding beneficial partnerships between SIL and other individuals or organisations and has encouraged SIL International to apply for Observer status with ELEXIS.
Wroclaw University of Technology
Wordnet as a Relational Semantic Dictionary Built on Corpus Data
Princeton WordNet – the prototypical wordnet (”the mother of all wordnets”) – started off as a psycholinguistic experiment on language acquisition by children. Later, it developed into a lexico-semantic database. Thus, WordNet was not originally meant to be a dictionary, but at some point began to be treated as one. It is usually presented as a network of lexicalised concepts (represented as synsets – synonym sets). In addition, many people call it and use it as a kind of ontology. Contrary to such claims, we will argue that wordnet can be modelled (and constructed) as a relational semantic dictionary in which lexical meanings function are the basic building blocks defined by a dense network of lexico-semantic relations as a primary means of their description.
In such perspective, synsets are construed as sets of lexical meanings that share lexico-semantic relations of certain types. Thus, there is no need for assigning to them a special ontological status. Relations between synsets are just notational abbreviations for beams of relations between lexical meanings. The whole construction of a wordnet is based on Minimal Commitment Principle: minimising the number of assumptions, maximising the freedom of further interpretation of wordnet structure.
In a way typical for dictionaries, all lexical properties are assigned to lexical meanings, especially non-relational elements of description such as usage examples, textual definitions or attributes like stylistic register. The properties, but also lexico-semantic relations can be based on language data in a straightforward way, e.g. by various linguistic tests verified against usage examples, not only intuitions of linguists.
In order to show the consequences of the model, we will refer to plWordNet – a wordnet of Polish – which has been consequently built on its basis. A corpus-based wordnet development process has been applied in the construction of plWordNet, i.e. large text corpora were used as a source of lexical knowledge supporting the work of lexicographers to extract, e.g., lemmas, clusters of usage examples suggesting potential meanings, multi-word expressions, distributional models revealing semantic relatedness or instances of lexico-semantic relations. The talk will be illustrated with examples and statistics zooming in on several details of the solution.
Maciej Piasecki is an Assistant Professor at the Wroclaw University of Science and Technology (Department of Computational Intelligence, Faculty of Computer Science and Management), Poland, the Polish National Coordinator of CLARIN (clarin.eu) (European language technology research infrastructure), the Chair of CLARIN ERIC National Coordinators Forum (since 04.2018) and the coordinator of CLARIN-PL (clarin-pl.eu) (Polish consortium, a part of CLARIN). He is the leader of G4.19 Research Group: Computational Linguistics and Language Technology (nlp.pwr.edu.pl) – one of the largest Polish research teams in these areas. The main mission of G4.19 is development of open robust language technology for Polish, both in monolingual and bilingual setting.
Since 2008 he has been or is a coordinator of 14 large projects or their work packages (national and funded from EU structural funds, including 3 projects in cooperation with companies) on language technology and its different applications . He is also a member of the DARIAH-PL Board (dariah.pl) and Global WordNet Association Board.
His main research areas include Computational Linguistics, Natural Language Engineering and Human Language Technology. The main research topics are: automated extraction of the lexico-semantic knowledge from text, semi-automated wordnet expansion, Distributional Semantics, relational lexical semantics and shallow semantic processing of text. He has also been working on morpho-syntactic processing of Polish (a co-author of the first publicly available morpho-syntactic tagger of Polish, with many applications), Information Extraction, Question Answering, Formal Semantics and Machine Translation. He has been the leader of the Polish wordnet project: plWordNet (plwordnet.pwr.edu.pl) – the largest language resource of this type in the world.