Humanitarian reports on ReliefWeb as a domainspecific corpus
Keywords:
humanitarian domain, corpus creation, ReliefWeb, information extractionAbstract
This paper presents an assessment of the content available on ReliefWeb’s API for its suitability as a domain-specific corpus. ReliefWeb’s position as a primary information resource for humanitarian response, boasting a database of nearly a million reports, lends it considerable value for the corpus-based study of humanitarian discourse. However, the service’s content is under-explored in this regard. To this end, a Python package is introduced to manage the creation of ReliefWeb corpora. The composition of ReliefWeb’s HTML reports in English is examined and compared with a corpus from the Humanitarian Encyclopedia. The comparison includes a keyness analysis of the Encyclopedia’s 129 concepts and an assessment of diachronic trends for six concepts (HUMANITARIAN REFORM, SUSTAINABILITY, RESILIENCE, GENDER-BASED VIOLENCE, SETTLEMENT, and SOVEREIGNTY), as well as an analysis of hypernymic and definitional knowledge-rich contexts. Results indicated that ReliefWeb reports, mostly brief news and press release items, have much lower relative frequencies for humanitarian concepts than the reference corpus. Still, the data overlapped considerably and the breadth of the HTML content contributed important thematic diversity for some concepts. The paper concludes with a discussion of how the management of ReliefWeb corpora could be improved in future iterations.
Downloads
Published
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.