Information Extraction for the Study of Humanities and Social Sciences Corpora: Opportunities and Challenges in Perspective

Abstract

Information extraction methods—such as named entity recognition, coreference resolution, or relation extraction—aim to produce structured data from unstructured or semi-structured sources like text. These structured outputs can then be used for various purposes, including database creation, ontology population, or automated reasoning. In the context of research in the humanities and social sciences (HSS), applying these methods is particularly valuable, especially when considering large-scale semantic annotation of corpora, thereby facilitating their critical study.

In this presentation, I will discuss the application of information extraction methods to a range of HSS corpora (19th- and 20th-century periodicals, scientific articles), within the frameworks of the SpaceWars, EMONTAL, and InSciM projects. I will first focus on the approaches we employed to extract different types of information (named entities, relations, geographic information, etc.), as well as how we exploited this information—particularly through visualization techniques—to alternate between close and distant reading of the corpora.

In the second part, I will address the challenges and issues involved in applying these methods to large-scale HSS corpora. I will specifically discuss the essential preprocessing steps, as well as difficulties related to the lack of dedicated resources for information extraction—such as annotated datasets or rule sets. Finally, I will address the comparative evaluation of different approaches (rule-based, machine learning, deep learning, small and large language models), considering not only their task-specific performance, but also criteria such as scalability, interpretability, and resource efficiency.

Speaker

Nicolas Guterhlé is a postdoctoral researcher in Natural Language Processing (NLP) within the ANR project InSciM "Modelling Uncertainty in Science", at the CRIT laboratory

Additional Information

The seminar will be given in 🇫🇷 French 🇫🇷. You can take part online.