Information Extraction for Humanities and Social Science Corpora: Exploring Possibilities and Challenges

Abstract

Information extraction methods—such as named entity recognition, coreference resolution, and relation extraction—aim to generate structured data from unstructured or semi-structured texts, enabling their use for tasks such as database creation, ontology population, or automated reasoning. In the context of research in the humanities and social sciences (HSS), these methods are particularly valuable for enabling large-scale semantic annotation of corpora and facilitating critical analysis.

In this presentation, I will discuss the application of information extraction techniques to various HSS corpora (including 19th- and 20th-century periodicals and scientific articles) as part of the SpaceWars, EMONTAL, and InSciM projects. I will first focus on the approaches we have employed to extract different types of information (named entities, relationships, geographic data, etc.) and how these have been used—especially through visualization techniques—to alternate between close and distant reading of the corpora.

In the second part, I will address the challenges and stakes involved in applying these methods to large-scale HSS corpora. This includes the necessary steps of data preparation, the lack of domain-specific resources such as annotated datasets or rule sets, and the comparative evaluation of different approaches (symbolic, machine learning-based, deep learning, small and large language models) according to performance, but also scalability, interpretability, and resource efficiency.

Additional Information

The seminar will be given in 🇫🇷 French 🇫🇷.