History and Digital Humanities

Digital humanities is a research field that spans all disciplines within the humanities and social sciences. The team focuses in particular on contemporary history, in collaboration with members of the Image Processing and Pattern Recognition team, drawing on their expertise in the processing of historical documents. One of the main challenges of this research lies in extracting knowledge from textual sources, notably through quantitative tools that enable the semi-automated analysis of texts and the ideas they convey, including through cross-referencing. The extraction of data and knowledge from large volumes of complex historical sources is thus at the core of this research area.

The team is particularly interested in identifying and organizing relevant topics using natural language processing (NLP) and Formal Concept Analysis (FCA), as well as other techniques from data science such as classification. Another important focus is the visualization of both intermediate and final data in order to assess the validity and effectiveness of the developed methods.

More recently, the integration of large language models (LLMs) into the document processing pipeline—from OCR to information extraction—has opened new possibilities for historical analysis. LLMs now make it possible to extract and analyze information from digitized historical documents at scale. However, their known limitations—such as hallucinations, outdated knowledge, and opaque reasoning—raise concerns about their use in historical research, where source verifiability is essential. To address these challenges, the team has initiated collaborative work with the AI team on Retrieval-Augmented Generation (RAG), a promising approach that combines LLMs with verified source retrieval to ensure both accuracy and traceability.

Through specific case studies—such as the analysis of French parliamentary debates published between 1870 and 1940—the team evaluates the relevance of these methods for historical inquiry. In collaboration with a member of the AI team, it also analyses the extracted data using NLP and text mining techniques.

Related Projects

AGODA

AGODA is a digital humanities project funded by the DataLab of the Bibliothèque nationale de France. It aims to make the parliamentary debates of the French Third Republic (1881–1940) more accessible and usable by creating a structured, semantically enriched XML-TEI corpus.

Learn More

Related Publications

[1]

Marie Puren • Fanny Lebreton • Aurélien Pellet • Pierre Vernus. "From parliamentary history to digital and computational history: a NLP-friendly TEI model for historical parliamentary proceedings". Digital Scholarship in the Humanities. 2024. https://doi.org/10.1093/llc/fqae071.

[2]

Marie Puren • Pierre Vernus. "Conceptual Modeling of European Silk Heritage with the SILKNOW Data Model and Extension". Digital Humanities Quarterly. 2024.

[3]

Marie Puren • Florian Cafiero. "InTEIrviews: An ODD for Qualitative Interviews in the Humanities". Journal of the Text Encoding Inititative. 2024. https://doi.org/10.4000/jtei.5007.

[4]

Marie Puren • Aurélien Pellet. "Explorer les débats parlementaires français de la Troisième République par leurs sujets". Humanistica 2023. 2023.

[5]

Sebastián Lozano, Jorge • Alba Pagán, Ester • Martínez Roig, Eliseo and Gaitán Salvatella, Mar • León Muñoz, Arabella • Sevilla Peris, Javier • Vernus, Pierre • Puren, Marie • Rei, Luis • Mladenič, Dunja. "Open Access to Data about Silk Heritage: A Case Study in Digital Information Sustainability". Sustainability. 2023. https://doi.org/10.3390/su151914340.