Semantic Analysis and Relational Graphs for Opening and Studying Parliamentary Debates at the French National Assembly
AGODA is a digital humanities project that aims to make the parliamentary debates of the French Third Republic (1881–1940) accessible and exploitable through an online platform. Although these debates are fully digitized and available via the Gallica digital library of the BnF, they remain underused due to OCR quality issues and a lack of structured, interoperable formats. AGODA brings together historians and natural language processing (NLP) specialists to produce enriched corpora in XML-TEI and to facilitate large-scale exploration and analysis.
Objectives
- Build an online platform for consulting historical parliamentary debates.
- Produce semantically enriched and structured textual data in XML-TEI format.
- Enable the creation of sub-corpora by speaker, date, or topic.
- Apply and integrate distant reading methods (e.g. topic modeling, word embeddings).
- Develop a reproducible and scalable workflow for the preparation and analysis of large-scale historical corpora.
Methodology As a proof of concept, the project focuses on the 1889–1893 legislative cycle, comprising over 10,000 pages of parliamentary debates. The digitized documents are reprocessed using the PERO OCR engine, developed by the ANR SODUCO project, which is particularly well-suited to historical printed materials from the 19th century.
The texts are then encoded using a custom XML-TEI scheme, inspired by the ParlaClarin and ParlaMint models but adapted to the specific characteristics of historical French sources. The encoding includes detailed structural annotations (such as sessions, speeches, and votes) as well as semantic annotations (notably named entities and topics derived from computational analyses).
Furthermore, the encoded data are designed with linked data integration in mind, with entities such as members of parliament being associated with external authority files like Sycomore.
Expected Outcomes
- A web platform for historians, social scientists, and the general public to consult and explore the debates.
- A reusable TEI encoding model for historical parliamentary corpora.
- A methodological contribution to digital history, with workflows integrating NLP tools and TEI
- Enhanced valorization of French political heritage, encouraging new interdisciplinary research.
Partners
Related Publications
[1]
Marie Puren • Fanny Lebreton • Aurélien Pellet • Pierre Vernus. "From parliamentary history to digital and computational history: a NLP-friendly TEI model for historical parliamentary proceedings". Digital Scholarship in the Humanities. 2024. https://doi.org/10.1093/llc/fqae071.
[2]
Marie Puren • Aurélien Pellet. "Explorer les débats parlementaires français de la Troisième République par leurs sujets". Humanistica 2023. 2023.