The Parallel Corpus of the Parable of the Prodigal Son (PCPPS) corpus is an historical parallel corpus of languages spoken in mainland France. It consists of a collection of versions of the Parable of the Prodigal Son, collected during the 19th century for linguistic surveys.
- Sources identification: Even if more than 500 translations of the Parable were done on the French territory, the corpus was never compiled. Selected 90 versions were compiled and edited during the 19th century, few other texts appeared in contemporary journals but most of the corpus is only available in manuscript form.
- Data retrieval: scrapping of the data using API, Transcription (1_source)
- Data structuration: encoding in TEI following the COLaF Schema available here mostly using python scripts (2_ProductionTEI)
- Data exploitation: Word-level alignment using [Collatex](https://interedition.github.io/ collatex/pythonport.html) (3_collation) and mapping of selected words on Atlas linguistic de France associated maps using Qgis to analyse the quality of the data (4_visualisation_carte)
A first corpus have been created using this pipeline using the 1879 edition of 89 selected translations, available on Gallica. The OCR output have been extracted in XML/ALTO using Gallica API and then manually corrected to ensure data quality. It was automatically encoded in XML/TEI using a Python script and sociolinguistic metadata (language, locality and collector for each translation) were then added manually. The word-level alignment is available in both csv and XML/TEI (3_collation). Lastly, twenty maps were created.
The PCPPS corpus contains 89 dialectal versions of the same 22 paragraphs for a total of 100, 000 tokens. 22 linguistic varieties (inferred from the place of collection) are represented coming from areas outside of Paris Basin such as Oil varieties (e.g. Wallon, Picard, Franc-Comtois), Occitan varieties (e.g. Vivaro-Alpin, Languedocien), Platt, Poitevin-Saintongeais and Romanche. The following map of the word jeune among the corpus presents the selected texts distribution accross mainland France.
Other XML/TEI files are also available in the repository (2_ProductionTEI):
- The 1831 edition of selected translations transcribed by Sven Ködel
- Various translations appearing in contemporary journals (Le Brigant 1779 (bret.) ; Champollion-Figeac 1809 (vivaro-alpin) ; Société liégeoise de littérature wallonne 1864 (wall.) ; Favrat 1866 (suisse romand) ; Chambure 1878 (bourg.) ; Loth 1889 (bret.))
- In Progress: The Bourciez Parables (4444 translations from south-west France localities) in collaboration with the Bourciez Project
This work has been mainly funded by the Inria “Défi”-type project COLaF (Corpus et Outils pour les Langues de France).
All documents (source, encoded documents and code) are CC-BY.

Lucence Ing, Juliette Janès, Sven Ködel, David Escarpit, Alexandre Génadot, Quentin Peyras, Javier Martínez, Julien Buziol, Gérard Judet de la Combe, Aurélie Puig, Benoît Sagot, The Parallel Corpus of the Parable of the Prodigal Son, 2026, Paris: INRIA https://github.com/DEFI-COLaF/Parabole/
If you have any questions or remarks, please contact colaf@inria.fr.
