| title | author | date | lang | documentclass | tags | abstract | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
Tutorial Main File |
|
\today |
en |
article |
|
This is the root of the tutorial.
|
This tutorial aims to help yourself building a text processing workflow that suits your need and your ressources.
It is modular, in order to be useful for a variety of:
- research goals;
- starting points;
- technical levels and resources.
It is written with, in mind, Medieval Western languages, especially Romance and Latin, but can be more generally useful for historical languages.
This is a tutorial on building a text processing workflow, not a collection of tutorials on each specific tool. We include short description cards for each tool, and point to relevant exterior ressources when needed.
This is a tutorial on corpus production and text processing workflows, so, we will only allude to what to do with the corpus once it has been produced (data mining, stemmatology, critical editing, etc.). Please note that this does'nt mean that you shouldn't be very clear about your research goals when starting to build the corpus and selecting a path.
Before selecting a path, you need to establish clearly the ressources you have or are willing to invest, the goals you are trying to attain, and make a choice regarding software between, on one hand, user-friendliness and ease of use or, on the other, flexibility and tailor-made.
The following yes/no statements are intended to help you select a starting point. Are all of the following statements true in your case?
- my need are not covered by out-of-the-box tools;
- I have access to an IT infrastructure, or, at least, a Linux computer or server;
- I know at least one programming language, or someone from my team does;
- I know what an API is;
- ressources are scarce or non-existent for the language I study (models, corpora, …);
- I prefer to use only open software;
- I wish to control every aspect of the processing workflow (data model and formats, algorithms, etc.);
- I am ready to invest significant time and effort in producing this corpus.
If all these statements are false in your case, you can safely select path 1, aka the user-friendly path. Otherwise, you might need to select path 2, aka the Do-it-yourself path [or even path 3, the pionneer path].
Orient toward a step -> where do I start ?
- What is my goal, and what data do I expect from the output?
- What data do I already have, if any?
For each path: where can I be lazy and where I cannot be. Ce qui est rédhibitoire / impératif. Skippable steps, unskippable steps
For each tool: ratio time-investment / improvement in the results (difficulty)
If you already have the digital text of your corpus, start at 2.
If you already have structured and annotated data, start at 3.
- Text acquisition with Transkribus: OCR, HTR,
- Text Enrichment (skippable): select one or several (or none) of the following according to your goals
- lemmatisation with Pyrrha;
- named entities annotation with (Recogito?);
- …
- Text interrogation
- TXM
- Text alignment: are you working on multiple-versions texts? If not, you can skip this step.
- Tracer;
- Iteal;
- Stemmaweb.
- See common final steps below.
You can report to the following documents to learn more about each conceptual step, and identify the software the more suited to your needs. For each software, we give a short 'software' cards, specifying installing, inputs and outpus, existing models when relevant, and provide data samples with minimal executable code.
The goal is to help you in articulating the tools, linking and chaining them.
- Project conception: goals, expected output, canonical data model.
- Getting the text: Text Acquisition
- Adding relevant information and structure: Text Enrichment
- Aligning multiple versions: Text Alignment and Collation
- See common final steps below.
Starting from scratch: data, models, etc.
What data are useful ? In what ?
Robinson's main stages of digital critical edition: transcription, collation, stemmatological analysis, edition, publication.
- juridic questions
- How do I preserve the data ?
- License
- Archiving
- edition / publishing
