This repository contains code used to scrape and preprocess textual data, train models and perform thorough evaluation and experiments. The repository uses the following tech stack:
PythonPytorch + Torch Lightning- model building and trainingMLFlow, Tensorboard- experiment tracking, reproducibility
scripts/: Entrypoints used directly to perform elements of the scraping/processing/training/evaluation pipelinecfg/: Hydra configuration files used by the scripts
src/: Source codedata/: Resources related to data-related operationseda/: Creating viz & stats for raw or processed dataprocessing/: Extracting numerical features from raw datasetscraping/: Scraping raw datastructs/: Classes used to access directories of pre-defined structure
model/: Definition of the models and model componentsutils/: Utilities used globally
- Use guidelines defined in the Google Python Style Guide (https://google.github.io/styleguide/pyguide.html)
- Avoid writing too long functions. If possible, split them into several ones.
- Avoid using typedefs to define complex data structures. Make the most out of
Pydanticordataclasses. - Use docstrings for modules, classes and functions. For public functions, use Google docstring format. For private functions, use just a short description. Always in 3rd person.
- Avoid using unnecessary
try-exceptblocks and, in general, too nested code. - Every module should have its
_logger()function, defining a logger aslogging.getLogger(__name__). - Do not use exceptions to handle unrecoverable problems. Use just a critical log and sys exit.
- Do not use comments, unless absolutely necessary.
- Do not use redundant variables, unless they contribute to the readability of the code. As a rule of thumb, if a variable is asigned a short expression, just use it directly.