This project is a Machine Learning and Data Science pipeline aimed at predicting outcomes based on various features using MLflow for tracking experiments. The current setup serves as a flexible and reusable framework that can be adapted and expanded for different datasets and machine learning tasks. The main purpose is to establish a flexible and reusable workflow for personal ML/DS projects.
The project directory structure is as follows:
mlflow-tracking/
├── data/
│ ├── processed/
│ └── raw/
│ └── placeholder_data.csv
├── notebooks/
├── src/
│ ├── data_cleaner.py
│ ├── custom_data_cleaner/
│ │ ├── __init__.py
│ │ └── AccidentDataCleaner.py
│ ├── data_loader.py
│ ├── custom_data_loader/
│ │ ├── __init__.py
│ │ └── CSVDataLoader.py
│ ├── feature_engineer.py
│ ├── custom_feature_engineer/
│ │ ├── __init__.py
│ │ └── AccidentFeatureEngineer.py
│ ├── model_trainer.py
│ ├── custom_model_trainer/
│ │ ├── __init__.py
│ │ └── XGBoostModelTrainer.py
│ ├── custom_plotting/
│ │ ├── __init__.py
│ │ └── Plotting.py
│ ├── run.py
│ └── config.yaml
├── plots/
├── requirements.txt
├── .gitignore
├── LICENSE
└── README.md
To get started, clone the repository and install the necessary dependencies:
git clone https://github.com/jparrent/mlflow-tracking.git
cd mlflow-tracking
pip install -r requirements.txtTo run the project, execute the main script run.py:
python src/run.pyThis will load the data, clean it, engineer features, train a model, and save the processed data and model, with tracking in MLFlow.
src/data_loader.py
This script contains the DataLoader class, which is responsible for loading raw data from the specified directory. This serves as a base class for any custom data loading logic you might want to implement.
src/custom_data_loader/CSVDataLoader.py
This script contains the CSVDataLoader class, which inherits from DataLoader and implements the logic to load data specifically from CSV files. It demonstrates how to extend the base data loading functionality for specific file types.
src/data_cleaner.py
This script defines a base DataCleaner class. The AccidentDataCleaner class inherits from this base class and can be extended for more specific cleaning operations. This approach allows for a flexible and modular data cleaning pipeline.
src/custom_data_cleaner/AccidentDataCleaner.py
This script contains the AccidentDataCleaner class, which extends the data cleaning functionalities specific to accident data. It handles missing values, removes duplicates, and filters invalid data.
src/feature_engineer.py
This script defines the FeatureEngineer class, which is responsible for feature engineering tasks such as calculating distances from a fixed point, encoding cyclic features, and clustering based on latitude and longitude. This class serves as a base class that can be customized for different feature engineering needs.
src/custom_feature_engineer/AccidentFeatureEngineer.py
This script contains the AccidentFeatureEngineer class, which inherits from FeatureEngineer and implements additional feature engineering steps specific to accident data. It demonstrates how to extend the base feature engineering functionality for specific datasets.
src/model_trainer.py
This script contains the ModelTrainer class, which trains machine learning models. This class includes methods for handling the training process, hyperparameter tuning, model evaluation, and logging with MLflow. It serves as a flexible base for implementing specific training algorithms.
src/custom_model_trainer/XGBoostModelTrainer.py
src/custom_plotting/Plotting.py
This script contains custom plotting utilities that can be used to visualize various aspects of the data, model performance, or any other relevant metrics. It enhances the project's capability to generate insightful visualizations as part of the data exploration or model evaluation process.
src/run.py
This script contains the XGBoostModelTrainer class, which inherits from ModelTrainer and implements the logic to train a machine learning model using XGBoost. It includes hyperparameter tuning, model evaluation, and logging with MLflow. This class is designed to be adaptable for different models and evaluation metrics. run.py
The main script that ties together the data loading, cleaning, feature engineering, and model training steps. It orchestrates the entire workflow and ensures that each step is executed in the correct order. It saves the processed data to disk and loads it if it already exists to avoid redundant processing.
The data used in this project is a placeholder obtained from a recent interview to set up a larger personal ML/DS workflow. The goal is to create a flexible framework that can be easily adapted to different datasets and machine learning tasks. The current data cleaning and feature engineering steps are basic and meant to be expanded upon for more complex projects.
- Custom Algorithms: Make it easier to swap out algorithms and customize the ModelTrainer.
- Modularization: Further modularize the codebase to support easier integration of new components.
Feel free to contribute to this project by opening issues or submitting pull requests. Your feedback and suggestions are welcome!
This project is licensed under the MIT License. See the LICENSE file for details.