GitHub - EofK/Benchmark-16-ML-Models-Using-the-Dry-Bean-Dataset: This project analyzed and compared the performance of 16 machine learning models on a supervised classification task using the Dry Bean dataset. This project pursued two objectives: (1) Measure how accurately each model classifies unseen bean samples, and (2) Determine each model’s runtime for its classification training and testing process.

Performance Evaluation of 16 Machine Learning Models

Using the Dry Bean Dataset

Overview

This GitHub repository contains the final report and supporting Python modules for the analysis and comparison of 16 machine learning models applied to the Dry Bean dataset, which contains 13,611 records. This project focused on a supervised classification task and pursued two objectives:

Measure how accurately each model classifies unseen bean samples
Determine each model’s runtime for its classification training and testing process.

The project’s final report documents Dry Bean dataset characteristics, including standard summary statistics, feature-level diagnostics such as the Confusion Susceptibility Score (CSS), and feature correlation and redundancy analysis. The report also describes the tuning performed of selected hyperparameters for each model. It presents model-specific accuracy and runtime results, and it explores tradeoffs between performance and efficiency. The report concludes with a summary of model performances and feature-level insights.

Project File Structure

ml_benchmark/

LICENSE
Dry_Bean_Benchmark_Final_Report.pdf
README.md – Project documentation
requirements.txt – Python package dependencies
config.py – Centralized directory and file paths for project modules
datasets/ – Raw Dry Bean dataset
notebooks/ – Jupyter notebooks for each pipeline stage
outputs/ – All output files from the pipeline
- baseline_results/ – Results for baseline models without hyperparameter tuning
- clean_data/ – Preprocessing outputs
- curated_data/ – Final model-ready datasets
- feature_subsets/ – Feature analysis outputs
- figures/ – Visualizations and plots
- results/ – Benchmark results for tuned models
- summary/ – Consolidated pipeline summaries
- tuned_models/ – Serialized tuned model objects
- tuning/ – Hyperparameter tuning logs

Key Features

Feature diagnosis and exploratory analyses, including feature importance and correlation and redundancy assessments
Evaluation of 16 classification models, spanning seven algorithm families
Stratified 5-fold cross-validation training and testing of each model
Accuracy and runtime results, before and after tuning hyperparameters
Accuracy of each model at predicting each bean type, and potential causes for performance anomalies
Accuracy vs. runtime tradeoff analysis
Potential reasons for accuracy and runtime results for specific machine learning models
Summary of tuning effects on model performance
Model performance evaluation and feature relevance summaries.

Understanding the Code

The project uses one central configuration model and ten Jupyter notebooks:

config.py: centralizes and defines all directory and file paths, ensuring consistent, portable, and maintainable path management across all project modules
01_load_and_explore.ipynb: Loads the Dry Bean dataset and performs initial data exploration and visualization
02_diagnose_feature_quality.ipynb: Analyzes feature distributions, missing values, and correlations to assess feature quality
03_select_feature_subset.ipynb: Select the feature subsets for analyses (this project selected all 16 features)
04_curate_and_export.ipynb: Scales the 16 numeric features, encodes the single class (bean type), and exports the curated dataset for subsequent modeling
04b_baseline_compare_models.ipynb: Benchmarks baseline performance of a selected model on the curated dataset without hyperparameter tuning, and stores the selected model’s results in an Excel file
05_tune_hyperparameters.ipynb: Performs model-specific hyperparameter tuning using GridSearchCV
06_build_model_dict.ipynb: Dynamically builds and saves a dictionary of trained model instances with their optimal hyperparameters
07_compare_models.ipynb: Evaluates the performance of a single tuned model selected by the user across 137 combinations of dry bean features, including a runtime measurement approach designed to mitigate Windows management interruptions, populates an Excel file for the selected model, and adds the model’s confusion matrix to an Excel file that accumulates all models’ confusion matrices
08_consolidate_model_comparison_results.ipynb: Merges results from all models into a single summary Excel file
09_display_results.ipynb: Presents results and key findings all models into several combined summary charts and tables.

Running the Code

To run the pipeline:

Download the Dry Bean dataset from the source cited below, in either .csv or .xlsx format.
Clone or download this repository from GitHub.
Ensure Python 3.10+ and pip are installed.
(Optional but recommended) Create and activate a virtual environment.
Run “pip install -r requirements.txt” from the project root directory, to install all required packages.
Open a terminal and navigate to the project directory (e.g., cd C:/Misc/ml_benchmark).
Launch JupyterLab or Jupyter Notebook.
In the Jupyter interface, open the notebooks folder.
Update the PROJECT_BASE variable in config.py to match your machine’s file path (e.g., Path("C:/Your/Path/ml_benchmark")).
Ensure the expected subfolders exist (e.g., for the Dry Bean dataset, 10 .ipynb files (i.e., the notebooks)) or create them.
Run the notebooks in order, starting with 01_load_and_explore.ipynb and proceeding through 09_display_results.ipynb.
In each notebook:
- Read the introductory markdown cell(s) for context and instructions.
- Run all code cells sequentially from top to bottom (use "Run All" or "Run All Above/Below" as needed).
- Confirm that dependencies from earlier stages have executed successfully (most notebooks depend on outputs from a previous stage).
Review the outputs, figures, and exported files in the outputs directory as you progress.
If making changes to config.py, restart the notebook kernel and re-import the config module to ensure path changes take effect.
For troubleshooting, consult the README or review error messages in the notebook output cells.

Citation

Dataset Source: University of California, Irvine (UCI) Machine Learning Repository: https://doi.org/10.24432/C50S4B .

Reference Study: KOKLU, M. and OZKAN, I.A., (2020), “Multiclass Classification of Dry Beans Using Computer Vision and Machine Learning Techniques.” Computers and Electronics in Agriculture, 174, 105507. DOI: https://doi.org/10.1016/j.compag.2020.105507 .

Note: Dry bean abbreviations used by this benchmarking project follow the naming conventions established by Koklu and Ozkan.

License

This project is licensed under the MIT License. Refer to the “LICENSE” file for details.

Contact Information

For questions or suggestions, feel free to contact:

Name: Ed Kaempf
Email: edkaempf@gmail.com
GitHub: github.com/EofK
Linkedin: https://www.linkedin.com/in/ed-kaempf-4887839b/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
notebooks		notebooks
Final ML Model Benchmark Report.pdf		Final ML Model Benchmark Report.pdf
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages