This project explores the relationship between water temperature and the reproductive cycles of octopuses in a controlled lab environment. The primary goal is to develop an interpretable machine learning model that can predict the likelihood of an octopus laying eggs based on historical temperature data.
This analysis serves as a preliminary investigation to identify potential temperature patterns that could inform future biological experiments. The final model, a Decision Tree, suggests a specific signal—a rise in temperature approximately 1.5 months prior to an event—that correlates with egg-laying.
- Temperature Signal: The model's most significant predictive path indicates that a rise in temperature from below 20.7°C (8 weeks prior) to above 24.2°C (6 weeks prior) is a strong indicator of an upcoming egg-laying week.
- Model Performance: The correlation between predictions and actual events is informative, successfully identifying key egg-laying periods:

- Interpretability: By using a simple Decision Tree with a limited depth (
max_depth=3), the model's logic is fully transparent, making it a useful tool for generating hypotheses for biologists.
├── data/ # Raw data files
├── notebooks/ # analysis.ipynb (original exploratory notebook)
├── src/ # Source code for the project
│ ├── data_processing.py # Script for data cleaning and feature engineering
│ ├── modeling.py # Script for model training and visualization
│ └── main.py # Main script to run the pipeline
├── .gitignore # Files to ignore in Git
├── README.md # Project documentation
└── requirements.txt # Python dependencies
- The raw hourly temperature data and daily egg-laying logs were loaded and merged.
- To manage the severe class imbalance (only 125 laying days out of ~4000), the data was resampled to a weekly frequency. The weekly temperature was averaged, and a week was marked as an "event week" if at least one laying occurred.
- Lag Features were created for the temperature data, representing the average temperature from 2, 4, 6, and 8 weeks prior. These features allow the model to make predictions based on past conditions, not current ones.
- A Decision Tree Classifier was chosen for its high interpretability.
- To account for the remaining class imbalance, the
class_weight='balanced'parameter was used. - The data was split into training and test sets using a time-based split to preserve the temporal order.
- The model was trained on the first half of the data and evaluated on the second half.
The trained Decision Tree was visualized to understand the rules it learned for predicting an egg-laying event. The primary signal identified provides a clear, testable hypothesis for marine biologists.
-
Clone the repository:
git clone https://github.com/davidspn/octopus-reproduction-prediction.git cd octopus-reproduction-prediction -
Create a virtual environment and install dependencies:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate` pip install -r requirements.txt
-
Run the main analysis script:
python src/main.py
This will process the data, train the model, print the evaluation metrics, and display the prediction and decision tree plots.
