Rough Paths Coursework: Signature ML Model

Environment Setup

Start the virtual environment:

source /venv_02035622_06038567/bin/activate

Install the required Python modules:

python -m pip install -r requirements.txt

Verify installation by running the main program:
```
python main.py
```

Project Overview

Enhancing Midprice Prediction Using Financial and Signature Features

This project aims to predict midprice differences over 87 steps in a limit order book (LOB). It integrates handcrafted financial features, signature transforms, and optimized machine learning techniques, primarily using LightGBM.

Data Preprocessing

The data goes through the following pipeline before being passed to the LightGBM model:

Financial Feature Extraction: Extracts financial features from LOB data to provide dimensionality reduction and better insights into the data. Forward-fills missing rate values and sets missing sizes to zero.
Feature Scaling & Normalization: Normalizes the rows for consistency.
Rolling Windows: Converts these features into structured 87-step windows.
Data augmentation and Signature Transform: Augment each window and transform it using the signature transform

Key Features

1. Handcrafted Financial Features

The model employs financial indicators to capture key market microstructure patterns:

Midprice & Spread: Identifies price movement trends.
Order Flow Imbalance (OFI): Measures net liquidity changes and price direction shifts.
Depth Imbalance: Assesses buy/sell pressure at different LOB levels.
Market Depth & VWAP: Evaluates liquidity and weighted price levels.
Momentum Indicators: Includes Stochastic Oscillator and price derivatives.
Bid-Ask Spread & Relative Spread: Captures execution costs and market liquidity.
Order Book Pressure (OBP): Measures the probability of price movement direction.
Price and Size Derivatives: Estimates the rate of change in Bid and Ask Sizes and Prices at the first level.

2. Signature Features

The signature transform encodes sequential LOB movements and captures temporal dependencies effectively:

Normalisation of the path: Normalise the path using a Standard Scaler.
Lead-Lag Augmentation: Preserves order book variations.
Time Augmentation & Basepoint: Restores the time reparameterization variance.
Invisibility Reset: Provides information about the starting point of the path.

Model Training

1. Machine Learning Model: LightGBM

The core predictive model is LightGBM (LGBMRegressor), chosen for its efficiency and scalability with large datasets.

Other designs were considered, like RandomForest or a deep learning model for feature extraction, but they were too slow and inaccurate to be useful for the prediciton task.

2. Hyperparameter Optimization

Bayesian optimization (Hyperopt’s TPE) is used to fine-tune key hyperparameters:

Learning Rate, Max Depth, Subsample, Drop Rate: Controls overfitting and improves generalization.
Min Child Samples: Regularizes model training.
Number of Estimators: Optimized for better predictive accuracy.

Other parameters, like window size and how much data trained the model, were chosen by running the model multiple times and picking the best results.

3. Incremental Training for Large Datasets

To efficiently handle large-scale data, the model uses incremental training with keep_training_booster=True, allowing for:

Memory-efficient updates without retraining from scratch.
Gradual model refinement as more data is processed.

Data Splitting Strategy

Train-Validation Split (80-20): Ensures generalization without overfitting.
Sequential Ordering: Maintains time-series integrity and prevents data leakage.
Validation Set for Hyperparameter Tuning: Ensures unbiased performance assessment.

Model Evaluation & Performance

Metric Used: R² Score – assesses the model’s ability to explain variance in midprice changes, achieveing an r2 score of 0.088.
Incremental Model Updates: Enhances stability over long training periods.
Noise Robustness: Signature features and OFI-based indicators reduce volatility.

Training the Model

Open the report_training.ipynb notebook.
Run all the cells.

Testing the Model

Open the report_evaluation.ipynb notebook.
Run all the cells.

Conclusion

This approach integrates financial expertise, signature-based mathematical encoding, and optimized machine learning to develop a highly accurate and computationally efficient model for midprice prediction in high-frequency trading environments.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
provided_code		provided_code
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rough Paths Coursework: Signature ML Model

Environment Setup

Project Overview

Enhancing Midprice Prediction Using Financial and Signature Features

Data Preprocessing

Key Features

1. Handcrafted Financial Features

2. Signature Features

Model Training

1. Machine Learning Model: LightGBM

2. Hyperparameter Optimization

3. Incremental Training for Large Datasets

Data Splitting Strategy

Model Evaluation & Performance

Training the Model

Testing the Model

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Rough Paths Coursework: Signature ML Model

Environment Setup

Project Overview

Enhancing Midprice Prediction Using Financial and Signature Features

Data Preprocessing

Key Features

1. Handcrafted Financial Features

2. Signature Features

Model Training

1. Machine Learning Model: LightGBM

2. Hyperparameter Optimization

3. Incremental Training for Large Datasets

Data Splitting Strategy

Model Evaluation & Performance

Training the Model

Testing the Model

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages