Skip to content

bobbomania/rough_paths_ml_model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rough Paths Coursework: Signature ML Model

Environment Setup

  1. Start the virtual environment:
    source /venv_02035622_06038567/bin/activate
  2. Install the required Python modules:
    python -m pip install -r requirements.txt
  3. Verify installation by running the main program:
    python main.py

Project Overview

Enhancing Midprice Prediction Using Financial and Signature Features

This project aims to predict midprice differences over 87 steps in a limit order book (LOB). It integrates handcrafted financial features, signature transforms, and optimized machine learning techniques, primarily using LightGBM.

Data Preprocessing

The data goes through the following pipeline before being passed to the LightGBM model:

  • Financial Feature Extraction: Extracts financial features from LOB data to provide dimensionality reduction and better insights into the data. Forward-fills missing rate values and sets missing sizes to zero.
  • Feature Scaling & Normalization: Normalizes the rows for consistency.
  • Rolling Windows: Converts these features into structured 87-step windows.
  • Data augmentation and Signature Transform: Augment each window and transform it using the signature transform

Key Features

1. Handcrafted Financial Features

The model employs financial indicators to capture key market microstructure patterns:

  • Midprice & Spread: Identifies price movement trends.
  • Order Flow Imbalance (OFI): Measures net liquidity changes and price direction shifts.
  • Depth Imbalance: Assesses buy/sell pressure at different LOB levels.
  • Market Depth & VWAP: Evaluates liquidity and weighted price levels.
  • Momentum Indicators: Includes Stochastic Oscillator and price derivatives.
  • Bid-Ask Spread & Relative Spread: Captures execution costs and market liquidity.
  • Order Book Pressure (OBP): Measures the probability of price movement direction.
  • Price and Size Derivatives: Estimates the rate of change in Bid and Ask Sizes and Prices at the first level.

2. Signature Features

The signature transform encodes sequential LOB movements and captures temporal dependencies effectively:

  • Normalisation of the path: Normalise the path using a Standard Scaler.
  • Lead-Lag Augmentation: Preserves order book variations.
  • Time Augmentation & Basepoint: Restores the time reparameterization variance.
  • Invisibility Reset: Provides information about the starting point of the path.

Model Training

1. Machine Learning Model: LightGBM

The core predictive model is LightGBM (LGBMRegressor), chosen for its efficiency and scalability with large datasets.

Other designs were considered, like RandomForest or a deep learning model for feature extraction, but they were too slow and inaccurate to be useful for the prediciton task.

2. Hyperparameter Optimization

Bayesian optimization (Hyperopt’s TPE) is used to fine-tune key hyperparameters:

  • Learning Rate, Max Depth, Subsample, Drop Rate: Controls overfitting and improves generalization.
  • Min Child Samples: Regularizes model training.
  • Number of Estimators: Optimized for better predictive accuracy.

Other parameters, like window size and how much data trained the model, were chosen by running the model multiple times and picking the best results.

3. Incremental Training for Large Datasets

To efficiently handle large-scale data, the model uses incremental training with keep_training_booster=True, allowing for:

  • Memory-efficient updates without retraining from scratch.
  • Gradual model refinement as more data is processed.

Data Splitting Strategy

  • Train-Validation Split (80-20): Ensures generalization without overfitting.
  • Sequential Ordering: Maintains time-series integrity and prevents data leakage.
  • Validation Set for Hyperparameter Tuning: Ensures unbiased performance assessment.

Model Evaluation & Performance

  • Metric Used: R² Score – assesses the model’s ability to explain variance in midprice changes, achieveing an r2 score of 0.088.
  • Incremental Model Updates: Enhances stability over long training periods.
  • Noise Robustness: Signature features and OFI-based indicators reduce volatility.

Training the Model

  1. Open the report_training.ipynb notebook.
  2. Run all the cells.

Testing the Model

  1. Open the report_evaluation.ipynb notebook.
  2. Run all the cells.

Conclusion

This approach integrates financial expertise, signature-based mathematical encoding, and optimized machine learning to develop a highly accurate and computationally efficient model for midprice prediction in high-frequency trading environments.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors