- Start the virtual environment:
source /venv_02035622_06038567/bin/activate - Install the required Python modules:
python -m pip install -r requirements.txt
- Verify installation by running the main program:
python main.py
This project aims to predict midprice differences over 87 steps in a limit order book (LOB). It integrates handcrafted financial features, signature transforms, and optimized machine learning techniques, primarily using LightGBM.
The data goes through the following pipeline before being passed to the LightGBM model:
- Financial Feature Extraction: Extracts financial features from LOB data to provide dimensionality reduction and better insights into the data. Forward-fills missing rate values and sets missing sizes to zero.
- Feature Scaling & Normalization: Normalizes the rows for consistency.
- Rolling Windows: Converts these features into structured 87-step windows.
- Data augmentation and Signature Transform: Augment each window and transform it using the signature transform
The model employs financial indicators to capture key market microstructure patterns:
- Midprice & Spread: Identifies price movement trends.
- Order Flow Imbalance (OFI): Measures net liquidity changes and price direction shifts.
- Depth Imbalance: Assesses buy/sell pressure at different LOB levels.
- Market Depth & VWAP: Evaluates liquidity and weighted price levels.
- Momentum Indicators: Includes Stochastic Oscillator and price derivatives.
- Bid-Ask Spread & Relative Spread: Captures execution costs and market liquidity.
- Order Book Pressure (OBP): Measures the probability of price movement direction.
- Price and Size Derivatives: Estimates the rate of change in Bid and Ask Sizes and Prices at the first level.
The signature transform encodes sequential LOB movements and captures temporal dependencies effectively:
- Normalisation of the path: Normalise the path using a Standard Scaler.
- Lead-Lag Augmentation: Preserves order book variations.
- Time Augmentation & Basepoint: Restores the time reparameterization variance.
- Invisibility Reset: Provides information about the starting point of the path.
The core predictive model is LightGBM (LGBMRegressor), chosen for its efficiency and scalability with large datasets.
Other designs were considered, like RandomForest or a deep learning model for feature extraction, but they were too slow and inaccurate to be useful for the prediciton task.
Bayesian optimization (Hyperopt’s TPE) is used to fine-tune key hyperparameters:
- Learning Rate, Max Depth, Subsample, Drop Rate: Controls overfitting and improves generalization.
- Min Child Samples: Regularizes model training.
- Number of Estimators: Optimized for better predictive accuracy.
Other parameters, like window size and how much data trained the model, were chosen by running the model multiple times and picking the best results.
To efficiently handle large-scale data, the model uses incremental training with keep_training_booster=True, allowing for:
- Memory-efficient updates without retraining from scratch.
- Gradual model refinement as more data is processed.
- Train-Validation Split (80-20): Ensures generalization without overfitting.
- Sequential Ordering: Maintains time-series integrity and prevents data leakage.
- Validation Set for Hyperparameter Tuning: Ensures unbiased performance assessment.
- Metric Used: R² Score – assesses the model’s ability to explain variance in midprice changes, achieveing an r2 score of 0.088.
- Incremental Model Updates: Enhances stability over long training periods.
- Noise Robustness: Signature features and OFI-based indicators reduce volatility.
- Open the
report_training.ipynbnotebook. - Run all the cells.
- Open the
report_evaluation.ipynbnotebook. - Run all the cells.
This approach integrates financial expertise, signature-based mathematical encoding, and optimized machine learning to develop a highly accurate and computationally efficient model for midprice prediction in high-frequency trading environments.