This project builds a short-horizon alpha model using limit order book (LOB) and trade data.
The goal is to predict next-second mid-price movements in a high-frequency setting using market microstructure features.
The modelling approach is split into two stages:
- Move model: predicts whether the mid-price will change
- Sign model: predicts direction conditional on a move
These are combined into a final trading signal and evaluated through a simple backtest.
Data is sourced from the Binance exchange for the BTCUSDT trading pair.
- Order book snapshots (top 100 levels)
- Recent trades (~1000 per snapshot)
- Sampling frequency: ~1 second
Trades and order book data are aligned using a backward time join:
- Each trade is matched to the most recent past order book snapshot
- This ensures no forward-looking information is used
Timestamps are adjusted to account for observed latency.
Features are constructed from both the order book and trade flow. (some of these are listed below)
- Mid-price
- Bid–ask spread
- Relative spread
- Volume imbalance at multiple depths (levels 1, 5, 10)
- Depth ratios
- Total bid/ask volume
- Total book volume
- Volume concentration
- Trade count
- Buy/sell volume
- Trade size statistics (mean, max, std)
- Volume imbalance
- Log returns (multiple horizons)
- Rolling volatility
- Realised volatility
Heavy-tailed features are transformed using signed log scaling where appropriate.
Two prediction targets are defined:
-
Move Binary variable indicating whether the mid-price changes in the next second
-
Sign Direction of the price change (conditional on a move)
Models are trained using a time-series cross-validation framework:
TimeSeriesSplitwith gap to prevent leakage- Rolling training method as a very rough way to account for regime shifts (prior to explicit regime modelling)
- Out-of-fold (OOF) predictions used for evaluation
- Logistic Regression (L1, L2, Elastic Net)
- Random Forest
- Extra trees
- XGBoost
Logistic regression is used as a baseline and for feature selection. Tree-based models capture non-linear interactions.
models can be calibrated to improve probability estimates (currently not due to time series nature)
The final signal combines both models:
P(up) = P(move) × P(up | move)
A trading signal is constructed as:
- Long if high probability of upward move
- Short if high probability of downward move
- No trade otherwise
Thresholding is applied:
- Move probability filter (e.g. top quantiles)
- Sign confidence threshold
To avoid lookahead bias + since data granularit is ~1s (many trades couldnt happen in that time frame):
- Signals are shifted forward by one time step before applying returns
Strategy performance is evaluated using:
- Sharpe ratio
- Total PnL
- Maximum drawdown
- Hit rate
- Information Coefficient (IC)
Transaction costs are approximated using bid–ask spread.
Move Model
- Driven by:
- volatility
- liquidity
- trade intensity
- Captures when the market is active
Sign Model
- Still contains signal:
- order book imbalance
- microprice
- trade flow
- Captures directional pressure
Combined Signal
- Improves trading performance vs individual models
- Acts as:
- move - trade filter
- sign - direction selector
- Reduces low-quality trades
Move Model (quantile thresholding)
- Optimising for Sharpe per trade pushes the move threshold to low quantiles
- This means:
- the model assigns similar probabilities across observations
- filtering aggressively removes too many trades without improving quality
- Interpretation:
- move model is better at detecting general activity regimes than ranking high-conviction events
Sign Model (probability thresholding)
- Optimising for Sharpe per trade pushes thresholds higher probabilities
- This means:
- only high-confidence directional predictions are useful
- weak predictions add noise and reduce performance per trade
- Interpretation:
- directional signal is sparse but stronger when present
- requires selective trading
At a 1-second horizon, mid-price returns exhibit near-zero autocorrelation, consistent with martingale-like behaviour.
- Past returns contain little predictive information about future returns
- Using lagged returns as signals does not generate excess returns
- Any predictive power must therefore come from:
- order book state
- trade flow
- microstructure features
Markets are non-stationary, with changing volatility, liquidity, and microstructure dynamics. To model this explicitly, we fit a Gaussian Hidden Markov Model (HMM) on slow-moving structural features:
- volatility (vol_5, vol_20)
- bid–ask spread
- order book imbalance
- trade flow imbalance
The HMM infers regime probabilities at each timestamp, representing latent market conditions.
The HMM is used as a filtering layer on top of the alpha model.
- Select a regime of interest (e.g. low-volatility regime)
- Apply a probability threshold:
- Only allow trades when this condition is satisfied
This gates the strategy, restricting trading to favourable market conditions.
The strategy is optimised in two stages for Sharpe ratio:
-
Alpha optimisation (within regime)
- Fix regime (k)
- Optimise:
- move threshold (quantile)
- sign threshold
- Store optimal parameters
-
Regime threshold optimisation
- Fix alpha thresholds
- Vary regime probability threshold (\theta)
- Select value that maximises Sharpe
- Regimes correspond primarily to volatility and liquidity states
- Alpha performance is regime-dependent
- Filtering trades using regime probabilities:
- reduces low-quality trades
- improves Sharpe ratio and other metrics
The HMM captures persistent market structure that is not directly observable from raw features. Using it as a gating mechanism improves performance without modifying the underlying alpha model. Also a big imporvement on reproducibility across other datasets vs adhoc timeseries splits with rolling training