A reinforcement learning-based portfolio optimization system using Stable-Baselines3, designed to showcase modern RL techniques applied to quantitative finance.
This project implements a multi-asset portfolio allocation agent using deep reinforcement learning. The agent learns to dynamically rebalance a portfolio to maximize risk-adjusted returns while accounting for transaction costs and market dynamics.
- Modular Architecture: Extensible design with registry patterns for features, rewards, and baseline strategies
- Multiple RL Algorithms: Support for PPO, SAC, and A2C from Stable-Baselines3
- Rich Feature Engineering: Technical indicators (RSI, MACD, Bollinger Bands, etc.) using pandas-ta
- Flexible Reward Functions: Sharpe ratio, Sortino ratio, risk-adjusted returns, and more
- Comprehensive Backtesting: Compare RL agents against traditional baselines (equal weight, momentum, min variance, etc.)
- Walk-Forward Evaluation: Expanding-window walk-forward harness (library + CLI) for out-of-sample assessment across regimes
- Professional Evaluation: Complete metrics suite and visualization tools
# Clone the repository
git clone <repository-url>
cd rlportfolio
# Create conda environment
conda env create -f environment.yml
# Activate environment
conda activate rlportfolio# Clone the repository
git clone <repository-url>
cd rlportfolio
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtWhy conda? Better dependency resolution for scientific computing packages (numpy, scipy, matplotlib) and easier management of platform-specific binaries.
Market data flows through finbase
(PyPI), a sibling project that
manages a SQLite store at ~/.finbase/timeseries.db. The Python package
is installed from PyPI by pip install -r requirements.txt (or the
conda env), but it ships only the read API — not the data, and not the
download / setup script (scripts/setup_database.py lives in finbase's
repo, not in the wheel). To populate the database from scratch:
# One-time: clone finbase and run its setup script
git clone https://github.com/shoom1/finbase.git
cd finbase
python scripts/setup_database.py --init # creates ~/.finbase/timeseries.db
python scripts/setup_database.py --update-all-indices # SP500, DOW30, NASDAQ-100, FTSE100, DAX
python scripts/setup_database.py --load-index-data SP500 \
--index-start-date 2005-01-01 # OHLCV history via YFinanceAfter that, this repo's data.fetcher.DataFetcher (a thin wrapper over
finbase.DataClient) will read from the populated store automatically.
See the finbase quick-start
for more options.
You can skip data setup entirely if you only want to explore the
analysis. This repo ships the canonical walk-forward output at
results/walk_forward_tech5.csv and an executed notebook at
notebooks/walk_forward_analysis.ipynb; both work with no finbase
access, no SB3, and no training.
Python version note. finbase requires Python ≥ 3.12. rlportfolio core supports 3.8+, but regenerating the walk-forward CSV requires the 3.12 finbase install.
To work on the code and run the test suite, install the project in editable mode:
pip install -e .This registers the data, environment, evaluation, experiments, and training packages on sys.path so imports resolve without needing a working directory hack. Run the tests with:
pytest tests/# Train with default configuration (PPO on 5 tech stocks)
python training/train.py
# Train with custom configuration
python training/train.py --config configs/sac_config.yaml
# Resume from checkpoint
python training/train.py --resume training/models/portfolio_agent_50000_steps.zippython training/train.py --eval training/models/best/best_model.zipfrom data.fetcher import DataFetcher
from data.features import FeatureEngineer
from environment import PortfolioEnv
from evaluation import Backtester, plot_strategy_comparison
from stable_baselines3 import PPO
# Prepare data
fetcher = DataFetcher()
engineer = FeatureEngineer()
tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'NVDA']
data = fetcher.get_latest_data(tickers, days=500)
features = engineer.compute_features(data)
env_data = engineer.prepare_for_environment(features)
# Create environment
feature_cols = engineer.create_observation_columns()
env = PortfolioEnv(
data=env_data,
feature_columns=feature_cols,
tickers=tickers
)
# Load trained agent
agent = PPO.load('training/models/best/best_model.zip')
# Run backtests
backtester = Backtester()
backtester.run_agent(agent, env, name='PPO')
backtester.run_baseline(env, 'equal_weight')
backtester.run_baseline(env, 'momentum_20')
backtester.run_baseline(env, 'min_variance_60')
# Compare results
backtester.print_comparison()
# Visualize
histories = backtester.get_histories()
plot_strategy_comparison(histories, save_path='results/comparison.png')For a rigorous OOS assessment across market regimes — trains a fresh agent on every fold's expanding window and backtests on the next disjoint slice:
# Quarterly walk-forward from 2005 to today, ~70 folds
conda run -n rlportfolio python -m evaluation.walk_forward \
--config configs/opt_c_div19.yaml \
--t-min-days 756 --stride-days 63 \
--seeds 42 43 44 \
--output results/walk_forward.csvProduces a per-fold/per-seed CSV (agent + baselines on each disjoint test window), records failed runs for investigation, and prints aggregate run-level and fold-mean Sharpe / hit-rate summaries. The last 6 months of each fold's train window are reserved for model selection via EvalCallback, so the test window is never seen during training. Programmatic API in evaluation.walk_forward.WalkForwardEvaluator; see examples/walk_forward.py for a minimal driver.
rlportfolio/
├── data/ # Data fetching and feature engineering
│ ├── fetcher.py # Thin adapter around finbase.DataClient
│ └── features.py # Technical indicators with registry pattern
├── environment/ # Custom Gymnasium environment
│ ├── portfolio_env.py # Multi-asset portfolio env (precomputed feature cube)
│ ├── rewards.py # Reward function implementations
│ ├── transaction_costs.py # Pluggable cost/slippage models
│ └── constants.py # Shared numeric constants
├── training/ # Training infrastructure
│ ├── train.py # Training script + PortfolioTrainer
│ ├── config.py # Typed TrainingConfig dataclasses
│ └── models/ # Saved models and checkpoints
├── evaluation/ # Backtesting and analysis
│ ├── metrics.py # Performance metrics (Sharpe, Sortino, etc.)
│ ├── backtest.py # Stateful Backtester facade
│ ├── backtest_strategies.py # Sequential / walk-forward / Monte Carlo execution
│ ├── baselines.py # Baseline strategy implementations
│ ├── walk_forward.py # Walk-forward training + OOS harness (library + CLI)
│ ├── visualization.py # Plotting functions
│ └── visualize_network.py # NN architecture visualization
├── experiments/ # Experiment tracking (W&B, MLflow, SQLite)
├── configs/ # YAML configuration files (see directory for full list)
├── examples/ # Thin demo scripts (walk_forward, seed_variance, ...)
└── tests/ # Unit tests (320 passing)
- DataFetcher: Thin adapter over
finbase.DataClient— see github.com/shoom1/finbase (also on PyPI). It reads from a shared SQLite database at~/.finbase/timeseries.db. Data is populated and refreshed by thefinbaseproject — this repo only reads it; see the Installation section for one-time DB-population steps. - FeatureEngineer: Computes technical indicators using a registry pattern for extensibility
- Features are normalized and prepared for the RL environment
- State Space: Market features (prices, indicators) + current portfolio state (weights, cash)
- Action Space: Continuous portfolio weights (normalized via softmax)
- Reward: Configurable (Sharpe ratio, returns, risk-adjusted, etc.)
- Transaction Costs: Proportional costs are applied during rebalancing.
Slippage defaults to zero for backward compatibility; custom cost models can
use fixed, volume-based, or spread-based slippage, with
volume,bid_ask_spread, orspreadcolumns passed into trade records when present.
- Uses Stable-Baselines3 for RL algorithms
- Supports PPO (default), SAC, and A2C
- Configuration via YAML files
- Tensorboard logging and model checkpointing
- Evaluation callback for validation
- Disjoint train/val:
PortfolioTrainer.prepare_datafetches a single combined window oftrain_days + val_daysand slices on date, so val is strictly after train (no in-sample leakage into the eval callback).
- Metrics: Total return, Sharpe ratio, Sortino ratio, max drawdown, volatility, win rate, etc.
- Baselines: Equal weight, buy-and-hold, momentum, minimum variance, inverse volatility
- Walk-Forward: Expanding-window protocol that retrains per fold; last 6 months of each fold's train reserved for model selection; disjoint quarterly test windows. Library (
evaluation.walk_forward.WalkForwardEvaluator) and CLI (python -m evaluation.walk_forward). - Visualization: Performance comparison, drawdown, weights evolution, risk-return scatter
Configurations are stored in configs/ as YAML files. Key parameters:
data:
tickers: [AAPL, MSFT, GOOGL, AMZN, NVDA]
universe:
mode: static_current
survivorship_bias: known
train_days: 730
environment:
initial_balance: 10000.0
transaction_cost: 0.001
reward_function: sharpe
agent:
algorithm: PPO
learning_rate: 0.0003
policy_kwargs:
net_arch: [256, 256, 128]
training:
total_timesteps: 100000data.universe.mode: static_current is the only supported universe policy
today. It reuses the configured ticker list across every period and marks
outputs with survivorship_bias: known; it does not reconstruct historical
index membership. point_in_time_index is reserved for future support and
fails validation instead of silently behaving like a static universe.
from data.features import Feature
class MyCustomFeature(Feature):
def __init__(self):
super().__init__('my_feature')
def compute(self, df):
df['my_indicator'] = df['Close'].rolling(10).mean()
return df
def get_column_names(self):
return ['my_indicator']
# Use it
engineer = FeatureEngineer(custom_features=[MyCustomFeature()])from environment.rewards import RewardFunction
class MyReward(RewardFunction):
def compute(self, portfolio_return, portfolio_value, previous_value, **kwargs):
# Your custom logic
return portfolio_return * 2 # Exampleimport numpy as np
from environment.constants import CASH_SOFTMAX_BIAS
from evaluation.backtest import Backtester
from evaluation.baselines import BaselineStrategy
class MyStrategy(BaselineStrategy):
def __init__(self):
super().__init__('my_strategy')
def get_action(self, env, step, **kwargs):
action = np.ones(env.n_assets + 1)
action[-1] = CASH_SOFTMAX_BIAS
return action
# Register it
backtester = Backtester()
backtester.baseline_registry.register(MyStrategy())Market data is sourced through finbase.DataClient — a sibling
project at github.com/shoom1/finbase
(also on PyPI as finbase). It
maintains a SQLite store at ~/.finbase/timeseries.db, populated from
YFinance with full point-in-time index-constituent tracking for SP500,
DOW30, NASDAQ-100, FTSE 100, and DAX. This repo is a read-only consumer
of that database; see the Installation section
for the one-time DB-population steps.
See data/fetcher.py for the thin adapter layer.
Numbers below are reproducible from the configs and seeds in this repo (the exact command is in the Reproduce block further down). See the Limitations section for the assumptions behind them.
Walk-forward, expanding-window protocol. 73 folds × 3 seeds = 219/219 successful runs (0 failed). Quarterly stride, 3-year minimum train, 6-month in-sample selection slice. Test windows 2008-01-04 → 2026-04-16.
| Strategy | Mean Sharpe | Median Sharpe | Sharpe σ | Mean total return | Mean max DD | Hit rate vs agent |
|---|---|---|---|---|---|---|
| RL agent (PPO) | +1.273 | +1.360 | 1.963 | +5.09% | -9.26% | — |
| buy_and_hold | +1.344 | +1.509 | 1.976 | +6.32% | — | 40% |
| equal_weight | +1.346 | +1.346 | 1.969 | +6.27% | — | 40% |
| S&P 500 (^GSPC, buy & hold) | +1.039 | +1.185 | 1.788 | +2.32% | — | 54% |
Hit rate = fraction of (fold, seed) runs where the agent's Sharpe strictly beat the baseline's on the same disjoint test window.
The figures below are committed to the repo so they render directly on
GitHub. For an interactive version that runs against the same
committed CSV (no finbase / SB3 / training required), open
notebooks/walk_forward_analysis.ipynb
— it ships with executed outputs and adds extra cells (seed dispersion,
sortable per-fold table) that don't fit cleanly into static PNGs.
The PNGs are generated by scripts/plot_walk_forward.py from the same
CSV; the notebook is generated by scripts/build_notebook.py and
re-executed with jupyter nbconvert --execute --inplace.
Concatenated quarterly returns — per-fold quarterly returns chained end-to-end. The PPO agent (blue) underperforms the in-universe baselines (green/orange) from ~2017 onward and outperforms the S&P 500 (grey) across the sample.
Per-window Sharpe distribution — boxplot + jittered points across all 219 (fold, seed) runs. The four distributions overlap; the agent's median Sharpe is about 0.15 below the in-universe baselines and ~0.15 above the S&P 500.
Rolling 4-fold mean Sharpe — the agent (blue) and in-universe baselines (green/orange) track closely across the sample. The S&P 500 (grey) sits below them on average.
Rolling 8-fold hit rate — fraction of recent runs where the agent's Sharpe exceeds each baseline. The 50% line is parity. Vs the S&P 500 the agent is mostly above parity; vs the in-universe baselines the agent is below parity through 2014–2022 and near parity elsewhere.
Reading the table.
- Vs in-universe baselines (
buy_and_hold,equal_weight). Mean Sharpe trails by ~0.07, mean total return trails by ~120 bps per quarter, hit rate is 40% of (fold, seed) runs. - Vs the S&P 500 (
^GSPCbuy-and-hold, net of one initial transaction cost). Mean Sharpe leads by ~+0.23, hit rate is 54%. The in-universe baselines also lead the S&P 500 (mean Sharpe ~+0.30), so the gap vs the broad market is attributable to universe composition rather than the policy.
Reproduce:
conda run -n rlportfolio python -m evaluation.walk_forward \
--config configs/opt_c_tech5.yaml \
--t-min-days 756 --stride-days 63 \
--seeds 42 43 44 \
--output results/walk_forward_tech5.csv
conda run -n rlportfolio python scripts/results_table.py \
results/walk_forward_tech5.csv \
--update-readme README.md --marker WF_TECH5_TABLEEach (fold, seed) trains a fresh PPO agent on the expanding window of all
data prior to the test slice, with the last 6 months of that train window
held out for in-sample model selection (EvalCallback picks the best
checkpoint). The selected model is then backtested on the disjoint
quarterly test window. The default WalkForwardConfig.baselines runs
buy_and_hold and equal_weight on the same test window for direct
comparison; richer baselines (momentum, min_variance,
inverse_vol) are available via --baselines. Per-fold metrics land
in results/walk_forward_tech5.csv; the table above is generated by
scripts/results_table.py.
Quarterly protocol with 3-year minimum train, 1-quarter stride and disjoint
test windows, 6-month in-sample selection slice. See
evaluation/walk_forward.py for all knobs. Each CSV row records
universe_mode, survivorship_bias, n_assets, and tickers_hash so
downstream analysis keeps the universe assumption attached to the result.
The harness addresses some sources of bias and noise but not all. The items below describe what is and is not handled.
Not addressed by this code:
- Survivorship bias.
data.universe.mode: static_currentreuses today's ticker list across every historical fold. Any "AAPL was already in the universe in 2015" backtest implicitly excludes the names that were in the index in 2015 but later got delisted, acquired, or removed (Lehman, Sears, GE-pre-spinoffs, …). Index-style claims should be treated as survivorship- biased unless you pipe in a real point-in-time membership source.point_in_time_indexmode is reserved but not implemented. - Non-stationarity. Equity dynamics are regime-dependent. An agent trained on 2015-2021 (low-rate, post-GFC bull) is not the same problem as 2022-2024 (rate hikes, war, AI mania). Walk-forward exposes this by re-training each fold, but if your tickers, features, or hyperparameters were chosen by staring at the full sample first, you have already leaked future information into your model selection.
- Hyperparameter overfitting. Reported numbers come from the configs
checked into
configs/. They have not been searched against the walk-forward test set — but if you tune anything against walk-forward output and re-report, that is also leakage. - Backtest ≠ live trading. Order fills assume your trade gets the
closing price with proportional slippage. There is no execution latency,
no liquidity constraint, no borrow cost on shorts (and the env is
long-only anyway), no minimum-tick rounding, and no overnight gap risk
modeled separately from intraday.
transaction_costs.pyprovides pluggable cost / slippage models; using anything richer than the default proportional cost is on the user. - Normalization is point-in-time, not "fit on the sample". The
current
prepare_for_environmentonly applies causal transforms:pct_change(lookback_window)(purely backward-looking),rsi / 100.0(constant),atr / close(point-in-time). There is no global mean/std fit — so changing to a per-fold scaler would be a no-op on the current feature set. If you add z-score / min-max features later, the harness would need a per-fold.fit()then. - Action-space inductive bias. Continuous weights via softmax with an explicit cash dimension means the agent can never short, never lever beyond 1.0×, and always maintains a valid simplex. That is a strong prior — useful for learning stability, but it rules out long/short and market-neutral strategies a priori.
- Reward-function dependence. Sharpe / Sortino / drawdown-penalised rewards each shape behaviour differently and the right choice is not obvious. None of them solve the deeper problem that test-set Sharpe is what you actually care about and you cannot use it as a training signal.
- Data source. Market data flows through
finbase.DataClient, which reads adjusted close from a shared SQLite store. Adjustments (splits, dividends) are applied historically; iffinbasecorrects an old bar retroactively, your saved walk-forward CSVs will not match a fresh re-run. There is no guarantee of point-in-time-as-of integrity.
Addressed by this code:
- Look-ahead in indicators. All rolling features are left-aligned;
prepare_for_environmentapplies per-tickerffill(no cross-ticker pollution); the env zero-fills any remaining warm-up NaN. - Train/val leakage.
PortfolioTrainer.prepare_datafetches one combined window and slices by date so val is strictly after train. - Multi-window evaluation. The
walk_forward.pyharness retrains per fold on disjoint test windows. The results table reports fold-mean Sharpe, hit rate vs baselines, and Sharpe σ rather than a single-window summary. - Reproducibility. Every CSV row records
universe_mode,survivorship_bias,n_assets,tickers_hash, seed, and fold geometry. The exact command that produced the table above is in the Results section.
Hit rate definition. Hit rate is the fraction of (fold, seed) runs where the agent's Sharpe strictly exceeds the baseline's on the same disjoint test window. The CSV does not include statistical tests (bootstrap CIs, paired tests) on the per-fold Sharpe deltas — those would be needed to call any deviation from 50% a real edge.
- Python 3.8 - 3.12 (recommended: 3.11 or 3.12)
- stable-baselines3 >= 2.7.0
- gymnasium >= 1.0.0
- See
requirements.txtorenvironment.ymlfor full dependencies
MIT License
- Built with Stable-Baselines3
- Technical indicators from pandas-ta
- Market data via
finbase— sibling SQLite-backed data client (also on PyPI)



