Summary
Add directional bias and directional accuracy metrics to the metrics module. These metrics are particularly valuable for time series forecasting and financial modeling, where the direction of change is often more important than the absolute prediction accuracy.
Motivation
While traditional regression metrics (MAE, MSE, RMSE) focus on prediction magnitude, they don't capture whether a model correctly predicts the direction of change. This is critical for:
- Financial forecasting: Predicting whether a stock will go up or down is often more important than the exact price
- Trend prediction: Understanding if a model captures directional trends correctly
- Model diagnostics: Identifying systematic over-prediction or under-prediction biases
- Risk assessment: Evaluating if errors are balanced or skewed in one direction
Background
Directional Accuracy
Measures the proportion of times a model correctly predicts the direction of change relative to a baseline. For time series, this is typically whether the value increased or decreased from the previous time point.
Formula: DA = (1/n) * Σ I(sign(y_true - baseline) == sign(y_pred - baseline))
where I is the indicator function that equals 1 when the condition is true.
Directional Bias
Measures systematic tendency to over-predict or under-predict. A positive bias indicates over-prediction, negative indicates under-prediction, and zero indicates no systematic bias.
Formula: DB = (n_over - n_under) / n
where n_over is the number of over-predictions and n_under is the number of under-predictions.
Proposed API
1. Directional Accuracy Function
def directional_accuracy_score(
y_true: np.ndarray,
y_pred: np.ndarray,
baseline: Optional[np.ndarray] = None,
sample_weight: Optional[np.ndarray] = None,
handle_equal: str = 'exclude',
) -> float:
"""Calculate directional accuracy of predictions.
Measures the proportion of times the model correctly predicts the direction of
change from a baseline (typically previous value in time series).
:param y_true: array-like of shape (n_samples,). True target values.
:param y_pred: array-like of shape (n_samples,). Predicted target values.
:param baseline: array-like of shape (n_samples,), optional. Baseline values for
comparison. If None, uses previous true value (y_true[i-1]) for
time series data.
:param sample_weight: array-like of shape (n_samples,), default=None. Sample weights.
:param handle_equal: {'exclude', 'correct', 'incorrect'}, default='exclude'.
How to handle cases where y_true == baseline:
- 'exclude': Remove these samples from calculation
- 'correct': Count as correct if y_pred == baseline
- 'incorrect': Count as incorrect if y_pred != baseline
:return: Directional accuracy score in range [0, 1]. Higher is better.
1.0 = perfect directional prediction, 0.5 = random, 0.0 = always wrong.
:raises ValueError: If baseline is None and y_true has fewer than 2 samples.
:raises ValueError: If handle_equal not in {'exclude', 'correct', 'incorrect'}.
:raises ValueError: If shapes don't match.
Example:
>>> # Time series example
>>> y_true = np.array([100, 102, 98, 101, 99])
>>> y_pred = np.array([101, 103, 97, 102, 98])
>>> da = directional_accuracy_score(y_true, y_pred) # Uses y_true[i-1] as baseline
>>> print(f"Directional Accuracy: {da:.2%}")
>>> # Custom baseline example
>>> baseline = np.array([100, 100, 100, 100, 100])
>>> da = directional_accuracy_score(y_true, y_pred, baseline=baseline)
Notes:
- For time series (baseline=None), the first sample is excluded as it has no prior value
- A score of 0.5 indicates performance no better than random guessing
- This metric is particularly useful for financial forecasting and trend prediction
"""
2. Directional Bias Function
def directional_bias_score(
y_true: np.ndarray,
y_pred: np.ndarray,
sample_weight: Optional[np.ndarray] = None,
handle_equal: str = 'exclude',
) -> float:
"""Calculate directional bias of predictions.
Measures systematic tendency to over-predict or under-predict. Returns the
proportion of over-predictions minus the proportion of under-predictions.
:param y_true: array-like of shape (n_samples,). True target values.
:param y_pred: array-like of shape (n_samples,). Predicted target values.
:param sample_weight: array-like of shape (n_samples,), default=None. Sample weights.
:param handle_equal: {'exclude', 'neutral'}, default='exclude'.
How to handle cases where y_pred == y_true:
- 'exclude': Remove these samples from calculation
- 'neutral': Include but count as neither over nor under
:return: Directional bias score in range [-1, 1].
- Positive values indicate tendency to over-predict
- Negative values indicate tendency to under-predict
- 0 indicates no systematic bias (balanced errors)
- ±1 indicates complete bias in one direction
:raises ValueError: If handle_equal not in {'exclude', 'neutral'}.
:raises ValueError: If shapes don't match.
Example:
>>> y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
>>> y_pred = np.array([1.2, 2.3, 3.1, 4.2, 5.1]) # Consistently over-predicting
>>> bias = directional_bias_score(y_true, y_pred)
>>> print(f"Directional Bias: {bias:.2f}")
Directional Bias: 1.00
>>> y_pred_balanced = np.array([0.9, 2.1, 2.9, 4.1, 5.0]) # Balanced errors
>>> bias = directional_bias_score(y_true, y_pred_balanced)
>>> print(f"Directional Bias: {bias:.2f}")
Directional Bias: 0.00
Notes:
- This metric complements traditional error metrics by revealing systematic biases
- Useful for calibrating models and identifying consistent over/under-estimation
- A model with low RMSE but high bias may need recalibration
"""
Suggested Implementation
Implementation: directional_accuracy_score
def directional_accuracy_score(
y_true: np.ndarray,
y_pred: np.ndarray,
baseline: Optional[np.ndarray] = None,
sample_weight: Optional[np.ndarray] = None,
handle_equal: str = 'exclude',
) -> float:
"""Calculate directional accuracy of predictions."""
# Validate handle_equal parameter
if handle_equal not in {'exclude', 'correct', 'incorrect'}:
raise ValueError(
f"handle_equal must be 'exclude', 'correct', or 'incorrect', got '{handle_equal}'"
)
# Convert to numpy arrays
y_true = np.asarray(y_true).flatten()
y_pred = np.asarray(y_pred).flatten()
# Validate shapes
if y_true.shape != y_pred.shape:
raise ValueError(
f"Shape mismatch: y_true {y_true.shape} and y_pred {y_pred.shape}"
)
# Handle baseline
if baseline is None:
# Use previous value as baseline (time series default)
if len(y_true) < 2:
raise ValueError(
"y_true must have at least 2 samples when baseline is None "
"(time series mode requires previous values)"
)
baseline = y_true[:-1]
y_true = y_true[1:]
y_pred = y_pred[1:]
if sample_weight is not None:
sample_weight = np.asarray(sample_weight)[1:]
else:
baseline = np.asarray(baseline).flatten()
if baseline.shape != y_true.shape:
raise ValueError(
f"Shape mismatch: baseline {baseline.shape} and y_true {y_true.shape}"
)
# Calculate directions
true_direction = np.sign(y_true - baseline)
pred_direction = np.sign(y_pred - baseline)
# Handle cases where true value equals baseline
if handle_equal == 'exclude':
# Exclude samples where true value equals baseline
mask = true_direction != 0
true_direction = true_direction[mask]
pred_direction = pred_direction[mask]
if sample_weight is not None:
sample_weight = np.asarray(sample_weight)[mask]
elif handle_equal == 'correct':
# Count as correct if prediction also equals baseline
equal_mask = true_direction == 0
correct_equal = (pred_direction == 0) & equal_mask
true_direction[equal_mask] = 1 # Temporary marker
pred_direction[correct_equal] = 1 # Mark as correct
elif handle_equal == 'incorrect':
# Count as incorrect if prediction doesn't equal baseline
equal_mask = true_direction == 0
true_direction[equal_mask] = 1 # Temporary marker
# pred_direction stays as is, will be wrong if != 0
# Check if we have samples left
if len(true_direction) == 0:
raise ValueError("No valid samples remain after filtering")
# Calculate directional accuracy
correct_direction = (true_direction == pred_direction)
if sample_weight is not None:
sample_weight = np.asarray(sample_weight)
if len(sample_weight) != len(correct_direction):
raise ValueError(
f"Sample weight length {len(sample_weight)} doesn't match "
f"number of samples {len(correct_direction)}"
)
# Normalize weights
sample_weight = sample_weight / sample_weight.sum()
accuracy = (correct_direction * sample_weight).sum()
else:
accuracy = correct_direction.mean()
return float(accuracy)
Implementation: directional_bias_score
def directional_bias_score(
y_true: np.ndarray,
y_pred: np.ndarray,
sample_weight: Optional[np.ndarray] = None,
handle_equal: str = 'exclude',
) -> float:
"""Calculate directional bias of predictions."""
# Validate handle_equal parameter
if handle_equal not in {'exclude', 'neutral'}:
raise ValueError(
f"handle_equal must be 'exclude' or 'neutral', got '{handle_equal}'"
)
# Convert to numpy arrays
y_true = np.asarray(y_true).flatten()
y_pred = np.asarray(y_pred).flatten()
# Validate shapes
if y_true.shape != y_pred.shape:
raise ValueError(
f"Shape mismatch: y_true {y_true.shape} and y_pred {y_pred.shape}"
)
# Calculate prediction errors
errors = y_pred - y_true
# Handle equal cases
if handle_equal == 'exclude':
# Exclude samples where prediction equals true value
mask = errors != 0
errors = errors[mask]
if sample_weight is not None:
sample_weight = np.asarray(sample_weight)[mask]
# If 'neutral', errors of 0 contribute 0 to both over and under counts
# Check if we have samples left
if len(errors) == 0:
raise ValueError("No valid samples remain after filtering")
# Calculate over-predictions and under-predictions
over_predictions = errors > 0
under_predictions = errors < 0
if sample_weight is not None:
sample_weight = np.asarray(sample_weight)
if len(sample_weight) != len(errors):
raise ValueError(
f"Sample weight length {len(sample_weight)} doesn't match "
f"number of samples {len(errors)}"
)
# Normalize weights
sample_weight = sample_weight / sample_weight.sum()
prop_over = (over_predictions * sample_weight).sum()
prop_under = (under_predictions * sample_weight).sum()
else:
prop_over = over_predictions.mean()
prop_under = under_predictions.mean()
# Calculate bias: positive = over-prediction, negative = under-prediction
bias = prop_over - prop_under
return float(bias)
Testing Requirements
Unit Tests for Directional Accuracy
def test_directional_accuracy_perfect_prediction():
"""Test that perfect directional predictions give score of 1.0."""
y_true = np.array([100, 102, 98, 101, 99, 103])
y_pred = np.array([100.5, 102.5, 97.5, 101.5, 98.5, 103.5])
# All directions are correct (same sign of change)
da = directional_accuracy_score(y_true, y_pred)
assert da == pytest.approx(1.0)
def test_directional_accuracy_random_prediction():
"""Test directional accuracy with 50% correct predictions."""
y_true = np.array([100, 102, 98, 101, 99])
baseline = np.array([100, 100, 100, 100, 100])
y_pred = np.array([101, 99, 99, 99, 101]) # 2 correct, 2 incorrect
da = directional_accuracy_score(y_true, y_pred, baseline=baseline)
assert da == pytest.approx(0.5)
def test_directional_accuracy_all_wrong():
"""Test that completely wrong directional predictions give score of 0.0."""
y_true = np.array([100, 102, 98, 101, 99])
baseline = np.array([100, 100, 100, 100, 100])
# All predictions go opposite direction
y_pred = np.array([99, 97, 101, 98, 102])
da = directional_accuracy_score(y_true, y_pred, baseline=baseline)
assert da == pytest.approx(0.0)
def test_directional_accuracy_time_series_default():
"""Test directional accuracy with default time series behavior."""
y_true = np.array([100, 102, 98, 101, 99])
# Uses previous value as baseline automatically
# Changes: +2, -4, +3, -2
y_pred = np.array([100.5, 103, 97, 102, 98])
# Predicted changes: +0.5, +0.5, -6, +5, -4
# Correct directions: ✓(+,+), ✓(-,-), ✓(+,+), ✓(-,-)
da = directional_accuracy_score(y_true, y_pred)
assert da == pytest.approx(1.0)
def test_directional_accuracy_insufficient_samples():
"""Test that insufficient samples raises ValueError."""
y_true = np.array([100])
y_pred = np.array([101])
with pytest.raises(ValueError, match="at least 2 samples"):
directional_accuracy_score(y_true, y_pred)
def test_directional_accuracy_shape_mismatch():
"""Test that shape mismatch raises ValueError."""
y_true = np.array([1.0, 2.0, 3.0])
y_pred = np.array([1.0, 2.0])
with pytest.raises(ValueError, match="Shape mismatch"):
directional_accuracy_score(y_true, y_pred)
def test_directional_accuracy_invalid_handle_equal():
"""Test that invalid handle_equal raises ValueError."""
y_true = np.array([1.0, 2.0, 3.0])
y_pred = np.array([1.0, 2.0, 3.0])
baseline = np.array([1.0, 1.0, 1.0])
with pytest.raises(ValueError, match="handle_equal must be"):
directional_accuracy_score(y_true, y_pred, baseline=baseline, handle_equal='invalid')
@pytest.mark.parametrize("handle_equal", ['exclude', 'correct', 'incorrect'])
def test_directional_accuracy_handle_equal_modes(handle_equal):
"""Test different handle_equal modes."""
y_true = np.array([100, 100, 102, 100, 98])
baseline = np.array([100, 100, 100, 100, 100])
y_pred = np.array([100, 101, 103, 99, 97])
# Positions 0 and 1: y_true == baseline
# Position 2: correct direction (+, +)
# Position 3: correct direction (0, -) - depends on mode
# Position 4: correct direction (-, -)
da = directional_accuracy_score(y_true, y_pred, baseline=baseline, handle_equal=handle_equal)
if handle_equal == 'exclude':
# Only positions 2, 4 evaluated: 2/2 = 1.0
assert da == pytest.approx(1.0)
elif handle_equal == 'correct':
# Position 0: pred != baseline, counts as incorrect (0 vs +)
# Position 1: pred != baseline, counts as incorrect (0 vs +)
# Others correct: 3/5 = 0.6
assert 0.0 <= da <= 1.0 # Exact value depends on implementation details
elif handle_equal == 'incorrect':
# Similar to correct but different handling
assert 0.0 <= da <= 1.0
def test_directional_accuracy_with_weights():
"""Test directional accuracy with sample weights."""
y_true = np.array([100, 102, 98, 101])
baseline = np.array([100, 100, 100, 100])
y_pred = np.array([101, 103, 97, 99])
# All directions correct
# Equal weights
da_equal = directional_accuracy_score(y_true, y_pred, baseline=baseline)
# Weighted (should still be 1.0 since all correct)
weights = np.array([2.0, 1.0, 1.0, 1.0])
da_weighted = directional_accuracy_score(y_true, y_pred, baseline=baseline, sample_weight=weights)
assert da_equal == pytest.approx(1.0)
assert da_weighted == pytest.approx(1.0)
Unit Tests for Directional Bias
def test_directional_bias_no_bias():
"""Test that balanced errors give bias of 0.0."""
y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred = np.array([1.1, 1.9, 3.1, 3.9, 5.0])
# 2 over, 2 under, 1 equal → (2-2)/4 = 0.0 with exclude mode
bias = directional_bias_score(y_true, y_pred, handle_equal='exclude')
assert bias == pytest.approx(0.0)
def test_directional_bias_complete_over_prediction():
"""Test that complete over-prediction gives bias of 1.0."""
y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred = np.array([1.1, 2.1, 3.1, 4.1, 5.1])
# All over-predictions
bias = directional_bias_score(y_true, y_pred)
assert bias == pytest.approx(1.0)
def test_directional_bias_complete_under_prediction():
"""Test that complete under-prediction gives bias of -1.0."""
y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred = np.array([0.9, 1.9, 2.9, 3.9, 4.9])
# All under-predictions
bias = directional_bias_score(y_true, y_pred)
assert bias == pytest.approx(-1.0)
def test_directional_bias_mostly_over():
"""Test partial over-prediction bias."""
y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred = np.array([1.1, 2.1, 3.1, 3.9, 4.9])
# 3 over, 2 under → (3-2)/5 = 0.2
bias = directional_bias_score(y_true, y_pred)
assert bias == pytest.approx(0.2)
def test_directional_bias_shape_mismatch():
"""Test that shape mismatch raises ValueError."""
y_true = np.array([1.0, 2.0, 3.0])
y_pred = np.array([1.0, 2.0])
with pytest.raises(ValueError, match="Shape mismatch"):
directional_bias_score(y_true, y_pred)
def test_directional_bias_invalid_handle_equal():
"""Test that invalid handle_equal raises ValueError."""
y_true = np.array([1.0, 2.0, 3.0])
y_pred = np.array([1.0, 2.0, 3.0])
with pytest.raises(ValueError, match="handle_equal must be"):
directional_bias_score(y_true, y_pred, handle_equal='invalid')
@pytest.mark.parametrize("handle_equal", ['exclude', 'neutral'])
def test_directional_bias_handle_equal_modes(handle_equal):
"""Test different handle_equal modes."""
y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred = np.array([1.1, 2.0, 3.1, 4.0, 5.1])
# 3 over, 0 under, 2 equal
bias = directional_bias_score(y_true, y_pred, handle_equal=handle_equal)
if handle_equal == 'exclude':
# Only 3 samples: (3-0)/3 = 1.0
assert bias == pytest.approx(1.0)
elif handle_equal == 'neutral':
# All 5 samples: (3-0)/5 = 0.6
assert bias == pytest.approx(0.6)
def test_directional_bias_with_weights():
"""Test directional bias with sample weights."""
y_true = np.array([1.0, 2.0, 3.0, 4.0])
y_pred = np.array([1.1, 2.1, 2.9, 3.9])
# 2 over, 2 under normally → bias = 0
bias_equal = directional_bias_score(y_true, y_pred)
assert bias_equal == pytest.approx(0.0)
# Weight the over-predictions more heavily
weights = np.array([2.0, 2.0, 1.0, 1.0])
bias_weighted = directional_bias_score(y_true, y_pred, sample_weight=weights)
# Weighted: over = (2+2)/(2+2+1+1) = 4/6, under = 2/6 → bias = 2/6 ≈ 0.333
assert bias_weighted > 0
def test_directional_bias_all_equal():
"""Test that all equal predictions raises error in exclude mode."""
y_true = np.array([1.0, 2.0, 3.0])
y_pred = np.array([1.0, 2.0, 3.0])
with pytest.raises(ValueError, match="No valid samples"):
directional_bias_score(y_true, y_pred, handle_equal='exclude')
Integration Tests
def test_directional_metrics_on_real_data():
"""Test directional metrics on realistic time series data."""
# Simulate stock price predictions
np.random.seed(42)
n_samples = 100
# True prices with trend
true_prices = 100 + np.cumsum(np.random.randn(n_samples) * 2)
# Good model: follows trend but with some error
good_pred = true_prices + np.random.randn(n_samples) * 5
# Bad model: random walk
bad_pred = 100 + np.cumsum(np.random.randn(n_samples) * 2)
# Calculate directional accuracy
da_good = directional_accuracy_score(true_prices, good_pred)
da_bad = directional_accuracy_score(true_prices, bad_pred)
# Good model should have higher directional accuracy
assert da_good > da_bad
assert da_good > 0.5 # Better than random
# Calculate bias
bias_good = directional_bias_score(true_prices, good_pred)
bias_bad = directional_bias_score(true_prices, bad_pred)
# Both should be relatively unbiased (close to 0)
assert abs(bias_good) < 0.3
assert abs(bias_bad) < 0.3
def test_directional_metrics_consistency():
"""Test that directional metrics are consistent with each other."""
y_true = np.array([100, 102, 98, 101, 99, 103])
# Consistently over-predicting model
y_pred_over = y_true + 2
bias_over = directional_bias_score(y_true, y_pred_over)
assert bias_over == pytest.approx(1.0) # Complete over-prediction
# Directional accuracy should still be high if it captures trends
da_over = directional_accuracy_score(y_true, y_pred_over)
assert da_over == pytest.approx(1.0) # All directions correct despite bias
Documentation Requirements
-
Docstring examples: Add comprehensive examples for both functions showing:
- Time series usage
- Custom baseline usage
- Interpretation of results
- Use with sample weights
-
Module documentation: Update module-level docstring with:
- Brief explanation of directional metrics
- When to use them vs traditional metrics
- Link to references
-
README: Add section on directional metrics with:
- Use case examples (finance, forecasting)
- Interpretation guide
- Code snippets
-
Tutorial notebook (optional): Create notebook showing:
- Comparison with traditional metrics
- Financial forecasting example
- Model calibration using directional bias
Additional Notes
Design Decisions
- handle_equal parameter: Provides flexibility in how to treat edge cases (no change, perfect predictions)
- Time series default: When baseline=None, automatically uses previous value for intuitive time series usage
- Sample weights: Supports weighted metrics for imbalanced or importance-weighted datasets
- Range normalization: Both metrics return values in intuitive ranges ([-1, 1] for bias, [0, 1] for accuracy)
Comparison with Existing Metrics
| Metric |
What it measures |
When to use |
| MAE/MSE |
Magnitude of error |
When exact values matter |
| Directional Accuracy |
Correct trend prediction |
When direction matters more than magnitude |
| Directional Bias |
Systematic over/under prediction |
For model calibration and diagnostics |
Use Cases
- Financial Forecasting: Stock price direction is often more valuable than exact price
- Energy Demand: Predicting increase/decrease helps with resource allocation
- Sales Forecasting: Understanding trend direction aids business planning
- Weather Forecasting: Temperature trend prediction for planning
Performance Considerations
- Both functions use vectorized NumPy operations for efficiency
- O(n) time complexity where n is number of samples
- Minimal memory overhead (only creates mask arrays)
References
- Pesaran, M. H., & Timmermann, A. (1992). A simple nonparametric test of predictive performance. Journal of Business & Economic Statistics, 10(4), 461-465.
- Blaskowitz, O., & Herwartz, H. (2011). On economic evaluation of directional forecasts. International Journal of Forecasting, 27(4), 1058-1065.
- Nyberg, H. (2011). Forecasting the direction of the US stock market with dynamic binary probit models. International Journal of Forecasting, 27(2), 561-578.
Summary
Add directional bias and directional accuracy metrics to the metrics module. These metrics are particularly valuable for time series forecasting and financial modeling, where the direction of change is often more important than the absolute prediction accuracy.
Motivation
While traditional regression metrics (MAE, MSE, RMSE) focus on prediction magnitude, they don't capture whether a model correctly predicts the direction of change. This is critical for:
Background
Directional Accuracy
Measures the proportion of times a model correctly predicts the direction of change relative to a baseline. For time series, this is typically whether the value increased or decreased from the previous time point.
Formula:
DA = (1/n) * Σ I(sign(y_true - baseline) == sign(y_pred - baseline))where I is the indicator function that equals 1 when the condition is true.
Directional Bias
Measures systematic tendency to over-predict or under-predict. A positive bias indicates over-prediction, negative indicates under-prediction, and zero indicates no systematic bias.
Formula:
DB = (n_over - n_under) / nwhere n_over is the number of over-predictions and n_under is the number of under-predictions.
Proposed API
1. Directional Accuracy Function
2. Directional Bias Function
Suggested Implementation
Implementation: directional_accuracy_score
Implementation: directional_bias_score
Testing Requirements
Unit Tests for Directional Accuracy
Unit Tests for Directional Bias
Integration Tests
Documentation Requirements
Docstring examples: Add comprehensive examples for both functions showing:
Module documentation: Update module-level docstring with:
README: Add section on directional metrics with:
Tutorial notebook (optional): Create notebook showing:
Additional Notes
Design Decisions
Comparison with Existing Metrics
Use Cases
Performance Considerations
References