Summary
Update the test data used in test_plot_precision_recall_curve_with_thresholds_annotations to generate Precision-Recall curves that look like actual PR curves (decreasing from top-left to bottom-right) rather than resembling inverted ROC curves.
Problem Description
Current Issue
The current test generates baseline images (test_plot_precision_recall_curve_with_thresholds_annotations_default.png and test_plot_precision_recall_curve_with_thresholds_annotations_with_chance_level.png) where the plotted curves appear to increase from left to right, resembling an inverted ROC curve rather than a typical Precision-Recall curve.
Why This Is Incorrect
Precision-Recall curves have distinct characteristics:
-
General shape: PR curves typically decrease as recall increases (moving left to right on the x-axis)
- They start at high precision/low recall (top-left)
- They end at low precision/high recall (bottom-right)
- Good classifiers "hug" the top-right corner (high precision AND high recall)
-
Not a mirror of ROC curves:
- ROC curves increase from bottom-left to top-right
- PR curves generally decrease from top-left to bottom-right
- They have fundamentally different shapes due to different y-axis definitions
-
Why they decrease:
- As you lower the classification threshold, recall increases (you catch more positives)
- But precision typically decreases (you also get more false positives mixed in)
- This creates the characteristic downward trend
-
Characteristic features of real PR curves:
- Often have a "staircase" or "zigzag" pattern
- May have plateaus where lowering threshold doesn't change recall but precision fluctuates
- Non-linear interpolation between points (unlike ROC curves)
- Can cross each other more frequently than ROC curves
Visual Comparison
ROC Curve (for reference):
Precision
^
| ___________
| /
| / (Increases left to right)
| /
|/_______________> False Positive Rate
Typical Precision-Recall Curve:
Precision
^
|\
| \___
| \___ (Decreases left to right)
| \___
| \___
|________________> Recall
0 1
Current Test Data Issue
Looking at tests/resources/plotly_models.json, the test data appears to have been generated in a way that creates curves that increase from left to right, which is the opposite of what typical PR curves look like.
This is problematic because:
- Misleading visuals: Developers and users looking at the test images might develop incorrect expectations
- Reduced confidence: Baseline images that don't look like real PR curves reduce confidence in the library
- Educational value: Tests should serve as documentation and examples - current images teach the wrong pattern
Desired Outcome
Generate test data that produces realistic Precision-Recall curves with these characteristics:
- Decreasing trend: Curves should generally decrease from top-left to bottom-right
- Start high: High precision at low recall (top-left area)
- End lower: Lower precision at high recall (bottom-right area)
- Good classifier appearance: Curve should stay relatively high (close to top-right corner) to demonstrate a well-performing classifier
- Realistic variability: Include the characteristic "staircase" pattern that real PR curves exhibit
Visual Goal
The ideal curve should look something like:
Precision
1.0 |●────●
| ●────●
| ●───●
| ●──●
| ●─●
0.0 |___________________●__> Recall
0.0 1.0
Proposed Solution
Option 1: Generate Synthetic Data with Correct Characteristics
Create a function to generate realistic classifier scores that will produce proper PR curves:
def generate_realistic_pr_curve_data(
n_samples: int = 2239,
positive_ratio: float = 0.3,
classifier_quality: str = "good",
random_state: int = 42
) -> Tuple[np.ndarray, np.ndarray]:
"""
Generate synthetic data that produces realistic Precision-Recall curves.
The key insight: For PR curves to decrease left-to-right, high-scoring
predictions need to be predominantly positive (high precision), while
lower-scoring predictions have more false positives mixed in (lower precision).
:param n_samples: Total number of samples
:param positive_ratio: Proportion of positive samples
:param classifier_quality: 'excellent', 'good', 'fair', or 'poor'
:param random_state: Random seed for reproducibility
:return: (y_true, y_scores)
"""
np.random.seed(random_state)
n_positives = int(n_samples * positive_ratio)
n_negatives = n_samples - n_positives
# Create true labels
y_true = np.concatenate([
np.ones(n_positives),
np.zeros(n_negatives)
])
# Quality parameters: (pos_mean, pos_std, neg_mean, neg_std)
# The key: positive class should have HIGHER scores on average
quality_params = {
'excellent': (0.85, 0.10, 0.20, 0.10), # Clear separation
'good': (0.75, 0.15, 0.30, 0.15), # Good separation
'fair': (0.65, 0.20, 0.40, 0.15), # Moderate separation
'poor': (0.55, 0.20, 0.45, 0.20), # Poor separation
}
pos_mean, pos_std, neg_mean, neg_std = quality_params[classifier_quality]
# Generate scores
# Positive class: higher scores (concentrated toward 1)
positive_scores = np.random.normal(pos_mean, pos_std, n_positives)
positive_scores = np.clip(positive_scores, 0, 1)
# Negative class: lower scores (concentrated toward 0)
negative_scores = np.random.normal(neg_mean, neg_std, n_negatives)
negative_scores = np.clip(negative_scores, 0, 1)
y_scores = np.concatenate([positive_scores, negative_scores])
# Shuffle together
indices = np.random.permutation(n_samples)
y_true = y_true[indices]
y_scores = y_scores[indices]
return y_true, y_scores
# Example usage for test data generation
np.random.seed(42)
# Generate data for multiple classifiers with different quality levels
classifiers_data = {
"Decision Tree": generate_realistic_pr_curve_data(
n_samples=2239,
positive_ratio=0.3,
classifier_quality="good",
random_state=42
),
"Random Forest": generate_realistic_pr_curve_data(
n_samples=2239,
positive_ratio=0.3,
classifier_quality="excellent",
random_state=123
),
"Logistic Regression": generate_realistic_pr_curve_data(
n_samples=2239,
positive_ratio=0.3,
classifier_quality="fair",
random_state=456
),
}
# Verify the data produces decreasing PR curves
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 6))
for name, (y_true, y_scores) in classifiers_data.items():
precision, recall, _ = precision_recall_curve(y_true, y_scores)
ax.plot(recall, precision, marker='o', label=name, markersize=3)
# Verify it's generally decreasing
# Calculate if precision generally decreases as recall increases
is_mostly_decreasing = np.sum(np.diff(precision) < 0) > len(precision) * 0.6
print(f"{name}: Mostly decreasing = {is_mostly_decreasing}")
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
ax.set_title('Realistic Precision-Recall Curves')
ax.legend()
ax.grid(True, alpha=0.3)
plt.savefig('realistic_pr_curves.png', dpi=150, bbox_inches='tight')
Option 2: Use Real-World Dataset
Use a real imbalanced classification dataset that naturally produces proper PR curves:
from sklearn.datasets import make_classification
# Generate imbalanced classification data
X, y = make_classification(
n_samples=2239,
n_features=20,
n_informative=15,
n_redundant=5,
n_classes=2,
weights=[0.7, 0.3], # Imbalanced
flip_y=0.05, # Some noise
random_state=42
)
# Train classifiers
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
classifiers = {
"Decision Tree": DecisionTreeClassifier(random_state=42, max_depth=10),
"Random Forest": RandomForestClassifier(random_state=42, n_estimators=50),
"Logistic Regression": LogisticRegression(random_state=42, max_iter=1000),
}
classifiers_scores = {}
for name, clf in classifiers.items():
clf.fit(X_train, y_train)
y_scores = clf.predict_proba(X_test)[:, 1] # Probability of positive class
classifiers_scores[name] = y_scores
# y_test and classifiers_scores now have realistic data
Implementation Steps
-
Create data generation script (or use real data):
# Create a script to generate the test data
touch tests/scripts/generate_pr_curve_test_data.py
-
Generate new test data:
- Use one of the approaches above to create realistic y_true and y_scores
- Ensure the data produces PR curves that:
- Start high-left (high precision, low recall)
- End low-right (lower precision, high recall)
- Generally decrease from left to right
-
Update plotly_models.json:
- Replace the current
y_true array with new realistic data
- For each classifier, update
y_scores arrays
- Recalculate the
precision_recall_curve data (precision_array, recall_array, thresholds)
- Recalculate
roc_curve data if needed
-
Regenerate baseline images:
pytest tests/test_metrics.py::test_plot_precision_recall_curve_with_thresholds_annotations --mpl-generate-path=tests/baseline_images/test_metrics
-
Verify the results:
- Visually inspect the new baseline images
- Confirm curves decrease from left to right
- Confirm they look like typical PR curves
- Ensure the chance level (horizontal line at prevalence) looks correct
-
Update documentation (if needed):
- Add comments in the test explaining why this data was chosen
- Document the characteristics of good PR curves
Example: What the Fix Should Look Like
Before (Current - INCORRECT):
The curve increases from left to right, looking like an inverted ROC curve:
Precision
^ ____
| ___/
| ___/
| ___/
| ___/
|___/_______________> Recall
After (Fixed - CORRECT):
The curve decreases from left to right, like a proper PR curve:
Precision
^
|●───●
| ●───●
| ●───●
| ●───●
| ●──●
|____________________●> Recall
Testing Validation
After implementing the fix, validate that:
- Visual inspection: Baseline images show decreasing curves
- Numerical validation: For each curve, confirm:
precision, recall, _ = precision_recall_curve(y_true, y_scores)
# Should be mostly decreasing
n_decreasing = np.sum(np.diff(precision) < 0)
n_total = len(precision) - 1
assert n_decreasing / n_total > 0.5, "PR curve should mostly decrease"
- Curve starts high: First precision value should be relatively high (> 0.6 for good classifier)
- Curve ends lower: Last precision value should be close to prevalence
- Prevalence line: The chance level (horizontal line) should be at the correct y-value
Additional Context
Why This Matters
- Educational value: Tests serve as documentation and examples
- Visual validation: Developers can spot issues by looking at the curves
- Confidence: Realistic test data increases confidence in the library
- Correctness: Ensures the plotting function works correctly with real-world-like data
References
Related Issues
This fix should be implemented after or alongside the issue for updating plot_chance_level parameter, since:
- The new baseline images will need to show the horizontal chance level line correctly
- Both changes affect the same baseline image files
- It makes sense to regenerate baseline images once with both fixes applied
Acceptance Criteria
Notes
Important: The key insight is that PR curves decrease because as you lower the classification threshold:
- You include more predictions as "positive" (recall increases)
- But you also include more false positives (precision decreases)
- This creates the characteristic top-left to bottom-right pattern
The test data should be generated such that:
- Positive class instances tend to have higher scores
- Negative class instances tend to have lower scores
- There's some overlap (realistic classifier)
- This naturally produces a decreasing PR curve
Summary
Update the test data used in
test_plot_precision_recall_curve_with_thresholds_annotationsto generate Precision-Recall curves that look like actual PR curves (decreasing from top-left to bottom-right) rather than resembling inverted ROC curves.Problem Description
Current Issue
The current test generates baseline images (
test_plot_precision_recall_curve_with_thresholds_annotations_default.pngandtest_plot_precision_recall_curve_with_thresholds_annotations_with_chance_level.png) where the plotted curves appear to increase from left to right, resembling an inverted ROC curve rather than a typical Precision-Recall curve.Why This Is Incorrect
Precision-Recall curves have distinct characteristics:
General shape: PR curves typically decrease as recall increases (moving left to right on the x-axis)
Not a mirror of ROC curves:
Why they decrease:
Characteristic features of real PR curves:
Visual Comparison
ROC Curve (for reference):
Typical Precision-Recall Curve:
Current Test Data Issue
Looking at
tests/resources/plotly_models.json, the test data appears to have been generated in a way that creates curves that increase from left to right, which is the opposite of what typical PR curves look like.This is problematic because:
Desired Outcome
Generate test data that produces realistic Precision-Recall curves with these characteristics:
Visual Goal
The ideal curve should look something like:
Proposed Solution
Option 1: Generate Synthetic Data with Correct Characteristics
Create a function to generate realistic classifier scores that will produce proper PR curves:
Option 2: Use Real-World Dataset
Use a real imbalanced classification dataset that naturally produces proper PR curves:
Implementation Steps
Create data generation script (or use real data):
# Create a script to generate the test data touch tests/scripts/generate_pr_curve_test_data.pyGenerate new test data:
Update
plotly_models.json:y_truearray with new realistic datay_scoresarraysprecision_recall_curvedata (precision_array, recall_array, thresholds)roc_curvedata if neededRegenerate baseline images:
Verify the results:
Update documentation (if needed):
Example: What the Fix Should Look Like
Before (Current - INCORRECT):
After (Fixed - CORRECT):
Testing Validation
After implementing the fix, validate that:
Additional Context
Why This Matters
References
Related Issues
This fix should be implemented after or alongside the issue for updating
plot_chance_levelparameter, since:Acceptance Criteria
Notes
Important: The key insight is that PR curves decrease because as you lower the classification threshold:
The test data should be generated such that: