Skip to content

Improve Test Data for Precision-Recall Curve to Show Realistic Shape #100

@idanmoradarthas

Description

@idanmoradarthas

Summary

Update the test data used in test_plot_precision_recall_curve_with_thresholds_annotations to generate Precision-Recall curves that look like actual PR curves (decreasing from top-left to bottom-right) rather than resembling inverted ROC curves.

Problem Description

Current Issue

The current test generates baseline images (test_plot_precision_recall_curve_with_thresholds_annotations_default.png and test_plot_precision_recall_curve_with_thresholds_annotations_with_chance_level.png) where the plotted curves appear to increase from left to right, resembling an inverted ROC curve rather than a typical Precision-Recall curve.

Why This Is Incorrect

Precision-Recall curves have distinct characteristics:

  1. General shape: PR curves typically decrease as recall increases (moving left to right on the x-axis)

    • They start at high precision/low recall (top-left)
    • They end at low precision/high recall (bottom-right)
    • Good classifiers "hug" the top-right corner (high precision AND high recall)
  2. Not a mirror of ROC curves:

    • ROC curves increase from bottom-left to top-right
    • PR curves generally decrease from top-left to bottom-right
    • They have fundamentally different shapes due to different y-axis definitions
  3. Why they decrease:

    • As you lower the classification threshold, recall increases (you catch more positives)
    • But precision typically decreases (you also get more false positives mixed in)
    • This creates the characteristic downward trend
  4. Characteristic features of real PR curves:

    • Often have a "staircase" or "zigzag" pattern
    • May have plateaus where lowering threshold doesn't change recall but precision fluctuates
    • Non-linear interpolation between points (unlike ROC curves)
    • Can cross each other more frequently than ROC curves

Visual Comparison

ROC Curve (for reference):

Precision
    ^
    |     ___________
    |   /
    |  /              (Increases left to right)
    | /
    |/_______________> False Positive Rate

Typical Precision-Recall Curve:

Precision
    ^
    |\
    | \___
    |     \___        (Decreases left to right)
    |        \___
    |            \___
    |________________> Recall
    0               1

Current Test Data Issue

Looking at tests/resources/plotly_models.json, the test data appears to have been generated in a way that creates curves that increase from left to right, which is the opposite of what typical PR curves look like.

This is problematic because:

  1. Misleading visuals: Developers and users looking at the test images might develop incorrect expectations
  2. Reduced confidence: Baseline images that don't look like real PR curves reduce confidence in the library
  3. Educational value: Tests should serve as documentation and examples - current images teach the wrong pattern

Desired Outcome

Generate test data that produces realistic Precision-Recall curves with these characteristics:

  1. Decreasing trend: Curves should generally decrease from top-left to bottom-right
  2. Start high: High precision at low recall (top-left area)
  3. End lower: Lower precision at high recall (bottom-right area)
  4. Good classifier appearance: Curve should stay relatively high (close to top-right corner) to demonstrate a well-performing classifier
  5. Realistic variability: Include the characteristic "staircase" pattern that real PR curves exhibit

Visual Goal

The ideal curve should look something like:

Precision
1.0 |●────●
    |      ●────●
    |           ●───●
    |               ●──●
    |                  ●─●
0.0 |___________________●__> Recall
   0.0                  1.0

Proposed Solution

Option 1: Generate Synthetic Data with Correct Characteristics

Create a function to generate realistic classifier scores that will produce proper PR curves:

def generate_realistic_pr_curve_data(
    n_samples: int = 2239,
    positive_ratio: float = 0.3,
    classifier_quality: str = "good",
    random_state: int = 42
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Generate synthetic data that produces realistic Precision-Recall curves.
    
    The key insight: For PR curves to decrease left-to-right, high-scoring 
    predictions need to be predominantly positive (high precision), while 
    lower-scoring predictions have more false positives mixed in (lower precision).
    
    :param n_samples: Total number of samples
    :param positive_ratio: Proportion of positive samples
    :param classifier_quality: 'excellent', 'good', 'fair', or 'poor'
    :param random_state: Random seed for reproducibility
    :return: (y_true, y_scores)
    """
    np.random.seed(random_state)
    
    n_positives = int(n_samples * positive_ratio)
    n_negatives = n_samples - n_positives
    
    # Create true labels
    y_true = np.concatenate([
        np.ones(n_positives),
        np.zeros(n_negatives)
    ])
    
    # Quality parameters: (pos_mean, pos_std, neg_mean, neg_std)
    # The key: positive class should have HIGHER scores on average
    quality_params = {
        'excellent': (0.85, 0.10, 0.20, 0.10),  # Clear separation
        'good': (0.75, 0.15, 0.30, 0.15),       # Good separation
        'fair': (0.65, 0.20, 0.40, 0.15),       # Moderate separation
        'poor': (0.55, 0.20, 0.45, 0.20),       # Poor separation
    }
    
    pos_mean, pos_std, neg_mean, neg_std = quality_params[classifier_quality]
    
    # Generate scores
    # Positive class: higher scores (concentrated toward 1)
    positive_scores = np.random.normal(pos_mean, pos_std, n_positives)
    positive_scores = np.clip(positive_scores, 0, 1)
    
    # Negative class: lower scores (concentrated toward 0)
    negative_scores = np.random.normal(neg_mean, neg_std, n_negatives)
    negative_scores = np.clip(negative_scores, 0, 1)
    
    y_scores = np.concatenate([positive_scores, negative_scores])
    
    # Shuffle together
    indices = np.random.permutation(n_samples)
    y_true = y_true[indices]
    y_scores = y_scores[indices]
    
    return y_true, y_scores


# Example usage for test data generation
np.random.seed(42)

# Generate data for multiple classifiers with different quality levels
classifiers_data = {
    "Decision Tree": generate_realistic_pr_curve_data(
        n_samples=2239, 
        positive_ratio=0.3, 
        classifier_quality="good",
        random_state=42
    ),
    "Random Forest": generate_realistic_pr_curve_data(
        n_samples=2239, 
        positive_ratio=0.3, 
        classifier_quality="excellent",
        random_state=123
    ),
    "Logistic Regression": generate_realistic_pr_curve_data(
        n_samples=2239, 
        positive_ratio=0.3, 
        classifier_quality="fair",
        random_state=456
    ),
}

# Verify the data produces decreasing PR curves
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6))
for name, (y_true, y_scores) in classifiers_data.items():
    precision, recall, _ = precision_recall_curve(y_true, y_scores)
    ax.plot(recall, precision, marker='o', label=name, markersize=3)
    
    # Verify it's generally decreasing
    # Calculate if precision generally decreases as recall increases
    is_mostly_decreasing = np.sum(np.diff(precision) < 0) > len(precision) * 0.6
    print(f"{name}: Mostly decreasing = {is_mostly_decreasing}")

ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
ax.set_title('Realistic Precision-Recall Curves')
ax.legend()
ax.grid(True, alpha=0.3)
plt.savefig('realistic_pr_curves.png', dpi=150, bbox_inches='tight')

Option 2: Use Real-World Dataset

Use a real imbalanced classification dataset that naturally produces proper PR curves:

from sklearn.datasets import make_classification

# Generate imbalanced classification data
X, y = make_classification(
    n_samples=2239,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    weights=[0.7, 0.3],  # Imbalanced
    flip_y=0.05,  # Some noise
    random_state=42
)

# Train classifiers
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

classifiers = {
    "Decision Tree": DecisionTreeClassifier(random_state=42, max_depth=10),
    "Random Forest": RandomForestClassifier(random_state=42, n_estimators=50),
    "Logistic Regression": LogisticRegression(random_state=42, max_iter=1000),
}

classifiers_scores = {}
for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    y_scores = clf.predict_proba(X_test)[:, 1]  # Probability of positive class
    classifiers_scores[name] = y_scores

# y_test and classifiers_scores now have realistic data

Implementation Steps

  1. Create data generation script (or use real data):

    # Create a script to generate the test data
    touch tests/scripts/generate_pr_curve_test_data.py
  2. Generate new test data:

    • Use one of the approaches above to create realistic y_true and y_scores
    • Ensure the data produces PR curves that:
      • Start high-left (high precision, low recall)
      • End low-right (lower precision, high recall)
      • Generally decrease from left to right
  3. Update plotly_models.json:

    • Replace the current y_true array with new realistic data
    • For each classifier, update y_scores arrays
    • Recalculate the precision_recall_curve data (precision_array, recall_array, thresholds)
    • Recalculate roc_curve data if needed
  4. Regenerate baseline images:

    pytest tests/test_metrics.py::test_plot_precision_recall_curve_with_thresholds_annotations --mpl-generate-path=tests/baseline_images/test_metrics
  5. Verify the results:

    • Visually inspect the new baseline images
    • Confirm curves decrease from left to right
    • Confirm they look like typical PR curves
    • Ensure the chance level (horizontal line at prevalence) looks correct
  6. Update documentation (if needed):

    • Add comments in the test explaining why this data was chosen
    • Document the characteristics of good PR curves

Example: What the Fix Should Look Like

Before (Current - INCORRECT):

The curve increases from left to right, looking like an inverted ROC curve:

Precision
    ^                    ____
    |                ___/
    |            ___/
    |        ___/
    |    ___/
    |___/_______________> Recall

After (Fixed - CORRECT):

The curve decreases from left to right, like a proper PR curve:

Precision
    ^
    |●───●
    |     ●───●
    |         ●───●
    |             ●───●
    |                 ●──●
    |____________________●> Recall

Testing Validation

After implementing the fix, validate that:

  1. Visual inspection: Baseline images show decreasing curves
  2. Numerical validation: For each curve, confirm:
    precision, recall, _ = precision_recall_curve(y_true, y_scores)
    # Should be mostly decreasing
    n_decreasing = np.sum(np.diff(precision) < 0)
    n_total = len(precision) - 1
    assert n_decreasing / n_total > 0.5, "PR curve should mostly decrease"
  3. Curve starts high: First precision value should be relatively high (> 0.6 for good classifier)
  4. Curve ends lower: Last precision value should be close to prevalence
  5. Prevalence line: The chance level (horizontal line) should be at the correct y-value

Additional Context

Why This Matters

  1. Educational value: Tests serve as documentation and examples
  2. Visual validation: Developers can spot issues by looking at the curves
  3. Confidence: Realistic test data increases confidence in the library
  4. Correctness: Ensures the plotting function works correctly with real-world-like data

References

Related Issues

This fix should be implemented after or alongside the issue for updating plot_chance_level parameter, since:

  • The new baseline images will need to show the horizontal chance level line correctly
  • Both changes affect the same baseline image files
  • It makes sense to regenerate baseline images once with both fixes applied

Acceptance Criteria

  • Test data generates PR curves that decrease from top-left to bottom-right
  • Curves show realistic "staircase" pattern
  • High-scoring predictions are predominantly positive (creating high initial precision)
  • Lower-scoring predictions have more false positives mixed in (creating lower final precision)
  • Baseline images visually match typical PR curve examples from ML literature
  • All existing tests pass with new baseline images
  • Code includes comments explaining the data generation approach
  • The prevalence/chance level appears at the correct position

Notes

Important: The key insight is that PR curves decrease because as you lower the classification threshold:

  • You include more predictions as "positive" (recall increases)
  • But you also include more false positives (precision decreases)
  • This creates the characteristic top-left to bottom-right pattern

The test data should be generated such that:

  • Positive class instances tend to have higher scores
  • Negative class instances tend to have lower scores
  • There's some overlap (realistic classifier)
  • This naturally produces a decreasing PR curve

Metadata

Metadata

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions