Improve Test Data for Precision-Recall Curve to Show Realistic Shape

## Summary
Update the test data used in `test_plot_precision_recall_curve_with_thresholds_annotations` to generate Precision-Recall curves that look like actual PR curves (decreasing from top-left to bottom-right) rather than resembling inverted ROC curves.

## Problem Description

### Current Issue
The current test generates baseline images (`test_plot_precision_recall_curve_with_thresholds_annotations_default.png` and `test_plot_precision_recall_curve_with_thresholds_annotations_with_chance_level.png`) where the plotted curves appear to increase from left to right, resembling an inverted ROC curve rather than a typical Precision-Recall curve.

### Why This Is Incorrect

**Precision-Recall curves have distinct characteristics:**

1. **General shape**: PR curves typically **decrease** as recall increases (moving left to right on the x-axis)
   - They start at high precision/low recall (top-left)
   - They end at low precision/high recall (bottom-right)
   - Good classifiers "hug" the **top-right corner** (high precision AND high recall)

2. **Not a mirror of ROC curves**: 
   - ROC curves increase from bottom-left to top-right
   - PR curves generally decrease from top-left to bottom-right
   - They have fundamentally different shapes due to different y-axis definitions

3. **Why they decrease**:
   - As you lower the classification threshold, recall increases (you catch more positives)
   - But precision typically decreases (you also get more false positives mixed in)
   - This creates the characteristic downward trend

4. **Characteristic features** of real PR curves:
   - Often have a "staircase" or "zigzag" pattern
   - May have plateaus where lowering threshold doesn't change recall but precision fluctuates
   - Non-linear interpolation between points (unlike ROC curves)
   - Can cross each other more frequently than ROC curves

### Visual Comparison

**ROC Curve (for reference):**
```
Precision
    ^
    |     ___________
    |   /
    |  /              (Increases left to right)
    | /
    |/_______________> False Positive Rate
```

**Typical Precision-Recall Curve:**
```
Precision
    ^
    |\
    | \___
    |     \___        (Decreases left to right)
    |        \___
    |            \___
    |________________> Recall
    0               1
```

### Current Test Data Issue

Looking at `tests/resources/plotly_models.json`, the test data appears to have been generated in a way that creates curves that increase from left to right, which is the opposite of what typical PR curves look like.

This is problematic because:
1. **Misleading visuals**: Developers and users looking at the test images might develop incorrect expectations
2. **Reduced confidence**: Baseline images that don't look like real PR curves reduce confidence in the library
3. **Educational value**: Tests should serve as documentation and examples - current images teach the wrong pattern

## Desired Outcome

Generate test data that produces realistic Precision-Recall curves with these characteristics:

1. **Decreasing trend**: Curves should generally decrease from top-left to bottom-right
2. **Start high**: High precision at low recall (top-left area)
3. **End lower**: Lower precision at high recall (bottom-right area)
4. **Good classifier appearance**: Curve should stay relatively high (close to top-right corner) to demonstrate a well-performing classifier
5. **Realistic variability**: Include the characteristic "staircase" pattern that real PR curves exhibit

### Visual Goal

The ideal curve should look something like:
```
Precision
1.0 |●────●
    |      ●────●
    |           ●───●
    |               ●──●
    |                  ●─●
0.0 |___________________●__> Recall
   0.0                  1.0
```

## Proposed Solution

### Option 1: Generate Synthetic Data with Correct Characteristics

Create a function to generate realistic classifier scores that will produce proper PR curves:

```python
def generate_realistic_pr_curve_data(
    n_samples: int = 2239,
    positive_ratio: float = 0.3,
    classifier_quality: str = "good",
    random_state: int = 42
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Generate synthetic data that produces realistic Precision-Recall curves.
    
    The key insight: For PR curves to decrease left-to-right, high-scoring 
    predictions need to be predominantly positive (high precision), while 
    lower-scoring predictions have more false positives mixed in (lower precision).
    
    :param n_samples: Total number of samples
    :param positive_ratio: Proportion of positive samples
    :param classifier_quality: 'excellent', 'good', 'fair', or 'poor'
    :param random_state: Random seed for reproducibility
    :return: (y_true, y_scores)
    """
    np.random.seed(random_state)
    
    n_positives = int(n_samples * positive_ratio)
    n_negatives = n_samples - n_positives
    
    # Create true labels
    y_true = np.concatenate([
        np.ones(n_positives),
        np.zeros(n_negatives)
    ])
    
    # Quality parameters: (pos_mean, pos_std, neg_mean, neg_std)
    # The key: positive class should have HIGHER scores on average
    quality_params = {
        'excellent': (0.85, 0.10, 0.20, 0.10),  # Clear separation
        'good': (0.75, 0.15, 0.30, 0.15),       # Good separation
        'fair': (0.65, 0.20, 0.40, 0.15),       # Moderate separation
        'poor': (0.55, 0.20, 0.45, 0.20),       # Poor separation
    }
    
    pos_mean, pos_std, neg_mean, neg_std = quality_params[classifier_quality]
    
    # Generate scores
    # Positive class: higher scores (concentrated toward 1)
    positive_scores = np.random.normal(pos_mean, pos_std, n_positives)
    positive_scores = np.clip(positive_scores, 0, 1)
    
    # Negative class: lower scores (concentrated toward 0)
    negative_scores = np.random.normal(neg_mean, neg_std, n_negatives)
    negative_scores = np.clip(negative_scores, 0, 1)
    
    y_scores = np.concatenate([positive_scores, negative_scores])
    
    # Shuffle together
    indices = np.random.permutation(n_samples)
    y_true = y_true[indices]
    y_scores = y_scores[indices]
    
    return y_true, y_scores


# Example usage for test data generation
np.random.seed(42)

# Generate data for multiple classifiers with different quality levels
classifiers_data = {
    "Decision Tree": generate_realistic_pr_curve_data(
        n_samples=2239, 
        positive_ratio=0.3, 
        classifier_quality="good",
        random_state=42
    ),
    "Random Forest": generate_realistic_pr_curve_data(
        n_samples=2239, 
        positive_ratio=0.3, 
        classifier_quality="excellent",
        random_state=123
    ),
    "Logistic Regression": generate_realistic_pr_curve_data(
        n_samples=2239, 
        positive_ratio=0.3, 
        classifier_quality="fair",
        random_state=456
    ),
}

# Verify the data produces decreasing PR curves
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6))
for name, (y_true, y_scores) in classifiers_data.items():
    precision, recall, _ = precision_recall_curve(y_true, y_scores)
    ax.plot(recall, precision, marker='o', label=name, markersize=3)
    
    # Verify it's generally decreasing
    # Calculate if precision generally decreases as recall increases
    is_mostly_decreasing = np.sum(np.diff(precision) < 0) > len(precision) * 0.6
    print(f"{name}: Mostly decreasing = {is_mostly_decreasing}")

ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
ax.set_title('Realistic Precision-Recall Curves')
ax.legend()
ax.grid(True, alpha=0.3)
plt.savefig('realistic_pr_curves.png', dpi=150, bbox_inches='tight')
```

### Option 2: Use Real-World Dataset

Use a real imbalanced classification dataset that naturally produces proper PR curves:

```python
from sklearn.datasets import make_classification

# Generate imbalanced classification data
X, y = make_classification(
    n_samples=2239,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    weights=[0.7, 0.3],  # Imbalanced
    flip_y=0.05,  # Some noise
    random_state=42
)

# Train classifiers
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

classifiers = {
    "Decision Tree": DecisionTreeClassifier(random_state=42, max_depth=10),
    "Random Forest": RandomForestClassifier(random_state=42, n_estimators=50),
    "Logistic Regression": LogisticRegression(random_state=42, max_iter=1000),
}

classifiers_scores = {}
for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    y_scores = clf.predict_proba(X_test)[:, 1]  # Probability of positive class
    classifiers_scores[name] = y_scores

# y_test and classifiers_scores now have realistic data
```

## Implementation Steps

1. **Create data generation script** (or use real data):
   ```bash
   # Create a script to generate the test data
   touch tests/scripts/generate_pr_curve_test_data.py
   ```

2. **Generate new test data**:
   - Use one of the approaches above to create realistic y_true and y_scores
   - Ensure the data produces PR curves that:
     - Start high-left (high precision, low recall)
     - End low-right (lower precision, high recall)
     - Generally decrease from left to right

3. **Update `plotly_models.json`**:
   - Replace the current `y_true` array with new realistic data
   - For each classifier, update `y_scores` arrays
   - Recalculate the `precision_recall_curve` data (precision_array, recall_array, thresholds)
   - Recalculate `roc_curve` data if needed

4. **Regenerate baseline images**:
   ```bash
   pytest tests/test_metrics.py::test_plot_precision_recall_curve_with_thresholds_annotations --mpl-generate-path=tests/baseline_images/test_metrics
   ```

5. **Verify the results**:
   - Visually inspect the new baseline images
   - Confirm curves decrease from left to right
   - Confirm they look like typical PR curves
   - Ensure the chance level (horizontal line at prevalence) looks correct

6. **Update documentation** (if needed):
   - Add comments in the test explaining why this data was chosen
   - Document the characteristics of good PR curves

## Example: What the Fix Should Look Like

### Before (Current - INCORRECT):
```
The curve increases from left to right, looking like an inverted ROC curve:

Precision
    ^                    ____
    |                ___/
    |            ___/
    |        ___/
    |    ___/
    |___/_______________> Recall
```

### After (Fixed - CORRECT):
```
The curve decreases from left to right, like a proper PR curve:

Precision
    ^
    |●───●
    |     ●───●
    |         ●───●
    |             ●───●
    |                 ●──●
    |____________________●> Recall
```

## Testing Validation

After implementing the fix, validate that:

1. **Visual inspection**: Baseline images show decreasing curves
2. **Numerical validation**: For each curve, confirm:
   ```python
   precision, recall, _ = precision_recall_curve(y_true, y_scores)
   # Should be mostly decreasing
   n_decreasing = np.sum(np.diff(precision) < 0)
   n_total = len(precision) - 1
   assert n_decreasing / n_total > 0.5, "PR curve should mostly decrease"
   ```
3. **Curve starts high**: First precision value should be relatively high (> 0.6 for good classifier)
4. **Curve ends lower**: Last precision value should be close to prevalence
5. **Prevalence line**: The chance level (horizontal line) should be at the correct y-value

## Additional Context

### Why This Matters

1. **Educational value**: Tests serve as documentation and examples
2. **Visual validation**: Developers can spot issues by looking at the curves
3. **Confidence**: Realistic test data increases confidence in the library
4. **Correctness**: Ensures the plotting function works correctly with real-world-like data

### References

- [sklearn Precision-Recall Documentation](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html)
- [Understanding PR Curves](https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/)
- Key quote from sklearn docs: *"precision does not necessarily decrease with recall"* but the general trend is downward

### Related Issues

This fix should be implemented **after** or **alongside** the issue for updating `plot_chance_level` parameter, since:
- The new baseline images will need to show the horizontal chance level line correctly
- Both changes affect the same baseline image files
- It makes sense to regenerate baseline images once with both fixes applied

## Acceptance Criteria

- [ ] Test data generates PR curves that decrease from top-left to bottom-right
- [ ] Curves show realistic "staircase" pattern
- [ ] High-scoring predictions are predominantly positive (creating high initial precision)
- [ ] Lower-scoring predictions have more false positives mixed in (creating lower final precision)
- [ ] Baseline images visually match typical PR curve examples from ML literature
- [ ] All existing tests pass with new baseline images
- [ ] Code includes comments explaining the data generation approach
- [ ] The prevalence/chance level appears at the correct position

## Notes

**Important**: The key insight is that PR curves decrease because as you lower the classification threshold:
- You include more predictions as "positive" (recall increases)
- But you also include more false positives (precision decreases)
- This creates the characteristic top-left to bottom-right pattern

The test data should be generated such that:
- Positive class instances tend to have **higher** scores
- Negative class instances tend to have **lower** scores
- There's some overlap (realistic classifier)
- This naturally produces a decreasing PR curve

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Test Data for Precision-Recall Curve to Show Realistic Shape #100

Summary

Problem Description

Current Issue

Why This Is Incorrect

Visual Comparison

Current Test Data Issue

Desired Outcome

Visual Goal

Proposed Solution

Option 1: Generate Synthetic Data with Correct Characteristics

Option 2: Use Real-World Dataset

Implementation Steps

Example: What the Fix Should Look Like

Before (Current - INCORRECT):

After (Fixed - CORRECT):

Testing Validation

Additional Context

Why This Matters

References

Related Issues

Acceptance Criteria

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Improve Test Data for Precision-Recall Curve to Show Realistic Shape #100

Description

Summary

Problem Description

Current Issue

Why This Is Incorrect

Visual Comparison

Current Test Data Issue

Desired Outcome

Visual Goal

Proposed Solution

Option 1: Generate Synthetic Data with Correct Characteristics

Option 2: Use Real-World Dataset

Implementation Steps

Example: What the Fix Should Look Like

Before (Current - INCORRECT):

After (Fixed - CORRECT):

Testing Validation

Additional Context

Why This Matters

References

Related Issues

Acceptance Criteria

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions