Skip to content

Plugin Suggestion: Test Quality Analysis for AI-Generated Tests #1

@SpiGAndromeda

Description

A test quality analysis plugin for Claude Code that addresses the quality issues in AI-generated tests. Current AI coding tools produce tests at scale, but frequently generate low-quality output: weak assertions that inflate coverage metrics without validating correctness, tests that verify implementation rather than requirements, and redundant test cases. This plugin identifies these issues before they degrade test suite quality.

Follows the same architecture as the comment-review plugin: systematic analysis, preservation of valuable tests, and flagging of issues requiring human judgment.

Problem Space

AI-generated tests exhibit predictable quality issues:

  • Weak Assertions - High coverage metrics achieved through superficial validation. Tests verify field existence rather than correctness, providing false confidence in test effectiveness.
  • Implementation Verification - Tests that validate current behavior rather than specified requirements. Bug fixes break these tests because they've codified the defect.
  • Redundant Test Cases - Multiple tests exercising identical code paths without additional documentation or fault detection value. Example: test_price_zero and test_price_edge_case_zero with identical execution traces.
  • Non-Descriptive Naming - Generic identifiers (test1(), test2(), testMethod()) that provide no information about test intent or failure diagnosis.
  • Incomplete Coverage - Overrepresentation of happy path and CRUD operations with insufficient boundary condition and error scenario testing.

Research data supports these observations: Meta's TestGen-LLM reaches 73% acceptance rates for production deployment by software engineers. This plugin provides automated quality analysis for AI-generated test code.

Analysis Dimensions

1. Coverage Analysis

Evaluates test distribution across functional categories and identifies gaps in test methodology.

Test categorization:

  • Happy Path - Primary successful execution flow with valid inputs
  • Standard Variations - Common alternative execution paths
  • Configuration Options - Configurable behavior (feature flags, settings, optional parameters)
  • Edge Cases - Boundary conditions, null handling, atypical valid inputs
  • Error Cases - Exception handling, invalid inputs, failure scenarios
  • Integration Tests - Multi-component interactions

Analysis identifies missing or under-represented categories, untested code paths, and excessive testing of trivial code.

Coverage metrics evaluated:

  • Line/Statement Coverage - Execution of individual statements
  • Branch Coverage - Both true and false path evaluation
  • Path Coverage - All possible execution paths (often impractical)
  • Condition Coverage - Each boolean condition evaluated in both states
  • MC/DC - Modified Condition/Decision Coverage for safety-critical systems
  • Mutation Testing - Fault detection via artificial defect injection

Analysis references the testing pyramid (Mike Cohn): high volume of fast unit tests at the foundation, moderate integration testing, minimal E2E tests. Also applies the Agile Testing Quadrants framework (originally created by Brian Marick in 2003, adapted by Crispin & Gregory): technology-facing tests supporting development (unit, component), business-facing tests supporting development (acceptance), business-facing tests critiquing product (exploratory, usability), technology-facing tests critiquing product (performance, security).

Zhu et al. define coverage adequacy: "A test suite is 100% code coverage-adequate with respect to a coverage criterion if all instances of the criterion are exercised in a program by at least one test case." In practical terms: complete coverage requires tests that exercise every instance defined by the chosen metric.

2. Test Organization

Validates test structure and ordering within test files and classes. Uses the 6-category pattern: Happy Path, Standard Variations, Configuration Options, Edge Cases, Error Cases, Integration Tests. This ordering improves test suite navigability and maintainability.

Detection capabilities:

  • Ordering violations within test files
  • Inter-test dependencies
  • Organizational anti-patterns (execution order dependencies, shared mutable state)

How individual tests should be structured:

AAA Pattern (Arrange-Act-Assert):

test "creates user with default settings"
    // Arrange - set up what you need
    const userService = new UserService()
    const userData = { name: "John" }

    // Act - do the thing
    const result = userService.create(userData)

    // Assert - check it worked
    expect(result.id).toBeDefined()
    expect(result.name).toBe("John")

Given-When-Then Pattern (same idea, BDD style):

def test_user_creation_with_default_settings():
    # Given
    user_service = UserService()
    user_data = {"name": "John"}

    # When
    result = user_service.create(user_data)

    # Then
    assert result["id"] is not None
    assert result["name"] == "John"

Identifies dependency issues including execution order requirements, shared state between tests, and incorrect setup/teardown implementation.

3. Test Naming

Evaluates test method names for descriptiveness and documentation value. Follows the TestDox principle: test names should form readable documentation when listed.

Requirements for effective test names:

  • Describe behavior and expected outcomes
  • Avoid implementation details
  • Explicitly identify edge cases and error conditions
  • Reject generic identifiers (test1(), testMethod())

Common patterns that work:

// Given-When-Then
Given_UserExists_When_PasswordInvalid_Then_ThrowsAuthError
testGiven_EmptyCart_When_Checkout_Then_ReturnsValidationError

// When-Then (shorter)
WhenDivisorIsZero_ExpectMathError
testWhenUserNotFound_ExpectNotFoundException

// Should-When
Should_ReturnError_When_EmailInvalid
testShould_CreateUser_When_DataValid

// Plain behavior description
testCreatesUserWithDefaultSettings
testThrowsExceptionWhenEmailAlreadyExists
test_returns_empty_list_for_user_with_no_orders

Anti-patterns:

  • test1(), testMethod(), testSuccess() - No semantic information
  • testCallsDatabaseTwice() - Implementation-focused rather than behavior-focused
  • testEdgeCase(), testError() - Lacks specificity
  • testInvalid() - Missing context

PHPUnit's TestDox shows why this matters. It turns test names into documentation:

UserService
 - Creates user with default settings
 - Creates user with custom role
 - Throws exception when email already exists
 - Throws exception when email format invalid
 - Returns null when user not found

Good, descriptive names for tests help create an overview of all tests. The overview becomes a kind of behaviour specification of the code—enabling teams to quickly understand test coverage and intent without diving into implementation details.

4. Redundancy Detection

Identifies redundant tests while preserving those with documentation or regression protection value.

Redundancy classification:

Syntactic - Copy-paste duplicates, identical structure and assertions, tests that should be parameterized

Semantic - Different inputs hitting the exact same code path, tests that completely subsume other tests

Path Equivalence - Multiple tests exercising identical execution paths with no additional fault detection or documentation value

Subsumption criteria: - Test A subsumes Test B when A covers all paths and conditions of B with no additional value from B.

Preservation criteria (tests to keep despite redundancy):

  • Explicit documentation of edge cases or business rules
  • Regression protection for specific historical defects
  • Intentional redundancy from multiple testing perspectives (black box/white box)
  • Clarity value in making complex scenarios explicit

Removal candidates:

  • Unintentional duplication without clear purpose
  • Zero marginal information value
  • Tests created without checking for existing coverage

Detection methodology:

Execution trace comparison to identify identical coverage footprints. Coverage contribution analysis to quantify unique coverage per test. Equivalence class partitioning to group tests by input domains and detect intra-class redundancy.

Research findings inform our approach: Noemmer and Haas (2020) demonstrate that test suite minimization can achieve over 70% reduction in suite size and execution time, though with a substantial loss in fault detection capability of around 12.5% on average. This informs our conservative flagging strategy.

"A Tester-Assisted Methodology for Test Redundancy Detection" (Koochakzadeh and Garousi, 2010) indicates coverage-based detection alone produces excessive false positives. Analysis must include fault detection effectiveness evaluation.

Testing Fundamentals

FIRST Principles (Robert C. Martin)

Validation against FIRST principles:

F - Fast - Millisecond-level execution time. Fast feedback loops encourage frequent test execution. Slow tests reduce execution frequency.

I - Independent - Order-independent execution. No shared mutable state. Each test establishes its own context.

R - Repeatable - Deterministic outcomes. No flaky tests with non-deterministic failures. No mutable external dependencies.

S - Self-Validating - Binary pass/fail outcome. No manual result inspection required.

T - Timely - Written concurrently with production code (TDD methodology). Alternative interpretation: Thorough coverage of both success and failure scenarios.

Violation patterns:

# Violates FAST - network I/O
def test_api_integration():
    response = requests.get("https://api.example.com/users")  # network latency
    assert response.status_code == 200

# Violates INDEPENDENT - shared mutable state
class TestUserService:
    user_id = None

    def test_create_user(self):
        self.user_id = service.create({"name": "John"})  # state mutation

    def test_get_user(self):
        user = service.get(self.user_id)  # execution order dependency

# Violates REPEATABLE - time dependency
def test_generate_report():
    result = generate_report(date=datetime.now())  # non-deterministic input
    assert "2024" in result.title  # year-specific assertion

Test Smells (Gerard Meszaros - xUnit Test Patterns)

Assertion Roulette - Multiple assertions without diagnostic messages. Failure provides insufficient information about which assertion failed and why.

// Anti-pattern
public function testUserCreation(): void
{
    $user = $this->userService->create($data);
    $this->assertNotNull($user);
    $this->assertEquals('John', $user->getName());
    $this->assertEquals('john@example.com', $user->getEmail());
    $this->assertTrue($user->isActive());
    // Failure lacks diagnostic context
}

// Improved - diagnostic messages
public function testUserCreation(): void
{
    $user = $this->userService->create($data);
    $this->assertNotNull($user, 'User should be created');
    $this->assertEquals('John', $user->getName(), 'Name should match input');
    $this->assertEquals('john@example.com', $user->getEmail(), 'Email should match');
    $this->assertTrue($user->isActive(), 'New users should be active');
}

Mystery Guest - External resource dependencies (files, databases, APIs) that obscure test intent and require external context to understand.

// Anti-pattern - external dependency obscures test intent
test('loads user configuration', () => {
    const config = loadConfig('./fixtures/user-config.json');  // external dependency
    expect(config.theme).toBe('dark');  // unclear test scope
});

// Improved - explicit test data
test('loads user configuration', () => {
    const config = { theme: 'dark', language: 'en', notifications: true };
    const result = processConfig(config);
    expect(result.theme).toBe('dark');
});

Additional test smells:

  • Fragile Tests - Implementation coupling causes failures despite unchanged behavior
  • Obscure Test - Unclear verification intent
  • Slow Tests - Excessive execution time (FIRST violation)
  • Test Code Duplication - Repeated setup/teardown code

Language Support

Initial implementation targets PHP/PHPUnit, JavaScript/TypeScript (Jest/Mocha/Vitest), and Python (pytest/unittest).

PHP/PHPUnit - TestCase classes, #[Test] attributes, DataProvider methods, setUp/tearDown lifecycle methods

JavaScript/TypeScript - test()/it() function calls, describe() block structure, lifecycle hooks (beforeEach/afterEach/beforeAll/afterAll), test.each() data-driven patterns

Python - Module-level test_* functions, Test* class patterns, pytest fixtures and parametrize decorators, unittest.TestCase inheritance

Implementation Architecture

Follows comment-review plugin design patterns:

Command interface:

  • /test-review [scope] - Interactive analysis with improvement suggestions
  • /test-check [scope] - Read-only analysis without modifications

Scope support: individual files, directory trees, git working changes, commit references, commit ranges.

Progressive disclosure architecture - On-demand reference file loading for minimal context consumption. Two-stage uncertainty evaluation: lightweight heuristics followed by detailed pattern analysis when needed. Adaptive output verbosity. Confidence classification (High/Medium/Low) for ambiguous cases. Configuration via .testreviewrc.md.

Reference files (on-demand loading):

  • references/coverage-patterns.md - Test category detection heuristics
  • references/naming-conventions.md - Language-specific naming conventions
  • references/redundancy-algorithms.md - Subsumption and equivalence class analysis
  • references/first-violations.md - FIRST principle violation detection
  • references/test-smells-catalog.md - Comprehensive test smell patterns
  • references/uncertainty-evaluation.md - Confidence scoring methodology

Technical constraints:
Static analysis limitations: code path determination is heuristic-based without execution. Language-dependent naming pattern recognition. Redundancy detection produces false positives requiring human review. Edge case identification uses boundary value and null check heuristics with imperfect accuracy. Coverage tool integration optional but beneficial.

Output format example:

## Test Quality Analysis - tests/UserServiceTest.php

**Issues identified:**
- Line 45: test1() - Non-descriptive identifier, lacks behavioral context
- Line 67: FIRST violation (Independent) - Shared mutable state via user_id
- Line 120: Redundancy candidate with testUserCreation() - Identical execution path

**Preserved tests:**
- Line 102: testRegressionBug4521() - Regression protection value

**Human review required (High Priority):**
1. UserServiceTest.php:120-135 - testUserCreationWithDefaults()
   - Subsumes testUserCreation()
   - Potential documentation value for developer onboarding
   - Verification needed: distinct purposes for both tests

**Analysis summary:**
- 12 tests analyzed
- 2 coverage gaps identified (payment failure error cases)
- 3 FIRST violations (2 Independence, 1 Fast)
- 4 test smells detected (2 Assertion Roulette, 1 Mystery Guest, 1 Obscure)
- 5 naming quality issues
- 3 redundancy candidates (1 removal recommended)

**Coverage distribution:**
- Happy Path: 5 tests (adequate)
- Standard Variations: 3 tests (adequate)
- Configuration Options: 2 tests (expansion recommended)
- Edge Cases: 4 tests (adequate)
- Error Cases: 1 test (insufficient coverage)
- Integration: 2 tests (adequate)

Open Questions

  • Test generation scope: suggest specific test additions vs. gap identification only
  • Redundancy handling: conservative flagging mode vs. aggressive removal suggestions
  • Coverage tool integration: direct tool integration vs. log parsing
  • Test category customization: support for user-defined categories beyond standard 6
  • Confidence threshold configuration: auto-flagging criteria vs. mandatory human review
  • Interactive workflow: diff preview vs. per-file confirmation

References

Books:

Research:

Industry practices:

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions