A test quality analysis plugin for Claude Code that addresses the quality issues in AI-generated tests. Current AI coding tools produce tests at scale, but frequently generate low-quality output: weak assertions that inflate coverage metrics without validating correctness, tests that verify implementation rather than requirements, and redundant test cases. This plugin identifies these issues before they degrade test suite quality.
Follows the same architecture as the comment-review plugin: systematic analysis, preservation of valuable tests, and flagging of issues requiring human judgment.
Problem Space
AI-generated tests exhibit predictable quality issues:
- Weak Assertions - High coverage metrics achieved through superficial validation. Tests verify field existence rather than correctness, providing false confidence in test effectiveness.
- Implementation Verification - Tests that validate current behavior rather than specified requirements. Bug fixes break these tests because they've codified the defect.
- Redundant Test Cases - Multiple tests exercising identical code paths without additional documentation or fault detection value. Example:
test_price_zero and test_price_edge_case_zero with identical execution traces.
- Non-Descriptive Naming - Generic identifiers (test1(), test2(), testMethod()) that provide no information about test intent or failure diagnosis.
- Incomplete Coverage - Overrepresentation of happy path and CRUD operations with insufficient boundary condition and error scenario testing.
Research data supports these observations: Meta's TestGen-LLM reaches 73% acceptance rates for production deployment by software engineers. This plugin provides automated quality analysis for AI-generated test code.
Analysis Dimensions
1. Coverage Analysis
Evaluates test distribution across functional categories and identifies gaps in test methodology.
Test categorization:
- Happy Path - Primary successful execution flow with valid inputs
- Standard Variations - Common alternative execution paths
- Configuration Options - Configurable behavior (feature flags, settings, optional parameters)
- Edge Cases - Boundary conditions, null handling, atypical valid inputs
- Error Cases - Exception handling, invalid inputs, failure scenarios
- Integration Tests - Multi-component interactions
Analysis identifies missing or under-represented categories, untested code paths, and excessive testing of trivial code.
Coverage metrics evaluated:
- Line/Statement Coverage - Execution of individual statements
- Branch Coverage - Both true and false path evaluation
- Path Coverage - All possible execution paths (often impractical)
- Condition Coverage - Each boolean condition evaluated in both states
- MC/DC - Modified Condition/Decision Coverage for safety-critical systems
- Mutation Testing - Fault detection via artificial defect injection
Analysis references the testing pyramid (Mike Cohn): high volume of fast unit tests at the foundation, moderate integration testing, minimal E2E tests. Also applies the Agile Testing Quadrants framework (originally created by Brian Marick in 2003, adapted by Crispin & Gregory): technology-facing tests supporting development (unit, component), business-facing tests supporting development (acceptance), business-facing tests critiquing product (exploratory, usability), technology-facing tests critiquing product (performance, security).
Zhu et al. define coverage adequacy: "A test suite is 100% code coverage-adequate with respect to a coverage criterion if all instances of the criterion are exercised in a program by at least one test case." In practical terms: complete coverage requires tests that exercise every instance defined by the chosen metric.
2. Test Organization
Validates test structure and ordering within test files and classes. Uses the 6-category pattern: Happy Path, Standard Variations, Configuration Options, Edge Cases, Error Cases, Integration Tests. This ordering improves test suite navigability and maintainability.
Detection capabilities:
- Ordering violations within test files
- Inter-test dependencies
- Organizational anti-patterns (execution order dependencies, shared mutable state)
How individual tests should be structured:
AAA Pattern (Arrange-Act-Assert):
test "creates user with default settings"
// Arrange - set up what you need
const userService = new UserService()
const userData = { name: "John" }
// Act - do the thing
const result = userService.create(userData)
// Assert - check it worked
expect(result.id).toBeDefined()
expect(result.name).toBe("John")
Given-When-Then Pattern (same idea, BDD style):
def test_user_creation_with_default_settings():
# Given
user_service = UserService()
user_data = {"name": "John"}
# When
result = user_service.create(user_data)
# Then
assert result["id"] is not None
assert result["name"] == "John"
Identifies dependency issues including execution order requirements, shared state between tests, and incorrect setup/teardown implementation.
3. Test Naming
Evaluates test method names for descriptiveness and documentation value. Follows the TestDox principle: test names should form readable documentation when listed.
Requirements for effective test names:
- Describe behavior and expected outcomes
- Avoid implementation details
- Explicitly identify edge cases and error conditions
- Reject generic identifiers (test1(), testMethod())
Common patterns that work:
// Given-When-Then
Given_UserExists_When_PasswordInvalid_Then_ThrowsAuthError
testGiven_EmptyCart_When_Checkout_Then_ReturnsValidationError
// When-Then (shorter)
WhenDivisorIsZero_ExpectMathError
testWhenUserNotFound_ExpectNotFoundException
// Should-When
Should_ReturnError_When_EmailInvalid
testShould_CreateUser_When_DataValid
// Plain behavior description
testCreatesUserWithDefaultSettings
testThrowsExceptionWhenEmailAlreadyExists
test_returns_empty_list_for_user_with_no_orders
Anti-patterns:
- test1(), testMethod(), testSuccess() - No semantic information
- testCallsDatabaseTwice() - Implementation-focused rather than behavior-focused
- testEdgeCase(), testError() - Lacks specificity
- testInvalid() - Missing context
PHPUnit's TestDox shows why this matters. It turns test names into documentation:
UserService
- Creates user with default settings
- Creates user with custom role
- Throws exception when email already exists
- Throws exception when email format invalid
- Returns null when user not found
Good, descriptive names for tests help create an overview of all tests. The overview becomes a kind of behaviour specification of the code—enabling teams to quickly understand test coverage and intent without diving into implementation details.
4. Redundancy Detection
Identifies redundant tests while preserving those with documentation or regression protection value.
Redundancy classification:
Syntactic - Copy-paste duplicates, identical structure and assertions, tests that should be parameterized
Semantic - Different inputs hitting the exact same code path, tests that completely subsume other tests
Path Equivalence - Multiple tests exercising identical execution paths with no additional fault detection or documentation value
Subsumption criteria: - Test A subsumes Test B when A covers all paths and conditions of B with no additional value from B.
Preservation criteria (tests to keep despite redundancy):
- Explicit documentation of edge cases or business rules
- Regression protection for specific historical defects
- Intentional redundancy from multiple testing perspectives (black box/white box)
- Clarity value in making complex scenarios explicit
Removal candidates:
- Unintentional duplication without clear purpose
- Zero marginal information value
- Tests created without checking for existing coverage
Detection methodology:
Execution trace comparison to identify identical coverage footprints. Coverage contribution analysis to quantify unique coverage per test. Equivalence class partitioning to group tests by input domains and detect intra-class redundancy.
Research findings inform our approach: Noemmer and Haas (2020) demonstrate that test suite minimization can achieve over 70% reduction in suite size and execution time, though with a substantial loss in fault detection capability of around 12.5% on average. This informs our conservative flagging strategy.
"A Tester-Assisted Methodology for Test Redundancy Detection" (Koochakzadeh and Garousi, 2010) indicates coverage-based detection alone produces excessive false positives. Analysis must include fault detection effectiveness evaluation.
Testing Fundamentals
FIRST Principles (Robert C. Martin)
Validation against FIRST principles:
F - Fast - Millisecond-level execution time. Fast feedback loops encourage frequent test execution. Slow tests reduce execution frequency.
I - Independent - Order-independent execution. No shared mutable state. Each test establishes its own context.
R - Repeatable - Deterministic outcomes. No flaky tests with non-deterministic failures. No mutable external dependencies.
S - Self-Validating - Binary pass/fail outcome. No manual result inspection required.
T - Timely - Written concurrently with production code (TDD methodology). Alternative interpretation: Thorough coverage of both success and failure scenarios.
Violation patterns:
# Violates FAST - network I/O
def test_api_integration():
response = requests.get("https://api.example.com/users") # network latency
assert response.status_code == 200
# Violates INDEPENDENT - shared mutable state
class TestUserService:
user_id = None
def test_create_user(self):
self.user_id = service.create({"name": "John"}) # state mutation
def test_get_user(self):
user = service.get(self.user_id) # execution order dependency
# Violates REPEATABLE - time dependency
def test_generate_report():
result = generate_report(date=datetime.now()) # non-deterministic input
assert "2024" in result.title # year-specific assertion
Test Smells (Gerard Meszaros - xUnit Test Patterns)
Assertion Roulette - Multiple assertions without diagnostic messages. Failure provides insufficient information about which assertion failed and why.
// Anti-pattern
public function testUserCreation(): void
{
$user = $this->userService->create($data);
$this->assertNotNull($user);
$this->assertEquals('John', $user->getName());
$this->assertEquals('john@example.com', $user->getEmail());
$this->assertTrue($user->isActive());
// Failure lacks diagnostic context
}
// Improved - diagnostic messages
public function testUserCreation(): void
{
$user = $this->userService->create($data);
$this->assertNotNull($user, 'User should be created');
$this->assertEquals('John', $user->getName(), 'Name should match input');
$this->assertEquals('john@example.com', $user->getEmail(), 'Email should match');
$this->assertTrue($user->isActive(), 'New users should be active');
}
Mystery Guest - External resource dependencies (files, databases, APIs) that obscure test intent and require external context to understand.
// Anti-pattern - external dependency obscures test intent
test('loads user configuration', () => {
const config = loadConfig('./fixtures/user-config.json'); // external dependency
expect(config.theme).toBe('dark'); // unclear test scope
});
// Improved - explicit test data
test('loads user configuration', () => {
const config = { theme: 'dark', language: 'en', notifications: true };
const result = processConfig(config);
expect(result.theme).toBe('dark');
});
Additional test smells:
- Fragile Tests - Implementation coupling causes failures despite unchanged behavior
- Obscure Test - Unclear verification intent
- Slow Tests - Excessive execution time (FIRST violation)
- Test Code Duplication - Repeated setup/teardown code
Language Support
Initial implementation targets PHP/PHPUnit, JavaScript/TypeScript (Jest/Mocha/Vitest), and Python (pytest/unittest).
PHP/PHPUnit - TestCase classes, #[Test] attributes, DataProvider methods, setUp/tearDown lifecycle methods
JavaScript/TypeScript - test()/it() function calls, describe() block structure, lifecycle hooks (beforeEach/afterEach/beforeAll/afterAll), test.each() data-driven patterns
Python - Module-level test_* functions, Test* class patterns, pytest fixtures and parametrize decorators, unittest.TestCase inheritance
Implementation Architecture
Follows comment-review plugin design patterns:
Command interface:
/test-review [scope] - Interactive analysis with improvement suggestions
/test-check [scope] - Read-only analysis without modifications
Scope support: individual files, directory trees, git working changes, commit references, commit ranges.
Progressive disclosure architecture - On-demand reference file loading for minimal context consumption. Two-stage uncertainty evaluation: lightweight heuristics followed by detailed pattern analysis when needed. Adaptive output verbosity. Confidence classification (High/Medium/Low) for ambiguous cases. Configuration via .testreviewrc.md.
Reference files (on-demand loading):
references/coverage-patterns.md - Test category detection heuristics
references/naming-conventions.md - Language-specific naming conventions
references/redundancy-algorithms.md - Subsumption and equivalence class analysis
references/first-violations.md - FIRST principle violation detection
references/test-smells-catalog.md - Comprehensive test smell patterns
references/uncertainty-evaluation.md - Confidence scoring methodology
Technical constraints:
Static analysis limitations: code path determination is heuristic-based without execution. Language-dependent naming pattern recognition. Redundancy detection produces false positives requiring human review. Edge case identification uses boundary value and null check heuristics with imperfect accuracy. Coverage tool integration optional but beneficial.
Output format example:
## Test Quality Analysis - tests/UserServiceTest.php
**Issues identified:**
- Line 45: test1() - Non-descriptive identifier, lacks behavioral context
- Line 67: FIRST violation (Independent) - Shared mutable state via user_id
- Line 120: Redundancy candidate with testUserCreation() - Identical execution path
**Preserved tests:**
- Line 102: testRegressionBug4521() - Regression protection value
**Human review required (High Priority):**
1. UserServiceTest.php:120-135 - testUserCreationWithDefaults()
- Subsumes testUserCreation()
- Potential documentation value for developer onboarding
- Verification needed: distinct purposes for both tests
**Analysis summary:**
- 12 tests analyzed
- 2 coverage gaps identified (payment failure error cases)
- 3 FIRST violations (2 Independence, 1 Fast)
- 4 test smells detected (2 Assertion Roulette, 1 Mystery Guest, 1 Obscure)
- 5 naming quality issues
- 3 redundancy candidates (1 removal recommended)
**Coverage distribution:**
- Happy Path: 5 tests (adequate)
- Standard Variations: 3 tests (adequate)
- Configuration Options: 2 tests (expansion recommended)
- Edge Cases: 4 tests (adequate)
- Error Cases: 1 test (insufficient coverage)
- Integration: 2 tests (adequate)
Open Questions
- Test generation scope: suggest specific test additions vs. gap identification only
- Redundancy handling: conservative flagging mode vs. aggressive removal suggestions
- Coverage tool integration: direct tool integration vs. log parsing
- Test category customization: support for user-defined categories beyond standard 6
- Confidence threshold configuration: auto-flagging criteria vs. mandatory human review
- Interactive workflow: diff preview vs. per-file confirmation
References
Books:
- Gerard Meszaros - xUnit Test Patterns: Refactoring Test Code (2007)
- Kent Beck - Test Driven Development: By Example (2002)
- Lisa Crispin & Janet Gregory - Agile Testing: A Practical Guide for Testers and Agile Teams (2009)
- Robert C. Martin - Clean Code: A Handbook of Agile Software Craftsmanship (2008)
- Source of FIRST Principles
Research:
- Hong Zhu, Patrick A. V. Hall, John H. R. May - "Software Unit Test Coverage and Adequacy" (ACM Computing Surveys, Vol. 29, No. 4, December 1997, pp. 366-427)
- Noemmer, R., and Haas, R. - "An Evaluation of Test Suite Minimization Techniques" (Software Quality: Quality Intelligence in Software and Systems Engineering, SWQD 2020, Lecture Notes in Business Information Processing, vol 371, Springer, 2020)
- Koochakzadeh, N. and Garousi, V. - "A Tester-Assisted Methodology for Test Redundancy Detection" (Advances in Software Engineering, Special Issue on Software Test Automation, January 2010)
- Meta AI - "Automated Unit Test Improvement using Large Language Models at Meta" (arXiv:2402.09171, February 2024)
Industry practices:
- Robert C. Martin - FIRST Principles
- Source: Clean Code (2008)
- Mike Cohn - Test Automation Pyramid
- Source: Succeeding with Agile (2009)
- Brian Marick - Agile Testing Matrix (2003)
- PHPUnit TestDox - Self-documenting test output
- Daniel Terhorst-North, Chris Matts - Given-When-Then (BDD, 2003)
- Bill Wake - AAA Pattern (Arrange-Act-Assert, 2001)
A test quality analysis plugin for Claude Code that addresses the quality issues in AI-generated tests. Current AI coding tools produce tests at scale, but frequently generate low-quality output: weak assertions that inflate coverage metrics without validating correctness, tests that verify implementation rather than requirements, and redundant test cases. This plugin identifies these issues before they degrade test suite quality.
Follows the same architecture as the comment-review plugin: systematic analysis, preservation of valuable tests, and flagging of issues requiring human judgment.
Problem Space
AI-generated tests exhibit predictable quality issues:
test_price_zeroandtest_price_edge_case_zerowith identical execution traces.Research data supports these observations: Meta's TestGen-LLM reaches 73% acceptance rates for production deployment by software engineers. This plugin provides automated quality analysis for AI-generated test code.
Analysis Dimensions
1. Coverage Analysis
Evaluates test distribution across functional categories and identifies gaps in test methodology.
Test categorization:
Analysis identifies missing or under-represented categories, untested code paths, and excessive testing of trivial code.
Coverage metrics evaluated:
Analysis references the testing pyramid (Mike Cohn): high volume of fast unit tests at the foundation, moderate integration testing, minimal E2E tests. Also applies the Agile Testing Quadrants framework (originally created by Brian Marick in 2003, adapted by Crispin & Gregory): technology-facing tests supporting development (unit, component), business-facing tests supporting development (acceptance), business-facing tests critiquing product (exploratory, usability), technology-facing tests critiquing product (performance, security).
Zhu et al. define coverage adequacy: "A test suite is 100% code coverage-adequate with respect to a coverage criterion if all instances of the criterion are exercised in a program by at least one test case." In practical terms: complete coverage requires tests that exercise every instance defined by the chosen metric.
2. Test Organization
Validates test structure and ordering within test files and classes. Uses the 6-category pattern: Happy Path, Standard Variations, Configuration Options, Edge Cases, Error Cases, Integration Tests. This ordering improves test suite navigability and maintainability.
Detection capabilities:
How individual tests should be structured:
AAA Pattern (Arrange-Act-Assert):
Given-When-Then Pattern (same idea, BDD style):
Identifies dependency issues including execution order requirements, shared state between tests, and incorrect setup/teardown implementation.
3. Test Naming
Evaluates test method names for descriptiveness and documentation value. Follows the TestDox principle: test names should form readable documentation when listed.
Requirements for effective test names:
Common patterns that work:
Anti-patterns:
PHPUnit's TestDox shows why this matters. It turns test names into documentation:
Good, descriptive names for tests help create an overview of all tests. The overview becomes a kind of behaviour specification of the code—enabling teams to quickly understand test coverage and intent without diving into implementation details.
4. Redundancy Detection
Identifies redundant tests while preserving those with documentation or regression protection value.
Redundancy classification:
Syntactic - Copy-paste duplicates, identical structure and assertions, tests that should be parameterized
Semantic - Different inputs hitting the exact same code path, tests that completely subsume other tests
Path Equivalence - Multiple tests exercising identical execution paths with no additional fault detection or documentation value
Subsumption criteria: - Test A subsumes Test B when A covers all paths and conditions of B with no additional value from B.
Preservation criteria (tests to keep despite redundancy):
Removal candidates:
Detection methodology:
Execution trace comparison to identify identical coverage footprints. Coverage contribution analysis to quantify unique coverage per test. Equivalence class partitioning to group tests by input domains and detect intra-class redundancy.
Research findings inform our approach: Noemmer and Haas (2020) demonstrate that test suite minimization can achieve over 70% reduction in suite size and execution time, though with a substantial loss in fault detection capability of around 12.5% on average. This informs our conservative flagging strategy.
"A Tester-Assisted Methodology for Test Redundancy Detection" (Koochakzadeh and Garousi, 2010) indicates coverage-based detection alone produces excessive false positives. Analysis must include fault detection effectiveness evaluation.
Testing Fundamentals
FIRST Principles (Robert C. Martin)
Validation against FIRST principles:
F - Fast - Millisecond-level execution time. Fast feedback loops encourage frequent test execution. Slow tests reduce execution frequency.
I - Independent - Order-independent execution. No shared mutable state. Each test establishes its own context.
R - Repeatable - Deterministic outcomes. No flaky tests with non-deterministic failures. No mutable external dependencies.
S - Self-Validating - Binary pass/fail outcome. No manual result inspection required.
T - Timely - Written concurrently with production code (TDD methodology). Alternative interpretation: Thorough coverage of both success and failure scenarios.
Violation patterns:
Test Smells (Gerard Meszaros - xUnit Test Patterns)
Assertion Roulette - Multiple assertions without diagnostic messages. Failure provides insufficient information about which assertion failed and why.
Mystery Guest - External resource dependencies (files, databases, APIs) that obscure test intent and require external context to understand.
Additional test smells:
Language Support
Initial implementation targets PHP/PHPUnit, JavaScript/TypeScript (Jest/Mocha/Vitest), and Python (pytest/unittest).
PHP/PHPUnit - TestCase classes, #[Test] attributes, DataProvider methods, setUp/tearDown lifecycle methods
JavaScript/TypeScript - test()/it() function calls, describe() block structure, lifecycle hooks (beforeEach/afterEach/beforeAll/afterAll), test.each() data-driven patterns
Python - Module-level test_* functions, Test* class patterns, pytest fixtures and parametrize decorators, unittest.TestCase inheritance
Implementation Architecture
Follows comment-review plugin design patterns:
Command interface:
/test-review [scope]- Interactive analysis with improvement suggestions/test-check [scope]- Read-only analysis without modificationsScope support: individual files, directory trees, git working changes, commit references, commit ranges.
Progressive disclosure architecture - On-demand reference file loading for minimal context consumption. Two-stage uncertainty evaluation: lightweight heuristics followed by detailed pattern analysis when needed. Adaptive output verbosity. Confidence classification (High/Medium/Low) for ambiguous cases. Configuration via
.testreviewrc.md.Reference files (on-demand loading):
references/coverage-patterns.md- Test category detection heuristicsreferences/naming-conventions.md- Language-specific naming conventionsreferences/redundancy-algorithms.md- Subsumption and equivalence class analysisreferences/first-violations.md- FIRST principle violation detectionreferences/test-smells-catalog.md- Comprehensive test smell patternsreferences/uncertainty-evaluation.md- Confidence scoring methodologyTechnical constraints:
Static analysis limitations: code path determination is heuristic-based without execution. Language-dependent naming pattern recognition. Redundancy detection produces false positives requiring human review. Edge case identification uses boundary value and null check heuristics with imperfect accuracy. Coverage tool integration optional but beneficial.
Output format example:
Open Questions
References
Books:
Research:
Industry practices: