This document clarifies what each project requires to execute, helping evaluators set realistic expectations and avoid "this is too hard to reproduce" objections.
| Project | Mode | Environment | Headless CI? | Notes |
|---|---|---|---|---|
| Python | ||||
static_structural |
Compile | Python 3.8+ | ✅ | Import/syntax errors |
runtime_failure |
Runtime | Python 3.8+ | ✅ | Needs config.json |
test_failure |
Test | Python 3.8+ + pytest | ✅ | pytest test_calculator.py |
multi_file_bug |
Runtime + Test | Python 3.8+ + Flask | ✅ | API server + integration tests |
hypothesis_debugging |
Runtime | Python 3.8+ | ✅ | Data pipeline analysis |
| JavaScript | ||||
static_structural |
Compile | Node 18+ | ✅ | Module/export errors |
runtime_failure |
Runtime | Node 18+ | ✅ | Async/null issues |
test_failure |
Test | Node 18+ + Jest | ✅ | npm test |
frontend_backend_mismatch |
Runtime | Node 18+ | ✅ | Two processes (client + server) |
config_failure |
Runtime | Node 18+ | ✅ | Environment/webpack issues |
| TypeScript | ||||
type_errors |
Compile | Node 18+ + tsc | ✅ | npm run build |
async_failures |
Runtime | Node 18+ + ts-node | ✅ | Promise/race conditions |
| Java | ||||
dependency_issue |
Compile | JDK 17+ + Maven | ✅ | mvn compile |
logic_error |
Runtime | JDK 17+ + Maven | ✅ | mvn exec:java |
test_failure |
Test | JDK 17+ + Maven | ✅ | mvn test |
| Go | ||||
runtime_panic |
Runtime | Go 1.21+ | ✅ | go run main.go |
concurrency_bug |
Runtime | Go 1.21+ | ✅ | go run -race main.go |
| C++ | ||||
header_missing |
Compile | CMake + GCC/Clang | ✅ | cmake .. && make |
linker_error |
Link | CMake + GCC/Clang | ✅ | Undefined references |
memory_issue |
Runtime | CMake + GCC/Clang | ✅ | Safe memory bugs (no UB exploits) |
| Kotlin/Android | ||||
android_lifecycle |
Compile + Runtime | Android Studio / Gradle | Emulator or ./gradlew build |
|
| Swift/iOS | ||||
build_error |
Compile | Xcode / Swift 5.9+ | macOS only, swift build |
|
optionals_crash |
Runtime | Xcode / Swift 5.9+ | Force-unwrap failures | |
ui_thread |
Runtime | Xcode + Simulator | ❌ | Requires iOS Simulator |
| Unity/C# | ||||
lifecycle_bug |
Runtime | Unity 2022+ | ❌ | Editor required |
serialization_error |
Runtime | Unity 2022+ | ❌ | Inspector + Play mode |
scene_mismatch |
Runtime | Unity 2022+ | ❌ | Asset-code desync |
| Symbol | Meaning |
|---|---|
| ✅ | Fully headless—runs in CI/terminal with no GUI |
| Compile-only headless; runtime needs device/emulator | |
| ❌ | Requires IDE or platform-specific tooling |
Run these in any CI environment (GitHub Actions, Azure Pipelines, local terminal):
- All Python projects
- All JavaScript/TypeScript projects
- All Java projects
- All Go projects
- All C++ projects (with CMake)
- Kotlin/Android:
./gradlew assembleDebugon any OS; full test needs emulator - Swift/iOS:
swift buildon macOS; runtime tests need Simulator
- Unity: Must open in Unity Editor to test MonoBehaviour lifecycle, serialization, and scene issues
Each project folder contains (or should contain):
README.md– Symptoms, expected behavior, verification steps- Build/run commands documented in README
For CI automation, a future scripts/ folder will provide:
run_headless.ps1/run_headless.sh– Execute all Tier 1 projectsreport_results.py– Aggregate pass/fail/partial scores
This suite is intentionally diverse. Not every evaluator will have every environment. The recommended approach:
- Start with Tier 1 – 100% headless, no excuses
- Report partial coverage – "Evaluated 18/24 projects (Tier 1 + 2)"
- Use static analysis for Tier 3 – Even without Unity Editor, an agent can identify obvious issues in C# scripts
The goal is not to gatekeep, but to measure how far a system can reason across domains.