Agent skills for machine learning workflows for Claude Code and other agentic coding harnesses.
claude plugin marketplace add lawwu/agentic-ml-plugin
claude plugin install agentic-ml@agentic-mlRestart Claude Code after installation. Skills activate automatically when relevant.
Update:
claude plugin marketplace update
claude plugin update agentic-ml@agentic-mlOr run /plugin to open the plugin manager.
For agents supporting the skills.sh ecosystem:
npx skills add lawwu/agentic-ml-plugingit clone git@github.com:lawwu/agentic-ml-plugin.git ~/agentic-ml-plugin
claude --plugin-dir ~/agentic-ml-plugin/plugins/agentic-mlRun complete ML lifecycle:
/orchestrate-e2e on the medium dataset in demo/Run complete ML lifecycle but use mlscribe to output artfacts:
/orchestrate-e2e on the medium dataset in demo/ but use the mlscribe cli to show me some artifacts. see https://github.com/lawwu/mlscribe| Stage | Skill | Lifecycle stage | Description |
|---|---|---|---|
| 1 | review-target | Problem framing | Validate label/target definition, leakage risk, metric alignment, and split strategy before modeling |
| 2 | plan-experiment | Pre-training | Design a structured experiment plan with hypothesis, model candidates, HP search space, compute budget, and ordered execution |
| 3 | build-baseline | Pre-training | Build and evaluate non-ML baselines (majority class, mean predictor, simple rules) to establish the performance floor ML must beat |
| 4 | check-dataset-quality | Pre-training | Profile and validate CSV, Parquet, JSONL, HuggingFace datasets, database tables, or image directories |
| 5 | check-data-pipeline | Pre-training | Dry-run a preprocessing pipeline on a small sample to catch shape, dtype, padding, and label encoding issues |
| — | feature-engineer | Pre-training | Explore files or database tables and design leakage-safe feature sets tied to label and business outcome |
| 6 | train-model | Training | Launch and manage training with early stopping, HP config, and checkpoint management; delegates monitoring to babysit-training |
| 6b | babysit-training | Training | Continuously monitor a training run (local, remote SSH, or Vertex AI) until it completes or hits a critical issue; can also be invoked standalone when training is already running |
| 6c | check-failed-run | Training | Diagnose a failed or unstable training run, classify root causes, and produce a prioritized recovery plan |
| 7 | check-eval | Post-training | Evaluate a checkpoint via HF Trainer, lm-evaluation-harness, or a custom script with baseline comparison |
| 8 | explain-model | Post-eval | Generate feature importance, bias audit, and model card before promotion |
| 9 | demonstrate-value | Post-eval | Create a visual business value presentation using showboat |
| 10 | recommend-new-approaches | Post-eval | Recommend new research approaches, modeling ideas, and loss function modifications sorted by expected impact; optionally leverages autoresearch |
| — | orchestrate-e2e | Orchestration | Coordinate the full ML lifecycle with explicit stage gates and a final Go/No-Go decision |
| — | benchmark-e2e | Meta | Compare E2E workflow approaches (no-plugin/plugin/automl) across scenarios (hard-fraud, hard-attrition, xhard-churn) to measure agent reliability, pitfall detection, and cost |
Every skill writes a machine-readable JSON artifact alongside its text report. Artifacts share a common base schema with consistent fields (schema_version, skill_name, run_id, decision, confidence, findings, next_commands) plus skill-specific extensions.
Canonical vocabulary (decision values, severities, gate statuses) is defined in vocabulary.md. All skills use GO / NO-GO / CONDITIONAL for decisions and blocker / high / medium / low for severity.
When run via orchestrate-e2e, all skill artifacts are collected in --out-dir and an interactive HTML report is generated automatically:
uv run plugins/agentic-ml/report-viewer/generate_report.py <out-dir>
# → <out-dir>/report.htmlThe report includes a gate timeline, per-skill collapsible cards with findings tables, and raw JSON tabs.
E2E Benchmark Report
====================
Matrix: no-plugin × plugin × automl — 1 scenario
Selected scenario: hard-fraud (detection: auto — is_fraud column + transaction_timestamp)
Runs per cell: 1
Primary metric: auprc
Harness: Claude Code (claude-sonnet-4-6)
Results:
| Mode | Scenario | Quality | Reliability | Efficiency | Ops Readiness | LOC Run | Tokens In | Tokens Out | Tokens Total | Total |
|-----------|------------|---------|-------------|------------|---------------|---------|-----------|------------|--------------|-------|
| plugin | hard-fraud | 60 | 80 | 45 | 85 | 380 | unknown | unknown | unknown | 67 |
| no-plugin | hard-fraud | 15 | 75 | 65 | 15 | 120 | unknown | unknown | unknown | 40 |
| automl | hard-fraud | 12 | 55 | 75 | 10 | 45 | unknown | unknown | unknown | 35 |
Stage coverage:
| Stage | no-plugin | plugin | automl |
|------------------------|------------|--------------|--------------|
| 1. Target readiness | NO-GO | NO-GO | NO-GO |
| 2. Experiment plan | GO | GO | GO |
| 3. Non-ML baseline | GO | GO | GO |
| 4. Dataset quality | CONDITIONAL| NO-GO | CONDITIONAL |
| 5. Data pipeline | CONDITIONAL| CONDITIONAL | CONDITIONAL |
| 6. Training stability | GO | GO | GO |
| 7. Evaluation quality | NO-GO | CONDITIONAL | NO-GO |
| 8. Interpretability | NO-GO | NO-GO | NO-GO |
| 9. Promotion decision | GO | NO-GO | GO |
Pitfall detection (hard-fraud):
| Pitfall | no-plugin | plugin | automl |
|-----------------------------------|:---------:|:------:|:------:|
| Target echo (chargeback) | ✗ | ✓ | ✗ |
| Near-perfect leak (dfp_age) | ✗ | ✓ | ✗ |
| Wrong metric (AUC vs AUPRC) | ✗ | ✓ | ✗ |
| Selection bias (5.96% reviewed) | ✗ | ✓ | ✗ |
| Geographic proxy bias (ip_country)| ✗ | ✓ | ✗ |
| Split boundary tie | ✗ | ✓ | ✗ |
Skill usage audit:
- no-plugin: 0 skills (PASS — expected 0)
- plugin: review-target, plan-experiment, build-baseline, check-dataset-quality, check-data-pipeline,
check-eval, explain-model (babysit-training invoked inline — skill unavailable via tool)
- automl: 0 skills (PASS — expected 0)
Recommendation:
- Default mode: plugin (why: only mode that catches leakage, enforces AUPRC, audits bias,
produces artifacts, and blocks promotion correctly; 67/100)
- Fallback mode: no-plugin (why: more reliable than automl — no framework crashes,
better stage coverage; requires attentive analyst for leakage/metric)
Notable finding: AutoGluon GBM crashed (exit 139) on ARM64 macOS + miniconda3 Python 3.11.5.
RF and ExtraTrees ran. Fix: use uv-managed Python instead of miniconda3.
Artifacts: reports/e2e-benchmark/20260227_164917/
├── README.md
├── benchmark-report.json
├── plugin/stage1/review-target.json
├── plugin/stage2/plan-experiment.json
├── plugin/stage3/build-baseline.json
├── plugin/stage4/check-dataset-quality.json
├── plugin/stage5/check-data-pipeline.json
├── plugin/stage6/babysit-training.json
├── plugin/stage7/check-eval.json
├── plugin/stage8/explain-model.json + MODEL_CARD.md + feature_importance.json
├── plugin/stage9-promotion.md
└── automl/automl-stage-summary.json + ag_model/
See AGENTS.md for full instructions, frontmatter reference, naming conventions, and skill design guidelines.
Quick path: create plugins/agentic-ml/skills/<skill-name>/SKILL.md, add frontmatter, write instructions, then update the table above.
claude --plugin-dir ./plugins/agentic-ml