llm-evals

Here are 43 public repositories matching this topic...

ALucek / evaluizer

Visualize LLM outputs against datasets, manually annotate results, and run automated evaluations to algorithmically optimize prompts.

llm-optimizer llm-evals prompt-annotation prompt-optimizer

Updated Nov 22, 2025
TypeScript

The-Swarm-Corporation / StatisticalModelEvaluator

Star

An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"

ai ml multiagent agents llms evals llm-evals agent-evals multi-agent-eval

Updated Oct 6, 2025
Python

pyladiesams / eval-llm-based-apps-jan2025

Star

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

workshop llm llms llmops llm-eval llm-test llm-evaluation-framework llm-evaluation-metrics llm-monitoring llm-testing llm-evals

Updated May 6, 2025
Jupyter Notebook

LLMSystems / llm-evals

Star

A framework for evaluating large language models (LLMs) across a variety of tasks.

nlp benchmark ai evaluation-framework ai-evaluation llm llm-evaluation llm-as-a-judge g-eval llm-evals

Updated Mar 18, 2026
Python

tpertner / squeeze

Star

Squeeze your model with pressure prompts to see if its behavior leaks.

reliability evaluation calibration alignment quality-assurance metamorphic-testing ai-safety trustworthiness hallucinations prompt-engineering llm-eval llm-evals

Updated Mar 1, 2026
Python

kevinschaul / llm-evals

Star

Because we should all have our own set of LLM evals.

llm llm-evals

Updated Apr 17, 2026
Python

aelaguiz / codex-autoresearcher

Star

Codex-native autoresearch harness with structured worker/judge turns for optimizing anything you can measure.

python research optimization codex ai-agents llm-evals experiment-runner autoresearch

Updated Mar 21, 2026
Python

tpertner / confess

Star

Detecting Relational Boundary Erosion in AI systems. A framework for testing whether models maintain honest, calibrated, and appropriate boundaries.

python yaml calibration alignment metamorphic-testing model-evaluation ai-safety red-teaming prompt-injection hallucination-detection llm-evals evaluation-harness

Updated Feb 22, 2026
Python

spences10 / ralph-town

Star

Disposable Daytona sandboxes for LLM evals and isolated command execution

cli typescript mcp sandbox daytona llm evals llm-evals sandbox-orchestration

Updated Apr 24, 2026
TypeScript

dhirajxai / llm-evals-and-anti-hallucination

Star

Evaluation patterns, release gates, and anti-hallucination techniques for developer-focused AI workflows.

evaluation llmops prompt-testing promptfoo llm-evals ai-reliability anti-hallucination groundedness

Updated Mar 27, 2026
Python

Pavansomisetty21 / GEval-Metrics-Analyzing-the-Reliability-of-LLM-Responses

Sponsor

Star

In this we evaluate the LLM responses and find accuracy

llm-evaluation-metrics llm-evals geval

Updated Jul 8, 2025
Python

abhijeetnardele24-hash / dev-eval-innovator

Star

Local-first LLM evaluation runner with baselines, caching, markdown reports, and CI-friendly quality, latency, and cost gates.

python ci developer-tools prompt-engineering llm-testing llm-evals openai-compatible eval-framework

Updated Apr 13, 2026
Python

kai-linux / proof

Star

Evaluation and reliability harness for agentic LLM systems, with task success, latency, cost, retries, fallback routing, and failure taxonomy.

benchmarking intelligence dashboard latency reliability evaluation observability ai-agents failure-analysis cost-tracking llm-evals fallback-routing

Updated Apr 1, 2026
Python

ygyzys83 / AI_evaluation_app

Star

This project demonstrates a production-grade Evaluation (Evals) Framework used to benchmark multiple Large Language Models (LLMs) against a "Source of Truth" NBA dataset.

nba-data rag model-benchmarking ai-product-management llm-evals

Updated Apr 27, 2026
Python

dicnunz / agent-marketplace-evals

Star

Synthetic marketplace benchmark harness with deterministic demo and Codex subagent pilot

codex local-first market-simulation llm-evals synthetic-benchmark

Updated Apr 25, 2026
Python

ZoaGrad / blackglass-dojo

Star

Sovereign Adversarial Simulation & Interdiction Engine for the 0.05V Standard.

simulation ai-safety guardrails llm-evals runtime-assurance adversarial-testing

Updated Mar 21, 2026
Python

SiddhantaShrestha / autonomous-research-eval-agent

Star

Agentic research pipeline with local retrieval, structured evaluation, conditional revision, and traceable outputs using Groq.

python cli retrieval evaluation ai-agents groq llm prompt-engineering research-agent agentic-workflow llm-evals

Updated Mar 23, 2026
TypeScript

scasella / Proofgrade

Star

Reproducible LLM proof grading benchmark + API for Olympiad-style math.

python benchmarking fastapi math-education llm-evaluation llm-evals proof-grading rubric-aware olympiad-math math-ed

Updated Apr 24, 2026
Python

scienceaditya / agentic-research-kit

Star

Evaluation-first research kit for biology. Structured logs, reproducible specs, and lightweight validation for agentic workflows.

benchmarking reliability computational-biology crispr llm-evals membrane-trafficking

Updated Dec 25, 2025
Python

SproutSeeds / dormant-behavior-audit

Star

Public benchmark and reference bundle for dormant behavior audit

benchmarking machine-learning reproducibility ai-safety interpretability llm-evals

Updated Apr 11, 2026
Python

Improve this page

Add a description, image, and links to the llm-evals topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evals topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-evals

Here are 43 public repositories matching this topic...

ALucek / evaluizer

The-Swarm-Corporation / StatisticalModelEvaluator

pyladiesams / eval-llm-based-apps-jan2025

LLMSystems / llm-evals

tpertner / squeeze

kevinschaul / llm-evals

aelaguiz / codex-autoresearcher

tpertner / confess

spences10 / ralph-town

dhirajxai / llm-evals-and-anti-hallucination

Pavansomisetty21 / GEval-Metrics-Analyzing-the-Reliability-of-LLM-Responses

abhijeetnardele24-hash / dev-eval-innovator

kai-linux / proof

ygyzys83 / AI_evaluation_app

dicnunz / agent-marketplace-evals

ZoaGrad / blackglass-dojo

SiddhantaShrestha / autonomous-research-eval-agent

scasella / Proofgrade

scienceaditya / agentic-research-kit

SproutSeeds / dormant-behavior-audit

Improve this page

Add this topic to your repo