Skip to content

Comments

feat: Add QA evaluation structured outputs for Starlight (Brent Council)#5

Open
roshan-vapi wants to merge 1 commit intomainfrom
tasker/PRO-846-qa-structured-outputs
Open

feat: Add QA evaluation structured outputs for Starlight (Brent Council)#5
roshan-vapi wants to merge 1 commit intomainfrom
tasker/PRO-846-qa-structured-outputs

Conversation

@roshan-vapi
Copy link
Contributor

Summary

Adds 5 structured output YAML files for automated post-call QA evaluation of Brent Council Housing Benefits calls (Starlight project).

  • 4 QA category structured outputs that evaluate call transcripts against Brent Council's manual QA criteria
  • 1 wrap-up code structured output that classifies calls into 19 predefined categories

Linear Issue

PRO-846

Files Created

File Category Questions Auto-Fail
resources/structuredOutputs/starlight-qa-engagement.yml Engagement 7 (1.1-1.7) 1.3, 1.4, 1.5
resources/structuredOutputs/starlight-qa-right-first-time.yml Right First Time 8 (2.1-2.8) 2.3, 2.4, 2.5
resources/structuredOutputs/starlight-qa-signposting.yml Signposting 2 (3.1-3.2) None
resources/structuredOutputs/starlight-qa-explaining.yml Explaining 2 (4.1-4.2) None
resources/structuredOutputs/starlight-wrap-up-code.yml Call Classification N/A N/A

Schema Design

Each QA structured output produces per-question evaluations with:

  • result: yes / no / not_applicable
  • reasoning: explanation referencing the conversation
  • evidence: array of { message_text, timestamp } excerpts

Top-level fields:

  • auto_fail: true if ANY auto-fail question received no
  • overall_pass: true only if auto_fail is false
  • category_score: fraction string e.g. "5/7"

Auto-fail logic: If any auto-fail question in ANY of the 4 categories receives no, the ENTIRE call evaluation fails. Each structured output sets its own auto_fail flag; the consuming application must check across all 4.

Key Design Decisions

  • Model: gpt-4.1 at temperature: 0 for deterministic, accurate QA evaluation
  • Multilingual support: All outputs include explicit instructions to evaluate in transcript language
  • AI agent adaptation: Questions that don't apply to AI agents (ACW, system logging, hold time) have not_applicable guidance
  • Glossary: Full Brent Council Housing Benefits terminology embedded in each output's description
  • assistant_ids: []: Empty because Starlight assistant configs are not yet in the gitops repo; will be populated when they are added
  • Wrap-up code second-tier: Placeholder secondary_classification_notes field for pending tier definitions

Line Count Note

This PR is 778 lines, which exceeds the 500-line guideline. However, all additions are declarative YAML data files with repetitive per-question schema structure. The 5 files are logically atomic units that cannot be meaningfully split -- each represents a single structured output definition. No code was modified.

How to Test

  1. Verify YAML validity: each file parses correctly with the yaml npm package
  2. Verify schema.type is always a simple string (not an array) per AGENTS.md warning
  3. After push to Vapi (npm run push:dev), verify structured outputs appear in the dashboard
  4. Run a test call and verify the structured outputs produce expected evaluation results

Validation

  • All 5 files validated as correct YAML with required fields (name, type, target, description, model, schema, assistant_ids, workflow_ids)
  • schema.type confirmed as simple string "object" in all files (avoids .toLowerCase() crash)
  • All question properties validated to have result, reasoning, and evidence sub-properties
  • name fields follow snake_case convention per AGENTS.md

Add 5 structured output YAML files for automated post-call QA evaluation
of Brent Council Housing Benefits calls:

- starlight-qa-engagement.yml: 7 questions (3 auto-fail: 1.3, 1.4, 1.5)
- starlight-qa-right-first-time.yml: 8 questions (3 auto-fail: 2.3, 2.4, 2.5)
- starlight-qa-signposting.yml: 2 questions (no auto-fail)
- starlight-qa-explaining.yml: 2 questions (no auto-fail)
- starlight-wrap-up-code.yml: call classification into 19 wrap-up codes

Each QA structured output evaluates per-question with result (yes/no/not_applicable),
reasoning, and transcript evidence. Auto-fail logic: if ANY auto-fail question
receives "no", the entire evaluation fails across all categories.

All outputs include multilingual transcript support, AI agent adaptation notes,
and the full Brent Council Housing Benefits glossary.

Closes PRO-846

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant