Skip to content

Improve OpenAI evaluator telemetry coverage with gpt-5.5 xhigh (plan …#237

Open
mltseva wants to merge 1 commit into
amaltseva/codex-gpt-5.5-high-testfrom
amaltseva/codex-gpt-5.5-high-test-2
Open

Improve OpenAI evaluator telemetry coverage with gpt-5.5 xhigh (plan …#237
mltseva wants to merge 1 commit into
amaltseva/codex-gpt-5.5-high-testfrom
amaltseva/codex-gpt-5.5-high-test-2

Conversation

@mltseva
Copy link
Copy Markdown
Collaborator

@mltseva mltseva commented May 12, 2026

artifact_path: experiments/tracy-openai/gpt-5.5-xhigh-medium/
Eval id: 294156bb-e4d9-4f1e-9c9f-f211e9b9ae9a
Status: succeeded
Score: 98
Scenarios: 67/86 passed, 19 partially failed.

Main remaining gaps:

  • openai/chat/lifecycle: largest loss, score 31. create passes, but retrieve returns 404, so later update/messages/list/delete spans are missing.
  • openai/chat/vision: API returns 400 invalid_image_url; also missing request-side tracy.request.input_image.count.
  • openai/chat/streaming: missing gen_ai.request.stream=true, response id/model, and finish reasons.
  • openai/images/streaming and openai/images/edit_streaming: gen_ai.request.stream is emitted as string "true" instead of Boolean true; missing tracy.response.created_at.
  • openai/images/variation: response is 404, operation is generate_content, and image request/response attrs are missing.
  • Smaller gaps: embeddings dimension count, conversations item retrieve operation name, Responses file/image input attrs, tracy.request.include, invalid input error type/code, and videos delete response attrs.

Highest-value fixes: boolean serialization for gen_ai.request.stream, embeddings dimension count, correct conversations retrieve operation, request-side multimodal/file attrs, and image variation route handling.

@mltseva mltseva requested a review from georgiizorabov May 12, 2026 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant