Skip to content

Improve OpenAI evaluator telemetry coverage with gpt-5.5 high#234

Open
mltseva wants to merge 1 commit into
mainfrom
amaltseva/codex-gpt-5.5-high-test
Open

Improve OpenAI evaluator telemetry coverage with gpt-5.5 high#234
mltseva wants to merge 1 commit into
mainfrom
amaltseva/codex-gpt-5.5-high-test

Conversation

@mltseva
Copy link
Copy Markdown
Collaborator

@mltseva mltseva commented May 11, 2026

id: 3a3e581c-f07f-4ea6-96a0-fb0f23cd31ea
artifact_path: experiments/tracy-openai/gpt-5.5-high/
providers: ["openai"]
status: succeeded
score: 91
checks: 1608 / 1794
passed scenarios: 37 / 86
provider_errors: 4

Category summary:

openai/batches avg 100.0
openai/audio avg 98.4
openai/conversations avg 98.5
openai/videos avg 96.7
openai/files avg 99.0
openai/embeddings avg 94.3
openai/images avg 86.9
openai/responses avg 85.9
openai/chat avg 81.3

Worst scenarios:

30 openai/chat/lifecycle provider_error=true
40 openai/chat/vision provider_error=true
46 openai/responses/streaming
60 openai/responses/background_cancel
61 openai/images/variation provider_error=true
63 openai/chat/streaming
70 openai/models/retrieve

Main remaining gaps:

  • chat/lifecycle still fails because retrieve returns 404, then update/messages/list/delete spans are missing.
  • chat/vision fails with provider-side image download 400.
  • images/variation returns 404.
  • videos/lifecycle mostly passes but delete returns 400 because the video is still processing.
  • A few operation-name mismatches remain for Responses, for example responses.input_items.list vs expected response.input_items.list.
  • Some checks still miss status/usage/response metadata in error or streaming paths.

@mltseva mltseva requested a review from georgiizorabov May 11, 2026 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant