Skip to content

Improve OpenAI evaluator telemetry coverage with gpt-5.5 medium#235

Open
mltseva wants to merge 1 commit into
mainfrom
amaltseva/codex-gpt-5.5-medium-test-2
Open

Improve OpenAI evaluator telemetry coverage with gpt-5.5 medium#235
mltseva wants to merge 1 commit into
mainfrom
amaltseva/codex-gpt-5.5-medium-test-2

Conversation

@mltseva
Copy link
Copy Markdown
Collaborator

@mltseva mltseva commented May 11, 2026

id: 589e6fac-7d92-4f60-aad6-6c9d559e7a67
artifact_path: experiments/tracy-openai/gpt-5.5-medium-2/
providers: ["openai"]
status: succeeded
score: 87
checks: 1522 / 1794
passed scenarios: 15 / 86
provider_errors: 4

Category summary:

openai/audio avg 96.9
openai/batches avg 95.0
openai/conversations avg 94.2
openai/files avg 93.5
openai/moderations avg 92.3
openai/videos avg 96.0
openai/embeddings avg 86.3
openai/responses avg 83.4
openai/images avg 78.0
openai/chat avg 66.7

Worst scenarios:

26 openai/chat/lifecycle provider_error=true
26 openai/chat/vision provider_error=true
40 openai/responses/streaming
45 openai/chat/streaming
60 openai/chat/tools
63 openai/responses/invalid_empty_input
68 openai/images/streaming
69 openai/images/variation provider_error=true

Main remaining gaps:

  • Operation/API type regressions are the biggest issue in this artifact:
    • gen_ai.operation.name expected='generate_content' actual='responses' 25 times.
    • openai.api.type expected='chat_completions' actual='chat' 12 times.
    • gen_ai.operation.name expected='chat' actual='chat.completions' 10 times.
  • Numeric request values are still emitted as strings in some places, e.g. tracy.request.n actual='1'.
  • Streaming and lifecycle scenarios still miss some usage/status/id fields.
  • Provider errors remain in:
    • openai/chat/lifecycle
    • openai/chat/vision
    • openai/images/variation
    • openai/videos/lifecycle

@mltseva mltseva requested a review from georgiizorabov May 11, 2026 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant