Skip to content

Three server-side bugs in gym Docker images affecting 19 benchmark tasks #4

@anchit

Description

@anchit

We found three server-side bugs in the MCP gym Docker images that make 19 benchmark tasks impossible to complete regardless of model capability. These were discovered while running the full 649-task benchmark (enterprise_ops_gym_oracle.parquet) against the official Docker images.


Bug 1: create_virtual_event_townhall crashes with Python TypeError (8 tasks)

Domain: Teams, Hybrid
Image: enterpriseops-gym-mcp-teams

Error:

Error calling create_virtual_event_townhall: Failed to create townhall:
schemas.virtual_event_townhall.VirtualEventTownhallResponse() argument after ** must be a mapping, not NoneType

Root cause: The townhall creation handler returns None internally, then the response constructor tries to unpack it via **None, which crashes.

Affected tasks:

  • task_20251125_113335_636_7ebc1127_a4ebffe8 (teams)
  • task_20251205_150335_071_464ee3e0_19d263cd (teams)
  • task_20251205_192805_864_464ee3e0_91f2c02f (teams)
  • task_20260106_104903_639_0154326e_81f6a68a (teams)
  • task_20260108_172650_003_8e9e30d7_a1ff8588 (teams)
  • task_20260108_194921_186_8e9e30d7_a087d6f8 (teams)
  • task_20260109_004703_994_7ebc1127_6f97f3e9 (teams)
  • task_20260114_164939_471_4d9df647_2aaff95f (hybrid)

The model calls the tool with valid arguments, but the server crashes before producing any response.


Bug 2: create_send_as_alias returns HTTP 500 (5 tasks)

Domain: Email, Hybrid
Image: enterpriseops-gym-mcp-email

Error:

Error calling create_send_as_alias: ❌ ❌ HTTP 500: Internal Server Error

No error details returned — the server crashes with an unhandled exception.

Affected tasks:

  • task_20251218_102205_211_1628b966_06687c79 (hybrid)
  • task_20260107_131200_705_1628b966_84e87e0f (email)
  • task_20260107_141130_029_1628b966_8e264839 (email)
  • task_20260109_160851_122_911d75d7_3ece8e5e (email)
  • task_20260116_064331_915_d8f93f2d_854d09b8 (hybrid)

Bug 3: create_draft / send_message FOREIGN KEY constraint failures (6 tasks)

Domain: Email, Hybrid
Image: enterpriseops-gym-mcp-email

Error (two variants):

Variant A — threads table FK (4 tasks, all hybrid):

Error creating draft: (sqlite3.IntegrityError) FOREIGN KEY constraint failed
[SQL: INSERT INTO threads (id, user_id, snippet, ...) VALUES (?, ?, ?, ?, ?, ?)]

Occurs when userId="me" in hybrid tasks. The email gym resolves "me" to a user_id that does not exist in the seeded database's users table.

Variant B — message_labels table FK (3 tasks):

Error sending message: (sqlite3.IntegrityError) FOREIGN KEY constraint failed
[SQL: INSERT INTO message_labels (message_id, label_id) VALUES (?, ?)]

The label exists (verified by modify_message working with the same label_id immediately after), but the send_message endpoint fails when labelIds is provided in the request body. Likely a transaction ordering issue where the message is committed to message_labels before the thread/message FK chain is fully resolved.

Affected tasks:

  • task_20251211_100706_985_701c5774_8d658f71 (email)
  • task_20251218_103430_472_701c5774_42c99459 (hybrid)
  • task_20251218_120340_971_701c5774_de18aa54 (hybrid)
  • task_20251219_115208_288_701c5774_66a236f8 (hybrid)
  • task_20251224_104537_197_701c5774_5c104832 (hybrid)
  • task_20251225_064512_901_701c5774_964bf3bf (hybrid)

Impact

These 19 tasks (2.9% of the benchmark) are guaranteed failures regardless of the model — the server crashes or rejects valid requests before the model can complete the task. This deflates reported scores for any model evaluated on the full benchmark.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions