We found three server-side bugs in the MCP gym Docker images that make 19 benchmark tasks impossible to complete regardless of model capability. These were discovered while running the full 649-task benchmark (enterprise_ops_gym_oracle.parquet) against the official Docker images.
Bug 1: create_virtual_event_townhall crashes with Python TypeError (8 tasks)
Domain: Teams, Hybrid
Image: enterpriseops-gym-mcp-teams
Error:
Error calling create_virtual_event_townhall: Failed to create townhall:
schemas.virtual_event_townhall.VirtualEventTownhallResponse() argument after ** must be a mapping, not NoneType
Root cause: The townhall creation handler returns None internally, then the response constructor tries to unpack it via **None, which crashes.
Affected tasks:
task_20251125_113335_636_7ebc1127_a4ebffe8 (teams)
task_20251205_150335_071_464ee3e0_19d263cd (teams)
task_20251205_192805_864_464ee3e0_91f2c02f (teams)
task_20260106_104903_639_0154326e_81f6a68a (teams)
task_20260108_172650_003_8e9e30d7_a1ff8588 (teams)
task_20260108_194921_186_8e9e30d7_a087d6f8 (teams)
task_20260109_004703_994_7ebc1127_6f97f3e9 (teams)
task_20260114_164939_471_4d9df647_2aaff95f (hybrid)
The model calls the tool with valid arguments, but the server crashes before producing any response.
Bug 2: create_send_as_alias returns HTTP 500 (5 tasks)
Domain: Email, Hybrid
Image: enterpriseops-gym-mcp-email
Error:
Error calling create_send_as_alias: ❌ ❌ HTTP 500: Internal Server Error
No error details returned — the server crashes with an unhandled exception.
Affected tasks:
task_20251218_102205_211_1628b966_06687c79 (hybrid)
task_20260107_131200_705_1628b966_84e87e0f (email)
task_20260107_141130_029_1628b966_8e264839 (email)
task_20260109_160851_122_911d75d7_3ece8e5e (email)
task_20260116_064331_915_d8f93f2d_854d09b8 (hybrid)
Bug 3: create_draft / send_message FOREIGN KEY constraint failures (6 tasks)
Domain: Email, Hybrid
Image: enterpriseops-gym-mcp-email
Error (two variants):
Variant A — threads table FK (4 tasks, all hybrid):
Error creating draft: (sqlite3.IntegrityError) FOREIGN KEY constraint failed
[SQL: INSERT INTO threads (id, user_id, snippet, ...) VALUES (?, ?, ?, ?, ?, ?)]
Occurs when userId="me" in hybrid tasks. The email gym resolves "me" to a user_id that does not exist in the seeded database's users table.
Variant B — message_labels table FK (3 tasks):
Error sending message: (sqlite3.IntegrityError) FOREIGN KEY constraint failed
[SQL: INSERT INTO message_labels (message_id, label_id) VALUES (?, ?)]
The label exists (verified by modify_message working with the same label_id immediately after), but the send_message endpoint fails when labelIds is provided in the request body. Likely a transaction ordering issue where the message is committed to message_labels before the thread/message FK chain is fully resolved.
Affected tasks:
task_20251211_100706_985_701c5774_8d658f71 (email)
task_20251218_103430_472_701c5774_42c99459 (hybrid)
task_20251218_120340_971_701c5774_de18aa54 (hybrid)
task_20251219_115208_288_701c5774_66a236f8 (hybrid)
task_20251224_104537_197_701c5774_5c104832 (hybrid)
task_20251225_064512_901_701c5774_964bf3bf (hybrid)
Impact
These 19 tasks (2.9% of the benchmark) are guaranteed failures regardless of the model — the server crashes or rejects valid requests before the model can complete the task. This deflates reported scores for any model evaluated on the full benchmark.
We found three server-side bugs in the MCP gym Docker images that make 19 benchmark tasks impossible to complete regardless of model capability. These were discovered while running the full 649-task benchmark (
enterprise_ops_gym_oracle.parquet) against the official Docker images.Bug 1:
create_virtual_event_townhallcrashes with Python TypeError (8 tasks)Domain: Teams, Hybrid
Image:
enterpriseops-gym-mcp-teamsError:
Root cause: The townhall creation handler returns
Noneinternally, then the response constructor tries to unpack it via**None, which crashes.Affected tasks:
task_20251125_113335_636_7ebc1127_a4ebffe8(teams)task_20251205_150335_071_464ee3e0_19d263cd(teams)task_20251205_192805_864_464ee3e0_91f2c02f(teams)task_20260106_104903_639_0154326e_81f6a68a(teams)task_20260108_172650_003_8e9e30d7_a1ff8588(teams)task_20260108_194921_186_8e9e30d7_a087d6f8(teams)task_20260109_004703_994_7ebc1127_6f97f3e9(teams)task_20260114_164939_471_4d9df647_2aaff95f(hybrid)The model calls the tool with valid arguments, but the server crashes before producing any response.
Bug 2:
create_send_as_aliasreturns HTTP 500 (5 tasks)Domain: Email, Hybrid
Image:
enterpriseops-gym-mcp-emailError:
No error details returned — the server crashes with an unhandled exception.
Affected tasks:
task_20251218_102205_211_1628b966_06687c79(hybrid)task_20260107_131200_705_1628b966_84e87e0f(email)task_20260107_141130_029_1628b966_8e264839(email)task_20260109_160851_122_911d75d7_3ece8e5e(email)task_20260116_064331_915_d8f93f2d_854d09b8(hybrid)Bug 3:
create_draft/send_messageFOREIGN KEY constraint failures (6 tasks)Domain: Email, Hybrid
Image:
enterpriseops-gym-mcp-emailError (two variants):
Variant A — threads table FK (4 tasks, all hybrid):
Occurs when
userId="me"in hybrid tasks. The email gym resolves"me"to a user_id that does not exist in the seeded database'suserstable.Variant B — message_labels table FK (3 tasks):
The label exists (verified by
modify_messageworking with the samelabel_idimmediately after), but thesend_messageendpoint fails whenlabelIdsis provided in the request body. Likely a transaction ordering issue where the message is committed tomessage_labelsbefore the thread/message FK chain is fully resolved.Affected tasks:
task_20251211_100706_985_701c5774_8d658f71(email)task_20251218_103430_472_701c5774_42c99459(hybrid)task_20251218_120340_971_701c5774_de18aa54(hybrid)task_20251219_115208_288_701c5774_66a236f8(hybrid)task_20251224_104537_197_701c5774_5c104832(hybrid)task_20251225_064512_901_701c5774_964bf3bf(hybrid)Impact
These 19 tasks (2.9% of the benchmark) are guaranteed failures regardless of the model — the server crashes or rejects valid requests before the model can complete the task. This deflates reported scores for any model evaluated on the full benchmark.