fix(docker): pin pnpm version to honor package.json packageManager field#86
Merged
Conversation
The Dockerfile called `corepack prepare pnpm@latest --activate`, which
silently follows whatever pnpm publishes as "latest" each time the image
is rebuilt. This broke the build on 2026-05-12 when pnpm 11.1.1 shipped
(released the same day) with a stricter interpretation of PNPM_HOME:
[ERROR] The configured global bin directory "/pnpm/bin" is not in PATH
pnpm 10.x (which the repo's pnpm-lock.yaml requires) accepted
`PATH=$PNPM_HOME:$PATH` because it treated $PNPM_HOME itself as the bin
dir. pnpm 11.x splits PNPM_HOME and its bin subdirectory, so the same
PATH no longer satisfies `pnpm config set --global`.
Two changes:
1. Pin to pnpm@10.16.1, the version package.json's `packageManager` field
already declared as the single source of truth. Corepack honors that
field automatically when you don't pass an explicit override, so this
is now the canonical pin. Future pnpm upgrades happen by editing
package.json (and ideally regenerating pnpm-lock.yaml), not by
whatever Docker Hub serves that day.
2. Add `$PNPM_HOME/bin` to PATH as well. Belt-and-suspenders — when the
pin is intentionally bumped to pnpm 11.x in the future, the build
keeps working without another Dockerfile edit.
Tested:
- docker compose build --no-cache scripthammer succeeds
- docker compose up -d → container reaches health: starting
- Verified pnpm 10.16.1 activates inside built image (corepack pulls
from package.json packageManager field)
- Sanity test in isolation:
docker run --rm node:22-slim sh -c 'corepack enable &&
corepack prepare pnpm@10.16.1 --activate &&
export PNPM_HOME=/pnpm && export PATH=\$PNPM_HOME:\$PNPM_HOME/bin:\$PATH &&
pnpm config set store-dir /pnpm/store --global && echo SUCCESS'
→ SUCCESS
Root cause:
A floating `@latest` tag in infrastructure code is a time bomb; this is
the first time it went off but won't be the last unless pinned. The fix
also removes the duplication where two different files (Dockerfile and
package.json) both claimed authority over the pnpm version.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
This was referenced May 13, 2026
TortoiseWolfe
added a commit
that referenced
this pull request
May 13, 2026
* fix(ci): serialize E2E runs via repo-wide concurrency mutex The E2E suite runs against the project's single Supabase Cloud instance (huvitqubafsrazpjxsax) with shared test users, shared conversation rows, and beforeAll cleanupOldMessages hooks in 11+ messaging spec files. Concurrent runs against the same backend race each other and dogpile shared state, causing the 60-min GitHub Actions job timeout cap to bite. Verified on 2026-05-13: - Run 25769955636 (main, post-#86 merge, started 00:14:46Z) had its `E2E (firefox-msg 1/1)` job cancelled at exactly 60 min during step "Run E2E tests". Last operation in the captured log: a Realtime subscription waitForSelector on `body[data-messages-subscribed]`. - Run 25770068787 (PR #88 docs cleanup, started 00:18:07Z — only 3 minutes later) ran on the same Supabase project. Direct DB queries confirm conversation `4105492f-1047-40ef-bd6e-411788850547` (the one the cancelled main test was using at "Step 8: Conversation ID:") was created by PR #88's run at 00:21:55Z, not by main's run. - The cancelled run's tests detected "Connection already exists / Conversation already exists" and reused PR #88's data, then watched PR #88's beforeAll hooks wipe it on each of 11 spec boundaries. - Progressive per-spec slowdown across the cancel window (35s, 35s, 2m45s, 1m, 6m48s, 5m6s, 8m36s, 12m16s, 10m11s, 10m46s, cancelled at 10m46s) is the cumulative effect of cross-run data races, not Supabase rate limiting (verified zero rate-limit signals in the Supabase health endpoint, zero auth.audit_log_entries during the window — Supabase internal audit is disabled on this project). Three earlier diagnostic theories were wrong because they all assumed single-actor causation: 1. Retry-heavy specific test (complete-user-workflow.spec.ts retry loop) — disproved: that test passed first-attempt at 00:49:26Z. 2. Rate limiting — disproved: ACTIVE_HEALTHY across all Supabase services, no 429s. 3. Cumulative latency from one run's request volume — disproved: two runs were in flight, not one. The fix is structural: GitHub Actions native `concurrency:` group with a repo-wide key. All E2E runs across all refs (main, PRs, schedule) share one mutex; concurrent runs queue rather than race. Changes to .github/workflows/e2e.yml: Before: concurrency: group: e2e-${{ github.ref }} # per-branch cancel-in-progress: true # kill running on new push After: concurrency: group: e2e-supabase-${{ github.repository }} # repo-wide cancel-in-progress: false # queue, don't cancel Why both fields had to change: - Just flipping cancel-in-progress wouldn't fix anything because the race is cross-branch (main vs PR), not within one branch. - Just changing the group key without cancel-in-progress: false would let pushes mid-run abort each other and leave Supabase in a half-cleaned state. - ${{ github.repository }} keeps forks isolated from upstream — each fork gets its own mutex, preserving the template story. Trade-off accepted: PRs land slightly slower when 2+ are pushed close together (queue wait up to ~45 min worst case). On this repo's PR volume (~5/day max), the queue is rarely hit. Slower-by-design beats 60-min cancellation + re-run + human debug time. What this fix does NOT do (and why each is correct): - Does NOT bump `timeout-minutes: 60` — masking the budget is exactly what 9 prior rounds of flake mitigation did and got us here. - Does NOT shard messaging tests N-way — current 1/1 shard works fine when not contended (28 min on chromium-msg when alone). - Does NOT drop retries — the retries weren't the problem. - Does NOT replace the hand-rolled retry loops in complete-user-workflow.spec.ts:437-462 or encrypted-messaging.spec.ts:311,580 — those loops were doing the right thing; they polled while a different CI run was deleting their data. With the mutex in place, they'll succeed on attempt 1. Lessons captured in memory: - ~/.claude/projects/-home-TurtleWolfe-repos-ScriptHammer/memory/lesson_concurrent_ci_shared_backend.md - ~/.claude/projects/-home-TurtleWolfe-repos-ScriptHammer/memory/feedback_branch_hygiene.md This closes round 10 of the E2E flake pattern (STATUS.md:127-129) by addressing the actual underlying invariant: one writer at a time against the shared Supabase backend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(e2e): dispatch scroll event after programmatic scrollTop in messaging tests The 4 messaging E2E specs that set `el.scrollTop = N` programmatically were relying on the browser to auto-fire a `scroll` event so the React component's handleScroll listener could update state. Chromium and Firefox do this reliably; WebKit does not. Result: tests that depend on the listener firing (jump-to-bottom button visibility, auto-scroll behavior) pass on Chromium / Firefox and fail intermittently on WebKit. This was surfaced by run 25774750382 (PR #89) where `webkit-msg` failed on `messaging-scroll.spec.ts:261` "T007-T008: Jump button appears when scrolled and does not overlap input." The test set scrollTop, the React listener at src/components/molecular/MessageThread/MessageThread.tsx:194-212 never ran, showScrollButton stayed false, and the assert on `[data-testid="jump-to-bottom"]` timed out. All 3 retries failed identically. The same pattern existed in three other test sites that hid the bug behind either flaky-retry (PR #88's webkit-msg passed with "1 flaky" tag on a different test) or the assert not depending on the listener firing immediately. Fixing all 4 to prevent future round-N rediscovery: tests/e2e/messaging/messaging-scroll.spec.ts:237 (test setup scroll-to-top) tests/e2e/messaging/messaging-scroll.spec.ts:276 (THE failing test T007-T008) tests/e2e/messaging/messaging-scroll.spec.ts:322 (test setup scroll-to-top) tests/e2e/messaging/encrypted-messaging.spec.ts:665 (test setup scroll-to-top) The change is additive — same scrollTop assignment, plus an explicit `el.dispatchEvent(new Event('scroll', { bubbles: true }))`. handleScroll is idempotent (just reads current scroll position and sets state), so the event firing twice in browsers that auto-fire is harmless. Rationale for bundling with the concurrency mutex fix in this PR: - Both changes are E2E reliability fixes - The webkit failure surfaced *because* the concurrency mutex removed the 60-min cancellation bottleneck, letting webkit-msg run to completion and reveal pre-existing flake that the prior cancel had been masking - Splitting into two PRs would require coordinating their merge order and re-running CI twice; bundled is one merge cycle Verification: - All 4 sites now dispatch the scroll event - Pre-push hooks (lint, type-check, build) green - CI re-run will confirm webkit-msg passes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: TurtleWolfe <TurtleWolfe@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TortoiseWolfe
added a commit
that referenced
this pull request
May 14, 2026
…ry (#93) The PRP-STATUS.md dashboard was last fully refreshed 2026-04-25. Six PRs have landed since then (2026-05-12 to 2026-05-14) closing the long-running E2E flake pattern, the #31 GA4 ticket, and improving fork onboarding. Three targeted updates: 1. Header — bump "Last Updated" to 2026-05-14, "Previous Update" to 2026-04-25. Shipped count 17 -> 18 (019 GA moved from Mostly Shipped to Shipped after #31 close on 2026-05-13). Updated "Current Phase" line to reflect round 10 closure. 2. New "v0.4.x updates since 2026-04-25 audit" section between the header and the full feature table — one-paragraph summary of each merged PR (#86, #88, #89, #90, #91, #92) plus the issue closures (#31 GA4, #85 OAuth) with link to the closure comment for #85's outstanding dashboard work. 3. Stability hotspots note — added a callout indicating the E2E flake row in the hotspot table is resolved at round 10. Rounds 1-9 attacked symptoms; round 10 found the underlying cause (concurrent CI runs racing against a shared Supabase project) and fixed it structurally via the concurrency mutex. Other 9 hotspots remain open. Per-feature audit data in the lower sections is left untouched — the 2026-04-25 sweep is still the canonical detail. This refresh is purely the top-of-document changes needed to reflect 19 days of activity. Verification: - grep "Last Updated" docs/prp-docs/PRP-STATUS.md -> "2026-05-14" - Pre-commit hooks pass (prettier + gitleaks) Co-authored-by: TurtleWolfe <TurtleWolfe@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
corepack prepare pnpm@latestindocker/Dockerfilestarted failing 2026-05-12 when pnpm 11.1.1 shipped (released the same day) with a stricterPNPM_HOME/PATH check. Build broke wherever the image was rebuilt today.pnpm@10.16.1— the versionpackage.json'spackageManagerfield already declares as the single source of truth. Corepack honors that field automatically, so this is now the canonical pin.$PNPM_HOME/binto PATH as belt-and-suspenders, so future intentional bumps to pnpm 11.x keep the build working without another Dockerfile edit.Root cause
A floating
@latesttag in infrastructure code is a time bomb. This is the first time it went off, but won't be the last unless pinned. The fix also removes the duplication where two files (Dockerfile andpackage.json) both claimed authority over the pnpm version.Test plan
docker compose build --no-cache scripthammersucceedsdocker compose up -dreacheshealth: startingpackage.jsonpackageManager field)SUCCESS🤖 Generated with Claude Code