test: update upgrade tests by itsjustriley · Pull Request #3472 · Shopify/hydrogen

itsjustriley · 2026-02-17T12:20:35Z

WHY are these changes introduced?

Fixes https://github.com/Shopify/developer-tools-team/issues/1045

The existing upgrade flow tests were unreliable. Tests would intermittently fail in CI due to dev server flakiness (port conflicts, spawn/kill race conditions, HTTP timeouts), making it difficult to validate changelog.json entries/upgrade flow. This should reduce toil for future releases.

WHAT is this pull request doing?

This PR refactors the upgrade flow tests to be more focused, and fail faster to ensure that new changelog.json entries enable successful upgrades.

The test validates:

Scaffolds a project from the previous release
Runs actual h2 upgrade command to latest version
Dependencies and devDependencies are upgraded correctly
Dependencies are removed when specified (e.g., removeDependencies array)
Upgrade guide markdown is generated when release has manual steps
npm install succeeds without conflicts
npm run build succeeds (when no manual steps required)

Key Decisions:

Remove dev server validation and validate build instead, which catches same errors aside from runtime (which we can trust other tests to catch)
Testing only previous previous → latest
Uses current skeleton with simulated old version (could investigate git history/tags, but it didn't seem functional in CI)
Simplified schema validation to fail fast before integration tests
Use actual h2 upgrade command
Gracefully skips when packages aren't published yet
Skip build when manual steps are required, but validate dependencies install/markdown is generated
Only trigger on changelog.json changes

HOW to test your changes?

Run test locally:

cd packages/cli
npm test upgrade-flow.test.ts

Expected: Both tests pass

Verify CI workflow:
- Modify docs/changelog.json (add a comma or whitespace)
- Push to a branch
- The test-upgrade-flow.yml workflow should trigger automatically
- Check that it passes

Checklist

I've read the Contributing Guidelines
I've considered possible cross-platform impacts (Mac, Linux, Windows)
I've added a changeset if this PR contains user-facing or noteworthy changes
I've added tests to cover my changes
I've added or updated the documentation (workflow includes detailed comments)

shopify · 2026-02-17T12:22:38Z

Oxygen deployed a preview of your e2e-upgrade-tests branch. Details:

Storefront	Status	Preview link	Deployment details	Last update (UTC)
Skeleton (skeleton.hydrogen.shop)	✅ Successful (Logs)	Preview deployment	Inspect deployment	February 23, 2026 6:26 PM

Learn more about Hydrogen's GitHub integration.

github-actions · 2026-02-17T15:17:05Z

We detected some changes in packages/*/package.json or packages/*/src, and there are no updates in the .changeset.
If the changes are user-facing and should cause a version bump, run npm run changeset add to track your changes and include them in the next release CHANGELOG.
If you are making simple updates to examples or documentation, you do not need to add a changeset.

kdaviduik · 2026-02-19T03:52:17Z

 import {exec} from '@shopify/cli-kit/node/system';
 import {execAsync} from '../../lib/process.js';
 import * as upgradeModule from './upgrade.js';
-import {spawn, ChildProcess} from 'node:child_process';


(threading)

Issue/bug: scaffolding only downgrades @shopify/hydrogen, not other dependencies

What: scaffoldHistoricalProject() copies the current skeleton and only changes version and dependencies['@shopify/hydrogen']. All other dependencies remain at HEAD
versions.

Why it matters: The upgrade command thinks it's upgrading from the previous version, but most dependencies are already at the latest version. This means the test can pass without going through the full upgrade flow that merchants experience. Also, this may cause issues down the line (eg: Remix --> RR7 is an example where this bug would've caused issues)

Fix: Use previousRelease.dependencies and previousRelease.devDependencies from the changelog to set ALL starting dependency versions, not just hydrogen:

if (previousRelease.dependencies) { for (const [dep, ver] of Object.entries(previousRelease.dependencies)) { packageJson.dependencies[dep] = ver; } } if (previousRelease.devDependencies) { for (const [dep, ver] of Object.entries(previousRelease.devDependencies)) { packageJson.devDependencies[dep] = ver; } }

kdaviduik · 2026-02-19T03:54:34Z

-    renderConfirmationPrompt: vi.fn(() => Promise.resolve(true)), // Always confirm
-    renderInfo: vi.fn(() => {}), // Mock renderInfo to silence output
-    renderSuccess: vi.fn(() => {}), // Mock renderSuccess to silence output
+    renderConfirmationPrompt: vi.fn(() => Promise.resolve(true)),


(threading)

This also eliminates (without replacement) a few other validations/checks of the changelog.json file:

I definitely think we should have these validations, though this test doesn't feel like the place to put it imo. I think we should just have a CI step that ensures the JSON formatting is correct.

Or even better but less trivial would be to use a Zod schema to validate this JSON file

agree on the schema validation for the json file

but i think our tests were too flaky to begin with

what do we want to test?

my educated guess is:

whether changelog.json matches a specific format

whether we can read the package.json from a project

whether we can detect what "hydrogen packages" version a package.json has

whether we can detect if they are on the latest

whether we can modify package versions in a package.json

whether we can call the install command on behalf of the user

for none of those we need to fetch an external changelog from a backend

but our functions in upgrade.ts are so bound together and filesystem dependant, that makes testing really hard

pseudocode for what i imagine would be easy to test for all those cases:

function upgrade(packageJsonService, changelog) { const versions = getHydrogenPackageVersions(packageJsonService); const needUpgrading = getOutdatedPackageVersions(versions, changelog); if (needUpgrading.length === 0) return; await updatePackageVersions(packageJsonService); await installPackages(); return; }

(extra points if we use Effect, or at least errors as values)

maybe we should push a refactor of upgrade as tech debt?

but i think our tests were too flaky to begin with

definitely, though i think it's also important to consider the specific flaky factors. is there a world where we could make those tests NOT flaky without compromising on the safety/validation of our changes? i think there is!

for none of those we need to fetch an external changelog from a backend

this IS what happens though when users actually run h2 upgrade, so it is important to have E2E validation that this is working imo.

this is a long comment and there are discussions to be had - I booked a meeting tomorrow with all of us to discuss/align synchronously :)

TLDR: cleanup of test infra = good, but I am very concerned with this PR's approach to validating the upgrade command. There was a huge issue in the Remix --> RR7 upgrade last year (before any upgrade tests existed) that the old/current approach would've caught. This new approach wouldn't have caught that, which is why I want to discuss/maybe rethink this.

The cleanup of the test infrastructure (removing process.env mutation, using inTemporaryDirectory, better error messages when packages aren't published) is genuinely good!

But I want to step back and ensure we're aligned on the overall direction before we merge this.

I think this PR is attacking the problem at the wrong level.

The previous tests were flaky, yes - but the flakiness was primarily in the test infrastructure (parsing/validation logic, dev server lifecycle, port conflicts), not in the upgrade simulation itself. The approach of scaffolding a real historical project and running the upgrade command against it was the right idea. It was the implementation details around that approach that needed fixing.

This PR solves the flakiness by fundamentally changing what we're testing. Instead of "scaffold a real old project → upgrade it → verify it works," we now have "copy the current skeleton → change some version numbers → upgrade it → verify it works." These are very different tests with very different confidence levels.

Here's the concrete problem. The new scaffoldProjectAtVersion copies the current skeleton template and overlays dependency versions from the previous changelog entry. But changelog entries are sparse - they only list packages that changed in that specific release. For example, the 2025.4.2 entry only specifies @shopify/hydrogen and @shopify/cli. It doesn't list react-router, @remix-run/dev, vite, or any of the ~30 other ecosystem dependencies.

This means for a 2025.4.2 → 2025.5.1 upgrade test (Remix → React Router 7):

The test copies the current skeleton, which already has react-router@7.x, @react-router/dev, and no @remix-run/* packages

It overlays the 2025.4.2 changelog deps, which only overwrites @shopify/hydrogen and @shopify/cli

The simulated "old project" ends up with the current RR7 file structure, current RR7 dependencies, and just an older hydrogen version

The upgrade command runs against this hybrid project and "succeeds" - but it never had to handle the actual Remix → RR7 migration

The build validation at the end (npm run build) is only meaningful if the starting project actually represents what a merchant has. Building a project that was already structurally at the latest version proves very little.

We have a real-world example of why this matters. During the 2025.4.0 → 2025.5.0 upgrade (Remix → React Router 7), the old/current upgrade tests (which didn't yet exist) would've caught that the upgrade command was failing because React Router was already a peer dependency in the 2025.4.0 project, and the upgrade command's install steps didn't use --force - npm couldn't resolve the dependency conflict. That signal told us we needed to add an uninstall step to the Hydrogen CLI before the upgrade could work. This PR's approach would have never caught that, because the simulated "old project" would have already had the RR7 dependency structure - @remix-run/dev would have been absent and react-router would have already been at v7. The peer dependency conflict would never have surfaced. We would have shipped a broken upgrade path to merchants if we were relying on this PR's approach.

What I think we should do instead:

The changelog validation cleanup and the test infrastructure improvements (vi.stubEnv, inTemporaryDirectory, fail-instead-of-skip) are good - let's keep those. But I think we should fix the scaffolding approach rather than replace it with something that doesn't actually test upgrades. The old findCommitForVersion with 3 fallback strategies and git archive was over-engineered, but the goal of getting a real historical skeleton was correct.

Some high-level ideas:

Simplify the git-based scaffolding - we already version the skeleton template. we could probably just find the right commit based on the tag.

Store skeleton snapshots - as part of the release process, archive the skeleton's package.json (or the full skeleton) so tests can reference it without git archaeology.

I don't want us to merge this and consider "upgrade command is validated" done, because what we'd actually have is "upgrade command can modify version numbers in a package.json that's already structurally current." That's a much weaker guarantee than what we had before, even accounting for the flakiness.

Thoughts?

this IS what happens though when users actually run h2 upgrade, so it is important to have E2E validation that this is working imo.

yes, but they are different things to test

this unit test can quickly and efficiently ensure our piping is working

then a CDC (not a unit test) can call the external endpoint and validate if it matches the schema

this organizational change makes it easy to now test independent parts without testing the entire system at once

in summary:

write unit tests to ensure the bullet points i mentioned

write one CDC test to fetch changelog and validate

i dont want us to block moving forward, but i also dont want us to have hard to work with tests and bad examples for LLM!

Edit: I read the next comment (your long one) and I agree now! I think we are trying to test everything in one and not testing what matters

my new suggestion is: make units testable, write CDC, but also add an e2e for real world scenario of scaffolding an old app and upgrading it, maybe even as a matrix if we want to account for the past 1 year of updates

Yeah I'm still happy to have "fast" tests in addition to the proper E2E test(s), and I think that would be helpful. I just don't want to replace the existing E2E test with "fast" tests

kdaviduik · 2026-02-19T03:57:45Z

-    );
-  }
-}
+    process.env.FORCE_CHANGELOG_SOURCE = 'local';


these lines set FORCE_CHANGELOG_SOURCE, SHOPIFY_HYDROGEN_FLAG_FORCE, and CI on process.env but never restore them.

Why it matters: These leak to subsequent tests in the same worker. The sibling upgrade.test.ts handles this correctly with save/restore.

Fix: Use vi.stubEnv() (auto-restores on teardown) or save/restore in afterEach.

kdaviduik · 2026-02-19T03:59:05Z

-async function scaffoldHistoricalProject(commit: string): Promise<string> {
-  const tempDir = await mkdtemp(join(tmpdir(), 'hydrogen-upgrade-test-'));
-  const projectDir = join(tempDir, 'test-project');
+    const projectDir = await scaffoldHistoricalProject(fromVersion);


The function no longer scaffolds a historical project — it copies the current skeleton and changes the version. I think we should rename this to something like scaffoldProjectAtVersion(version) or createTestProjectWithVersion(version).

kdaviduik · 2026-02-19T03:59:57Z

-    }
-  }
+async function scaffoldHistoricalProject(version: string): Promise<string> {
+  const tempDir = await mkdtemp(join(tmpdir(), 'hydrogen-upgrade-test-'));


from Claude, seems reasonable

mkdtemp creates directories that are never cleaned up. Use inTemporaryDirectory from @shopify/cli-kit/node/fs (like the sibling test does) or add cleanup in afterAll.

kdaviduik · 2026-02-19T04:00:58Z

-      if (!reinstalledDeps[dep]) {
-        expect(packageJson.dependencies?.[dep]).toBeUndefined();
-      }
+    if (!hasUpgradeSteps) {


Let's add a comment here (explaining that projects with manual upgrade steps may not build until those steps are applied) about why we are doing this since it's not necessarily obvious. .

kdaviduik · 2026-02-19T04:04:51Z

+      });
    }
-  }
+  }, 180000);


(threading)

If someone submitted a changelog.json PR before the package(s) being upgraded have been released, the upgrade tests would skip and CI would appear successful, but we would have zero guarantees the change actually works.

Imo this test should fail if any packages specified haven't been released yet. This could be due to a genuine bug/typo, or because our packages haven't been released yet. Either way, I think this test failing if the package version doesn't exist would be helpful

kdaviduik · 2026-02-27T05:28:34Z

-
-async function scaffoldHistoricalProject(commit: string): Promise<string> {
-  const tempDir = await mkdtemp(join(tmpdir(), 'hydrogen-upgrade-test-'));
+async function scaffoldProjectAtVersion(


non-blocking, out of scope for this PR: scaffolding approach creates an unrealistic starting project.

scaffoldProjectAtVersion() copies the current skeleton and overlays dependency versions from the previous changelog entry. But changelog entries are sparse - the latest release (2025.10.0) lists only 2 of the skeleton's 40 total dependencies. Even the most comprehensive release (2025.7.0) covers only ~22%.

This means the test can pass even when the actual upgrade path is broken for merchants.

Concrete example: During the upgrade to 2025.7.0 (Remix -> React Router 7), the old approach would have caught that the upgrade command failed because React Router was already a peer dependency in the old project, and npm install couldn't resolve the conflict. The new approach would miss this because the simulated "old project" would already have the RR7 file structure.

Additional edge case: The dependency overlay only writes deps that exist in the release entry. If a dependency in the previous release doesn't exist in the current skeleton's package.json (because it was removed in a later version), the overlay silently does nothing.

I think we should fix the scaffolding approach rather than replace it:

Simplify the git-based scaffolding - find the right commit based on the release tag

Store skeleton snapshots - as part of the release process, archive the skeleton's package.json

At minimum, apply ALL dependency versions from the previous release entry, not just those in the current skeleton

kdaviduik · 2026-02-27T05:28:34Z

-    /^[\^~]?0\.0\.0-next-[a-f0-9]+-\d+$/,
-  );
+describe('upgrade flow', () => {
+  it('validates changelog.json structure', async () => {


non-blocking, out of scope for this PR: changelog schema validation significantly weakened.

The old test validated: allowed fields per release entry (no rogue fields), required fields (title, version, hash, commit, etc.), URL format validation, calver version format, feature/fix step structure, base64-encoded code blocks, dependenciesMeta structure, and removeDependencies/removeDevDependencies as string arrays.

The new test only checks that releases exist, the latest version matches a calver regex, dependency values look version-like, and features/fixes are arrays. A malformed changelog entry with rogue fields, invalid commit URLs, or broken base64 code blocks would now pass.

The changelog schema validation is orthogonal to the integration test. I think it should be extracted as an independent concern - either a separate test file or a standalone validation script. Ideally using Zod for the schema, which would serve double duty: validating changelog entries in tests AND providing a reusable schema that h2 upgrade itself could use to validate fetched data at runtime.

kdaviduik · 2026-02-27T05:28:34Z

      }
    });
-  });
+  }, 600000);


non-blocking: I think 300,000ms (5 minutes) would be more appropriate here. The old test had 180,000ms while doing MORE work (dev server, typecheck). This test does less work but has a 3.3x longer timeout. If npm install hangs due to a registry outage, this burns 10 minutes of CI compute per run before producing a failure signal.

kdaviduik · 2026-02-27T05:28:34Z

-          latestRelease.devDependencies?.['@shopify/mini-oxygen'],
-        ],
-      ]);
+  it('upgrades from previous version to latest', async () => {


non-blocking: this integration test is a single ~170-line function with ~8 distinct verification concerns: checking package publication, scaffolding, verifying initial state, stubbing env, running upgrade, verifying deps, checking removals, verifying upgrade guide, running install, and conditionally running build.

Extracting named helpers like verifyDependenciesUpgraded(packageJson, expectedRelease) would make each concern independently comprehensible without sacrificing the linear flow.

kdaviduik · 2026-02-27T05:28:34Z

 });

-// Checks if all specified packages are published to npm
+afterEach(() => {


nice improvement! The vi.stubEnv() + afterEach(() => vi.unstubAllEnvs()) pattern is much cleaner than the old direct process.env mutation that leaked to sibling tests.

itsjustriley · 2026-03-06T18:32:07Z

closing in favour of #3527

itsjustriley force-pushed the e2e-upgrade-tests branch 10 times, most recently from 8df9450 to eed6a44 Compare February 17, 2026 15:01

test: update upgrade tests

db4b831

itsjustriley force-pushed the e2e-upgrade-tests branch from eed6a44 to db4b831 Compare February 17, 2026 15:12

itsjustriley requested review from andguy95, fredericoo and kdaviduik February 17, 2026 15:16

itsjustriley marked this pull request as ready for review February 17, 2026 15:16

itsjustriley requested a review from a team as a code owner February 17, 2026 15:16

kdaviduik requested changes Feb 19, 2026

View reviewed changes

address pr feedback

6a0564d

kdaviduik approved these changes Feb 27, 2026

View reviewed changes

itsjustriley closed this Mar 6, 2026

Conversation

itsjustriley commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

WHY are these changes introduced?

WHAT is this pull request doing?

HOW to test your changes?

Uh oh!

shopify Bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Feb 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fredericoo Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fredericoo Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itsjustriley commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

itsjustriley commented Feb 17, 2026 •

edited

Loading

shopify Bot commented Feb 17, 2026 •

edited

Loading

fredericoo Feb 19, 2026 •

edited

Loading

fredericoo Feb 24, 2026 •

edited

Loading