test: update upgrade tests#3472
Conversation
|
Oxygen deployed a preview of your
Learn more about Hydrogen's GitHub integration. |
8df9450 to
eed6a44
Compare
eed6a44 to
db4b831
Compare
|
We detected some changes in |
| import {exec} from '@shopify/cli-kit/node/system'; | ||
| import {execAsync} from '../../lib/process.js'; | ||
| import * as upgradeModule from './upgrade.js'; | ||
| import {spawn, ChildProcess} from 'node:child_process'; |
There was a problem hiding this comment.
(threading)
Issue/bug: scaffolding only downgrades @shopify/hydrogen, not other dependencies
What: scaffoldHistoricalProject() copies the current skeleton and only changes version and dependencies['@shopify/hydrogen']. All other dependencies remain at HEAD
versions.
Why it matters: The upgrade command thinks it's upgrading from the previous version, but most dependencies are already at the latest version. This means the test can pass without going through the full upgrade flow that merchants experience. Also, this may cause issues down the line (eg: Remix --> RR7 is an example where this bug would've caused issues)
Fix: Use previousRelease.dependencies and previousRelease.devDependencies from the changelog to set ALL starting dependency versions, not just hydrogen:
if (previousRelease.dependencies) {
for (const [dep, ver] of Object.entries(previousRelease.dependencies)) {
packageJson.dependencies[dep] = ver;
}
}
if (previousRelease.devDependencies) {
for (const [dep, ver] of Object.entries(previousRelease.devDependencies)) {
packageJson.devDependencies[dep] = ver;
}
}
| renderConfirmationPrompt: vi.fn(() => Promise.resolve(true)), // Always confirm | ||
| renderInfo: vi.fn(() => {}), // Mock renderInfo to silence output | ||
| renderSuccess: vi.fn(() => {}), // Mock renderSuccess to silence output | ||
| renderConfirmationPrompt: vi.fn(() => Promise.resolve(true)), |
There was a problem hiding this comment.
(threading)
This also eliminates (without replacement) a few other validations/checks of the changelog.json file:
I definitely think we should have these validations, though this test doesn't feel like the place to put it imo. I think we should just have a CI step that ensures the JSON formatting is correct.
Or even better but less trivial would be to use a Zod schema to validate this JSON file
There was a problem hiding this comment.
agree on the schema validation for the json file
but i think our tests were too flaky to begin with
what do we want to test?
my educated guess is:
- whether changelog.json matches a specific format
- whether we can read the package.json from a project
- whether we can detect what "hydrogen packages" version a package.json has
- whether we can detect if they are on the latest
- whether we can modify package versions in a package.json
- whether we can call the install command on behalf of the user
for none of those we need to fetch an external changelog from a backend
but our functions in upgrade.ts are so bound together and filesystem dependant, that makes testing really hard
pseudocode for what i imagine would be easy to test for all those cases:
function upgrade(packageJsonService, changelog) {
const versions = getHydrogenPackageVersions(packageJsonService);
const needUpgrading = getOutdatedPackageVersions(versions, changelog);
if (needUpgrading.length === 0) return;
await updatePackageVersions(packageJsonService);
await installPackages();
return;
}(extra points if we use Effect, or at least errors as values)
maybe we should push a refactor of upgrade as tech debt?
There was a problem hiding this comment.
but i think our tests were too flaky to begin with
definitely, though i think it's also important to consider the specific flaky factors. is there a world where we could make those tests NOT flaky without compromising on the safety/validation of our changes? i think there is!
for none of those we need to fetch an external changelog from a backend
this IS what happens though when users actually run h2 upgrade, so it is important to have E2E validation that this is working imo.
There was a problem hiding this comment.
this is a long comment and there are discussions to be had - I booked a meeting tomorrow with all of us to discuss/align synchronously :)
TLDR: cleanup of test infra = good, but I am very concerned with this PR's approach to validating the upgrade command. There was a huge issue in the Remix --> RR7 upgrade last year (before any upgrade tests existed) that the old/current approach would've caught. This new approach wouldn't have caught that, which is why I want to discuss/maybe rethink this.
The cleanup of the test infrastructure (removing process.env mutation, using inTemporaryDirectory, better error messages when packages aren't published) is genuinely good!
But I want to step back and ensure we're aligned on the overall direction before we merge this.
I think this PR is attacking the problem at the wrong level.
The previous tests were flaky, yes - but the flakiness was primarily in the test infrastructure (parsing/validation logic, dev server lifecycle, port conflicts), not in the upgrade simulation itself. The approach of scaffolding a real historical project and running the upgrade command against it was the right idea. It was the implementation details around that approach that needed fixing.
This PR solves the flakiness by fundamentally changing what we're testing. Instead of "scaffold a real old project → upgrade it → verify it works," we now have "copy the current skeleton → change some version numbers → upgrade it → verify it works." These are very different tests with very different confidence levels.
Here's the concrete problem. The new scaffoldProjectAtVersion copies the current skeleton template and overlays dependency versions from the previous changelog entry. But changelog entries are sparse - they only list packages that changed in that specific release. For example, the 2025.4.2 entry only specifies @shopify/hydrogen and @shopify/cli. It doesn't list react-router, @remix-run/dev, vite, or any of the ~30 other ecosystem dependencies.
This means for a 2025.4.2 → 2025.5.1 upgrade test (Remix → React Router 7):
- The test copies the current skeleton, which already has
react-router@7.x,@react-router/dev, and no@remix-run/*packages - It overlays the
2025.4.2changelog deps, which only overwrites@shopify/hydrogenand@shopify/cli - The simulated "old project" ends up with the current RR7 file structure, current RR7 dependencies, and just an older hydrogen version
- The upgrade command runs against this hybrid project and "succeeds" - but it never had to handle the actual Remix → RR7 migration
The build validation at the end (npm run build) is only meaningful if the starting project actually represents what a merchant has. Building a project that was already structurally at the latest version proves very little.
We have a real-world example of why this matters. During the 2025.4.0 → 2025.5.0 upgrade (Remix → React Router 7), the old/current upgrade tests (which didn't yet exist) would've caught that the upgrade command was failing because React Router was already a peer dependency in the 2025.4.0 project, and the upgrade command's install steps didn't use --force - npm couldn't resolve the dependency conflict. That signal told us we needed to add an uninstall step to the Hydrogen CLI before the upgrade could work. This PR's approach would have never caught that, because the simulated "old project" would have already had the RR7 dependency structure - @remix-run/dev would have been absent and react-router would have already been at v7. The peer dependency conflict would never have surfaced. We would have shipped a broken upgrade path to merchants if we were relying on this PR's approach.
What I think we should do instead:
The changelog validation cleanup and the test infrastructure improvements (vi.stubEnv, inTemporaryDirectory, fail-instead-of-skip) are good - let's keep those. But I think we should fix the scaffolding approach rather than replace it with something that doesn't actually test upgrades. The old findCommitForVersion with 3 fallback strategies and git archive was over-engineered, but the goal of getting a real historical skeleton was correct.
Some high-level ideas:
- Simplify the git-based scaffolding - we already version the skeleton template. we could probably just find the right commit based on the tag.
- Store skeleton snapshots - as part of the release process, archive the skeleton's
package.json(or the full skeleton) so tests can reference it without git archaeology.
I don't want us to merge this and consider "upgrade command is validated" done, because what we'd actually have is "upgrade command can modify version numbers in a package.json that's already structurally current." That's a much weaker guarantee than what we had before, even accounting for the flakiness.
Thoughts?
There was a problem hiding this comment.
this IS what happens though when users actually run
h2 upgrade, so it is important to have E2E validation that this is working imo.
yes, but they are different things to test
this unit test can quickly and efficiently ensure our piping is working
then a CDC (not a unit test) can call the external endpoint and validate if it matches the schema
this organizational change makes it easy to now test independent parts without testing the entire system at once
in summary:
- write unit tests to ensure the bullet points i mentioned
- write one CDC test to fetch changelog and validate
i dont want us to block moving forward, but i also dont want us to have hard to work with tests and bad examples for LLM!
Edit: I read the next comment (your long one) and I agree now! I think we are trying to test everything in one and not testing what matters
my new suggestion is: make units testable, write CDC, but also add an e2e for real world scenario of scaffolding an old app and upgrading it, maybe even as a matrix if we want to account for the past 1 year of updates
There was a problem hiding this comment.
Yeah I'm still happy to have "fast" tests in addition to the proper E2E test(s), and I think that would be helpful. I just don't want to replace the existing E2E test with "fast" tests
| ); | ||
| } | ||
| } | ||
| process.env.FORCE_CHANGELOG_SOURCE = 'local'; |
There was a problem hiding this comment.
these lines set FORCE_CHANGELOG_SOURCE, SHOPIFY_HYDROGEN_FLAG_FORCE, and CI on process.env but never restore them.
Why it matters: These leak to subsequent tests in the same worker. The sibling upgrade.test.ts handles this correctly with save/restore.
Fix: Use vi.stubEnv() (auto-restores on teardown) or save/restore in afterEach.
| async function scaffoldHistoricalProject(commit: string): Promise<string> { | ||
| const tempDir = await mkdtemp(join(tmpdir(), 'hydrogen-upgrade-test-')); | ||
| const projectDir = join(tempDir, 'test-project'); | ||
| const projectDir = await scaffoldHistoricalProject(fromVersion); |
There was a problem hiding this comment.
The function no longer scaffolds a historical project — it copies the current skeleton and changes the version. I think we should rename this to something like scaffoldProjectAtVersion(version) or createTestProjectWithVersion(version).
| } | ||
| } | ||
| async function scaffoldHistoricalProject(version: string): Promise<string> { | ||
| const tempDir = await mkdtemp(join(tmpdir(), 'hydrogen-upgrade-test-')); |
There was a problem hiding this comment.
from Claude, seems reasonable
mkdtempcreates directories that are never cleaned up. UseinTemporaryDirectoryfrom@shopify/cli-kit/node/fs(like the sibling test does) or add cleanup inafterAll.
| if (!reinstalledDeps[dep]) { | ||
| expect(packageJson.dependencies?.[dep]).toBeUndefined(); | ||
| } | ||
| if (!hasUpgradeSteps) { |
There was a problem hiding this comment.
Let's add a comment here (explaining that projects with manual upgrade steps may not build until those steps are applied) about why we are doing this since it's not necessarily obvious. .
| }); | ||
| } | ||
| } | ||
| }, 180000); |
There was a problem hiding this comment.
(threading)
If someone submitted a changelog.json PR before the package(s) being upgraded have been released, the upgrade tests would skip and CI would appear successful, but we would have zero guarantees the change actually works.
Imo this test should fail if any packages specified haven't been released yet. This could be due to a genuine bug/typo, or because our packages haven't been released yet. Either way, I think this test failing if the package version doesn't exist would be helpful
|
|
||
| async function scaffoldHistoricalProject(commit: string): Promise<string> { | ||
| const tempDir = await mkdtemp(join(tmpdir(), 'hydrogen-upgrade-test-')); | ||
| async function scaffoldProjectAtVersion( |
There was a problem hiding this comment.
non-blocking, out of scope for this PR: scaffolding approach creates an unrealistic starting project.
scaffoldProjectAtVersion() copies the current skeleton and overlays dependency versions from the previous changelog entry. But changelog entries are sparse - the latest release (2025.10.0) lists only 2 of the skeleton's 40 total dependencies. Even the most comprehensive release (2025.7.0) covers only ~22%.
This means the test can pass even when the actual upgrade path is broken for merchants.
Concrete example: During the upgrade to 2025.7.0 (Remix -> React Router 7), the old approach would have caught that the upgrade command failed because React Router was already a peer dependency in the old project, and npm install couldn't resolve the conflict. The new approach would miss this because the simulated "old project" would already have the RR7 file structure.
Additional edge case: The dependency overlay only writes deps that exist in the release entry. If a dependency in the previous release doesn't exist in the current skeleton's package.json (because it was removed in a later version), the overlay silently does nothing.
I think we should fix the scaffolding approach rather than replace it:
- Simplify the git-based scaffolding - find the right commit based on the release tag
- Store skeleton snapshots - as part of the release process, archive the skeleton's package.json
- At minimum, apply ALL dependency versions from the previous release entry, not just those in the current skeleton
| /^[\^~]?0\.0\.0-next-[a-f0-9]+-\d+$/, | ||
| ); | ||
| describe('upgrade flow', () => { | ||
| it('validates changelog.json structure', async () => { |
There was a problem hiding this comment.
non-blocking, out of scope for this PR: changelog schema validation significantly weakened.
The old test validated: allowed fields per release entry (no rogue fields), required fields (title, version, hash, commit, etc.), URL format validation, calver version format, feature/fix step structure, base64-encoded code blocks, dependenciesMeta structure, and removeDependencies/removeDevDependencies as string arrays.
The new test only checks that releases exist, the latest version matches a calver regex, dependency values look version-like, and features/fixes are arrays. A malformed changelog entry with rogue fields, invalid commit URLs, or broken base64 code blocks would now pass.
The changelog schema validation is orthogonal to the integration test. I think it should be extracted as an independent concern - either a separate test file or a standalone validation script. Ideally using Zod for the schema, which would serve double duty: validating changelog entries in tests AND providing a reusable schema that h2 upgrade itself could use to validate fetched data at runtime.
| } | ||
| }); | ||
| }); | ||
| }, 600000); |
There was a problem hiding this comment.
non-blocking: I think 300,000ms (5 minutes) would be more appropriate here. The old test had 180,000ms while doing MORE work (dev server, typecheck). This test does less work but has a 3.3x longer timeout. If npm install hangs due to a registry outage, this burns 10 minutes of CI compute per run before producing a failure signal.
| latestRelease.devDependencies?.['@shopify/mini-oxygen'], | ||
| ], | ||
| ]); | ||
| it('upgrades from previous version to latest', async () => { |
There was a problem hiding this comment.
non-blocking: this integration test is a single ~170-line function with ~8 distinct verification concerns: checking package publication, scaffolding, verifying initial state, stubbing env, running upgrade, verifying deps, checking removals, verifying upgrade guide, running install, and conditionally running build.
Extracting named helpers like verifyDependenciesUpgraded(packageJson, expectedRelease) would make each concern independently comprehensible without sacrificing the linear flow.
| }); | ||
|
|
||
| // Checks if all specified packages are published to npm | ||
| afterEach(() => { |
There was a problem hiding this comment.
nice improvement! The vi.stubEnv() + afterEach(() => vi.unstubAllEnvs()) pattern is much cleaner than the old direct process.env mutation that leaked to sibling tests.
|
closing in favour of #3527 |
WHY are these changes introduced?
Fixes https://github.com/Shopify/developer-tools-team/issues/1045
The existing upgrade flow tests were unreliable. Tests would intermittently fail in CI due to dev server flakiness (port conflicts, spawn/kill race conditions, HTTP timeouts), making it difficult to validate changelog.json entries/upgrade flow. This should reduce toil for future releases.
WHAT is this pull request doing?
This PR refactors the upgrade flow tests to be more focused, and fail faster to ensure that new changelog.json entries enable successful upgrades.
The test validates:
Key Decisions:
h2 upgradecommandHOW to test your changes?
Run test locally:
Expected: Both tests pass
Verify CI workflow:
- Modify docs/changelog.json (add a comma or whitespace)
- Push to a branch
- The test-upgrade-flow.yml workflow should trigger automatically
- Check that it passes
Checklist