Skip to content

Drop gitlab dynamic pipelines and refactor imgtestlib [HMS-9712]#2359

Draft
achilleas-k wants to merge 18 commits into
osbuild:mainfrom
achilleas-k:ci/no-dynamic-pipelines
Draft

Drop gitlab dynamic pipelines and refactor imgtestlib [HMS-9712]#2359
achilleas-k wants to merge 18 commits into
osbuild:mainfrom
achilleas-k:ci/no-dynamic-pipelines

Conversation

@achilleas-k
Copy link
Copy Markdown
Member

@achilleas-k achilleas-k commented May 21, 2026

This PR simplifies image building and testing in Gitlab CI by removing the dynamic pipeline generation and instead builds and tests all images for a given distribution and architecture on the same runner.

The imgtestlib has been refactored into a module with multiple files for easier navigation, as it was getting too big for a single file.

Some further improvements I'd like to do after this is merged:

  • Async "touch" for S3 objects.
  • Async boot tests.
  • Return errors from build and boot functions. Currently the test functions rely on the sp.run() shell commands failing to fail a build. I'd like to capture those errors instead and handle them gracefully. That way we can generate clean failure messages. Also it would make it possible to continue with other image builds when a build or boot test fails.
  • Merge vmtest into imgtestlib.

Closes #1703

@achilleas-k achilleas-k requested review from a team and thozza as code owners May 21, 2026 15:12
@achilleas-k achilleas-k requested review from lzap and supakeen May 21, 2026 15:12
@achilleas-k
Copy link
Copy Markdown
Member Author

The PR moves the core parts of the test scripts into the imgtestlib module. The boot-image script uses Python's match statement, which isn't available on EL9. This wasn't an issue before because boot-image was only ever run on the CI runners, which are Fedora 42. Now that the core functionality is part of the importable module though, we need to rewrite it to run on older Python versions.

We should be testing builds on EL9 as well, so I should do this regardless.

@achilleas-k achilleas-k force-pushed the ci/no-dynamic-pipelines branch 2 times, most recently from f000c79 to 3781ada Compare May 21, 2026 16:08
@supakeen
Copy link
Copy Markdown
Member

So; since there are a lot of failures I went through them:

  1. 7 jobs succeeded.
  2. 20+ jobs got their instance killed.
  3. Jobs fail when testing installers, as they need access to KVM and it isn't available.
  4. A few failures due to: time="2026-05-21T16:30:35Z" level=fatal msg="Error parsing image name \"docker://None\": invalid reference format: repository name must be lowercase"

@achilleas-k
Copy link
Copy Markdown
Member Author

So; since there are a lot of failures I went through them:

Thanks for going through them!!

1. 7 jobs succeeded.

Not great.

2. 20+ jobs got their instance killed.

I suspect this will be the biggest issue with this change.

3. Jobs fail when testing installers, as they need access to KVM and it isn't 

Ugh, right, yeah. I guess we're going to need to run everything on KVM-enabled runners since every distro has an installer.

4. A few failures due to: `time="2026-05-21T16:30:35Z" level=fatal msg="Error parsing image name \"docker://None\": invalid reference format: repository name must be lowercase"`

I think I fixed that? Anyway, definitely fixable.

@lzap
Copy link
Copy Markdown
Contributor

lzap commented May 25, 2026

Observation: average job time was 1 hour and the slowest one was 4 hours.

2. 20+ jobs got their instance killed.

We must start tracking these, I wonder if we pay actually more than if we were not using spot. Because when a spot instance is killed by AWS for capacity reasons, we still pay the time on the clock. AWS sends a signal 2 minutes before the term/kill so we can mark those jobs for later inspection and statistics.

- E252 missing whitespace around parameter equals
- E713 test for membership should be 'not in'
- E302 expected 2 blank lines, found 1
Make the image building scaffolding reusable.
Builds all modified images for a specific distro and (host)
architecture.

This script is essentially the same as the generate-build-config script,
only instead of generating a gitlab-ci file with the images that need to
be rebuilt, it runs any required builds in sequence.
Print a log section start and end line when building an image or
performing other operations.  If running in GitLab CI, use their custom
collapsible sections escape sequences [1] to create collapsible build
logs for each image.

[1] https://docs.gitlab.com/ci/jobs/job_logs/#custom-collapsible-sections
Move imgtestlib.py to an importable module directory by the same
(logical) name.
The module got too big for one file.

This commit only moves code around and adjusts imports.  Functions are
grouped into submodules by logical functionality.  Groups (submodule
boundaries) were chosen to avoid circular imports.

The .core module could still use some splitting or tidying up, but the
current state is already an improvement.

Importing everything into __init__.py might be unnecessary—we only
really need to import the functions that are used externally—but it's
not very important to reduce the API surface of this internal testing
library.
Make them easier to see in the log.
Builds all modified images that depend on an ostree commit for a
specific distro and (host) architecture.

The script is essentially the same as the generate-ostree-build-config
script, only instead of generating a gitlab-ci file with the images that
need to be rebuilt, it runs any required builds in sequence.

This is very similar to the test-new-manifests script, but it also
handles discovering, downloading, and running ostree containers to
serve the payload ostree commits for derived images (ostree disk images
and installers).
The boot-image script is now a thin wrapper around the new
imgtestlib.boot_image() function, so the general functionality is
reusable.  The boot-image script now simply handles argument parsing and
calls into the function.

The can_boot_test() function has been moved to imgtestlib/core to avoid
circular imports.  We might need to reorganise the module at a later
date.
The upload-results script is now a thin wrapper around the new
imgtestlib.upload_results() function, so the general functionality is
reusable.  The upload-results script now simply handles argument parsing
and calls into the function.
Update the gitlab-ci.yml generator to run the new tests.
Generate the new config.
Let's test everything!
boto3 is only needed when interfacing with aws.  vmtest is imported by
imgtestlib, which we often use for a lot of other smaller tasks, like
setting up the osbuild repo.  We could install boto3 whenever we need
it, but it's simpler to allow importing without it.
Since we moved core parts of the test scripts into the imgtestlib
module, the boot_image function now needs to be importable by all the
distros we support and test on, including EL9 which only has Python 3.9.
This wasn't an issue before because boot-image was only ever run on the
CI runners, which are currently Fedora 42.  Now that the core
functionality is part of the importable module though, we need to
rewrite it to run on older Python versions.
@achilleas-k achilleas-k force-pushed the ci/no-dynamic-pipelines branch from 3781ada to 57fe49f Compare May 27, 2026 18:12
@achilleas-k
Copy link
Copy Markdown
Member Author

Rebased on main but deleted .gitlab-ci.yml. I want to try a few things before rerunning the pipelines. Setting to draft.

@achilleas-k achilleas-k marked this pull request as draft May 27, 2026 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce runner resource usage when doing full rebuilds in CI

3 participants