Skip to content

workflows: Run image refreshes in Testing Farm#8758

Closed
martinpitt wants to merge 6 commits intomainfrom
image-refresh-tf
Closed

workflows: Run image refreshes in Testing Farm#8758
martinpitt wants to merge 6 commits intomainfrom
image-refresh-tf

Conversation

@martinpitt
Copy link
Copy Markdown
Member

@martinpitt martinpitt commented Feb 25, 2026

Building RHEL images is the only bit in our CI that actually requires
the internal RH network. Everything else could (and probably soon will)
move to some public infrastructure.

Testing Farm supports nested virtualization in principle, even a bit
better on the RHEL ranch than the public one. It does not currently
scale very well (until it moves to EC2's brand new nested
virtualization). But our image builds are not very resource hungry, just
a dozen 30-minute runs a week. So let's move them there!

Add an "issue-scan" GitHub workflow which replaces our webhook/AMQP/bot
worker logic. On appropriate changes to an issue/PR, run issue-scan,
and if there is a resulting task, schedule a plans/job-runner.fmf run
on Testing Farm. That creates a suitable job-runner configuration, and
then runs the job inside a local podman tasks container.

https://issues.redhat.com/browse/COCKPIT-1772


I developed this on my fork. I kept pushing to my main branch, and doing the refreshes in martinpitt#36 then martinpitt#37 then martinpitt#38 -- that's because each actually successful run then turns the issue into a PR, and conflicts appear. The last one is a clean run how it's actually supposed to look like. The two "image-refresh cirros done" and "Success. Log:.." are the same as we always had -- they normally come from cockpituous, but for my fork that of course is a token from myself.

The third, and new, comment is this one that links to the Testing Farm run. We don't strictly need that -- we don't have a counterpart on our current CI, other than "pitti ssh's into a bot and reads the journal", but I think it's useful for at least a little while until this stabilizes. Note that this is running the RHEL ranch, i.e. you need VPN to access the logs (again, just like with our current CI).

I tested this from both an issue and a PR, and with both editing the description as well as (un)labelling "bot", all good.

This requires a new "image-build" GitHub environment for the required secrets. I committed "Add GitHub env for image refreshes" to our internal ci-secrets repo, and ran it (with the obvious cockpit-project martinpitt change) to deploy the env into my fork.

After review, but before landing:

After landing:

@martinpitt
Copy link
Copy Markdown
Member Author

martinpitt commented Feb 25, 2026

ah yes, unit tests now fail test_mock_image_refresh -- that's actually intended. I had a look, and this just needs to be dropped, there's nothing to salvage there; that is all about AMQP and deployment.

But in the meantime I'd appreciate some reviews/comments, thanks!

Copy link
Copy Markdown
Member Author

@martinpitt martinpitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self-review with fresh eyes on the Morgen danach

Comment thread issue-scan Outdated
assert pika is not None, "pika module is required for --amqp"
assert distributed_queue is not None, "distributed_queue module is required for --amqp"

for result in scan(opts.issues_data, opts.repo):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting the loop between AMQP and non-AMQP was an intermediate step when I tried to move the imports into the if:. But that didn't work, so I made them global. However, given that this AMQP code disappears entirely after landing this PR, I'm actually in favor of keeping it split that way.

Comment thread .github/workflows/issue-scan.yml Outdated
Comment thread .github/workflows/issue-scan.yml Outdated
Comment thread .github/workflows/issue-scan.yml Outdated
Comment thread .github/workflows/issue-scan.yml Outdated
Comment thread .github/workflows/issue-scan.yml Outdated
Comment thread plans/job-runner.sh Outdated
Comment thread plans/job-runner.sh Outdated
@martinpitt martinpitt force-pushed the image-refresh-tf branch 2 times, most recently from 04454a9 to 589d1c5 Compare February 26, 2026 08:19
@martinpitt
Copy link
Copy Markdown
Member Author

Pushed fixes for my self-review, tested again in martinpitt#40

@jelly
Copy link
Copy Markdown
Member

jelly commented Feb 26, 2026

So first question is regarding logging:

  • Do all image-refreshes run on the private ranch? It is very useful to point external contributors such as SUSE to image refresh issues of their image.

@jelly
Copy link
Copy Markdown
Member

jelly commented Feb 26, 2026

So first question is regarding logging:

* Do all image-refreshes run on the private ranch? It is very useful to point external contributors such as SUSE to image refresh issues of their image.

Re-reading this, I guess this ain't true as the logs do end up in s3 so will show up but not commented on this issue?

Copy link
Copy Markdown
Member

@allisonkarlitskaya allisonkarlitskaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like using GitHub actions for stuff like this, but there might be a silver lining here...

Image refreshes were the last of the classical "tasks". Will this let us kill all the "infra" in lib/task.py?

Comment thread issue-scan
@martinpitt
Copy link
Copy Markdown
Member Author

So first question is regarding logging:

* Do all image-refreshes run on the private ranch? It is very useful to point external contributors such as SUSE to image refresh issues of their image.

As you see on e.g. martinpitt#38 the S3 logs stay as they have always been, and they are public. The bit that happens on the private ranch is the actual job-runner invocation, which currently runs in our PSI OpenStack (which is also private to RHEL). Logs from infra failures have never been available to the public.

As soon as TF starts supporting the new EC2 nested virtualization, we could move non-RHEL image builds out to the public ranch, but right now that fails far too often (it simply isn't scalable to run all these on bare-metal EC2 instances). But this is more of a bonus/cost efficiency issue.

martinpitt added a commit to cockpit-project/cockpituous that referenced this pull request Feb 26, 2026
We will soon move image refreshes out of our own CI into GitHub/Testing
Farm [1]. That PR [2] disables the `issue-scan` portion of `run-tests`
to avoid building images twice.

[1] https://issues.redhat.com/browse/COCKPIT-1772
[2] cockpit-project/bots#8758
@jelly
Copy link
Copy Markdown
Member

jelly commented Feb 26, 2026

So first question is regarding logging:

* Do all image-refreshes run on the private ranch? It is very useful to point external contributors such as SUSE to image refresh issues of their image.

As you see on e.g. martinpitt#38 the S3 logs stay as they have always been, and they are public. The bit that happens on the private ranch is the actual job-runner invocation, which currently runs in our PSI OpenStack (which is also private to RHEL). Logs from infra failures have never been available to the public.

As soon as TF starts supporting the new EC2 nested virtualization, we could move non-RHEL image builds out to the public ranch, but right now that fails far too often (it simply isn't scalable to run all these on bare-metal EC2 instances). But this is more of a bonus/cost efficiency issue.

Yup, realized that when reading the code!

Comment thread .github/workflows/issue-scan.yml Outdated
TF_RESPONSE=$(curl -s --json @tf-request.json https://api.dev.testing-farm.io/v0.1/requests)

# Sadly, the response does not include the artifacts URL, the only
# useful thing is the ID; so we have to hardcode it
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we have a bug report for?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just filed one: https://issues.redhat.com/browse/TFT-4379 . I added that to the comment (pushes are cheap here)

Copy link
Copy Markdown
Member

@allisonkarlitskaya allisonkarlitskaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another look at this today and once I started seeing the quoting problems I started to see them everywhere....

pull-requests: write
steps:
- name: Clone repository
uses: actions/checkout@v5
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like a comment here about what this step is doing because it seems like "boilerplate" but we have to think about it very carefully.

For the "issue" side, it obviously checks out the main branch (because what would it do otherwise), but what's going on for PRs? Is your intent to check out the HEAD of the proposed branch, the merge commit, main, or something else? Of course this is pull_request and not pull_request_target so we're getting the proposed branch, but it's also doing the merge. That's probably kinda right because if this was in job-runner I think we'd rebase onto main, right? And does it even matter? Is this all going to get checked out again by testing farm and/or job-runner? Is this just about tests-scan? Could we ask 🤖 to rewrite that in JS instead to simplify this and alleviate the quoting issues we have? We should definitely think about this explicitly and document our intent here...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a bit more, issues-scan checks the allowlist using the list from the checked out git repository, which in this case would be the proposed branch, no? I think this is a backdoor to allow any unauthorized user to run any image refresh (or any code at all) inside of the RH network... I think this is the usual reason you avoid giving credentials to workflows running on pull_request (and not pull_request_target)...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my intent was to check out the proposed branch, so that changes to this workflow are self-validating. I.e. "what we do with all workflows", unless we have an explicit reason not to.

But this isn't a backdoor - in order to run the workflow in a meaningful way (i.e. with access to the secrets env), it has to come from origin, and only "trusted" people can push there. That's roughly (or perhaps even exactly?) the same privilege as for setting the "bots" label.

You really don't want pull_request_target here, that would be a backdoor.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true even if we directly configure an environment:?

Comment thread .github/workflows/issue-scan.yml Outdated
Comment on lines +39 to +40
GITHUB_BASE=${{ github.repository }} \
SCAN_OUT=$(./issue-scan --issues-data '${{ toJSON(github.event) }}')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quoting please on the SCAN_OUT assignment...

also, I think quoting toJSON output like that in single quotes breaks if any string in the JSON contains a ', doesn't it? Can we use $GITHUB_EVENT_PATH here instead of trying to feed this in through the shell's parser?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYM? x=$(...) doesn't need quoting, but if it gives you a warm fuzzy feeling I'm happy to add it 😁

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about $GITHUB_EVENT_PATH, that definitively sounds interesting and preferable. We can teach issue-scan to read it from a file.

Note that we have passed it through a shell forever, we know the structure. But happy to improve here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about a=$(cmd). Funny thing is, I think I knew that before but forgot it.

Comment thread .github/workflows/issue-scan.yml Outdated
Comment on lines +68 to +76
# Build Testing Farm API request using jq for proper escaping
echo '${{ steps.scan.outputs.scan_output }}' | jq \
--arg api_key '${{ secrets.TESTING_FARM_RH_TOKEN }}' \
--arg git_url '${{ steps.gitref.outputs.git_url }}' \
--arg git_ref '${{ steps.gitref.outputs.git_ref }}' \
--arg gh_token '${{ secrets.COCKPITUOUS_TOKEN }}' \
--arg s3_eu '${{ secrets.S3_KEY_EU }}' \
--arg s3_us '${{ secrets.S3_KEY_US }}' \
--arg s3_logs '${{ secrets.S3_KEY_LOGS }}' \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the quoting is scaring me again.... I wonder if we could use env: to get these into the command instead...

See https://docs.github.com/en/actions/reference/security/secure-use#good-practices-for-mitigating-script-injection-attacks for the suggested approaches...

Comment thread .github/workflows/issue-scan.yml Outdated
Comment on lines +120 to +121
HUMAN=$(echo '${{ steps.scan.outputs.scan_output }}' | jq -r '.human')
JOB_JSON=$(echo '${{ steps.scan.outputs.scan_output }}' | jq '.job')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quotes quotes quotes...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same same same --- if you prefer, I add them, but shell doesn't need them.

Comment thread .github/workflows/issue-scan.yml Outdated
<summary>Job JSON</summary>

\`\`\`json
$JOB_JSON
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... probably there's not \nCOMMENT_EOF\n in here...

@allisonkarlitskaya allisonkarlitskaya marked this pull request as draft March 9, 2026 15:15
@allisonkarlitskaya
Copy link
Copy Markdown
Member

allisonkarlitskaya commented Mar 9, 2026

This is back to the drawing board. I just had a conversation with Martin based on my "burn it all down and rewrite in JS" idea. This idea is definitely an improvement but maybe an even bigger one (in the name of local testability) would be to write it as one big Python script, run as a single step, which does the JSON/REST stuff internally, instead of this weird bouncing around of JSON through various shell expansions in the workflow script.

We also discussed that the allowlist stuff is redundant if the script only works for non-fork PRs since the people who can propose bots PRs from origin branches is strictly less than the list of people on the allow list. We might want to allow proposing from forks in the future but for the time being we'll drop the allowlist change to keep things simple.

What happens now: I'm gonna take a look at the "rewrite in JS" vs "rewrite in Python" options over the next couple of days and hopefully we find a nice way forward.

Building RHEL images is the only bit in our CI that actually requires
the internal RH network. Everything else could (and probably soon will)
move to some public infrastructure.

Testing Farm supports nested virtualization in principle, even a bit
better on the RHEL ranch than the public one. It does not currently
scale very well (until it moves to EC2's brand new nested
virtualization). But our image builds are not very resource hungry, just
a dozen 30-minute runs a week. So let's move them there!

Add an "issue-scan" GitHub workflow which replaces our webhook/AMQP/bot
worker logic. On appropriate changes to an issue/PR, run `issue-scan`,
and if there is a resulting task, schedule a `plans/job-runner.fmf` run
on Testing Farm. That creates a suitable `job-runner` configuration, and
then runs the job inside a local podman tasks container.

https://issues.redhat.com/browse/COCKPIT-1772

Co-authored-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
Copy link
Copy Markdown
Member Author

@martinpitt martinpitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! The unit tests fail on some ruff import ordering nag, but otherwise looks great!

Comment thread issue-scan
return url, ref


async def submit_job_runner(event: JsonObject, result: QueueEntry, config_file: str | None) -> int:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only ever returns 0 or crashes. Can just as well be -> None. Or does asyncio.run care about an int?

martinpitt added a commit that referenced this pull request Mar 17, 2026
@martinpitt
Copy link
Copy Markdown
Member Author

  • I deployed the env with the command from ci-secrets.git's README, it exists now and I checked the box in the description.
  • I created images: enable test.thing support on RHEL 8/9 #8221 as a PR against this branch. It did build the image (log), but that was actually from our own CI. So (obviously) that didn't take the new run-queue into account, as our own CI uses that from main.
  • The workflow failed on missing yarl, it has to install this.

So I pushed this branch to my fork's main again, and opened martinpitt#41 . The initial version failed as well on yarl, and on httpx -- why do we have to use these new-fangled libraries to do a single URL call that urllib is perfectly able to do? 🤔

But after that it fails with

/home/runner/.config/cockpit-dev/job-runner.toml: Invalid initial character for a key part (at line 3, column 12)

I think it doesn't like the dictionary syntax here?

          s3-keys = {
               'eu-central-1.linodeobjects.com'='${{ secrets.S3_KEY_EU }}',

@martinpitt
Copy link
Copy Markdown
Member Author

Now the unit tests fail with

issue-scan:228: error: Argument 2 to "submit_to_testing_farm" has incompatible type "JobSpecification"; expected "JsonObject" [arg-type]

but I leave that to you @allisonkarlitskaya . More importantly, after the series of fixups I am now getting this API post permission error. I think issue-scan ought to use COCKPITUOUS_TOKEN, which on my fork's env is a token for my fork's project. I even just refreshed it to make sure. It seems it did trigger TF, as that status posting happens afterwards, but there's no way to get the TF URL.

I even gave it public_repo permissions (which I don't really want to do), but it still does not work. It would be really nice to use the default GITHUB_TOKEN for that status posting, but that's difficult.

At this point I need to call "timeout", sorry.

@martinpitt
Copy link
Copy Markdown
Member Author

@allisonkarlitskaya
Copy link
Copy Markdown
Member

(test comment)

@jelly
Copy link
Copy Markdown
Member

jelly commented Mar 23, 2026

As far as I know this runs in the private ranch? So this would run into the new repository allowlist

@allisonkarlitskaya
Copy link
Copy Markdown
Member

Superseded by #8804, #8836, #8848, #8909

@martinpitt martinpitt deleted the image-refresh-tf branch April 15, 2026 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants