workflows: Run image refreshes in Testing Farm by martinpitt · Pull Request #8758 · cockpit-project/bots

martinpitt · 2026-02-25T15:05:33Z

Building RHEL images is the only bit in our CI that actually requires
the internal RH network. Everything else could (and probably soon will)
move to some public infrastructure.

Testing Farm supports nested virtualization in principle, even a bit
better on the RHEL ranch than the public one. It does not currently
scale very well (until it moves to EC2's brand new nested
virtualization). But our image builds are not very resource hungry, just
a dozen 30-minute runs a week. So let's move them there!

Add an "issue-scan" GitHub workflow which replaces our webhook/AMQP/bot
worker logic. On appropriate changes to an issue/PR, run issue-scan,
and if there is a resulting task, schedule a plans/job-runner.fmf run
on Testing Farm. That creates a suitable job-runner configuration, and
then runs the job inside a local podman tasks container.

https://issues.redhat.com/browse/COCKPIT-1772

I developed this on my fork. I kept pushing to my main branch, and doing the refreshes in martinpitt#36 then martinpitt#37 then martinpitt#38 -- that's because each actually successful run then turns the issue into a PR, and conflicts appear. The last one is a clean run how it's actually supposed to look like. The two "image-refresh cirros done" and "Success. Log:.." are the same as we always had -- they normally come from cockpituous, but for my fork that of course is a token from myself.

The third, and new, comment is this one that links to the Testing Farm run. We don't strictly need that -- we don't have a counterpart on our current CI, other than "pitti ssh's into a bot and reads the journal", but I think it's useful for at least a little while until this stabilizes. Note that this is running the RHEL ranch, i.e. you need VPN to access the logs (again, just like with our current CI).

I tested this from both an issue and a PR, and with both editing the description as well as (un)labelling "bot", all good.

This requires a new "image-build" GitHub environment for the required secrets. I committed "Add GitHub env for image refreshes" to our internal ci-secrets repo, and ran it (with the obvious cockpit-project → martinpitt change) to deploy the env into my fork.

After review, but before landing:

Deploy the image-build env
test: Drop image-refresh integration test cockpituous#686

After landing:

Drop the "issues" webhook trigger in bots, we don't need it any more
Drop my test-image-build token and image-build env on my fork
Drop the AMQP imports/code path from issue-scan (not necessary, but cleanup, and git remembers..)

martinpitt · 2026-02-25T16:23:29Z

ah yes, unit tests now fail test_mock_image_refresh -- that's actually intended. I had a look, and this just needs to be dropped, there's nothing to salvage there; that is all about AMQP and deployment.

But in the meantime I'd appreciate some reviews/comments, thanks!

martinpitt

self-review with fresh eyes on the Morgen danach

martinpitt · 2026-02-26T06:29:16Z

+        assert pika is not None, "pika module is required for --amqp"
+        assert distributed_queue is not None, "distributed_queue module is required for --amqp"
+
+        for result in scan(opts.issues_data, opts.repo):


Splitting the loop between AMQP and non-AMQP was an intermediate step when I tried to move the imports into the if:. But that didn't work, so I made them global. However, given that this AMQP code disappears entirely after landing this PR, I'm actually in favor of keeping it split that way.

martinpitt · 2026-02-26T08:19:56Z

Pushed fixes for my self-review, tested again in martinpitt#40

jelly · 2026-02-26T10:16:50Z

So first question is regarding logging:

Do all image-refreshes run on the private ranch? It is very useful to point external contributors such as SUSE to image refresh issues of their image.

jelly · 2026-02-26T10:33:04Z

So first question is regarding logging:

* Do all image-refreshes run on the private ranch? It is very useful to point external contributors such as SUSE to image refresh issues of their image.

Re-reading this, I guess this ain't true as the logs do end up in s3 so will show up but not commented on this issue?

allisonkarlitskaya

I don't like using GitHub actions for stuff like this, but there might be a silver lining here...

Image refreshes were the last of the classical "tasks". Will this let us kill all the "infra" in lib/task.py?

martinpitt · 2026-02-26T12:21:40Z

So first question is regarding logging:

* Do all image-refreshes run on the private ranch? It is very useful to point external contributors such as SUSE to image refresh issues of their image.

As you see on e.g. martinpitt#38 the S3 logs stay as they have always been, and they are public. The bit that happens on the private ranch is the actual job-runner invocation, which currently runs in our PSI OpenStack (which is also private to RHEL). Logs from infra failures have never been available to the public.

As soon as TF starts supporting the new EC2 nested virtualization, we could move non-RHEL image builds out to the public ranch, but right now that fails far too often (it simply isn't scalable to run all these on bare-metal EC2 instances). But this is more of a bonus/cost efficiency issue.

We will soon move image refreshes out of our own CI into GitHub/Testing Farm [1]. That PR [2] disables the `issue-scan` portion of `run-tests` to avoid building images twice. [1] https://issues.redhat.com/browse/COCKPIT-1772 [2] cockpit-project/bots#8758

jelly · 2026-02-26T14:47:40Z

So first question is regarding logging:
* Do all image-refreshes run on the private ranch? It is very useful to point external contributors such as SUSE to image refresh issues of their image.
As you see on e.g. martinpitt#38 the S3 logs stay as they have always been, and they are public. The bit that happens on the private ranch is the actual job-runner invocation, which currently runs in our PSI OpenStack (which is also private to RHEL). Logs from infra failures have never been available to the public.

As soon as TF starts supporting the new EC2 nested virtualization, we could move non-RHEL image builds out to the public ranch, but right now that fails far too often (it simply isn't scalable to run all these on bare-metal EC2 instances). But this is more of a bonus/cost efficiency issue.

Yup, realized that when reading the code!

jelly · 2026-02-26T10:20:27Z

+          TF_RESPONSE=$(curl -s --json @tf-request.json https://api.dev.testing-farm.io/v0.1/requests)
+
+          # Sadly, the response does not include the artifacts URL, the only
+          # useful thing is the ID; so we have to hardcode it


Is this something we have a bug report for?

I just filed one: https://issues.redhat.com/browse/TFT-4379 . I added that to the comment (pushes are cheap here)

allisonkarlitskaya

I took another look at this today and once I started seeing the quoting problems I started to see them everywhere....

allisonkarlitskaya · 2026-03-09T13:23:13Z

+      pull-requests: write
+    steps:
+      - name: Clone repository
+        uses: actions/checkout@v5


I think I'd like a comment here about what this step is doing because it seems like "boilerplate" but we have to think about it very carefully.

For the "issue" side, it obviously checks out the main branch (because what would it do otherwise), but what's going on for PRs? Is your intent to check out the HEAD of the proposed branch, the merge commit, main, or something else? Of course this is pull_request and not pull_request_target so we're getting the proposed branch, but it's also doing the merge. That's probably kinda right because if this was in job-runner I think we'd rebase onto main, right? And does it even matter? Is this all going to get checked out again by testing farm and/or job-runner? Is this just about tests-scan? Could we ask 🤖 to rewrite that in JS instead to simplify this and alleviate the quoting issues we have? We should definitely think about this explicitly and document our intent here...

Thinking about this a bit more, issues-scan checks the allowlist using the list from the checked out git repository, which in this case would be the proposed branch, no? I think this is a backdoor to allow any unauthorized user to run any image refresh (or any code at all) inside of the RH network... I think this is the usual reason you avoid giving credentials to workflows running on pull_request (and not pull_request_target)...

Yes, my intent was to check out the proposed branch, so that changes to this workflow are self-validating. I.e. "what we do with all workflows", unless we have an explicit reason not to.

But this isn't a backdoor - in order to run the workflow in a meaningful way (i.e. with access to the secrets env), it has to come from origin, and only "trusted" people can push there. That's roughly (or perhaps even exactly?) the same privilege as for setting the "bots" label.

You really don't want pull_request_target here, that would be a backdoor.

Is this true even if we directly configure an environment:?

allisonkarlitskaya · 2026-03-09T13:24:44Z

+          GITHUB_BASE=${{ github.repository }} \
+          SCAN_OUT=$(./issue-scan --issues-data '${{ toJSON(github.event) }}')


quoting please on the SCAN_OUT assignment...

also, I think quoting toJSON output like that in single quotes breaks if any string in the JSON contains a ', doesn't it? Can we use $GITHUB_EVENT_PATH here instead of trying to feed this in through the shell's parser?

WDYM? x=$(...) doesn't need quoting, but if it gives you a warm fuzzy feeling I'm happy to add it 😁

TIL about $GITHUB_EVENT_PATH, that definitively sounds interesting and preferable. We can teach issue-scan to read it from a file.

Note that we have passed it through a shell forever, we know the structure. But happy to improve here.

TIL about a=$(cmd). Funny thing is, I think I knew that before but forgot it.

allisonkarlitskaya · 2026-03-09T13:34:54Z

+          # Build Testing Farm API request using jq for proper escaping
+          echo '${{ steps.scan.outputs.scan_output }}' | jq \
+            --arg api_key '${{ secrets.TESTING_FARM_RH_TOKEN }}' \
+            --arg git_url '${{ steps.gitref.outputs.git_url }}' \
+            --arg git_ref '${{ steps.gitref.outputs.git_ref }}' \
+            --arg gh_token '${{ secrets.COCKPITUOUS_TOKEN }}' \
+            --arg s3_eu '${{ secrets.S3_KEY_EU }}' \
+            --arg s3_us '${{ secrets.S3_KEY_US }}' \
+            --arg s3_logs '${{ secrets.S3_KEY_LOGS }}' \


the quoting is scaring me again.... I wonder if we could use env: to get these into the command instead...

See https://docs.github.com/en/actions/reference/security/secure-use#good-practices-for-mitigating-script-injection-attacks for the suggested approaches...

allisonkarlitskaya · 2026-03-09T13:35:10Z

+          HUMAN=$(echo '${{ steps.scan.outputs.scan_output }}' | jq -r '.human')
+          JOB_JSON=$(echo '${{ steps.scan.outputs.scan_output }}' | jq '.job')


quotes quotes quotes...

same same same --- if you prefer, I add them, but shell doesn't need them.

allisonkarlitskaya · 2026-03-09T13:35:58Z

+          <summary>Job JSON</summary>
+
+          \`\`\`json
+          $JOB_JSON


... probably there's not \nCOMMENT_EOF\n in here...

allisonkarlitskaya · 2026-03-09T15:18:55Z

This is back to the drawing board. I just had a conversation with Martin based on my "burn it all down and rewrite in JS" idea. This idea is definitely an improvement but maybe an even bigger one (in the name of local testability) would be to write it as one big Python script, run as a single step, which does the JSON/REST stuff internally, instead of this weird bouncing around of JSON through various shell expansions in the workflow script.

We also discussed that the allowlist stuff is redundant if the script only works for non-fork PRs since the people who can propose bots PRs from origin branches is strictly less than the list of people on the allow list. We might want to allow proposing from forks in the future but for the time being we'll drop the allowlist change to keep things simple.

What happens now: I'm gonna take a look at the "rewrite in JS" vs "rewrite in Python" options over the next couple of days and hopefully we find a nice way forward.

Building RHEL images is the only bit in our CI that actually requires the internal RH network. Everything else could (and probably soon will) move to some public infrastructure. Testing Farm supports nested virtualization in principle, even a bit better on the RHEL ranch than the public one. It does not currently scale very well (until it moves to EC2's brand new nested virtualization). But our image builds are not very resource hungry, just a dozen 30-minute runs a week. So let's move them there! Add an "issue-scan" GitHub workflow which replaces our webhook/AMQP/bot worker logic. On appropriate changes to an issue/PR, run `issue-scan`, and if there is a resulting task, schedule a `plans/job-runner.fmf` run on Testing Farm. That creates a suitable `job-runner` configuration, and then runs the job inside a local podman tasks container. https://issues.redhat.com/browse/COCKPIT-1772 Co-authored-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>

martinpitt

Very cool! The unit tests fail on some ruff import ordering nag, but otherwise looks great!

martinpitt · 2026-03-17T06:24:04Z

+    return url, ref
+
+
+async def submit_job_runner(event: JsonObject, result: QueueEntry, config_file: str | None) -> int:


This only ever returns 0 or crashes. Can just as well be -> None. Or does asyncio.run care about an int?

Test #8758

martinpitt · 2026-03-17T06:44:56Z

I deployed the env with the command from ci-secrets.git's README, it exists now and I checked the box in the description.
I created images: enable test.thing support on RHEL 8/9 #8221 as a PR against this branch. It did build the image (log), but that was actually from our own CI. So (obviously) that didn't take the new run-queue into account, as our own CI uses that from main.
The workflow failed on missing yarl, it has to install this.

So I pushed this branch to my fork's main again, and opened martinpitt#41 . The initial version failed as well on yarl, and on httpx -- why do we have to use these new-fangled libraries to do a single URL call that urllib is perfectly able to do? 🤔

But after that it fails with

/home/runner/.config/cockpit-dev/job-runner.toml: Invalid initial character for a key part (at line 3, column 12)

I think it doesn't like the dictionary syntax here?

          s3-keys = {
               'eu-central-1.linodeobjects.com'='${{ secrets.S3_KEY_EU }}',

issue-scan moved to GitHub/TF, stop them from being scheduled on our own CI.

martinpitt · 2026-03-17T07:10:16Z

Now the unit tests fail with

issue-scan:228: error: Argument 2 to "submit_to_testing_farm" has incompatible type "JobSpecification"; expected "JsonObject" [arg-type]

but I leave that to you @allisonkarlitskaya . More importantly, after the series of fixups I am now getting this API post permission error. I think issue-scan ought to use COCKPITUOUS_TOKEN, which on my fork's env is a token for my fork's project. I even just refreshed it to make sure. It seems it did trigger TF, as that status posting happens afterwards, but there's no way to get the TF URL.

I even gave it public_repo permissions (which I don't really want to do), but it still does not work. It would be really nice to use the default GITHUB_TOKEN for that status posting, but that's difficult.

At this point I need to call "timeout", sorry.

martinpitt · 2026-03-17T07:23:10Z

Pushed the fixup, now works. See martinpitt#42 and https://github.com/martinpitt/bots/actions/runs/23183023242/job/67359893649

allisonkarlitskaya · 2026-03-17T11:52:02Z

(test comment)

jelly · 2026-03-23T09:52:31Z

As far as I know this runs in the private ranch? So this would run into the new repository allowlist

allisonkarlitskaya · 2026-04-15T11:23:56Z

Superseded by #8804, #8836, #8848, #8909

martinpitt requested review from allisonkarlitskaya and jelly February 25, 2026 15:05

martinpitt commented Feb 26, 2026

View reviewed changes

martinpitt mentioned this pull request Feb 26, 2026

test TF RHEL image build martinpitt/bots#34

Closed

martinpitt force-pushed the image-refresh-tf branch 2 times, most recently from 04454a9 to 589d1c5 Compare February 26, 2026 08:19

allisonkarlitskaya reviewed Feb 26, 2026

View reviewed changes

Comment thread issue-scan

martinpitt mentioned this pull request Feb 26, 2026

test: Drop image-refresh integration test cockpit-project/cockpituous#686

Open

jelly reviewed Feb 26, 2026

View reviewed changes

martinpitt force-pushed the image-refresh-tf branch from 589d1c5 to c20e91f Compare February 26, 2026 16:10

martinpitt requested review from allisonkarlitskaya and jelly March 2, 2026 10:13

allisonkarlitskaya requested changes Mar 9, 2026

View reviewed changes

allisonkarlitskaya marked this pull request as draft March 9, 2026 15:15

allisonkarlitskaya self-assigned this Mar 11, 2026

allisonkarlitskaya mentioned this pull request Mar 11, 2026

Preparation work for #8758 #8798

Merged

allisonkarlitskaya force-pushed the image-refresh-tf branch from 906046a to abf68b4 Compare March 16, 2026 13:59

allisonkarlitskaya force-pushed the image-refresh-tf branch from abf68b4 to bca29ca Compare March 16, 2026 15:26

martinpitt commented Mar 17, 2026

View reviewed changes

martinpitt added a commit that referenced this pull request Mar 17, 2026

No-change PR to test image refresh

7bdb616

Test #8758

martinpitt mentioned this pull request Mar 17, 2026

No-change PR to test image refresh #8821

Closed

1 task

martinpitt mentioned this pull request Mar 17, 2026

test image refresh, take #4 martinpitt/bots#41

Closed

1 task

martinpitt force-pushed the image-refresh-tf branch from bca29ca to 797bd8b Compare March 17, 2026 06:46

martinpitt added 4 commits March 17, 2026 07:52

run-queue: Remove issue-scan

8e64996

issue-scan moved to GitHub/TF, stop them from being scheduled on our own CI.

FIXUP ruff import order

8e6f1be

FIXUP install dependencies

b1e76c1

FIXUP toml key syntax

cef0f82

martinpitt force-pushed the image-refresh-tf branch from 797bd8b to cef0f82 Compare March 17, 2026 07:10

FIXUP set up standard github token for issue-scan comment posting

8be71f6

allisonkarlitskaya mentioned this pull request Mar 20, 2026

New issue-comment command and workflow #8836

Merged

allisonkarlitskaya closed this Apr 15, 2026

martinpitt deleted the image-refresh-tf branch April 15, 2026 12:37

		GITHUB_BASE=${{ github.repository }} \
		SCAN_OUT=$(./issue-scan --issues-data '${{ toJSON(github.event) }}')

		HUMAN=$(echo '${{ steps.scan.outputs.scan_output }}' \| jq -r '.human')
		JOB_JSON=$(echo '${{ steps.scan.outputs.scan_output }}' \| jq '.job')

		return url, ref


		async def submit_job_runner(event: JsonObject, result: QueueEntry, config_file: str \| None) -> int:

Conversation

martinpitt commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinpitt commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinpitt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martinpitt commented Feb 26, 2026

Uh oh!

jelly commented Feb 26, 2026

Uh oh!

jelly commented Feb 26, 2026

Uh oh!

allisonkarlitskaya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

martinpitt commented Feb 26, 2026

Uh oh!

jelly commented Feb 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allisonkarlitskaya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allisonkarlitskaya commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinpitt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martinpitt commented Mar 17, 2026

Uh oh!

martinpitt commented Mar 17, 2026

Uh oh!

martinpitt commented Mar 17, 2026

Uh oh!

allisonkarlitskaya commented Mar 17, 2026

Uh oh!

martinpitt commented Feb 25, 2026 •

edited

Loading

martinpitt commented Feb 25, 2026 •

edited

Loading

allisonkarlitskaya commented Mar 9, 2026 •

edited

Loading