workflows: Run image refreshes in Testing Farm#8758
Conversation
|
ah yes, unit tests now fail But in the meantime I'd appreciate some reviews/comments, thanks! |
martinpitt
left a comment
There was a problem hiding this comment.
self-review with fresh eyes on the Morgen danach
| assert pika is not None, "pika module is required for --amqp" | ||
| assert distributed_queue is not None, "distributed_queue module is required for --amqp" | ||
|
|
||
| for result in scan(opts.issues_data, opts.repo): |
There was a problem hiding this comment.
Splitting the loop between AMQP and non-AMQP was an intermediate step when I tried to move the imports into the if:. But that didn't work, so I made them global. However, given that this AMQP code disappears entirely after landing this PR, I'm actually in favor of keeping it split that way.
04454a9 to
589d1c5
Compare
|
Pushed fixes for my self-review, tested again in martinpitt#40 |
|
So first question is regarding logging:
|
Re-reading this, I guess this ain't true as the logs do end up in s3 so will show up but not commented on this issue? |
allisonkarlitskaya
left a comment
There was a problem hiding this comment.
I don't like using GitHub actions for stuff like this, but there might be a silver lining here...
Image refreshes were the last of the classical "tasks". Will this let us kill all the "infra" in lib/task.py?
As you see on e.g. martinpitt#38 the S3 logs stay as they have always been, and they are public. The bit that happens on the private ranch is the actual As soon as TF starts supporting the new EC2 nested virtualization, we could move non-RHEL image builds out to the public ranch, but right now that fails far too often (it simply isn't scalable to run all these on bare-metal EC2 instances). But this is more of a bonus/cost efficiency issue. |
We will soon move image refreshes out of our own CI into GitHub/Testing Farm [1]. That PR [2] disables the `issue-scan` portion of `run-tests` to avoid building images twice. [1] https://issues.redhat.com/browse/COCKPIT-1772 [2] cockpit-project/bots#8758
Yup, realized that when reading the code! |
| TF_RESPONSE=$(curl -s --json @tf-request.json https://api.dev.testing-farm.io/v0.1/requests) | ||
|
|
||
| # Sadly, the response does not include the artifacts URL, the only | ||
| # useful thing is the ID; so we have to hardcode it |
There was a problem hiding this comment.
Is this something we have a bug report for?
There was a problem hiding this comment.
I just filed one: https://issues.redhat.com/browse/TFT-4379 . I added that to the comment (pushes are cheap here)
589d1c5 to
c20e91f
Compare
allisonkarlitskaya
left a comment
There was a problem hiding this comment.
I took another look at this today and once I started seeing the quoting problems I started to see them everywhere....
| pull-requests: write | ||
| steps: | ||
| - name: Clone repository | ||
| uses: actions/checkout@v5 |
There was a problem hiding this comment.
I think I'd like a comment here about what this step is doing because it seems like "boilerplate" but we have to think about it very carefully.
For the "issue" side, it obviously checks out the main branch (because what would it do otherwise), but what's going on for PRs? Is your intent to check out the HEAD of the proposed branch, the merge commit, main, or something else? Of course this is pull_request and not pull_request_target so we're getting the proposed branch, but it's also doing the merge. That's probably kinda right because if this was in job-runner I think we'd rebase onto main, right? And does it even matter? Is this all going to get checked out again by testing farm and/or job-runner? Is this just about tests-scan? Could we ask 🤖 to rewrite that in JS instead to simplify this and alleviate the quoting issues we have? We should definitely think about this explicitly and document our intent here...
There was a problem hiding this comment.
Thinking about this a bit more, issues-scan checks the allowlist using the list from the checked out git repository, which in this case would be the proposed branch, no? I think this is a backdoor to allow any unauthorized user to run any image refresh (or any code at all) inside of the RH network... I think this is the usual reason you avoid giving credentials to workflows running on pull_request (and not pull_request_target)...
There was a problem hiding this comment.
Yes, my intent was to check out the proposed branch, so that changes to this workflow are self-validating. I.e. "what we do with all workflows", unless we have an explicit reason not to.
But this isn't a backdoor - in order to run the workflow in a meaningful way (i.e. with access to the secrets env), it has to come from origin, and only "trusted" people can push there. That's roughly (or perhaps even exactly?) the same privilege as for setting the "bots" label.
You really don't want pull_request_target here, that would be a backdoor.
There was a problem hiding this comment.
Is this true even if we directly configure an environment:?
| GITHUB_BASE=${{ github.repository }} \ | ||
| SCAN_OUT=$(./issue-scan --issues-data '${{ toJSON(github.event) }}') |
There was a problem hiding this comment.
quoting please on the SCAN_OUT assignment...
also, I think quoting toJSON output like that in single quotes breaks if any string in the JSON contains a ', doesn't it? Can we use $GITHUB_EVENT_PATH here instead of trying to feed this in through the shell's parser?
There was a problem hiding this comment.
WDYM? x=$(...) doesn't need quoting, but if it gives you a warm fuzzy feeling I'm happy to add it 😁
There was a problem hiding this comment.
TIL about $GITHUB_EVENT_PATH, that definitively sounds interesting and preferable. We can teach issue-scan to read it from a file.
Note that we have passed it through a shell forever, we know the structure. But happy to improve here.
There was a problem hiding this comment.
TIL about a=$(cmd). Funny thing is, I think I knew that before but forgot it.
| # Build Testing Farm API request using jq for proper escaping | ||
| echo '${{ steps.scan.outputs.scan_output }}' | jq \ | ||
| --arg api_key '${{ secrets.TESTING_FARM_RH_TOKEN }}' \ | ||
| --arg git_url '${{ steps.gitref.outputs.git_url }}' \ | ||
| --arg git_ref '${{ steps.gitref.outputs.git_ref }}' \ | ||
| --arg gh_token '${{ secrets.COCKPITUOUS_TOKEN }}' \ | ||
| --arg s3_eu '${{ secrets.S3_KEY_EU }}' \ | ||
| --arg s3_us '${{ secrets.S3_KEY_US }}' \ | ||
| --arg s3_logs '${{ secrets.S3_KEY_LOGS }}' \ |
There was a problem hiding this comment.
the quoting is scaring me again.... I wonder if we could use env: to get these into the command instead...
See https://docs.github.com/en/actions/reference/security/secure-use#good-practices-for-mitigating-script-injection-attacks for the suggested approaches...
| HUMAN=$(echo '${{ steps.scan.outputs.scan_output }}' | jq -r '.human') | ||
| JOB_JSON=$(echo '${{ steps.scan.outputs.scan_output }}' | jq '.job') |
There was a problem hiding this comment.
quotes quotes quotes...
There was a problem hiding this comment.
same same same --- if you prefer, I add them, but shell doesn't need them.
| <summary>Job JSON</summary> | ||
|
|
||
| \`\`\`json | ||
| $JOB_JSON |
There was a problem hiding this comment.
... probably there's not \nCOMMENT_EOF\n in here...
|
This is back to the drawing board. I just had a conversation with Martin based on my "burn it all down and rewrite in JS" idea. This idea is definitely an improvement but maybe an even bigger one (in the name of local testability) would be to write it as one big Python script, run as a single step, which does the JSON/REST stuff internally, instead of this weird bouncing around of JSON through various shell expansions in the workflow script. We also discussed that the allowlist stuff is redundant if the script only works for non-fork PRs since the people who can propose bots PRs from origin branches is strictly less than the list of people on the allow list. We might want to allow proposing from forks in the future but for the time being we'll drop the allowlist change to keep things simple. What happens now: I'm gonna take a look at the "rewrite in JS" vs "rewrite in Python" options over the next couple of days and hopefully we find a nice way forward. |
906046a to
abf68b4
Compare
Building RHEL images is the only bit in our CI that actually requires the internal RH network. Everything else could (and probably soon will) move to some public infrastructure. Testing Farm supports nested virtualization in principle, even a bit better on the RHEL ranch than the public one. It does not currently scale very well (until it moves to EC2's brand new nested virtualization). But our image builds are not very resource hungry, just a dozen 30-minute runs a week. So let's move them there! Add an "issue-scan" GitHub workflow which replaces our webhook/AMQP/bot worker logic. On appropriate changes to an issue/PR, run `issue-scan`, and if there is a resulting task, schedule a `plans/job-runner.fmf` run on Testing Farm. That creates a suitable `job-runner` configuration, and then runs the job inside a local podman tasks container. https://issues.redhat.com/browse/COCKPIT-1772 Co-authored-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
abf68b4 to
bca29ca
Compare
martinpitt
left a comment
There was a problem hiding this comment.
Very cool! The unit tests fail on some ruff import ordering nag, but otherwise looks great!
| return url, ref | ||
|
|
||
|
|
||
| async def submit_job_runner(event: JsonObject, result: QueueEntry, config_file: str | None) -> int: |
There was a problem hiding this comment.
This only ever returns 0 or crashes. Can just as well be -> None. Or does asyncio.run care about an int?
So I pushed this branch to my fork's main again, and opened martinpitt#41 . The initial version failed as well on yarl, and on httpx -- why do we have to use these new-fangled libraries to do a single URL call that But after that it fails with
I think it doesn't like the dictionary syntax here? s3-keys = {
'eu-central-1.linodeobjects.com'='${{ secrets.S3_KEY_EU }}', |
bca29ca to
797bd8b
Compare
issue-scan moved to GitHub/TF, stop them from being scheduled on our own CI.
|
Now the unit tests fail with
but I leave that to you @allisonkarlitskaya . More importantly, after the series of fixups I am now getting this API post permission error. I think issue-scan ought to use I even gave it At this point I need to call "timeout", sorry. |
797bd8b to
cef0f82
Compare
|
Pushed the fixup, now works. See martinpitt#42 and https://github.com/martinpitt/bots/actions/runs/23183023242/job/67359893649 |
|
(test comment) |
|
As far as I know this runs in the private ranch? So this would run into the new repository allowlist |
Building RHEL images is the only bit in our CI that actually requires
the internal RH network. Everything else could (and probably soon will)
move to some public infrastructure.
Testing Farm supports nested virtualization in principle, even a bit
better on the RHEL ranch than the public one. It does not currently
scale very well (until it moves to EC2's brand new nested
virtualization). But our image builds are not very resource hungry, just
a dozen 30-minute runs a week. So let's move them there!
Add an "issue-scan" GitHub workflow which replaces our webhook/AMQP/bot
worker logic. On appropriate changes to an issue/PR, run
issue-scan,and if there is a resulting task, schedule a
plans/job-runner.fmfrunon Testing Farm. That creates a suitable
job-runnerconfiguration, andthen runs the job inside a local podman tasks container.
https://issues.redhat.com/browse/COCKPIT-1772
I developed this on my fork. I kept pushing to my main branch, and doing the refreshes in martinpitt#36 then martinpitt#37 then martinpitt#38 -- that's because each actually successful run then turns the issue into a PR, and conflicts appear. The last one is a clean run how it's actually supposed to look like. The two "image-refresh cirros done" and "Success. Log:.." are the same as we always had -- they normally come from cockpituous, but for my fork that of course is a token from myself.
The third, and new, comment is this one that links to the Testing Farm run. We don't strictly need that -- we don't have a counterpart on our current CI, other than "pitti ssh's into a bot and reads the journal", but I think it's useful for at least a little while until this stabilizes. Note that this is running the RHEL ranch, i.e. you need VPN to access the logs (again, just like with our current CI).
I tested this from both an issue and a PR, and with both editing the description as well as (un)labelling "bot", all good.
This requires a new "image-build" GitHub environment for the required secrets. I committed "Add GitHub env for image refreshes" to our internal ci-secrets repo, and ran it (with the obvious
cockpit-project→martinpittchange) to deploy the env into my fork.After review, but before landing:
After landing:
issue-scan(not necessary, but cleanup, and git remembers..)