Skip to content

OCPBUGS-64841: Ensure 'containers' user & group are part of the image#1917

Closed
travier wants to merge 1 commit intoopenshift:masterfrom
travier:master-fix-containers-user-group
Closed

OCPBUGS-64841: Ensure 'containers' user & group are part of the image#1917
travier wants to merge 1 commit intoopenshift:masterfrom
travier:master-fix-containers-user-group

Conversation

@travier
Copy link
Copy Markdown
Member

@travier travier commented Mar 24, 2026

This is a combinaison of multiple things:

  • In [1], the cri-o package has been updated to use systemd-sysusers config instead of using useradd/usermod commands directly.

  • Starting with OCP 4.19, we've split the OCP packages (here cri-o) from the base RHEL image to the Node image layer. This means that the sysusers scriplet in %pre is now called during the node layer build and does not add the user/group to the /usr/lib/passwd|group files but to the /etc/passwd|group ones. As it does not take into account the existing users & groups from /usr/lib/passwd|group, the new containers user/group have a UID/GID that collide with an existing user/group. Changes to the /etc/passwd|group files are also not propagated to the system ones on updates as those files are changed on first boot as the core user is created on the system and thus ostree does not update them anymore. See [2] & [3].

  • Starting with OCP 4.19, new nodes start with no containers user/group defined (either in /usr/ or /etc) and those are thus created in /etc after the switch to the node image, so everything appear to be OK when you create a fresh cluster. Clusters updating to OCP 4.19 with older nodes that used to have the containers user/group defined in /usr/lib/passwd|group will now no longer have them there and thus systemd-sysusers will attempt to create them on the system. This will however fail as entries for those user/group are left in the /etc/shadow and /etc/gshadow files. This is [4] but "reversed".

The proposed solution here is to keep the containers user/group properly defined in the container image in the /usr/lib/passwd|group files. Older nodes will thus use those user/group like they used to. New nodes will stop trying to create them. They will have missing shadow|gshadow entries however until we fix [4] but that should be an issue as those are not used for interactive/login session users.

The medium/longer term fix is to complete the transition away from nss-altfiles for all Bootable Container systems.

[1] https://pkgs.devel.redhat.com/cgit/rpms/cri-o/commit/?h=rhaos-4.18-rhel-9&id=240a1e3db29a1d1c1b58dfae1325a9f19c663b91
[2] https://bootc-dev.github.io/bootc/building/users-and-groups.html#system-users-and-groups-added-via-packages-etc
[3] https://bootc-dev.github.io/bootc/building/users-and-groups.html#nss-altfiles
[4] bootc-dev/bootc#1179

Fixes: https://redhat.atlassian.net/browse/OCPBUGS-64841
Related: https://gitlab.com/fedora/bootc/tracker/-/work_items/76

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 24, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@travier: This pull request references Jira Issue OCPBUGS-64841, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This is a combinaison of multiple things:

  • In [1], the cri-o package has been updated to use systemd-sysusers config instead of using useradd/usermod commands directly.

  • Starting with OCP 4.19, we've split the OCP packages (here cri-o) from the base RHEL image to the Node image layer. This means that the sysusers scriplet in %pre is now called during the node layer build and does not add the user/group to the /usr/lib/passwd|group files but to the /etc/passwd|group ones. As it does not take into account the existing users & groups from /usr/lib/passwd|group, the new containers user/group have a UID/GID that collide with an existing user/group. Changes to the /etc/passwd|group files are also not propagated to the system ones on updates as those files are changed on first boot as the core user is created on the system and thus ostree does not update them anymore. See [2] & [3].

  • Starting with OCP 4.19, new nodes start with no containers user/group defined (either in /usr/ or /etc) and those are thus created in /etc after the switch to the node image, so everything appear to be OK when you create a fresh cluster. Clusters updating to OCP 4.19 with older nodes that used to have the containers user/group defined in /usr/lib/passwd|group will now no longer have them there and thus systemd-sysusers will attempt to create them on the system. This will however fail as entries for those user/group are left in the /etc/shadow and /etc/gshadow files. This is [4] but "reversed".

The proposed solution here is to keep the containers user/group properly defined in the container image in the /usr/lib/passwd|group files. Older nodes will thus use those user/group like they used to. New nodes will stop trying to create them. They will have missing shadow|gshadow entries however until we fix [4] but that should be an issue as those are not used for interactive/login session users.

The medium/longer term fix is to complete the transition away from nss-altfiles for all Bootable Container systems.

[1] https://pkgs.devel.redhat.com/cgit/rpms/cri-o/commit/?h=rhaos-4.18-rhel-9&id=240a1e3db29a1d1c1b58dfae1325a9f19c663b91
[2] https://bootc-dev.github.io/bootc/building/users-and-groups.html#system-users-and-groups-added-via-packages-etc
[3] https://bootc-dev.github.io/bootc/building/users-and-groups.html#nss-altfiles
[4] bootc-dev/bootc#1179

Fixes: https://redhat.atlassian.net/browse/OCPBUGS-64841
Related: https://gitlab.com/fedora/bootc/tracker/-/work_items/76

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from cgwalters and jschintag March 24, 2026 11:36
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 24, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: travier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 24, 2026
@travier
Copy link
Copy Markdown
Member Author

travier commented Mar 24, 2026

The proposed solution here is to keep the containers user/group properly defined in the container image in the /usr/lib/passwd|group files. Older nodes will thus use those user/group like they used to. New nodes will stop trying to create them. They will have missing shadow|gshadow entries however until we fix [4] but that should be an issue as those are not used for interactive/login session users.

Another option here is to instead go all in and drop those from the image. This means that we need to manually remove the shadow & gshadow entries from older nodes and then make sure that sysusers gets triggered again to re-create the user/group. This might also change the UID/GID allocated for those users.

As this user was mostly used to allocate a subuid/subgid range for UID/GID namespaced containers, it's not clear the impact of a UID/GID change would have.

This is a combinaison of multiple things:

- In [1], the cri-o package has been updated to use systemd-sysusers
  config instead of using useradd/usermod commands directly.

- Starting with OCP 4.19, we've split the OCP packages (here cri-o) from
  the base RHEL image to the Node image layer. This means that the
  sysusers scriplet in `%pre` is now called during the node layer build
  and does not add the user/group to the `/usr/lib/passwd|group` files
  but to the `/etc/passwd|group` ones. As it does not take into account
  the existing users & groups from `/usr/lib/passwd|group`, the new
  `containers` user/group have a UID/GID that collide with an existing
  user/group. Changes to the `/etc/passwd|group` files are also not
  propagated to the system ones on updates as those files are changed on
  first boot as the `core` user is created on the system and thus ostree
  does not update them anymore. See [2] & [3].

- Starting with OCP 4.19, new nodes start with no `containers`
  user/group defined (either in `/usr/` or `/etc`) and those are thus
  created in `/etc` after the switch to the node image, so everything
  appear to be OK when you create a fresh cluster.
  Clusters updating to OCP 4.19 with older nodes that used to have the
  `containers` user/group defined in `/usr/lib/passwd|group` will now no
  longer have them there and thus systemd-sysusers will attempt to
  create them on the system. This will however fail as entries for those
  user/group are left in the `/etc/shadow` and `/etc/gshadow` files.
  This is [4] but "reversed".

The proposed solution here is to keep the `containers` user/group
properly defined in the container image in the `/usr/lib/passwd|group` files.
Older nodes will thus use those user/group like they used to.
New nodes will stop trying to create them. They will have missing
`shadow|gshadow` entries however until we fix [4] but that should be an
issue as those are not used for interactive/login session users.

The medium/longer term fix is to complete the transition away from
nss-altfiles for all Bootable Container systems.

[1] https://pkgs.devel.redhat.com/cgit/rpms/cri-o/commit/?h=rhaos-4.18-rhel-9&id=240a1e3db29a1d1c1b58dfae1325a9f19c663b91
[2] https://bootc-dev.github.io/bootc/building/users-and-groups.html#system-users-and-groups-added-via-packages-etc
[3] https://bootc-dev.github.io/bootc/building/users-and-groups.html#nss-altfiles
[4] bootc-dev/bootc#1179

Fixes: https://redhat.atlassian.net/browse/OCPBUGS-64841
Related: https://gitlab.com/fedora/bootc/tracker/-/work_items/76
@travier travier force-pushed the master-fix-containers-user-group branch from a2e7ffd to 005e8a5 Compare March 24, 2026 11:59
@travier
Copy link
Copy Markdown
Member Author

travier commented Mar 24, 2026

I think we should have added this user/group to our passwd/group files in #1661.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 24, 2026

@travier: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@haircommander
Copy link
Copy Markdown
Member

LGTM, when this merges should I update the crio rpm to remove the sysusers addition or would it be a noop?

@travier
Copy link
Copy Markdown
Member Author

travier commented Mar 24, 2026

LGTM, when this merges should I update the crio rpm to remove the sysusers addition or would it be a noop?

Don't remove the sysusers config, we are using it here. The %pre will be redundant but I think it should still be there (it's the correct way to do it in a package mode setup).

@travier
Copy link
Copy Markdown
Member Author

travier commented Mar 24, 2026

There has been movement related to issue recently. See:

(from https://gitlab.com/fedora/bootc/tracker/-/work_items/76#note_3187445643)

But I've just realized that this would be for RPM reading the "right" password & group files, not for sysusers to use them so this will not help here.

@travier travier requested review from dustymabe and jlebon and removed request for cgwalters and jschintag March 24, 2026 16:15
@travier
Copy link
Copy Markdown
Member Author

travier commented Mar 24, 2026

(I still need to fully test this fix but we can still have the discussion about the approach now)

@dustymabe
Copy link
Copy Markdown
Member

  • This will however fail as entries for those user/group are left in the /etc/shadow and /etc/gshadow files. This is [4] but "reversed".

From what I understand new and upgrading systems have been working fine even without this PR. If this hadn't failed would systems have stopped working?

# for the full details.

# Only do that when doing a container build
if [[ -f /run/.containerenv ]] && [[ -f /usr/lib/sysusers.d/crio.conf ]]; then
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situation would we ever not be running in a container env?

Suggested change
if [[ -f /run/.containerenv ]] && [[ -f /usr/lib/sysusers.d/crio.conf ]]; then
if [[ -f /usr/lib/sysusers.d/crio.conf ]]; then

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was conservative and copied the checks from above. I'll update both.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the check above dates back from when we supported both layered and base composes for the node image.

Comment on lines +144 to +145
# First, cleanup the broken entries from /etc/passwd|group|shadow|gshadow
sed -i "/^containers:/d" /etc/{passwd,group,shadow,gshadow}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contradicts your statement in the description of this PR:

Starting with OCP 4.19, new nodes start with no containers user/group defined (either in /usr/ or /etc)

If that is true then containers: should never be defined here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New nodes start with a pure RHEL boot image that does not have CRI-O installed. This is the container image with CRI-O installed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for my understanding, since there are a lot of pieces here. Today we have:

  1. RHEL CoreOS Base Image build (no crio)
  2. OpenShift Node Image build FROM:rhel-coreos-base where crio gets installed
    a. RPM transation where crio gets installed
    b. postprocess scripts (not rpm scriptlets) where this sed statement runs

I think what you are saying is that in a. the containers: user gets added to /etc/{passwd,group,shadow,gshadow} and that's what we are cleaning up?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it gets added to the node layer /etc/{passwd,group,shadow,gshadow} and not to the /usr as part of the node layer build.

Comment on lines +142 to +143
# Only do that when doing a container build
if [[ -f /run/.containerenv ]] && [[ -f /usr/lib/sysusers.d/crio.conf ]]; then
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only ever run in a container environment.

Suggested change
# Only do that when doing a container build
if [[ -f /run/.containerenv ]] && [[ -f /usr/lib/sysusers.d/crio.conf ]]; then
if [[ -f /usr/lib/sysusers.d/crio.conf ]]; then

Am I missing something?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in what situations would crio.conf not exist? I'm thinking we should fail if crio.conf doesn't exist.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should fail indeed.

mv /etc/passwd /usr/lib/passwd
mv /etc/group /usr/lib/group
mv /etc/passwd.bak /etc/passwd
mv /etc/group.bak /etc/group
Copy link
Copy Markdown
Member

@dustymabe dustymabe Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep this "hack" targeted.

Can we do something here that will make sure the only entry that was created was the containers: user and no other changes were made to the /usr/lib/passwd* and /etc/passwd* from the RHEL CoreOS Base image we are deriving from and fail if some other changes were made?

@travier
Copy link
Copy Markdown
Member Author

travier commented Mar 25, 2026

  • This will however fail as entries for those user/group are left in the /etc/shadow and /etc/gshadow files. This is [4] but "reversed".

From what I understand new and upgrading systems have been working fine even without this PR. If this hadn't failed would systems have stopped working?

I don't think anyone is using the functionality that is broken by this PR (user namespace'd containers). I'm not fully sure this is needed for it to work (but maybe for it to be secure to make sure those IDs are not reused by something else).

# for the full details.

# Only do that when doing a container build
if [[ -f /run/.containerenv ]] && [[ -f /usr/lib/sysusers.d/crio.conf ]]; then
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the check above dates back from when we supported both layered and base composes for the node image.

mv /usr/lib/group /etc/group

# Re-create the user/group/shadow/gshadow entries
systemd-sysusers crio.conf
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to go further and fixate the UID/GID to whatever it was in the base composes on 4.18. Basically same rationale as f202927.

I guess we could do that here, or just add it to r-c-c which already carries a bunch of other fixated users/groups the node image needs.

I'm not sure why containers wasn't part of that commit I did back then. I'm pretty sure the way I came up with this is that I booted a live ~4.18 RHCOS and added the dynamic entries, so perhaps containers wasn't added back then for some reason.

@travier
Copy link
Copy Markdown
Member Author

travier commented Mar 26, 2026

RHCOS 4.18-9.4 (2026-03-23):

[core@cosa-devsh ~]$ rpm-ostree status
State: idle
Deployments:
● ostree-image-signed:oci-archive:/rhcos-418.94.202603230850-0-ostree.x86_64.ociarchive
                   Digest: sha256:8fa6b14c8a76a0c2a75efdfa1a10aae846933ac3b2c50c684caaead09b707983
                  Version: 418.94.202603230850-0 (2026-03-23T08:55:14Z)

[core@cosa-devsh ~]$ sudo grep containers /usr/lib/{passwd,group} /usr/etc/{shadow,gshadow}
/usr/lib/passwd:containers:x:795:792:User for rootless containers:/nonexistent:/sbin/nologin
/usr/lib/group:containers:x:792:
/usr/etc/shadow:containers:!!:::::::
/usr/etc/gshadow:containers:!::

This UID is now used by dnsmasq: https://github.com/coreos/rhel-coreos-config/blob/main/passwd#L25

RHCOS 4.19-9.6 (2025-05-02) (FYI, I think we never shipped this one as we moved to the node image):

[core@cosa-devsh ~]$ rpm-ostree status
State: idle
Deployments:
● ostree-image-signed:oci-archive:/rhcos-419.96.202505021444-0-ostree.x86_64.ociarchive
                   Digest: sha256:1d84640fa51563d5902fde825cf9158d1b4761f8db56f7dfa1ebf42c7288ffc2
                  Version: 419.96.202505021444-0 (2025-05-02T14:49:19Z)

[core@cosa-devsh ~]$ sudo grep containers /usr/lib/{passwd,group} /usr/etc/{shadow,gshadow}
/usr/lib/passwd:containers:x:793:790:User for rootless containers:/nonexistent:/sbin/nologin
/usr/lib/group:containers:x:790:
/usr/etc/shadow:containers:!!:::::::
/usr/etc/gshadow:containers:!::

So the UID/GID will change from what we had in 4.18.

travier added a commit to travier/rhel-coreos-config that referenced this pull request Mar 26, 2026
The openvswitch user and group have been part of the passwd & group
files for, at least, as long as we've published RHCOS sources publicly:
- https://github.com/openshift/os/blame/bdb5b8153ed68c88e2485d9e7bd66ea6eb54d6c1/passwd#L27
- https://github.com/openshift/os/blame/release-4.19/group#L47

We did not remove them when we re-visited our fixed UIDs/GID in the
split between the RHEL boot image and the new OCP node image ([1], [2] &
[3]). Thus they are now part of the base RHEL boot image, even though
the openvswitch package is not included there.

Although technically unnecessary, this is fine and simplify things a bit
as we do not have to update the user & group entries during the node
image build, which is currently a problematic topic (see [4]).

Thus instead of adding openvswitch to hugetlbfs group in the node image
build, we add it here directly to simplify the logic.

[1] openshift/os#1661
[2] coreos#29
[3] coreos#31
[4] openshift/os#1917
travier added a commit to travier/rhel-coreos-config that referenced this pull request Mar 26, 2026
Adding users and groups during a container image layered build is
currently non-ergonomic with bootable containers. Thus instead of doing
that in openshift/os for the node layer, we directly include the user &
group here, which also guarentees us that the UID/GID remain stable.

See openshift/os#1917 for the original version
of this change and the full details about what makes adding user/group
in the node layer non-ergonomic.

Fixes: https://redhat.atlassian.net/browse/OCPBUGS-64841
travier added a commit to travier/rhel-coreos-config that referenced this pull request Mar 26, 2026
Adding users and groups during a container image layered build is
currently non-ergonomic with bootable containers. Thus instead of doing
that in openshift/os for the node layer, we directly include the user &
group here, which also guarentees us that the UID/GID remain stable.

See openshift/os#1917 for the original version
of this change and the full details about what makes adding user/group
in the node layer non-ergonomic.

Fixes: https://redhat.atlassian.net/browse/OCPBUGS-64841
travier added a commit to travier/rhel-coreos-config that referenced this pull request Mar 26, 2026
Adding users and groups during a container image layered build is
currently non-ergonomic with bootable containers. Thus instead of doing
that in openshift/os for the node layer, we directly include the user &
group here, which also guarentees us that the UID/GID remain stable.

See openshift/os#1917 for the original version
of this change and the full details about what makes adding user/group
in the node layer non-ergonomic.

Unfortunately we can not use the UID/GID that were used in the last
"full" RHCOS image (4.18) as those are now used for dnsmasq (see [1]).
Thus use the first UID & GID available for both user and group, going
downward.

[1] openshift/os#1917 (comment)

Fixes: https://redhat.atlassian.net/browse/OCPBUGS-64841
@travier
Copy link
Copy Markdown
Member Author

travier commented Mar 26, 2026

OK I've made coreos/rhel-coreos-config#224 instead. I'll make another PR here to remove the script for openvswitch.

travier added a commit to travier/os that referenced this pull request Mar 26, 2026
We are moving the group inclusion directly to the RHEL base image
instead of working around it here in the OCP node layer.

See: openshift#1917
See: coreos/rhel-coreos-config#224
See: https://redhat.atlassian.net/browse/OCPBUGS-64841
@travier
Copy link
Copy Markdown
Member Author

travier commented Mar 26, 2026

Workaround removal for the node layer: #1918

dustymabe pushed a commit to coreos/rhel-coreos-config that referenced this pull request Mar 28, 2026
The openvswitch user and group have been part of the passwd & group
files for, at least, as long as we've published RHCOS sources publicly:
- https://github.com/openshift/os/blame/bdb5b8153ed68c88e2485d9e7bd66ea6eb54d6c1/passwd#L27
- https://github.com/openshift/os/blame/release-4.19/group#L47

We did not remove them when we re-visited our fixed UIDs/GID in the
split between the RHEL boot image and the new OCP node image ([1], [2] &
[3]). Thus they are now part of the base RHEL boot image, even though
the openvswitch package is not included there.

Although technically unnecessary, this is fine and simplify things a bit
as we do not have to update the user & group entries during the node
image build, which is currently a problematic topic (see [4]).

Thus instead of adding openvswitch to hugetlbfs group in the node image
build, we add it here directly to simplify the logic.

[1] openshift/os#1661
[2] #29
[3] #31
[4] openshift/os#1917
dustymabe pushed a commit to coreos/rhel-coreos-config that referenced this pull request Mar 28, 2026
Adding users and groups during a container image layered build is
currently non-ergonomic with bootable containers. Thus instead of doing
that in openshift/os for the node layer, we directly include the user &
group here, which also guarentees us that the UID/GID remain stable.

See openshift/os#1917 for the original version
of this change and the full details about what makes adding user/group
in the node layer non-ergonomic.

Unfortunately we can not use the UID/GID that were used in the last
"full" RHCOS image (4.18) as those are now used for dnsmasq (see [1]).
Thus use the first UID & GID available for both user and group, going
downward.

[1] openshift/os#1917 (comment)

Fixes: https://redhat.atlassian.net/browse/OCPBUGS-64841
@travier
Copy link
Copy Markdown
Member Author

travier commented Mar 30, 2026

Closing this one as we are doing coreos/rhel-coreos-config#224 & #1918 instead.

@travier travier closed this Mar 30, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@travier: This pull request references Jira Issue OCPBUGS-64841. The bug has been updated to no longer refer to the pull request using the external bug tracker.

Details

In response to this:

This is a combinaison of multiple things:

  • In [1], the cri-o package has been updated to use systemd-sysusers config instead of using useradd/usermod commands directly.

  • Starting with OCP 4.19, we've split the OCP packages (here cri-o) from the base RHEL image to the Node image layer. This means that the sysusers scriplet in %pre is now called during the node layer build and does not add the user/group to the /usr/lib/passwd|group files but to the /etc/passwd|group ones. As it does not take into account the existing users & groups from /usr/lib/passwd|group, the new containers user/group have a UID/GID that collide with an existing user/group. Changes to the /etc/passwd|group files are also not propagated to the system ones on updates as those files are changed on first boot as the core user is created on the system and thus ostree does not update them anymore. See [2] & [3].

  • Starting with OCP 4.19, new nodes start with no containers user/group defined (either in /usr/ or /etc) and those are thus created in /etc after the switch to the node image, so everything appear to be OK when you create a fresh cluster. Clusters updating to OCP 4.19 with older nodes that used to have the containers user/group defined in /usr/lib/passwd|group will now no longer have them there and thus systemd-sysusers will attempt to create them on the system. This will however fail as entries for those user/group are left in the /etc/shadow and /etc/gshadow files. This is [4] but "reversed".

The proposed solution here is to keep the containers user/group properly defined in the container image in the /usr/lib/passwd|group files. Older nodes will thus use those user/group like they used to. New nodes will stop trying to create them. They will have missing shadow|gshadow entries however until we fix [4] but that should be an issue as those are not used for interactive/login session users.

The medium/longer term fix is to complete the transition away from nss-altfiles for all Bootable Container systems.

[1] https://pkgs.devel.redhat.com/cgit/rpms/cri-o/commit/?h=rhaos-4.18-rhel-9&id=240a1e3db29a1d1c1b58dfae1325a9f19c663b91
[2] https://bootc-dev.github.io/bootc/building/users-and-groups.html#system-users-and-groups-added-via-packages-etc
[3] https://bootc-dev.github.io/bootc/building/users-and-groups.html#nss-altfiles
[4] bootc-dev/bootc#1179

Fixes: https://redhat.atlassian.net/browse/OCPBUGS-64841
Related: https://gitlab.com/fedora/bootc/tracker/-/work_items/76

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants