CASMTRIAGE-9128: Update IUF docs for procedure to follow after each batch of worker nodes rollout w.r.t iSCSI#6552
Conversation
…atch of worker nodes rollout w.r.t iSCSI SBPS - initial placeholder commit
Update management_rollout.md for iSCSI SBPS Signed-off-by: ravikanth-nalla-hpe <140072234+ravikanth-nalla-hpe@users.noreply.github.com>
Signed-off-by: ravikanth-nalla-hpe <140072234+ravikanth-nalla-hpe@users.noreply.github.com>
Signed-off-by: ravikanth-nalla-hpe <140072234+ravikanth-nalla-hpe@users.noreply.github.com>
Signed-off-by: ravikanth-nalla-hpe <140072234+ravikanth-nalla-hpe@users.noreply.github.com>
|
|
||
| 1. Invoke `iuf run` with `-r` to execute the [`management-nodes-rollout`](../stages/management_nodes_rollout.md) stage on `ncn-m001`. This will rebuild `ncn-m001` with the new CFS configuration and image built in | ||
| previous steps of the workflow. | ||
| 1. Invoke `iuf run` with `-r` to execute the [`management-nodes-rollout`](../stages/management_nodes_rollout.md) stage on `ncn-m001`. This will rebuild `ncn-m001` with the new CFS configuration and image built in previous steps of the workflow. |
There was a problem hiding this comment.
some space realignment can you please check?
|
|
||
| 1. (`ncn-nid#`) Verify whether the following messages are observed on any compute or UAN node. | ||
|
|
||
| ```bash |
There was a problem hiding this comment.
can this be a script which can run across all computes and UANs?
| TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000001 from iqn.2023-06.csm.iscsi:x 1000c1s1b1n0 | ||
| ``` | ||
|
|
||
| If above respective messages are encountered on the canary worker and any compute/UAN nodes, ensure the following procedure is followed before continuing. |
There was a problem hiding this comment.
Should it be seen on both workers and computes???
| TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000001 from iqn.2023-06.csm.iscsi:x 1000c1s1b1n0 | ||
| ``` | ||
|
|
||
| If above respective messages are encountered on the canary worker and any compute/UAN nodes, ensure the following procedure is followed before continuing. |
There was a problem hiding this comment.
Rewording is needed based on the answer to above question.
|
|
||
| ```bash | ||
| dmesg | grep "Detected NON_EXISTENT_LUN Access" | ||
| ``` |
There was a problem hiding this comment.
This message will be seen during boot time of worker node initially. So can't rely on this unless its persists right ?
| sd 2:0:0:12: LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments. | ||
| sd 2:0:0:7: LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments. | ||
| sd 2:0:0:9: LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments. | ||
| ``` |
There was a problem hiding this comment.
This is not right. These messages are seen on iSCSI initiator nodes. Not on iSCSI target. These messages will be seen upon any asynchronous scan or 'iscsiadm rescan' command as well. We can't rely on this. We need to verify 'multipath -ll' command output on initiator nodes and see if the paths are lost right ? Pls mention the symptom seen in CASMTRIAGE-9122. Or else this can also be mentioned as pre-requisite step to avoid this issue as well as other issues that we have seen with iSCSI during upgrade.
| TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000001 from iqn.2023-06.csm.iscsi:x 1000c1s1b1n0 | ||
| TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000001 from iqn.2023-06.csm.iscsi:x 1000c1s1b1n0 | ||
| TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000001 from iqn.2023-06.csm.iscsi:x 1000c1s1b1n0 | ||
| TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000001 from iqn.2023-06.csm.iscsi:x 1000c1s1b1n0 |
There was a problem hiding this comment.
Again same comment as first comment. Can't rely on this. Pls mention this as pre-requisite step during rollouts.
|
This pull-request has not had activity in over 20 days and is being marked as stale. |
Description
On Upgrade CSM 25.9.0 (1.7.0) to CSM 26.3.0 (1.7.1) on Creek system:
After all worker nodes management rollouts complete, observed SQUASHFS errors and "LUN assignments on this target have changed" messages from the compute nodes dmesg log and also "Detected NON_EXISTENT_LUN Access" messages observed from the worker nodes. Compute nodes look to be frozen due to the flood of the messages and commands are failing.
Resolution
Update IUF management rollouts (for workers) section to validate if the issue can occur and apply the CASMTRIAGE-9129 procedure as a preventive action to avoid the issue (compute nodes freezing/ unresponsive).
Relates to:
CASMTRIAGE-9122[CASMTRIAGE-9122] - Parent JIRA
CASMTRIAGE-9128 - Current PR ref JIRA
CASMTRIAGE-9129 - iSCSI SBPS procedure