zedkube/edgeview: fix stale vmiVNC.run blocking new VNC sessions#5875
zedkube/edgeview: fix stale vmiVNC.run blocking new VNC sessions#5875naiming-zededa wants to merge 1 commit intolf-edge:masterfrom
Conversation
b49eec0 to
b32e0c1
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5875 +/- ##
==========================================
- Coverage 19.52% 17.12% -2.41%
==========================================
Files 19 475 +456
Lines 3021 85683 +82662
==========================================
+ Hits 590 14669 +14079
- Misses 2310 69496 +67186
- Partials 121 1518 +1397 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
b32e0c1 to
a2833f4
Compare
| logmsg "monitor_caller_pid: Caller PID $caller_pid is gone (crashed?), cleaning up" | ||
|
|
||
| # Re-check CallerPID in the file before acting, in case a new edgeview | ||
| # process with a recycled PID already wrote a new session file. |
There was a problem hiding this comment.
Is this really for "recycled PID" if we are checking for changed CallerPID?
There was a problem hiding this comment.
updated comments.
| │ ┌─────────┐ AppInstanceConfig ┌──────────┐ │ | ||
| │ │zedagent │─────────────────────►│ zedkube │ │ | ||
| │ └─────────┘ (RemoteConsole) └────┬─────┘ │ | ||
| │ │ runAppVNC() │ |
There was a problem hiding this comment.
Add space to fix the arrow
| │ │edgeview │◄────────────────────│ edgeview │ │ | ||
| │ │ client │ │ (server) │ │ | ||
| │ └──────────┘ └────┬─────┘ │ | ||
| │ │ setupEveKVNC() │ |
There was a problem hiding this comment.
Is edgeview used also for Controller-triggered VNC access? Based on this diagram it seems that zedkube talks to edgeview to setup VNC.
There was a problem hiding this comment.
redo the drawing to reflect the schemes properly.
|
|
||
| Rules: | ||
|
|
||
| - `CallerPID > 0` → edgeview owns this session; liveness via `/proc/<pid>/comm` |
There was a problem hiding this comment.
I believe that in the edgeview-case, the liveness is both PID-check and port-listen check (?)
But the port-listen check is not mentioned for CallerPID > 0
There was a problem hiding this comment.
yes, we actually missing the code in checking the vnc port. updated the code and the doc.
| │ │ │ go runAppVNC() │ │ │ │ | ||
| │ │ │ getVMIdomainName() │ │ │ │ | ||
| │ │ │ (retry up to 6×) │ │ │ │ | ||
| │ │ │ canClaimVNCFile() │ │ │ │ |
There was a problem hiding this comment.
Shoule be here also the "evict stale" step (as shown for the edgeview case below)?
|
|
||
| Both `removeStaleVNCFile` (edgeview) and `canClaimVNCFile` (zedkube) probe | ||
| whether a virtctl proxy is actually running before declaring a remote-console | ||
| file stale. They do this differently: |
There was a problem hiding this comment.
Do they really do it differently? They use the same helper function after all
| │ | ||
| ▼ no | ||
| parsable & | ||
| VNCPort > 0? ──no──► remove file ──► return true |
There was a problem hiding this comment.
I feel like there can be a race condition between edgeview and zedkube. These file/PID/port checks are not atomic.
But I guess this is up to the user to avoid opening multiple VNC sessions from these two places in parallel.
There was a problem hiding this comment.
i have added the error messages in place in case this happens.
a2833f4 to
b257790
Compare
b257790 to
79515e0
Compare
Problem: vmiVNC.run was not being cleaned up when an edgeview instance
crashed or was killed.
Fixes:
- clustertypes.go: Add AppUUID field to VmiVNCConfig to identify file
ownership per-app. Add OwnerAlive() which reads /proc/<pid>/comm and
checks the name is edge-view, guarding against PID reuse by
unrelated processes.
- zedkube/vnc.go: Replace the blind os.Stat block with
canClaimVNCFile(). It evicts stale files (dead edgeview PID, reused
PID, or idle virtctl port) and blocks only on a genuinely live
session — a running edgeview process or a virtctl proxy still
listening on the port.
- edgeview/network.go: Replace the blind os.Stat block with
removeStaleVNCFile(), the edgeview-side equivalent. Also call it at
edgeview startup (runOnServer) with evictIdle=false to clear files
left from a previous crash without disturbing a live remote-console
session. Rewrite isPortListening to read /proc/net/tcp directly
instead of spawning ss (avoids disrupting the virtctl WebSocket
connection).
- vnc-proxy.sh: Move monitor_caller_pid start to the top of handle_vnc
before the virtctl retry loop
Signed-off-by: naiming-zededa <naiming@zededa.com>
79515e0 to
6f0c5c2
Compare
Description
Problem: vmiVNC.run was not being cleaned up when an edgeview instance crashed or was killed.
Fixes:
PR dependencies
How to test and validate this PR
run the edgeview VNC in multiple times, and verify they are not blocked by the stale config file on the device.
Changelog notes
zedkube/edgeview: fix stale vmiVNC.run blocking new VNC sessions
PR Backports
Checklist
For backport PRs (remove it if it's not a backport):
And the last but not least:
check them.