Skip to content

zedkube/edgeview: fix stale vmiVNC.run blocking new VNC sessions#5875

Open
naiming-zededa wants to merge 1 commit intolf-edge:masterfrom
naiming-zededa:naiming-vnc-cleanup
Open

zedkube/edgeview: fix stale vmiVNC.run blocking new VNC sessions#5875
naiming-zededa wants to merge 1 commit intolf-edge:masterfrom
naiming-zededa:naiming-vnc-cleanup

Conversation

@naiming-zededa
Copy link
Copy Markdown
Contributor

Description

Problem: vmiVNC.run was not being cleaned up when an edgeview instance crashed or was killed.

Fixes:

  • clustertypes.go: Add AppUUID field to VmiVNCConfig to identify file ownership per-app. Add OwnerAlive() which reads /proc//comm and checks the name is edge-view, guarding against PID reuse by unrelated processes.
  • zedkube/vnc.go: Replace the blind os.Stat block with canClaimVNCFile(). It evicts stale files (dead edgeview PID, reused PID, or idle virtctl port) and blocks only on a genuinely live session — a running edgeview process or a virtctl proxy still listening on the port.
  • edgeview/network.go: Replace the blind os.Stat block with removeStaleVNCFile(), the edgeview-side equivalent. Also call it at edgeview startup (runOnServer) with evictIdle=false to clear files left from a previous crash without disturbing a live remote-console session. Rewrite isPortListening to read /proc/net/tcp directly instead of spawning ss (avoids disrupting the virtctl WebSocket connection).
  • vnc-proxy.sh: Move monitor_caller_pid start to the top of handle_vnc before the virtctl retry loop

PR dependencies

How to test and validate this PR

run the edgeview VNC in multiple times, and verify they are not blocked by the stale config file on the device.

Changelog notes

zedkube/edgeview: fix stale vmiVNC.run blocking new VNC sessions

PR Backports

Checklist

  • I've provided a proper description
  • I've added the proper documentation
  • I've tested my PR on amd64 device
  • I've tested my PR on arm64 device
  • I've written the test verification instructions
  • I've set the proper labels to this PR

For backport PRs (remove it if it's not a backport):

  • I've added a reference link to the original PR
  • PR's title follows the template

And the last but not least:

  • I've checked the boxes above, or I've provided a good reason why I didn't
    check them.

@naiming-zededa naiming-zededa force-pushed the naiming-vnc-cleanup branch 2 times, most recently from b49eec0 to b32e0c1 Compare April 28, 2026 04:13
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

❌ Patch coverage is 31.81818% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.12%. Comparing base (2281599) to head (6f0c5c2).
⚠️ Report is 633 commits behind head on master.

Files with missing lines Patch % Lines
...f-edge/eve/pkg/pillar/utils/netutils/portlisten.go 0.00% 15 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5875      +/-   ##
==========================================
- Coverage   19.52%   17.12%   -2.41%     
==========================================
  Files          19      475     +456     
  Lines        3021    85683   +82662     
==========================================
+ Hits          590    14669   +14079     
- Misses       2310    69496   +67186     
- Partials      121     1518    +1397     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread pkg/edgeview/src/network.go
Comment thread pkg/kube/vnc-proxy.sh Outdated
logmsg "monitor_caller_pid: Caller PID $caller_pid is gone (crashed?), cleaning up"

# Re-check CallerPID in the file before acting, in case a new edgeview
# process with a recycled PID already wrote a new session file.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really for "recycled PID" if we are checking for changed CallerPID?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated comments.

Comment thread pkg/pillar/docs/vnc-workflows.md Outdated
│ ┌─────────┐ AppInstanceConfig ┌──────────┐ │
│ │zedagent │─────────────────────►│ zedkube │ │
│ └─────────┘ (RemoteConsole) └────┬─────┘ │
│ │ runAppVNC() │
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add space to fix the arrow

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

Comment thread pkg/pillar/docs/vnc-workflows.md Outdated
│ │edgeview │◄────────────────────│ edgeview │ │
│ │ client │ │ (server) │ │
│ └──────────┘ └────┬─────┘ │
│ │ setupEveKVNC() │
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is edgeview used also for Controller-triggered VNC access? Based on this diagram it seems that zedkube talks to edgeview to setup VNC.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redo the drawing to reflect the schemes properly.

Comment thread pkg/pillar/docs/vnc-workflows.md Outdated

Rules:

- `CallerPID > 0` → edgeview owns this session; liveness via `/proc/<pid>/comm`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that in the edgeview-case, the liveness is both PID-check and port-listen check (?)
But the port-listen check is not mentioned for CallerPID > 0

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we actually missing the code in checking the vnc port. updated the code and the doc.

│ │ │ go runAppVNC() │ │ │ │
│ │ │ getVMIdomainName() │ │ │ │
│ │ │ (retry up to 6×) │ │ │ │
│ │ │ canClaimVNCFile() │ │ │ │
Copy link
Copy Markdown
Contributor

@milan-zededa milan-zededa Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoule be here also the "evict stale" step (as shown for the edgeview case below)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Comment thread pkg/pillar/docs/vnc-workflows.md Outdated

Both `removeStaleVNCFile` (edgeview) and `canClaimVNCFile` (zedkube) probe
whether a virtctl proxy is actually running before declaring a remote-console
file stale. They do this differently:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they really do it differently? They use the same helper function after all

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

▼ no
parsable &
VNCPort > 0? ──no──► remove file ──► return true
Copy link
Copy Markdown
Contributor

@milan-zededa milan-zededa Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there can be a race condition between edgeview and zedkube. These file/PID/port checks are not atomic.
But I guess this is up to the user to avoid opening multiple VNC sessions from these two places in parallel.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have added the error messages in place in case this happens.

Problem: vmiVNC.run was not being cleaned up when an edgeview instance
crashed or was killed.

  Fixes:
  - clustertypes.go: Add AppUUID field to VmiVNCConfig to identify file
    ownership per-app. Add OwnerAlive() which reads /proc/<pid>/comm and
    checks the name is edge-view, guarding against PID reuse by
    unrelated processes.
  - zedkube/vnc.go: Replace the blind os.Stat block with
    canClaimVNCFile(). It evicts stale files (dead edgeview PID, reused
    PID, or idle virtctl port) and blocks only on a genuinely live
    session — a running edgeview process or a virtctl proxy still
    listening on the port.
  - edgeview/network.go: Replace the blind os.Stat block with
    removeStaleVNCFile(), the edgeview-side equivalent. Also call it at
    edgeview startup (runOnServer) with evictIdle=false to clear files
    left from a previous crash without disturbing a live remote-console
    session. Rewrite isPortListening to read /proc/net/tcp directly
    instead of spawning ss (avoids disrupting the virtctl WebSocket
    connection).
  - vnc-proxy.sh: Move monitor_caller_pid start to the top of handle_vnc
    before the virtctl retry loop

Signed-off-by: naiming-zededa <naiming@zededa.com>
@naiming-zededa naiming-zededa force-pushed the naiming-vnc-cleanup branch from 79515e0 to 6f0c5c2 Compare May 1, 2026 03:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants