Skip to content

feat: remote doctor#142

Open
zzhengzhuo015 wants to merge 19 commits intolay2dev:developfrom
zzhengzhuo015:feat-remote-doctor
Open

feat: remote doctor#142
zzhengzhuo015 wants to merge 19 commits intolay2dev:developfrom
zzhengzhuo015:feat-remote-doctor

Conversation

@zzhengzhuo015
Copy link
Contributor

No description provided.

Copy link
Collaborator

@dev01lay2 dev01lay2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid foundation for the gateway client crate and remote doctor flow. A few blocking issues to address before merge.

method: "connect".into(),
params: Some(serde_json::to_value(self.build_connect_params(&nonce))?),
});

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BS: The start() method has a confusing scope structure — locked is bound to the unsplit writer but then used after the handshake read. The variable name locked is misleading (it's not a MutexGuard). Please restructure: split → send connect via writer → read response via reader → spawn reader task, with clearer variable names.

set_active_clawpal_data_override(None).expect("clear clawpal data");
let _ = std::fs::remove_dir_all(temp_root);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BS: This file is 4340 lines — well beyond the repo's own guideline of ≤500 lines per PR (excluding generated files). Even as a single module this is hard to review and maintain. Please split into submodules: types.rs, protocol.rs, executor.rs, orchestrator.rs, logging.rs, identity.rs under src-tauri/src/remote_doctor/.

agents.md Outdated
@@ -1,2 +1,115 @@
<!-- This file has been moved to AGENTS.md. -->
Moved to [`AGENTS.md`](AGENTS.md).
# AGENTS.md
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BS: This overwrites the redirect stub in agents.md (lowercase) with full content, but the repo already has AGENTS.md (uppercase) as the canonical file. This creates two competing AGENTS files. Either update AGENTS.md or keep the lowercase file as a redirect.

}
}
Err(_) => break,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NBS: request() has no timeout — if the gateway never responds, the oneshot will hang until the connection drops. Consider adding a configurable timeout or documenting the caller's responsibility.

client_id: &str,
client_mode: &str,
role: &str,
scopes: &[String],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NBS: build_device_auth_payload_v3 builds a JSON value but is never used by client.rs (which passes ConnectParams directly without device auth signing). If this is for future use, consider #[allow(dead_code)] or moving it to tests until wired in.

Copy link
Collaborator

@dev01lay2 dev01lay2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #142 feat: remote doctor — REQUEST_CHANGES

CI: rust fails (cargo fmt --check), Capture UI Screenshots fails. Builds pending.

🔴 Blocking

BS 1: cargo fmt failures across openclaw-gateway-client/tests/
Multiple files need formatting: protocol_roundtrip.rs, tls_fingerprint.rs, and others. Fix: cargo fmt --all.

BS 2: remote_doctor.rs is 4,340 lines in a single file
This violates the repo AGENTS.md guideline of ≤500 lines per file and makes review extremely difficult. The file contains types, config helpers, protocol selection, 3 separate repair loops (legacy, clawpal_server, agent_planner), plan execution, logging, SSH helpers, rescue preflight, prompt construction, and ~1,500 lines of tests. Split into at minimum:

  • remote_doctor/types.rs
  • remote_doctor/protocol.rs
  • remote_doctor/executor.rs
  • remote_doctor/planner_agent.rs
  • remote_doctor/planner_clawpal_server.rs
  • remote_doctor/logging.rs
  • remote_doctor/prompts.rs
  • remote_doctor/rescue.rs
  • remote_doctor/mod.rs

BS 3: agents.md (lowercase) overwrites redirect with full content — conflicts with AGENTS.md (uppercase)
The repo already has AGENTS.md at root (the canonical file referenced by harness tooling). This PR replaces the lowercase agents.md redirect stub with 115 lines of Chinese-language content that duplicates and diverges from AGENTS.md. On case-insensitive filesystems (macOS, Windows) this creates a real conflict. Fix: keep agents.md as a redirect stub, or delete it and only use AGENTS.md.

BS 4: openclaw-gateway-client/Cargo.lock should not exist
Workspace members share the root Cargo.lock. A separate lockfile in the subcrate will confuse tooling and diverge from the workspace resolution. Delete openclaw-gateway-client/Cargo.lock.

🟡 Non-blocking

  • MAX_PENDING_INVOKES silently changed from 50→10 in bridge_client.rs with no explanation — could cause dropped invokes under load
  • edition = "2024" in openclaw-gateway-client/Cargo.toml requires nightly Rust; CI likely uses stable — will fail on build jobs if they reach compilation
  • E2E Dockerfile embeds E2E_ROOT_PASSWORD as a string constant in test code — fine for local Docker but document that this must never be used outside test fixtures
  • Design docs (6 plan files, ~1,700 lines total) are useful but should reference each other with a summary index; currently they are standalone and partially overlapping

…irect

- Format openclaw-gateway-client test files (protocol_roundtrip, tls_fingerprint)
- Remove openclaw-gateway-client/Cargo.lock (workspace members share root lockfile)
- Restore agents.md redirect stub (content lives in AGENTS.md)
Copy link
Collaborator

@dev01lay2 dev01lay2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review after 9166cc6..1e1b1af (2 new commits)

CI: rust ✅ | frontend ✅ | builds (4 platforms) ✅ | home-perf ✅ | profile-e2e ✅ — coverage, metrics, and Capture UI Screenshots fail (appear to be pre-existing/infra issues).

Resolved

  • BS 1 (cargo fmt) — ✅ Fixed in both commits
  • BS 3 (agents.md conflict) — ✅ Restored as redirect stub
  • BS 4 (subcrate Cargo.lock) — ✅ Removed

🔴 Still blocking

BS 2: remote_doctor.rs is now 4,525 lines in a single file (was 4,340)

The file actually grew by ~185 net lines in the formatting pass. The repo AGENTS.md guideline is ≤500 lines per file. This file contains types, config helpers, protocol selection, 3 separate repair loops, plan execution, logging, SSH helpers, rescue preflight, prompt construction, and ~1,500 lines of tests — all in one module.

Please split into at minimum:

  • remote_doctor/types.rs
  • remote_doctor/protocol.rs
  • remote_doctor/executor.rs
  • remote_doctor/planner_agent.rs
  • remote_doctor/planner_clawpal_server.rs
  • remote_doctor/logging.rs
  • remote_doctor/prompts.rs
  • remote_doctor/rescue.rs
  • remote_doctor/mod.rs

Once this is addressed, the rest of the PR looks good to go.

Copy link
Collaborator

@dev01lay2 dev01lay2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #142 Re-review — Code Quality Deep Dive

Correction: My initial review flagged remote_doctor.rs as a 4,340-line monolith — that was based on a stale shallow clone. The actual branch already has remote_doctor/ split into 9 modules. Apologies for the noise. The previous fix commit (fmt + Cargo.lock + agents.md) remains valid.

CI Status: rust ✅, frontend ✅, profile-e2e ✅, home-perf ✅. coverage ❌ (GitHub API permission — pre-existing infra). Builds pending.


Architecture Assessment

openclaw-gateway-client crate — well-structured standalone gateway WebSocket client:

  • Clean builder pattern for GatewayClient
  • Protocol handshake (challenge → connect → hello) correctly implemented
  • NodeClient / NodeInvokeRequest abstraction is clean
  • Ed25519 identity generation + PKCS#8 DER → PEM encoding correct
  • Test coverage for protocol roundtrip and TLS fingerprint normalization

remote_doctor/ module split — 9 files, good separation:

  • types.rs (255 lines) — clean domain types ✅
  • session.rs (149 lines) — logging/progress helpers ✅
  • mod.rs (5 lines) — minimal re-exports ✅
  • config.rs (~630 lines) — gateway config + SSH target resolution — slightly over 500 but acceptable given cohesion
  • plan.rs (~760 lines) — plan execution + clawpal doctor command dispatch — could benefit from splitting execute_clawpal_doctor_command into its own file
  • repair_loops.rs (~1,325 lines) — this is the main concern — contains 3 full repair loops (legacy, clawpal_server, agent_planner) + the Tauri command entry point

bridge_client.rs changes — generified to <R: Runtime>, added broadcast::Sender<Value> for invoke events, added subscribe_invokes(). Clean extension.

node_client.rs changes — same <R: Runtime> generification pattern.

🟡 Non-blocking Issues

NBS 1: repair_loops.rs at ~1,325 lines
Contains 3 distinct repair loops (run_remote_doctor_repair_loop, run_clawpal_server_repair_loop, run_agent_planner_repair_loop) plus the Tauri command entry point. Consider extracting each loop into its own file. Not blocking because the internal structure is logical.

NBS 2: MAX_PENDING_INVOKES 50→10 silently
bridge_client.rs reduced from 50 to 10 with no commit message or comment explaining why. Under high invoke throughput this could silently drop requests. Add a comment explaining the rationale.

NBS 3: edition = "2024" mismatch
openclaw-gateway-client uses edition = "2024" while rest of workspace uses edition = "2021". This works on Rust 1.85+ (current stable) but introduces subtle semantic differences (e.g. gen keyword, unsafe_op_in_unsafe_fn lint). Consider aligning to "2021" for consistency unless the 2024 features are needed.

NBS 4: Plan command allowlist is very restrictive
validate_plan_command_argv only allows openclaw --version and openclaw gateway status. Any other openclaw subcommand fails validation. This is good security but may be too tight for real repair scenarios — the agent planner might need openclaw gateway restart or openclaw config get. Worth documenting the expansion path.

NBS 5: empty_diagnosis() hardcodes "2026-03-18T00:00:00Z"
This should use chrono::Utc::now() or at least be a const/comment explaining it is a sentinel value.


Overall the implementation is solid — 3 repair protocol paths (legacy gateway RPC, clawpal-server structured plans, agent planner with LLM), good logging/observability, clean command validation, proper config backup-before-write pattern. The openclaw-gateway-client extraction is a welcome addition that will be reusable. Previous blocking issues (fmt, Cargo.lock, agents.md) are resolved by the fix commit.

Copy link
Collaborator

@dev01lay2 dev01lay2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review after 84a09ba..9c42faf (3 new commits)

CI: frontend ✅, Capture UI Screenshots ❌ (pre-existing), rest pending.

Resolved from previous NBS

  • NBS 1 (repair_loops.rs at ~1,325 lines) — ✅ Split into 4 submodules: agent_planner.rs (243), clawpal_server.rs (259), legacy_doctor.rs (175), shared.rs (153). Clean extraction with mod.rs re-exports. Each file is well under 500 lines.
  • NBS 3 (edition 2024 mismatch) — ✅ openclaw-gateway-client changed from edition = "2024" to "2021", aligning with the workspace.
  • NBS 5 (hardcoded empty diagnosis timestamp) — ✅ empty_diagnosis() now uses format_timestamp_from_unix(unix_timestamp_secs()) instead of a hardcoded string. Struct construction replaces the brittle serde_json::from_value deserialization. Test added (empty_diagnosis_checked_at_is_not_hardcoded).

New changes

refactor: tighten remote doctor repair limitsMAX_REMOTE_DOCTOR_ROUNDS reduced from 50 to 10 in shared.rs. Reasonable — 50 rounds was likely too permissive and would burn through LLM quota on stuck loops. The stall detection (REPAIR_PLAN_STALL_THRESHOLD = 3) catches loops much earlier anyway.

legacy_e2e.rs — explicit type annotation on closure parameter (previous_results: Vec<CommandResult>) fixes type inference. Minor but correct.

All three previous NBS items addressed. Remaining NBS 2 (MAX_PENDING_INVOKES 50→10) and NBS 4 (command allowlist documentation) are minor and can be addressed post-merge.

LGTM ✅

Copy link
Collaborator

@dev01lay2 dev01lay2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #142 feat: remote doctor — APPROVED ✅

CI: rustfrontendprofile-e2ehome-perf ✅ | coverage ❌ (pre-existing GH API permission) | Screenshots ❌ (pre-existing) | Builds pending

Changes since initial review

Author addressed all feedback:

  • cargo fmt fixed (my commit + author follow-up)
  • Cargo.lock removed from subcrate (my commit)
  • agents.md redirect restored (my commit)
  • repair_loops.rs split into repair_loops/ with 4 sub-modules (agent_planner.rs, clawpal_server.rs, legacy_doctor.rs, shared.rs) — addresses NBS 1
  • MAX_REMOTE_DOCTOR_ROUNDS tightened 50→10 with clear commit message — addresses NBS 2 concern (explicit change)
  • empty_diagnosis() now uses unix_timestamp_secs() + test asserting no hardcoded date — addresses NBS 5

Remaining NBS (non-blocking, can address later)

  • MAX_PENDING_INVOKES 50→10 in bridge_client.rs — no inline comment explaining rationale
  • edition = "2024" vs workspace "2021" — works fine on stable, just inconsistent
  • Plan command allowlist (openclaw --version + openclaw gateway status only) — document expansion path when more commands are needed

Summary

Solid feature PR. Three repair protocol paths (legacy gateway RPC, clawpal-server structured plans, agent planner with LLM bridge), comprehensive E2E test suite with Docker containers, clean openclaw-gateway-client crate extraction, proper module organization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants