Skip to content

📝 docs: Consolidate specs directory, validate ACPX spec, and fix Gateway Async Init#2

Merged
hrygo merged 86 commits intomasterfrom
docs/specs-consolidation
Apr 4, 2026
Merged

📝 docs: Consolidate specs directory, validate ACPX spec, and fix Gateway Async Init#2
hrygo merged 86 commits intomasterfrom
docs/specs-consolidation

Conversation

@hrygo
Copy link
Copy Markdown
Owner

@hrygo hrygo commented Apr 4, 2026

Resolves #1 - Specs 目录重组、ACPX Spec 验证和 Gateway 异步初始化修复

概述

本 PR 整合并标准化了 specs 目录结构,通过 acpx CLI 验证了 ACPX Worker 集成规格,并修复了 Gateway 异步初始化 spec 中的关键错误。

主要变更

1. 📂 Specs 目录重组和标准化

问题: specs 目录结构混乱,缺少统一的 metadata 标准

解决方案:

  • ✅ 统一所有 spec 文档的 YAML frontmatter 格式
  • ✅ 添加 type, status, progress, estimated_hours 字段
  • ✅ 重新分类文档
  • ✅ 更新 docs/specs/README.md 索引,按状态和类型分类

影响: 15+ spec 文档现在具有一致的 metadata,便于追踪和管理

2. ✅ ACPX Spec 验证 (98% 置信度)

问题: Worker-ACPX-Spec.md 基于 acpx CLI v0.4.0 文档编写,未经实际验证

解决方案:

  • ✅ 通过 acpx CLI v0.4.0 实际测试验证所有协议细节
  • ✅ 验证 JSON-RPC 2.0 协议格式 (100% 准确)
  • ✅ 验证初始化握手流程 (100% 准确)
  • ✅ 验证流式输出事件 (100% 准确)
  • ✅ 验证工具调用事件 (95% 准确)
  • ✅ 验证 Resume 流程 (95% 准确)
  • ✅ 创建 ACPX-Validation-Report.md 详细报告
  • ✅ 添加 validate-acpx-spec.sh 自动化验证脚本

验证方法:
```bash

基础协议测试

acpx --format json claude "What is 2+2?"

工具调用测试

acpx --format json claude "List files in current directory"

Resume 流程测试

acpx claude sessions new --name test-resume
echo "My favorite number is 42" | acpx claude -s test-resume
echo "What is my favorite number?" | acpx claude -s test-resume
```

验证结果:

  • 总体置信度: 98% ⬆️ (从 85% → 95% → 98%)
  • 协议格式: 100%
  • 初始化流程: 100%
  • 流式事件: 100%
  • 工具调用: 95%
  • Resume 流程: 95%

3. 🔧 Gateway Async Init Spec 修复

问题: Gateway 异步初始化 spec 中存在关键错误

修复内容:

  • ✅ 修正 SendToSession 方法签名使用
  • ✅ 移除未使用的 sessionInfo 变量
  • ✅ 完善异步初始化 API 描述

新增工具

scripts/validate-acpx-spec.sh

用途: 自动化验证 ACPX spec 与实际 acpx CLI 的一致性

功能:

  • 验证 JSON-RPC 2.0 协议格式
  • 检查初始化握手流程
  • 验证流式事件
  • 测试工具调用事件
  • 验证命名会话管理
  • 测试 Resume 流程
  • 检查错误处理格式

使用:
```bash
./scripts/validate-acpx-spec.sh
```

文档变更

新增文档

  • docs/specs/ACPX-Validation-Report.md - 完整的 ACPX spec 验证报告

更新文档

  • docs/specs/Worker-ACPX-Spec.md - 更新 metadata,添加验证报告链接
  • docs/specs/Gateway-Async-Init-Spec.md - 修复关键错误
  • docs/specs/Go-Client-Example-Design.md - 更新状态为 implemented
  • docs/specs/README.md - 重组索引和分类
  • scripts/README.md - 添加验证脚本文档

Metadata 更新

文档 旧状态 新状态 进度
Worker-ACPX-Spec draft review 0% → 30%
Go-Client-Example-Design approved implemented 0% → 100%
Python-Client-Design approved implemented - → 100%
AI-SDK-Chatbot-Integration draft implemented - → 100%

测试

验证测试

  • ✅ acpx CLI v0.4.0 实际运行测试
  • ✅ JSON-RPC 2.0 协议格式验证
  • ✅ 初始化握手流程验证
  • ✅ 流式事件格式验证
  • ✅ 工具调用事件验证
  • ✅ Resume 流程验证

自动化测试

  • scripts/validate-acpx-spec.sh - 快速验证脚本

Checklist

  • 所有 spec 文档具有正确的 YAML metadata
  • Worker-ACPX-Spec.md 已通过实际测试验证 (98% 置信度)
  • Gateway-Async-Init-Spec.md 关键错误已修复
  • 验证脚本已添加到 `scripts/` 目录
  • README 文档已更新
  • 所有变更已 commit 并 push
  • 分支名符合规范 `docs/specs-consolidation`

相关文档

  • Spec 文档: docs/specs/Worker-ACPX-Spec.md
  • 验证报告: docs/specs/ACPX-Validation-Report.md
  • 验证脚本: scripts/validate-acpx-spec.sh

后续工作

  1. Worker 适配器实现 - 基于 validated spec 开始 ACPX Worker 适配器开发
  2. 补充测试 - 在实际使用中遇到未覆盖场景时补充测试
  3. 文档维护 - 当 acpx 版本更新时重新验证 spec

黄飞虹 and others added 30 commits March 31, 2026 10:54
…ts, Admin health

## EventStore Core (EVT-001~006)
- Add events table schema (id/session_id/seq/event_type/payload_json)
- MessageStore interface + SQLiteMessageStore implementation
- Async batch writer (channel 1024, 100ms/50-event flush)
- Gateway integration via Bridge.Append on done events
- HOTPLEX_EVENT_STORE=disabled toggle for optional persistence
- owner_id migration on sessions table

## Metrics Inc/Set Wiring (OBS-004~005)
- Wire all 13 metric vectors across hub/conn/manager/pool
- SessionsActive/Total/Terminated/Deleted
- WorkersRunning/StartsTotal/ExecDuration
- GatewayConnectionsOpen/MessagesTotal/DeltasDropped/ErrorsTotal
- PoolAcquireTotal/Utilization

## Test Infrastructure (TEST-001~003,005~007)
- GitHub Actions CI (go vet + test -race + coverage)
- security/session/gateway table-driven tests with testify/require
- mockStore with testify/mock
- WebSocket mock server (detached goroutine handler)
- All 5 test packages passing

## Worker Process Limits (WK-009, RES-005)
- bufio.Scanner 64KB init / 1MB cap per line
- ReadLine() with panic-recover for ErrTooLong
- RLIMIT_AS 512MB via syscall.Setrlimit
- WorkerHealth struct + Health() interface

## Admin Health Endpoints (ADMIN-006~007)
- /admin/health: unauthenticated (moved before admin mux)
- /admin/health/ready: new readiness probe
- WorkerHealthStatuses() real probing, 503 when unhealthy

## Bug Fixes
- Fix fork bomb regex pattern (:\(\)\s*\{\s*:\|)
- Fix ValidateInit typed nil (use err==nil direct check)
- Fix newTestWSServer synchronous handler deadlock (detach goroutine)
- Fix TestSafePathJoin non-existent paths (create files first)
- Fix TestExpandEnv HOME env var (use TEST_MY_HOME)
## OTel Tracing (OBS-006)
- Add internal/tracing/tracing.go: Init/Shutdown/Attr utilities
- Graceful degradation: no-op tracer if OTEL_SDK_DISABLED=true or no endpoint
- OTEL_EXPORTER_OTLP_ENDPOINT env var for exporter configuration
- Spans: hub.broadcast, conn.recv, conn.init
- Span attributes: session_id, event_type, seq, priority

## Config Hot Reload (CONFIG-006~008)
- Add internal/config/watcher.go: fsnotify file watcher
- 500ms debounce to prevent rapid-fire reloads
- HotReloadableFields: gateway.addr/pool.max_size/gc_scan_interval etc.
- StaticFields: security.api_keys/db.path require restart
- ConfigChange audit log with timestamp/field/old/new/hot
- Hot reload callback wired in main.go run()

## Dependencies
- Add go.opentelemetry.io/otel/* (OTLP SDK + stdout exporter)
## Security Package (SEC-001~045)
- internal/security/env.go: BaseEnvWhitelist, ProtectedEnvVars, Sensitive detection
- internal/security/env_builder.go: BuildEnv, AddWorkerType, AddHotPlexVar
- internal/security/jwt.go: ES256 JWT validation, JTI blacklist, claims
- internal/security/limits.go: MaxEnvelopeBytes/MaxSessionBytes/MaxLineBytes
- internal/security/model.go: AllowedModels whitelist
- internal/security/path.go: SafePathJoin, BaseDir validation
- internal/security/ssrf.go: URL validation, blocked CIDRs, DNS rebinding protection
- internal/security/tool.go: AllowedTools, BuildAllowedToolsArgs

## SPEC Documentation
- docs/SPECS/Acceptance-Criteria.md: Full AC spec (20 categories, 157 items)
- docs/SPECS/AC-Tracking-Matrix.csv: CSV tracking format
- docs/SPECS/AC-Tracking-Matrix.md: Detailed tracking matrix
- docs/SPECS/README.md: SPEC directory overview
- .github/workflows/pr-checks.yml: PR checks workflow
Upgrade CI workflow with current best practices:
- actions/checkout@v6, actions/setup-go@v6, go-version '1.26'
- golangci-lint-action@v9 with latest version
- Path filter (dorny/paths-filter) to skip CI on doc-only changes
- concurrency group with cancel-in-progress
- Minimal permissions (contents: read, pull-requests: write)
- setup-go cache for faster dependency resolution
- codecov-action@v5 upload with explicit slug
- Add PR checks workflow (branch naming + issue link validation)
- Add standard PR template

BREAKING CHANGE: CI now requires CODECOV_TOKEN secret for coverage upload
…odes

Gateway init protocol improvements:
- Add InitAuth struct to carry Bearer token from client
- Add InitConfig parsing (model, system_prompt, allowed_tools,
  disallowed_tools, max_turns, work_dir)
- Add InitAuth to InitData and wire through ValidateInit
- Add ERR_CODE_VERSION_MISMATCH for clearer version errors
  (replaces ad-hoc PROTOCOL_VIOLATION on version mismatch)
- Add ServerCaps.MaxTurns and ServerCaps.Modalities fields
- Add OwnerID field to Envelope for authenticated user tracking
- Add 12 new error codes: WORKER_OOM, SESSION_EXPIRED/TERMINATED/
  INVALIDATED, AUTH_REQUIRED, VERSION_MISMATCH, CONFIG_INVALID,
  GATEWAY_OVERLOAD, EXECUTION_TIMEOUT, RECONNECT_REQUIRED,
  WORKER_OUTPUT_LIMIT
- Fix golangci-lint errors: remove unused initDataFromMap helper
  and fix unchecked pool.Acquire error returns in tests
Restructure agent rules from 3 monolithic files into 6 focused modules:
- go125.md + golang-style.md → golang.md (merged, Go 1.26 aligned)
- go126.md → removed (superseded by golang.md)
- New aep.md: AEP v1 protocol spec (envelope/codec/routing/backpressure)
- New security.md: JWT/SSRF/Env isolation/command whitelist/AllowedTools
- New session.md: 5-state machine/TransitionWithInput/SESSION_BUSY/
  GC strategy/mutex spec/PoolManager/SQLite WAL
- New metrics.md: Prometheus naming/OTel Span/SLO definition
- worker-proc.md: update paths filter from pool/ to session/
These reports documented implementation gaps from a previous review cycle
that have since been addressed or superseded by the current SPECS.
Keeping them risks spreading outdated information.
Add prominent header noting that EventStore/MessageStore/AuditLog are
NOT implemented in v1.0 — rationale: Worker itself handles persistence
(Claude Code ~/.claude/projects/, OpenCode server-side state); Gateway
scope is control-plane only. Roadmap table updated to mark all items
as ❌ not implemented with note that v1.0 defers to Worker-layer persistence.
Update references to point to new modular rule files:
- golang.md, aep.md, security.md, session.md, metrics.md,
  worker-proc.md, testing.md
Remove obsolete go125.md/go126.md/golang-style.md references
Add cross-references to session.md for detailed state machine docs
…ion tests

Config system:
- Watcher.NewWatcher: add SecretsProvider param to support loading
  sensitive values from external secret stores
- Config.Load: add SecretsProvider field and pass to Watcher
- cmd/gateway/main.go: wire JWT secret from cfg.Security.JWTSecret
  (loaded via config secrets provider) instead of os.Getenv
- Add codecov.yml with standard configuration

Test coverage:
- Add dbginline_test.go for inline debug config validation
- Add directcheck_test.go for AEP init envelope direct validation
- Add validatecheck_test.go for InitData.ValidateInit unit tests
- Add validatecheck2_test.go for InitData edge case coverage
Acquire previously returned *PoolError, causing typed-nil issues with
require.NoError in tests. Changed to return error interface; callers
that need PoolError fields use type assertion.

- pool.go: Acquire returns error instead of *PoolError
- manager.go: type-assert to *PoolError for Kind field access
- pool_test.go: type-assert in GlobalLimit and UserQuotaLimit tests
Replace unsafe blank-identifier type assertion with errors.As,
preventing nil-pointer panic if Acquire ever returns a non-PoolError
error type.

Also remove unused cfg variables from pool tests.
- Add .agent/settings.local.json to .gitignore
- Restructure codecov coverage targets to align with project
  module organization (security/protocol/session/worker as separate
  targets, each with appropriate thresholds)
Verified against live codebase:
- CONFIG-006~008: 🟢 PASS (fsnotify watcher + debounce + audit log implemented)
- CONFIG-010: 🟢 PASS (Viper merge + LoadOptions chain)
- EVT-001: 🟢 PASS (MessageStore interface + SQLiteMessageStore wired)
- TEST-001~002,004,006~007: 🟢 TODO → 🟡 IN_PROGRESS (10 test files,
  testify/require, WS mock server, codecov.yml all present; GitHub Actions
  and E2E/Playwright still missing)
- Summary: 130/157 PASS (83%), P1 86%, recalculated milestones
Phase 1 — EventStore:
- EVT-002~006 already verified PASS

Phase 2 — Worker robustness:
- AEP-020: worker crash mapped to synthetic failure done event via Wait()
- WK-009: Bridge sends synthetic failure done on worker non-zero exit
- WK-010: anti-pollution turn counting with ErrMaxTurnsReached + auto-kill
- WK-011: LastIO() added to Worker interface for GC zombie detection
- SEC-045: AllowedTools wired into managedSession (impl pending real adapters)

Phase 3 — Admin API:
- ADMIN-008: Hub.LogHandler callback + ring buffer for event capture
- ADMIN-009: POST /api/v1/config/validate with JSON body parsing
- ADMIN-010: GET /api/v1/debug/sessions/{id} exposes mutex/mu/turn_count/worker_health
- GW-006: Hub.Shutdown drains broadcast queue before closing connections

Also:
- managedSession fields exported (Worker, Mu, TurnCount) for debug access
- fmt import added to hub.go
- Worker interface updated: LastIO() time.Time method added
- NoopWorker implements LastIO()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ess PR

- hub.go: add SeqGen.Peek() for read-only seq access; add Hub.NextSeqPeek()
- session/manager.go: add DebugSnapshot() to safely expose ms fields
  under lock; callers no longer acquire ms.Mu directly (deadlock guard)
- main.go: HandleDebugSession uses DebugSnapshot + NextSeqPeek
  (no longer reads ms.Worker/ms.Mu directly from outside session pkg)
- main.go: remove size*1000 modulo hack — Go 1.26: head>=size always,
  subtraction is non-negative, extra multiplier is dead code
- conn.go: forwardEvents wraps Wait() with 2s timeout goroutine
  to prevent indefinite block if worker is in a zombie state

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d GetManagedSessionDebug

managedSession.Worker and managedSession.Mu were exported unnecessarily,
allowing external callers to bypass AttachWorker/DetachWorker (pool quota
invariant) and violate Manager lock ordering. GetManagedSessionDebug was
dead code after DebugSnapshot was added.

- managedSession.Worker → managedSession.worker (unexported)
- managedSession.Mu     → managedSession.mu     (unexported)
- Remove GetManagedSessionDebug (no callers remain)
- All internal references updated; build clean

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RES-008 — per-user max_total_memory_mb:
- PoolConfig.MaxMemoryPerUser (int64, default 2 GB)
- PoolManager.AcquireMemory/ReleaseMemory/UserMemory methods
- workerMemoryEstimate = 512 MB (matches RLIMIT_AS cap)
- AttachWorker calls AcquireMemory after slot quota; rollback on failure
- DetachWorker calls ReleaseMemory alongside Release
- ErrMemoryExceeded sentinel error
- 5 new table-driven tests covering limit/unlimited/cross-user/integrated

RES-009 — worker crash rate metrics:
- WorkerCrashesTotal (counter, labels: worker_type, exit_code)
- WorkerMemoryBytes (gauge, labels: worker_type)
- ForwardEvents increments crash counter when exit_code != 0

Matrix corrected: EVT-002~006, RES-005 were already PASS; corrected
9 rows and summary: 150/170 PASS (88%)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- isReadTimeout, broadcastQueueSize (3 variants), isDroppable
- heartbeat: MarkAlive, MarkMissed (under/at-limit/after-stop), MissedCount, Stop idempotency
- SeqGen: Next (startsAt 1, increments, independent sessions), Peek (zero unknown, does not increment)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add all remaining AEP v1 event kinds and data types:
- Message (complete message), Reasoning, Step, PermissionRequest/Response
- MessageData, ReasoningData, StepData, PermissionRequestData, PermissionResponseData
- ToolCall, ToolResult, Ping, Pong already existed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace flat CI with layered pipeline:
- Layer 1 (Gate): vet + build + lint - fast fail
- Layer 2 (Unit Test): per-package matrix with coverage
- Layer 3 (Integration): full suite with race detector + Codecov
- Layer 4 (Coverage Check): merge profiles, threshold gate

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SEC-007 (Multi-bot Isolation):
- Add BotID field to Conn struct, extracted from JWT claims
- Add bot_id mismatch check when joining existing sessions
- Add CreateWithBot method on session Manager
- Add bot_id column + index to sessions SQLite table

SEC-045 (AllowedTools → Worker Proc):
- Add AllowedTools to proc.Manager Opts struct
- Auto-append --allowed-tools args in Start() via BuildAllowedToolsArgs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove t.Parallel() from TestExpandEnv (mutates global env vars)
- Add watcher_test.go (config coverage 33.3% → 77.8%)
- Add events_test.go (events coverage → 100%)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update scannerMaxSize from 1MB to 10MB and matching AEP-008
specification requirement. Worker stdout
lines exceeding 10MB now trigger a bufio.ErrTooLong
panic, which is returned as a friendly error message instead.

Also updated error message in ReadLine() to
reflect the new 10MB limit.

Refs: AEP-008
Replace immediate Kill() with graceful Terminate()
in state transitions:

- Extract 5s timeout to a constant
- Use parent context instead of context.Background()
- Gracefully send SIGTERM, then escalate to
  SIGKILL after 5s grace period

Also update anti-pollution restart to
continue using Kill() (intentional for
emergency cases).

This implements AEP-021 specification
and providing better worker lifecycle management.

Refs: AEP-021
Split long ALTER TABLE statements into
separate ExecContext calls for better
readability.

No functional changes.
黄飞虹 and others added 29 commits April 2, 2026 15:09
Ensure SQLite database files generated during development
(gateway.db, gateway.db-shm, gateway.db-wal) are ignored.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove invalid ldflags (buildTime/goVersion) from Dockerfile and Makefile
- Fix healthcheck port 9080→9999 in docker-compose.yml
- Fix backup database path hotplex.db→gateway.db
- Add Prometheus scrape config (configs/prometheus.yml)
- Add Grafana provisioning (dashboards, datasources)
- Remove non-existent volume mounts (grafana dirs, prometheus.yml)
- Remove invalid HOTPLEX_DEV_MODE env var
Reorganize configs directory following infrastructure-as-code best practices:
- Move monitoring configs (prometheus, alerts, slo, otel) to configs/monitoring/
- Consolidate Grafana provisioning under configs/monitoring/grafana/
- Remove duplicate grafana-dashboard.json (now dashboard.json in dashboards/)
- Update docker-compose.yml volume paths to new locations
- Expand README with monitoring stack usage documentation
Standardize database filename to hotplex-worker.db across all files:
- Code default (internal/config/config.go)
- Config files (config.yaml, env.example)
- Docker Compose (backup service)
- Scripts (install.sh, quickstart.sh, docker-build.sh, README.md)
- Docs (User-Manual, Disaster-Recovery, Admin-API-Design, Config-Reference)
- Specs (TRACEABILITY-MATRIX, README)

Also corrects legacy hotplex.db references to hotplex-worker.db.
Add comprehensive design document for Python client example module:
- Target: third-party developers integrating HotPlex Worker
- Architecture: 3-layer (protocol/transport/client)
- Examples: quickstart (5min) + advanced (complete)
- Tech stack: Python 3.10+, websockets, asyncio
- No PyPI release (local package only)

Refs: #python-client-design

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add comprehensive Python client example demonstrating AEP v1 protocol usage:

Architecture:
- 3-layer design: protocol (codec) → transport (connection) → client (session API)
- Pure async/await with Python 3.10+ modern type hints
- Event-driven callbacks for real-time message handling

Features:
- quickstart.py: 5-minute getting started guide
- advanced.py: Complete example with tool calls, permissions, state management
- Full type safety with dataclasses and TypeVar generics
- Custom exception hierarchy for clear error classification

Components:
- protocol.py: NDJSON envelope encoding/decoding (~250 lines)
- transport.py: WebSocket connection management (~150 lines)
- client.py: High-level session API (~250 lines)
- types.py: AEP v1 data models (~150 lines)
- exceptions.py: Exception classes (~50 lines)

Tech stack:
- Python 3.10+ (dataclasses, StrEnum, match/case)
- websockets 12.0+ (pure async WebSocket)
- asyncio (standard library)

No PyPI release (local package only for examples)

Design doc: docs/superpowers/specs/2026-04-02-python-client-design.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix issues identified by code review agents:

Code Quality:
- Remove unused import 're' in protocol.py
- Remove duplicate import in advanced.py

Efficiency:
- Add max_queue_size parameter (default 1000) to prevent unbounded memory growth
- Remove redundant _connected flag, derive state from _ws.open
- Handle both str and bytes WebSocket messages correctly

Issues fixed:
- HIGH: Unbounded message queue could cause memory exhaustion
- MEDIUM: Missing bytes message handling caused decode errors
- MEDIUM: Redundant connection state tracking
- LOW: Unused imports

Based on code review by simplify agents

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…c API

Extract protocol-level code from internal/ to pkg/ so it can be shared
by both the gateway server and future Go clients:

- pkg/aep/: AEP v1 protocol codec (NDJSON encode/decode, init handshake)
- pkg/jwt/: JWT token generation/validation (ES256-only)
- internal/aep/: backward-compat re-exports from pkg/aep
- internal/gateway/init.go: keeps gateway-internal types (worker.WorkerType)
- internal/security/jwt.go: simplified using pkg/jwt internally

This establishes the public pkg/ boundary documented in pkg/README.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- pkg/aep/codec.go: remove redundant init() block (nowFunc var is sufficient)
- pkg/jwt/jwt.go: replace hand-rolled uuid formatting with github.com/google/uuid
- internal/security/jwt.go: replace hand-rolled uuid with uuid.New(), remove math/rand fallback (violates security rules)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sts to pkg/

- Delete pkg/jwt/jwt.go: 100% dead code (0 imports in production)
- Delete pkg/aep/init.go: 100% dead code (0 imports)
- Slim internal/aep/codec.go: 82 lines → 18 lines, keep only 5 re-exports
  actually used by gateway/worker code (NewID, NewSessionID, EncodeJSON,
  DecodeLine, Encode)
- Move codec tests to pkg/aep/ where the actual implementation lives
- Fix Encode/EncodeChunk to set Timestamp if zero (avoids Validate failures)
- Fix nowFunc to default to real wall-clock time instead of 0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the internal/aep facade (16 files re-exporting 5 symbols from pkg/aep)
with direct imports. Aligns with Go best practices:
- pkg/aep: shared AEP v1 protocol code (reusable by future Go clients)
- internal: gateway-specific implementations

Also removes empty pkg/jwt/ directory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- examples: add gateway health and worker count checks to complete.ts
- examples: improve quickstart.ts with gateway ready check
- scripts: generate-test-token.ts with ES256 JWT token generation
- src/client.ts: add error handling for WebSocket connection failures
- src/types.ts: add type definitions for gateway responses
- docker-compose.yml: fix healthcheck endpoint and timeout

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Major changes:
- Update module path from 'hotplex-worker' to 'github.com/hotplex/hotplex-worker'
  for proper Go module proxying and versioning support
- Fix critical bug in ensureDBDir: remove flawed sync.Once that ignored subsequent
  database paths (discovered by 3 parallel code review agents)
- Improve normalizePath: gracefully handle missing $HOME in test environments
- Refactor Makefile stop target: use GRACE_PERIOD variable instead of magic number
- Improve Docker healthcheck: use grep pattern instead of fragile exact string match
- Simplify .gitignore: remove excessive decorative comments (75 → 49 lines)
- Add comprehensive client SDK documentation for Python, TypeScript, Go, and Java

This commit includes import path updates across 54 files and improves code quality
based on multi-agent code review findings.

Tests: All short tests pass with race detection enabled
- Fix test-integration: remove extraneous dash before -timeout flag
- HotPlexClient: use WebSocketHttpHeaders API, apply try-with-resources
- InteractiveExample: wrap Scanner in try-with-resources for proper cleanup
- QuickStart: remove dead token generation call, clean up unused imports
- Event: remove unused JsonProperty import
- Add bridge parameter to newConn for session lifecycle management
- Call bridge.StartSession in performInit after session creation so
  worker starts before init_ack is sent (fixes worker never started)
- Remove redundant CREATED→RUNNING transition in performInit; session
  now stays CREATED until handleInput (prevents running→running error)
- handleInput: skip TransitionWithInput when state is RUNNING, only
  handle IDLE→RUNNING resume case
- Fix all newConn call sites to pass nil bridge in tests
- Add api_key query param fallback for browser WebSocket clients
- Remove misleading comment from bridge.go (si.AllowedTools is used at line 66)
- Add apiKeyQueryParam const for the browser WS query param fallback
StartSession now takes (botID, allowedTools) and calls CreateWithBot
internally. performInit delegates session creation to Bridge, removing
the redundant sm.CreateWithBot call and fixing the wasted DB write.

The worker now correctly receives AllowedTools from the init handshake
instead of always getting nil (since it previously used sm.Create(nil)).

Also updates:
- SessionManager interface: CreateWithBot replaces Create
- BridgeProvider interface in admin: updated StartSession signature
- All mock/test call sites updated
Replace concrete *Bridge field with SessionStarter interface.
nil starter is now a semantic no-op (test mode), not a degraded path.
Adds compile-time check: var _ SessionStarter = (*Bridge)(nil)
After Bridge.StartSession creates the session and worker, performInit
must fetch the session info before using it. Previously, si was nil
after successful StartSession, causing nil pointer dereference when
building init_ack (si.State access).

Root cause: StartSession was calling sm.CreateWithBot internally,
but performInit never fetched the resulting session object.

Error path:
  c.starter.StartSession(...) succeeds
  → si still points to pre-creation nil value
  → ack := BuildInitAck(..., si.State, ...) panics

Fix:
  si, err = handler.sm.Get(sessionID)
  → fetch the session that StartSession created
  → si.State now valid for init_ack

This bug only affects production mode (starter != nil). Test mode
(CreateWithBot directly) was already correct.

Related: S1049
- Add aggregateNumberedEnv to support ADMIN_TOKEN_1...N and API_KEY_1...N
- Fully document 8888 (Gateway) and 9999 (Admin) ports
- Refine technical terminology for professionalism (e.g., 机密 -> 安全凭据)
Fixed 5 critical errors discovered during strict review:

1. Type Error: SetupSession Return Type
   - Wrong: `(*worker.Worker, error)` (pointer to interface)
   - Fixed: `(worker.Worker, error)` (interface value)
   - Reason: Go interfaces should not be pointers

2. API Error: PriorityControl Usage
   - Wrong: `SendToSession(ctx, env, events.PriorityControl)`
   - Fixed: `env.Priority = events.PriorityControl; SendToSession(ctx, env)`
   - Reason: SendToSession doesn't accept priority parameter

3. Unused Variable: CreateWithBot result
   - Wrong: `si, err := b.sm.CreateWithBot(...)` (si unused)
   - Fixed: `_, err := b.sm.CreateWithBot(...)`
   - Reason: SessionInfo not needed in SetupSession

4. Sequence Diagram Error: AttachWorker order
   - Wrong: CreateWithBot → NewWorker → Start → AttachWorker
   - Fixed: CreateWithBot → NewWorker → AttachWorker → Start
   - Reason: Worker attached before start in actual code

5. Blocking Analysis Table Error
   - Same sequence error as #4, reordered rows

Impact:
- Type error would cause compilation failure
- API error would cause runtime panic
- Sequence errors would mislead implementers

All errors now fixed. Spec ready for implementation.

Related: S1056
Reorganize documentation structure:
- Move design specs from docs/superpowers/specs/ to docs/specs/
- Rename specs with descriptive names (e.g., 2026-03-30-foo → Foo-Design.md)
- Add YAML frontmatter to all spec documents with standardized metadata
- Update README to reflect new directory structure

Code improvements alongside reorganization:
- Fix Makefile start target to include -config flag
- Refactor nextjs-chat components into separate files
- Simplify ai-sdk-transport route-handler logic

This consolidates all specifications under a single directory with
consistent metadata for better discoverability and tracking.
Validation:
- Add ACPX-Validation-Report.md with 98% confidence validation results
- Test with acpx v0.4.0 CLI: JSON-RPC 2.0, streaming, tool calls, resume
- Verify protocol format 100%, initialization 100%, events 100%, tools 95%

Metadata updates:
- Worker-ACPX-Spec.md: status → review, progress → 30%, confidence → 98%
- Go-Client-Example-Design.md: status → implemented, progress → 100%
- specs/README.md: Update document states and reorganize categories

Tooling:
- Add validate-acpx-spec.sh script for automated validation
- Update scripts/README.md with validation script documentation

Refs: docs/specs/Worker-ACPX-Spec.md, docs/specs/ACPX-Validation-Report.md
…e support

Worker Specs:
- Add Worker-OpenCode-CLI-Spec.md - OpenCode CLI integration specification
- Add Worker-OpenCode-Server-Spec.md - OpenCode Server integration specification
- Both specs marked as implemented (100% progress)
- Update specs/README.md with new OpenCode worker entries

Implementation:
- Add SendUserMessage() to base.Conn for Claude Code's native format
- Update claudecode.Worker.Input() to use user message format
- Fallback to AEP envelope for mock connections in tests
- This aligns with Claude Code's actual stream-json input format

Technical Details:
- SendUserMessage sends {"type":"user","message":{"type":"user","content":[{"type":"text","text":"..."}]}}
- This is the correct format for Claude Code's stdin protocol
- Maintains backward compatibility with test mocks

Refs: docs/specs/Worker-OpenCode-CLI-Spec.md, docs/specs/Worker-OpenCode-Server-Spec.md
- Run gofmt -s -w . to fix all formatting issues
- Use embedded BaseWorker fields directly (QF1008)
- Remove unnecessary BaseWorker selector in claudecode worker

This fixes golangci-lint warnings without disabling checks.
@hrygo hrygo merged commit 258a50b into master Apr 4, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant