Skip to content

Comments

feat(flight-sql): stabilize auth/session lifecycle, enable USE persistence, and harden Flight SQL runtime#17207

Open
CritasWang wants to merge 2 commits intoai-code/flight-sqlfrom
ai-code/wx_flight-sql
Open

feat(flight-sql): stabilize auth/session lifecycle, enable USE persistence, and harden Flight SQL runtime#17207
CritasWang wants to merge 2 commits intoai-code/flight-sqlfrom
ai-code/wx_flight-sql

Conversation

@CritasWang
Copy link
Collaborator

@CritasWang CritasWang commented Feb 14, 2026

Background

Compared with base branch ai-code/flight-sql, this branch addresses several Flight SQL stability and correctness issues:

  1. USE <database> context was not reliably preserved across client reconnections.
  2. Auth/session flow was fragile under direct executor + DataNode Netty runtime constraints.
  3. Query/session lifecycle had potential blocking/resource-leak risks.
  4. Integration coverage was missing for client isolation and invalid clientId security behavior.

What’s Changed

1) Auth and session model refactor

  • Replaced BasicCallHeaderAuthenticator + GeneratedBearerTokenAuthenticator with a custom CallHeaderAuthenticator (FlightSqlAuthHandler).
  • Removed FlightSqlAuthMiddleware; auth is now fully centralized in header auth flow.
  • Implemented Bearer-first with Basic fallback, and explicit Bearer propagation.
  • FlightSqlSessionManager now:
    • validates credentials via AuthorityChecker.checkUser(),
    • uses dual caches:
      • tokenCache (token -> session)
      • clientSessionCache (client-key -> token),
    • builds client-key with username + '\0' + clientId to avoid key ambiguity,
    • validates clientId in fail-closed mode (invalid non-empty values are rejected),
    • enforces cache cap via maximumSize(1000).

2) USE <database> persistence and per-client isolation

  • Added support for x-flight-sql-client-id (in tests via client middleware injection).
  • Same clientId reuses the same session, preserving USE context across connections.
  • Different clientIds are isolated, preventing cross-client context leakage.
  • Added invalid clientId rejection test to avoid silent fallback to shared sessions.

3) Query execution and resource hardening (IoTDBFlightSqlProducer)

  • Migrated activeQueries from ConcurrentHashMap to TTL-based Caffeine cache with eviction cleanup.
  • Added SQL validation (empty query / max length guard).
  • Non-query statements (USE/CREATE/INSERT/...) now return empty FlightInfo instead of failing query-stream logic.
  • Switched ticket encoding to Any.pack(TicketStatementQuery) for Flight SQL dispatch compatibility.
  • Unified cleanup paths for normal/exceptional stream completion.

4) Flight SQL service lifecycle/runtime improvements

  • Added lifecycle locking in FlightSqlService to prevent start/stop race/reentry issues.
  • Bounded allocator memory with new config support and JVM-aware upper bound.
  • Added gRPC transport hints (directExecutor + flow-control window tuning) to reduce end-of-stream mid-frame issues.
  • Added arrow_flight_sql_max_allocator_memory config loading in IoTDBConfig/IoTDBDescriptor.

5) Dependency and packaging updates

  • Switched Flight SQL runtime memory backend to arrow-memory-unsafe; excluded conflicting/redundant Netty transitive deps.
  • Added flight-sql-jar-with-dependencies into integration-test assembly.
  • Updated information_schema.services expectation to include FLIGHT_SQL.
  • Added dependency management entry for arrow-memory-unsafe.

6) Test coverage expansion

Enhanced IoTDBArrowFlightSqlIT with:

  • testUseDbSessionPersistence
  • testUseDbWithFullyQualifiedFallback
  • testUseDbIsolationAcrossClients
  • testInvalidClientIdRejected
  • plus regression coverage for existing query paths.

Validation

  • Spotless checks passed:
    • mvn -pl external-service-impl/flight-sql -DskipTests spotless:check
    • mvn -pl integration-test -P with-integration-tests -DskipTests spotless:check
  • Flight SQL regression suite status on this branch: 5/5 passed.

Known Limitations / Follow-ups

  • Due to direct-executor constraints, full SessionManager.login() path is not used yet.
    Password-expiration checks and login-lock behavior still need follow-up (async auth or execution model adjustments).
  • Missing clientId remains backward-compatible (username-scoped session sharing).

背景

当前分支相对基础分支 ai-code/flight-sql,主要解决了 Flight SQL 在 DataNode 环境中的几个核心问题:

  1. 连接复用场景下 USE <database> 上下文无法稳定生效。
  2. Arrow Flight 鉴权/会话模型在 direct executor + Netty 共存环境下不稳定。
  3. 部分调用链存在阻塞风险或资源泄露风险(查询上下文、会话缓存、内存分配)。
  4. 集成测试覆盖不足,缺少多客户端隔离和非法 clientId 安全验证。

主要变更

1) 鉴权与会话模型重构(Flight SQL)

  • 将鉴权从 BasicCallHeaderAuthenticator + GeneratedBearerTokenAuthenticator 模式调整为自定义 CallHeaderAuthenticatorFlightSqlAuthHandler)。
  • 删除 FlightSqlAuthMiddleware,统一通过 Header Authenticator 处理 Basic/Bearer。
  • 支持 Bearer 优先、Basic 回退的认证流程,并回写 Bearer token 供客户端后续复用。
  • FlightSqlSessionManager 改为:
    • 通过 AuthorityChecker.checkUser() 校验账号密码。
    • 维护双缓存:
      • tokenCache(token -> session)
      • clientSessionCache(client-key -> token)
    • client-key 采用 username + '\0' + clientId,避免拼接歧义碰撞。
    • clientId 校验为 fail-closed:非空非法值直接拒绝(长度、字符集)。
    • 增加缓存上限 maximumSize(1000),避免无限增长。

2) 修复 USE <database> 生效与多客户端隔离

  • 在客户端侧引入 x-flight-sql-client-id header(IT 中通过 middleware 注入)。
  • 同一 clientId 复用同一会话,USE 上下文可跨连接保持。
  • 不同 clientId 隔离会话,避免跨客户端数据库上下文串扰。
  • 新增非法 clientId 拒绝测试,防止静默降级到共享会话。

3) 查询执行与资源管理增强(IoTDBFlightSqlProducer

  • activeQueriesConcurrentHashMap 改为带 TTL 的 Caffeine cache,淘汰时自动 cleanup。
  • 增加 SQL 长度与空 SQL 校验,提升鲁棒性。
  • 非查询语句(如 USE/CREATE/INSERT)返回空 FlightInfo,避免错误地按查询流处理。
  • ticket 构建改为 Any.pack(TicketStatementQuery),兼容 Flight SQL 协议分发。
  • 流式读取结束/异常时统一 cleanup,减少查询上下文泄露。

4) Flight SQL 服务启动与运行参数优化

  • FlightSqlService 生命周期加锁,防止重复 start/stop 并发问题。
  • 启动时设置 allocator 上限(新增配置项支持),并按 JVM 可用内存约束。
  • gRPC Netty 增加 transport hint(directExecutor + flow control window)以降低 end-of-stream mid-frame 类问题。
  • 引入 arrow_flight_sql_max_allocator_memory 配置读取逻辑(IoTDBConfig/IoTDBDescriptor)。

5) 依赖与打包调整

  • flight-sql 模块切换到 arrow-memory-unsafe,并排除冲突/重复 Netty 依赖。
  • integration-test 组装中加入 flight-sql-jar-with-dependencies
  • information_schema.services 预期结果补充 FLIGHT_SQL 服务项。
  • pom 补充 arrow-memory-unsafe 依赖管理。

6) 测试覆盖增强

IoTDBArrowFlightSqlIT 新增并强化以下场景:

  • testUseDbSessionPersistence
  • testUseDbWithFullyQualifiedFallback
  • testUseDbIsolationAcrossClients
  • testInvalidClientIdRejected
  • 以及既有查询场景回归(show/filter/aggregation/empty result)

测试结果

  • 代码风格检查通过:
    • mvn -pl external-service-impl/flight-sql -DskipTests spotless:check
    • mvn -pl integration-test -P with-integration-tests -DskipTests spotless:check
  • Flight SQL 相关回归测试:5/5 通过(本分支已完成)。

已知限制 / 后续计划

  • 由于 direct executor 约束,当前未走 SessionManager.login() 全链路;
    密码过期检查、登录锁策略仍需后续通过异步化或执行模型调整补齐。
  • clientId 时仍保持兼容行为(按用户名共享会话)。

…integration test

修复 Flight SQL 集成测试中的 "end-of-stream mid-frame" HTTP/2 帧截断错误。

Root cause / 根本原因:
The gRPC default thread pool executor fails to properly handle subsequent
RPCs on the same HTTP/2 connection in the DataNode JVM environment, where
standalone Netty JARs coexist with grpc-netty bundled in the fat jar.

DataNode JVM 环境中,gRPC 默认线程池执行器无法正确处理同一 HTTP/2 连接上
的后续 RPC 调用。根因是类路径上独立的 Netty JAR 与 fat jar 中捆绑的
grpc-netty 产生冲突。

Fix / 修复方案:
1. directExecutor() — run gRPC handlers in the Netty event loop thread,
   bypassing the default executor's thread scheduling issues (关键修复)
2. flowControlWindow(1MB) — explicit HTTP/2 flow control prevents framing
   errors when duplicate Netty JARs coexist on the classpath
3. Exclude io.netty from fat jar POM — use standalone Netty JARs already
   on the DataNode classpath instead of bundling duplicates

Additional bug fixes / 其他修复:
- TsBlockToArrowConverter: fix NPE when getColumnNameIndexMap() returns
  null for SHOW DATABASES queries (回退到列索引)
- FlightSqlAuthHandler: add null guards in authenticate() and
  appendToOutgoingHeaders() for CallHeaders with null internal maps
- FlightSqlAuthHandler: rewrite as CallHeaderAuthenticator with Bearer
  token reuse and Basic auth fallback
- FlightSqlSessionManager: add user token cache for session reuse
- IoTDBFlightSqlProducer: handle non-query statements (USE, CREATE, etc.)
  by returning empty FlightInfo, use TicketStatementQuery protobuf format

Test changes / 测试改动:
- Use fully qualified table names (database.table) instead of USE statement
  to keep each test to one GetFlightInfo + one DoGet RPC per connection
- All 5 integration tests pass: testShowDatabases, testQueryWithAllDataTypes,
  testQueryWithFilter, testQueryWithAggregation, testEmptyResult
@CritasWang CritasWang force-pushed the ai-code/wx_flight-sql branch from 22d7342 to 430f9c7 Compare February 24, 2026 03:37
…ning

- Add x-flight-sql-client-id header support for per-client USE database
  isolation via FlightSqlAuthHandler and ClientIdMiddlewareFactory
- Use \0 (null byte) delimiter in clientSessionCache key to prevent
  username/clientId collision attacks
- Validate clientId: alphanumeric + dash only, max 64 chars, fail-closed
  for non-empty invalid values (SecurityException)
- Add maximumSize(1000) to tokenCache and clientSessionCache to prevent
  resource exhaustion from arbitrary clientIds
- Remove LoginLockManager (userId=-1L caused cross-user lock collision;
  getUserId() is blocking RPC incompatible with directExecutor())
- Remove unused flightClient field from IT
- Add directExecutor() + HTTP/2 flow control window tuning (1MB) on
  NettyServerBuilder to fix end-of-stream mid-frame errors
- Document all functional gaps vs SessionManager.login() (password
  expiration, login lock, checkUser cache-miss risk)

Tests (9/9 pass):
- 5 original Flight SQL query tests
- testUseDbSessionPersistence: USE context persists across connections
- testUseDbWithFullyQualifiedFallback: USE + qualified/unqualified queries
- testUseDbIsolationAcrossClients: Client B fails without USE context
- testInvalidClientIdRejected: non-empty invalid clientId rejected
@CritasWang CritasWang changed the title fix(flight-sql): resolve end-of-stream mid-frame error in Flight SQL integration test feat(flight-sql): stabilize auth/session lifecycle, enable USE persistence, and harden Flight SQL runtime Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant