Skip to content

Fix/graphql client token refresh#8791

Open
Ke-vin-S wants to merge 7 commits intoapache:mainfrom
Ke-vin-S:fix/graphql-client-token-refresh
Open

Fix/graphql client token refresh#8791
Ke-vin-S wants to merge 7 commits intoapache:mainfrom
Ke-vin-S:fix/graphql-client-token-refresh

Conversation

@Ke-vin-S
Copy link
Copy Markdown
Contributor

@Ke-vin-S Ke-vin-S commented Mar 22, 2026

Summary

This PR fixes two issues in the GitHub GraphQL client related to stability and authentication during long-running pipelines:

  1. Prevent process crash on rate-limit polling errors

    • Replaces panic in GraphqlAsyncClient background goroutine with graceful error handling.
    • On failure, errors are logged and the system retries in the next polling cycle while retaining the last known rate limit.
  2. Enable automatic token refresh for GraphQL requests

    • Replaces oauth2.StaticTokenSource-based HTTP client with the underlying http.Client from ApiAsyncClient.
    • This allows GraphQL requests to reuse the RefreshRoundTripper, enabling automatic token refresh for GitHub App installation tokens.
  3. Refactor authentication into shared HTTP client

    • Extracts auth and transport setup into CreateAuthenticatedHttpClient.
    • Enables reuse across REST and GraphQL clients.
  4. Introduce GraphQL client factory

    • Adds CreateGraphqlClient using shared authenticated HTTP client.
    • Replaces manual client setup while preserving existing behavior.
  5. Support static and refreshable tokens

    • Adds StaticRoundTripper for PAT authentication.
    • Ensures correct handling for both token types.

Does this close any open issues?

Closes #8788


Screenshots

N/A


Other Information

Design decisions

  • Avoid panic in background goroutines
    Rate-limit polling is non-critical and should not crash the entire pipeline. Errors are now handled gracefully.

  • Separate handling of initial vs runtime failures

    • Initial rate-limit fetch failure → fallback to default (5000) to prevent deadlock.
    • Runtime failure → retain last known rateRemaining to preserve correctness.
  • Use eventual consistency for rate limiting
    Rate-limit polling is best-effort; failures should not block or terminate execution.

  • Centralize transport layer for consistency
    Both GraphQL and REST clients now use a shared authenticated HTTP client, ensuring consistent authentication, proxy, and retry behavior.

  • Avoid StaticTokenSource
    Static tokens break long-running pipelines due to expiration. Using RefreshRoundTripper ensures seamless token rotation.

  • Handle static vs refreshable tokens explicitly
    Shared transport distinguishes between PAT (static) and installation (refreshable) tokens to avoid incorrect refresh/retry behavior.


…utine

Replace panic in GraphqlAsyncClient rate-limit polling goroutine with graceful error handling.

Previously, any error while fetching rate limit (e.g., transient network issues or 401 responses) would trigger a panic inside a background goroutine, crashing the entire DevLake process.

Now, errors are logged and the client retries in the next cycle while retaining the last known rate limit.

Design decisions:
- Avoid panic in background goroutines: rate-limit polling is non-critical and should not bring down the entire pipeline.
- Use last known rateRemaining on runtime failures instead of resetting or blocking, ensuring continued progress with eventual consistency.
- Retry via existing polling mechanism instead of immediate retry to prevent tight retry loops and unnecessary API pressure.
- Introduce a default fallback (5000) only for initial rate-limit fetch failures, since no prior state exists at startup.
- Separate handling of initial vs runtime failures:
  - Initial failure → fallback to default (5000)
  - Runtime failure → retain previous value

Fixes apache#8788 (bug 1)
…token refresh

Replace oauth2.StaticTokenSource-based HTTP client with the underlying http.Client from ApiAsyncClient.

Previously, the GraphQL client constructed its own HTTP client using StaticTokenSource, which froze the access token at task start time. This caused GitHub App installation tokens (which expire after ~1 hour) to become invalid during long-running pipelines, leading to persistent 401 errors.

Now, the GraphQL client reuses apiClient.GetClient(), which is already configured with RefreshRoundTripper and TokenProvider. This enables automatic token refresh on 401 responses, aligning GraphQL behavior with the REST client.

Design decisions:
- Reuse transport layer instead of duplicating authentication logic to ensure consistency across REST and GraphQL clients.
- Avoid StaticTokenSource, as it prevents token refresh and breaks long-running pipelines.
- Leverage existing RefreshRoundTripper for transparent token rotation without modifying GraphQL query logic.
- Keep protocol-specific logic (GraphQL vs REST) separate while sharing the underlying HTTP transport.

This ensures GraphQL pipelines using GitHub App authentication can run beyond token expiry without failure.

Fixes apache#8788 (bug 2)
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. component/framework This issue or PR relates to the framework pr-type/bug-fix This PR fixes a bug priority/high This issue is very important labels Mar 22, 2026
@Ke-vin-S Ke-vin-S marked this pull request as draft March 22, 2026 05:40
…lient

- moved token provider and refresh round tripper setup into a reusable helper
- introduced CreateAuthenticatedHttpClient to centralize auth + transport logic
- updated CreateApiClient to use shared http client instead of inline setup

Rationale:
- decouples authentication (transport layer) from REST-specific client logic
- enables reuse for GraphQL client without duplicating token refresh logic
- aligns architecture with separation of concerns (http transport vs api clients)
…ntegrate into task flow

- added CreateGraphqlClient to encapsulate graphql client construction
- reused CreateAuthenticatedHttpClient from github/tasks to inject token refresh via RoundTripper
- replaced manual graphql client setup in PrepareTaskData with new factory function
- preserved existing rate limit handling via getRateRemaining callback
- preserved query cost calculation using SetGetRateCost

Technical details:
- graphql client now uses http transport with TokenProvider and RefreshRoundTripper
- removes dependency on oauth2 client and avoids token expiration issues
- decouples graphql client from REST ApiClient by avoiding reuse of apiClient.GetClient()
- maintains compatibility with github.com and enterprise graphql endpoints

Note:
- shared auth logic remains in github/tasks and is imported with alias to avoid package name collision
- introduces cross-plugin dependency (github_graphql → github/tasks) as a pragmatic tradeoff to avoid duplication
…ents

add StaticRoundTripper for PAT authentication and use it in the shared http client.

since the same client is used by both REST and GraphQL, auth handling must distinguish
between refreshable tokens and static tokens. avoid applying refresh/retry logic to PAT.

ensures correct behavior across clients and prevents unnecessary retries for static auth.
@Ke-vin-S Ke-vin-S marked this pull request as ready for review March 22, 2026 18:47
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. component/plugins This issue or PR relates to plugins improvement pr-type/refactor This PR refactors existing features and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Mar 22, 2026
getRateCost func(q interface{}) int
}

const defaultRateLimit = 5000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this constant come from? Could you add a source link for reference, or should we make it configurable if it varies?

Copy link
Copy Markdown
Contributor Author

@Ke-vin-S Ke-vin-S Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it falls back to 5000 (GitHub’s standard limit) when rate info isn’t available. I agree this shouldn’t be hardcoded—I'll make it configurable so it can adapt to different scenarios.

…e limit

Implement a layered fallback mechanism for GraphQL rate limiting:

1. Dynamic rate limit from provider (getRateRemaining)
2. Per-client override (WithFallbackRateLimit)
3. Config override (GRAPHQL_RATE_LIMIT)
4. Default fallback (1000)

Also moved GitHub-specific fallback (5000) via WithFallbackRateLimit
to the Graphql client.
@Ke-vin-S
Copy link
Copy Markdown
Contributor Author

@klesh I’ve pushed an update for this—could you please take a look? Happy to make changes if needed.

Copy link
Copy Markdown

@danielemoraschi danielemoraschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments for your review. Thanks!

) (*helper.GraphqlAsyncClient, errors.Error) {

// inject the shared auth layer
httpClient, err := githubTasks.CreateAuthenticatedHttpClient(taskCtx, connection, nil)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this with nil as the base client I believe will default to an empty http.Client. This ignores the proxy configuration (connection.GetProxy()) and abandons the HTTP/SOCKS5 Proxy logic associated with the connection. It might break enterprise proxy setups as the proxy configuration is never attached, while the previous implementation explicitly injected a custom proxy http.Transport in backend/plugins/github_graphql/impl/impl.go into the GraphQL HTTP client. This refactoring drops that setup for corporate proxy users.

connection.AuthMethod,
)

} else if connection.Token != "" {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connection.Token can contain multiple comma-separated tokens (e.g. tok1,tok2,tok3). The old code in impl.go split on comma and used only the first token (tokens[0]) via oauth2.StaticTokenSource. This passes connection.Token directly to StaticRoundTripper, which injects the entire unsplit string as Authorization: Bearer tok1,tok2,tok3. On the REST path, this also overwrites the correctly-rotated single token set by SetupAuthentication at the request level.

logger.Info("Persisted initial token for connection %d", connection.ID)
}
}
println("http client HIT3")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover development debug print statement deployed in production code.

}

// resolveRateLimit determines the rate limit for GraphQL requests using task configuration -> else default constant.
func resolveRateLimit(taskCtx plugin.TaskContext, logger log.Logger) int {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolveRateLimit reads env config at line 69, then opts (including WithFallbackRateLimit(5000)) are applied at lines 82-84, unconditionally overwriting the env value. Wouldn't a user setting GRAPHQL_RATE_LIMIT=2000 be silently overridden by the code-level "fallback" of 5000 from graphql_client.go:69?

logger.Info(`github graphql rate limit are disabled, fallback to 5000req/hour`)
return 5000, nil, nil
logger.Info(`github graphql rate limit unavailable, using fallback rate limit`)
return 0, nil, errors.Default.New("rate limit unavailable")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code returned 5000, nil, nil for GitHub Enterprise with rate limiting disabled. This now returns an error instead of the previous 5000. The error propagates so the functionality is preserved, but every task run against such servers now produces a warning for what was previously a silently handled normal case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/framework This issue or PR relates to the framework component/plugins This issue or PR relates to plugins improvement pr-type/bug-fix This PR fixes a bug pr-type/refactor This PR refactors existing features priority/high This issue is very important size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug][github_graphql] GitHub App installation token not refreshed for GraphQL client, panic and pipeline crash on large repos

3 participants