Skip to content

CXP-116 Containerization and refactor of caches#101

Open
JavierCarnelli-ConductorOne wants to merge 15 commits intomainfrom
jallers/cxp-116-containerization
Open

CXP-116 Containerization and refactor of caches#101
JavierCarnelli-ConductorOne wants to merge 15 commits intomainfrom
jallers/cxp-116-containerization

Conversation

@JavierCarnelli-ConductorOne
Copy link
Copy Markdown

@JavierCarnelli-ConductorOne JavierCarnelli-ConductorOne commented Feb 9, 2026

Updates the connector for it to be able to containerize.

This applies changes to the structure, the workflows and refactors the use of caches on the connector. Also implements ResourceSyncerV2.
Will require extensive testing and validations.

Summary by CodeRabbit

  • Refactor
    • Migrated connectors to a unified SyncOpAttrs/SyncOpResults pagination model and updated connector initialization/signatures.
    • Removed legacy external-ID validation and related configuration constants.
  • Performance
    • Added session-backed caching and stateful, throttle-aware paging to reduce repeated API calls and improve sync resilience.
  • Tests
    • Updated tests to exercise the new sync/pagination and session-caching behavior.

@linear
Copy link
Copy Markdown

linear Bot commented Feb 9, 2026

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces legacy pagination/token+annotations APIs with SyncOpAttrs/SyncOpResults across connector resource types, moves external-id/config validation into the connector, and adds session-backed caching and paginated state encoding for AWS SSO permission-set lists, details, grants, and entitlements.

Changes

Cohort / File(s) Summary
Config surface & tests
pkg/config/config.go, pkg/config/config_test.go
Removed local external ID length constants and config validation helpers; tests switched to package config and now reference public config/connector validation (connector.ValidateExternalID).
Connector bootstrap & validation
pkg/connector/connector.go
New constructor signature returning ConnectorBuilderV2 + opts; added exported ValidateExternalID and internal validateConfig; integrated pre-build validation and migrated ResourceSyncers to V2.
Session-backed SSO permission-set caching & paging
pkg/connector/account.go
Introduced session-store cache keys, page-state encode/decode, throttle-aware error wrapping, per-account/per-permission-set caches, and helpers to fetch/cache permission sets and paged grants/entitlements; replaced prior in-memory mutex/map caches.
Pagination API migration (resource types)
pkg/connector/account.go, pkg/connector/account_iam.go, pkg/connector/iam_group.go, pkg/connector/iam_user.go, pkg/connector/role.go, pkg/connector/secrets.go, pkg/connector/sso_group.go, pkg/connector/sso_user.go
Migrated List/Entitlements/Grants signatures from pagination.Token + (string, annotations.Annotations) to resourceSdk.SyncOpAttrs input and *resourceSdk.SyncOpResults output; moved token handling to opts.PageToken.Token and updated all error/return shapes.
Grants/Entitlements multi-phase paging
pkg/connector/account.go, pkg/connector/account_iam.go
Reworked grants/entitlements to be session-aware multi-phase flows: collect permission-set IDs, fetch details (session/API), fetch per-permission-set paged grants, and assemble results with SyncOpResults-based continuation.
Tests & interface updates
pkg/connector/sso_group_test.go, pkg/connector/..., pkg/config/config_test.go
Tests updated to use resourceSdk.SyncOpAttrs/SyncOpResults; interface implementations updated (e.g., IAM user account manager → V2).
Imports & cleanup
pkg/connector/*.go, pkg/config/*
Removed unused imports (annotations, sync, context in config.go), reordered/added imports for sessions and resource types, and updated log/error prefixes.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Connector
    participant Session as sessions.SessionStore
    participant SSO as AWS SSO Admin API
    rect rgba(200,200,255,0.5)
    Client->>Session: GetJSON("permission-sets")
    alt cached list present
        Session-->>Client: permissionSetIDs
    else cache miss
        Session-->>Client: nil
        Client->>SSO: ListPermissionSets()
        SSO-->>Client: permissionSetIDs
        Client->>Session: SetJSON("permission-sets", permissionSetIDs)
    end
    Client->>Session: GetManyJSON(["permission-set:<id>...", ...])
    alt some details missing
        Client->>SSO: DescribePermissionSet(id)
        SSO-->>Client: permissionSetDetail
        Client->>Session: SetJSON("permission-set:<id>", permissionSetDetail)
    end
    Client-->>Client: build entitlements/grants (encode page state if truncated)
    Client-->>Client: return []*Entitlement / []*Grant + SyncOpResults
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I hopped through tokens, caches, and keys,

Swapped old returns for SyncOp ease,
I moved validation close to the start,
Cached SSO sets to spare the API's heart,
A rabbit's patch — tidy, nimble, and smart.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'CXP-116 Containerization and refactor of caches' accurately reflects the main objectives: containerization updates, cache refactoring, and session storage interactions mentioned in the PR description.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch jallers/cxp-116-containerization

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pkg/config/config_test.go (1)

21-33: ⚠️ Potential issue | 🟡 Minor

Align test validation with single‑hop behavior.

The test currently requires an external ID whenever use-assume is set, but the production validator only enforces it when a global role ARN is configured. This makes the tests reject valid single‑hop configs.

Suggested fix
 func validateConfig(ctx context.Context, v *viper.Viper) error {
 	if v.GetBool(config.UseAssumeField.FieldName) {
-		err := connector.ValidateExternalID(v.GetString(config.ExternalIdField.FieldName))
-		if err != nil {
-			return err
-		}
-		err = connector.IsValidRoleARN(v.GetString(config.RoleArnField.FieldName))
-		if err != nil {
-			return err
-		}
+		if err := connector.IsValidRoleARN(v.GetString(config.RoleArnField.FieldName)); err != nil {
+			return err
+		}
+		if v.GetString(config.GlobalRoleArnField.FieldName) != "" {
+			if err := connector.ValidateExternalID(v.GetString(config.ExternalIdField.FieldName)); err != nil {
+				return err
+			}
+		}
 	}
 	return nil
 }
pkg/connector/iam_user.go (1)

233-255: ⚠️ Potential issue | 🟡 Minor

Bug: Inverted error-check logic in CreateAccount.

The condition if result, err := o.iamClient.GetUser(...); err == nil means the user was found successfully. Inside this block, checking errors.As(err, &noSuchEntity) will always be false since err is nil. The current logic returns the existing user as a success regardless, which may be intentional for idempotency, but the dead NoSuchEntityException check is misleading.

If the intent is to return an error when GetUser fails (for reasons other than "not found"), the logic should be inverted:

Suggested fix
-	if result, err := o.iamClient.GetUser(ctx, &iam.GetUserInput{UserName: awsSdk.String(username)}); err == nil {
-		var noSuchEntity *iamTypes.NoSuchEntityException
-		if errors.As(err, &noSuchEntity) {
-			return nil, nil, nil, fmt.Errorf("aws-connector: iam.GetUser failed: %w", err)
-		}
+	result, err := o.iamClient.GetUser(ctx, &iam.GetUserInput{UserName: awsSdk.String(username)})
+	if err != nil {
+		var noSuchEntity *iamTypes.NoSuchEntityException
+		if !errors.As(err, &noSuchEntity) {
+			return nil, nil, nil, fmt.Errorf("aws-connector: iam.GetUser failed: %w", err)
+		}
+		// User doesn't exist, proceed to create
+	} else {
+		// User already exists, return it
 		annos := &v2.V1Identifier{
 			Id: awsSdk.ToString(result.User.Arn),
 		}
🤖 Fix all issues with AI agents
In `@pkg/connector/account.go`:
- Around line 33-37: The comment above the cache key constants (the
"SessionStore cache keys" comment) is missing a terminating period; update that
comment to end with a period so the linter is satisfied. Locate the comment
above permissionSetsCacheKey and permissionSetDetailKeyPrefix in the file
(pkg/connector/account.go) and add the period to the end of the sentence.

In `@pkg/connector/connector.go`:
- Around line 36-38: The externalIDLengthMaximum constant
(externalIDLengthMaximum) is off-by-one compared to the ValidateExternalID logic
and error text; update externalIDLengthMaximum to 64 (to match
ValidateExternalID and the error message) or alternatively make
ValidateExternalID and its error message consistent with the constant—ensure the
symbols externalIDLengthMaximum, externalIDLengthMinimum and the
ValidateExternalID function all use the same max value so invalid IDs aren’t
allowed through.
🧹 Nitpick comments (1)
pkg/connector/account.go (1)

644-711: Consider logging cache errors for observability.

The caching implementation silently ignores errors from session.GetJSON, session.SetJSON, and session.SetManyJSON. While this is a common pattern for optional caching, logging these errors at debug level would help with troubleshooting cache issues in production.

Example: Add debug logging for cache errors
 func (o *accountResourceType) getPermissionSets(ctx context.Context, ss sessions.SessionStore) ([]*awsSsoAdminTypes.PermissionSet, error) {
+	l := ctxzap.Extract(ctx)
 	if ss != nil {
 		cached, found, err := session.GetJSON[[]*awsSsoAdminTypes.PermissionSet](ctx, ss, permissionSetsCacheKey)
-		if err == nil && found {
+		if err != nil {
+			l.Debug("aws-connector: failed to get permission sets from cache", zap.Error(err))
+		} else if found {
 			return cached, nil
 		}
 	}

Comment thread pkg/connector/account.go
Comment thread pkg/connector/connector.go
Comment thread cmd/baton-aws/main.go Outdated
Comment thread pkg/connector/account.go Outdated
Comment thread pkg/connector/account.go Outdated
Copy link
Copy Markdown
Contributor

@mateovespConductor mateovespConductor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the comments left by @agustin-conductor . I think these changes are necessary

Comment thread pkg/connector/account.go Outdated
Comment thread pkg/connector/account.go Outdated
Comment thread pkg/connector/account.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
pkg/connector/account.go (2)

293-293: Dead code: assignmentsInput.NextToken assignment is unused.

The AWS SDK paginator manages pagination internally. Setting assignmentsInput.NextToken manually has no effect since the paginator doesn't use this input after initialization. This line can be removed.

🧹 Proposed fix
 			}
 
-			assignmentsInput.NextToken = assignmentsResp.NextToken
 			if !assignmentsPaginator.HasMorePages() {
 				break
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/connector/account.go` at line 293, Remove the dead assignment to
assignmentsInput.NextToken (it is set from assignmentsResp.NextToken) since the
AWS SDK paginator manages pagination and ignores changes to the input after
initialization; locate the pagination loop where assignmentsInput and
assignmentsResp are used (the block that iterates with the paginator) and delete
the line setting assignmentsInput.NextToken so the code relies solely on the
paginator for advancing pages.

720-723: Inconsistent error handling for session operations.

In getPermissionSets, errors from session.GetManyJSON are silently ignored (line 722), while in Grants (lines 183-184), similar errors cause the function to return an error. The same inconsistency exists for SetManyJSON.

If session errors should be fatal in Grants, they should likely be consistent here. Alternatively, if ignoring cache errors is intentional (fall back to API), consider applying the same lenient approach in Grants for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/connector/account.go` around lines 720 - 723, getPermissionSets currently
swallows errors from session.GetManyJSON while Grants returns on similar session
errors, causing inconsistent behavior; update getPermissionSets (and mirror in
any SetManyJSON usage) to handle session errors consistently with Grants by
checking the error returned from session.GetManyJSON/SetManyJSON (when ss !=
nil) and either return the error up the stack (make it fatal like in Grants) or
instead log the error and continue (make Grants lenient) — pick one approach and
apply it consistently, referencing the session variable ss, the cachedDetails
variable, and the GetManyJSON/SetManyJSON calls so the change is applied in the
correct functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/connector/account.go`:
- Line 304: Fix the typo in the error message string by replacing "writting"
with "writing" in the fmt.Errorf call that returns "aws-connector: Session
Storage writting operation failed: %w" (found in pkg/connector/account.go);
update that message so it reads "aws-connector: Session Storage writing
operation failed: %w" to correct the spelling while preserving the error wrap.

---

Nitpick comments:
In `@pkg/connector/account.go`:
- Line 293: Remove the dead assignment to assignmentsInput.NextToken (it is set
from assignmentsResp.NextToken) since the AWS SDK paginator manages pagination
and ignores changes to the input after initialization; locate the pagination
loop where assignmentsInput and assignmentsResp are used (the block that
iterates with the paginator) and delete the line setting
assignmentsInput.NextToken so the code relies solely on the paginator for
advancing pages.
- Around line 720-723: getPermissionSets currently swallows errors from
session.GetManyJSON while Grants returns on similar session errors, causing
inconsistent behavior; update getPermissionSets (and mirror in any SetManyJSON
usage) to handle session errors consistently with Grants by checking the error
returned from session.GetManyJSON/SetManyJSON (when ss != nil) and either return
the error up the stack (make it fatal like in Grants) or instead log the error
and continue (make Grants lenient) — pick one approach and apply it
consistently, referencing the session variable ss, the cachedDetails variable,
and the GetManyJSON/SetManyJSON calls so the change is applied in the correct
functions.

Comment thread pkg/connector/account.go Outdated
Comment thread pkg/connector/account.go Outdated
Comment thread pkg/connector/account.go
Comment thread cmd/baton-aws/main.go
Comment thread pkg/connector/connector.go Outdated
Comment thread pkg/connector/connector.go Outdated
Copy link
Copy Markdown
Contributor

@agustin-conductor agustin-conductor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might be able to process accounts entitlements more than 1 permission set at the time, other than that, LGTM

Comment thread pkg/connector/account.go
Comment thread pkg/connector/account.go
Comment thread pkg/connector/connector.go Outdated
Comment thread pkg/connector/connector.go Outdated
Comment thread pkg/connector/account.go Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 11, 2026

Claude finished @JavierCarnelli-ConductorOne's task in 2m 18s —— View job


PR Review: CXP-116 Containerization and Refactor of Caches

  • Read repository guidelines
  • Analyze PR changes (20 files, +842/-522)
  • Perform comprehensive code review
  • Post review findings

Overview

This PR makes three major changes:

  1. Containerization — Simplifies main.go to use config.RunConnector with connectorrunner.WithSessionStoreEnabled()
  2. ResourceSyncerV2 migration — Replaces pagination.Token + (string, annotations.Annotations) with SyncOpAttrs/SyncOpResults across all resource types
  3. Session-backed caching — Replaces in-memory mutex/map caches with sessions.SessionStore for permission set data, with per-permission-set pagination for throttle resilience

The overall architecture is solid. The getOrSetCache[T] generic helper, generic encodePageToken/decodePageToken, batched entitlements (entitlementsBatchSize = 25), and comprehensive wrapAWSError with throttle code detection are all well-implemented improvements that address prior review feedback. Below are the remaining issues I found.


Issues

1. Bug (pre-existing): Inverted error-check logic in CreateAccount

pkg/connector/iam_user.go:233-255

Inside the if result, err := o.iamClient.GetUser(...); err == nil block, err is guaranteed to be nil, so errors.As(err, &noSuchEntity) is always false — dead code. The current behavior is that if GetUser fails for any reason (permissions, network, user-not-found), it falls through to CreateUser. Non-"not found" errors are silently swallowed.

if result, err := o.iamClient.GetUser(ctx, &iam.GetUserInput{...}); err == nil {
    var noSuchEntity *iamTypes.NoSuchEntityException
    if errors.As(err, &noSuchEntity) {  // err is always nil here
        return nil, nil, nil, fmt.Errorf(...)
    }
    // ... returns existing user
}
// Falls through to CreateUser for ALL error cases

Fix this →

2. Test/production validation divergence

pkg/config/config_test.go:22-33

The test validateConfig always requires external-id when use-assume is set (line 24), but the production validateConfig in connector.go:206-221 only requires it for two-hop mode (when global-role-arn is set). Valid single-hop configs (IRSA → target role with --use-assume --role-arn <arn> but no --external-id) pass in production but fail the test (line 60-63 expects failure for "external ID missing").

The test should mirror the production logic:

if v.GetBool(config.UseAssumeField.FieldName) {
    err := connector.IsValidRoleARN(v.GetString(config.RoleArnField.FieldName))
    if err != nil { return err }
    if v.GetString(config.GlobalRoleArnField.FieldName) != "" {
        err = connector.ValidateExternalID(v.GetString(config.ExternalIdField.FieldName))
        if err != nil { return err }
    }
}

Fix this →

3. Missing ctx.Done() checks in pagination loops

pkg/connector/account.go:383-394, 406-414

Both getOrFetchPermissionSetIDs and getOrFetchAllPermissionSetIDs contain inner pagination loops that do not check ctx.Done(). If the context is cancelled (timeout or shutdown), the loop will continue iterating until the next API call happens to fail.

for {
    select {
    case <-ctx.Done():
        return nil, ctx.Err()
    default:
    }
    psBindingsResp, err := psBindingPaginator.NextPage(ctx)
    // ...
}

4. secrets.go:73-77 — Error swallowed on ListAccessKeys (pre-existing)

pkg/connector/secrets.go:73-77

ListAccessKeys errors are logged then silently skipped with continue, violating the "never swallow errors" guideline. This can produce incomplete sync output with no indication that some secrets were missed.

if err != nil {
    logger.Error("Error listing access keys", zap.Error(err))
    continue  // WRONG — should return error
}

5. role.go:151-160 — Error swallowed on GetRole (pre-existing)

pkg/connector/role.go:154-159

The GetRole error is logged but returns nil, nil, nil, silently skipping grants for this role. While this may be intentional for resilience, it violates the error handling guidelines and could mask permission issues.


Minor / Nits

  • pkg/connector/connector.go:196: fieldLength <= 0 — since len() returns a non-negative int, this is equivalent to fieldLength == 0. Using == 0 is more idiomatic.

  • pkg/connector/account.go:463-466: The nextToken variable and if check could be simplified to return rv, awsSdk.ToString(assignmentsResp.NextToken), nil since awsSdk.ToString handles nil.

  • Several files (account_iam.go:36, sso_group.go:41, sso_user.go:34, secrets.go:38): bag.Unmarshal errors are returned without connector-prefix wrapping. Minor inconsistency with the wrapped errors elsewhere.


What Looks Good

  • main.go: Clean containerization setup with connectorrunner.WithSessionStoreEnabled() and WithDefaultCapabilitiesConnectorBuilderV2
  • helpers.go: wrapAWSError with comprehensive throttle code detection is well-designed. The comment documenting intentional error chain breakage (lines 336-340) addresses prior review feedback perfectly
  • account.go: getOrSetCache[T] generic helper eliminates prior duplication. Generic encodePageToken/decodePageToken pair is clean
  • account.go:41-45: entitlementsBatchSize = 25 with good comment explaining the tradeoff — addresses prior feedback about one-at-a-time processing
  • connector.go:36: TODO comment preserved on the off-by-one constant — good decision given the risk assessment
  • connector.go:224: Unused opts parameter correctly renamed to _
  • All builders: Consistent migration to SyncOpAttrs/SyncOpResults pattern
  • wrapAWSError applied across all builders: iam_user.go:62, iam_group.go:61,137, role.go:64, sso_group.go:60,165, sso_user.go:54, secrets.go:63 — addresses prior feedback about throttle wrapping only being in account.go
  • Error prefixes: All new code consistently uses baton-aws: prefix

Summary

The PR is in good shape after addressing multiple rounds of review feedback. The remaining issues are:

  • 1 real bug (pre-existing CreateAccount inverted error check)
  • 1 test/production divergence on validation logic
  • 1 missing context cancellation check in pagination loops
  • 2 pre-existing error swallowing issues in secrets.go and role.go

The pre-existing issues don't need to block this PR but should be tracked for follow-up.

Comment thread pkg/connector/connector.go Outdated
Comment thread pkg/connector/connector.go Outdated
Comment thread pkg/connector/connector.go
Comment thread pkg/config/config.go
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 12, 2026

Claude finished @JavierCarnelli-ConductorOne's task in 2m 19s —— View job


PR Review: CXP-116 Containerization and Refactor of Caches

  • Read repository guidelines
  • Analyze PR changes (20 files, +843/-522)
  • Perform comprehensive code review
  • Post review findings

Overview

This PR makes three major changes:

  1. Containerization - Simplifies main.go to use config.RunConnector with connectorrunner.WithSessionStoreEnabled() and WithDefaultCapabilitiesConnectorBuilderV2
  2. ResourceSyncerV2 migration - Replaces pagination.Token + (string, annotations.Annotations) patterns with SyncOpAttrs/SyncOpResults across all resource types
  3. Session-backed caching - Replaces in-memory mutex/map caches with sessions.SessionStore via a clean getOrSetCache[T] generic helper, and implements per-permission-set pagination for throttle resilience

The PR has addressed the majority of prior review feedback well. The generic encodePageToken/decodePageToken, getOrSetCache[T], entitlementsBatchSize = 25, wrapAWSError across all builders, and the TODO comment preservation are all solid improvements. Below are the remaining issues.


Issues

1. Bug (pre-existing): Inverted error-check logic in CreateAccount

pkg/connector/iam_user.go:233-255

Inside if result, err := o.iamClient.GetUser(...); err == nil, err is guaranteed to be nil, so errors.As(err, &noSuchEntity) on line 235 is always false — dead code. The current behavior: if GetUser succeeds, return existing user (idempotent); if GetUser fails for any reason (permissions, network, user-not-found), silently fall through to CreateUser. Non-"not found" errors are swallowed.

// Current (broken):
if result, err := o.iamClient.GetUser(ctx, &iam.GetUserInput{...}); err == nil {
    var noSuchEntity *iamTypes.NoSuchEntityException
    if errors.As(err, &noSuchEntity) {  // err is always nil here - dead code
        return nil, nil, nil, fmt.Errorf("baton-aws: iam.GetUser failed: %w", err)
    }
    // returns existing user...
}
// Falls through to CreateUser for ALL error cases

Fix this →

2. Test/production validation divergence

pkg/config/config_test.go:22-33

The test validateConfig always requires external-id when use-assume is set (line 24), but the production validateConfig in connector.go:206-221 only requires it for two-hop mode (when global-role-arn is set). This means:

  • Test case at line 59-63 ("--use-assume --role-arn <arn>" → expects failure for "external ID missing") would pass in production (single-hop mode doesn't need external-id)
  • Valid single-hop configs are incorrectly marked as failures in tests

The test should mirror the production logic by only requiring external-id when global-role-arn is also set.

Fix this →

3. Missing ctx.Done() checks in pagination loops

pkg/connector/account.go:383-394, 406-414

Both getOrFetchPermissionSetIDs and getOrFetchAllPermissionSetIDs contain inner pagination loops that don't check ctx.Done(). If the context is cancelled (timeout or shutdown), the loop will continue iterating until the next API call happens to fail.

for {
    select {
    case <-ctx.Done():
        return nil, ctx.Err()
    default:
    }
    psBindingsResp, err := psBindingPaginator.NextPage(ctx)
    // ...
}

4. account_iam.go missing wrapAWSError and error wrapping

pkg/connector/account_iam.go:50-63

While wrapAWSError was applied to all other builders, account_iam.go still has unwrapped AWS calls:

  • Line 52: o.orgClient.ListAccounts — no wrapAWSError
  • Line 60: stsClient.GetCallerIdentity — no wrapAWSError
  • Line 62: raw err return without connector-prefix wrapping
identity, err := stsClient.GetCallerIdentity(ctx, &sts.GetCallerIdentityInput{})
if err != nil {
    return nil, nil, err  // No prefix, no wrapAWSError
}

5. secrets.go:73-77 — Error swallowed on ListAccessKeys (pre-existing)

pkg/connector/secrets.go:73-77

ListAccessKeys errors are logged then silently skipped with continue, violating the "never swallow errors" guideline. This can produce incomplete sync output with no indication that some secrets were missed. Per CLAUDE.md: "Never swallow errors."

if err != nil {
    logger.Error("Error listing access keys", zap.Error(err))
    continue  // Should return error
}

6. role.go:154-159 — Error swallowed on GetRole (pre-existing)

pkg/connector/role.go:154-159

GetRole error is logged but returns nil, nil, nil, silently skipping grants for this role. While this may be intentional for resilience, it can mask permission issues and produces incomplete sync output.


Minor / Nits

  • pkg/connector/connector.go:196: fieldLength <= 0 — since len() returns a non-negative int, this is equivalent to fieldLength == 0. Using == 0 is more idiomatic.

  • pkg/connector/account.go:463-466: The nextToken variable and if check could be simplified to return rv, awsSdk.ToString(assignmentsResp.NextToken), nil since awsSdk.ToString handles nil pointers by returning empty string.

  • pkg/connector/connector.go:262: Error prefix "aws connector:" is inconsistent with the standard "baton-aws:" prefix used everywhere else.


What Looks Good

  • main.go: Clean containerization with connectorrunner.WithSessionStoreEnabled() and WithDefaultCapabilitiesConnectorBuilderV2
  • helpers.go: wrapAWSError with comprehensive throttle code detection is well-designed; the comment documenting intentional error chain breakage (lines 336-340) addresses prior feedback perfectly
  • account.go: getOrSetCache[T] generic helper and generic encodePageToken/decodePageToken eliminate prior duplication cleanly
  • account.go:41-45: entitlementsBatchSize = 25 with good comment explaining the tradeoff addresses prior feedback about one-at-a-time processing
  • connector.go:36: TODO comment preserved on the off-by-one constant
  • connector.go:224: Unused opts parameter correctly renamed to _
  • All builders: Consistent migration to SyncOpAttrs/SyncOpResults and wrapAWSError applied across iam_user.go, iam_group.go, role.go, sso_group.go, sso_user.go, secrets.go
  • Error prefixes: All new code consistently uses baton-aws: prefix
  • connector.go:206-221: Production validateConfig correctly distinguishes single-hop vs two-hop mode

Summary

The PR is in good shape after addressing multiple rounds of review feedback. The remaining actionable issues are:

  • 1 real bug (pre-existing CreateAccount inverted error check) — worth fixing in this PR or a follow-up
  • 1 test/production validation divergence — should be fixed to avoid false test failures
  • 1 missing ctx.Done() check in pagination loops
  • 1 builder file (account_iam.go) missing wrapAWSError wrapping
  • 2 pre-existing error swallowing issues (secrets.go, role.go) — can be tracked separately

The pre-existing issues (1, 5, 6) don't need to block this PR but should be tracked for follow-up.


@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 16, 2026

Claude finished @JavierCarnelli-ConductorOne's task in 3m 22s —— View job


PR Review: CXP-116 Containerization and Refactor of Caches

  • Read repository guidelines
  • Analyze PR changes (21 files, +1072/-522)
  • Perform comprehensive code review
  • Post review findings

Overview

This PR makes three major changes:

  1. Containerization — Simplifies main.go to use config.RunConnector with connectorrunner.WithSessionStoreEnabled() and WithDefaultCapabilitiesConnectorBuilderV2
  2. ResourceSyncerV2 migration — Replaces pagination.Token + (string, annotations.Annotations) patterns with SyncOpAttrs/SyncOpResults across all resource types
  3. Session-backed caching — Replaces in-memory mutex/map caches with sessions.SessionStore via getOrSetCache[T] generic helper, with per-permission-set pagination for throttle resilience

The PR has addressed the vast majority of prior review feedback (7+ rounds). Generic encodePageToken/decodePageToken, getOrSetCache[T], entitlementsBatchSize = 25, wrapAWSError across all builders, TODO comment preservation, unused _ parameter, and error prefix standardization are all solid improvements.


Remaining Issues

1. Bug (pre-existing): Inverted error-check logic in CreateAccount

pkg/connector/iam_user.go:233-255

This has been flagged in every prior review and is still present. Inside if result, err := o.iamClient.GetUser(...); err == nil, err is guaranteed nil, so errors.As(err, &noSuchEntity) on line 235 is always false — dead code. Non-"not found" errors from GetUser (permissions, network) are silently swallowed and trigger an unnecessary CreateUser call.

// Current (broken):
if result, err := o.iamClient.GetUser(ctx, &iam.GetUserInput{...}); err == nil {
    var noSuchEntity *iamTypes.NoSuchEntityException
    if errors.As(err, &noSuchEntity) {  // err is always nil here - dead code
        return nil, nil, nil, fmt.Errorf("baton-aws: iam.GetUser failed: %w", err)
    }
    // returns existing user...
}
// Falls through to CreateUser for ALL error cases

Fix this →

2. Test/production validation divergence

pkg/config/config_test.go:22-33

Still present. The test validateConfig (line 24) always requires external-id when use-assume is set, but the production validateConfig in connector.go:206-221 only requires it for two-hop mode (when global-role-arn is set). This means test case at line 59-63 (--use-assume --role-arn <arn> → expects failure for "external ID missing") would pass in production since single-hop mode doesn't require external-id.

Fix this →

3. Missing ctx.Done() checks in pagination loops

pkg/connector/account.go:383-394, 406-417

Both getOrFetchPermissionSetIDs and getOrFetchAllPermissionSetIDs contain inner pagination loops that do not check ctx.Done(). If the context is cancelled (timeout/shutdown), the loop continues until the next API call happens to fail.

for {
    select {
    case <-ctx.Done():
        return nil, ctx.Err()
    default:
    }
    psBindingsResp, err := psBindingPaginator.NextPage(ctx)
    // ...
}

4. account_iam.go missing wrapAWSError

pkg/connector/account_iam.go:50-62

While wrapAWSError was applied to all other builders, account_iam.go still has unwrapped AWS calls:

  • Line 50: o.orgClient.ListAccounts — error wrapped with fmt.Errorf but not wrapAWSError
  • Line 60: stsClient.GetCallerIdentity — raw error returned without prefix or wrapAWSError
identity, err := stsClient.GetCallerIdentity(ctx, &sts.GetCallerIdentityInput{})
if err != nil {
    return nil, nil, err  // No prefix, no wrapAWSError
}

5. secrets.go:73-77 — Error swallowed on ListAccessKeys (pre-existing)

pkg/connector/secrets.go:73-77

ListAccessKeys errors are logged then silently skipped with continue. Per CLAUDE.md: "Never swallow errors." This can produce incomplete sync output with no indication that secrets were missed.

6. connector.go:262 — Inconsistent error prefix

pkg/connector/connector.go:262

return nil, nil, fmt.Errorf("aws connector: config load failure: %w", err)

Uses "aws connector:" instead of the standard "baton-aws:" prefix used everywhere else.


Minor / Nits

  • pkg/connector/connector.go:196: fieldLength <= 0 — since len() returns a non-negative int, == 0 is more idiomatic.
  • pkg/connector/account.go:463-466: The nextToken variable and if check could be simplified to return rv, awsSdk.ToString(assignmentsResp.NextToken), nil since awsSdk.ToString handles nil.
  • pkg/connector/account_iam.go:36: bag.Unmarshal error returned without connector-prefix wrapping.

What Looks Good

  • main.go: Clean containerization with connectorrunner.WithSessionStoreEnabled() and WithDefaultCapabilitiesConnectorBuilderV2
  • helpers.go:316-354: wrapAWSError with comprehensive throttle code detection and well-documented comment about intentional error chain breakage
  • account.go:48-76: getOrSetCache[T] generic helper cleanly eliminates prior duplication
  • account.go:88-110: Generic encodePageToken/decodePageToken pair — clean and reusable
  • account.go:41-45: entitlementsBatchSize = 25 with good comment explaining the tradeoff — properly addresses prior feedback
  • connector.go:36: TODO comment preserved on the off-by-one constant
  • connector.go:224: Unused opts parameter correctly renamed to _
  • All builders: Consistent migration to SyncOpAttrs/SyncOpResults and wrapAWSError applied across iam_user.go, iam_group.go, role.go, sso_group.go, sso_user.go, secrets.go
  • Error prefixes: All new code consistently uses baton-aws: prefix (with the one exception noted above)
  • connector.go:206-221: Production validateConfig correctly distinguishes single-hop vs two-hop mode

Summary

The PR is in good shape after addressing 7+ rounds of review feedback. The remaining actionable items are:

# Severity Description Blocking?
1 Bug (pre-existing) CreateAccount inverted error check — dead code + silent error swallowing No (pre-existing)
2 Medium Test/production validation divergence on external-id Should fix
3 Low Missing ctx.Done() in pagination loops Nice to have
4 Low account_iam.go missing wrapAWSError Should fix for consistency
5 Low (pre-existing) secrets.go error swallowing No (pre-existing)
6 Nit One inconsistent error prefix Nice to have

Issues 1 and 5 are pre-existing and don't need to block this PR. Issues 2 and 4 are worth fixing in this PR or a fast follow-up.

johnallers and others added 15 commits March 19, 2026 13:02
Migrate to ConnectorBuilderV2 and ResourceSyncerV2 interfaces to enable
containerized Lambda deployment following the SDK v0.5.0+ pattern.

- Update main.go to use config.RunConnector() pattern
- Migrate all resource syncers to V2 interface signatures
- Move validation functions to connector package to resolve import cycle
- Add lambda support build tags to Makefile
- Remove outdated Dockerfile

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

refactor: converts in memory caches to session storage cached data

chore: adds final period to comment

refactor: optimize session storage with batch operations

Replace individual GetJSON/SetJSON calls with batch GetManyJSON/SetManyJSON
operations in the Grants() method to reduce session storage round-trips
from O(2N) to O(2) per account, improving Lambda performance.

- Refactor Grants() into 5 phases: collect IDs, batch cache read,
  fetch API misses, build grants, batch cache write
- Rename getPermissionSet to fetchPermissionSetFromAPI (pure API fetcher)
- Enable SessionStore in connector runner
- Add early return when no permission sets provisioned

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: handles Session Storage operation errors

refactor: use SDK pagination in Grants() for Lambda compatibility

Replace internal for loops with SDK pagination via NextPageToken to:
- Save progress per permission set, reducing blast radius on 429 errors
- Enable retries to resume from last successful page instead of restarting
- Cache permission set IDs per account across paginated calls

Also adds wrapAWSError() to convert AWS throttling errors to codes.Unavailable
for proper handling by baton-sdk retry mechanism.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

refactor: use SDK pagination in Entitlements() for Lambda compatibility

Apply the same pagination pattern from Grants() to Entitlements():
- Process one permission set per call instead of all at once
- Cache permission set IDs globally per sync
- Return NextPageToken to save progress between calls
- Remove unused getPermissionSets() function and permissionSetsCacheKey

This reduces blast radius on 429 errors - only current permission set's
work is lost instead of all progress.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…dling

Move awsThrottleErrorCodes and wrapAWSError from account.go to helpers.go
so the utility is available package-wide. Apply wrapAWSError to every AWS
API call error return in role.go, iam_group.go, iam_user.go, sso_user.go,
sso_group.go, and secrets.go so throttling errors from any builder surface
as codes.Unavailable for the baton-sdk sync engine to retry. Add a comment
documenting the intentional status.Error chain-breaking trade-off.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…enerics

Consolidates encodeGrantsPageToken, decodeGrantsPageToken,
encodeEntitlementsPageToken, and decodeEntitlementsPageToken into two
generic functions encodePageToken[T] and decodePageToken[T] that share
the identical JSON↔base64 logic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@laurenleach laurenleach force-pushed the jallers/cxp-116-containerization branch from 828e4c7 to b7b5cc7 Compare March 19, 2026 20:05
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 19, 2026

Claude finished @laurenleach's task in 2m 23s —— View job


PR Review: CXP-116 Containerization and Refactor of Caches

  • Read repository guidelines
  • Analyze PR changes (21 files, +881/-525)
  • Perform comprehensive code review
  • Post review findings

Overview

This PR makes three major changes:

  1. Containerization — Simplifies main.go to use config.RunConnector with connectorrunner.WithSessionStoreEnabled() and WithDefaultCapabilitiesConnectorBuilderV2
  2. ResourceSyncerV2 migration — Replaces pagination.Token + (string, annotations.Annotations) with SyncOpAttrs/SyncOpResults across all resource types
  3. Session-backed caching — Replaces in-memory mutex/map caches with sessions.SessionStore via getOrSetCache[T] generic helper, with per-permission-set pagination for throttle resilience

The PR has addressed the vast majority of feedback from 7+ prior review rounds. The generic encodePageToken/decodePageToken, getOrSetCache[T], entitlementsBatchSize = 25, wrapAWSError across all builders, TODO comment preservation, unused _ parameter, and error prefix standardization are all solid improvements.


Remaining Issues

1. Bug (pre-existing): Inverted error-check logic in CreateAccount

pkg/connector/iam_user.go:233-255

This has been flagged in every prior review and is still present. Inside if result, err := o.iamClient.GetUser(...); err == nil, err is guaranteed nil, so errors.As(err, &noSuchEntity) on line 235 is always false — dead code. Non-"not found" errors from GetUser (permissions, network) are silently swallowed and trigger an unnecessary CreateUser call.

// Current (broken):
if result, err := o.iamClient.GetUser(ctx, &iam.GetUserInput{...}); err == nil {
    var noSuchEntity *iamTypes.NoSuchEntityException
    if errors.As(err, &noSuchEntity) {  // err is always nil here - dead code
        return nil, nil, nil, fmt.Errorf("baton-aws: iam.GetUser failed: %w", err)
    }
    // returns existing user...
}
// Falls through to CreateUser for ALL error cases

Fix this →

2. Test/production validation divergence

pkg/config/config_test.go:22-33

Still present. The test validateConfig (line 24) always requires external-id when use-assume is set, but the production validateConfig in connector.go:206-221 only requires it for two-hop mode (when global-role-arn is set). Test case at line 59-63 (--use-assume --role-arn <arn> → expects failure for "external ID missing") would pass in production since single-hop mode doesn't require external-id. The test should mirror the production logic.

Fix this →

3. account_iam.go missing wrapAWSError and error prefix

pkg/connector/account_iam.go:50-62

While wrapAWSError was applied to all other builders, account_iam.go still has unwrapped AWS calls:

  • Line 50: o.orgClient.ListAccounts — wrapped with fmt.Errorf but not wrapAWSError
  • Line 60: stsClient.GetCallerIdentity — raw error returned without prefix or wrapAWSError
identity, err := stsClient.GetCallerIdentity(ctx, &sts.GetCallerIdentityInput{})
if err != nil {
    return nil, nil, err  // No prefix, no wrapAWSError
}

Fix this →

4. connector.go:262 — Inconsistent error prefix

pkg/connector/connector.go:262

return nil, nil, fmt.Errorf("aws connector: config load failure: %w", err)

Uses "aws connector:" instead of the standard "baton-aws:" prefix used everywhere else.

5. secrets.go:73-77 — Error swallowed on ListAccessKeys (pre-existing)

pkg/connector/secrets.go:73-77

ListAccessKeys errors are logged then silently skipped with continue. Per CLAUDE.md: "Never swallow errors." This can produce incomplete sync output with no indication that some secrets were missed.

if err != nil {
    logger.Error("Error listing access keys", zap.Error(err))
    continue  // Should return error
}

Minor / Nits

  • pkg/connector/connector.go:196: fieldLength <= 0 — since len() returns a non-negative int, == 0 is more idiomatic.
  • pkg/connector/account.go:463-466: The nextToken variable and if check could be simplified to return rv, awsSdk.ToString(assignmentsResp.NextToken), nil since awsSdk.ToString handles nil pointers by returning empty string.
  • pkg/connector/account_iam.go:36: bag.Unmarshal error returned without connector-prefix wrapping.

What Looks Good

  • main.go: Clean containerization with connectorrunner.WithSessionStoreEnabled() and WithDefaultCapabilitiesConnectorBuilderV2
  • helpers.go:316-354: wrapAWSError with comprehensive throttle code detection and well-documented comment about intentional error chain breakage — addresses prior feedback perfectly
  • account.go:48-76: getOrSetCache[T] generic helper cleanly eliminates prior duplication
  • account.go:88-110: Generic encodePageToken/decodePageToken pair — clean and reusable
  • account.go:41-45: entitlementsBatchSize = 25 with good comment explaining the tradeoff — properly addresses prior feedback about one-at-a-time processing
  • connector.go:36: TODO comment preserved on the off-by-one constant
  • connector.go:224: Unused opts parameter correctly renamed to _
  • All builders: Consistent migration to SyncOpAttrs/SyncOpResults and wrapAWSError applied across iam_user.go, iam_group.go, role.go, sso_group.go, sso_user.go, secrets.go
  • Error prefixes: All new code consistently uses baton-aws: prefix (with the two exceptions noted above)
  • connector.go:206-221: Production validateConfig correctly distinguishes single-hop vs two-hop mode
  • account.go Grant/Revoke: Thorough account status verification, ConflictException handling, and async operation polling with AccessDeniedException fallbacks

Summary

The PR is in good shape after addressing 7+ rounds of review feedback. The remaining actionable items are:

# Severity Description Blocking?
1 Bug (pre-existing) CreateAccount inverted error check — dead code + silent error swallowing No (pre-existing)
2 Medium Test/production validation divergence on external-id Should fix
3 Low account_iam.go missing wrapAWSError wrapping Should fix for consistency
4 Nit One inconsistent error prefix in connector.go:262 Nice to have
5 Low (pre-existing) secrets.go error swallowing No (pre-existing)

Issues 1 and 5 are pre-existing and don't need to block this PR. Issues 2 and 3 are worth fixing in this PR or a fast follow-up.


@carolinaroncaglia
Copy link
Copy Markdown

carolinaroncaglia commented Apr 14, 2026

Documentation gaps found during containerization validation

The following items were surfaced during the CXP-116 QA plan review and have been addressed in a separate PR:

PR #111: docs: add --sync-sso-user-last-login to README, remove orphaned SCIM fields

#111

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants