[RHIDP-12470] Add API endpoints for dynamically registering and unregistering MCP Servers#1330
[RHIDP-12470] Add API endpoints for dynamically registering and unregistering MCP Servers#1330maysunfaisal wants to merge 2 commits intolightspeed-core:mainfrom
Conversation
Assisted-by: Claude Opus 4.6 Generated-by: Cursor Signed-off-by: Maysun J Faisal <maysunaneek@gmail.com>
Assisted-by: Claude Opus 4.6 Generated-by: Cursor Signed-off-by: Maysun J Faisal <maysunaneek@gmail.com>
WalkthroughThis change introduces dynamic MCP server management via three new FastAPI endpoints. It adds configuration support for tracking runtime-registered servers, new authorization actions, and comprehensive request/response models. The implementation integrates with existing authentication and Llama Stack client infrastructure. Changes
Sequence Diagram(s)sequenceDiagram
actor Client
participant Endpoint as MCP Endpoint
participant Config as AppConfig
participant LlamaStack as Llama Stack Client
Client->>Endpoint: POST /mcp-servers (registration request)
Endpoint->>Endpoint: Validate auth & request
Endpoint->>Config: add_mcp_server()
activate Config
Config->>Config: Check duplicate & load status
Config->>Config: Update internal state
deactivate Config
Endpoint->>LlamaStack: Register toolgroup
alt Success
LlamaStack-->>Endpoint: OK
Endpoint-->>Client: 200 MCPServerRegistrationResponse
else Llama Stack Error
LlamaStack-->>Endpoint: Error
Endpoint->>Config: remove_mcp_server() (rollback)
activate Config
Config->>Config: Restore state
deactivate Config
Endpoint-->>Client: 503 Service Unavailable
else Duplicate Name
Endpoint-->>Client: 409 Conflict
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can validate your CodeRabbit configuration file in your editor.If your editor has YAML language server, you can enable auto-completion and validation by adding |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (3)
tests/unit/app/endpoints/test_mcp_servers.py (1)
326-350: Assert rollback invariants after delete failure.This test currently checks only the 503. It should also verify the dynamic server remains registered locally when
toolgroups.unregisterfails.✅ Suggested test hardening
`@pytest.mark.asyncio` async def test_delete_mcp_server_llama_stack_failure( @@ - _make_app_config(mocker, mock_configuration) + app_config = _make_app_config(mocker, mock_configuration) @@ with pytest.raises(HTTPException) as exc_info: await mcp_servers.delete_mcp_server_handler( request=mocker.Mock(), name="to-delete-fail", auth=MOCK_AUTH ) assert exc_info.value.status_code == 503 + assert app_config.is_dynamic_mcp_server("to-delete-fail") + assert any(s.name == "to-delete-fail" for s in app_config.mcp_servers)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/app/endpoints/test_mcp_servers.py` around lines 326 - 350, Update the test_delete_mcp_server_llama_stack_failure to assert rollback invariants after toolgroups.unregister raises APIConnectionError: after calling mcp_servers.register_mcp_server_handler and triggering the failing delete via mcp_servers.delete_mcp_server_handler (where client.toolgroups.unregister.side_effect is set), in addition to asserting the HTTPException status_code == 503, verify that the dynamic server entry is still present in the local registration state (the same registry/storage that register_mcp_server_handler updates) so the server remains registered locally when client.toolgroups.unregister fails; reference the test function name, mcp_servers.register_mcp_server_handler, mcp_servers.delete_mcp_server_handler, and client.toolgroups.unregister to locate the code to assert the local registration still exists.src/models/requests.py (1)
837-842: UseOptional[PositiveInt]fortimeoutto follow model typing conventions.The
timeoutfield currently usesOptional[int]with agt=0constraint, but the codebase guideline requiresPositiveIntfor positive integer fields in Pydantic models. UsingOptional[PositiveInt]makes the constraint explicit at the type level, matching the pattern already established insrc/models/config.py(e.g., line 553).♻️ Proposed refactor
-from pydantic import BaseModel, Field, field_validator, model_validator +from pydantic import BaseModel, Field, PositiveInt, field_validator, model_validator ... - timeout: Optional[int] = Field( + timeout: Optional[PositiveInt] = Field( default=None, description="Request timeout in seconds for the MCP server", - gt=0, examples=[30], )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/models/requests.py` around lines 837 - 842, The timeout Field in the Request model is typed as Optional[int] with gt=0; change its annotation to Optional[PositiveInt] (from pydantic) so the positivity constraint is enforced at the type level, remove the redundant gt=0 argument from the Field, and add the PositiveInt import if missing; update the timeout declaration (symbol: timeout: Optional[int] = Field(...)) to timeout: Optional[PositiveInt] = Field(...) to match the pattern used in src/models/config.py.src/app/endpoints/mcp_servers.py (1)
57-68: Align the handler docstrings with the repo's Google-style format.These docstrings include
Returns/Raises, but they still omitParameters, which makes the request/body/auth contract harder to scan in an API module. As per coding guidelines, "Follow Google Python docstring conventions (https://google.github.io/styleguide/pyguide.html) with Parameters, Returns, Raises, and Attributes sections."Also applies to: 136-146, 186-198
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/app/endpoints/mcp_servers.py` around lines 57 - 68, The handler docstrings (the "Register an MCP server dynamically at runtime" registration handler and the other handlers at the ranges you flagged) must be converted to Google-style docstrings: add a Parameters section describing the request body/expected fields and auth context, change Returns to a "Returns:" section that clearly states the return type (MCPServerRegistrationResponse) and content, and keep a "Raises:" section listing HTTPException cases (duplicate name, connection error, registration failure) with brief conditions; update the docstrings for the MCP server registration handler and the other two handlers at the referenced blocks (lines around 136-146 and 186-198) to follow this exact structure so consumers can easily scan parameters, returns, and errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/app/endpoints/mcp_servers.py`:
- Around line 99-107: The code currently includes raw upstream exception text
(str(e)) in API responses (see ServiceUnavailableResponse and
InternalServerErrorResponse in the except blocks around
configuration.remove_mcp_server and HTTPException raising); change these to log
the full exception via logger.exception or logger.error with exception info, but
replace the response cause with a fixed opaque string (e.g. "upstream_error" or
a generated error_id) or a generic message like "internal server error" before
raising HTTPException or returning the response; apply the same change to the
other occurrence referenced (lines ~223-225) so no raw exception text is
serialized to clients while full details remain in logs for debugging.
- Around line 227-240: The code currently swallows all exceptions from the Llama
Stack toolgroup unregister and proceeds to call
configuration.remove_mcp_server(name); change this so that only a verified
remote "not found" outcome allows local removal. Specifically, replace the broad
except Exception around the remote unregister with logic that (1) detects the
remote-not-found case and continues to call
configuration.remove_mcp_server(name), and (2) for any other exception from the
Llama Stack toolgroup unregister, log the error and raise an HTTPException (do
not call configuration.remove_mcp_server). Reference the Llama Stack toolgroup
unregister call and configuration.remove_mcp_server(name) when making the change
so the local config is preserved unless the remote toolgroup is confirmed
missing.
- Around line 74-81: The handler currently constructs a
ModelContextProtocolServer (mcp_server) with authorization_headers, headers, and
timeout but only passes mcp_endpoint={"uri": mcp_server.url} into
toolgroups.register, which can create an unusable toolgroup; update the endpoint
handler to either forward these connection fields to Llama Stack when supported
or explicitly reject requests that supply them—specifically, before calling
toolgroups.register check ModelContextProtocolServer.authorization_headers,
.headers, and .timeout against defaults and if any non-empty/non-default values
are present raise a clear HTTP error (or ValidationError) explaining that
authorization_headers, headers, and custom timeout are not supported by the
current llama-stack-client; reference ModelContextProtocolServer,
toolgroups.register, and mcp_endpoint in the change so the rejection/forwarding
logic is colocated with where mcp_server is created and toolgroups.register is
called.
In `@src/configuration.py`:
- Line 67: The dynamic MCP registry is currently stored in the in-memory set
_dynamic_mcp_server_names (and other related code around lines 169-228), which
will diverge across workers; change the implementation so the registry is kept
in a shared, persistent store (e.g., Redis, database table, or central config
service) instead of the process-local set. Replace direct reads/writes of
_dynamic_mcp_server_names with calls to new helper methods (e.g.,
get_dynamic_mcp_servers(), add_dynamic_mcp_server(name),
remove_dynamic_mcp_server(name)) that operate on the shared backend and provide
atomic operations; update any code paths that reference
_dynamic_mcp_server_names to use these helpers and ensure worker-safe
concurrency (locks or atomic commands) and consistent
serialization/deserialization. Ensure migration/initialization code populates
the shared store from any existing local state and add tests verifying
consistency across simulated multiple-worker access.
- Line 67: init_from_dict() currently clears cached runtime state but doesn't
reset the _dynamic_mcp_server_names set, allowing stale server names to persist
across reinitialization; update init_from_dict() to clear or reinitialize
self._dynamic_mcp_server_names (e.g. set it to an empty set) as part of the
re-initialization path so dynamic MCP registry state is reset when configuration
is reloaded.
---
Nitpick comments:
In `@src/app/endpoints/mcp_servers.py`:
- Around line 57-68: The handler docstrings (the "Register an MCP server
dynamically at runtime" registration handler and the other handlers at the
ranges you flagged) must be converted to Google-style docstrings: add a
Parameters section describing the request body/expected fields and auth context,
change Returns to a "Returns:" section that clearly states the return type
(MCPServerRegistrationResponse) and content, and keep a "Raises:" section
listing HTTPException cases (duplicate name, connection error, registration
failure) with brief conditions; update the docstrings for the MCP server
registration handler and the other two handlers at the referenced blocks (lines
around 136-146 and 186-198) to follow this exact structure so consumers can
easily scan parameters, returns, and errors.
In `@src/models/requests.py`:
- Around line 837-842: The timeout Field in the Request model is typed as
Optional[int] with gt=0; change its annotation to Optional[PositiveInt] (from
pydantic) so the positivity constraint is enforced at the type level, remove the
redundant gt=0 argument from the Field, and add the PositiveInt import if
missing; update the timeout declaration (symbol: timeout: Optional[int] =
Field(...)) to timeout: Optional[PositiveInt] = Field(...) to match the pattern
used in src/models/config.py.
In `@tests/unit/app/endpoints/test_mcp_servers.py`:
- Around line 326-350: Update the test_delete_mcp_server_llama_stack_failure to
assert rollback invariants after toolgroups.unregister raises
APIConnectionError: after calling mcp_servers.register_mcp_server_handler and
triggering the failing delete via mcp_servers.delete_mcp_server_handler (where
client.toolgroups.unregister.side_effect is set), in addition to asserting the
HTTPException status_code == 503, verify that the dynamic server entry is still
present in the local registration state (the same registry/storage that
register_mcp_server_handler updates) so the server remains registered locally
when client.toolgroups.unregister fails; reference the test function name,
mcp_servers.register_mcp_server_handler, mcp_servers.delete_mcp_server_handler,
and client.toolgroups.unregister to locate the code to assert the local
registration still exists.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3ae38ff2-4149-475c-8e9d-ea70eb28845e
📒 Files selected for processing (9)
src/app/endpoints/mcp_servers.pysrc/app/routers.pysrc/configuration.pysrc/models/config.pysrc/models/requests.pysrc/models/responses.pytests/unit/app/endpoints/test_mcp_servers.pytests/unit/app/test_routers.pytests/unit/models/responses/test_error_responses.py
| mcp_server = ModelContextProtocolServer( | ||
| name=body.name, | ||
| url=body.url, | ||
| provider_id=body.provider_id, | ||
| authorization_headers=body.authorization_headers or {}, | ||
| headers=body.headers or [], | ||
| timeout=body.timeout, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
For llama-stack-client 0.5.2, what fields are supported inside AsyncLlamaStackClient.toolgroups.register(..., mcp_endpoint=...)for MCP endpoints? Doesmcp_endpoint support headers, authorization headers, or timeout values in addition to the URI?
💡 Result:
In llama-stack-client 0.5.2, AsyncLlamaStackClient.toolgroups.register(..., mcp_endpoint=...) uses the URL schema, and the only supported field inside mcp_endpoint is:
uri: str(required)
This comes from the RegisterToolGroupRequest schema (mcp_endpoint: URL | null) and the URL schema definition ({ "uri": string }). [1][2]
So, no—in 0.5.2, mcp_endpoint itself does not support additional structured fields like headers, authorization/Authorization headers, or timeouts beyond the URI. [1][2]
If you need headers/auth, note that headers and authorization are supported elsewhere in the API (e.g., the OpenAI “Responses” MCP tool config includes headers and authorization), but that is not part of toolgroups.register(..., mcp_endpoint=...). [3]
Sources
[1] Llama Stack OpenAPI (v0.5.2): RegisterToolGroupRequest schema (shows mcp_endpoint is URL)
[2] Llama Stack OpenAPI (v0.5.2): URL schema (uri only)
[3] Llama Stack OpenAPI (v0.5.2): OpenAIResponseInputToolMCP schema (has headers, authorization)
Forward the request's connection settings to Llama Stack or reject them explicitly.
The handler accepts and stores authorization_headers, headers, and timeout (lines 74-81), but the call to toolgroups.register(...) (lines 91-95) only forwards the URI via mcp_endpoint={"uri": mcp_server.url}. In llama-stack-client 0.5.2, the mcp_endpoint parameter supports only the uri field; additional connection metadata is not propagated to the remote toolgroup. This means the endpoint returns 201 while potentially creating a toolgroup that cannot authenticate to or properly connect with the MCP server if those settings are required. Either pass these fields through to Llama Stack if a future version supports them, or reject the request with a clear error if authorization_headers, headers, or a non-default timeout are provided.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/app/endpoints/mcp_servers.py` around lines 74 - 81, The handler currently
constructs a ModelContextProtocolServer (mcp_server) with authorization_headers,
headers, and timeout but only passes mcp_endpoint={"uri": mcp_server.url} into
toolgroups.register, which can create an unusable toolgroup; update the endpoint
handler to either forward these connection fields to Llama Stack when supported
or explicitly reject requests that supply them—specifically, before calling
toolgroups.register check ModelContextProtocolServer.authorization_headers,
.headers, and .timeout against defaults and if any non-empty/non-default values
are present raise a clear HTTP error (or ValidationError) explaining that
authorization_headers, headers, and custom timeout are not supported by the
current llama-stack-client; reference ModelContextProtocolServer,
toolgroups.register, and mcp_endpoint in the change so the rejection/forwarding
logic is colocated with where mcp_server is created and toolgroups.register is
called.
| response = ServiceUnavailableResponse(backend_name="Llama Stack", cause=str(e)) | ||
| raise HTTPException(**response.model_dump()) from e | ||
| except Exception as e: # pylint: disable=broad-exception-caught | ||
| configuration.remove_mcp_server(body.name) | ||
| logger.error("Failed to register MCP toolgroup: %s", e) | ||
| error_response = InternalServerErrorResponse( | ||
| response="Failed to register MCP server", | ||
| cause=str(e), | ||
| ) |
There was a problem hiding this comment.
Do not return raw upstream exception text to API clients.
These branches serialize str(e) into the public error payload. That can leak internal Llama Stack details and makes this API contract depend on upstream exception wording. Keep the full exception in logs, but return a fixed cause string or opaque error id to callers.
Also applies to: 223-225
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/app/endpoints/mcp_servers.py` around lines 99 - 107, The code currently
includes raw upstream exception text (str(e)) in API responses (see
ServiceUnavailableResponse and InternalServerErrorResponse in the except blocks
around configuration.remove_mcp_server and HTTPException raising); change these
to log the full exception via logger.exception or logger.error with exception
info, but replace the response cause with a fixed opaque string (e.g.
"upstream_error" or a generated error_id) or a generic message like "internal
server error" before raising HTTPException or returning the response; apply the
same change to the other occurrence referenced (lines ~223-225) so no raw
exception text is serialized to clients while full details remain in logs for
debugging.
| except Exception as e: # pylint: disable=broad-exception-caught | ||
| logger.warning( | ||
| "Llama Stack toolgroup unregister failed for '%s', " | ||
| "proceeding with local removal: %s", | ||
| name, | ||
| e, | ||
| ) | ||
|
|
||
| try: | ||
| configuration.remove_mcp_server(name) | ||
| except ValueError as e: | ||
| logger.error("Failed to remove MCP server from configuration: %s", e) | ||
| response = NotFoundResponse(resource="MCP server", resource_id=name) | ||
| raise HTTPException(**response.model_dump()) from e |
There was a problem hiding this comment.
Do not report a successful delete after remote unregister fails.
The broad except Exception at Line 227 logs and continues, so the local config can be removed even though the Llama Stack toolgroup is still present. That leaves /v1/mcp-servers and the actual toolgroup state out of sync. Except for a verified remote "not found" case, this should fail the request and preserve the local registration so the delete can be retried safely.
💡 Example change
- except Exception as e: # pylint: disable=broad-exception-caught
- logger.warning(
- "Llama Stack toolgroup unregister failed for '%s', "
- "proceeding with local removal: %s",
- name,
- e,
- )
+ except Exception as e: # pylint: disable=broad-exception-caught
+ logger.error("Failed to unregister MCP toolgroup from Llama Stack: %s", e)
+ error_response = InternalServerErrorResponse(
+ response="Failed to unregister MCP server",
+ cause="Toolgroup unregistration failed",
+ )
+ raise HTTPException(**error_response.model_dump()) from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/app/endpoints/mcp_servers.py` around lines 227 - 240, The code currently
swallows all exceptions from the Llama Stack toolgroup unregister and proceeds
to call configuration.remove_mcp_server(name); change this so that only a
verified remote "not found" outcome allows local removal. Specifically, replace
the broad except Exception around the remote unregister with logic that (1)
detects the remote-not-found case and continues to call
configuration.remove_mcp_server(name), and (2) for any other exception from the
Llama Stack toolgroup unregister, log the error and raise an HTTPException (do
not call configuration.remove_mcp_server). Reference the Llama Stack toolgroup
unregister call and configuration.remove_mcp_server(name) when making the change
so the local config is preserved unless the remote toolgroup is confirmed
missing.
| self._conversation_cache: Optional[Cache] = None | ||
| self._quota_limiters: list[QuotaLimiter] = [] | ||
| self._token_usage_history: Optional[TokenUsageHistory] = None | ||
| self._dynamic_mcp_server_names: set[str] = set() |
There was a problem hiding this comment.
Dynamic MCP registry is process-local and will diverge across workers.
Line 67 stores dynamic registration state only in local memory. In multi-worker deployments, requests can hit different workers and observe different MCP server sets (e.g., register succeeds but list/delete on another worker returns inconsistent results).
Also applies to: 169-228
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/configuration.py` at line 67, The dynamic MCP registry is currently
stored in the in-memory set _dynamic_mcp_server_names (and other related code
around lines 169-228), which will diverge across workers; change the
implementation so the registry is kept in a shared, persistent store (e.g.,
Redis, database table, or central config service) instead of the process-local
set. Replace direct reads/writes of _dynamic_mcp_server_names with calls to new
helper methods (e.g., get_dynamic_mcp_servers(), add_dynamic_mcp_server(name),
remove_dynamic_mcp_server(name)) that operate on the shared backend and provide
atomic operations; update any code paths that reference
_dynamic_mcp_server_names to use these helpers and ensure worker-safe
concurrency (locks or atomic commands) and consistent
serialization/deserialization. Ensure migration/initialization code populates
the shared store from any existing local state and add tests verifying
consistency across simulated multiple-worker access.
Reset dynamic MCP registry on config re-initialization.
init_from_dict() clears other cached runtime state but not _dynamic_mcp_server_names. Stale names can survive reloads and incorrectly mark servers as API-managed.
🛠️ Proposed fix
def init_from_dict(self, config_dict: dict[Any, Any]) -> None:
"""Initialize configuration from a dictionary.
@@
# clear cached values when configuration changes
self._conversation_cache = None
self._quota_limiters = []
self._token_usage_history = None
+ self._dynamic_mcp_server_names.clear()
# now it is possible to re-read configuration
self._configuration = Configuration(**config_dict)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/configuration.py` at line 67, init_from_dict() currently clears cached
runtime state but doesn't reset the _dynamic_mcp_server_names set, allowing
stale server names to persist across reinitialization; update init_from_dict()
to clear or reinitialize self._dynamic_mcp_server_names (e.g. set it to an empty
set) as part of the re-initialization path so dynamic MCP registry state is
reset when configuration is reloaded.
Description
Currently if you want MCP Servers to be configured in Lightspeed Stack, you will have to do it via lightspeed-stack.yaml:
However, there is a need for Lightspeed Stack Clients to dynamically register and unregister MCP Servers. For example, Developer Lightspeed has a UI (think of Cursor MCP Settings), where users can enter their MCP Server information. This lets users add and remove MCP Servers on the fly rather than it being configured statically via the lightspeed-config.yaml file (See Feature RHDHPLAN-45).
Three new APIs
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
POST http://localhost:8080/v1/mcp-servers
Body
Response
LCS Logs
Verifying via Llama Stack toolgroups API
GET http://localhost:8080/v1/mcp-servers
Response
LCS Logs
DELETE http://localhost:8080/v1/mcp-servers/mcp-integration-tools-not-present
Response
LCS Logs
DELETE http://localhost:8080/v1/mcp-servers/mcp-integration-tools
Response
LCS Logs
Verifying via Llama Stack toolgroups API
DELETE http://localhost:8080/v1/mcp-servers/test-mcp-server
Response
LCS Logs
Summary by CodeRabbit
Release Notes
New Features
Tests