From 4bd9688e9bc2224375e9986d29b88c60187dd524 Mon Sep 17 00:00:00 2001 From: Emi Date: Mon, 23 Jun 2025 10:46:20 -0700 Subject: [PATCH 1/4] add Sourcegraph v6.5 AWS Bedrock latency documentation Signed-off-by: Emi --- .../enterprise/completions-configuration.mdx | 100 ++++++++++++++++-- 1 file changed, 92 insertions(+), 8 deletions(-) diff --git a/docs/cody/enterprise/completions-configuration.mdx b/docs/cody/enterprise/completions-configuration.mdx index cdab972ab..5ee4e7372 100644 --- a/docs/cody/enterprise/completions-configuration.mdx +++ b/docs/cody/enterprise/completions-configuration.mdx @@ -91,16 +91,61 @@ For `accessToken`, you can either: - Set it to `:` if directly configuring the credentials - Set it to `::` if a session token is also required - - We only recommend configuring AWS Bedrock to use an accessToken for - authentication. Specifying no accessToken (e.g. to use [IAM roles for EC2 / - instance role - binding](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html)) - is not currently recommended (there is a known performance bug with this - method which will prevent autocomplete from working correctly. (internal - issue: PRIME-662) +#### AWS Bedrock: Latency Optimization + + +This feature is available in Sourcegraph v6.5+ +AWS Bedrock supports Latency Optimized Inference which can reduce autocomplete latency with models like Claude 3.5 Haiku by up to ~40%. + +To use Bedrock's latency optimized inference feature for a specific model with Cody, configure the `"latencyOptimization": "optimized"` setting under the `serverSideConfig` of any model in `modelOverrides`. For example: + +```json +"modelOverrides": [ + { + "modelRef": "aws-bedrock::v1::claude-3-5-haiku-latency-optimized", + "modelName": "us.anthropic.claude-3-5-haiku-20241022-v1:0", + "displayName": "Claude 3.5 Haiku (latency optimized)", + "capabilities": [ + "chat", + "autocomplete" + ], + "category": "speed", + "status": "stable", + "contextWindow": { + "maxInputTokens": 200000, + "maxOutputTokens": 4096 + }, + "serverSideConfig": { + "type": "awsBedrock", + "latencyOptimization": "optimized" + } + }, + { + "modelRef": "aws-bedrock::v1::claude-3-5-haiku", + "modelName": "us.anthropic.claude-3-5-haiku-20241022-v1:0", + "displayName": "Claude 3.5 Haiku", + "capabilities": [ + "chat", + "autocomplete" + ], + "category": "speed", + "status": "stable", + "contextWindow": { + "maxInputTokens": 200000, + "maxOutputTokens": 4096 + }, + "serverSideConfig": { + "type": "awsBedrock", + "latencyOptimization": "standard" + } + } +] +``` + +See also [Debugging: running a latency test](#debugging-running-a-latency-test). + ### Example: Using GCP Vertex AI On [GCP Vertex](https://cloud.google.com/vertex-ai/generative-ai/docs/partner-models/use-claude), we only support Anthropic Claude models. @@ -194,3 +239,42 @@ To enable StarCoder, go to **Site admin > Site configuration** (`/site-admin/con ``` Users of the Cody extensions will automatically pick up this change when connected to your Enterprise instance. + +## Debugging: running a latency test + + +This feature is available in Sourcegraph v6.5+ + + +Site administrators can test completions latency by sending a special debug command in any Cody chat window (in the web, in the editor, etc.): + +``` +cody_debug:::{"latencytest": 100} +``` + +Cody will then perform `100` quick `Hello, please respond with a short message.` requests to the LLM model selected in the dropdown, and measure the time taken to get the first streaming event back (e.g. first token from the model.) It records all of these requests timing information, and then responds with a report indicating the latency between the Sourcegraph `frontend` container and the LLM API: + +``` +Starting latency test with 10 requests... + +Individual timings: + +[... how long each request took ...] + +Summary: + +* Requests: 10/10 successful +* Average: 882ms +* Minimum: 435ms +* Maximum: 1.3s +``` + +This can be helpful to get a feel for the latency of particular models, or models with different configurations - such as when using the AWS Bedrock Latency Optimized Inference feature. + + +Debug commands are only available to site administrators and have no effect when used by regular users. + + + +Sourcegraph's builtin Grafana monitoring also has a full `Completions` dashboard for monitoring LLM requests, performance, etc. + From 4d12b0d3db749f8da2c7a9410f36d881bf0eba19 Mon Sep 17 00:00:00 2001 From: Emi Date: Mon, 23 Jun 2025 11:29:28 -0700 Subject: [PATCH 2/4] remove callout which is invalid Signed-off-by: Emi --- .../cody/enterprise/model-config-examples.mdx | 10 ---------- public/llms.txt | 20 ------------------- 2 files changed, 30 deletions(-) diff --git a/docs/cody/enterprise/model-config-examples.mdx b/docs/cody/enterprise/model-config-examples.mdx index 99671fc94..a138bcc8a 100644 --- a/docs/cody/enterprise/model-config-examples.mdx +++ b/docs/cody/enterprise/model-config-examples.mdx @@ -792,14 +792,4 @@ Provisioned throughput for Amazon Bedrock models can be configured using the `"a ](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_InstanceMetadataOptionsRequest.html#:~:text=HttpPutResponseHopLimit) instance metadata option to a higher value (e.g., 2) to ensure that the metadata service can be accessed from the frontend container running in the EC2 instance. See [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-IMDS-existing-instances.html) for instructions. - - We only recommend configuring AWS Bedrock to use an accessToken for - authentication. Specifying no accessToken (e.g. to use [IAM roles for EC2 / - instance role - binding](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html)) - is not currently recommended. There is a known performance bug with this - method which will prevent autocomplete from working correctly (internal - issue: CORE-819) - - diff --git a/public/llms.txt b/public/llms.txt index 96fc55723..89c54c5e5 100644 --- a/public/llms.txt +++ b/public/llms.txt @@ -15668,16 +15668,6 @@ Provisioned throughput for Amazon Bedrock models can be configured using the `"a ](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_InstanceMetadataOptionsRequest.html#:~:text=HttpPutResponseHopLimit) instance metadata option to a higher value (e.g., 2) to ensure that the metadata service can be accessed from the frontend container running in the EC2 instance. See [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-IMDS-existing-instances.html) for instructions. - - We only recommend configuring AWS Bedrock to use an accessToken for - authentication. Specifying no accessToken (e.g. to use [IAM roles for EC2 / - instance role - binding](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html)) - is not currently recommended. There is a known performance bug with this - method which will prevent autocomplete from working correctly (internal - issue: CORE-819) - - @@ -15897,16 +15887,6 @@ For `accessToken`, you can either: - Set it to `:` if directly configuring the credentials - Set it to `::` if a session token is also required - - We only recommend configuring AWS Bedrock to use an accessToken for - authentication. Specifying no accessToken (e.g. to use [IAM roles for EC2 / - instance role - binding](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html)) - is not currently recommended (there is a known performance bug with this - method which will prevent autocomplete from working correctly. (internal - issue: PRIME-662) - - ### Example: Using GCP Vertex AI On [GCP Vertex](https://cloud.google.com/vertex-ai/generative-ai/docs/partner-models/use-claude), we only support Anthropic Claude models. From f95352f2e7eaa0776851c68eb95fd710c4652184 Mon Sep 17 00:00:00 2001 From: Emi Date: Mon, 23 Jun 2025 11:34:59 -0700 Subject: [PATCH 3/4] link AWS docs Signed-off-by: Emi --- docs/cody/enterprise/completions-configuration.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/cody/enterprise/completions-configuration.mdx b/docs/cody/enterprise/completions-configuration.mdx index 5ee4e7372..c30466a7c 100644 --- a/docs/cody/enterprise/completions-configuration.mdx +++ b/docs/cody/enterprise/completions-configuration.mdx @@ -97,7 +97,7 @@ For `accessToken`, you can either: This feature is available in Sourcegraph v6.5+ -AWS Bedrock supports Latency Optimized Inference which can reduce autocomplete latency with models like Claude 3.5 Haiku by up to ~40%. +AWS Bedrock supports [Latency Optimized Inference](https://docs.aws.amazon.com/bedrock/latest/userguide/latency-optimized-inference.html) which can reduce autocomplete latency with models like Claude 3.5 Haiku by up to ~40%. To use Bedrock's latency optimized inference feature for a specific model with Cody, configure the `"latencyOptimization": "optimized"` setting under the `serverSideConfig` of any model in `modelOverrides`. For example: From 6f62424f404da81222a47c6d387fb5b71b661c10 Mon Sep 17 00:00:00 2001 From: Maedah Batool Date: Mon, 23 Jun 2025 18:33:47 -0700 Subject: [PATCH 4/4] Add some tweaks --- .../enterprise/completions-configuration.mdx | 27 +++++++------------ 1 file changed, 10 insertions(+), 17 deletions(-) diff --git a/docs/cody/enterprise/completions-configuration.mdx b/docs/cody/enterprise/completions-configuration.mdx index c30466a7c..de561e1c1 100644 --- a/docs/cody/enterprise/completions-configuration.mdx +++ b/docs/cody/enterprise/completions-configuration.mdx @@ -91,11 +91,9 @@ For `accessToken`, you can either: - Set it to `:` if directly configuring the credentials - Set it to `::` if a session token is also required -#### AWS Bedrock: Latency Optimization +#### AWS Bedrock: Latency optimization - -This feature is available in Sourcegraph v6.5+ - +Optimization for latency with AWS Bedrock is available in Sourcegraph v6.5 and more. AWS Bedrock supports [Latency Optimized Inference](https://docs.aws.amazon.com/bedrock/latest/userguide/latency-optimized-inference.html) which can reduce autocomplete latency with models like Claude 3.5 Haiku by up to ~40%. @@ -240,21 +238,19 @@ To enable StarCoder, go to **Site admin > Site configuration** (`/site-admin/con Users of the Cody extensions will automatically pick up this change when connected to your Enterprise instance. -## Debugging: running a latency test +## Debugging: Running a latency test - -This feature is available in Sourcegraph v6.5+ - +Debugging latency optimizated inference is supported in Sourcegraph v6.5 and more. Site administrators can test completions latency by sending a special debug command in any Cody chat window (in the web, in the editor, etc.): -``` +```shell cody_debug:::{"latencytest": 100} ``` -Cody will then perform `100` quick `Hello, please respond with a short message.` requests to the LLM model selected in the dropdown, and measure the time taken to get the first streaming event back (e.g. first token from the model.) It records all of these requests timing information, and then responds with a report indicating the latency between the Sourcegraph `frontend` container and the LLM API: +Cody will then perform `100` quick `Hello, please respond with a short message.` requests to the LLM model selected in the dropdown, and measure the time taken to get the first streaming event back (for example first token from the model.) It records all of these requests timing information, and then responds with a report indicating the latency between the Sourcegraph `frontend` container and the LLM API: -``` +```shell Starting latency test with 10 requests... Individual timings: @@ -271,10 +267,7 @@ Summary: This can be helpful to get a feel for the latency of particular models, or models with different configurations - such as when using the AWS Bedrock Latency Optimized Inference feature. - -Debug commands are only available to site administrators and have no effect when used by regular users. - +Few important considerations: - -Sourcegraph's builtin Grafana monitoring also has a full `Completions` dashboard for monitoring LLM requests, performance, etc. - +- Debug commands are only available to site administrators and have no effect when used by regular users. +- Sourcegraph's built-in Grafana monitoring also has a full `Completions` dashboard for monitoring LLM requests, performance, etc.