docs(router): metrics#7696
Conversation
Summary of ChangesHello @kamilkisiela, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the documentation for OpenTelemetry metrics in the Hive Router by introducing dedicated UI components for presenting metric and label information clearly. It provides users with a comprehensive guide on how to configure, export, and interpret the various metrics available, improving observability and troubleshooting capabilities for router operations. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
🚀 Snapshot Release (
|
| Package | Version | Info |
|---|---|---|
@graphql-hive/cli |
0.58.1-alpha-20260216151046-a7dc6de7cd0da593a8b7038d6ad153b0e21220a6 |
npm ↗︎ unpkg ↗︎ |
hive |
9.4.1-alpha-20260216151046-a7dc6de7cd0da593a8b7038d6ad153b0e21220a6 |
npm ↗︎ unpkg ↗︎ |
|
🐋 This PR was built and pushed to the following Docker images: Targets: Platforms: Image Tag: |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive documentation for OpenTelemetry metrics, complete with new React components for a structured and interactive presentation. The implementation is solid, and the new documentation page is well-organized. I have a few suggestions to enhance semantic correctness, accessibility, and component reusability, primarily concerning the use of an <a> tag for a copy-link action and refactoring the MetricsSection component for better flexibility.
💻 Website PreviewThe latest changes are available as preview in: https://pr-7696.hive-landing-page.pages.dev |
This PR introduces a full metrics pipeline based on OpenTelemetry. It adds support for emitting histograms, counters and gauges for HTTP and GraphQL operations, exports them through OTLP or Prometheus, and allows fine‑grained control over metrics. - [Documentation PR](graphql-hive/console#7696) - [Documentation preview](https://pr-7696.hive-landing-page.pages.dev/docs/router/observability/metrics) **New dependencies** The router now depends on `opentelemetry‑prometheus` and `prometheus` crates, and `humantime` for duration parsing. **Performance considerations** Observed metrics are collected asynchronously and should not block request processing. **`Metrics` struct** aggregates metrics for five domains - HTTP server, HTTP client, GraphQL pipeline, the supergraph loader and the internal caches. Each sub‑component exposes counters and histograms and can be disabled via configuration. **Meter provider setup** The Telemetry struct now holds separate `traces_provider`, `metrics_provider` and an optional `PrometheusRuntime`. During initialisation the router builds an OpenTelemetry meter provider based on `telemetry.metrics.*` configuration and sets it as global via `opentelemetry::global::set_meter_provider`. If a Prometheus exporter is configured, a `PrometheusRuntime` is created, either attached to the existing HTTP server (reusing the router port) or detached on a dedicated port. **Prometheus runtime** The PrometheusRuntime enum encapsulates two modes: - `Attached` - metrics are served from the same HTTP server on a configurable path (default `/metrics`). - `Detached` - if a distinct port is configured, the runtime spawns a dedicated `ntex::HttpServer` with its own lifecycle. I limited the server to a **single worker**. **Endpoint conflict detection** Because metrics may live under `/metrics` or another path, the router now encapsulates paths into a `RouterPaths` struct. On start‑up it verifies that GraphQL, health, readiness and Prometheus endpoints do not collide and returns a `RouterInitError::EndpointConflict` if they do. **Request context storage** A new `pipeline::request_extensions` module defines small structs stored in `HttpRequest` extensions. These hold the raw body size, the GraphQL operation name and type, and the final response status. It also exposes helper functions that are used by different pipeline stages to read and write data. ## Configuration Metrics are disabled by default. They are enabled when at least one exporter is present and marked as enabled. Each exporter entry has a `kind` field (e.g. `otlp` or `prometheus`) and optional settings such as `interval`, `temporality` and `protocol`. A Prometheus exporter can specify a `path` and `port` ("empty" means reuse the router port). The `telemetry.metrics.instrumentation` section allows tuning histogram aggregation. A common histogram configuration defines explicit buckets for `bytes` (request/response body sizes) and `seconds` (durations). The default buckets are covered in the docs. We do two buckets, one per unit, as their values and their ranges are completely different. A per-instrument override lets you disable individual metrics or strip certain attributes. ## Noteworthy changes - Caches are now stored in `CacheState`. This centralisation makes it easier to manage cache invalidation on supergraph reloads and to register metrics observers. - Cache reads and writes are now standardised. I used `.entry().or_try_insert_with()` API of moka everywhere, to guarantee that concurrent calls on the same not-existing entry are coalesced into one evaluation. We used to do `get().await` and then `insert().await` in some cases, but now we have minimum cache writes and maximum reusability of just inserted entries. - I also created `EntryResultHitMissExt` trait that extends Moka's Entry API, to give us insights into cache hits, misses and errors. Useful for metrics and spans attempting to record cache hits/misses. ## Example config ```yaml telemetry: metrics: exporters: - kind: prometheus # enable the Prometheus exporter enabled: true path: /metrics # override the default path port: 6969 # spins up a dedicated http server, running - kind: otlp endpoint: http://otel‑collector:4318 protocol: http # or grpc interval: 30s # push every 30s temporality: cumulative instrumentation: common: histogram: aggregation: explicit # or exponential bytes: buckets: [128, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 3145728, 4194304, 5242880] record_min_max: false seconds: buckets: [0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10] record_min_max: false instruments: http_server_request_duration_seconds: attributes: graphql.operation.name: false # drop operation name label, as it has high cardinality ```
This PR introduces a full metrics pipeline based on OpenTelemetry. It adds support for emitting histograms, counters and gauges for HTTP and GraphQL operations, exports them through OTLP or Prometheus, and allows fine‑grained control over metrics. - [Documentation PR](graphql-hive/console#7696) - [Documentation preview](https://pr-7696.hive-landing-page.pages.dev/docs/router/observability/metrics) **New dependencies** The router now depends on `opentelemetry‑prometheus` and `prometheus` crates, and `humantime` for duration parsing. **Performance considerations** Observed metrics are collected asynchronously and should not block request processing. **`Metrics` struct** aggregates metrics for five domains - HTTP server, HTTP client, GraphQL pipeline, the supergraph loader and the internal caches. Each sub‑component exposes counters and histograms and can be disabled via configuration. **Meter provider setup** The Telemetry struct now holds separate `traces_provider`, `metrics_provider` and an optional `PrometheusRuntime`. During initialisation the router builds an OpenTelemetry meter provider based on `telemetry.metrics.*` configuration and sets it as global via `opentelemetry::global::set_meter_provider`. If a Prometheus exporter is configured, a `PrometheusRuntime` is created, either attached to the existing HTTP server (reusing the router port) or detached on a dedicated port. **Prometheus runtime** The PrometheusRuntime enum encapsulates two modes: - `Attached` - metrics are served from the same HTTP server on a configurable path (default `/metrics`). - `Detached` - if a distinct port is configured, the runtime spawns a dedicated `ntex::HttpServer` with its own lifecycle. I limited the server to a **single worker**. **Endpoint conflict detection** Because metrics may live under `/metrics` or another path, the router now encapsulates paths into a `RouterPaths` struct. On start‑up it verifies that GraphQL, health, readiness and Prometheus endpoints do not collide and returns a `RouterInitError::EndpointConflict` if they do. **Request context storage** A new `pipeline::request_extensions` module defines small structs stored in `HttpRequest` extensions. These hold the raw body size, the GraphQL operation name and type, and the final response status. It also exposes helper functions that are used by different pipeline stages to read and write data. ## Configuration Metrics are disabled by default. They are enabled when at least one exporter is present and marked as enabled. Each exporter entry has a `kind` field (e.g. `otlp` or `prometheus`) and optional settings such as `interval`, `temporality` and `protocol`. A Prometheus exporter can specify a `path` and `port` ("empty" means reuse the router port). The `telemetry.metrics.instrumentation` section allows tuning histogram aggregation. A common histogram configuration defines explicit buckets for `bytes` (request/response body sizes) and `seconds` (durations). The default buckets are covered in the docs. We do two buckets, one per unit, as their values and their ranges are completely different. A per-instrument override lets you disable individual metrics or strip certain attributes. ## Noteworthy changes - Caches are now stored in `CacheState`. This centralisation makes it easier to manage cache invalidation on supergraph reloads and to register metrics observers. - Cache reads and writes are now standardised. I used `.entry().or_try_insert_with()` API of moka everywhere, to guarantee that concurrent calls on the same not-existing entry are coalesced into one evaluation. We used to do `get().await` and then `insert().await` in some cases, but now we have minimum cache writes and maximum reusability of just inserted entries. - I also created `EntryResultHitMissExt` trait that extends Moka's Entry API, to give us insights into cache hits, misses and errors. Useful for metrics and spans attempting to record cache hits/misses. ## Example config ```yaml telemetry: metrics: exporters: - kind: prometheus # enable the Prometheus exporter enabled: true path: /metrics # override the default path port: 6969 # spins up a dedicated http server, running - kind: otlp endpoint: http://otel‑collector:4318 protocol: http # or grpc interval: 30s # push every 30s temporality: cumulative instrumentation: common: histogram: aggregation: explicit # or exponential bytes: buckets: [128, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 3145728, 4194304, 5242880] record_min_max: false seconds: buckets: [0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10] record_min_max: false instruments: http_server_request_duration_seconds: attributes: graphql.operation.name: false # drop operation name label, as it has high cardinality ```
graphql-hive/router#770