diff --git a/proposals/0071-Entity/01-context.md b/proposals/0071-Entity/01-context.md new file mode 100644 index 00000000..7a3ef710 --- /dev/null +++ b/proposals/0071-Entity/01-context.md @@ -0,0 +1,444 @@ +# Supporting Entities in Prometheus + +## Abstract + +This proposal introduces native support for **Entities** in Prometheus—a first-class concept representing **the "things" that produce telemetry**. + +A Kubernetes pod, a service instance, a physical host—these are not metrics themselves, but rather the *sources* of metrics. They have their own identity, lifecycle, and attributes that provide context for understanding the telemetry they produce. Today, Prometheus lacks a native way to represent these "things". While the ecosystem has developed conventions (like info metrics) to work around this gap, Prometheus itself doesn't understand what these conventions represent. + +**This proposal establishes Entities as a foundational concept in Prometheus.** An Entity represents a distinct object of interest in your infrastructure or application—something that has an identity, produces telemetry, and whose metadata helps you understand that telemetry. + +By making Entities first-class, this proposal enables Prometheus to support them consistently across all layers. Exposition formats gain semantics to declare entity information; SDKs provide clean abstractions for instrumenting entities; storage optimizes for entity metadata and relationships; the query language automatically correlates entity context with metrics; and alerting maintains stable alert identity as entity attributes change. + +This proposal also aligns with Prometheus's commitment to being the default store for OpenTelemetry metrics, which has a well-defined Entity model. Native Entity support enables seamless integration between OpenTelemetry's view of the world and Prometheus's. + +--- + +## Terminology + +Before diving into the problem and proposed solution, let's establish a shared vocabulary: + +#### Info Metric + +A metric that exposes metadata about a monitored entity rather than a measurement. In current Prometheus convention, these are gauges with a constant value of `1` and labels containing the metadata. Examples include `node_uname_info`, `kube_pod_info`, and `target_info`. + +``` +build_info{version="1.2.3", revision="abc123", goversion="go1.21"} 1 +``` + +#### Entity + +An **Entity** represents a distinct object of interest that produces or is associated with telemetry. Unlike Info metrics, Entities are not metrics—they are first-class objects with their own identity, labels, and +lifecycle. + +Examples: a Kubernetes pod, a physical host, a service instance, a database table. + +Each entity has: +- A **type** (e.g., `k8s.pod`, `host`, `service`) +- **Identifying labels** that uniquely define it (immutable for the entity's lifetime) +- **Descriptive labels** that provide additional context (may change over time) +- **Lifecycle boundaries** (creation time, end time) + +In OpenTelemetry, an entity is an object of interest that produces telemetry data. This proposal adopts a compatible Entity concept as Prometheus's native representation for what was previously expressed only through info metric conventions. + +**The relationship:** Entities are the concept; info metrics are how they're serialized in the exposition format. + +#### Resource Attributes + +In OpenTelemetry, **resource attributes** are key-value pairs that describe the entity producing telemetry. These attributes are attached to all telemetry (metrics, logs, traces) from that entity. When OTel metrics are exported to Prometheus, resource attributes typically become labels on a `target_info` metric. + +#### Identifying Labels + +**Identifying labels** uniquely distinguish one entity from another of the same type. These labels: +- Must remain constant for the lifetime of the entity +- Together form a unique identifier for the entity +- Are required to identify which entity produced the telemetry + +Examples: +- `k8s.pod.uid` or (`k8s.pod.name`,`k8s.namespace.name`) for a Kubernetes pod +- `host.id` for a host +- `service.instance.id` for a service instance + +#### Descriptive Labels + +**Descriptive labels** provide additional context about an entity but do not serve to uniquely identify it. These labels: +- May change during the entity's lifetime +- Provide useful metadata for querying and visualization +- Are optional and supplementary + +Examples: +- `k8s.pod.label.app_name` (pods labels can change) +- `host.name` (hostnames can change) +- `service.version` (versions change with deployments) + +--- + +## Problem Statement + +### Prometheus Is Missing the Entity Concept + +Prometheus has a powerful data model for representing **metrics**—time series of numeric measurements identified by labels. But it lacks a native representation for "things" that produce metrics. + +Consider a Kubernetes pod. It has an identity (namespace, UID), labels that describe it (name, node, status), a lifecycle (creation time, termination), and it produces telemetry (CPU usage, memory consumption, request counts). The pod is the *source* of metrics—it is conceptually distinct from the metrics it produces. + +Today, the Prometheus ecosystem uses **info metrics** to represent entity metadata: + +```promql +kube_pod_info{namespace="production", pod="api-server-7b9f5", uid="550e8400", node="worker-2"} 1 +``` + +Info metrics have served the community well as a **pragmatic convention** for representing entity information. They work, and thousands of dashboards and exporters rely on them. However, because Prometheus treats them as regular metrics rather than recognizing them as entity representations, several limitations emerge: + +1. **The value is a placeholder**: The `1` carries no information—it exists only because Prometheus's storage requires a numeric value for every series. +2. **Identity is conflated with description**: All labels are treated equally. There's no way to declare that `uid` uniquely identifies the pod while `node` is descriptive metadata that may change. +3. **Lifecycle is implicit**: When a pod is deleted and recreated, Prometheus sees label churn. There's no first-class representation of "this entity ended; a new one began." +4. **Correlation is manual**: To associate entity metadata with metrics, users must write complex `group_left` joins—reconstructing a relationship that should be understood by the system. + +What Prometheus needs is not a replacement for info metrics, but rather **recognition of Entities as a first-class concept**. Info metrics are already representing entities—this proposal gives Prometheus the semantics to understand what they represent. + +### Joining Info Metrics Requires `group_left` + +The most common use case for info metrics is attaching their labels to other metrics. For example, adding Kubernetes pod metadata to container CPU metrics: + +```promql +container_cpu_usage_seconds_total + * on(namespace, pod) group_left(node, created_by_kind, created_by_name) + kube_pod_info +``` + +This pattern has several problems: + +1. **Verbose**: Every query that needs pod metadata must include the full `group_left` clause. Dashboards with dozens of panels repeat this join logic everywhere. +2. **Error-Prone**: The `on()` clause must list exactly the right matching labels. Miss one, and the join fails silently or produces incorrect results. List too many, and you get "many-to-many matching not allowed" errors. +3. **Confusing Semantics**: The `group_left` modifier is one of the most confusing aspects of PromQL for new users. "Many-to-one matching" and "group modifiers" require significant mental overhead to understand and use correctly. +4. **Fragile to label changes**: If `kube_pod_info` adds a new label, existing queries may break. If a label is removed, dashboards silently lose data. There's no contract about which labels are stable identifiers vs. which are descriptive metadata. + +### No Distinction Between Identifying and Descriptive Labels + +Current info metrics treat all labels equally. There's no way to express that some labels are stable identifiers while others are mutable metadata: + +```promql +kube_pod_info{ + namespace="production", # Identifying: part of pod identity + pod="api-server-7b9f5", # Identifying: part of pod identity + uid="abc-123-def", # Identifying: globally unique + node="worker-2", # Descriptive: can change if rescheduled + created_by_kind="Deployment", # Descriptive: additional context + created_by_name="api-server" # Descriptive: additional context +} 1 +``` + +This lack of distinction causes problems: +- Queries cannot reliably join on "the identity" of an entity +- OTel Entities cannot be accurately translated (OTel's identifying vs descriptive attributes map to our identifying vs descriptive labels) + +### Storage and Lifecycle Are Not Optimized + +Info metrics are stored like any other time series, despite their unique characteristics: +- The value is always `1`—storing it repeatedly wastes space +- Metadata changes infrequently, but samples are scraped every interval +- Staleness handling treats info metrics like measurements, not metadata + +--- + +## Motivation + +### Prometheus's Commitment to OpenTelemetry + +In March 2024, Prometheus announced its commitment to being the default store for OpenTelemetry metrics. This includes: +- Native OTLP ingestion +- UTF-8 support for metric and label names +- Native support for resource attributes + +OpenTelemetry's data model distinguishes between **metric attributes** (dimensions on individual metrics) and **resource attributes** (properties of the entity producing metrics). Currently, Prometheus flattens resource attributes into `target_info` labels, losing the semantic distinction. + +Native Entity support is a important step toward proper resource attribute handling. + +### The Entity Model + +OpenTelemetry's Entity model provides a structured way to represent monitored objects: + +``` +Entity { + type: "k8s.pod" + identifying_attributes: { + "k8s.namespace.name": "production", + "k8s.pod.uid": "abc-123-def" + } + descriptive_attributes: { + "k8s.pod.name": "api-server-7b9f5", + "k8s.node.name": "worker-2", + "k8s.deployment.name": "api-server" + } +} +``` + +This model enables: +- Clear semantics about what identifies an entity +- Lifecycle management (entities can be created, updated, deleted) +- Correlation across telemetry signals (metrics, logs, traces) + +Prometheus can benefit from similar semantics. In this proposal, OTel's "identifying attributes" map to Prometheus identifying labels, and OTel's "descriptive attributes" map to descriptive labels. + +### Users Already Rely on Info Metrics + +Info metrics are a well-established pattern in the Prometheus ecosystem: + +| Metric | Source | Labels | +|--------|--------|--------| +| `node_uname_info` | Node Exporter | `nodename`, `release`, `version`, `machine`, `sysname` | +| `kube_pod_info` | kube-state-metrics | `namespace`, `pod`, `uid`, `node`, `created_by_*`, etc. | +| `kube_node_info` | kube-state-metrics | `node`, `kernel_version`, `os_image`, `container_runtime_version` | +| `target_info` | OTel SDK | All resource attributes | +| `build_info` | Various | `version`, `revision`, `branch`, `goversion` | + +These metrics are used in thousands of dashboards and alerts. Introducing native Entities improves the ergonomics and semantics while maintaining the utility users depend on. + +--- + +## Use Cases + +### Enriching Metrics with Producer Metadata + +A common need in observability is to enrich metrics with information about what produced them. When analyzing CPU usage, you often want to know which version of the software is running, what node a container is scheduled on, or what deployment owns a pod. This context transforms raw numbers into actionable insights. + +**The Problem:** + +Today, this requires complex `group_left` joins between metrics and info metrics: + +```promql +sum by (namespace, pod, node) ( + rate(container_cpu_usage_seconds_total{namespace="production"}[5m]) + * on(namespace, pod) group_left(node) + kube_pod_info +) +``` + +This pattern appears everywhere: adding `build_info` labels to application metrics, enriching host metrics with `node_uname_info`, correlating service metrics with `target_info` from OTel. Every query must: + +- Know which labels to match on (`namespace`, `pod`, `job`, `instance`, etc.) +- Explicitly list which metadata labels to bring in +- Handle edge cases when labels change (pod rescheduling, version upgrades) + + +Users should be able to say "give me this metric, enriched with information about its producer" without writing complex joins. The query engine should understand the relationship between metrics and the entities that produced them. + +With native Entity support, the query engine knows which labels identify an entity and which describe it. Enrichment becomes automatic or requires minimal syntax—no need to manually specify join keys or enumerate which labels to include. + +### OpenTelemetry Resource Translation + +**Current State:** + +When OTel metrics are exported to Prometheus, resource attributes become labels on `target_info`: + +```promql +target_info{ + job="otel-collector", + instance="collector-1:8888", + service_name="payment-service", + service_version="2.1.0", + service_instance_id="i-abc123", + deployment_environment="production", + host_name="prod-vm-42", + host_id="550e8400-e29b-41d4-a716-446655440000" +} 1 +``` + +To use these attributes with application metrics: + +```promql +http_request_duration_seconds_bucket + * on(job, instance) group_left(service_name, service_version, deployment_environment) + target_info +``` + +**Pain Points:** +- OTel distinguishes identifying vs. descriptive attributes; Prometheus loses this +- Entity lifecycle (creation, updates) is not represented +- Every query must know the OTel schema to write correct joins + +**Desired State:** + +Native translation of OTel Entities to Prometheus Entities, where OTel's identifying attributes (like `k8s_pod_uid`) become identifying labels, and OTel's descriptive attributes (like `k8s_pod_annotation_created_by`, `k8s_pod_status`) become descriptive labels. This would preserve the semantic richness of the OTel data model and enable better query ergonomics. + +### Collection Architectures: Direct Scraping vs. Gateways + +Prometheus deployments follow two main patterns for collecting metrics, and this proposal must support both. + +**Direct Scraping** + +In direct scraping, Prometheus discovers and scrapes each target individually. Service Discovery provides accurate metadata about each target, because the target *is* the entity producing metrics. + +``` +┌─────────────┐ +│ Service A │◀────┐ +│ (pod-xyz) │ │ +└─────────────┘ │ + │ scrape ┌───────────┐ +┌─────────────┐ ├──────────▶│ │ +│ Service B │◀────┤ │Prometheus │ +│ (pod-abc) │ │ │ │ +└─────────────┘ │ └───────────┘ + │ +┌─────────────┐ │ +│ Service C │◀────┘ +│ (pod-def) │ +└─────────────┘ +``` + +Here, Kubernetes SD knows that `pod-xyz` runs Service A with specific labels, resource limits, and node placement. This metadata accurately describes the entity producing metrics—SD-derived entities work well. + +**Gateway and Federation** + +In gateway architectures, metrics flow through an intermediary before reaching Prometheus. The intermediary aggregates metrics from multiple sources. + +``` +┌───────────┐ ┌───────────┐ ┌───────────┐ +│ Service A │────▶│ │ │ │ +│ │push │ OTel │──────▶│Prometheus │ +├───────────┤ │ Collector │scrape │ │ +│ Service B │────▶│ │ │ │ +│ │ │(gateway) │ │ │ +├───────────┤ │ │ │ │ +│ Service C │────▶│ │ │ │ +└───────────┘ └───────────┘ └───────────┘ +``` + +Here, SD only sees the OTel Collector—not Services A, B, or C. Any SD-derived metadata would describe the collector, not the actual metric producers. The same applies to Prometheus federation and pushgateway patterns. + +| What SD Sees | What Actually Produced Telemetry | +|--------------|----------------------------------| +| `otel-collector-pod-xyz` | `payment-service`, `auth-service`, `user-service` | +| `prometheus-federation-1` | Hundreds of scraped targets from regional Prometheus | +| `pushgateway-xyz` | Various batch jobs and short-lived processes | +| `kube-state-metrics-0` | Workloads running in K8s and K8s API itself | + +**Supporting Both Models** + +This proposal must support both architectures: + +1. **Direct scraping**: Entity information can be derived from Service Discovery metadata, since SD accurately describes each target. +2. **Gateway/federation**: Entity information must be embedded in the exposition format to travel with the metrics through intermediaries. + +Users choose the appropriate approach for their architecture. See [Service Discovery](./04-service-discovery.md) for configuration details. + +--- + +## Goals + +This proposal aims to achieve the following: + +### 1. Define Entity as a Native Concept + +Prometheus should recognize Entities as a distinct concept with their own semantics, separate from metrics. Entities represent the things that produce telemetry, not the telemetry itself. + +### 2. Support Identifying and Descriptive Label Semantics + +Entities should allow declaring which labels are identifying (forming the entity's identity) and which are descriptive (providing additional context that may change over time). + +### 3. Improve Query Ergonomics + +Reduce or eliminate the need for `group_left` when attaching entity labels to related metrics. The common case should be simple. + +### 4. Optimize Storage for Metadata + +Entities store string labels and change infrequently. Storage and ingestion should be optimized for this pattern, rather than treating them as time series with constant values. + +### 5. Enable OTel Entity Translation + +Provide a natural mapping between OpenTelemetry Entities and Prometheus Entities, translating OTel's identifying and descriptive attributes to Prometheus's identifying and descriptive labels. + +### 6. Support Both Direct and Gateway Collection Models + +Entity information must work correctly whether Prometheus scrapes targets directly (where SD metadata is accurate) or through intermediaries like OTel Collector or federation. + +--- + +## Non-Goals + +The following are explicitly out of scope for this proposal: + +### Changing behavior for existing `*_info` Gauges + +This proposal defines new semantics for Entities. Existing **gauges** with `_info` suffix will continue to work as gauges and joins will continue to work. Migration or automatic conversion is not in scope. + +### Complete OTel Data Model Parity + +This proposal focuses on Entities. Full parity with OTel's data model (exemplars, exponential histograms, etc.) is addressed elsewhere. + +--- + +## Related Work + +### OpenMetrics Specification + +OpenMetrics 1.0 (November 2020) formally defines the Info metric type. The specification describes Info as "used to expose textual information which SHOULD NOT change during process lifetime." + +- [OpenMetrics 1.0 Specification](https://prometheus.io/docs/specs/om/open_metrics_spec/) +- [OpenMetrics 2.0 Draft](https://prometheus.io/docs/specs/om/open_metrics_spec_2_0/) + +### The `info()` PromQL Function + +Prometheus 2.x introduced an experimental `info()` function in PromQL to simplify joins between metrics and info metrics. Instead of writing verbose `group_left` queries, users can write: + +```promql +info(rate(http_requests_total[5m])) +``` + +This automatically enriches the result with labels from `target_info`. The function reduces boilerplate and makes queries more readable. + +However, the current implementation hardcodes `job` and `instance` as identifying labels—the labels used to correlate metrics with their info series. This works for `target_info` but fails for other entity types like `kube_pod_info` (which uses `namespace` and `pod`) or `kube_node_info` (which uses `node`). The community is actively discussing improvements to make the function more flexible. + +More fundamentally, `info()` still operates on info metrics—it makes joins easier but doesn't change the underlying model where entity information is encoded as a metric with a constant value. Native Entity support would allow the query engine to understand entity relationships directly, making enrichment automatic without needing explicit function calls or hardcoded identifying labels. + +- [PromQL info() function documentation](https://prometheus.io/docs/prometheus/latest/querying/functions/#info) + +### OpenTelemetry Entity Data Model + +OpenTelemetry defines Entities as "objects of interest associated with produced telemetry." The data model specifies: +- Entity types and their schemas +- Identifying vs. descriptive attributes +- Entity lifecycle events + +- [OTel Entities Data Model](https://opentelemetry.io/docs/specs/otel/entities/data-model/) +- [Resource and Entity Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/how-to-write-conventions/resource-and-entities/) + +### OpenTelemetry Prometheus Compatibility + +OpenTelemetry provides specifications for bidirectional conversion between OTel and Prometheus formats: +- Resource attributes → `target_info` labels +- Metric attributes → metric labels +- Handling of Info and StateSet types + +- [Prometheus and OpenMetrics Compatibility](https://opentelemetry.io/docs/specs/otel/compatibility/prometheus_and_openmetrics/) +- [Prometheus Exporter Specification](https://opentelemetry.io/docs/specs/otel/metrics/sdk_exporters/prometheus/) + +### Prometheus Commitment to OpenTelemetry + +In March 2024, Prometheus announced plans to be the default store for OpenTelemetry metrics: +- OTLP ingestion +- UTF-8 metric and label name support +- Native resource attribute support + +As of late 2024, most of this work has been implemented: OTLP ingestion is generally available in Prometheus 3.0 and UTF-8 support for metric and label names is complete. The notable exception is **native support for resource attributes**—which is precisely what this proposal aims to address through proper Entity semantics. + +- [Prometheus Commitment to OpenTelemetry](https://prometheus.io/blog/2024/03/14/commitment-to-opentelemetry/) + +--- + +## What's Next + +This document establishes the context and motivation for native Entity support in Prometheus. The following documents detail the implementation: + +- **[Exposition Formats](./02-exposition-formats.md)**: How entities are represented in text and protobuf formats +- **[SDK](./03-sdk.md)**: How Prometheus client libraries support entities +- **[Service Discovery](./04-service-discovery.md)**: How entities relate to Prometheus targets and discovered metadata +- **[Storage](./05-storage.md)**: How entities are stored efficiently in the TSDB +- **[Querying](./06-querying.md)**: PromQL extensions for working with entities +- **[Web UI and APIs](./07-web-ui-and-apis.md)**: How entities are displayed and accessed +- **[Alerting](./08-alerting.md)**: How entities interact with alerting rules and Alertmanager +- **Remote Write (TBD)**: Protocol changes for transmitting entities over remote write + +--- + +*This proposal is a work in progress. Feedback from Prometheus maintainers, users, and the broader observability community is welcome.* diff --git a/proposals/0071-Entity/02-exposition-formats.md b/proposals/0071-Entity/02-exposition-formats.md new file mode 100644 index 00000000..2aaf2343 --- /dev/null +++ b/proposals/0071-Entity/02-exposition-formats.md @@ -0,0 +1,324 @@ +# Exposition Formats + +## Abstract + +This document specifies how Prometheus exposition formats represent **Entities** using info metrics. As established in [01-context.md](./01-context.md), Entities are the first-class concept representing things that produce telemetry. This document defines how they are serialized in the wire format. + +Info metrics have long been used to represent entity metadata in the Prometheus ecosystem. This proposal enhances them with markers that allow Prometheus to recognize them as entity representations rather than ordinary metrics. The key addition is the `# IDENTIFYING_LABELS` declaration, which distinguishes which labels uniquely identify the entity from which labels describe it. + +--- + +## Entities vs. Info Metrics: Concepts and Representation + +Before diving into syntax, it's important to clarify the relationship between two terms used in this proposal: + +### Entity (Concept) + +An **Entity** is the conceptual abstraction—the "thing" that produces telemetry: +- A Kubernetes pod +- A physical host +- A service instance +- A database table + +Entities have: +- **Type** (e.g., `k8s.pod`, `service`, `host`) +- **Identifying labels** (immutable, define unique identity) +- **Descriptive labels** (mutable, provide context) +- **Lifecycle** (creation time, end time) + +### Info Metric (Wire Format) + +An **info metric** is how entities are represented in the exposition format: +- Uses the familiar `*_info` naming convention +- Declares `# TYPE ... info` +- Now includes `# IDENTIFYING_LABELS` to mark which labels are identifying +- Has a placeholder value of `1` + +Throughout this proposal: +- When we say **"Entity,"** we mean the conceptual abstraction +- When we say **"info metric,"** we mean the wire format representation +- The two are closely related: info metrics *represent* entities + +--- + +## Text Format + +### New Syntax Elements + +| Element | Syntax | Description | +|---------|--------|-------------| +| Identifying labels declaration | `# IDENTIFYING_LABELS ...` | Declares which labels uniquely identify the info metric instance | +| Info section delimiter | `---` | Marks the end of the info metrics section | + +### Complete Example + +``` +# HELP kube_pod_info Information about pods +# TYPE kube_pod_info info +# IDENTIFYING_LABELS namespace pod_uid +kube_pod_info{namespace="default",pod_uid="550e8400-e29b-41d4-a716-446655440000",pod="nginx-7b9f5"} 1 +kube_pod_info{namespace="default",pod_uid="660e8400-e29b-41d4-a716-446655440001",pod="redis-cache-0"} 1 +kube_pod_info{namespace="kube-system",pod_uid="770e8400-e29b-41d4-a716-446655440002",pod="coredns-5dd5756b68-abcde"} 1 + +# HELP kube_node_info Information about nodes +# TYPE kube_node_info info +# IDENTIFYING_LABELS node_uid +kube_node_info{node_uid="node-uid-001",node="worker-1",os="linux",kernel_version="5.15.0"} 1 +kube_node_info{node_uid="node-uid-002",node="worker-2",os="linux",kernel_version="5.15.0"} 1 + +# HELP target_info Target metadata from OpenTelemetry +# TYPE target_info info +# IDENTIFYING_LABELS job instance +target_info{job="payment-service",instance="10.0.1.5:8080",service_version="2.1.0",deployment_environment="production"} 1 + +--- + +# HELP container_cpu_usage_seconds_total Total CPU usage in seconds +# TYPE container_cpu_usage_seconds_total counter +container_cpu_usage_seconds_total{namespace="default",pod_uid="550e8400-e29b-41d4-a716-446655440000",node_uid="node-uid-001",container="nginx"} 1234.5 +container_cpu_usage_seconds_total{namespace="default",pod_uid="660e8400-e29b-41d4-a716-446655440001",node_uid="node-uid-002",container="redis"} 567.8 + +# HELP http_requests_total Total HTTP requests +# TYPE http_requests_total counter +http_requests_total{job="payment-service",instance="10.0.1.5:8080",method="GET",status="200"} 9999 + +# EOF +``` + +### Parsing Rules + +1. `# TYPE ... info` MUST be followed by `# IDENTIFYING_LABELS` before any metric instances +2. `# IDENTIFYING_LABELS` applies to the info metric family declared by the preceding `# TYPE` +3. All labels listed in `# IDENTIFYING_LABELS` must be present on every instance of that info metric +4. Labels not listed in `# IDENTIFYING_LABELS` are considered descriptive labels +5. The info metrics section ends with a `---` delimiter on its own line +6. After the `---` delimiter, any info metric declarations are a parse error + +### Ordering + +**All info metrics MUST appear at the beginning of the scrape response, before any regular metrics.** The info metrics section ends with a `---` delimiter. + +This ordering requirement exists for practical reasons: when Prometheus parses a metric, it needs to immediately correlate that metric with any relevant info metrics. If info metrics could appear anywhere in the response, Prometheus would need to either buffer all metrics until the entire response is parsed, or make a second pass through the data. Both approaches add complexity and memory overhead. + +By requiring info metrics first, the parser can process the exposition in a single pass. When it encounters a regular metric, all potentially correlated info metrics are already in memory and correlation can happen immediately. + +If no info metrics are present, the `---` delimiter may be omitted. + +#### Breaking Change + +**This ordering requirement is a breaking change.** Currently, Prometheus parses info metrics as regular gauges, allowing them to appear anywhere in the scrape response. Applications that expose info metrics after regular metrics will need to be updated to comply with this ordering requirement. + +This trade-off was accepted because the benefits of single-pass parsing and immediate correlation outweigh the migration cost. See [99-alternatives.md](./99-alternatives.md#alternative-introduce-a-new-entity-concept) for an alternative approach that would not have this breaking change. + +--- + +## Protobuf Format + +While the text format uses info metrics to represent entities (for familiarity), the protobuf format uses a dedicated `EntityFamily` structure. This provides a cleaner representation without the need for placeholder values. + +### New Message Definitions + +```protobuf +syntax = "proto2"; + +package io.prometheus.client; + +// EntityFamily groups entities of the same type +message EntityFamily { + // Entity type name + required string type = 1; + + // Names of labels that form the unique identity + repeated string identifying_label_names = 2; + + // Entity instances of this type + repeated Entity entity = 3; +} + +// Entity represents a single entity instance +message Entity { + // All labels (both identifying and descriptive) + repeated LabelPair label = 1; +} +``` + +### Integration with Existing Messages + +The existing `MetricFamily` structure remains unchanged. A new top-level message wraps both: + +```protobuf +// MetricPayload is the top-level message for scrape responses +// that include both entities and metrics +message MetricPayload { + // Entity families (must come before metric families) + repeated EntityFamily entity_family = 1; + + // Metric families + repeated MetricFamily metric_family = 2; +} +``` + +### Content-Type + +For protobuf with entity support: + +``` +application/vnd.google.protobuf;proto=io.prometheus.client.MetricPayload;encoding=delimited +``` + +The `proto` parameter changes from `MetricFamily` to `MetricPayload` to indicate the new top-level message type. + +### Translation Between Formats + +Entities can be losslessly translated between text and protobuf formats: + +| Text Format | Protobuf | +|-------------|----------| +| `# TYPE kube_pod_info info` | `EntityFamily.type = "kube_pod"` | +| `# IDENTIFYING_LABELS namespace pod_uid` | `EntityFamily.identifying_label_names = ["namespace", "pod_uid"]` | +| `kube_pod_info{namespace="default",pod="nginx"} 1` | `Entity.label = [{name: "namespace", value: "default"}, {name: "pod", value: "nginx"}]` | + +Note that the placeholder value `1` from the text format is not stored in protobuf—it's implicit for entities. + +--- + +## Info Metric to Regular Metric Correlation + +### How Correlation Works + +Info metrics correlate with regular metrics through **shared identifying labels**: + +- If a metric has labels that match ALL identifying labels of an info metric (same names, same values), that metric is associated with that info metric. +- A single metric can correlate with multiple info metrics if it contains the identifying labels of each. + +**Example:** + +``` +# TYPE kube_pod_info info +# IDENTIFYING_LABELS namespace pod_uid +kube_pod_info{namespace="default",pod_uid="550e8400",pod="nginx",node="worker-1"} 1 +--- +# This metric correlates with kube_pod_info above (has both identifying labels) +container_cpu_usage_seconds_total{namespace="default",pod_uid="550e8400",container="app"} 1234.5 +``` + +Correlation is computed at ingestion time when Prometheus parses the exposition format. See [05-storage.md](./05-storage.md#correlation-index) for how Prometheus builds and maintains these correlations in storage. + +### Conflict Detection + +When a metric correlates with an info metric, the query engine enriches the metric's labels with the info metric's descriptive labels (see [06-querying.md](./06-querying.md)). This creates the possibility of label conflicts. + +A conflict occurs when: +- A metric correlates with an info metric (has all identifying labels) +- The metric has a label with the same name as one of the info metric's descriptive labels +- The values differ + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ Label Conflict Detection │ +└─────────────────────────────────────────────────────────────────────────────┘ + +Info Metric (kube_pod_info) Regular Metric (my_metric) +┌─────────────────────────────────┐ ┌─────────────────────────────────┐ +│ Identifying Labels: │ │ Labels: │ +│ namespace = "default" │◄─────────►│ namespace = "default" │ ✓ Match +│ pod_uid = "abc-123" │◄─────────►│ pod_uid = "abc-123" │ ✓ Match +├─────────────────────────────────┤ ├─────────────────────────────────┤ +│ Descriptive Labels: │ │ │ +│ version = "2.0" │◄────╳────►│ version = "1.0" │ ✗ CONFLICT! +│ pod = "nginx" │ │ │ +└─────────────────────────────────┘ │ Value: 42 │ + └─────────────────────────────────┘ + +Correlation established via matching identifying labels, +but "version" exists in both with different values → Scrape fails! +``` + +**Example conflict in exposition format:** + +``` +# TYPE kube_pod_info info +# IDENTIFYING_LABELS namespace pod_uid +kube_pod_info{namespace="default",pod_uid="abc-123",version="2.0",pod="nginx"} 1 +--- +# This metric has kube_pod_info identifying labels, so it correlates. +# But it also has a "version" label that conflicts! +my_metric{namespace="default",pod_uid="abc-123",version="1.0"} 42 +``` + +When a conflict is detected during scrape, **the scrape fails with an error**. + +Note that **identifying labels cannot conflict** because they must be present on the metric for correlation to occur—if the metric has the same label name with a different value, it simply won't correlate with that info metric. + +--- + +## Implementation Overview + +This section summarizes the key implementation changes required to support the exposition format extensions. Detailed implementation specifics are deferred until the fundamental direction is agreed upon. + +### Parser Changes + +The text parser requires two new entry types: + +1. **`EntryIdentifyingLabels`** — Returned when the parser encounters `# IDENTIFYING_LABELS` +2. **`EntryInfoDelimiter`** — Returned when the parser encounters `---` + +A new method `IdentifyingLabels()` returns the list of label names declared in the `# IDENTIFYING_LABELS` line. + +### Scrape Loop Changes + +The scrape loop needs to: + +1. **Track info metric state during parsing** — Remember which info type is being parsed and its identifying label names +2. **Enforce ordering** — Reject info metrics that appear after the `---` delimiter +3. **Split labels** — Separate identifying labels from descriptive labels based on the declaration +4. **Build correlations** — When processing regular metrics, check if they contain identifying labels that match any parsed info metric +5. **Detect conflicts** — Fail the scrape if a metric's label conflicts with an info metric's descriptive label + +### Data Flow + +``` +┌───────────────────────────────────────────────────────────────────────────────┐ +│ Scrape Data Flow │ +└───────────────────────────────────────────────────────────────────────────────┘ + + Target /metrics Prometheus Scrape Loop + ┌─────────────────┐ ┌─────────────────────────────────────────┐ + │ # TYPE pod info │ │ │ + │ # IDENT_LABELS │ ──HTTP GET──► │ 1. Parse info metrics first │ + │ pod_info{...} 1 │ │ - Extract identifying labels │ + │ --- │ │ - Store for correlation │ + │ # TYPE metric │ │ │ + │ metric{...} 123 │ │ 2. Parse regular metrics │ + │ # EOF │ │ - Check for correlation matches │ + └─────────────────┘ │ - Detect label conflicts │ + │ │ + │ 3. Commit to storage │ + │ - Write info metric metadata │ + │ - Write series samples │ + │ - Build correlation index │ + └─────────────────────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────────────┐ + │ Storage (TSDB) │ + │ - Info metric metadata │ + │ - Series data │ + │ - Correlation index │ + └─────────────────────────────────────────┘ +``` + +--- + +## Related Documents + +- [01-context.md](./01-context.md) - Problem statement and motivation +- [03-sdk.md](./03-sdk.md) - How Prometheus client libraries support info metrics with identifying labels +- [04-service-discovery.md](./04-service-discovery.md) - How info metrics relate to Prometheus targets +- [05-storage.md](./05-storage.md) - How info metric metadata is stored in the TSDB +- [06-querying.md](./06-querying.md) - PromQL extensions for working with info metrics +- [07-web-ui-and-apis.md](./07-web-ui-and-apis.md) - How info metrics are displayed and accessed + +--- + +*This proposal is a work in progress. Feedback is welcome.* diff --git a/proposals/0071-Entity/03-sdk.md b/proposals/0071-Entity/03-sdk.md new file mode 100644 index 00000000..9e3341a4 --- /dev/null +++ b/proposals/0071-Entity/03-sdk.md @@ -0,0 +1,375 @@ +# SDK Support for Entities + +## Abstract + +This document specifies how Prometheus client libraries should be extended to support the Entity concept. Using client_golang as the reference implementation, we define new types, interfaces, and patterns that enable applications to declare entities alongside metrics while maintaining backward compatibility with existing instrumentation code. + +The design prioritizes simplicity for the common case—an application instrumenting itself as a single entity—while providing flexibility for advanced scenarios like exporters that expose metrics for multiple entities. + +--- + +## Design Principles + +Before diving into implementation details, it's worth understanding the key design decisions that shaped this proposal. + +**Entities are not collectors.** In client_golang, metrics are managed through the Collector interface, which combines description and collection into a single abstraction. We considered making entities follow this pattern, but entities have fundamentally different characteristics: they represent the "things" that produce telemetry, not the telemetry itself. An entity like "this Kubernetes pod" cuts across multiple collectors (process metrics, Go runtime metrics, application metrics). Tying entities to collectors would create awkward ownership questions and unnecessary coupling. + +**The EntityRegistry is global and separate from the metric Registry.** This separation reflects the conceptual difference between "what is producing telemetry" (entities) and "what telemetry is being produced" (metrics). Making the EntityRegistry global (via `DefaultEntityRegistry`) enables validation at metric registration time—if a metric references a non-existent entity ref, registration fails immediately rather than silently producing invalid output at scrape time. + +**Descriptive labels are mutable, identifying labels are not.** An entity's identity (its type plus identifying labels) is immutable—changing it would make it a different entity. But descriptive labels like version numbers or human-readable names can change during the entity's lifetime. The API reflects this: `SetDescriptiveLabels()` atomically replaces all descriptive labels, while identifying labels are set only at construction. + +--- + +## Entity Types + +### Entity + +The `Entity` type represents a single entity instance: + +```go +type Entity struct { + ref uint64 // Assigned by EntityRegistry + entityType string // e.g., "service", "k8s.pod" + identifyingLabels Labels // Immutable after creation + descriptiveLabels Labels // Mutable via SetDescriptiveLabels + mtx sync.RWMutex // Protects descriptiveLabels +} + +// EntityOpts configures a new Entity +type EntityOpts struct { + Type string // Required: entity type name + Identifying Labels // Required: labels that uniquely identify this instance + Descriptive Labels // Optional: additional context labels +} + +// NewEntity creates an entity. +func NewEntity(opts EntityOpts) *Entity + +// Ref returns the entity's reference (0 if not yet registered) +func (e *Entity) Ref() uint64 + +// Type returns the entity type +func (e *Entity) Type() string + +// IdentifyingLabels returns a copy of the identifying labels +func (e *Entity) IdentifyingLabels() Labels + +// DescriptiveLabels returns a copy of the current descriptive labels +func (e *Entity) DescriptiveLabels() Labels + +// SetDescriptiveLabels atomically replaces all descriptive labels +func (e *Entity) SetDescriptiveLabels(labels Labels) +``` + +### EntityRegistry + +The `EntityRegistry` is a **global singleton**, similar to `prometheus.DefaultRegisterer`. This ensures that metrics can validate entity refs at registration time—if a metric references a non-existent entity, registration fails immediately rather than at scrape time. + +```go +// Global EntityRegistry instance +var DefaultEntityRegistry = NewEntityRegistry() + +type EntityRegistry struct { + mtx sync.RWMutex + byHash map[uint64]*Entity // hash(type+identifying) → Entity + byRef map[uint64]*Entity // ref → Entity + refCounter uint64 // Auto-increments on Register +} + + +// Register adds an entity and assigns its ref. +// Returns error if an entity with the same type+identifying labels exists. +func (er *EntityRegistry) Register(e *Entity) error + +// Unregister removes an entity by ref +func (er *EntityRegistry) Unregister(ref uint64) bool + +// Lookup finds an entity by type and identifying labels, returns its ref +func (er *EntityRegistry) Lookup(entityType string, identifying Labels) (ref uint64, found bool) + +// Get retrieves an entity by ref +func (er *EntityRegistry) Get(ref uint64) *Entity + +// Gather collects entities and metrics together into a MetricPayload. +// Only entities referenced by the gathered metrics are included. +func (er *EntityRegistry) Gather(gatherers ...Gatherer) (*dto.MetricPayload, error) +``` + +--- + +## Metric Integration + +Metrics declare their entity associations through the `EntityRefs` field in their options. This field contains the refs of entities that the metric correlates with. + +### Updated Metric Options + +```go +type CounterOpts struct { + Namespace string + Subsystem string + Name string + Help string + ConstLabels Labels + + // EntityRefs lists the refs of entities this metric correlates with. + // Obtain refs via Entity.Ref() after registering with EntityRegistry. + EntityRefs []uint64 +} + +// Same pattern for GaugeOpts, HistogramOpts, SummaryOpts, etc. +``` + +### Validation at Registration + +When a metric with `EntityRefs` is registered, the metric registry validates that all referenced entity refs exist in the global `DefaultEntityRegistry`. This catches configuration errors immediately: + +```go +// This works: entity is registered first +serviceEntity := prometheus.NewEntity(prometheus.EntityOpts{...}) +prometheus.RegisterEntity(serviceEntity) // Uses DefaultEntityRegistry + +counter := prometheus.NewCounter(prometheus.CounterOpts{ + Name: "requests_total", + EntityRefs: []uint64{serviceEntity.Ref()}, +}) +prometheus.MustRegister(counter) // Validates that serviceEntity.Ref() exists + +// This fails: entity ref doesn't exist +badCounter := prometheus.NewCounter(prometheus.CounterOpts{ + Name: "bad_counter", + EntityRefs: []uint64{999}, // No entity with this ref +}) +prometheus.MustRegister(badCounter) // PANIC: unknown entity ref 999 +``` + +### Usage Example + +```go +// Create and register entity +serviceEntity := prometheus.NewEntity(prometheus.EntityOpts{ + Type: "service", + Identifying: prometheus.Labels{ + "service.namespace": "production", + "service.name": "payment-api", + "service.instance.id": os.Getenv("INSTANCE_ID"), + }, + Descriptive: prometheus.Labels{ + "service.version": "1.0.0", + }, +}) +prometheus.RegisterEntity(serviceEntity) + +// Create metric that correlates with the entity +requestDuration := prometheus.NewHistogram(prometheus.HistogramOpts{ + Name: "http_request_duration_seconds", + Help: "HTTP request latency", + Buckets: prometheus.DefBuckets, + EntityRefs: []uint64{serviceEntity.Ref()}, +}) +prometheus.MustRegister(requestDuration) + +// Later: update descriptive labels during rolling deploy +serviceEntity.SetDescriptiveLabels(prometheus.Labels{ + "service.version": "2.0.0", +}) +``` + +### Multiple Entity Correlations + +A single metric can correlate with multiple entities. This is useful when a metric describes something that spans entity boundaries: + +```go +// Register both pod and node entities +podEntity := prometheus.NewEntity(prometheus.EntityOpts{ + Type: "k8s.pod", + Identifying: prometheus.Labels{ + "k8s.namespace.name": "default", + "k8s.pod.uid": "abc-123", + }, +}) +nodeEntity := prometheus.NewEntity(prometheus.EntityOpts{ + Type: "k8s.node", + Identifying: prometheus.Labels{ + "k8s.node.uid": "node-456", + }, +}) +entityRegistry.Register(podEntity) +entityRegistry.Register(nodeEntity) + +// Container CPU correlates with both pod AND node +containerCPU := prometheus.NewCounter(prometheus.CounterOpts{ + Name: "container_cpu_usage_seconds_total", + Help: "Total CPU usage by container", + EntityRefs: []uint64{podEntity.Ref(), nodeEntity.Ref()}, +}) +``` + +--- + +## Gathering and Exposition + +The `EntityRegistry.Gather()` method is the central coordination point. It accepts metric gatherers as arguments and returns a complete `dto.MetricPayload` containing both entities and metrics. This design enforces that entities are never gathered in isolation—they only make sense alongside their correlated metrics. + +### How Gather Works + +The `Gather()` method coordinates metric and entity collection: + +1. **Collect metrics** from all provided gatherers +2. **Track entity references** — identify which entity refs are used by the gathered metrics +3. **Filter entities** — include only entities that are actually referenced by at least one metric +4. **Return payload** — combine entity families and metric families into a single `MetricPayload` + +This filtering ensures that: +- **Metrics without entities** are still exposed +- **Entities without metrics** are excluded +- **Only the entities actually needed** are transmitted, reducing payload size + +### HTTP Handler Updates + +The promhttp package provides `HandlerFor()` that accepts an `EntityRegistry` and metric gatherers, returning an HTTP handler that: + +1. Calls `EntityRegistry.Gather()` with the provided gatherers +2. Negotiates content type (text or protobuf) +3. Encodes the combined `MetricPayload` to the response + +### Usage Example + +```go +func main() { + // Register entity (uses global DefaultEntityRegistry) + serviceEntity := prometheus.NewEntity(prometheus.EntityOpts{...}) + prometheus.RegisterEntity(serviceEntity) + + // Register metrics + counter := prometheus.NewCounter(prometheus.CounterOpts{ + Name: "requests_total", + EntityRefs: []uint64{serviceEntity.Ref()}, + }) + prometheus.MustRegister(counter) + + // Expose via HTTP - uses global registries + http.Handle("/metrics", promhttp.Handler()) // Enhanced to use DefaultEntityRegistry + http.ListenAndServe(":8080", nil) +} +``` + +For custom registries, pass them explicitly: + +```go +entityReg := prometheus.NewEntityRegistry() +metricReg := prometheus.NewRegistry() + +http.Handle("/metrics", promhttp.HandlerFor(entityReg, []prometheus.Gatherer{metricReg}, promhttp.HandlerOpts{})) +``` + +--- + +## Changes to Supporting Libraries + +Implementing entity support requires coordinated changes across multiple repositories. + +### client_model + +The protobuf definitions need new message types: + +```protobuf +// EntityFamily groups entities of the same type +message EntityFamily { + required string type = 1; + repeated string identifying_label_names = 2; + repeated Entity entity = 3; +} + +// Entity represents a single entity instance +message Entity { + repeated LabelPair label = 1; // All labels (identifying + descriptive) +} + +// MetricPayload is the top-level message for combined exposition +message MetricPayload { + repeated EntityFamily entity_family = 1; + repeated MetricFamily metric_family = 2; +} +``` + +### common/expfmt + +The exposition format library needs encoder support for `MetricPayload`: + +```go +// PayloadEncoder encodes a complete MetricPayload +type PayloadEncoder interface { + EncodePayload(payload *dto.MetricPayload) error +} + +// NewPayloadEncoder creates an encoder for the combined format +func NewPayloadEncoder(w io.Writer, format Format) PayloadEncoder +``` + +For the text format, the encoder writes the payload in order: entity declarations first, then the `---` delimiter, then metric families. For the protobuf format, the encoder marshals the `MetricPayload` message directly. + +### client_golang + +The changes described in this document: +- New `Entity` and `EntityRegistry` types +- `EntityRegistry.Gather()` that accepts metric gatherers and returns `*dto.MetricPayload` +- Updated metric options with `EntityRefs` field +- Updated promhttp handlers + +--- + +## Backward Compatibility + +The design maintains full backward compatibility: + +**Existing metrics continue to work.** The `EntityRefs` field is optional. Metrics without entity associations work exactly as before—they simply don't correlate with any entity. + +**Existing registries are unaffected.** The metric `Registry` type is unchanged. Entity support is additive through the separate `EntityRegistry`. + +**Existing HTTP handlers work.** The standard `promhttp.Handler()` continues to expose metrics without entities. Applications opt into entity support by using the new `HandlerFor()` that accepts an `EntityRegistry`. + +**Gradual adoption is possible.** Applications can add entity support incrementally—register an entity, update a few metrics to reference it, and the rest continue working unchanged. + +--- + +## Advanced: Dynamic Entity Associations + +The design presented above works well for applications that instrument themselves, where entities are known at startup and metrics have fixed entity associations. However, some use cases require dynamic associations. + +### Exporters with Many Entities + +Exporters like kube-state-metrics expose metrics for thousands of entities (pods, nodes, deployments). Each metric sample correlates with a different entity based on its label values. For these cases, we propose a per-sample entity association: + +```go +// GaugeVec with per-sample entity support +podInfo := prometheus.NewGaugeVec(prometheus.GaugeVecOpts{ + Name: "kube_pod_info", + VariableLabels: []string{"pod_name", "node"}, +}) + +// When recording, specify which entity this sample correlates with +podInfo.WithEntityRef(podEntities[pod.UID].Ref()). + WithLabelValues("nginx", "node-1"). + Set(1) +``` + +This API extension is optional and can be added in a future iteration once the core entity support is stable. + +--- + +## Open Questions + +Several aspects of this design warrant community feedback: + +**promauto integration.** How should the promauto convenience package handle entities? + +**Entity unregistration and metrics.** If an entity is unregistered while metrics still reference it, what should happen? Options: prevent unregistration while referenced, allow it and have Gather skip the missing entity, or error at gather time. + +--- + +## Related Documents + +- [01-context.md](./01-context.md) — Problem statement and entity concept +- [02-exposition-formats.md](./02-exposition-formats.md) — Wire format for entities +- [05-storage.md](./05-storage.md) — How Prometheus stores entities + diff --git a/proposals/0071-Entity/04-service-discovery.md b/proposals/0071-Entity/04-service-discovery.md new file mode 100644 index 00000000..31442e44 --- /dev/null +++ b/proposals/0071-Entity/04-service-discovery.md @@ -0,0 +1,638 @@ +# Service Discovery and Entities + +## Abstract + +This document specifies how Prometheus Service Discovery (SD) integrates with the Entity concept introduced in this proposal. SD already collects rich metadata about scrape targets—metadata that naturally maps to entity labels. This document provides a comprehensive technical specification for deriving entities from SD metadata, including implementation details and resolution of the interaction between relabeling, entity generation, and metric correlation. + +The document also addresses **attribute mapping standards**—how `__meta_*` labels translate to entity type names and attribute names. Rather than prescribing a specific convention, this document presents the available options (OpenTelemetry semantic conventions, Prometheus-native conventions, etc.) and their trade-offs. Standardized, non-customizable mappings are essential for enabling ecosystem-wide interoperability; the specific convention choice is left as an open decision for the Prometheus community. + +Entities can come from two sources: the **exposition format** (embedded in scraped data) or **Service Discovery** (derived from target metadata). Each approach has trade-offs, and users choose based on their architecture. + +--- + +## Background: How Service Discovery Works + +### Discovery Manager Architecture + +The Discovery Manager (`discovery/manager.go`) coordinates all service discovery mechanisms: + +```go +type Manager struct { + // providers keeps track of SD providers + providers []*Provider + + // targets maps (setName, providerName) -> source -> TargetGroup + targets map[poolKey]map[string]*targetgroup.Group + + // syncCh sends updates to the scrape manager + syncCh chan map[string][]*targetgroup.Group +} +``` + +Each `Provider` wraps a `Discoverer` that implements: + +```go +type Discoverer interface { + // Run sends TargetGroups through the channel when changes occur + Run(ctx context.Context, up chan<- []*targetgroup.Group) +} +``` + +### Target Group Structure + +The fundamental unit of discovery is the `targetgroup.Group`: + +```go +// From discovery/targetgroup/targetgroup.go +type Group struct { + // Targets is a list of targets identified by a label set. + // Each target is uniquely identifiable by its address label. + Targets []model.LabelSet + + // Labels is a set of labels common across all targets in the group. + Labels model.LabelSet + + // Source is an identifier that describes this group of targets. + Source string +} +``` + +**Key insight**: SD mechanisms populate `__meta_*` labels into these `LabelSet` objects. These labels contain the raw metadata that will become entity attributes. + +### Label Flow: Discovery to Scrape + +The complete flow from discovery to metric labels: + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ Service Discovery Flow │ +└─────────────────────────────────────────────────────────────────────────────┘ + + 1. DISCOVERY PHASE + ┌─────────────────────────────────────────────────────────────────────────┐ + │ Kubernetes API / AWS API / Consul / etc. │ + │ │ │ + │ ▼ │ + │ ┌─────────────────────────────────────────────────────────────────────┐ │ + │ │ Discoverer.Run() builds targetgroup.Group with: │ │ + │ │ │ │ + │ │ Targets[0] = { │ │ + │ │ __address__: "10.0.0.1:8080" │ │ + │ │ __meta_kubernetes_namespace: "production" │ │ + │ │ __meta_kubernetes_pod_name: "nginx-7b9f5" │ │ + │ │ __meta_kubernetes_pod_uid: "550e8400-e29b-..." │ │ + │ │ __meta_kubernetes_pod_node_name: "worker-1" │ │ + │ │ __meta_kubernetes_pod_phase: "Running" │ │ + │ │ ... │ │ + │ │ } │ │ + │ │ │ │ + │ │ Labels = { │ │ + │ │ __meta_kubernetes_namespace: "production" (group-level) │ │ + │ │ } │ │ + │ └─────────────────────────────────────────────────────────────────────┘ │ + └─────────────────────────────────────────────────────────────────────────┘ + │ + ▼ + 2. SCRAPE MANAGER RECEIVES TARGET GROUPS + ┌─────────────────────────────────────────────────────────────────────────┐ + │ scrapePool.Sync(tgs []*targetgroup.Group) │ + │ │ │ + │ ▼ │ + │ TargetsFromGroup() → PopulateLabels() │ + └─────────────────────────────────────────────────────────────────────────┘ + │ + ▼ + 3. LABEL POPULATION (scrape/target.go:PopulateLabels) + ┌─────────────────────────────────────────────────────────────────────────┐ + │ a) Merge target labels + group labels │ + │ b) Add scrape config defaults (job, __scheme__, __metrics_path__, etc.) │ + │ c) Apply relabel_configs │ + │ d) Delete all __meta_* labels │ + │ e) Default instance to __address__ │ + │ │ + │ Result: Target with final label set │ + │ {job="kubernetes-pods", instance="10.0.0.1:8080", namespace="prod"} │ + └─────────────────────────────────────────────────────────────────────────┘ + │ + ▼ + 4. SCRAPE LOOP + ┌─────────────────────────────────────────────────────────────────────────┐ + │ HTTP GET target → Parse metrics → Apply metric_relabel_configs │ + │ → Append to storage with final labels │ + └─────────────────────────────────────────────────────────────────────────┘ +``` + +**Critical observation**: The `__meta_*` labels are deleted in step 3d. With entity support, we intercept these labels *before* deletion to generate entities. + +--- + +## Entity Sources + +Entities can originate from two sources, each suited to different deployment patterns: + +### Source 1: Service Discovery + +When Prometheus scrapes targets directly, SD metadata accurately describes the entity producing metrics: + +| SD Mechanism | What It Discovers | Entity It Can Generate | +|--------------|-------------------|------------------------| +| Kubernetes pod SD | Pods | `k8s.pod` | +| Kubernetes node SD | Nodes | `k8s.node` | +| Kubernetes service SD | Services | `k8s.service` | +| EC2 SD | EC2 instances | `host`, `cloud.instance` | +| Azure VM SD | Azure VMs | `host`, `cloud.instance` | +| GCE SD | GCE instances | `host`, `cloud.instance` | +| Consul SD | Services | `service` | + +**When to use**: Direct scraping where the target IS the entity. + +### Source 2: Exposition Format + +When metrics flow through intermediaries, SD sees the intermediary, not the actual sources: + +``` +┌───────────┐ ┌───────────┐ ┌───────────┐ +│ Service A │────▶│ OTel │◀─────▶│Prometheus │ +│ (pod-xyz) │push │ Collector │scrape │ │ +└───────────┘ │ │ │ SD sees: │ +┌───────────┐ │ (pod-abc) │ │ pod-abc │ +│ Service B │────▶│ │ │ │ +└───────────┘ └───────────┘ └───────────┘ + │ + Entity info must travel │ + WITH the metrics ─────────┘ +``` + +**When to use**: Gateways, federation, pushgateway, kube-state-metrics. + +See [01-context.md](./01-context.md#collection-architectures-direct-scraping-vs-gateways) for detailed use cases. + +--- + +## Configuration + +### New Configuration Options + +| Option | Type | Default | Description | +|--------|------|---------|-------------| +| `entity_from_sd` | bool | `false` | When true, generates entities from `__meta_*` labels using built-in mappings | +| `entity_limit` | int | `0` | Maximum distinct entities per target (0 = no limit) | + +### Configuration Examples + +```yaml +scrape_configs: + # Direct scraping with entity generation enabled + - job_name: 'kubernetes-pods' + kubernetes_sd_configs: + - role: pod + entity_from_sd: true + + # Gateway pattern - entities come from exposition format + - job_name: 'otel-collector' + static_configs: + - targets: ['otel-collector:8889'] + entity_from_sd: false # Default + + # Federation - entities flow through metrics + - job_name: 'federate' + honor_labels: true + metrics_path: '/federate' + static_configs: + - targets: ['prometheus-regional:9090'] + entity_from_sd: false +``` + +--- + +## Attribute Mapping Standards + +A critical design decision for SD-derived entities is how `__meta_*` labels translate to entity type names and attribute names. This section outlines the requirements, available options, and trade-offs for establishing a mapping standard. + +### The Problem + +Service Discovery mechanisms produce `__meta_*` labels with provider-specific naming: + +``` +__meta_kubernetes_pod_uid +__meta_kubernetes_namespace +__meta_ec2_instance_id +__meta_azure_machine_id +``` + +These must be transformed into entity attributes. The key questions are: + +1. **Entity type names**: What should we call the entity? (`k8s.pod`? `kubernetes_pod`? `pod`?) +2. **Attribute names**: How should attributes be named? (`k8s.pod.uid`? `pod_uid`? `uid`?) +3. **Which labels become identifying vs. descriptive?** + +The answers to these questions affect: +- **Correlation**: Metrics and entities must share the same identifying label names and values +- **Interoperability**: Other systems querying Prometheus data need predictable attribute names +- **Ecosystem alignment**: Conventions should facilitate integration with dashboards, alerting, and other tools + +### Design Requirements + +Whatever convention is chosen, the mapping must satisfy these requirements: + +1. **Deterministic**: Given the same `__meta_*` labels, the resulting entity attributes must always be identical +2. **Complete**: All meaningful metadata should be captured—useful information should not be silently dropped +3. **Unambiguous**: Each `__meta_*` label maps to exactly one attribute; no conflicts or overlaps +4. **Stable**: Once established, mappings should not change without a clear migration path + +### Available Options + +#### Option 1: OpenTelemetry Semantic Conventions + +Adopt attribute names from [OpenTelemetry Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/), which define standardized names for resource attributes across the industry. + +**Example mappings:** + +| SD Label | OTel-style Entity Attribute | +|----------|----------------------------| +| `__meta_kubernetes_pod_uid` | `k8s.pod.uid` | +| `__meta_kubernetes_namespace` | `k8s.namespace.name` | +| `__meta_ec2_instance_id` | `host.id` | +| `__meta_ec2_instance_type` | `host.type` | +| `__meta_azure_machine_id` | `host.id` | +| `__meta_gce_project` | `cloud.account.id` | + +**Advantages:** +- Industry-wide standardization enables correlation across tools (Grafana, OTel Collector, etc.) +- Reduces cognitive load for teams already using OTel conventions +- Future-proofs Prometheus for deeper OTel integration +- Extensive documentation and community support + +**Disadvantages:** +- Not all conventions are stable; Kubernetes conventions are currently "Experimental" and may change +- Introduces dot-separated names (e.g., `k8s.pod.uid`) which differ from Prometheus's traditional underscore convention +- Requires Prometheus to track and potentially adapt to external convention changes + +**Stability considerations:** + +If OTel conventions are adopted, Prometheus should consider: +- Only adopting conventions that have reached **Stable** status +- For widely-used Experimental conventions (like Kubernetes), accepting the risk with clear user documentation +- Establishing a migration strategy for when conventions change + +#### Option 2: Prometheus-Native Conventions + +Define Prometheus-specific conventions that align with existing Prometheus naming patterns (lowercase, underscore-separated). + +**Example mappings:** + +| SD Label | Prometheus-style Entity Attribute | +|----------|----------------------------------| +| `__meta_kubernetes_pod_uid` | `kubernetes_pod_uid` | +| `__meta_kubernetes_namespace` | `kubernetes_namespace` | +| `__meta_ec2_instance_id` | `ec2_instance_id` | +| `__meta_ec2_instance_type` | `ec2_instance_type` | +| `__meta_azure_machine_id` | `azure_machine_id` | +| `__meta_gce_project` | `gce_project` | + +**Advantages:** +- Consistent with existing Prometheus label naming conventions +- Full control over naming without external dependencies +- No risk of upstream convention changes +- Simpler—direct transformation from `__meta_*` labels + +**Disadvantages:** +- No industry standardization; correlation with OTel-based systems requires translation +- Prometheus would need to define and maintain its own convention documentation +- May diverge from where the broader observability ecosystem is heading +- Less intuitive for teams already using OTel conventions + +#### Option 3: Minimal Transformation + +Strip the `__meta_` prefix and SD-type prefix, keeping attribute names close to the original. + +**Example mappings:** + +| SD Label | Minimal Entity Attribute | +|----------|-------------------------| +| `__meta_kubernetes_pod_uid` | `pod_uid` | +| `__meta_kubernetes_namespace` | `namespace` | +| `__meta_ec2_instance_id` | `instance_id` | +| `__meta_ec2_instance_type` | `instance_type` | +| `__meta_azure_machine_id` | `machine_id` | +| `__meta_gce_project` | `project` | + +**Advantages:** +- Simplest transformation logic +- Shortest attribute names +- Easy to understand and predict + +**Disadvantages:** +- No namespace to distinguish provider-specific attributes +- Poor interoperability with any external standard + +### Identifying vs. Descriptive Label Classification + +Beyond naming, each mapping must classify labels as **identifying** (immutable, define identity) or **descriptive** (mutable, provide context). This classification must be: + +1. **Consistent with the data source**: If the underlying resource uses a UID for identity, so should the entity +2. **Globally unique when combined**: Identifying labels together must uniquely identify one entity +3. **Stable over the entity's lifetime**: Identifying label values must not change + +### SD Mechanisms Without Entity Mappings + +The following SD mechanisms do not generate entities automatically because they lack sufficient metadata to construct meaningful entities: + +| SD Mechanism | Reason | +|--------------|--------| +| `static_configs` | No metadata—just addresses | +| `file_sd_configs` | User-defined, no standard schema | +| `http_sd_configs` | User-defined, no standard schema | +| `dns_sd_configs` | Only provides addresses | + +Users requiring entities from these sources should embed entity information in the exposition format (see [02-exposition-formats.md](./02-exposition-formats.md)). + +### Non-Customizable by Design + +**Attribute mappings are not user-configurable.** This is intentional: + +1. **Standardization requires consistency**: If every deployment uses different attribute names, the benefits of entities (correlation, interoperability, ecosystem tooling) are lost +2. **Ecosystem tooling depends on predictability**: Dashboards, alerting rules, and integrations assume specific attribute names +3. **Reduced cognitive load**: Users don't need to understand or maintain mapping configurations +4. **Simpler implementation**: No configuration parsing, validation, or per-scrape-config mapping logic + +Users who need different attribute names can transform data downstream (e.g., in recording rules or remote write pipelines), but the source of truth in Prometheus uses the standard mappings. + +### Open Decision + +This proposal does not prescribe which naming convention Prometheus should adopt. The choice between OTel alignment, Prometheus-native conventions, or another approach should be made by the Prometheus community based on: + +- Strategic direction for OTel integration +- Compatibility requirements with existing tooling +- Long-term maintenance considerations +- Community feedback + +The implementation will be straightforward once a convention is chosen—the technical complexity is in the entity infrastructure, not the naming. + +--- + +## Implementation Overview + +### Where Entity Generation Happens + +Entity generation occurs during target creation in `PopulateLabels()`, **before** `__meta_*` labels are discarded. This timing is critical—once relabeling deletes the meta labels, the raw SD metadata is lost. + +When `entity_from_sd: true`: + +1. **Detect SD type** — Examine `__meta_*` label prefixes to determine which SD mechanism provided the target +2. **Apply built-in mappings** — Use the standard mappings for that SD type to extract entity attributes +3. **Classify labels** — Separate identifying labels (for identity) from descriptive labels (for context) +4. **Create entities** — Build entity structures with type, identifying labels, and descriptive labels +5. **Associate with target** — Store the generated entities alongside the target for transmission during scrape + +### Data Flow Diagram + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ Entity Generation Data Flow │ +└─────────────────────────────────────────────────────────────────────────────┘ + +┌───────────────┐ ┌───────────────┐ ┌───────────────┐ +│ Kubernetes │ │ EC2 │ │ Consul │ +│ API │ │ API │ │ API │ +└───────┬───────┘ └───────┬───────┘ └───────┬───────┘ + │ │ │ + ▼ ▼ ▼ +┌───────────────────────────────────────────────────────────────────────────┐ +│ Discovery Manager │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ targetgroup.Group │ │ +│ │ Targets: [ { __meta_kubernetes_pod_uid: "abc", ... } ] │ │ +│ │ Labels: { __meta_kubernetes_namespace: "prod" } │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +└───────────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌───────────────────────────────────────────────────────────────────────────┐ +│ Scrape Manager │ +│ │ +│ scrapePool.Sync(tgs) → TargetsFromGroup() → PopulateLabels() │ +│ │ │ +│ ┌────────────────────┴────────────────────┐ │ +│ │ │ │ +│ ▼ ▼ │ +│ ┌─────────────────────────┐ ┌─────────────────────────┐ │ +│ │ Entity Generation │ │ Label Processing │ │ +│ │ (from __meta_* labels)│ │ (relabel_configs) │ │ +│ │ │ │ │ │ +│ │ IF entity_from_sd: │ │ 1. Apply relabel rules │ │ +│ │ Extract identifying │ │ 2. Delete __meta_* │ │ +│ │ Extract descriptive │ │ 3. Set instance default│ │ +│ │ Create Entity struct │ │ │ │ +│ └───────────┬─────────────┘ └──────────┬──────────────┘ │ +│ │ │ │ +│ │ ┌──────────────────────────────┘ │ +│ │ │ │ +│ ▼ ▼ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ Target │ │ +│ │ │ │ +│ │ labels: { job="k8s-pods", instance="10.0.0.1:8080", ns="prod" } │ │ +│ │ │ │ +│ │ sdEntities: [ │ │ +│ │ Entity{ │ │ +│ │ type: "k8s.pod", │ │ +│ │ identifyingLabels: {namespace="prod", pod_uid="abc-123"} │ │ +│ │ descriptiveLabels: {pod_name="nginx", node_name="worker-1"} │ │ +│ │ } │ │ +│ │ ] │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +└───────────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌───────────────────────────────────────────────────────────────────────────┐ +│ Scrape Loop │ +│ │ +│ For each scrape: │ +│ 1. HTTP GET target │ +│ 2. Parse exposition format │ +│ 3. Extract exposition-format entities (if any) │ +│ 4. Merge SD entities + exposition entities │ +│ 5. app.AppendEntity() for each entity │ +│ 6. app.Append() for each metric (with correlation via shared labels) │ +│ 7. app.Commit() │ +└───────────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌───────────────────────────────────────────────────────────────────────────┐ +│ Storage (TSDB) │ +│ │ +│ ┌─────────────────────┐ ┌─────────────────────┐ │ +│ │ Entity Storage │ │ Series Storage │ │ +│ │ │ │ │ │ +│ │ memEntity │◄──►│ memSeries │ │ +│ │ stripeEntities │ │ stripeSeries │ │ +│ │ EntityMemPostings │ │ postings │ │ +│ │ │ │ │ │ +│ │ Correlation Index │────┤ │ │ +│ └─────────────────────┘ └─────────────────────┘ │ +└───────────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Relabeling and Entities + +This section specifies how relabeling interacts with entity generation. + +### Principle: Entities Are Generated Before Relabeling + +Entity generation uses the **raw** `__meta_*` labels before any relabeling is applied. This ensures: + +1. **Predictability**: Entity structure is consistent regardless of user relabeling rules +2. **Correctness**: Identifying labels match the actual resource identity +3. **Simplicity**: Users don't need to coordinate relabeling with entity generation + +### relabel_configs Do Not Affect Entity Labels + +```yaml +scrape_configs: + - job_name: 'kubernetes-pods' + kubernetes_sd_configs: + - role: pod + entity_from_sd: true + relabel_configs: + # This ONLY affects metric labels, NOT entity labels + - source_labels: [__meta_kubernetes_namespace] + target_label: ns # Metric label becomes "ns" + # Entity attribute uses the standard mapping (unchanged) +``` + +**Rationale**: Entity identifying labels are derived from `__meta_*` labels using the standard mapping, independent of `relabel_configs`. This ensures entity structure is predictable regardless of user relabeling rules. + +### metric_relabel_configs and Entity Labels + +`metric_relabel_configs` operates on metrics **after** they're scraped but **before** correlation happens. Entity-enriched labels (descriptive labels added during query) are **not** subject to `metric_relabel_configs`. + +```yaml +scrape_configs: + - job_name: 'kubernetes-pods' + entity_from_sd: true + metric_relabel_configs: + # This drops metrics, but entities remain + - source_labels: [__name__] + regex: 'go_.*' + action: drop +``` + +### honor_labels Interaction + +When `honor_labels: true`, labels from the scraped payload take precedence over target labels. This affects correlation: + +```yaml +scrape_configs: + - job_name: 'federate' + honor_labels: true + entity_from_sd: false # Entities come from federated metrics +``` + +If `entity_from_sd: true` with `honor_labels: true`: +- SD-derived entities are still generated +- Correlation uses the **final** metric labels (which may come from the payload) +- This could cause correlation mismatches if payload labels differ from SD labels + +**Recommendation**: When using `honor_labels: true`, set `entity_from_sd: false` and rely on exposition-format entities. + +--- + +## Conflict Resolution + +> **TODO**: This section needs further design work. When entities come from both SD and the exposition format for the same scrape, we need to define: +> - How to detect that two entities refer to the same resource +> - Whether to merge, prefer one source, or treat them as distinct +> - How to handle conflicting descriptive labels +> - Edge cases around timing and ordering +> +> This interacts with the exposition format design in [02-exposition-formats.md](./02-exposition-formats.md) and needs to be addressed holistically. + +--- + +## Entity Lifecycle with SD + +### Entity Creation + +An SD-derived entity is created when a target with matching `__meta_*` labels first appears in discovery. + +### Entity Updates + +When a target is re-discovered (on each SD refresh) and `entity_from_sd: true`: +1. Entity identifying labels are checked against existing entities +2. If entity exists, descriptive labels are compared +3. If descriptive labels changed, a new snapshot is recorded (see [05-storage.md](./05-storage.md)) + +### Entity Staleness + +When a target disappears from SD: + +1. **Immediate behavior**: The target's scrape loop is stopped +2. **Reference counting**: The scrape pool tracks how many targets reference each entity +3. **Entity marking**: When the last target referencing an entity disappears, the entity's `endTime` is set +4. **Grace period**: Entities remain queryable for historical analysis until retention removes them + +### Entity Deduplication + +Multiple targets may correlate with the same entity (e.g., multiple containers in a pod). Entity identity is determined by type + identifying labels—if two targets generate entities with the same identity, only one entity is stored. + +When the same entity is discovered from multiple targets: +- First discovery creates the entity +- Subsequent discoveries update `lastSeen` timestamp +- Descriptive labels are merged (last write wins for conflicts) + +--- + +## Open Questions Resolved + +### Q: Entity deduplication across multiple discovery mechanisms + +**Answer**: Entities are deduplicated by their identifying labels. If Kubernetes pod SD and endpoints SD both discover the same pod, only one entity is stored. The entity's descriptive labels are updated from whichever source provides the most recent data. + +### Q: SD entity lifecycle when target disappears + +**Answer**: When the last target referencing an entity disappears from SD, the entity's `endTime` is set to the current timestamp. The entity remains in storage for historical queries until retention deletes it. + +## Open Questions + +### Q: Which naming convention should Prometheus adopt for entity attributes? + +This proposal presents the available options (OTel semantic conventions, Prometheus-native, minimal transformation) and their trade-offs, but does not prescribe a specific choice. The decision should be made by the Prometheus community considering: + +- Strategic alignment with OpenTelemetry +- Existing ecosystem tooling and dashboards +- Long-term maintenance burden +- Community preferences + +### Q: How should Prometheus handle OTel conventions that are not yet stable? + +If OTel semantic conventions are chosen, Prometheus must decide how to handle conventions that haven't reached "Stable" status (e.g., Kubernetes conventions are currently "Experimental"). Options include: + +1. **Strict stability requirement**: Only adopt stable conventions; define Prometheus-specific names for unstable areas +2. **Pragmatic adoption**: Adopt widely-used experimental conventions with clear documentation about potential future changes +3. **Hybrid approach**: Use stable OTel conventions where available, Prometheus-native names elsewhere + +### Q: Should entity types be namespaced by SD mechanism? + +When multiple SD mechanisms can discover similar resources (e.g., EC2, Azure, GCE all discover "hosts"), should entity types be: + +- **Generic**: `host` (requires merging semantics across providers) +- **Provider-specific**: `ec2.instance`, `azure.vm`, `gce.instance` (clearer provenance, no collision risk) +- **Hierarchical**: `host` with `cloud.provider` as an identifying label + +--- + +## Related Documents + +- [01-context.md](./01-context.md) - Problem statement, motivation, and use cases +- [02-exposition-formats.md](./02-exposition-formats.md) - How entities are represented in wire formats +- [05-storage.md](./05-storage.md) - How entities are stored in the TSDB +- [06-querying.md](./06-querying.md) - PromQL extensions for working with entities +- [07-web-ui-and-apis.md](./07-web-ui-and-apis.md) - How entities are displayed and accessed + +--- + +*This proposal is a work in progress. Feedback is welcome.* + diff --git a/proposals/0071-Entity/05-storage.md b/proposals/0071-Entity/05-storage.md new file mode 100644 index 00000000..97b5c234 --- /dev/null +++ b/proposals/0071-Entity/05-storage.md @@ -0,0 +1,472 @@ +# Entity Storage + +## Abstract + +This document specifies how Prometheus stores entities reliably and efficiently. Entities represent the things that produce telemetry (pods, nodes, services) and need different storage semantics than traditional time series: they have immutable identifying labels, mutable descriptive labels that change over time, and lifecycle boundaries (creation and deletion). This document covers the in-memory structures, Write-Ahead Log integration, block persistence, and the correlation index that links entities to their associated metrics. + +## Background + +### Current Prometheus Storage Architecture + +Prometheus uses a time series database (TSDB) optimized for append-heavy workloads with the following key components: + +**Head Block**: The in-memory component that stores the most recent data. New samples are appended here first. The Head contains: +- `memSeries`: In-memory representation of each time series, holding recent samples in chunks +- `stripeSeries`: A sharded map for concurrent access to series by ID or label hash +- `MemPostings`: An inverted index mapping label name/value pairs to series references + +**Write-Ahead Log (WAL)**: Ensures durability by writing all incoming data to disk before acknowledging. On crash recovery, the WAL is replayed to reconstruct the Head. WAL records include: +- Series records (new series with their labels) +- Sample records (timestamp + value for a series) +- Metadata records (type, unit, help for metrics) +- Exemplar and histogram records + +**Persistent Blocks**: Periodically, the Head is compacted into immutable blocks stored on disk. Each block contains: +- Chunk files (compressed time series data) +- Index file (label index, postings lists, series metadata) +- Meta file (time range, stats) + +**Appender Interface**: The primary interface for writing data to storage: + +```go +type Appender interface { + Append(ref SeriesRef, l labels.Labels, t int64, v float64) (SeriesRef, error) + Commit() error + Rollback() error + // ... other methods for histograms, exemplars, metadata +} +``` + +The scrape loop uses Appender to write scraped metrics. Each scrape creates an Appender, appends all samples, then calls Commit() to atomically persist everything to the WAL. + +### Why Entities Need Different Storage + +Entities differ from time series in fundamental ways: + +| Aspect | Time Series | Entities | +|--------|-------------|----------| +| Identity | Labels (all mutable in theory) | Identifying labels (immutable) | +| Values | Numeric samples over time | String labels (descriptive) | +| Cardinality | High (many series per entity) | Lower (one entity, many series) | +| Lifecycle | Implicit (staleness) | Explicit (start/end timestamps) | +| Correlation | Self-contained | Links to multiple series | + +These differences motivate a dedicated storage approach rather than trying to fit entities into the existing series model. + +## Entity Data Model + +### The memEntity Structure + +Each entity in memory is represented by the following structure: + +```go +type memEntity struct { + // Immutable after creation + ref EntityRef // Unique identifier (uint64, auto-incrementing) + entityType string // e.g., "k8s.pod", "service", "k8s.node" + identifyingLabels labels.Labels // Immutable labels that define identity + + // Lifecycle timestamps + startTime int64 // When this entity incarnation was created + endTime int64 // When deleted (0 if still alive) + + // Mutable + sync.Mutex + descriptiveSnapshots []labelSnapshot // Historical descriptive labels + lastSeen int64 // Last scrape timestamp (for staleness checking) +} + +type labelSnapshot struct { + timestamp int64 + labels labels.Labels +} +``` + +### Identifying vs Descriptive Labels + +**Identifying Labels** define what an entity *is*. They are immutable for the lifetime of an entity incarnation: + +``` +Entity Type: k8s.pod +Identifying Labels: + - k8s.namespace.name = "production" + - k8s.pod.uid = "550e8400-e29b-41d4-a716-446655440000" +``` + +Two entities with the same identifying labels are considered the same entity (within their lifecycle bounds). + +**Descriptive Labels** provide additional context that may change over time: + +``` +Descriptive Labels (at t1): + - k8s.pod.name = "nginx-7b9f5" + - k8s.node.name = "worker-1" + - k8s.pod.status = "Running" + +Descriptive Labels (at t2, pod migrated): + - k8s.pod.name = "nginx-7b9f5" + - k8s.node.name = "worker-2" ← changed + - k8s.pod.status = "Running" +``` + +### Snapshot Storage for Descriptive Labels + +Descriptive labels are stored as complete snapshots at each change point. When new descriptive labels arrive: + +1. Compare with the most recent snapshot +2. If different, append a new snapshot with current timestamp +3. If identical, update `lastSeen` but don't create new snapshot + +``` +descriptiveSnapshots: [ + { t1, {name="nginx-7b9f5", node="worker-1", status="Running"} }, + { t5, {name="nginx-7b9f5", node="worker-2", status="Running"} }, // node changed + { t9, {name="nginx-7b9f5", node="worker-2", status="Terminating"} }, // status changed +] +``` + +### Entity Lifecycle + +Each entity has explicit lifecycle boundaries: + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ Entity Lifecycle │ +├─────────────────────────────────────────────────────────────────────┤ +│ │ +│ startTime endTime │ +│ │ │ │ +│ ▼ ▼ │ +│ ┌──────────────────────────────────────────────────────┐ │ +│ │ Entity is "alive" │ │ +│ │ - Correlates with metrics in this time range │ │ +│ │ - Descriptive labels tracked │ │ +│ └──────────────────────────────────────────────────────┘ │ +│ │ +│ Before startTime: Entity doesn't exist │ +│ After endTime: Entity is "dead" (historical only) │ +│ endTime == 0: Entity is currently alive │ +│ │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +**Entity Staleness** + +An entity's `endTime` is determined by staleness, similar to series staleness: +- Each scrape updates `lastSeen` timestamp +- If `now - lastSeen > staleness_threshold`, entity is marked dead +- `endTime` is set to `lastSeen + staleness_threshold` + +**Entity Reincarnation** + +The same identifying labels can appear again after an entity ends: + +``` +Timeline: + t1: Entity A created (ref=1, identifying={pod.uid="abc"}, startTime=t1) + t5: Entity A deleted (ref=1, endTime=t5) + t10: Entity B created (ref=2, identifying={pod.uid="abc"}, startTime=t10) +``` + +Entity A and Entity B have the same identifying labels but different EntityRefs and non-overlapping lifecycles. At any point in time, at most one entity with a given set of identifying labels should be alive. + +## Storage Components + +### In-Memory Structures + +#### Entity Storage in Head + +The Head block is extended with: + +| Component | Purpose | +|-----------|---------| +| **Entity storage** | Sharded map (like `stripeSeries`) storing `memEntity` by ref or identifying labels hash | +| **Entity postings** | Inverted index mapping `(label_name, label_value)` → entity refs | +| **Correlation index** | Bidirectional maps: `series_ref ↔ entity_refs` | + +The entity storage and postings follow the same sharding patterns as the existing series storage to support concurrent access. + +#### Correlation Index + +The correlation index maintains the many-to-many relationship between series and entities as two bidirectional maps: +- **Series → Entities**: "which entities does this series correlate with?" +- **Entities → Series**: "which series are associated with this entity?" + +**Building correlations at ingestion time:** + +When a **new series** is created, Prometheus checks each registered entity type. If the series labels contain all of an entity type's identifying labels, it looks up the corresponding entity and adds the correlation. + +When a **new entity** is created, Prometheus uses the postings index to find all series whose labels contain all of the entity's identifying labels, then adds correlations for each match. + +**Correlation and Entity Lifecycle** + +When an entity becomes stale (endTime set), it remains in the correlation index. This preserves historical correlations for queries over past time ranges. The query layer filters based on timestamp overlap between the query range and entity lifecycle. + +### Write-Ahead Log + +#### New WAL Record Type + +A single new record type captures all entity state: + +```go +const ( + // ... existing types ... + Entity Type = 11 // Entity record +) + +type RefEntity struct { + Ref EntityRef + EntityType string + IdentifyingLabels []labels.Label + DescriptiveLabels []labels.Label + StartTime int64 + EndTime int64 // 0 if alive + Timestamp int64 // When this record was written +} +``` + +#### Record Encoding + +Entity records follow the same encoding pattern as other WAL records: + +``` +┌───────────┬──────────┬────────────┬──────────────┐ +│ type <1b> │ len <2b> │ CRC32 <4b> │ data │ +└───────────┴──────────┴────────────┴──────────────┘ +``` + +The data section for an Entity record: + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ Entity Record Data │ +├─────────────────────────────────────────────────────────────────────┤ +│ ref <8b, big-endian> │ +│ entityType │ +│ numIdentifyingLabels │ +│ ┌─ name │ +│ └─ value │ +│ ... repeated for each identifying label │ +│ numDescriptiveLabels │ +│ ┌─ name │ +│ └─ value │ +│ ... repeated for each descriptive label │ +│ startTime │ +│ endTime │ +│ timestamp │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +#### When Entity Records Are Written + +Entity records are written to WAL in these situations: + +1. **New entity created**: Full record with startTime set, endTime=0 +2. **Descriptive labels changed**: Full record with updated labels and new timestamp +3. **Entity marked dead**: Full record with endTime set + +Writing full records (not deltas) simplifies replay and allows any single record to fully describe entity state at that point. + +### Block Persistence + +When the Head is compacted into a persistent block, entities must also be persisted. + +#### Entity Index in Blocks + +Each block includes an entity index alongside the existing series index: + +``` +Block Directory Structure: + block-ulid/ + ├── chunks/ # Chunk files (existing) + ├── index # Series index (existing) + ├── entities # Entity index (new) + ├── meta.json # Block metadata (extended) + └── tombstones # Deletion markers (existing) +``` + +The entity index file structure: + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ Entity Index File │ +├─────────────────────────────────────────────────────────────────────┤ +│ Magic Number (4 bytes) │ +│ Version (1 byte) │ +├─────────────────────────────────────────────────────────────────────┤ +│ Symbol Table │ +│ - All unique strings (entity types, attr names, attr values) │ +├─────────────────────────────────────────────────────────────────────┤ +│ Entity Table │ +│ For each entity: │ +│ - EntityRef │ +│ - EntityType (symbol ref) │ +│ - IdentifyingLabels (symbol ref pairs) │ +│ - StartTime, EndTime │ +│ - DescriptiveSnapshots offset (pointer to snapshots section) │ +├─────────────────────────────────────────────────────────────────────┤ +│ Descriptive Snapshots Section │ +│ For each entity's snapshots: │ +│ - Number of snapshots │ +│ - For each snapshot: timestamp, labels (symbol ref pairs) │ +├─────────────────────────────────────────────────────────────────────┤ +│ Entity Postings │ +│ - Inverted index: (label_name, label_value) -> [EntityRefs] │ +├─────────────────────────────────────────────────────────────────────┤ +│ Table of Contents │ +│ CRC32 │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +#### Entity Retention + +Entities follow the same retention policy as series data. Prometheus deletes blocks based on `RetentionDuration` (time-based) or `MaxBytes` (size-based). When blocks are deleted, entities are handled as follows: + +**Retention Rule**: An entity persists as long as **any block overlapping its lifecycle** exists. + +``` +Block Timeline: + Block 1 Block 2 Block 3 Block 4 + [t0, t1] [t1, t2] [t2, t3] [t3, t4] + +Entity A: ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ + startTime=t0 endTime=t1.5 + (lifecycle spans Block 1 and Block 2) + +Entity B: ░░░░░░░░░░░░████████████████████████████████ + startTime=t1.2 endTime=0 (still alive) + (lifecycle spans Block 2, Block 3, Block 4, Head) + +When Block 1 and Block 2 are deleted due to retention: +- Entity A is deleted (no remaining blocks contain its lifecycle) +- Entity B persists (Block 3, Block 4, Head still overlap its lifecycle) +``` + +This ensures historical queries can always resolve entity correlations for the data that remains. + +## Ingestion Flow + +### Extended Appender Interface + +The Appender interface is extended to support entity ingestion: + +```go +type Appender interface { + // ... existing methods ... + + // AppendEntity adds or updates an entity. + // Returns the EntityRef (existing or newly assigned). + AppendEntity( + entityType string, + identifyingAttrs labels.Labels, + descriptiveAttrs labels.Labels, + timestamp int64, + ) (EntityRef, error) +} +``` + +### AppendEntity Behavior + +When `AppendEntity` is called: + +1. **Validate** — Entity type and identifying labels must be non-empty +2. **Lookup** — Search for an existing alive entity with the same identifying labels +3. **If not found** — Create a new entity with a fresh EntityRef, set `startTime` to now, stage for WAL write +4. **If found** — Update `lastSeen` timestamp; if descriptive labels changed, append a new snapshot and stage a WAL record + +New entities and WAL records are staged (not committed) until `Commit()` is called, following the same transactional pattern as sample appends. + +## Query Support + +This section provides an overview of how storage exposes entities for queries. Detailed query semantics are covered in the Querying document. + +### Storage Query Interface + +```go +type EntityQuerier interface { + // Get entity by ref + Entity(ref EntityRef) (*Entity, error) + + // Find entities by type and/or labels + Entities(ctx context.Context, entityType string, matchers ...*labels.Matcher) (EntitySet, error) + + // Get entities correlated with a series at a specific time + EntitiesForSeries(seriesRef SeriesRef, timestamp int64) ([]EntityRef, error) + + // Get series correlated with an entity + SeriesForEntity(entityRef EntityRef) ([]SeriesRef, error) + + // Get descriptive labels at a point in time + DescriptiveLabelsAt(entityRef EntityRef, timestamp int64) (labels.Labels, error) +} +``` + +### Time-Range Filtering + +Queries specify a time range `[mint, maxt]`. Entity results are filtered by lifecycle: +- **isAliveAt(t)**: True if `startTime <= t` and (`endTime == 0` or `endTime > t`) +- **overlapsRange(mint, maxt)**: True if the entity's lifecycle overlaps the query range + +## Open Questions / Future Work + +### Retention Alignment + +How exactly should entity retention align with block retention? +- Current proposal: entities persist while any block containing their lifecycle exists +- May need refinement based on operational experience + +### Memory Management + +Long-running Prometheus instances may accumulate many historical entities: +- Consider memory-mapped entity storage for historical entities +- Investigate entity compaction/summarization for very old data + +### Federation and Multi-Prometheus + +When multiple Prometheus instances scrape the same entities: +- Entity deduplication across instances +- Consistent EntityRef assignment (or ref translation) +- Correlation index consistency + +### Entity Type Registry + +Should Prometheus maintain a registry of known entity types with their identifying label schemas? +- Would enable validation at ingestion time +- Could optimize correlation index building +- Trade-off: flexibility vs. consistency + +--- + +## TODO: Memory and WAL Replay Performance + +This section requires further investigation and benchmarking: + +### Memory Concerns + +- **Entity memory footprint estimation**: We need to quantify the memory cost per entity, including the `memEntity` struct, descriptive snapshots, and correlation index entries. This will help users estimate memory requirements based on expected entity counts. + +- **Impact on existing memory settings**: How do entity storage requirements interact with `--storage.tsdb.head-chunks-*` and other memory-related flags? Should there be dedicated entity memory limits? + +- **Memory-mapped entity storage**: For Prometheus instances with very long uptimes and high entity churn, historical entities may accumulate. Investigate whether memory-mapping historical entities (similar to mmapped chunks) could reduce memory pressure. + +- **Correlation index memory scaling**: The bidirectional correlation maps (`seriesToEntities` and `entitiesToSeries`) could become large with high series and entity counts. Consider more memory-efficient data structures (e.g., roaring bitmaps) if benchmarks show this is a bottleneck. + +### WAL Replay Performance + +- **Correlation index rebuild time**: The current proposal rebuilds the correlation index after WAL replay by iterating all entities and series. For large Prometheus instances (millions of series, thousands of entities), this could significantly increase startup time. + +- **Incremental correlation during replay**: Instead of rebuilding correlations after replay, could we store correlation state in the WAL or maintain it incrementally during replay? This would trade WAL size for faster startup. + +- **Checkpointing correlation state**: Consider extending WAL checkpointing to include entity and correlation state, reducing the amount of replay needed on restart. + +- **Benchmark targets**: We should establish performance targets (e.g., "WAL replay should not increase by more than 10% with 10,000 entities") and validate them through benchmarks. + +These topics need benchmarking with realistic workloads before finalizing the implementation approach. + +--- + +## What's Next + +- [Querying](06-querying.md): How PromQL is extended to query entities and correlations +- [Web UI and APIs](07-web-ui-and-apis.md): HTTP API endpoints and UI for entity exploration + diff --git a/proposals/0071-Entity/06-querying.md b/proposals/0071-Entity/06-querying.md new file mode 100644 index 00000000..3f01798f --- /dev/null +++ b/proposals/0071-Entity/06-querying.md @@ -0,0 +1,435 @@ +# Querying: Entity-Aware PromQL + +## Abstract + +This document specifies how Prometheus's query engine extends to support native entity awareness. The core principle is **automatic enrichment**: when querying metrics, correlated entity labels (both identifying and descriptive) are automatically included in results without requiring explicit join operations. A new **pipe operator** (`|`) enables filtering metrics by entity correlation using familiar syntax consistent with the exposition format. + +## Background + +### Current PromQL Value Types + +PromQL expressions evaluate to one of four value types: + +| Type | Description | Example | +|------|-------------|---------| +| **Scalar** | Single floating-point number | `42`, `3.14` | +| **String** | Simple string literal | `"hello"` | +| **Instant Vector** | Set of time series, each with one sample at a single timestamp | `http_requests_total{job="api"}` | +| **Range Vector (Matrix)** | Set of time series, each with multiple samples over a time range | `http_requests_total{job="api"}[5m]` | + +Functions have specific type signatures: + +``` +rate(Matrix) → Vector +sum(Vector) → Vector +scalar(Vector) → Scalar (single-element vector only) +``` + +### Current Query Execution Model + +When Prometheus executes a PromQL query: + +1. **Parsing**: Query string → Abstract Syntax Tree (AST) +2. **Preparation**: For each VectorSelector, call `querier.Select()` with label matchers +3. **Evaluation**: Traverse AST, evaluate functions and operators +4. **Result**: Return typed value (Scalar, Vector, or Matrix) + +The query engine interacts with storage through the `Querier` interface: + +```go +type Querier interface { + Select(ctx context.Context, sortSeries bool, hints *SelectHints, + matchers ...*labels.Matcher) SeriesSet + LabelValues(ctx context.Context, name string, ...) ([]string, error) + LabelNames(ctx context.Context, ...) ([]string, error) + Close() error +} +``` + +--- + +## Automatic Enrichment + +### How It Works + +When the query engine evaluates a VectorSelector or MatrixSelector, it automatically enriches each series with labels from correlated entities. + +**Query:** +```promql +container_cpu_usage_seconds_total{k8s.namespace.name="production"} +``` + +**Before enrichment (raw series from storage):** +``` +container_cpu_usage_seconds_total{ + container="nginx", + k8s.namespace.name="production", + k8s.pod.uid="abc-123", + k8s.node.uid="node-001" +} 1234.5 +``` + +**After enrichment (returned to user):** +``` +container_cpu_usage_seconds_total{ + # Original metric labels + container="nginx", + + # Identifying labels (correlation keys, already on series) + k8s.namespace.name="production", + k8s.pod.uid="abc-123", + k8s.node.uid="node-001", + + # Descriptive labels from k8s.pod entity + k8s.pod.name="nginx-7b9f5", + k8s.pod.status.phase="Running", + k8s.pod.start_time="2024-01-15T10:30:00Z", + + # Descriptive labels from k8s.node entity + k8s.node.name="worker-1", + k8s.node.os="linux", + k8s.node.kernel.version="5.15.0" +} 1234.5 +``` + +--- + +## Filtering by Entity Labels + +Since entity labels appear as labels in query results, standard PromQL label matchers work: + +### By Identifying Labels + +```promql +# Filter by pod UID (identifying) +container_cpu_usage_seconds_total{k8s.pod.uid="abc-123"} +``` + +This is efficient because identifying labels are stored on the series and indexed. + +### By Descriptive Labels + +```promql +# Filter by pod name (descriptive) +container_cpu_usage_seconds_total{k8s.pod.name="nginx-7b9f5"} + +# Filter by node OS (descriptive) +container_memory_usage_bytes{k8s.node.os="linux"} + +# Regex matching on descriptive labels +http_requests_total{service.version=~"2\\..*"} +``` + +**Query Execution for Descriptive Label Filters:** + +1. Select all series that might match (based on metric name and any indexed labels) +2. For each series, look up correlated entities +3. Get descriptive labels at evaluation timestamp +4. Apply the filter: keep series where enriched labels match + +## Aggregation by Entity Labels + +Standard PromQL aggregation works with entity labels: + +```promql +# Sum CPU by node name (descriptive label) +sum by (k8s.node.name) (container_cpu_usage_seconds_total) + +# Average memory by service version +avg by (service.version) (process_resident_memory_bytes) + +# Count requests by pod status +count by (k8s.pod.status.phase) (rate(http_requests_total[5m])) +``` + +### Aggregation Semantics + +Aggregation happens **after** enrichment: + +``` +1. Select series matching the selector +2. Enrich each series with entity labels +3. Group by the specified labels (which may include entity labels) +4. Apply aggregation function +``` + +**Example:** + +```promql +sum by (k8s.node.name) (container_cpu_usage_seconds_total) +``` + +``` +Step 1 - Select series: + container_cpu{pod_uid="a", node_uid="n1"} 10 + container_cpu{pod_uid="b", node_uid="n1"} 20 + container_cpu{pod_uid="c", node_uid="n2"} 30 + +Step 2 - Enrich with entity labels: + container_cpu{..., k8s.node.name="worker-1"} 10 + container_cpu{..., k8s.node.name="worker-1"} 20 + container_cpu{..., k8s.node.name="worker-2"} 30 + +Step 3 - Group by k8s.node.name: + Group "worker-1": [10, 20] + Group "worker-2": [30] + +Step 4 - Sum: + {k8s.node.name="worker-1"} 30 + {k8s.node.name="worker-2"} 30 +``` + +--- + +## Range Queries and Temporal Semantics + +### The Challenge + +Descriptive labels can change over time. When querying a range, which label values should be used? + +**Example scenario:** +- Pod `abc-123` runs on `worker-1` from T0 to T5 +- Pod migrates to `worker-2` at T5 +- Query: `container_cpu_usage_seconds_total{k8s.pod.uid="abc-123"}[10m]` + +### Solution: Point-in-Time Label Resolution + +Each sample is enriched with the descriptive labels **that were valid at that sample's timestamp**. + +```promql +container_cpu_usage_seconds_total{k8s.pod.uid="abc-123"}[10m] +``` + +**Returns:** +``` +# Samples before migration (T0-T4) have worker-1 +container_cpu{k8s.pod.uid="abc-123", k8s.node.name="worker-1"} 100 @T0 +container_cpu{k8s.pod.uid="abc-123", k8s.node.name="worker-1"} 110 @T1 +container_cpu{k8s.pod.uid="abc-123", k8s.node.name="worker-1"} 120 @T2 +container_cpu{k8s.pod.uid="abc-123", k8s.node.name="worker-1"} 130 @T3 +container_cpu{k8s.pod.uid="abc-123", k8s.node.name="worker-1"} 140 @T4 + +# Samples after migration (T5+) have worker-2 +container_cpu{k8s.pod.uid="abc-123", k8s.node.name="worker-2"} 150 @T5 +container_cpu{k8s.pod.uid="abc-123", k8s.node.name="worker-2"} 160 @T6 +... +``` + +### Implications for Range Functions + +Functions like `rate()` operate on the raw sample values, but the returned instant vector has enriched labels: + +```promql +rate(container_cpu_usage_seconds_total{k8s.pod.uid="abc-123"}[5m]) +``` + +For rate calculation: +- Uses sample values regardless of label changes +- The result is enriched with labels **at the evaluation timestamp** + +### Series Identity Across Label Changes + +**Important:** Descriptive label changes do NOT create new series. The series identity is defined by: +- Metric name +- Original metric labels +- Entity identifying labels (correlation keys) + +Descriptive labels are metadata that "rides along" with samples, not part of series identity. + +--- + +## The Entity Type Filter Operator + +Automatic enrichment means entity labels appear as labels in query results, so standard label matchers handle most filtering needs: + +```promql +# Filter by entity label - just use label matchers +container_cpu_usage_seconds_total{k8s.pod.name="nginx"} +container_cpu_usage_seconds_total{k8s.pod.status.phase="Running"} +``` + +However, there's one thing label matchers **cannot** do: filter by entity type existence. The pipe operator (`|`) fills this gap. + +### Syntax + +```promql +vector_expr | entity_type_expr +``` + +Where `entity_type_expr` can be: +- A single entity type: `k8s.pod` +- Negated: `!k8s.pod` +- Combined with `and`: `k8s.pod and k8s.node` +- Combined with `or`: `k8s.pod or service` +- Grouped: `(k8s.pod and k8s.node) or service` + +### When to Use + +The pipe operator answers the question: **"Is this metric correlated with an entity of this type?"** + +```promql +# Metrics that ARE correlated with any pod entity +container_cpu_usage_seconds_total | k8s.pod + +# Metrics that ARE correlated with any node entity +container_memory_usage_bytes | k8s.node + +# Metrics that ARE correlated with any service entity +http_requests_total | service +``` + +### Negation with `!` + +Use `!` before an entity type to negate it: + +```promql +# Metrics NOT correlated with any pod +container_cpu_usage_seconds_total | !k8s.pod + +# Metrics NOT correlated with any service +http_requests_total | !service +``` + +### Combining Entity Type Filters + +Use `and`/`or` keywords to combine entity type filters: + +```promql +# Metrics correlated with BOTH a pod AND a node +container_cpu_usage_seconds_total | k8s.pod and k8s.node + +# Metrics correlated with a pod OR a service +container_cpu_usage_seconds_total | k8s.pod or service + +# Metrics correlated with a pod but NOT a node +container_cpu_usage_seconds_total | k8s.pod and !k8s.node +``` + +Operator precedence follows standard rules: `!` (not) binds tightest, then `and`, then `or`. Use parentheses for clarity: + +```promql +# Explicit grouping +container_cpu | (k8s.pod and k8s.node) or service +``` + +### All Metrics for an Entity Type + +To get all metrics correlated with a specific entity type, omit the metric selector: + +```promql +# All metrics correlated with any pod + | k8s.pod + +# Equivalent to: +{__name__=~".+"} | k8s.pod +``` + +This is useful for exploring what metrics are available for a given entity type. + +### Combining with Label Matchers + +For label filtering, use label matchers (simpler and familiar). Use the pipe operator only when you need entity type filtering: + +```promql +# Filter by label: use label matcher +container_cpu_usage_seconds_total{k8s.pod.name="nginx"} + +# Filter by entity type existence: use pipe +container_cpu_usage_seconds_total | k8s.pod + +# Both: label matcher for label, pipe for type +container_cpu_usage_seconds_total{k8s.namespace.name="production"} | k8s.pod | k8s.node +``` + +--- + +## Query Engine Implementation + +### Extended Querier Interface + +```go +// EntityQuerier provides entity lookup capabilities +type EntityQuerier interface { + // Get entities correlated with a series + EntitiesForSeries(ref storage.SeriesRef) []EntityRef + + // Get entity by reference + GetEntity(ref EntityRef) Entity + + Close() error +} + +// Entity represents a single entity +type Entity interface { + Ref() EntityRef + Type() string + IdentifyingLabels() labels.Labels + DescriptiveLabelsAt(timestamp int64) labels.Labels + StartTime() int64 + EndTime() int64 +} +``` + +### Query Execution Flow + +``` +┌─────────────────────────────────────────────────────────────────────────┐ +│ Query Execution Flow │ +└─────────────────────────────────────────────────────────────────────────┘ + + ┌─────────────────────────────┐ + │ PromQL String │ + │ │ + │ cpu | k8s.pod and k8s.node │ + └─────────────┬───────────────┘ + │ + ▼ + ┌─────────────────────────────┐ + │ Parser │ + │ │ + │ - VectorSelector │ + │ - EntityTypeFilter │◄── NEW + │ - EntityTypeExpr (and/or/!) │ + └─────────────┬───────────────┘ + │ + ▼ + ┌─────────────────────────────┐ + │ AST │ + │ │ + │ EntityTypeFilter { │ + │ Expr: cpu │ + │ TypeExpr: And { │ + │ Left: "k8s.pod" │ + │ Right: "k8s.node" │ + │ } │ + │ } │ + └─────────────┬───────────────┘ + │ + ▼ +┌────────────────────────────────────────────────────────────────────────┐ +│ Evaluator │ +│ │ +│ 1. Evaluate left side (VectorSelector) │ +│ - querier.Select() → SeriesSet │ +│ - Enrich with entity labels │ +│ - Result: enriched Vector │ +│ │ +│ 2. Evaluate EntityTypeFilter │ +│ - For each series, get correlated entity types │ +│ - Evaluate boolean expression against those types │ +│ - Keep series where expression evaluates to true │ +│ - Result: filtered Vector │ +│ │ +└────────────────────────────────────────────────────────────────────────┘ + │ + ▼ + ┌─────────────────────────────┐ + │ Result │ + │ │ + │ Vector/Matrix │ + └─────────────────────────────┘ +``` + +--- + +The next document will cover [Web UI and APIs](./07-web-ui-and-apis.md), detailing how these capabilities are exposed in Prometheus's user interface and HTTP APIs. diff --git a/proposals/0071-Entity/07-web-ui-and-apis.md b/proposals/0071-Entity/07-web-ui-and-apis.md new file mode 100644 index 00000000..ca05e962 --- /dev/null +++ b/proposals/0071-Entity/07-web-ui-and-apis.md @@ -0,0 +1,566 @@ +# Web UI and APIs + +## Abstract + +This document specifies how Prometheus's HTTP API and Web UI should be extended to support entity-aware querying. The key principle is **progressive disclosure**: query results display entity context prominently while keeping the interface familiar for users who don't need entity details. + +The wireframe below illustrates the concept—entity labels are displayed separately from metric labels, making it easy to understand the context of each time series. + +![Wireframe showing query results with entity labels separated from metric labels](./wireframes/Wireframe%20-%20Simple%20idea%20-%20Complete%20flow.png) + +--- + +## Background + +### Current API Response Structure + +Today, the `/api/v1/query` endpoint returns results like: + +```json +{ + "status": "success", + "data": { + "resultType": "vector", + "result": [ + { + "metric": { + "__name__": "container_cpu_usage_seconds_total", + "container": "nginx", + "namespace": "production", + "pod": "nginx-7b9f5" + }, + "value": [1234567890, "1234.5"] + } + ] + } +} +``` + +All labels are in a flat `metric` object. There's no distinction between: +- Labels that identify the metric itself (e.g., `container`, `method`) +- Labels that identify the entity producing the metric (e.g., `k8s.pod.uid`, `k8s.node.uid`) +- Labels that describe the entity (e.g., `k8s.pod.name`, `k8s.node.os`) + +### Current UI Display + +The Prometheus UI displays all labels together: + +``` +container_cpu_usage_seconds_total{container="nginx", namespace="production", pod="nginx-7b9f5", ...} +``` + +This becomes unwieldy when entity labels are added through enrichment—users see a long list of labels without understanding which provide entity context. + +--- + +## API Changes + +### Query Response Enhancement + +The query endpoints (`/api/v1/query`, `/api/v1/query_range`) should return entity context alongside the metric: + +```json +{ + "status": "success", + "data": { + "resultType": "vector", + "result": [ + { + "metric": { + "__name__": "container_cpu_usage_seconds_total", + "container": "nginx" + }, + "entities": [ + { + "type": "k8s.pod", + "identifyingLabels": { + "k8s.namespace.name": "production", + "k8s.pod.uid": "abc-123" + }, + "descriptiveLabels": { + "k8s.pod.name": "nginx-7b9f5", + "k8s.pod.status.phase": "Running" + } + }, + { + "type": "k8s.node", + "identifyingLabels": { + "k8s.node.uid": "node-001" + }, + "descriptiveLabels": { + "k8s.node.name": "worker-1", + "k8s.node.os": "linux" + } + } + ], + "value": [1234567890, "1234.5"] + } + ] + } +} +``` + +**Key changes:** + +| Field | Description | +|-------|-------------| +| `metric` | Only the original metric labels (not entity labels) | +| `entities` | Array of correlated entities with their labels | +| `entities[].type` | Entity type (e.g., "k8s.pod", "service") | +| `entities[].identifyingLabels` | Immutable labels that identify the entity | +| `entities[].descriptiveLabels` | Mutable labels describing the entity | + +### Backward Compatibility + +For backward compatibility, a query parameter controls the response format: + +``` +GET /api/v1/query?query=...&entity_info=true +``` + +| Parameter | Behavior | +|-----------|----------| +| `entity_info=true` | Returns structured entity information | +| `entity_info=false` (default) | Returns flat labels (current behavior, entity labels merged in) | + +When `entity_info=false` (default), all entity labels appear in the `metric` object as they do today with automatic enrichment. This ensures existing tooling continues to work. + +### Response Type Definitions + +```typescript +// Enhanced query result with entity context +interface EnhancedInstantSample { + metric: Record; // Original metric labels only + entities?: EntityContext[]; // Correlated entities (if entity_info=true) + value?: [number, string]; + histogram?: [number, Histogram]; +} + +interface EntityContext { + type: string; // e.g., "k8s.pod" + identifyingLabels: Record; + descriptiveLabels: Record; +} + +// When entity_info=false (default), use existing format +interface LegacyInstantSample { + metric: Record; // All labels merged (metric + entity labels) + value?: [number, string]; + histogram?: [number, Histogram]; +} +``` + +--- + +## New Entity Endpoints + +### List Entity Types + +``` +GET /api/v1/entities/types +``` + +Returns all known entity types in the system: + +```json +{ + "status": "success", + "data": [ + { + "type": "k8s.pod", + "identifyingLabels": ["k8s.namespace.name", "k8s.pod.uid"], + "count": 1523 + }, + { + "type": "k8s.node", + "identifyingLabels": ["k8s.node.uid"], + "count": 12 + }, + { + "type": "service", + "identifyingLabels": ["service.namespace", "service.name", "service.instance.id"], + "count": 89 + } + ] +} +``` + +### Get Entity Type Schema + +``` +GET /api/v1/entities/types/{type} +``` + +Returns detailed schema for an entity type: + +```json +{ + "status": "success", + "data": { + "type": "k8s.pod", + "identifyingLabels": ["k8s.namespace.name", "k8s.pod.uid"], + "knownDescriptiveLabels": [ + "k8s.pod.name", + "k8s.pod.status.phase", + "k8s.pod.start_time", + "k8s.pod.ip", + "k8s.pod.owner.kind", + "k8s.pod.owner.name" + ], + "activeEntityCount": 1523, + "correlatedSeriesCount": 45230 + } +} +``` + +### List Entities + +``` +GET /api/v1/entities?type=k8s.pod&match[]={k8s.namespace.name="production"} +``` + +Returns entities matching the criteria: + +```json +{ + "status": "success", + "data": [ + { + "type": "k8s.pod", + "identifyingLabels": { + "k8s.namespace.name": "production", + "k8s.pod.uid": "abc-123" + }, + "descriptiveLabels": { + "k8s.pod.name": "nginx-7b9f5", + "k8s.pod.status.phase": "Running" + }, + "startTime": 1700000000, + "endTime": 0, + "correlatedSeriesCount": 42 + } + ] +} +``` + +**Query parameters:** + +| Parameter | Description | +|-----------|-------------| +| `type` | Entity type to query (required) | +| `match[]` | Label matchers for filtering entity labels (can specify multiple) | +| `start` | Start of time range (for historical queries) | +| `end` | End of time range | +| `limit` | Maximum entities to return | + +### Get Entity Details + +``` +GET /api/v1/entities/{type}/{encoded_identifying_attrs} +``` + +The identifying labels are URL-encoded as a label set: + +``` +GET /api/v1/entities/k8s.pod/k8s.namespace.name%3D%22production%22%2Ck8s.pod.uid%3D%22abc-123%22 +``` + +Returns detailed information about a specific entity: + +```json +{ + "status": "success", + "data": { + "type": "k8s.pod", + "identifyingLabels": { + "k8s.namespace.name": "production", + "k8s.pod.uid": "abc-123" + }, + "descriptiveLabels": { + "k8s.pod.name": "nginx-7b9f5", + "k8s.pod.status.phase": "Running" + }, + "startTime": 1700000000, + "endTime": 0, + "descriptiveHistory": [ + { + "timestamp": 1700000000, + "labels": { + "k8s.pod.name": "nginx-7b9f5", + "k8s.pod.status.phase": "Pending" + } + }, + { + "timestamp": 1700000030, + "labels": { + "k8s.pod.name": "nginx-7b9f5", + "k8s.pod.status.phase": "Running" + } + } + ], + "correlatedSeries": [ + "container_cpu_usage_seconds_total", + "container_memory_usage_bytes", + "container_network_receive_bytes_total" + ] + } +} +``` + +### Get Correlated Metrics for Entity + +``` +GET /api/v1/entities/{type}/{encoded_identifying_attrs}/metrics +``` + +Returns all metric names correlated with a specific entity: + +```json +{ + "status": "success", + "data": [ + { + "name": "container_cpu_usage_seconds_total", + "seriesCount": 3, + "labels": ["container"] + }, + { + "name": "container_memory_usage_bytes", + "seriesCount": 3, + "labels": ["container"] + } + ] +} +``` + +--- + +## Web UI Changes + +### Query Results Display + +Based on the wireframe concept, query results should display entity context prominently but separately from metric labels. + +**Current display:** +``` +container_cpu_usage_seconds_total{container="nginx", k8s.namespace.name="production", k8s.pod.uid="abc-123", k8s.pod.name="nginx-7b9f5", k8s.node.uid="node-001", k8s.node.name="worker-1", ...} 1234.5 +``` + +**Enhanced display:** + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ container_cpu_usage_seconds_total{container="nginx"} 1234.5 │ +│ │ +│ Entities: │ +│ k8s.pod │ +│ k8s.namespace.name="production", k8s.pod.uid="abc-123" │ +│ k8s.pod.name="nginx-7b9f5", k8s.pod.status.phase="Running" │ +│ │ +│ k8s.node │ +│ k8s.node.uid="node-001" │ +│ k8s.node.name="worker-1", k8s.node.os="linux" │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +### UI Components + +**1. SeriesName Enhancement** + +The `SeriesName` component should accept entity context: + +```typescript +interface SeriesNameProps { + labels: Record; + entities?: EntityContext[]; + format: boolean; + showEntities?: boolean; // Toggle entity display +} +``` + +**2. EntityBadge Component** + +A new component for displaying entity information: + +```typescript +interface EntityBadgeProps { + entity: EntityContext; + expanded?: boolean; + onToggle?: () => void; +} +``` + +Displays entity type with expandable labels: + +``` +┌─────────────────────────────────────────────┐ +│ 📦 k8s.pod [▼] │ +│ k8s.namespace.name="production" │ +│ k8s.pod.uid="abc-123" │ +│ ───────────────────────────── │ +│ k8s.pod.name="nginx-7b9f5" │ +│ k8s.pod.status.phase="Running" │ +└─────────────────────────────────────────────┘ +``` + +**3. Collapsible Entity Section** + +For tables with many results, entities can be collapsed by default: + +```typescript +interface DataTableProps { + data: InstantQueryResult; + showEntities: boolean; + entityDisplayMode: 'collapsed' | 'expanded' | 'inline'; +} +``` + +### New Pages + +**1. Entity Explorer Page** + +A dedicated page for browsing entities: + +``` +/entities + ├── List all entity types + ├── Filter by type + ├── Search by labels + └── Click to see entity details + +/entities/{type} + ├── List all entities of type + ├── Filter by identifying/descriptive labels + └── Click to see entity details + +/entities/{type}/{id} + ├── Entity details + ├── Label history timeline + ├── Correlated metrics list + └── Quick query links +``` + +**2. Entity Type Schema Page** + +Shows the schema for an entity type: + +``` +/entities/types/{type} + ├── Identifying labels list + ├── Known descriptive labels + ├── Entity count statistics + └── Related entity types +``` + +### Graph View Integration + +When viewing graphs, entity context can be shown on hover: + +``` +┌───────────────────────────────────────────────────────────────┐ +│ Graph │ +│ ╱╲ ╱╲ │ +│ ╱ ╲ ╱ ╲ ╱╲ │ +│ ╱ ╲╱ ╲ ╱ ╲ │ +│ ╱ ╲╱ ╲ │ +│ ╱ ╲ │ +├───────────────────────────────────────────────────────────────┤ +│ Hovering: container_cpu_usage_seconds_total{container="nginx"}│ +│ │ +│ 📦 k8s.pod: nginx-7b9f5 (production) │ +│ 🖥️ k8s.node: worker-1 │ +│ │ +│ Value: 1234.5 @ 2024-01-15 10:30:00 │ +└───────────────────────────────────────────────────────────────┘ +``` + +### Settings + +New user preferences for entity display: + +```typescript +interface EntityDisplaySettings { + // Show entity information in query results + showEntitiesInResults: boolean; + + // Default display mode + entityDisplayMode: 'collapsed' | 'expanded' | 'inline'; + + // Show identifying vs descriptive separation + separateIdentifyingLabels: boolean; + + // Entity types to always show + pinnedEntityTypes: string[]; +} +``` + +--- + +## Implementation Considerations + +### API Response Size + +Adding entity context increases response size. Mitigations: + +1. **Optional via query parameter**: `entity_info=true` to opt-in +2. **Compression**: gzip reduces impact significantly +3. **Pagination**: Limit results and paginate large responses +4. **Streaming**: Consider streaming for very large result sets + +### Frontend Performance + +With potentially many entities per series: + +- Lazy load entity details on expand +- Virtualize long lists +- Use `entity_info=false` for performance-critical views +- Progressive loading for entity explorer + +--- + +## API Summary + +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/api/v1/query` | GET/POST | Query with optional `entity_info=true` | +| `/api/v1/query_range` | GET/POST | Range query with optional `entity_info=true` | +| `/api/v1/entities/types` | GET | List all entity types | +| `/api/v1/entities/types/{type}` | GET | Get entity type schema | +| `/api/v1/entities` | GET | List entities with filters | +| `/api/v1/entities/{type}/{id}` | GET | Get specific entity details | +| `/api/v1/entities/{type}/{id}/metrics` | GET | Get metrics for entity | + +--- + +## UI Summary + +| Feature | Description | +|---------|-------------| +| Enhanced SeriesName | Shows entities separately from labels | +| EntityBadge | Compact entity display with expand | +| Entity Explorer | Browse and search entities | +| Graph hover | Shows entity context on hover | +| Settings | Control entity display preferences | + +--- + +## Migration Path + +**Phase 1: API additions** +- Add `entity_info` parameter (default false) +- Add new `/api/v1/entities/*` endpoints +- Existing behavior unchanged + +**Phase 2: UI enhancements** +- Add EntityBadge component +- Enhance SeriesName with entity support +- Add Entity Explorer page + +**Phase 3: Default behavior** +- Consider making `entity_info=true` the default +- Deprecation warnings for flat-label-only usage + +--- + +*This proposal is a work in progress. Feedback on API design and UI mockups is welcome.* + diff --git a/proposals/0071-Entity/08-alerting.md b/proposals/0071-Entity/08-alerting.md new file mode 100644 index 00000000..84e2c8eb --- /dev/null +++ b/proposals/0071-Entity/08-alerting.md @@ -0,0 +1,503 @@ +# Alerting: Entity-Aware Alert Evaluation + +## Abstract + +This document specifies how Prometheus alerting rules and Alertmanager interact with the Entity concept introduced in [01-context.md](./01-context.md). The central challenge is ensuring that alerts remain stable when entity descriptive labels change—a pod migrating between nodes or a service being upgraded should not cause alerts to "flap" (appearing to resolve and re-fire). + +We introduce the concept of **Alert Identity**—a stable identifier for an alert that persists even when some labels change. This builds on the existing Fingerprint mechanism but distinguishes between labels that define identity versus labels that provide context. The key insight is that labels explicitly used in an alert expression signal user intent and should contribute to identity, while labels added purely through automatic enrichment (as described in [06-querying.md](./06-querying.md)) are contextual metadata. + +This document explores the implications for both Prometheus and Alertmanager, including trade-offs and the need for Alertmanager API changes to fully realize stable alert identity across the pipeline. + +--- + +## Background + +### Alert Figerprint + +In current Prometheus, each alert has a **Fingerprint**—a hash computed from all of its labels: + +```go +// rules/alerting.go - current implementation +type Alert struct { + State AlertState + Labels labels.Labels + Annotations labels.Labels + Value float64 + // ... timestamps ... +} + +func (a *Alert) Fingerprint() model.Fingerprint { + return a.Labels.Fingerprint() // Hash of ALL labels +} +``` + +The fingerprint determines: +- **State tracking:** Which alerts are currently active (the `active` map in `AlertingRule`) +- **`for` clause:** Whether an alert has been pending long enough to fire +- **Deduplication:** Whether to send an alert again or skip it + +This works well today because labels are stable—they come from the metric's own labels, labels added by the rule configuration, and external labels. When any label changes, it's intentionally a different alert: `{instance="server-1"}` and `{instance="server-2"}` are distinct alerts tracking distinct issues. + +### The Challenge: Enriched Labels Change + +With entity support, query results are automatically enriched with entity labels as described in [06-querying.md](./06-querying.md). This enrichment includes **descriptive labels** that can change during an entity's lifetime: + +- A pod's `k8s.pod.status.phase` changes from `Pending` to `Running` +- A service's `service.version` changes during deployment +- A node's `k8s.node.name` could theoretically change + +If the existing fingerprint mechanism includes all enriched labels, alerts would "flap"—appearing to resolve and re-fire whenever a descriptive label changes, even though the underlying condition persists. + +**Example of the problem:** + +```yaml +alert: HighCPU +expr: container_cpu_usage_seconds_total > 0.9 +for: 5m +``` + +1. T0: Alert becomes Pending with labels `{pod_uid="abc", k8s.node.name="worker-1"}` +2. T3: Pod migrates, `k8s.node.name` changes to `worker-2` +3. With naive fingerprinting, Prometheus sees: + - Alert `{..., k8s.node.name="worker-1"}` disappeared (resolved?) + - Alert `{..., k8s.node.name="worker-2"}` appeared (new!) + - The `for: 5m` timer resets + +This defeats the purpose of the `for` clause and creates confusing behavior. + +--- + +## Introducing Alert Identity + +### Why We Need a New Concept + +The existing Fingerprint mechanism served Prometheus well because label stability was assumed. With entity enrichment, we need a more nuanced concept: **Alert Identity**. + +Alert Identity answers the question: "Is this the same alert as before, or a different one?" While Fingerprint simply hashes all labels, Alert Identity considers which labels are semantically significant for distinguishing alerts. + +The term "Alert Identity" is new to Prometheus—it doesn't exist in the current codebase. We introduce it here to describe the stable identifier we need, which will be implemented as a modified fingerprint computation that excludes certain labels. + +### Identifying Labels vs. Descriptive Labels in Alert Identity + +As established in [01-context.md](./01-context.md), entity labels fall into two categories: **identifying labels** (which uniquely identify an entity, like `k8s.pod.uid`) and **descriptive labels** (which provide additional context that may change, like `k8s.node.name`). + +For Alert Identity, we treat these categories differently: + +- **Entity identifying labels are always part of alert identity.** These labels uniquely identify the entity producing the alert. If two alerts have different `k8s.pod.uid` values, they're fundamentally about different pods and should be distinct alerts. + +- **Entity descriptive labels are only part of identity if explicitly used in the expression.** This is where user intent matters. + +Consider this rule: + +```yaml +alert: PodHighMemory +expr: container_memory_usage_bytes > 1e9 +``` + +When evaluated, the query engine enriches results with entity labels including both `k8s.pod.uid` (identifying) and `k8s.node.name` (descriptive). The identifying label `k8s.pod.uid` is always part of identity—different pods are different alerts. But the descriptive label `k8s.node.name` is NOT part of identity here because the user didn't filter on it. If a pod migrates from worker-1 to worker-2, it remains the same alert (same pod UID). + +Now consider a rule that explicitly filters on a descriptive label: + +```yaml +alert: NodeHighCPU +expr: cpu_usage{k8s.node.name="worker-1"} > 80 +``` + +Here, the user explicitly filtered by `k8s.node.name="worker-1"`. They're saying: "I specifically care about worker-1." The descriptive label `k8s.node.name` becomes part of identity because the user declared it significant by including it in their expression. If they wrote another rule for worker-2, those would be separate alerts. + +This leads to our core principle: + +> **Labels explicitly used in the expression signal user intent and contribute to identity. Labels added purely through enrichment are context that doesn't affect identity.** + +### What Constitutes Identity + +Based on this principle, Alert Identity is computed from: + +1. **Metric labels** — Original labels on the time series +2. **Entity identifying labels** — Labels that uniquely identify an entity (e.g., `k8s.pod.uid`) +3. **Explicit descriptive labels** — Descriptive labels the user filtered on in the expression +4. **Rule-defined labels** — Labels added in the rule's `labels:` configuration +5. **External labels** — Prometheus-wide labels from configuration + +Labels **excluded** from identity: + +1. **Enriched descriptive labels** — Descriptive labels added by automatic enrichment that weren't explicitly referenced + +### The Alertmanager Challenge + +Here's where things get complicated. Prometheus computes Alert Identity internally, but **Alertmanager also computes its own fingerprint** from the labels it receives. If we send all labels (including enriched descriptive) to Alertmanager: + +``` +Prometheus Alertmanager +┌──────────────────┐ ┌──────────────────┐ +│ Identity = hash( │ │ Fingerprint = │ +│ metric_labels │ sends all │ hash(all │ +│ + identifying │ labels │ received │ +│ + explicit │ ──────────────► │ labels) │ +│ + rule_labels │ │ │ +│ ) │ │ │ +│ │ │ If labels change,│ +│ ✓ Stable │ │ fingerprint │ +│ │ │ changes! │ +└──────────────────┘ └──────────────────┘ +``` + +If descriptive labels change between alert sends: +- Prometheus sees the same alert (stable identity) +- Alertmanager sees a "new" alert (different fingerprint) +- Alertmanager might send duplicate notifications +- Alert might move between groups +- Silences might stop matching + +This is a real problem that we must address explicitly. + +--- + +## Design Options + +We have three main approaches to handle this challenge: + +### Option A: Identity is Prometheus-Internal Only + +Prometheus uses Alert Identity internally for state tracking and the `for` clause. When sending to Alertmanager, it sends all labels. Alertmanager's behavior with changing labels is documented but accepted. + +**Prometheus changes:** +- Compute identity from identity labels for internal state tracking +- Send all labels to Alertmanager + +**Alertmanager changes:** None + +**Trade-offs:** +- ✅ Simple—no Alertmanager API changes +- ✅ Prometheus internal state is stable +- ❌ Alertmanager may re-notify when descriptive labels change +- ❌ Groups may split/merge unexpectedly +- ❌ Silences by descriptive labels may break + +**Mitigation:** Document that users should group/silence by stable labels (identifying labels) for predictable behavior. + +### Option B: Alertmanager API Receives Identity Separately + +Extend the Alertmanager API to receive identity labels separately from all labels. + +**Prometheus changes:** +- Compute identity labels +- Send both `identityLabels` and `labels` to Alertmanager + +**Alertmanager changes:** +- API accepts new `identityLabels` field +- Use `identityLabels` for fingerprinting and deduplication +- Use full `labels` for routing matchers and notification templates + +**Trade-offs:** +- ✅ Full stability across the pipeline +- ✅ Alertmanager can correctly deduplicate +- ❌ Requires API version bump +- ❌ Requires coordinated changes to both systems +- ❌ Breaking change for existing Alertmanager integrations + +### Option C: Only Send Identity Labels + +Prometheus only sends identity labels to Alertmanager. Enriched descriptive labels are either dropped or moved to annotations. + +**Trade-offs:** +- ✅ Simple Alertmanager, stable fingerprints +- ❌ Loses rich context in notification templates +- ❌ Awkward if users want to route by descriptive labels + +### Recommendation + +We recommend **Option B** for full correctness, with **Option A** as an acceptable intermediate step that doesn't require Alertmanager changes. + +Option A is sufficient for ensuring Prometheus's `for` clause works correctly. The Alertmanager "churn" is manageable if users follow best practices (group and silence by stable labels). Option B can be implemented later as an enhancement. + +The rest of this document assumes Option B as the target design, with notes on Option A where relevant. + +--- + +## Prometheus Implementation + +### How Alert Identity Is Computed + +When an alerting rule is created, Prometheus parses the expression AST to extract which labels are explicitly used in matchers. This set of "explicit labels" is stored with the rule. + +During alert evaluation, for each result from the query engine: + +1. **Classify each label** — Determine if it's an identity label or not: + - Metric name → always identity + - Entity identifying labels → always identity + - Labels explicitly used in the expression → identity (user signaled intent) + - Original metric labels (not from entity enrichment) → identity + - Enriched descriptive labels → NOT identity + +2. **Compute identity labels** — Subset of all labels that constitute identity + +3. **Track state by identity** — Use `IdentityLabels.Fingerprint()` for the `active` map and `for` clause timing + +4. **Send full labels** — When sending to Alertmanager, include all labels (identity + enriched) + +### The Alert Struct + +We rename the fields to make the distinction clear: + +```go +type Alert struct { + State AlertState + + // IdentityLabels are used for fingerprinting and state tracking. + // These labels are stable even when descriptive labels change. + IdentityLabels labels.Labels + + // Labels includes all labels: identity + enriched descriptive. + // This is what gets sent to Alertmanager for routing and templates. + Labels labels.Labels + + Annotations labels.Labels + Value float64 + + ActiveAt time.Time + FiredAt time.Time + ResolvedAt time.Time + LastSentAt time.Time + ValidUntil time.Time + KeepFiringSince time.Time +} + +// Fingerprint uses identity labels for stability +func (a *Alert) Fingerprint() model.Fingerprint { + return a.IdentityLabels.Fingerprint() +} +``` + +Note that `Labels` (full labels) is **not** redundantly storing identity labels—it's the complete set. We could optimize storage by only storing the "extra" descriptive labels and computing full labels on demand, but this complicates the code for minimal gain. + +--- + +## Alertmanager Changes + +### Option A: No Changes (Intermediate) + +If we proceed with Option A (Prometheus-internal identity only), Alertmanager receives alerts as today with `labels` containing all labels. Users must be aware: + +- Grouping by descriptive labels may cause groups to change over time +- Silences by descriptive labels may stop matching if labels change +- Notification deduplication may re-notify on label changes + +**Best practices for Option A:** +- Group by identifying labels: `group_by: [alertname, k8s.pod.uid]` not `k8s.pod.name` +- Silence by identifying labels for stability +- Accept that notifications may include different descriptive label values over time + +### Option B: API Extension (Target Design) + +Extend the Alertmanager API to accept identity label references. To avoid duplicating label strings, `identityLabelRefs` contains indices into the `labels` array: + +```json +// POST /api/v2/alerts - Extended payload +[ + { + "labels": [ + { "name": "alertname", "value": "HighCPU" }, + { "name": "k8s.pod.uid", "value": "abc-123" }, + { "name": "k8s.pod.name", "value": "nginx-7b9f5" }, + { "name": "k8s.node.name", "value": "worker-1" }, + { "name": "severity", "value": "warning" } + ], + "identityLabelRefs": [0, 1, 4], + "annotations": { ... }, + "startsAt": "2024-01-15T10:30:00Z", + "endsAt": "0001-01-01T00:00:00Z", + "generatorURL": "..." + } +] +``` + +Here, `identityLabelRefs: [0, 1, 4]` indicates that labels at positions 0 (`alertname`), 1 (`k8s.pod.uid`), and 4 (`severity`) constitute the alert's identity. Alertmanager reconstructs identity labels by indexing into the `labels` array. + +Alertmanager changes: +1. Accept `identityLabelRefs` field (optional for backward compatibility) +2. If present, construct identity labels from the referenced indices in `labels` +3. Use identity labels for fingerprinting and deduplication +4. Use full `labels` for routing matchers and notification templates +5. Groups are keyed by identity labels, not full labels + +This ensures the entire pipeline respects Alert Identity. + +### Routing and Grouping + +With either option, routing matchers operate on full `labels`: + +```yaml +route: + group_by: [alertname, k8s.namespace.name] # Both are identity labels + routes: + - matchers: + - k8s.node.name=~"worker-.*" # Can match descriptive labels + receiver: node-team +``` + +With Option B, even though `k8s.node.name` changes, the alert stays in the same group because grouping uses identity labels internally. + +### Silencing and Inhibition + +Silences match against full `labels`: + +```yaml +matchers: + - k8s.pod.uid="abc-123" # Identifying - stable match + - k8s.node.name="worker-1" # Descriptive - may stop matching if pod migrates +``` + +With Option B, the silence correctly continues matching because the alert's identity hasn't changed, even if `k8s.node.name` changed. + +--- + +## Temporal Semantics + +### Which Label Values Are Sent? + +When Prometheus evaluates an alerting rule at time T, the query engine enriches results with descriptive labels as they exist at time T (see [06-querying.md](./06-querying.md) for details on point-in-time label resolution). These are the values included in `Labels` when sending to Alertmanager. + +If an alert persists across multiple evaluation cycles: +- T1: Labels include `{service.version="1.0.0"}` +- T2: Service upgrades +- T3: Labels include `{service.version="2.0.0"}` + +With stable Alert Identity, this is still the same alert. Notifications at T3 reflect the current state. + +### The `for` Clause + +The `for` clause requires an alert to be continuously active for a duration before firing: + +```yaml +alert: HighCPU +expr: cpu_usage > 0.9 +for: 5m +``` + +This works correctly because Prometheus tracks alerts by `IdentityLabels.Fingerprint()`. Descriptive label changes don't reset the timer: + +1. T0: Alert becomes Pending, identity `{instance="server-1", k8s.pod.uid="abc"}` +2. T1-T4: Entity's `k8s.node.name` changes multiple times +3. T5: Alert fires (5 minutes elapsed, same identity throughout) + +--- + +## Examples + +### Basic Alert with Enrichment + +```yaml +alert: PodHighMemory +expr: container_memory_usage_bytes > 1e9 +for: 2m +labels: + severity: warning +annotations: + summary: "Pod {{ $labels.k8s.pod.name }} high memory on {{ $labels.k8s.node.name }}" +``` + +**Identity labels:** `{__name__, container, k8s.pod.uid, alertname, severity}` + +**Full labels (sent to Alertmanager):** Identity + `k8s.pod.name`, `k8s.node.name`, `k8s.pod.status.phase`, etc. + +The annotation templates can reference enriched descriptive labels. + +### Alert with Explicit Descriptive Filter + +```yaml +alert: CriticalPodPending +expr: kube_pod_status_phase{k8s.pod.status.phase="Pending"} == 1 +for: 10m +labels: + severity: critical +``` + +**Identity labels:** `{__name__, k8s.pod.uid, k8s.pod.status.phase, alertname, severity}` + +Here `k8s.pod.status.phase` IS part of identity because the user explicitly filtered on it. This alert resolves when the pod transitions to `Running`. + +### Two Alerts Distinguished by Descriptive Labels + +```yaml +# Alert 1 +alert: WorkerOneHighCPU +expr: cpu_usage{k8s.node.name="worker-1"} > 80 + +# Alert 2 +alert: WorkerTwoHighCPU +expr: cpu_usage{k8s.node.name="worker-2"} > 80 +``` + +These have different `alertname` values, so they're distinct regardless of whether `k8s.node.name` is considered identity. But even with the same alert name, the explicit filter makes `k8s.node.name` part of identity for each rule. + +--- + +## Backward Compatibility + +For metrics without entity correlation: +- `explicitLabels` contains labels from the expression +- `entityStore.IsIdentifyingLabel()` and `IsDescriptiveLabel()` return false +- All labels are treated as identity labels +- Behavior matches current Prometheus exactly + +Existing alerting rules work unchanged. Entity-aware behavior only activates for metrics that have entity correlations. + +--- + +## Open Questions + +### Migration Path for Alertmanager + +If we implement Option B, what's the migration path? +- New API version with `identityLabels` field? +- Backward compatible: if `identityLabels` absent, use `labels`? +- How do we handle mixed Prometheus/Alertmanager versions during rollout? + +### Recording Rules + +If a recording rule aggregates entity-correlated metrics: + +```yaml +record: job:requests:rate5m +expr: sum by (job) (rate(http_requests_total[5m])) +``` + +The recorded metric loses entity correlation (aggregated away). Alerting on this metric behaves as today (no entity enrichment). Is this acceptable, or should we track "derived" correlations? + +### Relabeling Entity Labels + +Should alert relabeling be able to manipulate entity labels? + +```yaml +alerting: + alert_relabel_configs: + - source_labels: [k8s.node.name] + action: drop +``` + +This works on full `labels`. Should there be restrictions or warnings when dropping identity labels? + +--- + +## Summary + +Entity-aware alerting introduces Alert Identity as a concept built on top of the existing Fingerprint mechanism. The core principle is that **explicit labels signal user intent**, while **enriched labels provide context**. + +| Component | Change | +|-----------|--------| +| Prometheus alerting rules | Track explicit labels, compute identity separately | +| Prometheus `Alert` struct | Split into `IdentityLabels` and `Labels` | +| Alertmanager (Option A) | None—document behavioral implications | +| Alertmanager (Option B) | Accept `identityLabels` for fingerprinting | + +The key insight: **"If you mentioned it, you meant it."** Labels in the expression contribute to identity. Labels from enrichment provide context without affecting identity. + +--- + +## Related Documents + +- [01-context.md](./01-context.md) — Problem statement and entity concept +- [05-storage.md](./05-storage.md) — How entities and correlations are stored +- [06-querying.md](./06-querying.md) — Entity-aware PromQL and automatic enrichment +- [07-web-ui-and-apis.md](./07-web-ui-and-apis.md) — UI and API exposure of alerts diff --git a/proposals/0071-Entity/99-alternatives.md b/proposals/0071-Entity/99-alternatives.md new file mode 100644 index 00000000..d41cc124 --- /dev/null +++ b/proposals/0071-Entity/99-alternatives.md @@ -0,0 +1,52 @@ +# Alternatives Considered + +This document captures alternative approaches that were evaluated during the design of native Entity support in Prometheus. For each alternative, we describe what was considered, why it was appealing, and ultimately why it was not chosen. + +The goal is to preserve institutional knowledge about design decisions and help future contributors understand the reasoning behind the current proposal. + +--- + +## Exposition Formats + +*See [Exposition Formats](./02-exposition-formats.md) for the chosen approach.* + +### Alternative: Introduce a New "Entity" Concept in for OpenMetrics-text + +#### Description + +Instead of extending info metrics, introduce a completely new "Entity" concept with dedicated syntax: + +``` +# ENTITY_TYPE k8s.pod +# ENTITY_IDENTIFYING namespace pod_uid +k8s.pod{namespace="default",pod_uid="abc-123",pod="nginx"} + +--- + +# TYPE container_cpu_usage_seconds_total counter +container_cpu_usage_seconds_total{namespace="default",pod_uid="abc-123",container="app"} 1234.5 +``` + +Key differences from the chosen approach: +- New `# ENTITY_TYPE` declaration instead of `# TYPE ... info` +- New `# ENTITY_IDENTIFYING` declaration instead of `# IDENTIFYING_LABELS` +- Entity instances have **no value** (no `1` placeholder) +- Entity type is explicit in the declaration, not derived from the metric name + +#### Motivation + +- **Semantic clarity**: Entities truly aren't metrics—they don't have values because they represent the *producers* of telemetry, not telemetry itself +- **Cleaner data model**: No meaningless `1` value wasting storage +- **Better alignment with OpenTelemetry**: OTel's Entity model treats entities as first-class objects, not metrics +- **No breaking change for existing info metrics**: Since this introduces completely new syntax, existing applications exposing info metrics in any order would continue to work unchanged. The new Entity syntax would be opt-in for applications that want correlation features. + +#### Concerns / Reasons for Rejection + +- **Cognitive load**: Users must learn a new concept ("Entity") rather than building on the familiar info metric pattern they already understand +- **Larger syntax change**: Three new declarations vs. one (`# ENTITY_TYPE`, `# ENTITY_IDENTIFYING`, value-less lines vs. just `# IDENTIFYING_LABELS`) +- **Community familiarity**: The `*_info` metric pattern is well-established across the ecosystem (kube-state-metrics, node_exporter, OTel SDK). Extending it is less disruptive than replacing it. +- **Incremental evolution**: Prometheus has historically evolved through incremental changes rather than whole new concepts + +The extended info metrics approach achieves the same functional goals (identifying vs. descriptive labels, automatic enrichment, correlation) while requiring less conceptual overhead. + +--- \ No newline at end of file diff --git a/proposals/0071-Entity/wireframes/Wireframe - Simple idea - Complete flow.png b/proposals/0071-Entity/wireframes/Wireframe - Simple idea - Complete flow.png new file mode 100644 index 00000000..2496463e Binary files /dev/null and b/proposals/0071-Entity/wireframes/Wireframe - Simple idea - Complete flow.png differ