web/api: add fuzzy metadata search endpoints#18293
Conversation
Add search endpoints for metric names, label names, and label values, backed by new storage search interfaces and TSDB implementations. The search API supports substring, subsequence, and Jaro-Winkler-based matching, optional relevance sorting, and NDJSON streaming responses. This also wires the new endpoints into the query UI and adds backend and utility tests. Known limitation: frequency/cardinality enrichment currently relies on per-result query-time scans, so those paths may need follow-up work before they scale well. Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
216e79a to
9ae9c1b
Compare
|
|
||
| - `match[]=<series_selector>`: Repeated series selector used to scope the | ||
| search. Optional. | ||
| - `search=<string>`: Search string matched against names or values. Optional. |
There was a problem hiding this comment.
In the latest version of the proposal the search allows for multiple search strings to be set. They are treated as an OR conditional.
We could support the request allowing an AND or OR operator to control how the multiple search values are handled.
| - `fuzz_threshold=<number>`: Fuzzy threshold from 0 to 100. Optional. | ||
| - `fuzz_alg=<subsequence | jarowinkler>`: Matching algorithm. Optional. | ||
| - `case_sensitive=<bool>`: Toggle case-sensitive matching. Optional. | ||
| - `sort_by=<string>`: Sort mode. Supported values depend on the endpoint. |
There was a problem hiding this comment.
In the proposal we have sort_by as optional and if it is not set the default is no sort order. Might be worth highlighting this that if no sort is set then no sort is applied. This is different to the existing endpoints which always return in alphabetical order.
| - `sort_by=<string>`: Sort mode. Supported values depend on the endpoint. | ||
| - `sort_dir=<asc | dsc>`: Sort direction. Optional. Only valid when `sort_by` | ||
| is set. | ||
| - `start=<rfc3339 | unix_timestamp>`: Start timestamp. Optional. |
There was a problem hiding this comment.
I think we discussed having a sane default start/end time if none is set.
| `application/x-ndjson`. Each response contains one or more result batches, | ||
| followed by a final trailer line: | ||
|
|
||
| ```json |
There was a problem hiding this comment.
An example with a warning annotation could be useful.
| type Searcher interface { | ||
| // SearchLabelNames returns label names matching the search criteria. | ||
| // Results include relevance scores based on the Filter. | ||
| SearchLabelNames(ctx context.Context, hints *SearchHints, matchers ...*labels.Matcher) ([]SearchResult, error) |
There was a problem hiding this comment.
Can we do the iterator style return type?
This then unlocks more options for fully streaming results.
type SearcherValueSet interface {
Next() bool
At() SearchResult
Warnings() annotations.Annotations
Err() error
Close()
}```
|
|
||
| - `match[]=<series_selector>`: Repeated series selector used to scope the | ||
| search. Optional. | ||
| - `search=<string>`: Search string matched against names or values. Optional. |
There was a problem hiding this comment.
Updating the proposal to make this param search[] so it better matches with match[]
| - `match[]=<series_selector>`: Repeated series selector used to scope the | ||
| search. Optional. | ||
| - `search=<string>`: Search string matched against names or values. Optional. | ||
| - `fuzz_threshold=<number>`: Fuzzy threshold from 0 to 100. Optional. |
There was a problem hiding this comment.
Note default is 0 (no fuzz matching)
| search. Optional. | ||
| - `search=<string>`: Search string matched against names or values. Optional. | ||
| - `fuzz_threshold=<number>`: Fuzzy threshold from 0 to 100. Optional. | ||
| - `fuzz_alg=<subsequence | jarowinkler>`: Matching algorithm. Optional. |
There was a problem hiding this comment.
Note default is jarowinkler
|
|
||
| Additional parameters for `/api/v1/search/metric_names`: | ||
|
|
||
| - `include_cardinality=<bool>`: Include metric cardinality in each result. |
There was a problem hiding this comment.
I have updated the proposal to remove cardinality and frequency. I have left the include_metadata for now.
|
|
||
| - `include_cardinality=<bool>`: Include metric cardinality in each result. | ||
| - `include_metadata=<bool>`: Include metric metadata in each result. | ||
| - `sort_by=<alpha | cardinality | score>` |
There was a problem hiding this comment.
With score might be worth noting that score only makes sense to sort by if there is a search[] + fuzz_threshold set
|
|
||
| // mergeSearchResults merges search results from multiple calls to fn, deduplicating | ||
| // by value and taking the maximum score for duplicates. | ||
| func mergeSearchResults(hints *SearchHints, fn func(Searcher) ([]SearchResult, error), searchers []Searcher) ([]SearchResult, error) { |
There was a problem hiding this comment.
Per comments in the interfaces - can we have the Searcher return a ValueSet style iterator?
| for value, score := range scores { | ||
| merged = append(merged, SearchResult{Value: value, Score: score}) | ||
| } | ||
| if hints != nil && hints.CompareFunc != nil { |
There was a problem hiding this comment.
Make sure that we support the no-sort case. If no sort_by is set then we should not be applied any sort. ie do not default to alpha
| // Returns (accepted, score) where score is used for relevance ranking. | ||
| // Score should be in range [0.0, 1.0] where 1.0 is perfect match. | ||
| type Filter interface { | ||
| Accept(value string) (accepted bool, score float64) |
There was a problem hiding this comment.
One messy flaw we still have with this interface is this ... consider a search where;
- search[] = cpu
- fuzz_threshold = not set - so no fuzz applied and no score calculated in the Filter
- sort_by = score
This is a legitimate use case - auto complete on labels/values containing cpu
We either need to detect this early and require the Filter to still calculate the score or lazy calculate the score before the comparator is applied. If we do the former it's a different filter chain then applying a search + fuzz filter.
| // CompareFunc is used for ordering results. | ||
| // It receives full SearchResult values, allowing comparison by value, | ||
| // score, or any combination. | ||
| // A nil value means alphabetical ordering by value. |
There was a problem hiding this comment.
Per above, I think a nil CompareFunc should imply no sorting is applied.
| labelHints.Limit = hints.Limit | ||
| } | ||
|
|
||
| names, err := q.index.LabelNames(ctx, matchers...) |
There was a problem hiding this comment.
Are we able to implement the Searcher interface inside the q.index? Get the filtering as low possible and then return the ValueSet style iterator which better facilitates streaming the results back?
| // Apply filter and collect scores. | ||
| var results []storage.SearchResult | ||
| for _, name := range names { | ||
| if hints.Filter != nil { |
There was a problem hiding this comment.
Per comment above - we need to consider the use case where a search[] is set, fuzz_threshold is not set and sort_by=score is selected.
|
|
||
| // Winkler modification: boost for common prefix up to 4 characters. | ||
| prefixLen := 0 | ||
| maxPrefix := min(4, l1, l2) |
There was a problem hiding this comment.
Does this prefix boosting value need to be pulled out into a configurable option?
Address review comments on PR 18293: - Switch Searcher interface to iterator-based SearchResultSet return type for streaming; add sliceSearchResultSet, EmptySearchResultSet, and ErrSearchResultSet helpers - Rename search= to search[] and support multiple values with OR logic via orSearchesFilter and buildSearchFilter - Change nil CompareFunc to mean no sorting (natural index order) - Default fuzz_alg to jarowinkler, fuzz_threshold=0 disables fuzzy matching - Remove cardinality and frequency enrichment from all three endpoints - Default start/end to a 1-hour lookback window when not specified - Add include_score parameter to return relevance scores per result - Enforce sort_by=score requires search[] to be set - Update OpenAPI schemas and golden files accordingly - Replace local design doc with upstream proposal 74 content Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
Add search endpoints for metric names, label names, and label values, backed by new storage search interfaces and TSDB implementations.
The search API supports substring, subsequence, and Jaro-Winkler-based matching, optional relevance sorting, and NDJSON streaming responses. This also wires the new endpoints into the query UI and adds backend and utility tests.
Known limitation: frequency/cardinality enrichment currently relies on per-result query-time scans, so those paths may need follow-up work before they scale well.
prometheus/proposals#74
Which issue(s) does the PR fix:
Does this PR introduce a user-facing change?