Skip to content

[Feature] PPL Highlight Command#5234

Draft
RyanL1997 wants to merge 6 commits intoopensearch-project:mainfrom
RyanL1997:ppl-highlight-cmd
Draft

[Feature] PPL Highlight Command#5234
RyanL1997 wants to merge 6 commits intoopensearch-project:mainfrom
RyanL1997:ppl-highlight-cmd

Conversation

@RyanL1997
Copy link
Collaborator

Description

[Describe what this change achieves]

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 5891541.

PathLineSeverityDescription
opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java516lowUser-supplied highlight terms are interpolated directly into a Lucene query_string expression with only double-quote wrapping and no further sanitization. A crafted term containing backslash escapes or unbalanced quotes could malform or expand the generated query. Impact is limited to highlighting behavior (not document access control), so this is likely an oversight rather than malicious intent.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 0 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 14, 2026

PR Reviewer Guide 🔍

(Review updated until commit f0dacc0)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 Security concerns

Term Injection:
In applyHighlightPushDown (AbstractCalciteIndexScan.java), user-supplied highlight terms are embedded into an OpenSearch query string using simple double-quote wrapping: "\"" + term + "\"". A term containing special characters (e.g., ", AND, OR, NOT, wildcards) could manipulate the query string logic. For example, a term like foo" OR password:* could expand the highlight query beyond the intended scope. Input terms should be escaped before being used in a query string query.

✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: PPL Highlight AST and Parser

Relevant files:

  • ppl/src/main/antlr/OpenSearchPPLLexer.g4
  • ppl/src/main/antlr/OpenSearchPPLParser.g4
  • core/src/main/java/org/opensearch/sql/ast/tree/Highlight.java
  • core/src/main/java/org/opensearch/sql/ast/AbstractNodeVisitor.java
  • core/src/main/java/org/opensearch/sql/ast/dsl/AstDSL.java
  • core/src/main/java/org/opensearch/sql/analysis/Analyzer.java
  • ppl/src/main/java/org/opensearch/sql/ppl/parser/AstBuilder.java
  • ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java
  • ppl/src/test/java/org/opensearch/sql/ppl/parser/AstBuilderTest.java
  • ppl/src/test/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizerTest.java

Sub-PR theme: Highlight Calcite Planner and OpenSearch Pushdown

Relevant files:

  • core/src/main/java/org/opensearch/sql/calcite/plan/rel/LogicalHighlight.java
  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/planner/rules/HighlightIndexScanRule.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/planner/rules/OpenSearchIndexRules.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteEnumerableIndexScan.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/OpenSearchIndexEnumerator.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/PushDownContext.java
  • core/src/main/java/org/opensearch/sql/expression/HighlightExpression.java

Sub-PR theme: Highlight Response Format and Documentation

Relevant files:

  • protocol/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java
  • protocol/src/main/java/org/opensearch/sql/protocol/response/format/SimpleJsonResponseFormatter.java
  • docs/user/ppl/cmd/highlight.md
  • docs/category.json

⚡ Recommended focus areas for review

Term Injection Risk

In applyHighlightPushDown, highlight terms are interpolated directly into a query string using string concatenation with only basic double-quote wrapping. A term containing a double-quote character could break the query string syntax or produce unexpected behavior. Input sanitization or escaping should be applied before building the query string.

String queryStr =
    highlightArgs.stream()
        .map(term -> "\"" + term + "\"")
        .collect(java.util.stream.Collectors.joining(" OR "));
org.opensearch.search.fetch.subphase.highlight.HighlightBuilder.Field field =
    new org.opensearch.search.fetch.subphase.highlight.HighlightBuilder.Field("*")
        .highlightQuery(
            org.opensearch.index.query.QueryBuilders.queryStringQuery(queryStr)
                .defaultField("*"));
hb.field(field);
Duplicate _highlight Column

In pushDownHighlight, the _highlight field is unconditionally added to the schema without checking if it already exists (unlike LogicalHighlight.create which does check). If pushDownHighlight is called more than once, or if the field already exists, this could result in a duplicate column in the row type.

public RelNode pushDownHighlight(List<String> highlightArgs) {
  RelDataTypeFactory.Builder schemaBuilder = getCluster().getTypeFactory().builder();
  schemaBuilder.addAll(getRowType().getFieldList());
  schemaBuilder.add(
      HighlightExpression.HIGHLIGHT_FIELD,
      getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
  CalciteLogicalIndexScan newScan = copyWithNewSchema(schemaBuilder.build());
  newScan.getPushDownContext().setHighlightArgs(highlightArgs);
  return newScan;
}
Unsafe Cast

In the highlights() method, there is an unchecked cast (Map<String, Object>) hlMap which is unnecessary since hlMap is already declared as Map<String, Object>. While harmless here, it suppresses compiler warnings and should be removed for clarity.

return (Map<String, Object>) hlMap;
Iterator Change

The iterator() method was refactored to filter out _highlight and use e.getValue().value() instead of the previous ExprValue::value. This changes behavior for all queries, not just highlight queries. The previous implementation used convertExprValuesToValues which called ExprValue::value. The new implementation filters by key and calls .value() — this should be validated to ensure no regression for non-highlight queries, especially for complex/nested ExprValue types.

public Iterator<Object[]> iterator() {
  return exprValues.stream()
      .map(ExprValueUtils::getTupleValue)
      .map(
          tuple ->
              tuple.entrySet().stream()
                  .filter(e -> !HIGHLIGHT_FIELD.equals(e.getKey()))
                  .map(e -> e.getValue().value())
                  .toArray(Object[]::new))
      .iterator();
Implicit _highlight Projection

In visitProject, the _highlight field is automatically appended to all projections whenever it exists in the schema. This means any query that uses highlight followed by a fields command will silently include _highlight in the projection, even if the user did not request it. This could cause unexpected behavior if the _highlight column is not properly handled downstream in all code paths.

// Include _highlight in projections when the highlight column is present in the schema
int hlIndex = currentFields.indexOf(HighlightExpression.HIGHLIGHT_FIELD);
if (hlIndex >= 0) {
  expandedFields.add(context.relBuilder.field(hlIndex));
}

@github-actions
Copy link
Contributor

github-actions bot commented Mar 14, 2026

PR Code Suggestions ✨

Latest suggestions up to f0dacc0
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Prevent duplicate highlight column in schema

If _highlight is already present in the row type (e.g., when pushDownHighlight is
called more than once), the method unconditionally adds a duplicate _highlight
column, which will cause a schema conflict. Add a guard to only add the column when
it is not already present, mirroring the check done in LogicalHighlight.create.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [88-97]

 public RelNode pushDownHighlight(List<String> highlightArgs) {
     RelDataTypeFactory.Builder schemaBuilder = getCluster().getTypeFactory().builder();
     schemaBuilder.addAll(getRowType().getFieldList());
-    schemaBuilder.add(
-        HighlightExpression.HIGHLIGHT_FIELD,
-        getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
+    if (!getRowType().getFieldNames().contains(HighlightExpression.HIGHLIGHT_FIELD)) {
+        schemaBuilder.add(
+            HighlightExpression.HIGHLIGHT_FIELD,
+            getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
+    }
     CalciteLogicalIndexScan newScan = copyWithNewSchema(schemaBuilder.build());
     newScan.getPushDownContext().setHighlightArgs(highlightArgs);
     return newScan;
 }
Suggestion importance[1-10]: 7

__

Why: The pushDownHighlight method unconditionally adds _highlight to the schema without checking if it already exists, which could cause a schema conflict if called multiple times. The fix mirrors the guard already present in LogicalHighlight.create.

Medium
Preserve schema column order in row iterator

The new iterator() implementation filters out _highlight but also changes the column
ordering guarantee: it now iterates over the map's entry set order rather than the
schema column order. This can cause column values to be misaligned with the schema
columns returned by columnNameTypes(). The original code used Map::values which
relied on insertion order; the new code should explicitly follow the schema column
order to be safe.

protocol/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java [89-97]

 return exprValues.stream()
     .map(ExprValueUtils::getTupleValue)
     .map(
         tuple ->
-            tuple.entrySet().stream()
-                .filter(e -> !HIGHLIGHT_FIELD.equals(e.getKey()))
-                .map(e -> e.getValue().value())
+            schema.getColumns().stream()
+                .filter(col -> !HIGHLIGHT_FIELD.equals(getColumnName(col)))
+                .map(col -> tuple.getOrDefault(getColumnName(col),
+                    org.opensearch.sql.data.model.ExprMissingValue.of()).value())
                 .toArray(Object[]::new))
     .iterator();
Suggestion importance[1-10]: 7

__

Why: The new iterator() implementation iterates over the map's entry set order rather than the schema column order, which could cause column values to be misaligned with the schema. Iterating over schema columns explicitly ensures correct alignment between datarows and schema.

Medium
Security
Escape special characters in highlight query terms

The terms are interpolated directly into a query string using """ + term + """
without any escaping. A term containing a double-quote character (e.g., he said
"hello") will break the query string syntax and may cause an OpenSearch request
failure or unexpected behavior. The terms should be escaped before being embedded in
the query string.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java [511-522]

 } else {
   // Highlight specific terms across all fields
   String queryStr =
       highlightArgs.stream()
-          .map(term -> "\"" + term + "\"")
+          .map(term -> "\"" + term.replace("\\", "\\\\").replace("\"", "\\\"") + "\"")
           .collect(java.util.stream.Collectors.joining(" OR "));
   org.opensearch.search.fetch.subphase.highlight.HighlightBuilder.Field field =
       new org.opensearch.search.fetch.subphase.highlight.HighlightBuilder.Field("*")
           .highlightQuery(
               org.opensearch.index.query.QueryBuilders.queryStringQuery(queryStr)
                   .defaultField("*"));
   hb.field(field);
 }
Suggestion importance[1-10]: 6

__

Why: Terms containing double-quote characters are interpolated directly into the query string without escaping, which could break the query string syntax. However, since the grammar restricts highlightArg to string literals parsed by the PPL parser, the risk of unescaped quotes reaching this code is limited in practice.

Low

Previous suggestions

Suggestions up to commit 5891541
CategorySuggestion                                                                                                                                    Impact
Possible issue
Prevent duplicate highlight column in schema

The _highlight column is unconditionally added to the schema even if it already
exists (e.g., if pushDownHighlight is called twice). This can cause duplicate column
errors. Add a guard to only add the column when it is not already present, similar
to how LogicalHighlight.create handles it.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [90-99]

 public RelNode pushDownHighlight(List<String> highlightArgs) {
     RelDataTypeFactory.Builder schemaBuilder = getCluster().getTypeFactory().builder();
     schemaBuilder.addAll(getRowType().getFieldList());
-    schemaBuilder.add(
-        HighlightExpression.HIGHLIGHT_FIELD,
-        getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
+    if (!getRowType().getFieldNames().contains(HighlightExpression.HIGHLIGHT_FIELD)) {
+        schemaBuilder.add(
+            HighlightExpression.HIGHLIGHT_FIELD,
+            getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
+    }
     CalciteLogicalIndexScan newScan = copyWithNewSchema(schemaBuilder.build());
     newScan.getPushDownContext().setHighlightArgs(highlightArgs);
     return newScan;
 }
Suggestion importance[1-10]: 6

__

Why: The pushDownHighlight method unconditionally adds _highlight to the schema without checking if it already exists, which could cause duplicate column errors if called multiple times. The fix mirrors the guard already present in LogicalHighlight.create.

Low
Security
Escape highlight terms to prevent query injection

The term values are interpolated directly into a query string using """ + term +
""" without any escaping. If a term contains special query string characters (e.g.,
", </code>, :, *), this could produce a malformed or injected query. Terms should be
escaped before being embedded in the query string.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java [511-522]

 } else {
     // Highlight specific terms across all fields
     String queryStr =
         highlightArgs.stream()
-            .map(term -> "\"" + term + "\"")
+            .map(term -> "\"" + org.apache.lucene.queryparser.classic.QueryParser.escape(term) + "\"")
             .collect(java.util.stream.Collectors.joining(" OR "));
     org.opensearch.search.fetch.subphase.highlight.HighlightBuilder.Field field =
         new org.opensearch.search.fetch.subphase.highlight.HighlightBuilder.Field("*")
             .highlightQuery(
                 org.opensearch.index.query.QueryBuilders.queryStringQuery(queryStr)
                     .defaultField("*"));
     hb.field(field);
 }
Suggestion importance[1-10]: 6

__

Why: User-supplied terms are interpolated directly into a Lucene query string without escaping, which could produce malformed queries or allow injection of special characters. Using QueryParser.escape would mitigate this risk.

Low
General
Guard against null highlight field values

Calling entry.getValue().collectionValue() will throw a NullPointerException or
ExpressionEvaluationException if the highlight field value for a particular key is
not a collection type. Add a null/type check before calling collectionValue() to
handle unexpected value shapes gracefully.

protocol/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java [110-124]

 map(
     tuple -> {
       ExprValue hl = tuple.get(HIGHLIGHT_FIELD);
       if (hl == null || hl.isMissing() || hl.isNull()) {
         return null;
       }
       Map<String, Object> hlMap = new LinkedHashMap<>();
       for (Map.Entry<String, ExprValue> entry : hl.tupleValue().entrySet()) {
+        ExprValue val = entry.getValue();
+        if (val == null || val.isMissing() || val.isNull()) continue;
         hlMap.put(
             entry.getKey(),
-            entry.getValue().collectionValue().stream()
+            val.collectionValue().stream()
                 .map(ExprValue::stringValue)
                 .collect(Collectors.toList()));
       }
       return (Map<String, Object>) hlMap;
     })
Suggestion importance[1-10]: 4

__

Why: Calling collectionValue() on a highlight entry without checking if the value is null or missing could throw an exception for unexpected value shapes. Adding a null/missing check improves robustness.

Low
Verify child plan attachment for highlight command

The visitHighlightCommand returns a new Highlight(highlightArgs) without attaching
it to the child plan. All other command visitors (e.g., visitTrendlineCommand)
return the node and rely on the caller to call .attach(child), but AstDSL.highlight
calls .attach(input). Since visitHighlightCommand is called by the AST builder
framework which chains nodes, the returned Highlight node will have a null child,
causing a NullPointerException when the plan is traversed. The child should be
attached here or the framework must handle it — verify the attachment mechanism is
consistent with other commands.

ppl/src/main/java/org/opensearch/sql/ppl/parser/AstBuilder.java [1290-1305]

 public UnresolvedPlan visitHighlightCommand(OpenSearchPPLParser.HighlightCommandContext ctx) {
     List<String> highlightArgs =
         ctx.highlightArg().stream()
             .map(
                 arg -> {
                   if (arg.STAR() != null) {
                     return "*";
                   } else if (arg.stringLiteral() != null) {
                     return StringUtils.unquoteText(arg.stringLiteral().getText());
                   } else {
                     return arg.fieldExpression().getText();
                   }
                 })
             .collect(Collectors.toList());
+    // Return without attach; the framework (visitPipeCommands) attaches the child.
+    // Ensure this is consistent with how other commands are handled.
     return new Highlight(highlightArgs);
 }
Suggestion importance[1-10]: 2

__

Why: The suggestion asks to verify that the framework attaches the child, but the improved_code is essentially identical to the existing_code with only a comment added. Other command visitors in the same file follow the same pattern of returning without attaching, so this is consistent and not a real bug.

Low

@RyanL1997 RyanL1997 added PPL Piped processing language v3.6.0 Issues targeting release v3.6.0 feature labels Mar 14, 2026
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit f0dacc0.

PathLineSeverityDescription
opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java519lowUser-supplied highlight terms are wrapped in double quotes but not escaped before being passed to queryStringQuery(). A term containing a literal '"' character could partially break out of the phrase context and alter the Lucene query structure. Because PPL users are already authenticated and can issue arbitrary search queries, the practical exploitability is limited, but proper escaping of the term string (e.g., via QueryParser.escape) would be safer.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 0 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Contributor

Persistent review updated to latest commit f0dacc0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PPL Piped processing language v3.6.0 Issues targeting release v3.6.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant