Skip to content

Fail retention app when the columnPattern mismatch partition spec for…#552

Merged
jiang95-dev merged 3 commits into
linkedin:mainfrom
jiang95-dev:lejiang/hcr-spec-mismatch
May 19, 2026
Merged

Fail retention app when the columnPattern mismatch partition spec for…#552
jiang95-dev merged 3 commits into
linkedin:mainfrom
jiang95-dev:lejiang/hcr-spec-mismatch

Conversation

@jiang95-dev
Copy link
Copy Markdown
Collaborator

@jiang95-dev jiang95-dev commented Apr 17, 2026

Summary

Fail retention app when the columnPattern mismatch partition spec for HCR tables. Sometimes users use a non-partitioned column for retention policy, and this will cause rewrite data files in the retention job. If we enable rename, this will cause data loss. So we decided to fail the operation when the delete is not a metdata-only operation. We detect it by checking if all the planned files have no residual from the the patition data.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

Comment thread apps/spark/src/main/java/com/linkedin/openhouse/jobs/spark/Operations.java Outdated
Comment thread apps/spark/src/main/java/com/linkedin/openhouse/jobs/spark/Operations.java Outdated
Copy link
Copy Markdown
Collaborator

@mkuchenbecker mkuchenbecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. How do we know when this is failing?
  2. When this is failing, what is the impact on the customer's limited retention? What is the AI for the customer / us?

Copy link
Copy Markdown
Collaborator

@cbb330 cbb330 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will add toil to operations, the operations is already at peak so we can't choose to add toil.

why toilsome?

  1. user calls SET POLICY ddl
  2. 1 out of 60,000 async jobs starts failing. impossible to alert on
  3. many weeks go by
  4. user raises ticket about unmatched expectation between their policy, and data
  5. this problem also is obfuscated when the retention policy comes from UI instead of table properties.
  6. oncall needs to dig into retention app logs and triage to dig out this exception in the spark driver logs.

multiply that^ ticket burden across N tables which have retention policy on non-partitioned column.

Instead this needs to be fixed at the root cause:

Sometimes users use a non-partitioned column for retention policy

Somewhere there is a bug:
a) either explicitly disallow DDL to SET POLICY ... retention on a non-partition column
b) if that^ is already done, how are these tables leaking through? it must be an edge case we're not tolerating somehow e.g. maybe a gap in server validation.

@jiang95-dev
Copy link
Copy Markdown
Collaborator Author

this will add toil to operations, the operations is already at peak so we can't choose to add toil.

It won't add much toil. This fail model only applies to the HCR-enabled tables, which is the tracking tables. The HDFS team is enabling hcr in a batch, and they will monitor the effects. This fail model is the safety catch so that they won't archive the wrong data.

Somewhere there is a bug: a) either explicitly disallow DDL to SET POLICY ... retention on a non-partition column b) if that^ is already done, how are these tables leaking through? it must be an edge case we're not tolerating somehow e.g. maybe a gap in server validation.

Tracking tables uses CPS to config retention policies, so we can't prevent it from the SET POLICY command. But agree that we should add the validations for other tables where currently we are missing. I'll send another PR for this.

@jiang95-dev
Copy link
Copy Markdown
Collaborator Author

  1. How do we know when this is failing?

We don't know it is failing since we don't set up alerts on individual tables.

  1. When this is failing, what is the impact on the customer's limited retention? What is the AI for the customer / us?

This fail model only apply to HCR-enabled tables, which is the tracking tables. The HDFS team is enabling hcr in a batch, so they will contact us if the job fails and we will tell them the config is wrong. So no impact on LR for other customers.

teamurko
teamurko previously approved these changes May 18, 2026
Copy link
Copy Markdown
Collaborator

@teamurko teamurko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. The trade-off here is data loss of back-ed up files vs partition retention job failure. This code runs on backup enabled tables. It's a protection against enabling backup on table with misaligned retention policy. Since now users can directly update retention policy on CPS, we can't enforce it in the catalog. This is a trade-off we have to make to enable further storage cost reduction. PTAL @mkuchenbecker @cbb330, discuss any further ideas.

Comment thread apps/spark/src/main/java/com/linkedin/openhouse/jobs/spark/Operations.java Outdated
cbb330
cbb330 previously approved these changes May 18, 2026
Copy link
Copy Markdown
Collaborator

@cbb330 cbb330 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed with Stas offline,

  • we unblock this solution knowing the cons, and mitigating with future fixes
  • we will decide whether to block merging this PR until retention policy service implements a write-path fix, or else we merge this and deploy it
  • we will ask policy service to write through to OpenHouse table properties, so that we can block retention policies on columns that aren't partitioned
  • this aligns with our strategy that OH implements the chokepoint for datalake, and policy service can be a reflection of OH as the SOT. allows OH set technical constraints for downstreams/upstreams in the API in one place.

@jiang95-dev jiang95-dev dismissed stale reviews from cbb330 and teamurko via 0bf7de2 May 19, 2026 00:33
@jiang95-dev jiang95-dev merged commit 6d6ff0d into linkedin:main May 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants