Fail retention app when the columnPattern mismatch partition spec for…#552
Conversation
mkuchenbecker
left a comment
There was a problem hiding this comment.
- How do we know when this is failing?
- When this is failing, what is the impact on the customer's limited retention? What is the AI for the customer / us?
There was a problem hiding this comment.
this will add toil to operations, the operations is already at peak so we can't choose to add toil.
why toilsome?
- user calls SET POLICY ddl
- 1 out of 60,000 async jobs starts failing. impossible to alert on
- many weeks go by
- user raises ticket about unmatched expectation between their policy, and data
- this problem also is obfuscated when the retention policy comes from UI instead of table properties.
- oncall needs to dig into retention app logs and triage to dig out this exception in the spark driver logs.
multiply that^ ticket burden across N tables which have retention policy on non-partitioned column.
Instead this needs to be fixed at the root cause:
Sometimes users use a non-partitioned column for retention policy
Somewhere there is a bug:
a) either explicitly disallow DDL to SET POLICY ... retention on a non-partition column
b) if that^ is already done, how are these tables leaking through? it must be an edge case we're not tolerating somehow e.g. maybe a gap in server validation.
It won't add much toil. This fail model only applies to the HCR-enabled tables, which is the tracking tables. The HDFS team is enabling hcr in a batch, and they will monitor the effects. This fail model is the safety catch so that they won't archive the wrong data.
Tracking tables uses CPS to config retention policies, so we can't prevent it from the |
We don't know it is failing since we don't set up alerts on individual tables.
This fail model only apply to HCR-enabled tables, which is the tracking tables. The HDFS team is enabling hcr in a batch, so they will contact us if the job fails and we will tell them the config is wrong. So no impact on LR for other customers. |
teamurko
left a comment
There was a problem hiding this comment.
Overall looks good. The trade-off here is data loss of back-ed up files vs partition retention job failure. This code runs on backup enabled tables. It's a protection against enabling backup on table with misaligned retention policy. Since now users can directly update retention policy on CPS, we can't enforce it in the catalog. This is a trade-off we have to make to enable further storage cost reduction. PTAL @mkuchenbecker @cbb330, discuss any further ideas.
cbb330
left a comment
There was a problem hiding this comment.
discussed with Stas offline,
- we unblock this solution knowing the cons, and mitigating with future fixes
- we will decide whether to block merging this PR until retention policy service implements a write-path fix, or else we merge this and deploy it
- we will ask policy service to write through to OpenHouse table properties, so that we can block retention policies on columns that aren't partitioned
- this aligns with our strategy that OH implements the chokepoint for datalake, and policy service can be a reflection of OH as the SOT. allows OH set technical constraints for downstreams/upstreams in the API in one place.
Summary
Fail retention app when the columnPattern mismatch partition spec for HCR tables. Sometimes users use a non-partitioned column for retention policy, and this will cause rewrite data files in the retention job. If we enable rename, this will cause data loss. So we decided to fail the operation when the delete is not a metdata-only operation. We detect it by checking if all the planned files have no residual from the the patition data.
Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.