Background
CHT Sync currently requires a full copy of all synced CouchDB databases in PostgreSQL. This leads to high storage usage, consuming a significant fraction of the live database size.
Once data is moved from the source table into dbt model tables, some of it is no longer needed in the source table. Deleting this unnecessary data could free up substantial storage space.
However, ensuring accuracy and avoiding disruptions to model development while deleting data poses challenges.
Key Questions
-
How can we determine which data is safe to delete?
Once data_records and contacts are moved to the form and contact models, all downstream models (dashboards, aggregates) are expected to query from these models rather than the source table.
- If this behavior can be guaranteed and
form and contact models change infrequently, records in the source table could be deleted once these models are populated.
- A similar approach might work for
tasks, where only a small amount of metadata is needed while most source data is operational and unnecessary in PostgreSQL.
-
How do we handle changes to models that depend on the source table?
couch2pg lacks a mechanism to selectively re-add deleted records. If models that rely on the source table need to be rebuilt, but the original data has been deleted, there must be a way to restore it.
-
How can we prevent users from querying the source table directly?
While aggregates and dashboard models are expected to query from form and contact models, nothing technically prevents developers from querying the source table directly.
- If source data is deleted, such queries would produce incorrect results silently.
- How can we enforce best practices and prevent accidental reliance on deleted source data?
Background
CHT Sync currently requires a full copy of all synced CouchDB databases in PostgreSQL. This leads to high storage usage, consuming a significant fraction of the live database size.
Once data is moved from the source table into dbt model tables, some of it is no longer needed in the source table. Deleting this unnecessary data could free up substantial storage space.
However, ensuring accuracy and avoiding disruptions to model development while deleting data poses challenges.
Key Questions
How can we determine which data is safe to delete?
Once
data_recordsandcontactsare moved to theformandcontactmodels, all downstream models (dashboards, aggregates) are expected to query from these models rather than the source table.formandcontactmodels change infrequently, records in the source table could be deleted once these models are populated.tasks, where only a small amount of metadata is needed while most source data is operational and unnecessary in PostgreSQL.How do we handle changes to models that depend on the source table?
couch2pglacks a mechanism to selectively re-add deleted records. If models that rely on the source table need to be rebuilt, but the original data has been deleted, there must be a way to restore it.How can we prevent users from querying the source table directly?
While aggregates and dashboard models are expected to query from
formandcontactmodels, nothing technically prevents developers from querying the source table directly.