Feature Request
Connector-based ingestion relies on primary keys to determine whether records should be inserted or updated. When source data provides a stable identifier, this works well. However, many raw datasets (e.g. GeoJSON uploads via the Dataset Importer) do not include feature IDs.
As of #136, the system generates random UUIDs for such records. This avoids baking in brittle assumptions about record equality and prevents subtle duplication bugs caused by small, hard-to-reason-about differences in feature content (e.g. null vs omitted fields, minor attribute edits). A TODO was intentionally left in the code to revisit this problem once append workflows are in scope.
However, supporting append-to-existing-dataset workflows in the future will require a more deliberate and explicit strategy for record identity. Naïve deterministic hashing of feature content is likely insufficient and may lead to surprising behavior or silent duplication. Equality semantics, scope of comparison, and user intent all need to be carefully defined before changing ID behavior at a low level.
This issue exists to scope and design a deterministic primary key strategy suitable for append workflows, without prematurely committing to a specific hashing or deduplication mechanism.
User Stories & Acceptance Criteria
- As a user appending data to an existing dataset, I want record identity to behave predictably and not create silent duplicates due to minor, unintentional changes in source data.
Feature Request
Connector-based ingestion relies on primary keys to determine whether records should be inserted or updated. When source data provides a stable identifier, this works well. However, many raw datasets (e.g. GeoJSON uploads via the Dataset Importer) do not include feature IDs.
As of #136, the system generates random UUIDs for such records. This avoids baking in brittle assumptions about record equality and prevents subtle duplication bugs caused by small, hard-to-reason-about differences in feature content (e.g. null vs omitted fields, minor attribute edits). A TODO was intentionally left in the code to revisit this problem once append workflows are in scope.
However, supporting append-to-existing-dataset workflows in the future will require a more deliberate and explicit strategy for record identity. Naïve deterministic hashing of feature content is likely insufficient and may lead to surprising behavior or silent duplication. Equality semantics, scope of comparison, and user intent all need to be carefully defined before changing ID behavior at a low level.
This issue exists to scope and design a deterministic primary key strategy suitable for append workflows, without prematurely committing to a specific hashing or deduplication mechanism.
User Stories & Acceptance Criteria