Skip to content

Define deterministic primary key strategy for append uploads across connectors #196

@rudokemper

Description

@rudokemper

Feature Request

Connector-based ingestion relies on primary keys to determine whether records should be inserted or updated. When source data provides a stable identifier, this works well. However, many raw datasets (e.g. GeoJSON uploads via the Dataset Importer) do not include feature IDs.

As of #136, the system generates random UUIDs for such records. This avoids baking in brittle assumptions about record equality and prevents subtle duplication bugs caused by small, hard-to-reason-about differences in feature content (e.g. null vs omitted fields, minor attribute edits). A TODO was intentionally left in the code to revisit this problem once append workflows are in scope.

However, supporting append-to-existing-dataset workflows in the future will require a more deliberate and explicit strategy for record identity. Naïve deterministic hashing of feature content is likely insufficient and may lead to surprising behavior or silent duplication. Equality semantics, scope of comparison, and user intent all need to be carefully defined before changing ID behavior at a low level.

This issue exists to scope and design a deterministic primary key strategy suitable for append workflows, without prematurely committing to a specific hashing or deduplication mechanism.

User Stories & Acceptance Criteria

  • As a user appending data to an existing dataset, I want record identity to behave predictably and not create silent duplicates due to minor, unintentional changes in source data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    connectorsConnector scripts for ETL from upstream data sources

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions