Skip to content

ingest: spatial matching for DIC and future datasets (site, grid, segment) #47

@bbest

Description

@bbest

Problem

The CalCOFI DIC ingest (ingest_calcofi_dic.qmd) achieves only 24.7% cast matching via sta_key + datetime_utc (±3 day window). The remaining 75% of samples have valid lat/lon but no link to existing casts because:

  • Many DIC cruises pre-date or differ from the calcofi.org bottle database coverage
  • Date offsets between NCEI and calcofi.org records sometimes exceed 3 days
  • Some stations may use slightly different line/station designations

This will be a recurring problem for every new dataset ingested (euphausiids, phytoplankton, zooplankton, etc.).

Proposed Solution

Implement a spatial matching pipeline that, given lon + lat, can:

  1. Project to CalCOFI coordinate system → derive line + station → match to site table
  2. Fall back to grid when site-level precision is insufficient (using ST_Intersects with grid polygons, as in ingest_swfsc_ichthyo.qmd)
  3. Populate segment table with cruise track observations — each dataset ingested adds observations to segments, building up the track over time
  4. QA/QC checks: identify unrealistic deviations in space and time along the track, corroborated by other datasets

Key Questions

  • What distance threshold to use for site matching? (Currently ingest_calcofi_ctd-cast.qmd uses max_dist_dec_lnst_km)
  • Should we always fall back to grid if site match fails, or flag as out-of-grid?
  • How to define segment boundaries along cruise tracks?

Datasets Affected

  • calcofi_dic (current — 75% unmatched)
  • All future ingest workflows (euphausiids, phytoplankton, zooplankton, etc.)

References

  • ingest_swfsc_ichthyo.qmd — site/grid creation and assignment via add_point_geom() + assign_grid_key()
  • ingest_calcofi_ctd-cast.qmd — distance-based filtering to line/station with max_dist_dec_lnst_km
  • calcofi4db::assign_grid_key() — ST_Intersects spatial join
  • calcofi4r::cc_grid / cc_grid_ctrs — CalCOFI grid polygons and centers

Acceptance Criteria

  • Function to project lon/lat → CalCOFI line/station (or nearest site)
  • Reusable assign_site_key() function in calcofi4db
  • Grid fallback for out-of-site observations
  • Segment table populated incrementally by each ingest
  • DIC ingest match rate improves significantly (target: >80%)
  • Pattern documented in /ingest-new skill template

Metadata

Metadata

Assignees

No one assigned

    Labels

    ingestData ingestion workflowmust-completeContract deliverable - must be completed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions