Problem
The CalCOFI DIC ingest (ingest_calcofi_dic.qmd) achieves only 24.7% cast matching via sta_key + datetime_utc (±3 day window). The remaining 75% of samples have valid lat/lon but no link to existing casts because:
- Many DIC cruises pre-date or differ from the calcofi.org bottle database coverage
- Date offsets between NCEI and calcofi.org records sometimes exceed 3 days
- Some stations may use slightly different line/station designations
This will be a recurring problem for every new dataset ingested (euphausiids, phytoplankton, zooplankton, etc.).
Proposed Solution
Implement a spatial matching pipeline that, given lon + lat, can:
- Project to CalCOFI coordinate system → derive
line + station → match to site table
- Fall back to
grid when site-level precision is insufficient (using ST_Intersects with grid polygons, as in ingest_swfsc_ichthyo.qmd)
- Populate
segment table with cruise track observations — each dataset ingested adds observations to segments, building up the track over time
- QA/QC checks: identify unrealistic deviations in space and time along the track, corroborated by other datasets
Key Questions
- What distance threshold to use for site matching? (Currently
ingest_calcofi_ctd-cast.qmd uses max_dist_dec_lnst_km)
- Should we always fall back to grid if site match fails, or flag as out-of-grid?
- How to define segment boundaries along cruise tracks?
Datasets Affected
calcofi_dic (current — 75% unmatched)
- All future ingest workflows (
euphausiids, phytoplankton, zooplankton, etc.)
References
ingest_swfsc_ichthyo.qmd — site/grid creation and assignment via add_point_geom() + assign_grid_key()
ingest_calcofi_ctd-cast.qmd — distance-based filtering to line/station with max_dist_dec_lnst_km
calcofi4db::assign_grid_key() — ST_Intersects spatial join
calcofi4r::cc_grid / cc_grid_ctrs — CalCOFI grid polygons and centers
Acceptance Criteria
Problem
The CalCOFI DIC ingest (
ingest_calcofi_dic.qmd) achieves only 24.7% cast matching viasta_key+datetime_utc(±3 day window). The remaining 75% of samples have valid lat/lon but no link to existingcastsbecause:This will be a recurring problem for every new dataset ingested (euphausiids, phytoplankton, zooplankton, etc.).
Proposed Solution
Implement a spatial matching pipeline that, given
lon + lat, can:line + station→ match tositetablegridwhen site-level precision is insufficient (usingST_Intersectswith grid polygons, as iningest_swfsc_ichthyo.qmd)segmenttable with cruise track observations — each dataset ingested adds observations to segments, building up the track over timeKey Questions
ingest_calcofi_ctd-cast.qmdusesmax_dist_dec_lnst_km)Datasets Affected
calcofi_dic(current — 75% unmatched)euphausiids,phytoplankton,zooplankton, etc.)References
ingest_swfsc_ichthyo.qmd— site/grid creation and assignment viaadd_point_geom()+assign_grid_key()ingest_calcofi_ctd-cast.qmd— distance-based filtering to line/station withmax_dist_dec_lnst_kmcalcofi4db::assign_grid_key()— ST_Intersects spatial joincalcofi4r::cc_grid/cc_grid_ctrs— CalCOFI grid polygons and centersAcceptance Criteria
assign_site_key()function in calcofi4db/ingest-newskill template