@@ -59,11 +59,11 @@ Based on the dataset characteristics (from `/explore-dataset` output or user inp
5959- Full spatial/grid assignment
6060- Example: euphausiids, zooplankton
6161
62- ** Pattern B: Merge into existing table ** (like DIC → bottle_measurement )
63- - Appends rows to an existing table
64- - Uses existing PKs from the target table
65- - Joins via FK to existing casts/bottles
66- - Example: DIC measurements → bottle_measurement
62+ ** Pattern B: Supplementary measurements ** (like DIC)
63+ - Creates own ` {dataset}_sample ` (position-only) and ` {dataset}_measurement ` tables
64+ - Matches to existing casts/bottles via station + date window
65+ - Keeps tables separate from bottle_measurement (different QC pipelines)
66+ - Example: DIC/TA → dic_sample + dic_measurement + dic_measurement_summary
6767
6868** Pattern C: Multi-source ingest** (like phytoplankton)
6969- Reads from multiple source formats (CSV, API, etc.)
@@ -89,25 +89,65 @@ The notebook includes these sections (customize based on pattern):
89896 . ** Show source files** — ` show_source_files() `
90907 . ** Show tables/fields** — Redefinition display
91918 . ** Load into database** — ` ingest_dataset() ` or custom load
92- 9 . ** Schema documentation** — dm visualization
92+ 9 . ** Schema documentation** — Define PKs/FKs via ` dm_add_pk() ` /` dm_add_fk() ` ,
93+ color-code tables (` lightblue ` = new tables, ` lightyellow ` = amended
94+ reference tables like measurement_type, ` white ` = shared metadata like
95+ dataset), draw with ` dm_draw() ` , then write ` relationships.json `
96+ sidecar via ` build_relationships_json() ` for use in release_database.qmd
939710 . ** Validate** — ` validate_for_release() `
949811 . ** Enforce column types** — ` enforce_column_types() `
95- 12 . ** Data preview** — ` preview_tables() `
99+ 12 . ** Data preview** — Individual ` datatable() ` calls per table (NOT
100+ ` preview_tables() ` in a loop, which has DT rendering issues)
9610113 . ** Write parquet** — ` write_parquet_outputs() `
97- 14 . ** Write metadata** — ` build_metadata_json() ` , ` build_relationships_json() `
102+ 14 . ** Write metadata** — ` build_metadata_json() `
9810315 . ** Upload to GCS** — ` sync_to_gcs() `
9910416 . ** Cleanup** — Close DuckDB connection
100105
101106#### Conditional sections:
102107- ** Cross-dataset loading** — ` load_prior_tables() ` (if depends on prior ingest)
103108- ** Primary key setup** — ` assign_deterministic_uuids() ` or ` assign_sequential_ids() `
104109- ** Pivot measurements** — Wide→long transformation (if ` --has-pivot ` )
110+ - ** Measurement summary** — Aggregate replicates with avg/stddev per unique
111+ position (station + date + depth + measurement_type). Filter out invalid
112+ values: ` WHERE NOT isnan(measurement_value) AND isfinite(measurement_value) ` .
113+ Use ` STDDEV_SAMP() ` with ` CASE WHEN COUNT(*) = 1 THEN 0 ` for single
114+ observations. See ` ctd_summary ` in ` ingest_calcofi_ctd-cast.qmd ` and
115+ ` dic_measurement_summary ` in ` ingest_calcofi_dic.qmd ` for examples.
105116- ** Taxonomy** — ` standardize_species_local() ` , ` build_taxon_hierarchy() ` (if ` --has-taxonomy ` )
106- - ** Spatial** — ` add_point_geom() ` , ` assign_grid_key() ` (if has lat/lon)
117+ - ** Spatial** — ` add_point_geom() ` , ` assign_grid_key() ` (if has lat/lon).
118+ For datasets without direct cast_id/bottle_id FKs, match via station +
119+ date window (±3 days) or lat/lon spatial join. See issue #47 for the
120+ site/grid/segment matching roadmap.
107121- ** Lookup tables** — ` create_lookup_table() ` (if categorical vocabularies exist)
108122- ** Ship/cruise matching** — ` derive_cruise_key_on_casts() ` (if cross-dataset bridge needed)
109123
110- ### 5. Mark dataset-specific sections
124+ ### 5. Coding conventions
125+
126+ ** Tidy data** : Apply tidy data principles throughout:
127+ - The base ` {dataset}_sample ` table has only position/time/FK columns —
128+ NO measurement values as separate columns
129+ - ALL measurements (including ancillary ones like temp, salinity) are
130+ pivoted into ` {dataset}_measurement ` with columns:
131+ ` measurement_type ` , ` measurement_value ` , ` measurement_qual `
132+ - Each row = one measurement at one position. Never mix different
133+ measured quantities on the same row.
134+ - Example: DIC dataset pivots 4 types (dic, alkalinity, ctdtemp_its90,
135+ salinity_pss78) into ` dic_measurement ` — ` dic_sample ` has zero
136+ measurement columns.
137+
138+ ** Status output** : Use ` cat() ` (not ` message() ` ) for user-facing status
139+ output in chunks. ` message() ` sends to stderr which Quarto may not
140+ render visibly with ` code-fold: true ` . Pattern:
141+ ``` r
142+ cat(glue(" label: {value}" ), " \n " )
143+ ```
144+
145+ ** Data preview** : Use individual ` datatable() ` calls per table in
146+ separate chunks (one chunk per table). Do NOT use ` preview_tables() `
147+ in a loop — it has DT widget rendering issues where only the first
148+ table displays.
149+
150+ ### 6. Mark dataset-specific sections
111151
112152In the generated notebook, mark sections requiring manual implementation with:
113153
@@ -117,7 +157,59 @@ In the generated notebook, mark sections requiring manual implementation with:
117157# - {specific guidance based on dataset characteristics}
118158```
119159
120- ### 6. Update ` _targets.R `
160+ ### 6. Include dataset metadata and release_database update
161+
162+ Every ingest notebook MUST include these two standard sections:
163+
164+ ** a. Load Dataset Metadata** — Load ` metadata/dataset.csv ` into the
165+ wrangling DB so it's included in the parquet output and flows into
166+ ` release_database.qmd ` :
167+
168+ ``` r
169+ d_dataset <- read_csv(here(" metadata/dataset.csv" ))
170+ dbWriteTable(con , " dataset" , d_dataset , overwrite = TRUE )
171+ ```
172+
173+ ** b. CalCOFI.org page check** — Before ingesting, scrape the CalCOFI.org
174+ landing page for the dataset (from ` link_calcofi_org ` in ` dataset.csv ` )
175+ to check for updated data, new download links, or changed metadata.
176+
177+ ** c. Update ` release_database.qmd ` ** — Add the new dataset's parquet
178+ directory and relationships.json path to the release workflow:
179+
180+ ``` r
181+ # in release_database.qmd, add to parquet_dirs:
182+ parquet_dirs <- c(
183+ ... ,
184+ here(" data/parquet/{provider}_{dataset}" )
185+ )
186+
187+ # and to rels_paths:
188+ rels_paths <- c(
189+ ... ,
190+ here(" data/parquet/{provider}_{dataset}/relationships.json" )
191+ )
192+ ```
193+
194+ Also add the dataset's tables to the color grouping section and update
195+ the release notes data sources list.
196+
197+ ### 7. Provider naming convention
198+
199+ The ` provider ` value represents the ** organization curating the data** ,
200+ not the data portal where it's hosted:
201+
202+ | Provider | Organization | Example datasets |
203+ | ----------| -------------| ------------------|
204+ | ` calcofi ` | CalCOFI program | bottle, ctd-cast, dic |
205+ | ` swfsc ` | NOAA SWFSC | ichthyo |
206+ | ` pic ` | SIO Pelagic Invertebrates Collection | zooplankton |
207+ | ` sccoos ` | SCCOOS | underway |
208+
209+ Data portals (NCEI, EDI, ERDDAP) are recorded in ` link_data_source `
210+ in ` metadata/dataset.csv ` , not in the provider name.
211+
212+ ### 8. Update ` _targets.R `
121213
122214Add a new target entry for the ingest workflow:
123215
@@ -136,7 +228,7 @@ tar_target(
136228
137229Insert it in the correct dependency order (after its ` depends_on ` target, before ` release_database ` ).
138230
139- ### 7 . Present results
231+ ### 9 . Present results
140232
141233Show the user:
142234- Created file path
0 commit comments