CalCOFI
diff --git a/‎.claude/skills/explore-dataset.md‎
Lines changed: 18 additions & 2 deletions b/‎.claude/skills/explore-dataset.md‎
Lines changed: 18 additions & 2 deletions
diff --git a/‎.claude/skills/generate-metadata.md‎
Lines changed: 24 additions & 3 deletions b/‎.claude/skills/generate-metadata.md‎
Lines changed: 24 additions & 3 deletions
diff --git a/‎.claude/skills/ingest-new.md‎
Lines changed: 104 additions & 12 deletions b/‎.claude/skills/ingest-new.md‎
Lines changed: 104 additions & 12 deletions
diff --git a/‎.claude/skills/templates/ingest_template.qmd‎
Lines changed: 94 additions & 4 deletions b/‎.claude/skills/templates/ingest_template.qmd‎
Lines changed: 94 additions & 4 deletions
diff --git a/‎.claude/skills/validate-ingest.md‎
Lines changed: 25 additions & 0 deletions b/‎.claude/skills/validate-ingest.md‎
Lines changed: 25 additions & 0 deletions
@@ -48,7 +48,23 @@ When the user invokes this skill, run the R script `scripts/explore_dataset.R` a
    - Measurement columns that could map to `measurement_type.csv`
    - Data quality flags (duplicates, outliers, encoding issues)
 
-4. **Generate recommendations**:
+4. **Scrape CalCOFI.org landing page**:
+   - Use `WebFetch` on the CalCOFI.org page for the dataset (e.g.,
+     `https://calcofi.org/data/oceanographic-data/{dataset}/`) to check
+     for updated data, download links, methodology notes, and citations.
+   - If not available, check the data portal landing page (NCEI, EDI, ERDDAP).
+   - Extract: citation, DOI, PI names, temporal/spatial coverage, license.
+
+5. **Determine provider**:
+   - The `provider` is the **organization curating the data**, not the
+     data portal where it's hosted. For example:
+     - Data from CalCOFI → `provider = "calcofi"` (even if hosted on NCEI or EDI)
+     - Data from SWFSC → `provider = "swfsc"`
+     - Data from SIO/PIC → `provider = "pic"`
+   - The data portal (NCEI, EDI, ERDDAP) is recorded in `link_data_source`
+     in the `dataset` metadata table, not in the provider name.
+
+6. **Generate recommendations**:
    - Suggest whether this is an **ingest** (new data) or **publish** (subset of existing data)
    - Recommend table naming following `{dataset}_{table}` convention
    - Identify which existing tables to join against
@@ -59,7 +75,7 @@ When the user invokes this skill, run the R script `scripts/explore_dataset.R` a
      - Taxonomy standardization needed
      - Spatial matching complexity
 
-5. **Output**: Display the markdown report directly in the conversation.
+7. **Output**: Display the markdown report directly in the conversation.
 
 ## Example Output
 
 
@@ -131,13 +131,34 @@ cat(paste(mt$measurement_type, collapse = "\n"))
 
 Report which measurements already exist and which need to be added.
 
-### 7. Present results to user
+### 7. Register in `metadata/dataset.csv`
+
+Add a row to the unified `dataset` reference table with:
+
+```csv
+provider,dataset,dataset_name,description,citation_main,citation_others,link_calcofi_org,link_data_source,link_others,tables,coverage_temporal,coverage_spatial,license,pi_names
+```
+
+Fields:
+- `provider`: Organization curating the data (e.g., `calcofi`, `swfsc`, `pic`)
+  — NOT the data portal (NCEI, EDI, ERDDAP)
+- `citation_main`: Primary dataset citation (from DOI or data portal)
+- `link_calcofi_org`: CalCOFI.org landing page for the dataset
+- `link_data_source`: Data portal URL (NCEI accession, EDI package, ERDDAP endpoint)
+- `link_others`: Semicolon-delimited additional links (DOI, publications)
+- `tables`: Semicolon-delimited list of tables contributed to the database
+
+Scrape the CalCOFI.org page and data portal landing page for citation,
+DOI, PI names, and other metadata before filling in this row.
+
+### 8. Present results to user
 
 Show:
 - Created file paths
 - Table mapping summary
 - Field mapping summary with any that need manual review
 - Measurement types to add (if any)
+- Dataset metadata row added to `metadata/dataset.csv`
 - Instructions for next steps:
   1. Review and edit `flds_redefine.csv` (rename decisions, type overrides, include/exclude)
   2. Add new entries to `metadata/measurement_type.csv` if needed
@@ -146,12 +167,12 @@ Show:
 ## Example
 
 ```
-/generate-metadata ncei dic ~/My\ Drive/projects/calcofi/data-public/ncei/dic
+/generate-metadata calcofi dic ~/My\ Drive/projects/calcofi/data-public/calcofi/dic
 ```
 
 Creates:
 ```
-metadata/ncei/dic/
+metadata/calcofi/dic/
 ├── tbls_redefine.csv    # Maps CalCOFI_DIC_data → dic_measurement
 ├── flds_redefine.csv    # Maps DIC, TA, pH → standard names
 └── metadata_derived.csv # (empty, no derived columns needed)
 
@@ -59,11 +59,11 @@ Based on the dataset characteristics (from `/explore-dataset` output or user inp
 - Full spatial/grid assignment
 - Example: euphausiids, zooplankton
 
-**Pattern B: Merge into existing table** (like DIC → bottle_measurement)
-- Appends rows to an existing table
-- Uses existing PKs from the target table
-- Joins via FK to existing casts/bottles
-- Example: DIC measurements → bottle_measurement
+**Pattern B: Supplementary measurements** (like DIC)
+- Creates own `{dataset}_sample` (position-only) and `{dataset}_measurement` tables
+- Matches to existing casts/bottles via station + date window
+- Keeps tables separate from bottle_measurement (different QC pipelines)
+- Example: DIC/TA → dic_sample + dic_measurement + dic_measurement_summary
 
 **Pattern C: Multi-source ingest** (like phytoplankton)
 - Reads from multiple source formats (CSV, API, etc.)
@@ -89,25 +89,65 @@ The notebook includes these sections (customize based on pattern):
 6. **Show source files** — `show_source_files()`
 7. **Show tables/fields** — Redefinition display
 8. **Load into database** — `ingest_dataset()` or custom load
-9. **Schema documentation** — dm visualization
+9. **Schema documentation** — Define PKs/FKs via `dm_add_pk()`/`dm_add_fk()`,
+  color-code tables (`lightblue` = new tables, `lightyellow` = amended
+  reference tables like measurement_type, `white` = shared metadata like
+  dataset), draw with `dm_draw()`, then write `relationships.json`
+  sidecar via `build_relationships_json()` for use in release_database.qmd
 10. **Validate** — `validate_for_release()`
 11. **Enforce column types** — `enforce_column_types()`
-12. **Data preview** — `preview_tables()`
+12. **Data preview** — Individual `datatable()` calls per table (NOT
+  `preview_tables()` in a loop, which has DT rendering issues)
 13. **Write parquet** — `write_parquet_outputs()`
-14. **Write metadata** — `build_metadata_json()`, `build_relationships_json()`
+14. **Write metadata** — `build_metadata_json()`
 15. **Upload to GCS** — `sync_to_gcs()`
 16. **Cleanup** — Close DuckDB connection
 
 #### Conditional sections:
 - **Cross-dataset loading** — `load_prior_tables()` (if depends on prior ingest)
 - **Primary key setup** — `assign_deterministic_uuids()` or `assign_sequential_ids()`
 - **Pivot measurements** — Wide→long transformation (if `--has-pivot`)
+- **Measurement summary** — Aggregate replicates with avg/stddev per unique
+  position (station + date + depth + measurement_type). Filter out invalid
+  values: `WHERE NOT isnan(measurement_value) AND isfinite(measurement_value)`.
+  Use `STDDEV_SAMP()` with `CASE WHEN COUNT(*) = 1 THEN 0` for single
+  observations. See `ctd_summary` in `ingest_calcofi_ctd-cast.qmd` and
+  `dic_measurement_summary` in `ingest_calcofi_dic.qmd` for examples.
 - **Taxonomy** — `standardize_species_local()`, `build_taxon_hierarchy()` (if `--has-taxonomy`)
-- **Spatial** — `add_point_geom()`, `assign_grid_key()` (if has lat/lon)
+- **Spatial** — `add_point_geom()`, `assign_grid_key()` (if has lat/lon).
+  For datasets without direct cast_id/bottle_id FKs, match via station +
+  date window (±3 days) or lat/lon spatial join. See issue #47 for the
+  site/grid/segment matching roadmap.
 - **Lookup tables** — `create_lookup_table()` (if categorical vocabularies exist)
 - **Ship/cruise matching** — `derive_cruise_key_on_casts()` (if cross-dataset bridge needed)
 
-### 5. Mark dataset-specific sections
+### 5. Coding conventions
+
+**Tidy data**: Apply tidy data principles throughout:
+- The base `{dataset}_sample` table has only position/time/FK columns —
+  NO measurement values as separate columns
+- ALL measurements (including ancillary ones like temp, salinity) are
+  pivoted into `{dataset}_measurement` with columns:
+  `measurement_type`, `measurement_value`, `measurement_qual`
+- Each row = one measurement at one position. Never mix different
+  measured quantities on the same row.
+- Example: DIC dataset pivots 4 types (dic, alkalinity, ctdtemp_its90,
+  salinity_pss78) into `dic_measurement` — `dic_sample` has zero
+  measurement columns.
+
+**Status output**: Use `cat()` (not `message()`) for user-facing status
+output in chunks. `message()` sends to stderr which Quarto may not
+render visibly with `code-fold: true`. Pattern:
+```r
+cat(glue("label: {value}"), "\n")
+```
+
+**Data preview**: Use individual `datatable()` calls per table in
+separate chunks (one chunk per table). Do NOT use `preview_tables()`
+in a loop — it has DT widget rendering issues where only the first
+table displays.
+
+### 6. Mark dataset-specific sections
 
 In the generated notebook, mark sections requiring manual implementation with:
 
@@ -117,7 +157,59 @@ In the generated notebook, mark sections requiring manual implementation with:
 # - {specific guidance based on dataset characteristics}
 ```
 
-### 6. Update `_targets.R`
+### 6. Include dataset metadata and release_database update
+
+Every ingest notebook MUST include these two standard sections:
+
+**a. Load Dataset Metadata** — Load `metadata/dataset.csv` into the
+wrangling DB so it's included in the parquet output and flows into
+`release_database.qmd`:
+
+```r
+d_dataset <- read_csv(here("metadata/dataset.csv"))
+dbWriteTable(con, "dataset", d_dataset, overwrite = TRUE)
+```
+
+**b. CalCOFI.org page check** — Before ingesting, scrape the CalCOFI.org
+landing page for the dataset (from `link_calcofi_org` in `dataset.csv`)
+to check for updated data, new download links, or changed metadata.
+
+**c. Update `release_database.qmd`** — Add the new dataset's parquet
+directory and relationships.json path to the release workflow:
+
+```r
+# in release_database.qmd, add to parquet_dirs:
+parquet_dirs <- c(
+  ...,
+  here("data/parquet/{provider}_{dataset}")
+)
+
+# and to rels_paths:
+rels_paths <- c(
+  ...,
+  here("data/parquet/{provider}_{dataset}/relationships.json")
+)
+```
+
+Also add the dataset's tables to the color grouping section and update
+the release notes data sources list.
+
+### 7. Provider naming convention
+
+The `provider` value represents the **organization curating the data**,
+not the data portal where it's hosted:
+
+| Provider | Organization | Example datasets |
+|----------|-------------|------------------|
+| `calcofi` | CalCOFI program | bottle, ctd-cast, dic |
+| `swfsc` | NOAA SWFSC | ichthyo |
+| `pic` | SIO Pelagic Invertebrates Collection | zooplankton |
+| `sccoos` | SCCOOS | underway |
+
+Data portals (NCEI, EDI, ERDDAP) are recorded in `link_data_source`
+in `metadata/dataset.csv`, not in the provider name.
+
+### 8. Update `_targets.R`
 
 Add a new target entry for the ingest workflow:
 
@@ -136,7 +228,7 @@ tar_target(
 
 Insert it in the correct dependency order (after its `depends_on` target, before `release_database`).
 
-### 7. Present results
+### 9. Present results
 
 Show the user:
 - Created file path
 
@@ -169,6 +169,41 @@ ingest_dataset(con, d)
 ```
 <!-- {{pivot_section_end}} -->
 
+<!-- {{measurement_summary_section_start}} -->
+## Summarize Replicate Measurements
+
+Aggregate replicate measurements at each unique position in time and
+space (station + date + depth) into mean and standard deviation.
+Filters out `NaN`, `-Inf`, `Inf` values. See `ctd_summary` in
+`ingest_calcofi_ctd-cast.qmd` and `dic_measurement_summary` in
+`ingest_calcofi_dic.qmd` for production examples.
+
+```{r}
+#| label: TODO-measurement-summary
+
+# TODO: adjust table names and grouping columns for this dataset
+# dbExecute(
+#   con,
+#   "CREATE OR REPLACE TABLE {{dataset}}_measurement_summary AS
+#    SELECT
+#      sta_key,
+#      datetime_utc,
+#      depth_m,
+#      measurement_type,
+#      AVG(measurement_value)    AS avg,
+#      CASE
+#        WHEN COUNT(*) = 1 THEN 0
+#        ELSE COALESCE(STDDEV_SAMP(measurement_value), 0)
+#      END                       AS stddev,
+#      COUNT(*)                  AS n_obs
+#    FROM {{dataset}}_measurement
+#    WHERE NOT isnan(measurement_value)
+#      AND isfinite(measurement_value)
+#    GROUP BY sta_key, datetime_utc, depth_m, measurement_type"
+# )
+```
+<!-- {{measurement_summary_section_end}} -->
+
 <!-- {{cross_dataset_section_start}} -->
 ## Cross-Dataset Integration
 
@@ -193,13 +228,53 @@ load_prior_tables(
 
 ## Schema Documentation
 
+Define PKs, FKs, and color-code tables for the ER diagram.
+Color scheme: lightblue = new dataset tables, lightyellow = amended
+reference tables (e.g. measurement_type), white = shared metadata.
+
 ```{r}
 #| label: schema
 
-# build dm object for visualization
-tables <- dbListTables(con)
-dm_obj <- dm_from_con(con, tables, learn_keys = FALSE)
-dm_draw(dm_obj, rankdir = "LR", view_type = "all")
+# TODO: define PK/FK relationships for this dataset
+add_{{dataset}}_keys <- function(dm) {
+  dm |>
+    dm_add_pk({{dataset}}_measurement, {{dataset}}_measurement_id) |>
+    dm_add_pk(measurement_type, measurement_type) |>
+    dm_add_fk({{dataset}}_measurement, measurement_type, measurement_type)
+    # add more FKs as needed
+}
+
+# build dm from dataset-specific tables (exclude loaded reference tables)
+{{dataset}}_tables <- c(
+  "{{dataset}}_sample", "{{dataset}}_measurement",
+  "{{dataset}}_measurement_summary",
+  "measurement_type", "dataset")
+dm_{{dataset}} <- dm_from_con(
+  con, table_names = {{dataset}}_tables, learn_keys = FALSE) |>
+  add_{{dataset}}_keys() |>
+  dm_set_colors(
+    lightblue   = c({{dataset}}_sample, {{dataset}}_measurement,
+                    {{dataset}}_measurement_summary),
+    lightyellow = measurement_type,
+    white       = dataset
+  )
+
+dm_draw(dm_{{dataset}}, rankdir = "LR", view_type = "all")
+```
+
+### Write relationships.json
+
+Write PK/FK sidecar for use in `release_database.qmd` schema merging.
+
+```{r}
+#| label: write-relationships
+
+build_relationships_json(
+  dm         = dm_{{dataset}},
+  output_dir = dir_parquet,
+  provider   = provider,
+  dataset    = dataset
+)
 ```
 
 ## Add Spatial
@@ -212,6 +287,21 @@ dm_draw(dm_obj, rankdir = "LR", view_type = "all")
 # assign_grid_key(con, "{{spatial_table}}")
 ```
 
+## Load Dataset Metadata
+
+Register this dataset in the `dataset` reference table, which tracks
+citations, source URLs, and CalCOFI.org landing pages for all ingested
+datasets. Ensure `metadata/dataset.csv` has a row for this dataset
+before running.
+
+```{r}
+#| label: load-dataset-metadata
+
+d_dataset <- read_csv(here("metadata/dataset.csv"))
+dbWriteTable(con, "dataset", d_dataset, overwrite = TRUE)
+cat(glue("dataset: {nrow(d_dataset)} datasets registered"), "\n")
+```
+
 ## Validate
 
 ```{r}
 
@@ -150,6 +150,31 @@ date_cols <- c("datetime_utc", "date", "cruise_date")
 # check for near-duplicates (same key columns, different values)
 ```
 
+#### I. Measurement Summary Consistency (`summary`)
+```r
+# if a *_measurement_summary table exists, validate:
+# - all summary rows have n_obs >= 1
+# - stddev == 0 when n_obs == 1
+# - no NaN or Inf in avg or stddev columns
+# - summary row count <= measurement row count
+# - measurement types in summary match measurement table
+for (tbl in tables) {
+  if (grepl("_summary$", tbl)) {
+    # check for NaN/Inf in summary values
+    bad_vals <- dbGetQuery(con, glue(
+      "SELECT COUNT(*) FROM {tbl}
+       WHERE isnan(avg) OR NOT isfinite(avg)
+          OR isnan(stddev) OR NOT isfinite(stddev)"))[[1]]
+    if (bad_vals > 0) report_error("summary NaN/Inf", tbl, glue("{bad_vals} rows"))
+    # check stddev = 0 when n_obs = 1
+    bad_stddev <- dbGetQuery(con, glue(
+      "SELECT COUNT(*) FROM {tbl}
+       WHERE n_obs = 1 AND stddev != 0"))[[1]]
+    if (bad_stddev > 0) report_warning("stddev != 0 for n_obs=1", tbl, bad_stddev)
+  }
+}
+```
+
 ### 3. Cross-dataset validation
 
 If prior ingest parquet exists, also validate cross-dataset integrity: