Skip to content

Commit 32401f0

Browse files
committed
+ ingest_calcofi_dic; + .claude/skills
1 parent 3b1352b commit 32401f0

94 files changed

Lines changed: 28088 additions & 5942 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/skills/explore-dataset.md

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,23 @@ When the user invokes this skill, run the R script `scripts/explore_dataset.R` a
4848
- Measurement columns that could map to `measurement_type.csv`
4949
- Data quality flags (duplicates, outliers, encoding issues)
5050

51-
4. **Generate recommendations**:
51+
4. **Scrape CalCOFI.org landing page**:
52+
- Use `WebFetch` on the CalCOFI.org page for the dataset (e.g.,
53+
`https://calcofi.org/data/oceanographic-data/{dataset}/`) to check
54+
for updated data, download links, methodology notes, and citations.
55+
- If not available, check the data portal landing page (NCEI, EDI, ERDDAP).
56+
- Extract: citation, DOI, PI names, temporal/spatial coverage, license.
57+
58+
5. **Determine provider**:
59+
- The `provider` is the **organization curating the data**, not the
60+
data portal where it's hosted. For example:
61+
- Data from CalCOFI → `provider = "calcofi"` (even if hosted on NCEI or EDI)
62+
- Data from SWFSC → `provider = "swfsc"`
63+
- Data from SIO/PIC → `provider = "pic"`
64+
- The data portal (NCEI, EDI, ERDDAP) is recorded in `link_data_source`
65+
in the `dataset` metadata table, not in the provider name.
66+
67+
6. **Generate recommendations**:
5268
- Suggest whether this is an **ingest** (new data) or **publish** (subset of existing data)
5369
- Recommend table naming following `{dataset}_{table}` convention
5470
- Identify which existing tables to join against
@@ -59,7 +75,7 @@ When the user invokes this skill, run the R script `scripts/explore_dataset.R` a
5975
- Taxonomy standardization needed
6076
- Spatial matching complexity
6177

62-
5. **Output**: Display the markdown report directly in the conversation.
78+
7. **Output**: Display the markdown report directly in the conversation.
6379

6480
## Example Output
6581

.claude/skills/generate-metadata.md

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -131,13 +131,34 @@ cat(paste(mt$measurement_type, collapse = "\n"))
131131

132132
Report which measurements already exist and which need to be added.
133133

134-
### 7. Present results to user
134+
### 7. Register in `metadata/dataset.csv`
135+
136+
Add a row to the unified `dataset` reference table with:
137+
138+
```csv
139+
provider,dataset,dataset_name,description,citation_main,citation_others,link_calcofi_org,link_data_source,link_others,tables,coverage_temporal,coverage_spatial,license,pi_names
140+
```
141+
142+
Fields:
143+
- `provider`: Organization curating the data (e.g., `calcofi`, `swfsc`, `pic`)
144+
— NOT the data portal (NCEI, EDI, ERDDAP)
145+
- `citation_main`: Primary dataset citation (from DOI or data portal)
146+
- `link_calcofi_org`: CalCOFI.org landing page for the dataset
147+
- `link_data_source`: Data portal URL (NCEI accession, EDI package, ERDDAP endpoint)
148+
- `link_others`: Semicolon-delimited additional links (DOI, publications)
149+
- `tables`: Semicolon-delimited list of tables contributed to the database
150+
151+
Scrape the CalCOFI.org page and data portal landing page for citation,
152+
DOI, PI names, and other metadata before filling in this row.
153+
154+
### 8. Present results to user
135155

136156
Show:
137157
- Created file paths
138158
- Table mapping summary
139159
- Field mapping summary with any that need manual review
140160
- Measurement types to add (if any)
161+
- Dataset metadata row added to `metadata/dataset.csv`
141162
- Instructions for next steps:
142163
1. Review and edit `flds_redefine.csv` (rename decisions, type overrides, include/exclude)
143164
2. Add new entries to `metadata/measurement_type.csv` if needed
@@ -146,12 +167,12 @@ Show:
146167
## Example
147168

148169
```
149-
/generate-metadata ncei dic ~/My\ Drive/projects/calcofi/data-public/ncei/dic
170+
/generate-metadata calcofi dic ~/My\ Drive/projects/calcofi/data-public/calcofi/dic
150171
```
151172

152173
Creates:
153174
```
154-
metadata/ncei/dic/
175+
metadata/calcofi/dic/
155176
├── tbls_redefine.csv # Maps CalCOFI_DIC_data → dic_measurement
156177
├── flds_redefine.csv # Maps DIC, TA, pH → standard names
157178
└── metadata_derived.csv # (empty, no derived columns needed)

.claude/skills/ingest-new.md

Lines changed: 104 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -59,11 +59,11 @@ Based on the dataset characteristics (from `/explore-dataset` output or user inp
5959
- Full spatial/grid assignment
6060
- Example: euphausiids, zooplankton
6161

62-
**Pattern B: Merge into existing table** (like DIC → bottle_measurement)
63-
- Appends rows to an existing table
64-
- Uses existing PKs from the target table
65-
- Joins via FK to existing casts/bottles
66-
- Example: DIC measurements → bottle_measurement
62+
**Pattern B: Supplementary measurements** (like DIC)
63+
- Creates own `{dataset}_sample` (position-only) and `{dataset}_measurement` tables
64+
- Matches to existing casts/bottles via station + date window
65+
- Keeps tables separate from bottle_measurement (different QC pipelines)
66+
- Example: DIC/TA → dic_sample + dic_measurement + dic_measurement_summary
6767

6868
**Pattern C: Multi-source ingest** (like phytoplankton)
6969
- Reads from multiple source formats (CSV, API, etc.)
@@ -89,25 +89,65 @@ The notebook includes these sections (customize based on pattern):
8989
6. **Show source files**`show_source_files()`
9090
7. **Show tables/fields** — Redefinition display
9191
8. **Load into database**`ingest_dataset()` or custom load
92-
9. **Schema documentation** — dm visualization
92+
9. **Schema documentation** — Define PKs/FKs via `dm_add_pk()`/`dm_add_fk()`,
93+
color-code tables (`lightblue` = new tables, `lightyellow` = amended
94+
reference tables like measurement_type, `white` = shared metadata like
95+
dataset), draw with `dm_draw()`, then write `relationships.json`
96+
sidecar via `build_relationships_json()` for use in release_database.qmd
9397
10. **Validate**`validate_for_release()`
9498
11. **Enforce column types**`enforce_column_types()`
95-
12. **Data preview**`preview_tables()`
99+
12. **Data preview** — Individual `datatable()` calls per table (NOT
100+
`preview_tables()` in a loop, which has DT rendering issues)
96101
13. **Write parquet**`write_parquet_outputs()`
97-
14. **Write metadata**`build_metadata_json()`, `build_relationships_json()`
102+
14. **Write metadata**`build_metadata_json()`
98103
15. **Upload to GCS**`sync_to_gcs()`
99104
16. **Cleanup** — Close DuckDB connection
100105

101106
#### Conditional sections:
102107
- **Cross-dataset loading**`load_prior_tables()` (if depends on prior ingest)
103108
- **Primary key setup**`assign_deterministic_uuids()` or `assign_sequential_ids()`
104109
- **Pivot measurements** — Wide→long transformation (if `--has-pivot`)
110+
- **Measurement summary** — Aggregate replicates with avg/stddev per unique
111+
position (station + date + depth + measurement_type). Filter out invalid
112+
values: `WHERE NOT isnan(measurement_value) AND isfinite(measurement_value)`.
113+
Use `STDDEV_SAMP()` with `CASE WHEN COUNT(*) = 1 THEN 0` for single
114+
observations. See `ctd_summary` in `ingest_calcofi_ctd-cast.qmd` and
115+
`dic_measurement_summary` in `ingest_calcofi_dic.qmd` for examples.
105116
- **Taxonomy**`standardize_species_local()`, `build_taxon_hierarchy()` (if `--has-taxonomy`)
106-
- **Spatial**`add_point_geom()`, `assign_grid_key()` (if has lat/lon)
117+
- **Spatial**`add_point_geom()`, `assign_grid_key()` (if has lat/lon).
118+
For datasets without direct cast_id/bottle_id FKs, match via station +
119+
date window (±3 days) or lat/lon spatial join. See issue #47 for the
120+
site/grid/segment matching roadmap.
107121
- **Lookup tables**`create_lookup_table()` (if categorical vocabularies exist)
108122
- **Ship/cruise matching**`derive_cruise_key_on_casts()` (if cross-dataset bridge needed)
109123

110-
### 5. Mark dataset-specific sections
124+
### 5. Coding conventions
125+
126+
**Tidy data**: Apply tidy data principles throughout:
127+
- The base `{dataset}_sample` table has only position/time/FK columns —
128+
NO measurement values as separate columns
129+
- ALL measurements (including ancillary ones like temp, salinity) are
130+
pivoted into `{dataset}_measurement` with columns:
131+
`measurement_type`, `measurement_value`, `measurement_qual`
132+
- Each row = one measurement at one position. Never mix different
133+
measured quantities on the same row.
134+
- Example: DIC dataset pivots 4 types (dic, alkalinity, ctdtemp_its90,
135+
salinity_pss78) into `dic_measurement``dic_sample` has zero
136+
measurement columns.
137+
138+
**Status output**: Use `cat()` (not `message()`) for user-facing status
139+
output in chunks. `message()` sends to stderr which Quarto may not
140+
render visibly with `code-fold: true`. Pattern:
141+
```r
142+
cat(glue("label: {value}"), "\n")
143+
```
144+
145+
**Data preview**: Use individual `datatable()` calls per table in
146+
separate chunks (one chunk per table). Do NOT use `preview_tables()`
147+
in a loop — it has DT widget rendering issues where only the first
148+
table displays.
149+
150+
### 6. Mark dataset-specific sections
111151

112152
In the generated notebook, mark sections requiring manual implementation with:
113153

@@ -117,7 +157,59 @@ In the generated notebook, mark sections requiring manual implementation with:
117157
# - {specific guidance based on dataset characteristics}
118158
```
119159

120-
### 6. Update `_targets.R`
160+
### 6. Include dataset metadata and release_database update
161+
162+
Every ingest notebook MUST include these two standard sections:
163+
164+
**a. Load Dataset Metadata** — Load `metadata/dataset.csv` into the
165+
wrangling DB so it's included in the parquet output and flows into
166+
`release_database.qmd`:
167+
168+
```r
169+
d_dataset <- read_csv(here("metadata/dataset.csv"))
170+
dbWriteTable(con, "dataset", d_dataset, overwrite = TRUE)
171+
```
172+
173+
**b. CalCOFI.org page check** — Before ingesting, scrape the CalCOFI.org
174+
landing page for the dataset (from `link_calcofi_org` in `dataset.csv`)
175+
to check for updated data, new download links, or changed metadata.
176+
177+
**c. Update `release_database.qmd`** — Add the new dataset's parquet
178+
directory and relationships.json path to the release workflow:
179+
180+
```r
181+
# in release_database.qmd, add to parquet_dirs:
182+
parquet_dirs <- c(
183+
...,
184+
here("data/parquet/{provider}_{dataset}")
185+
)
186+
187+
# and to rels_paths:
188+
rels_paths <- c(
189+
...,
190+
here("data/parquet/{provider}_{dataset}/relationships.json")
191+
)
192+
```
193+
194+
Also add the dataset's tables to the color grouping section and update
195+
the release notes data sources list.
196+
197+
### 7. Provider naming convention
198+
199+
The `provider` value represents the **organization curating the data**,
200+
not the data portal where it's hosted:
201+
202+
| Provider | Organization | Example datasets |
203+
|----------|-------------|------------------|
204+
| `calcofi` | CalCOFI program | bottle, ctd-cast, dic |
205+
| `swfsc` | NOAA SWFSC | ichthyo |
206+
| `pic` | SIO Pelagic Invertebrates Collection | zooplankton |
207+
| `sccoos` | SCCOOS | underway |
208+
209+
Data portals (NCEI, EDI, ERDDAP) are recorded in `link_data_source`
210+
in `metadata/dataset.csv`, not in the provider name.
211+
212+
### 8. Update `_targets.R`
121213

122214
Add a new target entry for the ingest workflow:
123215

@@ -136,7 +228,7 @@ tar_target(
136228

137229
Insert it in the correct dependency order (after its `depends_on` target, before `release_database`).
138230

139-
### 7. Present results
231+
### 9. Present results
140232

141233
Show the user:
142234
- Created file path

.claude/skills/templates/ingest_template.qmd

Lines changed: 94 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -169,6 +169,41 @@ ingest_dataset(con, d)
169169
```
170170
<!-- {{pivot_section_end}} -->
171171

172+
<!-- {{measurement_summary_section_start}} -->
173+
## Summarize Replicate Measurements
174+
175+
Aggregate replicate measurements at each unique position in time and
176+
space (station + date + depth) into mean and standard deviation.
177+
Filters out `NaN`, `-Inf`, `Inf` values. See `ctd_summary` in
178+
`ingest_calcofi_ctd-cast.qmd` and `dic_measurement_summary` in
179+
`ingest_calcofi_dic.qmd` for production examples.
180+
181+
```{r}
182+
#| label: TODO-measurement-summary
183+
184+
# TODO: adjust table names and grouping columns for this dataset
185+
# dbExecute(
186+
# con,
187+
# "CREATE OR REPLACE TABLE {{dataset}}_measurement_summary AS
188+
# SELECT
189+
# sta_key,
190+
# datetime_utc,
191+
# depth_m,
192+
# measurement_type,
193+
# AVG(measurement_value) AS avg,
194+
# CASE
195+
# WHEN COUNT(*) = 1 THEN 0
196+
# ELSE COALESCE(STDDEV_SAMP(measurement_value), 0)
197+
# END AS stddev,
198+
# COUNT(*) AS n_obs
199+
# FROM {{dataset}}_measurement
200+
# WHERE NOT isnan(measurement_value)
201+
# AND isfinite(measurement_value)
202+
# GROUP BY sta_key, datetime_utc, depth_m, measurement_type"
203+
# )
204+
```
205+
<!-- {{measurement_summary_section_end}} -->
206+
172207
<!-- {{cross_dataset_section_start}} -->
173208
## Cross-Dataset Integration
174209

@@ -193,13 +228,53 @@ load_prior_tables(
193228

194229
## Schema Documentation
195230

231+
Define PKs, FKs, and color-code tables for the ER diagram.
232+
Color scheme: lightblue = new dataset tables, lightyellow = amended
233+
reference tables (e.g. measurement_type), white = shared metadata.
234+
196235
```{r}
197236
#| label: schema
198237
199-
# build dm object for visualization
200-
tables <- dbListTables(con)
201-
dm_obj <- dm_from_con(con, tables, learn_keys = FALSE)
202-
dm_draw(dm_obj, rankdir = "LR", view_type = "all")
238+
# TODO: define PK/FK relationships for this dataset
239+
add_{{dataset}}_keys <- function(dm) {
240+
dm |>
241+
dm_add_pk({{dataset}}_measurement, {{dataset}}_measurement_id) |>
242+
dm_add_pk(measurement_type, measurement_type) |>
243+
dm_add_fk({{dataset}}_measurement, measurement_type, measurement_type)
244+
# add more FKs as needed
245+
}
246+
247+
# build dm from dataset-specific tables (exclude loaded reference tables)
248+
{{dataset}}_tables <- c(
249+
"{{dataset}}_sample", "{{dataset}}_measurement",
250+
"{{dataset}}_measurement_summary",
251+
"measurement_type", "dataset")
252+
dm_{{dataset}} <- dm_from_con(
253+
con, table_names = {{dataset}}_tables, learn_keys = FALSE) |>
254+
add_{{dataset}}_keys() |>
255+
dm_set_colors(
256+
lightblue = c({{dataset}}_sample, {{dataset}}_measurement,
257+
{{dataset}}_measurement_summary),
258+
lightyellow = measurement_type,
259+
white = dataset
260+
)
261+
262+
dm_draw(dm_{{dataset}}, rankdir = "LR", view_type = "all")
263+
```
264+
265+
### Write relationships.json
266+
267+
Write PK/FK sidecar for use in `release_database.qmd` schema merging.
268+
269+
```{r}
270+
#| label: write-relationships
271+
272+
build_relationships_json(
273+
dm = dm_{{dataset}},
274+
output_dir = dir_parquet,
275+
provider = provider,
276+
dataset = dataset
277+
)
203278
```
204279

205280
## Add Spatial
@@ -212,6 +287,21 @@ dm_draw(dm_obj, rankdir = "LR", view_type = "all")
212287
# assign_grid_key(con, "{{spatial_table}}")
213288
```
214289

290+
## Load Dataset Metadata
291+
292+
Register this dataset in the `dataset` reference table, which tracks
293+
citations, source URLs, and CalCOFI.org landing pages for all ingested
294+
datasets. Ensure `metadata/dataset.csv` has a row for this dataset
295+
before running.
296+
297+
```{r}
298+
#| label: load-dataset-metadata
299+
300+
d_dataset <- read_csv(here("metadata/dataset.csv"))
301+
dbWriteTable(con, "dataset", d_dataset, overwrite = TRUE)
302+
cat(glue("dataset: {nrow(d_dataset)} datasets registered"), "\n")
303+
```
304+
215305
## Validate
216306

217307
```{r}

.claude/skills/validate-ingest.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,31 @@ date_cols <- c("datetime_utc", "date", "cruise_date")
150150
# check for near-duplicates (same key columns, different values)
151151
```
152152

153+
#### I. Measurement Summary Consistency (`summary`)
154+
```r
155+
# if a *_measurement_summary table exists, validate:
156+
# - all summary rows have n_obs >= 1
157+
# - stddev == 0 when n_obs == 1
158+
# - no NaN or Inf in avg or stddev columns
159+
# - summary row count <= measurement row count
160+
# - measurement types in summary match measurement table
161+
for (tbl in tables) {
162+
if (grepl("_summary$", tbl)) {
163+
# check for NaN/Inf in summary values
164+
bad_vals <- dbGetQuery(con, glue(
165+
"SELECT COUNT(*) FROM {tbl}
166+
WHERE isnan(avg) OR NOT isfinite(avg)
167+
OR isnan(stddev) OR NOT isfinite(stddev)"))[[1]]
168+
if (bad_vals > 0) report_error("summary NaN/Inf", tbl, glue("{bad_vals} rows"))
169+
# check stddev = 0 when n_obs = 1
170+
bad_stddev <- dbGetQuery(con, glue(
171+
"SELECT COUNT(*) FROM {tbl}
172+
WHERE n_obs = 1 AND stddev != 0"))[[1]]
173+
if (bad_stddev > 0) report_warning("stddev != 0 for n_obs=1", tbl, bad_stddev)
174+
}
175+
}
176+
```
177+
153178
### 3. Cross-dataset validation
154179

155180
If prior ingest parquet exists, also validate cross-dataset integrity:

0 commit comments

Comments
 (0)