You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: db.qmd
+66Lines changed: 66 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -113,6 +113,50 @@ Use Quarto documents with chunks of R code in the [workflows](https://github.com
113
113
%%| file: diagrams/db_doc.mmd
114
114
```
115
115
116
+
### Using calcofi4db package
117
+
118
+
The [calcofi4db](https://github.com/CalCOFI/calcofi4db) package provides functions to streamline dataset ingestion, metadata generation, and change detection. The standard workflow is:
119
+
120
+
1.**Load data files**: `load_csv_files()` reads CSV files from a directory and prepares them for ingestion
121
+
2.**Transform data**: `transform_data()` applies transformations according to redefinition files
122
+
3.**Detect changes**: `detect_csv_changes()` compares data with existing database tables
123
+
4.**Ingest data**: `ingest_csv_to_db()` writes data to the database with proper metadata
124
+
125
+
For convenience, the high-level `ingest_dataset()` function combines these steps:
Google Drive \*.csv files get ingested with a **workflow** per **dataset** (in Github repository [calcofi/workflows](https://github.com/calcofi/workflows) as a Quarto document). Data definition CSV files (`tbls_redefine.csv` , `flds_redefine.csv`) are auto-generated (if missing) and manually updated to rename and describe tables and fields. After injecting the data for each of the tables, extra metadata is added to the `COMMENT`s of each table as JSON elements (links in markdown), including at the ***table*** level:
117
161
118
162
-**description**: general description describing contents and how each row is unique
@@ -128,6 +172,28 @@ And at the ***field*** level:
128
172
129
173
These comments are then exposed by the API [db_tables](https://api.calcofi.io/db_tables) endpoint, which can be consumed and rendered into a tabular searchable catalog with [calcofi4r::cc_db_catalog](https://calcofi.io/calcofi4r/reference/cc_db_catalog.html).
130
174
175
+
### Change detection strategy
176
+
177
+
The `calcofi4db` package implements a comprehensive change detection strategy:
178
+
179
+
1.**Table changes**:
180
+
- New tables are identified for initial creation
181
+
- Existing tables are identified for potential updates
182
+
183
+
2.**Field changes**:
184
+
- Added fields: New columns in CSV not present in the database
185
+
- Removed fields: Columns in database not present in the CSV
186
+
- Type changes: Fields with different data types between CSV and database
187
+
188
+
3.**Data changes**:
189
+
- Row counts are compared between source and destination
190
+
- Data comparison is handled with checksum verification
191
+
192
+
If changes are detected, they are displayed to the user who can decide whether to:
193
+
- Create new tables
194
+
- Modify existing table schemas
195
+
- Update data with appropriate strategies (append, replace, merge)
196
+
131
197
Additional workflows will publish the data to the various [Portals](https://calcofi.io/docs/portals.html) (ERDDAP, EDI, OBIS, NCEI) using ecological metadata language (EML) and the [EML](https://docs.ropensci.org/EML/) R package, pulling directly from the structured metadata in the database (on table and field definitions).
0 commit comments