Skip to content

Commit b350cec

Browse files
committed
+ calcofi4db
1 parent 0ff21ba commit b350cec

1 file changed

Lines changed: 66 additions & 0 deletions

File tree

db.qmd

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,50 @@ Use Quarto documents with chunks of R code in the [workflows](https://github.com
113113
%%| file: diagrams/db_doc.mmd
114114
```
115115

116+
### Using calcofi4db package
117+
118+
The [calcofi4db](https://github.com/CalCOFI/calcofi4db) package provides functions to streamline dataset ingestion, metadata generation, and change detection. The standard workflow is:
119+
120+
1. **Load data files**: `load_csv_files()` reads CSV files from a directory and prepares them for ingestion
121+
2. **Transform data**: `transform_data()` applies transformations according to redefinition files
122+
3. **Detect changes**: `detect_csv_changes()` compares data with existing database tables
123+
4. **Ingest data**: `ingest_csv_to_db()` writes data to the database with proper metadata
124+
125+
For convenience, the high-level `ingest_dataset()` function combines these steps:
126+
127+
```r
128+
library(calcofi4db)
129+
library(DBI)
130+
library(RPostgres)
131+
132+
# Connect to database
133+
con <- dbConnect(
134+
Postgres(),
135+
dbname = "gis",
136+
host = "localhost",
137+
port = 5432,
138+
user = "admin",
139+
password = "postgres"
140+
)
141+
142+
# Ingest a dataset
143+
result <- ingest_dataset(
144+
con = con,
145+
provider = "swfsc.noaa.gov",
146+
dataset = "calcofi-db",
147+
dir_data = "/path/to/data",
148+
schema = "public",
149+
dir_googledata = "https://drive.google.com/drive/folders/your-folder-id",
150+
email = "your.email@example.com"
151+
)
152+
153+
# Examine changes and results
154+
result$changes
155+
result$stats
156+
```
157+
158+
### Workflow details
159+
116160
Google Drive \*.csv files get ingested with a **workflow** per **dataset** (in Github repository [calcofi/workflows](https://github.com/calcofi/workflows) as a Quarto document). Data definition CSV files (`tbls_redefine.csv` , `flds_redefine.csv`) are auto-generated (if missing) and manually updated to rename and describe tables and fields. After injecting the data for each of the tables, extra metadata is added to the `COMMENT`s of each table as JSON elements (links in markdown), including at the ***table*** level:
117161

118162
- **description**: general description describing contents and how each row is unique
@@ -128,6 +172,28 @@ And at the ***field*** level:
128172

129173
These comments are then exposed by the API [db_tables](https://api.calcofi.io/db_tables) endpoint, which can be consumed and rendered into a tabular searchable catalog with [calcofi4r::cc_db_catalog](https://calcofi.io/calcofi4r/reference/cc_db_catalog.html).
130174

175+
### Change detection strategy
176+
177+
The `calcofi4db` package implements a comprehensive change detection strategy:
178+
179+
1. **Table changes**:
180+
- New tables are identified for initial creation
181+
- Existing tables are identified for potential updates
182+
183+
2. **Field changes**:
184+
- Added fields: New columns in CSV not present in the database
185+
- Removed fields: Columns in database not present in the CSV
186+
- Type changes: Fields with different data types between CSV and database
187+
188+
3. **Data changes**:
189+
- Row counts are compared between source and destination
190+
- Data comparison is handled with checksum verification
191+
192+
If changes are detected, they are displayed to the user who can decide whether to:
193+
- Create new tables
194+
- Modify existing table schemas
195+
- Update data with appropriate strategies (append, replace, merge)
196+
131197
Additional workflows will publish the data to the various [Portals](https://calcofi.io/docs/portals.html) (ERDDAP, EDI, OBIS, NCEI) using ecological metadata language (EML) and the [EML](https://docs.ropensci.org/EML/) R package, pulling directly from the structured metadata in the database (on table and field definitions).
132198

133199
### OR Describe tables and columns directly

0 commit comments

Comments
 (0)