docs/db.qmd at main · CalCOFI/docs · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
# Database

## Database naming conventions

> There are only two hard things in Computer Science: cache invalidation and naming things. -- Phil Karlton (Netscape architect)

We're circling the wagons to come up with the best conventions for naming. Here are some ideas:

- [Learn SQL: Naming Conventions](https://www.sqlshack.com/learn-sql-naming-conventions/)
- [Best Practices for Database Naming Conventions - Drygast.NET](https://drygast.net/blog/post/database_naming_conventions)

### Name tables

- Table names are **singular** and use all lower case.
  - Example: `cruise`, `site`, `species`, `lookup` (not `cruises`, `sites`, `lookups`)

### Name columns

- To name columns, use [**snake-case**](https://cran.r-project.org/web/packages/snakecase/vignettes/introducing-the-snakecase-package.html) (i.e., lower-case with underscores) so as to prevent the need to quote SQL statements. (TIP: Use [`janitor::clean_names()`](https://sfirke.github.io/janitor/reference/clean_names.html) to convert a table.)

- Unique **identifiers** are suffixed with:
  - `*_id` for unique integer keys;
  - `*_uuid` for universally unique identifiers as defined by [RFC 4122](https://datatracker.ietf.org/doc/html/rfc4122) and stored in Postgres as [UUID Type](https://www.postgresql.org/docs/current/datatype-uuid.html).
  - `*_key` for unique string keys;
  - `*_seq` for auto-incrementing sequence integer keys.

- Suffix with **units** where applicable (e.g., `*_m` for meters, `*_km` for kilometers, `degc` for degrees Celsius). See [units vignette](https://cran.r-project.org/web/packages/units/vignettes/measurement_units_in_R.html).

- Set geometry column to **`geom`** (used by [PostGIS](https://postgis.net) spatial extension). If the table has multiple geometry columns, use `geom` for the default geometry column and `geom_{type}` for additional geometry columns (e.g., `geom_point`, `geom_line`, `geom_polygon`).

### Primary key conventions

**Prefer natural keys** (meaningful domain identifiers) over surrogate keys where stable:

- `cruise_key` = 'YYMMKK' (e.g., `2401NH` = January 2024, New Horizon)
- `ship_key` = 2-letter ship code (e.g., `NH`)
- `tow_type_key` = tow type code (e.g., `CB`)

**Sequential integer keys** should have explicit sort order documented for reproducibility:

- `site_id` sorted by `cruise_key, orderocc`
- `tow_id` sorted by `site_id, time_start`
- `net_id` sorted by `tow_id, side`
- `ichthyo_id` sorted by `net_id, species_id, life_stage, measurement_type, measurement_value`

**Avoid UUIDs in output tables**: Use `_source_uuid` in Working DuckLake for provenance tracking only (stripped in frozen releases).

## Use Unicode for text

The [default character encoding for Postgresql](https://www.postgresql.org/docs/current/multibyte.html#MULTIBYTE-SETTING) is unicode (`UTF8`), which allows for international characters, accents and special characters. Improper encoding can royally mess up basic text.

Logging into the server, we can see this with the following command:

```bash
docker exec -it postgis psql -l
```

```
                                  List of databases
        Name        | Owner | Encoding |  Collate   |   Ctype    | Access privileges
--------------------+-------+----------+------------+------------+-------------------
 gis                | admin | UTF8     | en_US.utf8 | en_US.utf8 | =Tc/admin        +
                    |       |          |            |            | admin=CTc/admin  +
                    |       |          |            |            | ro_user=c/admin
 lter_core_metabase | admin | UTF8     | en_US.utf8 | en_US.utf8 | =Tc/admin        +
                    |       |          |            |            | admin=CTc/admin  +
                    |       |          |            |            | rw_user=c/admin
 postgres           | admin | UTF8     | en_US.utf8 | en_US.utf8 |
 template0          | admin | UTF8     | en_US.utf8 | en_US.utf8 | =c/admin         +
                    |       |          |            |            | admin=CTc/admin
 template1          | admin | UTF8     | en_US.utf8 | en_US.utf8 | =c/admin         +
                    |       |          |            |            | admin=CTc/admin
 template_postgis   | admin | UTF8     | en_US.utf8 | en_US.utf8 |
(6 rows)
```

Use Unicode (`utf-8` in Python or `UTF8` in Postgresql) encoding for all database text values to support international characters and documentation (i.e., tabs, etc for markdown conversion).

- In **Python**, use [**`pandas`**](https://pandas.pydata.org/docs/index.html) to read ([`read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) and write ([`to_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html)) with UTF-8 encoding (i.e., `encoding='utf-8'`).:

  ```python
  import pandas as pd
  from sqlalchemy import create_engine
  engine = create_engine('postgresql://user:password@localhost:5432/dbname')

  # read from a csv file
  df = pd.read_csv('file.csv', encoding='utf-8')

  # write to PostgreSQL
  df.to_sql('table_name', engine, if_exists='replace', index=False, method='multi', chunksize=1000, encoding='utf-8')

  # read from PostgreSQL
  df = pd.read_sql('SELECT * FROM table_name', engine, encoding='utf-8')

  # write to a csv file with UTF-8 encoding
  df.to_csv('file.csv', index=False, encoding='utf-8')
  ```

- In **R**, use [**`readr`**](https://readr.tidyverse.org/index.html) to read ([`read_csv()`](https://readr.tidyverse.org/reference/read_delim.html)) and write ([`write_excel_csv()`](https://readr.tidyverse.org/reference/write_delim.html)) to force UTF-8 encoding.

  ```r
  library(readr)
  library(DBI)
  library(RPostgres)

  # connect to PostgreSQL
  con <- dbConnect(RPostgres::Postgres(), dbname = "dbname", host = "localhost", port = 5432, user = "user", password = "password")

  # read from a csv file
  df <- read_csv('file.csv', locale = locale(encoding = 'UTF-8'))  # explicit
  df <- read_csv('file.csv')                                       # implicit

  # write to PostgreSQL
  dbWriteTable(con, 'table_name', df, overwrite = TRUE)

  # read from PostgreSQL
  df <- dbReadTable(con, 'table_name')

  # write to a csv file with UTF-8 encoding
  write_excel_csv(df, 'file.csv', locale = locale(encoding = 'UTF-8'))  # explicit
  write_excel_csv(df, 'file.csv')                                       # implicit
  ```

## Integrated database ingestion strategy

### Overview

The CalCOFI database uses a two-schema strategy for development and production:

- **`dev` schema**: Development schema where new datasets, tables, fields, and relationships are ingested and QA/QC'd. This schema is recreated fresh with each ingestion run using the master ingestion script.
- **`prod` schema**: Production schema for stable, versioned data used by public APIs, apps, and data portals (OBIS, EDI, ERDDAP). Once `dev` is validated, it's copied to `prod` with a version number.

### Master ingestion workflow

All datasets are ingested using a single master Quarto script [`calcofi4db/inst/create_db.qmd`](https://github.com/CalCOFI/calcofi4db/blob/main/inst/create_db.qmd) that:

1. **Drops and recreates** the `dev` schema (fresh start each run)
2. **Ingests multiple datasets** from Google Drive source files (CSV, potentially SHP/NC in future)
3. **Applies transformations** using redefinition files (`tbls_redefine.csv`, `flds_redefine.csv`)
4. **Creates relationships** (primary keys, foreign keys, indexes)
5. **Records schema version** with metadata in `schema_version` table

Each dataset section in the master script handles:

- Reading CSV files from Google Drive
- Transforming data according to redefinition rules
- Loading into database tables
- Adding table/field comments with metadata

```{mermaid}
%%| label: fig-db_doc
%%| fig-cap: "Integrated database ingestion scheme."
%%| file: diagrams/db_doc.mmd
```

### Using calcofi4db package

The [calcofi4db](https://github.com/CalCOFI/calcofi4db) package provides streamlined functions for dataset ingestion.

#### DuckLake Workflow (Recommended)

The preferred approach uses DuckDB with provenance tracking:

```r
library(calcofi4db)

# connect to working DuckLake (downloads from GCS if needed)
con <- get_working_ducklake()

# read CSV files from local directory or GCS archive
# - syncs to GCS archive for immutable provenance tracking
# - loads redefinition metadata for column renaming/typing
d <- read_csv_files(
  provider     = "swfsc.noaa.gov",
  dataset      = "calcofi-db",
  dir_data     = "~/My Drive/projects/calcofi/data-public",
  metadata_dir = "metadata",
  sync_archive = TRUE)

# ingest dataset with automatic provenance tracking
# - transforms data using redefinition files
# - adds _source_file, _source_row, _source_uuid, _ingested_at columns
# - handles uuid column detection automatically
tbl_stats <- ingest_dataset(
  con  = con,
  d    = d,
  mode = "replace")

# save to GCS and close
save_working_ducklake(con)
close_duckdb(con)
```

The `ingest_dataset()` function handles:

- Calling `transform_data()` to apply table/field redefinitions
- Detecting UUID columns automatically for provenance tracking
- Calling `ingest_to_working()` for each table with proper source file tracking
- Returning ingestion statistics (rows input, rows after, timestamps)

#### PostgreSQL Workflow (Legacy)

For PostgreSQL ingestion (being gradually deprecated):

```r
library(calcofi4db)
library(DBI)
library(RPostgres)

# connect to database
con <- dbConnect(
  Postgres(),
  dbname   = "gis",
  host     = "localhost",
  port     = 5432,
  user     = "admin",
  password = "postgres")

# read CSV files and metadata
d <- read_csv_files(
  provider = "swfsc.noaa.gov",
  dataset  = "calcofi-db")

# transform data according to redefinitions
transformed_data <- transform_data(d)

# ingest into dev schema
ingest_csv_to_db(
  con              = con,
  schema           = "dev",
  transformed_data = transformed_data,
  d_flds_rd        = d$d_flds_rd,
  d_gdata          = d$d_gdata,
  workflow_info    = d$workflow_info)

# record schema version
record_schema_version(
  con              = con,
  schema           = "dev",
  version          = "1.0.0",
  description      = "Initial ingestion of NOAA CalCOFI Database",
  script_permalink = "https://github.com/CalCOFI/calcofi4db/blob/main/inst/create_db.qmd")
```

### Schema versioning

Each successful ingestion creates a new schema version recorded in the `schema_version` table with:

- **version**: Semantic version number (e.g., "1.0.0", "1.1.0")
- **description**: Changes introduced in this version
- **date_created**: Timestamp of ingestion
- **script_permalink**: GitHub permalink to the versioned ingestion script

Versions are also archived as SQL dumps in Google Drive for reproducibility.

### Metadata and documentation

After ingestion, metadata is stored in PostgreSQL `COMMENT`s as JSON at the **table** level:

- **description**: General description and row uniqueness
- **source**: CSV file link to Google Drive
- **source_created**: Source file creation timestamp
- **workflow**: Link to rendered ingestion script
- **workflow_ingested**: Ingestion timestamp

And at the **field** level:

- **description**: Field description
- **units**: SI units where applicable

These comments are exposed via the API [db_tables](https://api.calcofi.io/db_tables) endpoint and rendered with [calcofi4r::cc_db_catalog](https://calcofi.io/calcofi4r/reference/cc_db_catalog.html).

### Publishing to portals

After `prod` schema is versioned, additional workflows publish data to [Portals](https://calcofi.io/docs/portals.html) (ERDDAP, EDI, OBIS, NCEI) using ecological metadata language (EML) via the [EML](https://docs.ropensci.org/EML/) R package, pulling metadata directly from database comments.

### OR Describe tables and columns directly

- Use the `COMMENT` clause to add descriptions to tables and columns, either through the GUI [pgadmin.calcofi.io](https://pgadmin.calcofi.io/) (by right-clicking on the table or column and selecting `Properties`) or with SQL. For example:

  ```sql
  COMMENT ON TABLE public.aoi_fed_sanctuaries IS 'areas of interest (`aoi`) polygons for federal **National Marine Sanctuaries**; loaded by _workflow_ [load_sanctuaries](https://calcofi.io/workflows/load_sanctuaries.html)';
  ```

- Note the use of [**markdown**](https://www.markdownguide.org/cheat-sheet/) for including links and formatting (e.g., bold, code, italics), such that the above SQL will render like so:

   > areas of interest (`aoi`) polygons for federal **National Marine Sanctuaries**; loaded by _workflow_ [load_sanctuaries](https://calcofi.io/workflows/load_sanctuaries.html)

- It is especially helpful to link to any _**workflows**_ that are responsible for the ingesting or updating of the input data.

### Display tables and columns with metadata

- These descriptions can be viewed in the CalCOFI **API** [api.calcofi.io](https://api.calcofi.io) as CSV tables (see code in [calcofi/api: `plumber.R`](https://github.com/CalCOFI/api/blob/8ad9d9ad62fd526d4b8da23357759f1ad196cb88/plumber.R#L916-L990)):
  - [api.calcofi.io`/db_tables`](https://api.calcofi.io/db_tables)\
    fields:\
    - `schema`: (only "public" so far)
    - `table_type`: "table", "view", or "materialized view" (none yet)
    - `table`: name of table
    - `table_description`: description of table (possibly in markdown)
  - [api.calcofi.io`/db_columns`](https://api.calcofi.io/db_columns)\
    fields:\
    - `schema`: (only "public" so far)
    - `table_type`: "table", "view", or "materialized view" (none yet)
    - `table`: name of table
    - `column`: name of column
    - `column_type`: data type of column
    - `column_description`: description of column (possibly in markdown)

- Fetch and display these descriptions into an interactive table with [`calcofi4r::`**`cc_db_catalog()`**](https://calcofi.io/calcofi4r/reference/cc_db_catalog.html).

## Relationships between tables

- See [calcofi/workflows: **clean_db**](https://calcofi.io/workflows/clean_db.html)\

- `TODO:` add calcofi/apps: db to show latest tables, columns and relationsips

## Spatial Tips

- Use [`ST_Subdivide()`](https://postgis.net/docs/ST_Subdivide.html) when running spatial joins on large polygons.