Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,19 @@ jobs:
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
- name: Install wxyc_unaccent rules into the PG service container
# Required by migrations/0003_wxyc_identity_match_functions.sql.
# Alpine layout: /usr/local/share/postgresql/tsearch_data.
run: |
PG_CID=$(docker ps -q --filter ancestor=postgres:16-alpine)
if [ -z "$PG_CID" ]; then
echo "no postgres:16-alpine service container running" >&2
exit 1
fi
docker cp vendor/wxyc-etl/wxyc_unaccent.rules \
"$PG_CID:/usr/local/share/postgresql/tsearch_data/wxyc_unaccent.rules"
docker cp vendor/wxyc-etl/wxyc_unaccent.version \
"$PG_CID:/usr/local/share/postgresql/tsearch_data/wxyc_unaccent.version"
- name: Run PostgreSQL integration tests
env:
TEST_DATABASE_URL: postgresql://wikidata:wikidata@localhost:5435/wikidata_test
Expand All @@ -54,3 +67,7 @@ jobs:
env:
TEST_DATABASE_URL: postgresql://wikidata:wikidata@localhost:5435/wikidata_test
run: cargo test --test charset_torture -- --test-threads=1 --include-ignored
- name: Run wxyc_identity_match_* parity test
env:
TEST_DATABASE_URL: postgresql://wikidata:wikidata@localhost:5435/wikidata_test
run: cargo test --test wxyc_identity_match_parity_test -- --ignored --test-threads=1
4 changes: 4 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,10 @@ The full rebuild (`build` then `import --fresh`) is scheduled via `.github/workf

**Runner-capacity caveat:** the Wikidata JSON dump is roughly 130GB gzipped and a full rebuild can take many hours. GitHub-hosted `ubuntu-latest` runners have a 6-hour job timeout and only ~14GB of free disk, so the scheduled run will likely fail on disk or timeout. The workflow is intentionally a scheduling skeleton — the actual rebuild needs to migrate to a self-hosted runner, a Railway job, or a dedicated EC2 box. Until then, treat the `workflow_dispatch` trigger as the supported path (e.g., for small-dump smoke tests) and run real rebuilds out-of-band.

## wxyc-etl vendoring (cross-cache-identity)

This repo deploys the `wxyc_identity_match_*` plpgsql function family for cross-cache identity matching. The canonical sources live in WXYC/wxyc-etl@v0.4.0 (`data/`); we vendor byte-for-byte copies under `vendor/wxyc-etl/` (top-level `data/` would conflict with `--data-dir` runtime use). SHA-pinned in `wxyc-etl-pin.txt`. Migration `migrations/0003_wxyc_identity_match_functions.sql` is a sqlx-cli-friendly wrapper that sets up the `wxyc_unaccent` text-search dictionary and then inlines the canonical SQL. The parity test in `tests/wxyc_identity_match_parity_test.rs` enforces pin freshness, migration-vs-canonical byte-equality, and PG-side fixture parity (252 rows + idempotence).

## Migrations

Schema changes ship as numbered SQL files under `migrations/`, applied with [sqlx-cli](https://crates.io/crates/sqlx-cli). The baseline `migrations/0001_initial.sql` mirrors `schema/create_database.sql`.
Expand Down
80 changes: 70 additions & 10 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ csv = "1.3"
anyhow = "1"
log = "0.4"
tracing = "0.1"
wxyc-etl = "0.3.0"
wxyc-etl = "0.4.0"
postgres = "0.19"
rusqlite = { version = "0.31", features = ["bundled"] }

Expand All @@ -27,6 +27,7 @@ tempfile = "3"
postgres = "0.19"
pretty_assertions = "1"
csv = "1.3"
sha2 = "0.10"

[profile.release]
lto = true
Expand Down
Loading
Loading