test: bump charset-torture corpus pin to v0.12.0#29
Open
jakebromberg wants to merge 1 commit into
Open
Conversation
Cascade WX-2.2.1 (`to_storage_form`) corpus updates from WXYC/wxyc-shared#100 + WXYC/wxyc-etl#89. Corpus moves from v0.10.0 (sha 75a3395b…) to v0.12.0 (sha 41a18c5c0a92d129ec4b575827b6874196bfb7591e4bdf237a918a5da2de7b66). The new corpus formalizes `expected_match_form` for the bidi_marks category (`HelloWorld` → `helloworld`, `HelloWorld` → `helloworld`, `Reversed` → `reversed`) and adds NFC `expected_storage` for two NFD `normalization` entries. This repo's `tests/charset_torture.rs` detector is a CSV → PG COPY byte round-trip — it asserts `actual == input`, not `actual == expected_match_form` or `actual == expected_storage`. The bidi_marks bytes round-trip cleanly (LRM/RLM/RLO/PDF are all valid PG TEXT) so no new `[wjf:...]` xfails are needed; only the existing `[wjf:pg-null-byte]` xfail remains. Verified with `cargo test -- --ignored` against the docker-compose PG service (1 passed, 0 failed). Drift-guard CI (`.github/workflows/charset-corpus-drift.yml`) is repointed to the new sha + `'0.12.0'` package version. It will fail-until-publish until `@wxyc/shared` v0.12.0 is on GitHub Packages — expected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cascades the WX-2.2.1 (
to_storage_form) corpus updates from WXYC/wxyc-shared#100 + WXYC/wxyc-etl#89 into the wikidata-cache CSV → PG COPY round-trip detector.75a3395b…→ v0.12.0 / sha41a18c5c0a92d129ec4b575827b6874196bfb7591e4bdf237a918a5da2de7b66tests/fixtures/charset-torture.jsonre-vendored from@wxyc/sharedv0.12.0tests/fixtures/charset-torture.json.sha256updated.github/workflows/charset-corpus-drift.ymlrepointed to the new sha +package-version: '0.12.0'What changed in v0.12.0
expected_match_formis now"helloworld"/"helloworld"/"reversed"for the LRM, RLM, and RLO+PDF entries (wasnullin v0.10.0). Drives the WX-2 charter rule thatto_match_formstrips Cf characters except U+200D ZWJ.café,ñ) gain explicit NFCexpected_storageforto_storage_formconsumers.Why no new
[wjf:...]xfailstests/charset_torture.rsis a byte round-trip detector — it assertsactual == inputafter a CSV → PG COPY round-trip viaimport::import_all. It does not exerciseto_match_form(the field that changed for bidi_marks) and does not runto_storage_formover the input before the round-trip (the pipeline writes raw CSV bytes).The bidi marks (U+200E LRM, U+200F RLM, U+202E RLO, U+202C PDF) are all valid bytes in PostgreSQL
TEXTand survive CSV → COPY losslessly. Only the existing[wjf:pg-null-byte]xfail (U+0000, rejected by PG TEXT per SQL standard) remains.Local verification:
cargo test -- --ignoredagainst the docker-compose PG service —corpus_csv_pg_roundtrippasses (1 passed, 0 failed). Fullcargo testis also green.Drift guard
.github/workflows/charset-corpus-drift.ymlwill fail-until-publish until@wxyc/sharedv0.12.0 is on GitHub Packages — expected per the bump-or-freeze procedure documented in the wxyc-shared README.Test plan
cargo test— all 87 unit/integration tests passcargo test -- --ignored—corpus_csv_pg_roundtrippasses against PG (1 passed, 0 failed)@wxyc/sharedv0.12.0 is published; all other jobs green