Skip to content

test: bump charset-torture corpus pin to v0.12.0#29

Open
jakebromberg wants to merge 1 commit into
mainfrom
wx-2-corpus-bump-v0.12.0
Open

test: bump charset-torture corpus pin to v0.12.0#29
jakebromberg wants to merge 1 commit into
mainfrom
wx-2-corpus-bump-v0.12.0

Conversation

@jakebromberg
Copy link
Copy Markdown
Member

Summary

Cascades the WX-2.2.1 (to_storage_form) corpus updates from WXYC/wxyc-shared#100 + WXYC/wxyc-etl#89 into the wikidata-cache CSV → PG COPY round-trip detector.

  • Corpus pin: v0.10.0 / sha 75a3395b…v0.12.0 / sha 41a18c5c0a92d129ec4b575827b6874196bfb7591e4bdf237a918a5da2de7b66
  • tests/fixtures/charset-torture.json re-vendored from @wxyc/shared v0.12.0
  • tests/fixtures/charset-torture.json.sha256 updated
  • .github/workflows/charset-corpus-drift.yml repointed to the new sha + package-version: '0.12.0'

What changed in v0.12.0

  • bidi_marks: expected_match_form is now "helloworld" / "helloworld" / "reversed" for the LRM, RLM, and RLO+PDF entries (was null in v0.10.0). Drives the WX-2 charter rule that to_match_form strips Cf characters except U+200D ZWJ.
  • normalization: two NFD entries (café, ) gain explicit NFC expected_storage for to_storage_form consumers.

Why no new [wjf:...] xfails

tests/charset_torture.rs is a byte round-trip detector — it asserts actual == input after a CSV → PG COPY round-trip via import::import_all. It does not exercise to_match_form (the field that changed for bidi_marks) and does not run to_storage_form over the input before the round-trip (the pipeline writes raw CSV bytes).

The bidi marks (U+200E LRM, U+200F RLM, U+202E RLO, U+202C PDF) are all valid bytes in PostgreSQL TEXT and survive CSV → COPY losslessly. Only the existing [wjf:pg-null-byte] xfail (U+0000, rejected by PG TEXT per SQL standard) remains.

Local verification: cargo test -- --ignored against the docker-compose PG service — corpus_csv_pg_roundtrip passes (1 passed, 0 failed). Full cargo test is also green.

Drift guard

.github/workflows/charset-corpus-drift.yml will fail-until-publish until @wxyc/shared v0.12.0 is on GitHub Packages — expected per the bump-or-freeze procedure documented in the wxyc-shared README.

Test plan

  • cargo test — all 87 unit/integration tests pass
  • cargo test -- --ignoredcorpus_csv_pg_roundtrip passes against PG (1 passed, 0 failed)
  • CI: drift guard expected to fail until @wxyc/shared v0.12.0 is published; all other jobs green

Cascade WX-2.2.1 (`to_storage_form`) corpus updates from WXYC/wxyc-shared#100 + WXYC/wxyc-etl#89. Corpus moves from v0.10.0 (sha 75a3395b…) to v0.12.0 (sha 41a18c5c0a92d129ec4b575827b6874196bfb7591e4bdf237a918a5da2de7b66). The new corpus formalizes `expected_match_form` for the bidi_marks category (`Hello‎World` → `helloworld`, `Hello‏World` → `helloworld`, `‮Reversed‬` → `reversed`) and adds NFC `expected_storage` for two NFD `normalization` entries.

This repo's `tests/charset_torture.rs` detector is a CSV → PG COPY byte round-trip — it asserts `actual == input`, not `actual == expected_match_form` or `actual == expected_storage`. The bidi_marks bytes round-trip cleanly (LRM/RLM/RLO/PDF are all valid PG TEXT) so no new `[wjf:...]` xfails are needed; only the existing `[wjf:pg-null-byte]` xfail remains. Verified with `cargo test -- --ignored` against the docker-compose PG service (1 passed, 0 failed).

Drift-guard CI (`.github/workflows/charset-corpus-drift.yml`) is repointed to the new sha + `'0.12.0'` package version. It will fail-until-publish until `@wxyc/shared` v0.12.0 is on GitHub Packages — expected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant