Convert datasets to parquet (un-shard USPTO) and rework validation CI by skearnes · Pull Request #241 · open-reaction-database/ord-data

skearnes · 2026-05-12T22:38:30Z

Summary

Adds parquet copies of every dataset alongside the existing pb.gz files,
consolidates two shard groups into single un-sharded parquet files, and
restructures CI to validate the new format efficiently.

Conversion (commit `ca97f95`)

489 monthly uspto-grants-YYYY_MM pb.gz files → one un-sharded uspto-grants parquet (1.77M reactions, ~1.0 GB). Per-month CML filename provenance dropped from the description; per-reaction patent provenance is preserved.
10 (N/10) Training-data shards from doi.org/10.1039/C8SC04228D → one parquet (409 035 reactions).
47 other datasets converted 1:1, keeping name/description/dataset_id.
pb.gz inputs kept in place.
Merged-output dataset_ids derived deterministically from the sorted source ids, so re-runs are idempotent.
New scripts/convert_to_parquet.py is the script that did this; left in-tree as the audit trail and for future re-conversions.

CI restructure (commit `ca97f95`)

Renames validate_database → validate_pb (the input glob has always matched all *.pb* extensions). Adds fail-fast: false so one shard failure does not cancel the others. The old 9-way hex-prefix matrix is otherwise unchanged.
Removes the parquet pass that briefly lived inside that job (added in Bump ORD_SCHEMA_TAG to v0.6.1 and add parquet support #240).
Adds a new validate_parquet job with a 2-element matrix:
- uspto: filters to the un-sharded USPTO parquet only. Row-group parallelism in validate_dataset.py saturates the 4-CPU runner on this single ~1.77M-reaction file.
- other: negative lookahead on the USPTO id; the remaining 48 parquet files validate in parallel at file + row-group level.
Local validation of all 49 parquet files (n_jobs=8 on an 8-CPU machine) completed in 19m58s with zero errors.

Tag bump (commit `e0523b6`)

ORD_SCHEMA_TAG v0.6.1 → v0.6.3. Required by the new validate_parquet job: row-group parallelism in validate_dataset.py landed in v0.6.2 (Parallelize parquet dataset validation over row groups ord-schema#812). v0.6.3 is the latest release.

HF mirror scope (commit `ff91745`)

scripts/upload_to_huggingface.py and huggingface_mirror.yml widened to mirror data/, .gitattributes, README.md, LICENSE, CITATION.cff, CONTRIBUTING.md, CONTRIBUTORS.md — matches what already exists on the HF dataset. GitHub-side infrastructure (.github/, scripts/, badges/) stays excluded.
One-time bootstrap already done: .gitattributes was pushed to the HF dataset out-of-band so future parquet uploads preupload as LFS regardless of size. (Without this, small singleton parquets would have landed as regular blobs.)

Test plan

validate_pb (9 shards) all pass.
validate_parquet (uspto) passes.
validate_parquet (other) passes.
process_submission validates the 49 added parquet files cleanly.
huggingface_mirror dry-run shows 49 parquet uploads, 0 deletions.

Notes for reviewers

This PR will trigger 11 LFS-heavy CI checkouts (9 pb shards + 2 parquet shards) plus process_submission and the HF mirror dry-run.
validate_pb is renamed → required-checks in branch protection (if any) will need updating after merge.

🤖 Generated with Claude Code

Adds 49 parquet outputs to data/ alongside the existing 546 pb.gz inputs, plus scripts/convert_to_parquet.py used to generate them: - 489 monthly USPTO grants (uspto-grants-YYYY_MM) consolidated into one un-sharded parquet (1.77M reactions). Per-month CML filename provenance is dropped from the description; per-reaction patent provenance remains. - 10 (N/10) Training-data shards from doi.org/10.1039/C8SC04228D merged into a single parquet (409035 reactions). - 47 other datasets converted 1:1, keeping name/description/dataset_id. - pb.gz inputs are kept in place; parquet IDs for merged outputs are derived deterministically from the sorted source dataset_ids so re-runs are idempotent. Validation workflow restructured: - Renames validate_database -> validate_pb (it has always validated all pb* extensions, not just pb.gz; the rename matches reality). fail-fast now disabled so one shard's failure does not cancel the others. - New validate_parquet job with a 2-element matrix: * uspto: filters to the un-sharded USPTO parquet only; row-group parallelism inside validate_dataset.py saturates the 4-CPU runner. * other: negative lookahead on the USPTO id; the remaining 48 parquet files validate in parallel at file + row-group level. Together they replace the parquet pass we briefly added inside the 9-shard pb matrix. Note: validate_parquet depends on row-group parallelism in ord-schema's validate_dataset.py (open in PR #812 there); a new ord-schema release tag past v0.6.1 must land before bumping ORD_SCHEMA_TAG here and merging this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Required by the new validate_parquet job: row-group parallelism in ord-schema's validate_dataset.py landed in v0.6.2 (open-reaction-database/ord-schema#812) and saturates the 4-CPU runner on the un-sharded USPTO parquet. v0.6.3 adds an unrelated urllib3 dependency bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirror now covers data/, .gitattributes, README.md, LICENSE, CITATION.cff, CONTRIBUTING.md, CONTRIBUTORS.md -- matching what is already on huggingface.co/datasets/open-reaction-database/ord-data and keeping LFS rules in sync between the two remotes. GitHub-side infrastructure (.github/, scripts/, badges/) remains excluded. The PR-trigger paths filter on huggingface_mirror.yml is widened to match, so dry-runs fire whenever any mirrored file changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Maintainer PRs that touch dataset files but should not be reprocessed by process_dataset.py --update --cleanup (e.g., format conversions of already-finalized data, mass migrations) can carry the label to skip the rewrite step. Validation still runs. Adds labeled/unlabeled to the workflow trigger so toggling the label re-fires the workflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

One in-flight run per PR. Pushes and label toggles arriving in quick succession (e.g., apply skip-update-submission and push a follow-up commit) now cancel the prior run instead of doubling the LFS checkout cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Same intent as the submission-workflow concurrency group: cancel any prior in-flight matrix when a new event (PR push or push to main) supersedes it, so we are not paying twice for the LFS-heavy validate_pb / validate_parquet shards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

actions/checkout v2 -> v4 and actions/setup-python v4 -> v5. Brings the two older workflows in line with huggingface_mirror.yml and clears the Node.js 20 deprecation warning visible in recent runs (v4/v5 ship on Node 24). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bdeadman · 2026-05-13T10:59:37Z

Git LFS usage in current period is at 207.3 GB on 13th May.

Our monthly LFS bandwidth budget is 250 GB (according to the billing portal) but in March we used 554 GB, and in April it was 354 GB so we may have additional allowance that isn't shown in the portal. The meter resets in 19 days so if we are worried about bandwidth we can delay until the end of the month and make a decision to used up the remaining May bandwidth, or get this job in first at the start of June.

skearnes and others added 4 commits May 12, 2026 18:29

skearnes added the skip-update-submission Skip Update step in Submission workflow; validation still runs. label May 13, 2026

skearnes and others added 3 commits May 12, 2026 20:33

skearnes requested a review from bdeadman May 13, 2026 00:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert datasets to parquet (un-shard USPTO) and rework validation CI#241

Convert datasets to parquet (un-shard USPTO) and rework validation CI#241
skearnes wants to merge 7 commits into
mainfrom
parquet-conversion

skearnes commented May 12, 2026

Uh oh!

bdeadman commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

skearnes commented May 12, 2026

Summary

Conversion (commit ca97f95)

CI restructure (commit ca97f95)

Tag bump (commit e0523b6)

HF mirror scope (commit ff91745)

Test plan

Notes for reviewers

Uh oh!

bdeadman commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversion (commit `ca97f95`)

CI restructure (commit `ca97f95`)

Tag bump (commit `e0523b6`)

HF mirror scope (commit `ff91745`)