Convert datasets to parquet (un-shard USPTO) and rework validation CI#241
Open
skearnes wants to merge 7 commits into
Open
Convert datasets to parquet (un-shard USPTO) and rework validation CI#241skearnes wants to merge 7 commits into
skearnes wants to merge 7 commits into
Conversation
Adds 49 parquet outputs to data/ alongside the existing 546 pb.gz inputs,
plus scripts/convert_to_parquet.py used to generate them:
- 489 monthly USPTO grants (uspto-grants-YYYY_MM) consolidated into one
un-sharded parquet (1.77M reactions). Per-month CML filename provenance
is dropped from the description; per-reaction patent provenance remains.
- 10 (N/10) Training-data shards from doi.org/10.1039/C8SC04228D merged
into a single parquet (409035 reactions).
- 47 other datasets converted 1:1, keeping name/description/dataset_id.
- pb.gz inputs are kept in place; parquet IDs for merged outputs are
derived deterministically from the sorted source dataset_ids so re-runs
are idempotent.
Validation workflow restructured:
- Renames validate_database -> validate_pb (it has always validated all
pb* extensions, not just pb.gz; the rename matches reality). fail-fast
now disabled so one shard's failure does not cancel the others.
- New validate_parquet job with a 2-element matrix:
* uspto: filters to the un-sharded USPTO parquet only; row-group
parallelism inside validate_dataset.py saturates the 4-CPU runner.
* other: negative lookahead on the USPTO id; the remaining 48
parquet files validate in parallel at file + row-group level.
Together they replace the parquet pass we briefly added inside the
9-shard pb matrix.
Note: validate_parquet depends on row-group parallelism in
ord-schema's validate_dataset.py (open in PR #812 there); a new
ord-schema release tag past v0.6.1 must land before bumping
ORD_SCHEMA_TAG here and merging this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Required by the new validate_parquet job: row-group parallelism in ord-schema's validate_dataset.py landed in v0.6.2 (open-reaction-database/ord-schema#812) and saturates the 4-CPU runner on the un-sharded USPTO parquet. v0.6.3 adds an unrelated urllib3 dependency bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror now covers data/, .gitattributes, README.md, LICENSE, CITATION.cff, CONTRIBUTING.md, CONTRIBUTORS.md -- matching what is already on huggingface.co/datasets/open-reaction-database/ord-data and keeping LFS rules in sync between the two remotes. GitHub-side infrastructure (.github/, scripts/, badges/) remains excluded. The PR-trigger paths filter on huggingface_mirror.yml is widened to match, so dry-runs fire whenever any mirrored file changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Maintainer PRs that touch dataset files but should not be reprocessed by process_dataset.py --update --cleanup (e.g., format conversions of already-finalized data, mass migrations) can carry the label to skip the rewrite step. Validation still runs. Adds labeled/unlabeled to the workflow trigger so toggling the label re-fires the workflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One in-flight run per PR. Pushes and label toggles arriving in quick succession (e.g., apply skip-update-submission and push a follow-up commit) now cancel the prior run instead of doubling the LFS checkout cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same intent as the submission-workflow concurrency group: cancel any prior in-flight matrix when a new event (PR push or push to main) supersedes it, so we are not paying twice for the LFS-heavy validate_pb / validate_parquet shards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
actions/checkout v2 -> v4 and actions/setup-python v4 -> v5. Brings the two older workflows in line with huggingface_mirror.yml and clears the Node.js 20 deprecation warning visible in recent runs (v4/v5 ship on Node 24). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
|
Git LFS usage in current period is at 207.3 GB on 13th May. Our monthly LFS bandwidth budget is 250 GB (according to the billing portal) but in March we used 554 GB, and in April it was 354 GB so we may have additional allowance that isn't shown in the portal. The meter resets in 19 days so if we are worried about bandwidth we can delay until the end of the month and make a decision to used up the remaining May bandwidth, or get this job in first at the start of June. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds parquet copies of every dataset alongside the existing pb.gz files,
consolidates two shard groups into single un-sharded parquet files, and
restructures CI to validate the new format efficiently.
Conversion (commit
ca97f95)uspto-grants-YYYY_MMpb.gz files → one un-shardeduspto-grantsparquet (1.77M reactions, ~1.0 GB). Per-month CML filename provenance dropped from the description; per-reaction patent provenance is preserved.(N/10)Training-data shards from doi.org/10.1039/C8SC04228D → one parquet (409 035 reactions).dataset_ids derived deterministically from the sorted source ids, so re-runs are idempotent.scripts/convert_to_parquet.pyis the script that did this; left in-tree as the audit trail and for future re-conversions.CI restructure (commit
ca97f95)validate_database→validate_pb(the input glob has always matched all*.pb*extensions). Addsfail-fast: falseso one shard failure does not cancel the others. The old 9-way hex-prefix matrix is otherwise unchanged.validate_parquetjob with a 2-element matrix:uspto: filters to the un-sharded USPTO parquet only. Row-group parallelism invalidate_dataset.pysaturates the 4-CPU runner on this single ~1.77M-reaction file.other: negative lookahead on the USPTO id; the remaining 48 parquet files validate in parallel at file + row-group level.Tag bump (commit
e0523b6)ORD_SCHEMA_TAGv0.6.1 → v0.6.3. Required by the newvalidate_parquetjob: row-group parallelism invalidate_dataset.pylanded in v0.6.2 (Parallelize parquet dataset validation over row groups ord-schema#812). v0.6.3 is the latest release.HF mirror scope (commit
ff91745)scripts/upload_to_huggingface.pyandhuggingface_mirror.ymlwidened to mirrordata/,.gitattributes,README.md,LICENSE,CITATION.cff,CONTRIBUTING.md,CONTRIBUTORS.md— matches what already exists on the HF dataset. GitHub-side infrastructure (.github/,scripts/,badges/) stays excluded..gitattributeswas pushed to the HF dataset out-of-band so future parquet uploads preupload as LFS regardless of size. (Without this, small singleton parquets would have landed as regular blobs.)Test plan
validate_pb(9 shards) all pass.validate_parquet (uspto)passes.validate_parquet (other)passes.process_submissionvalidates the 49 added parquet files cleanly.huggingface_mirrordry-run shows 49 parquet uploads, 0 deletions.Notes for reviewers
process_submissionand the HF mirror dry-run.validate_pbis renamed → required-checks in branch protection (if any) will need updating after merge.🤖 Generated with Claude Code