Skip to content

Convert datasets to parquet (un-shard USPTO) and rework validation CI#241

Open
skearnes wants to merge 7 commits into
mainfrom
parquet-conversion
Open

Convert datasets to parquet (un-shard USPTO) and rework validation CI#241
skearnes wants to merge 7 commits into
mainfrom
parquet-conversion

Conversation

@skearnes
Copy link
Copy Markdown
Member

Summary

Adds parquet copies of every dataset alongside the existing pb.gz files,
consolidates two shard groups into single un-sharded parquet files, and
restructures CI to validate the new format efficiently.

Conversion (commit ca97f95)

  • 489 monthly uspto-grants-YYYY_MM pb.gz files → one un-sharded uspto-grants parquet (1.77M reactions, ~1.0 GB). Per-month CML filename provenance dropped from the description; per-reaction patent provenance is preserved.
  • 10 (N/10) Training-data shards from doi.org/10.1039/C8SC04228D → one parquet (409 035 reactions).
  • 47 other datasets converted 1:1, keeping name/description/dataset_id.
  • pb.gz inputs kept in place.
  • Merged-output dataset_ids derived deterministically from the sorted source ids, so re-runs are idempotent.
  • New scripts/convert_to_parquet.py is the script that did this; left in-tree as the audit trail and for future re-conversions.

CI restructure (commit ca97f95)

  • Renames validate_databasevalidate_pb (the input glob has always matched all *.pb* extensions). Adds fail-fast: false so one shard failure does not cancel the others. The old 9-way hex-prefix matrix is otherwise unchanged.
  • Removes the parquet pass that briefly lived inside that job (added in Bump ORD_SCHEMA_TAG to v0.6.1 and add parquet support #240).
  • Adds a new validate_parquet job with a 2-element matrix:
    • uspto: filters to the un-sharded USPTO parquet only. Row-group parallelism in validate_dataset.py saturates the 4-CPU runner on this single ~1.77M-reaction file.
    • other: negative lookahead on the USPTO id; the remaining 48 parquet files validate in parallel at file + row-group level.
  • Local validation of all 49 parquet files (n_jobs=8 on an 8-CPU machine) completed in 19m58s with zero errors.

Tag bump (commit e0523b6)

HF mirror scope (commit ff91745)

  • scripts/upload_to_huggingface.py and huggingface_mirror.yml widened to mirror data/, .gitattributes, README.md, LICENSE, CITATION.cff, CONTRIBUTING.md, CONTRIBUTORS.md — matches what already exists on the HF dataset. GitHub-side infrastructure (.github/, scripts/, badges/) stays excluded.
  • One-time bootstrap already done: .gitattributes was pushed to the HF dataset out-of-band so future parquet uploads preupload as LFS regardless of size. (Without this, small singleton parquets would have landed as regular blobs.)

Test plan

  • validate_pb (9 shards) all pass.
  • validate_parquet (uspto) passes.
  • validate_parquet (other) passes.
  • process_submission validates the 49 added parquet files cleanly.
  • huggingface_mirror dry-run shows 49 parquet uploads, 0 deletions.

Notes for reviewers

  • This PR will trigger 11 LFS-heavy CI checkouts (9 pb shards + 2 parquet shards) plus process_submission and the HF mirror dry-run.
  • validate_pb is renamed → required-checks in branch protection (if any) will need updating after merge.

🤖 Generated with Claude Code

skearnes and others added 4 commits May 12, 2026 18:29
Adds 49 parquet outputs to data/ alongside the existing 546 pb.gz inputs,
plus scripts/convert_to_parquet.py used to generate them:

- 489 monthly USPTO grants (uspto-grants-YYYY_MM) consolidated into one
  un-sharded parquet (1.77M reactions). Per-month CML filename provenance
  is dropped from the description; per-reaction patent provenance remains.
- 10 (N/10) Training-data shards from doi.org/10.1039/C8SC04228D merged
  into a single parquet (409035 reactions).
- 47 other datasets converted 1:1, keeping name/description/dataset_id.
- pb.gz inputs are kept in place; parquet IDs for merged outputs are
  derived deterministically from the sorted source dataset_ids so re-runs
  are idempotent.

Validation workflow restructured:

- Renames validate_database -> validate_pb (it has always validated all
  pb* extensions, not just pb.gz; the rename matches reality). fail-fast
  now disabled so one shard's failure does not cancel the others.
- New validate_parquet job with a 2-element matrix:
    * uspto: filters to the un-sharded USPTO parquet only; row-group
      parallelism inside validate_dataset.py saturates the 4-CPU runner.
    * other: negative lookahead on the USPTO id; the remaining 48
      parquet files validate in parallel at file + row-group level.
  Together they replace the parquet pass we briefly added inside the
  9-shard pb matrix.

Note: validate_parquet depends on row-group parallelism in
ord-schema's validate_dataset.py (open in PR #812 there); a new
ord-schema release tag past v0.6.1 must land before bumping
ORD_SCHEMA_TAG here and merging this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Required by the new validate_parquet job: row-group parallelism in
ord-schema's validate_dataset.py landed in v0.6.2 (open-reaction-database/ord-schema#812)
and saturates the 4-CPU runner on the un-sharded USPTO parquet. v0.6.3
adds an unrelated urllib3 dependency bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror now covers data/, .gitattributes, README.md, LICENSE,
CITATION.cff, CONTRIBUTING.md, CONTRIBUTORS.md -- matching what is
already on huggingface.co/datasets/open-reaction-database/ord-data and
keeping LFS rules in sync between the two remotes. GitHub-side
infrastructure (.github/, scripts/, badges/) remains excluded.

The PR-trigger paths filter on huggingface_mirror.yml is widened to
match, so dry-runs fire whenever any mirrored file changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Maintainer PRs that touch dataset files but should not be reprocessed
by process_dataset.py --update --cleanup (e.g., format conversions of
already-finalized data, mass migrations) can carry the label to skip
the rewrite step. Validation still runs. Adds labeled/unlabeled to the
workflow trigger so toggling the label re-fires the workflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@skearnes skearnes added the skip-update-submission Skip Update step in Submission workflow; validation still runs. label May 13, 2026
skearnes and others added 3 commits May 12, 2026 20:33
One in-flight run per PR. Pushes and label toggles arriving in quick
succession (e.g., apply skip-update-submission and push a follow-up
commit) now cancel the prior run instead of doubling the LFS checkout
cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same intent as the submission-workflow concurrency group: cancel any
prior in-flight matrix when a new event (PR push or push to main)
supersedes it, so we are not paying twice for the LFS-heavy
validate_pb / validate_parquet shards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
actions/checkout v2 -> v4 and actions/setup-python v4 -> v5. Brings the
two older workflows in line with huggingface_mirror.yml and clears the
Node.js 20 deprecation warning visible in recent runs (v4/v5 ship on
Node 24).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@skearnes skearnes requested a review from bdeadman May 13, 2026 00:40
@bdeadman
Copy link
Copy Markdown
Collaborator

Git LFS usage in current period is at 207.3 GB on 13th May.

Our monthly LFS bandwidth budget is 250 GB (according to the billing portal) but in March we used 554 GB, and in April it was 354 GB so we may have additional allowance that isn't shown in the portal. The meter resets in 19 days so if we are worried about bandwidth we can delay until the end of the month and make a decision to used up the remaining May bandwidth, or get this job in first at the start of June.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-update-submission Skip Update step in Submission workflow; validation still runs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants