Skip to content

Bug: get_c4 broken with datasets>=2.14 (allenai--c4 config removed) #87

@UriKialy

Description

@UriKialy

Title: get_c4 broken with datasets>=2.14allenai--c4 config removed

Body:


Bug

lib/data.py get_c4() fails with recent versions of the datasets library (>=2.14):

ValueError: BuilderConfig 'allenai--c4' not found. Available: ['en', 'en.noblocklist', ...]

The config name allenai--c4 with explicit data_files no longer works. The fix is to use the standard 'en' config with streaming:

Before (broken)

traindata = load_dataset('allenai/c4', 'allenai--c4',
data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')

fix

traindata = list(load_dataset('allenai/c4', 'en', split='train', streaming=True).take(10000))

Same for validation split. Also valdata[:1100]['text'] needs to become [d['text'] for d in valdata] since streaming returns dicts not a Dataset object.

Additional: position_embeddings in newer transformers

lib/prune.py layer forward calls fail with transformers>=4.45 because LlamaDecoderLayer.forward() now requires position_embeddings as an explicit argument:

TypeError: cannot unpack non-iterable NoneType object

Fix: pre-compute RoPE embeddings and pass them to each layer call.

Environment

  • datasets==3.x
  • transformers==4.50+
  • Python 3.11

Happy to open a PR with the fixes if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions