Fix: support preformatted text datasets in train_eagle3.py (avoid forcing conversations generator) by Seun-Ajayi · Pull Request #498 · sgl-project/SpecForge

Seun-Ajayi · 2026-03-09T14:10:38Z

Motivation

SpecForge documentation supports training EAGLE3 draft models using preformatted text datasets with the schema:

{"id": "...", "text": "..."}

when the --is-preformatted flag is used.

However, train_eagle3.py currently loads all datasets through safe_conversations_generator, which assumes the dataset contains a conversations field. As a result, preformatted datasets are rewritten to:

{"conversations": []}

which breaks the expected schema and causes training to fail with:

ValueError: Expected 'text' column for is_preformatted=True, but found columns: ['conversations']

This PR fixes the incompatibility so that preformatted datasets documented in SpecForge can be used directly for EAGLE3 training.

Motivation

Conditional dataset loading for --is-preformatted

When --is-preformatted is enabled, datasets are loaded directly via HuggingFace load_dataset("json") instead of the safe_conversations_generator.

Before:

train_dataset = Dataset.from_generator(
    generator=safe_conversations_generator,
    gen_kwargs={"file_path": args.train_data_path},
)

After:

if args.is_preformatted:
    train_dataset = load_dataset(
        "json",
        data_files=args.train_data_path,
        split="train",
    )
else:
    train_dataset = Dataset.from_generator(
        generator=safe_conversations_generator,
        gen_kwargs={"file_path": args.train_data_path},
    )

The same logic is applied to evaluation datasets.

Cache key improvement

The dataset preprocessing cache key now includes the is_preformatted flag to avoid collisions between:

conversation-format datasets
preformatted text datasets

Before:

cache_params_string = (
    f"{args.train_data_path}-"
    f"{args.max_length}-"
    f"{args.chat_template}-"
    f"{args.target_model_path}"
)

After:

cache_params_string = (
    f"{args.train_data_path}-"
    f"{args.max_length}-"
    f"{args.chat_template}-"
    f"{args.target_model_path}-"
    f"is_preformatted={args.is_preformatted}"
)

This ensures dataset preprocessing caches remain consistent across dataset formats.

Related Issues

No formal issue was opened.

This PR fixes a mismatch between the documented preformatted dataset workflow and the current dataset loading implementation.

Accuracy Test

No model architecture, kernels, or inference logic were modified.

This change only affects dataset loading and preprocessing, so model accuracy behavior remains unchanged.

Training was verified to run correctly with --is-preformatted datasets.

Benchmark & Profiling

No performance-critical code paths were modified.

The change affects only dataset loading before training begins and does not impact runtime performance of EAGLE3 training or inference.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

…orcing conversations generator)

gemini-code-assist · 2026-03-09T14:10:51Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an incompatibility in the train_eagle3.py script that prevented the use of preformatted text datasets for training. By introducing conditional dataset loading and updating the cache key, the script now correctly processes datasets with the {"id": "...", "text": "..."} schema, aligning with SpecForge documentation and enabling successful training with these formats.

Highlights

Conditional Dataset Loading: Implemented conditional logic for loading datasets in train_eagle3.py. When the --is-preformatted flag is enabled, datasets are now loaded directly using HuggingFace's load_dataset("json") instead of the safe_conversations_generator. This change applies to both training and evaluation datasets.
Cache Key Improvement: The dataset preprocessing cache key now includes the is_preformatted flag. This prevents cache collisions between conversation-format datasets and preformatted text datasets, ensuring consistent caching across different dataset formats.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

scripts/train_eagle3.py
- Imported load_dataset from the datasets library to enable direct loading of JSON files.
- Modified the build_dataloaders function to conditionally load the training dataset: using load_dataset for preformatted data and Dataset.from_generator for conversation-based data.
- Updated the cache key string to include the is_preformatted argument, ensuring unique cache entries for different dataset formats.
- Applied the same conditional loading logic to the evaluation dataset, allowing it to correctly handle preformatted data when specified.

Activity

No specific activity has been recorded for this pull request yet, as it is a new submission.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly adds support for preformatted text datasets by conditionally using load_dataset instead of safe_conversations_generator. The change to include is_preformatted in the cache key is also a good improvement to prevent cache collisions. However, I've found a critical syntax error due to incorrect indentation that will prevent the code from running. I've also suggested a refactoring to reduce code duplication for better maintainability. Please see my detailed comments.

scripts/train_eagle3.py

Fix: support preformatted text datasets in train_eagle3.py (avoid f…

f5864b8

…orcing conversations generator)

Seun-Ajayi requested review from FlamingoPg, shuaills and sleepcoo as code owners March 9, 2026 14:10

gemini-code-assist bot reviewed Mar 9, 2026

View reviewed changes

scripts/train_eagle3.py Outdated Show resolved Hide resolved

scripts/train_eagle3.py Outdated Show resolved Hide resolved

Seun-Ajayi and others added 3 commits March 12, 2026 11:41

refactor to use a helper func instead

fdbb46e

Merge branch 'main' into main

8c435b5

Merge branch 'main' into main

1e636cf

FrankLeeeee approved these changes Mar 24, 2026

View reviewed changes

Seun-Ajayi and others added 3 commits March 24, 2026 08:44

apply pre-commit

a5f70a3

Merge branch 'main' into main

4f158b8

Merge branch 'main' into main

cdbe036

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: support preformatted text datasets in train_eagle3.py (avoid forcing conversations generator)#498

Fix: support preformatted text datasets in train_eagle3.py (avoid forcing conversations generator)#498
Seun-Ajayi wants to merge 7 commits intosgl-project:mainfrom
Seun-Ajayi:main

Seun-Ajayi commented Mar 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Seun-Ajayi commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Motivation

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Mar 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Seun-Ajayi commented Mar 9, 2026 •

edited

Loading