Skip to content

Fix: support preformatted text datasets in train_eagle3.py (avoid forcing conversations generator)#498

Open
Seun-Ajayi wants to merge 7 commits intosgl-project:mainfrom
Seun-Ajayi:main
Open

Fix: support preformatted text datasets in train_eagle3.py (avoid forcing conversations generator)#498
Seun-Ajayi wants to merge 7 commits intosgl-project:mainfrom
Seun-Ajayi:main

Conversation

@Seun-Ajayi
Copy link
Copy Markdown

@Seun-Ajayi Seun-Ajayi commented Mar 9, 2026

Motivation

SpecForge documentation supports training EAGLE3 draft models using preformatted text datasets with the schema:

{"id": "...", "text": "..."}

when the --is-preformatted flag is used.

However, train_eagle3.py currently loads all datasets through safe_conversations_generator, which assumes the dataset contains a conversations field. As a result, preformatted datasets are rewritten to:

{"conversations": []}

which breaks the expected schema and causes training to fail with:

ValueError: Expected 'text' column for is_preformatted=True, but found columns: ['conversations']

This PR fixes the incompatibility so that preformatted datasets documented in SpecForge can be used directly for EAGLE3 training.

Motivation

  1. Conditional dataset loading for --is-preformatted

When --is-preformatted is enabled, datasets are loaded directly via HuggingFace load_dataset("json") instead of the safe_conversations_generator.

Before:

train_dataset = Dataset.from_generator(
    generator=safe_conversations_generator,
    gen_kwargs={"file_path": args.train_data_path},
)

After:

if args.is_preformatted:
    train_dataset = load_dataset(
        "json",
        data_files=args.train_data_path,
        split="train",
    )
else:
    train_dataset = Dataset.from_generator(
        generator=safe_conversations_generator,
        gen_kwargs={"file_path": args.train_data_path},
    )

The same logic is applied to evaluation datasets.

  1. Cache key improvement

The dataset preprocessing cache key now includes the is_preformatted flag to avoid collisions between:

  • conversation-format datasets
  • preformatted text datasets

Before:

cache_params_string = (
    f"{args.train_data_path}-"
    f"{args.max_length}-"
    f"{args.chat_template}-"
    f"{args.target_model_path}"
)

After:

cache_params_string = (
    f"{args.train_data_path}-"
    f"{args.max_length}-"
    f"{args.chat_template}-"
    f"{args.target_model_path}-"
    f"is_preformatted={args.is_preformatted}"
)

This ensures dataset preprocessing caches remain consistent across dataset formats.

Related Issues

No formal issue was opened.

This PR fixes a mismatch between the documented preformatted dataset workflow and the current dataset loading implementation.

Accuracy Test

No model architecture, kernels, or inference logic were modified.

This change only affects dataset loading and preprocessing, so model accuracy behavior remains unchanged.

Training was verified to run correctly with --is-preformatted datasets.

Benchmark & Profiling

No performance-critical code paths were modified.

The change affects only dataset loading before training begins and does not impact runtime performance of EAGLE3 training or inference.

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an incompatibility in the train_eagle3.py script that prevented the use of preformatted text datasets for training. By introducing conditional dataset loading and updating the cache key, the script now correctly processes datasets with the {"id": "...", "text": "..."} schema, aligning with SpecForge documentation and enabling successful training with these formats.

Highlights

  • Conditional Dataset Loading: Implemented conditional logic for loading datasets in train_eagle3.py. When the --is-preformatted flag is enabled, datasets are now loaded directly using HuggingFace's load_dataset("json") instead of the safe_conversations_generator. This change applies to both training and evaluation datasets.
  • Cache Key Improvement: The dataset preprocessing cache key now includes the is_preformatted flag. This prevents cache collisions between conversation-format datasets and preformatted text datasets, ensuring consistent caching across different dataset formats.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • scripts/train_eagle3.py
    • Imported load_dataset from the datasets library to enable direct loading of JSON files.
    • Modified the build_dataloaders function to conditionally load the training dataset: using load_dataset for preformatted data and Dataset.from_generator for conversation-based data.
    • Updated the cache key string to include the is_preformatted argument, ensuring unique cache entries for different dataset formats.
    • Applied the same conditional loading logic to the evaluation dataset, allowing it to correctly handle preformatted data when specified.
Activity
  • No specific activity has been recorded for this pull request yet, as it is a new submission.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly adds support for preformatted text datasets by conditionally using load_dataset instead of safe_conversations_generator. The change to include is_preformatted in the cache key is also a good improvement to prevent cache collisions. However, I've found a critical syntax error due to incorrect indentation that will prevent the code from running. I've also suggested a refactoring to reduce code duplication for better maintainability. Please see my detailed comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants