Fix: support preformatted text datasets in train_eagle3.py (avoid forcing conversations generator)#498
Fix: support preformatted text datasets in train_eagle3.py (avoid forcing conversations generator)#498Seun-Ajayi wants to merge 7 commits intosgl-project:mainfrom
Conversation
…orcing conversations generator)
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves an incompatibility in the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly adds support for preformatted text datasets by conditionally using load_dataset instead of safe_conversations_generator. The change to include is_preformatted in the cache key is also a good improvement to prevent cache collisions. However, I've found a critical syntax error due to incorrect indentation that will prevent the code from running. I've also suggested a refactoring to reduce code duplication for better maintainability. Please see my detailed comments.
Motivation
SpecForge documentation supports training EAGLE3 draft models using preformatted text datasets with the schema:
{"id": "...", "text": "..."}when the
--is-preformattedflag is used.However,
train_eagle3.pycurrently loads all datasets throughsafe_conversations_generator, which assumes the dataset contains aconversationsfield. As a result, preformatted datasets are rewritten to:{"conversations": []}which breaks the expected schema and causes training to fail with:
This PR fixes the incompatibility so that preformatted datasets documented in SpecForge can be used directly for EAGLE3 training.
Motivation
-is-preformattedWhen
--is-preformattedis enabled, datasets are loaded directly via HuggingFaceload_dataset("json")instead of thesafe_conversations_generator.Before:
After:
The same logic is applied to evaluation datasets.
The dataset preprocessing cache key now includes the
is_preformattedflag to avoid collisions between:Before:
After:
This ensures dataset preprocessing caches remain consistent across dataset formats.
Related Issues
No formal issue was opened.
This PR fixes a mismatch between the documented preformatted dataset workflow and the current dataset loading implementation.
Accuracy Test
No model architecture, kernels, or inference logic were modified.
This change only affects dataset loading and preprocessing, so model accuracy behavior remains unchanged.
Training was verified to run correctly with
--is-preformatteddatasets.Benchmark & Profiling
No performance-critical code paths were modified.
The change affects only dataset loading before training begins and does not impact runtime performance of EAGLE3 training or inference.
Checklist