A web-based tool for creating, merging, and augmenting datasets for large language models (LLMs).
- Manual dataset creation with intuitive UI
- Dataset merging from multiple sources
- Language model API integration for dataset augmentation
- Localization support for multi-language use
- JSON export and import capabilities in Alpaca or NeuralWeights format
- User-friendly interface for easy data management
- Python 3.x
- Flask framework (version 3.1.0)
- Requests library (version 2.32.3)
- Web browser with ES6+ support (for the UI)
-
Clone this repository:
git clone https://github.com/neuralweights/llm-dataset-designer.git -
Navigate to the project directory:
cd llm-dataset-designer -
Create virtual environment (optional):
- On Windows:
python -m venv venv - On macOS/Linux:
python3 -m venv venv
- On Windows:
-
Install the required Python dependencies:
pip install flask==3.1.0 requests==2.32.3
-
Run the application:
python app.py -
Open your web browser and visit
http://localhost:8080to access the LLM Dataset Designer UI.
-
Manual Dataset Creation (Create Mode)
- Enter instructions, system prompts, and responses for each dataset sample
- Add or update records as needed
- Export the created dataset in either Alpaca or NeuralWeights JSON format
-
Dataset Merging (Merge Mode)
- Upload multiple datasets from different sources
- Merge the uploaded datasets with current into a unified dataset
- Optionally apply optimization techniques to the merged dataset
-
Dataset Augmentation (Augment Mode)
- Specify parameters for the Completions API requests (OpenAI-compatible)
- Generate or modify dataset samples using an external language model API
- Update existing dataset entries with newly generated responses
- Works well with META STE model
This application supports two primary dataset structures:
- Alpaca Dataset:
[
{
"input": "What is NeuralWeights llm dataset designer?",
"instruction": "Give detailed, yet concise answer in simple language.",
"output": "**NeuralWeights LLM Dataset Designer** is a powerful tool for creating customized datasets tailored to **fine-tune** Large Language Models. It offers an intuitive interface and advanced features to efficiently generate high-quality, domain-specific data."
},
{
"input": "How can NeuralWeights llm dataset designer help in fine-tuning?",
"instruction": "Provide a response in the form of a quatrain.",
"output": "NeuralWeights Dataset Designer,\nA tool to shape models with care,\nCustomize data, make it fine and rare,\nFor large language tuning, beyond compare!"
}, ...
]- NeuralWeights Dataset:
{
"instruction": [
"What is NeuralWeights llm dataset designer?",
"How can NeuralWeights llm dataset designer help in fine-tuning?",
...
],
"system_prompt": [
"Give detailed, yet concise answer in simple language.",
"Provide a response in the form of a quatrain.",
...
],
"response": [
"**NeuralWeights LLM Dataset Designer** is a powerful tool for creating customized datasets tailored to **fine-tune** Large Language Models. It offers an intuitive interface and advanced features to efficiently generate high-quality, domain-specific data.",
"NeuralWeights Dataset Designer,\nA tool to shape models with care,\nCustomize data, make it fine and rare,\nFor large language tuning, beyond compare!",
...
]
}Contributions to this project are welcome and appreciated! If you have suggestions, bug fixes, or new features to propose, please fork the repository and submit a pull request.
-
Fork the repository and create a new branch:
git checkout -b feature/new-feature
-
Implement your changes or bug fixes.
-
Commit your changes with clear commit messages.
-
Push the changes to your forked repository.
-
Create a pull request against the main
developbranch of this repository.
Please ensure that you test your changes thoroughly and adhere to best practices for coding style and documentation.
The following features are planned for future development:
-
Expanding Augmentation Features:
- Add token-streaming support from OpenAI-compatible API endpoint
- Implement more modes of augmentation
-
Improved Dataset Compatibility:
- Enhance dataset merging algorithms to handle diverse dataset structures
-
User Interface Enhancements:
- Improve accessibility and usability
- Provide more customization options for user preferences
Please note that the roadmap is subject to change based on future development priorities and community feedback.
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). For more information, see the LICENSE file or visit https://creativecommons.org/licenses/by-nc/4.0/.
- Creator: Alexander Nester
- Email: neuralweights@proton.me (for commercial licensing inquiries)
- Website: https://neuralweights.com (currently under active development)
- Support: https://patreon.com/NeuralWeights
