LLM Dataset Designer

English | 中文简体

LLM Dataset Designer

A web-based tool for creating, merging, and augmenting datasets for large language models (LLMs).

Features

Manual dataset creation with intuitive UI
Dataset merging from multiple sources
Language model API integration for dataset augmentation
Localization support for multi-language use
JSON export and import capabilities in Alpaca or NeuralWeights format
User-friendly interface for easy data management

Getting Started

Prerequisites

Python 3.x
Flask framework (version 3.1.0)
Requests library (version 2.32.3)
Web browser with ES6+ support (for the UI)

Installation

Clone this repository:

git clone https://github.com/neuralweights/llm-dataset-designer.git

Navigate to the project directory:
```
cd llm-dataset-designer
```
Create virtual environment (optional):
- On Windows:
```
python -m venv venv
```
- On macOS/Linux:
```
python3 -m venv venv
```

Install the required Python dependencies:

pip install flask==3.1.0 requests==2.32.3

Run the application:
```
python app.py
```
Open your web browser and visit http://localhost:8080 to access the LLM Dataset Designer UI.

Usage

Manual Dataset Creation (Create Mode)
- Enter instructions, system prompts, and responses for each dataset sample
- Add or update records as needed
- Export the created dataset in either Alpaca or NeuralWeights JSON format
Dataset Merging (Merge Mode)
- Upload multiple datasets from different sources
- Merge the uploaded datasets with current into a unified dataset
- Optionally apply optimization techniques to the merged dataset
Dataset Augmentation (Augment Mode)
- Specify parameters for the Completions API requests (OpenAI-compatible)
- Generate or modify dataset samples using an external language model API
- Update existing dataset entries with newly generated responses
- Works well with META STE model

Compatibility

This application supports two primary dataset structures:

Alpaca Dataset:

[
    {
        "input": "What is NeuralWeights llm dataset designer?",
        "instruction": "Give detailed, yet concise answer in simple language.",
        "output": "**NeuralWeights LLM Dataset Designer** is a powerful tool for creating customized datasets tailored to **fine-tune** Large Language Models. It offers an intuitive interface and advanced features to efficiently generate high-quality, domain-specific data."
    },
    {
        "input": "How can NeuralWeights llm dataset designer help in fine-tuning?",
        "instruction": "Provide a response in the form of a quatrain.",
        "output": "NeuralWeights Dataset Designer,\nA tool to shape models with care,\nCustomize data, make it fine and rare,\nFor large language tuning, beyond compare!"
    }, ...
]

NeuralWeights Dataset:

{
    "instruction": [
        "What is NeuralWeights llm dataset designer?",
        "How can NeuralWeights llm dataset designer help in fine-tuning?",
        ...
    ],
    "system_prompt": [
        "Give detailed, yet concise answer in simple language.",
        "Provide a response in the form of a quatrain.",
        ...
    ],
    "response": [
        "**NeuralWeights LLM Dataset Designer** is a powerful tool for creating customized datasets tailored to **fine-tune** Large Language Models. It offers an intuitive interface and advanced features to efficiently generate high-quality, domain-specific data.",
        "NeuralWeights Dataset Designer,\nA tool to shape models with care,\nCustomize data, make it fine and rare,\nFor large language tuning, beyond compare!",
        ...
    ]
}

Contributions

Contributions to this project are welcome and appreciated! If you have suggestions, bug fixes, or new features to propose, please fork the repository and submit a pull request.

Fork the repository and create a new branch:
```
git checkout -b feature/new-feature
```
Implement your changes or bug fixes.
Commit your changes with clear commit messages.
Push the changes to your forked repository.
Create a pull request against the main develop branch of this repository.

Please ensure that you test your changes thoroughly and adhere to best practices for coding style and documentation.

Roadmap

The following features are planned for future development:

Expanding Augmentation Features:
- Add token-streaming support from OpenAI-compatible API endpoint
- Implement more modes of augmentation
Improved Dataset Compatibility:
- Enhance dataset merging algorithms to handle diverse dataset structures
User Interface Enhancements:
- Improve accessibility and usability
- Provide more customization options for user preferences

Please note that the roadmap is subject to change based on future development priorities and community feedback.

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). For more information, see the LICENSE file or visit https://creativecommons.org/licenses/by-nc/4.0/.

Contact

Creator: Alexander Nester
Email: neuralweights@proton.me (for commercial licensing inquiries)
Website: https://neuralweights.com (currently under active development)
Support: https://patreon.com/NeuralWeights

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
static		static
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.MD		README.MD
README_CN.MD		README_CN.MD
app.py		app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Dataset Designer

Features

Getting Started

Prerequisites

Installation

Usage

Compatibility

Contributions

Roadmap

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Dataset Designer

Features

Getting Started

Prerequisites

Installation

Usage

Compatibility

Contributions

Roadmap

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages