Skip to content

NeuralWeights/LLM-Dataset-Designer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

English | 中文简体

LLM Dataset Designer

A web-based tool for creating, merging, and augmenting datasets for large language models (LLMs).

Demo

Features

  • Manual dataset creation with intuitive UI
  • Dataset merging from multiple sources
  • Language model API integration for dataset augmentation
  • Localization support for multi-language use
  • JSON export and import capabilities in Alpaca or NeuralWeights format
  • User-friendly interface for easy data management

Getting Started

Prerequisites

  • Python 3.x
  • Flask framework (version 3.1.0)
  • Requests library (version 2.32.3)
  • Web browser with ES6+ support (for the UI)

Installation

  1. Clone this repository:

    git clone https://github.com/neuralweights/llm-dataset-designer.git
    
  2. Navigate to the project directory:

    cd llm-dataset-designer
    
  3. Create virtual environment (optional):

    • On Windows:
      python -m venv venv
      
    • On macOS/Linux:
      python3 -m venv venv
      
  4. Install the required Python dependencies:

    pip install flask==3.1.0 requests==2.32.3
  5. Run the application:

    python app.py
    
  6. Open your web browser and visit http://localhost:8080 to access the LLM Dataset Designer UI.

Usage

  1. Manual Dataset Creation (Create Mode)

    • Enter instructions, system prompts, and responses for each dataset sample
    • Add or update records as needed
    • Export the created dataset in either Alpaca or NeuralWeights JSON format
  2. Dataset Merging (Merge Mode)

    • Upload multiple datasets from different sources
    • Merge the uploaded datasets with current into a unified dataset
    • Optionally apply optimization techniques to the merged dataset
  3. Dataset Augmentation (Augment Mode)

    • Specify parameters for the Completions API requests (OpenAI-compatible)
    • Generate or modify dataset samples using an external language model API
    • Update existing dataset entries with newly generated responses
    • Works well with META STE model

Compatibility

This application supports two primary dataset structures:

  1. Alpaca Dataset:
[
    {
        "input": "What is NeuralWeights llm dataset designer?",
        "instruction": "Give detailed, yet concise answer in simple language.",
        "output": "**NeuralWeights LLM Dataset Designer** is a powerful tool for creating customized datasets tailored to **fine-tune** Large Language Models. It offers an intuitive interface and advanced features to efficiently generate high-quality, domain-specific data."
    },
    {
        "input": "How can NeuralWeights llm dataset designer help in fine-tuning?",
        "instruction": "Provide a response in the form of a quatrain.",
        "output": "NeuralWeights Dataset Designer,\nA tool to shape models with care,\nCustomize data, make it fine and rare,\nFor large language tuning, beyond compare!"
    }, ...
]
  1. NeuralWeights Dataset:
{
    "instruction": [
        "What is NeuralWeights llm dataset designer?",
        "How can NeuralWeights llm dataset designer help in fine-tuning?",
        ...
    ],
    "system_prompt": [
        "Give detailed, yet concise answer in simple language.",
        "Provide a response in the form of a quatrain.",
        ...
    ],
    "response": [
        "**NeuralWeights LLM Dataset Designer** is a powerful tool for creating customized datasets tailored to **fine-tune** Large Language Models. It offers an intuitive interface and advanced features to efficiently generate high-quality, domain-specific data.",
        "NeuralWeights Dataset Designer,\nA tool to shape models with care,\nCustomize data, make it fine and rare,\nFor large language tuning, beyond compare!",
        ...
    ]
}

Contributions

Contributions to this project are welcome and appreciated! If you have suggestions, bug fixes, or new features to propose, please fork the repository and submit a pull request.

  • Fork the repository and create a new branch:

    git checkout -b feature/new-feature
  • Implement your changes or bug fixes.

  • Commit your changes with clear commit messages.

  • Push the changes to your forked repository.

  • Create a pull request against the main develop branch of this repository.

Please ensure that you test your changes thoroughly and adhere to best practices for coding style and documentation.


Roadmap

The following features are planned for future development:

  1. Expanding Augmentation Features:

    • Add token-streaming support from OpenAI-compatible API endpoint
    • Implement more modes of augmentation
  2. Improved Dataset Compatibility:

    • Enhance dataset merging algorithms to handle diverse dataset structures
  3. User Interface Enhancements:

    • Improve accessibility and usability
    • Provide more customization options for user preferences

Please note that the roadmap is subject to change based on future development priorities and community feedback.


License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). For more information, see the LICENSE file or visit https://creativecommons.org/licenses/by-nc/4.0/.

Contact

About

LLM Dataset Designer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors