๐ Website โข ๐ Documentation โข ๐ฌ Join Community โข ๐ Report Issues โข โญ Star Us
Build large-scale, high-quality training datasets for AI models from scratch with minimal system overhead
| Section | Link |
|---|---|
| ๐ฏ Overview | Jump to Overview |
| ๐๏ธ Project Architecture | View Architecture |
| ๐ง Module Specifications | Explore Modules |
| ๐ Dataset Categories | Browse Categories |
| ๐ป System Requirements | Check Requirements |
| ๐ฆ Installation & Setup | Get Started |
| ๐ฐ Quick Install Methods (Windows) | Install Options |
| ๐ Usage Instructions | Learn How to Use |
| ๐ Dataset Output Structure | See Output Format |
| โก Performance & Benchmarks | View Performance |
| ๐ก๏ธ Data Quality Assurance | Quality Metrics |
| ๐ง Troubleshooting | Fix Issues |
| โ FAQ | Get Answers |
| ๐ค Contributing | Contribute |
| ๐ License & Citation | License Info |
Run this project file to start generation:
python run_all.pyUse this first. If other methods fail, come back to this method.
git clone https://github.com/asaad123sh/DatasetMaker.git
cd DatasetMaker
python --version
python run_all.py- Open VS Code.
- Click File > Open Folder and select
DatasetMaker. - Open the built-in terminal: Terminal > New Terminal.
- Make sure terminal path is inside project folder.
- Run:
python --version
python run_all.py- Download project ZIP from GitHub.
- Extract it to a folder (for example:
D:\DatasetMaker). - Open that folder in File Explorer.
- In the address bar, type
powershelland press Enter. - Run:
python --version
python run_all.pyIf python --version does not work, Python is not added to PATH.
- Download Python from
https://www.python.org/downloads/windows/. - Start installer.
- On first installer screen, check:
Add python.exe to PATH
- Click Install Now.
- Close and reopen terminal.
- Verify:
python --version- Find your Python install path (example):
C:\Users\<YourUser>\AppData\Local\Programs\Python\Python312\
- Copy these two paths:
- Python folder path
- Scripts folder path (same path +
\Scripts)
- Open Windows Start and search:
Edit the system environment variables. - Click Environment Variables...
- Under User variables (or System variables), select
Pathand click Edit. - Click New and add:
C:\Users\<YourUser>\AppData\Local\Programs\Python\Python312\C:\Users\<YourUser>\AppData\Local\Programs\Python\Python312\Scripts\
- Click OK on all windows.
- Restart terminal.
- Verify:
python --versionUse only if you understand environment variable risks.
$p = Get-ChildItem "$env:LOCALAPPDATA\Programs\Python" -Directory -ErrorAction SilentlyContinue | Sort-Object Name -Descending | Select-Object -First 1
if ($p) {
[Environment]::SetEnvironmentVariable("Path", $env:Path + ";" + $p.FullName + ";" + (Join-Path $p.FullName "Scripts"), "User")
Write-Host "Python paths added. Restart terminal and run: python --version"
} else {
Write-Host "Python folder not found. Install Python first."
}for /d %D in ("%LocalAppData%\Programs\Python\Python*") do set PYDIR=%D
if defined PYDIR (
setx PATH "%PATH%;%PYDIR%;%PYDIR%\Scripts"
echo Python paths added. Restart terminal and run: python --version
) else (
echo Python folder not found. Install Python first.
)The AI Dataset Generator is an enterprise-grade Python-based toolkit designed for creating large-scale, diverse, and contextually relevant datasets for training artificial intelligence models from scratch. This project leverages modular architecture and optimized algorithms to produce 93M+ unique samples across 23 specialized categories without requiring external dependencies or expensive cloud infrastructure.
โจ Modular Architecture โ Each data category is independently generated and can be used separately or combined
๐ High Performance โ Generates tens of millions of samples in 2โ3 hours
๐พ Storage Efficient โ Compressed JSON output with unique identifiers
๐ง Zero External Dependencies โ Pure Python implementation using only standard library modules
๐ฏ Highly Customizable โ Easy to modify generation logic for domain-specific data
๐ Production-Ready โ Structured output suitable for PyTorch, TensorFlow, and other ML frameworks
The generator follows a modular pipeline architecture:
graph TD
A["๐ run_all.py<br/>Entry Point<br/>Orchestrates Pipeline"] --> B["๐ qa_gen.py<br/>Q&A Generator"]
A --> C["๐ฌ chat_gen.py<br/>Chat Generator"]
A --> D["๐ป coding_gen.py<br/>Coding Generator"]
A --> E["๐ข math_gen.py<br/>Math Generator"]
A --> F["โญ advanced_categories_gen.py<br/>23 Advanced Categories"]
B --> G["โ๏ธ engine.py<br/>Core Processing<br/>UUID Generation<br/>File Writing<br/>Data Formatting"]
C --> G
D --> G
E --> G
F --> G
G --> H["๐ Output Directory<br/>2,300 JSONL Files<br/>93M+ Samples"]
style A fill:#4CAF50,stroke:#2E7D32,color:#fff,font-weight:bold
style B fill:#2196F3,stroke:#1565C0,color:#fff
style C fill:#2196F3,stroke:#1565C0,color:#fff
style D fill:#2196F3,stroke:#1565C0,color:#fff
style E fill:#2196F3,stroke:#1565C0,color:#fff
style F fill:#2196F3,stroke:#1565C0,color:#fff
style G fill:#FF9800,stroke:#E65100,color:#fff,font-weight:bold
style H fill:#9C27B0,stroke:#6A1B9A,color:#fff,font-weight:bold
Purpose: Primary entry point that coordinates the entire dataset generation workflow.
Responsibilities:
- Initializes the generation pipeline
- Manages category selection and generation parameters
- Controls batch processing and file handling
- Integrates all generator modules
- Handles optional modules (e.g.,
safety_gen.pyif available) - Executes warnings and error handling
Key Functions:
- Loads configuration parameters
- Calls all category-specific generators in sequence
- Manages output directory structure
- Provides progress tracking and logging
Usage:
python run_all.pyPurpose: Low-level data management and file I/O operations.
Responsibilities:
- Unique ID Generation: Uses UUID4 to ensure global uniqueness of samples
- File Management: Handles reading/writing JSON files with proper formatting
- Data Structuring: Ensures consistent schema across all generated samples
- Error Handling: Manages file system errors and data validation
- Batch Processing: Optimizes memory usage for large datasets
Key Functions:
write_category_files()โ Writes structured data to JSON with proper formatting- UUID-based sample identification
- Directory creation and management
Data Schema (Example):
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"category": "qa",
"user": "question text",
"assistant": "answer text",
"metadata": {}
}Purpose: Produces high-quality question-answer pairs for general knowledge and trivia.
Generated Content:
- Factual Q&A pairs covering science, history, geography, and general knowledge
- Educational content designed for basic to intermediate learning levels
- Diverse topic coverage to ensure dataset richness
Characteristics:
- โ Factually accurate responses
- โ Clear, concise answer formatting
- โ Structured question-answer pairs
- โ Suitable for knowledge-based models
Sample Output:
{
"user": "What gas do plants absorb from the atmosphere?",
"assistant": "Plants absorb carbon dioxide from the atmosphere during photosynthesis."
}Purpose: Creates realistic human-like conversational exchanges for dialogue-based AI training.
Generated Content:
- Multi-turn conversations simulating natural chat interactions
- Contextual exchanges with proper conversation flow
- User-assistant dialogue patterns mimicking real communication
- Emotional and contextual awareness in responses
Characteristics:
- โ Natural language patterns
- โ Contextual coherence across turns
- โ Diverse conversation topics
- โ Realistic user queries and responses
Use Cases:
- Chatbot training
- Conversational AI models
- Dialog systems
- Virtual assistant development
Purpose: Generates coding problems, solutions, and explanations for AI model training on software engineering tasks.
Generated Content:
- Algorithm problems with varying difficulty levels
- Code snippets in popular programming languages
- Explanations of programming concepts and solutions
- Best practices and optimization techniques
Characteristics:
- โ Syntactically correct code examples
- โ Clear problem statements
- โ Step-by-step solutions
- โ Multiple programming paradigms
Sample Output:
{
"user": "Write a function to find the maximum element in an array",
"assistant": "def find_max(arr):\n return max(arr)\n\n# This uses Python's built-in max() function for optimal performance."
}Purpose: Creates mathematically sound problems with detailed step-by-step solutions.
Generated Content:
- Arithmetic operations (addition, subtraction, multiplication, division)
- Algebraic equations and their solutions
- Step-by-step explanations of mathematical processes
- Various difficulty levels from basic to advanced
Characteristics:
- โ Mathematically accurate solutions
- โ Detailed working shown for each problem
- โ Clear explanation of mathematical principles
- โ Suitable for math tutoring AI
Sample Output:
{
"user": "Solve: 2x + 3 = 7",
"assistant": "Let's solve the equation step by step:\n\n1. Subtract 3 from both sides:\n 2x = 7 - 3\n 2x = 4\n\n2. Divide both sides by 2:\n x = 4 / 2\n x = 2"
}Purpose: Produces advanced, contextually rich data for specialized AI applications.
Generated Content:
- Emotional Support Conversations โ Empathetic, supportive dialogue
- Technical Advice โ Expert-level technical guidance
- Professional Communications โ Business and formal interactions
- Complex Problem-Solving โ Multi-step reasoning scenarios
- Relationship & Social Advice โ Interpersonal guidance
- Conflict Resolution โ De-escalation and mediation techniques
Characteristics:
- โ Context-aware responses
- โ Multi-turn conversation support
- โ Sophisticated reasoning patterns
- โ Real-world applicability
Use Cases:
- Empathetic AI assistants
- Technical support systems
- Professional communication training
- Complex reasoning models
| # | Category | Type | Files | Lines/File | Total Lines | Use Case |
|---|---|---|---|---|---|---|
| 1 | Q&A | Knowledge-based | 100 JSONL | 30K | 3M | Fact retrieval, trivia |
| 2 | General Chat | Conversational | 100 JSONL | 30K | 3M | Dialog systems |
| 3 | Coding | Technical | 100 JSONL | 30K | 3M | Code generation |
| 4 | Most Basics General | Beginner Chat | 100 JSONL | 30K | 3M | Friendly conversations |
| 5 | Emotions | Emotional AI | 100 JSONL | 32K | 3.2M | Empathetic responses |
| 6 | Coding Master | Advanced Coding | 100 JSONL | 50K | 5M | Expert programming |
| 7 | Maths Advanced | Mathematics | 100 JSONL | 50K | 5M | Problem-solving |
| 8 | ML Advanced | ML/AI | 100 JSONL | 44K | 4.4M | ML algorithms |
| 9 | Deep Learning Advanced | Deep Learning | 100 JSONL | 46K | 4.6M | Neural networks |
| 10 | AI Ultra Advanced | Cutting-Edge AI | 100 JSONL | 47K | 4.7M | Advanced AI concepts |
| 11 | Physics Master | Physics | 100 JSONL | 42K | 4.2M | Physics problems |
| 12 | Chemistry Normal | Chemistry | 100 JSONL | 30K | 3M | Chemical reactions |
| 13 | Science Overall Master | Comprehensive Science | 100 JSONL | 43K | 4.3M | Multi-discipline science |
| 14 | Religion Master | Religious Studies | 100 JSONL | 32K | 3.2M | Religious knowledge |
| 15 | Encyclopedia Species | Zoology/Biology | 100 JSONL | 30K | 3M | Animal/species info |
| 16 | Foods & Recipes | Culinary | 100 JSONL | 45K | 4.5M | Recipes & cooking |
| 17 | Recipe Understanding | Culinary Education | 100 JSONL | 32K | 3.2M | Cooking techniques |
| 18 | Unity Master | Game Dev (Unity) | 100 JSONL | 50K | 5M | Game development |
| 19 | Unreal Master | Game Dev (Unreal) | 100 JSONL | 48K | 4.8M | Game development |
| 20 | Godot Ultra | Game Dev (Godot) | 100 JSONL | 43K | 4.3M | Game development |
| 21 | Hacking Basics | Cybersecurity | 100 JSONL | 30K | 3M | Security knowledge |
| 22 | Internet Ultra Master | Web/Internet | 100 JSONL | 45K | 4.5M | Internet technology |
| 23 | Anime Master | Entertainment | 100 JSONL | 33K | 3.3M | Anime knowledge |
| TOTAL | 23 Categories | Mixed | 2300 JSONL | 30K-50K | ~93M | Comprehensive AI Training |
Generated datasets are organized hierarchically with 500 total JSONL files (100 per category):
output/
โโโ qa/
โ โโโ qa_001.jsonl # 20K-50K samples
โ โโโ qa_002.jsonl
โ โโโ ...
โ โโโ qa_100.jsonl
โ
โโโ chat/
โ โโโ chat_001.jsonl # 20K-50K samples
โ โโโ chat_002.jsonl
โ โโโ ...
โ โโโ chat_100.jsonl
โ
โโโ coding/
โ โโโ coding_001.jsonl # 25K-50K samples
โ โโโ coding_002.jsonl
โ โโโ ...
โ โโโ coding_100.jsonl
โ
โโโ math/
โ โโโ math_001.jsonl # 25K-50K samples
โ โโโ math_002.jsonl
โ โโโ ...
โ โโโ math_100.jsonl
โ
โโโ advanced/
โ โโโ advanced_001.jsonl # 30K-50K samples
โ โโโ advanced_002.jsonl
โ โโโ ...
โ โโโ advanced_100.jsonl
โ
โโโ metadata.json # Summary statistics and dataset info
๐ Dataset Statistics:
- Total Files: 2,300 JSONL files (100 per category)
- Total Samples: 93M+ unique data points
- Average Lines per File: 30K-50K (varies by category)
- Total Size: ~100-120 GB (fully generated dataset)
- Format: JSONL (JSON Lines - one JSON object per line)
- Uniqueness: 100% via UUID4 identifiers
- Categories: 23 specialized domains
- Streaming Optimized: Each file can be processed independently
The AI Dataset Generator creates a massive, well-organized training dataset with the following specifications:
- 100 files per category (2,300-2,400 files total with optional safety category)
- JSONL format (JSON Lines - optimal for streaming and processing)
- Lines per file: 30,000 - 50,000 samples per JSONL file (varies by category)
- Distributed storage: Each category in its own folder for easy management
| Category | Files | Lines/File | Total Lines | Approx. Size/File | Total Size |
|---|---|---|---|---|---|
| Q&A | 100 | 30K | 3M | 25-35 MB | 2.5-3.5 GB |
| General Chat | 100 | 30K | 3M | 30-40 MB | 3-4 GB |
| Coding | 100 | 30K | 3M | 35-45 MB | 3.5-4.5 GB |
| Most Basics General | 100 | 30K | 3M | 25-35 MB | 2.5-3.5 GB |
| Emotions | 100 | 32K | 3.2M | 28-38 MB | 2.8-3.8 GB |
| Coding Master | 100 | 50K | 5M | 50-65 MB | 5-6.5 GB |
| Maths Advanced | 100 | 50K | 5M | 45-60 MB | 4.5-6 GB |
| ML Advanced | 100 | 44K | 4.4M | 40-55 MB | 4-5.5 GB |
| Deep Learning Advanced | 100 | 46K | 4.6M | 42-58 MB | 4.2-5.8 GB |
| AI Ultra Advanced | 100 | 47K | 4.7M | 43-60 MB | 4.3-6 GB |
| Physics Master | 100 | 42K | 4.2M | 38-52 MB | 3.8-5.2 GB |
| Chemistry Normal | 100 | 30K | 3M | 28-38 MB | 2.8-3.8 GB |
| Science Overall Master | 100 | 43K | 4.3M | 40-55 MB | 4-5.5 GB |
| Religion Master | 100 | 32K | 3.2M | 28-38 MB | 2.8-3.8 GB |
| Encyclopedia Species | 100 | 30K | 3M | 25-35 MB | 2.5-3.5 GB |
| Foods & Recipes | 100 | 45K | 4.5M | 42-58 MB | 4.2-5.8 GB |
| Recipe Understanding | 100 | 32K | 3.2M | 28-38 MB | 2.8-3.8 GB |
| Unity Master | 100 | 50K | 5M | 50-65 MB | 5-6.5 GB |
| Unreal Master | 100 | 48K | 4.8M | 48-63 MB | 4.8-6.3 GB |
| Godot Ultra | 100 | 43K | 4.3M | 40-55 MB | 4-5.5 GB |
| Hacking Basics | 100 | 30K | 3M | 25-35 MB | 2.5-3.5 GB |
| Internet Ultra Master | 100 | 45K | 4.5M | 42-58 MB | 4.2-5.8 GB |
| Anime Master | 100 | 33K | 3.3M | 30-40 MB | 3-4 GB |
| Safety (Optional) | 100 | 30K | 3M | 25-35 MB | 2.5-3.5 GB |
| TOTAL | 2300-2400 | 30K-50K | 93M-100M | Variable | ~93-120 GB |
๐ฏ Key Benefits:
- โ Parallel Processing: Process 23 categories ร 100 files simultaneously
- โ Memory Efficient: Load one file at a time (~50MB typical)
- โ Easy Distribution: Share individual files or categories across systems
- โ Fault Tolerance: One corrupted file doesn't affect other 2,299 files
- โ Scalability: Easy to combine multiple dataset runs
- โ Streaming Support: Perfect for PyTorch, TensorFlow DataLoaders
- โ Specialized Training: Use specific categories for domain-specific models
For generating small datasets (100Kโ1M samples):
| Resource | Requirement |
|---|---|
| RAM | 2 GB minimum |
| CPU | Dual-core processor (2GHz+) |
| Storage | 70 GB free disk space |
| OS | Windows 7+, Linux (any distro), macOS 10.13+ |
| Python | Python 3.7 or higher |
For generating large datasets (10Mโ100M samples):
| Resource | Requirement |
|---|---|
| RAM | 8 GB or more (16 GB preferred) |
| CPU | Quad-core or better processor (3GHz+) |
| Storage | 70 GB free disk space |
| OS | Windows 10+, Linux (modern distro), macOS 10.14+ |
| Python | Python 3.8 or higher |
- ARM Processors: Fully supported (Raspberry Pi, Apple Silicon, etc.)
- SSD Recommended: For optimal I/O performance with 70 GB storage capacity
- Memory Scaling: Generator uses streaming to minimize RAM footprint
- Storage Requirement: 70 GB minimum free disk space required for all system tiers
- Network: Not required (fully offline operation)
| Dataset Scale | Files | Total Samples | RAM | CPU (Quad-Core) | Storage Needed | Duration |
|---|---|---|---|---|---|---|
| Minimal | 230 | ~10M | 4 GB | Dual-core | 20 GB | 1-1.5 hours |
| Small | 500 | ~23M | 6 GB | Dual-core | 40 GB | 1.5-2 hours |
| Medium | 1150 | ~50M | 8 GB | Quad-core | 70 GB | 2-2.5 hours |
| Large | 2300 | 93M | 16 GB | Quad-core | 100-120 GB | 3-4 hours |
| Extra Large | 2300+ | 93M+ | 16+ GB | Octa-core | 150+ GB | 4-6 hours |
- Python 3.7+ installed on your system
- pip (Python package manager) โ optional, not required
- Git for cloning the repository
git clone https://github.com/asaad123sh/DatasetMaker.git
cd DatasetMakerpython --version # Should be 3.7 or higherThe project uses only Python standard library modules, so no external package installation is required.
# For Windows
python -m venv venv
venv\Scripts\activate
# For Linux/macOS
python3 -m venv venv
source venv/bin/activatecd /path/to/ai-dataset-generatorpython run_all.pyThe script will display progress information as it generates each category of data.
Check the output directory for generated JSON files:
ls output/ # On Linux/macOS
dir output # On WindowsModify run_all.py to comment out unwanted categories:
# Example: Generate only Q&A and Chat datasets
generate_qa()
generate_chat()
# generate_coding() # Commented out
# generate_math()
# generate_advanced_categories()Edit individual generator files to adjust:
- Number of samples per category
- Content diversity and complexity
- Output formatting
- Sample distribution
import json
# Load datasets
with open('output/qa_dataset.json', 'r') as f:
qa_data = json.load(f)
with open('output/chat_dataset.json', 'r') as f:
chat_data = json.load(f)
# Combine
combined = qa_data + chat_data
# Save
with open('output/combined_dataset.json', 'w') as f:
json.dump(combined, f, indent=2)All datasets are saved in minified JSON format for efficient storage:
[
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"category": "qa",
"user": "What is the capital of France?",
"assistant": "The capital of France is Paris.",
"timestamp": "2024-01-15T10:30:00Z",
"metadata": {
"source": "qa_gen",
"difficulty": "easy"
}
},
{
"id": "6ba7b810-9dad-11d1-80b4-00c04fd430c8",
"category": "math",
"user": "Solve: x + 5 = 12",
"assistant": "To solve x + 5 = 12:\n1. Subtract 5 from both sides\n2. x = 12 - 5\n3. x = 7",
"timestamp": "2024-01-15T10:30:01Z",
"metadata": {
"source": "math_gen",
"difficulty": "easy"
}
}
]- ID: Globally unique identifier (UUID4)
- Category: Data type (qa, chat, coding, math, advanced)
- User: Input/question text
- Assistant: Output/answer text
- Timestamp: Generation timestamp (ISO 8601)
- Metadata: Additional context and attributes
| Operation | Samples/Second | Avg. Time per Sample |
|---|---|---|
| Q&A Generation | 5,000โ10,000 | 0.1โ0.2 ms |
| Chat Generation | 3,000โ7,000 | 0.15โ0.33 ms |
| Coding Generation | 2,000โ5,000 | 0.2โ0.5 ms |
| Math Generation | 4,000โ8,000 | 0.125โ0.25 ms |
| File Writing | 10,000โ50,000 | 0.02โ0.1 ms |
- Per-Sample Memory: ~500 bytes (average)
- File I/O Buffer: Configurable (typically 50โ100 MB)
- Peak Memory Usage: Generally stays below 2 GB on moderate datasets
- Use SSD Storage โ Significantly faster I/O operations
- Increase Buffer Size โ If system has sufficient RAM
- Parallelize Generation โ Run multiple generators in parallel using threading
- Compress Output โ Use gzip for storage efficiency
- Batch Processing โ Process multiple samples at once
โ
Uniqueness: 100% โ Every sample has a globally unique UUID
โ
Accuracy: Domain-dependent โ Validate against source materials
โ
Consistency: Schema validation across all samples
โ
Diversity: Random sampling ensures variety
โ
Completeness: All required fields present in every sample
Before using generated datasets for training:
-
Sample Inspection
# Randomly inspect 100 samples python -c "import json; data = json.load(open('output/qa_dataset.json')); import random; print('\n'.join(str(s) for s in random.sample(data, 100)))"
-
Validation Script
import json with open('output/qa_dataset.json', 'r') as f: data = json.load(f) errors = [] for i, sample in enumerate(data): if not sample.get('id'): errors.append(f"Sample {i}: Missing ID") if not sample.get('user'): errors.append(f"Sample {i}: Missing user field") if not sample.get('assistant'): errors.append(f"Sample {i}: Missing assistant field") print(f"Total errors: {len(errors)}") for error in errors[:10]: print(error)
-
Statistical Analysis
import json with open('output/qa_dataset.json', 'r') as f: data = json.load(f) print(f"Total samples: {len(data)}") print(f"Avg user length: {sum(len(s['user']) for s in data) / len(data):.0f} chars") print(f"Avg assistant length: {sum(len(s['assistant']) for s in data) / len(data):.0f} chars")
- Domain Verification: Check accuracy in specialized domains (math, coding)
- Linguistic Quality: Ensure grammatically correct and natural language
- Semantic Relevance: Verify user-assistant pairs are contextually related
- Distribution Analysis: Ensure balanced representation across categories
Cause: Missing Python standard library (rare on fresh installations)
Solution:
python --version # Verify Python installation
pip install --upgrade pip # Update pipCause: Output directory lacks write permissions
Solution:
# Linux/macOS
chmod 755 output/
# Windows (Run as Administrator)
icacls output /grant:r "%USERNAME%":FCause: System RAM exhausted
Solution:
- Reduce batch size in generator files
- Generate smaller datasets first
- Close other applications
- Consider upgrading system RAM
Cause: Disk I/O bottleneck or insufficient CPU
Solution:
- Use SSD instead of HDD
- Increase CPU core count
- Optimize file buffering
- Run on faster system
Cause: UUID collision (extremely rare) or script re-run
Solution:
- Clear output directory before re-running
- UUID4 collision probability: < 1 in 5.3 ร 10^36
- If duplicates occur, use
set()to deduplicate
A: Yes, this project and its output are suitable for commercial use. Please review the LICENSE file for specific terms.
A: Load the JSON files and create custom Dataset classes:
import json
import torch
with open('output/qa_dataset.json', 'r') as f:
data = json.load(f)
class CustomDataset(torch.utils.data.Dataset):
def __init__(self, data):
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return {
'input': self.data[idx]['user'],
'output': self.data[idx]['assistant']
}
dataset = CustomDataset(data)A: Yes! The code is designed to be modular and customizable. Edit individual generator files to suit your needs.
A: Python 3.7+. We recommend 3.8 or higher for best performance.
A: No. This is a fully offline tool with no external API dependencies.
A: Yes. Modify run_all.py to use Python's threading or multiprocessing modules.
A: Use standard Python tools:
import json
import random
with open('output/qa_dataset.json', 'r') as f:
data = json.load(f)
# Random sample
sample = random.sample(data, 1000)
# Filter by length
filtered = [s for s in data if len(s['user']) < 100]
# Save result
with open('output/filtered_dataset.json', 'w') as f:
json.dump(sample, f, indent=2)A: Yes. Concatenate JSON arrays and deduplicate by ID if needed:
import json
data1 = json.load(open('output1/qa_dataset.json'))
data2 = json.load(open('output2/qa_dataset.json'))
combined = data1 + data2
# Deduplicate by ID
seen = set()
unique = []
for item in combined:
if item['id'] not in seen:
unique.append(item)
seen.add(item['id'])
json.dump(unique, open('output/combined.json', 'w'), indent=2)- Fork the repository
- Create a feature branch (
git checkout -b feature/YourFeature) - Commit your changes (
git commit -m 'Add YourFeature') - Push to branch (
git push origin feature/YourFeature) - Open a Pull Request
- Follow PEP 8 style guidelines
- Add docstrings to new functions
- Test your code thoroughly
- Update README.md if adding new features
- Keep commits atomic and descriptive
- ๐ฏ Add new dataset categories (science, history, etc.)
- ๐ Optimize generation speed
- ๐ Add data analysis and visualization tools
- ๐ Implement data validation and sanitization
- ๐ Add multilingual support
- ๐ Create web interface for configuration
If you use this dataset generator in research or publication, please cite:
@software{ai_dataset_generator_2024,
title={AI Dataset Generator: Large-Scale Training Data Creation Tool},
author={Muhammad Asaad},
year={2024},
url={https://github.com/asaad123sh/DatasetMaker}
}- ๐ Issues: Open an issue on GitHub for bugs or feature requests
- ๐ฌ Discussions: Use GitHub Discussions for questions and ideas
- ๐ง Email: asaadsaif11223366@gmail.com
- ๐ Documentation: Check the wiki for detailed guides
| ๐ฏ Aspect | โญ Highlight | ๐ Status |
|---|---|---|
| Scale | Generate millions of samples for large-scale AI training | โ Production Ready |
| Speed | Complete generation in 2โ3 hours for very large datasets | โ Optimized |
| Simplicity | Zero dependencies โ pure Python implementation | โ Verified |
| Quality | Unique IDs ensure no duplicate samples | โ Guaranteed |
| Flexibility | Modular design allows custom extensions | โ Extensible |
| Compatibility | Works with small to large systems | โ Tested |
- GPU-accelerated generation
- Distributed generation across multiple machines
- Web UI for dataset configuration
- Real-time progress monitoring dashboard
- Advanced filtering and sampling tools
- Automated quality scoring system
- Multilingual dataset support
- Integration with popular ML platforms
NOTE: "I made this Dataset because i am planing to train anew modle the issue i am facing now is How to Train a mdole propely because in Pakistan there is no oppertiunity of Free GPUs, Or Free APIs of High AIs where i can do testing. I am trainintg a small modle for my training and testing on GoogleColab So it will Take time for more Optimizations in these things and in this Dataset but still this Dataset is a Big source for Basic Data need for amodle to train"
Last Updated: April 2026
Version: 1.0.0
Maintainer: [Muhammad Asaad]
Contact: [asaadsaif11223366@gmail.com]
Built with โค๏ธ for the AI community. Happy dataset generation!