High-performance, seed-based test data generator for enterprise applications. Generate realistic, reproducible data to Kafka, databases, and files using simple YAML configuration.
- π High Performance: Multi-threaded generation β 12β258M records/sec for primitives, 25β33K rec/sec for realistic Datafaker data
- π Reproducible: Same seed β identical output, byte-for-byte, across machines and thread counts
- π Locale-Aware: 62 locales supported via Datafaker (Italian names, US addresses, etc.)
- π Multiple Formats: JSON (NDJSON), CSV (RFC 4180), Protobuf (binary), CBEFF (biometric envelope)
- πΎ Multiple Destinations: File (NIO, gzip), Kafka (SASL/SSL, async/sync), JDBC databases (HikariCP, FK injection)
- βοΈ YAML Configuration: Declarative structure and job definitions β no code required
- π Extensible Type System: 48+ Datafaker semantic types with runtime registration (
DatafakerRegistry) - π Secure by Default: File permission validation,
${ENV_VAR}substitution for credentials
- Java 21+ (Amazon Corretto, OpenJDK, or GraalVM)
- Gradle 9.4+ wrapper included β no system install needed
- Docker (optional, for integration tests with Testcontainers)
- JDBC driver (optional, for database destination β drop into
extras/)
Download the release JAR and run immediately. You still need the config files, so clone first:
git clone https://github.com/mferretti/SeedStream.git && cd SeedStream
wget https://github.com/mferretti/SeedStream/releases/latest/download/seedstream-0.4.0.jar
java -jar seedstream-0.4.0.jar execute --job config/jobs/file_address.yaml --count 100wget https://github.com/mferretti/SeedStream/releases/latest/download/cli-0.4.0.zip
unzip cli-0.4.0.zip
# Point to your own job configs or clone the repo for examples
cli-0.4.0/bin/datagenerator execute --job /path/to/job.yaml --count 100git clone https://github.com/mferretti/SeedStream.git && cd SeedStream
./gradlew :cli:run --args="execute --job config/jobs/file_address.yaml --count 100"# Generate 10,000 US customers as CSV
./gradlew :cli:run --args="execute --job config/jobs/file_customer.yaml --format csv --count 10000"
# Stream 1M events to Kafka with 8 threads
./gradlew :cli:run --args="execute --job config/jobs/kafka_events_env_seed.yaml --count 1000000 --threads 8"
# Reproducible output β same seed, same data every time
./gradlew :cli:run --args="execute --job config/jobs/file_address.yaml --seed 12345 --count 1000"
# Validate a configuration without running
./gradlew :cli:run --args="validate --job config/jobs/file_invoice.yaml"| Option | Default | Description |
|---|---|---|
--job |
required | Path to job YAML |
--format |
json |
json, csv, protobuf |
--count |
100 |
Records to generate |
--seed |
from config | Override seed for this run |
--threads |
CPU cores | Worker threads |
--verbose |
off | Detailed logging |
--debug |
off | Trace sampling (see --trace-sample-rate) |
Validated throughput from JMH benchmarks (March 2026):
| Data type | Throughput |
|---|---|
| Primitive (int, boolean) | 12β258M records/sec |
| Datafaker (names, emails, etc.) | 13β154K records/sec |
| Real-world (10-field customer, E2E) | ~25β33K records/sec |
| File I/O | 600β800 MB/s |
Scaling: 3.7Γ speedup with 4 workers (92% efficiency). Datafaker workloads are I/O-bound β 4 threads is usually optimal regardless of core count.
See PERFORMANCE.md for full benchmarks, tuning guide, and hardware recommendations.
cli β destinations β formats β generators β schema β core
Six independent modules with clean one-way dependencies. Each layer is pluggable: add a destination by implementing DestinationAdapter, a format by implementing FormatSerializer, or a new semantic type by registering it with DatafakerRegistry.
See DESIGN.md for architecture decisions, the multi-threading reproducibility model, and extension points.
| Document | Contents |
|---|---|
| config/README.md | Type system reference, job/structure examples, Kafka & database config |
| docs/DESIGN.md | Architecture, threading model, reproducibility, extensibility |
| docs/PERFORMANCE.md | Benchmarks, tuning guide, hardware recommendations |
| docs/TROUBLESHOOTING.md | Common errors, debug mode, FAQ |
| docs/CONTRIBUTING.md | Setup, development workflow, code standards |
| docs/QUALITY.md | Coverage, SpotBugs, Spotless configuration |
| CHANGELOG.md | Release history and roadmap |
Contributions welcome β bug reports, new generators, destinations, or formats.
git clone https://github.com/mferretti/SeedStream.git
cd SeedStream
./gradlew build testSee CONTRIBUTING.md for setup, workflow, and code standards.
Copyright 2024-2026 Marco Ferretti
Licensed under the Apache License 2.0.