Skip to content

mferretti/SeedStream

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

242 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

SeedStream

Build Status Security Scan Codacy Badge Java Version Gradle License codecov

High-performance, seed-based test data generator for enterprise applications. Generate realistic, reproducible data to Kafka, databases, and files using simple YAML configuration.


Features

  • πŸš€ High Performance: Multi-threaded generation β€” 12–258M records/sec for primitives, 25–33K rec/sec for realistic Datafaker data
  • πŸ”„ Reproducible: Same seed β†’ identical output, byte-for-byte, across machines and thread counts
  • 🌍 Locale-Aware: 62 locales supported via Datafaker (Italian names, US addresses, etc.)
  • πŸ“ Multiple Formats: JSON (NDJSON), CSV (RFC 4180), Protobuf (binary), CBEFF (biometric envelope)
  • πŸ’Ύ Multiple Destinations: File (NIO, gzip), Kafka (SASL/SSL, async/sync), JDBC databases (HikariCP, FK injection)
  • βš™οΈ YAML Configuration: Declarative structure and job definitions β€” no code required
  • πŸ”Œ Extensible Type System: 48+ Datafaker semantic types with runtime registration (DatafakerRegistry)
  • πŸ” Secure by Default: File permission validation, ${ENV_VAR} substitution for credentials

Requirements

  • Java 21+ (Amazon Corretto, OpenJDK, or GraalVM)
  • Gradle 9.4+ wrapper included β€” no system install needed
  • Docker (optional, for integration tests with Testcontainers)
  • JDBC driver (optional, for database destination β€” drop into extras/)

Quick Start

Option 1 β€” Fat JAR (no build required)

Download the release JAR and run immediately. You still need the config files, so clone first:

git clone https://github.com/mferretti/SeedStream.git && cd SeedStream
wget https://github.com/mferretti/SeedStream/releases/latest/download/seedstream-0.4.0.jar
java -jar seedstream-0.4.0.jar execute --job config/jobs/file_address.yaml --count 100

Option 2 β€” Distribution zip

wget https://github.com/mferretti/SeedStream/releases/latest/download/cli-0.4.0.zip
unzip cli-0.4.0.zip
# Point to your own job configs or clone the repo for examples
cli-0.4.0/bin/datagenerator execute --job /path/to/job.yaml --count 100

Option 3 β€” Build from source

git clone https://github.com/mferretti/SeedStream.git && cd SeedStream
./gradlew :cli:run --args="execute --job config/jobs/file_address.yaml --count 100"

Common examples

# Generate 10,000 US customers as CSV
./gradlew :cli:run --args="execute --job config/jobs/file_customer.yaml --format csv --count 10000"

# Stream 1M events to Kafka with 8 threads
./gradlew :cli:run --args="execute --job config/jobs/kafka_events_env_seed.yaml --count 1000000 --threads 8"

# Reproducible output β€” same seed, same data every time
./gradlew :cli:run --args="execute --job config/jobs/file_address.yaml --seed 12345 --count 1000"

# Validate a configuration without running
./gradlew :cli:run --args="validate --job config/jobs/file_invoice.yaml"

CLI options

Option Default Description
--job required Path to job YAML
--format json json, csv, protobuf
--count 100 Records to generate
--seed from config Override seed for this run
--threads CPU cores Worker threads
--verbose off Detailed logging
--debug off Trace sampling (see --trace-sample-rate)

Performance

Validated throughput from JMH benchmarks (March 2026):

Data type Throughput
Primitive (int, boolean) 12–258M records/sec
Datafaker (names, emails, etc.) 13–154K records/sec
Real-world (10-field customer, E2E) ~25–33K records/sec
File I/O 600–800 MB/s

Scaling: 3.7Γ— speedup with 4 workers (92% efficiency). Datafaker workloads are I/O-bound β€” 4 threads is usually optimal regardless of core count.

See PERFORMANCE.md for full benchmarks, tuning guide, and hardware recommendations.


Architecture

cli β†’ destinations β†’ formats β†’ generators β†’ schema β†’ core

Six independent modules with clean one-way dependencies. Each layer is pluggable: add a destination by implementing DestinationAdapter, a format by implementing FormatSerializer, or a new semantic type by registering it with DatafakerRegistry.

See DESIGN.md for architecture decisions, the multi-threading reproducibility model, and extension points.


Documentation

Document Contents
config/README.md Type system reference, job/structure examples, Kafka & database config
docs/DESIGN.md Architecture, threading model, reproducibility, extensibility
docs/PERFORMANCE.md Benchmarks, tuning guide, hardware recommendations
docs/TROUBLESHOOTING.md Common errors, debug mode, FAQ
docs/CONTRIBUTING.md Setup, development workflow, code standards
docs/QUALITY.md Coverage, SpotBugs, Spotless configuration
CHANGELOG.md Release history and roadmap

Contributing

Contributions welcome β€” bug reports, new generators, destinations, or formats.

git clone https://github.com/mferretti/SeedStream.git
cd SeedStream
./gradlew build test

See CONTRIBUTING.md for setup, workflow, and code standards.


License

Copyright 2024-2026 Marco Ferretti

Licensed under the Apache License 2.0.

About

High-performance test data generator for enterprise applications. Generates realistic, reproducible test data to Kafka, and files using YAML configuration.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors