Variational Autoencoder for music generation — learning a continuous latent space of musical sequences for interpolation, sampling, and controlled generation.
Topics: audio-ml · deep-learning · generative-ai · hierarchical-lstm · latent-space-interpolation · magenta · music-ai · music-generation · variational-autoencoder · sequential-generation
MusicVAE is an implementation of Variational Autoencoders applied to symbolic music sequences — a research project exploring the intersection of deep generative modelling and computational creativity. Inspired by Magenta's MusicVAE paper (Roberts et al., 2018), it trains a hierarchical VAE on MIDI sequences to learn a smooth, continuous latent space where neighbouring points correspond to musically similar sequences and linear interpolation produces coherent musical transitions.
The architecture uses a bidirectional LSTM encoder to compress 2- or 4-bar MIDI sequences into a latent vector of 256 or 512 dimensions, and a hierarchical conductor + decoder LSTM to reconstruct the sequence from the latent code. The hierarchical decoder structure — where a conductor LSTM generates bar-level embeddings that a lower-level decoder uses token by token — is critical for learning coherent long-range musical structure rather than note-by-note prediction.
The latent space has three musically meaningful operations that make MusicVAE practically useful beyond academic interest: sampling (draw a random point and decode a novel sequence), interpolation (smoothly blend between two musical ideas by interpolating their latent vectors and decoding at intermediate points), and attribute manipulation (identify latent directions corresponding to tempo, density, or rhythmic complexity and move along them for controlled generation).
Standard sequence models (vanilla RNNs, Transformers) generate music autoregressively — each note depends on previous notes but there is no global latent variable representing the 'character' of the sequence. VAEs provide that global latent variable, enabling the musically powerful operations of interpolation and attribute manipulation that are impossible with autoregressive models alone. This project was built to explore whether a relatively small VAE (trained on a single GPU) can learn a musically meaningful latent space from a few thousand MIDI sequences.
Input: 2-bar MIDI sequence (piano roll, 32 × 128 binary matrix)
│
Bidirectional LSTM Encoder
→ μ, log_σ² (256-dim each)
→ z ~ N(μ, σ²) via reparameterisation trick
│
Hierarchical Decoder:
├── Conductor LSTM: z → bar embeddings (e₁, e₂)
└── Token Decoder LSTM: eₜ → note sequence per bar
│
Loss: Reconstruction (CE) + KL Divergence
L = E[log p(x|z)] - β · KL(q(z|x) || p(z))
│
Latent Space Operations:
├── Sample: z ~ N(0,I) → decode new melody
├── Interpolate: z = αz₁ + (1-α)z₂ → blend melodies
└── Attribute edit: z' = z + λ·direction → modify style
Conductor + decoder LSTM hierarchy for learning coherent bar-level structure — addressing the posterior collapse problem that affects standard VAEs on long sequences.
128-note × time-step binary piano roll encoding with configurable quantisation (16th note resolution default), supporting monophonic and polyphonic MIDI sequences.
Configurable β parameter for the KL divergence weight, enabling the disentanglement vs reconstruction trade-off to be tuned for the target application (generation vs reconstruction quality).
Smooth musical interpolation between two MIDI sequences by encoding both to latent vectors and decoding at N evenly-spaced intermediate points — producing a musical 'morph'.
Sample random points from the prior N(0,I), decode to MIDI, and play back — demonstrating the diversity and quality of the learned generative model.
Identify latent space directions correlated with musical attributes (note density, pitch range, rhythmic complexity) using labelled sequence pairs and linear regression in latent space.
In-app MIDI playback via FluidSynth or browser-based MIDI.js, allowing immediate audition of generated sequences without external DAW.
Real-time training curves for reconstruction loss, KL divergence, ELBO, and per-timestep accuracy — with sample generation at regular intervals to monitor learning progress.
| Library / Tool | Role | Why This Choice |
|---|---|---|
| PyTorch | Deep learning framework | BiLSTM encoder, conductor LSTM, decoder LSTM, Adam optimiser |
| pretty_midi | MIDI I/O | MIDI loading, piano roll conversion, MIDI synthesis |
| Magenta (optional) | Reference implementation | Pre-trained MusicVAE weights for comparison |
| NumPy | Array operations | Piano roll manipulation, latent arithmetic |
| Matplotlib | Visualisation | Piano roll plots, loss curves, latent space t-SNE |
| FluidSynth | MIDI playback | Offline MIDI to audio synthesis for playback |
| Streamlit | Generation interface | Latent space explorer, interpolation slider, sample player |
- Python 3.9+ (or Node.js 18+ for TypeScript/JavaScript projects)
- A virtual environment manager (
venv,conda, or equivalent) - API keys as listed in the Configuration section
git clone https://github.com/Devanik21/MusicVAE.git
cd MusicVAE
python -m venv venv && source venv/bin/activate
pip install torch pretty_midi numpy matplotlib streamlit
# Optional: FluidSynth for MIDI playback
# sudo apt-get install fluidsynth (Linux)
# brew install fluid-synth (macOS)
# Download/prepare MIDI dataset (Lakh MIDI Dataset recommended)
# python prepare_data.py --data_dir ./midi_data/
# Train the model
python train.py --config configs/2bar_melody.yaml --epochs 100
# Launch generation interface
streamlit run app.py# Train MusicVAE
python train.py --epochs 100 --latent_dim 256 --beta 0.5
# Sample new melodies
python sample.py --checkpoint checkpoints/best.pt --n_samples 10
# Interpolate between two MIDI files
python interpolate.py \
--start melody_a.mid --end melody_b.mid \
--steps 8 --output interpolations/
# Explore latent space
streamlit run app.py| Variable | Default | Description |
|---|---|---|
LATENT_DIM |
256 |
VAE latent space dimensionality |
BETA |
0.5 |
KL divergence weight β in β-VAE loss |
N_BARS |
2 |
Sequence length in bars (2 or 4) |
QUANTISATION |
16 |
Time steps per bar (16 = 16th note resolution) |
LEARNING_RATE |
0.001 |
Adam optimiser learning rate |
BATCH_SIZE |
64 |
Training batch size |
Copy
.env.exampleto.envand populate required values before running.
MusicVAE/
├── README.md
├── requirements.txt
├── app.py
└── ...
- Conditional MusicVAE: condition generation on chord progression or style label
- Transformer-based encoder for capturing long-range harmonic dependencies
- Cross-modal latent space: align music latent space with text descriptions for text-to-melody generation
- Real-time latent space exploration in a browser with WebMIDI API
- Few-shot style adaptation: fine-tune the decoder on 10–20 examples of a new musical style
Contributions, issues, and suggestions are welcome.
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-idea - Commit your changes:
git commit -m 'feat: add your idea' - Push to your branch:
git push origin feature/your-idea - Open a Pull Request with a clear description
Please follow conventional commit messages and add documentation for new features.
MusicVAE training on a single GPU (T4/RTX 3060) takes approximately 4–8 hours for 100 epochs on the Lakh MIDI Dataset filtered to monophonic melodies. The model exhibits posterior collapse at high β values — monitor KL divergence during training and reduce β if KL approaches zero. Generated MIDI quality varies significantly with dataset quality and training duration.
Devanik Debnath
B.Tech, Electronics & Communication Engineering
National Institute of Technology Agartala
This project is open source and available under the MIT License.
Built with curiosity, depth, and care — because good projects deserve good documentation.