This repository contains a Multimodal Deep Learning Model for classifying emotions in music based on audio and lyrics. The model combines audio processing (spectrogram-based architecture) and NLP for lyrics (transformer-based) using late fusion, achieving 60.7% accuracy on the MoodyLyrics4Q dataset.
- Audio Module: Processes spectrograms with a Sarkar et al.-inspired model.
- Lyrics Module: Uses fine-tuned transformer models for emotion detection.
- Fusion Approaches: Majority voting and concatenation improve multimodal performance.
- Best Accuracy: 60.7% (majority voting).
- Performs well for "Happy" and "Relaxed" emotions, with challenges for "Sad."
- Robust across diverse musical genres and styles.
| Approach | Accuracy in Literature [%] | Accuracy in This Study [%] |
|---|---|---|
| Classical methods (SVM) | 50 | 31.38 |
| Ravdess-based architecture | 65.96 | 47.99 |
| InceptionV3 | 70–90 | 53.36 |
| ResNet | 77.36 | 56.04 |
| VGG16 | 63.79 | 53.69 |
| Sarkar et al. architecture | 68–78 | 59.06 |
| Inception-ResNet | 84.91–87.24 | 56.23 |
| Approach | Accuracy in Literature [%] | Accuracy in This Study [%] |
|---|---|---|
| Feature-based SVM | 58.0 | 55.0 |
| Feature-based ANN | 58.5 | 53.9 |
| Fasttext-based | - | 48.2 |
| Transformer-based (XLNet) | 94.78 | 59.2 |
| Ensemble Approach | Precision [%] | Recall [%] | F1-score [%] | Accuracy [%] |
|---|---|---|---|---|
| Majority voting | 61.5 | 60.7 | 58.8 | 60.7 |
| Concatenation | 58.5 | 58.4 | 58.2 | 58.4 |
Copy repository, install necessary requirements using pip install -r requirements.txt
In order to run code formatting tool type autopep8 --global-config .pep8 --in-place --aggressive --aggressive <filepath>.
Global configuration is placed in the .pep8 file.