This repository contains the implementation of a Speech Emotion Recognition (SER) model utilizing a Convolutional Recurrent Neural Network (CRNN). The goal of this project is to classify female speech into six distinct emotions: neutral, happy, sad, angry, fear, and disgust.
The model is trained on three popular datasets:
- Toronto Emotional Speech Set (TESS)
- Crowd-Sourced Emotional Multimodal Actors Dataset (CREMA-D)
- Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)
Each dataset has been preprocessed to maintain consistency, focusing on audio from female speakers for uniformity.
The notebook tess_explore.ipynb is a file that we used to explore the TESS
dataset and understand how we can preprocess, augment, and experiment with data.
- Raw audio is transformed into Mel-spectrograms with Fast-Fourier transformations
to represent energy in a time-frequency domain.
- Two convolutional layers extract spatial features.
- Max-pooling, batch normalization, and dropout are applied to reduce dimensions and prevent overfitting.
- Two Bidirectional LSTM layers capture temporal dependencies.
- Attention mechanisms enhance focus on key temporal features.
- Dense layers consolidate extracted features.
- Final softmax layer outputs probabilities for six emotion classes.
- Baseline Model (ResNet-based CNN): Achieved 78.99% accuracy.
- CRNN Model: Validation accuracy of 64% and test accuracy of 61%.
The CRNN model underperformed due to limited dataset diversity and overfitting on small datasets. Simpler CNN models showed better generalization.
git clone https://github.com/RuthlessRu/vigilant-fishstick.git
cd vigilant-fishstick/notebookspython3 data_retrieval.pypython3 preprocess_dataset.pyReplece <model> with any of the following: baseline_split2, baseline,
pytorch_conv_split2, pytorch_conv, pytorch_crnn_split2, pytorch_crnn.
- Split2: Combines TESS, CREMAD, and RAVDESS datasets, then splits them into training, validation, and testing sets.
- Non-Split2: Uses TESS and CREMAD for training and validation, while RAVDESS is reserved for testing.
python3 <model>.py- Adam Kanoun (adam.kanoun@mail.utoronto.ca): Model building and data preprocessing
- David He (davidhe.he@mail.utoronto.ca): Performance analysis and revisions
- Jai Dey (jai.dey@mail.utoronto.ca): Performance analysis and revisions
- Sizhe Fan (sizhe.fan@mail.utoronto.ca): Found dataset and analyzed data
- Incorporating diverse datasets with multilingual and multi-gender speakers.
- Simplifying architecture to reduce overfitting on small datasets.
- Exploring transformer-based models for SER.



