Speech Emotion Recognition Using CRNN

This repository contains the implementation of a Speech Emotion Recognition (SER) model utilizing a Convolutional Recurrent Neural Network (CRNN). The goal of this project is to classify female speech into six distinct emotions: neutral, happy, sad, angry, fear, and disgust.

📂 Datasets

The model is trained on three popular datasets:

Toronto Emotional Speech Set (TESS)
Crowd-Sourced Emotional Multimodal Actors Dataset (CREMA-D)
Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

Each dataset has been preprocessed to maintain consistency, focusing on audio from female speakers for uniformity.

🔍 Data Exploration

The notebook tess_explore.ipynb is a file that we used to explore the TESS dataset and understand how we can preprocess, augment, and experiment with data.

Original Unprocessed Audio

Preprocessed Audio

⚙️ Model Architecture

1. Input Processing

Raw audio is transformed into Mel-spectrograms with Fast-Fourier transformations
to represent energy in a time-frequency domain.

2. CNN Module

Two convolutional layers extract spatial features.
Max-pooling, batch normalization, and dropout are applied to reduce dimensions and prevent overfitting.

3. RNN Module

Two Bidirectional LSTM layers capture temporal dependencies.
Attention mechanisms enhance focus on key temporal features.

4. Output Layer

Dense layers consolidate extracted features.
Final softmax layer outputs probabilities for six emotion classes.

📊 Results

Baseline Model (ResNet-based CNN): Achieved 78.99% accuracy.
CRNN Model: Validation accuracy of 64% and test accuracy of 61%.

The CRNN model underperformed due to limited dataset diversity and overfitting on small datasets. Simpler CNN models showed better generalization.

🛠 How to Use (Linux)

1. Clone the Repository

git clone https://github.com/RuthlessRu/vigilant-fishstick.git
cd vigilant-fishstick/notebooks

2. Download Datasets from Kaggle

python3 data_retrieval.py

3. Process Datasets

python3 preprocess_dataset.py

4. Train and Evaluate Model

Replece <model> with any of the following: baseline_split2, baseline, pytorch_conv_split2, pytorch_conv, pytorch_crnn_split2, pytorch_crnn.

Split2: Combines TESS, CREMAD, and RAVDESS datasets, then splits them into training, validation, and testing sets.
Non-Split2: Uses TESS and CREMAD for training and validation, while RAVDESS is reserved for testing.

python3 <model>.py

🖥 Contributions

Adam Kanoun (adam.kanoun@mail.utoronto.ca): Model building and data preprocessing
David He (davidhe.he@mail.utoronto.ca): Performance analysis and revisions
Jai Dey (jai.dey@mail.utoronto.ca): Performance analysis and revisions
Sizhe Fan (sizhe.fan@mail.utoronto.ca): Found dataset and analyzed data

📈 Future Improvements

Incorporating diverse datasets with multilingual and multi-gender speakers.
Simplifying architecture to reduce overfitting on small datasets.
Exploring transformer-based models for SER.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
figure		figure
notebooks		notebooks
resources		resources
sections		sections
.gitignore		.gitignore
final_report.pdf		final_report.pdf
main.tex		main.tex
neurips_2023.sty		neurips_2023.sty
proposal.pdf		proposal.pdf
readme.md		readme.md
references.bib		references.bib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Emotion Recognition Using CRNN

📂 Datasets

🔍 Data Exploration

Original Unprocessed Audio

Preprocessed Audio

⚙️ Model Architecture

1. Input Processing

2. CNN Module

3. RNN Module

4. Output Layer

📊 Results

🛠 How to Use (Linux)

1. Clone the Repository

2. Download Datasets from Kaggle

3. Process Datasets

4. Train and Evaluate Model

🖥 Contributions

📈 Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Speech Emotion Recognition Using CRNN

📂 Datasets

🔍 Data Exploration

Original Unprocessed Audio

Preprocessed Audio

⚙️ Model Architecture

1. Input Processing

2. CNN Module

3. RNN Module

4. Output Layer

📊 Results

🛠 How to Use (Linux)

1. Clone the Repository

2. Download Datasets from Kaggle

3. Process Datasets

4. Train and Evaluate Model

🖥 Contributions

📈 Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages