π§ π² CS 753: Advanced Neural Techniques for Audio Analysis and Processing under the guidance of Professor Preethi Jyothi
Meet the brilliant minds behind this project:
- Anuj Attri (23M0808) π¨βπ
- Arnav Attri (23M0811) π¨βπ
- Pratham Tarjule (20D110029) π¨βπ
This repository contains three advanced Jupyter notebooks that demonstrate innovative uses of neural networks for audio analysis, transcription, and processing.
This notebook demonstrates the application of OpenAI's Whisper models (Tiny, Base, Small) for transcribing audio data from the LibriSpeech dataset. Utilizing advanced audio preprocessing techniques and machine learning models, we achieve progressively lower Word Error Rates (WER).
- Technologies used:
torch,whisper,pandas,torchaudio - Key features:
- Audio data preprocessing ποΈ
- Utilization of different Whisper model sizes π
- WER calculation and comparison π
Dive into the process of playing audio files and converting them into text using the Wav2Vec2 model. This notebook provides a hands-on approach to handling audio data, processing it with sophisticated models, and generating accurate transcripts.
- Technologies used:
torch,IPython,librosa,transformers - Highlights:
- Audio playback in Jupyter notebooks π§
- Detailed transcription process with model insights π
- Analysis of transcription accuracy and model warnings
β οΈ
Explore a sophisticated architecture that combines Temporal U-Net with Squeezeformer blocks for enhanced sequence processing. This notebook introduces a powerful model for handling complex sequential data, demonstrating the integration of convolution and attention mechanisms.
- Technologies used:
torch - Special features:
- Encoder-decoder architecture with Squeezeformer blocks π§
- Depthwise separable convolution subsampling π
- Application in sequential data processing and analysis π
What was your assigned hacker paper and how does what you've implemented relate to it?
Paper Assigned: Squeezeformer: An Efficient Transformer for Automatic Speech Recognition
Our Squeezeformer.ipynb implements several components related to the Temporal U-Net architecture, which is a deep learning model designed for sequence-to-sequence tasks, such as audio source separation or speech enhancement.
-
Squeezeformer Block
-
Squeezeformer:
-
Depthwise Separable Convolution Subsampling:
-
Unified Activations with Squeezeformer Block:
What did you newly implement, where was the code mainly derived from?
We have tried with Open source Whisper Model to calculate WER on test-clean dataset from librespeech as there was no training script available for Squeezeformer.
Our code was mainly derived from multiple sources including Github, but is majorly written by us.
Demo is shown in Wav2Vec.ipynb
Ready to Dive In? These notebooks are fully implemented and ready to run, offering a seamless experience right out of the box. Simply download, open, and execute to explore the cutting-edge techniques in audio processing:
git clone https://github.com/arnavcse/CS753-Hacker.git
cd CS753-Hacker
jupyter lab
With everything set up for you, getting started is as easy as pie! π₯§ Enjoy experimenting with advanced neural network techniques for audio analysis and processing.
License π
This project is licensed under the MIT License - see the LICENSE.md file for details.
Show Your Support π
Give a βοΈ if this project helped you!