🎵 ACE-step: Captioner & Transcriber

A powerful application for music captioning and song transcription using state-of-the-art models: Qwen2.5-Omni (ACE-Step) and Whisper (large-v3 / large-v2). It offers flexible preprocessing (noise reduction, stem separation, segmentation) and batch processing.

Features

Two recognition models: Qwen Omni (for music description / transcription) and Whisper (pure transcription).
Flexible model loading: quantization (4/8-bit), Flash Attention, CPU offload.
Batch processing: process multiple files or an entire folder.
Audio preprocessing:
- Noise reduction (noisereduce)
- Stem separation (Demucs) – extract vocals, instrumental, etc.
- Trim by duration
- Segmented processing for long audio
Auto-save and manual save results (next to the audio or in a specified folder).
User-friendly Gradio interface with native folder selection dialogs.
GPU acceleration for Demucs and models.

Installation

Requirements

Python 3.12 (recommended) or 3.10–3.12. Tested on Python 3.12.9
CUDA (optional, for GPU)

Steps

Clone the repository:

   git clone https://github.com/AndyAnttle/ACE-step-Captioner-Transcriber.git
   cd ACE-step-Captioner-Transcriber

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate   # Linux/Mac
venv\Scripts\activate       # Windows

Install dependencies: Torch and requirements for Python 3.12.9

pip install torch==2.7.0+cu126 torchvision==0.22.0+cu126 torchaudio==2.7.0+cu126 --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1 --no-build-isolation

Run the application:

python app.py

The browser will open automatically with the interface.

Usage

Select the recognition model (Qwen Omni or Whisper).
If Qwen is chosen, specify the path to your Qwen models; (you can change it in the code MODELS_ROOT = r"your\path\folder") and load the model.
For Whisper, choose the version and language, then load the model.
Adjust preprocessing settings (noise reduction, stems, segmentation, etc.).
Choose input mode: single file, multiple files, or a folder.
Click the run button (its label depends on the selected model).
The result will appear in the text box. You can save it manually or enable auto-save.

Notes

Qwen Omni requires a specific transformers branch; it is specified in requirements.txt.

Demucs will download its weights (~1 GB) on first use.

Whisper large-v3 will be downloaded on first use (~3 GB).

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Readme.md		Readme.md
app.py		app.py
requirements.txt		requirements.txt
screenshot.JPG		screenshot.JPG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎵 ACE-step: Captioner & Transcriber

Features

Installation

Requirements

Steps

Usage

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎵 ACE-step: Captioner & Transcriber

Features

Installation

Requirements

Steps

Usage

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages