Skip to content

agentvoiceresponse/avr-vad

Repository files navigation

Agent Voice Response - AVR VAD - Silero Voice Activity Detection for Node.js

Discord GitHub Repo stars npm version npm downloads Ko-fi

🎀 A Node.js library for Voice Activity Detection using the Silero VAD model.

✨ Features

  • πŸš€ Based on Silero VAD: Uses the pre-trained Silero ONNX model (v5 and legacy versions) for accurate results
  • 🎯 Real-time processing: Supports real-time frame-by-frame processing
  • ⚑ Non-real-time processing: Batch processing for audio files and streams
  • πŸ”§ Configurable: Customizable thresholds and parameters for different needs
  • 🎡 Audio processing: Includes utilities for resampling and audio manipulation
  • πŸ“Š Multiple models: Support for both Silero VAD v5 and legacy models
  • πŸ’Ύ Bundled models: Models are included in the package, no external downloads required
  • πŸ“ TypeScript: Fully typed with TypeScript

πŸš€ Installation

npm install avr-vad

πŸ“– Quick Start

Real-time Processing

import { RealTimeVAD } from 'avr-vad';

// Initialize the VAD with default options (Silero v5 model)
const vad = await RealTimeVAD.new({
  model: 'v5', // or 'legacy'
  positiveSpeechThreshold: 0.5,
  negativeSpeechThreshold: 0.35,
  preSpeechPadFrames: 1,
  redemptionFrames: 8,
  frameSamples: 1536,
  minSpeechFrames: 3
});

// Process audio frames in real-time
const audioFrame = getAudioFrameFromMicrophone(); // Float32Array of 1536 samples at 16kHz
const result = await vad.processFrame(audioFrame);

console.log(`Speech probability: ${result.probability}`);
console.log(`Speech detected: ${result.msg === 'SPEECH_START' || result.msg === 'SPEECH_CONTINUE'}`);

// Clean up when done
vad.destroy();

Non-Real-time Processing

import { NonRealTimeVAD } from 'avr-vad';

// Initialize for batch processing
const vad = await NonRealTimeVAD.new({
  model: 'v5',
  positiveSpeechThreshold: 0.5,
  negativeSpeechThreshold: 0.35
});

// Process entire audio buffer
const audioData = loadAudioData(); // Float32Array at 16kHz
const results = await vad.processAudio(audioData);

// Get speech segments
const speechSegments = vad.getSpeechSegments(results);
console.log(`Found ${speechSegments.length} speech segments`);

speechSegments.forEach((segment, i) => {
  console.log(`Segment ${i + 1}: ${segment.start}ms - ${segment.end}ms`);
});

// Clean up
vad.destroy();

βš™οΈ Configuration

Real-time VAD Options

interface RealTimeVADOptions {
  /** Model version to use ('v5' | 'legacy') */
  model?: 'v5' | 'legacy';
  
  /** Threshold for detecting speech start */
  positiveSpeechThreshold?: number;
  
  /** Threshold for detecting speech end */
  negativeSpeechThreshold?: number;
  
  /** Frames to include before speech detection */
  preSpeechPadFrames?: number;
  
  /** Frames to wait before ending speech */
  redemptionFrames?: number;
  
  /** Number of samples per frame (usually 1536 for 16kHz) */
  frameSamples?: number;
  
  /** Minimum frames for valid speech */
  minSpeechFrames?: number;
}

Non-Real-time VAD Options

interface NonRealTimeVADOptions {
  /** Model version to use ('v5' | 'legacy') */
  model?: 'v5' | 'legacy';
  
  /** Threshold for detecting speech start */
  positiveSpeechThreshold?: number;
  
  /** Threshold for detecting speech end */
  negativeSpeechThreshold?: number;
}

Default Values

// Real-time VAD defaults
const defaultRealTimeOptions = {
  model: 'v5',
  positiveSpeechThreshold: 0.5,
  negativeSpeechThreshold: 0.35,
  preSpeechPadFrames: 1,
  redemptionFrames: 8,
  frameSamples: 1536,
  minSpeechFrames: 3
};

// Non-real-time VAD defaults
const defaultNonRealTimeOptions = {
  model: 'v5',
  positiveSpeechThreshold: 0.5,
  negativeSpeechThreshold: 0.35
};

πŸ“Š Results and Messages

VAD Messages

The VAD returns different message types to indicate speech state changes:

enum Message {
  ERROR = 'ERROR',
  SPEECH_START = 'SPEECH_START',
  SPEECH_CONTINUE = 'SPEECH_CONTINUE', 
  SPEECH_END = 'SPEECH_END',
  SILENCE = 'SILENCE'
}

Processing Results

interface VADResult {
  /** Speech probability (0.0 - 1.0) */
  probability: number;
  
  /** Message indicating speech state */
  msg: Message;
  
  /** Audio data if speech segment ended */
  audio?: Float32Array;
}

Speech Segments

interface SpeechSegment {
  /** Start time in milliseconds */
  start: number;
  
  /** End time in milliseconds */
  end: number;
  
  /** Speech probability for this segment */
  probability: number;
}

πŸ”§ Audio Utilities

The library includes various audio processing utilities:

import { utils, Resampler } from 'avr-vad';

// Resample audio to 16kHz (required for VAD)
const resampler = new Resampler({
  nativeSampleRate: 44100,
  targetSampleRate: 16000,
  targetFrameSize: 1536
});

const resampledFrame = resampler.process(audioFrame);

// Other utilities
const frameSize = utils.frameSize; // Get frame size for current sample rate
const audioBuffer = utils.concatArrays([frame1, frame2]); // Concatenate audio arrays

🎯 Advanced Examples

Real-time Speech Detection with Callbacks

import { RealTimeVAD, Message } from 'avr-vad';

class SpeechDetector {
  private vad: RealTimeVAD;
  private onSpeechStart?: (audio: Float32Array) => void;
  private onSpeechEnd?: (audio: Float32Array) => void;

  constructor(callbacks: {
    onSpeechStart?: (audio: Float32Array) => void;
    onSpeechEnd?: (audio: Float32Array) => void;
  }) {
    this.onSpeechStart = callbacks.onSpeechStart;
    this.onSpeechEnd = callbacks.onSpeechEnd;
  }

  async initialize() {
    this.vad = await RealTimeVAD.new({
      positiveSpeechThreshold: 0.5,
      negativeSpeechThreshold: 0.35
      onSpeechStart: this.onSpeechStart,
      onSpeechEnd: this.onSpeechEnd
    });
  }

  async processFrame(audioFrame: Float32Array) {
    const result = await this.vad.processFrame(audioFrame);
    return result;
  }

  destroy() {
    this.vad?.destroy();
  }
}

// Usage
const detector = new SpeechDetector({
  onSpeechStart: (audio) => console.log(`Speech started with ${audio.length} samples`),
  onSpeechEnd: (audio) => console.log(`Speech ended with ${audio.length} samples`)
});

await detector.initialize();

Batch Processing Audio File

import { NonRealTimeVAD, utils } from 'avr-vad';
import * as fs from 'fs';

async function processAudioFile(filePath: string) {
  // Load audio data (you'll need your own audio loading logic)
  const audioData = loadWavFile(filePath); // Float32Array at 16kHz
  
  const vad = await NonRealTimeVAD.new({
    model: 'v5',
    positiveSpeechThreshold: 0.6,
    negativeSpeechThreshold: 0.4
  });

  const results = await vad.processAudio(audioData);
  const segments = vad.getSpeechSegments(results);

  console.log(`Processing ${filePath}:`);
  console.log(`Total audio duration: ${(audioData.length / 16000).toFixed(2)}s`);
  console.log(`Speech segments found: ${segments.length}`);
  
  segments.forEach((segment, i) => {
    const duration = ((segment.end - segment.start) / 1000).toFixed(2);
    console.log(`  Segment ${i + 1}: ${segment.start}ms - ${segment.end}ms (${duration}s)`);
  });

  vad.destroy();
  return segments;
}

πŸ“ Development

Requirements

  • Node.js >= 16.0.0
  • TypeScript >= 5.0.0

Build

npm run build

Test

npm test

Scripts

npm run lint      # Run ESLint
npm run clean     # Clean build directory
npm run prepare   # Build before npm install (automatically run)

πŸ“ Project Structure

avr-vad/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ index.ts                    # Main exports
β”‚   β”œβ”€β”€ real-time-vad.ts           # Real-time VAD implementation  
β”‚   └── common/
β”‚       β”œβ”€β”€ index.ts               # Common exports
β”‚       β”œβ”€β”€ frame-processor.ts     # Core ONNX processing
β”‚       β”œβ”€β”€ non-real-time-vad.ts  # Batch processing VAD
β”‚       β”œβ”€β”€ utils.ts               # Utility functions
β”‚       β”œβ”€β”€ resampler.ts           # Audio resampling
β”œβ”€β”€ dist/                          # Compiled JavaScript
β”œβ”€β”€ test/                          # Test files
β”œβ”€β”€ silero_vad_v5.onnx            # Silero VAD v5 model
β”œβ”€β”€ silero_vad_legacy.onnx        # Silero VAD legacy model
└── package.json

πŸ”§ Troubleshooting

Audio Format Requirements

The Silero VAD model requires:

  • Sample rate: 16kHz
  • Channels: Mono (single channel)
  • Format: Float32Array with values between -1.0 and 1.0
  • Frame size: 1536 samples (96ms at 16kHz)

Model Selection

  • v5 model: Latest version with improved accuracy
  • legacy model: Original model for compatibility

Use the Resampler utility to convert audio to the required format:

import { Resampler } from 'avr-vad';

const resampler = new Resampler({
  nativeSampleRate: 44100,    // Your audio sample rate
  targetSampleRate: 16000,    // Required by VAD
  targetFrameSize: 1536       // Required frame size
});

Performance Tips

  • Use appropriate thresholds for your use case
  • Consider using the legacy model for lower resource usage
  • For real-time applications, ensure your audio processing pipeline can handle 16kHz/1536 samples per frame
  • Use redemptionFrames to avoid choppy speech detection

Acknowledgments

  • Silero Models for the excellent VAD model
  • ONNX Runtime for model inference
  • The open source community for supporting libraries

Support & Community

Support AVR

AVR is free and open-source. If you find it valuable, consider supporting its development:

Support us on Ko-fi

License

MIT License - see the LICENSE.md file for details.

About

A Node.js library for Voice Activity Detection using the Silero VAD model.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors