YAMNet's Western training bias causes systematic miss-detection of India-specific sounds in educational content

## The problem nobody has named yet

The pipeline correctly detects a gunshot or a fire alarm with high confidence. But try it on a video where a harmonium starts playing during a devotional lesson, or where an auto-rickshaw horn cuts through a street scene, or where a call to prayer drifts in from outside a classroom window. The tool either fires the wrong label or produces nothing at all.

This isn't a threshold problem or a label mapping problem. It's a training data problem — and it lives at the very bottom of the stack.

---

## Where the gap comes from

YAMNet was trained on Google's AudioSet V2 — roughly 2 million 10-second audio clips sourced from YouTube and labeled across 521 categories. AudioSet is impressive in scale, but its geographic distribution mirrors YouTube's global content skew: predominantly English-language, Western-origin material.

For a tool built to serve PlanetRead's Hindi and regional-language educational videos, this is a structural mismatch. YAMNet assigns confidence scores based on acoustic similarity to what it was trained on. Sounds that were underrepresented in AudioSet get lower, noisier confidence values — often falling below the 0.35 detection threshold even when they're clearly audible.

The current `LABEL_MAP` in `src/audio/labels.py` does make a start — Tabla, Dhol, Temple bells, and Diwali firecrackers are mapped. But adding label entries cannot fix what the underlying model does not reliably detect in the first place.

---

## Sounds that fall through — or get misclassified

| Sound | Where it appears in Indian educational content | YAMNet coverage | What the pipeline returns |
|---|---|---|---|
| **Auto-rickshaw horn** | Traffic/street-set shots | None | `Honk` at low confidence, often dropped |
| **Azan (call to prayer)** | Community context, outdoor school scenes | None | `Music` or classified as speech → suppressed |
| **Harmonium** | Devotional lessons, folk music instruction | None | `Musical instrument` (generic, low conf.) |
| **Shehnai** | Wedding and festival scenes | None | `Wind instrument` (generic) |
| **Manjira / finger cymbals** | Devotional content, yoga instruction | None | `Bell` (approximate, inconsistent) |
| **Street vendor chant** | Market scenes, urban documentary content | None | Classified as `Speech` → suppressed entirely |
| **Nagara / war drum** | Folk performance, regional cultural content | None | `Drum` (generic) |
| **3-wheeler autorickshaw engine** | Any urban street content | None | `Motor vehicle` (generic) |
| **Indian train horn** | Journey/transition sequences | None | `Train` (generic, inconsistent timing) |
| **Shehnai at high pitch** | Weddings, temple processions | None | Sometimes fires `Alarm` — false positive |

The **street vendor chant** case is the most frustrating. Because it has speech-like acoustic qualities, YAMNet classifies it under speech categories and the pipeline suppresses it. A distinctly Indian environmental sound that carries real contextual meaning for a Hindi-speaking student disappears silently.

The **azan** case matters for cultural inclusivity. North Indian educational content produced near mosques or community centres will routinely include this sound. Deaf students using PlanetRead content in those communities deserve a caption that tells them what's happening in the scene.

---

## Where this gap sits in the pipeline

```
  Video file (.mp4 / .mkv)
        |
        v
   ffmpeg  -->  16 kHz mono WAV
                      |
                      v
              +-----------------------------------------+
              |  YAMNet (521 AudioSet classes)          |
              |                                         |
              |  Trained on ~2M clips                   |
              |  <- predominantly English YouTube        |
              |  <- Western acoustic profile             |
              |  <- India sounds: sparse coverage        |
              +-----------------+-----------------------+
                                |  confidence scores (0-1)
                                v
              +-----------------------------------------+
              |  LABEL_MAP  (src/audio/labels.py)       |
              |                                         |
              |  Tabla         -> [tabla]           OK  |
              |  Dhol          -> [dhol]            OK  |
              |  Temple bells  -> [temple bells]    OK  |
              |  Fireworks     -> [firecrackers]    OK  |
              |                                         |
              |  Auto-rickshaw ->  (no entry)      MISS |
              |  Azan          ->  (suppressed)    MISS |
              |  Harmonium     ->  (no entry)      MISS |
              |  Street vendor ->  (suppressed)    MISS |
              +-----------------+-----------------------+
                                |
                                v
                        SRT / SLS output
                     (missing India sounds)
```

---

## What a fix would look like

### Immediate: per-category confidence thresholds

A single global threshold of `0.35` is too coarse. India-specific sounds that YAMNet has weak representations for legitimately need a lower threshold — say `0.20` — while high-impact events like gunshots can stay at `0.35` or higher. The `AudioEvent` dataclass and `LABEL_MAP` already have the structure to support per-category thresholds; it just has not been implemented.

### Short-term: expand the label map with acoustic proxies

Many Indian sounds have acoustically similar entries in AudioSet that YAMNet will fire on, just with lower confidence or slightly mismatched category names. A dedicated expansion of `LABEL_MAP` with these acoustic proxy mappings would improve recall without any model changes.

### Medium-term: lightweight supplementary classifier

Train a small secondary classifier (a shallow CNN on mel-spectrograms) on 50-100 clips each of the critical India-specific sounds. This classifier runs only on audio windows where YAMNet's top score falls below a confidence floor — a second opinion for uncertain events rather than a full model replacement. It could be under 5 MB and run in about 50 ms per window on CPU.

### Long-term: India audio dataset contribution

Contribute labeled Indian audio clips to AudioSet's community annotation framework. This benefits the broader research community, not just this project, and could eventually feed into a future retrained model.

---

## Why this matters for PlanetRead's students specifically

PlanetRead's mission is literacy and accessibility for learners who are often excluded from traditional education. The students who rely on CC annotations are not watching Western content — they are watching Hindi educational films, regional documentaries, community instruction videos. The acoustic landscape of those videos is Indian.

When the CC tool silently skips a harmonium playing behind a maths lesson, or labels a shehnai as `[alarm]`, or drops an azan that gives a student critical context about the scene they are watching — the tool fails the exact people it was built for.

Fixing the detection layer matters more than any downstream improvement. Better evaluation metrics or output validators cannot compensate for a model that structurally underrepresents the content it is analyzing.

---

## Suggested first step

A contributor could start by:

1. Pulling together a list of 20-30 India-specific sounds (with YouTube example clips) that are currently missing or misclassified
2. Running each clip through the current pipeline and documenting what YAMNet actually returns
3. Proposing acoustic proxy mappings for the ones that can be patched in `LABEL_MAP`
4. Opening a follow-up PR with the expanded label map and per-category threshold support

This does not require model training — just careful listening and domain knowledge. If anyone on the team has access to PlanetRead's existing video library, even a 10-video sample would be enough to build a meaningful evidence base.

---

Refs #2


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YAMNet's Western training bias causes systematic miss-detection of India-specific sounds in educational content #26

The problem nobody has named yet

Where the gap comes from

Sounds that fall through — or get misclassified

Where this gap sits in the pipeline

What a fix would look like

Immediate: per-category confidence thresholds

Short-term: expand the label map with acoustic proxies

Medium-term: lightweight supplementary classifier

Long-term: India audio dataset contribution

Why this matters for PlanetRead's students specifically

Suggested first step

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Sound	Where it appears in Indian educational content	YAMNet coverage	What the pipeline returns
Auto-rickshaw horn	Traffic/street-set shots	None	`Honk` at low confidence, often dropped
Azan (call to prayer)	Community context, outdoor school scenes	None	`Music` or classified as speech → suppressed
Harmonium	Devotional lessons, folk music instruction	None	`Musical instrument` (generic, low conf.)
Shehnai	Wedding and festival scenes	None	`Wind instrument` (generic)
Manjira / finger cymbals	Devotional content, yoga instruction	None	`Bell` (approximate, inconsistent)
Street vendor chant	Market scenes, urban documentary content	None	Classified as `Speech` → suppressed entirely
Nagara / war drum	Folk performance, regional cultural content	None	`Drum` (generic)
3-wheeler autorickshaw engine	Any urban street content	None	`Motor vehicle` (generic)
Indian train horn	Journey/transition sequences	None	`Train` (generic, inconsistent timing)
Shehnai at high pitch	Weddings, temple processions	None	Sometimes fires `Alarm` — false positive

YAMNet's Western training bias causes systematic miss-detection of India-specific sounds in educational content #26

Description

The problem nobody has named yet

Where the gap comes from

Sounds that fall through — or get misclassified

Where this gap sits in the pipeline

What a fix would look like

Immediate: per-category confidence thresholds

Short-term: expand the label map with acoustic proxies

Medium-term: lightweight supplementary classifier

Long-term: India audio dataset contribution

Why this matters for PlanetRead's students specifically

Suggested first step

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions