The problem nobody has named yet
The pipeline correctly detects a gunshot or a fire alarm with high confidence. But try it on a video where a harmonium starts playing during a devotional lesson, or where an auto-rickshaw horn cuts through a street scene, or where a call to prayer drifts in from outside a classroom window. The tool either fires the wrong label or produces nothing at all.
This isn't a threshold problem or a label mapping problem. It's a training data problem — and it lives at the very bottom of the stack.
Where the gap comes from
YAMNet was trained on Google's AudioSet V2 — roughly 2 million 10-second audio clips sourced from YouTube and labeled across 521 categories. AudioSet is impressive in scale, but its geographic distribution mirrors YouTube's global content skew: predominantly English-language, Western-origin material.
For a tool built to serve PlanetRead's Hindi and regional-language educational videos, this is a structural mismatch. YAMNet assigns confidence scores based on acoustic similarity to what it was trained on. Sounds that were underrepresented in AudioSet get lower, noisier confidence values — often falling below the 0.35 detection threshold even when they're clearly audible.
The current LABEL_MAP in src/audio/labels.py does make a start — Tabla, Dhol, Temple bells, and Diwali firecrackers are mapped. But adding label entries cannot fix what the underlying model does not reliably detect in the first place.
Sounds that fall through — or get misclassified
| Sound |
Where it appears in Indian educational content |
YAMNet coverage |
What the pipeline returns |
| Auto-rickshaw horn |
Traffic/street-set shots |
None |
Honk at low confidence, often dropped |
| Azan (call to prayer) |
Community context, outdoor school scenes |
None |
Music or classified as speech → suppressed |
| Harmonium |
Devotional lessons, folk music instruction |
None |
Musical instrument (generic, low conf.) |
| Shehnai |
Wedding and festival scenes |
None |
Wind instrument (generic) |
| Manjira / finger cymbals |
Devotional content, yoga instruction |
None |
Bell (approximate, inconsistent) |
| Street vendor chant |
Market scenes, urban documentary content |
None |
Classified as Speech → suppressed entirely |
| Nagara / war drum |
Folk performance, regional cultural content |
None |
Drum (generic) |
| 3-wheeler autorickshaw engine |
Any urban street content |
None |
Motor vehicle (generic) |
| Indian train horn |
Journey/transition sequences |
None |
Train (generic, inconsistent timing) |
| Shehnai at high pitch |
Weddings, temple processions |
None |
Sometimes fires Alarm — false positive |
The street vendor chant case is the most frustrating. Because it has speech-like acoustic qualities, YAMNet classifies it under speech categories and the pipeline suppresses it. A distinctly Indian environmental sound that carries real contextual meaning for a Hindi-speaking student disappears silently.
The azan case matters for cultural inclusivity. North Indian educational content produced near mosques or community centres will routinely include this sound. Deaf students using PlanetRead content in those communities deserve a caption that tells them what's happening in the scene.
Where this gap sits in the pipeline
Video file (.mp4 / .mkv)
|
v
ffmpeg --> 16 kHz mono WAV
|
v
+-----------------------------------------+
| YAMNet (521 AudioSet classes) |
| |
| Trained on ~2M clips |
| <- predominantly English YouTube |
| <- Western acoustic profile |
| <- India sounds: sparse coverage |
+-----------------+-----------------------+
| confidence scores (0-1)
v
+-----------------------------------------+
| LABEL_MAP (src/audio/labels.py) |
| |
| Tabla -> [tabla] OK |
| Dhol -> [dhol] OK |
| Temple bells -> [temple bells] OK |
| Fireworks -> [firecrackers] OK |
| |
| Auto-rickshaw -> (no entry) MISS |
| Azan -> (suppressed) MISS |
| Harmonium -> (no entry) MISS |
| Street vendor -> (suppressed) MISS |
+-----------------+-----------------------+
|
v
SRT / SLS output
(missing India sounds)
What a fix would look like
Immediate: per-category confidence thresholds
A single global threshold of 0.35 is too coarse. India-specific sounds that YAMNet has weak representations for legitimately need a lower threshold — say 0.20 — while high-impact events like gunshots can stay at 0.35 or higher. The AudioEvent dataclass and LABEL_MAP already have the structure to support per-category thresholds; it just has not been implemented.
Short-term: expand the label map with acoustic proxies
Many Indian sounds have acoustically similar entries in AudioSet that YAMNet will fire on, just with lower confidence or slightly mismatched category names. A dedicated expansion of LABEL_MAP with these acoustic proxy mappings would improve recall without any model changes.
Medium-term: lightweight supplementary classifier
Train a small secondary classifier (a shallow CNN on mel-spectrograms) on 50-100 clips each of the critical India-specific sounds. This classifier runs only on audio windows where YAMNet's top score falls below a confidence floor — a second opinion for uncertain events rather than a full model replacement. It could be under 5 MB and run in about 50 ms per window on CPU.
Long-term: India audio dataset contribution
Contribute labeled Indian audio clips to AudioSet's community annotation framework. This benefits the broader research community, not just this project, and could eventually feed into a future retrained model.
Why this matters for PlanetRead's students specifically
PlanetRead's mission is literacy and accessibility for learners who are often excluded from traditional education. The students who rely on CC annotations are not watching Western content — they are watching Hindi educational films, regional documentaries, community instruction videos. The acoustic landscape of those videos is Indian.
When the CC tool silently skips a harmonium playing behind a maths lesson, or labels a shehnai as [alarm], or drops an azan that gives a student critical context about the scene they are watching — the tool fails the exact people it was built for.
Fixing the detection layer matters more than any downstream improvement. Better evaluation metrics or output validators cannot compensate for a model that structurally underrepresents the content it is analyzing.
Suggested first step
A contributor could start by:
- Pulling together a list of 20-30 India-specific sounds (with YouTube example clips) that are currently missing or misclassified
- Running each clip through the current pipeline and documenting what YAMNet actually returns
- Proposing acoustic proxy mappings for the ones that can be patched in
LABEL_MAP
- Opening a follow-up PR with the expanded label map and per-category threshold support
This does not require model training — just careful listening and domain knowledge. If anyone on the team has access to PlanetRead's existing video library, even a 10-video sample would be enough to build a meaningful evidence base.
Refs #2
The problem nobody has named yet
The pipeline correctly detects a gunshot or a fire alarm with high confidence. But try it on a video where a harmonium starts playing during a devotional lesson, or where an auto-rickshaw horn cuts through a street scene, or where a call to prayer drifts in from outside a classroom window. The tool either fires the wrong label or produces nothing at all.
This isn't a threshold problem or a label mapping problem. It's a training data problem — and it lives at the very bottom of the stack.
Where the gap comes from
YAMNet was trained on Google's AudioSet V2 — roughly 2 million 10-second audio clips sourced from YouTube and labeled across 521 categories. AudioSet is impressive in scale, but its geographic distribution mirrors YouTube's global content skew: predominantly English-language, Western-origin material.
For a tool built to serve PlanetRead's Hindi and regional-language educational videos, this is a structural mismatch. YAMNet assigns confidence scores based on acoustic similarity to what it was trained on. Sounds that were underrepresented in AudioSet get lower, noisier confidence values — often falling below the 0.35 detection threshold even when they're clearly audible.
The current
LABEL_MAPinsrc/audio/labels.pydoes make a start — Tabla, Dhol, Temple bells, and Diwali firecrackers are mapped. But adding label entries cannot fix what the underlying model does not reliably detect in the first place.Sounds that fall through — or get misclassified
Honkat low confidence, often droppedMusicor classified as speech → suppressedMusical instrument(generic, low conf.)Wind instrument(generic)Bell(approximate, inconsistent)Speech→ suppressed entirelyDrum(generic)Motor vehicle(generic)Train(generic, inconsistent timing)Alarm— false positiveThe street vendor chant case is the most frustrating. Because it has speech-like acoustic qualities, YAMNet classifies it under speech categories and the pipeline suppresses it. A distinctly Indian environmental sound that carries real contextual meaning for a Hindi-speaking student disappears silently.
The azan case matters for cultural inclusivity. North Indian educational content produced near mosques or community centres will routinely include this sound. Deaf students using PlanetRead content in those communities deserve a caption that tells them what's happening in the scene.
Where this gap sits in the pipeline
What a fix would look like
Immediate: per-category confidence thresholds
A single global threshold of
0.35is too coarse. India-specific sounds that YAMNet has weak representations for legitimately need a lower threshold — say0.20— while high-impact events like gunshots can stay at0.35or higher. TheAudioEventdataclass andLABEL_MAPalready have the structure to support per-category thresholds; it just has not been implemented.Short-term: expand the label map with acoustic proxies
Many Indian sounds have acoustically similar entries in AudioSet that YAMNet will fire on, just with lower confidence or slightly mismatched category names. A dedicated expansion of
LABEL_MAPwith these acoustic proxy mappings would improve recall without any model changes.Medium-term: lightweight supplementary classifier
Train a small secondary classifier (a shallow CNN on mel-spectrograms) on 50-100 clips each of the critical India-specific sounds. This classifier runs only on audio windows where YAMNet's top score falls below a confidence floor — a second opinion for uncertain events rather than a full model replacement. It could be under 5 MB and run in about 50 ms per window on CPU.
Long-term: India audio dataset contribution
Contribute labeled Indian audio clips to AudioSet's community annotation framework. This benefits the broader research community, not just this project, and could eventually feed into a future retrained model.
Why this matters for PlanetRead's students specifically
PlanetRead's mission is literacy and accessibility for learners who are often excluded from traditional education. The students who rely on CC annotations are not watching Western content — they are watching Hindi educational films, regional documentaries, community instruction videos. The acoustic landscape of those videos is Indian.
When the CC tool silently skips a harmonium playing behind a maths lesson, or labels a shehnai as
[alarm], or drops an azan that gives a student critical context about the scene they are watching — the tool fails the exact people it was built for.Fixing the detection layer matters more than any downstream improvement. Better evaluation metrics or output validators cannot compensate for a model that structurally underrepresents the content it is analyzing.
Suggested first step
A contributor could start by:
LABEL_MAPThis does not require model training — just careful listening and domain knowledge. If anyone on the team has access to PlanetRead's existing video library, even a 10-video sample would be enough to build a meaningful evidence base.
Refs #2