Text-to-speech for the Teensy 4.1 + Audio Shield. Converts arbitrary English text to speech by concatenating MBROLA phoneme WAV files, using a two-tier pronunciation lookup system and eSpeak-ng-derived letter-to-sound rules.
say("some text") splits the input on whitespace and punctuation. Commas insert a short pause (80 ms), periods/question marks/exclamation marks insert a longer pause (200 ms), and spaces insert a 50 ms inter-word gap.
Each word is normalized: lowercased, leading/trailing punctuation stripped, contractions truncated at the apostrophe.
Tier 1 — Flash dictionary (tts_dict.h)
A small hand-curated table compiled into firmware. Checked first. Use this to override or correct any word that the SD dictionary or rules get wrong.
Tier 2 — SD card dictionary (DICT.BIN on SD card)
~125,000 words from the CMU Pronouncing Dictionary, converted from ARPABET to MBROLA us2 phoneme names. Binary-searched in ~17 seeks (~34 ms worst case). Falls back to rules if the word is not found.
Tier 3 — Letter-to-sound rules (tts_rules.h)
eSpeak-ng-derived context-sensitive rules applied letter by letter when neither dictionary has the word.
All phoneme WAV files are 16 kHz, 16-bit mono, stored in the SD card root directory. For each word:
- Each phoneme's WAV is read from SD and appended into a RAM buffer.
- Adjacent phonemes are joined with a 128-sample bidirectional crossfade to eliminate click artifacts at boundaries.
- Short word-final phonemes (< 55 ms) are extended by looping their aspiration tail with a linear fade-out.
- The entire word is played as one continuous PCM stream through the Teensy Audio library — no gaps between phonemes.
The I²S sample rate is set to 16000 Hz in setup() to match the phoneme WAV files.
Copy all of the following to the root directory of the SD card (no subdirectory):
All files from TeensyTalkV2/wavs/ — listed below with their MBROLA phoneme mapping:
| WAV filename | MBROLA phoneme | Example |
|---|---|---|
A.wav |
A |
father |
AE.wav |
{ |
cat |
AI.wav |
AI |
bite |
SCH.wav |
@ |
about (schwa) |
OW.wav |
@U |
boat |
E.wav |
E |
bet |
EI.wav |
EI |
bait |
ER.wav |
r= |
bird |
IH.wav |
I |
bit |
i.wav |
i |
beat |
O.wav |
O |
thought |
OI.wav |
OI |
boy |
OR.wav |
OR |
for |
UH.wav |
V |
but |
UU.wav |
U |
book |
aU.wav |
aU |
how |
b.wav |
b |
bat |
d.wav |
d |
dog |
dZ.wav |
dZ |
jump |
DH.wav |
D |
the (voiced) |
f.wav |
f |
fat |
FL.wav |
4 |
butter (flap T) |
g.wav |
g |
go |
h.wav |
h |
hat |
j.wav |
j |
yes |
k.wav |
k |
(mapped to KH — see below) |
KH.wav |
k, k_h |
kite (aspirated, used for all K) |
l.wav |
l |
let |
LS.wav |
l= |
bottle (syllabic L) |
m.wav |
m |
mat |
n.wav |
n |
net |
NG.wav |
N |
sing |
p.wav |
p |
pat |
PH.wav |
p_h |
aspirated P |
r.wav |
r |
rat |
s.wav |
s |
sat |
SH.wav |
S |
she |
t.wav |
t |
tap |
TH.wav |
T |
think (voiceless) |
TH2.wav |
t_h |
aspirated T |
tS.wav |
tS |
chip |
u.wav |
u |
food |
v.wav |
v |
vat |
w.wav |
w |
wet |
z.wav |
z |
zoo |
ZH.wav |
Z |
vision |
| File | Description |
|---|---|
DICT.BIN |
CMU Pronouncing Dictionary, ~125,000 words, binary format |
Download DICT.BIN from the latest release and copy it to the SD card root. It is not stored in the repo due to its size (~15 MB).
Important: All files go directly in the SD card root — no subdirectories.
DICT.BIN is available as a pre-built download on the releases page. If you want to rebuild it yourself (e.g. after modifying the ARPABET conversion), use build_dict.py in the repo root:
python3 build_dict.py # writes DICT.BIN in current directory
python3 build_dict.py out.bin # custom output pathCopy the resulting DICT.BIN to the SD card root. The file is ~15 MB (~125,000 entries × 128 bytes each).
On boot, the firmware prints: SD dict loaded: 125247 entries
The WAV files are individual phoneme recordings extracted from the MBROLA us2 voice — a diphone synthesis voice for American English developed as part of the MBROLA project. Each file contains a single phoneme sound at 16 kHz, 16-bit mono.
The MBROLA project: https://github.com/numediart/MBROLA
The us2 voice database is available from the MBROLA voices repository.
To regenerate or add phonemes, install MBROLA and the us2 voice database, then synthesize individual phoneme sequences using a .pho input file (phoneme name + duration in ms + pitch points).
Run the sketch and type the word into the Serial Monitor. The debug output shows what phonemes were used and whether they came from the dictionary or rules:
rocket: r A k I t (dict)
travel: t r { v @ l (rules)
Determine the correct MBROLA us2 phoneme sequence for the word. Reference the WAV filename table above for phoneme names — note that some phonemes use special characters ({, @, r=, etc.) that map to renamed WAV files.
Open TeensyTalkV2/tts_dict.h and add an entry in alphabetical order:
{ "travel", "t r { v @ l" },
{ "rocket", "r A k I t" }, // ← new entry, in alphabetical positionThe list must stay in strict alphabetical order — it is searched with strcmp comparisons. Adding an entry out of order will cause incorrect lookups for nearby words.
Recompile and upload. The flash dictionary takes priority over the SD card dictionary, so your correction will always win.
| MBROLA | WAV file | Sound |
|---|---|---|
A |
A.wav | "father" vowel |
{ |
AE.wav | "cat" vowel |
@ |
SCH.wav | schwa — "about", "-er" unstressed |
@U |
OW.wav | "boat" diphthong |
AI |
AI.wav | "bite" diphthong |
aU |
aU.wav | "how" diphthong |
E |
E.wav | "bet" vowel |
EI |
EI.wav | "bait" diphthong |
I |
IH.wav | "bit" short vowel |
i |
i.wav | "beat" long vowel |
O |
O.wav | "thought" vowel |
OI |
OI.wav | "boy" diphthong |
r= |
ER.wav | "bird" r-colored vowel |
U |
UU.wav | "book" vowel |
u |
u.wav | "food" vowel |
V |
UH.wav | "but" strut vowel |
b |
b.wav | voiced bilabial stop |
d |
d.wav | voiced alveolar stop |
D |
DH.wav | voiced "th" (the, this) |
dZ |
dZ.wav | "jump" affricate |
f |
f.wav | voiceless labiodental |
4 |
FL.wav | flapped T (butter, later) |
g |
g.wav | voiced velar stop |
h |
h.wav | glottal fricative |
j |
j.wav | palatal approximant (yes) |
k |
KH.wav | voiceless velar stop (aspirated) |
l |
l.wav | lateral approximant |
l= |
LS.wav | syllabic L (bottle) |
m |
m.wav | bilabial nasal |
n |
n.wav | alveolar nasal |
N |
NG.wav | velar nasal (sing) |
p |
p.wav | voiceless bilabial stop |
r |
r.wav | approximant |
s |
s.wav | voiceless alveolar fricative |
S |
SH.wav | "she" fricative |
t |
t.wav | voiceless alveolar stop |
T |
TH.wav | voiceless "th" (think) |
tS |
tS.wav | "chip" affricate |
u |
u.wav | high back rounded vowel |
v |
v.wav | voiced labiodental |
w |
w.wav | labio-velar approximant |
z |
z.wav | voiced alveolar fricative |
Z |
ZH.wav | "vision" fricative |
- Teensy 4.1
- Teensy Audio Shield (SGTL5000)
- Micro SD card formatted FAT32
Audio output is on the headphone jack of the Audio Shield.
| File | Purpose |
|---|---|
TeensyTalkV2.ino |
Setup, audio graph, serial input loop |
tts_buffer.h |
PCM RAM buffer, WAV loading, crossfade, normalization |
tts_phonemes.h |
MBROLA→WAV filename mapping, loadPhoneme() |
tts_dict.h |
Flash dictionary (hand-curated, highest priority) |
tts_dict_sd.h |
SD card binary dictionary, sdDictInit(), sdDictLookup() |
tts_rules.h |
eSpeak-ng-derived letter-to-sound rules |
tts_say.h |
say(), sayNumber(), word splitting and normalization |
build_dict.py |
Builds DICT.BIN from CMU Pronouncing Dictionary |