LexiPhon is a collection of phonetically transcribed lexicons derived from Wikipedia. Each lexicon includes word frequencies and transcriptions generated with up to four transcription methods:
- XPF
- Epitran
- CharsiuG2P
- WikiPron
To generate lexicons, you will need to install Python and Apptainer.
Download and parse the ParaNames corpus with make download_paranames.
To use pre-built Apptainer image, run ./scripts/get-wiki-g2p [WIKI_CODE] [WIKI_DUMP_DATE] with the Wikipedia code for the language.
-
Add a row to
languages.tsvwith new language information. -
Run
maketo build Apptainer images. -
Run
./scripts/get-wiki-g2p [WIKI_CODE] [WIKI_DUMP_DATE]
| Language | Code | N (WikiPron size) | XPF Inv. Size | EPI Inv. Size | CS Inv. Size | WP Inv. Size | XPF Mean Len. | EPI Mean Len. | CS Mean Len. | WP Mean Len. |
|---|---|---|---|---|---|---|---|---|---|---|
| Adyghe | ady | 5729 (369) | -- | -- | 151 | 62 | -- | -- | 7.16 | 4.18 |
| Afrikaans | af | 42290 (1332) | -- | -- | 65 | 53 | -- | -- | 7.66 | 5.57 |
| Albanian | sq | 51599 (758) | 41 | 41 | 71 | 66 | 7.06 | 7.17 | 7.25 | 5.25 |
| Amharic | am | 80968 (292) | -- | 63 | 91 | 61 | -- | 8.27 | 9.34 | 5.01 |
| Arabic | ar | 47027 (3544) | -- | 54 | 76 | 59 | -- | 6.15 | 7.19 | 6.37 |
| Armenian | hy | 56436 (6101) | 37 | -- | 142 | 44 | 7.98 | -- | 8.08 | 7.16 |
| Assamese | as | 52966 (1622) | -- | -- | -- | 66 | -- | -- | -- | 4.65 |
| Asturian | ast | 49037 (573) | 25 | -- | -- | 30 | 7.42 | -- | -- | 6.56 |
| Aymara | ay | 23359 (0) | 31 | -- | -- | -- | 6.93 | -- | -- | -- |
| Azerbaijani | az | 62068 (1869) | 35 | 62 | 60 | 57 | 8.19 | 8.21 | 8.16 | 5.92 |
| Bashkir | ba | 54910 (1627) | 38 | -- | 77 | 113 | 8.43 | -- | 8.14 | 5.34 |
| Basque | eu | 55013 (2051) | 36 | -- | 73 | 34 | 7.8 | -- | 7.95 | 6.09 |
| Belarusian | be | 66717 (2976) | 43 | -- | 115 | 63 | 8.22 | -- | 8.31 | 6.25 |
| Bengali | bn | 46330 (1589) | -- | 69 | 433 | 76 | -- | 7.36 | 6.7 | 5.3 |
| Bulgarian | bg | 58291 (6176) | 29 | -- | 186 | 53 | 7.89 | -- | 7.77 | 6.85 |
| Catalan | ca | 44017 (13881) | -- | 48 | 57 | 40 | -- | 7.24 | 7.03 | 7.34 |
| Cebuano | ceb | 13972 (208) | -- | 32 | -- | 28 | -- | 7.03 | -- | 5.6 |
| Chewa | ny | 12952 (280) | -- | 63 | -- | 92 | -- | 7.47 | -- | 4.81 |
| Crimean Tatar | crh | 36075 (0) | 32 | -- | -- | -- | 7.52 | -- | -- | -- |
| Quechua | qu | 56355 (0) | 30 | -- | -- | -- | 7.21 | -- | -- | -- |
| Czech | cs | 68931 (12623) | 35 | 42 | 84 | 46 | 7.91 | 7.52 | 7.37 | 6.63 |
| Danish | da | 52074 (2554) | -- | -- | 158 | 108 | -- | -- | 7.05 | 5.52 |
| Dutch | nl | 48166 (9255) | -- | 48 | 75 | 50 | -- | 7.64 | 7.22 | 6.95 |
| Egyptian Arabic | arz | 20233 (56) | -- | -- | 80 | 50 | -- | -- | 6.54 | 4.27 |
| English | en | 38288 (15808) | -- | 40 | 52 | 62 | -- | 6.5 | 6.46 | 6.35 |
| Estonian | et | 75719 (992) | -- | -- | 69 | 81 | -- | -- | 8.12 | 6.14 |
| Faroese | fo | 50184 (986) | -- | -- | -- | 96 | -- | -- | -- | 5.53 |
| Finnish | fi | 77653 (12368) | -- | -- | 44 | 77 | -- | -- | 8.94 | 8.18 |
| French | fr | 44265 (13350) | -- | 42 | 63 | 41 | -- | 6.11 | 5.83 | 5.91 |
| Galician | gl | 46453 (1608) | -- | -- | 88 | 58 | -- | -- | 7.45 | 5.73 |
| Georgian | ka | 67918 (7787) | 34 | -- | 121 | 33 | 8.3 | -- | 8.46 | 7.79 |
| German | de | 53276 (3243) | -- | 48 | 110 | 66 | -- | 7.99 | 8.07 | 6.76 |
| Greek | el | 55163 (3070) | 24 | -- | 95 | 34 | 7.2 | -- | 7.35 | 7.03 |
| Gujarati | gu | 49127 (1137) | -- | -- | 317 | 78 | -- | -- | 9.53 | 4.89 |
| Hindi | hi | 34654 (8524) | -- | 74 | 232 | 60 | -- | 6.33 | 5.55 | 5.61 |
| Hungarian | hu | 69381 (19553) | 36 | 65 | 104 | 71 | 8.66 | 7.73 | 7.72 | 7.33 |
| Icelandic | is | 57366 (4211) | -- | -- | 115 | 61 | -- | -- | 8.31 | 6.44 |
| Indonesian | id | 44886 (2729) | -- | 30 | 49 | 61 | -- | 6.92 | 6.84 | 6.3 |
| Irish | ga | 46966 (4111) | -- | -- | 128 | 130 | -- | -- | 5.96 | 4.56 |
| Italian | it | 50399 (16147) | -- | 51 | 59 | 31 | -- | 7.33 | 7.57 | 8.03 |
| Japanese | ja | 23120 (1459) | -- | -- | 129 | 60 | -- | -- | 6.6 | 4.83 |
| Kabardian | kbd | 16357 (410) | 52 | 125 | -- | 64 | 6.99 | 7.18 | -- | 4.4 |
| Kalaallisut | kl | 2813 (184) | 19 | -- | -- | 31 | 9.83 | -- | -- | 7.96 |
| Kannada | kn | 75434 (660) | 40 | -- | -- | 65 | 9.98 | -- | -- | 5.21 |
| Kashubian | csb | 25708 (445) | -- | 64 | -- | 60 | -- | 7.39 | -- | 5.01 |
| Kazakh | kk | 58031 (905) | -- | 44 | 63 | 96 | -- | 8.93 | 8.33 | 7.1 |
| Mongolian | mn | 53895 (1658) | -- | 67 | -- | 116 | -- | 6.97 | -- | 5.24 |
| Kyrgyz | ky | 56482 (368) | 29 | 39 | -- | 53 | 8.28 | 8.71 | -- | 5.44 |
| Korean | ko | 64171 (6050) | 30 | -- | 152 | 60 | 7.73 | -- | 7.62 | 5.88 |
| Lithuanian | lt | 71509 (1926) | -- | -- | 177 | 131 | -- | -- | 8.39 | 7.07 |
| Lower Sorbian | dsb | 29728 (815) | -- | -- | -- | 53 | -- | -- | -- | 4.7 |
| Macedonian | mk | 52134 (14420) | 33 | -- | 72 | 47 | 7.76 | -- | 7.81 | 7.43 |
| Malagasy | mg | 22614 (87) | 34 | -- | -- | 61 | 7.29 | -- | -- | 5.59 |
| Maltese | mt | 58977 (4181) | -- | 62 | 74 | 41 | -- | 8.3 | 8.6 | 6.33 |
| Maori | mi | 10623 (0) | -- | 53 | -- | -- | -- | 6.81 | -- | -- |
| Kurdish | ku | 53554 (1198) | -- | -- | 197 | 54 | -- | -- | 6.73 | 5.43 |
| Northern Sami | se | 25779 (1340) | -- | -- | 132 | 69 | -- | -- | 8.38 | 6.54 |
| Norwegian | no | 52199 (718) | -- | -- | 77 | 65 | -- | -- | 7.22 | 4.95 |
| Oromo | om | 30376 (0) | -- | 104 | -- | -- | -- | 6.55 | -- | -- |
| Pashto | ps | 32625 (486) | -- | -- | -- | 59 | -- | -- | -- | 4.94 |
| Persian | fa | 26577 (3476) | -- | 40 | 65 | 134 | -- | 6.37 | 6.97 | 6.31 |
| Polish | pl | 69225 (19295) | -- | 45 | 97 | 46 | -- | 7.82 | 7.51 | 7.09 |
| Portuguese | pt | 45408 (14125) | -- | 44 | 57 | 44 | -- | 7.37 | 6.79 | 7.27 |
| Romanian | ro | 55662 (2888) | 30 | 34 | 136 | 59 | 7.46 | 7.59 | 7.43 | 6.45 |
| Russian | ru | 71188 (37102) | -- | 54 | 164 | 99 | -- | 8.1 | 8.14 | 8.15 |
| Scottish Gaelic | gd | 49148 (1888) | -- | -- | 194 | 129 | -- | -- | 6.8 | 4.66 |
| Serbo-Croatian | sh | 58923 (6739) | -- | 55 | 92 | 64 | -- | 7.42 | 6.4 | 7.06 |
| Shona | sn | 62943 (0) | -- | 47 | -- | -- | -- | 7.46 | -- | -- |
| Slovene | sl | 68115 (2373) | -- | -- | 113 | 48 | -- | -- | 6.59 | 6.17 |
| Spanish | es | 42948 (14019) | -- | 31 | 59 | 27 | -- | 7.46 | 7.46 | 7.56 |
| Swahili | sw | 39684 (74) | -- | 40 | 60 | 51 | -- | 7.21 | 7.01 | 4.82 |
| Swedish | sv | 50238 (1902) | -- | 41 | 153 | 99 | -- | 7.67 | 7.4 | 5.18 |
| Tagalog | tl | 43825 (5578) | -- | -- | 67 | 27 | -- | -- | 7.4 | 7.07 |
| Tajik | tg | 43740 (448) | 31 | 55 | -- | 41 | 7.22 | 7.21 | -- | 5.33 |
| Tatar | tt | 25463 (0) | 34 | -- | 77 | -- | 7.47 | -- | 7.46 | -- |
| Telugu | te | 48449 (1528) | 38 | 58 | -- | 84 | 9.1 | 8.12 | -- | 5.59 |
| Turkish | tr | 68339 (3507) | 30 | 37 | 82 | 111 | 7.75 | 7.89 | 7.77 | 6.12 |
| Turkmen | tk | 70974 (100) | -- | 65 | 119 | 65 | -- | 10.48 | 8.43 | 4.09 |
| Tuvan | tyv | 60093 (437) | 29 | -- | -- | 89 | 8.19 | -- | -- | 4.75 |
| Ukrainian | uk | 69748 (15715) | 39 | 55 | 177 | 85 | 8 | 7.94 | 7.91 | 7.87 |
| Upper Sorbian | hsb | 68899 (226) | 38 | -- | -- | 50 | 7.76 | -- | -- | 5.02 |
| Urdu | ur | 22350 (1886) | -- | 59 | -- | 103 | -- | 5.49 | -- | 5.33 |
| Uyghur | ug | 73866 (973) | 33 | 161 | -- | 50 | 8.04 | 8.06 | -- | 5.45 |
| Uzbek | uz | 59005 (0) | 14 | 49 | -- | -- | 8.25 | 7.88 | -- | -- |
| Welsh | cy | 34619 (4575) | -- | -- | 71 | 45 | -- | -- | 6.35 | 5.65 |
| West Frisian | fy | 45334 (674) | -- | -- | -- | 52 | -- | -- | -- | 4.94 |
| Zulu | zu | 50325 (789) | -- | 56 | -- | 102 | -- | 8.22 | -- | 6.33 |