Skip to content

amnda-d/LexiPhon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LexiPhon

LexiPhon is a collection of phonetically transcribed lexicons derived from Wikipedia. Each lexicon includes word frequencies and transcriptions generated with up to four transcription methods:

  • XPF
  • Epitran
  • CharsiuG2P
  • WikiPron

Generating lexicons for existing languages

To generate lexicons, you will need to install Python and Apptainer.

Download and parse the ParaNames corpus with make download_paranames.

To use pre-built Apptainer image, run ./scripts/get-wiki-g2p [WIKI_CODE] [WIKI_DUMP_DATE] with the Wikipedia code for the language.

Generating a lexicon for a new language

  1. Add a row to languages.tsv with new language information.

  2. Run make to build Apptainer images.

  3. Run ./scripts/get-wiki-g2p [WIKI_CODE] [WIKI_DUMP_DATE]

Languages

Language Code N (WikiPron size) XPF Inv. Size EPI Inv. Size CS Inv. Size WP Inv. Size XPF Mean Len. EPI Mean Len. CS Mean Len. WP Mean Len.
Adyghe ady 5729 (369) -- -- 151 62 -- -- 7.16 4.18
Afrikaans af 42290 (1332) -- -- 65 53 -- -- 7.66 5.57
Albanian sq 51599 (758) 41 41 71 66 7.06 7.17 7.25 5.25
Amharic am 80968 (292) -- 63 91 61 -- 8.27 9.34 5.01
Arabic ar 47027 (3544) -- 54 76 59 -- 6.15 7.19 6.37
Armenian hy 56436 (6101) 37 -- 142 44 7.98 -- 8.08 7.16
Assamese as 52966 (1622) -- -- -- 66 -- -- -- 4.65
Asturian ast 49037 (573) 25 -- -- 30 7.42 -- -- 6.56
Aymara ay 23359 (0) 31 -- -- -- 6.93 -- -- --
Azerbaijani az 62068 (1869) 35 62 60 57 8.19 8.21 8.16 5.92
Bashkir ba 54910 (1627) 38 -- 77 113 8.43 -- 8.14 5.34
Basque eu 55013 (2051) 36 -- 73 34 7.8 -- 7.95 6.09
Belarusian be 66717 (2976) 43 -- 115 63 8.22 -- 8.31 6.25
Bengali bn 46330 (1589) -- 69 433 76 -- 7.36 6.7 5.3
Bulgarian bg 58291 (6176) 29 -- 186 53 7.89 -- 7.77 6.85
Catalan ca 44017 (13881) -- 48 57 40 -- 7.24 7.03 7.34
Cebuano ceb 13972 (208) -- 32 -- 28 -- 7.03 -- 5.6
Chewa ny 12952 (280) -- 63 -- 92 -- 7.47 -- 4.81
Crimean Tatar crh 36075 (0) 32 -- -- -- 7.52 -- -- --
Quechua qu 56355 (0) 30 -- -- -- 7.21 -- -- --
Czech cs 68931 (12623) 35 42 84 46 7.91 7.52 7.37 6.63
Danish da 52074 (2554) -- -- 158 108 -- -- 7.05 5.52
Dutch nl 48166 (9255) -- 48 75 50 -- 7.64 7.22 6.95
Egyptian Arabic arz 20233 (56) -- -- 80 50 -- -- 6.54 4.27
English en 38288 (15808) -- 40 52 62 -- 6.5 6.46 6.35
Estonian et 75719 (992) -- -- 69 81 -- -- 8.12 6.14
Faroese fo 50184 (986) -- -- -- 96 -- -- -- 5.53
Finnish fi 77653 (12368) -- -- 44 77 -- -- 8.94 8.18
French fr 44265 (13350) -- 42 63 41 -- 6.11 5.83 5.91
Galician gl 46453 (1608) -- -- 88 58 -- -- 7.45 5.73
Georgian ka 67918 (7787) 34 -- 121 33 8.3 -- 8.46 7.79
German de 53276 (3243) -- 48 110 66 -- 7.99 8.07 6.76
Greek el 55163 (3070) 24 -- 95 34 7.2 -- 7.35 7.03
Gujarati gu 49127 (1137) -- -- 317 78 -- -- 9.53 4.89
Hindi hi 34654 (8524) -- 74 232 60 -- 6.33 5.55 5.61
Hungarian hu 69381 (19553) 36 65 104 71 8.66 7.73 7.72 7.33
Icelandic is 57366 (4211) -- -- 115 61 -- -- 8.31 6.44
Indonesian id 44886 (2729) -- 30 49 61 -- 6.92 6.84 6.3
Irish ga 46966 (4111) -- -- 128 130 -- -- 5.96 4.56
Italian it 50399 (16147) -- 51 59 31 -- 7.33 7.57 8.03
Japanese ja 23120 (1459) -- -- 129 60 -- -- 6.6 4.83
Kabardian kbd 16357 (410) 52 125 -- 64 6.99 7.18 -- 4.4
Kalaallisut kl 2813 (184) 19 -- -- 31 9.83 -- -- 7.96
Kannada kn 75434 (660) 40 -- -- 65 9.98 -- -- 5.21
Kashubian csb 25708 (445) -- 64 -- 60 -- 7.39 -- 5.01
Kazakh kk 58031 (905) -- 44 63 96 -- 8.93 8.33 7.1
Mongolian mn 53895 (1658) -- 67 -- 116 -- 6.97 -- 5.24
Kyrgyz ky 56482 (368) 29 39 -- 53 8.28 8.71 -- 5.44
Korean ko 64171 (6050) 30 -- 152 60 7.73 -- 7.62 5.88
Lithuanian lt 71509 (1926) -- -- 177 131 -- -- 8.39 7.07
Lower Sorbian dsb 29728 (815) -- -- -- 53 -- -- -- 4.7
Macedonian mk 52134 (14420) 33 -- 72 47 7.76 -- 7.81 7.43
Malagasy mg 22614 (87) 34 -- -- 61 7.29 -- -- 5.59
Maltese mt 58977 (4181) -- 62 74 41 -- 8.3 8.6 6.33
Maori mi 10623 (0) -- 53 -- -- -- 6.81 -- --
Kurdish ku 53554 (1198) -- -- 197 54 -- -- 6.73 5.43
Northern Sami se 25779 (1340) -- -- 132 69 -- -- 8.38 6.54
Norwegian no 52199 (718) -- -- 77 65 -- -- 7.22 4.95
Oromo om 30376 (0) -- 104 -- -- -- 6.55 -- --
Pashto ps 32625 (486) -- -- -- 59 -- -- -- 4.94
Persian fa 26577 (3476) -- 40 65 134 -- 6.37 6.97 6.31
Polish pl 69225 (19295) -- 45 97 46 -- 7.82 7.51 7.09
Portuguese pt 45408 (14125) -- 44 57 44 -- 7.37 6.79 7.27
Romanian ro 55662 (2888) 30 34 136 59 7.46 7.59 7.43 6.45
Russian ru 71188 (37102) -- 54 164 99 -- 8.1 8.14 8.15
Scottish Gaelic gd 49148 (1888) -- -- 194 129 -- -- 6.8 4.66
Serbo-Croatian sh 58923 (6739) -- 55 92 64 -- 7.42 6.4 7.06
Shona sn 62943 (0) -- 47 -- -- -- 7.46 -- --
Slovene sl 68115 (2373) -- -- 113 48 -- -- 6.59 6.17
Spanish es 42948 (14019) -- 31 59 27 -- 7.46 7.46 7.56
Swahili sw 39684 (74) -- 40 60 51 -- 7.21 7.01 4.82
Swedish sv 50238 (1902) -- 41 153 99 -- 7.67 7.4 5.18
Tagalog tl 43825 (5578) -- -- 67 27 -- -- 7.4 7.07
Tajik tg 43740 (448) 31 55 -- 41 7.22 7.21 -- 5.33
Tatar tt 25463 (0) 34 -- 77 -- 7.47 -- 7.46 --
Telugu te 48449 (1528) 38 58 -- 84 9.1 8.12 -- 5.59
Turkish tr 68339 (3507) 30 37 82 111 7.75 7.89 7.77 6.12
Turkmen tk 70974 (100) -- 65 119 65 -- 10.48 8.43 4.09
Tuvan tyv 60093 (437) 29 -- -- 89 8.19 -- -- 4.75
Ukrainian uk 69748 (15715) 39 55 177 85 8 7.94 7.91 7.87
Upper Sorbian hsb 68899 (226) 38 -- -- 50 7.76 -- -- 5.02
Urdu ur 22350 (1886) -- 59 -- 103 -- 5.49 -- 5.33
Uyghur ug 73866 (973) 33 161 -- 50 8.04 8.06 -- 5.45
Uzbek uz 59005 (0) 14 49 -- -- 8.25 7.88 -- --
Welsh cy 34619 (4575) -- -- 71 45 -- -- 6.35 5.65
West Frisian fy 45334 (674) -- -- -- 52 -- -- -- 4.94
Zulu zu 50325 (789) -- 56 -- 102 -- 8.22 -- 6.33

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors