orthography2ipa 0.2.1a1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (468) hide show
  1. orthography2ipa-0.2.1a1/.github/dependabot.yml +11 -0
  2. orthography2ipa-0.2.1a1/.github/workflows/build-tests.yml +14 -0
  3. orthography2ipa-0.2.1a1/.github/workflows/conventional-label.yaml +10 -0
  4. orthography2ipa-0.2.1a1/.github/workflows/coverage.yml +16 -0
  5. orthography2ipa-0.2.1a1/.github/workflows/license_check.yml +10 -0
  6. orthography2ipa-0.2.1a1/.github/workflows/publish_stable.yml +23 -0
  7. orthography2ipa-0.2.1a1/.github/workflows/release_workflow.yml +28 -0
  8. orthography2ipa-0.2.1a1/.gitignore +36 -0
  9. orthography2ipa-0.2.1a1/AGENTS.md +64 -0
  10. orthography2ipa-0.2.1a1/CHANGELOG.md +21 -0
  11. orthography2ipa-0.2.1a1/PKG-INFO +227 -0
  12. orthography2ipa-0.2.1a1/README.md +193 -0
  13. orthography2ipa-0.2.1a1/docs/README.md +72 -0
  14. orthography2ipa-0.2.1a1/docs/adding_a_language.md +238 -0
  15. orthography2ipa-0.2.1a1/docs/ancestry.md +215 -0
  16. orthography2ipa-0.2.1a1/docs/architecture.md +234 -0
  17. orthography2ipa-0.2.1a1/docs/bibliography.md +70 -0
  18. orthography2ipa-0.2.1a1/docs/data_model.md +268 -0
  19. orthography2ipa-0.2.1a1/docs/distance.md +379 -0
  20. orthography2ipa-0.2.1a1/docs/getting_started.md +183 -0
  21. orthography2ipa-0.2.1a1/docs/index.md +211 -0
  22. orthography2ipa-0.2.1a1/docs/ipa_reference.md +202 -0
  23. orthography2ipa-0.2.1a1/docs/languages/de-DE.md +111 -0
  24. orthography2ipa-0.2.1a1/docs/languages/en-GB.md +138 -0
  25. orthography2ipa-0.2.1a1/docs/languages/fr-FR.md +124 -0
  26. orthography2ipa-0.2.1a1/docs/languages/germanic.md +119 -0
  27. orthography2ipa-0.2.1a1/docs/languages/hi.md +128 -0
  28. orthography2ipa-0.2.1a1/docs/languages/index.md +54 -0
  29. orthography2ipa-0.2.1a1/docs/languages/it-IT.md +108 -0
  30. orthography2ipa-0.2.1a1/docs/languages/romance.md +137 -0
  31. orthography2ipa-0.2.1a1/docs/languages/ru.md +114 -0
  32. orthography2ipa-0.2.1a1/docs/languages/slavic.md +114 -0
  33. orthography2ipa-0.2.1a1/docs/linguistic_accuracy.md +211 -0
  34. orthography2ipa-0.2.1a1/docs/link-audit.md +105 -0
  35. orthography2ipa-0.2.1a1/docs/positional_graphemes.md +249 -0
  36. orthography2ipa-0.2.1a1/docs/registry.md +288 -0
  37. orthography2ipa-0.2.1a1/docs/tokenizer.md +227 -0
  38. orthography2ipa-0.2.1a1/examples/01_basic_usage.py +53 -0
  39. orthography2ipa-0.2.1a1/examples/02_distance_metrics.py +217 -0
  40. orthography2ipa-0.2.1a1/examples/03_tokenizer.py +124 -0
  41. orthography2ipa-0.2.1a1/examples/04_dialect_transforms.py +121 -0
  42. orthography2ipa-0.2.1a1/examples/05_script_distance.py +94 -0
  43. orthography2ipa-0.2.1a1/examples/06_sandhi.py +141 -0
  44. orthography2ipa-0.2.1a1/orthography2ipa/__init__.py +114 -0
  45. orthography2ipa-0.2.1a1/orthography2ipa/cli.py +291 -0
  46. orthography2ipa-0.2.1a1/orthography2ipa/data/SCHEMA.md +254 -0
  47. orthography2ipa-0.2.1a1/orthography2ipa/data/acy.json +316 -0
  48. orthography2ipa-0.2.1a1/orthography2ipa/data/af.json +351 -0
  49. orthography2ipa-0.2.1a1/orthography2ipa/data/an-x-occidental.json +78 -0
  50. orthography2ipa-0.2.1a1/orthography2ipa/data/an-x-oriental.json +99 -0
  51. orthography2ipa-0.2.1a1/orthography2ipa/data/an.json +517 -0
  52. orthography2ipa-0.2.1a1/orthography2ipa/data/ang.json +370 -0
  53. orthography2ipa-0.2.1a1/orthography2ipa/data/aoa.json +315 -0
  54. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-AE.json +74 -0
  55. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-BH.json +68 -0
  56. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-DZ.json +134 -0
  57. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-IQ-x-qeltu.json +78 -0
  58. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-IQ.json +129 -0
  59. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-KW.json +67 -0
  60. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-LY.json +120 -0
  61. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-MA.json +126 -0
  62. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-MR.json +133 -0
  63. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-NG.json +101 -0
  64. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-OM.json +99 -0
  65. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-QA.json +62 -0
  66. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-SA-x-hejaz.json +118 -0
  67. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-SA-x-najd.json +108 -0
  68. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-TD.json +116 -0
  69. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-TN.json +137 -0
  70. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-YE.json +117 -0
  71. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-x-gulf.json +117 -0
  72. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-x-maghrebi.json +121 -0
  73. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-x-mashriqi.json +119 -0
  74. orthography2ipa-0.2.1a1/orthography2ipa/data/ar-x-peninsular.json +122 -0
  75. orthography2ipa-0.2.1a1/orthography2ipa/data/ar.json +55 -0
  76. orthography2ipa-0.2.1a1/orthography2ipa/data/arb.json +357 -0
  77. orthography2ipa-0.2.1a1/orthography2ipa/data/as.json +384 -0
  78. orthography2ipa-0.2.1a1/orthography2ipa/data/ast-ES-x-leon.json +54 -0
  79. orthography2ipa-0.2.1a1/orthography2ipa/data/ast-PT-x-guadramil.json +78 -0
  80. orthography2ipa-0.2.1a1/orthography2ipa/data/ast-PT-x-medieval.json +208 -0
  81. orthography2ipa-0.2.1a1/orthography2ipa/data/ast-PT-x-rionor.json +318 -0
  82. orthography2ipa-0.2.1a1/orthography2ipa/data/ast-x-cantabrian.json +171 -0
  83. orthography2ipa-0.2.1a1/orthography2ipa/data/ast-x-leon.json +177 -0
  84. orthography2ipa-0.2.1a1/orthography2ipa/data/ast-x-occidental.json +89 -0
  85. orthography2ipa-0.2.1a1/orthography2ipa/data/ast-x-oriental.json +56 -0
  86. orthography2ipa-0.2.1a1/orthography2ipa/data/ast-x-sanabria.json +90 -0
  87. orthography2ipa-0.2.1a1/orthography2ipa/data/ast.json +422 -0
  88. orthography2ipa-0.2.1a1/orthography2ipa/data/be.json +323 -0
  89. orthography2ipa-0.2.1a1/orthography2ipa/data/ber.json +348 -0
  90. orthography2ipa-0.2.1a1/orthography2ipa/data/bg.json +278 -0
  91. orthography2ipa-0.2.1a1/orthography2ipa/data/bho.json +368 -0
  92. orthography2ipa-0.2.1a1/orthography2ipa/data/bn.json +375 -0
  93. orthography2ipa-0.2.1a1/orthography2ipa/data/br.json +287 -0
  94. orthography2ipa-0.2.1a1/orthography2ipa/data/brx-x-proto-boro-garo.json +217 -0
  95. orthography2ipa-0.2.1a1/orthography2ipa/data/brx.json +288 -0
  96. orthography2ipa-0.2.1a1/orthography2ipa/data/ca-x-balear.json +93 -0
  97. orthography2ipa-0.2.1a1/orthography2ipa/data/ca-x-nord.json +110 -0
  98. orthography2ipa-0.2.1a1/orthography2ipa/data/ca-x-occidental.json +79 -0
  99. orthography2ipa-0.2.1a1/orthography2ipa/data/ca-x-valencia.json +110 -0
  100. orthography2ipa-0.2.1a1/orthography2ipa/data/ca.json +442 -0
  101. orthography2ipa-0.2.1a1/orthography2ipa/data/cel-x-gallaecia.json +152 -0
  102. orthography2ipa-0.2.1a1/orthography2ipa/data/cel-x-goidelic.json +215 -0
  103. orthography2ipa-0.2.1a1/orthography2ipa/data/cel.json +218 -0
  104. orthography2ipa-0.2.1a1/orthography2ipa/data/co.json +338 -0
  105. orthography2ipa-0.2.1a1/orthography2ipa/data/cop.json +258 -0
  106. orthography2ipa-0.2.1a1/orthography2ipa/data/cs.json +412 -0
  107. orthography2ipa-0.2.1a1/orthography2ipa/data/csb.json +75 -0
  108. orthography2ipa-0.2.1a1/orthography2ipa/data/cu.json +299 -0
  109. orthography2ipa-0.2.1a1/orthography2ipa/data/cy.json +320 -0
  110. orthography2ipa-0.2.1a1/orthography2ipa/data/da-x-copenhagen.json +64 -0
  111. orthography2ipa-0.2.1a1/orthography2ipa/data/da.json +349 -0
  112. orthography2ipa-0.2.1a1/orthography2ipa/data/de-AT.json +56 -0
  113. orthography2ipa-0.2.1a1/orthography2ipa/data/de-CH.json +66 -0
  114. orthography2ipa-0.2.1a1/orthography2ipa/data/de-DE.json +476 -0
  115. orthography2ipa-0.2.1a1/orthography2ipa/data/de-x-alemannic.json +55 -0
  116. orthography2ipa-0.2.1a1/orthography2ipa/data/de-x-bavarian.json +68 -0
  117. orthography2ipa-0.2.1a1/orthography2ipa/data/dsb.json +76 -0
  118. orthography2ipa-0.2.1a1/orthography2ipa/data/egl.json +185 -0
  119. orthography2ipa-0.2.1a1/orthography2ipa/data/el-CY.json +77 -0
  120. orthography2ipa-0.2.1a1/orthography2ipa/data/el.json +379 -0
  121. orthography2ipa-0.2.1a1/orthography2ipa/data/en-AU.json +70 -0
  122. orthography2ipa-0.2.1a1/orthography2ipa/data/en-CA.json +75 -0
  123. orthography2ipa-0.2.1a1/orthography2ipa/data/en-GB-x-scotland.json +86 -0
  124. orthography2ipa-0.2.1a1/orthography2ipa/data/en-GB.json +566 -0
  125. orthography2ipa-0.2.1a1/orthography2ipa/data/en-IE.json +77 -0
  126. orthography2ipa-0.2.1a1/orthography2ipa/data/en-US.json +107 -0
  127. orthography2ipa-0.2.1a1/orthography2ipa/data/en-ZA.json +69 -0
  128. orthography2ipa-0.2.1a1/orthography2ipa/data/enm.json +334 -0
  129. orthography2ipa-0.2.1a1/orthography2ipa/data/es-419.json +74 -0
  130. orthography2ipa-0.2.1a1/orthography2ipa/data/es-AR.json +85 -0
  131. orthography2ipa-0.2.1a1/orthography2ipa/data/es-BO.json +95 -0
  132. orthography2ipa-0.2.1a1/orthography2ipa/data/es-CL.json +121 -0
  133. orthography2ipa-0.2.1a1/orthography2ipa/data/es-CO-x-costa.json +142 -0
  134. orthography2ipa-0.2.1a1/orthography2ipa/data/es-CO-x-paisa.json +76 -0
  135. orthography2ipa-0.2.1a1/orthography2ipa/data/es-CO.json +76 -0
  136. orthography2ipa-0.2.1a1/orthography2ipa/data/es-CR.json +85 -0
  137. orthography2ipa-0.2.1a1/orthography2ipa/data/es-CU.json +141 -0
  138. orthography2ipa-0.2.1a1/orthography2ipa/data/es-DO.json +138 -0
  139. orthography2ipa-0.2.1a1/orthography2ipa/data/es-EC.json +95 -0
  140. orthography2ipa-0.2.1a1/orthography2ipa/data/es-ES-x-andalusia-e.json +136 -0
  141. orthography2ipa-0.2.1a1/orthography2ipa/data/es-ES-x-andalusia-w.json +177 -0
  142. orthography2ipa-0.2.1a1/orthography2ipa/data/es-ES-x-canarias.json +102 -0
  143. orthography2ipa-0.2.1a1/orthography2ipa/data/es-ES-x-cantabria.json +79 -0
  144. orthography2ipa-0.2.1a1/orthography2ipa/data/es-ES-x-extremadura.json +99 -0
  145. orthography2ipa-0.2.1a1/orthography2ipa/data/es-ES-x-medieval.json +400 -0
  146. orthography2ipa-0.2.1a1/orthography2ipa/data/es-ES-x-murcia.json +125 -0
  147. orthography2ipa-0.2.1a1/orthography2ipa/data/es-ES.json +54 -0
  148. orthography2ipa-0.2.1a1/orthography2ipa/data/es-GQ.json +87 -0
  149. orthography2ipa-0.2.1a1/orthography2ipa/data/es-GT.json +76 -0
  150. orthography2ipa-0.2.1a1/orthography2ipa/data/es-MX-x-costa.json +137 -0
  151. orthography2ipa-0.2.1a1/orthography2ipa/data/es-MX.json +90 -0
  152. orthography2ipa-0.2.1a1/orthography2ipa/data/es-NI.json +126 -0
  153. orthography2ipa-0.2.1a1/orthography2ipa/data/es-PA.json +137 -0
  154. orthography2ipa-0.2.1a1/orthography2ipa/data/es-PE-x-lima.json +75 -0
  155. orthography2ipa-0.2.1a1/orthography2ipa/data/es-PE.json +95 -0
  156. orthography2ipa-0.2.1a1/orthography2ipa/data/es-PR.json +142 -0
  157. orthography2ipa-0.2.1a1/orthography2ipa/data/es-PY.json +92 -0
  158. orthography2ipa-0.2.1a1/orthography2ipa/data/es-UY.json +79 -0
  159. orthography2ipa-0.2.1a1/orthography2ipa/data/es-VE.json +137 -0
  160. orthography2ipa-0.2.1a1/orthography2ipa/data/et.json +75 -0
  161. orthography2ipa-0.2.1a1/orthography2ipa/data/etr.json +168 -0
  162. orthography2ipa-0.2.1a1/orthography2ipa/data/eu-x-bizkaiera.json +92 -0
  163. orthography2ipa-0.2.1a1/orthography2ipa/data/eu-x-gipuzkera.json +72 -0
  164. orthography2ipa-0.2.1a1/orthography2ipa/data/eu-x-lapurtera.json +65 -0
  165. orthography2ipa-0.2.1a1/orthography2ipa/data/eu-x-nafarra-beherea.json +80 -0
  166. orthography2ipa-0.2.1a1/orthography2ipa/data/eu-x-nafarra-garaia.json +82 -0
  167. orthography2ipa-0.2.1a1/orthography2ipa/data/eu-x-zuberera.json +87 -0
  168. orthography2ipa-0.2.1a1/orthography2ipa/data/eu.json +257 -0
  169. orthography2ipa-0.2.1a1/orthography2ipa/data/ext-PT-x-barrancos.json +156 -0
  170. orthography2ipa-0.2.1a1/orthography2ipa/data/ext.json +200 -0
  171. orthography2ipa-0.2.1a1/orthography2ipa/data/fa-AF.json +122 -0
  172. orthography2ipa-0.2.1a1/orthography2ipa/data/fa-x-early.json +281 -0
  173. orthography2ipa-0.2.1a1/orthography2ipa/data/fa-x-hazaragi.json +78 -0
  174. orthography2ipa-0.2.1a1/orthography2ipa/data/fa-x-isfahani.json +71 -0
  175. orthography2ipa-0.2.1a1/orthography2ipa/data/fa-x-kermani.json +79 -0
  176. orthography2ipa-0.2.1a1/orthography2ipa/data/fa-x-khorasani.json +81 -0
  177. orthography2ipa-0.2.1a1/orthography2ipa/data/fa-x-mashhadi.json +68 -0
  178. orthography2ipa-0.2.1a1/orthography2ipa/data/fa-x-shirazi.json +75 -0
  179. orthography2ipa-0.2.1a1/orthography2ipa/data/fa-x-tehran.json +89 -0
  180. orthography2ipa-0.2.1a1/orthography2ipa/data/fa-x-yazdi.json +75 -0
  181. orthography2ipa-0.2.1a1/orthography2ipa/data/fa.json +268 -0
  182. orthography2ipa-0.2.1a1/orthography2ipa/data/fax.json +382 -0
  183. orthography2ipa-0.2.1a1/orthography2ipa/data/ff.json +344 -0
  184. orthography2ipa-0.2.1a1/orthography2ipa/data/fi.json +360 -0
  185. orthography2ipa-0.2.1a1/orthography2ipa/data/fo.json +310 -0
  186. orthography2ipa-0.2.1a1/orthography2ipa/data/fr-FR.json +574 -0
  187. orthography2ipa-0.2.1a1/orthography2ipa/data/frp.json +331 -0
  188. orthography2ipa-0.2.1a1/orthography2ipa/data/frr.json +457 -0
  189. orthography2ipa-0.2.1a1/orthography2ipa/data/fur.json +294 -0
  190. orthography2ipa-0.2.1a1/orthography2ipa/data/fy.json +476 -0
  191. orthography2ipa-0.2.1a1/orthography2ipa/data/ga.json +319 -0
  192. orthography2ipa-0.2.1a1/orthography2ipa/data/gd.json +319 -0
  193. orthography2ipa-0.2.1a1/orthography2ipa/data/gem-x-ingvaeonic.json +192 -0
  194. orthography2ipa-0.2.1a1/orthography2ipa/data/gem-x-north.json +259 -0
  195. orthography2ipa-0.2.1a1/orthography2ipa/data/gem-x-northwest.json +190 -0
  196. orthography2ipa-0.2.1a1/orthography2ipa/data/gem.json +232 -0
  197. orthography2ipa-0.2.1a1/orthography2ipa/data/gl-ES.json +424 -0
  198. orthography2ipa-0.2.1a1/orthography2ipa/data/gl-x-central.json +73 -0
  199. orthography2ipa-0.2.1a1/orthography2ipa/data/gl-x-occidental.json +99 -0
  200. orthography2ipa-0.2.1a1/orthography2ipa/data/gl-x-oriental.json +104 -0
  201. orthography2ipa-0.2.1a1/orthography2ipa/data/gl.json +477 -0
  202. orthography2ipa-0.2.1a1/orthography2ipa/data/goh.json +279 -0
  203. orthography2ipa-0.2.1a1/orthography2ipa/data/got.json +50 -0
  204. orthography2ipa-0.2.1a1/orthography2ipa/data/grc.json +381 -0
  205. orthography2ipa-0.2.1a1/orthography2ipa/data/gu.json +389 -0
  206. orthography2ipa-0.2.1a1/orthography2ipa/data/gv.json +280 -0
  207. orthography2ipa-0.2.1a1/orthography2ipa/data/hi.json +468 -0
  208. orthography2ipa-0.2.1a1/orthography2ipa/data/hr.json +250 -0
  209. orthography2ipa-0.2.1a1/orthography2ipa/data/hsb.json +74 -0
  210. orthography2ipa-0.2.1a1/orthography2ipa/data/hu.json +448 -0
  211. orthography2ipa-0.2.1a1/orthography2ipa/data/hy.json +96 -0
  212. orthography2ipa-0.2.1a1/orthography2ipa/data/id.json +310 -0
  213. orthography2ipa-0.2.1a1/orthography2ipa/data/iir.json +257 -0
  214. orthography2ipa-0.2.1a1/orthography2ipa/data/ine-x-italic.json +226 -0
  215. orthography2ipa-0.2.1a1/orthography2ipa/data/ine.json +242 -0
  216. orthography2ipa-0.2.1a1/orthography2ipa/data/ira.json +234 -0
  217. orthography2ipa-0.2.1a1/orthography2ipa/data/is.json +318 -0
  218. orthography2ipa-0.2.1a1/orthography2ipa/data/it-IT-x-abruzzo.json +351 -0
  219. orthography2ipa-0.2.1a1/orthography2ipa/data/it-IT-x-calabria.json +374 -0
  220. orthography2ipa-0.2.1a1/orthography2ipa/data/it-IT-x-marche.json +350 -0
  221. orthography2ipa-0.2.1a1/orthography2ipa/data/it-IT-x-puglia.json +364 -0
  222. orthography2ipa-0.2.1a1/orthography2ipa/data/it-IT-x-roma.json +356 -0
  223. orthography2ipa-0.2.1a1/orthography2ipa/data/it-IT-x-toscana.json +360 -0
  224. orthography2ipa-0.2.1a1/orthography2ipa/data/it-IT-x-umbria.json +350 -0
  225. orthography2ipa-0.2.1a1/orthography2ipa/data/it-IT.json +493 -0
  226. orthography2ipa-0.2.1a1/orthography2ipa/data/ja.json +685 -0
  227. orthography2ipa-0.2.1a1/orthography2ipa/data/ka.json +74 -0
  228. orthography2ipa-0.2.1a1/orthography2ipa/data/kea.json +324 -0
  229. orthography2ipa-0.2.1a1/orthography2ipa/data/kha-x-proto-mon-khmer.json +240 -0
  230. orthography2ipa-0.2.1a1/orthography2ipa/data/kha.json +224 -0
  231. orthography2ipa-0.2.1a1/orthography2ipa/data/kn.json +404 -0
  232. orthography2ipa-0.2.1a1/orthography2ipa/data/ko.json +306 -0
  233. orthography2ipa-0.2.1a1/orthography2ipa/data/kok.json +381 -0
  234. orthography2ipa-0.2.1a1/orthography2ipa/data/ks.json +388 -0
  235. orthography2ipa-0.2.1a1/orthography2ipa/data/kw.json +251 -0
  236. orthography2ipa-0.2.1a1/orthography2ipa/data/la-x-archaic.json +213 -0
  237. orthography2ipa-0.2.1a1/orthography2ipa/data/la-x-balkans.json +280 -0
  238. orthography2ipa-0.2.1a1/orthography2ipa/data/la-x-gallia.json +189 -0
  239. orthography2ipa-0.2.1a1/orthography2ipa/data/la-x-galloitalic.json +224 -0
  240. orthography2ipa-0.2.1a1/orthography2ipa/data/la-x-hispania.json +211 -0
  241. orthography2ipa-0.2.1a1/orthography2ipa/data/la-x-italia.json +355 -0
  242. orthography2ipa-0.2.1a1/orthography2ipa/data/la-x-late.json +123 -0
  243. orthography2ipa-0.2.1a1/orthography2ipa/data/la.json +100 -0
  244. orthography2ipa-0.2.1a1/orthography2ipa/data/lad.json +311 -0
  245. orthography2ipa-0.2.1a1/orthography2ipa/data/lb.json +72 -0
  246. orthography2ipa-0.2.1a1/orthography2ipa/data/lexicons/ast-PT-x-rionor.csv +918 -0
  247. orthography2ipa-0.2.1a1/orthography2ipa/data/lij.json +179 -0
  248. orthography2ipa-0.2.1a1/orthography2ipa/data/lld.json +279 -0
  249. orthography2ipa-0.2.1a1/orthography2ipa/data/lmo.json +180 -0
  250. orthography2ipa-0.2.1a1/orthography2ipa/data/lt.json +82 -0
  251. orthography2ipa-0.2.1a1/orthography2ipa/data/lv.json +81 -0
  252. orthography2ipa-0.2.1a1/orthography2ipa/data/mai.json +362 -0
  253. orthography2ipa-0.2.1a1/orthography2ipa/data/mcm.json +301 -0
  254. orthography2ipa-0.2.1a1/orthography2ipa/data/mk.json +257 -0
  255. orthography2ipa-0.2.1a1/orthography2ipa/data/ml.json +416 -0
  256. orthography2ipa-0.2.1a1/orthography2ipa/data/mni-x-proto-kuki-chin.json +217 -0
  257. orthography2ipa-0.2.1a1/orthography2ipa/data/mni.json +262 -0
  258. orthography2ipa-0.2.1a1/orthography2ipa/data/mr.json +412 -0
  259. orthography2ipa-0.2.1a1/orthography2ipa/data/ms.json +84 -0
  260. orthography2ipa-0.2.1a1/orthography2ipa/data/mt.json +78 -0
  261. orthography2ipa-0.2.1a1/orthography2ipa/data/mwl-x-ifanes.json +49 -0
  262. orthography2ipa-0.2.1a1/orthography2ipa/data/mwl-x-sendim.json +58 -0
  263. orthography2ipa-0.2.1a1/orthography2ipa/data/mwl.json +79 -0
  264. orthography2ipa-0.2.1a1/orthography2ipa/data/mxi.json +49 -0
  265. orthography2ipa-0.2.1a1/orthography2ipa/data/nap.json +366 -0
  266. orthography2ipa-0.2.1a1/orthography2ipa/data/nb.json +65 -0
  267. orthography2ipa-0.2.1a1/orthography2ipa/data/nds.json +360 -0
  268. orthography2ipa-0.2.1a1/orthography2ipa/data/ne.json +372 -0
  269. orthography2ipa-0.2.1a1/orthography2ipa/data/nl-BE.json +44 -0
  270. orthography2ipa-0.2.1a1/orthography2ipa/data/nl-NL.json +376 -0
  271. orthography2ipa-0.2.1a1/orthography2ipa/data/nl.json +450 -0
  272. orthography2ipa-0.2.1a1/orthography2ipa/data/nn.json +62 -0
  273. orthography2ipa-0.2.1a1/orthography2ipa/data/no.json +341 -0
  274. orthography2ipa-0.2.1a1/orthography2ipa/data/non.json +340 -0
  275. orthography2ipa-0.2.1a1/orthography2ipa/data/nrf.json +79 -0
  276. orthography2ipa-0.2.1a1/orthography2ipa/data/ny.json +345 -0
  277. orthography2ipa-0.2.1a1/orthography2ipa/data/oc-x-aranes.json +356 -0
  278. orthography2ipa-0.2.1a1/orthography2ipa/data/oc.json +82 -0
  279. orthography2ipa-0.2.1a1/orthography2ipa/data/ofs.json +397 -0
  280. orthography2ipa-0.2.1a1/orthography2ipa/data/or.json +375 -0
  281. orthography2ipa-0.2.1a1/orthography2ipa/data/osc.json +191 -0
  282. orthography2ipa-0.2.1a1/orthography2ipa/data/osx.json +251 -0
  283. orthography2ipa-0.2.1a1/orthography2ipa/data/pa-PK.json +54 -0
  284. orthography2ipa-0.2.1a1/orthography2ipa/data/pa.json +395 -0
  285. orthography2ipa-0.2.1a1/orthography2ipa/data/pal.json +277 -0
  286. orthography2ipa-0.2.1a1/orthography2ipa/data/pap.json +296 -0
  287. orthography2ipa-0.2.1a1/orthography2ipa/data/pcd.json +75 -0
  288. orthography2ipa-0.2.1a1/orthography2ipa/data/peo.json +244 -0
  289. orthography2ipa-0.2.1a1/orthography2ipa/data/phn.json +255 -0
  290. orthography2ipa-0.2.1a1/orthography2ipa/data/pi.json +311 -0
  291. orthography2ipa-0.2.1a1/orthography2ipa/data/pl.json +426 -0
  292. orthography2ipa-0.2.1a1/orthography2ipa/data/pms.json +176 -0
  293. orthography2ipa-0.2.1a1/orthography2ipa/data/pnt.json +77 -0
  294. orthography2ipa-0.2.1a1/orthography2ipa/data/pov.json +324 -0
  295. orthography2ipa-0.2.1a1/orthography2ipa/data/pre.json +323 -0
  296. orthography2ipa-0.2.1a1/orthography2ipa/data/ps.json +338 -0
  297. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-AO.json +69 -0
  298. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-BR-x-bahia.json +364 -0
  299. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-BR-x-brasilia.json +362 -0
  300. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-BR-x-caipira.json +367 -0
  301. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-BR-x-ce.json +365 -0
  302. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-BR-x-fluminense.json +367 -0
  303. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-BR-x-mg.json +366 -0
  304. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-BR-x-norte.json +362 -0
  305. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-BR-x-pr.json +363 -0
  306. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-BR-x-recife.json +364 -0
  307. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-BR-x-rj.json +367 -0
  308. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-BR-x-sp.json +363 -0
  309. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-BR-x-sul.json +364 -0
  310. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-BR.json +219 -0
  311. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-CV.json +101 -0
  312. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-GW.json +107 -0
  313. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-MO.json +98 -0
  314. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-MZ.json +114 -0
  315. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-PT-x-acores.json +80 -0
  316. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-PT-x-alentejo.json +75 -0
  317. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-PT-x-alfena.json +110 -0
  318. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-PT-x-algarve.json +74 -0
  319. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-PT-x-aveiro.json +64 -0
  320. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-PT-x-beira.json +88 -0
  321. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-PT-x-lisbon.json +62 -0
  322. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-PT-x-madeira.json +103 -0
  323. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-PT-x-medieval.json +214 -0
  324. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-PT-x-minho.json +108 -0
  325. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-PT-x-porto.json +82 -0
  326. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-PT-x-trasosmontes.json +94 -0
  327. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-PT-x-viana.json +76 -0
  328. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-PT.json +273 -0
  329. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-ST.json +106 -0
  330. orthography2ipa-0.2.1a1/orthography2ipa/data/pt-TL.json +117 -0
  331. orthography2ipa-0.2.1a1/orthography2ipa/data/rm.json +270 -0
  332. orthography2ipa-0.2.1a1/orthography2ipa/data/ro-RO.json +361 -0
  333. orthography2ipa-0.2.1a1/orthography2ipa/data/roa-x-galaicopt.json +437 -0
  334. orthography2ipa-0.2.1a1/orthography2ipa/data/rom.json +80 -0
  335. orthography2ipa-0.2.1a1/orthography2ipa/data/ru-x-arkhangelsk.json +55 -0
  336. orthography2ipa-0.2.1a1/orthography2ipa/data/ru-x-don.json +53 -0
  337. orthography2ipa-0.2.1a1/orthography2ipa/data/ru-x-kursk-orel.json +55 -0
  338. orthography2ipa-0.2.1a1/orthography2ipa/data/ru-x-moscow.json +53 -0
  339. orthography2ipa-0.2.1a1/orthography2ipa/data/ru-x-northern.json +72 -0
  340. orthography2ipa-0.2.1a1/orthography2ipa/data/ru-x-pskov.json +61 -0
  341. orthography2ipa-0.2.1a1/orthography2ipa/data/ru-x-siberian.json +61 -0
  342. orthography2ipa-0.2.1a1/orthography2ipa/data/ru-x-southern.json +333 -0
  343. orthography2ipa-0.2.1a1/orthography2ipa/data/ru-x-ural.json +55 -0
  344. orthography2ipa-0.2.1a1/orthography2ipa/data/ru-x-vologda.json +55 -0
  345. orthography2ipa-0.2.1a1/orthography2ipa/data/ru.json +463 -0
  346. orthography2ipa-0.2.1a1/orthography2ipa/data/rue.json +83 -0
  347. orthography2ipa-0.2.1a1/orthography2ipa/data/rup.json +72 -0
  348. orthography2ipa-0.2.1a1/orthography2ipa/data/sa-x-vedic.json +416 -0
  349. orthography2ipa-0.2.1a1/orthography2ipa/data/sa.json +408 -0
  350. orthography2ipa-0.2.1a1/orthography2ipa/data/sat-x-proto-munda.json +216 -0
  351. orthography2ipa-0.2.1a1/orthography2ipa/data/sat.json +270 -0
  352. orthography2ipa-0.2.1a1/orthography2ipa/data/sc-x-campidanese.json +67 -0
  353. orthography2ipa-0.2.1a1/orthography2ipa/data/sc-x-logudorese.json +59 -0
  354. orthography2ipa-0.2.1a1/orthography2ipa/data/sc.json +282 -0
  355. orthography2ipa-0.2.1a1/orthography2ipa/data/scn.json +369 -0
  356. orthography2ipa-0.2.1a1/orthography2ipa/data/sd.json +371 -0
  357. orthography2ipa-0.2.1a1/orthography2ipa/data/se.json +60 -0
  358. orthography2ipa-0.2.1a1/orthography2ipa/data/sem-x-central.json +50 -0
  359. orthography2ipa-0.2.1a1/orthography2ipa/data/sem-x-west.json +50 -0
  360. orthography2ipa-0.2.1a1/orthography2ipa/data/sem.json +271 -0
  361. orthography2ipa-0.2.1a1/orthography2ipa/data/si.json +439 -0
  362. orthography2ipa-0.2.1a1/orthography2ipa/data/sk.json +347 -0
  363. orthography2ipa-0.2.1a1/orthography2ipa/data/sl.json +271 -0
  364. orthography2ipa-0.2.1a1/orthography2ipa/data/sla.json +331 -0
  365. orthography2ipa-0.2.1a1/orthography2ipa/data/sq.json +84 -0
  366. orthography2ipa-0.2.1a1/orthography2ipa/data/sr.json +339 -0
  367. orthography2ipa-0.2.1a1/orthography2ipa/data/stq.json +462 -0
  368. orthography2ipa-0.2.1a1/orthography2ipa/data/sv-FI.json +85 -0
  369. orthography2ipa-0.2.1a1/orthography2ipa/data/sv-x-rikssvenska.json +70 -0
  370. orthography2ipa-0.2.1a1/orthography2ipa/data/sv-x-skanska.json +83 -0
  371. orthography2ipa-0.2.1a1/orthography2ipa/data/sv.json +434 -0
  372. orthography2ipa-0.2.1a1/orthography2ipa/data/sw.json +320 -0
  373. orthography2ipa-0.2.1a1/orthography2ipa/data/szl.json +78 -0
  374. orthography2ipa-0.2.1a1/orthography2ipa/data/ta-x-proto-dravidian.json +237 -0
  375. orthography2ipa-0.2.1a1/orthography2ipa/data/ta.json +350 -0
  376. orthography2ipa-0.2.1a1/orthography2ipa/data/tcy.json +340 -0
  377. orthography2ipa-0.2.1a1/orthography2ipa/data/te.json +399 -0
  378. orthography2ipa-0.2.1a1/orthography2ipa/data/tet.json +291 -0
  379. orthography2ipa-0.2.1a1/orthography2ipa/data/tg.json +284 -0
  380. orthography2ipa-0.2.1a1/orthography2ipa/data/tr.json +344 -0
  381. orthography2ipa-0.2.1a1/orthography2ipa/data/ts.json +336 -0
  382. orthography2ipa-0.2.1a1/orthography2ipa/data/txr.json +136 -0
  383. orthography2ipa-0.2.1a1/orthography2ipa/data/uk.json +294 -0
  384. orthography2ipa-0.2.1a1/orthography2ipa/data/unr.json +284 -0
  385. orthography2ipa-0.2.1a1/orthography2ipa/data/ur.json +395 -0
  386. orthography2ipa-0.2.1a1/orthography2ipa/data/vec.json +296 -0
  387. orthography2ipa-0.2.1a1/orthography2ipa/data/wa.json +81 -0
  388. orthography2ipa-0.2.1a1/orthography2ipa/data/xaa.json +53 -0
  389. orthography2ipa-0.2.1a1/orthography2ipa/data/xaq.json +169 -0
  390. orthography2ipa-0.2.1a1/orthography2ipa/data/xbr.json +223 -0
  391. orthography2ipa-0.2.1a1/orthography2ipa/data/xce.json +274 -0
  392. orthography2ipa-0.2.1a1/orthography2ipa/data/xcg.json +154 -0
  393. orthography2ipa-0.2.1a1/orthography2ipa/data/xda.json +179 -0
  394. orthography2ipa-0.2.1a1/orthography2ipa/data/xga.json +172 -0
  395. orthography2ipa-0.2.1a1/orthography2ipa/data/xib.json +155 -0
  396. orthography2ipa-0.2.1a1/orthography2ipa/data/xlg.json +157 -0
  397. orthography2ipa-0.2.1a1/orthography2ipa/data/xlp.json +173 -0
  398. orthography2ipa-0.2.1a1/orthography2ipa/data/xpa.json +264 -0
  399. orthography2ipa-0.2.1a1/orthography2ipa/data/xsb.json +50 -0
  400. orthography2ipa-0.2.1a1/orthography2ipa/data/xtg.json +199 -0
  401. orthography2ipa-0.2.1a1/orthography2ipa/data/xum.json +188 -0
  402. orthography2ipa-0.2.1a1/orthography2ipa/data/yi.json +87 -0
  403. orthography2ipa-0.2.1a1/orthography2ipa/data/zh.json +335 -0
  404. orthography2ipa-0.2.1a1/orthography2ipa/distance.py +1057 -0
  405. orthography2ipa-0.2.1a1/orthography2ipa/feats.py +683 -0
  406. orthography2ipa-0.2.1a1/orthography2ipa/g2p_plugin.py +54 -0
  407. orthography2ipa-0.2.1a1/orthography2ipa/json_loader.py +336 -0
  408. orthography2ipa-0.2.1a1/orthography2ipa/lm.py +118 -0
  409. orthography2ipa-0.2.1a1/orthography2ipa/phonetok.py +615 -0
  410. orthography2ipa-0.2.1a1/orthography2ipa/plugins/__init__.py +1 -0
  411. orthography2ipa-0.2.1a1/orthography2ipa/plugins/arabic_g2p.py +148 -0
  412. orthography2ipa-0.2.1a1/orthography2ipa/plugins/arabic_utils.py +65 -0
  413. orthography2ipa-0.2.1a1/orthography2ipa/plugins/tashkeel.py +58 -0
  414. orthography2ipa-0.2.1a1/orthography2ipa/registry.py +133 -0
  415. orthography2ipa-0.2.1a1/orthography2ipa/sandhi.py +77 -0
  416. orthography2ipa-0.2.1a1/orthography2ipa/schema.py +246 -0
  417. orthography2ipa-0.2.1a1/orthography2ipa/script_distance.py +250 -0
  418. orthography2ipa-0.2.1a1/orthography2ipa/transforms.py +1061 -0
  419. orthography2ipa-0.2.1a1/orthography2ipa/types.py +702 -0
  420. orthography2ipa-0.2.1a1/orthography2ipa/version.py +10 -0
  421. orthography2ipa-0.2.1a1/orthography2ipa.egg-info/PKG-INFO +227 -0
  422. orthography2ipa-0.2.1a1/orthography2ipa.egg-info/SOURCES.txt +466 -0
  423. orthography2ipa-0.2.1a1/orthography2ipa.egg-info/dependency_links.txt +1 -0
  424. orthography2ipa-0.2.1a1/orthography2ipa.egg-info/entry_points.txt +5 -0
  425. orthography2ipa-0.2.1a1/orthography2ipa.egg-info/requires.txt +14 -0
  426. orthography2ipa-0.2.1a1/orthography2ipa.egg-info/top_level.txt +1 -0
  427. orthography2ipa-0.2.1a1/pyproject.toml +51 -0
  428. orthography2ipa-0.2.1a1/requirements.txt +2 -0
  429. orthography2ipa-0.2.1a1/setup.cfg +4 -0
  430. orthography2ipa-0.2.1a1/tests/__init__.py +0 -0
  431. orthography2ipa-0.2.1a1/tests/conftest.py +228 -0
  432. orthography2ipa-0.2.1a1/tests/pytest.ini +9 -0
  433. orthography2ipa-0.2.1a1/tests/test_all_languages.py +62 -0
  434. orthography2ipa-0.2.1a1/tests/test_arabic.py +432 -0
  435. orthography2ipa-0.2.1a1/tests/test_arabic_g2p.py +58 -0
  436. orthography2ipa-0.2.1a1/tests/test_celtic.py +941 -0
  437. orthography2ipa-0.2.1a1/tests/test_distance.py +570 -0
  438. orthography2ipa-0.2.1a1/tests/test_feats_accuracy.py +74 -0
  439. orthography2ipa-0.2.1a1/tests/test_g2p_plugin.py +68 -0
  440. orthography2ipa-0.2.1a1/tests/test_germanic.py +1728 -0
  441. orthography2ipa-0.2.1a1/tests/test_guadramil_barrancos.py +325 -0
  442. orthography2ipa-0.2.1a1/tests/test_iberian.py +2350 -0
  443. orthography2ipa-0.2.1a1/tests/test_iberian_extended.py +1637 -0
  444. orthography2ipa-0.2.1a1/tests/test_indo_iranian.py +536 -0
  445. orthography2ipa-0.2.1a1/tests/test_integration.py +273 -0
  446. orthography2ipa-0.2.1a1/tests/test_language_integrity.py +489 -0
  447. orthography2ipa-0.2.1a1/tests/test_lexicon.py +72 -0
  448. orthography2ipa-0.2.1a1/tests/test_linguistic_spotchecks.py +117 -0
  449. orthography2ipa-0.2.1a1/tests/test_new_feats.py +104 -0
  450. orthography2ipa-0.2.1a1/tests/test_new_types.py +129 -0
  451. orthography2ipa-0.2.1a1/tests/test_other_languages.py +812 -0
  452. orthography2ipa-0.2.1a1/tests/test_phonetok.py +284 -0
  453. orthography2ipa-0.2.1a1/tests/test_pos.py +343 -0
  454. orthography2ipa-0.2.1a1/tests/test_py_compat.py +116 -0
  455. orthography2ipa-0.2.1a1/tests/test_pydantic_schema.py +59 -0
  456. orthography2ipa-0.2.1a1/tests/test_registry.py +154 -0
  457. orthography2ipa-0.2.1a1/tests/test_rionorese.py +416 -0
  458. orthography2ipa-0.2.1a1/tests/test_romance_extended2.py +1315 -0
  459. orthography2ipa-0.2.1a1/tests/test_sandhi.py +70 -0
  460. orthography2ipa-0.2.1a1/tests/test_script_distance.py +109 -0
  461. orthography2ipa-0.2.1a1/tests/test_slavic.py +809 -0
  462. orthography2ipa-0.2.1a1/tests/test_sources.py +312 -0
  463. orthography2ipa-0.2.1a1/tests/test_temporal.py +268 -0
  464. orthography2ipa-0.2.1a1/tests/test_tone_distance.py +49 -0
  465. orthography2ipa-0.2.1a1/tests/test_transforms.py +536 -0
  466. orthography2ipa-0.2.1a1/tests/test_types.py +241 -0
  467. orthography2ipa-0.2.1a1/tests/test_typological_distances.py +98 -0
  468. orthography2ipa-0.2.1a1/uv.lock +435 -0
@@ -0,0 +1,11 @@
1
+ # To get started with Dependabot version updates, you'll need to specify which
2
+ # package ecosystems to update and where the package manifests are located.
3
+ # Please see the documentation for all configuration options:
4
+ # https://docs.github.com/code-security/dependabot/dependabot-version-updates/configuration-options-for-the-dependabot.yml-file
5
+
6
+ version: 2
7
+ updates:
8
+ - package-ecosystem: "pip" # See documentation for possible values
9
+ directory: "/requirements" # Location of package manifests
10
+ schedule:
11
+ interval: "weekly"
@@ -0,0 +1,14 @@
1
+ name: Build Tests
2
+
3
+ on:
4
+ pull_request:
5
+ branches: [dev, master]
6
+ workflow_dispatch:
7
+
8
+ jobs:
9
+ build:
10
+ uses: OpenVoiceOS/gh-automations/.github/workflows/build-tests.yml@dev
11
+ with:
12
+ python_versions: '["3.10", "3.11", "3.12", "3.13"]'
13
+ install_extras: 'test'
14
+ test_path: 'tests'
@@ -0,0 +1,10 @@
1
+ # auto add labels to PRs
2
+ on:
3
+ pull_request_target:
4
+ types: [ opened, edited ]
5
+ name: conventional-release-labels
6
+ jobs:
7
+ label:
8
+ runs-on: ubuntu-latest
9
+ steps:
10
+ - uses: bcoe/conventional-release-labels@v1
@@ -0,0 +1,16 @@
1
+ name: Code Coverage
2
+
3
+ on:
4
+ pull_request:
5
+ branches: [dev]
6
+ workflow_dispatch:
7
+
8
+ jobs:
9
+ coverage:
10
+ uses: OpenVoiceOS/gh-automations/.github/workflows/coverage.yml@dev
11
+ with:
12
+ python_version: '3.11'
13
+ coverage_source: 'orthography2ipa'
14
+ test_path: 'tests/'
15
+ install_extras: 'test'
16
+ min_coverage: 0
@@ -0,0 +1,10 @@
1
+ name: License Check
2
+
3
+ on:
4
+ pull_request:
5
+ branches: [dev]
6
+ workflow_dispatch:
7
+
8
+ jobs:
9
+ license_check:
10
+ uses: OpenVoiceOS/gh-automations/.github/workflows/license-check.yml@dev
@@ -0,0 +1,23 @@
1
+ name: Publish Stable Release
2
+
3
+ on:
4
+ workflow_dispatch:
5
+ push:
6
+ branches: [master]
7
+
8
+ permissions:
9
+ contents: write # required for version bump commit and release tag
10
+
11
+ jobs:
12
+ publish_stable:
13
+ if: github.actor != 'github-actions[bot]'
14
+ uses: OpenVoiceOS/gh-automations/.github/workflows/publish-stable.yml@dev
15
+ secrets:
16
+ PYPI_TOKEN: ${{ secrets.PYPI_TOKEN }}
17
+ MATRIX_TOKEN: ${{ secrets.MATRIX_TOKEN }}
18
+ with:
19
+ version_file: 'orthography2ipa/version.py'
20
+ publish_pypi: true
21
+ publish_release: true
22
+ sync_dev: true
23
+ notify_matrix: true
@@ -0,0 +1,28 @@
1
+ name: Release Alpha and Propose Stable
2
+
3
+ on:
4
+ workflow_dispatch:
5
+ pull_request:
6
+ types: [closed]
7
+ branches: [dev]
8
+
9
+ permissions:
10
+ contents: write
11
+ pull-requests: write
12
+
13
+ jobs:
14
+ publish_alpha:
15
+ if: github.event.pull_request.merged == true || github.event_name == 'workflow_dispatch'
16
+ uses: OpenVoiceOS/gh-automations/.github/workflows/publish-alpha.yml@dev
17
+ secrets:
18
+ PYPI_TOKEN: ${{ secrets.PYPI_TOKEN }}
19
+ MATRIX_TOKEN: ${{ secrets.MATRIX_TOKEN }}
20
+ with:
21
+ branch: 'dev'
22
+ version_file: 'orthography2ipa/version.py'
23
+ update_changelog: true
24
+ publish_prerelease: true
25
+ propose_release: true
26
+ changelog_max_issues: 100
27
+ publish_pypi: true
28
+ notify_matrix: true
@@ -0,0 +1,36 @@
1
+ /dump/*
2
+ CLAUDE.md
3
+ .claude
4
+
5
+ # Python
6
+ __pycache__/
7
+ *.py[cod]
8
+ *.pyo
9
+ *.pyd
10
+ .Python
11
+
12
+ # Distribution / packaging
13
+ *.egg-info/
14
+ dist/
15
+ build/
16
+ *.egg
17
+ MANIFEST
18
+
19
+ # Testing
20
+ .pytest_cache/
21
+ .coverage
22
+ coverage.xml
23
+ htmlcov/
24
+
25
+ # IDE
26
+ .idea/
27
+ .vscode/
28
+ *.swp
29
+ *.swo
30
+
31
+ # Virtual environments
32
+ .venv/
33
+ venv/
34
+ env/
35
+ TODO.md
36
+ ROADMAP.md
@@ -0,0 +1,64 @@
1
+ # orthography2ipa — Agent Guide
2
+
3
+ Pure-data Python package: linguistically motivated grapheme→IPA and allophone mappings for 350+ language codes (356 JSON specs), plus a maximal-munch IPA tokenizer, phonological/script distance metrics, dialect transforms, and a pluggable G2P plugin system (e.g. algorithmic Arabic).
4
+
5
+ ## Setup
6
+
7
+ ```bash
8
+ pip install -e .
9
+ # optional algorithmic Arabic G2P (ONNX diacritization):
10
+ pip install -e .[arabic]
11
+ ```
12
+
13
+ Runtime deps are minimal: `numpy`, `langcodes` (see `requirements.txt`). `langcodes` is used for ISO 639-3 → BCP-47 normalisation, with a hand-maintained fallback alias table in `registry.py`.
14
+
15
+ ## Test
16
+
17
+ ```bash
18
+ pytest tests
19
+ # with coverage (as CI runs it):
20
+ pytest --cov=orthography2ipa --cov-report xml tests
21
+ ```
22
+
23
+ `tests/pytest.ini` and `tests/conftest.py` configure the suite. There is a broad per-family test layout (`test_iberian.py`, `test_celtic.py`, `test_slavic.py`, `test_germanic.py`, `test_indo_iranian.py`, …) plus `test_all_languages.py` and `test_language_integrity.py` that sweep every data file.
24
+
25
+ ## Lint/Typecheck
26
+
27
+ No linter or type checker is configured. Code uses `from __future__ import annotations` and typed dataclasses but there is no mypy/ruff/flake8 config.
28
+
29
+ ## Layout
30
+
31
+ - `orthography2ipa/types.py` — frozen dataclasses: `LanguageSpec`, `Grapheme2IPA`, `AllophoneMap`, `Ancestor`, `PositionalGrapheme2IPA`, `SandhiRule`; enums `QualityTier`/`ScriptType`/`AncestorRole`.
32
+ - `orthography2ipa/data/*.json` — 356 language/dialect spec files (the actual payload). `data/SCHEMA.md` documents the format; dialects inherit via `graphemes_base`/`allophones_base`. `data/lexicons/*.csv` hold reference word lists.
33
+ - `orthography2ipa/json_loader.py` — loads JSON specs and lexicons, resolves multi-ancestor inheritance.
34
+ - `orthography2ipa/registry.py` — `get()`, `available_codes()`, `available_families()`; lazy cache + plugin discovery + ISO alias table.
35
+ - `orthography2ipa/phonetok.py` — `PhonetokTokenizer`, beam-search IPA expansion (`IPAPath`, `Token`, `TokenKind`).
36
+ - `orthography2ipa/distance.py` + `feats.py` + `script_distance.py` — phonological/inventory/grapheme/tone/script distance metrics and feature vectors.
37
+ - `orthography2ipa/transforms.py` + `sandhi.py` + `lm.py` — dialect transforms, sandhi rules, language-model scoring helpers.
38
+ - `orthography2ipa/g2p_plugin.py` — `G2PPlugin` base; `plugins/arabic_g2p.py`, `plugins/tashkeel.py`, `plugins/arabic_utils.py` implement algorithmic Arabic G2P.
39
+ - `orthography2ipa/cli.py` — `orthography2ipa` console entry point (`list`, `info`, `transcribe`, `distance`; all support `--json`).
40
+ - `examples/` — runnable usage demos; `docs/` — Markdown reference (architecture, data model, tokenizer, distance, adding a language, bibliography).
41
+
42
+ ### Entry-point groups
43
+
44
+ - `[project.scripts]` → `orthography2ipa = orthography2ipa.cli:main` (CLI).
45
+ - `[project.entry-points."orthography2ipa.g2p"]` → `arabic = orthography2ipa.plugins.arabic_g2p:ArabicG2PPlugin`. This is a **package-private** plugin group (not an OVOS/OPM group); third parties register algorithmic G2P backends here.
46
+
47
+ ## Conventions (Org hard rules)
48
+
49
+ - Branches: `dev` for work, `master` for stable. NEVER `main`.
50
+ - Never edit `orthography2ipa/version.py` — gh-automations bumps semver from conventional-commit prefixes (`feat:`, `fix:`, `feat!:`).
51
+ - New repos private by default; do not make source public without asking.
52
+ - Commit identity: `JarbasAi <jarbasai@mailfence.com>`.
53
+ - Reference `TigreGotico`/`OpenVoiceOS` gh-automations reusable workflows at `@dev` (this repo currently pins `@master` — see TODO).
54
+ - No Neon / `neon-*` references.
55
+ - No meta-commentary: describe current state only — no history, dates, or "design mistake" framing in docs/commits/PRs/comments.
56
+ - CI is provided by gh-automations reusable workflows.
57
+
58
+ ## Gotchas
59
+
60
+ - This is **pure data + logic, no trained network weights** despite living in the ML cluster — the only model artifact is the optional ONNX Arabic diacritizer, and `plugins/tashkeel.py` still has `# TODO: Load and run ONNX model for diacritization` (the ONNX path is not wired up).
61
+ - `dynamic = ["version", "dependencies"]`: version comes from `orthography2ipa/version.py` attr, deps from `requirements.txt`. The release workflows reference a `setup.py` that is not present in the tree — packaging is `pyproject`-only, so the `setup.py`-based release steps will fail.
62
+ - `QualityTier` ranges from `stub`/`skeleton` through `research`/`production`; not every one of the 356 specs is `production` quality. Check `spec.quality` before relying on a mapping.
63
+ - Graphemes ≠ allophones: `graphemes` maps a spelling to the phonemes it can represent; `allophones` maps a phoneme to its contextual surface forms. Keep them distinct.
64
+ - Many scratch report files (AUDIT.md, MAINTENANCE_REPORT.md, SUGGESTIONS.md, PLAN.md, QUICK_FACTS.md, FAQ.md) and 78 `.pyc` files are committed despite `.gitignore` — do not add more.
@@ -0,0 +1,21 @@
1
+ # Changelog
2
+
3
+ ## [0.2.1a1](https://github.com/TigreGotico/orthography2ipa/tree/0.2.1a1) (2026-06-10)
4
+
5
+ [Full Changelog](https://github.com/TigreGotico/orthography2ipa/compare/73cf93d2bc10be1e32a61e00a380c4ed632a0148...0.2.1a1)
6
+
7
+ **Merged pull requests:**
8
+
9
+ - fix: py3.9 annotation compatibility, plugin-failure logging, public exports [\#17](https://github.com/TigreGotico/orthography2ipa/pull/17) ([JarbasAl](https://github.com/JarbasAl))
10
+ - Update phonetic representation of graphemes in an.json [\#14](https://github.com/TigreGotico/orthography2ipa/pull/14) ([Juanpabl](https://github.com/Juanpabl))
11
+ - feat: ast+gl [\#8](https://github.com/TigreGotico/orthography2ipa/pull/8) ([JarbasAl](https://github.com/JarbasAl))
12
+ - Latin graphemes + portuguese 4way sibilant distinction [\#7](https://github.com/TigreGotico/orthography2ipa/pull/7) ([JarbasAl](https://github.com/JarbasAl))
13
+ - refactor to json [\#6](https://github.com/TigreGotico/orthography2ipa/pull/6) ([JarbasAl](https://github.com/JarbasAl))
14
+ - feat: positional graphemmes [\#4](https://github.com/TigreGotico/orthography2ipa/pull/4) ([JarbasAl](https://github.com/JarbasAl))
15
+ - add release automations [\#3](https://github.com/TigreGotico/orthography2ipa/pull/3) ([JarbasAl](https://github.com/JarbasAl))
16
+ - celtic [\#2](https://github.com/TigreGotico/orthography2ipa/pull/2) ([JarbasAl](https://github.com/JarbasAl))
17
+ - add tests [\#1](https://github.com/TigreGotico/orthography2ipa/pull/1) ([JarbasAl](https://github.com/JarbasAl))
18
+
19
+
20
+
21
+ \* *This Changelog was automatically generated by [github_changelog_generator](https://github.com/github-changelog-generator/github-changelog-generator)*
@@ -0,0 +1,227 @@
1
+ Metadata-Version: 2.4
2
+ Name: orthography2ipa
3
+ Version: 0.2.1a1
4
+ Summary: Linguistically motivated grapheme-to-IPA and allophone mappings for 350+ language codes
5
+ License: Apache-2.0
6
+ Project-URL: Homepage, https://github.com/TigreGotico/orthography2ipa
7
+ Project-URL: Issues, https://github.com/TigreGotico/orthography2ipa/issues
8
+ Keywords: ipa,phonetics,phonology,grapheme,allophone,linguistics,nlp
9
+ Classifier: Development Status :: 3 - Alpha
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: Intended Audience :: Science/Research
12
+ Classifier: Topic :: Text Processing :: Linguistic
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Programming Language :: Python :: 3.9
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Programming Language :: Python :: 3.13
19
+ Classifier: License :: OSI Approved :: Apache Software License
20
+ Classifier: Operating System :: OS Independent
21
+ Requires-Python: >=3.9
22
+ Description-Content-Type: text/markdown
23
+ Requires-Dist: numpy
24
+ Requires-Dist: langcodes
25
+ Provides-Extra: arabic
26
+ Requires-Dist: onnxruntime; extra == "arabic"
27
+ Provides-Extra: validation
28
+ Requires-Dist: pydantic>=2; extra == "validation"
29
+ Provides-Extra: test
30
+ Requires-Dist: pytest; extra == "test"
31
+ Requires-Dist: pytest-timeout; extra == "test"
32
+ Requires-Dist: pytest-cov; extra == "test"
33
+ Requires-Dist: pydantic>=2; extra == "test"
34
+
35
+ # orthography2ipa
36
+
37
+ Linguistically motivated **grapheme→IPA** and **allophone** mappings for **350+ language codes** across 20+ language families — pure data, a maximal-munch IPA tokenizer, and a family of phonological/script distance metrics, with no trained weights to ship.
38
+
39
+ Only mappings grounded in official orthography and documented grammar are included. Arbitrary substring rules are excluded.
40
+
41
+ ## Why two maps
42
+
43
+ The central distinction the package enforces:
44
+
45
+ - A **grapheme map** tells you which phonemes a spelling *can* represent. English ⟨th⟩ → `['θ', 'ð']`.
46
+ - An **allophone map** tells you how a phoneme *surfaces* in context. English /t/ → `['t', 'tʰ', 'ɾ', 'ʔ', 't̚']`.
47
+
48
+ Keeping these separate lets you go from text to phoneme candidates (transcription) and from phonemes to surface realisations (pronunciation modelling) without conflating the two.
49
+
50
+ ## What each language carries
51
+
52
+ Every `LanguageSpec` provides:
53
+
54
+ 1. **Graphemes** — orthographic units (characters, digraphs, trigraphs) mapped to canonical IPA phonemes.
55
+ 2. **Allophones** — each phoneme mapped to its positional/contextual surface realisations.
56
+ 3. **Positional graphemes** — context-sensitive overrides (word-initial, intervocalic, before /i/, …).
57
+ 4. **Ancestry** — weighted multi-ancestor lineage (parent, substrate, superstrate, adstrate, …) for dialect trees.
58
+ 5. **Sandhi rules** — cross-word phonological processes.
59
+ 6. **Tone inventory** — tone marks → labels, where applicable.
60
+ 7. **Provenance** — `QualityTier` (stub → skeleton → research → production), `ScriptType`, and bibliographic sources.
61
+
62
+ Regional varieties get their own `LanguageSpec` objects linked through ancestry, and JSON data files support `graphemes_base`/`allophones_base` inheritance so a dialect only declares what differs from its parent.
63
+
64
+ ## Installation
65
+
66
+ ```bash
67
+ pip install orthography2ipa
68
+ ```
69
+
70
+ For the optional Arabic G2P backend:
71
+
72
+ ```bash
73
+ pip install orthography2ipa[arabic]
74
+ ```
75
+
76
+ ## Quick start
77
+
78
+ ### Python API
79
+
80
+ ```python
81
+ import orthography2ipa
82
+
83
+ # Get a language spec
84
+ en = orthography2ipa.get("en-GB")
85
+
86
+ # Grapheme → IPA candidates
87
+ en.graphemes["th"] # ['θ', 'ð']
88
+
89
+ # Allophone map: how /t/ surfaces
90
+ en.allophones["t"] # ['t', 'tʰ', 'ɾ', 'ʔ', 't̚']
91
+
92
+ # Metadata
93
+ en.name # 'British English (RP)'
94
+ en.family # 'Germanic'
95
+ en.script # 'Latin'
96
+
97
+ # Regional variants share ancestry but diverge where pronunciation does
98
+ pt_br = orthography2ipa.get("pt-BR")
99
+ pt_br.graphemes["t"] # ['t', 't͡ʃ'] — palatalisation before /i/
100
+
101
+ # ISO 639-3 aliases resolve to BCP-47 codes
102
+ orthography2ipa.get("eng").name # 'British English (RP)'
103
+
104
+ # Discover what's available
105
+ orthography2ipa.available_codes()
106
+ orthography2ipa.available_families()
107
+ ```
108
+
109
+ ### IPA tokenizer
110
+
111
+ `PhonetokTokenizer` performs maximal-munch grapheme tokenization with beam-search IPA expansion, ranking candidate transcriptions when a spelling is ambiguous:
112
+
113
+ ```python
114
+ from orthography2ipa import get
115
+ from orthography2ipa.phonetok import PhonetokTokenizer
116
+
117
+ tok = PhonetokTokenizer(get("en-GB"))
118
+
119
+ tok.ipa_best("through") # 'θɹɔː'
120
+ for path in tok.ipa_beam("through", beam_width=8):
121
+ print(path.ipa, path.score) # θɹɔː 0.0, ðɹɔː 1.0, θɹoʊ 1.0, …
122
+ ```
123
+
124
+ ### Distance metrics
125
+
126
+ Compare two languages across inventory, grapheme, allophone, and ancestry dimensions:
127
+
128
+ ```python
129
+ from orthography2ipa import get
130
+ from orthography2ipa.distance import phonological_distance
131
+
132
+ d = phonological_distance(get("pt-BR"), get("pt-PT"))
133
+ d.combined # 0.04 — near-identical
134
+ d.inventory.feature_mean # phoneme-inventory distance
135
+ d.grapheme.mean_ipa_distance # grapheme-mapping divergence
136
+ d.allophone_sim # allophone-overlap similarity
137
+ ```
138
+
139
+ Script-level distance and feature vectors are available via `script_distance.py` and `feats.py`.
140
+
141
+ ## Command-line interface
142
+
143
+ After installation the `orthography2ipa` command is available. Every subcommand accepts `--json` for machine-readable output.
144
+
145
+ ```bash
146
+ # List languages and families
147
+ orthography2ipa list
148
+ orthography2ipa list --families
149
+ orthography2ipa list --family Romance
150
+
151
+ # Inspect a language
152
+ orthography2ipa info pt-BR
153
+ orthography2ipa info pt-BR --graphemes
154
+ orthography2ipa info pt-BR --json
155
+
156
+ # Transcribe text to IPA (beam-ranked candidates)
157
+ orthography2ipa transcribe pt-BR "chuva"
158
+ orthography2ipa transcribe en-GB "through" --beam 8
159
+
160
+ # Phonological distance between two languages
161
+ orthography2ipa distance pt-BR pt-PT
162
+ orthography2ipa distance es-ES it-IT --json
163
+ ```
164
+
165
+ ## Languages
166
+
167
+ | Family | Examples |
168
+ |------------|----------|
169
+ | Romance | `pt-PT`, `pt-BR`, `es-ES`, `es-AR`, `ca`, `fr-FR`, `it-IT`, `ro-RO`, `gl`, `oc`, `sc`, `an` |
170
+ | Germanic | `en-GB`, `de-DE`, `nl-NL`, `sv-SE`, `da-DK`, `no-NO`, `af` |
171
+ | Slavic | `ru-RU`, `uk-UA`, `pl-PL`, `cs-CZ`, `sr-RS`, `hr-HR`, `bg-BG` |
172
+ | Celtic | `cy`, `ga`, `gd`, `br`, `kw`, `gv` |
173
+ | Indo-Aryan | `hi-IN`, `bn-BD`, `ur-PK`, `ne-NP`, `pa`, `gu`, `mr` |
174
+ | Semitic | `arb`, `he-IL`, `mt` |
175
+ | Turkic | `tr-TR`, `az`, `kk`, `uz` |
176
+ | Hellenic | `el-GR` |
177
+ | Uralic | `fi-FI`, `hu-HU`, `et-EE` |
178
+ | Japonic | `ja` |
179
+ | Sinitic | `zh` |
180
+ | Koreanic | `ko` |
181
+
182
+ 350+ codes across 40+ family groupings, including reconstructed proto-languages and fine-grained regional dialects.
183
+
184
+ ## Data structure
185
+
186
+ ```python
187
+ @dataclass(frozen=True)
188
+ class LanguageSpec:
189
+ code: str # 'pt-BR'
190
+ name: str # 'Brazilian Portuguese'
191
+ family: str # 'Romance'
192
+ script: str # 'Latin'
193
+ graphemes: Dict[str, List[str]] # 'th' → ['θ', 'ð']
194
+ allophones: Dict[str, List[str]] # 't' → ['t', 'tʰ', 'ɾ', 'ʔ', 't̚']
195
+ positional_graphemes: Dict[...] # context-sensitive overrides
196
+ parent: Optional[str] # primary parent code
197
+ ancestors: Tuple[Ancestor, ...] # weighted multi-ancestor lineage
198
+ quality: QualityTier # stub | skeleton | research | production
199
+ script_type: ScriptType # alphabet | abjad | abugida | ...
200
+ sandhi_rules: Tuple[SandhiRule, ...] # cross-word rules
201
+ tone_inventory: Optional[Dict] # tone marks → labels
202
+ sources: Tuple[LinguisticSource, ...] # bibliographic references
203
+ ```
204
+
205
+ When a spec declares graphemes but no explicit allophone map, a baseline identity allophone map is derived: every phoneme a grapheme can produce is, at minimum, its own surface realisation.
206
+
207
+ ## Design principles
208
+
209
+ - **Linguistically motivated only** — digraphs like English ⟨th⟩, Portuguese ⟨lh⟩, or German ⟨sch⟩ are included because they are standard orthographic units; arbitrary substrings are not.
210
+ - **Graphemes ≠ allophones** — spelling-to-phoneme and phoneme-to-surface are modelled separately.
211
+ - **Regional variants** — where pronunciation diverges systematically, a separate `LanguageSpec` is provided with ancestry links.
212
+ - **Multi-ancestor inheritance** — `graphemes_base`/`allophones_base` let dialect trees declare only their differences.
213
+ - **Pure data, pluggable logic** — mappings are declarative JSON; algorithmic G2P (e.g. Arabic) uses the plugin system.
214
+
215
+ ## Plugins
216
+
217
+ Algorithmic G2P backends register under the `orthography2ipa.g2p` entry-point group. The bundled Arabic plugin (`plugins/arabic_g2p.py`) handles consonant mapping, harakat vowels, sun-letter assimilation, hamzat al-wasl elision, and tanwin forms.
218
+
219
+ A neural Arabic diacritizer (`plugins/tashkeel.py`) is wired as an optional ONNX backend but ships as a documented stub: with no model loaded it returns input unchanged, and the rule-based plugin transcribes whatever diacritics are present. Bundling a tashkeel model is planned future work.
220
+
221
+ ## Contributing
222
+
223
+ To add a language, create `orthography2ipa/data/{code}.json` following `orthography2ipa/data/SCHEMA.md`. For dialects, use `graphemes_base`/`allophones_base` to inherit from the parent.
224
+
225
+ ## License
226
+
227
+ Apache 2.0
@@ -0,0 +1,193 @@
1
+ # orthography2ipa
2
+
3
+ Linguistically motivated **grapheme→IPA** and **allophone** mappings for **350+ language codes** across 20+ language families — pure data, a maximal-munch IPA tokenizer, and a family of phonological/script distance metrics, with no trained weights to ship.
4
+
5
+ Only mappings grounded in official orthography and documented grammar are included. Arbitrary substring rules are excluded.
6
+
7
+ ## Why two maps
8
+
9
+ The central distinction the package enforces:
10
+
11
+ - A **grapheme map** tells you which phonemes a spelling *can* represent. English ⟨th⟩ → `['θ', 'ð']`.
12
+ - An **allophone map** tells you how a phoneme *surfaces* in context. English /t/ → `['t', 'tʰ', 'ɾ', 'ʔ', 't̚']`.
13
+
14
+ Keeping these separate lets you go from text to phoneme candidates (transcription) and from phonemes to surface realisations (pronunciation modelling) without conflating the two.
15
+
16
+ ## What each language carries
17
+
18
+ Every `LanguageSpec` provides:
19
+
20
+ 1. **Graphemes** — orthographic units (characters, digraphs, trigraphs) mapped to canonical IPA phonemes.
21
+ 2. **Allophones** — each phoneme mapped to its positional/contextual surface realisations.
22
+ 3. **Positional graphemes** — context-sensitive overrides (word-initial, intervocalic, before /i/, …).
23
+ 4. **Ancestry** — weighted multi-ancestor lineage (parent, substrate, superstrate, adstrate, …) for dialect trees.
24
+ 5. **Sandhi rules** — cross-word phonological processes.
25
+ 6. **Tone inventory** — tone marks → labels, where applicable.
26
+ 7. **Provenance** — `QualityTier` (stub → skeleton → research → production), `ScriptType`, and bibliographic sources.
27
+
28
+ Regional varieties get their own `LanguageSpec` objects linked through ancestry, and JSON data files support `graphemes_base`/`allophones_base` inheritance so a dialect only declares what differs from its parent.
29
+
30
+ ## Installation
31
+
32
+ ```bash
33
+ pip install orthography2ipa
34
+ ```
35
+
36
+ For the optional Arabic G2P backend:
37
+
38
+ ```bash
39
+ pip install orthography2ipa[arabic]
40
+ ```
41
+
42
+ ## Quick start
43
+
44
+ ### Python API
45
+
46
+ ```python
47
+ import orthography2ipa
48
+
49
+ # Get a language spec
50
+ en = orthography2ipa.get("en-GB")
51
+
52
+ # Grapheme → IPA candidates
53
+ en.graphemes["th"] # ['θ', 'ð']
54
+
55
+ # Allophone map: how /t/ surfaces
56
+ en.allophones["t"] # ['t', 'tʰ', 'ɾ', 'ʔ', 't̚']
57
+
58
+ # Metadata
59
+ en.name # 'British English (RP)'
60
+ en.family # 'Germanic'
61
+ en.script # 'Latin'
62
+
63
+ # Regional variants share ancestry but diverge where pronunciation does
64
+ pt_br = orthography2ipa.get("pt-BR")
65
+ pt_br.graphemes["t"] # ['t', 't͡ʃ'] — palatalisation before /i/
66
+
67
+ # ISO 639-3 aliases resolve to BCP-47 codes
68
+ orthography2ipa.get("eng").name # 'British English (RP)'
69
+
70
+ # Discover what's available
71
+ orthography2ipa.available_codes()
72
+ orthography2ipa.available_families()
73
+ ```
74
+
75
+ ### IPA tokenizer
76
+
77
+ `PhonetokTokenizer` performs maximal-munch grapheme tokenization with beam-search IPA expansion, ranking candidate transcriptions when a spelling is ambiguous:
78
+
79
+ ```python
80
+ from orthography2ipa import get
81
+ from orthography2ipa.phonetok import PhonetokTokenizer
82
+
83
+ tok = PhonetokTokenizer(get("en-GB"))
84
+
85
+ tok.ipa_best("through") # 'θɹɔː'
86
+ for path in tok.ipa_beam("through", beam_width=8):
87
+ print(path.ipa, path.score) # θɹɔː 0.0, ðɹɔː 1.0, θɹoʊ 1.0, …
88
+ ```
89
+
90
+ ### Distance metrics
91
+
92
+ Compare two languages across inventory, grapheme, allophone, and ancestry dimensions:
93
+
94
+ ```python
95
+ from orthography2ipa import get
96
+ from orthography2ipa.distance import phonological_distance
97
+
98
+ d = phonological_distance(get("pt-BR"), get("pt-PT"))
99
+ d.combined # 0.04 — near-identical
100
+ d.inventory.feature_mean # phoneme-inventory distance
101
+ d.grapheme.mean_ipa_distance # grapheme-mapping divergence
102
+ d.allophone_sim # allophone-overlap similarity
103
+ ```
104
+
105
+ Script-level distance and feature vectors are available via `script_distance.py` and `feats.py`.
106
+
107
+ ## Command-line interface
108
+
109
+ After installation the `orthography2ipa` command is available. Every subcommand accepts `--json` for machine-readable output.
110
+
111
+ ```bash
112
+ # List languages and families
113
+ orthography2ipa list
114
+ orthography2ipa list --families
115
+ orthography2ipa list --family Romance
116
+
117
+ # Inspect a language
118
+ orthography2ipa info pt-BR
119
+ orthography2ipa info pt-BR --graphemes
120
+ orthography2ipa info pt-BR --json
121
+
122
+ # Transcribe text to IPA (beam-ranked candidates)
123
+ orthography2ipa transcribe pt-BR "chuva"
124
+ orthography2ipa transcribe en-GB "through" --beam 8
125
+
126
+ # Phonological distance between two languages
127
+ orthography2ipa distance pt-BR pt-PT
128
+ orthography2ipa distance es-ES it-IT --json
129
+ ```
130
+
131
+ ## Languages
132
+
133
+ | Family | Examples |
134
+ |------------|----------|
135
+ | Romance | `pt-PT`, `pt-BR`, `es-ES`, `es-AR`, `ca`, `fr-FR`, `it-IT`, `ro-RO`, `gl`, `oc`, `sc`, `an` |
136
+ | Germanic | `en-GB`, `de-DE`, `nl-NL`, `sv-SE`, `da-DK`, `no-NO`, `af` |
137
+ | Slavic | `ru-RU`, `uk-UA`, `pl-PL`, `cs-CZ`, `sr-RS`, `hr-HR`, `bg-BG` |
138
+ | Celtic | `cy`, `ga`, `gd`, `br`, `kw`, `gv` |
139
+ | Indo-Aryan | `hi-IN`, `bn-BD`, `ur-PK`, `ne-NP`, `pa`, `gu`, `mr` |
140
+ | Semitic | `arb`, `he-IL`, `mt` |
141
+ | Turkic | `tr-TR`, `az`, `kk`, `uz` |
142
+ | Hellenic | `el-GR` |
143
+ | Uralic | `fi-FI`, `hu-HU`, `et-EE` |
144
+ | Japonic | `ja` |
145
+ | Sinitic | `zh` |
146
+ | Koreanic | `ko` |
147
+
148
+ 350+ codes across 40+ family groupings, including reconstructed proto-languages and fine-grained regional dialects.
149
+
150
+ ## Data structure
151
+
152
+ ```python
153
+ @dataclass(frozen=True)
154
+ class LanguageSpec:
155
+ code: str # 'pt-BR'
156
+ name: str # 'Brazilian Portuguese'
157
+ family: str # 'Romance'
158
+ script: str # 'Latin'
159
+ graphemes: Dict[str, List[str]] # 'th' → ['θ', 'ð']
160
+ allophones: Dict[str, List[str]] # 't' → ['t', 'tʰ', 'ɾ', 'ʔ', 't̚']
161
+ positional_graphemes: Dict[...] # context-sensitive overrides
162
+ parent: Optional[str] # primary parent code
163
+ ancestors: Tuple[Ancestor, ...] # weighted multi-ancestor lineage
164
+ quality: QualityTier # stub | skeleton | research | production
165
+ script_type: ScriptType # alphabet | abjad | abugida | ...
166
+ sandhi_rules: Tuple[SandhiRule, ...] # cross-word rules
167
+ tone_inventory: Optional[Dict] # tone marks → labels
168
+ sources: Tuple[LinguisticSource, ...] # bibliographic references
169
+ ```
170
+
171
+ When a spec declares graphemes but no explicit allophone map, a baseline identity allophone map is derived: every phoneme a grapheme can produce is, at minimum, its own surface realisation.
172
+
173
+ ## Design principles
174
+
175
+ - **Linguistically motivated only** — digraphs like English ⟨th⟩, Portuguese ⟨lh⟩, or German ⟨sch⟩ are included because they are standard orthographic units; arbitrary substrings are not.
176
+ - **Graphemes ≠ allophones** — spelling-to-phoneme and phoneme-to-surface are modelled separately.
177
+ - **Regional variants** — where pronunciation diverges systematically, a separate `LanguageSpec` is provided with ancestry links.
178
+ - **Multi-ancestor inheritance** — `graphemes_base`/`allophones_base` let dialect trees declare only their differences.
179
+ - **Pure data, pluggable logic** — mappings are declarative JSON; algorithmic G2P (e.g. Arabic) uses the plugin system.
180
+
181
+ ## Plugins
182
+
183
+ Algorithmic G2P backends register under the `orthography2ipa.g2p` entry-point group. The bundled Arabic plugin (`plugins/arabic_g2p.py`) handles consonant mapping, harakat vowels, sun-letter assimilation, hamzat al-wasl elision, and tanwin forms.
184
+
185
+ A neural Arabic diacritizer (`plugins/tashkeel.py`) is wired as an optional ONNX backend but ships as a documented stub: with no model loaded it returns input unchanged, and the rule-based plugin transcribes whatever diacritics are present. Bundling a tashkeel model is planned future work.
186
+
187
+ ## Contributing
188
+
189
+ To add a language, create `orthography2ipa/data/{code}.json` following `orthography2ipa/data/SCHEMA.md`. For dialects, use `graphemes_base`/`allophones_base` to inherit from the parent.
190
+
191
+ ## License
192
+
193
+ Apache 2.0