PyPI - orthography2ipa - Versions diffs - 0.2.1a1__tar.gz - Mend

orthography2ipa 0.2.1a1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (468) hide show

orthography2ipa-0.2.1a1/.github/dependabot.yml ADDED Viewed

@@ -0,0 +1,11 @@
+# To get started with Dependabot version updates, you'll need to specify which
+# package ecosystems to update and where the package manifests are located.
+# Please see the documentation for all configuration options:
+# https://docs.github.com/code-security/dependabot/dependabot-version-updates/configuration-options-for-the-dependabot.yml-file
+version: 2
+updates:
+  - package-ecosystem: "pip" # See documentation for possible values
+    directory: "/requirements" # Location of package manifests
+    schedule:
+      interval: "weekly"

orthography2ipa-0.2.1a1/.github/workflows/build-tests.yml ADDED Viewed

@@ -0,0 +1,14 @@
+name: Build Tests
+on:
+  pull_request:
+    branches: [dev, master]
+  workflow_dispatch:
+jobs:
+  build:
+    uses: OpenVoiceOS/gh-automations/.github/workflows/build-tests.yml@dev
+    with:
+      python_versions: '["3.10", "3.11", "3.12", "3.13"]'
+      install_extras: 'test'
+      test_path: 'tests'

orthography2ipa-0.2.1a1/.github/workflows/conventional-label.yaml ADDED Viewed

@@ -0,0 +1,10 @@
+# auto add labels to PRs
+on:
+  pull_request_target:
+    types: [ opened, edited ]
+name: conventional-release-labels
+jobs:
+  label:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: bcoe/conventional-release-labels@v1

orthography2ipa-0.2.1a1/.github/workflows/coverage.yml ADDED Viewed

@@ -0,0 +1,16 @@
+name: Code Coverage
+on:
+  pull_request:
+    branches: [dev]
+  workflow_dispatch:
+jobs:
+  coverage:
+    uses: OpenVoiceOS/gh-automations/.github/workflows/coverage.yml@dev
+    with:
+      python_version: '3.11'
+      coverage_source: 'orthography2ipa'
+      test_path: 'tests/'
+      install_extras: 'test'
+      min_coverage: 0

orthography2ipa-0.2.1a1/.github/workflows/license_check.yml ADDED Viewed

@@ -0,0 +1,10 @@
+name: License Check
+on:
+  pull_request:
+    branches: [dev]
+  workflow_dispatch:
+jobs:
+  license_check:
+    uses: OpenVoiceOS/gh-automations/.github/workflows/license-check.yml@dev

orthography2ipa-0.2.1a1/.github/workflows/publish_stable.yml ADDED Viewed

@@ -0,0 +1,23 @@
+name: Publish Stable Release
+on:
+  workflow_dispatch:
+  push:
+    branches: [master]
+permissions:
+  contents: write   # required for version bump commit and release tag
+jobs:
+  publish_stable:
+    if: github.actor != 'github-actions[bot]'
+    uses: OpenVoiceOS/gh-automations/.github/workflows/publish-stable.yml@dev
+    secrets:
+      PYPI_TOKEN: ${{ secrets.PYPI_TOKEN }}
+      MATRIX_TOKEN: ${{ secrets.MATRIX_TOKEN }}
+    with:
+      version_file: 'orthography2ipa/version.py'
+      publish_pypi: true
+      publish_release: true
+      sync_dev: true
+      notify_matrix: true

orthography2ipa-0.2.1a1/.github/workflows/release_workflow.yml ADDED Viewed

@@ -0,0 +1,28 @@
+name: Release Alpha and Propose Stable
+on:
+  workflow_dispatch:
+  pull_request:
+    types: [closed]
+    branches: [dev]
+permissions:
+  contents: write
+  pull-requests: write
+jobs:
+  publish_alpha:
+    if: github.event.pull_request.merged == true || github.event_name == 'workflow_dispatch'
+    uses: OpenVoiceOS/gh-automations/.github/workflows/publish-alpha.yml@dev
+    secrets:
+      PYPI_TOKEN: ${{ secrets.PYPI_TOKEN }}
+      MATRIX_TOKEN: ${{ secrets.MATRIX_TOKEN }}
+    with:
+      branch: 'dev'
+      version_file: 'orthography2ipa/version.py'
+      update_changelog: true
+      publish_prerelease: true
+      propose_release: true
+      changelog_max_issues: 100
+      publish_pypi: true
+      notify_matrix: true

orthography2ipa-0.2.1a1/.gitignore ADDED Viewed

@@ -0,0 +1,36 @@
+/dump/*
+CLAUDE.md
+.claude
+# Python
+__pycache__/
+*.py[cod]
+*.pyo
+*.pyd
+.Python
+# Distribution / packaging
+*.egg-info/
+dist/
+build/
+*.egg
+MANIFEST
+# Testing
+.pytest_cache/
+.coverage
+coverage.xml
+htmlcov/
+# IDE
+.idea/
+.vscode/
+*.swp
+*.swo
+# Virtual environments
+.venv/
+venv/
+env/
+TODO.md
+ROADMAP.md

orthography2ipa-0.2.1a1/AGENTS.md ADDED Viewed

@@ -0,0 +1,64 @@
+# orthography2ipa — Agent Guide
+Pure-data Python package: linguistically motivated grapheme→IPA and allophone mappings for 350+ language codes (356 JSON specs), plus a maximal-munch IPA tokenizer, phonological/script distance metrics, dialect transforms, and a pluggable G2P plugin system (e.g. algorithmic Arabic).
+## Setup
+```bash
+pip install -e .
+# optional algorithmic Arabic G2P (ONNX diacritization):
+pip install -e .[arabic]
+```
+Runtime deps are minimal: `numpy`, `langcodes` (see `requirements.txt`). `langcodes` is used for ISO 639-3 → BCP-47 normalisation, with a hand-maintained fallback alias table in `registry.py`.
+## Test
+```bash
+pytest tests
+# with coverage (as CI runs it):
+pytest --cov=orthography2ipa --cov-report xml tests
+```
+`tests/pytest.ini` and `tests/conftest.py` configure the suite. There is a broad per-family test layout (`test_iberian.py`, `test_celtic.py`, `test_slavic.py`, `test_germanic.py`, `test_indo_iranian.py`, …) plus `test_all_languages.py` and `test_language_integrity.py` that sweep every data file.
+## Lint/Typecheck
+No linter or type checker is configured. Code uses `from __future__ import annotations` and typed dataclasses but there is no mypy/ruff/flake8 config.
+## Layout
+- `orthography2ipa/types.py` — frozen dataclasses: `LanguageSpec`, `Grapheme2IPA`, `AllophoneMap`, `Ancestor`, `PositionalGrapheme2IPA`, `SandhiRule`; enums `QualityTier`/`ScriptType`/`AncestorRole`.
+- `orthography2ipa/data/*.json` — 356 language/dialect spec files (the actual payload). `data/SCHEMA.md` documents the format; dialects inherit via `graphemes_base`/`allophones_base`. `data/lexicons/*.csv` hold reference word lists.
+- `orthography2ipa/json_loader.py` — loads JSON specs and lexicons, resolves multi-ancestor inheritance.
+- `orthography2ipa/registry.py` — `get()`, `available_codes()`, `available_families()`; lazy cache + plugin discovery + ISO alias table.
+- `orthography2ipa/phonetok.py` — `PhonetokTokenizer`, beam-search IPA expansion (`IPAPath`, `Token`, `TokenKind`).
+- `orthography2ipa/distance.py` + `feats.py` + `script_distance.py` — phonological/inventory/grapheme/tone/script distance metrics and feature vectors.
+- `orthography2ipa/transforms.py` + `sandhi.py` + `lm.py` — dialect transforms, sandhi rules, language-model scoring helpers.
+- `orthography2ipa/g2p_plugin.py` — `G2PPlugin` base; `plugins/arabic_g2p.py`, `plugins/tashkeel.py`, `plugins/arabic_utils.py` implement algorithmic Arabic G2P.
+- `orthography2ipa/cli.py` — `orthography2ipa` console entry point (`list`, `info`, `transcribe`, `distance`; all support `--json`).
+- `examples/` — runnable usage demos; `docs/` — Markdown reference (architecture, data model, tokenizer, distance, adding a language, bibliography).
+### Entry-point groups
+- `[project.scripts]` → `orthography2ipa = orthography2ipa.cli:main` (CLI).
+- `[project.entry-points."orthography2ipa.g2p"]` → `arabic = orthography2ipa.plugins.arabic_g2p:ArabicG2PPlugin`. This is a **package-private** plugin group (not an OVOS/OPM group); third parties register algorithmic G2P backends here.
+## Conventions (Org hard rules)
+- Branches: `dev` for work, `master` for stable. NEVER `main`.
+- Never edit `orthography2ipa/version.py` — gh-automations bumps semver from conventional-commit prefixes (`feat:`, `fix:`, `feat!:`).
+- New repos private by default; do not make source public without asking.
+- Commit identity: `JarbasAi <jarbasai@mailfence.com>`.
+- Reference `TigreGotico`/`OpenVoiceOS` gh-automations reusable workflows at `@dev` (this repo currently pins `@master` — see TODO).
+- No Neon / `neon-*` references.
+- No meta-commentary: describe current state only — no history, dates, or "design mistake" framing in docs/commits/PRs/comments.
+- CI is provided by gh-automations reusable workflows.
+## Gotchas
+- This is **pure data + logic, no trained network weights** despite living in the ML cluster — the only model artifact is the optional ONNX Arabic diacritizer, and `plugins/tashkeel.py` still has `# TODO: Load and run ONNX model for diacritization` (the ONNX path is not wired up).
+- `dynamic = ["version", "dependencies"]`: version comes from `orthography2ipa/version.py` attr, deps from `requirements.txt`. The release workflows reference a `setup.py` that is not present in the tree — packaging is `pyproject`-only, so the `setup.py`-based release steps will fail.
+- `QualityTier` ranges from `stub`/`skeleton` through `research`/`production`; not every one of the 356 specs is `production` quality. Check `spec.quality` before relying on a mapping.
+- Graphemes ≠ allophones: `graphemes` maps a spelling to the phonemes it can represent; `allophones` maps a phoneme to its contextual surface forms. Keep them distinct.
+- Many scratch report files (AUDIT.md, MAINTENANCE_REPORT.md, SUGGESTIONS.md, PLAN.md, QUICK_FACTS.md, FAQ.md) and 78 `.pyc` files are committed despite `.gitignore` — do not add more.

orthography2ipa-0.2.1a1/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,21 @@
+# Changelog
+## [0.2.1a1](https://github.com/TigreGotico/orthography2ipa/tree/0.2.1a1) (2026-06-10)
+[Full Changelog](https://github.com/TigreGotico/orthography2ipa/compare/73cf93d2bc10be1e32a61e00a380c4ed632a0148...0.2.1a1)
+**Merged pull requests:**
+- fix: py3.9 annotation compatibility, plugin-failure logging, public exports [\#17](https://github.com/TigreGotico/orthography2ipa/pull/17) ([JarbasAl](https://github.com/JarbasAl))
+- Update phonetic representation of graphemes in an.json [\#14](https://github.com/TigreGotico/orthography2ipa/pull/14) ([Juanpabl](https://github.com/Juanpabl))
+- feat: ast+gl [\#8](https://github.com/TigreGotico/orthography2ipa/pull/8) ([JarbasAl](https://github.com/JarbasAl))
+- Latin graphemes + portuguese 4way sibilant distinction [\#7](https://github.com/TigreGotico/orthography2ipa/pull/7) ([JarbasAl](https://github.com/JarbasAl))
+- refactor to json [\#6](https://github.com/TigreGotico/orthography2ipa/pull/6) ([JarbasAl](https://github.com/JarbasAl))
+- feat: positional graphemmes [\#4](https://github.com/TigreGotico/orthography2ipa/pull/4) ([JarbasAl](https://github.com/JarbasAl))
+- add release automations [\#3](https://github.com/TigreGotico/orthography2ipa/pull/3) ([JarbasAl](https://github.com/JarbasAl))
+- celtic [\#2](https://github.com/TigreGotico/orthography2ipa/pull/2) ([JarbasAl](https://github.com/JarbasAl))
+- add tests [\#1](https://github.com/TigreGotico/orthography2ipa/pull/1) ([JarbasAl](https://github.com/JarbasAl))
+\* *This Changelog was automatically generated by [github_changelog_generator](https://github.com/github-changelog-generator/github-changelog-generator)*

orthography2ipa-0.2.1a1/PKG-INFO ADDED Viewed

@@ -0,0 +1,227 @@
+Metadata-Version: 2.4
+Name: orthography2ipa
+Version: 0.2.1a1
+Summary: Linguistically motivated grapheme-to-IPA and allophone mappings for 350+ language codes
+License: Apache-2.0
+Project-URL: Homepage, https://github.com/TigreGotico/orthography2ipa
+Project-URL: Issues, https://github.com/TigreGotico/orthography2ipa/issues
+Keywords: ipa,phonetics,phonology,grapheme,allophone,linguistics,nlp
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: Topic :: Text Processing :: Linguistic
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Operating System :: OS Independent
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+Requires-Dist: numpy
+Requires-Dist: langcodes
+Provides-Extra: arabic
+Requires-Dist: onnxruntime; extra == "arabic"
+Provides-Extra: validation
+Requires-Dist: pydantic>=2; extra == "validation"
+Provides-Extra: test
+Requires-Dist: pytest; extra == "test"
+Requires-Dist: pytest-timeout; extra == "test"
+Requires-Dist: pytest-cov; extra == "test"
+Requires-Dist: pydantic>=2; extra == "test"
+# orthography2ipa
+Linguistically motivated **grapheme→IPA** and **allophone** mappings for **350+ language codes** across 20+ language families — pure data, a maximal-munch IPA tokenizer, and a family of phonological/script distance metrics, with no trained weights to ship.
+Only mappings grounded in official orthography and documented grammar are included. Arbitrary substring rules are excluded.
+## Why two maps
+The central distinction the package enforces:
+- A **grapheme map** tells you which phonemes a spelling *can* represent. English ⟨th⟩ → `['θ', 'ð']`.
+- An **allophone map** tells you how a phoneme *surfaces* in context. English /t/ → `['t', 'tʰ', 'ɾ', 'ʔ', 't̚']`.
+Keeping these separate lets you go from text to phoneme candidates (transcription) and from phonemes to surface realisations (pronunciation modelling) without conflating the two.
+## What each language carries
+Every `LanguageSpec` provides:
+1. **Graphemes** — orthographic units (characters, digraphs, trigraphs) mapped to canonical IPA phonemes.
+2. **Allophones** — each phoneme mapped to its positional/contextual surface realisations.
+3. **Positional graphemes** — context-sensitive overrides (word-initial, intervocalic, before /i/, …).
+4. **Ancestry** — weighted multi-ancestor lineage (parent, substrate, superstrate, adstrate, …) for dialect trees.
+5. **Sandhi rules** — cross-word phonological processes.
+6. **Tone inventory** — tone marks → labels, where applicable.
+7. **Provenance** — `QualityTier` (stub → skeleton → research → production), `ScriptType`, and bibliographic sources.
+Regional varieties get their own `LanguageSpec` objects linked through ancestry, and JSON data files support `graphemes_base`/`allophones_base` inheritance so a dialect only declares what differs from its parent.
+## Installation
+```bash
+pip install orthography2ipa
+```
+For the optional Arabic G2P backend:
+```bash
+pip install orthography2ipa[arabic]
+```
+## Quick start
+### Python API
+```python
+import orthography2ipa
+# Get a language spec
+en = orthography2ipa.get("en-GB")
+# Grapheme → IPA candidates
+en.graphemes["th"]    # ['θ', 'ð']
+# Allophone map: how /t/ surfaces
+en.allophones["t"]    # ['t', 'tʰ', 'ɾ', 'ʔ', 't̚']
+# Metadata
+en.name               # 'British English (RP)'
+en.family             # 'Germanic'
+en.script             # 'Latin'
+# Regional variants share ancestry but diverge where pronunciation does
+pt_br = orthography2ipa.get("pt-BR")
+pt_br.graphemes["t"]  # ['t', 't͡ʃ']   — palatalisation before /i/
+# ISO 639-3 aliases resolve to BCP-47 codes
+orthography2ipa.get("eng").name   # 'British English (RP)'
+# Discover what's available
+orthography2ipa.available_codes()
+orthography2ipa.available_families()
+```
+### IPA tokenizer
+`PhonetokTokenizer` performs maximal-munch grapheme tokenization with beam-search IPA expansion, ranking candidate transcriptions when a spelling is ambiguous:
+```python
+from orthography2ipa import get
+from orthography2ipa.phonetok import PhonetokTokenizer
+tok = PhonetokTokenizer(get("en-GB"))
+tok.ipa_best("through")              # 'θɹɔː'
+for path in tok.ipa_beam("through", beam_width=8):
+    print(path.ipa, path.score)      # θɹɔː 0.0, ðɹɔː 1.0, θɹoʊ 1.0, …
+```
+### Distance metrics
+Compare two languages across inventory, grapheme, allophone, and ancestry dimensions:
+```python
+from orthography2ipa import get
+from orthography2ipa.distance import phonological_distance
+d = phonological_distance(get("pt-BR"), get("pt-PT"))
+d.combined                    # 0.04 — near-identical
+d.inventory.feature_mean      # phoneme-inventory distance
+d.grapheme.mean_ipa_distance  # grapheme-mapping divergence
+d.allophone_sim               # allophone-overlap similarity
+```
+Script-level distance and feature vectors are available via `script_distance.py` and `feats.py`.
+## Command-line interface
+After installation the `orthography2ipa` command is available. Every subcommand accepts `--json` for machine-readable output.
+```bash
+# List languages and families
+orthography2ipa list
+orthography2ipa list --families
+orthography2ipa list --family Romance
+# Inspect a language
+orthography2ipa info pt-BR
+orthography2ipa info pt-BR --graphemes
+orthography2ipa info pt-BR --json
+# Transcribe text to IPA (beam-ranked candidates)
+orthography2ipa transcribe pt-BR "chuva"
+orthography2ipa transcribe en-GB "through" --beam 8
+# Phonological distance between two languages
+orthography2ipa distance pt-BR pt-PT
+orthography2ipa distance es-ES it-IT --json
+```
+## Languages
+| Family     | Examples |
+|------------|----------|
+| Romance    | `pt-PT`, `pt-BR`, `es-ES`, `es-AR`, `ca`, `fr-FR`, `it-IT`, `ro-RO`, `gl`, `oc`, `sc`, `an` |
+| Germanic   | `en-GB`, `de-DE`, `nl-NL`, `sv-SE`, `da-DK`, `no-NO`, `af` |
+| Slavic     | `ru-RU`, `uk-UA`, `pl-PL`, `cs-CZ`, `sr-RS`, `hr-HR`, `bg-BG` |
+| Celtic     | `cy`, `ga`, `gd`, `br`, `kw`, `gv` |
+| Indo-Aryan | `hi-IN`, `bn-BD`, `ur-PK`, `ne-NP`, `pa`, `gu`, `mr` |
+| Semitic    | `arb`, `he-IL`, `mt` |
+| Turkic     | `tr-TR`, `az`, `kk`, `uz` |
+| Hellenic   | `el-GR` |
+| Uralic     | `fi-FI`, `hu-HU`, `et-EE` |
+| Japonic    | `ja` |
+| Sinitic    | `zh` |
+| Koreanic   | `ko` |
+350+ codes across 40+ family groupings, including reconstructed proto-languages and fine-grained regional dialects.
+## Data structure
+```python
+@dataclass(frozen=True)
+class LanguageSpec:
+    code: str                              # 'pt-BR'
+    name: str                              # 'Brazilian Portuguese'
+    family: str                            # 'Romance'
+    script: str                            # 'Latin'
+    graphemes: Dict[str, List[str]]        # 'th' → ['θ', 'ð']
+    allophones: Dict[str, List[str]]       # 't' → ['t', 'tʰ', 'ɾ', 'ʔ', 't̚']
+    positional_graphemes: Dict[...]        # context-sensitive overrides
+    parent: Optional[str]                  # primary parent code
+    ancestors: Tuple[Ancestor, ...]        # weighted multi-ancestor lineage
+    quality: QualityTier                   # stub | skeleton | research | production
+    script_type: ScriptType                # alphabet | abjad | abugida | ...
+    sandhi_rules: Tuple[SandhiRule, ...]   # cross-word rules
+    tone_inventory: Optional[Dict]         # tone marks → labels
+    sources: Tuple[LinguisticSource, ...]  # bibliographic references
+```
+When a spec declares graphemes but no explicit allophone map, a baseline identity allophone map is derived: every phoneme a grapheme can produce is, at minimum, its own surface realisation.
+## Design principles
+- **Linguistically motivated only** — digraphs like English ⟨th⟩, Portuguese ⟨lh⟩, or German ⟨sch⟩ are included because they are standard orthographic units; arbitrary substrings are not.
+- **Graphemes ≠ allophones** — spelling-to-phoneme and phoneme-to-surface are modelled separately.
+- **Regional variants** — where pronunciation diverges systematically, a separate `LanguageSpec` is provided with ancestry links.
+- **Multi-ancestor inheritance** — `graphemes_base`/`allophones_base` let dialect trees declare only their differences.
+- **Pure data, pluggable logic** — mappings are declarative JSON; algorithmic G2P (e.g. Arabic) uses the plugin system.
+## Plugins
+Algorithmic G2P backends register under the `orthography2ipa.g2p` entry-point group. The bundled Arabic plugin (`plugins/arabic_g2p.py`) handles consonant mapping, harakat vowels, sun-letter assimilation, hamzat al-wasl elision, and tanwin forms.
+A neural Arabic diacritizer (`plugins/tashkeel.py`) is wired as an optional ONNX backend but ships as a documented stub: with no model loaded it returns input unchanged, and the rule-based plugin transcribes whatever diacritics are present. Bundling a tashkeel model is planned future work.
+## Contributing
+To add a language, create `orthography2ipa/data/{code}.json` following `orthography2ipa/data/SCHEMA.md`. For dialects, use `graphemes_base`/`allophones_base` to inherit from the parent.
+## License
+Apache 2.0

orthography2ipa-0.2.1a1/README.md ADDED Viewed

@@ -0,0 +1,193 @@
+# orthography2ipa
+Linguistically motivated **grapheme→IPA** and **allophone** mappings for **350+ language codes** across 20+ language families — pure data, a maximal-munch IPA tokenizer, and a family of phonological/script distance metrics, with no trained weights to ship.
+Only mappings grounded in official orthography and documented grammar are included. Arbitrary substring rules are excluded.
+## Why two maps
+The central distinction the package enforces:
+- A **grapheme map** tells you which phonemes a spelling *can* represent. English ⟨th⟩ → `['θ', 'ð']`.
+- An **allophone map** tells you how a phoneme *surfaces* in context. English /t/ → `['t', 'tʰ', 'ɾ', 'ʔ', 't̚']`.
+Keeping these separate lets you go from text to phoneme candidates (transcription) and from phonemes to surface realisations (pronunciation modelling) without conflating the two.
+## What each language carries
+Every `LanguageSpec` provides:
+1. **Graphemes** — orthographic units (characters, digraphs, trigraphs) mapped to canonical IPA phonemes.
+2. **Allophones** — each phoneme mapped to its positional/contextual surface realisations.
+3. **Positional graphemes** — context-sensitive overrides (word-initial, intervocalic, before /i/, …).
+4. **Ancestry** — weighted multi-ancestor lineage (parent, substrate, superstrate, adstrate, …) for dialect trees.
+5. **Sandhi rules** — cross-word phonological processes.
+6. **Tone inventory** — tone marks → labels, where applicable.
+7. **Provenance** — `QualityTier` (stub → skeleton → research → production), `ScriptType`, and bibliographic sources.
+Regional varieties get their own `LanguageSpec` objects linked through ancestry, and JSON data files support `graphemes_base`/`allophones_base` inheritance so a dialect only declares what differs from its parent.
+## Installation
+```bash
+pip install orthography2ipa
+```
+For the optional Arabic G2P backend:
+```bash
+pip install orthography2ipa[arabic]
+```
+## Quick start
+### Python API
+```python
+import orthography2ipa
+# Get a language spec
+en = orthography2ipa.get("en-GB")
+# Grapheme → IPA candidates
+en.graphemes["th"]    # ['θ', 'ð']
+# Allophone map: how /t/ surfaces
+en.allophones["t"]    # ['t', 'tʰ', 'ɾ', 'ʔ', 't̚']
+# Metadata
+en.name               # 'British English (RP)'
+en.family             # 'Germanic'
+en.script             # 'Latin'
+# Regional variants share ancestry but diverge where pronunciation does
+pt_br = orthography2ipa.get("pt-BR")
+pt_br.graphemes["t"]  # ['t', 't͡ʃ']   — palatalisation before /i/
+# ISO 639-3 aliases resolve to BCP-47 codes
+orthography2ipa.get("eng").name   # 'British English (RP)'
+# Discover what's available
+orthography2ipa.available_codes()
+orthography2ipa.available_families()
+```
+### IPA tokenizer
+`PhonetokTokenizer` performs maximal-munch grapheme tokenization with beam-search IPA expansion, ranking candidate transcriptions when a spelling is ambiguous:
+```python
+from orthography2ipa import get
+from orthography2ipa.phonetok import PhonetokTokenizer
+tok = PhonetokTokenizer(get("en-GB"))
+tok.ipa_best("through")              # 'θɹɔː'
+for path in tok.ipa_beam("through", beam_width=8):
+    print(path.ipa, path.score)      # θɹɔː 0.0, ðɹɔː 1.0, θɹoʊ 1.0, …
+```
+### Distance metrics
+Compare two languages across inventory, grapheme, allophone, and ancestry dimensions:
+```python
+from orthography2ipa import get
+from orthography2ipa.distance import phonological_distance
+d = phonological_distance(get("pt-BR"), get("pt-PT"))
+d.combined                    # 0.04 — near-identical
+d.inventory.feature_mean      # phoneme-inventory distance
+d.grapheme.mean_ipa_distance  # grapheme-mapping divergence
+d.allophone_sim               # allophone-overlap similarity
+```
+Script-level distance and feature vectors are available via `script_distance.py` and `feats.py`.
+## Command-line interface
+After installation the `orthography2ipa` command is available. Every subcommand accepts `--json` for machine-readable output.
+```bash
+# List languages and families
+orthography2ipa list
+orthography2ipa list --families
+orthography2ipa list --family Romance
+# Inspect a language
+orthography2ipa info pt-BR
+orthography2ipa info pt-BR --graphemes
+orthography2ipa info pt-BR --json
+# Transcribe text to IPA (beam-ranked candidates)
+orthography2ipa transcribe pt-BR "chuva"
+orthography2ipa transcribe en-GB "through" --beam 8
+# Phonological distance between two languages
+orthography2ipa distance pt-BR pt-PT
+orthography2ipa distance es-ES it-IT --json
+```
+## Languages
+| Family     | Examples |
+|------------|----------|
+| Romance    | `pt-PT`, `pt-BR`, `es-ES`, `es-AR`, `ca`, `fr-FR`, `it-IT`, `ro-RO`, `gl`, `oc`, `sc`, `an` |
+| Germanic   | `en-GB`, `de-DE`, `nl-NL`, `sv-SE`, `da-DK`, `no-NO`, `af` |
+| Slavic     | `ru-RU`, `uk-UA`, `pl-PL`, `cs-CZ`, `sr-RS`, `hr-HR`, `bg-BG` |
+| Celtic     | `cy`, `ga`, `gd`, `br`, `kw`, `gv` |
+| Indo-Aryan | `hi-IN`, `bn-BD`, `ur-PK`, `ne-NP`, `pa`, `gu`, `mr` |
+| Semitic    | `arb`, `he-IL`, `mt` |
+| Turkic     | `tr-TR`, `az`, `kk`, `uz` |
+| Hellenic   | `el-GR` |
+| Uralic     | `fi-FI`, `hu-HU`, `et-EE` |
+| Japonic    | `ja` |
+| Sinitic    | `zh` |
+| Koreanic   | `ko` |
+350+ codes across 40+ family groupings, including reconstructed proto-languages and fine-grained regional dialects.
+## Data structure
+```python
+@dataclass(frozen=True)
+class LanguageSpec:
+    code: str                              # 'pt-BR'
+    name: str                              # 'Brazilian Portuguese'
+    family: str                            # 'Romance'
+    script: str                            # 'Latin'
+    graphemes: Dict[str, List[str]]        # 'th' → ['θ', 'ð']
+    allophones: Dict[str, List[str]]       # 't' → ['t', 'tʰ', 'ɾ', 'ʔ', 't̚']
+    positional_graphemes: Dict[...]        # context-sensitive overrides
+    parent: Optional[str]                  # primary parent code
+    ancestors: Tuple[Ancestor, ...]        # weighted multi-ancestor lineage
+    quality: QualityTier                   # stub | skeleton | research | production
+    script_type: ScriptType                # alphabet | abjad | abugida | ...
+    sandhi_rules: Tuple[SandhiRule, ...]   # cross-word rules
+    tone_inventory: Optional[Dict]         # tone marks → labels
+    sources: Tuple[LinguisticSource, ...]  # bibliographic references
+```
+When a spec declares graphemes but no explicit allophone map, a baseline identity allophone map is derived: every phoneme a grapheme can produce is, at minimum, its own surface realisation.
+## Design principles
+- **Linguistically motivated only** — digraphs like English ⟨th⟩, Portuguese ⟨lh⟩, or German ⟨sch⟩ are included because they are standard orthographic units; arbitrary substrings are not.
+- **Graphemes ≠ allophones** — spelling-to-phoneme and phoneme-to-surface are modelled separately.
+- **Regional variants** — where pronunciation diverges systematically, a separate `LanguageSpec` is provided with ancestry links.
+- **Multi-ancestor inheritance** — `graphemes_base`/`allophones_base` let dialect trees declare only their differences.
+- **Pure data, pluggable logic** — mappings are declarative JSON; algorithmic G2P (e.g. Arabic) uses the plugin system.
+## Plugins
+Algorithmic G2P backends register under the `orthography2ipa.g2p` entry-point group. The bundled Arabic plugin (`plugins/arabic_g2p.py`) handles consonant mapping, harakat vowels, sun-letter assimilation, hamzat al-wasl elision, and tanwin forms.
+A neural Arabic diacritizer (`plugins/tashkeel.py`) is wired as an optional ONNX backend but ships as a documented stub: with no model loaded it returns input unchanged, and the rule-based plugin transcribes whatever diacritics are present. Bundling a tashkeel model is planned future work.
+## Contributing
+To add a language, create `orthography2ipa/data/{code}.json` following `orthography2ipa/data/SCHEMA.md`. For dialects, use `graphemes_base`/`allophones_base` to inherit from the parent.
+## License
+Apache 2.0