PyPI - normalize-kap-orthography - Versions diffs - 0.1.0__tar.gz - Mend

normalize-kap-orthography 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

normalize_kap_orthography-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Keith Manaloto
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

normalize_kap_orthography-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,127 @@
+Metadata-Version: 2.4
+Name: normalize-kap-orthography
+Version: 0.1.0
+Summary: Normalize Kapampangan words from Spanish-era (1730s) orthography to modern K-based orthography
+Author: Keith Manaloto
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/keithmanaloto/normalize-kap-orthography
+Project-URL: Issues, https://github.com/keithmanaloto/normalize-kap-orthography
+Keywords: kapampangan,pampanga,orthography,nlp,linguistics
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: Programming Language :: Python :: 3
+Classifier: Topic :: Text Processing :: Linguistic
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Dynamic: license-file
+# normalize-kap-orthography
+A Python utility to normalize Kapampangan words from Spanish-era (1730s) orthography to modern K-based orthography.
+Built to make historical Kapampangan texts — like Bergaño's 1732 *Vocabulario de la Lengua Pampanga* — more accessible to modern readers, researchers, and NLP pipelines.
+## Background
+Before the Spanish conquest, Kapampangans used their own indigenous writing system (Kulitan). Spanish missionaries romanized the language using Spanish orthographic conventions (C, Q, Ñ, LL, etc.). Over the past century, multiple competing romanized orthographies have emerged:
+## Disclaimer
+I am not a linguist — I'm a native Kapampangan speaker who happens to be a computer science graduate. The conversion rules in this tool were identified through patterns I recognized while cleaning historical dictionary data, not through formal linguistic analysis. The script was spot-checked against the dataset and appears accurate, but it has not been exhaustively verified. If you spot errors or have linguistic expertise to contribute, please open an issue or PR.
+| System | Also known as | Key features |
+|---|---|---|
+| Spanish-era ("Q & C") | *Súlat Bacúlud*, Old Orthography | Uses QU, C, Ñ, LL — the system used in colonial-era texts |
+| ABAKADA ("K") | *Súlat Wáwâ*, New Orthography | K-based, aligned with the Philippine national orthography |
+| Samson Hybrid | *Ámung Samson* | Retains C before a/o/u, replaces QU→K, adds diacritical marks |
+| Batiáuan Revised | *Súlat Wáwâ a alâng WA* | K-based without W, with diacritical marks |
+This tool converts from the **Spanish-era system** to a **modern K-based form** (closest to ABAKADA). For more on the orthography dispute, see [Pangilinan (2006)](https://sil-philippines-languages.org/ical/papers/pangilinan-Dispute%20on%20Orthography.pdf).
+## What it does
+The converter applies two phases of transformation:
+**Phase 1 — Spanish letter substitutions:**
+- `QUI` → `KI`, `QUE` → `KE`
+- `C` → `K` (except after `SI`)
+- `Ñ` → `N`, `LL` → `L`
+- Word-initial `V` → `W`
+**Phase 2 — Vowel cluster and diphthong normalization:**
+- `AO` → `O`, `AI`/`AY` → `E` (word-final, non-initial)
+- `UA` → `WA`, `UO` → `WO`
+- Various other diphthong simplifications
+An **exceptions table** handles words that don't follow general patterns, and a **two-pass conversion** catches cascading transformations.
+## Installation
+```shell
+pip install normalize-kap-orthography
+```
+Or just copy `normalize_orthography.py` into your project.
+## Usage
+```python
+from normalize_orthography import convert_orthography
+convert_orthography("QUINANG")   # → "KINANG"
+convert_orthography("VATAUAT")   # → "WATAWAT"
+convert_orthography("QUECAI")    # → "KEKE"
+convert_orthography("KINANG")    # → None (already modern)
+```
+Returns the normalized form, or `None` if no conversion is needed.
+### CLI
+```shell
+python normalize_orthography.py
+```
+Runs a small set of built-in test cases.
+## Limitations
+- **Not linguistically verified.** The rules were identified through pattern recognition by a native speaker, not through formal linguistic analysis. The script was spot-checked against dictionary data but not exhaustively validated.
+- **No diacritical marks.** The script does not handle stress marking, which is important in Kapampangan — e.g., *masakit* (painful) vs. *masákit* (difficult) vs. *másakit* (ill) are three distinct words.
+- **One-directional.** Currently only converts Spanish-era → modern. Reverse conversion is not supported.
+- **Uppercase only.** Input is converted to uppercase internally; output is always uppercase.
+## Origin
+Originally written in Dart as part of the v2 of [Learn Kulitan](https://github.com/keithliam/learn-kulitan-app), then rewritten in Python with **Claude Code Opus 4.6**.
+## Real-World Usage
+This script was originally used to normalize ~5,000 words extracted from [*Vocabulario de la Lengua Pampanga*](https://archive.org/details/aqn8189.0001.001.umich.edu/page/1/mode/2up) by Fray Diego Bergaño, originally published in 1732 — one of the earliest known dictionaries of the Kapampángan language. About 40% of entries (1,989 out of 4,971) had their orthography normalized.
+The raw, uncleaned entries and their cleaned, normalized versions are available as part of an open dataset on Hugging Face:
+**[keithmanaloto/kapampangan-dictionary-embeddings](https://huggingface.co/datasets/keithmanaloto/kapampangan-dictionary-embeddings)**
+The dataset also includes LLM-enriched metadata and pre-computed embeddings across multiple models — designed for semantic search, retrieval, and clustering over Kapampángan vocabulary. Both the original 1730s spelling and the normalized modern form are preserved in the dataset.
+For the full story behind the dataset and what I learned building it, see the article:
+[From a 300-Year-Old Dictionary to Hugging Face: I Built Kapampángan's First Embedding Dataset](https://keithmanaloto.medium.com/from-a-300-year-old-dictionary-to-hugging-face-i-built-kapampángans-first-embedding-dataset-dce2b877bd83)
+## Contributing
+Contributions are welcome, especially:
+- Expanding the exceptions table
+- Adding test coverage against known word lists
+- Adding diacritical mark support
+- Supporting additional orthographic target systems
+## License
+MIT

normalize_kap_orthography-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,108 @@
+# normalize-kap-orthography
+A Python utility to normalize Kapampangan words from Spanish-era (1730s) orthography to modern K-based orthography.
+Built to make historical Kapampangan texts — like Bergaño's 1732 *Vocabulario de la Lengua Pampanga* — more accessible to modern readers, researchers, and NLP pipelines.
+## Background
+Before the Spanish conquest, Kapampangans used their own indigenous writing system (Kulitan). Spanish missionaries romanized the language using Spanish orthographic conventions (C, Q, Ñ, LL, etc.). Over the past century, multiple competing romanized orthographies have emerged:
+## Disclaimer
+I am not a linguist — I'm a native Kapampangan speaker who happens to be a computer science graduate. The conversion rules in this tool were identified through patterns I recognized while cleaning historical dictionary data, not through formal linguistic analysis. The script was spot-checked against the dataset and appears accurate, but it has not been exhaustively verified. If you spot errors or have linguistic expertise to contribute, please open an issue or PR.
+| System | Also known as | Key features |
+|---|---|---|
+| Spanish-era ("Q & C") | *Súlat Bacúlud*, Old Orthography | Uses QU, C, Ñ, LL — the system used in colonial-era texts |
+| ABAKADA ("K") | *Súlat Wáwâ*, New Orthography | K-based, aligned with the Philippine national orthography |
+| Samson Hybrid | *Ámung Samson* | Retains C before a/o/u, replaces QU→K, adds diacritical marks |
+| Batiáuan Revised | *Súlat Wáwâ a alâng WA* | K-based without W, with diacritical marks |
+This tool converts from the **Spanish-era system** to a **modern K-based form** (closest to ABAKADA). For more on the orthography dispute, see [Pangilinan (2006)](https://sil-philippines-languages.org/ical/papers/pangilinan-Dispute%20on%20Orthography.pdf).
+## What it does
+The converter applies two phases of transformation:
+**Phase 1 — Spanish letter substitutions:**
+- `QUI` → `KI`, `QUE` → `KE`
+- `C` → `K` (except after `SI`)
+- `Ñ` → `N`, `LL` → `L`
+- Word-initial `V` → `W`
+**Phase 2 — Vowel cluster and diphthong normalization:**
+- `AO` → `O`, `AI`/`AY` → `E` (word-final, non-initial)
+- `UA` → `WA`, `UO` → `WO`
+- Various other diphthong simplifications
+An **exceptions table** handles words that don't follow general patterns, and a **two-pass conversion** catches cascading transformations.
+## Installation
+```shell
+pip install normalize-kap-orthography
+```
+Or just copy `normalize_orthography.py` into your project.
+## Usage
+```python
+from normalize_orthography import convert_orthography
+convert_orthography("QUINANG")   # → "KINANG"
+convert_orthography("VATAUAT")   # → "WATAWAT"
+convert_orthography("QUECAI")    # → "KEKE"
+convert_orthography("KINANG")    # → None (already modern)
+```
+Returns the normalized form, or `None` if no conversion is needed.
+### CLI
+```shell
+python normalize_orthography.py
+```
+Runs a small set of built-in test cases.
+## Limitations
+- **Not linguistically verified.** The rules were identified through pattern recognition by a native speaker, not through formal linguistic analysis. The script was spot-checked against dictionary data but not exhaustively validated.
+- **No diacritical marks.** The script does not handle stress marking, which is important in Kapampangan — e.g., *masakit* (painful) vs. *masákit* (difficult) vs. *másakit* (ill) are three distinct words.
+- **One-directional.** Currently only converts Spanish-era → modern. Reverse conversion is not supported.
+- **Uppercase only.** Input is converted to uppercase internally; output is always uppercase.
+## Origin
+Originally written in Dart as part of the v2 of [Learn Kulitan](https://github.com/keithliam/learn-kulitan-app), then rewritten in Python with **Claude Code Opus 4.6**.
+## Real-World Usage
+This script was originally used to normalize ~5,000 words extracted from [*Vocabulario de la Lengua Pampanga*](https://archive.org/details/aqn8189.0001.001.umich.edu/page/1/mode/2up) by Fray Diego Bergaño, originally published in 1732 — one of the earliest known dictionaries of the Kapampángan language. About 40% of entries (1,989 out of 4,971) had their orthography normalized.
+The raw, uncleaned entries and their cleaned, normalized versions are available as part of an open dataset on Hugging Face:
+**[keithmanaloto/kapampangan-dictionary-embeddings](https://huggingface.co/datasets/keithmanaloto/kapampangan-dictionary-embeddings)**
+The dataset also includes LLM-enriched metadata and pre-computed embeddings across multiple models — designed for semantic search, retrieval, and clustering over Kapampángan vocabulary. Both the original 1730s spelling and the normalized modern form are preserved in the dataset.
+For the full story behind the dataset and what I learned building it, see the article:
+[From a 300-Year-Old Dictionary to Hugging Face: I Built Kapampángan's First Embedding Dataset](https://keithmanaloto.medium.com/from-a-300-year-old-dictionary-to-hugging-face-i-built-kapampángans-first-embedding-dataset-dce2b877bd83)
+## Contributing
+Contributions are welcome, especially:
+- Expanding the exceptions table
+- Adding test coverage against known word lists
+- Adding diacritical mark support
+- Supporting additional orthographic target systems
+## License
+MIT

normalize_kap_orthography-0.1.0/normalize_kap_orthography.egg-info/PKG-INFO ADDED Viewed

@@ -0,0 +1,127 @@
+Metadata-Version: 2.4
+Name: normalize-kap-orthography
+Version: 0.1.0
+Summary: Normalize Kapampangan words from Spanish-era (1730s) orthography to modern K-based orthography
+Author: Keith Manaloto
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/keithmanaloto/normalize-kap-orthography
+Project-URL: Issues, https://github.com/keithmanaloto/normalize-kap-orthography
+Keywords: kapampangan,pampanga,orthography,nlp,linguistics
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: Programming Language :: Python :: 3
+Classifier: Topic :: Text Processing :: Linguistic
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Dynamic: license-file
+# normalize-kap-orthography
+A Python utility to normalize Kapampangan words from Spanish-era (1730s) orthography to modern K-based orthography.
+Built to make historical Kapampangan texts — like Bergaño's 1732 *Vocabulario de la Lengua Pampanga* — more accessible to modern readers, researchers, and NLP pipelines.
+## Background
+Before the Spanish conquest, Kapampangans used their own indigenous writing system (Kulitan). Spanish missionaries romanized the language using Spanish orthographic conventions (C, Q, Ñ, LL, etc.). Over the past century, multiple competing romanized orthographies have emerged:
+## Disclaimer
+I am not a linguist — I'm a native Kapampangan speaker who happens to be a computer science graduate. The conversion rules in this tool were identified through patterns I recognized while cleaning historical dictionary data, not through formal linguistic analysis. The script was spot-checked against the dataset and appears accurate, but it has not been exhaustively verified. If you spot errors or have linguistic expertise to contribute, please open an issue or PR.
+| System | Also known as | Key features |
+|---|---|---|
+| Spanish-era ("Q & C") | *Súlat Bacúlud*, Old Orthography | Uses QU, C, Ñ, LL — the system used in colonial-era texts |
+| ABAKADA ("K") | *Súlat Wáwâ*, New Orthography | K-based, aligned with the Philippine national orthography |
+| Samson Hybrid | *Ámung Samson* | Retains C before a/o/u, replaces QU→K, adds diacritical marks |
+| Batiáuan Revised | *Súlat Wáwâ a alâng WA* | K-based without W, with diacritical marks |
+This tool converts from the **Spanish-era system** to a **modern K-based form** (closest to ABAKADA). For more on the orthography dispute, see [Pangilinan (2006)](https://sil-philippines-languages.org/ical/papers/pangilinan-Dispute%20on%20Orthography.pdf).
+## What it does
+The converter applies two phases of transformation:
+**Phase 1 — Spanish letter substitutions:**
+- `QUI` → `KI`, `QUE` → `KE`
+- `C` → `K` (except after `SI`)
+- `Ñ` → `N`, `LL` → `L`
+- Word-initial `V` → `W`
+**Phase 2 — Vowel cluster and diphthong normalization:**
+- `AO` → `O`, `AI`/`AY` → `E` (word-final, non-initial)
+- `UA` → `WA`, `UO` → `WO`
+- Various other diphthong simplifications
+An **exceptions table** handles words that don't follow general patterns, and a **two-pass conversion** catches cascading transformations.
+## Installation
+```shell
+pip install normalize-kap-orthography
+```
+Or just copy `normalize_orthography.py` into your project.
+## Usage
+```python
+from normalize_orthography import convert_orthography
+convert_orthography("QUINANG")   # → "KINANG"
+convert_orthography("VATAUAT")   # → "WATAWAT"
+convert_orthography("QUECAI")    # → "KEKE"
+convert_orthography("KINANG")    # → None (already modern)
+```
+Returns the normalized form, or `None` if no conversion is needed.
+### CLI
+```shell
+python normalize_orthography.py
+```
+Runs a small set of built-in test cases.
+## Limitations
+- **Not linguistically verified.** The rules were identified through pattern recognition by a native speaker, not through formal linguistic analysis. The script was spot-checked against dictionary data but not exhaustively validated.
+- **No diacritical marks.** The script does not handle stress marking, which is important in Kapampangan — e.g., *masakit* (painful) vs. *masákit* (difficult) vs. *másakit* (ill) are three distinct words.
+- **One-directional.** Currently only converts Spanish-era → modern. Reverse conversion is not supported.
+- **Uppercase only.** Input is converted to uppercase internally; output is always uppercase.
+## Origin
+Originally written in Dart as part of the v2 of [Learn Kulitan](https://github.com/keithliam/learn-kulitan-app), then rewritten in Python with **Claude Code Opus 4.6**.
+## Real-World Usage
+This script was originally used to normalize ~5,000 words extracted from [*Vocabulario de la Lengua Pampanga*](https://archive.org/details/aqn8189.0001.001.umich.edu/page/1/mode/2up) by Fray Diego Bergaño, originally published in 1732 — one of the earliest known dictionaries of the Kapampángan language. About 40% of entries (1,989 out of 4,971) had their orthography normalized.
+The raw, uncleaned entries and their cleaned, normalized versions are available as part of an open dataset on Hugging Face:
+**[keithmanaloto/kapampangan-dictionary-embeddings](https://huggingface.co/datasets/keithmanaloto/kapampangan-dictionary-embeddings)**
+The dataset also includes LLM-enriched metadata and pre-computed embeddings across multiple models — designed for semantic search, retrieval, and clustering over Kapampángan vocabulary. Both the original 1730s spelling and the normalized modern form are preserved in the dataset.
+For the full story behind the dataset and what I learned building it, see the article:
+[From a 300-Year-Old Dictionary to Hugging Face: I Built Kapampángan's First Embedding Dataset](https://keithmanaloto.medium.com/from-a-300-year-old-dictionary-to-hugging-face-i-built-kapampángans-first-embedding-dataset-dce2b877bd83)
+## Contributing
+Contributions are welcome, especially:
+- Expanding the exceptions table
+- Adding test coverage against known word lists
+- Adding diacritical mark support
+- Supporting additional orthographic target systems
+## License
+MIT

normalize_kap_orthography-0.1.0/normalize_kap_orthography.egg-info/SOURCES.txt ADDED Viewed

@@ -0,0 +1,8 @@
+LICENSE
+README.md
+normalize_orthography.py
+pyproject.toml
+normalize_kap_orthography.egg-info/PKG-INFO
+normalize_kap_orthography.egg-info/SOURCES.txt
+normalize_kap_orthography.egg-info/dependency_links.txt
+normalize_kap_orthography.egg-info/top_level.txt

normalize_kap_orthography-0.1.0/normalize_kap_orthography.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+

normalize_kap_orthography-0.1.0/normalize_kap_orthography.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+ normalize_orthography

normalize_kap_orthography-0.1.0/normalize_orthography.py ADDED Viewed

@@ -0,0 +1,129 @@
+"""
+Normalize Kapampangan orthography from 1730s Spanish-influenced spelling to modern form.
+Ported from learn-kulitan-app-v2/lib/utils/utils.dart (TextUtils.convertOrthography).
+Usage as module:
+    from normalize_orthography import convert_orthography
+    result = convert_orthography("QUINANG")  # -> "KINANG"
+"""
+from __future__ import annotations
+import re
+EXCEPTIONS = {
+    "CAI": "KAYI",
+    "AIA": "AYA",
+    "VATAUAT": "WATAWAT",
+    "VALI": "WALI",
+    "PASIBAIO": "PASIBAYO",
+    "MAIUTPUT": "MAYUTPUT",
+    "OGNAY": "UGNE",
+    "BABAY": "BABAYI",
+    "IUAD": "IWAD",
+    "DALIUAUAT": "DALYAWAT",
+    "GUIUA": "GIWA",
+    "SAGUESAI": "SAGESE",
+    "MAUI": "MAWI",
+    "QUECAI": "KEKE",
+    "CABAYIAN": "KABAYIAN",
+    # Manual edge case
+    "PAYNG": "PAING",
+}
+# Phase 1: Spanish orthography normalization
+REPLACEMENTS_1 = [
+    ("QUI", "KI"),
+    ("QUE", "KE"),
+    (re.compile(r"(?<!SI)C"), "K"),
+    ("Ñ", "N"),
+    ("LL", "L"),
+    ("LLA", "LA"),
+    ("LL", "L"),
+    ("ÑY", "NY"),
+    ("NN", "N"),
+    (re.compile(r"^V"), "W"),
+]
+# Phase 2: Vowel cluster and diphthong normalization
+# Note: Dart supports variable-width lookbehinds but Python doesn't.
+# Patterns like (?<=..+) ("preceded by 2+ chars") are rewritten as
+# equivalent fixed-width or capturing-group approaches.
+REPLACEMENTS_2 = [
+    (re.compile(r"^O(?!U)"), "U"),           # O -> U (word-initial, not before U)
+    (re.compile(r"(?<=..)AO$"), "O"),        # AO -> O (not at start, 2+ chars before)
+    (re.compile(r"(?<=.)AI$"), "E"),         # AI -> E (not at start, 1+ char before)
+    (re.compile(r"(?<=.)AY$"), "E"),         # AY -> E (not at start)
+    (re.compile(r"AU$"), "AW"),              # AU -> AW (word-final)
+    (re.compile(r"(?<!L)UA$"), "WA"),        # UA -> WA (not after L)
+    (re.compile(r"(?<!B)UO"), "WO"),         # UO -> WO (not after B)
+    (re.compile(r"(?<=..)IA"), "YA"),        # IA -> YA (2+ chars before)
+    (re.compile(r"IU(?=A)"), "IW"),          # IU -> IW (before A)
+    (re.compile(r"IU(?=E)"), "IW"),          # IU -> IW (before E)
+    (re.compile(r"(?<=.)IU(?!A)"), "YU"),    # IU -> YU (1+ char before, not before A)
+    (re.compile(r"(?<=..)UI$"), "I"),        # UI -> I (2+ chars before, word-final)
+    (re.compile(r"(?<=..)IO$"), "YO"),       # IO -> YO (2+ chars before, word-final)
+    (re.compile(r"(?<=..z)IY$"), "I"),       # IY -> I (after 2+ chars ending in z, word-final)
+    (re.compile(r"IE$"), "YE"),              # IE -> YE (word-final)
+    ("AUA", "AWA"),
+    ("AUI", "AWI"),
+    ("EUA", "EWA"),
+    ("UE", "WE"),
+    ("KK", "K"),
+]
+def _apply_replacements(replacements: list, word: str) -> str:
+    for pattern, repl in replacements:
+        if isinstance(pattern, str):
+            word = word.replace(pattern, repl)
+        else:
+            word = pattern.sub(repl, word)
+    return word
+def _convert_orthography(word: str) -> str:
+    new_word = _apply_replacements(REPLACEMENTS_1, word)
+    # Remove gemination across hyphens: K-K -> K
+    new_word = re.sub(r"(\w)-\1", r"\1", new_word)
+    return _apply_replacements(REPLACEMENTS_2, new_word)
+def convert_orthography(word: str) -> str | None:
+    """Convert a word from 1730s orthography to modern form.
+    Returns the normalized form, or None if no conversion is needed
+    (i.e., the word is already in modern orthography).
+    """
+    upper = word.upper()
+    if upper in EXCEPTIONS:
+        return EXCEPTIONS[upper]
+    converted = _convert_orthography(upper)
+    if converted == upper:
+        return None
+    # Second pass
+    converted2 = _convert_orthography(converted)
+    if converted2 == upper:
+        return None
+    return converted2
+if __name__ == "__main__":
+    # Quick test
+    test_cases = [
+        ("QUINANG", "KINANG"),
+        ("CAI", "KAYI"),
+        ("VATAUAT", "WATAWAT"),
+        ("PAYNG", "PAING"),
+        ("QUECAI", "KEKE"),
+    ]
+    for original, expected in test_cases:
+        result = convert_orthography(original)
+        status = "OK" if result == expected else f"FAIL (got {result})"
+        print(f"  {original} -> {result}  {status}")

normalize_kap_orthography-0.1.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,29 @@
+[build-system]
+requires = ["setuptools>=61.0"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "normalize-kap-orthography"
+version = "0.1.0"
+description = "Normalize Kapampangan words from Spanish-era (1730s) orthography to modern K-based orthography"
+readme = "README.md"
+license = "MIT"
+requires-python = ">=3.10"
+authors = [
+    { name = "Keith Manaloto" },
+]
+keywords = ["kapampangan", "pampanga", "orthography", "nlp", "linguistics"]
+classifiers = [
+    "Development Status :: 3 - Alpha",
+    "Intended Audience :: Developers",
+    "Intended Audience :: Science/Research",
+    "Programming Language :: Python :: 3",
+    "Topic :: Text Processing :: Linguistic",
+]
+[tool.setuptools]
+py-modules = ["normalize_orthography"]
+[project.urls]
+Homepage = "https://github.com/keithmanaloto/normalize-kap-orthography"
+Issues = "https://github.com/keithmanaloto/normalize-kap-orthography"

normalize_kap_orthography-0.1.0/setup.cfg ADDED Viewed

@@ -0,0 +1,4 @@
+[egg_info]
+tag_build =
+tag_date = 0