PyPI - gladia-normalization - Versions diffs - 0.1.0a1__tar.gz - Mend

gladia-normalization 0.1.0a1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (105) hide show

gladia_normalization-0.1.0a1/.commitlintrc.json ADDED Viewed

@@ -0,0 +1,3 @@
+{
+  "extends": ["@commitlint/config-conventional"]
+}

gladia_normalization-0.1.0a1/.github/pull_request_template.md ADDED Viewed

@@ -0,0 +1,43 @@
+## What does this PR do?
+<!-- One-sentence summarizing of the change. -->
+## Type of change
+- [ ] New language (`languages/{lang}/`)
+- [ ] New step (`steps/text/` or `steps/word/`)
+- [ ] New preset version (`presets/`)
+- [ ] Bug fix
+- [ ] Refactor / internal cleanup
+- [ ] Docs / CI
+## Checklist
+### New language
+- [ ] Created `languages/{lang}/` with `operators.py`, `replacements.py`, `__init__.py`
+- [ ] All word-level substitutions are in `replacements.py`, not inline in `operators.py`
+- [ ] Decorated operators class with `@register_language`
+- [ ] Added one import line to `languages/__init__.py`
+- [ ] Added unit tests in `tests/unit/languages/`
+- [ ] Added e2e test rows in `tests/e2e/files/`
+### New step
+- [ ] `name` class attribute is unique and matches the YAML key
+- [ ] Decorated with `@register_step`
+- [ ] Added one import line to `steps/text/__init__.py` or `steps/word/__init__.py`
+- [ ] Algorithm reads data from `operators.config.*`, no hardcoded language-specific values
+- [ ] Optional config fields are guarded with `if operators.config.field is None: return text`
+- [ ] Placeholder protect/restore pairs are both in `steps/text/placeholders.py` and `pipeline/base.py`'s `validate()` is updated
+- [ ] Added unit tests in `tests/unit/steps/`
+- [ ] Added step name to relevant preset YAMLs (new preset file if existing presets are affected)
+- [ ] If the class docstring was added or changed, ran `uv run scripts/generate_step_docs.py` to regenerate `docs/steps.md`
+### Preset change
+- [ ] Existing preset files are not modified — new behavior uses a new preset version file
+## Tests
+<!-- Describe what was tested and how. -->

gladia_normalization-0.1.0a1/.github/workflows/cd.yml ADDED Viewed

@@ -0,0 +1,49 @@
+name: CD
+on:
+  push:
+    tags:
+      # PEP 440 versioning
+      - v[0-9]+.[0-9]+.[0-9]+
+      - v[0-9]+.[0-9]+.[0-9]+a[0-9]+
+      - v[0-9]+.[0-9]+.[0-9]+b[0-9]+
+      - v[0-9]+.[0-9]+.[0-9]+rc[0-9]+
+jobs:
+  publish:
+    name: Build and publish
+    runs-on: ubuntu-latest
+    environment: pypi
+    permissions:
+      contents: read
+      id-token: write  # required for Trusted Publisher (OIDC)
+    steps:
+      - uses: actions/checkout@v6
+      - uses: astral-sh/setup-uv@v7
+        with:
+          python-version: "3.13"
+      - name: Validate tag format
+        run: |
+          if [[ ! "${{ github.ref_name }}" =~ ^v[0-9]+\.[0-9]+\.[0-9]+(a|b|rc)?[0-9]*$ ]]; then
+            echo "Error: Tag must follow PEP 440 versioning format (vMAJOR.MINOR.PATCH with optional pre-release suffix)"
+            echo "Examples: v1.2.3, v1.2.3a1, v1.2.3b2, v1.2.3rc1"
+            echo "Got: ${{ github.ref_name }}"
+            exit 1
+          fi
+      - name: Extract version from tag
+        id: version
+        run: echo "version=${GITHUB_REF_NAME#v}" >> "$GITHUB_OUTPUT"
+      - name: Update version in pyproject.toml
+        run: |
+          sed -i 's/^version = ".*"/version = "${{ steps.version.outputs.version }}"/' pyproject.toml
+      - name: Build
+        run: uv build
+      - name: Publish to PyPI
+        run: uv publish

gladia_normalization-0.1.0a1/.github/workflows/ci.yml ADDED Viewed

@@ -0,0 +1,53 @@
+name: CI
+on:
+  pull_request:
+    branches: [main]
+    types: [opened, synchronize, reopened, labeled]
+  workflow_dispatch:
+jobs:
+  commitlint:
+    name: Lint commit messages
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      pull-requests: read
+    steps:
+      - uses: actions/checkout@v6
+        with:
+          fetch-depth: 0
+      - uses: wagoid/commitlint-github-action@v6
+        with:
+          failOnWarnings: false
+  lint:
+    name: Lint (ruff)
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6
+      - uses: astral-sh/setup-uv@v7
+      - run: uvx ruff check .
+      - run: uvx ruff format --check .
+  typecheck:
+    name: Type check (ty)
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6
+      - uses: astral-sh/setup-uv@v7
+        with:
+          python-version: "3.13"
+      - run: uv sync --group dev
+      - run: uv run ty check .
+  test:
+    name: Tests (pytest)
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6
+      - uses: astral-sh/setup-uv@v7
+        with:
+          python-version: "3.13"
+      - run: uv sync --group dev
+      - run: uv run pytest

gladia_normalization-0.1.0a1/.gitignore ADDED Viewed

@@ -0,0 +1,19 @@
+# Python-generated files
+__pycache__/
+*.py[oc]
+build/
+dist/
+wheels/
+*.egg-info
+.ruff_cache/
+.pytest_cache/
+# Virtual environments
+.venv
+# IDE
+.vscode/
+.idea/
+.cursor/
+.claude/

gladia_normalization-0.1.0a1/.pre-commit-config.yaml ADDED Viewed

@@ -0,0 +1,28 @@
+default_install_hook_types:
+  - pre-commit
+  - commit-msg
+repos:
+  - repo: https://github.com/alessandrojcm/commitlint-pre-commit-hook
+    rev: v9.24.0
+    hooks:
+      - id: commitlint
+        stages: [commit-msg]
+        additional_dependencies: ["@commitlint/config-conventional"]
+        verbose: true
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.15.2
+    hooks:
+      - id: ruff-check
+        args: [--fix]
+      - id: ruff-format
+  # Remove this once ty pre-commit hook is released
+  - repo: local
+    hooks:
+      - id: ty
+        name: ty check
+        entry: uvx ty check .
+        language: system
+        pass_filenames: false
+        always_run: true

gladia_normalization-0.1.0a1/.python-version ADDED Viewed

	@@ -0,0 +1 @@
1	+ 3.13

gladia_normalization-0.1.0a1/AGENTS.md ADDED Viewed

@@ -0,0 +1,160 @@
+# text_normalizers — Agent Guidelines
+This document describes the architecture, conventions, and rules for contributing to `normalization`. Read it fully before making any change.
+---
+## What this project is
+A Python library for normalizing speech-to-text transcription output and ground truth to enable fair Word Error Rate (WER) comparison across STT engines. It converts surface-form variations (currency symbols, written numbers, abbreviations, punctuation, fillers) into a canonical text representation so that semantically equivalent transcriptions are treated as identical.
+The repository is using uv as package manager.
+---
+## Architecture overview
+The pipeline has exactly three stages, always in this order:
+1. **Text pre-processing** — full-text transformations before word splitting (e.g. placeholder protection, symbol conversion, contraction expansion)
+2. **Word processing** — per-token transformations after splitting on spaces (e.g. replacements, email detection)
+3. **Text post-processing** — full-text cleanup after rejoining words (e.g. placeholder restoration, digit collapsing)
+This 3-stage structure is a hard constraint, not a suggestion. Steps have implicit ordering dependencies (a placeholder must be protected before symbols are removed, and restored after). Never flatten stages or allow steps to run out of order.
+### Stage responsibilities
+**text_pre_steps** — full text before word splitting.
+Protect patterns (decimals, email symbols, slashes), expand multi-word forms (contractions, numbers, acronyms), convert symbols to words (currency, degrees, operators), apply character-level transforms (casefold, diacritics, punctuation removal), normalize whitespace.
+**word_steps** — individual tokens after splitting, no neighbor context.
+Skip special tokens (emails), apply single-word replacements (`vs` → `versus`), remove bracketed noise (`[inaudible]`).
+**text_post_steps** — full text after word joining.
+Restore placeholders to their final form (characters or words), format multi-word patterns (time, numbers), collapse digit sequences, normalize whitespace.
+Pipelines are defined in YAML. The YAML lists which steps run in each stage. Step classes register themselves automatically via a decorator — the YAML name maps directly to the registered step.
+---
+## Project structure — key rules
+### `languages/`
+Each supported language is a **self-contained folder** (e.g. `languages/english/`). Every language folder follows the same structure:
+- `operators.py` — subclass of `LanguageOperators`, holds the language config instance and any language-specific _behavioral_ method overrides
+- `replacements.py` — a plain `dict[str, str]` of **all** word-level substitutions for this language. Every word replacement goes here — never add inline entries in `operators.py`. An empty dict is valid for languages with no replacements yet.
+- `__init__.py` — exports the operators class and the replacements dict, nothing else. Do not re-export sentence replacements, number normalizers, or any other internal symbols.
+**`languages/base/`** is a package that defines the full language contract. It contains two files:
+- `language_config.py` — `LanguageConfig` dataclass: all language-specific _data_ (separators, currency words, filler words, digit words, time word maps, sentence replacements, etc.). Most fields have sensible defaults (empty dicts/lists, `None` for optional fields); steps that read them skip gracefully when `None`.
+- `language_operator.py` — `LanguageOperators`: the base class and language-neutral fallback. Directly instantiable with no arguments — uses a minimal `LanguageConfig(code="default")` with empty symbol/currency mappings and all optional fields set to `None`. Registered in the language registry under `"default"` so it serves as the automatic fallback when no language is specified or the language is unsupported. All methods are no-ops. Only methods where the algorithm itself varies by language should be overridden in subclasses. Methods that are purely data-driven (i.e. the step owns the algorithm and only reads config values) do **not** belong here.
+Both symbols are re-exported from `languages/base/__init__.py`.
+Additional files beyond the required three (e.g. `number_normalizer.py`, `sentence_replacements.py`) are allowed when a language needs them, but they must never be empty. Number-related _data_ (digit words, number words) belongs in `LanguageConfig`. Only create a `number_normalizer.py` when the expansion _algorithm_ is complex enough to warrant its own module (see `languages/english/number_normalizer.py`).
+When adding a new language:
+1. Create a new folder under `languages/` with `operators.py`, `replacements.py`, and `__init__.py`
+2. Decorate the operators class with `@register_language` — registration is automatic
+3. Add one import line to `languages/__init__.py` to trigger the decorator at import time
+### `steps/`
+Steps are **atomic, stateless, single-responsibility** transformations. Each step class:
+- Has a `name` class attribute (the string used in YAML)
+- Is decorated with `@register_step` — this auto-registers it, no manual registry update needed
+- Receives `(text, operators)` for text steps, or `(word, operators)` for word steps
+- **Owns the algorithm** — the `__call__` method contains the transformation logic
+- **Reads data from `operators.config.*`** — never hardcodes language-specific values
+Steps are organized into `steps/text/` and `steps/word/` by stage. Protect/restore placeholder pairs always live in the **same file** (`steps/text/placeholders.py`) to keep their dependency explicit and co-located.
+When adding a new step:
+1. Create or add to the appropriate file under `steps/text/` or `steps/word/`
+2. Decorate with `@register_step`
+3. Add one import line to `steps/text/__init__.py` or `steps/word/__init__.py`
+4. Add the step name to the relevant YAML preset(s) if it should run by default
+### `pipeline/`
+- `base.py` — `NormalizationPipeline`: the orchestrator. Holds the three ordered step lists, runs them, exposes `.describe()` and `.validate()`.
+- `loader.py` — reads a YAML preset, resolves step names from the step registry, instantiates operators from the language registry, returns a ready-to-use pipeline.
+- `replacer.py` — stateful compiled-regex engine used by the word replacement step. Lives here because it is infrastructure, not a step itself.
+### `presets/`
+Versioned YAML files shipped with the library. **Once published, a preset must never be modified** — benchmark reproducibility depends on it. New behavior means a new preset file with a new version name.
+---
+## Core conventions
+### Auto-registration, not manual registries
+Never manually maintain a dict mapping names to classes. Use the `@register_step` and `@register_language` decorators defined in `steps/registery.py` and `languages/registery.py`. The only manual work is adding an import line to the relevant `__init__.py` so the decorator runs at import time.
+### Language data vs. language behavior
+This is the central design rule. There are two distinct places for language-specific things:
+**`LanguageConfig` (data)** — everything that can be expressed as a value: strings, lists, dicts. This includes separator characters, currency words, filler words, digit words, number words, and data-driven mappings like `time_words`, `sentence_replacements`, etc. Optional fields use `TypeAlias | None = None`; a `None` value means the step that reads it must skip gracefully. Semantic `TypeAlias` definitions (`TimeWords`, `DigitWords`, `SentenceReplacements`, etc.) are defined in `language_config.py` to make the contract self-documenting.
+**`LanguageOperators` (behavior)** — only methods where the _algorithm itself_ varies by language. Examples: `expand_contractions` (uses an external library + custom regexes), `expand_written_numbers` (English uses a complex Whisper-derived normalizer), `normalize_numeric_time_formats` (am/pm regex structure), `fix_one_word_in_numeric_contexts` (language-specific digit-adjacent pattern), `get_compound_minutes` (English combines tens+ones with hyphen/space; other languages form these differently or not at all). If the algorithm is generic and only the _data_ differs, the data goes in `LanguageConfig` and the algorithm goes in the step — not in the operator.
+Decision rule: ask "does the _logic_ change by language, or just the _values_?" If only values change → `LanguageConfig`. If the logic changes → `LanguageOperators` method override.
+### Placeholder protection is ordered and paired
+Any step that protects a character with a placeholder token must have a corresponding restore step. These must always be in `steps/text/placeholders.py`. The protect step must run in Stage 1 before `RemoveSymbolsStep`. The restore step must run in Stage 3. `pipeline.validate()` enforces this — do not bypass it. `loader.py` calls `validate()` automatically after constructing the pipeline.
+When implementing placeholder steps, use the base classes where they fit:
+- **`ProtectStep`** — use when the pattern has exactly two capture groups and emits a single placeholder (template: `\1{placeholder}\2`). Implement `_pattern(operators)`.
+- **`RestoreStep`** — use when restoration is a plain string replacement of a single placeholder. Implement `_replacement(operators)`.
+- **`TextStep`** directly — use when neither contract fits (multiple placeholders in one pass, zero-width patterns, per-match fan-out, marker deletion, post-replace logic). In that case, document why in the class docstring.
+### Steps are language-agnostic
+A step must not contain any language-specific logic or string literals. If the algorithm differs by language, add a method to `LanguageOperators` (with a no-op default in the base) and call `operators.that_method(text)` from the step. If only data differs, read it from `operators.config.*`. English-only helpers (e.g. `EnglishNumberNormalizer`) live inside `languages/english/`, not in `steps/`.
+### Language folders are self-contained
+Everything specific to a language lives inside its folder. If you find yourself adding a helper that only one language uses, it goes in that language's folder as an additional file — not in `steps/`, not in `pipeline/`. The English number normalizer (`languages/english/number_normalizer.py`) is the canonical example of this pattern.
+### Presets are the reproducibility contract
+Never modify a published preset YAML. Never let a preset reference a step that has changed its behavior under the same name. If a step's behavior changes, create a new step with a new name and update the relevant presets accordingly.
+---
+## Adding a new language — checklist
+- [ ] Create `languages/{lang}/` with `operators.py`, `replacements.py`, `__init__.py`
+- [ ] Put all word-level substitutions in `replacements.py`; do not add inline entries in `operators.py`
+- [ ] Instantiate a `LanguageConfig` in `operators.py`, filling in all required fields and any optional dict fields your language needs (`time_words`, `sentence_replacements`, etc.)
+- [ ] Subclass `LanguageOperators`, overriding only methods where the _algorithm_ differs (not just the data)
+- [ ] If the language has digit words, populate `digit_words` in `LanguageConfig`
+- [ ] If the language uses spoken time patterns, populate `time_words` with all needed word→digit mappings (clock hours 1-12 and minute-worth values up to 50); if it also uses compound minute expressions (e.g. "twenty-one"), override `get_compound_minutes()` to generate them — do **not** put this in config
+- [ ] If number expansion is needed and the algorithm is complex, implement it in a `number_normalizer.py` file and override `expand_written_numbers`; otherwise do not create the file
+- [ ] Decorate the class with `@register_language`
+- [ ] Add one import to `languages/__init__.py`
+- [ ] Add tests in `tests/unit/languages/`
+- [ ] Add test rows to `tests/e2e/files/` for the new language
+## Adding a new step — checklist
+- [ ] Add the class to the appropriate file in `steps/text/` or `steps/word/`
+- [ ] Set a unique `name` class attribute
+- [ ] Decorate with `@register_step`
+- [ ] Add one import to `steps/text/__init__.py` or `steps/word/__init__.py`
+- [ ] Place the algorithm in `__call__`; read language data from `operators.config.*`; call operator methods only for genuinely behavioral differences
+- [ ] If the step reads an optional `LanguageConfig` field, guard with `if operators.config.field is None: return text` and add a TODO comment
+- [ ] Add unit tests in `tests/unit/steps/`
+- [ ] If it involves placeholder protection, add both protect and restore to `steps/text/placeholders.py` and update `pipeline/base.py`'s `validate()` accordingly; use `ProtectStep`/`RestoreStep` base classes where the contract fits, otherwise use `TextStep` directly and document why in the docstring
+- [ ] Add the step name to relevant preset YAMLs if needed (new preset version if existing presets are affected)
+- [ ] If you added or changed the class docstring, run `python scripts/generate_step_docs.py` to regenerate `docs/steps.md`

gladia_normalization-0.1.0a1/CLAUDE.md ADDED Viewed

	@@ -0,0 +1 @@
1	+ AGENTS.md

gladia_normalization-0.1.0a1/CONTRIBUTING.md ADDED Viewed

@@ -0,0 +1,217 @@
+# Contributing
+Thanks for your interest in `gladia-normalization`! Here's how to get involved.
+## Reporting bugs
+Open an issue with steps to reproduce, expected vs actual behavior, and your environment (Python version, OS, package version).
+## Submitting changes
+1. **Fork the repo and create a branch**: `git checkout -b feat/my-feature`
+2. **Make your changes and add tests**
+3. **Run the checks**:
+   ```bash
+   uv run pytest               # run tests
+   uv run ruff check .         # lint
+   uv run ruff format .        # format
+   uv run ty check             # type-check
+   ```
+4. **Push your branch**: `git push origin your-feature-branch`
+5. **Create a PR**: Go to GitHub and create a pull request
+6. **Fill out the PR template**: Provide clear description of changes
+7. **Wait for review**: Maintainers will review and provide feedback
+8. **Address feedback**: Make requested changes and push updates
+9. **Merge**: Once approved, your PR will be merged!
+### Pre-commit hooks
+The project uses [pre-commit](https://pre-commit.com/) to enforce linting, formatting, and commit message conventions automatically. Install the hooks once after cloning:
+```bash
+uv run pre-commit install --install-hooks
+```
+This will run Ruff (lint + format) and ty (type-check) on every commit, and validate your commit message on `commit-msg`.
+## Commit style
+We use [Conventional Commits](https://www.conventionalcommits.org/): pre-fix your commit with `feat:`, `fix:`, `docs:`, `chore:`, etc.
+## Architecture at a glance
+Every pipeline runs exactly **three stages**, always in this order:
+1. **Text pre-processing** — full-text transforms before word splitting (placeholder protection, symbol conversion, contraction expansion, …)
+2. **Word processing** — per-token transforms after splitting on spaces (replacements, filler removal, …)
+3. **Text post-processing** — full-text cleanup after rejoining words (placeholder restoration, digit collapsing, …)
+This ordering is a hard constraint — some steps depend on earlier steps having run. See the [README](./README.md) for more detail.
+## Adding a new step
+1. Create or extend a file under `normalization/steps/text/` or `normalization/steps/word/`.
+2. Decorate the class with `@register_step` and set a unique `name` attribute.
+3. Add an import to `steps/text/__init__.py` or `steps/word/__init__.py`.
+4. Add unit tests under `tests/unit/steps/`.
+5. Add the step name to the relevant preset YAML, or create a new preset version.
+6. If you added or changed the class docstring, regenerate `docs/steps.md` by running `uv run scripts/generate_step_docs.py`.
+### Choosing a base class
+There are four base classes. Pick the narrowest one that fits your step.
+**`WordStep`** — use when your transformation operates on a single token in isolation, with no knowledge of neighboring words. This is the only base class for Stage 2 steps.
+```python
+@register_step
+class MyWordStep(WordStep):
+    name = "my_word_step"
+    def __call__(self, word: str, operators: LanguageOperators) -> str:
+        ...
+```
+**`TextStep`** — the general-purpose base for Stage 1 and Stage 3. Use it when your transformation needs to see the full string, or when none of the more specific bases below fit.
+```python
+@register_step
+class MyTextStep(TextStep):
+    name = "my_text_step"
+    def __call__(self, text: str, operators: LanguageOperators) -> str:
+        ...
+```
+**`ProtectStep`** — a specialization of `TextStep` for the common case of replacing a character with a placeholder token. You only implement `_pattern`, which returns a compiled regex with **exactly two capture groups** (what comes before and after the character being replaced). The `__call__` is fixed: it applies the pattern as `\1{placeholder}\2`.
+```python
+@register_step
+class MyProtectStep(ProtectStep):
+    name = "my_protect_step"
+    placeholder = ProtectPlaceholder.MY_PLACEHOLDER
+    def _pattern(self, operators: LanguageOperators) -> re.Pattern:
+        return re.compile(r"(\d+)X(\d+)")  # two capture groups required
+```
+Use `ProtectStep` when: one regex pattern maps to exactly one placeholder substitution.
+Use `TextStep` directly instead when: a single pass must protect two different symbols (like email `@` and `.`), the replacement needs to absorb surrounding whitespace with `\s*`, or the replacement is a per-match function rather than a fixed template.
+**`RestoreStep`** — a specialization of `TextStep` for restoring a placeholder back to a string. You only implement `_replacement`, which returns the string to substitute in. The `__call__` does a plain `str.replace` of the placeholder (and its case-folded form).
+```python
+@register_step
+class MyRestoreStep(RestoreStep):
+    name = "my_restore_step"
+    placeholder = ProtectPlaceholder.MY_PLACEHOLDER
+    def _replacement(self, operators: LanguageOperators) -> str:
+        return operators.config.some_word or " "
+```
+Use `RestoreStep` when: restoration is a straight token swap with no surrounding whitespace to absorb and no additional logic needed.
+Use `TextStep` directly instead when: the placeholder was inserted with spaces around it (requiring `re.sub` with `\s*` to avoid double spaces), the marker should be deleted entirely rather than replaced, or post-replacement cleanup is needed.
+## Writing tests
+### Unit tests for a step
+Unit tests live under `tests/unit/steps/text/` or `tests/unit/steps/word/`, mirroring the step file structure.
+The `tests/unit/steps/text/conftest.py` provides two fixtures and a helper:
+- `operators` — a bare `LanguageOperators()` instance (language-agnostic)
+- `english_operators` — an `EnglishOperators()` instance
+- `assert_text_step_registered(step_cls)` — verifies the step is in the registry under its name
+Every test file for a step should at minimum:
+1. Assert the step is registered.
+2. Instantiate the step with `MyStep()` and call it directly: `MyStep()(text, operators)`.
+3. Mutate `operators.config` fields in-place to cover different language configurations without creating a full language.
+```python
+# tests/unit/steps/text/my_step_test.py
+from normalization.languages.base import LanguageOperators
+from normalization.steps.text.my_module import MyStep
+from .conftest import assert_text_step_registered
+def test_step_is_registered():
+    assert_text_step_registered(MyStep)
+def test_my_step_basic(operators: LanguageOperators):
+    result = MyStep()("some input", operators)
+    assert result == "expected output"
+def test_my_step_with_config(operators: LanguageOperators):
+    operators.config.some_field = "custom_value"
+    result = MyStep()("some input", operators)
+    assert result == "expected output with custom value"
+def test_my_step_with_english(english_operators):
+    result = MyStep()("some input", english_operators)
+    assert result == "english-specific output"
+```
+### E2E tests for a preset
+E2E tests validate the full pipeline (preset + language) against a CSV fixture. The test runner lives in `tests/e2e/normalization_test.py` and CSV files go in `tests/e2e/files/`.
+**CSV format** — three columns, no quoting needed unless the value contains a comma:
+```
+input,expected,language
+$1,000,000,1000000 dollars,en
+hello world,hello world,fr
+```
+Each row is one test case. The `language` column must match a registered language code (or `default`).
+**Registering a new CSV** — add a block to `normalization_test.py` following the existing pattern:
+```python
+_MY_PRESET_CSV = _FILES_DIR / "my-preset.csv"
+_MY_PRESET_TESTS = _load_tests_from_csv(_MY_PRESET_CSV) if _MY_PRESET_CSV.exists() else []
+_MY_PRESET_PIPELINES: dict[str, NormalizationPipeline] = {}
+@pytest.mark.parametrize(
+    "test",
+    _MY_PRESET_TESTS,
+    ids=_case_ids(_MY_PRESET_TESTS),
+)
+def test_my_preset(test: NormalizationTest) -> None:
+    pipeline = _load_pipeline("my-preset", test.language)
+    result = pipeline.normalize(test.input)
+    assert result == test.expected, (
+        f"\n  input:    {test.input!r}"
+        f"\n  expected: {test.expected!r}"
+        f"\n  got:      {result!r}"
+    )
+```
+Pipelines are cached per language inside `_MY_PRESET_PIPELINES` to avoid reloading for each parametrized case — follow the `_load_pipeline` helper pattern already in the file.
+Steps must be **language-agnostic** — delegate all language-specific logic to the `operators` argument or read data from `operators.config.*`.
+## Adding a new language
+1. Create `normalization/languages/{lang}/` with `operators.py`, `replacements.py`, and `__init__.py`.
+2. Put all word-level substitutions in `replacements.py`.
+3. Instantiate a `LanguageConfig` and subclass `LanguageOperators` in `operators.py`.
+4. Decorate with `@register_language` and add one import to `normalization/languages/__init__.py`.
+5. Add tests under `tests/unit/languages/` and e2e fixture rows in `tests/e2e/files/`.
+## Key design rules
+- **Data vs. behavior**: if only the _values_ change by language, put them in `LanguageConfig`. If the _algorithm_ changes, override a method in `LanguageOperators`.
+- **Presets are immutable**: never modify a published preset YAML — new behavior means a new preset file.
+- **Placeholder pairs**: every `protect_*` step in Stage 1 must have a matching `restore_*` in Stage 3. The pipeline validates this at load time.

gladia_normalization-0.1.0a1/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Gladia
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.