PyPI - basa - Versions diffs - 0.1.0a0__tar.gz - Mend

basa 0.1.0a0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (37) hide show

basa-0.1.0a0/.gitignore +220 -0
basa-0.1.0a0/LICENSE +21 -0
basa-0.1.0a0/PKG-INFO +394 -0
basa-0.1.0a0/README.md +368 -0
basa-0.1.0a0/pyproject.toml +37 -0
basa-0.1.0a0/src/basa/__init__.py +25 -0
basa-0.1.0a0/src/basa/augment/__init__.py +0 -0
basa-0.1.0a0/src/basa/augment/noise.py +0 -0
basa-0.1.0a0/src/basa/augment/paraphrase.py +0 -0
basa-0.1.0a0/src/basa/augment/synthetic.py +0 -0
basa-0.1.0a0/src/basa/core/__init__.py +0 -0
basa-0.1.0a0/src/basa/core/normalize.py +214 -0
basa-0.1.0a0/src/basa/core/quick.py +50 -0
basa-0.1.0a0/src/basa/core/slang.py +700 -0
basa-0.1.0a0/src/basa/core/typo.py +433 -0
basa-0.1.0a0/src/basa/dataset/__init__.py +0 -0
basa-0.1.0a0/src/basa/dataset/builder.py +0 -0
basa-0.1.0a0/src/basa/dataset/cleaner.py +0 -0
basa-0.1.0a0/src/basa/dataset/split.py +0 -0
basa-0.1.0a0/src/basa/dataset/validator.py +0 -0
basa-0.1.0a0/src/basa/evaluate/__init__.py +0 -0
basa-0.1.0a0/src/basa/evaluate/factual.py +0 -0
basa-0.1.0a0/src/basa/evaluate/metrics.py +0 -0
basa-0.1.0a0/src/basa/evaluate/similarity.py +0 -0
basa-0.1.0a0/src/basa/tokenize/__init__.py +0 -0
basa-0.1.0a0/src/basa/tokenize/lang_detect.py +0 -0
basa-0.1.0a0/src/basa/tokenize/sentence.py +0 -0
basa-0.1.0a0/src/basa/tokenize/word.py +0 -0
basa-0.1.0a0/src/basa/translate/__init__.py +0 -0
basa-0.1.0a0/src/basa/translate/jv_id.py +0 -0
basa-0.1.0a0/src/basa/translate/router.py +0 -0
basa-0.1.0a0/src/basa/translate/su_id.py +0 -0
basa-0.1.0a0/src/basa/utils/__init__.py +0 -0
basa-0.1.0a0/src/basa/utils/constants.py +0 -0
basa-0.1.0a0/src/basa/utils/regex.py +0 -0
basa-0.1.0a0/src/basa/utils/text_clean.py +0 -0
basa-0.1.0a0/tests/test_normalize.py +272 -0

basa-0.1.0a0/.gitignore ADDED Viewed

@@ -0,0 +1,220 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[codz]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.venv
+.Python
+.idea/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#   Usually these files are written by a python script from a template
+#   before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py.cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+# Pipfile.lock
+# UV
+#   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+# uv.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+# poetry.lock
+# poetry.toml
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#   pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
+#   https://pdm-project.org/en/latest/usage/project/#working-with-version-control
+# pdm.lock
+# pdm.toml
+.pdm-python
+.pdm-build/
+# pixi
+#   Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
+# pixi.lock
+#   Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
+#   in the .venv directory. It is recommended not to include this directory in version control.
+.pixi
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# Redis
+*.rdb
+*.aof
+*.pid
+# RabbitMQ
+mnesia/
+rabbitmq/
+rabbitmq-data/
+# ActiveMQ
+activemq-data/
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.envrc
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#   JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#   be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#   and can be added to the global gitignore or merged into this file.  For a more nuclear
+#   option (not recommended) you can uncomment the following to ignore the entire idea folder.
+# .idea/
+# Abstra
+#   Abstra is an AI-powered process automation framework.
+#   Ignore directories containing user credentials, local state, and settings.
+#   Learn more at https://abstra.io/docs
+.abstra/
+# Visual Studio Code
+#   Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore
+#   that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
+#   and can be added to the global gitignore or merged into this file. However, if you prefer,
+#   you could uncomment the following to ignore the entire vscode folder
+# .vscode/
+# Temporary file for partial code execution
+tempCodeRunnerFile.py
+# Ruff stuff:
+.ruff_cache/
+# PyPI configuration file
+.pypirc
+# Marimo
+marimo/_static/
+marimo/_lsp/
+__marimo__/
+# Streamlit
+.streamlit/secrets.toml

basa-0.1.0a0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Muanai Khalifah Revindo
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

basa-0.1.0a0/PKG-INFO ADDED Viewed

@@ -0,0 +1,394 @@
+Metadata-Version: 2.4
+Name: basa
+Version: 0.1.0a0
+Summary: Modern NLP for Indonesian and regional languages
+Author-email: Muanai Khalifah Revindo <muanaikhalifahr@gmail.com>
+License: MIT
+License-File: LICENSE
+Requires-Python: >=3.10
+Requires-Dist: pydantic>=2.0.0
+Requires-Dist: torch>=2.0.0
+Requires-Dist: transformers>=4.30.0
+Provides-Extra: dev
+Requires-Dist: black>=23.0.0; extra == 'dev'
+Requires-Dist: mypy>=1.0.0; extra == 'dev'
+Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
+Requires-Dist: pytest>=7.0.0; extra == 'dev'
+Requires-Dist: ruff>=0.1.0; extra == 'dev'
+Provides-Extra: evaluation
+Requires-Dist: bert-score>=0.3.13; extra == 'evaluation'
+Requires-Dist: rouge-score>=0.1.2; extra == 'evaluation'
+Requires-Dist: seqeval>=1.2.2; extra == 'evaluation'
+Provides-Extra: serving
+Requires-Dist: fastapi>=0.100.0; extra == 'serving'
+Requires-Dist: uvicorn>=0.23.0; extra == 'serving'
+Description-Content-Type: text/markdown
+# BASA
+**Modern NLP preprocessing for Indonesian and regional languages.**
+BASA is a lightweight, zero-dependency preprocessing library designed for real-world Indonesian text — the kind found on Twitter/X, TikTok, WhatsApp, Shopee reviews, and Discord. It normalizes informal slang, collapses expressive character repetition, reduces punctuation noise, and optionally corrects typos, all through a single clean API.
+```python
+from basa import normalize
+normalize("GW GKKKK NGERTIII BNGTTTT!!!!!")
+# → 'saya tidak mengerti banget!'
+```
+---
+## Why BASA?
+Indonesian social media text is notoriously difficult to process with standard NLP tools:
+| Raw input | After `normalize()` |
+|---|---|
+| `gw gk ngerti bngt sihhhh!!!` | `saya tidak mengerti banget sih!` |
+| `kmrn gamau makan krn baper bgt` | `kemarin tidak mau makan karena bawa perasaan banget` |
+| `otw gan, rekber dlu ya!!!!!` | `dalam perjalanan saudara, rekening bersama dulu ya!` |
+| `GW GKKKK NGERTIII BNGTTTT!!!!!` | `saya tidak mengerti banget!` |
+Standard tokenizers and language models often fail on this kind of input because they see `"gkkkk"`, `"bngtttt"`, and `"ngertiii"` as unknown tokens. BASA normalizes them first.
+---
+## Installation
+```bash
+pip install basa
+```
+> **Requires Python 3.10+**
+---
+## Quick Start
+### One-liner (recommended for most use cases)
+```python
+from basa import normalize
+normalize("gw gk ngerti bngt sihhhh!!!")
+# → 'saya tidak mengerti banget sih!'
+```
+### Zero-config alias
+```python
+from basa import quick
+quick("GW GKKKK NGERTIII BNGTTTT!!!!!")
+# → 'saya tidak mengerti banget!'
+```
+`quick()` is a thin alias for `normalize()` with all defaults applied. Use it when you want the shortest possible call.
+### Batch processing
+```python
+from basa import normalize
+texts = [
+    "gw gk ngerti",
+    "lu udh makan??",
+    "kmrn gamau pergi krn baper bgt",
+]
+normalize(texts)
+# → ['saya tidak mengerti', 'kamu sudah makan?', 'kemarin tidak mau pergi karena bawa perasaan banget']
+```
+---
+## API Reference
+### `normalize(text, **options)`
+Normalize informal Indonesian text. Accepts a single string or a list of strings.
+```python
+normalize(
+    text: Union[str, List[str]],
+    apply_slang: bool = True,
+    apply_typo: bool = False,
+    lowercase: bool = True,
+    normalize_punctuation: bool = True,
+    normalize_whitespace: bool = True,
+) -> Union[str, List[str]]
+```
+#### Parameters
+| Parameter | Type | Default | Description |
+|---|---|---|---|
+| `text` | `str` or `List[str]` | — | Input text or list of texts. |
+| `apply_slang` | `bool` | `True` | Expand slang and reduce expressive repeated characters (e.g. `"bngtttt"` → `"banget"`). |
+| `apply_typo` | `bool` | `False` | Correct misspelled words using Levenshtein distance. **Opt-in** — requires a vocabulary to be loaded first. |
+| `lowercase` | `bool` | `True` | Lowercase the text before processing. Set `False` for NER and case-sensitive pipelines. |
+| `normalize_punctuation` | `bool` | `True` | Collapse repeated punctuation marks (`"!!!!!"` → `"!"`). |
+| `normalize_whitespace` | `bool` | `True` | Strip leading/trailing whitespace and collapse internal multiple spaces. |
+#### Processing pipeline (in order)
+```
+1. lowercase            → "GW GK NGERTI" → "gw gk ngerti"
+2. slang normalization  → "gkkkk" → "gk" → "tidak"
+3. typo correction      → "mkan" → "makan"  (opt-in)
+4. punctuation          → "!!!!!" → "!"
+5. whitespace cleanup   → "  a   b  " → "a b"
+```
+#### Examples
+```python
+# Preserve case for NER tasks
+normalize("Jokowi pergi ke Jakarta", lowercase=False)
+# → 'Jokowi pergi ke Jakarta'
+# Disable slang (pass through raw tokens)
+normalize("gw gk ngerti", apply_slang=False)
+# → 'gw gk ngerti'
+# Enable typo correction (requires vocab)
+from basa import typo
+typo.add_to_vocab({"makan", "minum", "pergi"})
+normalize("saya mkan dan mnum", apply_typo=True)
+# → 'saya makan dan minum'
+```
+---
+### `quick(text)`
+Zero-config alias for `normalize()` with all default settings.
+```python
+from basa import quick
+quick("gw gamau pergi krn mager")
+# → 'saya tidak mau pergi karena malas bergerak'
+```
+---
+### `typo` — Typo Corrector
+BASA's typo corrector is **vocabulary-driven and opt-in by default**. You supply the vocabulary; BASA finds the closest match using Levenshtein distance.
+```python
+from basa import typo
+# Load your domain vocabulary
+typo.add_to_vocab({"makan", "minum", "masak", "pergi", "datang"})
+typo.correct("mkan")     # → 'makan'
+typo.correct("mnm")      # → 'minum'
+typo.correct("ok")       # → 'ok'  (too short, skipped by default)
+# Correct a full sentence
+typo.correct_text("saya mkan dan mnm")
+# → 'saya makan dan minum'
+# Get multiple suggestions
+typo.suggest("mkan", top_k=3)
+# → ['makan', 'masak', 'minum']
+```
+#### Why is `apply_typo=False` by default?
+Typo correction is **destructive** when applied blindly. Without the right vocabulary, domain-specific terms like `xgboost`, `lightgbm`, or `rekber` would be mangled. BASA follows the principle of *conservative by default, destructive features opt-in*.
+#### Vocabulary management
+```python
+from basa import typo
+typo.add_to_vocab({"kata", "lain"})      # add words
+typo.remove_from_vocab({"kata"})         # remove words
+typo.clear_vocab()                       # reset entirely
+len(typo)                                # vocab size
+"makan" in typo                          # membership check
+# Check cache statistics (useful for profiling)
+typo.cache_info()
+# → {'hits': 120, 'misses': 35, 'size': 35}
+```
+#### Typo corrector options
+```python
+from basa.core.typo import TypoCorrector
+corrector = TypoCorrector(
+    vocab={"makan", "minum"},
+    min_word_length=4,    # tokens shorter than this are skipped (default: 4)
+    min_confidence=0.5,   # minimum correction confidence in [0, 1] (default: 0.5)
+)
+```
+---
+### `slang` — Slang Normalizer
+Access the underlying slang engine directly for fine-grained control.
+```python
+from basa.core.slang import slang, SlangNormalizer
+# Use the singleton
+slang.normalize("gw gamau pergi krn lg baper bgt")
+# → 'saya tidak mau pergi karena sedang bawa perasaan banget'
+# Custom dictionary (extend or override defaults)
+custom = SlangNormalizer(custom_mapping={
+    "gaskeun": "ayo lakukan",
+    "jancok":  "ekspresi",
+})
+custom.normalize("gaskeun bro!")
+# → 'ayo lakukan bro!'
+# Batch normalize
+slang.normalize_batch(["gw makan", "lu minum"])
+# → ['saya makan', 'kamu minum']
+```
+#### Slang dictionary categories
+The built-in dictionary covers **250+ entries** across 13 categories:
+| Category | Examples |
+|---|---|
+| Pronouns | `gw` → saya, `lu` → kamu, `dy` → dia |
+| Kinship & address | `kk` → kakak, `klg` → keluarga, `ortu` → orang tua |
+| Negation | `ga`, `gak`, `nggak` → tidak |
+| Compound negation | `gamau` → tidak mau, `gabisa` → tidak bisa |
+| Conjunctions | `yg` → yang, `krn` → karena, `tp` → tapi |
+| Verbs | `udah` → sudah, `blm` → belum, `ngerti` → mengerti |
+| Adjectives & adverbs | `bgt` → banget, `bener` → benar, `dikit` → sedikit |
+| Question words | `gmn` → bagaimana, `knp` → kenapa, `kmn` → kemana |
+| Greetings & responses | `makasih` → terima kasih, `sip` → baik |
+| Temporal & location | `skrg` → sekarang, `kmrn` → kemarin, `ntr` → nanti |
+| Internet slang | `otw` → dalam perjalanan, `btw` → omong-omong, `wkwk` → tertawa |
+| E-commerce & finance | `ongkir` → ongkos kirim, `rekber` → rekening bersama, `cod` → bayar di tempat |
+| Youth / Gen-Z | `mager` → malas bergerak, `baper` → bawa perasaan, `gabut` → tidak ada kegiatan |
+---
+## Real-World Use Cases
+### Preprocessing for sentiment analysis
+```python
+from basa import normalize
+reviews = [
+    "produknya bagus bgt tp ongkirnya mahal bgt!!!",
+    "gw kecewa bngt, barang ga sesuai deskripsi smskali",
+    "rekber dlu gan, takut kena tipu",
+]
+clean = normalize(reviews)
+# Pass clean into your sentiment model
+```
+### Preprocessing for a custom NLP pipeline
+```python
+from basa import normalize, typo
+# Load your domain vocabulary (e.g., from a word list file)
+with open("vocab.txt") as f:
+    domain_vocab = set(f.read().splitlines())
+typo.add_to_vocab(domain_vocab)
+def preprocess(text: str) -> str:
+    return normalize(text, apply_typo=True)
+preprocess("gw mkan siang tdi di wrng padang")
+# → 'saya makan siang tadi di warung padang'
+```
+### NER pipeline (preserve casing)
+```python
+from basa import normalize
+text = "Jokowi blg bhw pemerintah akan bantu UMKM"
+normalize(text, lowercase=False)
+# → 'Jokowi bilang bahwa pemerintah akan bantu UMKM'
+```
+---
+## Design Philosophy
+BASA is built around three principles:
+1. **Conservative by default.** Only safe, lossless transforms are enabled out of the box. Destructive features (like typo correction) require explicit opt-in.
+2. **No bundled vocabularies for correction.** Every domain has different vocabulary needs — fintech, e-commerce, ML, healthcare. Callers supply their own word list via `typo.add_to_vocab()`.
+3. **Zero required dependencies for core preprocessing.** The `normalize()` and `slang` modules use only the Python standard library. The optional `transformers`, `torch`, and `pydantic` dependencies are only required for advanced modules (`basa.translate`, `basa.evaluate`).
+---
+## Development
+### Setup
+```bash
+git clone https://github.com/Muanai/basa.git
+cd basa
+python -m venv .venv
+.venv\Scripts\activate       # Windows
+# source .venv/bin/activate  # macOS / Linux
+pip install -e ".[dev]"
+```
+### Running tests
+```bash
+pytest tests/ -v
+```
+### Optional extras
+```bash
+pip install -e ".[serving]"     # FastAPI serving
+pip install -e ".[evaluation]"  # ROUGE, BERTScore, seqeval
+pip install -e ".[dev]"         # pytest, ruff, black, mypy
+```
+---
+## Roadmap
+| Version | Status | Features |
+|---|---|---|
+| **v0.1** | ✅ Current | `normalize()`, `quick()`, slang (250+ entries), typo corrector |
+| **v0.2** | 🔜 Planned | BK-Tree / SymSpell for faster typo correction at large vocab sizes |
+| **v0.3** | 🔜 Planned | Emoji handling, `remove_emoji` flag |
+| **v0.4** | 🔜 Planned | Tokenizer module (`basa.tokenize`) |
+| **v1.0** | 🔜 Planned | Stable API, full docs site, PyPI release |
+---
+## Contributing
+Contributions are welcome! In particular:
+- **Slang dictionary additions** — if you spot a common slang word that's missing, open a PR adding it to the appropriate category in [`src/basa/core/slang.py`](src/basa/core/slang.py).
+- **Bug reports** — please include the exact input string and the unexpected output.
+- **Performance improvements** — especially for the typo correction module.
+Please open an issue before submitting large changes.
+---
+## License
+MIT © 2026 [Muanai Khalifah Revindo](mailto:muanaikhalifahr@gmail.com)