marathi-shabda 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (36) hide show
  1. marathi_shabda-0.1.0/CONTRIBUTING.md +101 -0
  2. marathi_shabda-0.1.0/LICENSE +21 -0
  3. marathi_shabda-0.1.0/MANIFEST.in +4 -0
  4. marathi_shabda-0.1.0/PKG-INFO +329 -0
  5. marathi_shabda-0.1.0/README.md +292 -0
  6. marathi_shabda-0.1.0/pyproject.toml +90 -0
  7. marathi_shabda-0.1.0/setup.cfg +4 -0
  8. marathi_shabda-0.1.0/src/marathi_shabda/__init__.py +38 -0
  9. marathi_shabda-0.1.0/src/marathi_shabda/api.py +208 -0
  10. marathi_shabda-0.1.0/src/marathi_shabda/cli.py +93 -0
  11. marathi_shabda-0.1.0/src/marathi_shabda/data/__init__.py +4 -0
  12. marathi_shabda-0.1.0/src/marathi_shabda/data/dictionary.db +0 -0
  13. marathi_shabda-0.1.0/src/marathi_shabda/dictionary/__init__.py +5 -0
  14. marathi_shabda-0.1.0/src/marathi_shabda/dictionary/adapter.py +169 -0
  15. marathi_shabda-0.1.0/src/marathi_shabda/exceptions.py +21 -0
  16. marathi_shabda-0.1.0/src/marathi_shabda/inference/__init__.py +9 -0
  17. marathi_shabda-0.1.0/src/marathi_shabda/inference/kaal_inference.py +50 -0
  18. marathi_shabda-0.1.0/src/marathi_shabda/inference/pos_inference.py +68 -0
  19. marathi_shabda-0.1.0/src/marathi_shabda/models.py +95 -0
  20. marathi_shabda-0.1.0/src/marathi_shabda/morphology/__init__.py +10 -0
  21. marathi_shabda-0.1.0/src/marathi_shabda/morphology/lemmatizer.py +116 -0
  22. marathi_shabda-0.1.0/src/marathi_shabda/morphology/stem_alternations.py +62 -0
  23. marathi_shabda-0.1.0/src/marathi_shabda/morphology/vibhakti_rules.py +68 -0
  24. marathi_shabda-0.1.0/src/marathi_shabda/normalization/__init__.py +12 -0
  25. marathi_shabda-0.1.0/src/marathi_shabda/normalization/safe_normalizer.py +75 -0
  26. marathi_shabda-0.1.0/src/marathi_shabda/normalization/script_detection.py +55 -0
  27. marathi_shabda-0.1.0/src/marathi_shabda/normalization/transliterator.py +127 -0
  28. marathi_shabda-0.1.0/src/marathi_shabda.egg-info/PKG-INFO +329 -0
  29. marathi_shabda-0.1.0/src/marathi_shabda.egg-info/SOURCES.txt +34 -0
  30. marathi_shabda-0.1.0/src/marathi_shabda.egg-info/dependency_links.txt +1 -0
  31. marathi_shabda-0.1.0/src/marathi_shabda.egg-info/entry_points.txt +2 -0
  32. marathi_shabda-0.1.0/src/marathi_shabda.egg-info/requires.txt +7 -0
  33. marathi_shabda-0.1.0/src/marathi_shabda.egg-info/top_level.txt +1 -0
  34. marathi_shabda-0.1.0/tests/test_inference.py +64 -0
  35. marathi_shabda-0.1.0/tests/test_morphology.py +96 -0
  36. marathi_shabda-0.1.0/tests/test_normalization.py +103 -0
@@ -0,0 +1,101 @@
1
+ # Contributing to marathi-shabda
2
+
3
+ Thank you for your interest in contributing to marathi-shabda! This document provides guidelines for contributing to the project.
4
+
5
+ ## How to Contribute
6
+
7
+ ### Reporting Issues
8
+
9
+ - Use the GitHub issue tracker
10
+ - Provide clear description of the problem
11
+ - Include example Marathi words that demonstrate the issue
12
+ - Specify expected vs. actual behavior
13
+
14
+ ### Adding Vibhakti Rules
15
+
16
+ To add new vibhakti detection rules:
17
+
18
+ 1. Edit `src/marathi_shabda/morphology/vibhakti_rules.py`
19
+ 2. Add your rule to `VIBHAKTI_SUFFIXES` list
20
+ 3. Ensure rules are ordered by length (longest first)
21
+ 4. Set appropriate priority (1 = highest)
22
+ 5. Add test cases in `tests/test_morphology.py`
23
+
24
+ Example:
25
+ ```python
26
+ VibhaktiRule("नंतर", VibhaktiType.PANCHAMI, 1),
27
+ ```
28
+
29
+ ### Adding Stem Alternations
30
+
31
+ To add stem alternation patterns:
32
+
33
+ 1. Edit `src/marathi_shabda/morphology/stem_alternations.py`
34
+ 2. Add pattern to `STEM_ALTERNATIONS` list
35
+ 3. Test with real examples
36
+
37
+ ### Improving Transliteration
38
+
39
+ To improve Roman → Devanagari mapping:
40
+
41
+ 1. Edit `src/marathi_shabda/normalization/transliterator.py`
42
+ 2. Add mappings to `TRANSLITERATION_MAP`
43
+ 3. Order by length (longest first)
44
+ 4. Test with common words
45
+
46
+ ## Code Style
47
+
48
+ - Follow PEP 8
49
+ - Use type hints
50
+ - Write docstrings for all public functions
51
+ - Keep functions focused and testable
52
+ - Run `ruff check` before committing
53
+ - Run `mypy` for type checking
54
+
55
+ ## Testing
56
+
57
+ ```bash
58
+ # Run all tests
59
+ pytest tests/ -v
60
+
61
+ # Run with coverage
62
+ pytest tests/ --cov=marathi_shabda
63
+
64
+ # Run specific test file
65
+ pytest tests/test_morphology.py -v
66
+ ```
67
+
68
+ ## Development Setup
69
+
70
+ ```bash
71
+ # Clone repository
72
+ git clone https://github.com/yourusername/marathi-shabda.git
73
+ cd marathi-shabda
74
+
75
+ # Create virtual environment
76
+ python -m venv venv
77
+ source venv/bin/activate # On Windows: venv\Scripts\activate
78
+
79
+ # Install in development mode
80
+ pip install -e ".[dev]"
81
+
82
+ # Run tests
83
+ pytest
84
+ ```
85
+
86
+ ## Philosophy
87
+
88
+ Remember the core principle:
89
+
90
+ > **When unsure, defer. When confident, explain why.**
91
+
92
+ - Don't hallucinate meanings
93
+ - Surface ambiguity, don't hide it
94
+ - Conservative inference over aggressive guessing
95
+ - Dictionary-backed truth over heuristics
96
+
97
+ ## Questions?
98
+
99
+ Open an issue on GitHub or start a discussion.
100
+
101
+ Thank you for helping improve Marathi language tooling! 🙏
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Marathi Pratham Contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,4 @@
1
+ include src/marathi_shabda/data/*.db
2
+ include README.md
3
+ include LICENSE
4
+ include CONTRIBUTING.md
@@ -0,0 +1,329 @@
1
+ Metadata-Version: 2.4
2
+ Name: marathi-shabda
3
+ Version: 0.1.0
4
+ Summary: Deterministic, offline Marathi word analysis library (shabda = word in Marathi)
5
+ Author: Marathi Pratham Contributors
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/yourusername/marathi-shabda
8
+ Project-URL: Documentation, https://github.com/yourusername/marathi-shabda#readme
9
+ Project-URL: Repository, https://github.com/yourusername/marathi-shabda
10
+ Project-URL: Issues, https://github.com/yourusername/marathi-shabda/issues
11
+ Keywords: marathi,nlp,morphology,dictionary,devanagari,lemmatization
12
+ Classifier: Development Status :: 3 - Alpha
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Intended Audience :: Education
15
+ Classifier: Intended Audience :: Science/Research
16
+ Classifier: License :: OSI Approved :: MIT License
17
+ Classifier: Natural Language :: Marathi
18
+ Classifier: Operating System :: OS Independent
19
+ Classifier: Programming Language :: Python :: 3
20
+ Classifier: Programming Language :: Python :: 3.8
21
+ Classifier: Programming Language :: Python :: 3.9
22
+ Classifier: Programming Language :: Python :: 3.10
23
+ Classifier: Programming Language :: Python :: 3.11
24
+ Classifier: Programming Language :: Python :: 3.12
25
+ Classifier: Topic :: Text Processing :: Linguistic
26
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
27
+ Requires-Python: >=3.8
28
+ Description-Content-Type: text/markdown
29
+ License-File: LICENSE
30
+ Provides-Extra: dev
31
+ Requires-Dist: pytest>=7.0; extra == "dev"
32
+ Requires-Dist: pytest-cov>=4.0; extra == "dev"
33
+ Requires-Dist: ruff>=0.1.0; extra == "dev"
34
+ Requires-Dist: mypy>=1.0; extra == "dev"
35
+ Requires-Dist: build>=0.10; extra == "dev"
36
+ Dynamic: license-file
37
+
38
+ # marathi-shabda
39
+
40
+ **Deterministic, offline Marathi word analysis library**
41
+
42
+ [![PyPI version](https://badge.fury.io/py/marathi-shabda.svg)](https://badge.fury.io/py/marathi-shabda)
43
+ [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
44
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
45
+
46
+ ---
47
+
48
+ ## What is marathi-shabda?
49
+
50
+ `marathi-shabda` is a production-quality Python library for analyzing Marathi words. It provides:
51
+
52
+ 1. **Lemma (stem) extraction** from inflected Marathi words
53
+ 2. **Dictionary lookup** (Marathi ↔ English) with meanings
54
+ 3. **Morphological analysis** (रूप परिचय) including POS, vibhakti, and kāl detection
55
+
56
+ ### Why "pratham" (प्रथम)?
57
+
58
+ *Pratham* means "first" in Marathi. This library provides the **first step** in Marathi text analysis: understanding individual words before tackling sentences or documents.
59
+
60
+ ---
61
+
62
+ ## Motivation
63
+
64
+ Marathi language tooling lags behind other Indian languages. Existing solutions either:
65
+ - Require network access (API-based)
66
+ - Hallucinate meanings (LLM-based)
67
+ - Lack linguistic grounding (pure ML)
68
+
69
+ **marathi-shabda** is different:
70
+ - ✅ **Offline-first**: No network, no API keys
71
+ - ✅ **Dictionary-backed**: Authoritative meanings, no hallucinations
72
+ - ✅ **Explainable**: Shows reasoning for every decision
73
+ - ✅ **Honest about limitations**: Surfaces ambiguity instead of hiding it
74
+
75
+ ---
76
+
77
+ ## What It Does
78
+
79
+ ### ✅ Supported Features
80
+
81
+ - **Lemma extraction**: `पाण्यावर` → `पाणी` (water)
82
+ - **Vibhakti detection**: Identifies case markers (तृतीया, सप्तमी, संबंध, etc.)
83
+ - **Dictionary lookup**: Marathi → English meanings
84
+ - **POS tagging**: Conservative noun/verb/adjective classification
85
+ - **Kāl inference**: Basic tense detection for verbs
86
+ - **Roman input**: Accepts romanized Marathi (e.g., `pani` → `पाणी`)
87
+ - **Stem alternations**: Handles oblique forms (`पाण्य` → `पाणी`)
88
+
89
+ ### ❌ Explicit Non-Goals
90
+
91
+ This library **does NOT**:
92
+ - Parse sentences or multi-word phrases
93
+ - Claim grammatical correctness in all contexts
94
+ - Infer semantics beyond dictionary meanings
95
+ - Require network access
96
+ - Use machine learning (v0.1.0)
97
+
98
+ ---
99
+
100
+ ## Installation
101
+
102
+ ```bash
103
+ pip install marathi-shabda
104
+ ```
105
+
106
+ **Requirements**: Python 3.8+, no external dependencies
107
+
108
+ ---
109
+
110
+ ## Quick Start
111
+
112
+ ### 1. Lemma Extraction
113
+
114
+ ```python
115
+ from marathi_shabda import get_lemma
116
+
117
+ result = get_lemma("पाण्यावर")
118
+ print(result.lemma) # पाणी
119
+ print(result.confidence) # 0.9
120
+ print(result.detected_vibhakti) # VibhaktiType.SAPTAMI (सप्तमी)
121
+ print(result.explanation) # "Detected सप्तमी vibhakti"
122
+ ```
123
+
124
+ ### 2. Dictionary Lookup
125
+
126
+ ```python
127
+ from marathi_shabda import lookup_word
128
+
129
+ result = lookup_word("पाणी")
130
+ print(result.english_meanings) # ['water']
131
+ print(result.found) # True
132
+
133
+ # Also works with Roman input
134
+ result = lookup_word("pani")
135
+ print(result.lemma) # पाणी
136
+ ```
137
+
138
+ ### 3. Morphological Analysis
139
+
140
+ ```python
141
+ from marathi_shabda import analyze_word
142
+
143
+ result = analyze_word("मुलाने")
144
+ print(result.lemma) # मुल
145
+ print(result.pos) # POSTag.NOUN
146
+ print(result.vibhakti) # VibhaktiType.TRUTIYA (तृतीया)
147
+ print(result.confidence) # 0.9
148
+ print(result.explanation)
149
+ # "Detected तृतीया vibhakti; Inferred noun"
150
+ ```
151
+
152
+ ---
153
+
154
+ ## How It Works
155
+
156
+ ### Architecture
157
+
158
+ ```
159
+ Input Word
160
+
161
+ Normalization (Roman → Devanagari)
162
+
163
+ Dictionary Check (exact match?)
164
+
165
+ Vibhakti Detection (longest-first)
166
+
167
+ Stem Alternations (पाण्य → पाणी)
168
+
169
+ Dictionary Validation (lemma exists?)
170
+
171
+ POS & Kāl Inference
172
+
173
+ Result with Confidence
174
+ ```
175
+
176
+ ### Key Principles
177
+
178
+ 1. **Dictionary-first validation**: Rules generate candidates, dictionary decides truth
179
+ 2. **Longest-match-first**: Detects `मध्ये` before `ये`
180
+ 3. **Conservative inference**: Returns `UNKNOWN` when uncertain
181
+ 4. **Explainable decisions**: Every result includes reasoning
182
+
183
+ ---
184
+
185
+ ## Confidence & Ambiguity
186
+
187
+ ### Confidence Scores
188
+
189
+ - **1.0**: Exact dictionary match
190
+ - **0.9**: Vibhakti detected, lemma validated
191
+ - **0.7**: Ambiguous (multiple possible lemmas)
192
+ - **0.0**: Word not in dictionary
193
+
194
+ ### Handling Ambiguity
195
+
196
+ ```python
197
+ result = get_lemma("घरात")
198
+ if result.ambiguous:
199
+ print(f"Multiple interpretations: {result.candidates}")
200
+ # ['घर', 'घरात'] # Could be noun or compound
201
+ ```
202
+
203
+ **Philosophy**: We surface ambiguity instead of making false claims.
204
+
205
+ ---
206
+
207
+ ## Offline Guarantee
208
+
209
+ **marathi-shabda** works completely offline:
210
+ - ✅ No network requests
211
+ - ✅ No API keys
212
+ - ✅ No telemetry
213
+ - ✅ Bundled SQLite database
214
+ - ✅ Pure Python (stdlib only)
215
+
216
+ Perfect for:
217
+ - Privacy-sensitive applications
218
+ - Offline environments
219
+ - Embedded systems
220
+ - Research reproducibility
221
+
222
+ ---
223
+
224
+ ## Limitations
225
+
226
+ ### Current Limitations (v0.1.0)
227
+
228
+ - **Single words only**: No sentence parsing
229
+ - **Conservative POS tagging**: Limited to obvious cases
230
+ - **Basic kāl detection**: Only common verb patterns
231
+ - **No semantic analysis**: Dictionary meanings only
232
+ - **Limited verb conjugation**: Focus on nouns/vibhakti
233
+
234
+ ### Known Edge Cases
235
+
236
+ - Compound words may not split correctly
237
+ - Rare vibhaktis may not be detected
238
+ - Ambiguous forms return multiple candidates
239
+ - Roman transliteration is approximate
240
+
241
+ **We document limitations honestly.** If you encounter issues, please report them!
242
+
243
+ ---
244
+
245
+ ## Future Roadmap
246
+
247
+ ### v0.2.0 (Planned)
248
+ - [ ] Extended database schema (POS, gender, number)
249
+ - [ ] Improved verb conjugation analysis
250
+ - [ ] Compound word splitting
251
+ - [ ] Performance optimizations
252
+
253
+ ### v0.3.0 (Planned)
254
+ - [ ] Optional SLM integration for ambiguity resolution
255
+ - [ ] Sentence-level analysis (experimental)
256
+ - [ ] Batch processing API
257
+
258
+ ### Long-term
259
+ - [ ] Hybrid rule-based + ML approach
260
+ - [ ] Community-contributed dictionary expansions
261
+ - [ ] Web API (optional deployment)
262
+
263
+ ---
264
+
265
+ ## Command-Line Interface
266
+
267
+ ```bash
268
+ # Extract lemma
269
+ marathi-shabda lemma पाण्यावर
270
+
271
+ # Dictionary lookup
272
+ marathi-shabda lookup पाणी
273
+
274
+ # Full analysis
275
+ marathi-shabda analyze मुलाने
276
+ ```
277
+
278
+ ---
279
+
280
+ ## Contributing
281
+
282
+ We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for:
283
+ - How to add vibhakti rules
284
+ - How to improve transliteration
285
+ - Code style guidelines
286
+ - Testing requirements
287
+
288
+ ---
289
+
290
+ ## License
291
+
292
+ MIT License - see [LICENSE](LICENSE) for details
293
+
294
+ ---
295
+
296
+ ## Acknowledgments
297
+
298
+ - Marathi language scholars and grammarians
299
+ - Open-source NLP community
300
+ - Contributors and testers
301
+
302
+ ---
303
+
304
+ ## Citation
305
+
306
+ If you use marathi-shabda in research, please cite:
307
+
308
+ ```bibtex
309
+ @software{marathi_shabda,
310
+ title = {marathi-shabda: Deterministic Marathi Word Analysis},
311
+ author = {Marathi Pratham Contributors},
312
+ year = {2026},
313
+ url = {https://github.com/yourusername/marathi-shabda}
314
+ }
315
+ ```
316
+
317
+ ---
318
+
319
+ ## Support
320
+
321
+ - **Issues**: [GitHub Issues](https://github.com/yourusername/marathi-shabda/issues)
322
+ - **Discussions**: [GitHub Discussions](https://github.com/yourusername/marathi-shabda/discussions)
323
+ - **Email**: [your-email@example.com]
324
+
325
+ ---
326
+
327
+ **Philosophy**: *When unsure, defer. When confident, explain why.*
328
+
329
+ Built with respect for the Marathi language and its speakers. 🙏