sonatoki 0.1.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- sonatoki/Cleaners.py +42 -0
- sonatoki/Filters.py +159 -0
- sonatoki/Preprocessors.py +131 -0
- sonatoki/Scorers.py +123 -0
- sonatoki/Tokenizers.py +64 -0
- sonatoki/__init__.py +0 -0
- sonatoki/__main__.py +9 -0
- sonatoki/constants.py +57 -0
- sonatoki/ilo.py +101 -0
- sonatoki/linku.json +1 -0
- sonatoki-0.1.0.dist-info/METADATA +84 -0
- sonatoki-0.1.0.dist-info/RECORD +14 -0
- sonatoki-0.1.0.dist-info/WHEEL +4 -0
- sonatoki-0.1.0.dist-info/licenses/LICENSE +661 -0
@@ -0,0 +1,84 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: sonatoki
|
3
|
+
Version: 0.1.0
|
4
|
+
Summary: ilo li moku e toki li pana e sona ni: ni li toki ala toki pona?
|
5
|
+
Author-Email: "jan Kekan San (@gregdan3)" <gregory.danielson3@gmail.com>
|
6
|
+
License: AGPL-3.0-or-later
|
7
|
+
Requires-Python: >=3.8
|
8
|
+
Requires-Dist: unidecode>=1.3.6
|
9
|
+
Requires-Dist: regex>=2023.12.25
|
10
|
+
Requires-Dist: typing-extensions>=4.11.0
|
11
|
+
Requires-Dist: nltk>=3.8.1; extra == "nltk"
|
12
|
+
Provides-Extra: nltk
|
13
|
+
Description-Content-Type: text/markdown
|
14
|
+
|
15
|
+
# sona toki
|
16
|
+
|
17
|
+
## What is **sona toki**?
|
18
|
+
|
19
|
+
This library, "Language Knowledge," helps you identify whether a message is in Toki Pona. No grammar checking, yet, which means this more checks whether a given message has enough Toki Pona words.
|
20
|
+
|
21
|
+
I wrote it with a variety of scraps and lessons learned from a prior project, [ilo pi toki pona taso, "toki-pona-only tool"](https://github.com/gregdan3/ilo-pi-toki-pona-taso). That tool will be rewritten to use this library shortly.
|
22
|
+
|
23
|
+
If you've ever worked on a similar project, you know the question "is this message in [language]" is not a consistent one- the environment, time, preferences of the speaker, and much more, can all alter whether a given message is "in toki pona," and this applies to essentially any language.
|
24
|
+
|
25
|
+
This project "solves" that complex problem by offering a highly configurable and incredibly lazy parser
|
26
|
+
|
27
|
+
## Quick Start
|
28
|
+
|
29
|
+
Install with your preferred Python package manager. Example:
|
30
|
+
|
31
|
+
```sh
|
32
|
+
pdm init # if your pyproject.toml doesn't exist yet
|
33
|
+
pdm add sonatoki
|
34
|
+
```
|
35
|
+
|
36
|
+
Then get started with a script along these lines:
|
37
|
+
|
38
|
+
```py
|
39
|
+
from sonatoki.Filters import (
|
40
|
+
Numerics,
|
41
|
+
Syllabic,
|
42
|
+
NimiLinku,
|
43
|
+
Alphabetic,
|
44
|
+
ProperName,
|
45
|
+
Punctuations,
|
46
|
+
)
|
47
|
+
from sonatoki.Scorers import Scaling
|
48
|
+
from sonatoki.Cleaners import ConsecutiveDuplicates
|
49
|
+
from sonatoki.Tokenizers import word_tokenize_tok
|
50
|
+
from sonatoki.Preprocessors import URLs, DiscordEmotes
|
51
|
+
|
52
|
+
def main():
|
53
|
+
ilo = Ilo(
|
54
|
+
preprocessors=[URLs, DiscordEmotes],
|
55
|
+
ignoring_filters=[Numerics, Punctuations],
|
56
|
+
scoring_filters=[NimiLinku, Syllabic, ProperName, Alphabetic],
|
57
|
+
cleaners=[ConsecutiveDuplicates],
|
58
|
+
scorer=Scaling,
|
59
|
+
tokenizer=word_tokenize_tok,
|
60
|
+
)
|
61
|
+
ilo.is_toki_pona("imagine how is touch the sky") # False
|
62
|
+
ilo.is_toki_pona("o pilin insa e ni: sina pilin e sewi") # True
|
63
|
+
|
64
|
+
if __name__ == "__main__":
|
65
|
+
main()
|
66
|
+
```
|
67
|
+
|
68
|
+
`Ilo` is highly configurable by design, so I recommend exploring the `Preprocessors`, `Filters`, and `Scorers` modules. The `Cleaners` module only contains one cleaner, which I highly recommend. The `Tokenizers` module contains several other word tokenizers, but their performance will be worse than the
|
69
|
+
|
70
|
+
## Development
|
71
|
+
|
72
|
+
1. Install [pdm](https://github.com/pdm-project/pdm)
|
73
|
+
1. `pdm sync --dev`
|
74
|
+
1. Open any file you like!
|
75
|
+
|
76
|
+
## FAQ
|
77
|
+
|
78
|
+
### Why isn't this README/library written in Toki Pona?
|
79
|
+
|
80
|
+
The intent is to show our methodology to the Unicode Consortium, particularly to the Script Encoding Working Group (previously the Script Ad Hoc Group). As far as we're aware, zero members of the committee know Toki Pona, which unfortunately means we fall back on English.
|
81
|
+
|
82
|
+
After our proposal has been examined and a result given by the committee, I will translate this file and library into Toki Pona, with a note left behind for those who do not understand it.
|
83
|
+
|
84
|
+
### Why aren't any of the specific
|
@@ -0,0 +1,14 @@
|
|
1
|
+
sonatoki-0.1.0.dist-info/METADATA,sha256=EQaB5tsicEQ4wYn5curehbhkzGF0qHqC1bnUbOVDCu0,3332
|
2
|
+
sonatoki-0.1.0.dist-info/WHEEL,sha256=vnE8JVcI2Wz7GRKorsPArnBdnW2SWKWGow5gu5tHlRU,90
|
3
|
+
sonatoki-0.1.0.dist-info/licenses/LICENSE,sha256=DZak_2itbUtvHzD3E7GNUYSRK6jdOJ-GqncQ2weavLA,34523
|
4
|
+
sonatoki/Cleaners.py,sha256=gTZ9dSsnvKVUtxM_ECSZ-_2heh--nD5A9dCQR1ATb1c,1160
|
5
|
+
sonatoki/Filters.py,sha256=yzhYF79GX03cOwlR_-B8SPMQPZv4UpAPytH0fQwBE70,4093
|
6
|
+
sonatoki/Preprocessors.py,sha256=G2up2jKKSrHQtTQWYNWH_fkjgroL45ZeajVn1KUECt8,3431
|
7
|
+
sonatoki/Scorers.py,sha256=X1vo-eIPbtl0IC5suIX6hu-4VG7NSzR90rkrLpep8WY,3690
|
8
|
+
sonatoki/Tokenizers.py,sha256=epOG3jZHI3MSO_L_6Z3zsSkexDEMLVzA2ARg6EnPMO0,1628
|
9
|
+
sonatoki/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
10
|
+
sonatoki/__main__.py,sha256=6xc-wIrrFo9wTyn4zRQNAmqwmJBtVvCMwV-CrM-hueA,82
|
11
|
+
sonatoki/constants.py,sha256=h5rbCfu9YF76BsjQYud5d2wq1HODY05zOaw0Ir1cwjo,1320
|
12
|
+
sonatoki/ilo.py,sha256=Uu0zipAF-L-5Wxw_EBB7-EMc40PM4WBa59Atq0zmYYE,3482
|
13
|
+
sonatoki/linku.json,sha256=MdFuFRIHniPDUVxKEKuUg1KyzPVgcCj4ZeyvburCwD0,270928
|
14
|
+
sonatoki-0.1.0.dist-info/RECORD,,
|