PyPI - sonatoki - Versions diffs - 0.1.0__tar.gz → 0.1.1__tar.gz - Mend

sonatoki 0.1.0tar.gz → 0.1.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

{sonatoki-0.1.0 → sonatoki-0.1.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: sonatoki
-Version: 0.1.0
+Version: 0.1.1
 Summary: ilo li moku e toki li pana e sona ni: ni li toki ala toki pona?
 Author-Email: "jan Kekan San (@gregdan3)" <gregory.danielson3@gmail.com>
 License: AGPL-3.0-or-later
@@ -44,7 +44,7 @@ from sonatoki.Filters import (
     ProperName,
     Punctuations,
 )
-from sonatoki.Scorers import Scaling
+from sonatoki.Scorers import SoftScaling
 from sonatoki.Cleaners import ConsecutiveDuplicates
 from sonatoki.Tokenizers import word_tokenize_tok
 from sonatoki.Preprocessors import URLs, DiscordEmotes
@@ -55,22 +55,23 @@ def main():
         ignoring_filters=[Numerics, Punctuations],
         scoring_filters=[NimiLinku, Syllabic, ProperName, Alphabetic],
         cleaners=[ConsecutiveDuplicates],
-        scorer=Scaling,
+        scorer=SoftScaling,
         tokenizer=word_tokenize_tok,
     )
     ilo.is_toki_pona("imagine how is touch the sky")  # False
     ilo.is_toki_pona("o pilin insa e ni: sina pilin e sewi")  # True
+    ilo.is_toki_pona("I Think I Can Evade Detection")  # False
 if __name__ == "__main__":
     main()
 ```
-`Ilo` is highly configurable by design, so I recommend exploring the `Preprocessors`, `Filters`, and `Scorers` modules. The `Cleaners` module only contains one cleaner, which I highly recommend. The `Tokenizers` module contains several other word tokenizers, but their performance will be worse than the
+`Ilo` is highly configurable by design, so I recommend exploring the `Preprocessors`, `Filters`, and `Scorers` modules. The `Cleaners` module only contains one cleaner, which I recommend using. The `Tokenizers` module contains several other word tokenizers, but their performance will be worse than the dedicated Toki Pona tokenizer `word_tokenize_tok`.
 ## Development
 1. Install [pdm](https://github.com/pdm-project/pdm)
-1. `pdm sync --dev`
+1. `pdm install --dev`
 1. Open any file you like!
 ## FAQ
@@ -81,4 +82,26 @@ The intent is to show our methodology to the Unicode Consortium, particularly to
 After our proposal has been examined and a result given by the committee, I will translate this file and library into Toki Pona, with a note left behind for those who do not understand it.
-### Why aren't any of the specific
+### What's the deal with the tokenizers?
+The Toki Pona tokenizer `word_tokenize_tok` is very specific in always separating writing characters from punctuation, and leaving contiguous punctuation as contiguous- this is a level of precision that NLTK's English tokenizer does not want for several reasons, such as that English words can have "punctuation" characters in them.
+Toki Pona doesn't have any mid-word symbols when rendered in the Latin alphabet, so a more aggressive tokenizer is highly desirable.
+The other tokenizers are provided as a comparison case more than anything. I do not recommend their use.
+### Aren't there a lot of false positives?
+Yes. It's up to you to use this tool responsibly on input you've done your best to clean, and better, use stronger filters before weaker ones. For now though, here's a list of relevant false positives:
+- `ProperName` will errantly match text in languages without a capital/lowercase distinction, artificially inflating the scores.
+- `Alphabetic` will match a _lot_ of undesirable text- it essentially allows 14 letters of the English alphabet.
+### Don't some of the cleaners/filters conflict?
+Yes. Some do so
+- `ConsecutiveDuplicates` may errantly change a word's validity. For example, "manna" is phonotactically invalid in Toki Pona, but would become "mana" which is valid.
+- `ConsecutiveDuplicates` will not work correctly with syllabaries (alphabets, but representing a pair of consonant and vowel).
+You'll notice a _lot_ of these are troubles regarding the application of latin alphabet filters to non-latin text. Working on it!

{sonatoki-0.1.0 → sonatoki-0.1.1}/README.md RENAMED Viewed

@@ -30,7 +30,7 @@ from sonatoki.Filters import (
     ProperName,
     Punctuations,
 )
-from sonatoki.Scorers import Scaling
+from sonatoki.Scorers import SoftScaling
 from sonatoki.Cleaners import ConsecutiveDuplicates
 from sonatoki.Tokenizers import word_tokenize_tok
 from sonatoki.Preprocessors import URLs, DiscordEmotes
@@ -41,22 +41,23 @@ def main():
         ignoring_filters=[Numerics, Punctuations],
         scoring_filters=[NimiLinku, Syllabic, ProperName, Alphabetic],
         cleaners=[ConsecutiveDuplicates],
-        scorer=Scaling,
+        scorer=SoftScaling,
         tokenizer=word_tokenize_tok,
     )
     ilo.is_toki_pona("imagine how is touch the sky")  # False
     ilo.is_toki_pona("o pilin insa e ni: sina pilin e sewi")  # True
+    ilo.is_toki_pona("I Think I Can Evade Detection")  # False
 if __name__ == "__main__":
     main()
 ```
-`Ilo` is highly configurable by design, so I recommend exploring the `Preprocessors`, `Filters`, and `Scorers` modules. The `Cleaners` module only contains one cleaner, which I highly recommend. The `Tokenizers` module contains several other word tokenizers, but their performance will be worse than the
+`Ilo` is highly configurable by design, so I recommend exploring the `Preprocessors`, `Filters`, and `Scorers` modules. The `Cleaners` module only contains one cleaner, which I recommend using. The `Tokenizers` module contains several other word tokenizers, but their performance will be worse than the dedicated Toki Pona tokenizer `word_tokenize_tok`.
 ## Development
 1. Install [pdm](https://github.com/pdm-project/pdm)
-1. `pdm sync --dev`
+1. `pdm install --dev`
 1. Open any file you like!
 ## FAQ
@@ -67,4 +68,26 @@ The intent is to show our methodology to the Unicode Consortium, particularly to
 After our proposal has been examined and a result given by the committee, I will translate this file and library into Toki Pona, with a note left behind for those who do not understand it.
-### Why aren't any of the specific
+### What's the deal with the tokenizers?
+The Toki Pona tokenizer `word_tokenize_tok` is very specific in always separating writing characters from punctuation, and leaving contiguous punctuation as contiguous- this is a level of precision that NLTK's English tokenizer does not want for several reasons, such as that English words can have "punctuation" characters in them.
+Toki Pona doesn't have any mid-word symbols when rendered in the Latin alphabet, so a more aggressive tokenizer is highly desirable.
+The other tokenizers are provided as a comparison case more than anything. I do not recommend their use.
+### Aren't there a lot of false positives?
+Yes. It's up to you to use this tool responsibly on input you've done your best to clean, and better, use stronger filters before weaker ones. For now though, here's a list of relevant false positives:
+- `ProperName` will errantly match text in languages without a capital/lowercase distinction, artificially inflating the scores.
+- `Alphabetic` will match a _lot_ of undesirable text- it essentially allows 14 letters of the English alphabet.
+### Don't some of the cleaners/filters conflict?
+Yes. Some do so
+- `ConsecutiveDuplicates` may errantly change a word's validity. For example, "manna" is phonotactically invalid in Toki Pona, but would become "mana" which is valid.
+- `ConsecutiveDuplicates` will not work correctly with syllabaries (alphabets, but representing a pair of consonant and vowel).
+You'll notice a _lot_ of these are troubles regarding the application of latin alphabet filters to non-latin text. Working on it!

{sonatoki-0.1.0 → sonatoki-0.1.1}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "sonatoki"
-version = "0.1.0"
+version = "0.1.1"
 description = "ilo li moku e toki li pana e sona ni: ni li toki ala toki pona?"
 authors = [
     { name = "jan Kekan San (@gregdan3)", email = "gregory.danielson3@gmail.com" },

{sonatoki-0.1.0 → sonatoki-0.1.1}/src/sonatoki/Scorers.py RENAMED Viewed

@@ -1,5 +1,6 @@
 # STL
 import math
+import logging
 from abc import ABC, abstractmethod
 from typing import Dict, List, Type, Union
@@ -9,24 +10,13 @@ from typing_extensions import override
 # LOCAL
 from sonatoki.Filters import Filter
+LOG = logging.getLogger(__name__)
 Number = Union[int, float]
 Weights = Dict[str, Number]
 class Scorer(ABC):
-    weights: Weights
-    # @classmethod
-    # def __score(cls, token: str, filters: List[Type[Filter]]) -> Tuple[int, Number]:
-    #     for filter in filters:
-    #         if not filter.filter(token):
-    #             continue
-    #         # NOTE: We assume the filters are ordered by their score
-    #         # Thus the first match is also the highest scoring
-    #         return filter.counts, cls.weights[filter.__name__]
-    #         # TODO: override weight if count is 0?
-    #     return 1, 0
     @classmethod
     @abstractmethod
     def score(cls, tokens: List[str], filters: List[Type[Filter]]) -> Number:
@@ -40,7 +30,12 @@ class PassFail(Scorer):
     def __score(cls, token: str, filters: List[Type[Filter]]) -> Number:
         for f in filters:
             if f.filter(token):
-                return 1
+                score = 1
+                LOG.debug(
+                    "%12s.%s('%s') = %.2f", cls.__name__, f.__name__, token, score
+                )
+                return score
+        LOG.debug("%12s('%s') = 0.00", cls.__name__, token)
         return 0
     @classmethod
@@ -67,7 +62,12 @@ class Scaling(Scorer):
     def score_token(cls, token: str, filters: List[Type[Filter]], scale: int):
         for i, f in enumerate(filters):
             if f.filter(token):
-                return scale - i
+                score = scale - i
+                LOG.debug(
+                    "%12s.%s('%s') = %.2f", cls.__name__, f.__name__, token, score
+                )
+                return score
+        LOG.debug("%12s('%s') = 0.00", cls.__name__, token)
         return 0
     @classmethod
@@ -95,7 +95,7 @@ class SoftScaling(Scaling):
     def sigmoid(n: int) -> Number:
         return 1 / (1 + math.exp(-(0.30 * (n - 1))))
         # n-1 makes sigmoid(1) == 0.5
-        # 0.30 softens scaling against input
+        # 0.30 softens scaling in favor of short input
         # return n / (1+abs(n))   # too weak in 0.7+
     @classmethod

{sonatoki-0.1.0 → sonatoki-0.1.1}/src/sonatoki/ilo.py RENAMED Viewed

@@ -1,5 +1,6 @@
 # STL
-from typing import List, Type
+import logging
+from typing import List, Type, Tuple
 # LOCAL
 from sonatoki.Filters import Filter
@@ -8,6 +9,8 @@ from sonatoki.Cleaners import Cleaner
 from sonatoki.Tokenizers import Tokenizer
 from sonatoki.Preprocessors import Preprocessor
+LOG = logging.getLogger(__name__)
 class Ilo:
     __preprocessors: List[Type[Preprocessor]]
@@ -17,7 +20,7 @@ class Ilo:
     __scorer: Type[Scorer]
     __tokenize: Tokenizer
     __passing_score: Number
-    debug: bool = False
+    logging_threshold: Number = 1.0
     def __init__(
         self,
@@ -83,19 +86,35 @@ class Ilo:
     def __score_tokens(self, tokens: List[str]) -> float:
         return self.__scorer.score(tokens, self.__scoring_filters)
-    def is_toki_pona(self, message: str) -> bool:
+    def _is_toki_pona(
+        self, message: str
+    ) -> Tuple[str, List[str], List[str], List[str], Number, bool]:
+        """Returns all components of the processing algorithm:
+        - Preprocessed message (str)
+        - Tokenized message (list[str])
+        - Filtered message (list[str])
+        - Cleaned message (list[str])
+        - Score (float)
+        - Result (bool)
+        """
         preprocessed = self.__preprocess(message)
         tokenized = self.__tokenize(preprocessed)
         filtered = self.__filter_tokens(tokenized)
         cleaned = self.__clean_tokens(filtered)
         score = self.__score_tokens(cleaned)
+        result = score >= self.__passing_score
-        if self.debug:
-            print("msg: %.2f  %s" % (score, repr(message)))
-            print("Preproc:   %s" % repr(preprocessed))
-            print("Tokenized: %s" % tokenized)
-            print("Filtered:  %s" % filtered)
-            print("Cleaned:   %s" % cleaned)
-            print()
+        # NOTE: this method may break if above funcs start sharing a list
+        if score <= self.logging_threshold:
+            LOG.debug("Msg: %.2f  %s", score, repr(message))
+            LOG.debug("Preproc:   %s", repr(preprocessed))
+            LOG.debug("Tokenized: %s", tokenized)
+            LOG.debug("Filtered:  %s", filtered)
+            LOG.debug("Cleaned:   %s", cleaned)
+        # TODO: Move to each function? Loses ability to control when logging occurs by threshold
-        return score >= self.__passing_score
+        return preprocessed, tokenized, filtered, cleaned, score, result
+    def is_toki_pona(self, message: str) -> bool:
+        *_, result = self._is_toki_pona(message)
+        return result

{sonatoki-0.1.0 → sonatoki-0.1.1}/tests/test_ilo.py RENAMED Viewed

@@ -36,18 +36,38 @@ def test_constructor():
         tokenizer=word_tokenize_tok,
         passing_score=0.8,
     )
-    ilo.debug = True
-    assert not ilo.is_toki_pona("super bruh moment 64")
+    # ilo._logging_threshold = 0.8
     assert ilo.is_toki_pona("mi unpa e mama sina")
-    assert ilo.is_toki_pona("mama sina li mu tan mi")
-    assert ilo.is_toki_pona("toki. sike li pona ala. o anpa.")
+    # toki pona
+    assert ilo.is_toki_pona("mama sina li lon seme? mi wile toki tawa ona")
+    assert ilo.is_toki_pona("sina sike pakala")
+    # names
     assert ilo.is_toki_pona("musi Homestuck li ike tawa mi")
+    # typoes
     assert ilo.is_toki_pona("mi mtue o kama sona")
+    assert ilo.is_toki_pona("mi mute o kma son")
+    # phonotactically valid
     assert ilo.is_toki_pona("ni li tenpo penpo")
+    # alphabetically valid
     assert ilo.is_toki_pona("ni li tptpt")
+    # a single
+    assert ilo.is_toki_pona("sipisi")
+    # soft scaling with syllablic filter at 2/4 will pass up to 5 syllablic words
+    assert ilo.is_toki_pona("walawa malama walama malama mupi")
+    # but fail 6 or more
+    assert not ilo.is_toki_pona("manama manama namana namana majani makala")
-    assert not ilo.is_toki_pona("I'm Trying To Evade The Filter")
+    # TODO: should soft scaling save an alphabetically valid single word?
+    assert not ilo.is_toki_pona("tok")
+    assert not ilo.is_toki_pona("mtue")
+    # just english
+    assert not ilo.is_toki_pona("bong")
+    assert not ilo.is_toki_pona("super bruh moment 64")
+    # all names
+    assert not ilo.is_toki_pona("I Want To Evade The Filter")
+    # all alphabetic
     assert not ilo.is_toki_pona(
-        """aaa i non-saw usa's most multiple element-set
-it's as asinine as in `e`-less speak"""
+        "aaa i non-saw usa's most multiple element-set. it's as asinine as in `e`-less speak"
     )

{sonatoki-0.1.0 → sonatoki-0.1.1}/tests/tokenize_cases/tokenize_words_tok.yml RENAMED Viewed

@@ -73,3 +73,18 @@
     - "are"
     - "boring"
     - "'"
+- name: "discovered case 1"
+  input: "***__U T A L A__   __M U N__***"
+  output:
+    - "***__"
+    - "U"
+    - "T"
+    - "A"
+    - "L"
+    - "A"
+    - "__"
+    - "__"
+    - "M"
+    - "U"
+    - "N"
+    - "__***"