PyPI - sonatoki - Versions diffs - 0.1.2__tar.gz → 0.1.3__tar.gz - Mend

sonatoki 0.1.2tar.gz → 0.1.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (30) hide show

{sonatoki-0.1.2 → sonatoki-0.1.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: sonatoki
-Version: 0.1.2
+Version: 0.1.3
 Summary: ilo li moku e toki li pana e sona ni: ni li toki ala toki pona?
 Author-Email: "jan Kekan San (@gregdan3)" <gregory.danielson3@gmail.com>
 License: AGPL-3.0-or-later
@@ -20,9 +20,9 @@ This library, "Language Knowledge," helps you identify whether a message is in T
 I wrote it with a variety of scraps and lessons learned from a prior project, [ilo pi toki pona taso, "toki-pona-only tool"](https://github.com/gregdan3/ilo-pi-toki-pona-taso). That tool will be rewritten to use this library shortly.
-If you've ever worked on a similar project, you know the question "is this message in [language]" is not a consistent one- the environment, time, preferences of the speaker, and much more, can all alter whether a given message is "in toki pona," and this applies to essentially any language.
+If you've ever worked on a similar project, you know the question "is this message in [language]" is not a consistent one- the environment, time, preferences of the speaker, and much more, can all alter whether a given message is "in" any specific language, and this question applies to Toki Pona too.
-This project "solves" that complex problem by offering a highly configurable and incredibly lazy parser
+This project "solves" that complex problem by offering a highly configurable parser, so you can tune it to your preferences and goals.
 ## Quick Start
@@ -36,28 +36,11 @@ pdm add sonatoki
 Then get started with a script along these lines:
 ```py
-from sonatoki.Filters import (
-    Numerics,
-    Syllabic,
-    NimiLinku,
-    Alphabetic,
-    ProperName,
-    Punctuations,
-)
-from sonatoki.Scorers import SoftScaling
-from sonatoki.Cleaners import ConsecutiveDuplicates
-from sonatoki.Tokenizers import word_tokenize_tok
-from sonatoki.Preprocessors import URLs, DiscordEmotes
+from sonatoki.ilo import Ilo
+from sonatoki.Configs import PrefConfig
 def main():
-    ilo = Ilo(
-        preprocessors=[URLs, DiscordEmotes],
-        ignoring_filters=[Numerics, Punctuations],
-        scoring_filters=[NimiLinku, Syllabic, ProperName, Alphabetic],
-        cleaners=[ConsecutiveDuplicates],
-        scorer=SoftScaling,
-        tokenizer=word_tokenize_tok,
-    )
+    ilo = Ilo(**PrefConfig)
     ilo.is_toki_pona("imagine how is touch the sky")  # False
     ilo.is_toki_pona("o pilin insa e ni: sina pilin e sewi")  # True
     ilo.is_toki_pona("I Think I Can Evade Detection")  # False
@@ -66,7 +49,30 @@ if __name__ == "__main__":
     main()
 ```
-`Ilo` is highly configurable by design, so I recommend exploring the `Preprocessors`, `Filters`, and `Scorers` modules. The `Cleaners` module only contains one cleaner, which I recommend using. The `Tokenizers` module contains several other word tokenizers, but their performance will be worse than the dedicated Toki Pona tokenizer `word_tokenize_tok`.
+Or if you'd prefer to configure on your own:
+```py
+from copy import deepcopy
+from sonatoki.ilo import Ilo
+from sonatoki.Configs import BaseConfig
+from sonatoki.Filters import NimiPuAle, Phonotactic, ProperName
+from sonatoki.Scorers import SoftPassFail
+def main():
+    config = deepcopy(BaseConfig)
+    config["scoring_filters"].extend([NimiPuAle, Phonotactic, ProperName])
+    config["scorer"] = SoftPassFail
+    ilo = Ilo(**config)
+    ilo.is_toki_pona("mu mu!")  # True
+    ilo.is_toki_pona("mi namako e moku mi")  # True
+    ilo.is_toki_pona("ma wulin")  # False
+if __name__ == "__main__":
+    main()
+```
+`Ilo` is highly configurable by necessity, so I recommend looking through the premade configs in `Configs` as well as the individual `Preprocessors`, `Filters`, and `Scorers`. The `Cleaners` module only contains one cleaner, which I recommend always using. Similarly, the `Tokenizers` module contains several other word tokenizers, but their performance will be worse than the dedicated Toki Pona tokenizer `WordTokenizerTok`.
 ## Development

{sonatoki-0.1.2 → sonatoki-0.1.3}/README.md RENAMED Viewed

@@ -6,9 +6,9 @@ This library, "Language Knowledge," helps you identify whether a message is in T
 I wrote it with a variety of scraps and lessons learned from a prior project, [ilo pi toki pona taso, "toki-pona-only tool"](https://github.com/gregdan3/ilo-pi-toki-pona-taso). That tool will be rewritten to use this library shortly.
-If you've ever worked on a similar project, you know the question "is this message in [language]" is not a consistent one- the environment, time, preferences of the speaker, and much more, can all alter whether a given message is "in toki pona," and this applies to essentially any language.
+If you've ever worked on a similar project, you know the question "is this message in [language]" is not a consistent one- the environment, time, preferences of the speaker, and much more, can all alter whether a given message is "in" any specific language, and this question applies to Toki Pona too.
-This project "solves" that complex problem by offering a highly configurable and incredibly lazy parser
+This project "solves" that complex problem by offering a highly configurable parser, so you can tune it to your preferences and goals.
 ## Quick Start
@@ -22,28 +22,11 @@ pdm add sonatoki
 Then get started with a script along these lines:
 ```py
-from sonatoki.Filters import (
-    Numerics,
-    Syllabic,
-    NimiLinku,
-    Alphabetic,
-    ProperName,
-    Punctuations,
-)
-from sonatoki.Scorers import SoftScaling
-from sonatoki.Cleaners import ConsecutiveDuplicates
-from sonatoki.Tokenizers import word_tokenize_tok
-from sonatoki.Preprocessors import URLs, DiscordEmotes
+from sonatoki.ilo import Ilo
+from sonatoki.Configs import PrefConfig
 def main():
-    ilo = Ilo(
-        preprocessors=[URLs, DiscordEmotes],
-        ignoring_filters=[Numerics, Punctuations],
-        scoring_filters=[NimiLinku, Syllabic, ProperName, Alphabetic],
-        cleaners=[ConsecutiveDuplicates],
-        scorer=SoftScaling,
-        tokenizer=word_tokenize_tok,
-    )
+    ilo = Ilo(**PrefConfig)
     ilo.is_toki_pona("imagine how is touch the sky")  # False
     ilo.is_toki_pona("o pilin insa e ni: sina pilin e sewi")  # True
     ilo.is_toki_pona("I Think I Can Evade Detection")  # False
@@ -52,7 +35,30 @@ if __name__ == "__main__":
     main()
 ```
-`Ilo` is highly configurable by design, so I recommend exploring the `Preprocessors`, `Filters`, and `Scorers` modules. The `Cleaners` module only contains one cleaner, which I recommend using. The `Tokenizers` module contains several other word tokenizers, but their performance will be worse than the dedicated Toki Pona tokenizer `word_tokenize_tok`.
+Or if you'd prefer to configure on your own:
+```py
+from copy import deepcopy
+from sonatoki.ilo import Ilo
+from sonatoki.Configs import BaseConfig
+from sonatoki.Filters import NimiPuAle, Phonotactic, ProperName
+from sonatoki.Scorers import SoftPassFail
+def main():
+    config = deepcopy(BaseConfig)
+    config["scoring_filters"].extend([NimiPuAle, Phonotactic, ProperName])
+    config["scorer"] = SoftPassFail
+    ilo = Ilo(**config)
+    ilo.is_toki_pona("mu mu!")  # True
+    ilo.is_toki_pona("mi namako e moku mi")  # True
+    ilo.is_toki_pona("ma wulin")  # False
+if __name__ == "__main__":
+    main()
+```
+`Ilo` is highly configurable by necessity, so I recommend looking through the premade configs in `Configs` as well as the individual `Preprocessors`, `Filters`, and `Scorers`. The `Cleaners` module only contains one cleaner, which I recommend always using. Similarly, the `Tokenizers` module contains several other word tokenizers, but their performance will be worse than the dedicated Toki Pona tokenizer `WordTokenizerTok`.
 ## Development

{sonatoki-0.1.2 → sonatoki-0.1.3}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "sonatoki"
-version = "0.1.2"
+version = "0.1.3"
 description = "ilo li moku e toki li pana e sona ni: ni li toki ala toki pona?"
 authors = [
     { name = "jan Kekan San (@gregdan3)", email = "gregory.danielson3@gmail.com" },

sonatoki-0.1.3/src/sonatoki/Configs.py ADDED Viewed

@@ -0,0 +1,80 @@
+# STL
+from copy import deepcopy
+from typing import List, Type, TypedDict
+# PDM
+from typing_extensions import NotRequired
+# LOCAL
+from sonatoki.Filters import (
+    Filter,
+    NimiPu,
+    Numerics,
+    Syllabic,
+    NimiLinku,
+    NimiPuAle,
+    Alphabetic,
+    ProperName,
+    Phonotactic,
+    NimiLinkuAle,
+    Punctuations,
+)
+from sonatoki.Scorers import Number, Scorer, PassFail, SoftScaling, SoftPassFail
+from sonatoki.Cleaners import Cleaner, ConsecutiveDuplicates
+from sonatoki.Tokenizers import Tokenizer, WordTokenizerTok
+from sonatoki.Preprocessors import (
+    URLs,
+    Preprocessor,
+    DiscordEmotes,
+    DiscordSpecial,
+    DiscordChannels,
+    DiscordMentions,
+)
+class IloConfig(TypedDict):
+    preprocessors: List[Type[Preprocessor]]
+    word_tokenizer: Type[Tokenizer]
+    cleaners: List[Type[Cleaner]]
+    ignoring_filters: List[Type[Filter]]
+    scoring_filters: List[Type[Filter]]
+    scorer: Type[Scorer]
+    passing_score: Number
+BaseConfig: IloConfig = {
+    "preprocessors": [URLs],
+    "cleaners": [ConsecutiveDuplicates],
+    "ignoring_filters": [Numerics, Punctuations],
+    "scoring_filters": [],
+    "scorer": PassFail,
+    "passing_score": 0.8,
+    "word_tokenizer": WordTokenizerTok,
+}
+PrefConfig: IloConfig = deepcopy(BaseConfig)
+PrefConfig["scoring_filters"].extend([NimiLinku, Syllabic, ProperName, Alphabetic])
+PrefConfig["scorer"] = SoftScaling
+LazyConfig: IloConfig = deepcopy(BaseConfig)
+LazyConfig["scoring_filters"].extend([Alphabetic, ProperName])
+LazyConfig["scorer"] = SoftPassFail
+DiscordConfig: IloConfig = deepcopy(PrefConfig)
+DiscordConfig["preprocessors"].extend(
+    [DiscordEmotes, DiscordMentions, DiscordChannels, DiscordSpecial]
+)
+TelegramConfig: IloConfig = deepcopy(PrefConfig)
+ForumConfig: IloConfig = deepcopy(PrefConfig)
+__all__ = [
+    "IloConfig",
+    "BaseConfig",
+    "PrefConfig",
+    "LazyConfig",
+    "DiscordConfig",
+    "TelegramConfig",
+    "ForumConfig",
+]

{sonatoki-0.1.2 → sonatoki-0.1.3}/src/sonatoki/Filters.py RENAMED Viewed

@@ -17,6 +17,7 @@ from sonatoki.constants import (
     NIMI_LINKU_SET,
     NIMI_PU_ALE_SET,
     NIMI_LINKU_ALE_SET,
+    NIMI_LINKU_SANDBOX_SET,
 )
 re.DEFAULT_VERSION = re.VERSION1
@@ -87,6 +88,10 @@ class NimiLinkuAle(SetFilter):
     tokens = NIMI_LINKU_ALE_SET
+class NimiLinkuSandbox(SetFilter):
+    tokens = NIMI_LINKU_SANDBOX_SET
 class Phonotactic(RegexFilter):
     """Determines if a given token is phonotactically valid Toki Pona (or `n`).
     Excludes both consecutive nasals and the illegal syllables:

{sonatoki-0.1.2 → sonatoki-0.1.3}/src/sonatoki/Preprocessors.py RENAMED Viewed

@@ -13,7 +13,7 @@ There are currently two distinct types of Preprocessor:
   - ArrowQuote
 Order does not generally matter, but if there were two overlapping containers such as in the string "|| spoiler ` monospace || `", order would matter.
-As such, each Preprocessor exposes a .precedence attribute which is optionally usable for ordering them. Lower precedence means it should be applied first.
+It is up to the user to order them appropriately.
 """
 # STL
@@ -27,8 +27,6 @@ re.DEFAULT_VERSION = re.VERSION1
 class Preprocessor(ABC):
-    precedence: int = 0
     @classmethod  # order matters
     @abstractmethod
     def process(cls, msg: str) -> str:
@@ -104,7 +102,6 @@ class DoubleQuotes(RegexPreprocessor):
 class Backticks(RegexPreprocessor):
     """Remove paired backticks and their contents `like this`"""
-    precedence = -10
     pattern = re.compile(r"`[^`]+`", flags=re.S)

sonatoki-0.1.3/src/sonatoki/Tokenizers.py ADDED Viewed

@@ -0,0 +1,76 @@
+# STL
+from abc import ABC, abstractmethod
+from typing import List
+# PDM
+import regex as re
+from typing_extensions import override
+try:
+    # PDM
+    import nltk
+    from nltk.tokenize import sent_tokenize as __sent_tokenize_nltk
+    from nltk.tokenize import word_tokenize as __word_tokenize_nltk
+except ImportError as e:
+    nltk = e
+LANGUAGE = "english"  # for NLTK
+class Tokenizer(ABC):
+    @classmethod
+    @abstractmethod
+    def tokenize(cls, s: str) -> List[str]: ...
+class NoOpTokenizer(Tokenizer):
+    """This is a special case that you do not want or need."""
+    @classmethod
+    @override
+    def tokenize(cls, s: str) -> List[str]:
+        return [s]
+class RegexTokenizer(Tokenizer):
+    pattern: "re.Pattern[str]"
+    @classmethod
+    @override
+    def tokenize(cls, s: str) -> List[str]:
+        return [clean for word in re.split(cls.pattern, s) if (clean := word.strip())]
+class WordTokenizerTok(RegexTokenizer):
+    pattern = re.compile(r"""([\p{Punctuation}\p{posix_punct}]+|\s+)""")
+    # TODO: are <> or {} that common as *sentence* delims? [] are already a stretch
+    # TODO: do the typography characters matter?
+    # NOTE: | / and , are *not* sentence delimiters for my purpose
+class SentTokenizerTok(RegexTokenizer):
+    pattern = re.compile(r"""(?<=[.?!:;·…“”"'()\[\]\-]|$)""")
+class WordTokenizerRe(RegexTokenizer):
+    pattern = re.compile(r"""(?<=[.?!;:'"-])""")
+class SentTokenizerRe(RegexTokenizer):
+    pattern = re.compile(r"""(.*?[.?!;:])|(.+?$)""")
+if not isinstance(nltk, ImportError):
+    class WordTokenizerNLTK(Tokenizer):
+        @classmethod
+        @override
+        def tokenize(cls, s: str) -> List[str]:
+            return __word_tokenize_nltk(text=s, language=LANGUAGE)
+    class SentTokenizerNLTK(Tokenizer):
+        @classmethod
+        @override
+        def tokenize(cls, s: str) -> List[str]:
+            return __sent_tokenize_nltk(text=s, language=LANGUAGE)

{sonatoki-0.1.2 → sonatoki-0.1.3}/src/sonatoki/constants.py RENAMED Viewed

@@ -4,6 +4,7 @@ from typing import Dict, List
 from pathlib import Path
 LINKU = Path(__file__).resolve().parent / Path("linku.json")
+SANDBOX = Path(__file__).resolve().parent / Path("sandbox.json")
 VOWELS = "aeiou"
 CONSONANTS = "jklmnpstw"
@@ -29,10 +30,16 @@ with open(LINKU) as f:
     ]
     NIMI_LINKU_ALE: List[str] = [d["word"] for d in r.values()]
+with open(SANDBOX) as f:
+    r: Dict[str, Dict[str, str]] = json.loads(f.read())
+    NIMI_LINKU_SANDBOX: List[str] = [d["word"] for d in r.values()]
 NIMI_PU_SET = set(NIMI_PU)
 NIMI_PU_ALE_SET = set(NIMI_PU_ALE)
 NIMI_LINKU_SET = set(NIMI_LINKU)
 NIMI_LINKU_ALE_SET = set(NIMI_LINKU_ALE)
+NIMI_LINKU_SANDBOX_SET = set(NIMI_LINKU_SANDBOX)
 ALLOWABLES_SET = set(ALLOWABLES)
 __all__ = [
@@ -54,4 +61,7 @@ __all__ = [
     #
     "NIMI_LINKU_ALE",
     "NIMI_LINKU_ALE_SET",
+    #
+    "NIMI_LINKU_SANDBOX",
+    "NIMI_LINKU_SANDBOX_SET",
 ]

{sonatoki-0.1.2 → sonatoki-0.1.3}/src/sonatoki/ilo.py RENAMED Viewed

@@ -14,13 +14,13 @@ LOG = logging.getLogger(__name__)
 class Ilo:
     __preprocessors: List[Type[Preprocessor]]
+    __word_tokenizer: Type[Tokenizer]
     __cleaners: List[Type[Cleaner]]
     __ignoring_filters: List[Type[Filter]]
     __scoring_filters: List[Type[Filter]]
     __scorer: Type[Scorer]
-    __tokenize: Tokenizer
     __passing_score: Number
-    logging_threshold: Number = 1.0
+    logging_threshold: Number = -1
     def __init__(
         self,
@@ -29,61 +29,62 @@ class Ilo:
         ignoring_filters: List[Type[Filter]],
         scoring_filters: List[Type[Filter]],
         scorer: Type[Scorer],
-        tokenizer: Tokenizer,  # NOTE: no wrapper needed?
         passing_score: Number,
+        word_tokenizer: Type[Tokenizer],
     ):
         super().__init__()
         # avoid keeping a ref to user's list just in case
         self.__preprocessors = [*preprocessors]
+        self.__word_tokenizer = word_tokenizer
         self.__cleaners = [*cleaners]
         self.__ignoring_filters = [*ignoring_filters]
         self.__scoring_filters = [*scoring_filters]
         self.__scorer = scorer
-        self.__tokenize = tokenizer
         self.__passing_score = passing_score
-    def __preprocess(self, msg: str) -> str:
+    def preprocess(self, msg: str) -> str:
         for p in self.__preprocessors:
             msg = p.process(msg)
         return msg
-    def __clean_token(self, token: str) -> str:
+    def word_tokenize(self, msg: str) -> List[str]:
+        """It is *highly* recommended that you run `ilo.preprocess` first."""
+        return self.__word_tokenizer.tokenize(msg)
+    def clean_token(self, token: str) -> str:
         for c in self.__cleaners:
             token = c.clean(token)
         return token
-    def __clean_tokens(self, tokens: List[str]) -> List[str]:
-        # NOTE: tested, making a new list with a for loop *is* faster than
-        # - list comps
-        # - generator comps
-        # - in-place replacement/removal
-        # - in place replacement with result of generator comp
+    def clean_tokens(self, tokens: List[str]) -> List[str]:
+        # NOTE: tested, making a new list with a for loop *is* faster than:
+        # list comp, generator comp, in-place replacement
         cleaned_tokens: List[str] = list()
         for token in tokens:
-            cleaned_token = self.__clean_token(token)
+            cleaned_token = self.clean_token(token)
             if not cleaned_token:
                 # TODO: warn user?
                 continue
             cleaned_tokens.append(cleaned_token)
         return cleaned_tokens
-    def __filter_token(self, token: str) -> bool:
+    def _filter_token(self, token: str) -> bool:
         for f in self.__ignoring_filters:
             if f.filter(token):
                 return True
         return False
-    def __filter_tokens(self, tokens: List[str]) -> List[str]:
+    def filter_tokens(self, tokens: List[str]) -> List[str]:
         filtered_tokens: List[str] = []
         for token in tokens:
-            if self.__filter_token(token):
+            if self._filter_token(token):
                 continue
             # the ignoring filter is true if the token matches
             # the user wants to ignore these so keep non-matching tokens
             filtered_tokens.append(token)
         return filtered_tokens
-    def __score_tokens(self, tokens: List[str]) -> float:
+    def score_tokens(self, tokens: List[str]) -> float:
         return self.__scorer.score(tokens, self.__scoring_filters)
     def _is_toki_pona(
@@ -95,26 +96,25 @@ class Ilo:
         - Filtered message (list[str])
         - Cleaned message (list[str])
         - Score (float)
-        - Result (bool)
-        """
-        preprocessed = self.__preprocess(message)
-        tokenized = self.__tokenize(preprocessed)
-        filtered = self.__filter_tokens(tokenized)
-        cleaned = self.__clean_tokens(filtered)
-        score = self.__score_tokens(cleaned)
+        - Result (bool)"""
+        preprocessed = self.preprocess(message)
+        tokenized = self.word_tokenize(preprocessed)
+        filtered = self.filter_tokens(tokenized)
+        cleaned = self.clean_tokens(filtered)
+        score = self.score_tokens(cleaned)
         result = score >= self.__passing_score
-        # NOTE: this method may break if above funcs start sharing a list
         if score <= self.logging_threshold:
-            LOG.debug("Msg: %.2f  %s", score, repr(message))
-            LOG.debug("Preproc:   %s", repr(preprocessed))
-            LOG.debug("Tokenized: %s", tokenized)
-            LOG.debug("Filtered:  %s", filtered)
-            LOG.debug("Cleaned:   %s", cleaned)
+            LOG.debug("msg: %.2f  %s", score, repr(message))
+            LOG.debug("preproc:   %s", repr(preprocessed))
+            LOG.debug("tokenized: %s", tokenized)
+            LOG.debug("filtered:  %s", filtered)
+            LOG.debug("cleaned:   %s", cleaned)
         # TODO: Move to each function? Loses ability to control when logging occurs by threshold
         return preprocessed, tokenized, filtered, cleaned, score, result
     def is_toki_pona(self, message: str) -> bool:
+        """Determines whether a single statement is or is not Toki Pona."""
         *_, result = self._is_toki_pona(message)
         return result

sonatoki 0.1.2__tar.gz → 0.1.3__tar.gz

sonatoki 0.1.2tar.gz → 0.1.3tar.gz