PyPI - SqueakyCleanText - Versions diffs - 0.2.4__tar.gz → 0.2.5__tar.gz - Mend

SqueakyCleanText 0.2.4tar.gz → 0.2.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

{SqueakyCleanText-0.2.4/SqueakyCleanText.egg-info → SqueakyCleanText-0.2.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: SqueakyCleanText
-Version: 0.2.4
+Version: 0.2.5
 Summary: A comprehensive text cleaning and preprocessing pipeline.
 Home-page: https://github.com/rhnfzl/SqueakyCleanText
 Author: Rehan Fazal
@@ -156,50 +156,53 @@ You can modify the package’s functionality by changing settings in the configu
     Similarly, other aspects of the configuration can be changed. Simply modify the settings before initializing TextCleaner(). Below is the full list of configurable settings:
     ```python
+        from sct import sct, config
         # In case Language detection is not required as well as No NER and No Statistical Model stopwords are needed,
         # then only CHECK_DETECT_LANGUAGE will be considered False.
-        CHECK_DETECT_LANGUAGE = True
-        CHECK_FIX_BAD_UNICODE = True
-        CHECK_TO_ASCII_UNICODE = True
-        CHECK_REPLACE_HTML = True
-        CHECK_REPLACE_URLS = True
-        CHECK_REPLACE_EMAILS = True
-        CHECK_REPLACE_YEARS = True
-        CHECK_REPLACE_PHONE_NUMBERS = True
-        CHECK_REPLACE_NUMBERS = True
-        CHECK_REPLACE_CURRENCY_SYMBOLS = True
-        CHECK_NER_PROCESS = True
-        CHECK_REMOVE_ISOLATED_LETTERS = True
-        CHECK_REMOVE_ISOLATED_SPECIAL_SYMBOLS = True
-        CHECK_NORMALIZE_WHITESPACE = True
-        CHECK_STATISTICAL_MODEL_PROCESSING = True
-        CHECK_CASEFOLD = True
-        CHECK_REMOVE_STOPWORDS = True
-        CHECK_REMOVE_PUNCTUATION = True
-        CHECK_REMOVE_SCT_CUSTOM_STOP_WORDS = True
+        config.CHECK_DETECT_LANGUAGE = True
+        config.CHECK_FIX_BAD_UNICODE = True
+        config.CHECK_TO_ASCII_UNICODE = True
+        config.CHECK_REPLACE_HTML = True
+        config.CHECK_REPLACE_URLS = True
+        config.CHECK_REPLACE_EMAILS = True
+        config.CHECK_REPLACE_YEARS = True
+        config.CHECK_REPLACE_PHONE_NUMBERS = True
+        config.CHECK_REPLACE_NUMBERS = True
+        config.CHECK_REPLACE_CURRENCY_SYMBOLS = True
+        config.CHECK_NER_PROCESS = True
+        config.CHECK_REMOVE_ISOLATED_LETTERS = True
+        config.CHECK_REMOVE_ISOLATED_SPECIAL_SYMBOLS = True
+        config.CHECK_NORMALIZE_WHITESPACE = True
+        config.CHECK_STATISTICAL_MODEL_PROCESSING = True
+        config.CHECK_CASEFOLD = True
+        config.CHECK_REMOVE_STOPWORDS = True
+        config.CHECK_REMOVE_PUNCTUATION = True
+        config.CHECK_REMOVE_SCT_CUSTOM_STOP_WORDS = True
         # Tags can be replaced if needed, like if no special tags are necessary "" can be passed
-        REPLACE_WITH_URL = "<URL>"
-        REPLACE_WITH_HTML = "<HTML>"
-        REPLACE_WITH_EMAIL = "<EMAIL>"
-        REPLACE_WITH_YEARS = "<YEAR>"
-        REPLACE_WITH_PHONE_NUMBERS = "<PHONE>"
-        REPLACE_WITH_NUMBERS = "<NUMBER>"
-        REPLACE_WITH_CURRENCY_SYMBOLS = None
+        config.REPLACE_WITH_URL = "<URL>"
+        config.REPLACE_WITH_HTML = "<HTML>"
+        config.REPLACE_WITH_EMAIL = "<EMAIL>"
+        config.REPLACE_WITH_YEARS = "<YEAR>"
+        config.REPLACE_WITH_PHONE_NUMBERS = "<PHONE>"
+        config.REPLACE_WITH_NUMBERS = "<NUMBER>"
+        config.REPLACE_WITH_CURRENCY_SYMBOLS = None
         # You can remove any of the tags
-        POSITIONAL_TAGS = ['PER', 'LOC', 'ORG']
-        NER_CONFIDENCE_THRESHOLD = 0.85
+        config.POSITIONAL_TAGS = ['PER', 'LOC', 'ORG']
+        config.NER_CONFIDENCE_THRESHOLD = 0.85
         # Pass it as ENGLISH, DUTCH, GERMAN etc. if you know the language of text beforehand.
-        LANGUAGE = None
+        config.LANGUAGE = None
         # Order of the model is Important: English Model, Dutch Model, German Model, Spanish Model, MULTILINGUAL Model
         # All models passed need to support transformers AutoModel
-        NER_MODELS_LIST = [
+        config.NER_MODELS_LIST = [
             "FacebookAI/xlm-roberta-large-finetuned-conll03-english",
             "FacebookAI/xlm-roberta-large-finetuned-conll02-dutch",
             "FacebookAI/xlm-roberta-large-finetuned-conll03-german",
             "FacebookAI/xlm-roberta-large-finetuned-conll02-spanish",
             "Babelscape/wikineural-multilingual-ner"
         ]
+        sx = sct.TextCleaner()
     ```
 ## API

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/README.md RENAMED Viewed

@@ -132,50 +132,53 @@ You can modify the package’s functionality by changing settings in the configu
     Similarly, other aspects of the configuration can be changed. Simply modify the settings before initializing TextCleaner(). Below is the full list of configurable settings:
     ```python
+        from sct import sct, config
         # In case Language detection is not required as well as No NER and No Statistical Model stopwords are needed,
         # then only CHECK_DETECT_LANGUAGE will be considered False.
-        CHECK_DETECT_LANGUAGE = True
-        CHECK_FIX_BAD_UNICODE = True
-        CHECK_TO_ASCII_UNICODE = True
-        CHECK_REPLACE_HTML = True
-        CHECK_REPLACE_URLS = True
-        CHECK_REPLACE_EMAILS = True
-        CHECK_REPLACE_YEARS = True
-        CHECK_REPLACE_PHONE_NUMBERS = True
-        CHECK_REPLACE_NUMBERS = True
-        CHECK_REPLACE_CURRENCY_SYMBOLS = True
-        CHECK_NER_PROCESS = True
-        CHECK_REMOVE_ISOLATED_LETTERS = True
-        CHECK_REMOVE_ISOLATED_SPECIAL_SYMBOLS = True
-        CHECK_NORMALIZE_WHITESPACE = True
-        CHECK_STATISTICAL_MODEL_PROCESSING = True
-        CHECK_CASEFOLD = True
-        CHECK_REMOVE_STOPWORDS = True
-        CHECK_REMOVE_PUNCTUATION = True
-        CHECK_REMOVE_SCT_CUSTOM_STOP_WORDS = True
+        config.CHECK_DETECT_LANGUAGE = True
+        config.CHECK_FIX_BAD_UNICODE = True
+        config.CHECK_TO_ASCII_UNICODE = True
+        config.CHECK_REPLACE_HTML = True
+        config.CHECK_REPLACE_URLS = True
+        config.CHECK_REPLACE_EMAILS = True
+        config.CHECK_REPLACE_YEARS = True
+        config.CHECK_REPLACE_PHONE_NUMBERS = True
+        config.CHECK_REPLACE_NUMBERS = True
+        config.CHECK_REPLACE_CURRENCY_SYMBOLS = True
+        config.CHECK_NER_PROCESS = True
+        config.CHECK_REMOVE_ISOLATED_LETTERS = True
+        config.CHECK_REMOVE_ISOLATED_SPECIAL_SYMBOLS = True
+        config.CHECK_NORMALIZE_WHITESPACE = True
+        config.CHECK_STATISTICAL_MODEL_PROCESSING = True
+        config.CHECK_CASEFOLD = True
+        config.CHECK_REMOVE_STOPWORDS = True
+        config.CHECK_REMOVE_PUNCTUATION = True
+        config.CHECK_REMOVE_SCT_CUSTOM_STOP_WORDS = True
         # Tags can be replaced if needed, like if no special tags are necessary "" can be passed
-        REPLACE_WITH_URL = "<URL>"
-        REPLACE_WITH_HTML = "<HTML>"
-        REPLACE_WITH_EMAIL = "<EMAIL>"
-        REPLACE_WITH_YEARS = "<YEAR>"
-        REPLACE_WITH_PHONE_NUMBERS = "<PHONE>"
-        REPLACE_WITH_NUMBERS = "<NUMBER>"
-        REPLACE_WITH_CURRENCY_SYMBOLS = None
+        config.REPLACE_WITH_URL = "<URL>"
+        config.REPLACE_WITH_HTML = "<HTML>"
+        config.REPLACE_WITH_EMAIL = "<EMAIL>"
+        config.REPLACE_WITH_YEARS = "<YEAR>"
+        config.REPLACE_WITH_PHONE_NUMBERS = "<PHONE>"
+        config.REPLACE_WITH_NUMBERS = "<NUMBER>"
+        config.REPLACE_WITH_CURRENCY_SYMBOLS = None
         # You can remove any of the tags
-        POSITIONAL_TAGS = ['PER', 'LOC', 'ORG']
-        NER_CONFIDENCE_THRESHOLD = 0.85
+        config.POSITIONAL_TAGS = ['PER', 'LOC', 'ORG']
+        config.NER_CONFIDENCE_THRESHOLD = 0.85
         # Pass it as ENGLISH, DUTCH, GERMAN etc. if you know the language of text beforehand.
-        LANGUAGE = None
+        config.LANGUAGE = None
         # Order of the model is Important: English Model, Dutch Model, German Model, Spanish Model, MULTILINGUAL Model
         # All models passed need to support transformers AutoModel
-        NER_MODELS_LIST = [
+        config.NER_MODELS_LIST = [
             "FacebookAI/xlm-roberta-large-finetuned-conll03-english",
             "FacebookAI/xlm-roberta-large-finetuned-conll02-dutch",
             "FacebookAI/xlm-roberta-large-finetuned-conll03-german",
             "FacebookAI/xlm-roberta-large-finetuned-conll02-spanish",
             "Babelscape/wikineural-multilingual-ner"
         ]
+        sx = sct.TextCleaner()
     ```
 ## API

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5/SqueakyCleanText.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: SqueakyCleanText
-Version: 0.2.4
+Version: 0.2.5
 Summary: A comprehensive text cleaning and preprocessing pipeline.
 Home-page: https://github.com/rhnfzl/SqueakyCleanText
 Author: Rehan Fazal
@@ -156,50 +156,53 @@ You can modify the package’s functionality by changing settings in the configu
     Similarly, other aspects of the configuration can be changed. Simply modify the settings before initializing TextCleaner(). Below is the full list of configurable settings:
     ```python
+        from sct import sct, config
         # In case Language detection is not required as well as No NER and No Statistical Model stopwords are needed,
         # then only CHECK_DETECT_LANGUAGE will be considered False.
-        CHECK_DETECT_LANGUAGE = True
-        CHECK_FIX_BAD_UNICODE = True
-        CHECK_TO_ASCII_UNICODE = True
-        CHECK_REPLACE_HTML = True
-        CHECK_REPLACE_URLS = True
-        CHECK_REPLACE_EMAILS = True
-        CHECK_REPLACE_YEARS = True
-        CHECK_REPLACE_PHONE_NUMBERS = True
-        CHECK_REPLACE_NUMBERS = True
-        CHECK_REPLACE_CURRENCY_SYMBOLS = True
-        CHECK_NER_PROCESS = True
-        CHECK_REMOVE_ISOLATED_LETTERS = True
-        CHECK_REMOVE_ISOLATED_SPECIAL_SYMBOLS = True
-        CHECK_NORMALIZE_WHITESPACE = True
-        CHECK_STATISTICAL_MODEL_PROCESSING = True
-        CHECK_CASEFOLD = True
-        CHECK_REMOVE_STOPWORDS = True
-        CHECK_REMOVE_PUNCTUATION = True
-        CHECK_REMOVE_SCT_CUSTOM_STOP_WORDS = True
+        config.CHECK_DETECT_LANGUAGE = True
+        config.CHECK_FIX_BAD_UNICODE = True
+        config.CHECK_TO_ASCII_UNICODE = True
+        config.CHECK_REPLACE_HTML = True
+        config.CHECK_REPLACE_URLS = True
+        config.CHECK_REPLACE_EMAILS = True
+        config.CHECK_REPLACE_YEARS = True
+        config.CHECK_REPLACE_PHONE_NUMBERS = True
+        config.CHECK_REPLACE_NUMBERS = True
+        config.CHECK_REPLACE_CURRENCY_SYMBOLS = True
+        config.CHECK_NER_PROCESS = True
+        config.CHECK_REMOVE_ISOLATED_LETTERS = True
+        config.CHECK_REMOVE_ISOLATED_SPECIAL_SYMBOLS = True
+        config.CHECK_NORMALIZE_WHITESPACE = True
+        config.CHECK_STATISTICAL_MODEL_PROCESSING = True
+        config.CHECK_CASEFOLD = True
+        config.CHECK_REMOVE_STOPWORDS = True
+        config.CHECK_REMOVE_PUNCTUATION = True
+        config.CHECK_REMOVE_SCT_CUSTOM_STOP_WORDS = True
         # Tags can be replaced if needed, like if no special tags are necessary "" can be passed
-        REPLACE_WITH_URL = "<URL>"
-        REPLACE_WITH_HTML = "<HTML>"
-        REPLACE_WITH_EMAIL = "<EMAIL>"
-        REPLACE_WITH_YEARS = "<YEAR>"
-        REPLACE_WITH_PHONE_NUMBERS = "<PHONE>"
-        REPLACE_WITH_NUMBERS = "<NUMBER>"
-        REPLACE_WITH_CURRENCY_SYMBOLS = None
+        config.REPLACE_WITH_URL = "<URL>"
+        config.REPLACE_WITH_HTML = "<HTML>"
+        config.REPLACE_WITH_EMAIL = "<EMAIL>"
+        config.REPLACE_WITH_YEARS = "<YEAR>"
+        config.REPLACE_WITH_PHONE_NUMBERS = "<PHONE>"
+        config.REPLACE_WITH_NUMBERS = "<NUMBER>"
+        config.REPLACE_WITH_CURRENCY_SYMBOLS = None
         # You can remove any of the tags
-        POSITIONAL_TAGS = ['PER', 'LOC', 'ORG']
-        NER_CONFIDENCE_THRESHOLD = 0.85
+        config.POSITIONAL_TAGS = ['PER', 'LOC', 'ORG']
+        config.NER_CONFIDENCE_THRESHOLD = 0.85
         # Pass it as ENGLISH, DUTCH, GERMAN etc. if you know the language of text beforehand.
-        LANGUAGE = None
+        config.LANGUAGE = None
         # Order of the model is Important: English Model, Dutch Model, German Model, Spanish Model, MULTILINGUAL Model
         # All models passed need to support transformers AutoModel
-        NER_MODELS_LIST = [
+        config.NER_MODELS_LIST = [
             "FacebookAI/xlm-roberta-large-finetuned-conll03-english",
             "FacebookAI/xlm-roberta-large-finetuned-conll02-dutch",
             "FacebookAI/xlm-roberta-large-finetuned-conll03-german",
             "FacebookAI/xlm-roberta-large-finetuned-conll02-spanish",
             "Babelscape/wikineural-multilingual-ner"
         ]
+        sx = sct.TextCleaner()
     ```
 ## API

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/sct/utils/ner.py RENAMED Viewed

@@ -11,6 +11,7 @@ from presidio_anonymizer.entities import RecognizerResult
 from sct.utils import constants
 from sct import config
+from typing import List, Dict, Any  # Add this line
 transformers.logging.set_verbosity_error()
 class GeneralNER:

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/setup.py RENAMED Viewed

@@ -2,7 +2,7 @@ from setuptools import setup, find_packages
 setup(
     name='SqueakyCleanText',
-    version='0.2.4',
+    version='0.2.5',
     author='Rehan Fazal',
     description='A comprehensive text cleaning and preprocessing pipeline.',
     long_description=open('README.md', encoding='utf-8').read(),

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/LICENSE RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/MANIFEST.in RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/SqueakyCleanText.egg-info/SOURCES.txt RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/SqueakyCleanText.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/SqueakyCleanText.egg-info/entry_points.txt RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/SqueakyCleanText.egg-info/requires.txt RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/SqueakyCleanText.egg-info/top_level.txt RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/sct/__init__.py RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/sct/config.py RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/sct/scripts/__init__.py RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/sct/scripts/download_nltk_stopwords.py RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/sct/sct.py RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/sct/utils/__init__.py RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/sct/utils/constants.py RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/sct/utils/contact.py RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/sct/utils/datetime.py RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/sct/utils/normtext.py RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/sct/utils/resources.py RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/sct/utils/special.py RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/sct/utils/stopwords.py RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/setup.cfg RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/tests/__init__.py RENAMED Viewed

File without changes

{SqueakyCleanText-0.2.4 → SqueakyCleanText-0.2.5}/tests/test_sct.py RENAMED Viewed

File without changes

SqueakyCleanText 0.2.4__tar.gz → 0.2.5__tar.gz

SqueakyCleanText 0.2.4tar.gz → 0.2.5tar.gz