PyPI - purrfectkit - Versions diffs - 0.2.1__py3-none-any.whl → 0.2.2__py3-none-any.whl - Mend

purrfectkit 0.2.1py3-none-any.whl → 0.2.2py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

{purrfectkit-0.2.1.dist-info → purrfectkit-0.2.2.dist-info}/METADATA +20 -8
purrfectkit-0.2.2.dist-info/RECORD +25 -0
{purrfectkit-0.2.1.dist-info → purrfectkit-0.2.2.dist-info}/WHEEL +1 -1
purrfectmeow/__init__.py +1 -1
purrfectmeow/meow/chaus.py +20 -0
purrfectmeow/meow/felis.py +33 -45
purrfectmeow/meow/kitty.py +7 -7
purrfectmeow/tc01_spl/base.py +7 -4
purrfectmeow/tc01_spl/markdown.py +18 -11
purrfectmeow/tc01_spl/ocr.py +36 -28
purrfectmeow/tc01_spl/simple.py +30 -19
purrfectmeow/tc02_mlt/base.py +19 -10
purrfectmeow/tc02_mlt/separate.py +7 -4
purrfectmeow/tc02_mlt/token.py +11 -19
purrfectmeow/tc03_wcm/local.py +9 -7
purrfectmeow/tc04_kmn/base.py +15 -5
purrfectmeow/tc04_kmn/cosine.py +13 -14
purrfectmeow/tc05_knj/base.py +2 -4
purrfectkit-0.2.1.dist-info/RECORD +0 -24
{purrfectkit-0.2.1.dist-info → purrfectkit-0.2.2.dist-info}/licenses/LICENSE +0 -0

{purrfectkit-0.2.1.dist-info → purrfectkit-0.2.2.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: purrfectkit
-Version: 0.2.1
+Version: 0.2.2
 Summary: **PurrfectKit** is a Python library for effortless Retrieval-Augmented Generation (RAG) workflows.
 Keywords: rag,nlp,llms,python,ai,ocr,document-processing,multilingual,text-extraction
 Author: SUWALUTIONS
@@ -25,7 +25,7 @@ Classifier: Natural Language :: English
 Classifier: Natural Language :: Thai
 Requires-Dist: python-magic<=0.4.27
 Requires-Dist: sentence-transformers<=5.1.0
-Requires-Dist: transformers<=4.52.1
+Requires-Dist: transformers<=4.53.0
 Requires-Dist: docling<=2.31.1
 Requires-Dist: markitdown<=0.1.1
 Requires-Dist: pymupdf4llm<=0.0.27
@@ -37,9 +37,15 @@ Requires-Dist: python-doctr<=1.0.0
 Requires-Dist: pandas<=2.3.2
 Requires-Dist: langchain-text-splitters<=1.0.0
 Requires-Dist: tiktoken<=0.12.0
+Requires-Dist: ruff<=0.6.0 ; extra == 'dev'
+Requires-Dist: mypy<=1.11.0 ; extra == 'dev'
+Requires-Dist: pre-commit<=3.8.0 ; extra == 'dev'
+Requires-Dist: detect-secrets<=1.5.0 ; extra == 'dev'
+Requires-Dist: codecov-cli<=11.2.4 ; extra == 'dev'
 Requires-Dist: sphinx<=8.2.3 ; extra == 'docs'
 Requires-Dist: sphinx-rtd-theme<=3.0.2 ; extra == 'docs'
 Requires-Dist: pytest<=8.4.2 ; extra == 'test'
+Requires-Dist: pytest-cov<=7.0.0 ; extra == 'test'
 Requires-Dist: pytest-mock<=3.15.1 ; extra == 'test'
 Maintainer: KHARAPSY
 Maintainer-email: KHARAPSY <kharapsy@suwalutions.com>
@@ -52,11 +58,17 @@ Provides-Extra: docs
 Provides-Extra: test
 Description-Content-Type: text/markdown
-![PurrfectMeow Logo](docs/_static/repo-logo.png)
+![PurrfectMeow Logo](https://github.com/suwalutions/PurrfectKit/blob/meow/docs/_static/repo-logo.png)
 # PurrfectKit
-[![Docker Image](https://github.com/suwalutions/PurrfectKit/actions/workflows/docker-image.yml/badge.svg)](https://github.com/suwalutions/PurrfectKit/actions/workflows/docker-image.yml)
+[![Python 3.10–3.13](https://img.shields.io/badge/python-3.10–3.13-blue)](https://www.python.org)
+[![PyPI](https://img.shields.io/pypi/v/purrfectkit?color=gold&label=PyPI)](https://pypi.org/project/purrfectkit/)
+[![Downloads](https://img.shields.io/pypi/dm/purrfectkit?color=purple)](https://pypistats.org/packages/purrfectkit)
+[![codecov](https://codecov.io/github/suwalutions/PurrfectKit/branch/meow/graph/badge.svg?token=Z6YETHJXCL)](https://codecov.io/github/suwalutions/PurrfectKit)
+[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/charliermarsh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
+[![Docker](https://img.shields.io/docker/v/suwalutions/purrfectkit?label=docker)](https://ghcr.io/suwalutions/purrfectkit)
+[![License](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
 **PurrfectKit** is a toolkit that simplifies Retrieval-Augmented Generation (RAG) into 5 easy steps:
 1. Suphalak - read content from files
@@ -72,12 +84,11 @@ Description-Content-Type: text/markdown
 ### Prerequisites
 - python
 - tesseract
-- git
 ### Installation
 ```bash
-pip install git+https://github.com/suwalutions/PurrfectKit.git
+pip install purrfectkit
 ```
@@ -88,7 +99,8 @@ from purrfectmeow import Suphalak, Malet, WichienMaat, KhaoManee
 file_path = 'test/test.pdf'
 metadata = MetaFile.get_metadata(file_path)
-content = Suphalak.reading(open(file_path, 'rb').read(), 'test.pdf', loader='PYMUPDF')
+with open(file_path, 'rb') as f:
+    content = Suphalak.reading(f, 'test.pdf')
 chunks = Malet.chunking(content, chunk_method='token', chunk_size='500', chunk_overlap='25')
 docs = DocTemplate.create_template(chunks, metadata)
 embedding = WichienMaat.embedding(chunks)
@@ -97,6 +109,6 @@ KhaoManee.searching(query, embedding, docs, 2)
 ```
-## 📄 License
+## License
 PurrfectKit is released under the [MIT License](LICENSE).

purrfectkit-0.2.2.dist-info/RECORD ADDED Viewed

@@ -0,0 +1,25 @@
+purrfectmeow/__init__.py,sha256=t6Bq_9cB4lsYL3gBoAnSMubeITCqqoYzcC1I8wkD8QY,271
+purrfectmeow/meow/chaus.py,sha256=PG95kQaMgIqQdd2MFUxCGUpz1-8yq7FrY1G9Imz7BEA,402
+purrfectmeow/meow/felis.py,sha256=sIz4kjyH-Y1Qfzy-NLcUkSx5tiairFJmWjbmutGq8YM,5844
+purrfectmeow/meow/kitty.py,sha256=ygbG8L29XwzC9CCz5BoZg5wKuWEWENRPHUPEhRwYSMY,2047
+purrfectmeow/tc01_spl/__init__.py,sha256=7ENCidvXhj9YhMQvBcv_mm4XIr3Mwzc1USQxgzLO0Nw,51
+purrfectmeow/tc01_spl/base.py,sha256=OUZy8u7avz1nlJ9hKVyFYeVkloSagGPW01O_zxyiLwI,3333
+purrfectmeow/tc01_spl/markdown.py,sha256=WEyO8zjXgNJnb082dmvb9lpzJ8cqyOhdJV-Tos8SzPA,2027
+purrfectmeow/tc01_spl/ocr.py,sha256=pWRd3C5K53SyTS8J1QXBAAg_ldtJSVDReyD3nKPkcCQ,4878
+purrfectmeow/tc01_spl/simple.py,sha256=Am1lnuj9QLu5g78HU1QpJk1OjYg-cvetpeaebLJd8z4,2744
+purrfectmeow/tc02_mlt/__init__.py,sha256=qB2Eyc_wFDVELwj0L7ttG_YOL3IISaqPBRj0zqSJcPo,45
+purrfectmeow/tc02_mlt/base.py,sha256=FC_0FiVYd7D8MkpCYdEHlDlOuxEqiOr1T8xdULWGhL4,1635
+purrfectmeow/tc02_mlt/separate.py,sha256=xUM2-qGF9psgJBbbJTLgFsiXuE3ftd4dTNhcVdBuXHE,1134
+purrfectmeow/tc02_mlt/token.py,sha256=siGciepOFgHtaUzTQbaGPmd-Bl7fN2B6Cp6zlOai8Sk,1931
+purrfectmeow/tc03_wcm/__init__.py,sha256=8pXGo04Z5KUNGkhSTONLBlqwVc43LicDGSuQiQDIKIM,57
+purrfectmeow/tc03_wcm/base.py,sha256=pXaaiU8JMLIjI5uJRxMLRnQ1Wmwv3U6EEkQ_IwhPLwg,473
+purrfectmeow/tc03_wcm/local.py,sha256=gfqXqAEDoozhi0EHnDXNLOlWZPzFE9RTeaHjGNVAFQI,1109
+purrfectmeow/tc04_kmn/__init__.py,sha256=FBHZKVu4agf6-p1MdMx0jIgQuKbAy9rsOu7MRIQVwXg,53
+purrfectmeow/tc04_kmn/base.py,sha256=InNetlSjwP9Need94IYvNrSmRYWgcD59KWb6NBrQCkk,482
+purrfectmeow/tc04_kmn/cosine.py,sha256=_zAvnnDH6N0Urz-rScRHxM7umandMODbddzCfTfIwh4,1225
+purrfectmeow/tc05_knj/__init__.py,sha256=XKwISvOAznPdTUWoTUnFDMBmxZF9Qd6FAi711W6bvZY,47
+purrfectmeow/tc05_knj/base.py,sha256=9itMmUvSYAI7G8DdM2H7GyTRC2LEXOsBc1QZf6HiImU,77
+purrfectkit-0.2.2.dist-info/licenses/LICENSE,sha256=9WlLgfJwKDGb71B1NwKYKKg6uL5u_knAr7ovGwIWvD4,1078
+purrfectkit-0.2.2.dist-info/WHEEL,sha256=DpNsHFUm_gffZe1FgzmqwuqiuPC6Y-uBCzibcJcdupM,78
+purrfectkit-0.2.2.dist-info/METADATA,sha256=4CJdWyS8cZVy7M3BdE7_XDW2kOz5OzT9OTSfg3eSh8c,4721
+purrfectkit-0.2.2.dist-info/RECORD,,

{purrfectkit-0.2.1.dist-info → purrfectkit-0.2.2.dist-info}/WHEEL RENAMED Viewed

@@ -1,4 +1,4 @@
 Wheel-Version: 1.0
-Generator: uv 0.9.7
+Generator: uv 0.9.8
 Root-Is-Purelib: true
 Tag: py3-none-any

purrfectmeow/__init__.py CHANGED Viewed

@@ -12,4 +12,4 @@ __all__ = [
     "Kornja",
 ]
-__version__ = "0.2.1"
+__version__ = "0.2.2"

purrfectmeow/meow/chaus.py ADDED Viewed

@@ -0,0 +1,20 @@
+from typing import TypedDict
+from .felis import Document
+class FileMetadata(TypedDict, total=False):
+    file_name: str
+    file_size: int
+    file_created_date: str
+    file_modified_date: str
+    file_extension: str
+    file_type: str
+    description: str
+    total_pages: int | str
+    file_md5: str
+class SimilarityResult(TypedDict, total=False):
+    score: float | str
+    document: Document

purrfectmeow/meow/felis.py CHANGED Viewed

@@ -1,15 +1,18 @@
-from typing import Any, Dict, List, Union
 from io import BytesIO
+from typing import Any
+from .chaus import FileMetadata
 class Document:
-    def __init__(self, page_content: str, metadata: Dict[str, Any]):
+    def __init__(self, page_content: str, metadata: dict[str, Any]) -> None:
         self.page_content = page_content
         self.metadata = metadata or {}
-    def __repr__(self):
+    def __repr__(self) -> str:
         return f"{self.__class__.__name__}(page_content={self.page_content!r}, metadata={self.metadata!r})"
-    def __getitem__(self, key):
+    def __getitem__(self, key: str) -> Any:
         if key == "page_content":
             return self.page_content
         elif key == "metadata":
@@ -17,31 +20,29 @@ class Document:
         else:
             raise KeyError(f"{key} is not a valid key. Use 'page_content' or 'metadata'.")
-    def to_dict(self):
-        return {
-            "page_content": self.page_content,
-            "metadata": self.metadata
-        }
+    def to_dict(self) -> dict[str, Any]:
+        return {"page_content": self.page_content, "metadata": self.metadata}
 class DocTemplate:
     @staticmethod
-    def create_template(chunks: List[str], metadata: Dict[str, Any]) -> List[Document]:
+    def create_template(chunks: list[str], metadata: dict[str, Any]) -> list[Document]:
         if not isinstance(chunks, list):
             raise TypeError(f"Expected 'chunks' to be a list, but got {type(chunks).__name__}.")
         if not isinstance(metadata, dict):
             raise TypeError(f"Expected 'metadata' to be a dict, but got {type(metadata).__name__}.")
         if not all(isinstance(c, str) for c in chunks):
             raise ValueError("All elements in 'chunks' must be strings.")
         docs = []
         chunk_hashes = []
-        import uuid
         import hashlib
+        import uuid
-        for idx, chunk in enumerate(chunks):
+        for _, chunk in enumerate(chunks):
             hash_val = hashlib.md5(chunk.encode()).hexdigest()
             chunk_hashes.append(hash_val)
@@ -62,22 +63,17 @@ class DocTemplate:
                 "chunk_size": chunk_size,
             }
-            doc_metadata = {
-                "chunk_info": chunk_info,
-                "source_info": metadata
-            }
+            doc_metadata = {"chunk_info": chunk_info, "source_info": metadata}
-            doc = Document(
-                page_content=chunk,
-                metadata=doc_metadata
-            )
+            doc = Document(page_content=chunk, metadata=doc_metadata)
             docs.append(doc)
         return docs
 class MetaFile:
     @staticmethod
-    def get_metadata(file: Union[str, BytesIO], **kwargs: Any) -> Dict[str, Union[str, int]]:
+    def get_metadata(file: str | BytesIO, **kwargs: Any) -> FileMetadata:
         if isinstance(file, bytes):
             file = BytesIO(file)
@@ -85,13 +81,13 @@ class MetaFile:
             import os
             os.makedirs(".cache/tmp", exist_ok=True)
-            file_name = kwargs.get('file_name')
+            file_name = kwargs.get("file_name")
             if not file_name:
                 raise ValueError("file_name must be provided when using BytesIO.")
             file_path = os.path.join(".cache/tmp", file_name)
-            with open(file_path, 'wb') as f:
+            with open(file_path, "wb") as f:
                 f.write(file.getvalue())
             try:
@@ -101,21 +97,22 @@ class MetaFile:
         elif isinstance(file, str):
             return MetaFile._get_metadata_from_path(file)
         else:
             raise TypeError(f"Unsupported file type: {type(file).__name__}. Expected str, bytes, or BytesIO.")
     @staticmethod
-    def _get_metadata_from_path(file_path: str) -> Dict[str, Union[str, int]]:
-        metadata = {}
+    def _get_metadata_from_path(file_path: str) -> FileMetadata:
+        metadata: FileMetadata = {}
+        import hashlib
         import os
         import re
+        import subprocess
         import time
         import magic
-        import hashlib
-        import subprocess
         try:
             if not os.path.exists(file_path):
                 raise FileNotFoundError(f"File {file_path} does not exist")
@@ -123,12 +120,8 @@ class MetaFile:
             stats = os.stat(file_path)
             metadata["file_name"] = os.path.basename(file_path)
             metadata["file_size"] = stats.st_size
-            metadata["file_created_date"] = time.strftime(
-                '%Y-%m-%d %H:%M:%S', time.localtime(stats.st_ctime)
-            )
-            metadata["file_modified_date"] = time.strftime(
-                '%Y-%m-%d %H:%M:%S', time.localtime(stats.st_mtime)
-            )
+            metadata["file_created_date"] = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(stats.st_ctime))
+            metadata["file_modified_date"] = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(stats.st_mtime))
             metadata["file_extension"] = os.path.splitext(file_path)[1] or "none"
             try:
@@ -143,12 +136,7 @@ class MetaFile:
                 metadata["total_pages"] = 1
             elif metadata["file_type"].startswith("application/pdf"):
                 try:
-                    result = subprocess.run(
-                        ['pdfinfo', file_path],
-                        stdout=subprocess.PIPE,
-                        text=True,
-                        check=True
-                    )
+                    result = subprocess.run(["pdfinfo", file_path], stdout=subprocess.PIPE, text=True, check=True)
                     pages_match = re.search(r"Pages:\s*(\d+)", result.stdout)
                     if pages_match:
                         metadata["total_pages"] = int(pages_match.group(1))
@@ -168,4 +156,4 @@ class MetaFile:
             return metadata
         except Exception as e:
-            raise RuntimeError(f"Failed to extract metadata: {str(e)}")
+            raise RuntimeError(f"Failed to extract metadata: {e}") from e

purrfectmeow/meow/kitty.py CHANGED Viewed

@@ -2,17 +2,19 @@ import logging
 from logging.handlers import RotatingFileHandler
 from pathlib import Path
 class LevelBasedFormatter(logging.Formatter):
-    def __init__(self, default_fmt, info_fmt, datefmt=None):
+    def __init__(self, default_fmt: str, info_fmt: str, datefmt: str | None = None) -> None:
         super().__init__(datefmt=datefmt)
-        self.default_fmt = logging.Formatter(default_fmt, datefmt)
-        self.info_fmt = logging.Formatter(info_fmt, datefmt)
+        self.default_fmt: logging.Formatter = logging.Formatter(default_fmt, datefmt)
+        self.info_fmt: logging.Formatter = logging.Formatter(info_fmt, datefmt)
-    def format(self, record):
+    def format(self, record: logging.LogRecord) -> str:
         if record.levelno == logging.INFO:
             return self.info_fmt.format(record)
         return self.default_fmt.format(record)
 def kitty_logger(name: str, log_file: str = "kitty.log", log_level: str = "INFO") -> logging.Logger:
     """
     Sets up a logger with console and rotating file handlers.
@@ -43,9 +45,7 @@ def kitty_logger(name: str, log_file: str = "kitty.log", log_level: str = "INFO"
         log_dir.mkdir(parents=True, exist_ok=True)
         log_path = log_dir / log_file
-        file_handler = RotatingFileHandler(
-            log_path, maxBytes=5 * 1024 * 1024, backupCount=3
-        )
+        file_handler = RotatingFileHandler(log_path, maxBytes=5 * 1024 * 1024, backupCount=3)
         file_handler.setFormatter(formatter)
         logger.addHandler(file_handler)

purrfectmeow/tc01_spl/base.py CHANGED Viewed

@@ -1,14 +1,15 @@
-from typing import Dict, BinaryIO, Any
+from typing import Any, BinaryIO
 from .markdown import Markdown
 from .ocr import Ocr
 from .simple import Simple
 class Suphalak:
-    tmp_dir = '.cache/tmp'
+    tmp_dir = ".cache/tmp"
     DEFAULT_LOADER = "PYMUPDF4LLM"
-    _LOADERS: Dict[str, Dict[str, Any]] = {
+    _LOADERS: dict[str, dict[str, Any]] = {
         "MARKITDOWN": {
             "func": Markdown.markitdown_convert,
             "ext": ("csv", "docx", "md", "pdf", "pptx", "txt", "xls", "xlsx"),
@@ -67,8 +68,9 @@ class Suphalak:
         return cls.DEFAULT_LOADER
     @classmethod
-    def reading(cls, file: BinaryIO, file_name: str, loader: str = None, **kwargs: Any) -> str:
+    def reading(cls, file: BinaryIO, file_name: str, loader: str | None = None, **kwargs: Any) -> str:
         import os
         file_ext = file_name.split(".")[-1].lower()
         if not loader:
@@ -87,6 +89,7 @@ class Suphalak:
         file_path = os.path.join(cls.tmp_dir, file_name)
         try:
+            text: str
             with open(file_path, "wb") as f:
                 f.write(file.read())

purrfectmeow/tc01_spl/markdown.py CHANGED Viewed

@@ -1,24 +1,25 @@
 import time
-from typing import Callable
+from collections.abc import Callable
+from typing import Any
 from purrfectmeow.meow.kitty import kitty_logger
 class Markdown:
     _logger = kitty_logger(__name__)
     @classmethod
-    def _convert(cls, file_path: str, converter: Callable, extractor: Callable) -> str:
+    def _convert(cls, file_path: str, converter: Callable[[str], Any], extractor: Callable[[Any], str]) -> str:
         cls._logger.debug(f"Starting conversion for '{file_path}'")
         start = time.time()
         try:
-            content = converter.convert(file_path)
-            result = extractor(content)
+            raw_content: Any = converter(file_path)
+            result: str = extractor(raw_content)
             cls._logger.debug(f"Succesfully converted '{file_path}'")
             return result
         finally:
             elapsed = time.time() - start
             cls._logger.debug(f"Conversion time spent '{elapsed:.2f}' seconds.")
@@ -28,16 +29,22 @@ class Markdown:
         cls._logger.debug("Using MarkItDown for Conversion")
         from markitdown import MarkItDown
-        return cls._convert(file_path, MarkItDown(), lambda content: content.text_content)
+        mid = MarkItDown()
+        return cls._convert(file_path, lambda path: mid.convert(path), lambda content: content.text_content)
     @classmethod
     def docling_convert(cls, file_path: str) -> str:
         cls._logger.debug("Using Docling for Conversion")
         from docling.document_converter import DocumentConverter
-        return cls._convert(file_path, DocumentConverter(), lambda content: content.document.export_to_markdown())
+        dcl = DocumentConverter()
+        return cls._convert(
+            file_path, lambda path: dcl.convert(path).document, lambda content: content.document.export_to_markdown()
+        )
     @classmethod
     def pymupdf4llm_convert(cls, file_path: str) -> str:
@@ -48,7 +55,7 @@ class Markdown:
         import pymupdf4llm
         try:
-            res = pymupdf4llm.to_markdown(file_path)
+            res: str = pymupdf4llm.to_markdown(file_path)
             cls._logger.debug(f"Succesfully converted '{file_path}'")
             return res

purrfectmeow/tc01_spl/ocr.py CHANGED Viewed

@@ -1,26 +1,34 @@
 import time
-from typing import Callable
+from collections.abc import Callable
+from typing import Any
 from purrfectmeow.meow.kitty import kitty_logger
-class Ocr:
+class Ocr:
     _logger = kitty_logger(__name__)
     _image_type = [
-        ".apng", ".png",
+        ".apng",
+        ".png",
         ".avif",
         ".gif",
-        ".jpg", ".jpeg", ".jfif", ".pjpeg", ".pjp",
+        ".jpg",
+        ".jpeg",
+        ".jfif",
+        ".pjpeg",
+        ".pjp",
         ".png",
         ".svg",
         ".webp",
         ".bmp",
-        ".ico", ".cur",
-        ".tif", ".tiff"
+        ".ico",
+        ".cur",
+        ".tif",
+        ".tiff",
     ]
     @classmethod
-    def _convert(cls, file_path: str, converter: Callable) -> str:
+    def _convert(cls, file_path: str, converter: Callable[[str], Any]) -> str:
         cls._logger.debug(f"Starting conversion for '{file_path}'")
         start = time.time()
@@ -28,7 +36,6 @@ class Ocr:
             content = []
             match file_path.lower():
                 case path if path.endswith(".pdf"):
                     from pdf2image import convert_from_path
                     images = convert_from_path(file_path, fmt="png")
@@ -37,12 +44,11 @@ class Ocr:
                             text = converter(image)
                             cls._logger.debug(f"Text: {text}")
                             content.append(text)
-                            cls._logger.debug(f"Page {idx+1} processed")
+                            cls._logger.debug(f"Page {idx + 1} processed")
                         except Exception as e:
-                            cls._logger.exception(f"Page {idx+1} failed: {e}")
+                            cls._logger.exception(f"Page {idx + 1} failed: {e}")
                             raise
                 case path if path.endswith(tuple(cls._image_type)):
                     from PIL import Image
                     image = Image.open(file_path)
@@ -61,41 +67,39 @@ class Ocr:
         finally:
             elasped = time.time() - start
             cls._logger.debug(f"Conversion time spent '{elasped:.2f}' seconds.")
     @classmethod
     def pytesseract_convert(cls, file_path: str) -> str:
         cls._logger.debug("Using PyTesseract for Conversion")
-        def converter(image):
+        def converter(image: str) -> Any:
             import pytesseract
             return pytesseract.image_to_string(image, lang="tha+eng")
         return cls._convert(file_path, converter)
     @classmethod
     def easyocr_convert(cls, file_path: str) -> str:
         cls._logger.debug("Using EasyOCR for Conversion")
-        def converter(image):
+        def converter(image: str) -> Any:
             import easyocr
             import numpy
-            reader = easyocr.Reader(
-                ['th', 'en'],
-                gpu=False
-            )
+            reader = easyocr.Reader(["th", "en"], gpu=False)
             res = reader.readtext(numpy.array(image))
             return "\n".join(text for _, text, _ in res)
         return cls._convert(file_path, converter)
     @classmethod
     def suryaocr_convert(cls, file_path: str) -> str:
         cls._logger.debug("Using SuryaOCR for Conversion")
-        def converter(image):
-            from surya.recognition import RecognitionPredictor
+        def converter(image: str) -> Any:
             from surya.detection import DetectionPredictor
+            from surya.recognition import RecognitionPredictor
             rec_pred = RecognitionPredictor()
             det_pred = DetectionPredictor()
@@ -107,20 +111,23 @@ class Ocr:
                 recognition_batch_size=1,
             )
             return "\n".join(line.text for line in prediction[0].text_lines)
         return cls._convert(file_path, converter)
     @classmethod
     def doctr_convert(cls, file_path: str) -> str:
         cls._logger.debug("Using docTR for Conversion")
-        def converter(image):
+        def converter(image: str) -> str:
             import os
+            import shutil
             import tempfile
             from doctr.io import DocumentFile
             from doctr.models import ocr_predictor
             with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmp:
-                image.save(tmp.name)
+                shutil.copy(image, tmp.name)
                 temp_image_path = tmp.name
             model = ocr_predictor(pretrained=True)
@@ -130,12 +137,13 @@ class Ocr:
             combined_text = "\n".join(
                 word["value"]
                 for page in data["pages"]
-                for block in page.get('blocks', [])
-                for line in block.get('lines', [])
-                for word in line.get('words', [])
+                for block in page.get("blocks", [])
+                for line in block.get("lines", [])
+                for word in line.get("words", [])
                 if "value" in word
             )
             if os.path.exists(temp_image_path):
                 os.remove(temp_image_path)
             return combined_text
         return cls._convert(file_path, converter)

purrfectmeow/tc01_spl/simple.py CHANGED Viewed

@@ -1,20 +1,21 @@
 import time
-from typing import Callable
+from collections.abc import Callable
+from typing import Any
 from purrfectmeow.meow.kitty import kitty_logger
-class Simple:
+class Simple:
     _logger = kitty_logger(__name__)
     @classmethod
-    def _convert(cls, file_path: str, converter: Callable) -> str:
+    def _convert(cls, file_path: str, converter: Callable[[str], Any]) -> str | Any:
         cls._logger.debug(f"Starting conversion for '{file_path}'")
         start = time.time()
         try:
             res = converter(file_path)
             cls._logger.debug(f"Successfully converted '{file_path}'")
             return res
@@ -26,39 +27,49 @@ class Simple:
     def encoding_convert(cls, file_path: str) -> str:
         cls._logger.debug("Using Encoding for Conversion")
-        def reader(file_path):
-            with open(file_path, "r", encoding="utf-8") as f:
+        def reader(file_path: str) -> str:
+            with open(file_path, encoding="utf-8") as f:
                 return f.read()
         return cls._convert(file_path, lambda file_path: reader(file_path))
     @classmethod
     def pymupdf_convert(cls, file_path: str) -> str:
         cls._logger.debug("Using PyMuPDF for Conversion")
-        def reader(file_path):
+        def reader(file_path: str) -> str:
             import pymupdf
-            if file_path.endswith(('.txt', '.md', '.json', '.html', '.xml')):
+            if file_path.endswith((".txt", ".md", ".json", ".html", ".xml")):
                 return "".join(page.get_text() for page in pymupdf.open(file_path, filetype="txt"))
             else:
                 return "".join(page.get_text() for page in pymupdf.open(file_path))
         return cls._convert(file_path, lambda file_path: reader(file_path))
     @classmethod
     def pandas_convert(cls, file_path: str) -> str:
         cls._logger.debug("Using Pandas for Conversion")
-        def reader(file_path):
+        def reader(file_path: str) -> Any:
             import pandas
-            if file_path.endswith(('.xls', '.xlsx')):
-                return pandas.read_excel(file_path).to_string(index=False)
-            elif file_path.endswith('.csv'):
-                return pandas.read_csv(file_path).to_string(index=False)
-            elif file_path.endswith('.json'):
-                return pandas.read_json(file_path).to_string(index=False)
-            elif file_path.endswith('.html'):
-                return pandas.read_html(file_path)
-            elif file_path.endswith('.xml'):
-                return pandas.read_xml(file_path).to_string(index=False)
+            if file_path.endswith((".xls", ".xlsx")):
+                df_x: pandas.DataFrame = pandas.read_excel(file_path)
+                return df_x.to_string(index=False)
+            elif file_path.endswith(".csv"):
+                df_c: pandas.DataFrame = pandas.read_csv(file_path)
+                return df_c.to_string(index=False)
+            elif file_path.endswith(".json"):
+                df_j: pandas.DataFrame = pandas.read_json(file_path)
+                return df_j.to_string(index=False)
+            elif file_path.endswith(".html"):
+                df_h: list[pandas.DataFrame] = pandas.read_html(file_path)
+                return "".join(df.to_string(index=False) for df in df_h)
+            elif file_path.endswith(".xml"):
+                df_m: pandas.DataFrame = pandas.read_xml(file_path)
+                return df_m.to_string(index=False)
+            else:
+                return ""
         return cls._convert(file_path, lambda file_path: reader(file_path))

purrfectmeow/tc02_mlt/base.py CHANGED Viewed

@@ -1,34 +1,43 @@
-from typing import Any, List, Literal, Optional
+from typing import Any, Literal
+from langchain_text_splitters import TokenTextSplitter
-from .token import TokenSplit
 from .separate import SeparateSplit
+from .token import TokenSplit
 class Malet:
-    DEFAULT_MODEL_NAME = 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'
+    DEFAULT_MODEL_NAME = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
     DEFAULT_CHUNK_SIZE = 500
     DEFAULT_CHUNK_OVERLAP = 0
-    DEFAULT_CHUNK_SEPARATOR = '\n\n'
+    DEFAULT_CHUNK_SEPARATOR = "\n\n"
     @staticmethod
-    def _get_kwarg(kwargs: dict, keys: List[str], default: Any = None) -> Any:
+    def _get_kwarg(kwargs: dict[str, Any], keys: list[str], default: Any = None) -> Any:
         for key in keys:
             if key in kwargs:
                 return kwargs[key]
         return default
     @classmethod
-    def chunking(cls, text: str, chunk_method: Optional[Literal["token", "separate"]] = "token", **kwargs: Any) -> List[str]:
+    def chunking(
+        cls, text: str, chunk_method: Literal["token", "separate"] | None = "token", **kwargs: Any
+    ) -> TokenTextSplitter | SeparateSplit.CharacterSeparator:
         match chunk_method:
             case "token":
                 model_name = cls._get_kwarg(kwargs, ["model_name", "ModelName", "modelName"], cls.DEFAULT_MODEL_NAME)
                 chunk_size = cls._get_kwarg(kwargs, ["chunk_size", "ChunkSize", "chunkSize"], cls.DEFAULT_CHUNK_SIZE)
-                chunk_overlap = cls._get_kwarg(kwargs, ["chunk_overlap", "ChunkOverlap", "chunkOverlap"], cls.DEFAULT_CHUNK_OVERLAP)
+                chunk_overlap = cls._get_kwarg(
+                    kwargs, ["chunk_overlap", "ChunkOverlap", "chunkOverlap"], cls.DEFAULT_CHUNK_OVERLAP
+                )
                 method = TokenSplit.splitter(model_name, chunk_size, chunk_overlap)
-                return method.split_text(text)
             case "separate":
-                chunk_separator = cls._get_kwarg(kwargs, ["chunk_separator", "ChunkSeparator", "chunkSeparator"], cls.DEFAULT_CHUNK_SEPARATOR)
+                chunk_separator = cls._get_kwarg(
+                    kwargs, ["chunk_separator", "ChunkSeparator", "chunkSeparator"], cls.DEFAULT_CHUNK_SEPARATOR
+                )
                 method = SeparateSplit.splitter(chunk_separator)
-                return method.split_text(text)
+        return method.split_text(text)

purrfectmeow/tc02_mlt/separate.py CHANGED Viewed

@@ -1,12 +1,15 @@
+from __future__ import annotations
 import time
 from purrfectmeow.meow.kitty import kitty_logger
 class SeparateSplit:
     _logger = kitty_logger(__name__)
     @classmethod
-    def splitter(cls, chunk_separator: str):
+    def splitter(cls, chunk_separator: str) -> CharacterSeparator:
         cls._logger.debug("Initializing separate splitter")
         start = time.time()
@@ -25,8 +28,8 @@ class SeparateSplit:
     class CharacterSeparator:
         def __init__(self, separator: str):
             self.separator = separator
-        def split_text(self, text: str):
+        def split_text(self, text: str) -> list[str]:
             chunks = [chunk + self.separator for chunk in text.split(self.separator)]
             chunks[-1] = chunks[-1].rstrip(self.separator)
             return chunks

purrfectmeow/tc02_mlt/token.py CHANGED Viewed

@@ -1,45 +1,38 @@
 import time
+from langchain_text_splitters import TokenTextSplitter
 from purrfectmeow.meow.kitty import kitty_logger
 class TokenSplit:
     _logger = kitty_logger(__name__)
-    _OPENAI_EMBED_MODEL = {
-        'text-embedding-ada-002',
-        'text-embedding-3-small',
-        'text-embedding-3-large'
-    }
-    _OPENAI_HF_MODEL = {
-        'Xenova/text-embedding-ada-002'
-    }
-    _HF_MODEL_DIR = '.cache/huggingface/hub/'
+    _OPENAI_EMBED_MODEL = {"text-embedding-ada-002", "text-embedding-3-small", "text-embedding-3-large"}
+    _OPENAI_HF_MODEL = {"Xenova/text-embedding-ada-002"}
+    _HF_MODEL_DIR = ".cache/huggingface/hub/"
     @classmethod
-    def splitter(cls, model_name: str, chunk_size: int, chunk_overlap: int):
+    def splitter(cls, model_name: str, chunk_size: int, chunk_overlap: int) -> TokenTextSplitter:
         cls._logger.debug("Initializing token splitter")
         start = time.time()
         try:
             cls._logger.debug(f"Using OpenAI model tokenizer: {model_name}")
-            from langchain_text_splitters import TokenTextSplitter
             if model_name in cls._OPENAI_EMBED_MODEL:
                 splitter = TokenTextSplitter.from_tiktoken_encoder(
-                    model_name=model_name,
-                    chunk_size=chunk_size,
-                    chunk_overlap=chunk_overlap
+                    model_name=model_name, chunk_size=chunk_size, chunk_overlap=chunk_overlap
                 )
             else:
                 cls._logger.debug(f"Using HuggingFace tokenizer: {model_name}")
                 from transformers import AutoTokenizer, GPT2TokenizerFast
                 if model_name in cls._OPENAI_HF_MODEL:
                     tokenizer = GPT2TokenizerFast.from_pretrained(model_name, cache_dir=cls._HF_MODEL_DIR)
                 else:
                     tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cls._HF_MODEL_DIR)
                 splitter = TokenTextSplitter.from_huggingface_tokenizer(
-                    tokenizer=tokenizer,
-                    chunk_size=chunk_size,
-                    chunk_overlap=chunk_overlap
+                    tokenizer=tokenizer, chunk_size=chunk_size, chunk_overlap=chunk_overlap
                 )
             cls._logger.debug("Token splitter successfully initialized.")
@@ -52,4 +45,3 @@ class TokenSplit:
         finally:
             elapsed = time.time() - start
             cls._logger.debug(f"Token splitting completed in {elapsed:.2f} seconds.")

purrfectmeow/tc03_wcm/local.py CHANGED Viewed

@@ -1,24 +1,26 @@
 import time
-from typing import List
+from typing import Any
+import numpy
 from purrfectmeow.meow.kitty import kitty_logger
-class Local:
+class Local:
     _logger = kitty_logger(__name__)
-    _HF_MODEL_DIR = '.cache/huggingface/hub/'
+    _HF_MODEL_DIR = ".cache/huggingface/hub/"
     @classmethod
-    def model_encode(cls, sentence: str | List[str], model_name: str, **kwargs):
+    def model_encode(cls, sentence: str | list[str], model_name: str, **kwargs: Any) -> numpy.ndarray:
         cls._logger.debug("Initializing local model encode")
         start = time.time()
         try:
             from sentence_transformers import SentenceTransformer
             model = SentenceTransformer(
-                model_name,
+                model_name,
                 cache_folder=cls._HF_MODEL_DIR,
-                #local_files_only=True
+                # local_files_only=True
             )
             embed = model.encode(sentence, convert_to_numpy=True)

purrfectmeow/tc04_kmn/base.py CHANGED Viewed

@@ -1,8 +1,18 @@
+import numpy
-from .cosine import ConsineSim
-class KhaoManee:
+from purrfectmeow.meow.chaus import SimilarityResult
+from purrfectmeow.meow.felis import Document
+from .cosine import CosineSim
-    @classmethod
-    def searching(cls, query_embed, sentence_embed, document, top_k):
-        return ConsineSim.vector_search(query_embed, sentence_embed, document, top_k)
+class KhaoManee:
+    @classmethod
+    def searching(
+        cls,
+        query_embed: numpy.ndarray,
+        sentence_embed: numpy.ndarray | list[numpy.ndarray],
+        documents: list[Document],
+        top_k: int,
+    ) -> list[SimilarityResult]:
+        return CosineSim.vector_search(query_embed, sentence_embed, documents, top_k)

purrfectmeow/tc04_kmn/cosine.py CHANGED Viewed

@@ -1,22 +1,23 @@
 import time
-from typing import List
 import numpy
-from purrfectmeow.meow.felis import Document
+from purrfectmeow.meow.chaus import SimilarityResult
+from purrfectmeow.meow.felis import Document
 from purrfectmeow.meow.kitty import kitty_logger
-class ConsineSim:
+class CosineSim:
     _logger = kitty_logger(__name__)
     @classmethod
     def vector_search(
-        cls,
-        embed_query: numpy.ndarray,
-        embed_sentence: numpy.ndarray | List[numpy.ndarray],
-        document: Document,
-        top_k: int
-    ):
+        cls,
+        embed_query: numpy.ndarray,
+        embed_sentence: numpy.ndarray | list[numpy.ndarray],
+        documents: list[Document],
+        top_k: int,
+    ) -> list[SimilarityResult]:
         cls._logger.debug("Initializing vector search")
         start = time.time()
         try:
@@ -25,10 +26,9 @@ class ConsineSim:
             sims = cosine_similarity([embed_query], embed_sentence)[0]
             top_indices = numpy.argsort(sims)[::-1][:top_k]
-            results = [{
-                "score": float(sims[i]),
-                "document": document[i]
-            } for i in top_indices]
+            results: list[SimilarityResult] = [
+                SimilarityResult(score=float(sims[i]), document=documents[i]) for i in top_indices
+            ]
             return results
         except Exception as e:
@@ -37,4 +37,3 @@ class ConsineSim:
         finally:
             elapsed = time.time() - start
             cls._logger.debug(f"Vector search completed in {elapsed:.2f} seconds.")

purrfectmeow/tc05_knj/base.py CHANGED Viewed

@@ -1,6 +1,4 @@
 class Kornja:
     @classmethod
-    def generating(cls):
-        ...
+    def generating(cls) -> None:
+        pass

purrfectkit-0.2.1.dist-info/RECORD DELETED Viewed

@@ -1,24 +0,0 @@
-purrfectmeow/__init__.py,sha256=XEej-s0VH-Up9aob3XcDQqgS55Ftk_qNoXezdcedFJQ,271
-purrfectmeow/meow/felis.py,sha256=8d1kaizsEisr7dW-MKw8HqsYfOkLBGy-sYTv-4kClQ8,6149
-purrfectmeow/meow/kitty.py,sha256=WaLuh2t1PnigWYDNZlbNfCA_uqXnPYc-xxDuZlFfNNY,1971
-purrfectmeow/tc01_spl/__init__.py,sha256=7ENCidvXhj9YhMQvBcv_mm4XIr3Mwzc1USQxgzLO0Nw,51
-purrfectmeow/tc01_spl/base.py,sha256=iuIZiPUe-ofeF_PmknnCg-4NsJxDoH7rj-SMsqNBTAQ,3308
-purrfectmeow/tc01_spl/markdown.py,sha256=AUCSZ-6W0sXbZwGgZfe6utidbEemQGoi6c4rsLiH928,1861
-purrfectmeow/tc01_spl/ocr.py,sha256=A3orLTIVmu2WYJTi4joWlTmV27IDh3MTa7qc7IRAQkE,4784
-purrfectmeow/tc01_spl/simple.py,sha256=dwecYL2sviKz4BoJcOQntAprXACvaEig-ZbDiwTW-cU,2347
-purrfectmeow/tc02_mlt/__init__.py,sha256=qB2Eyc_wFDVELwj0L7ttG_YOL3IISaqPBRj0zqSJcPo,45
-purrfectmeow/tc02_mlt/base.py,sha256=cz1qFo1AdL-I2wnBPO06MhcYSQh90tcLCN99phIUKWw,1508
-purrfectmeow/tc02_mlt/separate.py,sha256=YQSnC5BODg1cJh4JrPkT_-tO1CbwgpxuCjMvHQwRUNE,1074
-purrfectmeow/tc02_mlt/token.py,sha256=qULVySiTAbDoBQrtWWuvPkO5Zqf5hjRutN1Q7foCwUU,2052
-purrfectmeow/tc03_wcm/__init__.py,sha256=8pXGo04Z5KUNGkhSTONLBlqwVc43LicDGSuQiQDIKIM,57
-purrfectmeow/tc03_wcm/base.py,sha256=pXaaiU8JMLIjI5uJRxMLRnQ1Wmwv3U6EEkQ_IwhPLwg,473
-purrfectmeow/tc03_wcm/local.py,sha256=5AfVSftW_cfaZBZBe-joSMJRRJ55G0g5lf9Qtcl0LUw,1074
-purrfectmeow/tc04_kmn/__init__.py,sha256=FBHZKVu4agf6-p1MdMx0jIgQuKbAy9rsOu7MRIQVwXg,53
-purrfectmeow/tc04_kmn/base.py,sha256=rj3Ar2Pv8VOL7vKvPB-snif8SRwBbGaLbWIpHFpd5b8,224
-purrfectmeow/tc04_kmn/cosine.py,sha256=DaDXVcy6YyNc5jwtPXeQg040FT7607phyt5Ub74E9aw,1147
-purrfectmeow/tc05_knj/__init__.py,sha256=XKwISvOAznPdTUWoTUnFDMBmxZF9Qd6FAi711W6bvZY,47
-purrfectmeow/tc05_knj/base.py,sha256=qN1VCx20G5H7YHcVzmg0YNXMLZM7TPkiD_UMEZykfjE,70
-purrfectkit-0.2.1.dist-info/licenses/LICENSE,sha256=9WlLgfJwKDGb71B1NwKYKKg6uL5u_knAr7ovGwIWvD4,1078
-purrfectkit-0.2.1.dist-info/WHEEL,sha256=5w2T7AS2mz1-rW9CNagNYWRCaB0iQqBMYLwKdlgiR4Q,78
-purrfectkit-0.2.1.dist-info/METADATA,sha256=cSe3NLmt6D8LaZSpilNU1c3G9k0P5XGThncqp6K2Crk,3765
-purrfectkit-0.2.1.dist-info/RECORD,,

{purrfectkit-0.2.1.dist-info → purrfectkit-0.2.2.dist-info}/licenses/LICENSE RENAMED Viewed

File without changes

purrfectkit 0.2.1__py3-none-any.whl → 0.2.2__py3-none-any.whl

purrfectkit 0.2.1py3-none-any.whl → 0.2.2py3-none-any.whl