PyPI - hamtaa-texttools - Versions diffs - 1.3.0__py3-none-any.whl → 1.3.2__py3-none-any.whl - Mend

hamtaa-texttools 1.3.0py3-none-any.whl → 1.3.2py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

{hamtaa_texttools-1.3.0.dist-info → hamtaa_texttools-1.3.2.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: hamtaa-texttools
-Version: 1.3.0
+Version: 1.3.2
 Summary: A high-level NLP toolkit built on top of modern LLMs.
 Author-email: Tohidi <the.mohammad.tohidi@gmail.com>, Erfan Moosavi <erfanmoosavi84@gmail.com>, Montazer <montazerh82@gmail.com>, Givechi <mohamad.m.givechi@gmail.com>, Zareshahi <a.zareshahi1377@gmail.com>
 Maintainer-email: Erfan Moosavi <erfanmoosavi84@gmail.com>, Tohidi <the.mohammad.tohidi@gmail.com>
@@ -21,6 +21,9 @@ Dynamic: license-file
 # TextTools
+![PyPI](https://img.shields.io/pypi/v/hamtaa-texttools)
+![License](https://img.shields.io/pypi/l/hamtaa-texttools)
 ## 📌 Overview
 **TextTools** is a high-level **NLP toolkit** built on top of **LLMs**.
@@ -44,11 +47,11 @@ Each tool is designed to work with structured outputs.
 - **`is_question()`** - Binary question detection
 - **`text_to_question()`** - Generates questions from text
 - **`merge_questions()`** - Merges multiple questions into one
-- **`rewrite()`** - Rewrites text in a diffrent way
-- **`subject_to_question()`** - Generates questions about a specific subject
+- **`rewrite()`** - Rewrites text in a different way
+- **`subject_to_question()`** - Generates questions about a given subject
 - **`summarize()`** - Text summarization
 - **`translate()`** - Text translation
-- **`propositionize()`** - Convert text to atomic independence meaningful sentences
+- **`propositionize()`** - Convert text to atomic independent meaningful sentences
 - **`check_fact()`** - Check whether a statement is relevant to the source text
 - **`run_custom()`** - Allows users to define a custom tool with an arbitrary BaseModel
@@ -66,7 +69,7 @@ pip install -U hamtaa-texttools
 ## 📊 Tool Quality Tiers
-| Status | Meaning | Tools | Use in Production? |
+| Status | Meaning | Tools | Safe for Production? |
 |--------|---------|----------|-------------------|
 | **✅ Production** | Evaluated, tested, stable. | `categorize()` (list mode), `extract_keywords()`, `extract_entities()`, `is_question()`, `text_to_question()`, `merge_questions()`, `rewrite()`, `subject_to_question()`, `summarize()`, `run_custom()` | **Yes** - ready for reliable use. |
 | **🧪 Experimental** | Added to the package but **not fully evaluated**. Functional, but quality may vary. | `categorize()` (tree mode), `translate()`, `propositionize()`, `check_fact()` | **Use with caution** - outputs not yet validated. |
@@ -177,40 +180,7 @@ Use **TextTools** when you need to:
 ---
-## 📚 Batch Processing
-Process large datasets efficiently using OpenAI's batch API.
-## ⚡ Quick Start (Batch Runner)
-```python
-from pydantic import BaseModel
-from texttools import BatchRunner, BatchConfig
-config = BatchConfig(
-    system_prompt="Extract entities from the text",
-    job_name="entity_extraction",
-    input_data_path="data.json",
-    output_data_filename="results.json",
-    model="gpt-4o-mini"
-)
-class Output(BaseModel):
-    entities: list[str]
-runner = BatchRunner(config, output_model=Output)
-runner.run()
-```
----
 ## 🤝 Contributing
 Contributions are welcome!
 Feel free to **open issues, suggest new features, or submit pull requests**.
----
-## 🌿 License
-This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

{hamtaa_texttools-1.3.0.dist-info → hamtaa_texttools-1.3.2.dist-info}/RECORD RENAMED Viewed

@@ -1,17 +1,14 @@
-hamtaa_texttools-1.3.0.dist-info/licenses/LICENSE,sha256=Hb2YOBKy2MJQLnyLrX37B4ZVuac8eaIcE71SvVIMOLg,1082
-texttools/__init__.py,sha256=4z7wInlrgbGSlWlXHQNeZMCGQH1sN2xtARsbgLHOLd8,283
+hamtaa_texttools-1.3.2.dist-info/licenses/LICENSE,sha256=Hb2YOBKy2MJQLnyLrX37B4ZVuac8eaIcE71SvVIMOLg,1082
+texttools/__init__.py,sha256=RK1GAU6pq2lGwFtHdrCX5JkPRHmOLGcmGH67hd_7VAQ,175
 texttools/models.py,sha256=5eT2cSrFq8Xa38kANznV7gbi7lwB2PoDxciLKTpsd6c,2516
 texttools/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
-texttools/batch/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
-texttools/batch/config.py,sha256=GDDXuhRZ_bOGVwSIlU4tWP247tx1_A7qzLJn7VqDyLU,1050
-texttools/batch/manager.py,sha256=XZtf8UkdClfQlnRKne4nWEcFvdSKE67EamEePKy7jwI,8730
-texttools/batch/runner.py,sha256=9qxXIMfYRXW5SXDqqKtRr61rnQdYZkbCGqKImhSrY6I,9923
 texttools/core/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
-texttools/core/engine.py,sha256=iRHdlIOPuUwIN6_72HNyTQQE7h_7xUZhC-WO-fDA5k8,9597
+texttools/core/engine.py,sha256=AjifrcJl6PeRu1W6nu9zcxySn-1439Ef2La4d7GpNKY,9481
 texttools/core/exceptions.py,sha256=6SDjUL1rmd3ngzD3ytF4LyTRj3bQMSFR9ECrLoqXXHw,395
-texttools/core/internal_models.py,sha256=aExdLvhXhSev8NY1kuAJckeXdFBEisQtKZPxybd3rW8,1703
-texttools/core/operators/async_operator.py,sha256=wFs7eZ9QJrL0jBOu00YffgfPnIrCSavNjecSorXh-mE,6452
-texttools/core/operators/sync_operator.py,sha256=NaUS-aLh3y0QNMiKut4qtcSZKYXbuPbw0o2jvPsYKdY,6357
+texttools/core/internal_models.py,sha256=J1qGEO8V0OoX6_-1yxbSmZSR79tJF0ExAIG1QuvH0L0,1734
+texttools/core/operators/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
+texttools/core/operators/async_operator.py,sha256=-72YQEGFkbk2uYW6PHkLT4wGxhj2p6Uqy3sJtVa9-rk,6386
+texttools/core/operators/sync_operator.py,sha256=mfXtEOlIAhHo4SHaHRKjGb0Z1T894clv-toUzUcbfpo,6291
 texttools/prompts/categorize.yaml,sha256=42Rp3SgVHaDLKrJ27_uK788LiQud0pOXJthz4r0a40Y,1214
 texttools/prompts/check_fact.yaml,sha256=zWFQDRhEE1ij9wSeeenS9YSTM-bY5zzUaG390zUgmcs,714
 texttools/prompts/extract_entities.yaml,sha256=_zYKHNJDIzVDI_-TnwFCKyMs-XLM5igvmWhvSTc3INQ,637
@@ -28,7 +25,7 @@ texttools/prompts/translate.yaml,sha256=Dd5bs3O8SI-FlVSwHMYGeEjMmdOWeRlcfBHkhixC
 texttools/tools/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
 texttools/tools/async_tools.py,sha256=2suwx8N0aRnowaSOpV6C57AqPlmQe5Z0Yx4E5QIMkmU,46939
 texttools/tools/sync_tools.py,sha256=mEuL-nlbxVW30dPE3hGkAUnYXbul-3gN2Le4CMVFCgU,42528
-hamtaa_texttools-1.3.0.dist-info/METADATA,sha256=_CXrOjvT2jwWcs1LHID0vVyo9eKlSIK_BzU8YUeNypo,8024
-hamtaa_texttools-1.3.0.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
-hamtaa_texttools-1.3.0.dist-info/top_level.txt,sha256=5Mh0jIxxZ5rOXHGJ6Mp-JPKviywwN0MYuH0xk5bEWqE,10
-hamtaa_texttools-1.3.0.dist-info/RECORD,,
+hamtaa_texttools-1.3.2.dist-info/METADATA,sha256=LjhXLwovneW5Ii1DvAYhFT4JR64ar23UyptCvCO6Hpc,7448
+hamtaa_texttools-1.3.2.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
+hamtaa_texttools-1.3.2.dist-info/top_level.txt,sha256=5Mh0jIxxZ5rOXHGJ6Mp-JPKviywwN0MYuH0xk5bEWqE,10
+hamtaa_texttools-1.3.2.dist-info/RECORD,,

texttools/__init__.py CHANGED Viewed

@@ -1,7 +1,5 @@
-from .batch.config import BatchConfig
-from .batch.runner import BatchRunner
 from .models import CategoryTree
 from .tools.async_tools import AsyncTheTool
 from .tools.sync_tools import TheTool
-__all__ = ["TheTool", "AsyncTheTool", "CategoryTree", "BatchRunner", "BatchConfig"]
+__all__ = ["TheTool", "AsyncTheTool", "CategoryTree"]

texttools/core/engine.py CHANGED Viewed

@@ -4,6 +4,7 @@ import random
 import re
 from functools import lru_cache
 from pathlib import Path
+from typing import Any
 import yaml
@@ -20,9 +21,6 @@ class PromptLoader:
     @lru_cache(maxsize=32)
     def _load_templates(self, prompt_file: str, mode: str | None) -> dict[str, str]:
-        """
-        Loads prompt templates from YAML file with optional mode selection.
-        """
         try:
             base_dir = Path(__file__).parent.parent / Path("prompts")
             prompt_path = base_dir / prompt_file
@@ -73,13 +71,12 @@ class PromptLoader:
         self, prompt_file: str, text: str, mode: str, **extra_kwargs
     ) -> dict[str, str]:
         try:
-            template_configs = self._load_templates(prompt_file, mode)
             format_args = {"text": text}
             format_args.update(extra_kwargs)
-            # Inject variables inside each template
-            for key in template_configs.keys():
-                template_configs[key] = template_configs[key].format(**format_args)
+            template_configs = self._load_templates(prompt_file, mode)
+            for key, value in template_configs.items():
+                template_configs[key] = value.format(**format_args)
             return template_configs
@@ -97,30 +94,27 @@ class OperatorUtils:
         output_lang: str | None,
         user_prompt: str | None,
     ) -> str:
-        main_prompt = ""
+        parts = []
         if analysis:
-            main_prompt += f"Based on this analysis:\n{analysis}\n"
+            parts.append(f"Based on this analysis: {analysis}")
         if output_lang:
-            main_prompt += f"Respond only in the {output_lang} language.\n"
+            parts.append(f"Respond only in the {output_lang} language.")
         if user_prompt:
-            main_prompt += f"Consider this instruction {user_prompt}\n"
+            parts.append(f"Consider this instruction: {user_prompt}")
-        main_prompt += main_template
-        return main_prompt
+        parts.append(main_template)
+        return "\n".join(parts)
     @staticmethod
     def build_message(prompt: str) -> list[dict[str, str]]:
         return [{"role": "user", "content": prompt}]
     @staticmethod
-    def extract_logprobs(completion: dict) -> list[dict]:
+    def extract_logprobs(completion: Any) -> list[dict]:
         """
-        Extracts and filters token probabilities from completion logprobs.
-        Skips punctuation and structural tokens, returns cleaned probability data.
+        Extracts and filters logprobs from completion.
+        Skips punctuation and structural tokens.
         """
         logprobs_data = []
@@ -153,16 +147,17 @@ class OperatorUtils:
     @staticmethod
     def get_retry_temp(base_temp: float) -> float:
-        delta_temp = random.choice([-1, 1]) * random.uniform(0.1, 0.9)
-        new_temp = base_temp + delta_temp
+        new_temp = base_temp + random.choice([-1, 1]) * random.uniform(0.1, 0.9)
         return max(0.0, min(new_temp, 1.5))
 def text_to_chunks(text: str, size: int, overlap: int) -> list[str]:
+    """
+    Utility for chunking large texts. Used for translation tool
+    """
     separators = ["\n\n", "\n", " ", ""]
     is_separator_regex = False
-    keep_separator = True  # Equivalent to 'start'
+    keep_separator = True
     length_function = len
     strip_whitespace = True
     chunk_size = size
@@ -256,6 +251,9 @@ def text_to_chunks(text: str, size: int, overlap: int) -> list[str]:
 async def run_with_timeout(coro, timeout: float | None):
+    """
+    Utility for timeout logic defined in AsyncTheTool
+    """
     if timeout is None:
         return await coro
     try:

texttools/core/internal_models.py CHANGED Viewed

@@ -21,7 +21,9 @@ class Bool(BaseModel):
 class ListStr(BaseModel):
     result: list[str] = Field(
-        ..., description="The output list of strings", example=["text_1", "text_2"]
+        ...,
+        description="The output list of strings",
+        example=["text_1", "text_2", "text_3"],
     )
@@ -36,11 +38,13 @@ class ListDictStrStr(BaseModel):
 class ReasonListStr(BaseModel):
     reason: str = Field(..., description="Thinking process that led to the output")
     result: list[str] = Field(
-        ..., description="The output list of strings", example=["text_1", "text_2"]
+        ...,
+        description="The output list of strings",
+        example=["text_1", "text_2", "text_3"],
     )
-# This function is needed to create CategorizerOutput with dynamic categories
+# Create CategorizerOutput with dynamic categories
 def create_dynamic_model(allowed_values: list[str]) -> Type[BaseModel]:
     literal_type = Literal[*allowed_values]

texttools/core/operators/async_operator.py CHANGED Viewed

@@ -54,7 +54,7 @@ class AsyncOperator:
     ) -> tuple[T, Any]:
         """
         Parses a chat completion using OpenAI's structured output format.
-        Returns both the parsed Any and the raw completion for logprobs.
+        Returns both the parsed and the completion for logprobs.
         """
         try:
             request_kwargs = {
@@ -92,7 +92,6 @@ class AsyncOperator:
     async def run(
         self,
-        # User parameters
         text: str,
         with_analysis: bool,
         output_lang: str | None,
@@ -103,7 +102,6 @@ class AsyncOperator:
         validator: Callable[[Any], bool] | None,
         max_validation_retries: int | None,
         priority: int | None,
-        # Internal parameters
         tool_name: str,
         output_model: Type[T],
         mode: str | None,

texttools/core/operators/sync_operator.py CHANGED Viewed

@@ -54,7 +54,7 @@ class Operator:
     ) -> tuple[T, Any]:
         """
         Parses a chat completion using OpenAI's structured output format.
-        Returns both the parsed Any and the raw completion for logprobs.
+        Returns both the parsed and the completion for logprobs.
         """
         try:
             request_kwargs = {
@@ -90,7 +90,6 @@ class Operator:
     def run(
         self,
-        # User parameters
         text: str,
         with_analysis: bool,
         output_lang: str | None,
@@ -101,7 +100,6 @@ class Operator:
         validator: Callable[[Any], bool] | None,
         max_validation_retries: int | None,
         priority: int | None,
-        # Internal parameters
         tool_name: str,
         output_model: Type[T],
         mode: str | None,

texttools/batch/config.py DELETED Viewed

@@ -1,40 +0,0 @@
-from collections.abc import Callable
-from dataclasses import dataclass
-from typing import Any
-def export_data(data) -> list[dict[str, str]]:
-    """
-    Produces a structure of the following form from an initial data structure:
-    [{"id": str, "text": str},...]
-    """
-    return data
-def import_data(data) -> Any:
-    """
-    Takes the output and adds and aggregates it to the original structure.
-    """
-    return data
-@dataclass
-class BatchConfig:
-    """
-    Configuration for batch job runner.
-    """
-    system_prompt: str = ""
-    job_name: str = ""
-    input_data_path: str = ""
-    output_data_filename: str = ""
-    model: str = "gpt-4.1-mini"
-    MAX_BATCH_SIZE: int = 100
-    MAX_TOTAL_TOKENS: int = 2_000_000
-    CHARS_PER_TOKEN: float = 2.7
-    PROMPT_TOKEN_MULTIPLIER: int = 1_000
-    BASE_OUTPUT_DIR: str = "Data/batch_entity_result"
-    import_function: Callable = import_data
-    export_function: Callable = export_data
-    poll_interval_seconds: int = 30
-    max_retries: int = 3

texttools/batch/manager.py DELETED Viewed

@@ -1,228 +0,0 @@
-import json
-import logging
-import uuid
-from pathlib import Path
-from typing import Any, Type, TypeVar
-from openai import OpenAI
-from openai.lib._pydantic import to_strict_json_schema
-from pydantic import BaseModel
-# Base Model type for output models
-T = TypeVar("T", bound=BaseModel)
-logger = logging.getLogger("texttools.batch_manager")
-class BatchManager:
-    """
-    Manages batch processing jobs for OpenAI's chat completions with structured outputs.
-    Handles the full lifecycle of a batch job: creating tasks from input texts,
-    starting the job, monitoring status, and fetching results. Results are automatically
-    parsed into the specified Pydantic output model. Job state is persisted to disk.
-    """
-    def __init__(
-        self,
-        client: OpenAI,
-        model: str,
-        output_model: Type[T],
-        prompt_template: str,
-        state_dir: Path = Path(".batch_jobs"),
-        custom_json_schema_obj_str: dict | None = None,
-        **client_kwargs: Any,
-    ):
-        self._client = client
-        self._model = model
-        self._output_model = output_model
-        self._prompt_template = prompt_template
-        self._state_dir = state_dir
-        self._custom_json_schema_obj_str = custom_json_schema_obj_str
-        self._client_kwargs = client_kwargs
-        self._dict_input = False
-        self._state_dir.mkdir(parents=True, exist_ok=True)
-        if custom_json_schema_obj_str and not isinstance(
-            custom_json_schema_obj_str, dict
-        ):
-            raise ValueError("Schema should be a dict")
-    def _state_file(self, job_name: str) -> Path:
-        return self._state_dir / f"{job_name}.json"
-    def _load_state(self, job_name: str) -> list[dict[str, Any]]:
-        """
-        Loads the state (job information) from the state file for the given job name.
-        Returns an empty list if the state file does not exist.
-        """
-        path = self._state_file(job_name)
-        if path.exists():
-            with open(path, "r", encoding="utf-8") as f:
-                return json.load(f)
-        return []
-    def _save_state(self, job_name: str, jobs: list[dict[str, Any]]) -> None:
-        """
-        Saves the job state to the state file for the given job name.
-        """
-        with open(self._state_file(job_name), "w", encoding="utf-8") as f:
-            json.dump(jobs, f)
-    def _clear_state(self, job_name: str) -> None:
-        """
-        Deletes the state file for the given job name if it exists.
-        """
-        path = self._state_file(job_name)
-        if path.exists():
-            path.unlink()
-    def _build_task(self, text: str, idx: str) -> dict[str, Any]:
-        """
-        Builds a single task dictionary for the batch job, including the prompt, model, and response format configuration.
-        """
-        response_format_config: dict[str, Any]
-        if self._custom_json_schema_obj_str:
-            response_format_config = {
-                "type": "json_schema",
-                "json_schema": self._custom_json_schema_obj_str,
-            }
-        else:
-            raw_schema = to_strict_json_schema(self._output_model)
-            response_format_config = {
-                "type": "json_schema",
-                "json_schema": {
-                    "name": self._output_model.__name__,
-                    "schema": raw_schema,
-                },
-            }
-        return {
-            "custom_id": str(idx),
-            "method": "POST",
-            "url": "/v1/chat/completions",
-            "body": {
-                "model": self.model,
-                "messages": [
-                    {"role": "system", "content": self._prompt_template},
-                    {"role": "user", "content": text},
-                ],
-                "response_format": response_format_config,
-                **self._client_kwargs,
-            },
-        }
-    def _prepare_file(self, payload: list[str] | list[dict[str, str]]) -> Path:
-        """
-        Prepares a JSONL file containing all tasks for the batch job, based on the input payload.
-        Returns the path to the created file.
-        """
-        if not payload:
-            raise ValueError("Payload must not be empty")
-        if isinstance(payload[0], str):
-            tasks = [self._build_task(text, uuid.uuid4().hex) for text in payload]
-        elif isinstance(payload[0], dict):
-            tasks = [self._build_task(dic["text"], dic["id"]) for dic in payload]
-        else:
-            raise TypeError(
-                "The input must be either a list of texts or a dictionary in the form {'id': str, 'text': str}"
-            )
-        file_path = self._state_dir / f"batch_{uuid.uuid4().hex}.jsonl"
-        with open(file_path, "w", encoding="utf-8") as f:
-            for task in tasks:
-                f.write(json.dumps(task) + "\n")
-        return file_path
-    def start(self, payload: list[str | dict[str, str]], job_name: str):
-        """
-        Starts a new batch job by uploading the prepared file and creating a batch job on the server.
-        If a job with the same name already exists, it does nothing.
-        """
-        if self._load_state(job_name):
-            return
-        path = self._prepare_file(payload)
-        upload = self._client.files.create(file=open(path, "rb"), purpose="batch")
-        job = self._client.batches.create(
-            input_file_id=upload.id,
-            endpoint="/v1/chat/completions",
-            completion_window="24h",
-        ).to_dict()
-        self._save_state(job_name, [job])
-    def check_status(self, job_name: str) -> str:
-        """
-        Checks and returns the current status of the batch job with the given job name.
-        Updates the job state with the latest information from the server.
-        """
-        job = self._load_state(job_name)[0]
-        if not job:
-            return "completed"
-        info = self._client.batches.retrieve(job["id"])
-        job = info.to_dict()
-        self._save_state(job_name, [job])
-        logger.info("Batch job status: %s", job)
-        return job["status"]
-    def fetch_results(
-        self, job_name: str, remove_cache: bool = True
-    ) -> tuple[dict[str, str], list]:
-        """
-        Fetches the results of a completed batch job. Optionally saves the results to a file and/or removes the job cache.
-        Returns a tuple containing the parsed results and a log of errors (if any).
-        """
-        job = self._load_state(job_name)[0]
-        if not job:
-            return {}
-        batch_id = job["id"]
-        info = self._client.batches.retrieve(batch_id)
-        out_file_id = info.output_file_id
-        if not out_file_id:
-            error_file_id = info.error_file_id
-            if error_file_id:
-                err_content = (
-                    self._client.files.content(error_file_id).read().decode("utf-8")
-                )
-                logger.error("Error file content:", err_content)
-            return {}
-        content = self._client.files.content(out_file_id).read().decode("utf-8")
-        lines = content.splitlines()
-        results = {}
-        log = []
-        for line in lines:
-            result = json.loads(line)
-            custom_id = result["custom_id"]
-            if result["response"]["status_code"] == 200:
-                content = result["response"]["body"]["choices"][0]["message"]["content"]
-                try:
-                    parsed_content = json.loads(content)
-                    model_instance = self._output_model(**parsed_content)
-                    results[custom_id] = model_instance.model_dump(mode="json")
-                except json.JSONDecodeError:
-                    results[custom_id] = {"error": "Failed to parse content as JSON"}
-                    error_d = {custom_id: results[custom_id]}
-                    log.append(error_d)
-                except Exception as e:
-                    results[custom_id] = {"error": str(e)}
-                    error_d = {custom_id: results[custom_id]}
-                    log.append(error_d)
-            else:
-                error_message = (
-                    result["response"]["body"]
-                    .get("error", {})
-                    .get("message", "Unknown error")
-                )
-                results[custom_id] = {"error": error_message}
-                error_d = {custom_id: results[custom_id]}
-                log.append(error_d)
-        if remove_cache:
-            self._clear_state(job_name)
-        return results, log

texttools/batch/runner.py DELETED Viewed

@@ -1,228 +0,0 @@
-import json
-import logging
-import os
-import time
-from pathlib import Path
-from typing import Any, Type, TypeVar
-from dotenv import load_dotenv
-from openai import OpenAI
-from pydantic import BaseModel
-from ..core.exceptions import TextToolsError
-from ..core.internal_models import Str
-from .config import BatchConfig
-from .manager import BatchManager
-# Base Model type for output models
-T = TypeVar("T", bound=BaseModel)
-logger = logging.getLogger("texttools.batch_runner")
-class BatchRunner:
-    """
-    Handles running batch jobs using a batch manager and configuration.
-    """
-    def __init__(
-        self, config: BatchConfig = BatchConfig(), output_model: Type[T] = Str
-    ):
-        try:
-            self._config = config
-            self._system_prompt = config.system_prompt
-            self._job_name = config.job_name
-            self._input_data_path = config.input_data_path
-            self._output_data_filename = config.output_data_filename
-            self._model = config.model
-            self._output_model = output_model
-            self._manager = self._init_manager()
-            self._data = self._load_data()
-            self._parts: list[list[dict[str, Any]]] = []
-            # Map part index to job name
-            self._part_idx_to_job_name: dict[int, str] = {}
-            # Track retry attempts per part
-            self._part_attempts: dict[int, int] = {}
-            self._partition_data()
-            Path(self._config.BASE_OUTPUT_DIR).mkdir(parents=True, exist_ok=True)
-        except Exception as e:
-            raise TextToolsError(f"Batch runner initialization failed: {e}")
-    def _init_manager(self) -> BatchManager:
-        load_dotenv()
-        api_key = os.getenv("OPENAI_API_KEY")
-        client = OpenAI(api_key=api_key)
-        return BatchManager(
-            client=client,
-            model=self._model,
-            prompt_template=self._system_prompt,
-            output_model=self._output_model,
-        )
-    def _load_data(self):
-        with open(self._input_data_path, "r", encoding="utf-8") as f:
-            data = json.load(f)
-        data = self._config.export_function(data)
-        # Ensure data is a list of dicts with 'id' and 'content' as strings
-        if not isinstance(data, list):
-            raise ValueError(
-                "Exported data must be a list of dicts with 'id' and 'content' keys"
-            )
-        for item in data:
-            if not (isinstance(item, dict) and "id" in item and "content" in item):
-                raise ValueError(
-                    f"Item must be a dict with 'id' and 'content' keys. Got: {type(item)}"
-                )
-            if not (isinstance(item["id"], str) and isinstance(item["content"], str)):
-                raise ValueError("'id' and 'content' must be strings.")
-        return data
-    def _partition_data(self):
-        total_length = sum(len(item["content"]) for item in self._data)
-        prompt_length = len(self._system_prompt)
-        total = total_length + (prompt_length * len(self._data))
-        calculation = total / self._config.CHARS_PER_TOKEN
-        logger.info(
-            f"Total chars: {total_length}, Prompt chars: {prompt_length}, Total: {total}, Tokens: {calculation}"
-        )
-        if calculation < self._config.MAX_TOTAL_TOKENS:
-            self._parts = [self._data]
-        else:
-            # Partition into chunks of MAX_BATCH_SIZE
-            self._parts = [
-                self._data[i : i + self._config.MAX_BATCH_SIZE]
-                for i in range(0, len(self._data), self._config.MAX_BATCH_SIZE)
-            ]
-        logger.info(f"Data split into {len(self._parts)} part(s)")
-    def _submit_all_jobs(self) -> None:
-        for idx, part in enumerate(self._parts):
-            if self._result_exists(idx):
-                logger.info(f"Skipping part {idx + 1}: result already exists.")
-                continue
-            part_job_name = (
-                f"{self._job_name}_part_{idx + 1}"
-                if len(self._parts) > 1
-                else self._job_name
-            )
-            # If a job with this name already exists, register and skip submitting
-            existing_job = self._manager._load_state(part_job_name)
-            if existing_job:
-                logger.info(
-                    f"Skipping part {idx + 1}: job already exists ({part_job_name})."
-                )
-                self._part_idx_to_job_name[idx] = part_job_name
-                self._part_attempts.setdefault(idx, 0)
-                continue
-            payload = part
-            logger.info(
-                f"Submitting job for part {idx + 1}/{len(self._parts)}: {part_job_name}"
-            )
-            self._manager.start(payload, job_name=part_job_name)
-            self._part_idx_to_job_name[idx] = part_job_name
-            self._part_attempts.setdefault(idx, 0)
-            # This is added for letting file get uploaded, before starting the next part.
-            logger.info("Uploading...")
-            time.sleep(30)
-    def _save_results(
-        self,
-        output_data: list[dict[str, Any]] | dict[str, Any],
-        log: list[Any],
-        part_idx: int,
-    ):
-        part_suffix = f"_part_{part_idx + 1}" if len(self._parts) > 1 else ""
-        result_path = (
-            Path(self._config.BASE_OUTPUT_DIR)
-            / f"{Path(self._output_data_filename).stem}{part_suffix}.json"
-        )
-        if not output_data:
-            logger.info("No output data to save. Skipping this part.")
-            return
-        else:
-            with open(result_path, "w", encoding="utf-8") as f:
-                json.dump(output_data, f, ensure_ascii=False, indent=4)
-        if log:
-            log_path = (
-                Path(self._config.BASE_OUTPUT_DIR)
-                / f"{Path(self._output_data_filename).stem}{part_suffix}_log.json"
-            )
-            with open(log_path, "w", encoding="utf-8") as f:
-                json.dump(log, f, ensure_ascii=False, indent=4)
-    def _result_exists(self, part_idx: int) -> bool:
-        part_suffix = f"_part_{part_idx + 1}" if len(self._parts) > 1 else ""
-        result_path = (
-            Path(self._config.BASE_OUTPUT_DIR)
-            / f"{Path(self._output_data_filename).stem}{part_suffix}.json"
-        )
-        return result_path.exists()
-    def run(self):
-        """
-        Execute the batch job processing pipeline.
-        Submits jobs, monitors progress, handles retries, and saves results.
-        """
-        try:
-            # Submit all jobs up-front for concurrent execution
-            self._submit_all_jobs()
-            pending_parts: set[int] = set(self._part_idx_to_job_name.keys())
-            logger.info(f"Pending parts: {sorted(pending_parts)}")
-            # Polling loop
-            while pending_parts:
-                finished_this_round: list[int] = []
-                for part_idx in list(pending_parts):
-                    job_name = self._part_idx_to_job_name[part_idx]
-                    status = self._manager.check_status(job_name=job_name)
-                    logger.info(f"Status for {job_name}: {status}")
-                    if status == "completed":
-                        logger.info(
-                            f"Job completed. Fetching results for part {part_idx + 1}..."
-                        )
-                        output_data, log = self._manager.fetch_results(
-                            job_name=job_name, remove_cache=False
-                        )
-                        output_data = self._config.import_function(output_data)
-                        self._save_results(output_data, log, part_idx)
-                        logger.info(
-                            f"Fetched and saved results for part {part_idx + 1}."
-                        )
-                        finished_this_round.append(part_idx)
-                    elif status == "failed":
-                        attempt = self._part_attempts.get(part_idx, 0) + 1
-                        self._part_attempts[part_idx] = attempt
-                        if attempt <= self._config.max_retries:
-                            logger.info(
-                                f"Job {job_name} failed (attempt {attempt}). Retrying after short backoff..."
-                            )
-                            self._manager._clear_state(job_name)
-                            time.sleep(10)
-                            payload = self._to_manager_payload(self._parts[part_idx])
-                            new_job_name = (
-                                f"{self._job_name}_part_{part_idx + 1}_retry_{attempt}"
-                            )
-                            self._manager.start(payload, job_name=new_job_name)
-                            self._part_idx_to_job_name[part_idx] = new_job_name
-                        else:
-                            logger.info(
-                                f"Job {job_name} failed after {attempt - 1} retries. Marking as failed."
-                            )
-                            finished_this_round.append(part_idx)
-                    else:
-                        # Still running or queued
-                        continue
-                # Remove finished parts
-                for part_idx in finished_this_round:
-                    pending_parts.discard(part_idx)
-                if pending_parts:
-                    logger.info(
-                        f"Waiting {self._config.poll_interval_seconds}s before next status check for parts: {sorted(pending_parts)}"
-                    )
-                    time.sleep(self._config.poll_interval_seconds)
-        except Exception as e:
-            raise TextToolsError(f"Batch job execution failed: {e}")

{hamtaa_texttools-1.3.0.dist-info → hamtaa_texttools-1.3.2.dist-info}/WHEEL RENAMED Viewed

File without changes

{hamtaa_texttools-1.3.0.dist-info → hamtaa_texttools-1.3.2.dist-info}/licenses/LICENSE RENAMED Viewed

File without changes

{hamtaa_texttools-1.3.0.dist-info → hamtaa_texttools-1.3.2.dist-info}/top_level.txt RENAMED Viewed

File without changes

/texttools/{batch → core/operators}/__init__.py RENAMED Viewed

File without changes

hamtaa-texttools 1.3.0__py3-none-any.whl → 1.3.2__py3-none-any.whl

hamtaa-texttools 1.3.0py3-none-any.whl → 1.3.2py3-none-any.whl