hamtaa-texttools 1.1.18__py3-none-any.whl → 1.1.20__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {hamtaa_texttools-1.1.18.dist-info → hamtaa_texttools-1.1.20.dist-info}/METADATA +38 -8
- hamtaa_texttools-1.1.20.dist-info/RECORD +33 -0
- texttools/batch/batch_runner.py +6 -6
- texttools/batch/internals/batch_manager.py +6 -6
- texttools/batch/internals/utils.py +1 -4
- texttools/internals/async_operator.py +4 -6
- texttools/internals/models.py +8 -17
- texttools/internals/operator_utils.py +24 -0
- texttools/internals/prompt_loader.py +34 -6
- texttools/internals/sync_operator.py +4 -6
- texttools/internals/text_to_chunks.py +97 -0
- texttools/prompts/check_fact.yaml +19 -0
- texttools/prompts/extract_entities.yaml +1 -1
- texttools/prompts/propositionize.yaml +13 -6
- texttools/prompts/run_custom.yaml +1 -1
- texttools/prompts/text_to_question.yaml +6 -4
- texttools/tools/async_tools.py +169 -81
- texttools/tools/sync_tools.py +169 -81
- hamtaa_texttools-1.1.18.dist-info/RECORD +0 -33
- texttools/internals/formatters.py +0 -24
- texttools/prompts/detect_entity.yaml +0 -22
- {hamtaa_texttools-1.1.18.dist-info → hamtaa_texttools-1.1.20.dist-info}/WHEEL +0 -0
- {hamtaa_texttools-1.1.18.dist-info → hamtaa_texttools-1.1.20.dist-info}/licenses/LICENSE +0 -0
- {hamtaa_texttools-1.1.18.dist-info → hamtaa_texttools-1.1.20.dist-info}/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: hamtaa-texttools
|
|
3
|
-
Version: 1.1.
|
|
3
|
+
Version: 1.1.20
|
|
4
4
|
Summary: A high-level NLP toolkit built on top of modern LLMs.
|
|
5
5
|
Author-email: Tohidi <the.mohammad.tohidi@gmail.com>, Montazer <montazerh82@gmail.com>, Givechi <mohamad.m.givechi@gmail.com>, MoosaviNejad <erfanmoosavi84@gmail.com>, Zareshahi <a.zareshahi1377@gmail.com>
|
|
6
6
|
License: MIT License
|
|
@@ -61,29 +61,59 @@ Each tool is designed to work with structured outputs (JSON / Pydantic).
|
|
|
61
61
|
- **`summarize()`** - Text summarization
|
|
62
62
|
- **`translate()`** - Text translation between languages
|
|
63
63
|
- **`propositionize()`** - Convert text to atomic independence meaningful sentences
|
|
64
|
+
- **`check_fact()`** - Check a statement is relevant to source text or not
|
|
64
65
|
- **`run_custom()`** - Allows users to define a custom tool with an arbitrary BaseModel
|
|
65
66
|
|
|
66
67
|
---
|
|
67
68
|
|
|
69
|
+
## 📊 Tool Quality Tiers
|
|
70
|
+
|
|
71
|
+
| Status | Meaning | Use in Production? |
|
|
72
|
+
|--------|---------|-------------------|
|
|
73
|
+
| **✅ Production** | Evaluated, tested, stable. | **Yes** - ready for reliable use. |
|
|
74
|
+
| **🧪 Experimental** | Added to the package but **not fully evaluated**. Functional, but quality may vary. | **Use with caution** - outputs not yet validated. |
|
|
75
|
+
|
|
76
|
+
### Current Status
|
|
77
|
+
**Production Tools:**
|
|
78
|
+
- `categorize()` (list mode)
|
|
79
|
+
- `extract_keywords()`
|
|
80
|
+
- `extract_entities()`
|
|
81
|
+
- `is_question()`
|
|
82
|
+
- `text_to_question()`
|
|
83
|
+
- `merge_questions()`
|
|
84
|
+
- `rewrite()`
|
|
85
|
+
- `subject_to_question()`
|
|
86
|
+
- `summarize()`
|
|
87
|
+
- `run_custom()` (fine in most cases)
|
|
88
|
+
|
|
89
|
+
**Experimental Tools:**
|
|
90
|
+
- `categorize()` (tree mode)
|
|
91
|
+
- `translate()`
|
|
92
|
+
- `propositionize()`
|
|
93
|
+
- `check_fact()`
|
|
94
|
+
- `run_custom()` (not evaluated in all scenarios)
|
|
95
|
+
|
|
96
|
+
---
|
|
97
|
+
|
|
68
98
|
## ⚙️ `with_analysis`, `logprobs`, `output_lang`, `user_prompt`, `temperature`, `validator` and `priority` parameters
|
|
69
99
|
|
|
70
100
|
TextTools provides several optional flags to customize LLM behavior:
|
|
71
101
|
|
|
72
|
-
- **`with_analysis
|
|
102
|
+
- **`with_analysis: bool`** → Adds a reasoning step before generating the final output.
|
|
73
103
|
**Note:** This doubles token usage per call because it triggers an additional LLM request.
|
|
74
104
|
|
|
75
|
-
- **`logprobs
|
|
105
|
+
- **`logprobs: bool`** → Returns token-level probabilities for the generated output. You can also specify `top_logprobs=<N>` to get the top N alternative tokens and their probabilities.
|
|
76
106
|
**Note:** This feature works if it's supported by the model.
|
|
77
107
|
|
|
78
|
-
- **`output_lang
|
|
108
|
+
- **`output_lang: str`** → Forces the model to respond in a specific language. The model will ignore other instructions about language and respond strictly in the requested language.
|
|
79
109
|
|
|
80
|
-
- **`user_prompt
|
|
110
|
+
- **`user_prompt: str`** → Allows you to inject a custom instruction or prompt into the model alongside the main template. This gives you fine-grained control over how the model interprets or modifies the input text.
|
|
81
111
|
|
|
82
|
-
- **`temperature
|
|
112
|
+
- **`temperature: float`** → Determines how creative the model should respond. Takes a float number from `0.0` to `2.0`.
|
|
83
113
|
|
|
84
|
-
- **`validator (
|
|
114
|
+
- **`validator: Callable (Experimental)`** → Forces TheTool to validate the output result based on your custom validator. Validator should return a bool (True if there were no problem, False if the validation fails.) If the validator fails, TheTool will retry to get another output by modifying `temperature`. You can specify `max_validation_retries=<N>` to change the number of retries.
|
|
85
115
|
|
|
86
|
-
- **`priority (
|
|
116
|
+
- **`priority: int (Experimental)`** → Task execution priority level. Higher values = higher priority. Affects processing order in queues.
|
|
87
117
|
**Note:** This feature works if it's supported by the model and vLLM.
|
|
88
118
|
|
|
89
119
|
**Note:** There might be some tools that don't support some of the parameters above.
|
|
@@ -0,0 +1,33 @@
|
|
|
1
|
+
hamtaa_texttools-1.1.20.dist-info/licenses/LICENSE,sha256=Hb2YOBKy2MJQLnyLrX37B4ZVuac8eaIcE71SvVIMOLg,1082
|
|
2
|
+
texttools/__init__.py,sha256=CmCS9dEvO6061GiJ8A7gD3UAhCWHTkaID9q3Krlyq_o,311
|
|
3
|
+
texttools/batch/batch_config.py,sha256=m1UgILVKjNdWE6laNbfbG4vgi4o2fEegGZbeoam6pnY,749
|
|
4
|
+
texttools/batch/batch_runner.py,sha256=fVqgFDOyv8DqaNmRJQjt75wOxXTKPstisHycpt0LwcM,10026
|
|
5
|
+
texttools/batch/internals/batch_manager.py,sha256=6HfsexU0PHGGBH7HKReZ-CQxaQI9DXYKAPsFXxovb_I,8740
|
|
6
|
+
texttools/batch/internals/utils.py,sha256=8uNqvPHkEDFpiPp2Nyu-1nP4R-Tq8FwuSGMNSjcBogY,348
|
|
7
|
+
texttools/internals/async_operator.py,sha256=rgQuvT-hl53stClVojso9FgmKhK98nm_2Cdl5WROMoc,8399
|
|
8
|
+
texttools/internals/exceptions.py,sha256=h_yp_5i_5IfmqTBQ4S6ZOISrrliJBQ3HTEAjwJXrplk,495
|
|
9
|
+
texttools/internals/models.py,sha256=Ro8d875_xjMHwwJIz3D_-VWxQ2WOcLLIKxsleSvqPDE,5659
|
|
10
|
+
texttools/internals/operator_utils.py,sha256=jphe1DvYgdpRprZFv23ghh3bNYETvVTQ8ZZ8HPwwRVo,2759
|
|
11
|
+
texttools/internals/prompt_loader.py,sha256=i4OxcVJTjHFKPSoC-DWZUM3Vf8ye_vbD7b6t3N2qB08,3972
|
|
12
|
+
texttools/internals/sync_operator.py,sha256=Q0I2LlGF8MEPUCnwOOErOlxF3gjNxEvNS3oM9V6jazE,8296
|
|
13
|
+
texttools/internals/text_to_chunks.py,sha256=vY3odhgCZK4E44k_SGlLoSiKkdN0ib6-lQAsPcplAHA,3843
|
|
14
|
+
texttools/prompts/README.md,sha256=-5YO93CN93QLifqZpUeUnCOCBbDiOTV-cFQeJ7Gg0I4,1377
|
|
15
|
+
texttools/prompts/categorize.yaml,sha256=F7VezB25B_sT5yoC25ezODBddkuDD5lUHKetSpx9FKI,2743
|
|
16
|
+
texttools/prompts/check_fact.yaml,sha256=5kpBjmfZxgp81Owc8-Pd0U8-cZowFGRdYlGTFQLYQ9o,702
|
|
17
|
+
texttools/prompts/extract_entities.yaml,sha256=56N3iFH1KbGLqloYBLOW-12SchVayqbHgaQ5-8JTbeY,610
|
|
18
|
+
texttools/prompts/extract_keywords.yaml,sha256=Vj4Tt3vT6LtpOo_iBZPo9oWI50oVdPGXe5i8yDR8ex4,3177
|
|
19
|
+
texttools/prompts/is_question.yaml,sha256=d0-vKRbXWkxvO64ikvxRjEmpAXGpCYIPGhgexvPPjws,471
|
|
20
|
+
texttools/prompts/merge_questions.yaml,sha256=0J85GvTirZB4ELwH3sk8ub_WcqqpYf6PrMKr3djlZeo,1792
|
|
21
|
+
texttools/prompts/propositionize.yaml,sha256=kdj-UxPOYcLSTLF7cWARDxxTxSFB0qRBaRujdThPDxw,1380
|
|
22
|
+
texttools/prompts/rewrite.yaml,sha256=LO7He_IA3MZKz8a-LxH9DHJpOjpYwaYN1pbjp1Y0tFo,5392
|
|
23
|
+
texttools/prompts/run_custom.yaml,sha256=6oiMYOo_WctVbOmE01wZzI1ra7nFDMJzceTTtnGdmOA,126
|
|
24
|
+
texttools/prompts/subject_to_question.yaml,sha256=C7x7rNNm6U_ZG9HOn6zuzYOtvJUZ2skuWbL1-aYdd3E,1147
|
|
25
|
+
texttools/prompts/summarize.yaml,sha256=o6rxGPfWtZd61Duvm8NVvCJqfq73b-wAuMSKR6UYUqY,459
|
|
26
|
+
texttools/prompts/text_to_question.yaml,sha256=fnxDpUnnOEmK0yyFU5F4ItBqCnegoUCXhSTFjiTy18Y,1005
|
|
27
|
+
texttools/prompts/translate.yaml,sha256=mGT2uBCei6uucWqVbs4silk-UV060v3G0jnt0P6sr50,634
|
|
28
|
+
texttools/tools/async_tools.py,sha256=0ubBHyx3ii1MOwo8X_ZMfqaH43y07DaF_rk7s7DPjVA,52331
|
|
29
|
+
texttools/tools/sync_tools.py,sha256=7ZY2Cs7W0y3m8D9I70Tqk9pMisIQeIzdAWwOh1A1hRc,52137
|
|
30
|
+
hamtaa_texttools-1.1.20.dist-info/METADATA,sha256=s52LtRd3UQOcmNgOJ_riHjMgMC9pS2mmks2Ys0zWhXk,10610
|
|
31
|
+
hamtaa_texttools-1.1.20.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
|
|
32
|
+
hamtaa_texttools-1.1.20.dist-info/top_level.txt,sha256=5Mh0jIxxZ5rOXHGJ6Mp-JPKviywwN0MYuH0xk5bEWqE,10
|
|
33
|
+
hamtaa_texttools-1.1.20.dist-info/RECORD,,
|
texttools/batch/batch_runner.py
CHANGED
|
@@ -2,7 +2,7 @@ import json
|
|
|
2
2
|
import os
|
|
3
3
|
import time
|
|
4
4
|
from pathlib import Path
|
|
5
|
-
from typing import
|
|
5
|
+
from typing import Type, TypeVar
|
|
6
6
|
import logging
|
|
7
7
|
|
|
8
8
|
from dotenv import load_dotenv
|
|
@@ -11,7 +11,7 @@ from pydantic import BaseModel
|
|
|
11
11
|
|
|
12
12
|
from texttools.batch.internals.batch_manager import BatchManager
|
|
13
13
|
from texttools.batch.batch_config import BatchConfig
|
|
14
|
-
from texttools.internals.models import
|
|
14
|
+
from texttools.internals.models import Str
|
|
15
15
|
from texttools.internals.exceptions import TextToolsError, ConfigurationError
|
|
16
16
|
|
|
17
17
|
# Base Model type for output models
|
|
@@ -26,7 +26,7 @@ class BatchJobRunner:
|
|
|
26
26
|
"""
|
|
27
27
|
|
|
28
28
|
def __init__(
|
|
29
|
-
self, config: BatchConfig = BatchConfig(), output_model: Type[T] =
|
|
29
|
+
self, config: BatchConfig = BatchConfig(), output_model: Type[T] = Str
|
|
30
30
|
):
|
|
31
31
|
try:
|
|
32
32
|
self._config = config
|
|
@@ -38,7 +38,7 @@ class BatchJobRunner:
|
|
|
38
38
|
self._output_model = output_model
|
|
39
39
|
self._manager = self._init_manager()
|
|
40
40
|
self._data = self._load_data()
|
|
41
|
-
self._parts: list[list[dict[str,
|
|
41
|
+
self._parts: list[list[dict[str, object]]] = []
|
|
42
42
|
# Map part index to job name
|
|
43
43
|
self._part_idx_to_job_name: dict[int, str] = {}
|
|
44
44
|
# Track retry attempts per part
|
|
@@ -130,8 +130,8 @@ class BatchJobRunner:
|
|
|
130
130
|
|
|
131
131
|
def _save_results(
|
|
132
132
|
self,
|
|
133
|
-
output_data: list[dict[str,
|
|
134
|
-
log: list[
|
|
133
|
+
output_data: list[dict[str, object]] | dict[str, object],
|
|
134
|
+
log: list[object],
|
|
135
135
|
part_idx: int,
|
|
136
136
|
):
|
|
137
137
|
part_suffix = f"_part_{part_idx + 1}" if len(self._parts) > 1 else ""
|
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
import json
|
|
2
2
|
import uuid
|
|
3
3
|
from pathlib import Path
|
|
4
|
-
from typing import
|
|
4
|
+
from typing import Type, TypeVar
|
|
5
5
|
import logging
|
|
6
6
|
|
|
7
7
|
from pydantic import BaseModel
|
|
@@ -31,7 +31,7 @@ class BatchManager:
|
|
|
31
31
|
prompt_template: str,
|
|
32
32
|
state_dir: Path = Path(".batch_jobs"),
|
|
33
33
|
custom_json_schema_obj_str: dict | None = None,
|
|
34
|
-
**client_kwargs:
|
|
34
|
+
**client_kwargs: object,
|
|
35
35
|
):
|
|
36
36
|
self._client = client
|
|
37
37
|
self._model = model
|
|
@@ -51,7 +51,7 @@ class BatchManager:
|
|
|
51
51
|
def _state_file(self, job_name: str) -> Path:
|
|
52
52
|
return self._state_dir / f"{job_name}.json"
|
|
53
53
|
|
|
54
|
-
def _load_state(self, job_name: str) -> list[dict[str,
|
|
54
|
+
def _load_state(self, job_name: str) -> list[dict[str, object]]:
|
|
55
55
|
"""
|
|
56
56
|
Loads the state (job information) from the state file for the given job name.
|
|
57
57
|
Returns an empty list if the state file does not exist.
|
|
@@ -62,7 +62,7 @@ class BatchManager:
|
|
|
62
62
|
return json.load(f)
|
|
63
63
|
return []
|
|
64
64
|
|
|
65
|
-
def _save_state(self, job_name: str, jobs: list[dict[str,
|
|
65
|
+
def _save_state(self, job_name: str, jobs: list[dict[str, object]]) -> None:
|
|
66
66
|
"""
|
|
67
67
|
Saves the job state to the state file for the given job name.
|
|
68
68
|
"""
|
|
@@ -77,11 +77,11 @@ class BatchManager:
|
|
|
77
77
|
if path.exists():
|
|
78
78
|
path.unlink()
|
|
79
79
|
|
|
80
|
-
def _build_task(self, text: str, idx: str) -> dict[str,
|
|
80
|
+
def _build_task(self, text: str, idx: str) -> dict[str, object]:
|
|
81
81
|
"""
|
|
82
82
|
Builds a single task dictionary for the batch job, including the prompt, model, and response format configuration.
|
|
83
83
|
"""
|
|
84
|
-
response_format_config: dict[str,
|
|
84
|
+
response_format_config: dict[str, object]
|
|
85
85
|
|
|
86
86
|
if self._custom_json_schema_obj_str:
|
|
87
87
|
response_format_config = {
|
|
@@ -1,6 +1,3 @@
|
|
|
1
|
-
from typing import Any
|
|
2
|
-
|
|
3
|
-
|
|
4
1
|
def export_data(data) -> list[dict[str, str]]:
|
|
5
2
|
"""
|
|
6
3
|
Produces a structure of the following form from an initial data structure:
|
|
@@ -9,7 +6,7 @@ def export_data(data) -> list[dict[str, str]]:
|
|
|
9
6
|
return data
|
|
10
7
|
|
|
11
8
|
|
|
12
|
-
def import_data(data) ->
|
|
9
|
+
def import_data(data) -> object:
|
|
13
10
|
"""
|
|
14
11
|
Takes the output and adds and aggregates it to the original structure.
|
|
15
12
|
"""
|
|
@@ -1,4 +1,4 @@
|
|
|
1
|
-
from typing import
|
|
1
|
+
from typing import TypeVar, Type
|
|
2
2
|
from collections.abc import Callable
|
|
3
3
|
import logging
|
|
4
4
|
|
|
@@ -7,7 +7,6 @@ from pydantic import BaseModel
|
|
|
7
7
|
|
|
8
8
|
from texttools.internals.models import ToolOutput
|
|
9
9
|
from texttools.internals.operator_utils import OperatorUtils
|
|
10
|
-
from texttools.internals.formatters import Formatter
|
|
11
10
|
from texttools.internals.prompt_loader import PromptLoader
|
|
12
11
|
from texttools.internals.exceptions import (
|
|
13
12
|
TextToolsError,
|
|
@@ -77,7 +76,7 @@ class AsyncOperator:
|
|
|
77
76
|
logprobs: bool = False,
|
|
78
77
|
top_logprobs: int = 3,
|
|
79
78
|
priority: int | None = 0,
|
|
80
|
-
) -> tuple[T,
|
|
79
|
+
) -> tuple[T, object]:
|
|
81
80
|
"""
|
|
82
81
|
Parses a chat completion using OpenAI's structured output format.
|
|
83
82
|
Returns both the parsed object and the raw completion for logprobs.
|
|
@@ -124,7 +123,7 @@ class AsyncOperator:
|
|
|
124
123
|
temperature: float,
|
|
125
124
|
logprobs: bool,
|
|
126
125
|
top_logprobs: int | None,
|
|
127
|
-
validator: Callable[[
|
|
126
|
+
validator: Callable[[object], bool] | None,
|
|
128
127
|
max_validation_retries: int | None,
|
|
129
128
|
# Internal parameters
|
|
130
129
|
prompt_file: str,
|
|
@@ -138,7 +137,6 @@ class AsyncOperator:
|
|
|
138
137
|
"""
|
|
139
138
|
try:
|
|
140
139
|
prompt_loader = PromptLoader()
|
|
141
|
-
formatter = Formatter()
|
|
142
140
|
output = ToolOutput()
|
|
143
141
|
|
|
144
142
|
# Prompt configs contain two keys: main_template and analyze template, both are string
|
|
@@ -177,7 +175,7 @@ class AsyncOperator:
|
|
|
177
175
|
OperatorUtils.build_user_message(prompt_configs["main_template"])
|
|
178
176
|
)
|
|
179
177
|
|
|
180
|
-
messages =
|
|
178
|
+
messages = OperatorUtils.user_merge_format(messages)
|
|
181
179
|
|
|
182
180
|
if logprobs and (not isinstance(top_logprobs, int) or top_logprobs < 2):
|
|
183
181
|
raise ValueError("top_logprobs should be an integer greater than 1")
|
texttools/internals/models.py
CHANGED
|
@@ -1,12 +1,12 @@
|
|
|
1
1
|
from datetime import datetime
|
|
2
|
-
from typing import Type,
|
|
2
|
+
from typing import Type, Literal
|
|
3
3
|
|
|
4
4
|
from pydantic import BaseModel, Field, create_model
|
|
5
5
|
|
|
6
6
|
|
|
7
7
|
class ToolOutput(BaseModel):
|
|
8
|
-
result:
|
|
9
|
-
logprobs: list[dict[str,
|
|
8
|
+
result: object = None
|
|
9
|
+
logprobs: list[dict[str, object]] = []
|
|
10
10
|
analysis: str = ""
|
|
11
11
|
process: str | None = None
|
|
12
12
|
processed_at: datetime = datetime.now()
|
|
@@ -22,23 +22,23 @@ class ToolOutput(BaseModel):
|
|
|
22
22
|
"""
|
|
23
23
|
|
|
24
24
|
|
|
25
|
-
class
|
|
25
|
+
class Str(BaseModel):
|
|
26
26
|
result: str = Field(..., description="The output string", example="text")
|
|
27
27
|
|
|
28
28
|
|
|
29
|
-
class
|
|
29
|
+
class Bool(BaseModel):
|
|
30
30
|
result: bool = Field(
|
|
31
31
|
..., description="Boolean indicating the output state", example=True
|
|
32
32
|
)
|
|
33
33
|
|
|
34
34
|
|
|
35
|
-
class
|
|
35
|
+
class ListStr(BaseModel):
|
|
36
36
|
result: list[str] = Field(
|
|
37
37
|
..., description="The output list of strings", example=["text_1", "text_2"]
|
|
38
38
|
)
|
|
39
39
|
|
|
40
40
|
|
|
41
|
-
class
|
|
41
|
+
class ListDictStrStr(BaseModel):
|
|
42
42
|
result: list[dict[str, str]] = Field(
|
|
43
43
|
...,
|
|
44
44
|
description="List of dictionaries containing string key-value pairs",
|
|
@@ -46,7 +46,7 @@ class ListDictStrStrOutput(BaseModel):
|
|
|
46
46
|
)
|
|
47
47
|
|
|
48
48
|
|
|
49
|
-
class
|
|
49
|
+
class ReasonListStr(BaseModel):
|
|
50
50
|
reason: str = Field(..., description="Thinking process that led to the output")
|
|
51
51
|
result: list[str] = Field(
|
|
52
52
|
..., description="The output list of strings", example=["text_1", "text_2"]
|
|
@@ -179,12 +179,3 @@ def create_dynamic_model(allowed_values: list[str]) -> Type[BaseModel]:
|
|
|
179
179
|
)
|
|
180
180
|
|
|
181
181
|
return CategorizerOutput
|
|
182
|
-
|
|
183
|
-
|
|
184
|
-
class Entity(BaseModel):
|
|
185
|
-
text: str = Field(description="The exact text of the entity")
|
|
186
|
-
entity_type: str = Field(description="The type of the entity")
|
|
187
|
-
|
|
188
|
-
|
|
189
|
-
class EntityDetectorOutput(BaseModel):
|
|
190
|
-
result: list[Entity] = Field(description="List of all extracted entities")
|
|
@@ -52,3 +52,27 @@ class OperatorUtils:
|
|
|
52
52
|
new_temp = base_temp + delta_temp
|
|
53
53
|
|
|
54
54
|
return max(0.0, min(new_temp, 1.5))
|
|
55
|
+
|
|
56
|
+
@staticmethod
|
|
57
|
+
def user_merge_format(messages: list[dict[str, str]]) -> list[dict[str, str]]:
|
|
58
|
+
"""
|
|
59
|
+
Merges consecutive user messages into a single message, separated by newlines.
|
|
60
|
+
|
|
61
|
+
This is useful for condensing a multi-turn user input into a single
|
|
62
|
+
message for the LLM. Assistant and system messages are left unchanged and
|
|
63
|
+
act as separators between user message groups.
|
|
64
|
+
"""
|
|
65
|
+
merged = []
|
|
66
|
+
|
|
67
|
+
for message in messages:
|
|
68
|
+
role, content = message["role"], message["content"].strip()
|
|
69
|
+
|
|
70
|
+
# Merge with previous user turn
|
|
71
|
+
if merged and role == "user" and merged[-1]["role"] == "user":
|
|
72
|
+
merged[-1]["content"] += "\n" + content
|
|
73
|
+
|
|
74
|
+
# Otherwise, start a new turn
|
|
75
|
+
else:
|
|
76
|
+
merged.append({"role": role, "content": content})
|
|
77
|
+
|
|
78
|
+
return merged
|
|
@@ -44,16 +44,44 @@ class PromptLoader:
|
|
|
44
44
|
if self.MAIN_TEMPLATE not in data:
|
|
45
45
|
raise PromptError(f"Missing 'main_template' in {prompt_file}")
|
|
46
46
|
|
|
47
|
+
if self.ANALYZE_TEMPLATE not in data:
|
|
48
|
+
raise PromptError(f"Missing 'analyze_template' in {prompt_file}")
|
|
49
|
+
|
|
47
50
|
if mode and mode not in data.get(self.MAIN_TEMPLATE, {}):
|
|
48
51
|
raise PromptError(f"Mode '{mode}' not found in {prompt_file}")
|
|
49
52
|
|
|
50
|
-
|
|
51
|
-
|
|
53
|
+
# Extract templates based on mode
|
|
54
|
+
main_template = (
|
|
55
|
+
data[self.MAIN_TEMPLATE][mode]
|
|
52
56
|
if mode and isinstance(data[self.MAIN_TEMPLATE], dict)
|
|
53
|
-
else data[self.MAIN_TEMPLATE]
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
+
else data[self.MAIN_TEMPLATE]
|
|
58
|
+
)
|
|
59
|
+
|
|
60
|
+
analyze_template = (
|
|
61
|
+
data[self.ANALYZE_TEMPLATE][mode]
|
|
62
|
+
if mode and isinstance(data[self.ANALYZE_TEMPLATE], dict)
|
|
63
|
+
else data[self.ANALYZE_TEMPLATE]
|
|
64
|
+
)
|
|
65
|
+
|
|
66
|
+
if not main_template or not main_template.strip():
|
|
67
|
+
raise PromptError(
|
|
68
|
+
f"Empty main_template in {prompt_file}"
|
|
69
|
+
+ (f" for mode '{mode}'" if mode else "")
|
|
70
|
+
)
|
|
71
|
+
|
|
72
|
+
if (
|
|
73
|
+
not analyze_template
|
|
74
|
+
or not analyze_template.strip()
|
|
75
|
+
or analyze_template.strip() in ["{analyze_template}", "{}"]
|
|
76
|
+
):
|
|
77
|
+
raise PromptError(
|
|
78
|
+
"analyze_template cannot be empty"
|
|
79
|
+
+ (f" for mode '{mode}'" if mode else "")
|
|
80
|
+
)
|
|
81
|
+
|
|
82
|
+
return {
|
|
83
|
+
self.MAIN_TEMPLATE: main_template,
|
|
84
|
+
self.ANALYZE_TEMPLATE: analyze_template,
|
|
57
85
|
}
|
|
58
86
|
|
|
59
87
|
except yaml.YAMLError as e:
|
|
@@ -1,4 +1,4 @@
|
|
|
1
|
-
from typing import
|
|
1
|
+
from typing import TypeVar, Type
|
|
2
2
|
from collections.abc import Callable
|
|
3
3
|
import logging
|
|
4
4
|
|
|
@@ -7,7 +7,6 @@ from pydantic import BaseModel
|
|
|
7
7
|
|
|
8
8
|
from texttools.internals.models import ToolOutput
|
|
9
9
|
from texttools.internals.operator_utils import OperatorUtils
|
|
10
|
-
from texttools.internals.formatters import Formatter
|
|
11
10
|
from texttools.internals.prompt_loader import PromptLoader
|
|
12
11
|
from texttools.internals.exceptions import (
|
|
13
12
|
TextToolsError,
|
|
@@ -77,7 +76,7 @@ class Operator:
|
|
|
77
76
|
logprobs: bool = False,
|
|
78
77
|
top_logprobs: int = 3,
|
|
79
78
|
priority: int | None = 0,
|
|
80
|
-
) -> tuple[T,
|
|
79
|
+
) -> tuple[T, object]:
|
|
81
80
|
"""
|
|
82
81
|
Parses a chat completion using OpenAI's structured output format.
|
|
83
82
|
Returns both the parsed object and the raw completion for logprobs.
|
|
@@ -122,7 +121,7 @@ class Operator:
|
|
|
122
121
|
temperature: float,
|
|
123
122
|
logprobs: bool,
|
|
124
123
|
top_logprobs: int | None,
|
|
125
|
-
validator: Callable[[
|
|
124
|
+
validator: Callable[[object], bool] | None,
|
|
126
125
|
max_validation_retries: int | None,
|
|
127
126
|
# Internal parameters
|
|
128
127
|
prompt_file: str,
|
|
@@ -136,7 +135,6 @@ class Operator:
|
|
|
136
135
|
"""
|
|
137
136
|
try:
|
|
138
137
|
prompt_loader = PromptLoader()
|
|
139
|
-
formatter = Formatter()
|
|
140
138
|
output = ToolOutput()
|
|
141
139
|
|
|
142
140
|
# Prompt configs contain two keys: main_template and analyze template, both are string
|
|
@@ -175,7 +173,7 @@ class Operator:
|
|
|
175
173
|
OperatorUtils.build_user_message(prompt_configs["main_template"])
|
|
176
174
|
)
|
|
177
175
|
|
|
178
|
-
messages =
|
|
176
|
+
messages = OperatorUtils.user_merge_format(messages)
|
|
179
177
|
|
|
180
178
|
if logprobs and (not isinstance(top_logprobs, int) or top_logprobs < 2):
|
|
181
179
|
raise ValueError("top_logprobs should be an integer greater than 1")
|
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
import re
|
|
2
|
+
|
|
3
|
+
|
|
4
|
+
def text_to_chunks(text: str, size: int, overlap: int) -> list[str]:
|
|
5
|
+
separators = ["\n\n", "\n", " ", ""]
|
|
6
|
+
is_separator_regex = False
|
|
7
|
+
keep_separator = True # Equivalent to 'start'
|
|
8
|
+
length_function = len
|
|
9
|
+
strip_whitespace = True
|
|
10
|
+
chunk_size = size
|
|
11
|
+
chunk_overlap = overlap
|
|
12
|
+
|
|
13
|
+
def _split_text_with_regex(
|
|
14
|
+
text: str, separator: str, keep_separator: bool
|
|
15
|
+
) -> list[str]:
|
|
16
|
+
if not separator:
|
|
17
|
+
return [text]
|
|
18
|
+
if not keep_separator:
|
|
19
|
+
return re.split(separator, text)
|
|
20
|
+
_splits = re.split(f"({separator})", text)
|
|
21
|
+
splits = [_splits[i] + _splits[i + 1] for i in range(1, len(_splits), 2)]
|
|
22
|
+
if len(_splits) % 2 == 0:
|
|
23
|
+
splits += [_splits[-1]]
|
|
24
|
+
return [_splits[0]] + splits if _splits[0] else splits
|
|
25
|
+
|
|
26
|
+
def _join_docs(docs: list[str], separator: str) -> str | None:
|
|
27
|
+
text = separator.join(docs)
|
|
28
|
+
if strip_whitespace:
|
|
29
|
+
text = text.strip()
|
|
30
|
+
return text if text else None
|
|
31
|
+
|
|
32
|
+
def _merge_splits(splits: list[str], separator: str) -> list[str]:
|
|
33
|
+
separator_len = length_function(separator)
|
|
34
|
+
docs = []
|
|
35
|
+
current_doc = []
|
|
36
|
+
total = 0
|
|
37
|
+
for d in splits:
|
|
38
|
+
len_ = length_function(d)
|
|
39
|
+
if total + len_ + (separator_len if current_doc else 0) > chunk_size:
|
|
40
|
+
if total > chunk_size:
|
|
41
|
+
pass
|
|
42
|
+
if current_doc:
|
|
43
|
+
doc = _join_docs(current_doc, separator)
|
|
44
|
+
if doc is not None:
|
|
45
|
+
docs.append(doc)
|
|
46
|
+
while total > chunk_overlap or (
|
|
47
|
+
total + len_ + (separator_len if current_doc else 0)
|
|
48
|
+
> chunk_size
|
|
49
|
+
and total > 0
|
|
50
|
+
):
|
|
51
|
+
total -= length_function(current_doc[0]) + (
|
|
52
|
+
separator_len if len(current_doc) > 1 else 0
|
|
53
|
+
)
|
|
54
|
+
current_doc = current_doc[1:]
|
|
55
|
+
current_doc.append(d)
|
|
56
|
+
total += len_ + (separator_len if len(current_doc) > 1 else 0)
|
|
57
|
+
doc = _join_docs(current_doc, separator)
|
|
58
|
+
if doc is not None:
|
|
59
|
+
docs.append(doc)
|
|
60
|
+
return docs
|
|
61
|
+
|
|
62
|
+
def _split_text(text: str, separators: list[str]) -> list[str]:
|
|
63
|
+
final_chunks = []
|
|
64
|
+
separator = separators[-1]
|
|
65
|
+
new_separators = []
|
|
66
|
+
for i, _s in enumerate(separators):
|
|
67
|
+
separator_ = _s if is_separator_regex else re.escape(_s)
|
|
68
|
+
if not _s:
|
|
69
|
+
separator = _s
|
|
70
|
+
break
|
|
71
|
+
if re.search(separator_, text):
|
|
72
|
+
separator = _s
|
|
73
|
+
new_separators = separators[i + 1 :]
|
|
74
|
+
break
|
|
75
|
+
separator_ = separator if is_separator_regex else re.escape(separator)
|
|
76
|
+
splits = _split_text_with_regex(text, separator_, keep_separator)
|
|
77
|
+
_separator = "" if keep_separator else separator
|
|
78
|
+
good_splits = []
|
|
79
|
+
for s in splits:
|
|
80
|
+
if length_function(s) < chunk_size:
|
|
81
|
+
good_splits.append(s)
|
|
82
|
+
else:
|
|
83
|
+
if good_splits:
|
|
84
|
+
merged_text = _merge_splits(good_splits, _separator)
|
|
85
|
+
final_chunks.extend(merged_text)
|
|
86
|
+
good_splits = []
|
|
87
|
+
if not new_separators:
|
|
88
|
+
final_chunks.append(s)
|
|
89
|
+
else:
|
|
90
|
+
other_info = _split_text(s, new_separators)
|
|
91
|
+
final_chunks.extend(other_info)
|
|
92
|
+
if good_splits:
|
|
93
|
+
merged_text = _merge_splits(good_splits, _separator)
|
|
94
|
+
final_chunks.extend(merged_text)
|
|
95
|
+
return final_chunks
|
|
96
|
+
|
|
97
|
+
return _split_text(text, separators)
|
|
@@ -0,0 +1,19 @@
|
|
|
1
|
+
main_template: |
|
|
2
|
+
You are an expert in determining whether a statement can be concluded from the source text or not.
|
|
3
|
+
You must return a boolean value: True or False.
|
|
4
|
+
Return True if the statement can be concluded from the source, and False otherwise.
|
|
5
|
+
Respond only in JSON format (Output should be a boolean):
|
|
6
|
+
{{"result": True/False}}
|
|
7
|
+
The statement is:
|
|
8
|
+
{input}
|
|
9
|
+
The source text is:
|
|
10
|
+
{source_text}
|
|
11
|
+
|
|
12
|
+
analyze_template: |
|
|
13
|
+
You should analyze a statement and a source text and provide a brief,
|
|
14
|
+
summarized analysis that could help in determining that can the statement
|
|
15
|
+
be concluded from the source or not.
|
|
16
|
+
The statement is:
|
|
17
|
+
{input}
|
|
18
|
+
The source text is:
|
|
19
|
+
{source_text}
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
main_template: |
|
|
2
2
|
You are a Named Entity Recognition (NER) extractor.
|
|
3
|
-
Identify and extract
|
|
3
|
+
Identify and extract {entities} from the given text.
|
|
4
4
|
For each entity, provide its text and a clear type.
|
|
5
5
|
Respond only in JSON format:
|
|
6
6
|
{{
|
|
@@ -1,10 +1,17 @@
|
|
|
1
1
|
main_template: |
|
|
2
|
-
You are an expert
|
|
3
|
-
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
2
|
+
You are an expert data analyst specializing in Information Extraction.
|
|
3
|
+
Your task is to extract a list of "Atomic Propositions" from the provided text.
|
|
4
|
+
|
|
5
|
+
Definition of Atomic Proposition:
|
|
6
|
+
A single, self-contained statement of fact that is concise and verifiable.
|
|
7
|
+
|
|
8
|
+
Strict Guidelines:
|
|
9
|
+
1. Remove Meta-Data: STRICTLY EXCLUDE all citations, references, URLs, source attributions (e.g., "Source: makarem.ir"), and conversational fillers (e.g., "Based on the documents...", "In conclusion...").
|
|
10
|
+
2. Resolve Context: Replace pronouns ("it", "this", "they") with the specific nouns they refer to. Each proposition must make sense in isolation.
|
|
11
|
+
3. Preserve Logic: Keep conditions attached to their facts. Do not split a rule from its condition (e.g., "If X, then Y" should be one proposition).
|
|
12
|
+
4. No Redundancy: Do not extract summary statements that merely repeat facts already listed.
|
|
13
|
+
|
|
14
|
+
Extract the atomic propositions from the following text:
|
|
8
15
|
{input}
|
|
9
16
|
|
|
10
17
|
analyze_template: |
|
|
@@ -1,11 +1,13 @@
|
|
|
1
1
|
main_template: |
|
|
2
2
|
You are a question generator.
|
|
3
|
-
Given the following answer, generate
|
|
4
|
-
appropriate question that this answer would directly respond to.
|
|
3
|
+
Given the following answer, generate {number_of_questions} appropriate questions that this answer would directly respond to.
|
|
5
4
|
The generated answer should be independently meaningful,
|
|
6
5
|
and not mentioning any verbs like, this, that, he or she on the question.
|
|
6
|
+
There is a `reason` key, fill that up with a summerized version of your thoughts.
|
|
7
|
+
The `reason` must be less than 20 words.
|
|
8
|
+
Don't forget to fill the reason.
|
|
7
9
|
Respond only in JSON format:
|
|
8
|
-
{{"result": "string"}}
|
|
10
|
+
{{"result": ["question1", "question2", ...], "reason": "string"}}
|
|
9
11
|
Here is the answer:
|
|
10
12
|
{input}
|
|
11
13
|
|
|
@@ -13,7 +15,7 @@ analyze_template: |
|
|
|
13
15
|
Analyze the following answer to identify its key facts,
|
|
14
16
|
main subject, and what kind of information it provides.
|
|
15
17
|
Provide a brief, summarized understanding of the answer's content that will
|
|
16
|
-
help in formulating
|
|
18
|
+
help in formulating relevant and direct questions.
|
|
17
19
|
Just mention the keypoints that was provided in the answer
|
|
18
20
|
Here is the answer:
|
|
19
21
|
{input}
|