hamtaa-texttools 2.1.0__py3-none-any.whl → 2.3.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {hamtaa_texttools-2.1.0.dist-info → hamtaa_texttools-2.3.0.dist-info}/METADATA +75 -11
- hamtaa_texttools-2.3.0.dist-info/RECORD +31 -0
- texttools/__init__.py +2 -3
- texttools/core/__init__.py +34 -0
- texttools/core/internal_models.py +52 -0
- texttools/core/operators/__init__.py +4 -0
- texttools/core/operators/async_operator.py +11 -3
- texttools/core/operators/sync_operator.py +9 -3
- texttools/core/utils.py +33 -0
- texttools/models.py +4 -0
- texttools/prompts/augment.yaml +15 -15
- texttools/prompts/to_question.yaml +0 -2
- texttools/prompts/translate.yaml +2 -2
- texttools/tools/__init__.py +5 -0
- texttools/tools/async_tools.py +69 -19
- texttools/tools/batch_tools.py +688 -0
- texttools/tools/sync_tools.py +69 -19
- hamtaa_texttools-2.1.0.dist-info/RECORD +0 -30
- {hamtaa_texttools-2.1.0.dist-info → hamtaa_texttools-2.3.0.dist-info}/WHEEL +0 -0
- {hamtaa_texttools-2.1.0.dist-info → hamtaa_texttools-2.3.0.dist-info}/licenses/LICENSE +0 -0
- {hamtaa_texttools-2.1.0.dist-info → hamtaa_texttools-2.3.0.dist-info}/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: hamtaa-texttools
|
|
3
|
-
Version: 2.
|
|
3
|
+
Version: 2.3.0
|
|
4
4
|
Summary: A high-level NLP toolkit built on top of modern LLMs.
|
|
5
5
|
Author-email: Tohidi <the.mohammad.tohidi@gmail.com>, Erfan Moosavi <erfanmoosavi84@gmail.com>, Montazer <montazerh82@gmail.com>, Givechi <mohamad.m.givechi@gmail.com>, Zareshahi <a.zareshahi1377@gmail.com>
|
|
6
6
|
Maintainer-email: Erfan Moosavi <erfanmoosavi84@gmail.com>, Tohidi <the.mohammad.tohidi@gmail.com>
|
|
@@ -17,6 +17,7 @@ License-File: LICENSE
|
|
|
17
17
|
Requires-Dist: dotenv>=0.9.9
|
|
18
18
|
Requires-Dist: openai>=1.97.1
|
|
19
19
|
Requires-Dist: pydantic>=2.0.0
|
|
20
|
+
Requires-Dist: pytest>=9.0.2
|
|
20
21
|
Requires-Dist: pyyaml>=6.0
|
|
21
22
|
Dynamic: license-file
|
|
22
23
|
|
|
@@ -29,7 +30,10 @@ Dynamic: license-file
|
|
|
29
30
|
|
|
30
31
|
**TextTools** is a high-level **NLP toolkit** built on top of **LLMs**.
|
|
31
32
|
|
|
32
|
-
It provides
|
|
33
|
+
It provides three API styles for maximum flexibility:
|
|
34
|
+
- Sync API (`TheTool`) - Simple, sequential operations
|
|
35
|
+
- Async API (`AsyncTheTool`) - High-performance async operations
|
|
36
|
+
- Batch API (`BatchTheTool`) - Process multiple texts in parallel with built-in concurrency control
|
|
33
37
|
|
|
34
38
|
It provides ready-to-use utilities for **translation, question detection, categorization, NER extraction, and more** - designed to help you integrate AI-powered text processing into your applications with minimal effort.
|
|
35
39
|
|
|
@@ -76,8 +80,6 @@ pip install -U hamtaa-texttools
|
|
|
76
80
|
|
|
77
81
|
## ⚙️ Additional Parameters
|
|
78
82
|
|
|
79
|
-
- **`raise_on_error: bool`** → (`TheTool/AsyncTheTool` parameter) Raise errors (True) or return them in output (False). Default is True.
|
|
80
|
-
|
|
81
83
|
- **`with_analysis: bool`** → Adds a reasoning step before generating the final output.
|
|
82
84
|
**Note:** This doubles token usage per call.
|
|
83
85
|
|
|
@@ -98,32 +100,49 @@ pip install -U hamtaa-texttools
|
|
|
98
100
|
- **`timeout: float`** → Maximum time in seconds to wait for the response before raising a timeout error.
|
|
99
101
|
**Note:** This feature is only available in `AsyncTheTool`.
|
|
100
102
|
|
|
103
|
+
- **`raise_on_error: bool`** → (`TheTool/AsyncTheTool`) Raise errors (True) or return them in output (False). Default is True.
|
|
104
|
+
|
|
105
|
+
- **`max_concurrency: int`** → (`BatchTheTool` only) Maximum number of concurrent API calls. Default is 5.
|
|
101
106
|
|
|
102
107
|
---
|
|
103
108
|
|
|
104
109
|
## 🧩 ToolOutput
|
|
105
110
|
|
|
106
111
|
Every tool of `TextTools` returns a `ToolOutput` object which is a BaseModel with attributes:
|
|
112
|
+
|
|
107
113
|
- **`result: Any`**
|
|
108
114
|
- **`analysis: str`**
|
|
109
115
|
- **`logprobs: list`**
|
|
110
116
|
- **`errors: list[str]`**
|
|
111
|
-
- **`ToolOutputMetadata`**
|
|
117
|
+
- **`ToolOutputMetadata`**
|
|
112
118
|
- **`tool_name: str`**
|
|
119
|
+
- **`processed_by: str`**
|
|
113
120
|
- **`processed_at: datetime`**
|
|
114
121
|
- **`execution_time: float`**
|
|
122
|
+
- **`token_usage: TokenUsage`**
|
|
123
|
+
- **`completion_usage: CompletionUsage`**
|
|
124
|
+
- **`prompt_tokens: int`**
|
|
125
|
+
- **`completion_tokens: int`**
|
|
126
|
+
- **`total_tokens: int`**
|
|
127
|
+
- **`analyze_usage: AnalyzeUsage`**
|
|
128
|
+
- **`prompt_tokens: int`**
|
|
129
|
+
- **`completion_tokens: int`**
|
|
130
|
+
- **`total_tokens: int`**
|
|
115
131
|
|
|
116
132
|
- Serialize output to JSON using the `to_json()` method.
|
|
117
133
|
- Verify operation success with the `is_successful()` method.
|
|
118
134
|
- Convert output to a dictionary with the `to_dict()` method.
|
|
119
135
|
|
|
136
|
+
**Note:** For BatchTheTool: Each method returns a `list[ToolOutput]` containing results for all input texts.
|
|
137
|
+
|
|
120
138
|
---
|
|
121
139
|
|
|
122
|
-
## 🧨 Sync vs Async
|
|
123
|
-
| Tool
|
|
124
|
-
|
|
125
|
-
| `TheTool`
|
|
126
|
-
| `AsyncTheTool` | Async | High-throughput
|
|
140
|
+
## 🧨 Sync vs Async vs Batch
|
|
141
|
+
| Tool | Style | Use Case | Best For |
|
|
142
|
+
|------|-------|----------|----------|
|
|
143
|
+
| `TheTool` | **Sync** | Simple scripts, sequential workflows | • Quick prototyping<br>• Simple scripts<br>• Sequential processing<br>• Debugging |
|
|
144
|
+
| `AsyncTheTool` | **Async** | High-throughput applications, APIs, concurrent tasks | • Web APIs<br>• Concurrent operations<br>• High-performance apps<br>• Real-time processing |
|
|
145
|
+
| `BatchTheTool` | **Batch** | Process multiple texts efficiently with controlled concurrency | • Bulk processing<br>• Large datasets<br>• Parallel execution<br>• Resource optimization |
|
|
127
146
|
|
|
128
147
|
---
|
|
129
148
|
|
|
@@ -168,6 +187,35 @@ async def main():
|
|
|
168
187
|
asyncio.run(main())
|
|
169
188
|
```
|
|
170
189
|
|
|
190
|
+
## ⚡ Quick Start (Batch)
|
|
191
|
+
|
|
192
|
+
```python
|
|
193
|
+
import asyncio
|
|
194
|
+
from openai import AsyncOpenAI
|
|
195
|
+
from texttools import BatchTheTool
|
|
196
|
+
|
|
197
|
+
async def main():
|
|
198
|
+
async_client = AsyncOpenAI(base_url="your_url", api_key="your_api_key")
|
|
199
|
+
model = "model_name"
|
|
200
|
+
|
|
201
|
+
batch_the_tool = BatchTheTool(client=async_client, model=model, max_concurrency=3)
|
|
202
|
+
|
|
203
|
+
categories = await batch_tool.categorize(
|
|
204
|
+
texts=[
|
|
205
|
+
"Climate change impacts on agriculture",
|
|
206
|
+
"Artificial intelligence in healthcare",
|
|
207
|
+
"Economic effects of remote work",
|
|
208
|
+
"Advancements in quantum computing",
|
|
209
|
+
],
|
|
210
|
+
categories=["Science", "Technology", "Economics", "Environment"],
|
|
211
|
+
)
|
|
212
|
+
|
|
213
|
+
for i, result in enumerate(categories):
|
|
214
|
+
print(f"Text {i+1}: {result.result}")
|
|
215
|
+
|
|
216
|
+
asyncio.run(main())
|
|
217
|
+
```
|
|
218
|
+
|
|
171
219
|
---
|
|
172
220
|
|
|
173
221
|
## ✅ Use Cases
|
|
@@ -176,4 +224,20 @@ Use **TextTools** when you need to:
|
|
|
176
224
|
|
|
177
225
|
- 🔍 **Classify** large datasets quickly without model training
|
|
178
226
|
- 🧩 **Integrate** LLMs into production pipelines (structured outputs)
|
|
179
|
-
- 📊 **Analyze** large text collections using embeddings and categorization
|
|
227
|
+
- 📊 **Analyze** large text collections using embeddings and categorization
|
|
228
|
+
|
|
229
|
+
---
|
|
230
|
+
|
|
231
|
+
## 📄 License
|
|
232
|
+
|
|
233
|
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
234
|
+
|
|
235
|
+
---
|
|
236
|
+
|
|
237
|
+
## 🤝 Contributing
|
|
238
|
+
|
|
239
|
+
We welcome contributions from the community! - see the [CONTRIBUTING](CONTRIBUTING.md) file for details.
|
|
240
|
+
|
|
241
|
+
## 📚 Documentation
|
|
242
|
+
|
|
243
|
+
For detailed documentation, architecture overview, and implementation details, please visit the [docs](docs) directory.
|
|
@@ -0,0 +1,31 @@
|
|
|
1
|
+
hamtaa_texttools-2.3.0.dist-info/licenses/LICENSE,sha256=gqxbR8wqI3utd__l3Yn6_dQ3Pou1a17W4KmydbvZGok,1084
|
|
2
|
+
texttools/__init__.py,sha256=c7L4bv0vxwpkZW2XHAnhlK_aVdM6CVZLAA0rmkSfIao,163
|
|
3
|
+
texttools/models.py,sha256=_CLKOij2XvKPzQK2wyfNoBaZo0FEkF584Hdv2BHznLU,4746
|
|
4
|
+
texttools/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
5
|
+
texttools/core/__init__.py,sha256=9ymBg2lGljm6wriYt4_-T5HOK8-2hHUwsC_bGdlOCsk,720
|
|
6
|
+
texttools/core/exceptions.py,sha256=6SDjUL1rmd3ngzD3ytF4LyTRj3bQMSFR9ECrLoqXXHw,395
|
|
7
|
+
texttools/core/internal_models.py,sha256=ZauSJWWILnpDEPtAPLvrq49TMKs1BS2znZHD3FeKsRM,3774
|
|
8
|
+
texttools/core/utils.py,sha256=sQTOJmW_6EISEaUI-NWLRymAP1fvA7pZ-RzJRyQveu8,11169
|
|
9
|
+
texttools/core/operators/__init__.py,sha256=PFYEVwqT1mgVc5wok9d_jj_A-3EyOg3-Ssyi7kwkuPk,123
|
|
10
|
+
texttools/core/operators/async_operator.py,sha256=kBiQgu8TEG46AhjlEPysG7EBm4F3xanbBpYIyZqtdC8,6501
|
|
11
|
+
texttools/core/operators/sync_operator.py,sha256=5puFQrBErk4biZ2sG2ivq2AYCqP4tSdppTJUEKEOwUo,6366
|
|
12
|
+
texttools/prompts/augment.yaml,sha256=uJnnP-uEafiATdBx74LiOQWX6spvwcC0J-yfhySfoAM,5423
|
|
13
|
+
texttools/prompts/categorize.yaml,sha256=kN4uRPOC7q6A13bdCIox60vZZ8sgRiTtquv-kqIvTsk,1133
|
|
14
|
+
texttools/prompts/extract_entities.yaml,sha256=-qe1eEvN-8nJ2_GLjeoFAPVORCPYUzsIt7UGXD485bE,648
|
|
15
|
+
texttools/prompts/extract_keywords.yaml,sha256=jP74HFa4Dka01d1COStEBbdzW5onqwocwyyVsmNpECs,3276
|
|
16
|
+
texttools/prompts/is_fact.yaml,sha256=kqF527DEdnlL3MG5tF1Z3ci_sRxmGv7dgNR2SuElq4Y,719
|
|
17
|
+
texttools/prompts/is_question.yaml,sha256=C-ynlt0qHpUM4BAIh0oI7UJ5BxCNU9-GR9T5864jeto,496
|
|
18
|
+
texttools/prompts/merge_questions.yaml,sha256=zgZs8BcwseZy1GsD_DvVGtw0yuCCc6xsK8VDmuHI2V0,1844
|
|
19
|
+
texttools/prompts/propositionize.yaml,sha256=xTw3HQrxtxoMpkf8a9is0uZZ0AG4IDNfh7XE0aVlNso,1441
|
|
20
|
+
texttools/prompts/run_custom.yaml,sha256=hSfR4BMJNUo9nP_AodPU7YTnhR-X_G-W7Pz0ROQzoI0,133
|
|
21
|
+
texttools/prompts/summarize.yaml,sha256=0aKYFRDxODqOOEhSexi-hn3twLwkMFVmi7rtAifnCuA,464
|
|
22
|
+
texttools/prompts/to_question.yaml,sha256=lv4YouHE6yv594-RbO8fysH9aAz_mSms3PpbHzYzQmc,2118
|
|
23
|
+
texttools/prompts/translate.yaml,sha256=FVYhx9GamYyCf7QGDmmhRi_N-H0SwIma8jx_yT6TKNY,659
|
|
24
|
+
texttools/tools/__init__.py,sha256=xsFiKJq66HrgPebssYm5fxV2AyEjbz5urfAuNUsz8oE,168
|
|
25
|
+
texttools/tools/async_tools.py,sha256=2xKlzs--V01JguNNYY9GDcuc43oCJHUNYXTkuEOzo1M,47941
|
|
26
|
+
texttools/tools/batch_tools.py,sha256=hwWutcSWc2k79vZX5Urft1arTgHpDnnxztHZba54xtg,29899
|
|
27
|
+
texttools/tools/sync_tools.py,sha256=tyD-aq7x1yHvfLDnClNUiwqT_ZxHHeq2SNnfv5zVoBo,43688
|
|
28
|
+
hamtaa_texttools-2.3.0.dist-info/METADATA,sha256=ghPfIzci1KJpjKnOcTNQTKo5YvmU65a0bTLHXAhHmYw,9370
|
|
29
|
+
hamtaa_texttools-2.3.0.dist-info/WHEEL,sha256=wUyA8OaulRlbfwMtmQsvNngGrxQHAvkKcvRmdizlJi0,92
|
|
30
|
+
hamtaa_texttools-2.3.0.dist-info/top_level.txt,sha256=5Mh0jIxxZ5rOXHGJ6Mp-JPKviywwN0MYuH0xk5bEWqE,10
|
|
31
|
+
hamtaa_texttools-2.3.0.dist-info/RECORD,,
|
texttools/__init__.py
CHANGED
|
@@ -1,5 +1,4 @@
|
|
|
1
1
|
from .models import CategoryTree
|
|
2
|
-
from .tools
|
|
3
|
-
from .tools.sync_tools import TheTool
|
|
2
|
+
from .tools import AsyncTheTool, BatchTheTool, TheTool
|
|
4
3
|
|
|
5
|
-
__all__ = ["CategoryTree", "AsyncTheTool", "TheTool"]
|
|
4
|
+
__all__ = ["CategoryTree", "AsyncTheTool", "BatchTheTool", "TheTool"]
|
texttools/core/__init__.py
CHANGED
|
@@ -0,0 +1,34 @@
|
|
|
1
|
+
from .exceptions import LLMError, PromptError, TextToolsError, ValidationError
|
|
2
|
+
from .internal_models import (
|
|
3
|
+
Bool,
|
|
4
|
+
ListDictStrStr,
|
|
5
|
+
ListStr,
|
|
6
|
+
ReasonListStr,
|
|
7
|
+
Str,
|
|
8
|
+
TokenUsage,
|
|
9
|
+
create_dynamic_model,
|
|
10
|
+
)
|
|
11
|
+
from .operators import AsyncOperator, Operator
|
|
12
|
+
from .utils import OperatorUtils, TheToolUtils
|
|
13
|
+
|
|
14
|
+
__all__ = [
|
|
15
|
+
# Exceptions
|
|
16
|
+
"LLMError",
|
|
17
|
+
"PromptError",
|
|
18
|
+
"TextToolsError",
|
|
19
|
+
"ValidationError",
|
|
20
|
+
# Internal models
|
|
21
|
+
"Bool",
|
|
22
|
+
"ListDictStrStr",
|
|
23
|
+
"ListStr",
|
|
24
|
+
"ReasonListStr",
|
|
25
|
+
"Str",
|
|
26
|
+
"TokenUsage",
|
|
27
|
+
"create_dynamic_model",
|
|
28
|
+
# Operators
|
|
29
|
+
"AsyncOperator",
|
|
30
|
+
"Operator",
|
|
31
|
+
# Utils
|
|
32
|
+
"OperatorUtils",
|
|
33
|
+
"TheToolUtils",
|
|
34
|
+
]
|
|
@@ -1,12 +1,64 @@
|
|
|
1
|
+
from __future__ import annotations
|
|
2
|
+
|
|
1
3
|
from typing import Any, Literal
|
|
2
4
|
|
|
3
5
|
from pydantic import BaseModel, Field, create_model
|
|
4
6
|
|
|
5
7
|
|
|
8
|
+
class CompletionUsage(BaseModel):
|
|
9
|
+
prompt_tokens: int = 0
|
|
10
|
+
completion_tokens: int = 0
|
|
11
|
+
total_tokens: int = 0
|
|
12
|
+
|
|
13
|
+
|
|
14
|
+
class AnalyzeUsage(BaseModel):
|
|
15
|
+
prompt_tokens: int = 0
|
|
16
|
+
completion_tokens: int = 0
|
|
17
|
+
total_tokens: int = 0
|
|
18
|
+
|
|
19
|
+
|
|
20
|
+
class TokenUsage(BaseModel):
|
|
21
|
+
completion_usage: CompletionUsage = CompletionUsage()
|
|
22
|
+
analyze_usage: AnalyzeUsage = AnalyzeUsage()
|
|
23
|
+
total_tokens: int = 0
|
|
24
|
+
|
|
25
|
+
def __add__(self, other: TokenUsage) -> TokenUsage:
|
|
26
|
+
new_completion_usage = CompletionUsage(
|
|
27
|
+
prompt_tokens=self.completion_usage.prompt_tokens
|
|
28
|
+
+ other.completion_usage.prompt_tokens,
|
|
29
|
+
completion_tokens=self.completion_usage.completion_tokens
|
|
30
|
+
+ other.completion_usage.completion_tokens,
|
|
31
|
+
total_tokens=self.completion_usage.total_tokens
|
|
32
|
+
+ other.completion_usage.total_tokens,
|
|
33
|
+
)
|
|
34
|
+
new_analyze_usage = AnalyzeUsage(
|
|
35
|
+
prompt_tokens=self.analyze_usage.prompt_tokens
|
|
36
|
+
+ other.analyze_usage.prompt_tokens,
|
|
37
|
+
completion_tokens=self.analyze_usage.completion_tokens
|
|
38
|
+
+ other.analyze_usage.completion_tokens,
|
|
39
|
+
total_tokens=self.analyze_usage.total_tokens
|
|
40
|
+
+ other.analyze_usage.total_tokens,
|
|
41
|
+
)
|
|
42
|
+
total_tokens = (
|
|
43
|
+
new_completion_usage.total_tokens + new_analyze_usage.total_tokens
|
|
44
|
+
)
|
|
45
|
+
|
|
46
|
+
return TokenUsage(
|
|
47
|
+
completion_usage=new_completion_usage,
|
|
48
|
+
analyze_usage=new_analyze_usage,
|
|
49
|
+
total_tokens=total_tokens,
|
|
50
|
+
)
|
|
51
|
+
|
|
52
|
+
|
|
6
53
|
class OperatorOutput(BaseModel):
|
|
7
54
|
result: Any
|
|
8
55
|
analysis: str | None
|
|
9
56
|
logprobs: list[dict[str, Any]] | None
|
|
57
|
+
token_usage: TokenUsage | None = None
|
|
58
|
+
prompt_tokens: int | None = None
|
|
59
|
+
completion_tokens: int | None = None
|
|
60
|
+
analysis_tokens: int | None = None
|
|
61
|
+
total_tokens: int | None = None
|
|
10
62
|
|
|
11
63
|
|
|
12
64
|
class Str(BaseModel):
|
|
@@ -18,7 +18,9 @@ class AsyncOperator:
|
|
|
18
18
|
self._client = client
|
|
19
19
|
self._model = model
|
|
20
20
|
|
|
21
|
-
async def _analyze_completion(
|
|
21
|
+
async def _analyze_completion(
|
|
22
|
+
self, analyze_message: list[dict[str, str]]
|
|
23
|
+
) -> tuple[str, Any]:
|
|
22
24
|
try:
|
|
23
25
|
completion = await self._client.chat.completions.create(
|
|
24
26
|
model=self._model,
|
|
@@ -33,7 +35,7 @@ class AsyncOperator:
|
|
|
33
35
|
if not analysis:
|
|
34
36
|
raise LLMError("Empty analysis response")
|
|
35
37
|
|
|
36
|
-
return analysis
|
|
38
|
+
return analysis, completion
|
|
37
39
|
|
|
38
40
|
except Exception as e:
|
|
39
41
|
if isinstance(e, (PromptError, LLMError)):
|
|
@@ -116,12 +118,15 @@ class AsyncOperator:
|
|
|
116
118
|
)
|
|
117
119
|
|
|
118
120
|
analysis: str | None = None
|
|
121
|
+
analyze_completion: Any = None
|
|
119
122
|
|
|
120
123
|
if with_analysis:
|
|
121
124
|
analyze_message = OperatorUtils.build_message(
|
|
122
125
|
prompt_configs["analyze_template"]
|
|
123
126
|
)
|
|
124
|
-
analysis = await self._analyze_completion(
|
|
127
|
+
analysis, analyze_completion = await self._analyze_completion(
|
|
128
|
+
analyze_message
|
|
129
|
+
)
|
|
125
130
|
|
|
126
131
|
main_prompt = OperatorUtils.build_main_prompt(
|
|
127
132
|
prompt_configs["main_template"], analysis, output_lang, user_prompt
|
|
@@ -176,6 +181,9 @@ class AsyncOperator:
|
|
|
176
181
|
logprobs=OperatorUtils.extract_logprobs(completion)
|
|
177
182
|
if logprobs
|
|
178
183
|
else None,
|
|
184
|
+
token_usage=OperatorUtils.extract_token_usage(
|
|
185
|
+
completion, analyze_completion
|
|
186
|
+
),
|
|
179
187
|
)
|
|
180
188
|
|
|
181
189
|
return operator_output
|
|
@@ -18,7 +18,9 @@ class Operator:
|
|
|
18
18
|
self._client = client
|
|
19
19
|
self._model = model
|
|
20
20
|
|
|
21
|
-
def _analyze_completion(
|
|
21
|
+
def _analyze_completion(
|
|
22
|
+
self, analyze_message: list[dict[str, str]]
|
|
23
|
+
) -> tuple[str, Any]:
|
|
22
24
|
try:
|
|
23
25
|
completion = self._client.chat.completions.create(
|
|
24
26
|
model=self._model,
|
|
@@ -33,7 +35,7 @@ class Operator:
|
|
|
33
35
|
if not analysis:
|
|
34
36
|
raise LLMError("Empty analysis response")
|
|
35
37
|
|
|
36
|
-
return analysis
|
|
38
|
+
return analysis, completion
|
|
37
39
|
|
|
38
40
|
except Exception as e:
|
|
39
41
|
if isinstance(e, (PromptError, LLMError)):
|
|
@@ -114,12 +116,13 @@ class Operator:
|
|
|
114
116
|
)
|
|
115
117
|
|
|
116
118
|
analysis: str | None = None
|
|
119
|
+
analyze_completion: Any = None
|
|
117
120
|
|
|
118
121
|
if with_analysis:
|
|
119
122
|
analyze_message = OperatorUtils.build_message(
|
|
120
123
|
prompt_configs["analyze_template"]
|
|
121
124
|
)
|
|
122
|
-
analysis = self._analyze_completion(analyze_message)
|
|
125
|
+
analysis, analyze_completion = self._analyze_completion(analyze_message)
|
|
123
126
|
|
|
124
127
|
main_prompt = OperatorUtils.build_main_prompt(
|
|
125
128
|
prompt_configs["main_template"], analysis, output_lang, user_prompt
|
|
@@ -174,6 +177,9 @@ class Operator:
|
|
|
174
177
|
logprobs=OperatorUtils.extract_logprobs(completion)
|
|
175
178
|
if logprobs
|
|
176
179
|
else None,
|
|
180
|
+
token_usage=OperatorUtils.extract_token_usage(
|
|
181
|
+
completion, analyze_completion
|
|
182
|
+
),
|
|
177
183
|
)
|
|
178
184
|
|
|
179
185
|
return operator_output
|
texttools/core/utils.py
CHANGED
|
@@ -9,6 +9,7 @@ from typing import Any
|
|
|
9
9
|
import yaml
|
|
10
10
|
|
|
11
11
|
from .exceptions import PromptError
|
|
12
|
+
from .internal_models import AnalyzeUsage, CompletionUsage, TokenUsage
|
|
12
13
|
|
|
13
14
|
|
|
14
15
|
class OperatorUtils:
|
|
@@ -148,6 +149,38 @@ class OperatorUtils:
|
|
|
148
149
|
new_temp = base_temp + random.choice([-1, 1]) * random.uniform(0.1, 0.9)
|
|
149
150
|
return max(0.0, min(new_temp, 1.5))
|
|
150
151
|
|
|
152
|
+
@staticmethod
|
|
153
|
+
def extract_token_usage(completion: Any, analyze_completion: Any) -> TokenUsage:
|
|
154
|
+
completion_usage = completion.usage
|
|
155
|
+
analyze_usage = analyze_completion.usage if analyze_completion else None
|
|
156
|
+
|
|
157
|
+
completion_usage_model = CompletionUsage(
|
|
158
|
+
prompt_tokens=getattr(completion_usage, "prompt_tokens", 00),
|
|
159
|
+
completion_tokens=getattr(completion_usage, "completion_tokens", 00),
|
|
160
|
+
total_tokens=getattr(completion_usage, "total_tokens", 00),
|
|
161
|
+
)
|
|
162
|
+
analyze_usage_model = AnalyzeUsage(
|
|
163
|
+
prompt_tokens=getattr(analyze_usage, "prompt_tokens", 0),
|
|
164
|
+
completion_tokens=getattr(analyze_usage, "completion_tokens", 0),
|
|
165
|
+
total_tokens=getattr(analyze_usage, "total_tokens", 0),
|
|
166
|
+
)
|
|
167
|
+
total_analyze_tokens = (
|
|
168
|
+
analyze_usage_model.prompt_tokens + analyze_usage_model.completion_tokens
|
|
169
|
+
if analyze_completion
|
|
170
|
+
else 0
|
|
171
|
+
)
|
|
172
|
+
total_tokens = (
|
|
173
|
+
completion_usage_model.prompt_tokens
|
|
174
|
+
+ completion_usage_model.completion_tokens
|
|
175
|
+
+ total_analyze_tokens
|
|
176
|
+
)
|
|
177
|
+
|
|
178
|
+
return TokenUsage(
|
|
179
|
+
completion_usage=completion_usage_model,
|
|
180
|
+
analyze_usage=analyze_usage_model,
|
|
181
|
+
total_tokens=total_tokens,
|
|
182
|
+
)
|
|
183
|
+
|
|
151
184
|
|
|
152
185
|
class TheToolUtils:
|
|
153
186
|
"""
|
texttools/models.py
CHANGED
|
@@ -5,11 +5,15 @@ from typing import Any
|
|
|
5
5
|
|
|
6
6
|
from pydantic import BaseModel, Field
|
|
7
7
|
|
|
8
|
+
from .core import TokenUsage
|
|
9
|
+
|
|
8
10
|
|
|
9
11
|
class ToolOutputMetadata(BaseModel):
|
|
10
12
|
tool_name: str
|
|
13
|
+
processed_by: str | None = None
|
|
11
14
|
processed_at: datetime = Field(default_factory=datetime.now)
|
|
12
15
|
execution_time: float | None = None
|
|
16
|
+
token_usage: TokenUsage | None = None
|
|
13
17
|
|
|
14
18
|
|
|
15
19
|
class ToolOutput(BaseModel):
|
texttools/prompts/augment.yaml
CHANGED
|
@@ -38,25 +38,25 @@ main_template:
|
|
|
38
38
|
"{text}"
|
|
39
39
|
|
|
40
40
|
hard_negative: |
|
|
41
|
-
|
|
42
|
-
|
|
41
|
+
You are an AI assistant designed to generate high-quality training data for semantic text embedding models.
|
|
42
|
+
Your task is to create a hard-negative sample for a given "Anchor" text.
|
|
43
43
|
|
|
44
|
-
|
|
45
|
-
|
|
44
|
+
A high-quality hard-negative sample is a sentence that is topically related but semantically distinct from the Anchor.
|
|
45
|
+
It should share some context (e.g., same domain, same entities) but differ in a crucial piece of information, action, conclusion, or specific detail.
|
|
46
46
|
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
47
|
+
Instructions:
|
|
48
|
+
- Stay in General Domain: Remain in the same broad domain (e.g., religious topics), but choose a completely different subject matter.
|
|
49
|
+
- Maintain Topical Overlap: Keep the same domain, subject, or entities (e.g., people, products, concepts) as the Anchor.
|
|
50
|
+
- Alter a Key Semantic Element: Reverse a key word or condition or place or proper name that completely reverses the meaning of the sentence.
|
|
51
|
+
- Avoid Being a Paraphrase: The sentence must NOT be semantically equivalent. The core factual claim or intent must be different.
|
|
52
|
+
- Make it Challenging: The difference should be subtle enough that it requires a deep understanding of the text to identify, not just a simple keyword mismatch.
|
|
53
|
+
- Maintain Similar Length: The generated sentence should be of roughly the same length and level of detail as the Anchor.
|
|
54
54
|
|
|
55
|
-
|
|
56
|
-
|
|
55
|
+
Respond only in JSON format:
|
|
56
|
+
{{"result": "rewriteen_text"}}
|
|
57
57
|
|
|
58
|
-
|
|
59
|
-
|
|
58
|
+
Anchor Text:
|
|
59
|
+
"{text}"
|
|
60
60
|
|
|
61
61
|
|
|
62
62
|
analyze_template:
|
|
@@ -7,7 +7,6 @@ main_template:
|
|
|
7
7
|
and must not mention any verbs like this, that, he or she in the question.
|
|
8
8
|
|
|
9
9
|
There is a `reason` key, fill that up with a summerized version of your thoughts.
|
|
10
|
-
The `reason` must be less than 20 words.
|
|
11
10
|
Don't forget to fill the reason.
|
|
12
11
|
|
|
13
12
|
Respond only in JSON format:
|
|
@@ -23,7 +22,6 @@ main_template:
|
|
|
23
22
|
and must not mention any verbs like this, that, he or she in the question.
|
|
24
23
|
|
|
25
24
|
There is a `reason` key, fill that up with a summerized version of your thoughts.
|
|
26
|
-
The `reason` must be less than 20 words.
|
|
27
25
|
Don't forget to fill the reason.
|
|
28
26
|
|
|
29
27
|
Respond only in JSON format:
|
texttools/prompts/translate.yaml
CHANGED
|
@@ -3,9 +3,9 @@ main_template: |
|
|
|
3
3
|
Output only the translated text.
|
|
4
4
|
|
|
5
5
|
Respond only in JSON format:
|
|
6
|
-
{{"result": "
|
|
6
|
+
{{"result": "translated_text"}}
|
|
7
7
|
|
|
8
|
-
Don't translate proper
|
|
8
|
+
Don't translate proper names, only transliterate them to {target_lang}
|
|
9
9
|
|
|
10
10
|
Translate the following text to {target_lang}:
|
|
11
11
|
{text}
|