hamtaa-texttools 2.0.0__py3-none-any.whl → 2.2.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {hamtaa_texttools-2.0.0.dist-info → hamtaa_texttools-2.2.0.dist-info}/METADATA +60 -12
- {hamtaa_texttools-2.0.0.dist-info → hamtaa_texttools-2.2.0.dist-info}/RECORD +10 -9
- {hamtaa_texttools-2.0.0.dist-info → hamtaa_texttools-2.2.0.dist-info}/WHEEL +1 -1
- texttools/__init__.py +2 -1
- texttools/prompts/augment.yaml +15 -15
- texttools/tools/async_tools.py +64 -3
- texttools/tools/batch_tools.py +688 -0
- texttools/tools/sync_tools.py +64 -3
- {hamtaa_texttools-2.0.0.dist-info → hamtaa_texttools-2.2.0.dist-info}/licenses/LICENSE +0 -0
- {hamtaa_texttools-2.0.0.dist-info → hamtaa_texttools-2.2.0.dist-info}/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: hamtaa-texttools
|
|
3
|
-
Version: 2.
|
|
3
|
+
Version: 2.2.0
|
|
4
4
|
Summary: A high-level NLP toolkit built on top of modern LLMs.
|
|
5
5
|
Author-email: Tohidi <the.mohammad.tohidi@gmail.com>, Erfan Moosavi <erfanmoosavi84@gmail.com>, Montazer <montazerh82@gmail.com>, Givechi <mohamad.m.givechi@gmail.com>, Zareshahi <a.zareshahi1377@gmail.com>
|
|
6
6
|
Maintainer-email: Erfan Moosavi <erfanmoosavi84@gmail.com>, Tohidi <the.mohammad.tohidi@gmail.com>
|
|
@@ -14,6 +14,7 @@ Classifier: Operating System :: OS Independent
|
|
|
14
14
|
Requires-Python: >=3.11
|
|
15
15
|
Description-Content-Type: text/markdown
|
|
16
16
|
License-File: LICENSE
|
|
17
|
+
Requires-Dist: dotenv>=0.9.9
|
|
17
18
|
Requires-Dist: openai>=1.97.1
|
|
18
19
|
Requires-Dist: pydantic>=2.0.0
|
|
19
20
|
Requires-Dist: pyyaml>=6.0
|
|
@@ -28,7 +29,10 @@ Dynamic: license-file
|
|
|
28
29
|
|
|
29
30
|
**TextTools** is a high-level **NLP toolkit** built on top of **LLMs**.
|
|
30
31
|
|
|
31
|
-
It provides
|
|
32
|
+
It provides three API styles for maximum flexibility:
|
|
33
|
+
- Sync API (`TheTool`) - Simple, sequential operations
|
|
34
|
+
- Async API (`AsyncTheTool`) - High-performance async operations
|
|
35
|
+
- Batch API (`BatchTheTool`) - Process multiple texts in parallel with built-in concurrency control
|
|
32
36
|
|
|
33
37
|
It provides ready-to-use utilities for **translation, question detection, categorization, NER extraction, and more** - designed to help you integrate AI-powered text processing into your applications with minimal effort.
|
|
34
38
|
|
|
@@ -68,8 +72,8 @@ pip install -U hamtaa-texttools
|
|
|
68
72
|
|
|
69
73
|
| Status | Meaning | Tools | Safe for Production? |
|
|
70
74
|
|--------|---------|----------|-------------------|
|
|
71
|
-
| **✅ Production** | Evaluated and tested. | `categorize()
|
|
72
|
-
| **🧪 Experimental** | Added to the package but **not fully evaluated**. |
|
|
75
|
+
| **✅ Production** | Evaluated and tested. | `categorize()`, `extract_keywords()`, `extract_entities()`, `is_question()`, `to_question()`, `merge_questions()`, `augment()`, `summarize()`, `run_custom()` | **Yes** - ready for reliable use. |
|
|
76
|
+
| **🧪 Experimental** | Added to the package but **not fully evaluated**. | `translate()`, `propositionize()`, `is_fact()` | **Use with caution** |
|
|
73
77
|
|
|
74
78
|
---
|
|
75
79
|
|
|
@@ -95,6 +99,9 @@ pip install -U hamtaa-texttools
|
|
|
95
99
|
- **`timeout: float`** → Maximum time in seconds to wait for the response before raising a timeout error.
|
|
96
100
|
**Note:** This feature is only available in `AsyncTheTool`.
|
|
97
101
|
|
|
102
|
+
- **`raise_on_error: bool`** → (`TheTool/AsyncTheTool`) Raise errors (True) or return them in output (False). Default is True.
|
|
103
|
+
|
|
104
|
+
- **`max_concurrency: int`** → (`BatchTheTool` only) Maximum number of concurrent API calls. Default is 5.
|
|
98
105
|
|
|
99
106
|
---
|
|
100
107
|
|
|
@@ -114,13 +121,16 @@ Every tool of `TextTools` returns a `ToolOutput` object which is a BaseModel wit
|
|
|
114
121
|
- Verify operation success with the `is_successful()` method.
|
|
115
122
|
- Convert output to a dictionary with the `to_dict()` method.
|
|
116
123
|
|
|
124
|
+
**Note:** For BatchTheTool: Each method returns a list[ToolOutput] containing results for all input texts.
|
|
125
|
+
|
|
117
126
|
---
|
|
118
127
|
|
|
119
|
-
## 🧨 Sync vs Async
|
|
120
|
-
| Tool
|
|
121
|
-
|
|
122
|
-
| `TheTool`
|
|
123
|
-
| `AsyncTheTool` | Async | High-throughput
|
|
128
|
+
## 🧨 Sync vs Async vs Batch
|
|
129
|
+
| Tool | Style | Use Case | Best For |
|
|
130
|
+
|------|-------|----------|----------|
|
|
131
|
+
| `TheTool` | **Sync** | Simple scripts, sequential workflows | • Quick prototyping<br>• Simple scripts<br>• Sequential processing<br>• Debugging |
|
|
132
|
+
| `AsyncTheTool` | **Async** | High-throughput applications, APIs, concurrent tasks | • Web APIs<br>• Concurrent operations<br>• High-performance apps<br>• Real-time processing |
|
|
133
|
+
| `BatchTheTool` | **Batch** | Process multiple texts efficiently with controlled concurrency | • Bulk processing<br>• Large datasets<br>• Parallel execution<br>• Resource optimization |
|
|
124
134
|
|
|
125
135
|
---
|
|
126
136
|
|
|
@@ -165,6 +175,35 @@ async def main():
|
|
|
165
175
|
asyncio.run(main())
|
|
166
176
|
```
|
|
167
177
|
|
|
178
|
+
## ⚡ Quick Start (Batch)
|
|
179
|
+
|
|
180
|
+
```python
|
|
181
|
+
import asyncio
|
|
182
|
+
from openai import AsyncOpenAI
|
|
183
|
+
from texttools import BatchTheTool
|
|
184
|
+
|
|
185
|
+
async def main():
|
|
186
|
+
async_client = AsyncOpenAI(base_url="your_url", api_key="your_api_key")
|
|
187
|
+
model = "model_name"
|
|
188
|
+
|
|
189
|
+
batch_the_tool = BatchTheTool(client=async_client, model=model, max_concurrency=3)
|
|
190
|
+
|
|
191
|
+
categories = await batch_tool.categorize(
|
|
192
|
+
texts=[
|
|
193
|
+
"Climate change impacts on agriculture",
|
|
194
|
+
"Artificial intelligence in healthcare",
|
|
195
|
+
"Economic effects of remote work",
|
|
196
|
+
"Advancements in quantum computing",
|
|
197
|
+
],
|
|
198
|
+
categories=["Science", "Technology", "Economics", "Environment"],
|
|
199
|
+
)
|
|
200
|
+
|
|
201
|
+
for i, result in enumerate(categories):
|
|
202
|
+
print(f"Text {i+1}: {result.result}")
|
|
203
|
+
|
|
204
|
+
asyncio.run(main())
|
|
205
|
+
```
|
|
206
|
+
|
|
168
207
|
---
|
|
169
208
|
|
|
170
209
|
## ✅ Use Cases
|
|
@@ -173,11 +212,20 @@ Use **TextTools** when you need to:
|
|
|
173
212
|
|
|
174
213
|
- 🔍 **Classify** large datasets quickly without model training
|
|
175
214
|
- 🧩 **Integrate** LLMs into production pipelines (structured outputs)
|
|
176
|
-
- 📊 **Analyze** large text collections using embeddings and categorization
|
|
215
|
+
- 📊 **Analyze** large text collections using embeddings and categorization
|
|
216
|
+
|
|
217
|
+
---
|
|
218
|
+
|
|
219
|
+
## 📄 License
|
|
220
|
+
|
|
221
|
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
177
222
|
|
|
178
223
|
---
|
|
179
224
|
|
|
180
225
|
## 🤝 Contributing
|
|
181
226
|
|
|
182
|
-
|
|
183
|
-
|
|
227
|
+
We welcome contributions from the community! - see the [CONTRIBUTING](CONTRIBUTING.md) file for details.
|
|
228
|
+
|
|
229
|
+
## 📚 Documentation
|
|
230
|
+
|
|
231
|
+
For detailed documentation, architecture overview, and implementation details, please visit the [docs](docs) directory.
|
|
@@ -1,5 +1,5 @@
|
|
|
1
|
-
hamtaa_texttools-2.
|
|
2
|
-
texttools/__init__.py,sha256=
|
|
1
|
+
hamtaa_texttools-2.2.0.dist-info/licenses/LICENSE,sha256=gqxbR8wqI3utd__l3Yn6_dQ3Pou1a17W4KmydbvZGok,1084
|
|
2
|
+
texttools/__init__.py,sha256=2bIFP0BdsDeOC7aQNTQjSX6OBmWQEweltUPRowwrhmg,236
|
|
3
3
|
texttools/models.py,sha256=CQnO1zkKHFyqeMWrYGA4IyXQ7YYLVc3Xz1WaXbXzDLw,4634
|
|
4
4
|
texttools/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
5
5
|
texttools/core/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
@@ -9,7 +9,7 @@ texttools/core/utils.py,sha256=jqXHXU1DWDKWhK0HHSjnjq4_TLg3FMcnRzrwTF1eqqc,9744
|
|
|
9
9
|
texttools/core/operators/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
10
10
|
texttools/core/operators/async_operator.py,sha256=HOi9gUwIffJUtyp8WLNbMpxI8jnafNDrbtLl6vyPcUs,6221
|
|
11
11
|
texttools/core/operators/sync_operator.py,sha256=yM14fsku-4Nf60lPUVePaB9Lu8HbGKb4ubwoizVWuYQ,6126
|
|
12
|
-
texttools/prompts/augment.yaml,sha256=
|
|
12
|
+
texttools/prompts/augment.yaml,sha256=uJnnP-uEafiATdBx74LiOQWX6spvwcC0J-yfhySfoAM,5423
|
|
13
13
|
texttools/prompts/categorize.yaml,sha256=kN4uRPOC7q6A13bdCIox60vZZ8sgRiTtquv-kqIvTsk,1133
|
|
14
14
|
texttools/prompts/extract_entities.yaml,sha256=-qe1eEvN-8nJ2_GLjeoFAPVORCPYUzsIt7UGXD485bE,648
|
|
15
15
|
texttools/prompts/extract_keywords.yaml,sha256=jP74HFa4Dka01d1COStEBbdzW5onqwocwyyVsmNpECs,3276
|
|
@@ -22,9 +22,10 @@ texttools/prompts/summarize.yaml,sha256=0aKYFRDxODqOOEhSexi-hn3twLwkMFVmi7rtAifn
|
|
|
22
22
|
texttools/prompts/to_question.yaml,sha256=n8Bn28QjvSHwPHQLwRYpZ2IsaaBsq4pK9Dp_i0xk8eg,2210
|
|
23
23
|
texttools/prompts/translate.yaml,sha256=omtC-TlFYMidy8WqRe7idUtKNiK4g3IhEl-iyufOwjk,649
|
|
24
24
|
texttools/tools/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
25
|
-
texttools/tools/async_tools.py,sha256=
|
|
26
|
-
texttools/tools/
|
|
27
|
-
|
|
28
|
-
hamtaa_texttools-2.
|
|
29
|
-
hamtaa_texttools-2.
|
|
30
|
-
hamtaa_texttools-2.
|
|
25
|
+
texttools/tools/async_tools.py,sha256=_Dr5bo7RFp4f6eGNgNr549YIv5VoVpUq_ex_R5vsD2M,46087
|
|
26
|
+
texttools/tools/batch_tools.py,sha256=hwWutcSWc2k79vZX5Urft1arTgHpDnnxztHZba54xtg,29899
|
|
27
|
+
texttools/tools/sync_tools.py,sha256=UxXKUhnALoTCw2wpzfoBZVmhOZIGi6qv8tZAVXGIqFI,41833
|
|
28
|
+
hamtaa_texttools-2.2.0.dist-info/METADATA,sha256=qnmDDJ24KJ6BI-kJ31vxaigEET0-gVM0DBBnOlL9B-M,8928
|
|
29
|
+
hamtaa_texttools-2.2.0.dist-info/WHEEL,sha256=wUyA8OaulRlbfwMtmQsvNngGrxQHAvkKcvRmdizlJi0,92
|
|
30
|
+
hamtaa_texttools-2.2.0.dist-info/top_level.txt,sha256=5Mh0jIxxZ5rOXHGJ6Mp-JPKviywwN0MYuH0xk5bEWqE,10
|
|
31
|
+
hamtaa_texttools-2.2.0.dist-info/RECORD,,
|
texttools/__init__.py
CHANGED
|
@@ -1,5 +1,6 @@
|
|
|
1
1
|
from .models import CategoryTree
|
|
2
2
|
from .tools.async_tools import AsyncTheTool
|
|
3
|
+
from .tools.batch_tools import BatchTheTool
|
|
3
4
|
from .tools.sync_tools import TheTool
|
|
4
5
|
|
|
5
|
-
__all__ = ["CategoryTree", "AsyncTheTool", "TheTool"]
|
|
6
|
+
__all__ = ["CategoryTree", "AsyncTheTool", "TheTool", "BatchTheTool"]
|
texttools/prompts/augment.yaml
CHANGED
|
@@ -38,25 +38,25 @@ main_template:
|
|
|
38
38
|
"{text}"
|
|
39
39
|
|
|
40
40
|
hard_negative: |
|
|
41
|
-
|
|
42
|
-
|
|
41
|
+
You are an AI assistant designed to generate high-quality training data for semantic text embedding models.
|
|
42
|
+
Your task is to create a hard-negative sample for a given "Anchor" text.
|
|
43
43
|
|
|
44
|
-
|
|
45
|
-
|
|
44
|
+
A high-quality hard-negative sample is a sentence that is topically related but semantically distinct from the Anchor.
|
|
45
|
+
It should share some context (e.g., same domain, same entities) but differ in a crucial piece of information, action, conclusion, or specific detail.
|
|
46
46
|
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
47
|
+
Instructions:
|
|
48
|
+
- Stay in General Domain: Remain in the same broad domain (e.g., religious topics), but choose a completely different subject matter.
|
|
49
|
+
- Maintain Topical Overlap: Keep the same domain, subject, or entities (e.g., people, products, concepts) as the Anchor.
|
|
50
|
+
- Alter a Key Semantic Element: Reverse a key word or condition or place or proper name that completely reverses the meaning of the sentence.
|
|
51
|
+
- Avoid Being a Paraphrase: The sentence must NOT be semantically equivalent. The core factual claim or intent must be different.
|
|
52
|
+
- Make it Challenging: The difference should be subtle enough that it requires a deep understanding of the text to identify, not just a simple keyword mismatch.
|
|
53
|
+
- Maintain Similar Length: The generated sentence should be of roughly the same length and level of detail as the Anchor.
|
|
54
54
|
|
|
55
|
-
|
|
56
|
-
|
|
55
|
+
Respond only in JSON format:
|
|
56
|
+
{{"result": "rewriteen_text"}}
|
|
57
57
|
|
|
58
|
-
|
|
59
|
-
|
|
58
|
+
Anchor Text:
|
|
59
|
+
"{text}"
|
|
60
60
|
|
|
61
61
|
|
|
62
62
|
analyze_template:
|
texttools/tools/async_tools.py
CHANGED
|
@@ -1,3 +1,4 @@
|
|
|
1
|
+
import logging
|
|
1
2
|
from collections.abc import Callable
|
|
2
3
|
from time import perf_counter
|
|
3
4
|
from typing import Any, Literal
|
|
@@ -23,8 +24,11 @@ class AsyncTheTool:
|
|
|
23
24
|
self,
|
|
24
25
|
client: AsyncOpenAI,
|
|
25
26
|
model: str,
|
|
27
|
+
raise_on_error: bool = True,
|
|
26
28
|
):
|
|
27
29
|
self._operator = AsyncOperator(client=client, model=model)
|
|
30
|
+
self.logger = logging.getLogger(self.__class__.__name__)
|
|
31
|
+
self.raise_on_error = raise_on_error
|
|
28
32
|
|
|
29
33
|
async def categorize(
|
|
30
34
|
self,
|
|
@@ -43,8 +47,6 @@ class AsyncTheTool:
|
|
|
43
47
|
"""
|
|
44
48
|
Classify text into given categories
|
|
45
49
|
|
|
46
|
-
Important Note: category_tree mode is EXPERIMENTAL, you can use it but it isn't reliable.
|
|
47
|
-
|
|
48
50
|
Arguments:
|
|
49
51
|
text: The input text
|
|
50
52
|
categories: The category list / category tree
|
|
@@ -60,7 +62,6 @@ class AsyncTheTool:
|
|
|
60
62
|
|
|
61
63
|
Returns:
|
|
62
64
|
ToolOutput
|
|
63
|
-
|
|
64
65
|
"""
|
|
65
66
|
tool_name = "categorize"
|
|
66
67
|
start = perf_counter()
|
|
@@ -160,6 +161,11 @@ class AsyncTheTool:
|
|
|
160
161
|
)
|
|
161
162
|
|
|
162
163
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
164
|
+
self.logger.error(str(e))
|
|
165
|
+
|
|
166
|
+
if self.raise_on_error:
|
|
167
|
+
raise
|
|
168
|
+
|
|
163
169
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
164
170
|
tool_output = ToolOutput(
|
|
165
171
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -241,6 +247,11 @@ class AsyncTheTool:
|
|
|
241
247
|
)
|
|
242
248
|
|
|
243
249
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
250
|
+
self.logger.error(str(e))
|
|
251
|
+
|
|
252
|
+
if self.raise_on_error:
|
|
253
|
+
raise
|
|
254
|
+
|
|
244
255
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
245
256
|
tool_output = ToolOutput(
|
|
246
257
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -320,6 +331,11 @@ class AsyncTheTool:
|
|
|
320
331
|
)
|
|
321
332
|
|
|
322
333
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
334
|
+
self.logger.error(str(e))
|
|
335
|
+
|
|
336
|
+
if self.raise_on_error:
|
|
337
|
+
raise
|
|
338
|
+
|
|
323
339
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
324
340
|
tool_output = ToolOutput(
|
|
325
341
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -394,6 +410,11 @@ class AsyncTheTool:
|
|
|
394
410
|
)
|
|
395
411
|
|
|
396
412
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
413
|
+
self.logger.error(str(e))
|
|
414
|
+
|
|
415
|
+
if self.raise_on_error:
|
|
416
|
+
raise
|
|
417
|
+
|
|
397
418
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
398
419
|
tool_output = ToolOutput(
|
|
399
420
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -475,6 +496,11 @@ class AsyncTheTool:
|
|
|
475
496
|
)
|
|
476
497
|
|
|
477
498
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
499
|
+
self.logger.error(str(e))
|
|
500
|
+
|
|
501
|
+
if self.raise_on_error:
|
|
502
|
+
raise
|
|
503
|
+
|
|
478
504
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
479
505
|
tool_output = ToolOutput(
|
|
480
506
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -554,6 +580,11 @@ class AsyncTheTool:
|
|
|
554
580
|
)
|
|
555
581
|
|
|
556
582
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
583
|
+
self.logger.error(str(e))
|
|
584
|
+
|
|
585
|
+
if self.raise_on_error:
|
|
586
|
+
raise
|
|
587
|
+
|
|
557
588
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
558
589
|
tool_output = ToolOutput(
|
|
559
590
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -632,6 +663,11 @@ class AsyncTheTool:
|
|
|
632
663
|
)
|
|
633
664
|
|
|
634
665
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
666
|
+
self.logger.error(str(e))
|
|
667
|
+
|
|
668
|
+
if self.raise_on_error:
|
|
669
|
+
raise
|
|
670
|
+
|
|
635
671
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
636
672
|
tool_output = ToolOutput(
|
|
637
673
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -708,6 +744,11 @@ class AsyncTheTool:
|
|
|
708
744
|
)
|
|
709
745
|
|
|
710
746
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
747
|
+
self.logger.error(str(e))
|
|
748
|
+
|
|
749
|
+
if self.raise_on_error:
|
|
750
|
+
raise
|
|
751
|
+
|
|
711
752
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
712
753
|
tool_output = ToolOutput(
|
|
713
754
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -836,6 +877,11 @@ class AsyncTheTool:
|
|
|
836
877
|
)
|
|
837
878
|
|
|
838
879
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
880
|
+
self.logger.error(str(e))
|
|
881
|
+
|
|
882
|
+
if self.raise_on_error:
|
|
883
|
+
raise
|
|
884
|
+
|
|
839
885
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
840
886
|
tool_output = ToolOutput(
|
|
841
887
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -914,6 +960,11 @@ class AsyncTheTool:
|
|
|
914
960
|
)
|
|
915
961
|
|
|
916
962
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
963
|
+
self.logger.error(str(e))
|
|
964
|
+
|
|
965
|
+
if self.raise_on_error:
|
|
966
|
+
raise
|
|
967
|
+
|
|
917
968
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
918
969
|
tool_output = ToolOutput(
|
|
919
970
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -995,6 +1046,11 @@ class AsyncTheTool:
|
|
|
995
1046
|
)
|
|
996
1047
|
|
|
997
1048
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
1049
|
+
self.logger.error(str(e))
|
|
1050
|
+
|
|
1051
|
+
if self.raise_on_error:
|
|
1052
|
+
raise
|
|
1053
|
+
|
|
998
1054
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
999
1055
|
tool_output = ToolOutput(
|
|
1000
1056
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -1075,6 +1131,11 @@ class AsyncTheTool:
|
|
|
1075
1131
|
)
|
|
1076
1132
|
|
|
1077
1133
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
1134
|
+
self.logger.error(str(e))
|
|
1135
|
+
|
|
1136
|
+
if self.raise_on_error:
|
|
1137
|
+
raise
|
|
1138
|
+
|
|
1078
1139
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
1079
1140
|
tool_output = ToolOutput(
|
|
1080
1141
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -0,0 +1,688 @@
|
|
|
1
|
+
import asyncio
|
|
2
|
+
from typing import Any, Callable, Literal
|
|
3
|
+
|
|
4
|
+
from openai import AsyncOpenAI
|
|
5
|
+
|
|
6
|
+
from ..models import CategoryTree, ToolOutput
|
|
7
|
+
from .async_tools import AsyncTheTool
|
|
8
|
+
|
|
9
|
+
|
|
10
|
+
class BatchTheTool:
|
|
11
|
+
def __init__(
|
|
12
|
+
self,
|
|
13
|
+
client: AsyncOpenAI,
|
|
14
|
+
model: str,
|
|
15
|
+
raise_on_error: bool = True,
|
|
16
|
+
max_concurrency: int = 5,
|
|
17
|
+
):
|
|
18
|
+
self.tool = AsyncTheTool(client, model, raise_on_error)
|
|
19
|
+
self.semaphore = asyncio.Semaphore(max_concurrency)
|
|
20
|
+
|
|
21
|
+
async def categorize(
|
|
22
|
+
self,
|
|
23
|
+
texts: list[str],
|
|
24
|
+
categories: list[str] | CategoryTree,
|
|
25
|
+
with_analysis: bool = False,
|
|
26
|
+
user_prompt: str | None = None,
|
|
27
|
+
temperature: float | None = 0.0,
|
|
28
|
+
logprobs: bool = False,
|
|
29
|
+
top_logprobs: int = 3,
|
|
30
|
+
validator: Callable[[Any], bool] | None = None,
|
|
31
|
+
max_validation_retries: int | None = None,
|
|
32
|
+
priority: int | None = None,
|
|
33
|
+
timeout: float | None = None,
|
|
34
|
+
) -> list[ToolOutput]:
|
|
35
|
+
"""
|
|
36
|
+
Classify texts into given categories
|
|
37
|
+
|
|
38
|
+
Arguments:
|
|
39
|
+
texts: The input texts
|
|
40
|
+
categories: The category list / category tree
|
|
41
|
+
with_analysis: Adds a reasoning step before generating the final output. Note: This doubles token usage per call
|
|
42
|
+
user_prompt: Additional instructions
|
|
43
|
+
temperature: Controls randomness
|
|
44
|
+
logprobs: Whether to return token probability information
|
|
45
|
+
top_logprobs: Number of top token alternatives to return if logprobs enabled
|
|
46
|
+
validator: Custom validation function to validate the output
|
|
47
|
+
max_validation_retries: Maximum number of retry attempts if validation fails
|
|
48
|
+
priority: Task execution priority (if enabled by vLLM and the model)
|
|
49
|
+
timeout: Maximum time in seconds to wait for the response before raising a timeout error
|
|
50
|
+
|
|
51
|
+
Returns:
|
|
52
|
+
list[ToolOutput]
|
|
53
|
+
"""
|
|
54
|
+
|
|
55
|
+
async def _throttled_task(text: str) -> ToolOutput:
|
|
56
|
+
async with self.semaphore:
|
|
57
|
+
return await self.tool.categorize(
|
|
58
|
+
text=text,
|
|
59
|
+
categories=categories,
|
|
60
|
+
with_analysis=with_analysis,
|
|
61
|
+
user_prompt=user_prompt,
|
|
62
|
+
temperature=temperature,
|
|
63
|
+
logprobs=logprobs,
|
|
64
|
+
top_logprobs=top_logprobs,
|
|
65
|
+
validator=validator,
|
|
66
|
+
max_validation_retries=max_validation_retries,
|
|
67
|
+
priority=priority,
|
|
68
|
+
timeout=timeout,
|
|
69
|
+
)
|
|
70
|
+
|
|
71
|
+
tasks = [_throttled_task(t) for t in texts]
|
|
72
|
+
return await asyncio.gather(*tasks)
|
|
73
|
+
|
|
74
|
+
async def extract_keywords(
|
|
75
|
+
self,
|
|
76
|
+
texts: list[str],
|
|
77
|
+
mode: Literal["auto", "threshold", "count"],
|
|
78
|
+
number_of_keywords: int | None = None,
|
|
79
|
+
with_analysis: bool = False,
|
|
80
|
+
output_lang: str | None = None,
|
|
81
|
+
user_prompt: str | None = None,
|
|
82
|
+
temperature: float | None = 0.0,
|
|
83
|
+
logprobs: bool = False,
|
|
84
|
+
top_logprobs: int = 3,
|
|
85
|
+
validator: Callable[[Any], bool] | None = None,
|
|
86
|
+
max_validation_retries: int | None = None,
|
|
87
|
+
priority: int | None = None,
|
|
88
|
+
timeout: float | None = None,
|
|
89
|
+
) -> list[ToolOutput]:
|
|
90
|
+
"""
|
|
91
|
+
Extract keywords from the texts
|
|
92
|
+
|
|
93
|
+
Arguments:
|
|
94
|
+
texts: The input texts
|
|
95
|
+
mode: auto -> decide n of keywords automatically, threshold -> decide n of keywords by a threshold, count -> takes number of keywords as the parameter
|
|
96
|
+
number_of_keywords: Must be set only when using "count" mode
|
|
97
|
+
with_analysis: Adds a reasoning step before generating the final output. Note: This doubles token usage per call
|
|
98
|
+
output_lang: Forces the model to respond in a specific language
|
|
99
|
+
user_prompt: Additional instructions
|
|
100
|
+
temperature: Controls randomness
|
|
101
|
+
logprobs: Whether to return token probability information
|
|
102
|
+
top_logprobs: Number of top token alternatives to return if logprobs enabled
|
|
103
|
+
validator: Custom validation function to validate the output
|
|
104
|
+
max_validation_retries: Maximum number of retry attempts if validation fails
|
|
105
|
+
priority: Task execution priority (if enabled by vLLM and the model)
|
|
106
|
+
timeout: Maximum time in seconds to wait for the response before raising a timeout error
|
|
107
|
+
|
|
108
|
+
Returns:
|
|
109
|
+
list[ToolOutput]
|
|
110
|
+
"""
|
|
111
|
+
|
|
112
|
+
async def _throttled_task(text: str) -> ToolOutput:
|
|
113
|
+
async with self.semaphore:
|
|
114
|
+
return await self.tool.extract_keywords(
|
|
115
|
+
text=text,
|
|
116
|
+
mode=mode,
|
|
117
|
+
number_of_keywords=number_of_keywords,
|
|
118
|
+
with_analysis=with_analysis,
|
|
119
|
+
output_lang=output_lang,
|
|
120
|
+
user_prompt=user_prompt,
|
|
121
|
+
temperature=temperature,
|
|
122
|
+
logprobs=logprobs,
|
|
123
|
+
top_logprobs=top_logprobs,
|
|
124
|
+
validator=validator,
|
|
125
|
+
max_validation_retries=max_validation_retries,
|
|
126
|
+
priority=priority,
|
|
127
|
+
timeout=timeout,
|
|
128
|
+
)
|
|
129
|
+
|
|
130
|
+
tasks = [_throttled_task(t) for t in texts]
|
|
131
|
+
return await asyncio.gather(*tasks)
|
|
132
|
+
|
|
133
|
+
async def extract_entities(
|
|
134
|
+
self,
|
|
135
|
+
texts: list[str],
|
|
136
|
+
entities: list[str] = ["all named entities"],
|
|
137
|
+
with_analysis: bool = False,
|
|
138
|
+
output_lang: str | None = None,
|
|
139
|
+
user_prompt: str | None = None,
|
|
140
|
+
temperature: float | None = 0.0,
|
|
141
|
+
logprobs: bool = False,
|
|
142
|
+
top_logprobs: int = 3,
|
|
143
|
+
validator: Callable[[Any], bool] | None = None,
|
|
144
|
+
max_validation_retries: int | None = None,
|
|
145
|
+
priority: int | None = None,
|
|
146
|
+
timeout: float | None = None,
|
|
147
|
+
) -> list[ToolOutput]:
|
|
148
|
+
"""
|
|
149
|
+
Perform Named Entity Recognition (NER) on texts
|
|
150
|
+
|
|
151
|
+
Arguments:
|
|
152
|
+
texts: The input texts
|
|
153
|
+
entities: List of entities
|
|
154
|
+
with_analysis: Adds a reasoning step before generating the final output. Note: This doubles token usage per call
|
|
155
|
+
output_lang: Forces the model to respond in a specific language
|
|
156
|
+
user_prompt: Additional instructions
|
|
157
|
+
temperature: Controls randomness
|
|
158
|
+
logprobs: Whether to return token probability information
|
|
159
|
+
top_logprobs: Number of top token alternatives to return if logprobs enabled
|
|
160
|
+
validator: Custom validation function to validate the output
|
|
161
|
+
max_validation_retries: Maximum number of retry attempts if validation fails
|
|
162
|
+
priority: Task execution priority (if enabled by vLLM and the model)
|
|
163
|
+
timeout: Maximum time in seconds to wait for the response before raising a timeout error
|
|
164
|
+
|
|
165
|
+
Returns:
|
|
166
|
+
list[ToolOutput]
|
|
167
|
+
"""
|
|
168
|
+
|
|
169
|
+
async def _throttled_task(text: str) -> ToolOutput:
|
|
170
|
+
async with self.semaphore:
|
|
171
|
+
return await self.tool.extract_entities(
|
|
172
|
+
text=text,
|
|
173
|
+
entities=entities,
|
|
174
|
+
with_analysis=with_analysis,
|
|
175
|
+
output_lang=output_lang,
|
|
176
|
+
user_prompt=user_prompt,
|
|
177
|
+
temperature=temperature,
|
|
178
|
+
logprobs=logprobs,
|
|
179
|
+
top_logprobs=top_logprobs,
|
|
180
|
+
validator=validator,
|
|
181
|
+
max_validation_retries=max_validation_retries,
|
|
182
|
+
priority=priority,
|
|
183
|
+
timeout=timeout,
|
|
184
|
+
)
|
|
185
|
+
|
|
186
|
+
tasks = [_throttled_task(t) for t in texts]
|
|
187
|
+
return await asyncio.gather(*tasks)
|
|
188
|
+
|
|
189
|
+
async def is_question(
|
|
190
|
+
self,
|
|
191
|
+
texts: list[str],
|
|
192
|
+
with_analysis: bool = False,
|
|
193
|
+
user_prompt: str | None = None,
|
|
194
|
+
temperature: float | None = 0.0,
|
|
195
|
+
logprobs: bool = False,
|
|
196
|
+
top_logprobs: int = 3,
|
|
197
|
+
validator: Callable[[Any], bool] | None = None,
|
|
198
|
+
max_validation_retries: int | None = None,
|
|
199
|
+
priority: int | None = None,
|
|
200
|
+
timeout: float | None = None,
|
|
201
|
+
) -> list[ToolOutput]:
|
|
202
|
+
"""
|
|
203
|
+
Detect if the inputs are phrased as questions.
|
|
204
|
+
|
|
205
|
+
Arguments:
|
|
206
|
+
texts: The input texts
|
|
207
|
+
with_analysis: Adds a reasoning step before generating the final output. Note: This doubles token usage per call
|
|
208
|
+
user_prompt: Additional instructions
|
|
209
|
+
temperature: Controls randomness
|
|
210
|
+
logprobs: Whether to return token probability information
|
|
211
|
+
top_logprobs: Number of top token alternatives to return if logprobs enabled
|
|
212
|
+
validator: Custom validation function to validate the output
|
|
213
|
+
max_validation_retries: Maximum number of retry attempts if validation fails
|
|
214
|
+
priority: Task execution priority (if enabled by vLLM and the model)
|
|
215
|
+
timeout: Maximum time in seconds to wait for the response before raising a timeout error
|
|
216
|
+
|
|
217
|
+
Returns:
|
|
218
|
+
list[ToolOutput]
|
|
219
|
+
"""
|
|
220
|
+
|
|
221
|
+
async def _throttled_task(text: str) -> ToolOutput:
|
|
222
|
+
async with self.semaphore:
|
|
223
|
+
return await self.tool.is_question(
|
|
224
|
+
text=text,
|
|
225
|
+
with_analysis=with_analysis,
|
|
226
|
+
user_prompt=user_prompt,
|
|
227
|
+
temperature=temperature,
|
|
228
|
+
logprobs=logprobs,
|
|
229
|
+
top_logprobs=top_logprobs,
|
|
230
|
+
validator=validator,
|
|
231
|
+
max_validation_retries=max_validation_retries,
|
|
232
|
+
priority=priority,
|
|
233
|
+
timeout=timeout,
|
|
234
|
+
)
|
|
235
|
+
|
|
236
|
+
tasks = [_throttled_task(t) for t in texts]
|
|
237
|
+
return await asyncio.gather(*tasks)
|
|
238
|
+
|
|
239
|
+
async def to_question(
|
|
240
|
+
self,
|
|
241
|
+
texts: list[str],
|
|
242
|
+
number_of_questions: int,
|
|
243
|
+
mode: Literal["from_text", "from_subject"],
|
|
244
|
+
with_analysis: bool = False,
|
|
245
|
+
output_lang: str | None = None,
|
|
246
|
+
user_prompt: str | None = None,
|
|
247
|
+
temperature: float | None = 0.0,
|
|
248
|
+
logprobs: bool = False,
|
|
249
|
+
top_logprobs: int = 3,
|
|
250
|
+
validator: Callable[[Any], bool] | None = None,
|
|
251
|
+
max_validation_retries: int | None = None,
|
|
252
|
+
priority: int | None = None,
|
|
253
|
+
timeout: float | None = None,
|
|
254
|
+
) -> list[ToolOutput]:
|
|
255
|
+
"""
|
|
256
|
+
Generate questions from the given texts / subjects
|
|
257
|
+
|
|
258
|
+
Arguments:
|
|
259
|
+
texts: The input texts
|
|
260
|
+
mode: from_text -> generate questions from an answer, from_subject -> generate questions from a subject
|
|
261
|
+
number_of_questions: Number of questions to generate
|
|
262
|
+
with_analysis: Adds a reasoning step before generating the final output. Note: This doubles token usage per call
|
|
263
|
+
output_lang: Forces the model to respond in a specific language
|
|
264
|
+
user_prompt: Additional instructions
|
|
265
|
+
temperature: Controls randomness
|
|
266
|
+
logprobs: Whether to return token probability information
|
|
267
|
+
top_logprobs: Number of top token alternatives to return if logprobs enabled
|
|
268
|
+
validator: Custom validation function to validate the output
|
|
269
|
+
max_validation_retries: Maximum number of retry attempts if validation fails
|
|
270
|
+
priority: Task execution priority (if enabled by vLLM and the model)
|
|
271
|
+
timeout: Maximum time in seconds to wait for the response before raising a timeout error
|
|
272
|
+
|
|
273
|
+
Returns:
|
|
274
|
+
list[ToolOutput]
|
|
275
|
+
"""
|
|
276
|
+
|
|
277
|
+
async def _throttled_task(text: str) -> ToolOutput:
|
|
278
|
+
async with self.semaphore:
|
|
279
|
+
return await self.tool.to_question(
|
|
280
|
+
text=text,
|
|
281
|
+
number_of_questions=number_of_questions,
|
|
282
|
+
mode=mode,
|
|
283
|
+
with_analysis=with_analysis,
|
|
284
|
+
output_lang=output_lang,
|
|
285
|
+
user_prompt=user_prompt,
|
|
286
|
+
temperature=temperature,
|
|
287
|
+
logprobs=logprobs,
|
|
288
|
+
top_logprobs=top_logprobs,
|
|
289
|
+
validator=validator,
|
|
290
|
+
max_validation_retries=max_validation_retries,
|
|
291
|
+
priority=priority,
|
|
292
|
+
timeout=timeout,
|
|
293
|
+
)
|
|
294
|
+
|
|
295
|
+
tasks = [_throttled_task(t) for t in texts]
|
|
296
|
+
return await asyncio.gather(*tasks)
|
|
297
|
+
|
|
298
|
+
async def merge_questions(
|
|
299
|
+
self,
|
|
300
|
+
texts_list: list[list[str]],
|
|
301
|
+
mode: Literal["simple", "stepwise"],
|
|
302
|
+
with_analysis: bool = False,
|
|
303
|
+
output_lang: str | None = None,
|
|
304
|
+
user_prompt: str | None = None,
|
|
305
|
+
temperature: float | None = 0.0,
|
|
306
|
+
logprobs: bool = False,
|
|
307
|
+
top_logprobs: int = 3,
|
|
308
|
+
validator: Callable[[Any], bool] | None = None,
|
|
309
|
+
max_validation_retries: int | None = None,
|
|
310
|
+
priority: int | None = None,
|
|
311
|
+
timeout: float | None = None,
|
|
312
|
+
) -> list[ToolOutput]:
|
|
313
|
+
"""
|
|
314
|
+
Merge multiple questions into a single unified question for each group
|
|
315
|
+
|
|
316
|
+
Arguments:
|
|
317
|
+
texts_list: List of groups of questions to merge
|
|
318
|
+
mode: simple -> regular question merging, stepwise -> merge questions in two steps
|
|
319
|
+
with_analysis: Adds a reasoning step before generating the final output. Note: This doubles token usage per call
|
|
320
|
+
output_lang: Forces the model to respond in a specific language
|
|
321
|
+
user_prompt: Additional instructions
|
|
322
|
+
temperature: Controls randomness
|
|
323
|
+
logprobs: Whether to return token probability information
|
|
324
|
+
top_logprobs: Number of top token alternatives to return if logprobs enabled
|
|
325
|
+
validator: Custom validation function to validate the output
|
|
326
|
+
max_validation_retries: Maximum number of retry attempts if validation fails
|
|
327
|
+
priority: Task execution priority (if enabled by vLLM and the model)
|
|
328
|
+
timeout: Maximum time in seconds to wait for the response before raising a timeout error
|
|
329
|
+
|
|
330
|
+
Returns:
|
|
331
|
+
list[ToolOutput]
|
|
332
|
+
"""
|
|
333
|
+
|
|
334
|
+
async def _throttled_task(texts: list[str]) -> ToolOutput:
|
|
335
|
+
async with self.semaphore:
|
|
336
|
+
return await self.tool.merge_questions(
|
|
337
|
+
text=texts,
|
|
338
|
+
mode=mode,
|
|
339
|
+
with_analysis=with_analysis,
|
|
340
|
+
output_lang=output_lang,
|
|
341
|
+
user_prompt=user_prompt,
|
|
342
|
+
temperature=temperature,
|
|
343
|
+
logprobs=logprobs,
|
|
344
|
+
top_logprobs=top_logprobs,
|
|
345
|
+
validator=validator,
|
|
346
|
+
max_validation_retries=max_validation_retries,
|
|
347
|
+
priority=priority,
|
|
348
|
+
timeout=timeout,
|
|
349
|
+
)
|
|
350
|
+
|
|
351
|
+
tasks = [_throttled_task(t) for t in texts_list]
|
|
352
|
+
return await asyncio.gather(*tasks)
|
|
353
|
+
|
|
354
|
+
async def augment(
|
|
355
|
+
self,
|
|
356
|
+
texts: list[str],
|
|
357
|
+
mode: Literal["positive", "negative", "hard_negative"],
|
|
358
|
+
with_analysis: bool = False,
|
|
359
|
+
output_lang: str | None = None,
|
|
360
|
+
user_prompt: str | None = None,
|
|
361
|
+
temperature: float | None = 0.0,
|
|
362
|
+
logprobs: bool = False,
|
|
363
|
+
top_logprobs: int = 3,
|
|
364
|
+
validator: Callable[[Any], bool] | None = None,
|
|
365
|
+
max_validation_retries: int | None = None,
|
|
366
|
+
priority: int | None = None,
|
|
367
|
+
timeout: float | None = None,
|
|
368
|
+
) -> list[ToolOutput]:
|
|
369
|
+
"""
|
|
370
|
+
Rewrite texts in different augmentations
|
|
371
|
+
|
|
372
|
+
Arguments:
|
|
373
|
+
texts: The input texts
|
|
374
|
+
mode: positive -> positive augmentation, negative -> negative augmentation, hard_negative -> hard negative augmentation
|
|
375
|
+
with_analysis: Adds a reasoning step before generating the final output. Note: This doubles token usage per call
|
|
376
|
+
output_lang: Forces the model to respond in a specific language
|
|
377
|
+
user_prompt: Additional instructions
|
|
378
|
+
temperature: Controls randomness
|
|
379
|
+
logprobs: Whether to return token probability information
|
|
380
|
+
top_logprobs: Number of top token alternatives to return if logprobs enabled
|
|
381
|
+
validator: Custom validation function to validate the output
|
|
382
|
+
max_validation_retries: Maximum number of retry attempts if validation fails
|
|
383
|
+
priority: Task execution priority (if enabled by vLLM and the model)
|
|
384
|
+
timeout: Maximum time in seconds to wait for the response before raising a timeout error
|
|
385
|
+
|
|
386
|
+
Returns:
|
|
387
|
+
list[ToolOutput]
|
|
388
|
+
"""
|
|
389
|
+
|
|
390
|
+
async def _throttled_task(text: str) -> ToolOutput:
|
|
391
|
+
async with self.semaphore:
|
|
392
|
+
return await self.tool.augment(
|
|
393
|
+
text=text,
|
|
394
|
+
mode=mode,
|
|
395
|
+
with_analysis=with_analysis,
|
|
396
|
+
output_lang=output_lang,
|
|
397
|
+
user_prompt=user_prompt,
|
|
398
|
+
temperature=temperature,
|
|
399
|
+
logprobs=logprobs,
|
|
400
|
+
top_logprobs=top_logprobs,
|
|
401
|
+
validator=validator,
|
|
402
|
+
max_validation_retries=max_validation_retries,
|
|
403
|
+
priority=priority,
|
|
404
|
+
timeout=timeout,
|
|
405
|
+
)
|
|
406
|
+
|
|
407
|
+
tasks = [_throttled_task(t) for t in texts]
|
|
408
|
+
return await asyncio.gather(*tasks)
|
|
409
|
+
|
|
410
|
+
async def summarize(
|
|
411
|
+
self,
|
|
412
|
+
texts: list[str],
|
|
413
|
+
with_analysis: bool = False,
|
|
414
|
+
output_lang: str | None = None,
|
|
415
|
+
user_prompt: str | None = None,
|
|
416
|
+
temperature: float | None = 0.0,
|
|
417
|
+
logprobs: bool = False,
|
|
418
|
+
top_logprobs: int = 3,
|
|
419
|
+
validator: Callable[[Any], bool] | None = None,
|
|
420
|
+
max_validation_retries: int | None = None,
|
|
421
|
+
priority: int | None = None,
|
|
422
|
+
timeout: float | None = None,
|
|
423
|
+
) -> list[ToolOutput]:
|
|
424
|
+
"""
|
|
425
|
+
Summarize the given texts
|
|
426
|
+
|
|
427
|
+
Arguments:
|
|
428
|
+
texts: The input texts
|
|
429
|
+
with_analysis: Adds a reasoning step before generating the final output. Note: This doubles token usage per call
|
|
430
|
+
output_lang: Forces the model to respond in a specific language
|
|
431
|
+
user_prompt: Additional instructions
|
|
432
|
+
temperature: Controls randomness
|
|
433
|
+
logprobs: Whether to return token probability information
|
|
434
|
+
top_logprobs: Number of top token alternatives to return if logprobs enabled
|
|
435
|
+
validator: Custom validation function to validate the output
|
|
436
|
+
max_validation_retries: Maximum number of retry attempts if validation fails
|
|
437
|
+
priority: Task execution priority (if enabled by vLLM and the model)
|
|
438
|
+
timeout: Maximum time in seconds to wait for the response before raising a timeout error
|
|
439
|
+
|
|
440
|
+
Returns:
|
|
441
|
+
list[ToolOutput]
|
|
442
|
+
"""
|
|
443
|
+
|
|
444
|
+
async def _throttled_task(text: str) -> ToolOutput:
|
|
445
|
+
async with self.semaphore:
|
|
446
|
+
return await self.tool.summarize(
|
|
447
|
+
text=text,
|
|
448
|
+
with_analysis=with_analysis,
|
|
449
|
+
output_lang=output_lang,
|
|
450
|
+
user_prompt=user_prompt,
|
|
451
|
+
temperature=temperature,
|
|
452
|
+
logprobs=logprobs,
|
|
453
|
+
top_logprobs=top_logprobs,
|
|
454
|
+
validator=validator,
|
|
455
|
+
max_validation_retries=max_validation_retries,
|
|
456
|
+
priority=priority,
|
|
457
|
+
timeout=timeout,
|
|
458
|
+
)
|
|
459
|
+
|
|
460
|
+
tasks = [_throttled_task(t) for t in texts]
|
|
461
|
+
return await asyncio.gather(*tasks)
|
|
462
|
+
|
|
463
|
+
async def translate(
|
|
464
|
+
self,
|
|
465
|
+
texts: list[str],
|
|
466
|
+
target_lang: str,
|
|
467
|
+
use_chunker: bool = True,
|
|
468
|
+
with_analysis: bool = False,
|
|
469
|
+
user_prompt: str | None = None,
|
|
470
|
+
temperature: float | None = 0.0,
|
|
471
|
+
logprobs: bool = False,
|
|
472
|
+
top_logprobs: int = 3,
|
|
473
|
+
validator: Callable[[Any], bool] | None = None,
|
|
474
|
+
max_validation_retries: int | None = None,
|
|
475
|
+
priority: int | None = None,
|
|
476
|
+
timeout: float | None = None,
|
|
477
|
+
) -> list[ToolOutput]:
|
|
478
|
+
"""
|
|
479
|
+
Translate texts between languages
|
|
480
|
+
|
|
481
|
+
Important Note: This tool is EXPERIMENTAL, you can use it but it isn't reliable.
|
|
482
|
+
|
|
483
|
+
Arguments:
|
|
484
|
+
texts: The input texts
|
|
485
|
+
target_lang: The target language for translation
|
|
486
|
+
use_chunker: Whether to use text chunker for large texts
|
|
487
|
+
with_analysis: Adds a reasoning step before generating the final output. Note: This doubles token usage per call
|
|
488
|
+
user_prompt: Additional instructions
|
|
489
|
+
temperature: Controls randomness
|
|
490
|
+
logprobs: Whether to return token probability information
|
|
491
|
+
top_logprobs: Number of top token alternatives to return if logprobs enabled
|
|
492
|
+
validator: Custom validation function to validate the output
|
|
493
|
+
max_validation_retries: Maximum number of retry attempts if validation fails
|
|
494
|
+
priority: Task execution priority (if enabled by vLLM and the model)
|
|
495
|
+
timeout: Maximum time in seconds to wait for the response before raising a timeout error
|
|
496
|
+
|
|
497
|
+
Returns:
|
|
498
|
+
list[ToolOutput]
|
|
499
|
+
"""
|
|
500
|
+
|
|
501
|
+
async def _throttled_task(text: str) -> ToolOutput:
|
|
502
|
+
async with self.semaphore:
|
|
503
|
+
return await self.tool.translate(
|
|
504
|
+
text=text,
|
|
505
|
+
target_lang=target_lang,
|
|
506
|
+
use_chunker=use_chunker,
|
|
507
|
+
with_analysis=with_analysis,
|
|
508
|
+
user_prompt=user_prompt,
|
|
509
|
+
temperature=temperature,
|
|
510
|
+
logprobs=logprobs,
|
|
511
|
+
top_logprobs=top_logprobs,
|
|
512
|
+
validator=validator,
|
|
513
|
+
max_validation_retries=max_validation_retries,
|
|
514
|
+
priority=priority,
|
|
515
|
+
timeout=timeout,
|
|
516
|
+
)
|
|
517
|
+
|
|
518
|
+
tasks = [_throttled_task(t) for t in texts]
|
|
519
|
+
return await asyncio.gather(*tasks)
|
|
520
|
+
|
|
521
|
+
async def propositionize(
|
|
522
|
+
self,
|
|
523
|
+
texts: list[str],
|
|
524
|
+
with_analysis: bool = False,
|
|
525
|
+
output_lang: str | None = None,
|
|
526
|
+
user_prompt: str | None = None,
|
|
527
|
+
temperature: float | None = 0.0,
|
|
528
|
+
logprobs: bool = False,
|
|
529
|
+
top_logprobs: int = 3,
|
|
530
|
+
validator: Callable[[Any], bool] | None = None,
|
|
531
|
+
max_validation_retries: int | None = None,
|
|
532
|
+
priority: int | None = None,
|
|
533
|
+
timeout: float | None = None,
|
|
534
|
+
) -> list[ToolOutput]:
|
|
535
|
+
"""
|
|
536
|
+
Convert texts into atomic, independent, meaningful sentences
|
|
537
|
+
|
|
538
|
+
Important Note: This tool is EXPERIMENTAL, you can use it but it isn't reliable.
|
|
539
|
+
|
|
540
|
+
Arguments:
|
|
541
|
+
texts: The input texts
|
|
542
|
+
with_analysis: Adds a reasoning step before generating the final output. Note: This doubles token usage per call
|
|
543
|
+
output_lang: Forces the model to respond in a specific language
|
|
544
|
+
user_prompt: Additional instructions
|
|
545
|
+
temperature: Controls randomness
|
|
546
|
+
logprobs: Whether to return token probability information
|
|
547
|
+
top_logprobs: Number of top token alternatives to return if logprobs enabled
|
|
548
|
+
validator: Custom validation function to validate the output
|
|
549
|
+
max_validation_retries: Maximum number of retry attempts if validation fails
|
|
550
|
+
priority: Task execution priority (if enabled by vLLM and the model)
|
|
551
|
+
timeout: Maximum time in seconds to wait for the response before raising a timeout error
|
|
552
|
+
|
|
553
|
+
Returns:
|
|
554
|
+
list[ToolOutput]
|
|
555
|
+
"""
|
|
556
|
+
|
|
557
|
+
async def _throttled_task(text: str) -> ToolOutput:
|
|
558
|
+
async with self.semaphore:
|
|
559
|
+
return await self.tool.propositionize(
|
|
560
|
+
text=text,
|
|
561
|
+
with_analysis=with_analysis,
|
|
562
|
+
output_lang=output_lang,
|
|
563
|
+
user_prompt=user_prompt,
|
|
564
|
+
temperature=temperature,
|
|
565
|
+
logprobs=logprobs,
|
|
566
|
+
top_logprobs=top_logprobs,
|
|
567
|
+
validator=validator,
|
|
568
|
+
max_validation_retries=max_validation_retries,
|
|
569
|
+
priority=priority,
|
|
570
|
+
timeout=timeout,
|
|
571
|
+
)
|
|
572
|
+
|
|
573
|
+
tasks = [_throttled_task(t) for t in texts]
|
|
574
|
+
return await asyncio.gather(*tasks)
|
|
575
|
+
|
|
576
|
+
async def is_fact(
|
|
577
|
+
self,
|
|
578
|
+
texts: list[str],
|
|
579
|
+
source_texts: list[str],
|
|
580
|
+
with_analysis: bool = False,
|
|
581
|
+
output_lang: str | None = None,
|
|
582
|
+
user_prompt: str | None = None,
|
|
583
|
+
temperature: float | None = 0.0,
|
|
584
|
+
logprobs: bool = False,
|
|
585
|
+
top_logprobs: int = 3,
|
|
586
|
+
validator: Callable[[Any], bool] | None = None,
|
|
587
|
+
max_validation_retries: int | None = None,
|
|
588
|
+
priority: int | None = None,
|
|
589
|
+
timeout: float | None = None,
|
|
590
|
+
) -> list[ToolOutput]:
|
|
591
|
+
"""
|
|
592
|
+
Check whether statements are facts based on source texts
|
|
593
|
+
|
|
594
|
+
Important Note: This tool is EXPERIMENTAL, you can use it but it isn't reliable.
|
|
595
|
+
|
|
596
|
+
Arguments:
|
|
597
|
+
texts: The input texts (statements to check)
|
|
598
|
+
source_texts: The source texts
|
|
599
|
+
with_analysis: Adds a reasoning step before generating the final output. Note: This doubles token usage per call
|
|
600
|
+
output_lang: Forces the model to respond in a specific language
|
|
601
|
+
user_prompt: Additional instructions
|
|
602
|
+
temperature: Controls randomness
|
|
603
|
+
logprobs: Whether to return token probability information
|
|
604
|
+
top_logprobs: Number of top token alternatives to return if logprobs enabled
|
|
605
|
+
validator: Custom validation function to validate the output
|
|
606
|
+
max_validation_retries: Maximum number of retry attempts if validation fails
|
|
607
|
+
priority: Task execution priority (if enabled by vLLM and the model)
|
|
608
|
+
timeout: Maximum time in seconds to wait for the response before raising a timeout error
|
|
609
|
+
|
|
610
|
+
Returns:
|
|
611
|
+
list[ToolOutput]
|
|
612
|
+
"""
|
|
613
|
+
|
|
614
|
+
async def _throttled_task(text: str, source_text: str) -> ToolOutput:
|
|
615
|
+
async with self.semaphore:
|
|
616
|
+
return await self.tool.is_fact(
|
|
617
|
+
text=text,
|
|
618
|
+
source_text=source_text,
|
|
619
|
+
with_analysis=with_analysis,
|
|
620
|
+
output_lang=output_lang,
|
|
621
|
+
user_prompt=user_prompt,
|
|
622
|
+
temperature=temperature,
|
|
623
|
+
logprobs=logprobs,
|
|
624
|
+
top_logprobs=top_logprobs,
|
|
625
|
+
validator=validator,
|
|
626
|
+
max_validation_retries=max_validation_retries,
|
|
627
|
+
priority=priority,
|
|
628
|
+
timeout=timeout,
|
|
629
|
+
)
|
|
630
|
+
|
|
631
|
+
tasks = [_throttled_task(t, s) for t, s in zip(texts, source_texts)]
|
|
632
|
+
return await asyncio.gather(*tasks)
|
|
633
|
+
|
|
634
|
+
async def run_custom(
|
|
635
|
+
self,
|
|
636
|
+
prompts: list[str],
|
|
637
|
+
output_model: Any,
|
|
638
|
+
with_analysis: bool = False,
|
|
639
|
+
analyze_template: str | None = None,
|
|
640
|
+
output_lang: str | None = None,
|
|
641
|
+
temperature: float | None = None,
|
|
642
|
+
logprobs: bool | None = None,
|
|
643
|
+
top_logprobs: int = 3,
|
|
644
|
+
validator: Callable[[Any], bool] | None = None,
|
|
645
|
+
max_validation_retries: int | None = None,
|
|
646
|
+
priority: int | None = None,
|
|
647
|
+
timeout: float | None = None,
|
|
648
|
+
) -> list[ToolOutput]:
|
|
649
|
+
"""
|
|
650
|
+
Custom tool that can do almost anything for multiple prompts
|
|
651
|
+
|
|
652
|
+
Arguments:
|
|
653
|
+
prompts: The user prompts
|
|
654
|
+
output_model: Pydantic BaseModel used for structured output
|
|
655
|
+
with_analysis: Adds a reasoning step before generating the final output. Note: This doubles token usage per call
|
|
656
|
+
analyze_template: The analyze template used for reasoning analysis
|
|
657
|
+
output_lang: Forces the model to respond in a specific language
|
|
658
|
+
temperature: Controls randomness
|
|
659
|
+
logprobs: Whether to return token probability information
|
|
660
|
+
top_logprobs: Number of top token alternatives to return if logprobs enabled
|
|
661
|
+
validator: Custom validation function to validate the output
|
|
662
|
+
max_validation_retries: Maximum number of retry attempts if validation fails
|
|
663
|
+
priority: Task execution priority (if enabled by vLLM and the model)
|
|
664
|
+
timeout: Maximum time in seconds to wait for the response before raising a timeout error
|
|
665
|
+
|
|
666
|
+
Returns:
|
|
667
|
+
list[ToolOutput]
|
|
668
|
+
"""
|
|
669
|
+
|
|
670
|
+
async def _throttled_task(prompt: str) -> ToolOutput:
|
|
671
|
+
async with self.semaphore:
|
|
672
|
+
return await self.tool.run_custom(
|
|
673
|
+
prompt=prompt,
|
|
674
|
+
output_model=output_model,
|
|
675
|
+
with_analysis=with_analysis,
|
|
676
|
+
analyze_template=analyze_template,
|
|
677
|
+
output_lang=output_lang,
|
|
678
|
+
temperature=temperature,
|
|
679
|
+
logprobs=logprobs,
|
|
680
|
+
top_logprobs=top_logprobs,
|
|
681
|
+
validator=validator,
|
|
682
|
+
max_validation_retries=max_validation_retries,
|
|
683
|
+
priority=priority,
|
|
684
|
+
timeout=timeout,
|
|
685
|
+
)
|
|
686
|
+
|
|
687
|
+
tasks = [_throttled_task(p) for p in prompts]
|
|
688
|
+
return await asyncio.gather(*tasks)
|
texttools/tools/sync_tools.py
CHANGED
|
@@ -1,3 +1,4 @@
|
|
|
1
|
+
import logging
|
|
1
2
|
from collections.abc import Callable
|
|
2
3
|
from time import perf_counter
|
|
3
4
|
from typing import Any, Literal
|
|
@@ -23,8 +24,11 @@ class TheTool:
|
|
|
23
24
|
self,
|
|
24
25
|
client: OpenAI,
|
|
25
26
|
model: str,
|
|
27
|
+
raise_on_error: bool = True,
|
|
26
28
|
):
|
|
27
29
|
self._operator = Operator(client=client, model=model)
|
|
30
|
+
self.logger = logging.getLogger(self.__class__.__name__)
|
|
31
|
+
self.raise_on_error = raise_on_error
|
|
28
32
|
|
|
29
33
|
def categorize(
|
|
30
34
|
self,
|
|
@@ -42,8 +46,6 @@ class TheTool:
|
|
|
42
46
|
"""
|
|
43
47
|
Classify text into given categories
|
|
44
48
|
|
|
45
|
-
Important Note: category_tree mode is EXPERIMENTAL, you can use it but it isn't reliable.
|
|
46
|
-
|
|
47
49
|
Arguments:
|
|
48
50
|
text: The input text
|
|
49
51
|
categories: The category list / category tree
|
|
@@ -58,7 +60,6 @@ class TheTool:
|
|
|
58
60
|
|
|
59
61
|
Returns:
|
|
60
62
|
ToolOutput
|
|
61
|
-
|
|
62
63
|
"""
|
|
63
64
|
tool_name = "categorize"
|
|
64
65
|
start = perf_counter()
|
|
@@ -152,6 +153,11 @@ class TheTool:
|
|
|
152
153
|
)
|
|
153
154
|
|
|
154
155
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
156
|
+
self.logger.error(str(e))
|
|
157
|
+
|
|
158
|
+
if self.raise_on_error:
|
|
159
|
+
raise
|
|
160
|
+
|
|
155
161
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
156
162
|
tool_output = ToolOutput(
|
|
157
163
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -228,6 +234,11 @@ class TheTool:
|
|
|
228
234
|
)
|
|
229
235
|
|
|
230
236
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
237
|
+
self.logger.error(str(e))
|
|
238
|
+
|
|
239
|
+
if self.raise_on_error:
|
|
240
|
+
raise
|
|
241
|
+
|
|
231
242
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
232
243
|
tool_output = ToolOutput(
|
|
233
244
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -302,6 +313,11 @@ class TheTool:
|
|
|
302
313
|
)
|
|
303
314
|
|
|
304
315
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
316
|
+
self.logger.error(str(e))
|
|
317
|
+
|
|
318
|
+
if self.raise_on_error:
|
|
319
|
+
raise
|
|
320
|
+
|
|
305
321
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
306
322
|
tool_output = ToolOutput(
|
|
307
323
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -371,6 +387,11 @@ class TheTool:
|
|
|
371
387
|
)
|
|
372
388
|
|
|
373
389
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
390
|
+
self.logger.error(str(e))
|
|
391
|
+
|
|
392
|
+
if self.raise_on_error:
|
|
393
|
+
raise
|
|
394
|
+
|
|
374
395
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
375
396
|
tool_output = ToolOutput(
|
|
376
397
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -447,6 +468,11 @@ class TheTool:
|
|
|
447
468
|
)
|
|
448
469
|
|
|
449
470
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
471
|
+
self.logger.error(str(e))
|
|
472
|
+
|
|
473
|
+
if self.raise_on_error:
|
|
474
|
+
raise
|
|
475
|
+
|
|
450
476
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
451
477
|
tool_output = ToolOutput(
|
|
452
478
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -521,6 +547,11 @@ class TheTool:
|
|
|
521
547
|
)
|
|
522
548
|
|
|
523
549
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
550
|
+
self.logger.error(str(e))
|
|
551
|
+
|
|
552
|
+
if self.raise_on_error:
|
|
553
|
+
raise
|
|
554
|
+
|
|
524
555
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
525
556
|
tool_output = ToolOutput(
|
|
526
557
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -594,6 +625,11 @@ class TheTool:
|
|
|
594
625
|
)
|
|
595
626
|
|
|
596
627
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
628
|
+
self.logger.error(str(e))
|
|
629
|
+
|
|
630
|
+
if self.raise_on_error:
|
|
631
|
+
raise
|
|
632
|
+
|
|
597
633
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
598
634
|
tool_output = ToolOutput(
|
|
599
635
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -665,6 +701,11 @@ class TheTool:
|
|
|
665
701
|
)
|
|
666
702
|
|
|
667
703
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
704
|
+
self.logger.error(str(e))
|
|
705
|
+
|
|
706
|
+
if self.raise_on_error:
|
|
707
|
+
raise
|
|
708
|
+
|
|
668
709
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
669
710
|
tool_output = ToolOutput(
|
|
670
711
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -785,6 +826,11 @@ class TheTool:
|
|
|
785
826
|
)
|
|
786
827
|
|
|
787
828
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
829
|
+
self.logger.error(str(e))
|
|
830
|
+
|
|
831
|
+
if self.raise_on_error:
|
|
832
|
+
raise
|
|
833
|
+
|
|
788
834
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
789
835
|
tool_output = ToolOutput(
|
|
790
836
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -858,6 +904,11 @@ class TheTool:
|
|
|
858
904
|
)
|
|
859
905
|
|
|
860
906
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
907
|
+
self.logger.error(str(e))
|
|
908
|
+
|
|
909
|
+
if self.raise_on_error:
|
|
910
|
+
raise
|
|
911
|
+
|
|
861
912
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
862
913
|
tool_output = ToolOutput(
|
|
863
914
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -934,6 +985,11 @@ class TheTool:
|
|
|
934
985
|
)
|
|
935
986
|
|
|
936
987
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
988
|
+
self.logger.error(str(e))
|
|
989
|
+
|
|
990
|
+
if self.raise_on_error:
|
|
991
|
+
raise
|
|
992
|
+
|
|
937
993
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
938
994
|
tool_output = ToolOutput(
|
|
939
995
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
@@ -1009,6 +1065,11 @@ class TheTool:
|
|
|
1009
1065
|
)
|
|
1010
1066
|
|
|
1011
1067
|
except (PromptError, LLMError, ValidationError, TextToolsError, Exception) as e:
|
|
1068
|
+
self.logger.error(str(e))
|
|
1069
|
+
|
|
1070
|
+
if self.raise_on_error:
|
|
1071
|
+
raise
|
|
1072
|
+
|
|
1012
1073
|
metadata = ToolOutputMetadata(tool_name=tool_name)
|
|
1013
1074
|
tool_output = ToolOutput(
|
|
1014
1075
|
errors=[f"{type(e).__name__}: {e}"], metadata=metadata
|
|
File without changes
|
|
File without changes
|