hamtaa-texttools 1.1.19__py3-none-any.whl → 1.1.20__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: hamtaa-texttools
3
- Version: 1.1.19
3
+ Version: 1.1.20
4
4
  Summary: A high-level NLP toolkit built on top of modern LLMs.
5
5
  Author-email: Tohidi <the.mohammad.tohidi@gmail.com>, Montazer <montazerh82@gmail.com>, Givechi <mohamad.m.givechi@gmail.com>, MoosaviNejad <erfanmoosavi84@gmail.com>, Zareshahi <a.zareshahi1377@gmail.com>
6
6
  License: MIT License
@@ -99,21 +99,21 @@ Each tool is designed to work with structured outputs (JSON / Pydantic).
99
99
 
100
100
  TextTools provides several optional flags to customize LLM behavior:
101
101
 
102
- - **`with_analysis (bool)`** → Adds a reasoning step before generating the final output.
102
+ - **`with_analysis: bool`** → Adds a reasoning step before generating the final output.
103
103
  **Note:** This doubles token usage per call because it triggers an additional LLM request.
104
104
 
105
- - **`logprobs (bool)`** → Returns token-level probabilities for the generated output. You can also specify `top_logprobs=<N>` to get the top N alternative tokens and their probabilities.
105
+ - **`logprobs: bool`** → Returns token-level probabilities for the generated output. You can also specify `top_logprobs=<N>` to get the top N alternative tokens and their probabilities.
106
106
  **Note:** This feature works if it's supported by the model.
107
107
 
108
- - **`output_lang (str)`** → Forces the model to respond in a specific language. The model will ignore other instructions about language and respond strictly in the requested language.
108
+ - **`output_lang: str`** → Forces the model to respond in a specific language. The model will ignore other instructions about language and respond strictly in the requested language.
109
109
 
110
- - **`user_prompt (str)`** → Allows you to inject a custom instruction or prompt into the model alongside the main template. This gives you fine-grained control over how the model interprets or modifies the input text.
110
+ - **`user_prompt: str`** → Allows you to inject a custom instruction or prompt into the model alongside the main template. This gives you fine-grained control over how the model interprets or modifies the input text.
111
111
 
112
- - **`temperature (float)`** → Determines how creative the model should respond. Takes a float number from `0.0` to `2.0`.
112
+ - **`temperature: float`** → Determines how creative the model should respond. Takes a float number from `0.0` to `2.0`.
113
113
 
114
- - **`validator (Callable)`** → Forces TheTool to validate the output result based on your custom validator. Validator should return a bool (True if there were no problem, False if the validation fails.) If the validator fails, TheTool will retry to get another output by modifying `temperature`. You can specify `max_validation_retries=<N>` to change the number of retries.
114
+ - **`validator: Callable (Experimental)`** → Forces TheTool to validate the output result based on your custom validator. Validator should return a bool (True if there were no problem, False if the validation fails.) If the validator fails, TheTool will retry to get another output by modifying `temperature`. You can specify `max_validation_retries=<N>` to change the number of retries.
115
115
 
116
- - **`priority (int)`** → Task execution priority level. Higher values = higher priority. Affects processing order in queues.
116
+ - **`priority: int (Experimental)`** → Task execution priority level. Higher values = higher priority. Affects processing order in queues.
117
117
  **Note:** This feature works if it's supported by the model and vLLM.
118
118
 
119
119
  **Note:** There might be some tools that don't support some of the parameters above.
@@ -1,20 +1,20 @@
1
- hamtaa_texttools-1.1.19.dist-info/licenses/LICENSE,sha256=Hb2YOBKy2MJQLnyLrX37B4ZVuac8eaIcE71SvVIMOLg,1082
1
+ hamtaa_texttools-1.1.20.dist-info/licenses/LICENSE,sha256=Hb2YOBKy2MJQLnyLrX37B4ZVuac8eaIcE71SvVIMOLg,1082
2
2
  texttools/__init__.py,sha256=CmCS9dEvO6061GiJ8A7gD3UAhCWHTkaID9q3Krlyq_o,311
3
3
  texttools/batch/batch_config.py,sha256=m1UgILVKjNdWE6laNbfbG4vgi4o2fEegGZbeoam6pnY,749
4
- texttools/batch/batch_runner.py,sha256=Tz-jec27UZBSZAXc0sxitc5XycDfzvOYl47Yqzq6Myw,10031
5
- texttools/batch/internals/batch_manager.py,sha256=UoBe76vmFG72qrSaGKDZf4HzkykFBkkkbL9TLfV8TuQ,8730
6
- texttools/batch/internals/utils.py,sha256=F1_7YlVFKhjUROAFX4m0SaP8KiZVZyHRMIIB87VUGQc,373
7
- texttools/internals/async_operator.py,sha256=_RfYSm_66RJ6nppzorJ4r3BHdhr8xr404QjeVvsvX4Q,8485
4
+ texttools/batch/batch_runner.py,sha256=fVqgFDOyv8DqaNmRJQjt75wOxXTKPstisHycpt0LwcM,10026
5
+ texttools/batch/internals/batch_manager.py,sha256=6HfsexU0PHGGBH7HKReZ-CQxaQI9DXYKAPsFXxovb_I,8740
6
+ texttools/batch/internals/utils.py,sha256=8uNqvPHkEDFpiPp2Nyu-1nP4R-Tq8FwuSGMNSjcBogY,348
7
+ texttools/internals/async_operator.py,sha256=rgQuvT-hl53stClVojso9FgmKhK98nm_2Cdl5WROMoc,8399
8
8
  texttools/internals/exceptions.py,sha256=h_yp_5i_5IfmqTBQ4S6ZOISrrliJBQ3HTEAjwJXrplk,495
9
- texttools/internals/formatters.py,sha256=tACNLP6PeoqaRpNudVxBaHA25zyWqWYPZQuYysIu88g,941
10
- texttools/internals/models.py,sha256=zmgdFhMCNyfc-5dtSE4jwulhltVgxYzITZRMDJBUF0A,5977
11
- texttools/internals/operator_utils.py,sha256=w1k0RJ_W_CRbVc_J2w337VuL-opHpHiCxfhEOwtyuOo,1856
9
+ texttools/internals/models.py,sha256=Ro8d875_xjMHwwJIz3D_-VWxQ2WOcLLIKxsleSvqPDE,5659
10
+ texttools/internals/operator_utils.py,sha256=jphe1DvYgdpRprZFv23ghh3bNYETvVTQ8ZZ8HPwwRVo,2759
12
11
  texttools/internals/prompt_loader.py,sha256=i4OxcVJTjHFKPSoC-DWZUM3Vf8ye_vbD7b6t3N2qB08,3972
13
- texttools/internals/sync_operator.py,sha256=7SdsNoFQxgmMrSZbUUw7SJVqyO5Xhu8dui9lm64RKsk,8382
12
+ texttools/internals/sync_operator.py,sha256=Q0I2LlGF8MEPUCnwOOErOlxF3gjNxEvNS3oM9V6jazE,8296
13
+ texttools/internals/text_to_chunks.py,sha256=vY3odhgCZK4E44k_SGlLoSiKkdN0ib6-lQAsPcplAHA,3843
14
14
  texttools/prompts/README.md,sha256=-5YO93CN93QLifqZpUeUnCOCBbDiOTV-cFQeJ7Gg0I4,1377
15
15
  texttools/prompts/categorize.yaml,sha256=F7VezB25B_sT5yoC25ezODBddkuDD5lUHKetSpx9FKI,2743
16
16
  texttools/prompts/check_fact.yaml,sha256=5kpBjmfZxgp81Owc8-Pd0U8-cZowFGRdYlGTFQLYQ9o,702
17
- texttools/prompts/extract_entities.yaml,sha256=KiKjeDpHaeh3JVtZ6q1pa3k4DYucUIU9WnEcRTCA-SE,651
17
+ texttools/prompts/extract_entities.yaml,sha256=56N3iFH1KbGLqloYBLOW-12SchVayqbHgaQ5-8JTbeY,610
18
18
  texttools/prompts/extract_keywords.yaml,sha256=Vj4Tt3vT6LtpOo_iBZPo9oWI50oVdPGXe5i8yDR8ex4,3177
19
19
  texttools/prompts/is_question.yaml,sha256=d0-vKRbXWkxvO64ikvxRjEmpAXGpCYIPGhgexvPPjws,471
20
20
  texttools/prompts/merge_questions.yaml,sha256=0J85GvTirZB4ELwH3sk8ub_WcqqpYf6PrMKr3djlZeo,1792
@@ -23,11 +23,11 @@ texttools/prompts/rewrite.yaml,sha256=LO7He_IA3MZKz8a-LxH9DHJpOjpYwaYN1pbjp1Y0tF
23
23
  texttools/prompts/run_custom.yaml,sha256=6oiMYOo_WctVbOmE01wZzI1ra7nFDMJzceTTtnGdmOA,126
24
24
  texttools/prompts/subject_to_question.yaml,sha256=C7x7rNNm6U_ZG9HOn6zuzYOtvJUZ2skuWbL1-aYdd3E,1147
25
25
  texttools/prompts/summarize.yaml,sha256=o6rxGPfWtZd61Duvm8NVvCJqfq73b-wAuMSKR6UYUqY,459
26
- texttools/prompts/text_to_question.yaml,sha256=UheKYpDn6iyKI8NxunHZtFpNyfCLZZe5cvkuXpurUJY,783
26
+ texttools/prompts/text_to_question.yaml,sha256=fnxDpUnnOEmK0yyFU5F4ItBqCnegoUCXhSTFjiTy18Y,1005
27
27
  texttools/prompts/translate.yaml,sha256=mGT2uBCei6uucWqVbs4silk-UV060v3G0jnt0P6sr50,634
28
- texttools/tools/async_tools.py,sha256=eNsKJqpTNL1AIM_enHvqUJYxov1Mb5ErnShZuX7oqRQ,49532
29
- texttools/tools/sync_tools.py,sha256=DbY7smzYnEgd1H0r-5sVW-NExJwhL23TtQ_n5ACqbBc,49344
30
- hamtaa_texttools-1.1.19.dist-info/METADATA,sha256=egHahU5ec3bSpRB0DK0CrT21kfmhbM6-xxJjoMp1eDU,10587
31
- hamtaa_texttools-1.1.19.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
32
- hamtaa_texttools-1.1.19.dist-info/top_level.txt,sha256=5Mh0jIxxZ5rOXHGJ6Mp-JPKviywwN0MYuH0xk5bEWqE,10
33
- hamtaa_texttools-1.1.19.dist-info/RECORD,,
28
+ texttools/tools/async_tools.py,sha256=0ubBHyx3ii1MOwo8X_ZMfqaH43y07DaF_rk7s7DPjVA,52331
29
+ texttools/tools/sync_tools.py,sha256=7ZY2Cs7W0y3m8D9I70Tqk9pMisIQeIzdAWwOh1A1hRc,52137
30
+ hamtaa_texttools-1.1.20.dist-info/METADATA,sha256=s52LtRd3UQOcmNgOJ_riHjMgMC9pS2mmks2Ys0zWhXk,10610
31
+ hamtaa_texttools-1.1.20.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
32
+ hamtaa_texttools-1.1.20.dist-info/top_level.txt,sha256=5Mh0jIxxZ5rOXHGJ6Mp-JPKviywwN0MYuH0xk5bEWqE,10
33
+ hamtaa_texttools-1.1.20.dist-info/RECORD,,
@@ -2,7 +2,7 @@ import json
2
2
  import os
3
3
  import time
4
4
  from pathlib import Path
5
- from typing import Any, Type, TypeVar
5
+ from typing import Type, TypeVar
6
6
  import logging
7
7
 
8
8
  from dotenv import load_dotenv
@@ -11,7 +11,7 @@ from pydantic import BaseModel
11
11
 
12
12
  from texttools.batch.internals.batch_manager import BatchManager
13
13
  from texttools.batch.batch_config import BatchConfig
14
- from texttools.internals.models import StrOutput
14
+ from texttools.internals.models import Str
15
15
  from texttools.internals.exceptions import TextToolsError, ConfigurationError
16
16
 
17
17
  # Base Model type for output models
@@ -26,7 +26,7 @@ class BatchJobRunner:
26
26
  """
27
27
 
28
28
  def __init__(
29
- self, config: BatchConfig = BatchConfig(), output_model: Type[T] = StrOutput
29
+ self, config: BatchConfig = BatchConfig(), output_model: Type[T] = Str
30
30
  ):
31
31
  try:
32
32
  self._config = config
@@ -38,7 +38,7 @@ class BatchJobRunner:
38
38
  self._output_model = output_model
39
39
  self._manager = self._init_manager()
40
40
  self._data = self._load_data()
41
- self._parts: list[list[dict[str, Any]]] = []
41
+ self._parts: list[list[dict[str, object]]] = []
42
42
  # Map part index to job name
43
43
  self._part_idx_to_job_name: dict[int, str] = {}
44
44
  # Track retry attempts per part
@@ -130,8 +130,8 @@ class BatchJobRunner:
130
130
 
131
131
  def _save_results(
132
132
  self,
133
- output_data: list[dict[str, Any]] | dict[str, Any],
134
- log: list[Any],
133
+ output_data: list[dict[str, object]] | dict[str, object],
134
+ log: list[object],
135
135
  part_idx: int,
136
136
  ):
137
137
  part_suffix = f"_part_{part_idx + 1}" if len(self._parts) > 1 else ""
@@ -1,7 +1,7 @@
1
1
  import json
2
2
  import uuid
3
3
  from pathlib import Path
4
- from typing import Any, Type, TypeVar
4
+ from typing import Type, TypeVar
5
5
  import logging
6
6
 
7
7
  from pydantic import BaseModel
@@ -31,7 +31,7 @@ class BatchManager:
31
31
  prompt_template: str,
32
32
  state_dir: Path = Path(".batch_jobs"),
33
33
  custom_json_schema_obj_str: dict | None = None,
34
- **client_kwargs: Any,
34
+ **client_kwargs: object,
35
35
  ):
36
36
  self._client = client
37
37
  self._model = model
@@ -51,7 +51,7 @@ class BatchManager:
51
51
  def _state_file(self, job_name: str) -> Path:
52
52
  return self._state_dir / f"{job_name}.json"
53
53
 
54
- def _load_state(self, job_name: str) -> list[dict[str, Any]]:
54
+ def _load_state(self, job_name: str) -> list[dict[str, object]]:
55
55
  """
56
56
  Loads the state (job information) from the state file for the given job name.
57
57
  Returns an empty list if the state file does not exist.
@@ -62,7 +62,7 @@ class BatchManager:
62
62
  return json.load(f)
63
63
  return []
64
64
 
65
- def _save_state(self, job_name: str, jobs: list[dict[str, Any]]) -> None:
65
+ def _save_state(self, job_name: str, jobs: list[dict[str, object]]) -> None:
66
66
  """
67
67
  Saves the job state to the state file for the given job name.
68
68
  """
@@ -77,11 +77,11 @@ class BatchManager:
77
77
  if path.exists():
78
78
  path.unlink()
79
79
 
80
- def _build_task(self, text: str, idx: str) -> dict[str, Any]:
80
+ def _build_task(self, text: str, idx: str) -> dict[str, object]:
81
81
  """
82
82
  Builds a single task dictionary for the batch job, including the prompt, model, and response format configuration.
83
83
  """
84
- response_format_config: dict[str, Any]
84
+ response_format_config: dict[str, object]
85
85
 
86
86
  if self._custom_json_schema_obj_str:
87
87
  response_format_config = {
@@ -1,6 +1,3 @@
1
- from typing import Any
2
-
3
-
4
1
  def export_data(data) -> list[dict[str, str]]:
5
2
  """
6
3
  Produces a structure of the following form from an initial data structure:
@@ -9,7 +6,7 @@ def export_data(data) -> list[dict[str, str]]:
9
6
  return data
10
7
 
11
8
 
12
- def import_data(data) -> Any:
9
+ def import_data(data) -> object:
13
10
  """
14
11
  Takes the output and adds and aggregates it to the original structure.
15
12
  """
@@ -1,4 +1,4 @@
1
- from typing import Any, TypeVar, Type
1
+ from typing import TypeVar, Type
2
2
  from collections.abc import Callable
3
3
  import logging
4
4
 
@@ -7,7 +7,6 @@ from pydantic import BaseModel
7
7
 
8
8
  from texttools.internals.models import ToolOutput
9
9
  from texttools.internals.operator_utils import OperatorUtils
10
- from texttools.internals.formatters import Formatter
11
10
  from texttools.internals.prompt_loader import PromptLoader
12
11
  from texttools.internals.exceptions import (
13
12
  TextToolsError,
@@ -77,7 +76,7 @@ class AsyncOperator:
77
76
  logprobs: bool = False,
78
77
  top_logprobs: int = 3,
79
78
  priority: int | None = 0,
80
- ) -> tuple[T, Any]:
79
+ ) -> tuple[T, object]:
81
80
  """
82
81
  Parses a chat completion using OpenAI's structured output format.
83
82
  Returns both the parsed object and the raw completion for logprobs.
@@ -124,7 +123,7 @@ class AsyncOperator:
124
123
  temperature: float,
125
124
  logprobs: bool,
126
125
  top_logprobs: int | None,
127
- validator: Callable[[Any], bool] | None,
126
+ validator: Callable[[object], bool] | None,
128
127
  max_validation_retries: int | None,
129
128
  # Internal parameters
130
129
  prompt_file: str,
@@ -138,7 +137,6 @@ class AsyncOperator:
138
137
  """
139
138
  try:
140
139
  prompt_loader = PromptLoader()
141
- formatter = Formatter()
142
140
  output = ToolOutput()
143
141
 
144
142
  # Prompt configs contain two keys: main_template and analyze template, both are string
@@ -177,7 +175,7 @@ class AsyncOperator:
177
175
  OperatorUtils.build_user_message(prompt_configs["main_template"])
178
176
  )
179
177
 
180
- messages = formatter.user_merge_format(messages)
178
+ messages = OperatorUtils.user_merge_format(messages)
181
179
 
182
180
  if logprobs and (not isinstance(top_logprobs, int) or top_logprobs < 2):
183
181
  raise ValueError("top_logprobs should be an integer greater than 1")
@@ -1,12 +1,12 @@
1
1
  from datetime import datetime
2
- from typing import Type, Any, Literal
2
+ from typing import Type, Literal
3
3
 
4
4
  from pydantic import BaseModel, Field, create_model
5
5
 
6
6
 
7
7
  class ToolOutput(BaseModel):
8
- result: Any = None
9
- logprobs: list[dict[str, Any]] = []
8
+ result: object = None
9
+ logprobs: list[dict[str, object]] = []
10
10
  analysis: str = ""
11
11
  process: str | None = None
12
12
  processed_at: datetime = datetime.now()
@@ -22,23 +22,23 @@ class ToolOutput(BaseModel):
22
22
  """
23
23
 
24
24
 
25
- class StrOutput(BaseModel):
25
+ class Str(BaseModel):
26
26
  result: str = Field(..., description="The output string", example="text")
27
27
 
28
28
 
29
- class BoolOutput(BaseModel):
29
+ class Bool(BaseModel):
30
30
  result: bool = Field(
31
31
  ..., description="Boolean indicating the output state", example=True
32
32
  )
33
33
 
34
34
 
35
- class ListStrOutput(BaseModel):
35
+ class ListStr(BaseModel):
36
36
  result: list[str] = Field(
37
37
  ..., description="The output list of strings", example=["text_1", "text_2"]
38
38
  )
39
39
 
40
40
 
41
- class ListDictStrStrOutput(BaseModel):
41
+ class ListDictStrStr(BaseModel):
42
42
  result: list[dict[str, str]] = Field(
43
43
  ...,
44
44
  description="List of dictionaries containing string key-value pairs",
@@ -46,7 +46,7 @@ class ListDictStrStrOutput(BaseModel):
46
46
  )
47
47
 
48
48
 
49
- class ReasonListStrOutput(BaseModel):
49
+ class ReasonListStr(BaseModel):
50
50
  reason: str = Field(..., description="Thinking process that led to the output")
51
51
  result: list[str] = Field(
52
52
  ..., description="The output list of strings", example=["text_1", "text_2"]
@@ -179,12 +179,3 @@ def create_dynamic_model(allowed_values: list[str]) -> Type[BaseModel]:
179
179
  )
180
180
 
181
181
  return CategorizerOutput
182
-
183
-
184
- class Entity(BaseModel):
185
- text: str = Field(description="The exact text of the entity")
186
- entity_type: str = Field(description="The type of the entity")
187
-
188
-
189
- class EntityDetectorOutput(BaseModel):
190
- result: list[Entity] = Field(description="List of all extracted entities")
@@ -52,3 +52,27 @@ class OperatorUtils:
52
52
  new_temp = base_temp + delta_temp
53
53
 
54
54
  return max(0.0, min(new_temp, 1.5))
55
+
56
+ @staticmethod
57
+ def user_merge_format(messages: list[dict[str, str]]) -> list[dict[str, str]]:
58
+ """
59
+ Merges consecutive user messages into a single message, separated by newlines.
60
+
61
+ This is useful for condensing a multi-turn user input into a single
62
+ message for the LLM. Assistant and system messages are left unchanged and
63
+ act as separators between user message groups.
64
+ """
65
+ merged = []
66
+
67
+ for message in messages:
68
+ role, content = message["role"], message["content"].strip()
69
+
70
+ # Merge with previous user turn
71
+ if merged and role == "user" and merged[-1]["role"] == "user":
72
+ merged[-1]["content"] += "\n" + content
73
+
74
+ # Otherwise, start a new turn
75
+ else:
76
+ merged.append({"role": role, "content": content})
77
+
78
+ return merged
@@ -1,4 +1,4 @@
1
- from typing import Any, TypeVar, Type
1
+ from typing import TypeVar, Type
2
2
  from collections.abc import Callable
3
3
  import logging
4
4
 
@@ -7,7 +7,6 @@ from pydantic import BaseModel
7
7
 
8
8
  from texttools.internals.models import ToolOutput
9
9
  from texttools.internals.operator_utils import OperatorUtils
10
- from texttools.internals.formatters import Formatter
11
10
  from texttools.internals.prompt_loader import PromptLoader
12
11
  from texttools.internals.exceptions import (
13
12
  TextToolsError,
@@ -77,7 +76,7 @@ class Operator:
77
76
  logprobs: bool = False,
78
77
  top_logprobs: int = 3,
79
78
  priority: int | None = 0,
80
- ) -> tuple[T, Any]:
79
+ ) -> tuple[T, object]:
81
80
  """
82
81
  Parses a chat completion using OpenAI's structured output format.
83
82
  Returns both the parsed object and the raw completion for logprobs.
@@ -122,7 +121,7 @@ class Operator:
122
121
  temperature: float,
123
122
  logprobs: bool,
124
123
  top_logprobs: int | None,
125
- validator: Callable[[Any], bool] | None,
124
+ validator: Callable[[object], bool] | None,
126
125
  max_validation_retries: int | None,
127
126
  # Internal parameters
128
127
  prompt_file: str,
@@ -136,7 +135,6 @@ class Operator:
136
135
  """
137
136
  try:
138
137
  prompt_loader = PromptLoader()
139
- formatter = Formatter()
140
138
  output = ToolOutput()
141
139
 
142
140
  # Prompt configs contain two keys: main_template and analyze template, both are string
@@ -175,7 +173,7 @@ class Operator:
175
173
  OperatorUtils.build_user_message(prompt_configs["main_template"])
176
174
  )
177
175
 
178
- messages = formatter.user_merge_format(messages)
176
+ messages = OperatorUtils.user_merge_format(messages)
179
177
 
180
178
  if logprobs and (not isinstance(top_logprobs, int) or top_logprobs < 2):
181
179
  raise ValueError("top_logprobs should be an integer greater than 1")
@@ -0,0 +1,97 @@
1
+ import re
2
+
3
+
4
+ def text_to_chunks(text: str, size: int, overlap: int) -> list[str]:
5
+ separators = ["\n\n", "\n", " ", ""]
6
+ is_separator_regex = False
7
+ keep_separator = True # Equivalent to 'start'
8
+ length_function = len
9
+ strip_whitespace = True
10
+ chunk_size = size
11
+ chunk_overlap = overlap
12
+
13
+ def _split_text_with_regex(
14
+ text: str, separator: str, keep_separator: bool
15
+ ) -> list[str]:
16
+ if not separator:
17
+ return [text]
18
+ if not keep_separator:
19
+ return re.split(separator, text)
20
+ _splits = re.split(f"({separator})", text)
21
+ splits = [_splits[i] + _splits[i + 1] for i in range(1, len(_splits), 2)]
22
+ if len(_splits) % 2 == 0:
23
+ splits += [_splits[-1]]
24
+ return [_splits[0]] + splits if _splits[0] else splits
25
+
26
+ def _join_docs(docs: list[str], separator: str) -> str | None:
27
+ text = separator.join(docs)
28
+ if strip_whitespace:
29
+ text = text.strip()
30
+ return text if text else None
31
+
32
+ def _merge_splits(splits: list[str], separator: str) -> list[str]:
33
+ separator_len = length_function(separator)
34
+ docs = []
35
+ current_doc = []
36
+ total = 0
37
+ for d in splits:
38
+ len_ = length_function(d)
39
+ if total + len_ + (separator_len if current_doc else 0) > chunk_size:
40
+ if total > chunk_size:
41
+ pass
42
+ if current_doc:
43
+ doc = _join_docs(current_doc, separator)
44
+ if doc is not None:
45
+ docs.append(doc)
46
+ while total > chunk_overlap or (
47
+ total + len_ + (separator_len if current_doc else 0)
48
+ > chunk_size
49
+ and total > 0
50
+ ):
51
+ total -= length_function(current_doc[0]) + (
52
+ separator_len if len(current_doc) > 1 else 0
53
+ )
54
+ current_doc = current_doc[1:]
55
+ current_doc.append(d)
56
+ total += len_ + (separator_len if len(current_doc) > 1 else 0)
57
+ doc = _join_docs(current_doc, separator)
58
+ if doc is not None:
59
+ docs.append(doc)
60
+ return docs
61
+
62
+ def _split_text(text: str, separators: list[str]) -> list[str]:
63
+ final_chunks = []
64
+ separator = separators[-1]
65
+ new_separators = []
66
+ for i, _s in enumerate(separators):
67
+ separator_ = _s if is_separator_regex else re.escape(_s)
68
+ if not _s:
69
+ separator = _s
70
+ break
71
+ if re.search(separator_, text):
72
+ separator = _s
73
+ new_separators = separators[i + 1 :]
74
+ break
75
+ separator_ = separator if is_separator_regex else re.escape(separator)
76
+ splits = _split_text_with_regex(text, separator_, keep_separator)
77
+ _separator = "" if keep_separator else separator
78
+ good_splits = []
79
+ for s in splits:
80
+ if length_function(s) < chunk_size:
81
+ good_splits.append(s)
82
+ else:
83
+ if good_splits:
84
+ merged_text = _merge_splits(good_splits, _separator)
85
+ final_chunks.extend(merged_text)
86
+ good_splits = []
87
+ if not new_separators:
88
+ final_chunks.append(s)
89
+ else:
90
+ other_info = _split_text(s, new_separators)
91
+ final_chunks.extend(other_info)
92
+ if good_splits:
93
+ merged_text = _merge_splits(good_splits, _separator)
94
+ final_chunks.extend(merged_text)
95
+ return final_chunks
96
+
97
+ return _split_text(text, separators)
@@ -1,6 +1,6 @@
1
1
  main_template: |
2
2
  You are a Named Entity Recognition (NER) extractor.
3
- Identify and extract all named entities (e.g., PER, ORG, LOC, DAT, etc.) from the given text.
3
+ Identify and extract {entities} from the given text.
4
4
  For each entity, provide its text and a clear type.
5
5
  Respond only in JSON format:
6
6
  {{
@@ -1,11 +1,13 @@
1
1
  main_template: |
2
2
  You are a question generator.
3
- Given the following answer, generate one
4
- appropriate question that this answer would directly respond to.
3
+ Given the following answer, generate {number_of_questions} appropriate questions that this answer would directly respond to.
5
4
  The generated answer should be independently meaningful,
6
5
  and not mentioning any verbs like, this, that, he or she on the question.
6
+ There is a `reason` key, fill that up with a summerized version of your thoughts.
7
+ The `reason` must be less than 20 words.
8
+ Don't forget to fill the reason.
7
9
  Respond only in JSON format:
8
- {{"result": "string"}}
10
+ {{"result": ["question1", "question2", ...], "reason": "string"}}
9
11
  Here is the answer:
10
12
  {input}
11
13
 
@@ -13,7 +15,7 @@ analyze_template: |
13
15
  Analyze the following answer to identify its key facts,
14
16
  main subject, and what kind of information it provides.
15
17
  Provide a brief, summarized understanding of the answer's content that will
16
- help in formulating a relevant and direct question.
18
+ help in formulating relevant and direct questions.
17
19
  Just mention the keypoints that was provided in the answer
18
20
  Here is the answer:
19
21
  {input}