deepeval 3.4.9__py3-none-any.whl → 3.5.1__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- deepeval/_version.py +1 -1
- deepeval/benchmarks/drop/drop.py +2 -3
- deepeval/benchmarks/hellaswag/hellaswag.py +2 -2
- deepeval/benchmarks/logi_qa/logi_qa.py +2 -2
- deepeval/benchmarks/math_qa/math_qa.py +2 -2
- deepeval/benchmarks/mmlu/mmlu.py +2 -2
- deepeval/benchmarks/truthful_qa/truthful_qa.py +2 -2
- deepeval/confident/api.py +3 -0
- deepeval/integrations/langchain/callback.py +21 -0
- deepeval/integrations/pydantic_ai/__init__.py +2 -4
- deepeval/integrations/pydantic_ai/{setup.py → otel.py} +0 -8
- deepeval/integrations/pydantic_ai/patcher.py +376 -0
- deepeval/metrics/__init__.py +1 -1
- deepeval/metrics/answer_relevancy/template.py +13 -38
- deepeval/metrics/faithfulness/template.py +17 -27
- deepeval/models/llms/grok_model.py +1 -1
- deepeval/models/llms/kimi_model.py +1 -1
- deepeval/prompt/api.py +22 -4
- deepeval/prompt/prompt.py +131 -17
- deepeval/synthesizer/synthesizer.py +17 -9
- deepeval/tracing/api.py +3 -0
- deepeval/tracing/context.py +3 -1
- deepeval/tracing/perf_epoch_bridge.py +4 -4
- deepeval/tracing/tracing.py +12 -2
- deepeval/tracing/types.py +3 -0
- deepeval/tracing/utils.py +6 -2
- deepeval/utils.py +2 -2
- {deepeval-3.4.9.dist-info → deepeval-3.5.1.dist-info}/METADATA +14 -13
- {deepeval-3.4.9.dist-info → deepeval-3.5.1.dist-info}/RECORD +32 -32
- deepeval/integrations/pydantic_ai/agent.py +0 -364
- {deepeval-3.4.9.dist-info → deepeval-3.5.1.dist-info}/LICENSE.md +0 -0
- {deepeval-3.4.9.dist-info → deepeval-3.5.1.dist-info}/WHEEL +0 -0
- {deepeval-3.4.9.dist-info → deepeval-3.5.1.dist-info}/entry_points.txt +0 -0
|
@@ -76,42 +76,31 @@ The 'verdict' key should STRICTLY be either 'yes', 'no', or 'idk', which states
|
|
|
76
76
|
Provide a 'reason' ONLY if the answer is 'no' or 'idk'.
|
|
77
77
|
The provided claim is drawn from the actual output. Try to provide a correction in the reason using the facts in the retrieval context.
|
|
78
78
|
|
|
79
|
-
|
|
80
|
-
IMPORTANT: Please make sure to only return in JSON format, with the 'verdicts' key as a list of JSON objects.
|
|
81
|
-
Example retrieval contexts: "Einstein won the Nobel Prize for his discovery of the photoelectric effect. Einstein won the Nobel Prize in 1968. Einstein is a German Scientist."
|
|
82
|
-
Example claims: ["Barack Obama is a caucasian male.", "Zurich is a city in London", "Einstein won the Nobel Prize for the discovery of the photoelectric effect which may have contributed to his fame.", "Einstein won the Nobel Prize in 1969 for his discovery of the photoelectric effect.", "Einstein was a German chef."]
|
|
83
|
-
|
|
84
|
-
Example:
|
|
79
|
+
Expected JSON format:
|
|
85
80
|
{{
|
|
86
81
|
"verdicts": [
|
|
87
|
-
{{
|
|
88
|
-
"verdict": "idk",
|
|
89
|
-
"reason": "The claim about Barack Obama is although incorrect, it is not directly addressed in the retrieval context, and so poses no contradiction."
|
|
90
|
-
}},
|
|
91
|
-
{{
|
|
92
|
-
"verdict": "idk",
|
|
93
|
-
"reason": "The claim about Zurich being a city in London is incorrect but does not pose a contradiction to the retrieval context."
|
|
94
|
-
}},
|
|
95
82
|
{{
|
|
96
83
|
"verdict": "yes"
|
|
97
84
|
}},
|
|
98
85
|
{{
|
|
99
86
|
"verdict": "no",
|
|
100
|
-
"reason":
|
|
87
|
+
"reason": <explanation_for_contradiction>
|
|
101
88
|
}},
|
|
102
89
|
{{
|
|
103
|
-
"verdict": "
|
|
104
|
-
"reason":
|
|
105
|
-
}}
|
|
90
|
+
"verdict": "idk",
|
|
91
|
+
"reason": <explanation_for_uncertainty>
|
|
92
|
+
}}
|
|
106
93
|
]
|
|
107
94
|
}}
|
|
108
|
-
===== END OF EXAMPLE ======
|
|
109
95
|
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
96
|
+
Generate ONE verdict per claim - length of 'verdicts' MUST equal number of claims.
|
|
97
|
+
No 'reason' needed for 'yes' verdicts.
|
|
98
|
+
Only use 'no' if retrieval context DIRECTLY CONTRADICTS the claim - never use prior knowledge.
|
|
99
|
+
Use 'idk' for claims not backed up by context OR factually incorrect but non-contradictory - do not assume your knowledge.
|
|
100
|
+
Vague/speculative language in claims (e.g. 'may have', 'possibility') does NOT count as contradiction.
|
|
101
|
+
|
|
102
|
+
**
|
|
103
|
+
IMPORTANT: Please make sure to only return in JSON format, with the 'verdicts' key as a list of JSON objects.
|
|
115
104
|
**
|
|
116
105
|
|
|
117
106
|
Retrieval Contexts:
|
|
@@ -128,13 +117,14 @@ JSON:
|
|
|
128
117
|
return f"""Below is a list of Contradictions. It is a list of strings explaining why the 'actual output' does not align with the information presented in the 'retrieval context'. Contradictions happen in the 'actual output', NOT the 'retrieval context'.
|
|
129
118
|
Given the faithfulness score, which is a 0-1 score indicating how faithful the `actual output` is to the retrieval context (higher the better), CONCISELY summarize the contradictions to justify the score.
|
|
130
119
|
|
|
131
|
-
|
|
132
|
-
IMPORTANT: Please make sure to only return in JSON format, with the 'reason' key providing the reason.
|
|
133
|
-
Example JSON:
|
|
120
|
+
Expected JSON format:
|
|
134
121
|
{{
|
|
135
122
|
"reason": "The score is <faithfulness_score> because <your_reason>."
|
|
136
123
|
}}
|
|
137
124
|
|
|
125
|
+
**
|
|
126
|
+
IMPORTANT: Please make sure to only return in JSON format, with the 'reason' key providing the reason.
|
|
127
|
+
|
|
138
128
|
If there are no contradictions, just say something positive with an upbeat encouraging tone (but don't overdo it otherwise it gets annoying).
|
|
139
129
|
Your reason MUST use information in `contradiction` in your reason.
|
|
140
130
|
Be sure in your reason, as if you know what the actual output is from the contradictions.
|
deepeval/prompt/api.py
CHANGED
|
@@ -1,4 +1,4 @@
|
|
|
1
|
-
from pydantic import BaseModel, Field
|
|
1
|
+
from pydantic import BaseModel, Field, AliasChoices
|
|
2
2
|
from enum import Enum
|
|
3
3
|
from typing import List, Optional
|
|
4
4
|
|
|
@@ -19,9 +19,28 @@ class PromptType(Enum):
|
|
|
19
19
|
TEXT = "TEXT"
|
|
20
20
|
LIST = "LIST"
|
|
21
21
|
|
|
22
|
+
class PromptVersion(BaseModel):
|
|
23
|
+
id: str
|
|
24
|
+
version: str
|
|
25
|
+
commit_message: str = Field(
|
|
26
|
+
serialization_alias="commitMessage",
|
|
27
|
+
validation_alias=AliasChoices("commit_message", "commitMessage")
|
|
28
|
+
)
|
|
29
|
+
|
|
30
|
+
class PromptVersionsHttpResponse(BaseModel):
|
|
31
|
+
text_versions: Optional[List[PromptVersion]] = Field(
|
|
32
|
+
None,
|
|
33
|
+
serialization_alias="textVersions",
|
|
34
|
+
validation_alias=AliasChoices("text_versions", "textVersions")
|
|
35
|
+
)
|
|
36
|
+
messages_versions: Optional[List[PromptVersion]] = Field(
|
|
37
|
+
None,
|
|
38
|
+
serialization_alias="messagesVersions",
|
|
39
|
+
validation_alias=AliasChoices("messages_versions", "messagesVersions")
|
|
40
|
+
)
|
|
22
41
|
|
|
23
42
|
class PromptHttpResponse(BaseModel):
|
|
24
|
-
|
|
43
|
+
id: str
|
|
25
44
|
text: Optional[str] = None
|
|
26
45
|
messages: Optional[List[PromptMessage]] = None
|
|
27
46
|
interpolation_type: PromptInterpolationType = Field(
|
|
@@ -29,7 +48,6 @@ class PromptHttpResponse(BaseModel):
|
|
|
29
48
|
)
|
|
30
49
|
type: PromptType
|
|
31
50
|
|
|
32
|
-
|
|
33
51
|
class PromptPushRequest(BaseModel):
|
|
34
52
|
alias: str
|
|
35
53
|
text: Optional[str] = None
|
|
@@ -44,4 +62,4 @@ class PromptPushRequest(BaseModel):
|
|
|
44
62
|
|
|
45
63
|
class PromptApi(BaseModel):
|
|
46
64
|
id: str
|
|
47
|
-
type: PromptType
|
|
65
|
+
type: PromptType
|
deepeval/prompt/prompt.py
CHANGED
|
@@ -1,11 +1,12 @@
|
|
|
1
1
|
from enum import Enum
|
|
2
|
-
from typing import Optional, List
|
|
2
|
+
from typing import Optional, List, Dict
|
|
3
3
|
from rich.progress import Progress, SpinnerColumn, TextColumn, BarColumn
|
|
4
4
|
from rich.console import Console
|
|
5
5
|
import time
|
|
6
6
|
import json
|
|
7
7
|
import os
|
|
8
8
|
from pydantic import BaseModel
|
|
9
|
+
import asyncio
|
|
9
10
|
|
|
10
11
|
from deepeval.prompt.api import (
|
|
11
12
|
PromptHttpResponse,
|
|
@@ -13,11 +14,12 @@ from deepeval.prompt.api import (
|
|
|
13
14
|
PromptType,
|
|
14
15
|
PromptInterpolationType,
|
|
15
16
|
PromptPushRequest,
|
|
17
|
+
PromptVersionsHttpResponse,
|
|
16
18
|
)
|
|
17
19
|
from deepeval.prompt.utils import interpolate_text
|
|
18
20
|
from deepeval.confident.api import Api, Endpoints, HttpMethods
|
|
19
|
-
|
|
20
21
|
from deepeval.constants import HIDDEN_DIR
|
|
22
|
+
from deepeval.utils import get_or_create_event_loop
|
|
21
23
|
|
|
22
24
|
CACHE_FILE_NAME = f"{HIDDEN_DIR}/.deepeval-prompt-cache.json"
|
|
23
25
|
|
|
@@ -63,7 +65,23 @@ class Prompt:
|
|
|
63
65
|
self.alias = alias
|
|
64
66
|
self._text_template = template
|
|
65
67
|
self._messages_template = messages_template
|
|
66
|
-
self.
|
|
68
|
+
self._version = None
|
|
69
|
+
self._polling_tasks: Dict[str, asyncio.Task] = {}
|
|
70
|
+
self._refresh_map: Dict[str, int] = {}
|
|
71
|
+
|
|
72
|
+
@property
|
|
73
|
+
def version(self):
|
|
74
|
+
if self._version is not None and self._version != "latest":
|
|
75
|
+
return self._version
|
|
76
|
+
versions = self._get_versions()
|
|
77
|
+
if len(versions) == 0:
|
|
78
|
+
return "latest"
|
|
79
|
+
else:
|
|
80
|
+
return versions[-1].version
|
|
81
|
+
|
|
82
|
+
@version.setter
|
|
83
|
+
def version(self, value):
|
|
84
|
+
self._version = value
|
|
67
85
|
|
|
68
86
|
def interpolate(self, **kwargs):
|
|
69
87
|
if self._type == PromptType.TEXT:
|
|
@@ -93,6 +111,20 @@ class Prompt:
|
|
|
93
111
|
return interpolated_messages
|
|
94
112
|
else:
|
|
95
113
|
raise ValueError(f"Unsupported prompt type: {self._type}")
|
|
114
|
+
|
|
115
|
+
def _get_versions(self) -> List:
|
|
116
|
+
if self.alias is None:
|
|
117
|
+
raise ValueError(
|
|
118
|
+
"Prompt alias is not set. Please set an alias to continue."
|
|
119
|
+
)
|
|
120
|
+
api = Api()
|
|
121
|
+
data, _ = api.send_request(
|
|
122
|
+
method=HttpMethods.GET,
|
|
123
|
+
endpoint=Endpoints.PROMPTS_VERSIONS_ENDPOINT,
|
|
124
|
+
url_params={"alias": self.alias},
|
|
125
|
+
)
|
|
126
|
+
versions = PromptVersionsHttpResponse(**data)
|
|
127
|
+
return versions.text_versions or versions.messages_versions or []
|
|
96
128
|
|
|
97
129
|
def _read_from_cache(
|
|
98
130
|
self, alias: str, version: Optional[str] = None
|
|
@@ -123,8 +155,16 @@ class Prompt:
|
|
|
123
155
|
except Exception as e:
|
|
124
156
|
raise Exception(f"Error reading Prompt cache from disk: {e}")
|
|
125
157
|
|
|
126
|
-
def _write_to_cache(
|
|
127
|
-
|
|
158
|
+
def _write_to_cache(
|
|
159
|
+
self,
|
|
160
|
+
version: Optional[str] = None,
|
|
161
|
+
text_template: Optional[str] = None,
|
|
162
|
+
messages_template: Optional[List[PromptMessage]] = None,
|
|
163
|
+
prompt_version_id: Optional[str] = None,
|
|
164
|
+
type: Optional[PromptType] = None,
|
|
165
|
+
interpolation_type: Optional[PromptInterpolationType] = None,
|
|
166
|
+
):
|
|
167
|
+
if not self.alias or not version:
|
|
128
168
|
return
|
|
129
169
|
|
|
130
170
|
cache_data = {}
|
|
@@ -140,14 +180,14 @@ class Prompt:
|
|
|
140
180
|
cache_data[self.alias] = {}
|
|
141
181
|
|
|
142
182
|
# Cache the prompt
|
|
143
|
-
cache_data[self.alias][
|
|
183
|
+
cache_data[self.alias][version] = {
|
|
144
184
|
"alias": self.alias,
|
|
145
|
-
"version":
|
|
146
|
-
"template":
|
|
147
|
-
"messages_template":
|
|
148
|
-
"prompt_version_id":
|
|
149
|
-
"type":
|
|
150
|
-
"interpolation_type":
|
|
185
|
+
"version": version,
|
|
186
|
+
"template": text_template,
|
|
187
|
+
"messages_template": messages_template,
|
|
188
|
+
"prompt_version_id": prompt_version_id,
|
|
189
|
+
"type": type,
|
|
190
|
+
"interpolation_type": interpolation_type,
|
|
151
191
|
}
|
|
152
192
|
|
|
153
193
|
# Ensure directory exists
|
|
@@ -163,12 +203,22 @@ class Prompt:
|
|
|
163
203
|
fallback_to_cache: bool = True,
|
|
164
204
|
write_to_cache: bool = True,
|
|
165
205
|
default_to_cache: bool = True,
|
|
206
|
+
refresh: Optional[int] = 60,
|
|
166
207
|
):
|
|
208
|
+
if refresh:
|
|
209
|
+
default_to_cache = True
|
|
210
|
+
write_to_cache = False
|
|
167
211
|
if self.alias is None:
|
|
168
212
|
raise TypeError(
|
|
169
213
|
"Unable to pull prompt from Confident AI when no alias is provided."
|
|
170
214
|
)
|
|
171
215
|
|
|
216
|
+
# Manage background prompt polling
|
|
217
|
+
loop = get_or_create_event_loop()
|
|
218
|
+
loop.run_until_complete(
|
|
219
|
+
self.create_polling_task(version, refresh)
|
|
220
|
+
)
|
|
221
|
+
|
|
172
222
|
if default_to_cache:
|
|
173
223
|
try:
|
|
174
224
|
cached_prompt = self._read_from_cache(self.alias, version)
|
|
@@ -200,11 +250,11 @@ class Prompt:
|
|
|
200
250
|
try:
|
|
201
251
|
data, _ = api.send_request(
|
|
202
252
|
method=HttpMethods.GET,
|
|
203
|
-
endpoint=Endpoints.
|
|
204
|
-
|
|
253
|
+
endpoint=Endpoints.PROMPTS_VERSION_ID_ENDPOINT,
|
|
254
|
+
url_params={"alias": self.alias, "versionId": version or "latest"},
|
|
205
255
|
)
|
|
206
256
|
response = PromptHttpResponse(
|
|
207
|
-
|
|
257
|
+
id=data["id"],
|
|
208
258
|
text=data.get("text", None),
|
|
209
259
|
messages=data.get("messages", None),
|
|
210
260
|
type=data["type"],
|
|
@@ -243,7 +293,7 @@ class Prompt:
|
|
|
243
293
|
self.version = version or "latest"
|
|
244
294
|
self._text_template = response.text
|
|
245
295
|
self._messages_template = response.messages
|
|
246
|
-
self._prompt_version_id = response.
|
|
296
|
+
self._prompt_version_id = response.id
|
|
247
297
|
self._type = response.type
|
|
248
298
|
self._interpolation_type = response.interpolation_type
|
|
249
299
|
|
|
@@ -254,7 +304,14 @@ class Prompt:
|
|
|
254
304
|
description=f"{progress.tasks[task_id].description}[rgb(25,227,160)]Done! ({time_taken}s)",
|
|
255
305
|
)
|
|
256
306
|
if write_to_cache:
|
|
257
|
-
self._write_to_cache(
|
|
307
|
+
self._write_to_cache(
|
|
308
|
+
version=version or "latest",
|
|
309
|
+
text_template=response.text,
|
|
310
|
+
messages_template=response.messages,
|
|
311
|
+
prompt_version_id=response.id,
|
|
312
|
+
type=response.type,
|
|
313
|
+
interpolation_type=response.interpolation_type,
|
|
314
|
+
)
|
|
258
315
|
|
|
259
316
|
def push(
|
|
260
317
|
self,
|
|
@@ -300,3 +357,60 @@ class Prompt:
|
|
|
300
357
|
"✅ Prompt successfully pushed to Confident AI! View at "
|
|
301
358
|
f"[link={link}]{link}[/link]"
|
|
302
359
|
)
|
|
360
|
+
|
|
361
|
+
############################################
|
|
362
|
+
### Polling
|
|
363
|
+
############################################
|
|
364
|
+
|
|
365
|
+
async def create_polling_task(
|
|
366
|
+
self,
|
|
367
|
+
version: Optional[str],
|
|
368
|
+
refresh: Optional[int] = 60,
|
|
369
|
+
):
|
|
370
|
+
if version is None:
|
|
371
|
+
return
|
|
372
|
+
|
|
373
|
+
# If polling task doesn't exist, start it
|
|
374
|
+
polling_task: Optional[asyncio.Task] = self._polling_tasks.get(version)
|
|
375
|
+
if refresh:
|
|
376
|
+
self._refresh_map[version] = refresh
|
|
377
|
+
if not polling_task:
|
|
378
|
+
self._polling_tasks[version] = asyncio.create_task(
|
|
379
|
+
self.poll(version)
|
|
380
|
+
)
|
|
381
|
+
|
|
382
|
+
# If invalid `refresh`, stop the task
|
|
383
|
+
else:
|
|
384
|
+
if polling_task:
|
|
385
|
+
polling_task.cancel()
|
|
386
|
+
self._polling_tasks.pop(version)
|
|
387
|
+
self._refresh_map.pop(version)
|
|
388
|
+
|
|
389
|
+
async def poll(self, version: Optional[str] = None):
|
|
390
|
+
api = Api()
|
|
391
|
+
while True:
|
|
392
|
+
try:
|
|
393
|
+
data, _ = api.send_request(
|
|
394
|
+
method=HttpMethods.GET,
|
|
395
|
+
endpoint=Endpoints.PROMPTS_VERSION_ID_ENDPOINT,
|
|
396
|
+
url_params={"alias": self.alias, "versionId": version or "latest"},
|
|
397
|
+
)
|
|
398
|
+
response = PromptHttpResponse(
|
|
399
|
+
id=data["id"],
|
|
400
|
+
text=data.get("text", None),
|
|
401
|
+
messages=data.get("messages", None),
|
|
402
|
+
type=data["type"],
|
|
403
|
+
interpolation_type=data["interpolationType"],
|
|
404
|
+
)
|
|
405
|
+
self._write_to_cache(
|
|
406
|
+
version=version or "latest",
|
|
407
|
+
text_template=response.text,
|
|
408
|
+
messages_template=response.messages,
|
|
409
|
+
prompt_version_id=response.id,
|
|
410
|
+
type=response.type,
|
|
411
|
+
interpolation_type=response.interpolation_type,
|
|
412
|
+
)
|
|
413
|
+
except Exception as e:
|
|
414
|
+
pass
|
|
415
|
+
|
|
416
|
+
await asyncio.sleep(self._refresh_map[version])
|
|
@@ -361,7 +361,7 @@ class Synthesizer:
|
|
|
361
361
|
progress if _progress is None else nullcontext()
|
|
362
362
|
):
|
|
363
363
|
|
|
364
|
-
for
|
|
364
|
+
for context_index, context in enumerate(contexts):
|
|
365
365
|
# Calculate pbar lengths
|
|
366
366
|
should_style = (
|
|
367
367
|
self.styling_config.input_format
|
|
@@ -381,7 +381,7 @@ class Synthesizer:
|
|
|
381
381
|
# Add pbars
|
|
382
382
|
pbar_generate_goldens_id = add_pbar(
|
|
383
383
|
progress,
|
|
384
|
-
f"\t⚡ Generating goldens from context #{
|
|
384
|
+
f"\t⚡ Generating goldens from context #{context_index}",
|
|
385
385
|
total=1 + max_goldens_per_context,
|
|
386
386
|
)
|
|
387
387
|
pbar_generate_inputs_id = add_pbar(
|
|
@@ -421,7 +421,9 @@ class Synthesizer:
|
|
|
421
421
|
progress, pbar_generate_goldens_id, remove=False
|
|
422
422
|
)
|
|
423
423
|
|
|
424
|
-
for
|
|
424
|
+
for input_index, data in enumerate(
|
|
425
|
+
qualified_synthetic_inputs
|
|
426
|
+
):
|
|
425
427
|
# Evolve input
|
|
426
428
|
evolved_input, evolutions_used = self._evolve_input(
|
|
427
429
|
input=data.input,
|
|
@@ -429,7 +431,9 @@ class Synthesizer:
|
|
|
429
431
|
num_evolutions=self.evolution_config.num_evolutions,
|
|
430
432
|
evolutions=self.evolution_config.evolutions,
|
|
431
433
|
progress=progress,
|
|
432
|
-
pbar_evolve_input_id=pbar_evolve_input_ids[
|
|
434
|
+
pbar_evolve_input_id=pbar_evolve_input_ids[
|
|
435
|
+
input_index
|
|
436
|
+
],
|
|
433
437
|
remove_pbar=False,
|
|
434
438
|
)
|
|
435
439
|
|
|
@@ -441,7 +445,9 @@ class Synthesizer:
|
|
|
441
445
|
task=self.styling_config.task,
|
|
442
446
|
)
|
|
443
447
|
update_pbar(
|
|
444
|
-
progress,
|
|
448
|
+
progress,
|
|
449
|
+
pbar_evolve_input_ids[input_index],
|
|
450
|
+
remove=False,
|
|
445
451
|
)
|
|
446
452
|
res: SyntheticData = self._generate_schema(
|
|
447
453
|
prompt,
|
|
@@ -455,15 +461,15 @@ class Synthesizer:
|
|
|
455
461
|
input=evolved_input,
|
|
456
462
|
context=context,
|
|
457
463
|
source_file=(
|
|
458
|
-
source_files[
|
|
464
|
+
source_files[context_index]
|
|
459
465
|
if source_files is not None
|
|
460
466
|
else None
|
|
461
467
|
),
|
|
462
468
|
additional_metadata={
|
|
463
469
|
"evolutions": evolutions_used,
|
|
464
|
-
"synthetic_input_quality": scores[
|
|
470
|
+
"synthetic_input_quality": scores[input_index],
|
|
465
471
|
"context_quality": (
|
|
466
|
-
_context_scores[
|
|
472
|
+
_context_scores[context_index]
|
|
467
473
|
if _context_scores is not None
|
|
468
474
|
else None
|
|
469
475
|
),
|
|
@@ -480,7 +486,9 @@ class Synthesizer:
|
|
|
480
486
|
res = self._generate(prompt)
|
|
481
487
|
golden.expected_output = res
|
|
482
488
|
update_pbar(
|
|
483
|
-
progress,
|
|
489
|
+
progress,
|
|
490
|
+
pbar_evolve_input_ids[input_index],
|
|
491
|
+
remove=False,
|
|
484
492
|
)
|
|
485
493
|
|
|
486
494
|
goldens.append(golden)
|
deepeval/tracing/api.py
CHANGED
|
@@ -86,6 +86,9 @@ class BaseApiSpan(BaseModel):
|
|
|
86
86
|
cost_per_output_token: Optional[float] = Field(
|
|
87
87
|
None, alias="costPerOutputToken"
|
|
88
88
|
)
|
|
89
|
+
token_intervals: Optional[Dict[str, str]] = Field(
|
|
90
|
+
None, alias="tokenIntervals"
|
|
91
|
+
)
|
|
89
92
|
|
|
90
93
|
## evals
|
|
91
94
|
metric_collection: Optional[str] = Field(None, alias="metricCollection")
|
deepeval/tracing/context.py
CHANGED
|
@@ -4,7 +4,6 @@ from contextvars import ContextVar
|
|
|
4
4
|
from deepeval.tracing.types import BaseSpan, Trace
|
|
5
5
|
from deepeval.test_case.llm_test_case import ToolCall, LLMTestCase
|
|
6
6
|
from deepeval.tracing.types import LlmSpan, RetrieverSpan
|
|
7
|
-
from deepeval.metrics import BaseMetric
|
|
8
7
|
from deepeval.prompt.prompt import Prompt
|
|
9
8
|
|
|
10
9
|
current_span_context: ContextVar[Optional[BaseSpan]] = ContextVar(
|
|
@@ -117,6 +116,7 @@ def update_llm_span(
|
|
|
117
116
|
output_token_count: Optional[float] = None,
|
|
118
117
|
cost_per_input_token: Optional[float] = None,
|
|
119
118
|
cost_per_output_token: Optional[float] = None,
|
|
119
|
+
token_intervals: Optional[Dict[float, str]] = None,
|
|
120
120
|
prompt: Optional[Prompt] = None,
|
|
121
121
|
):
|
|
122
122
|
current_span = current_span_context.get()
|
|
@@ -132,6 +132,8 @@ def update_llm_span(
|
|
|
132
132
|
current_span.cost_per_input_token = cost_per_input_token
|
|
133
133
|
if cost_per_output_token:
|
|
134
134
|
current_span.cost_per_output_token = cost_per_output_token
|
|
135
|
+
if token_intervals:
|
|
136
|
+
current_span.token_intervals = token_intervals
|
|
135
137
|
if prompt:
|
|
136
138
|
current_span.prompt = prompt
|
|
137
139
|
|
|
@@ -15,12 +15,12 @@ Usage:
|
|
|
15
15
|
|
|
16
16
|
from __future__ import annotations
|
|
17
17
|
import time
|
|
18
|
-
from typing import Final
|
|
18
|
+
from typing import Final, Union
|
|
19
19
|
|
|
20
20
|
# Module globals are initialised exactly once.
|
|
21
|
-
_anchor_perf_ns: int
|
|
22
|
-
_anchor_wall_ns: int
|
|
23
|
-
_offset_ns: int
|
|
21
|
+
_anchor_perf_ns: Union[int, None] = None
|
|
22
|
+
_anchor_wall_ns: Union[int, None] = None
|
|
23
|
+
_offset_ns: Union[int, None] = None
|
|
24
24
|
|
|
25
25
|
|
|
26
26
|
def init_clock_bridge() -> None:
|
deepeval/tracing/tracing.py
CHANGED
|
@@ -114,7 +114,7 @@ class TraceManager:
|
|
|
114
114
|
self._print_trace_status(
|
|
115
115
|
message=f"WARNING: Exiting with {queue_size + in_flight} abaonded trace(s).",
|
|
116
116
|
trace_worker_status=TraceWorkerStatus.WARNING,
|
|
117
|
-
description=f"Set {CONFIDENT_TRACE_FLUSH}=
|
|
117
|
+
description=f"Set {CONFIDENT_TRACE_FLUSH}=1 as an environment variable to flush remaining traces to Confident AI.",
|
|
118
118
|
)
|
|
119
119
|
|
|
120
120
|
def mask(self, data: Any):
|
|
@@ -314,7 +314,7 @@ class TraceManager:
|
|
|
314
314
|
env_text,
|
|
315
315
|
message + ":",
|
|
316
316
|
description,
|
|
317
|
-
f"\nTo disable dev logging, set {CONFIDENT_TRACE_VERBOSE}=
|
|
317
|
+
f"\nTo disable dev logging, set {CONFIDENT_TRACE_VERBOSE}=0 as an environment variable.",
|
|
318
318
|
)
|
|
319
319
|
else:
|
|
320
320
|
console.print(message_prefix, env_text, message)
|
|
@@ -717,6 +717,16 @@ class TraceManager:
|
|
|
717
717
|
api_span.input_token_count = span.input_token_count
|
|
718
718
|
api_span.output_token_count = span.output_token_count
|
|
719
719
|
|
|
720
|
+
processed_token_intervals = {}
|
|
721
|
+
if span.token_intervals:
|
|
722
|
+
for key, value in span.token_intervals.items():
|
|
723
|
+
time = to_zod_compatible_iso(
|
|
724
|
+
perf_counter_to_datetime(key),
|
|
725
|
+
microsecond_precision=True,
|
|
726
|
+
)
|
|
727
|
+
processed_token_intervals[time] = value
|
|
728
|
+
api_span.token_intervals = processed_token_intervals
|
|
729
|
+
|
|
720
730
|
return api_span
|
|
721
731
|
|
|
722
732
|
|
deepeval/tracing/types.py
CHANGED
|
@@ -102,6 +102,9 @@ class LlmSpan(BaseSpan):
|
|
|
102
102
|
cost_per_output_token: Optional[float] = Field(
|
|
103
103
|
None, serialization_alias="costPerOutputToken"
|
|
104
104
|
)
|
|
105
|
+
token_intervals: Optional[Dict[float, str]] = Field(
|
|
106
|
+
None, serialization_alias="tokenTimes"
|
|
107
|
+
)
|
|
105
108
|
|
|
106
109
|
# for serializing `prompt`
|
|
107
110
|
model_config = {"arbitrary_types_allowed": True}
|
deepeval/tracing/utils.py
CHANGED
|
@@ -100,10 +100,14 @@ def make_json_serializable(obj):
|
|
|
100
100
|
return _serialize(obj)
|
|
101
101
|
|
|
102
102
|
|
|
103
|
-
def to_zod_compatible_iso(
|
|
103
|
+
def to_zod_compatible_iso(
|
|
104
|
+
dt: datetime, microsecond_precision: bool = False
|
|
105
|
+
) -> str:
|
|
104
106
|
return (
|
|
105
107
|
dt.astimezone(timezone.utc)
|
|
106
|
-
.isoformat(
|
|
108
|
+
.isoformat(
|
|
109
|
+
timespec="microseconds" if microsecond_precision else "milliseconds"
|
|
110
|
+
)
|
|
107
111
|
.replace("+00:00", "Z")
|
|
108
112
|
)
|
|
109
113
|
|
deepeval/utils.py
CHANGED
|
@@ -516,7 +516,7 @@ def remove_pbars(
|
|
|
516
516
|
|
|
517
517
|
|
|
518
518
|
def read_env_int(
|
|
519
|
-
name: str, default: int, *, min_value: int
|
|
519
|
+
name: str, default: int, *, min_value: Union[int, None] = None
|
|
520
520
|
) -> int:
|
|
521
521
|
"""Read an integer from an environment variable with safe fallback.
|
|
522
522
|
|
|
@@ -545,7 +545,7 @@ def read_env_int(
|
|
|
545
545
|
|
|
546
546
|
|
|
547
547
|
def read_env_float(
|
|
548
|
-
name: str, default: float, *, min_value: float
|
|
548
|
+
name: str, default: float, *, min_value: Union[float, None] = None
|
|
549
549
|
) -> float:
|
|
550
550
|
"""Read a float from an environment variable with safe fallback.
|
|
551
551
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: deepeval
|
|
3
|
-
Version: 3.
|
|
3
|
+
Version: 3.5.1
|
|
4
4
|
Summary: The LLM Evaluation Framework
|
|
5
5
|
Home-page: https://github.com/confident-ai/deepeval
|
|
6
6
|
License: Apache-2.0
|
|
@@ -189,16 +189,6 @@ Let's pretend your LLM application is a RAG based customer support chatbot; here
|
|
|
189
189
|
```
|
|
190
190
|
pip install -U deepeval
|
|
191
191
|
```
|
|
192
|
-
### Environment variables (.env / .env.local)
|
|
193
|
-
|
|
194
|
-
DeepEval auto-loads `.env.local` then `.env` from the current working directory **at import time**.
|
|
195
|
-
**Precedence:** process env -> `.env.local` -> `.env`.
|
|
196
|
-
Opt out with `DEEPEVAL_DISABLE_DOTENV=1`.
|
|
197
|
-
|
|
198
|
-
```bash
|
|
199
|
-
cp .env.example .env.local
|
|
200
|
-
# then edit .env.local (ignored by git)
|
|
201
|
-
```
|
|
202
192
|
|
|
203
193
|
## Create an account (highly recommended)
|
|
204
194
|
|
|
@@ -391,9 +381,20 @@ evaluate(dataset, [answer_relevancy_metric])
|
|
|
391
381
|
dataset.evaluate([answer_relevancy_metric])
|
|
392
382
|
```
|
|
393
383
|
|
|
394
|
-
|
|
384
|
+
## A Note on Env Variables (.env / .env.local)
|
|
385
|
+
|
|
386
|
+
DeepEval auto-loads `.env.local` then `.env` from the current working directory **at import time**.
|
|
387
|
+
**Precedence:** process env -> `.env.local` -> `.env`.
|
|
388
|
+
Opt out with `DEEPEVAL_DISABLE_DOTENV=1`.
|
|
389
|
+
|
|
390
|
+
```bash
|
|
391
|
+
cp .env.example .env.local
|
|
392
|
+
# then edit .env.local (ignored by git)
|
|
393
|
+
```
|
|
394
|
+
|
|
395
|
+
# DeepEval With Confident AI
|
|
395
396
|
|
|
396
|
-
|
|
397
|
+
DeepEval's cloud platform, [Confident AI](https://confident-ai.com?utm_source=Github), allows you to:
|
|
397
398
|
|
|
398
399
|
1. Curate/annotate evaluation datasets on the cloud
|
|
399
400
|
2. Benchmark LLM app using dataset, and compare with previous iterations to experiment which models/prompts works best
|