judgeval 0.0.31__py3-none-any.whl → 0.0.33__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (54) hide show
  1. judgeval/__init__.py +3 -1
  2. judgeval/common/s3_storage.py +93 -0
  3. judgeval/common/tracer.py +869 -183
  4. judgeval/constants.py +1 -1
  5. judgeval/data/datasets/dataset.py +5 -1
  6. judgeval/data/datasets/eval_dataset_client.py +2 -2
  7. judgeval/data/sequence.py +16 -26
  8. judgeval/data/sequence_run.py +2 -0
  9. judgeval/judgment_client.py +44 -166
  10. judgeval/rules.py +4 -7
  11. judgeval/run_evaluation.py +2 -2
  12. judgeval/scorers/__init__.py +4 -4
  13. judgeval/scorers/judgeval_scorers/__init__.py +0 -176
  14. judgeval/version_check.py +22 -0
  15. {judgeval-0.0.31.dist-info → judgeval-0.0.33.dist-info}/METADATA +15 -2
  16. judgeval-0.0.33.dist-info/RECORD +63 -0
  17. judgeval/scorers/base_scorer.py +0 -58
  18. judgeval/scorers/judgeval_scorers/local_implementations/__init__.py +0 -27
  19. judgeval/scorers/judgeval_scorers/local_implementations/answer_correctness/__init__.py +0 -4
  20. judgeval/scorers/judgeval_scorers/local_implementations/answer_correctness/answer_correctness_scorer.py +0 -276
  21. judgeval/scorers/judgeval_scorers/local_implementations/answer_correctness/prompts.py +0 -169
  22. judgeval/scorers/judgeval_scorers/local_implementations/answer_relevancy/__init__.py +0 -4
  23. judgeval/scorers/judgeval_scorers/local_implementations/answer_relevancy/answer_relevancy_scorer.py +0 -298
  24. judgeval/scorers/judgeval_scorers/local_implementations/answer_relevancy/prompts.py +0 -174
  25. judgeval/scorers/judgeval_scorers/local_implementations/comparison/__init__.py +0 -0
  26. judgeval/scorers/judgeval_scorers/local_implementations/comparison/comparison_scorer.py +0 -161
  27. judgeval/scorers/judgeval_scorers/local_implementations/comparison/prompts.py +0 -222
  28. judgeval/scorers/judgeval_scorers/local_implementations/contextual_precision/__init__.py +0 -3
  29. judgeval/scorers/judgeval_scorers/local_implementations/contextual_precision/contextual_precision_scorer.py +0 -264
  30. judgeval/scorers/judgeval_scorers/local_implementations/contextual_precision/prompts.py +0 -106
  31. judgeval/scorers/judgeval_scorers/local_implementations/contextual_recall/__init__.py +0 -3
  32. judgeval/scorers/judgeval_scorers/local_implementations/contextual_recall/contextual_recall_scorer.py +0 -254
  33. judgeval/scorers/judgeval_scorers/local_implementations/contextual_recall/prompts.py +0 -142
  34. judgeval/scorers/judgeval_scorers/local_implementations/contextual_relevancy/__init__.py +0 -3
  35. judgeval/scorers/judgeval_scorers/local_implementations/contextual_relevancy/contextual_relevancy_scorer.py +0 -245
  36. judgeval/scorers/judgeval_scorers/local_implementations/contextual_relevancy/prompts.py +0 -121
  37. judgeval/scorers/judgeval_scorers/local_implementations/execution_order/__init__.py +0 -3
  38. judgeval/scorers/judgeval_scorers/local_implementations/execution_order/execution_order.py +0 -156
  39. judgeval/scorers/judgeval_scorers/local_implementations/faithfulness/__init__.py +0 -3
  40. judgeval/scorers/judgeval_scorers/local_implementations/faithfulness/faithfulness_scorer.py +0 -318
  41. judgeval/scorers/judgeval_scorers/local_implementations/faithfulness/prompts.py +0 -268
  42. judgeval/scorers/judgeval_scorers/local_implementations/hallucination/__init__.py +0 -3
  43. judgeval/scorers/judgeval_scorers/local_implementations/hallucination/hallucination_scorer.py +0 -264
  44. judgeval/scorers/judgeval_scorers/local_implementations/hallucination/prompts.py +0 -104
  45. judgeval/scorers/judgeval_scorers/local_implementations/instruction_adherence/instruction_adherence.py +0 -232
  46. judgeval/scorers/judgeval_scorers/local_implementations/instruction_adherence/prompt.py +0 -102
  47. judgeval/scorers/judgeval_scorers/local_implementations/json_correctness/__init__.py +0 -5
  48. judgeval/scorers/judgeval_scorers/local_implementations/json_correctness/json_correctness_scorer.py +0 -134
  49. judgeval/scorers/judgeval_scorers/local_implementations/summarization/__init__.py +0 -3
  50. judgeval/scorers/judgeval_scorers/local_implementations/summarization/prompts.py +0 -247
  51. judgeval/scorers/judgeval_scorers/local_implementations/summarization/summarization_scorer.py +0 -551
  52. judgeval-0.0.31.dist-info/RECORD +0 -96
  53. {judgeval-0.0.31.dist-info → judgeval-0.0.33.dist-info}/WHEEL +0 -0
  54. {judgeval-0.0.31.dist-info → judgeval-0.0.33.dist-info}/licenses/LICENSE.md +0 -0
@@ -1,264 +0,0 @@
1
- """
2
- Metric that evaluates hallucinations in model outputs
3
-
4
- The hallucination metric determines whether your LLM generates factually correct information by comparing
5
- the actual_output to the provided context.
6
-
7
- If you're looking to evaluate hallucination for a RAG system, refer to the faithfulness metric instead.
8
-
9
- The HallucinationMetric uses an LLM to determine, for each context in contexts, whether there are any
10
- contradictions to the actual_output.
11
-
12
- Although extremely similar to the FaithfulnessMetric, the HallucinationMetric is calculated differently
13
- since it uses contexts as the source of truth instead. Since contexts is the ideal segment of your
14
- knowledge base relevant to a specific input, the degree of hallucination can be measured by the degree
15
- of which the contexts is disagreed upon.
16
-
17
- Faithfulness is measuring the number of statements in output that agree with contexts.
18
- Hallucination is measuring the fraction of contexts that agree with output (do not contradict == agree)
19
- """
20
-
21
- from typing import Optional, Union, List
22
-
23
- from judgeval.constants import APIScorer
24
- from judgeval.scorers.utils import (
25
- get_or_create_event_loop,
26
- scorer_progress_meter,
27
- create_verbose_logs,
28
- parse_response_json,
29
- check_example_params,
30
- )
31
- from judgeval.scorers import JudgevalScorer
32
- from judgeval.judges import JudgevalJudge
33
- from judgeval.judges.utils import create_judge
34
- from judgeval.data import Example, ExampleParams
35
- from judgeval.scorers.judgeval_scorers.local_implementations.hallucination.prompts import *
36
-
37
-
38
- required_params = [
39
- ExampleParams.INPUT,
40
- ExampleParams.ACTUAL_OUTPUT,
41
- ExampleParams.CONTEXT,
42
- ]
43
-
44
-
45
- class HallucinationScorer(JudgevalScorer):
46
- def __init__(
47
- self,
48
- threshold: float = 0.5,
49
- model: Optional[Union[str, JudgevalJudge]] = None,
50
- include_reason: bool = True,
51
- async_mode: bool = False,
52
- strict_mode: bool = False,
53
- verbose_mode: bool = False,
54
- ):
55
- super().__init__(
56
- score_type=APIScorer.HALLUCINATION,
57
- threshold=1 if strict_mode else threshold,
58
- evaluation_model=None,
59
- include_reason=include_reason,
60
- async_mode=async_mode,
61
- strict_mode=strict_mode,
62
- verbose_mode=verbose_mode
63
- )
64
- self.model, self.using_native_model = create_judge(model)
65
- self.evaluation_model = self.model.get_model_name()
66
-
67
- def score_example(
68
- self,
69
- example: Example,
70
- _show_indicator: bool = True,
71
- ) -> float:
72
- check_example_params(example, required_params, self)
73
-
74
- with scorer_progress_meter(self, display_meter=_show_indicator):
75
- if self.async_mode:
76
- loop = get_or_create_event_loop()
77
- loop.run_until_complete(
78
- self.a_score_example(example, _show_indicator=False)
79
- )
80
- else:
81
- self.verdicts: List[HallucinationVerdict] = (
82
- self._generate_verdicts(
83
- example.actual_output, example.context
84
- )
85
- )
86
- self.score = self._calculate_score()
87
- self.reason = self._generate_reason()
88
- self.success = self.score <= self.threshold
89
- self.verbose_logs = create_verbose_logs(
90
- self,
91
- steps=[
92
- f"Verdicts:\n{[v.model_dump() for v in self.verdicts]}",
93
- f"Score: {self.score}\nReason: {self.reason}",
94
- ],
95
- )
96
-
97
- return self.score
98
-
99
- async def a_score_example(
100
- self,
101
- example: Example,
102
- _show_indicator: bool = True,
103
- ) -> float:
104
- check_example_params(example, required_params, self)
105
-
106
- with scorer_progress_meter(
107
- self, async_mode=True, display_meter=_show_indicator
108
- ):
109
- self.verdicts: List[HallucinationVerdict] = (
110
- await self._a_generate_verdicts(
111
- example.actual_output, example.context
112
- )
113
- )
114
- self.score = self._calculate_score()
115
- self.reason = await self._a_generate_reason()
116
- self.success = self.score <= self.threshold
117
- self.verbose_logs = create_verbose_logs(
118
- self,
119
- steps=[
120
- f"Verdicts:\n{[v.model_dump() for v in self.verdicts]}",
121
- f"Score: {self.score}\nReason: {self.reason}",
122
- ],
123
- )
124
-
125
- return self.score
126
-
127
- async def _a_generate_reason(self):
128
- if self.include_reason is False:
129
- return None
130
-
131
- contradictions = []
132
- for verdict in self.verdicts:
133
- if verdict.verdict.strip().lower() == "no":
134
- contradictions.append(verdict.reason)
135
-
136
- prompt: dict = HallucinationTemplate.generate_reason(
137
- contradictions=contradictions,
138
- score=format(self.score, ".2f"),
139
- )
140
-
141
- if self.using_native_model:
142
- res = await self.model.a_generate(prompt)
143
- data = parse_response_json(res, self)
144
- return data["reason"]
145
- else:
146
- try:
147
- res: Reason = await self.model.a_generate(prompt, schema=Reason)
148
- return res.reason
149
- except TypeError:
150
- res = await self.model.a_generate(prompt)
151
- data = parse_response_json(res, self)
152
- return data["reason"]
153
-
154
- def _generate_reason(self):
155
- if self.include_reason is False:
156
- return None
157
-
158
- factual_alignments = []
159
- contradictions = []
160
- for verdict in self.verdicts:
161
- if verdict.verdict.strip().lower() == "no":
162
- contradictions.append(verdict.reason)
163
-
164
- prompt: dict = HallucinationTemplate.generate_reason(
165
- factual_alignments=factual_alignments,
166
- contradictions=contradictions,
167
- score=format(self.score, ".2f"),
168
- )
169
-
170
- if self.using_native_model:
171
- res = self.model.generate(prompt)
172
- data = parse_response_json(res, self)
173
- return data["reason"]
174
- else:
175
- try:
176
- res: Reason = self.model.generate(prompt, schema=Reason)
177
- return res.reason
178
- except TypeError:
179
- res = self.model.generate(prompt)
180
- data = parse_response_json(res, self)
181
- return data["reason"]
182
-
183
- async def _a_generate_verdicts(
184
- self, actual_output: str, contexts: List[str]
185
- ) -> List[HallucinationVerdict]:
186
- verdicts: List[HallucinationVerdict] = []
187
- prompt = HallucinationTemplate.generate_verdicts(
188
- actual_output=actual_output, contexts=contexts
189
- )
190
- if self.using_native_model:
191
- res = await self.model.a_generate(prompt)
192
- data = parse_response_json(res, self)
193
- verdicts = [
194
- HallucinationVerdict(**item) for item in data["verdicts"]
195
- ]
196
- return verdicts
197
- else:
198
- try:
199
- res: Verdicts = await self.model.a_generate(
200
- prompt, schema=Verdicts
201
- )
202
- verdicts = [item for item in res.verdicts]
203
- return verdicts
204
- except TypeError:
205
- res = await self.model.a_generate(prompt)
206
- data = parse_response_json(res, self)
207
- verdicts = [
208
- HallucinationVerdict(**item) for item in data["verdicts"]
209
- ]
210
- return verdicts
211
-
212
- def _generate_verdicts(
213
- self, actual_output: str, contexts: List[str]
214
- ) -> List[HallucinationVerdict]:
215
- verdicts: List[HallucinationVerdict] = []
216
- prompt = HallucinationTemplate.generate_verdicts(
217
- actual_output=actual_output, contexts=contexts
218
- )
219
- if self.using_native_model:
220
- res = self.model.generate(prompt)
221
- data = parse_response_json(res, self)
222
- verdicts = [
223
- HallucinationVerdict(**item) for item in data["verdicts"]
224
- ]
225
- return verdicts
226
- else:
227
- try:
228
- res: Verdicts = self.model.generate(prompt, schema=Verdicts)
229
- verdicts = [item for item in res.verdicts]
230
- return verdicts
231
- except TypeError:
232
- res = self.model.generate(prompt)
233
- data = parse_response_json(res, self)
234
- verdicts = [
235
- HallucinationVerdict(**item) for item in data["verdicts"]
236
- ]
237
- return verdicts
238
-
239
- def _calculate_score(self) -> float:
240
- number_of_verdicts = len(self.verdicts)
241
- if number_of_verdicts == 0:
242
- return 0
243
-
244
- hallucination_count = 0
245
- for verdict in self.verdicts:
246
- if verdict.verdict.strip().lower() == "no":
247
- hallucination_count += 1
248
-
249
- score = hallucination_count / number_of_verdicts
250
- return 1 if self.strict_mode and score > self.threshold else score
251
-
252
- def _success_check(self) -> bool:
253
- if self.error is not None:
254
- self.success = False
255
- else:
256
- try:
257
- self.success = self.score <= self.threshold
258
- except:
259
- self.success = False
260
- return self.success
261
-
262
- @property
263
- def __name__(self):
264
- return "Hallucination"
@@ -1,104 +0,0 @@
1
- from typing import List
2
- from pydantic import BaseModel
3
-
4
-
5
- class HallucinationVerdict(BaseModel):
6
- verdict: str
7
- reason: str
8
-
9
-
10
- class Verdicts(BaseModel):
11
- verdicts: List[HallucinationVerdict]
12
-
13
-
14
- class Reason(BaseModel):
15
- reason: str
16
-
17
-
18
- class HallucinationTemplate:
19
- @staticmethod
20
- def generate_verdicts(actual_output, contexts):
21
- return f"""==== TASK INSTRUCTIONS ====
22
- You will be provided with an `actual output` (the response of an LLM to a particular query) and `contexts` (ground truth contextual information from a knowledge base).
23
- Your task is to take each context in contexts and determine whether the `actual output` factually agrees with the context.
24
-
25
- Additional notes:
26
- You should NOT use any prior knowledge you have in your decision making process; take each context at face value.
27
- Since you will determine a verdict for EACH context, the number of 'verdicts' is EXACTLY EQUAL TO the number of contexts.
28
- You should be lenient in your judgment when the actual output lacks detail with respect to the context segment; you should ONLY provide a 'no' answer if the context contradicts the actual output.
29
-
30
- ==== FORMATTING INSTRUCTIONS ====
31
- You should return a JSON object with a key 'verdicts', which is a list of JSON objects. Each JSON object corresponds to a context in `contexts`, and should have 2 fields: 'verdict' and 'reason'.
32
- The 'verdict' key should be EXACTLY one of 'yes' or 'no', representing whether the `actual output` factually agrees with the context segment.
33
- The 'reason' is the justification for the verdict. If your verdict is 'no', try to provide a correction in the reason.
34
-
35
- ==== EXAMPLE ====
36
- Example contexts: ["Einstein won the Nobel Prize for his discovery of the photoelectric effect.", "Einstein won the Nobel Prize in 1968."]
37
- Example actual output: "Einstein won the Nobel Prize in 1969 for his discovery of the photoelectric effect."
38
-
39
- Example:
40
- {{
41
- "verdicts": [
42
- {{
43
- "verdict": "yes",
44
- "reason": "The actual output agrees with the provided context which states that Einstein won the Nobel Prize for his discovery of the photoelectric effect."
45
- }},
46
- {{
47
- "verdict": "no",
48
- "reason": "The actual output contradicts the provided context which states that Einstein won the Nobel Prize in 1968, not 1969."
49
- }}
50
- ]
51
- }}
52
-
53
- ==== YOUR TURN ====
54
- Contexts:
55
- {contexts}
56
-
57
- Actual Output:
58
- {actual_output}
59
-
60
- JSON:
61
- """
62
-
63
- @staticmethod
64
- def generate_reason(contradictions, score):
65
- return f"""==== TASK INSTRUCTIONS ====
66
- An LLM has been provided with a list of `contexts` (ground truth contextual information from a knowledge base) and `actual output` (the response of an LLM to a particular query).
67
- You will be provided with a list of `contradictions`, which are factual discrepancies between the context segments and the actual output.
68
- Additionally, you will be provided with a hallucination score, which is a float (0 - 1, where 0 is the best score) indicating the fraction of context segments that contradict the actual output.
69
-
70
- Your task is to provide a CLEAR and CONCISE reason for the hallucination score.
71
- If the hallucination score is 0 (no contradictions), you should instead respond with a positive remark with an upbeat encouraging tone (but don't overblow the kind attitude).
72
-
73
- ==== FORMATTING INSTRUCTIONS ====
74
- Please make sure to only return in JSON format, with the 'reason' key providing the reason.
75
- Example JSON:
76
- {{
77
- "reason": "The score is <hallucination_score> because <your_reason>."
78
- }}
79
-
80
- ==== EXAMPLE ====
81
- Example Contradictions:
82
- [
83
- "The actual output claims Einstein won the Nobel Prize in 1969, which contradicts the context stating he won it in 1968.",
84
- "The actual output states Einstein was a chemist, but the context indicates he was a physicist.",
85
- "The actual output claims Einstein was born in Switzerland, while the context states he was born in Germany."
86
- ]
87
-
88
- Example Hallucination Score:
89
- 0.75
90
-
91
- Example Response:
92
- {{
93
- "reason": "The score is 0.75 because the actual output made multiple factual errors: incorrectly stating Einstein's Nobel Prize year (1969 vs 1968), his profession (chemist vs physicist), and birthplace (Switzerland vs Germany)."
94
- }}
95
-
96
- ==== YOUR TURN ====
97
- Contradictions:
98
- {contradictions}
99
-
100
- Hallucination Score:
101
- {score}
102
-
103
- JSON:
104
- """
@@ -1,232 +0,0 @@
1
- from typing import Optional, List, Union, Tuple
2
- from pydantic import BaseModel
3
-
4
- from judgeval.constants import APIScorer
5
- from judgeval.scorers.utils import (get_or_create_event_loop,
6
- scorer_progress_meter,
7
- create_verbose_logs,
8
- parse_response_json,
9
- check_example_params
10
- )
11
- from judgeval.scorers import JudgevalScorer
12
- from judgeval.judges import JudgevalJudge
13
- from judgeval.judges.utils import create_judge
14
- from judgeval.data import Example, ExampleParams
15
- from judgeval.scorers.judgeval_scorers.local_implementations.instruction_adherence.prompt import (
16
- InstructionAdherenceTemplate,
17
- )
18
- required_params = [
19
- ExampleParams.INPUT,
20
- ExampleParams.ACTUAL_OUTPUT,
21
- ]
22
-
23
- class Instructions(BaseModel):
24
- instructions: List[str]
25
-
26
- class Verdict(BaseModel):
27
- instruction: str
28
- score: float
29
- reason: str
30
-
31
- class ListOfVerdicts(BaseModel):
32
- verdicts: List[Verdict]
33
-
34
- class InstructionAdherenceScorer(JudgevalScorer):
35
- def __init__(
36
- self,
37
- threshold: float = 0.5,
38
- model: Optional[Union[str, JudgevalJudge]] = None,
39
- include_reason: bool = True,
40
- async_mode: bool = True,
41
- strict_mode: bool = False,
42
- verbose_mode: bool = False,
43
- ):
44
- super().__init__(
45
- score_type=APIScorer.INSTRUCTION_ADHERENCE,
46
- threshold=1 if strict_mode else threshold,
47
- evaluation_model=None,
48
- include_reason=include_reason,
49
- async_mode=async_mode,
50
- strict_mode=strict_mode,
51
- verbose_mode=verbose_mode
52
- )
53
- self.model, self.using_native_model = create_judge(model)
54
- self.evaluation_model = self.model.get_model_name()
55
-
56
- def score_example(
57
- self,
58
- example: Example,
59
- _show_indicator: bool = True,
60
- ) -> float:
61
- check_example_params(example, required_params, self)
62
-
63
- with scorer_progress_meter(self, display_meter=_show_indicator):
64
- try:
65
- if self.async_mode:
66
- loop = get_or_create_event_loop()
67
- loop.run_until_complete(
68
- self.a_score_example(example, _show_indicator=False)
69
- )
70
- else:
71
- self.instructions: List[str] = self._get_instructions(example.input)
72
- self.verdicts: List[Verdict] = (
73
- self._get_verdicts(self.instructions, example.actual_output)
74
- )
75
- self.score = self._compute_score()
76
- self.reason = str(self.verdicts)
77
- self.success = self.score >= self.threshold
78
- self.verbose_logs = create_verbose_logs(
79
- self,
80
- steps=[
81
- f"Instructions:\n{self.instructions}",
82
- f"Score: {self.score}\nReason: {self.reason}",
83
- ],
84
- )
85
- return self.score
86
- except Exception as e:
87
- raise
88
-
89
- async def a_score_example(
90
- self,
91
- example: Example,
92
- _show_indicator: bool = True,
93
- ) -> float:
94
- check_example_params(example, required_params, self)
95
- try:
96
- with scorer_progress_meter(
97
- self, async_mode=True, display_meter=_show_indicator
98
- ):
99
- self.instructions: List[str] = await self._a_get_instructions(example.input)
100
- self.verdicts: List[Verdict] = (
101
- await self._a_get_verdicts(self.instructions, example.actual_output)
102
- )
103
- self.score = self._compute_score()
104
- self.reason = str(self.verdicts)
105
- self.success = self.score >= self.threshold
106
- self.verbose_logs = create_verbose_logs(
107
- self,
108
- steps=[
109
- f"Instructions:\n{self.instructions}",
110
- f"Score: {self.score}\nReason: {self.reason}",
111
- ],
112
- )
113
- return self.score
114
- except Exception as e:
115
- raise e
116
-
117
-
118
- async def _a_get_verdicts(
119
- self, instructions: List[str], actual_output: str
120
- ) -> List[Verdict]:
121
- if len(instructions) == 0:
122
- return []
123
-
124
- prompt = InstructionAdherenceTemplate.generate_verdicts(
125
- instructions=instructions,
126
- actual_output=actual_output,
127
- )
128
- if self.using_native_model:
129
- res = await self.model.a_generate(prompt)
130
- data = parse_response_json(res, self)
131
- return [
132
- Verdict(**item) for item in data["verdicts"]
133
- ]
134
- else:
135
- try:
136
- res: List[Verdict] = await self.model.a_generate(
137
- prompt, schema=List[Verdict]
138
- )
139
- return res
140
- except TypeError:
141
- res = await self.model.a_generate(prompt)
142
- data = parse_response_json(res, self)
143
- return [
144
- Verdict(**item) for item in data["verdicts"]
145
- ]
146
-
147
- def _get_verdicts(self, instructions: List[str], actual_output: str) -> List[Verdict]:
148
- if len(instructions) == 0:
149
- return []
150
-
151
- prompt = InstructionAdherenceTemplate.generate_verdicts(
152
- instructions=instructions,
153
- actual_output=actual_output,
154
- )
155
- if self.using_native_model:
156
- res = self.model.generate(prompt)
157
- data = parse_response_json(res, self)
158
- return [Verdict(**item) for item in data["verdicts"]]
159
- else:
160
- try:
161
- res: List[Verdict] = self.model.generate(prompt, schema=List[Verdict])
162
- return res
163
- except TypeError:
164
- res = self.model.generate(prompt)
165
- data = parse_response_json(res, self)
166
- return [
167
- Verdict(**item) for item in data["verdicts"]
168
- ]
169
-
170
- async def _a_get_instructions(
171
- self,
172
- input: str,
173
- ) -> List[str]:
174
- prompt = InstructionAdherenceTemplate.get_instructions(
175
- input=input,
176
- )
177
- if self.using_native_model:
178
- res = await self.model.a_generate(prompt)
179
- data = parse_response_json(res, self)
180
- return data["instructions"]
181
- else:
182
- try:
183
- res: List[str] = await self.model.a_generate(
184
- prompt, schema=List[str]
185
- )
186
- return res
187
- except TypeError:
188
- res = await self.model.a_generate(prompt)
189
- data = parse_response_json(res, self)
190
- return data["instructions"]
191
-
192
- def _get_instructions(
193
- self,
194
- input: str,
195
- ) -> List[str]:
196
- prompt = InstructionAdherenceTemplate.get_instructions(
197
- input=input,
198
- )
199
- if self.using_native_model:
200
- res = self.model.generate(prompt)
201
- data = parse_response_json(res, self)
202
- return data["instructions"]
203
- else:
204
- try:
205
- res: List[str] = self.model.generate(prompt, schema=List[str])
206
- return res
207
- except TypeError:
208
- res = self.model.generate(prompt)
209
- data = parse_response_json(res, self)
210
- return data["instructions"]
211
-
212
- def _compute_score(self):
213
- if len(self.verdicts) == 0:
214
- return 1
215
- score = 0
216
- for verdict in self.verdicts:
217
- score += verdict.score
218
- return score / len(self.verdicts)
219
-
220
- def success_check(self) -> bool:
221
- if self.error is not None:
222
- self.success = False
223
- else:
224
- try:
225
- self.success = self.score >= self.threshold
226
- except:
227
- self.success = False
228
- return self.success
229
-
230
- @property
231
- def __name__(self):
232
- return "Instruction Adherence"
@@ -1,102 +0,0 @@
1
- """
2
- Util prompts for InstructionAdherenceScorer
3
- """
4
-
5
- from typing import List, Optional, Tuple
6
- from pydantic import BaseModel, Field
7
-
8
-
9
- class InstructionAdherenceTemplate:
10
- @staticmethod
11
- def get_instructions(input):
12
- return f"""You will be presented with a piece of text. Your task is to break down the text and generate a list of the instructions contained within the text.
13
-
14
- ===== START OF EXAMPLES =====
15
- Example 1:
16
- Example text: Hello my name is John Doe. I like cars. Write two poems about the weather and create a joke. Also what is 5 + 5?
17
-
18
- Output:
19
- {{
20
- "instructions": ["Write two poem about the weather", "Create a joke", "What is 5 + 5?"]
21
- }}
22
- ===== END OF EXAMPLES =====
23
-
24
-
25
- **
26
- IMPORTANT: Please return your answer in valid JSON format, with the "instructions" key mapping to a list of strings. No words or explanation is needed.
27
- **
28
-
29
- ==== START OF INPUT ====
30
- Text:
31
- {input}
32
- ==== END OF INPUT ====
33
-
34
- ==== YOUR ANSWER ====
35
- JSON:
36
- """
37
-
38
- @staticmethod
39
- def generate_verdicts(instructions, actual_output):
40
- return f"""
41
- You will be presented with a list of instructions and a piece of text. For each instruction, determine if the instruction was completed in the text. There are 3 categories: either completed, partially completed, or not completed. The scores for these will be 1, 0.5, and 0 respectively.
42
- Go through each instruction and provide score for each instruction as well as the reasoning for that score.
43
-
44
- ==== FORMATTING YOUR ANSWER ====
45
- Please return your answer in JSON format, with a list of JSON objects with keys "instruction", "score", and "reason". No words or explanation beyond the output JSON is needed.
46
-
47
-
48
- ===== START OF EXAMPLES =====
49
- Example 1:
50
- instructions: ["Write two poems about the weather", "Create a joke", "What is 5 + 5?"]
51
- output: Poem 1: The Sun's Embrace
52
- The sun climbs high, a golden flame,
53
- It whispers warmth, it calls my name.
54
- The sky, a canvas, blue and clear,
55
- A perfect day for cars, my dear.
56
-
57
- The asphalt hums beneath the wheels,
58
- A symphony of speed it feels.
59
- The weather smiles, no clouds in sight,
60
- A driver's joy, pure delight.
61
-
62
- Poem 2: The Storm's Dance
63
- A sunlit meadow, alive with whispers of wind, where daisies dance and hope begins again. Each petal holds a promise—bright, unbruised— a symphony of light that cannot be refused.
64
-
65
- Joke
66
- Why dont cars ever get cold in the winter?
67
- Because they have radiators!
68
-
69
- Math Answer
70
- 5 + 5 = 10
71
-
72
- YOUR JSON OUTPUT:
73
- {{
74
- [
75
- {{
76
- "instruction": "Write two poem about the weather",
77
- "score": 0.5,
78
- "reason": "The output contained one poem about the weather, but the other poem was not about the weather."
79
- }},
80
- {{
81
- "instruction": "Create a joke",
82
- "score": 1,
83
- "reason": "There was a joke created in the output."
84
- }},
85
- {{
86
- "instruction": "What is 5 + 5?",
87
- "score": 1,
88
- "reason": "The answer to the math question was provided in the output."
89
- }}
90
- ]
91
- }}
92
- ===== END OF EXAMPLES =====
93
-
94
- ==== START OF INPUT ====
95
- instructions: {instructions}
96
- output: {actual_output}
97
- ==== END OF INPUT ====
98
-
99
- ==== YOUR ANSWER ====
100
- JSON:
101
- """
102
-