agentme 0.21.0 → 0.22.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
|
@@ -171,6 +171,46 @@ prompt = PromptTemplate.from_file(
|
|
|
171
171
|
)
|
|
172
172
|
```
|
|
173
173
|
|
|
174
|
+
#### 06-output-length-constraints
|
|
175
|
+
|
|
176
|
+
Every free-text field generated by an LLM MUST have an explicit word or token limit defined wherever the content is specified — in prompt text, output schema definitions, or both. This prevents runaway verbosity, reduces token costs, and makes quality evaluation deterministic.
|
|
177
|
+
|
|
178
|
+
**Rules:**
|
|
179
|
+
|
|
180
|
+
- Append `[max N words]` (or `[max N tokens]` when appropriate) directly inside the instruction or field description that requests the generated content.
|
|
181
|
+
- Apply the constraint to every level: the top-level prompt instruction AND any nested schema field that contains free text.
|
|
182
|
+
- When a field is an enumeration or a short code (e.g., `"APPROVE"` / `"REJECT"`), no word limit is needed.
|
|
183
|
+
- Inline examples are encouraged whenever the expected output style is non-obvious.
|
|
184
|
+
|
|
185
|
+
**Prompt example:**
|
|
186
|
+
|
|
187
|
+
```text
|
|
188
|
+
Generate a summary of this text. [max 40 words]
|
|
189
|
+
Identify the three main topics covered. [max 10 words each]
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
**Output schema example:**
|
|
193
|
+
|
|
194
|
+
```python
|
|
195
|
+
class EvaluationResult(BaseModel):
|
|
196
|
+
evaluation: str = Field(description="Most important aspects of the text [max 100 words]")
|
|
197
|
+
verdict: Literal["PASS", "FAIL"]
|
|
198
|
+
improvement_suggestion: str = Field(description="Concrete next step for the author [max 30 words]")
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
**Combined prompt + schema example:**
|
|
202
|
+
|
|
203
|
+
```python
|
|
204
|
+
prompt = """
|
|
205
|
+
Evaluate the following document against our quality criteria. [max 200 words total]
|
|
206
|
+
|
|
207
|
+
Return a JSON object with:
|
|
208
|
+
- "evaluation": overall assessment [max 100 words]
|
|
209
|
+
- "verdict": "PASS" or "FAIL"
|
|
210
|
+
- "improvement_suggestion": one concrete improvement [max 30 words]
|
|
211
|
+
"""
|
|
212
|
+
```
|
|
213
|
+
|
|
174
214
|
## References
|
|
175
215
|
|
|
176
216
|
- [agentme-edr-019](019-ai-agents-development-standards.md) — Agent implementation standards (deepagents, tool-invocation loops)
|
|
@@ -197,6 +197,58 @@ The current OS is: [operating system name].
|
|
|
197
197
|
| `<OUTPUT_FORMAT>` | Required | MUST include a concrete schema or templated example; do not leave it vague. When multiple output formats are possible, MUST use mandatory language to specify exactly which one to use and explicitly exclude the others. |
|
|
198
198
|
| `<WORKFLOW_CONTEXT>` | Conditional | MUST be omitted for standalone agents. MUST be present when the agent runs as a node inside a LangGraph workflow. |
|
|
199
199
|
|
|
200
|
+
#### 07-agent-output-format
|
|
201
|
+
|
|
202
|
+
The format of an agent's final output MUST be chosen based on who or what consumes it:
|
|
203
|
+
|
|
204
|
+
| Consumer | Required format |
|
|
205
|
+
|---|---|
|
|
206
|
+
| Code (workflow node, downstream function, API caller) | JSON object matching a declared schema |
|
|
207
|
+
| Human (user-facing summary, review comment, prose report) | Natural language; JSON is NOT required |
|
|
208
|
+
|
|
209
|
+
**When output is consumed by code:**
|
|
210
|
+
|
|
211
|
+
- The agent's `<OUTPUT_FORMAT>` system prompt section MUST specify a JSON schema or a concrete typed example.
|
|
212
|
+
- The `<OUTPUT_FORMAT>` section MUST include the instruction: `ALWAYS return valid JSON. NEVER include prose outside the JSON block.`
|
|
213
|
+
- The corresponding Python return type MUST be a typed dataclass or Pydantic model that mirrors the schema, not a plain `dict` or `str`.
|
|
214
|
+
- The caller MUST parse and validate the JSON against the declared type before passing it downstream.
|
|
215
|
+
|
|
216
|
+
**Example — code-consumed agent output:**
|
|
217
|
+
|
|
218
|
+
```python
|
|
219
|
+
from dataclasses import dataclass
|
|
220
|
+
from typing import List
|
|
221
|
+
import json
|
|
222
|
+
|
|
223
|
+
@dataclass
|
|
224
|
+
class FileAnalysisResult:
|
|
225
|
+
status: str # "success" | "failure"
|
|
226
|
+
files_found: int
|
|
227
|
+
issues: List[str]
|
|
228
|
+
|
|
229
|
+
# System prompt OUTPUT_FORMAT section:
|
|
230
|
+
# <OUTPUT_FORMAT>
|
|
231
|
+
# Respond with a JSON object matching this schema:
|
|
232
|
+
# {
|
|
233
|
+
# "status": "success" | "failure",
|
|
234
|
+
# "files_found": <integer>,
|
|
235
|
+
# "issues": ["<string>", ...],
|
|
236
|
+
# "summary": "summary of the work performed. max 30 words",
|
|
237
|
+
# }
|
|
238
|
+
# ALWAYS return valid JSON. NEVER include prose outside the JSON block.
|
|
239
|
+
# </OUTPUT_FORMAT>
|
|
240
|
+
|
|
241
|
+
def parse_agent_output(raw: str) -> FileAnalysisResult:
|
|
242
|
+
data = json.loads(raw)
|
|
243
|
+
return FileAnalysisResult(**data)
|
|
244
|
+
```
|
|
245
|
+
|
|
246
|
+
**When output is consumed by a human:**
|
|
247
|
+
|
|
248
|
+
- Natural language is correct and preferred.
|
|
249
|
+
- Do NOT wrap prose in JSON — it adds noise without value.
|
|
250
|
+
- The `<OUTPUT_FORMAT>` section MUST still describe the expected structure (e.g., a list of findings, a summary paragraph) using a prose template.
|
|
251
|
+
|
|
200
252
|
## References
|
|
201
253
|
|
|
202
254
|
- [agentme-edr-018](018-ai-llm-development-standards.md) — LLM development standards (LangChain configuration, mocking patterns)
|
|
@@ -30,8 +30,8 @@ evals/
|
|
|
30
30
|
<component>/ # the component being evaluated (e.g., workflow-x, agent-y, model-z)
|
|
31
31
|
eval-<name>/
|
|
32
32
|
dataset/ # EDR-024 compliant dataset (README.md, dataset.schema.json, data/)
|
|
33
|
-
eval
|
|
34
|
-
|
|
33
|
+
eval.py # evaluation script
|
|
34
|
+
report.md # generated report (overwritten on each run — see rule 03)
|
|
35
35
|
Makefile # eval and run targets
|
|
36
36
|
eval-<name2>/
|
|
37
37
|
...
|
|
@@ -62,13 +62,13 @@ eval:
|
|
|
62
62
|
|
|
63
63
|
#### 02-eval-script-requirements
|
|
64
64
|
|
|
65
|
-
Each `eval
|
|
65
|
+
Each `eval.py` script MUST:
|
|
66
66
|
|
|
67
67
|
- Load the dataset from `dataset/` in the same eval folder, following [agentme-edr-024](024-ml-dataset-structure.md). For input/output pairs, use the JSONL format per `agentme-edr-024.04-complex-structured-datasets-must-use-jsonl`.
|
|
68
68
|
- Run every input through the live component against **real LLM providers** (not mocked responses), to capture model drift.
|
|
69
69
|
- Log per-sample and aggregate metrics to an MLflow experiment that runs **locally** — a remote MLflow server MUST NOT be required.
|
|
70
70
|
- Compare outputs to expected values using project-defined quality thresholds. Thresholds MUST be declared explicitly (e.g., in a Makefile variable or README).
|
|
71
|
-
- Write `
|
|
71
|
+
- Write `report.md` in the same folder per rule `03-eval-report-file`.
|
|
72
72
|
- Exit with a non-zero status when any metric falls below its defined threshold, consistent with [agentme-edr-007](../principles/007-project-quality-standards.md) rule `07-statistical-models-must-have-eval-targets`.
|
|
73
73
|
|
|
74
74
|
**Example:**
|
|
@@ -96,7 +96,7 @@ with mlflow.start_run() as run:
|
|
|
96
96
|
|
|
97
97
|
#### 03-eval-report-file
|
|
98
98
|
|
|
99
|
-
Each eval script MUST produce `
|
|
99
|
+
Each eval script MUST produce `report.md` in the same `evals/<component>/eval-<name>/` folder and overwrite it on every run.
|
|
100
100
|
|
|
101
101
|
**Generation constraint:** The report MUST be produced programmatically, reading raw metric values directly from MLflow. No LLM or generative model may write, summarize, or paraphrase any section of the report, to prevent hallucinated metric values.
|
|
102
102
|
|
|
@@ -107,7 +107,7 @@ The report MUST follow this template:
|
|
|
107
107
|
|
|
108
108
|
**Date:** <ISO date>
|
|
109
109
|
**Dataset:** dataset/
|
|
110
|
-
**Script:** eval
|
|
110
|
+
**Script:** eval.py
|
|
111
111
|
**Thresholds:** accuracy ≥ <value>, F1 ≥ <value>
|
|
112
112
|
|
|
113
113
|
## Overall Results
|
|
@@ -142,14 +142,14 @@ $$\frac{\hat{p} + \frac{z^2}{2n} \pm z\sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac
|
|
|
142
142
|
|
|
143
143
|
Where $\hat{p}$ is observed accuracy and $n$ is sample count. Accuracy and F1 are required; precision and recall are recommended.
|
|
144
144
|
|
|
145
|
-
**Filled-in example** (`evals/workflow-document-review/eval-basic/
|
|
145
|
+
**Filled-in example** (`evals/workflow-document-review/eval-basic/report.md` for a document review workflow):
|
|
146
146
|
|
|
147
147
|
```markdown
|
|
148
148
|
# Eval Report: eval-basic
|
|
149
149
|
|
|
150
150
|
**Date:** 2026-06-12
|
|
151
151
|
**Dataset:** dataset/
|
|
152
|
-
**Script:** eval
|
|
152
|
+
**Script:** eval.py
|
|
153
153
|
**Thresholds:** accuracy ≥ 0.85, F1 ≥ 0.80
|
|
154
154
|
|
|
155
155
|
## Overall Results
|
|
@@ -182,6 +182,12 @@ Where $\hat{p}$ is observed accuracy and $n$ is sample count. Accuracy and F1 ar
|
|
|
182
182
|
- MLflow run: experiment `workflow-document-review/eval-basic` — view with `mlflow ui`
|
|
183
183
|
```
|
|
184
184
|
|
|
185
|
+
#### 04-eval-mlflow-unique-port
|
|
186
|
+
|
|
187
|
+
Each `evals/<component>/eval-<name>/Makefile` MUST start its MLflow tracking server on a **unique port** to prevent conflicts when multiple eval Makefiles are run concurrently or in parallel (e.g., in CI or across multiple terminal sessions).
|
|
188
|
+
|
|
189
|
+
Ports MUST be statically assigned per eval scenario and MUST NOT reuse the default `5000` port (reserved for `dev-mlflow` per [agentme-edr-008](../devops/008-common-targets.md) rule `09-ai-project-dev-targets`). Assign ports starting at `5100` and incrementing by 1 for each additional eval scenario across the entire project.
|
|
190
|
+
|
|
185
191
|
## References
|
|
186
192
|
|
|
187
193
|
- [agentme-edr-007](../principles/007-project-quality-standards.md) — Project quality standards: when evals are required per AI tier (rule `09-ai-project-testing-requirements`) and statistical model eval targets (rule `07-statistical-models-must-have-eval-targets`)
|