agentme 0.20.0 → 0.21.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
|
@@ -178,7 +178,7 @@ e.g.: Respond with a JSON object matching this schema: ... ALWAYS return valid J
|
|
|
178
178
|
</WORKFLOW_CONTEXT>
|
|
179
179
|
|
|
180
180
|
<SYSTEM_CONTEXT>
|
|
181
|
-
The current date
|
|
181
|
+
The current date is [today in YYYY-MM-DD format].
|
|
182
182
|
The current OS is: [operating system name].
|
|
183
183
|
</SYSTEM_CONTEXT>
|
|
184
184
|
```
|
|
@@ -187,7 +187,7 @@ The current OS is: [operating system name].
|
|
|
187
187
|
|
|
188
188
|
| Section | Required? | Notes |
|
|
189
189
|
|---|---|---|
|
|
190
|
-
| `<SYSTEM_CONTEXT>` | Optional | Runtime environment context injected at invocation time (e.g., current date
|
|
190
|
+
| `<SYSTEM_CONTEXT>` | Optional | Runtime environment context injected at invocation time (e.g., current date in YYYY-MM-DD, OS). Include whenever the agent may need temporal or environment awareness. Time MUST NOT be included — it changes every second and breaks prompt caching. |
|
|
191
191
|
| `<OBJECTIVE>` | Required | One or two sentences summarising the agent's main deliverable. |
|
|
192
192
|
| `<ROLE>` | Required | Agent persona and expertise. When inside a workflow, MUST reference its node name from `<WORKFLOW_CONTEXT>`. |
|
|
193
193
|
| `<INPUT>` | Required | List ALL inputs. For workflow agents: workflow-level inputs first, then agent-specific inputs. |
|
|
@@ -23,36 +23,41 @@ For when evals are required per AI tier, see [agentme-edr-007](../principles/007
|
|
|
23
23
|
|
|
24
24
|
#### 01-eval-folder-structure
|
|
25
25
|
|
|
26
|
-
|
|
26
|
+
Evals are grouped first by the component being evaluated, then by the specific evaluation scenario. Create one directory per component under `evals/`, and one directory per eval scenario inside it. Place `evals/` at the same level as `lib/` and `examples/`:
|
|
27
27
|
|
|
28
28
|
```text
|
|
29
29
|
evals/
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
30
|
+
<component>/ # the component being evaluated (e.g., workflow-x, agent-y, model-z)
|
|
31
|
+
eval-<name>/
|
|
32
|
+
dataset/ # EDR-024 compliant dataset (README.md, dataset.schema.json, data/)
|
|
33
|
+
eval-<name>.py # evaluation script
|
|
34
|
+
eval-<name>-report.md # generated report (overwritten on each run — see rule 03)
|
|
35
|
+
Makefile # eval and run targets
|
|
36
|
+
eval-<name2>/
|
|
37
|
+
...
|
|
38
|
+
<component2>/
|
|
36
39
|
...
|
|
37
40
|
```
|
|
38
41
|
|
|
39
|
-
|
|
42
|
+
`<component>` MUST match the name of the component under evaluation and use lowercase hyphen-separated words (e.g., `workflow-document-review`, `agent-support`, `model-classifier`).
|
|
43
|
+
|
|
44
|
+
`<name>` identifies the specific evaluation scenario using lowercase hyphen-separated words (e.g., `eval-basic`, `eval-complex`, `eval-edge-cases`, `eval-bias-test`).
|
|
40
45
|
|
|
41
46
|
The `dataset/` subfolder MUST be a valid [agentme-edr-024](024-ml-dataset-structure.md) dataset — it MUST include `README.md` and `dataset.schema.json` at its root. For input/output pairs, use JSONL files per `agentme-edr-024.04-complex-structured-datasets-must-use-jsonl`.
|
|
42
47
|
|
|
43
|
-
Each `evals
|
|
48
|
+
Each `evals/<component>/eval-<name>/Makefile` MUST define:
|
|
44
49
|
|
|
45
50
|
| Target | Behaviour |
|
|
46
51
|
|---|---|
|
|
47
52
|
| `eval` | Runs the eval with threshold enforcement; exits non-zero on failure (CI-safe) |
|
|
48
53
|
| `run` | Runs the eval without threshold enforcement (exploration / debugging) |
|
|
49
54
|
|
|
50
|
-
The module root Makefile MUST expose a `make eval` target that delegates to `eval` in every `evals
|
|
55
|
+
The module root Makefile MUST expose a `make eval` target that delegates to `eval` in every `evals/<component>/eval-<name>/Makefile`:
|
|
51
56
|
|
|
52
57
|
```makefile
|
|
53
58
|
eval:
|
|
54
|
-
$(MAKE) -C evals/eval-basic eval
|
|
55
|
-
$(MAKE) -C evals/eval-complex eval
|
|
59
|
+
$(MAKE) -C evals/workflow-document-review/eval-basic eval
|
|
60
|
+
$(MAKE) -C evals/workflow-document-review/eval-complex eval
|
|
56
61
|
```
|
|
57
62
|
|
|
58
63
|
#### 02-eval-script-requirements
|
|
@@ -63,7 +68,7 @@ Each `eval-<name>.py` script MUST:
|
|
|
63
68
|
- Run every input through the live component against **real LLM providers** (not mocked responses), to capture model drift.
|
|
64
69
|
- Log per-sample and aggregate metrics to an MLflow experiment that runs **locally** — a remote MLflow server MUST NOT be required.
|
|
65
70
|
- Compare outputs to expected values using project-defined quality thresholds. Thresholds MUST be declared explicitly (e.g., in a Makefile variable or README).
|
|
66
|
-
- Write `eval
|
|
71
|
+
- Write `eval-<name>-report.md` in the same folder per rule `03-eval-report-file`.
|
|
67
72
|
- Exit with a non-zero status when any metric falls below its defined threshold, consistent with [agentme-edr-007](../principles/007-project-quality-standards.md) rule `07-statistical-models-must-have-eval-targets`.
|
|
68
73
|
|
|
69
74
|
**Example:**
|
|
@@ -91,7 +96,7 @@ with mlflow.start_run() as run:
|
|
|
91
96
|
|
|
92
97
|
#### 03-eval-report-file
|
|
93
98
|
|
|
94
|
-
Each eval script MUST produce `eval
|
|
99
|
+
Each eval script MUST produce `eval-<name>-report.md` in the same `evals/<component>/eval-<name>/` folder and overwrite it on every run.
|
|
95
100
|
|
|
96
101
|
**Generation constraint:** The report MUST be produced programmatically, reading raw metric values directly from MLflow. No LLM or generative model may write, summarize, or paraphrase any section of the report, to prevent hallucinated metric values.
|
|
97
102
|
|
|
@@ -137,7 +142,7 @@ $$\frac{\hat{p} + \frac{z^2}{2n} \pm z\sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac
|
|
|
137
142
|
|
|
138
143
|
Where $\hat{p}$ is observed accuracy and $n$ is sample count. Accuracy and F1 are required; precision and recall are recommended.
|
|
139
144
|
|
|
140
|
-
**Filled-in example** (`evals/eval-basic/eval-report.md` for a document review workflow):
|
|
145
|
+
**Filled-in example** (`evals/workflow-document-review/eval-basic/eval-basic-report.md` for a document review workflow):
|
|
141
146
|
|
|
142
147
|
```markdown
|
|
143
148
|
# Eval Report: eval-basic
|
|
@@ -174,7 +179,7 @@ Where $\hat{p}$ is observed accuracy and $n$ is sample count. Accuracy and F1 ar
|
|
|
174
179
|
## Notes
|
|
175
180
|
|
|
176
181
|
- Sample 005 misclassified: redlined IP clause not flagged as escalation trigger. Possible model drift.
|
|
177
|
-
- MLflow run: experiment `
|
|
182
|
+
- MLflow run: experiment `workflow-document-review/eval-basic` — view with `mlflow ui`
|
|
178
183
|
```
|
|
179
184
|
|
|
180
185
|
## References
|
|
@@ -118,13 +118,14 @@ Directory and file layout must be self-explanatory: source code, tests, configur
|
|
|
118
118
|
|
|
119
119
|
#### 06-libraries-must-have-runnable-examples
|
|
120
120
|
|
|
121
|
-
Projects that are libraries or shared utilities must include an `examples/` directory. Each subdirectory represents a usage scenario and must be independently runnable. Examples are executed as part of `make test`.
|
|
121
|
+
Projects that are libraries or shared utilities must include an `examples/` directory. Each subdirectory represents a usage scenario and must be independently runnable. Examples that are "offline" (require no external credentials, no running servers, no paid APIs, and no environment-specific configuration outside the repository) must be executed as part of `make test`. Examples that depend on external entities may be left out of `make test`.
|
|
122
122
|
|
|
123
123
|
**Requirements:**
|
|
124
124
|
- `examples/` must contain at least one subdirectory per major usage scenario
|
|
125
125
|
- Each scenario subdirectory must have a `Makefile` with a `run` target
|
|
126
126
|
- Examples must import the library as an external consumer (not via relative `../src` imports)
|
|
127
|
-
- `make test` in the root must run all examples; failures block CI and releases
|
|
127
|
+
- `make test` in the root must run all offline examples; failures block CI and releases
|
|
128
|
+
- Examples that depend on external entities must not be included in `make test`
|
|
128
129
|
|
|
129
130
|
**Directory layout:**
|
|
130
131
|
|