agentme 0.18.0 → 0.19.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -107,12 +107,13 @@ Eval folder structure and script requirements are defined in [agentme-edr-028](0
107
107
 
108
108
  LangGraph node names MUST follow a suffix convention that communicates the node's role at a glance. Names MUST be action-oriented and descriptive.
109
109
 
110
- | Suffix | Node type | When to use |
110
+ | Convention | Node type | When to use |
111
111
  |---|---|---|
112
- | `_llm` | LLM call | Any node whose primary action is a direct LLM inference call (see [agentme-edr-018](018-ai-llm-development-standards.md)) |
113
- | `_step` | Algorithmic step | Deterministic logic with no LLM involvement (transformation, validation, routing) |
114
- | `_tool` | Tool/API call | A node that wraps a single external tool or API (e.g. a REST endpoint, DB query) |
115
- | `_agent` | Subgraph agent | A node that invokes a nested subgraph containing its own tool-invocation cycle and LLM calls; use the **deepagents** library for these nodes (see [agentme-edr-019](019-ai-agents-development-standards.md)) |
112
+ | suffix `_llm` | LLM call | Any node whose primary action is a direct LLM inference call (see [agentme-edr-018](018-ai-llm-development-standards.md)) |
113
+ | suffix `_step` | Algorithmic step | Deterministic logic with no LLM involvement (transformation, validation, routing) |
114
+ | suffix `_tool` | Tool/API call | A node that wraps a single external tool or API (e.g. a REST endpoint, DB query) |
115
+ | suffix `_agent` | Subgraph agent | A node that invokes a nested subgraph containing its own tool-invocation cycle and LLM calls; use the **deepagents** library for these nodes (see [agentme-edr-019](019-ai-agents-development-standards.md)) |
116
+ | prefix `evaluate_` | Judge node | A node that evaluates the quality, correctness, completeness, or progress of prior outputs and returns a structured verdict; MUST follow rule `13-judge-node-output-format` |
116
117
 
117
118
  The Python function implementing the node SHOULD share the same name as the node alias passed to `add_node`, so that graph definitions and stack traces remain unambiguous:
118
119
 
@@ -131,6 +132,8 @@ graph.add_node("code_reviewer_agent", code_reviewer_agent)
131
132
 
132
133
  Names MUST NOT use generic labels such as `node1`, `process`, or `run`. Each name must clearly express what action the node performs.
133
134
 
135
+ Judge nodes use a **prefix** convention instead of a suffix: the name MUST start with `evaluate_` followed by the subject being judged (e.g. `evaluate_progress`, `evaluate_quality`, `evaluate_completeness`, `evaluate_relevance`). This makes judge nodes immediately distinguishable from all other node types at a glance.
136
+
134
137
  #### 10-workflow-unit-testing
135
138
 
136
139
  All LLM calls within workflow nodes are external API calls and MUST be mocked in unit tests per [agentme-edr-018](018-ai-llm-development-standards.md) rule `04-unit-test-mocking`. Workflow unit tests must run fully offline with no real LLM provider calls.
@@ -222,6 +225,72 @@ Choose a name that summarises what the workflow consumes, processes, and produce
222
225
 
223
226
  **Bad names** (FORBIDDEN): `MainWorkflow`, `AgentGraph`, `ProcessFlow`, `Workflow1`, `RunGraph`.
224
227
 
228
+ #### 13-judge-node-output-format
229
+
230
+ Every node whose name starts with `evaluate_` (a judge node) MUST return a structured verdict object as its output. This ensures all judge nodes are interchangeable and their results can be uniformly consumed by downstream routing logic, logged, and compared across runs.
231
+
232
+ **Required output schema:**
233
+
234
+ ```python
235
+ from typing import Literal, Optional
236
+ from dataclasses import dataclass, field
237
+
238
+ FindingLevel = Literal["OK", "INFO", "WARNING", "ERROR"]
239
+
240
+ @dataclass
241
+ class JudgeFinding:
242
+ level: FindingLevel
243
+ # MUST: short action-oriented label; < 10 words
244
+ title: str
245
+ # MUST when level != "OK": why this is an issue; < 30 words
246
+ reason: Optional[str] = None
247
+ # MUST when level != "OK": notes/findings using mandatory (MUST) or advisory (SHOULD) language; < 400 words
248
+ details: Optional[str] = None
249
+ # OPTIONAL: possible fixes, only when directly inferrable from the finding without further analysis; < 200 words
250
+ fix: Optional[str] = None
251
+
252
+ @dataclass
253
+ class JudgeVerdict:
254
+ # MUST: highest severity level across all findings; "OK" only when every finding is "OK"
255
+ verdict: FindingLevel
256
+ # MUST: at least one finding present
257
+ findings: list[JudgeFinding] = field(default_factory=list)
258
+ ```
259
+
260
+ Example (for logging, state storage, and inter-node communication):
261
+
262
+ ```json
263
+ {
264
+ "verdict": "WARNING",
265
+ "findings": [
266
+ {
267
+ "level": "OK",
268
+ "title": "All required sections present"
269
+ },
270
+ {
271
+ "level": "WARNING",
272
+ "title": "Code coverage below threshold",
273
+ "reason": "Current coverage is 62%, minimum required is 80%.",
274
+ "details": "The following modules have no test coverage: auth.py, payments.py. SHOULD add unit tests for all public methods in these modules.",
275
+ "fix": "Add unit tests for auth.py and payments.py. Run `make test-coverage` to verify the threshold is met."
276
+ }
277
+ ]
278
+ }
279
+ ```
280
+
281
+ **Routing from judge nodes:**
282
+
283
+ Downstream conditional edges MUST route on `verdict` only:
284
+
285
+ ```python
286
+ def route_after_evaluate_quality(state) -> str:
287
+ if state["evaluate_quality_result"].verdict in ("ERROR", "WARNING"):
288
+ return "revise_draft_llm"
289
+ return "publish_step"
290
+ ```
291
+
292
+ **Logging:** Log `verdict` and the count of each level as MLflow metrics on the current run per rule `03-observability-and-experiment-tracking`.
293
+
225
294
  #### 15-workflow-state-persistence
226
295
 
227
296
  For long-running workflows that may need to be paused and resumed:
@@ -23,41 +23,47 @@ For when evals are required per AI tier, see [agentme-edr-007](../principles/007
23
23
 
24
24
  #### 01-eval-folder-structure
25
25
 
26
- For each AI component being evaluated (an LLM chain, agent, or workflow), create a corresponding directory under `evals/` at the same level as `lib/` and `examples/`:
26
+ Each named eval is a self-contained unit. Create one directory per eval under `evals/` at the same level as `lib/` and `examples/`:
27
27
 
28
28
  ```text
29
29
  evals/
30
- <component>/
31
- Makefile # eval targets for this component
32
- dataset_<group>/ # one folder per eval group (see agentme-edr-024)
33
- eval_<group>.py # evaluation script for each group
30
+ eval-<name>/
31
+ dataset/ # EDR-024 compliant dataset (README.md, dataset.schema.json, data/)
32
+ eval-<name>.py # evaluation script
33
+ eval-report.md # generated report (overwritten on each run — see rule 03)
34
+ Makefile # eval and run targets
35
+ eval-<name2>/
36
+ ...
34
37
  ```
35
38
 
36
- Where `<component>` is the name of the LLM chain, agent, or workflow being evaluated (e.g., `summarizer`, `file_analyzer_agent`, `document_review_workflow`).
39
+ Where `<name>` identifies the specific evaluation scenario (e.g., `eval-basic`, `eval-complex`, `eval-edge-cases`).
37
40
 
38
- The per-component `evals/<component>/Makefile` MUST define:
41
+ The `dataset/` subfolder MUST be a valid [agentme-edr-024](024-ml-dataset-structure.md) dataset — it MUST include `README.md` and `dataset.schema.json` at its root. For input/output pairs, use JSONL files per `agentme-edr-024.04-complex-structured-datasets-must-use-jsonl`.
42
+
43
+ Each `evals/eval-<name>/Makefile` MUST define:
39
44
 
40
45
  | Target | Behaviour |
41
46
  |---|---|
42
- | `eval` | Runs all eval groups for the component |
43
- | `eval-<group>` | Runs one named group (e.g. `eval-simple`, `eval-complex`) |
47
+ | `eval` | Runs the eval with threshold enforcement; exits non-zero on failure (CI-safe) |
48
+ | `run` | Runs the eval without threshold enforcement (exploration / debugging) |
44
49
 
45
- The module root Makefile MUST expose a `make eval` target that delegates to `eval` in every `evals/<component>/Makefile`:
50
+ The module root Makefile MUST expose a `make eval` target that delegates to `eval` in every `evals/eval-<name>/Makefile`:
46
51
 
47
52
  ```makefile
48
53
  eval:
49
- $(MAKE) -C evals/summarizer eval
50
- $(MAKE) -C evals/document_review_workflow eval
54
+ $(MAKE) -C evals/eval-basic eval
55
+ $(MAKE) -C evals/eval-complex eval
51
56
  ```
52
57
 
53
58
  #### 02-eval-script-requirements
54
59
 
55
- Each `eval_<group>.py` script MUST:
60
+ Each `eval-<name>.py` script MUST:
56
61
 
57
- - Load the dataset from `evals/<component>/dataset_<group>/` following [agentme-edr-024](024-ml-dataset-structure.md). For input/output pairs, use the JSONL format per `agentme-edr-024.04-complex-structured-datasets-must-use-jsonl`.
62
+ - Load the dataset from `dataset/` in the same eval folder, following [agentme-edr-024](024-ml-dataset-structure.md). For input/output pairs, use the JSONL format per `agentme-edr-024.04-complex-structured-datasets-must-use-jsonl`.
58
63
  - Run every input through the live component against **real LLM providers** (not mocked responses), to capture model drift.
59
64
  - Log per-sample and aggregate metrics to an MLflow experiment that runs **locally** — a remote MLflow server MUST NOT be required.
60
65
  - Compare outputs to expected values using project-defined quality thresholds. Thresholds MUST be declared explicitly (e.g., in a Makefile variable or README).
66
+ - Write `eval-report.md` in the same folder per rule `03-eval-report-file`.
61
67
  - Exit with a non-zero status when any metric falls below its defined threshold, consistent with [agentme-edr-007](../principles/007-project-quality-standards.md) rule `07-statistical-models-must-have-eval-targets`.
62
68
 
63
69
  **Example:**
@@ -68,19 +74,109 @@ from my_package.app.workflows.document_review_workflow.graph import graph
68
74
 
69
75
  EVAL_MIN_ACCURACY = 0.85
70
76
 
71
- with mlflow.start_run():
77
+ with mlflow.start_run() as run:
72
78
  results = []
73
- for sample in load_dataset("evals/document_review_workflow/dataset_basic/"):
79
+ for sample in load_dataset("dataset/"):
74
80
  output = graph.invoke({"document": sample["input"]})
75
81
  results.append(output["label"] == sample["expected_label"])
76
82
 
77
83
  accuracy = sum(results) / len(results)
78
84
  mlflow.log_metric("accuracy", accuracy)
79
85
 
86
+ write_eval_report(run, results, thresholds={"accuracy": EVAL_MIN_ACCURACY})
87
+
80
88
  if accuracy < EVAL_MIN_ACCURACY:
81
89
  raise SystemExit(f"Eval failed: accuracy {accuracy:.2f} < {EVAL_MIN_ACCURACY}")
82
90
  ```
83
91
 
92
+ #### 03-eval-report-file
93
+
94
+ Each eval script MUST produce `eval-report.md` in the same `evals/eval-<name>/` folder and overwrite it on every run.
95
+
96
+ **Generation constraint:** The report MUST be produced programmatically, reading raw metric values directly from MLflow. No LLM or generative model may write, summarize, or paraphrase any section of the report, to prevent hallucinated metric values.
97
+
98
+ The report MUST follow this template:
99
+
100
+ ```markdown
101
+ # Eval Report: <name>
102
+
103
+ **Date:** <ISO date>
104
+ **Dataset:** dataset/
105
+ **Script:** eval-<name>.py
106
+ **Thresholds:** accuracy ≥ <value>, F1 ≥ <value>
107
+
108
+ ## Overall Results
109
+
110
+ | Metric | Value | 95% CI | Threshold | Status |
111
+ |-----------|--------|----------------|-----------|---------|
112
+ | Accuracy | <val> | [<low>, <high>]| ≥ <thr> | ✓/✗ PASS/FAIL |
113
+ | F1 Score | <val> | — | ≥ <thr> | ✓/✗ PASS/FAIL |
114
+ | Precision | <val> | — | — | — |
115
+ | Recall | <val> | — | — | — |
116
+ | Samples | <n> | — | — | — |
117
+
118
+ **Overall: PASS / FAIL**
119
+
120
+ ## Per-item Results
121
+
122
+ | ID | Input Summary | Expected | Actual | Correct |
123
+ |-----|---------------|----------|--------|---------|
124
+ | 001 | <summary> | <label> | <label>| ✓ |
125
+ | 002 | <summary> | <label> | <label>| ✗ |
126
+
127
+ ## Notes
128
+
129
+ - <observations, failure patterns, MLflow run link>
130
+ ```
131
+
132
+ **Confidence interval:** The 95% CI for accuracy MUST be computed using the **Wilson score interval** (preferred over the normal approximation for small $n$). A wide interval signals that the dataset is too small to support confident conclusions and the sample count should be increased.
133
+
134
+ The Wilson score bounds at 95% confidence ($z = 1.96$) are:
135
+
136
+ $$\frac{\hat{p} + \frac{z^2}{2n} \pm z\sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac{z^2}{4n^2}}}{1 + \frac{z^2}{n}}$$
137
+
138
+ Where $\hat{p}$ is observed accuracy and $n$ is sample count. Accuracy and F1 are required; precision and recall are recommended.
139
+
140
+ **Filled-in example** (`evals/eval-basic/eval-report.md` for a document review workflow):
141
+
142
+ ```markdown
143
+ # Eval Report: eval-basic
144
+
145
+ **Date:** 2026-06-12
146
+ **Dataset:** dataset/
147
+ **Script:** eval-basic.py
148
+ **Thresholds:** accuracy ≥ 0.85, F1 ≥ 0.80
149
+
150
+ ## Overall Results
151
+
152
+ | Metric | Value | 95% CI | Threshold | Status |
153
+ |-----------|-------|--------------|-----------|-------------|
154
+ | Accuracy | 0.88 | [0.69, 0.97] | ≥ 0.85 | ✓ PASS |
155
+ | F1 Score | 0.86 | — | ≥ 0.80 | ✓ PASS |
156
+ | Precision | 0.89 | — | — | — |
157
+ | Recall | 0.84 | — | — | — |
158
+ | Samples | 25 | — | — | — |
159
+
160
+ **Overall: PASS**
161
+
162
+ > Note: CI [0.69, 0.97] is wide — 25 samples may be insufficient for high confidence. Consider expanding the dataset.
163
+
164
+ ## Per-item Results
165
+
166
+ | ID | Input Summary | Expected | Actual | Correct |
167
+ |-----|-------------------------------------|----------|----------|---------|
168
+ | 001 | Contract renewal, 3 pages, standard | approve | approve | ✓ |
169
+ | 002 | NDA with unusual liability clause | escalate | escalate | ✓ |
170
+ | 003 | Vendor invoice, missing PO number | reject | reject | ✓ |
171
+ | 004 | Employment agreement, standard terms| approve | approve | ✓ |
172
+ | 005 | Amendment with redlined IP clause | escalate | approve | ✗ |
173
+
174
+ ## Notes
175
+
176
+ - Sample 005 misclassified: redlined IP clause not flagged as escalation trigger. Possible model drift.
177
+ - MLflow run: experiment `eval_basic` — view with `mlflow ui`
178
+ ```
179
+
84
180
  ## References
85
181
 
86
182
  - [agentme-edr-007](../principles/007-project-quality-standards.md) — Project quality standards: when evals are required per AI tier (rule `09-ai-project-testing-requirements`) and statistical model eval targets (rule `07-statistical-models-must-have-eval-targets`)
package/.xdrs/index.md CHANGED
@@ -25,4 +25,6 @@ Opiniated set of decisions and skills for common development tasks
25
25
 
26
26
  ### _local (reserved)
27
27
 
28
- Project-local XDRs that must not be shared with other contexts. Always keep this scope last so its decisions override or extend all scopes listed above. Keep `_local` canonical indexes in the workspace tree only; do not link them from this shared index. Readers and tools should still try to discover existing `_local` indexes in the current workspace by default.
28
+ _local scope is the default scope for new xdrs and might override other scope decisions. These decisions are local and are not supposed to be shared in other contexts.
29
+
30
+ Read _local scope index at `_local/index.md` when it exists.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "agentme",
3
- "version": "0.18.0",
3
+ "version": "0.19.0",
4
4
  "description": "",
5
5
  "dependencies": {
6
6
  "filedist": "^0.35.0"