agentme 0.20.1 → 0.22.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -171,6 +171,46 @@ prompt = PromptTemplate.from_file(
171
171
  )
172
172
  ```
173
173
 
174
+ #### 06-output-length-constraints
175
+
176
+ Every free-text field generated by an LLM MUST have an explicit word or token limit defined wherever the content is specified — in prompt text, output schema definitions, or both. This prevents runaway verbosity, reduces token costs, and makes quality evaluation deterministic.
177
+
178
+ **Rules:**
179
+
180
+ - Append `[max N words]` (or `[max N tokens]` when appropriate) directly inside the instruction or field description that requests the generated content.
181
+ - Apply the constraint to every level: the top-level prompt instruction AND any nested schema field that contains free text.
182
+ - When a field is an enumeration or a short code (e.g., `"APPROVE"` / `"REJECT"`), no word limit is needed.
183
+ - Inline examples are encouraged whenever the expected output style is non-obvious.
184
+
185
+ **Prompt example:**
186
+
187
+ ```text
188
+ Generate a summary of this text. [max 40 words]
189
+ Identify the three main topics covered. [max 10 words each]
190
+ ```
191
+
192
+ **Output schema example:**
193
+
194
+ ```python
195
+ class EvaluationResult(BaseModel):
196
+ evaluation: str = Field(description="Most important aspects of the text [max 100 words]")
197
+ verdict: Literal["PASS", "FAIL"]
198
+ improvement_suggestion: str = Field(description="Concrete next step for the author [max 30 words]")
199
+ ```
200
+
201
+ **Combined prompt + schema example:**
202
+
203
+ ```python
204
+ prompt = """
205
+ Evaluate the following document against our quality criteria. [max 200 words total]
206
+
207
+ Return a JSON object with:
208
+ - "evaluation": overall assessment [max 100 words]
209
+ - "verdict": "PASS" or "FAIL"
210
+ - "improvement_suggestion": one concrete improvement [max 30 words]
211
+ """
212
+ ```
213
+
174
214
  ## References
175
215
 
176
216
  - [agentme-edr-019](019-ai-agents-development-standards.md) — Agent implementation standards (deepagents, tool-invocation loops)
@@ -197,6 +197,58 @@ The current OS is: [operating system name].
197
197
  | `<OUTPUT_FORMAT>` | Required | MUST include a concrete schema or templated example; do not leave it vague. When multiple output formats are possible, MUST use mandatory language to specify exactly which one to use and explicitly exclude the others. |
198
198
  | `<WORKFLOW_CONTEXT>` | Conditional | MUST be omitted for standalone agents. MUST be present when the agent runs as a node inside a LangGraph workflow. |
199
199
 
200
+ #### 07-agent-output-format
201
+
202
+ The format of an agent's final output MUST be chosen based on who or what consumes it:
203
+
204
+ | Consumer | Required format |
205
+ |---|---|
206
+ | Code (workflow node, downstream function, API caller) | JSON object matching a declared schema |
207
+ | Human (user-facing summary, review comment, prose report) | Natural language; JSON is NOT required |
208
+
209
+ **When output is consumed by code:**
210
+
211
+ - The agent's `<OUTPUT_FORMAT>` system prompt section MUST specify a JSON schema or a concrete typed example.
212
+ - The `<OUTPUT_FORMAT>` section MUST include the instruction: `ALWAYS return valid JSON. NEVER include prose outside the JSON block.`
213
+ - The corresponding Python return type MUST be a typed dataclass or Pydantic model that mirrors the schema, not a plain `dict` or `str`.
214
+ - The caller MUST parse and validate the JSON against the declared type before passing it downstream.
215
+
216
+ **Example — code-consumed agent output:**
217
+
218
+ ```python
219
+ from dataclasses import dataclass
220
+ from typing import List
221
+ import json
222
+
223
+ @dataclass
224
+ class FileAnalysisResult:
225
+ status: str # "success" | "failure"
226
+ files_found: int
227
+ issues: List[str]
228
+
229
+ # System prompt OUTPUT_FORMAT section:
230
+ # <OUTPUT_FORMAT>
231
+ # Respond with a JSON object matching this schema:
232
+ # {
233
+ # "status": "success" | "failure",
234
+ # "files_found": <integer>,
235
+ # "issues": ["<string>", ...],
236
+ # "summary": "summary of the work performed. max 30 words",
237
+ # }
238
+ # ALWAYS return valid JSON. NEVER include prose outside the JSON block.
239
+ # </OUTPUT_FORMAT>
240
+
241
+ def parse_agent_output(raw: str) -> FileAnalysisResult:
242
+ data = json.loads(raw)
243
+ return FileAnalysisResult(**data)
244
+ ```
245
+
246
+ **When output is consumed by a human:**
247
+
248
+ - Natural language is correct and preferred.
249
+ - Do NOT wrap prose in JSON — it adds noise without value.
250
+ - The `<OUTPUT_FORMAT>` section MUST still describe the expected structure (e.g., a list of findings, a summary paragraph) using a prose template.
251
+
200
252
  ## References
201
253
 
202
254
  - [agentme-edr-018](018-ai-llm-development-standards.md) — LLM development standards (LangChain configuration, mocking patterns)
@@ -23,47 +23,52 @@ For when evals are required per AI tier, see [agentme-edr-007](../principles/007
23
23
 
24
24
  #### 01-eval-folder-structure
25
25
 
26
- Each named eval is a self-contained unit. Create one directory per eval under `evals/` at the same level as `lib/` and `examples/`:
26
+ Evals are grouped first by the component being evaluated, then by the specific evaluation scenario. Create one directory per component under `evals/`, and one directory per eval scenario inside it. Place `evals/` at the same level as `lib/` and `examples/`:
27
27
 
28
28
  ```text
29
29
  evals/
30
- eval-<name>/
31
- dataset/ # EDR-024 compliant dataset (README.md, dataset.schema.json, data/)
32
- eval-<name>.py # evaluation script
33
- eval-report.md # generated report (overwritten on each run — see rule 03)
34
- Makefile # eval and run targets
35
- eval-<name2>/
30
+ <component>/ # the component being evaluated (e.g., workflow-x, agent-y, model-z)
31
+ eval-<name>/
32
+ dataset/ # EDR-024 compliant dataset (README.md, dataset.schema.json, data/)
33
+ eval.py # evaluation script
34
+ report.md # generated report (overwritten on each run — see rule 03)
35
+ Makefile # eval and run targets
36
+ eval-<name2>/
37
+ ...
38
+ <component2>/
36
39
  ...
37
40
  ```
38
41
 
39
- Where `<name>` identifies the specific evaluation scenario (e.g., `eval-basic`, `eval-complex`, `eval-edge-cases`).
42
+ `<component>` MUST match the name of the component under evaluation and use lowercase hyphen-separated words (e.g., `workflow-document-review`, `agent-support`, `model-classifier`).
43
+
44
+ `<name>` identifies the specific evaluation scenario using lowercase hyphen-separated words (e.g., `eval-basic`, `eval-complex`, `eval-edge-cases`, `eval-bias-test`).
40
45
 
41
46
  The `dataset/` subfolder MUST be a valid [agentme-edr-024](024-ml-dataset-structure.md) dataset — it MUST include `README.md` and `dataset.schema.json` at its root. For input/output pairs, use JSONL files per `agentme-edr-024.04-complex-structured-datasets-must-use-jsonl`.
42
47
 
43
- Each `evals/eval-<name>/Makefile` MUST define:
48
+ Each `evals/<component>/eval-<name>/Makefile` MUST define:
44
49
 
45
50
  | Target | Behaviour |
46
51
  |---|---|
47
52
  | `eval` | Runs the eval with threshold enforcement; exits non-zero on failure (CI-safe) |
48
53
  | `run` | Runs the eval without threshold enforcement (exploration / debugging) |
49
54
 
50
- The module root Makefile MUST expose a `make eval` target that delegates to `eval` in every `evals/eval-<name>/Makefile`:
55
+ The module root Makefile MUST expose a `make eval` target that delegates to `eval` in every `evals/<component>/eval-<name>/Makefile`:
51
56
 
52
57
  ```makefile
53
58
  eval:
54
- $(MAKE) -C evals/eval-basic eval
55
- $(MAKE) -C evals/eval-complex eval
59
+ $(MAKE) -C evals/workflow-document-review/eval-basic eval
60
+ $(MAKE) -C evals/workflow-document-review/eval-complex eval
56
61
  ```
57
62
 
58
63
  #### 02-eval-script-requirements
59
64
 
60
- Each `eval-<name>.py` script MUST:
65
+ Each `eval.py` script MUST:
61
66
 
62
67
  - Load the dataset from `dataset/` in the same eval folder, following [agentme-edr-024](024-ml-dataset-structure.md). For input/output pairs, use the JSONL format per `agentme-edr-024.04-complex-structured-datasets-must-use-jsonl`.
63
68
  - Run every input through the live component against **real LLM providers** (not mocked responses), to capture model drift.
64
69
  - Log per-sample and aggregate metrics to an MLflow experiment that runs **locally** — a remote MLflow server MUST NOT be required.
65
70
  - Compare outputs to expected values using project-defined quality thresholds. Thresholds MUST be declared explicitly (e.g., in a Makefile variable or README).
66
- - Write `eval-report.md` in the same folder per rule `03-eval-report-file`.
71
+ - Write `report.md` in the same folder per rule `03-eval-report-file`.
67
72
  - Exit with a non-zero status when any metric falls below its defined threshold, consistent with [agentme-edr-007](../principles/007-project-quality-standards.md) rule `07-statistical-models-must-have-eval-targets`.
68
73
 
69
74
  **Example:**
@@ -91,7 +96,7 @@ with mlflow.start_run() as run:
91
96
 
92
97
  #### 03-eval-report-file
93
98
 
94
- Each eval script MUST produce `eval-report.md` in the same `evals/eval-<name>/` folder and overwrite it on every run.
99
+ Each eval script MUST produce `report.md` in the same `evals/<component>/eval-<name>/` folder and overwrite it on every run.
95
100
 
96
101
  **Generation constraint:** The report MUST be produced programmatically, reading raw metric values directly from MLflow. No LLM or generative model may write, summarize, or paraphrase any section of the report, to prevent hallucinated metric values.
97
102
 
@@ -102,7 +107,7 @@ The report MUST follow this template:
102
107
 
103
108
  **Date:** <ISO date>
104
109
  **Dataset:** dataset/
105
- **Script:** eval-<name>.py
110
+ **Script:** eval.py
106
111
  **Thresholds:** accuracy ≥ <value>, F1 ≥ <value>
107
112
 
108
113
  ## Overall Results
@@ -137,14 +142,14 @@ $$\frac{\hat{p} + \frac{z^2}{2n} \pm z\sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac
137
142
 
138
143
  Where $\hat{p}$ is observed accuracy and $n$ is sample count. Accuracy and F1 are required; precision and recall are recommended.
139
144
 
140
- **Filled-in example** (`evals/eval-basic/eval-report.md` for a document review workflow):
145
+ **Filled-in example** (`evals/workflow-document-review/eval-basic/report.md` for a document review workflow):
141
146
 
142
147
  ```markdown
143
148
  # Eval Report: eval-basic
144
149
 
145
150
  **Date:** 2026-06-12
146
151
  **Dataset:** dataset/
147
- **Script:** eval-basic.py
152
+ **Script:** eval.py
148
153
  **Thresholds:** accuracy ≥ 0.85, F1 ≥ 0.80
149
154
 
150
155
  ## Overall Results
@@ -174,7 +179,7 @@ Where $\hat{p}$ is observed accuracy and $n$ is sample count. Accuracy and F1 ar
174
179
  ## Notes
175
180
 
176
181
  - Sample 005 misclassified: redlined IP clause not flagged as escalation trigger. Possible model drift.
177
- - MLflow run: experiment `eval_basic` — view with `mlflow ui`
182
+ - MLflow run: experiment `workflow-document-review/eval-basic` — view with `mlflow ui`
178
183
  ```
179
184
 
180
185
  ## References
@@ -118,13 +118,14 @@ Directory and file layout must be self-explanatory: source code, tests, configur
118
118
 
119
119
  #### 06-libraries-must-have-runnable-examples
120
120
 
121
- Projects that are libraries or shared utilities must include an `examples/` directory. Each subdirectory represents a usage scenario and must be independently runnable. Examples are executed as part of `make test`.
121
+ Projects that are libraries or shared utilities must include an `examples/` directory. Each subdirectory represents a usage scenario and must be independently runnable. Examples that are "offline" (require no external credentials, no running servers, no paid APIs, and no environment-specific configuration outside the repository) must be executed as part of `make test`. Examples that depend on external entities may be left out of `make test`.
122
122
 
123
123
  **Requirements:**
124
124
  - `examples/` must contain at least one subdirectory per major usage scenario
125
125
  - Each scenario subdirectory must have a `Makefile` with a `run` target
126
126
  - Examples must import the library as an external consumer (not via relative `../src` imports)
127
- - `make test` in the root must run all examples; failures block CI and releases
127
+ - `make test` in the root must run all offline examples; failures block CI and releases
128
+ - Examples that depend on external entities must not be included in `make test`
128
129
 
129
130
  **Directory layout:**
130
131
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "agentme",
3
- "version": "0.20.1",
3
+ "version": "0.22.0",
4
4
  "description": "",
5
5
  "dependencies": {
6
6
  "filedist": "^0.36.0"