agentme 0.14.0 → 0.15.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.filedist-package.yml +1 -1
- package/.xdrs/agentme/edrs/application/003-javascript-project-tooling.md +3 -3
- package/.xdrs/agentme/edrs/application/010-golang-project-tooling.md +3 -3
- package/.xdrs/agentme/edrs/application/014-python-project-tooling.md +2 -2
- package/.xdrs/agentme/edrs/application/015-cli-tool-standards.md +2 -2
- package/.xdrs/agentme/edrs/application/018-ai-llm-development-standards.md +180 -0
- package/.xdrs/agentme/edrs/application/019-ai-agents-development-standards.md +284 -0
- package/.xdrs/agentme/edrs/application/020-ai-workflow-development-standards.md +261 -0
- package/.xdrs/agentme/edrs/application/021-ai-eval-standards.md +90 -0
- package/.xdrs/agentme/edrs/application/{019-ml-dataset-structure.md → 024-ml-dataset-structure.md} +2 -2
- package/.xdrs/agentme/edrs/application/{020-ai-agent-xdrs-knowledge-layer.md → 025-ai-agent-xdrs-knowledge-layer.md} +4 -4
- package/.xdrs/agentme/edrs/application/{021-pragmatic-hexagonal-architecture.md → 026-pragmatic-hexagonal-architecture.md} +2 -2
- package/.xdrs/agentme/edrs/application/skills/001-create-javascript-project/SKILL.md +2 -2
- package/.xdrs/agentme/edrs/application/skills/003-create-golang-project/SKILL.md +2 -2
- package/.xdrs/agentme/edrs/application/skills/005-create-python-project/SKILL.md +3 -3
- package/.xdrs/agentme/edrs/index.md +7 -5
- package/.xdrs/agentme/edrs/principles/007-project-quality-standards.md +29 -1
- package/package.json +3 -3
- package/.xdrs/agentme/edrs/application/018-ai-agent-development-standards.md +0 -309
- package/.xdrs/agentme/edrs/application/024-llm-development-standards.md +0 -116
|
@@ -0,0 +1,261 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: agentme-edr-policy-020-ai-workflow-development-standards
|
|
3
|
+
description: Defines the standard toolchain, framework, observability, and workflow patterns for building LangGraph workflows in Python. Use when scaffolding, reviewing, or extending AI workflow projects that orchestrate LLM calls, agents, and algorithmic nodes. For simple LLM calls see agentme-edr-018, for agentic patterns see agentme-edr-019.
|
|
4
|
+
apply-to: AI workflow projects using LangGraph StateGraph built with Python
|
|
5
|
+
valid-from: 2026-06-05
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# agentme-edr-policy-020: AI workflow development standards
|
|
9
|
+
|
|
10
|
+
## Context and Problem Statement
|
|
11
|
+
|
|
12
|
+
AI workflow projects vary widely in how they structure directed graphs, manage state, evaluate outputs, and test execution paths. Without a shared baseline, projects accumulate incompatible patterns for flow design, state management, and dataset-driven testing.
|
|
13
|
+
|
|
14
|
+
Which tools, frameworks, and design patterns should AI workflow projects follow to ensure reproducibility, testability, and maintainability?
|
|
15
|
+
|
|
16
|
+
## Decision Outcome
|
|
17
|
+
|
|
18
|
+
**Use Python with LangGraph for flow orchestration and MLflow for experiment tracking and local evaluation.**
|
|
19
|
+
|
|
20
|
+
### Details
|
|
21
|
+
|
|
22
|
+
#### 01-language-and-framework
|
|
23
|
+
|
|
24
|
+
Workflows MUST be built with **LangGraph**. Use LangGraph `StateGraph` to model each distinct workflow as an explicit directed graph with typed state.
|
|
25
|
+
|
|
26
|
+
For all direct LLM calls within workflow nodes, use LangChain per [agentme-edr-018](018-ai-llm-development-standards.md). For agent nodes with tool-invocation loops, use deepagents per [agentme-edr-019](019-ai-agents-development-standards.md).
|
|
27
|
+
|
|
28
|
+
#### 03-observability-and-experiment-tracking
|
|
29
|
+
|
|
30
|
+
Use **MLflow** for all workflow observability and evaluation:
|
|
31
|
+
|
|
32
|
+
- **Workflow-level tracking:** Wrap each workflow run with `mlflow.start_run()` to capture traces, parameters, and metrics locally.
|
|
33
|
+
- **LLM-level auto-tracing:** Enable LangChain auto-tracing per [agentme-edr-018](018-ai-llm-development-standards.md) rule `03-llm-observability` by calling `mlflow.langchain.autolog()` during application startup. This captures inputs, outputs, token counts, and latency for every LangChain call within workflow nodes.
|
|
34
|
+
- Log run parameters (model name, temperature, prompt version) and output metrics (accuracy, latency, token counts) using `mlflow.log_param` / `mlflow.log_metric`.
|
|
35
|
+
- Run a local MLflow tracking server with `mlflow ui` to inspect runs during development. Do not require a remote MLflow server for local development.
|
|
36
|
+
|
|
37
|
+
#### 04-dataset-driven-accuracy-measurement
|
|
38
|
+
|
|
39
|
+
Eval dataset and implementation requirements are defined in [agentme-edr-021](021-ai-eval-standards.md). Testing requirements (when evals are required, release gates) are defined in [agentme-edr-007](../principles/007-project-quality-standards.md) rule `09-ai-project-testing-requirements`.
|
|
40
|
+
|
|
41
|
+
#### 05-flow-documentation
|
|
42
|
+
|
|
43
|
+
Each workflow MUST be documented as a **Mermaid graph** in a `README.md`. The diagram must match the LangGraph `StateGraph` definition:
|
|
44
|
+
|
|
45
|
+
- Use `graph TD` or `graph LR` direction.
|
|
46
|
+
- Label each node with its Python function name.
|
|
47
|
+
- Label conditional edges with the condition expression.
|
|
48
|
+
- Update the diagram whenever the graph topology changes.
|
|
49
|
+
|
|
50
|
+
Example minimal diagram block:
|
|
51
|
+
|
|
52
|
+
```mermaid
|
|
53
|
+
graph TD
|
|
54
|
+
A[fetch_context] --> B[draft_response]
|
|
55
|
+
B --> C{verify}
|
|
56
|
+
C -->|pass| D[output]
|
|
57
|
+
C -->|fail| B
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
#### 06-verification-steps
|
|
61
|
+
|
|
62
|
+
Workflows MUST include at least one explicit verification node before producing final output:
|
|
63
|
+
|
|
64
|
+
- Model the verification step as a dedicated LangGraph node (e.g. `verify_output`).
|
|
65
|
+
- The node checks the draft output against defined acceptance criteria (schema validation, factual consistency check, rubric scoring, or LLM-as-judge call).
|
|
66
|
+
- On failure, the verification node MUST route back to the relevant generation node, not silently pass through.
|
|
67
|
+
- Log verification results (pass/fail, score, reason) as MLflow metrics on the current run.
|
|
68
|
+
|
|
69
|
+
#### 07-workflow-structure
|
|
70
|
+
|
|
71
|
+
Workflow logic MUST be organized as named workflows following [agentme-edr-026](026-pragmatic-hexagonal-architecture.md). Each workflow is an independent LangGraph `StateGraph` with a defined start node and end node, connecting LLM nodes, agent nodes, algorithmic nodes, states, routes, and decision nodes.
|
|
72
|
+
|
|
73
|
+
Workflows live inside `app/workflows/` (the application layer), while external integrations such as LLM providers, vector stores, and third-party APIs live under `adapters/connectors/` (the outbound adapter layer). Inbound interfaces (HTTP API, CLI) live under `adapters/` as inbound adapters.
|
|
74
|
+
|
|
75
|
+
For each workflow named `<workflow>`, the full project layout is:
|
|
76
|
+
|
|
77
|
+
```text
|
|
78
|
+
lib/src/<package_name>/
|
|
79
|
+
adapters/
|
|
80
|
+
http/ # inbound: API server that triggers workflows
|
|
81
|
+
cli/ # inbound: CLI entry point (if applicable)
|
|
82
|
+
connectors/ # outbound: external resource integrations
|
|
83
|
+
openai/ # LLM provider connector
|
|
84
|
+
azure-openai/ # alternative LLM provider connector
|
|
85
|
+
postgres/ # database connector (if applicable)
|
|
86
|
+
vector-store/ # vector DB connector (if applicable)
|
|
87
|
+
app/
|
|
88
|
+
workflows/
|
|
89
|
+
<workflow>/
|
|
90
|
+
graph.py # StateGraph definition; entry point for the workflow
|
|
91
|
+
agents.py # deepagents agent definitions used by this workflow
|
|
92
|
+
states.py # Typed state dataclasses / TypedDicts
|
|
93
|
+
routes.py # Conditional edge functions
|
|
94
|
+
shared/ # infrastructure-agnostic utilities
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
- `app/workflows/<workflow>/graph.py` MUST define and compile the `StateGraph` and expose a `graph` object that callers invoke.
|
|
98
|
+
- Tool calls within workflow nodes that interact with external systems MUST use connectors from `adapters/connectors/`, not inline API calls.
|
|
99
|
+
- Additional modules (prompts, schemas) MAY be added inside `app/workflows/<workflow>/` when they are specific to that workflow. Shared utilities belong in `shared/`.
|
|
100
|
+
|
|
101
|
+
#### 08-workflow-evals
|
|
102
|
+
|
|
103
|
+
Eval folder structure and script requirements are defined in [agentme-edr-021](021-ai-eval-standards.md).
|
|
104
|
+
|
|
105
|
+
#### 09-node-naming-conventions
|
|
106
|
+
|
|
107
|
+
LangGraph node names MUST follow a suffix convention that communicates the node's role at a glance. Names MUST be action-oriented and descriptive.
|
|
108
|
+
|
|
109
|
+
| Suffix | Node type | When to use |
|
|
110
|
+
|---|---|---|
|
|
111
|
+
| `_llm` | LLM call | Any node whose primary action is a direct LLM inference call (see [agentme-edr-018](018-ai-llm-development-standards.md)) |
|
|
112
|
+
| `_step` | Algorithmic step | Deterministic logic with no LLM involvement (transformation, validation, routing) |
|
|
113
|
+
| `_tool` | Tool/API call | A node that wraps a single external tool or API (e.g. a REST endpoint, DB query) |
|
|
114
|
+
| `_agent` | Subgraph agent | A node that invokes a nested subgraph containing its own tool-invocation cycle and LLM calls; use the **deepagents** library for these nodes (see [agentme-edr-019](019-ai-agents-development-standards.md)) |
|
|
115
|
+
|
|
116
|
+
The Python function implementing the node SHOULD share the same name as the node alias passed to `add_node`, so that graph definitions and stack traces remain unambiguous:
|
|
117
|
+
|
|
118
|
+
```python
|
|
119
|
+
def draft_doc_llm(state): ...
|
|
120
|
+
graph.add_node("draft_doc_llm", draft_doc_llm)
|
|
121
|
+
|
|
122
|
+
# Tool node — calls the Stripe API
|
|
123
|
+
def stripe_api_tool(state): ...
|
|
124
|
+
graph.add_node("stripe_api_tool", stripe_api_tool)
|
|
125
|
+
|
|
126
|
+
# Agent node — uses deepagents for tool-invocation loop
|
|
127
|
+
def code_reviewer_agent(state): ...
|
|
128
|
+
graph.add_node("code_reviewer_agent", code_reviewer_agent)
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
Names MUST NOT use generic labels such as `node1`, `process`, or `run`. Each name must clearly express what action the node performs.
|
|
132
|
+
|
|
133
|
+
#### 10-workflow-unit-testing
|
|
134
|
+
|
|
135
|
+
All LLM calls within workflow nodes are external API calls and MUST be mocked in unit tests per [agentme-edr-018](018-ai-llm-development-standards.md) rule `04-unit-test-mocking`. Workflow unit tests must run fully offline with no real LLM provider calls.
|
|
136
|
+
|
|
137
|
+
Choose the mock utility based on what the node under test expects from the model:
|
|
138
|
+
|
|
139
|
+
- Use **`FakeListChatModel`** when nodes only read `AIMessage.content` (e.g. a routing node that checks a text label).
|
|
140
|
+
- Use **`GenericFakeChatModel`** when any node in the workflow expects tool calls, structured outputs, or when the workflow contains `_agent` nodes that drive a tool-invocation loop.
|
|
141
|
+
|
|
142
|
+
**Example — workflow with plain-text LLM nodes:**
|
|
143
|
+
|
|
144
|
+
```python
|
|
145
|
+
from langchain_core.language_models.fake_chat_models import FakeListChatModel
|
|
146
|
+
|
|
147
|
+
def test_document_workflow_approve_path():
|
|
148
|
+
# Responses consumed in node execution order
|
|
149
|
+
fake_model = FakeListChatModel(responses=["APPROVE", "Meets all criteria."])
|
|
150
|
+
|
|
151
|
+
workflow = DocumentWorkflow(model=fake_model)
|
|
152
|
+
result = workflow.run(input_doc)
|
|
153
|
+
|
|
154
|
+
assert result.status == "approved"
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
**Example — workflow containing an agent node (`_agent` suffix):**
|
|
158
|
+
|
|
159
|
+
```python
|
|
160
|
+
from langchain_core.language_models.fake_chat_models import GenericFakeChatModel
|
|
161
|
+
from langchain_core.messages import AIMessage
|
|
162
|
+
|
|
163
|
+
def test_document_workflow_with_agent_node():
|
|
164
|
+
tool_call_msg = AIMessage(
|
|
165
|
+
content="",
|
|
166
|
+
tool_calls=[{"name": "fetch_context", "args": {"doc_id": "42"}, "id": "c1"}]
|
|
167
|
+
)
|
|
168
|
+
agent_final_msg = AIMessage(content="Context retrieved successfully.")
|
|
169
|
+
routing_msg = AIMessage(content="APPROVE")
|
|
170
|
+
|
|
171
|
+
fake_model = GenericFakeChatModel(
|
|
172
|
+
messages=iter([tool_call_msg, agent_final_msg, routing_msg])
|
|
173
|
+
)
|
|
174
|
+
|
|
175
|
+
workflow = DocumentWorkflow(model=fake_model)
|
|
176
|
+
result = workflow.run(input_doc)
|
|
177
|
+
|
|
178
|
+
assert result.status == "approved"
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
Workflows MUST accept the LLM instance as a constructor parameter so that unit tests can inject a fake. See the injectable LLM pattern in [agentme-edr-018](018-ai-llm-development-standards.md) rule `04-unit-test-mocking`.
|
|
182
|
+
|
|
183
|
+
#### 11-state-type-conventions
|
|
184
|
+
|
|
185
|
+
All TypedDict and dataclass types that represent LangGraph node or workflow state MUST end with `_state` in their name. This suffix signals at a glance that the type is a state boundary, not a plain data model.
|
|
186
|
+
|
|
187
|
+
**Naming reference:**
|
|
188
|
+
|
|
189
|
+
| Owner | Naming pattern | Example |
|
|
190
|
+
|---|---|---|
|
|
191
|
+
| Single agent / agent subgraph | `<agent_name>_agent_state` | `reviewer_agent_state` |
|
|
192
|
+
| Full workflow (`StateGraph`) | `<workflow_name>_workflow_state` | `document_workflow_state` |
|
|
193
|
+
| Named group of nodes sharing state | `<group_responsibility>_state` | `retrieval_pipeline_state` |
|
|
194
|
+
|
|
195
|
+
**Boundary rules:**
|
|
196
|
+
|
|
197
|
+
- Each agent or agent subgraph MUST define its own dedicated state type. Do not reuse or extend a generic state across unrelated agents.
|
|
198
|
+
- Each workflow (`StateGraph`) MUST define its own top-level state type. The workflow state is the authoritative boundary for that graph's inputs and outputs.
|
|
199
|
+
- When a group of nodes (not a full workflow and not a single agent) shares a state type, the type name MUST clearly reflect the shared responsibility. Generic names such as `shared_state`, `common_state`, or `global_state` are FORBIDDEN.
|
|
200
|
+
- Large workflows MUST NOT use a single monolithic state that all nodes read and write. Split the state into per-phase or per-agent state types scoped to the subgraph or set of nodes that produce or consume each field.
|
|
201
|
+
|
|
202
|
+
State type names SHOULD align with the agent or node names defined in rule `09-node-naming-conventions` (e.g., an agent node named `draft_doc_agent` has a state type named `draft_doc_agent_state`).
|
|
203
|
+
|
|
204
|
+
#### 12-workflow-naming-conventions
|
|
205
|
+
|
|
206
|
+
LangGraph `StateGraph` instances and their enclosing classes MUST be given a meaningful name that conveys the workflow's input, output, and/or behavior. The name MUST end with `Workflow` (PascalCase class) or `_workflow` (snake_case variable or directory).
|
|
207
|
+
|
|
208
|
+
Choose a name that summarises what the workflow consumes, processes, and produces — avoid generic labels such as `Pipeline`, `Flow`, `Graph`, or `Process`.
|
|
209
|
+
|
|
210
|
+
| Context | Pattern | Example |
|
|
211
|
+
|---|---|---|
|
|
212
|
+
| Python class | `<DescriptiveName>Workflow` | `FileMapJudgeReduceWorkflow` |
|
|
213
|
+
| Python variable / instance | `<descriptive_name>_workflow` | `file_map_judge_reduce_workflow` |
|
|
214
|
+
| Directory under `app/workflows/` | `<descriptive_name>_workflow` | `financial_report_analysis_workflow/` |
|
|
215
|
+
|
|
216
|
+
**Good names** communicate purpose at a glance:
|
|
217
|
+
|
|
218
|
+
- `FileMapJudgeReduceWorkflow` — maps files, judges each, then reduces results
|
|
219
|
+
- `FinancialReportAnalysisWorkflow` — analyses financial report inputs
|
|
220
|
+
- `MarketingCampaignExecutorWorkflow` — executes a marketing campaign end-to-end
|
|
221
|
+
|
|
222
|
+
**Bad names** (FORBIDDEN): `MainWorkflow`, `AgentGraph`, `ProcessFlow`, `Workflow1`, `RunGraph`.
|
|
223
|
+
|
|
224
|
+
#### 15-workflow-state-persistence
|
|
225
|
+
|
|
226
|
+
For long-running workflows that may need to be paused and resumed:
|
|
227
|
+
|
|
228
|
+
- Use LangGraph's built-in checkpointing with `MemorySaver` for development and testing.
|
|
229
|
+
- Use persistent checkpointers (e.g., `PostgresSaver`, or Redis-based checkpointers) for production workflows that need durability.
|
|
230
|
+
- Checkpoint state MUST be serializable (use TypedDict or dataclasses with JSON-compatible fields).
|
|
231
|
+
- Document the checkpoint strategy in the workflow's README.md.
|
|
232
|
+
|
|
233
|
+
**Example with MemorySaver (development):**
|
|
234
|
+
|
|
235
|
+
```python
|
|
236
|
+
from langgraph.checkpoint.memory import MemorySaver
|
|
237
|
+
|
|
238
|
+
checkpointer = MemorySaver()
|
|
239
|
+
graph = workflow.compile(checkpointer=checkpointer)
|
|
240
|
+
|
|
241
|
+
# Resume from checkpoint
|
|
242
|
+
result = graph.invoke(input_state, config={"thread_id": "session-123"})
|
|
243
|
+
```
|
|
244
|
+
|
|
245
|
+
**When to use checkpointing:**
|
|
246
|
+
|
|
247
|
+
- Workflows that take > 30 seconds to complete
|
|
248
|
+
- Workflows that require human-in-the-loop approval or input
|
|
249
|
+
- Workflows that are non-indempotent
|
|
250
|
+
- Workflows that may fail mid-execution and need to be retried from the last successful node
|
|
251
|
+
- Multi-session workflows where state persists across user interactions
|
|
252
|
+
|
|
253
|
+
## References
|
|
254
|
+
|
|
255
|
+
- [agentme-edr-018](018-ai-llm-development-standards.md) — LLM development standards: LangChain framework, provider configuration, LLM observability, and unit test mocking
|
|
256
|
+
- [agentme-edr-019](019-ai-agents-development-standards.md) — Agent development standards: deepagents framework, tool-invocation loops, and agent patterns
|
|
257
|
+
- [agentme-edr-026](026-pragmatic-hexagonal-architecture.md) — Adapter/application layer separation that defines the project layout
|
|
258
|
+
- [agentme-edr-014](014-python-project-tooling.md) — Python project tooling and structure
|
|
259
|
+
- [agentme-edr-024](024-ml-dataset-structure.md) — ML dataset structure for eval datasets
|
|
260
|
+
- [agentme-edr-021](021-ai-eval-standards.md) — AI eval standards: folder structure, script requirements, and MLflow tracking
|
|
261
|
+
- [agentme-edr-007](../principles/007-project-quality-standards.md) — Project quality standards including AI-tier testing requirements (rule `09-ai-project-testing-requirements`)
|
|
@@ -0,0 +1,90 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: agentme-edr-policy-021-ai-eval-standards
|
|
3
|
+
description: Defines how to structure, write, and run eval tests for AI projects — folder layout, script requirements, and MLflow tracking. Use when implementing evals for LLM, Agent, or Workflow projects. For when evals are required see agentme-edr-007 rule 09-ai-project-testing-requirements.
|
|
4
|
+
apply-to: Python AI projects (LLM, Agent, or Workflow tier) that implement eval testing
|
|
5
|
+
valid-from: 2026-06-05
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# agentme-edr-policy-021: AI eval standards
|
|
9
|
+
|
|
10
|
+
## Context and Problem Statement
|
|
11
|
+
|
|
12
|
+
Eval tests measure AI component accuracy against expected outputs using real LLM providers. Without a shared folder layout and script convention, eval setups diverge across LLM, Agent, and Workflow projects, making them hard to run, compare, and integrate into CI/CD pipelines.
|
|
13
|
+
|
|
14
|
+
How should eval tests be structured and run across all AI tiers?
|
|
15
|
+
|
|
16
|
+
## Decision Outcome
|
|
17
|
+
|
|
18
|
+
**Use a per-component folder structure under `evals/` with a standardized Makefile interface and MLflow-backed scripts, applicable to LLM, Agent, and Workflow components.**
|
|
19
|
+
|
|
20
|
+
For when evals are required per AI tier, see [agentme-edr-007](../principles/007-project-quality-standards.md) rule `09-ai-project-testing-requirements`.
|
|
21
|
+
|
|
22
|
+
### Details
|
|
23
|
+
|
|
24
|
+
#### 01-eval-folder-structure
|
|
25
|
+
|
|
26
|
+
For each AI component being evaluated (an LLM chain, agent, or workflow), create a corresponding directory under `evals/` at the same level as `lib/` and `examples/`:
|
|
27
|
+
|
|
28
|
+
```text
|
|
29
|
+
evals/
|
|
30
|
+
<component>/
|
|
31
|
+
Makefile # eval targets for this component
|
|
32
|
+
dataset_<group>/ # one folder per eval group (see agentme-edr-024)
|
|
33
|
+
eval_<group>.py # evaluation script for each group
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
Where `<component>` is the name of the LLM chain, agent, or workflow being evaluated (e.g., `summarizer`, `file_analyzer_agent`, `document_review_workflow`).
|
|
37
|
+
|
|
38
|
+
The per-component `evals/<component>/Makefile` MUST define:
|
|
39
|
+
|
|
40
|
+
| Target | Behaviour |
|
|
41
|
+
|---|---|
|
|
42
|
+
| `eval` | Runs all eval groups for the component |
|
|
43
|
+
| `eval-<group>` | Runs one named group (e.g. `eval-simple`, `eval-complex`) |
|
|
44
|
+
|
|
45
|
+
The module root Makefile MUST expose a `make eval` target that delegates to `eval` in every `evals/<component>/Makefile`:
|
|
46
|
+
|
|
47
|
+
```makefile
|
|
48
|
+
eval:
|
|
49
|
+
$(MAKE) -C evals/summarizer eval
|
|
50
|
+
$(MAKE) -C evals/document_review_workflow eval
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
#### 02-eval-script-requirements
|
|
54
|
+
|
|
55
|
+
Each `eval_<group>.py` script MUST:
|
|
56
|
+
|
|
57
|
+
- Load the dataset from `evals/<component>/dataset_<group>/` following [agentme-edr-024](024-ml-dataset-structure.md). For input/output pairs, use the JSONL format per `agentme-edr-024.04-complex-structured-datasets-must-use-jsonl`.
|
|
58
|
+
- Run every input through the live component against **real LLM providers** (not mocked responses), to capture model drift.
|
|
59
|
+
- Log per-sample and aggregate metrics to an MLflow experiment that runs **locally** — a remote MLflow server MUST NOT be required.
|
|
60
|
+
- Compare outputs to expected values using project-defined quality thresholds. Thresholds MUST be declared explicitly (e.g., in a Makefile variable or README).
|
|
61
|
+
- Exit with a non-zero status when any metric falls below its defined threshold, consistent with [agentme-edr-007](../principles/007-project-quality-standards.md) rule `07-statistical-models-must-have-eval-targets`.
|
|
62
|
+
|
|
63
|
+
**Example:**
|
|
64
|
+
|
|
65
|
+
```python
|
|
66
|
+
import mlflow
|
|
67
|
+
from my_package.app.workflows.document_review_workflow.graph import graph
|
|
68
|
+
|
|
69
|
+
EVAL_MIN_ACCURACY = 0.85
|
|
70
|
+
|
|
71
|
+
with mlflow.start_run():
|
|
72
|
+
results = []
|
|
73
|
+
for sample in load_dataset("evals/document_review_workflow/dataset_basic/"):
|
|
74
|
+
output = graph.invoke({"document": sample["input"]})
|
|
75
|
+
results.append(output["label"] == sample["expected_label"])
|
|
76
|
+
|
|
77
|
+
accuracy = sum(results) / len(results)
|
|
78
|
+
mlflow.log_metric("accuracy", accuracy)
|
|
79
|
+
|
|
80
|
+
if accuracy < EVAL_MIN_ACCURACY:
|
|
81
|
+
raise SystemExit(f"Eval failed: accuracy {accuracy:.2f} < {EVAL_MIN_ACCURACY}")
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
## References
|
|
85
|
+
|
|
86
|
+
- [agentme-edr-007](../principles/007-project-quality-standards.md) — Project quality standards: when evals are required per AI tier (rule `09-ai-project-testing-requirements`) and statistical model eval targets (rule `07-statistical-models-must-have-eval-targets`)
|
|
87
|
+
- [agentme-edr-018](018-ai-llm-development-standards.md) — LLM development standards: LangChain framework and observability
|
|
88
|
+
- [agentme-edr-019](019-ai-agents-development-standards.md) — Agent development standards
|
|
89
|
+
- [agentme-edr-020](020-ai-workflow-development-standards.md) — Workflow development standards
|
|
90
|
+
- [agentme-edr-024](024-ml-dataset-structure.md) — ML dataset structure for eval datasets
|
package/.xdrs/agentme/edrs/application/{019-ml-dataset-structure.md → 024-ml-dataset-structure.md}
RENAMED
|
@@ -1,11 +1,11 @@
|
|
|
1
1
|
---
|
|
2
|
-
name: agentme-edr-policy-
|
|
2
|
+
name: agentme-edr-policy-024-ml-dataset-structure
|
|
3
3
|
description: Defines the standard folder layout and file conventions for ML datasets used in AI/ML projects. Use when creating, organizing, or consuming datasets for machine learning tasks such as image labeling, document extraction, tabular data, LLM evaluation, and Q&A sets.
|
|
4
4
|
apply-to: ML and AI projects that produce or consume datasets
|
|
5
5
|
valid-from: 2026-05-27
|
|
6
6
|
---
|
|
7
7
|
|
|
8
|
-
# agentme-edr-policy-
|
|
8
|
+
# agentme-edr-policy-024: ML dataset structure
|
|
9
9
|
|
|
10
10
|
## Context and Problem Statement
|
|
11
11
|
|
|
@@ -1,11 +1,11 @@
|
|
|
1
1
|
---
|
|
2
|
-
name: agentme-edr-policy-
|
|
2
|
+
name: agentme-edr-policy-025-ai-agent-xdrs-knowledge-layer
|
|
3
3
|
description: Defines how to integrate XDRS as the runtime knowledge source of truth for AI agents — covering document placement, AGENTS.md setup, file tools, and local sandbox configuration. Apply only when the project explicitly uses XDRS to govern agent behavior.
|
|
4
4
|
apply-to: AI agent projects that use XDRS as the source of truth for policies and skills
|
|
5
5
|
valid-from: 2026-05-27
|
|
6
6
|
---
|
|
7
7
|
|
|
8
|
-
# agentme-edr-policy-
|
|
8
|
+
# agentme-edr-policy-025: AI agent XDRS knowledge layer
|
|
9
9
|
|
|
10
10
|
## Context and Problem Statement
|
|
11
11
|
|
|
@@ -17,7 +17,7 @@ How should an AI agent project integrate XDRS as its runtime source of truth for
|
|
|
17
17
|
|
|
18
18
|
**Embed XDRS documents in `lib/data/.xdrs/`, instruct the agent to consult them via `AGENTS.md`, equip the agent with sandboxed file tools, and use the deepagents framework when a local sandbox is required.**
|
|
19
19
|
|
|
20
|
-
This policy MUST only be applied when the project explicitly chooses XDRS as its knowledge governance layer. It is not required by [agentme-edr-
|
|
20
|
+
This policy MUST only be applied when the project explicitly chooses XDRS as its knowledge governance layer. It is not required by [agentme-edr-019](019-ai-agents-development-standards.md) or [agentme-edr-020](020-ai-workflow-development-standards.md) in general.
|
|
21
21
|
|
|
22
22
|
### Details
|
|
23
23
|
|
|
@@ -91,7 +91,7 @@ data_root = str(files("myagent").joinpath("data"))
|
|
|
91
91
|
agents_md = Path(temp_root) / "AGENTS.md"
|
|
92
92
|
agents_md.write_text(_AGENTS_MD) # content from xdrs-core AGENTS.md template; see rule 01-xdrs-knowledge-layer
|
|
93
93
|
|
|
94
|
-
# Add these mounts alongside the base mounts from agentme-edr-
|
|
94
|
+
# Add these mounts alongside the base mounts from agentme-edr-019 rule 02-local-sandbox:
|
|
95
95
|
xdrs_mounts = [
|
|
96
96
|
{"src": f"{data_root}/.xdrs", "dst": "/.xdrs", "readonly": True},
|
|
97
97
|
{"src": str(agents_md), "dst": "/AGENTS.md", "readonly": True},
|
|
@@ -1,11 +1,11 @@
|
|
|
1
1
|
---
|
|
2
|
-
name: agentme-edr-policy-
|
|
2
|
+
name: agentme-edr-policy-026-pragmatic-hexagonal-architecture
|
|
3
3
|
description: Defines a pragmatic variant of Hexagonal Architecture for organizing application source code into Adapters (inbound/outbound I/O boundaries) and Application (business logic) layers, with explicit naming conventions and folder structure. Use when designing or reviewing the internal layout of application modules.
|
|
4
4
|
apply-to: All application projects
|
|
5
5
|
valid-from: 2026-05-28
|
|
6
6
|
---
|
|
7
7
|
|
|
8
|
-
# agentme-edr-policy-
|
|
8
|
+
# agentme-edr-policy-026: Pragmatic hexagonal architecture
|
|
9
9
|
|
|
10
10
|
## Context and Problem Statement
|
|
11
11
|
|
|
@@ -15,12 +15,12 @@ compatibility: JavaScript/TypeScript, Node.js 18+
|
|
|
15
15
|
|
|
16
16
|
Creates a complete JavaScript/TypeScript project from scratch. The layout keeps the
|
|
17
17
|
package self-contained in its module root (`lib/`), organizes internal code following
|
|
18
|
-
[agentme-edr-
|
|
18
|
+
[agentme-edr-026](../../026-pragmatic-hexagonal-architecture.md) (`adapters/`, `app/`, `shared/`),
|
|
19
19
|
places runnable consumer examples in the sibling `examples/` folder, redirects persistent caches
|
|
20
20
|
into `.cache/`, and uses Makefiles as the only entry points. Boilerplate is derived from the
|
|
21
21
|
[filedist](https://github.com/flaviostutz/filedist) project.
|
|
22
22
|
|
|
23
|
-
Related EDRs: [agentme-edr-003](../../003-javascript-project-tooling.md), [agentme-edr-016](../../../principles/016-cross-language-module-structure.md), [agentme-edr-
|
|
23
|
+
Related EDRs: [agentme-edr-003](../../003-javascript-project-tooling.md), [agentme-edr-016](../../../principles/016-cross-language-module-structure.md), [agentme-edr-026](../../026-pragmatic-hexagonal-architecture.md)
|
|
24
24
|
|
|
25
25
|
## Instructions
|
|
26
26
|
|
|
@@ -12,9 +12,9 @@ compatibility: Go 1.21+
|
|
|
12
12
|
|
|
13
13
|
## Overview
|
|
14
14
|
|
|
15
|
-
Creates a complete Go project from scratch, following the layout from [agentme-edr-010](../../010-golang-project-tooling.md) and [agentme-edr-
|
|
15
|
+
Creates a complete Go project from scratch, following the layout from [agentme-edr-010](../../010-golang-project-tooling.md) and [agentme-edr-026](../../026-pragmatic-hexagonal-architecture.md). Business logic lives in `app/<feature>/` packages; CLI wiring lives in `adapters/cli/`; outbound integrations live in `adapters/connectors/`; `main.go` is a thin dispatcher. The module root owns its `Makefile`, `README.md`, `dist/`, and `.cache/` folders.
|
|
16
16
|
|
|
17
|
-
Related EDRs: [agentme-edr-010](../../010-golang-project-tooling.md), [agentme-edr-016](../../../principles/016-cross-language-module-structure.md), [agentme-edr-
|
|
17
|
+
Related EDRs: [agentme-edr-010](../../010-golang-project-tooling.md), [agentme-edr-016](../../../principles/016-cross-language-module-structure.md), [agentme-edr-026](../../026-pragmatic-hexagonal-architecture.md)
|
|
18
18
|
|
|
19
19
|
## Instructions
|
|
20
20
|
|
|
@@ -14,11 +14,11 @@ compatibility: Python 3.12+
|
|
|
14
14
|
|
|
15
15
|
Creates a complete Python project from scratch using Mise, `uv`, `pyproject.toml`, Ruff,
|
|
16
16
|
ty, Pytest, and Makefiles. The layout keeps the package self-contained under `lib/`,
|
|
17
|
-
organizes internal code following [agentme-edr-
|
|
17
|
+
organizes internal code following [agentme-edr-026](../../026-pragmatic-hexagonal-architecture.md)
|
|
18
18
|
(`adapters/`, `app/`, `shared/`), uses a shared root `.venv/`, redirects persistent caches into
|
|
19
19
|
`.cache/`, and places runnable consumer projects under the sibling `examples/` folder.
|
|
20
20
|
|
|
21
|
-
Related EDRs: [agentme-edr-014](../../014-python-project-tooling.md), [agentme-edr-016](../../../principles/016-cross-language-module-structure.md), [agentme-edr-
|
|
21
|
+
Related EDRs: [agentme-edr-014](../../014-python-project-tooling.md), [agentme-edr-016](../../../principles/016-cross-language-module-structure.md), [agentme-edr-026](../../026-pragmatic-hexagonal-architecture.md)
|
|
22
22
|
|
|
23
23
|
## Instructions
|
|
24
24
|
|
|
@@ -282,7 +282,7 @@ make test
|
|
|
282
282
|
|
|
283
283
|
### Phase 4: Create the package and tests inside `lib/`
|
|
284
284
|
|
|
285
|
-
Create this baseline structure following [agentme-edr-
|
|
285
|
+
Create this baseline structure following [agentme-edr-026](../../026-pragmatic-hexagonal-architecture.md).
|
|
286
286
|
|
|
287
287
|
**`lib/src/[package_name]/__init__.py`**
|
|
288
288
|
|
|
@@ -31,11 +31,13 @@ Language and framework-specific tooling and project structure.
|
|
|
31
31
|
- [agentme-edr-010](application/010-golang-project-tooling.md) - **Go project tooling and structure** - Scaffold Go CLIs and libraries with the standard layout *(includes skill: [003-create-golang-project](application/skills/003-create-golang-project/SKILL.md))*
|
|
32
32
|
- [agentme-edr-014](application/014-python-project-tooling.md) - **Python project tooling and structure** - Scaffold Python packages and CLIs with the standard layout *(includes skill: [005-create-python-project](application/skills/005-create-python-project/SKILL.md))*
|
|
33
33
|
- [agentme-edr-015](application/015-cli-tool-standards.md) - **CLI tool standards** - Define command UX and behavior for CLI tools
|
|
34
|
-
- [agentme-edr-018](application/018-ai-
|
|
35
|
-
- [agentme-edr-019](application/019-
|
|
36
|
-
- [agentme-edr-
|
|
37
|
-
- [agentme-edr-
|
|
38
|
-
- [agentme-edr-
|
|
34
|
+
- [agentme-edr-018](application/018-ai-llm-development-standards.md) - **AI LLM development standards** - Standard framework (LangChain) and patterns for simple LLM calls with explicit configuration (no environment variables)
|
|
35
|
+
- [agentme-edr-019](application/019-ai-agents-development-standards.md) - **AI agents development standards** - Standard framework (deepagents) and patterns for agentic tool-invocation loops
|
|
36
|
+
- [agentme-edr-020](application/020-ai-workflow-development-standards.md) - **AI workflow development standards** - Standard toolchain (LangGraph), evaluation, and testing patterns for workflow projects
|
|
37
|
+
- [agentme-edr-021](application/021-ai-eval-standards.md) - **AI eval standards** - Folder structure, script requirements, and MLflow tracking for eval tests across LLM, Agent, and Workflow tiers
|
|
38
|
+
- [agentme-edr-024](application/024-ml-dataset-structure.md) - **ML dataset structure** - Standard folder layout and file conventions for ML datasets
|
|
39
|
+
- [agentme-edr-025](application/025-ai-agent-xdrs-knowledge-layer.md) - **AI agent XDRS knowledge layer** - How to integrate XDRS as the runtime source of truth for policies and skills in AI agents (apply only when the project explicitly uses XDRS)
|
|
40
|
+
- [agentme-edr-026](application/026-pragmatic-hexagonal-architecture.md) - **Pragmatic hexagonal architecture** - Organize application layers as External/Adapters/Application with practical coupling rules
|
|
39
41
|
- [004-select-relevant-xdrs](application/skills/004-select-relevant-xdrs/SKILL.md) - **Select relevant XDRs**
|
|
40
42
|
|
|
41
43
|
## Devops
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: agentme-edr-policy-007-project-quality-standards
|
|
3
|
-
description: Defines minimum project quality standards for README onboarding, testing (unit and
|
|
3
|
+
description: Defines minimum project quality standards for README onboarding, testing (unit, integration, and AI-tier evals), linting, XDR compliance, and runnable examples. Use when scaffolding or reviewing projects.
|
|
4
4
|
apply-to: All projects
|
|
5
5
|
valid-from: 2026-05-25
|
|
6
6
|
---
|
|
@@ -230,3 +230,31 @@ test-integration:
|
|
|
230
230
|
```
|
|
231
231
|
|
|
232
232
|
Projects are not required to implement integration tests, but when present, they SHOULD follow these conventions for consistency across the codebase.
|
|
233
|
+
|
|
234
|
+
---
|
|
235
|
+
|
|
236
|
+
#### 09-ai-project-testing-requirements
|
|
237
|
+
|
|
238
|
+
AI projects are classified into three tiers — LLM, Agent, and Workflow — defined in [agentme-edr-018](../application/018-ai-llm-development-standards.md). Testing requirements differ per tier:
|
|
239
|
+
|
|
240
|
+
| Tier | Unit tests | Evals | Integration tests |
|
|
241
|
+
|---|---|---|---|
|
|
242
|
+
| **LLM** ([agentme-edr-018](../application/018-ai-llm-development-standards.md)) | Not required | Not required; SHOULD be used when critical prompts are in use to measure accuracy and detect model drift | Not required |
|
|
243
|
+
| **Agent** ([agentme-edr-019](../application/019-ai-agents-development-standards.md)) | Not required | Not required; MAY be used | Not required |
|
|
244
|
+
| **Workflow** ([agentme-edr-020](../application/020-ai-workflow-development-standards.md)) | **Required** — see below | **Required** before every release; failed evals block release | Advised |
|
|
245
|
+
|
|
246
|
+
**Workflow unit test requirements:**
|
|
247
|
+
|
|
248
|
+
- MUST use mocked LLM providers. See [agentme-edr-018](../application/018-ai-llm-development-standards.md) rule `04-unit-test-mocking` for the mocking pattern.
|
|
249
|
+
- MUST run offline with no external dependencies per [agentme-edr-004](004-unit-test-requirements.md) rule `02-must-run-offline`.
|
|
250
|
+
- MUST achieve 80% code coverage per [agentme-edr-004](004-unit-test-requirements.md) rule `03-must-maintain-80-percent-coverage`.
|
|
251
|
+
- MUST test workflow routing logic, conditional edges, state transformations, and error handling.
|
|
252
|
+
- MUST achieve **80% coverage of LangGraph graph edges and branches**: every conditional edge MUST have test cases covering each possible branch, and every node→node transition MUST be exercised by at least one test.
|
|
253
|
+
- Files MUST be named `<name>_test.py` and placed alongside the source file per [agentme-edr-004](004-unit-test-requirements.md) rule `04-must-place-test-files-alongside-source`.
|
|
254
|
+
|
|
255
|
+
**Workflow eval requirements:**
|
|
256
|
+
|
|
257
|
+
- Evals MUST be executed before every release.
|
|
258
|
+
- Accuracy below project-defined thresholds MUST block the release. Thresholds MUST be documented in the eval Makefile or README.
|
|
259
|
+
- Evals MUST run against real LLM providers (not mocks) to capture model drift.
|
|
260
|
+
- For eval folder structure and script requirements, see [agentme-edr-021](../application/021-ai-eval-standards.md).
|
package/package.json
CHANGED
|
@@ -1,9 +1,9 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "agentme",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.15.0",
|
|
4
4
|
"description": "",
|
|
5
5
|
"dependencies": {
|
|
6
|
-
"filedist": "^0.34.
|
|
6
|
+
"filedist": "^0.34.2"
|
|
7
7
|
},
|
|
8
8
|
"bin": "bin/filedist.js",
|
|
9
9
|
"files": [
|
|
@@ -22,6 +22,6 @@
|
|
|
22
22
|
"url": "https://github.com/flaviostutz/agentme.git"
|
|
23
23
|
},
|
|
24
24
|
"devDependencies": {
|
|
25
|
-
"xdrs-core": "^0.28.
|
|
25
|
+
"xdrs-core": "^0.28.3"
|
|
26
26
|
}
|
|
27
27
|
}
|