agentme 0.13.0 → 0.14.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.xdrs/agentme/edrs/application/018-ai-agent-development-standards.md +159 -24
- package/.xdrs/agentme/edrs/application/024-llm-development-standards.md +116 -0
- package/.xdrs/agentme/edrs/index.md +2 -1
- package/.xdrs/agentme/edrs/principles/007-project-quality-standards.md +43 -2
- package/package.json +1 -1
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: agentme-edr-policy-018-ai-agent-development-standards
|
|
3
|
-
description: Defines the standard toolchain, framework, evaluation approach, and workflow patterns for building AI agents
|
|
4
|
-
apply-to: AI agent projects built with Python
|
|
3
|
+
description: Defines the standard toolchain, framework, evaluation approach, testing strategy, and workflow patterns for building AI agents (tool-invocation loops) and LangGraph workflows in Python. Use when scaffolding, reviewing, or extending AI agent or workflow projects. For raw LLM calls and provider configuration, see agentme-edr-024.
|
|
4
|
+
apply-to: AI agent and LangGraph workflow projects built with Python
|
|
5
5
|
valid-from: 2026-05-26
|
|
6
6
|
---
|
|
7
7
|
|
|
@@ -17,49 +17,41 @@ Which tools, frameworks, and design patterns should AI agent projects follow to
|
|
|
17
17
|
|
|
18
18
|
**Use Python with LangGraph for flow orchestration and MLflow for experiment tracking and local evaluation.**
|
|
19
19
|
|
|
20
|
+
This policy covers the **Agent** and **Workflow** tiers of the three-tier conceptual model defined in [agentme-edr-024](024-llm-development-standards.md). For the definition of LLM, Agent, and Workflow, and for the LangChain framework rules that govern direct LLM calls, see [agentme-edr-024](024-llm-development-standards.md).
|
|
21
|
+
|
|
20
22
|
### Details
|
|
21
23
|
|
|
22
24
|
#### 01-language-and-framework
|
|
23
25
|
|
|
24
|
-
All agent projects MUST be implemented in Python, following [agentme-edr-014](014-python-project-tooling.md) for project structure, tooling, and Makefile conventions.
|
|
26
|
+
All agent and workflow projects MUST be implemented in Python, following [agentme-edr-014](014-python-project-tooling.md) for project structure, tooling, and Makefile conventions.
|
|
25
27
|
|
|
26
28
|
Agent flows MUST be built with **LangGraph**. Use LangGraph `StateGraph` to model each distinct workflow as an explicit directed graph with typed state.
|
|
27
29
|
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
Agent code MUST be compatible with both **OpenAI** and **Azure OpenAI** providers without code changes. Achieve this by:
|
|
31
|
-
|
|
32
|
-
- Using the `langchain-openai` package which supports both providers through environment variables.
|
|
33
|
-
- Selecting the provider by setting `OPENAI_API_TYPE=azure` (Azure OpenAI) or omitting it (OpenAI).
|
|
34
|
-
- Never hardcoding provider-specific URLs, deployment names, or API versions in code; inject them through environment variables or a configuration object.
|
|
35
|
-
|
|
36
|
-
Minimum required environment variable surface:
|
|
37
|
-
|
|
38
|
-
| Variable | Purpose |
|
|
39
|
-
|---|---|
|
|
40
|
-
| `OPENAI_API_KEY` | API key (both providers) |
|
|
41
|
-
| `OPENAI_API_BASE` / `AZURE_OPENAI_ENDPOINT` | Endpoint (Azure only) |
|
|
42
|
-
| `OPENAI_API_VERSION` | API version (Azure only) |
|
|
43
|
-
| `AZURE_OPENAI_DEPLOYMENT` | Deployment/model name (Azure only) |
|
|
44
|
-
| `OPENAI_MODEL` | Model name (OpenAI only) |
|
|
30
|
+
For all direct LLM calls within agent and workflow nodes, use LangChain per [agentme-edr-024](024-llm-development-standards.md).
|
|
45
31
|
|
|
46
32
|
#### 03-observability-and-experiment-tracking
|
|
47
33
|
|
|
48
|
-
Use **MLflow** for all agent observability and evaluation:
|
|
34
|
+
Use **MLflow** for all agent and workflow observability and evaluation:
|
|
49
35
|
|
|
50
|
-
- Wrap each agent run with `mlflow.start_run()` to capture traces, parameters, and metrics locally.
|
|
51
|
-
- Enable LangChain auto-tracing via `mlflow.langchain.autolog()` at entry point startup.
|
|
36
|
+
- Wrap each agent or workflow run with `mlflow.start_run()` to capture traces, parameters, and metrics locally.
|
|
52
37
|
- Log run parameters (model name, temperature, prompt version) and output metrics (accuracy, latency, token counts) using `mlflow.log_param` / `mlflow.log_metric`.
|
|
53
38
|
- Run a local MLflow tracking server with `mlflow ui` to inspect runs during development. Do not require a remote MLflow server for local development.
|
|
39
|
+
- For LangChain-level auto-tracing of individual LLM calls, see [agentme-edr-024](024-llm-development-standards.md) rule `03-llm-observability`.
|
|
54
40
|
|
|
55
41
|
#### 04-dataset-driven-accuracy-measurement
|
|
56
42
|
|
|
57
43
|
Every agent pipeline MUST have a companion evaluation dataset and an MLflow experiment that measures accuracy against it. Datasets and evals are organized per-workflow following rule `07-workflow-structure` and rule `08-workflow-evals`.
|
|
58
44
|
|
|
45
|
+
- **Evals** measure model accuracy and performance against expected outputs. They are REQUIRED before every release to verify the workflow meets specified accuracy thresholds. They run against real LLM providers to capture model drift. They log metrics to MLflow and MUST have project-defined quality thresholds that block releases when not met.
|
|
46
|
+
- **Integration tests** verify that workflows execute end-to-end with real connectors and real models, using pass/fail assertions. They are ADVISED but not required. They validate wiring, error handling, and system integration, not model accuracy. See rule `13-three-tier-testing-strategy` for integration test guidelines.
|
|
47
|
+
|
|
48
|
+
**Eval requirements:**
|
|
49
|
+
|
|
59
50
|
- Store evaluation datasets under `evals/<workflow>/` (sibling of `lib/` and `examples/`), following [agentme-edr-019](019-ml-dataset-structure.md) for structure and format. For MLflow input/output pairs, use the JSONL format described in `agentme-edr-019.04-complex-structured-datasets-must-use-jsonl`.
|
|
60
51
|
- Write evaluation scripts under `evals/<workflow>/` that load the dataset, run each input through the live agent (against real LLMs, not mocks), compare outputs to expected values, and log per-sample and aggregate metrics to an MLflow experiment.
|
|
61
52
|
- Add a `make eval` Makefile target in the module root Makefile (the same Makefile that sits alongside `lib/` and `examples/`) that delegates to all per-workflow eval targets.
|
|
62
53
|
- Evaluation MUST run against real LLM providers, not recorded responses, to capture model drift. MLflow tracking MUST work locally without a remote server.
|
|
54
|
+
- Evals MUST be executed before every release. Failed eval runs with accuracy below project-defined thresholds MUST block the release.
|
|
63
55
|
|
|
64
56
|
#### 05-flow-documentation
|
|
65
57
|
|
|
@@ -149,7 +141,31 @@ Each `eval_<slice>.py` script MUST:
|
|
|
149
141
|
|
|
150
142
|
The module root Makefile `make eval` target MUST delegate to `eval` in every `evals/<workflow>/Makefile`.
|
|
151
143
|
|
|
152
|
-
#### 09-
|
|
144
|
+
#### 09-node-naming-conventions
|
|
145
|
+
|
|
146
|
+
LangGraph node names MUST follow a suffix convention that communicates the node's role at a glance. Names MUST be action-oriented and descriptive.
|
|
147
|
+
|
|
148
|
+
| Suffix | Node type | When to use |
|
|
149
|
+
|---|---|---|
|
|
150
|
+
| `_llm` | LLM call | Any node whose primary action is a direct LLM inference call |
|
|
151
|
+
| `_step` | Algorithmic step | Deterministic logic with no LLM involvement (transformation, validation, routing) |
|
|
152
|
+
| `_tool` | Tool/API call | A node that wraps a single external tool or API (e.g. a REST endpoint, DB query) |
|
|
153
|
+
| `_agent` | Subgraph agent | A node that invokes a nested subgraph containing its own tool-invocation cycle and LLM calls; prefer the **deepagents** library for these nodes |
|
|
154
|
+
|
|
155
|
+
The Python function implementing the node SHOULD share the same name as the node alias passed to `add_node`, so that graph definitions and stack traces remain unambiguous:
|
|
156
|
+
|
|
157
|
+
```python
|
|
158
|
+
def draft_doc_llm(state): ...
|
|
159
|
+
graph.add_node("draft_doc_llm", draft_doc_llm)
|
|
160
|
+
|
|
161
|
+
# Tool node — calls the Stripe API
|
|
162
|
+
def stripe_api_tool(state): ...
|
|
163
|
+
graph.add_node("stripe_api_tool", stripe_api_tool)
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
Names MUST NOT use generic labels such as `node1`, `process`, or `run`. Each name must clearly express what action the node performs.
|
|
167
|
+
|
|
168
|
+
#### 10-local-sandbox
|
|
153
169
|
|
|
154
170
|
When a workflow node or tool requires a **local sandbox** — an isolated environment where the agent can read files, glob-search directories, and execute shell commands — use the **[deepagents](https://github.com/deepagents/deepagents) framework** to provide that sandbox.
|
|
155
171
|
|
|
@@ -167,8 +183,127 @@ Use deepagents whenever ANY of the following is true for a workflow or tool:
|
|
|
167
183
|
- If the host-side code needs to pass files into the sandbox (e.g. generated config or input data), create a temporary directory with `tempfile.mkdtemp()`, write the files there, and mount it into the sandbox. Clean it up in the `finally` block.
|
|
168
184
|
- Replace hand-rolled `read_file`, `search_files`, and `grep_file` tool implementations with the equivalent tools provided by deepagents.
|
|
169
185
|
|
|
186
|
+
#### 11-state-type-conventions
|
|
187
|
+
|
|
188
|
+
All TypedDict and dataclass types that represent LangGraph node or workflow state MUST end with `_state` in their name. This suffix signals at a glance that the type is a state boundary, not a plain data model.
|
|
189
|
+
|
|
190
|
+
**Naming reference:**
|
|
191
|
+
|
|
192
|
+
| Owner | Naming pattern | Example |
|
|
193
|
+
|---|---|---|
|
|
194
|
+
| Single agent / agent subgraph | `<agent_name>_agent_state` | `reviewer_agent_state` |
|
|
195
|
+
| Full workflow (`StateGraph`) | `<workflow_name>_workflow_state` | `document_workflow_state` |
|
|
196
|
+
| Named group of nodes sharing state | `<group_responsibility>_state` | `retrieval_pipeline_state` |
|
|
197
|
+
|
|
198
|
+
**Boundary rules:**
|
|
199
|
+
|
|
200
|
+
- Each agent or agent subgraph MUST define its own dedicated state type. Do not reuse or extend a generic state across unrelated agents.
|
|
201
|
+
- Each workflow (`StateGraph`) MUST define its own top-level state type. The workflow state is the authoritative boundary for that graph's inputs and outputs.
|
|
202
|
+
- When a group of nodes (not a full workflow and not a single agent) shares a state type, the type name MUST clearly reflect the shared responsibility. Generic names such as `shared_state`, `common_state`, or `global_state` are FORBIDDEN.
|
|
203
|
+
- Large workflows MUST NOT use a single monolithic state that all nodes read and write. Split the state into per-phase or per-agent state types scoped to the subgraph or set of nodes that produce or consume each field.
|
|
204
|
+
|
|
205
|
+
State type names SHOULD align with the agent or node names defined in rule `09-node-naming-conventions` (e.g., an agent node named `draft_doc_agent` has a state type named `draft_doc_agent_state`).
|
|
206
|
+
|
|
207
|
+
#### 12-workflow-naming-conventions
|
|
208
|
+
|
|
209
|
+
LangGraph `StateGraph` instances and their enclosing classes MUST be given a meaningful name that conveys the workflow's input, output, and/or behavior. The name MUST end with `Workflow` (PascalCase class) or `_workflow` (snake_case variable or directory).
|
|
210
|
+
|
|
211
|
+
Choose a name that summarises what the workflow consumes, processes, and produces — avoid generic labels such as `Pipeline`, `Flow`, `Graph`, or `Process`.
|
|
212
|
+
|
|
213
|
+
| Context | Pattern | Example |
|
|
214
|
+
|---|---|---|
|
|
215
|
+
| Python class | `<DescriptiveName>Workflow` | `FileMapJudgeReduceWorkflow` |
|
|
216
|
+
| Python variable / instance | `<descriptive_name>_workflow` | `file_map_judge_reduce_workflow` |
|
|
217
|
+
| Directory under `app/workflows/` | `<descriptive_name>_workflow` | `financial_report_analysis_workflow/` |
|
|
218
|
+
|
|
219
|
+
**Good names** communicate purpose at a glance:
|
|
220
|
+
|
|
221
|
+
- `FileMapJudgeReduceWorkflow` — maps files, judges each, then reduces results
|
|
222
|
+
- `FinancialReportAnalysisWorkflow` — analyses financial report inputs
|
|
223
|
+
- `MarketingCampaignExecutorWorkflow` — executes a marketing campaign end-to-end
|
|
224
|
+
|
|
225
|
+
**Bad names** (FORBIDDEN): `MainWorkflow`, `AgentGraph`, `ProcessFlow`, `Workflow1`, `RunGraph`.
|
|
226
|
+
|
|
227
|
+
#### 13-three-tier-testing-strategy
|
|
228
|
+
|
|
229
|
+
AI agent and workflow projects recognize three distinct testing tiers, each with its own purpose, tooling, and execution model:
|
|
230
|
+
|
|
231
|
+
| Tier | Purpose | Model source | External APIs | File naming | When to run | Required |
|
|
232
|
+
|---|---|---|---|---|---|---|
|
|
233
|
+
| **Unit tests** | Test workflow logic, routing, and state management in isolation | Mocked (FakeListChatModel) | Mocked or faked | `<name>_test.py` | On every commit | **Required** |
|
|
234
|
+
| **Integration tests** | Verify end-to-end wiring with real models and real external connectors | Real LLM providers | Real connectors | `<name>_integration_test.py` | Before releases or on a schedule | Advised |
|
|
235
|
+
| **Evals** | Measure model accuracy and performance against expected outputs | Real LLM providers | Real connectors | `eval_<slice>.py` (see rule 08) | Before every release | **Required** |
|
|
236
|
+
|
|
237
|
+
**Unit test requirements (REQUIRED):**
|
|
238
|
+
|
|
239
|
+
- MUST use mocked LLM providers (see [agentme-edr-024](024-llm-development-standards.md) rule `04-unit-test-mocking`).
|
|
240
|
+
- MUST run offline with no external dependencies per [agentme-edr-004](../principles/004-unit-test-requirements.md) rule `02-must-run-offline`.
|
|
241
|
+
- MUST achieve 80% code coverage per [agentme-edr-004](../principles/004-unit-test-requirements.md) rule `03-must-maintain-80-percent-coverage`.
|
|
242
|
+
- MUST test workflow routing logic, conditional edges, state transformations, and error handling.
|
|
243
|
+
- Files MUST be named `<name>_test.py` and placed alongside the source file per [agentme-edr-004](../principles/004-unit-test-requirements.md) rule `04-must-place-test-files-alongside-source`.
|
|
244
|
+
- MUST achieve 80% coverage of workflow graph edges and branches per rule `14-workflow-graph-coverage`.
|
|
245
|
+
|
|
246
|
+
**Integration test guidelines (ADVISED):**
|
|
247
|
+
|
|
248
|
+
Integration tests are advised but not required. See [agentme-edr-007](../principles/007-project-quality-standards.md) rule `08-integration-tests-are-advised` for general integration testing guidelines.
|
|
249
|
+
|
|
250
|
+
For AI agent and workflow projects specifically, when integration tests are implemented:
|
|
251
|
+
|
|
252
|
+
- SHOULD use real LLM providers configured via environment variables
|
|
253
|
+
- MAY use smaller, faster models (e.g., `gpt-3.5-turbo`) instead of production models to reduce cost and latency
|
|
254
|
+
- SHOULD verify workflow execution with pass/fail assertions, not accuracy metrics (accuracy is measured by evals)
|
|
255
|
+
|
|
256
|
+
**Eval requirements (REQUIRED):**
|
|
257
|
+
|
|
258
|
+
Evals are REQUIRED for all agent and workflow projects. See rule `04-dataset-driven-accuracy-measurement` and rule `08-workflow-evals` for full requirements.
|
|
259
|
+
|
|
260
|
+
- Evals MUST run against real LLM providers to capture model drift and log metrics to MLflow for tracking model performance over time.
|
|
261
|
+
- Evals MUST be executed before every release to verify the workflow meets specified accuracy thresholds.
|
|
262
|
+
- Failed eval runs with accuracy below project-defined thresholds MUST block the release.
|
|
263
|
+
|
|
264
|
+
#### 13-workflow-graph-coverage
|
|
265
|
+
|
|
266
|
+
LangGraph Workflows MUST achieve **80% coverage of workflow graph edges and branches** during unit-tests.
|
|
267
|
+
|
|
268
|
+
**Graph edge coverage** measures whether each transition (edge) in the `StateGraph` is exercised by at least one test. This ensures that routing logic, conditional edges, and error paths are tested.
|
|
269
|
+
|
|
270
|
+
**Requirements:**
|
|
271
|
+
|
|
272
|
+
- Every conditional edge (e.g., `add_conditional_edges`) MUST have test cases covering each possible branch.
|
|
273
|
+
- Every node→node transition MUST be exercised by at least one test.
|
|
274
|
+
- Error handling paths (e.g., nodes that route to error states or retry nodes) MUST be tested with inputs that trigger those paths.
|
|
275
|
+
- Use mocked LLM responses in unit tests to control which branches are taken (see [agentme-edr-024](024-llm-development-standards.md) rule `04-unit-test-mocking`).
|
|
276
|
+
|
|
277
|
+
**Example:**
|
|
278
|
+
|
|
279
|
+
```python
|
|
280
|
+
def test_workflow_approval_path():
|
|
281
|
+
"""Test the workflow takes the approval path when LLM returns APPROVE."""
|
|
282
|
+
fake_llm = FakeListChatModel(responses=["APPROVE"])
|
|
283
|
+
workflow = DocumentWorkflow(llm=fake_llm)
|
|
284
|
+
|
|
285
|
+
result = workflow.invoke({"document": sample_doc})
|
|
286
|
+
|
|
287
|
+
assert result["status"] == "approved"
|
|
288
|
+
assert "verify_approved" in result["_visited_nodes"]
|
|
289
|
+
|
|
290
|
+
def test_workflow_rejection_path():
|
|
291
|
+
"""Test the workflow takes the rejection path when LLM returns REJECT."""
|
|
292
|
+
fake_llm = FakeListChatModel(responses=["REJECT"])
|
|
293
|
+
workflow = DocumentWorkflow(llm=fake_llm)
|
|
294
|
+
|
|
295
|
+
result = workflow.invoke({"document": sample_doc})
|
|
296
|
+
|
|
297
|
+
assert result["status"] == "rejected"
|
|
298
|
+
assert "handle_rejection" in result["_visited_nodes"]
|
|
299
|
+
```
|
|
300
|
+
|
|
301
|
+
When measuring coverage, use the same 80% threshold: at least 80% of graph edges must be covered by unit tests.
|
|
302
|
+
|
|
170
303
|
## References
|
|
171
304
|
|
|
305
|
+
- [agentme-edr-024](024-llm-development-standards.md) — LLM development standards: LangChain framework, provider compatibility, LLM observability, unit test mocking, and the LLM / Agent / Workflow conceptual model
|
|
306
|
+
- [agentme-edr-004](../principles/004-unit-test-requirements.md) — Unit test requirements including coverage, offline execution, and test file placement
|
|
172
307
|
- [agentme-edr-021](021-pragmatic-hexagonal-architecture.md) — Adapter/application layer separation that defines the project layout
|
|
173
308
|
- [agentme-edr-014](014-python-project-tooling.md) — Python project tooling and structure
|
|
174
309
|
- [agentme-edr-019](019-ml-dataset-structure.md) — ML dataset structure for eval datasets
|
|
@@ -0,0 +1,116 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: agentme-edr-policy-024-llm-development-standards
|
|
3
|
+
description: Defines the standard framework, provider compatibility, observability approach, and unit testing patterns for LLM-based applications in Python. Use when building, reviewing, or scaffolding any code that makes direct LLM calls, manages prompt context, or handles conversation history.
|
|
4
|
+
apply-to: Python projects that make LLM calls, manage prompt context, or handle conversation threads
|
|
5
|
+
valid-from: 2026-06-03
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# agentme-edr-policy-024: LLM development standards
|
|
9
|
+
|
|
10
|
+
## Context and Problem Statement
|
|
11
|
+
|
|
12
|
+
LLM-based applications can be built at different levels of abstraction — from a single prompt call to a full autonomous agent or a complex multi-step workflow. Without a shared vocabulary and a prescribed framework, projects mix incompatible patterns for invoking models, managing context, and tracing requests.
|
|
13
|
+
|
|
14
|
+
Which framework should be used for LLM calls, how should providers be configured, and what is the canonical meaning of "LLM", "agent", and "workflow" in this codebase?
|
|
15
|
+
|
|
16
|
+
## Decision Outcome
|
|
17
|
+
|
|
18
|
+
**Use LangChain as the standard framework for all direct LLM interactions. Adopt a strict three-tier conceptual model — LLM / Agent / Workflow — that maps each tier to a specific library.**
|
|
19
|
+
|
|
20
|
+
### Conceptual model
|
|
21
|
+
|
|
22
|
+
Three distinct tiers of LLM-based computation are recognized in this project. Every component MUST be classified into exactly one tier:
|
|
23
|
+
|
|
24
|
+
| Tier | What it is | Library |
|
|
25
|
+
|---|---|---|
|
|
26
|
+
| **LLM** | A request → response prompt exchange with a model. May include a conversation history or thread. No autonomous decision-making. | `langchain` / `langchain-openai` |
|
|
27
|
+
| **Agent** | An LLM-based flow driven by a tool-invocation loop that the LLM itself plans and executes. The LLM decides which tools to call and when to stop. | `deepagents` |
|
|
28
|
+
| **Workflow** | A directed graph of nodes that mixes LLM-based nodes (simple LLM calls or agentic loops) with deterministic algorithmic nodes. The graph topology is defined in code, not chosen by the LLM at runtime. | `langgraph` |
|
|
29
|
+
|
|
30
|
+
These tiers nest: a Workflow may contain Agent nodes; an Agent uses LLM calls internally. The tier of a component is determined by its outermost controlling structure.
|
|
31
|
+
|
|
32
|
+
See [agentme-edr-018](018-ai-agent-development-standards.md) for Agent and Workflow implementation standards.
|
|
33
|
+
|
|
34
|
+
### Details
|
|
35
|
+
|
|
36
|
+
#### 01-conceptual-model
|
|
37
|
+
|
|
38
|
+
Every component that interacts with an LLM MUST be classified as exactly one of the three tiers defined in the conceptual model table above: **LLM**, **Agent**, or **Workflow**.
|
|
39
|
+
|
|
40
|
+
- Do NOT use the word "agent" to describe a component that only makes a single LLM call without a tool-invocation loop.
|
|
41
|
+
- Do NOT use the word "workflow" to describe a component that is purely an LLM call with no graph structure.
|
|
42
|
+
- When designing a new feature, identify the correct tier first. The tier determines which library and patterns apply (LangChain, deepagents, or LangGraph).
|
|
43
|
+
|
|
44
|
+
#### 02-llm-framework
|
|
45
|
+
|
|
46
|
+
All direct LLM calls MUST use **LangChain** via the `langchain` and `langchain-openai` packages.
|
|
47
|
+
|
|
48
|
+
- Use `langchain-openai` as the provider integration layer. It supports both OpenAI and Azure OpenAI through environment variables, with no code changes required to switch.
|
|
49
|
+
- Select the provider by setting `OPENAI_API_TYPE=azure` for Azure OpenAI or omitting it for OpenAI.
|
|
50
|
+
- Never hardcode provider-specific URLs, deployment names, or API versions in code; inject them through environment variables or a configuration object.
|
|
51
|
+
|
|
52
|
+
Minimum required environment variable surface:
|
|
53
|
+
|
|
54
|
+
| Variable | Purpose |
|
|
55
|
+
|---|---|
|
|
56
|
+
| `OPENAI_API_KEY` | API key (both providers) |
|
|
57
|
+
| `OPENAI_API_BASE` / `AZURE_OPENAI_ENDPOINT` | Endpoint (Azure only) |
|
|
58
|
+
| `OPENAI_API_VERSION` | API version (Azure only) |
|
|
59
|
+
| `AZURE_OPENAI_DEPLOYMENT` | Deployment/model name (Azure only) |
|
|
60
|
+
| `OPENAI_MODEL` | Model name (OpenAI only) |
|
|
61
|
+
|
|
62
|
+
LangChain chain or runnable definitions MUST be placed in `app/workflows/<workflow>/agents.py` (for workflow-scoped LLM calls) or in the appropriate application layer module. Do not inline LLM construction in adapter or CLI code.
|
|
63
|
+
|
|
64
|
+
#### 03-llm-observability
|
|
65
|
+
|
|
66
|
+
Enable LangChain auto-tracing at every application entry point by calling `mlflow.langchain.autolog()` during startup, before any LLM call is made.
|
|
67
|
+
|
|
68
|
+
- This captures inputs, outputs, token counts, and latency for every LangChain chain or runnable automatically.
|
|
69
|
+
- Pair with `mlflow.start_run()` at the workflow or agent level to group LLM traces under a named experiment run (see [agentme-edr-018](018-ai-agent-development-standards.md) for run-level MLflow instrumentation).
|
|
70
|
+
- Do not disable autolog in production unless there is an explicit performance justification documented in the codebase.
|
|
71
|
+
|
|
72
|
+
#### 04-unit-test-mocking
|
|
73
|
+
|
|
74
|
+
LLM provider calls are external API calls and MUST be mocked in unit tests per [agentme-edr-004](../principles/004-unit-test-requirements.md) rule `06-should-avoid-mocks`. Mocking LLM providers enables offline test execution while testing workflow logic, routing, and state management.
|
|
75
|
+
|
|
76
|
+
Use **LangChain's `FakeListChatModel`** or a custom `GenericFakeChatModel` wrapper for unit testing LLM calls:
|
|
77
|
+
|
|
78
|
+
```python
|
|
79
|
+
from langchain_core.language_models.fake_chat_models import FakeListChatModel
|
|
80
|
+
|
|
81
|
+
# Unit test with mocked LLM responses
|
|
82
|
+
def test_document_workflow_routing():
|
|
83
|
+
fake_llm = FakeListChatModel(responses=[
|
|
84
|
+
"APPROVE",
|
|
85
|
+
"The document meets all quality criteria."
|
|
86
|
+
])
|
|
87
|
+
|
|
88
|
+
workflow = DocumentWorkflow(llm=fake_llm)
|
|
89
|
+
result = workflow.run(input_doc)
|
|
90
|
+
|
|
91
|
+
assert result.status == "approved"
|
|
92
|
+
assert "quality criteria" in result.reasoning
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
**Mocking boundaries:**
|
|
96
|
+
|
|
97
|
+
- Mocks are ONLY for unit tests (required). Integration tests SHOULD use real LLM providers when implemented (see [agentme-edr-018](018-ai-agent-development-standards.md) rule `13-three-tier-testing-strategy`).
|
|
98
|
+
- Evals MUST use real LLM providers to capture model drift and are REQUIRED before every release (see [agentme-edr-018](018-ai-agent-development-standards.md) rule `04-dataset-driven-accuracy-measurement`).
|
|
99
|
+
- Mock the LLM provider at the construction boundary (dependency injection), not by patching internal LangChain methods.
|
|
100
|
+
- Test files MUST follow the naming convention `<name>_test.py` for unit tests (see [agentme-edr-018](018-ai-agent-development-standards.md) rule `13-three-tier-testing-strategy` for integration test naming).
|
|
101
|
+
|
|
102
|
+
When workflows or agents require injectable LLM instances, accept the LLM as a constructor parameter or configuration field:
|
|
103
|
+
|
|
104
|
+
```python
|
|
105
|
+
class DocumentWorkflow:
|
|
106
|
+
def __init__(self, llm: Optional[BaseChatModel] = None):
|
|
107
|
+
self.llm = llm or ChatOpenAI(model="gpt-4")
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
This allows unit tests to inject `FakeListChatModel` while production code uses the real provider.
|
|
111
|
+
|
|
112
|
+
## References
|
|
113
|
+
|
|
114
|
+
- [agentme-edr-018](018-ai-agent-development-standards.md) — Agent and Workflow implementation standards (LangGraph, deepagents, MLflow run-level tracking)
|
|
115
|
+
- [agentme-edr-004](../principles/004-unit-test-requirements.md) — Unit test requirements including external API mocking guidance
|
|
116
|
+
- [agentme-edr-014](014-python-project-tooling.md) — Python project tooling and structure
|
|
@@ -31,8 +31,9 @@ Language and framework-specific tooling and project structure.
|
|
|
31
31
|
- [agentme-edr-010](application/010-golang-project-tooling.md) - **Go project tooling and structure** - Scaffold Go CLIs and libraries with the standard layout *(includes skill: [003-create-golang-project](application/skills/003-create-golang-project/SKILL.md))*
|
|
32
32
|
- [agentme-edr-014](application/014-python-project-tooling.md) - **Python project tooling and structure** - Scaffold Python packages and CLIs with the standard layout *(includes skill: [005-create-python-project](application/skills/005-create-python-project/SKILL.md))*
|
|
33
33
|
- [agentme-edr-015](application/015-cli-tool-standards.md) - **CLI tool standards** - Define command UX and behavior for CLI tools
|
|
34
|
-
- [agentme-edr-018](application/018-ai-agent-development-standards.md) - **AI agent development standards** - Standard toolchain, framework, evaluation, and workflow patterns for AI agent projects built with Python
|
|
34
|
+
- [agentme-edr-018](application/018-ai-agent-development-standards.md) - **AI agent development standards** - Standard toolchain, framework, evaluation, and workflow patterns for AI agent and LangGraph workflow projects built with Python
|
|
35
35
|
- [agentme-edr-019](application/019-ml-dataset-structure.md) - **ML dataset structure** - Standard folder layout and file conventions for ML datasets
|
|
36
|
+
- [agentme-edr-024](application/024-llm-development-standards.md) - **LLM development standards** - Default framework (LangChain), provider compatibility, observability, and conceptual model (LLM / Agent / Workflow) for LLM-based applications
|
|
36
37
|
- [agentme-edr-020](application/020-ai-agent-xdrs-knowledge-layer.md) - **AI agent XDRS knowledge layer** - How to integrate XDRS as the runtime source of truth for policies and skills in AI agents (apply only when the project explicitly uses XDRS)
|
|
37
38
|
- [agentme-edr-021](application/021-pragmatic-hexagonal-architecture.md) - **Pragmatic hexagonal architecture** - Organize application layers as External/Adapters/Application with practical coupling rules
|
|
38
39
|
- [004-select-relevant-xdrs](application/skills/004-select-relevant-xdrs/SKILL.md) - **Select relevant XDRs**
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: agentme-edr-policy-007-project-quality-standards
|
|
3
|
-
description: Defines minimum project quality standards for README onboarding, testing, linting, XDR compliance, and runnable examples. Use when scaffolding or reviewing projects.
|
|
3
|
+
description: Defines minimum project quality standards for README onboarding, testing (unit and integration), linting, XDR compliance, and runnable examples. Use when scaffolding or reviewing projects.
|
|
4
4
|
apply-to: All projects
|
|
5
5
|
valid-from: 2026-05-25
|
|
6
6
|
---
|
|
@@ -15,7 +15,7 @@ What minimum quality standards must every project in the organization meet to en
|
|
|
15
15
|
|
|
16
16
|
## Decision Outcome
|
|
17
17
|
|
|
18
|
-
Every project must meet
|
|
18
|
+
Every project must meet the minimum quality standards: a Getting Started section in its README, unit tests that run on every release, compliance with workspace XDRs, active linting enforcement, a structure that is clear to new developers, and — for libraries and utilities — a runnable examples folder verified on every test run. Integration tests are advised but not required. Projects with statistical models must have evaluation targets with performance thresholds.
|
|
19
19
|
|
|
20
20
|
These standards form a non-negotiable baseline. Individual projects may raise the bar but must never fall below it.
|
|
21
21
|
|
|
@@ -189,3 +189,44 @@ eval:
|
|
|
189
189
|
--min-f1 $(EVAL_MIN_F1) \
|
|
190
190
|
|| (echo "Evaluation failed: metrics below threshold"; exit 1)
|
|
191
191
|
```
|
|
192
|
+
|
|
193
|
+
---
|
|
194
|
+
|
|
195
|
+
#### 08-integration-tests-are-advised
|
|
196
|
+
|
|
197
|
+
Integration tests verify end-to-end system behavior with real external dependencies (databases, APIs, message queues, file systems, cloud services). While not required, integration tests are strongly advised for projects that interact with external systems.
|
|
198
|
+
|
|
199
|
+
**When to implement integration tests:**
|
|
200
|
+
|
|
201
|
+
- The project makes calls to external APIs or services
|
|
202
|
+
- The project reads from or writes to databases
|
|
203
|
+
- The project integrates with third-party systems (payment processors, authentication providers, etc.)
|
|
204
|
+
- The project's behavior depends on the interaction between multiple components or services
|
|
205
|
+
- Unit tests alone cannot adequately verify system integration points
|
|
206
|
+
|
|
207
|
+
**Integration test guidelines (when implemented):**
|
|
208
|
+
|
|
209
|
+
- SHOULD verify end-to-end execution with real external dependencies or containerized equivalents (e.g., postgres in Docker, localstack for AWS services)
|
|
210
|
+
- SHOULD use pass/fail assertions to validate expected behavior and error handling
|
|
211
|
+
- SHOULD be isolated from unit tests to allow independent execution
|
|
212
|
+
- Files SHOULD be named with a clear integration test suffix (e.g., `<name>_integration_test.py`, `<name>.integration.test.ts`)
|
|
213
|
+
- MAY be run less frequently than unit tests (e.g., nightly, before releases) to manage execution time and external dependency costs
|
|
214
|
+
- MAY use smaller or cheaper configurations of external services when available (e.g., smaller database instances, development-tier API keys)
|
|
215
|
+
|
|
216
|
+
**Makefile target:**
|
|
217
|
+
|
|
218
|
+
When integration tests exist, provide a dedicated `make test-integration` target:
|
|
219
|
+
|
|
220
|
+
```makefile
|
|
221
|
+
test: test-unit
|
|
222
|
+
|
|
223
|
+
test-unit:
|
|
224
|
+
# Run fast offline unit tests
|
|
225
|
+
pytest lib/src/
|
|
226
|
+
|
|
227
|
+
test-integration:
|
|
228
|
+
# Run integration tests with real dependencies
|
|
229
|
+
pytest lib/src/ -m integration
|
|
230
|
+
```
|
|
231
|
+
|
|
232
|
+
Projects are not required to implement integration tests, but when present, they SHOULD follow these conventions for consistency across the codebase.
|