agentme 0.14.0 → 0.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (20) hide show
  1. package/.filedist-package.yml +1 -1
  2. package/.xdrs/agentme/edrs/application/003-javascript-project-tooling.md +3 -3
  3. package/.xdrs/agentme/edrs/application/010-golang-project-tooling.md +3 -3
  4. package/.xdrs/agentme/edrs/application/014-python-project-tooling.md +2 -2
  5. package/.xdrs/agentme/edrs/application/015-cli-tool-standards.md +2 -2
  6. package/.xdrs/agentme/edrs/application/018-ai-llm-development-standards.md +180 -0
  7. package/.xdrs/agentme/edrs/application/019-ai-agents-development-standards.md +284 -0
  8. package/.xdrs/agentme/edrs/application/020-ai-workflow-development-standards.md +261 -0
  9. package/.xdrs/agentme/edrs/application/021-ai-eval-standards.md +90 -0
  10. package/.xdrs/agentme/edrs/application/{019-ml-dataset-structure.md → 024-ml-dataset-structure.md} +2 -2
  11. package/.xdrs/agentme/edrs/application/{020-ai-agent-xdrs-knowledge-layer.md → 025-ai-agent-xdrs-knowledge-layer.md} +4 -4
  12. package/.xdrs/agentme/edrs/application/{021-pragmatic-hexagonal-architecture.md → 026-pragmatic-hexagonal-architecture.md} +2 -2
  13. package/.xdrs/agentme/edrs/application/skills/001-create-javascript-project/SKILL.md +2 -2
  14. package/.xdrs/agentme/edrs/application/skills/003-create-golang-project/SKILL.md +2 -2
  15. package/.xdrs/agentme/edrs/application/skills/005-create-python-project/SKILL.md +3 -3
  16. package/.xdrs/agentme/edrs/index.md +7 -5
  17. package/.xdrs/agentme/edrs/principles/007-project-quality-standards.md +29 -1
  18. package/package.json +3 -3
  19. package/.xdrs/agentme/edrs/application/018-ai-agent-development-standards.md +0 -309
  20. package/.xdrs/agentme/edrs/application/024-llm-development-standards.md +0 -116
@@ -1,309 +0,0 @@
1
- ---
2
- name: agentme-edr-policy-018-ai-agent-development-standards
3
- description: Defines the standard toolchain, framework, evaluation approach, testing strategy, and workflow patterns for building AI agents (tool-invocation loops) and LangGraph workflows in Python. Use when scaffolding, reviewing, or extending AI agent or workflow projects. For raw LLM calls and provider configuration, see agentme-edr-024.
4
- apply-to: AI agent and LangGraph workflow projects built with Python
5
- valid-from: 2026-05-26
6
- ---
7
-
8
- # agentme-edr-policy-018: AI agent development standards
9
-
10
- ## Context and Problem Statement
11
-
12
- AI agent projects vary widely in how they choose frameworks, manage context, evaluate outputs, and structure workflows. Without a shared baseline, projects accumulate incompatible patterns for LLM provider abstraction, flow design, and dataset-driven testing.
13
-
14
- Which tools, frameworks, and design patterns should AI agent projects follow to ensure reproducibility, testability, and maintainability?
15
-
16
- ## Decision Outcome
17
-
18
- **Use Python with LangGraph for flow orchestration and MLflow for experiment tracking and local evaluation.**
19
-
20
- This policy covers the **Agent** and **Workflow** tiers of the three-tier conceptual model defined in [agentme-edr-024](024-llm-development-standards.md). For the definition of LLM, Agent, and Workflow, and for the LangChain framework rules that govern direct LLM calls, see [agentme-edr-024](024-llm-development-standards.md).
21
-
22
- ### Details
23
-
24
- #### 01-language-and-framework
25
-
26
- All agent and workflow projects MUST be implemented in Python, following [agentme-edr-014](014-python-project-tooling.md) for project structure, tooling, and Makefile conventions.
27
-
28
- Agent flows MUST be built with **LangGraph**. Use LangGraph `StateGraph` to model each distinct workflow as an explicit directed graph with typed state.
29
-
30
- For all direct LLM calls within agent and workflow nodes, use LangChain per [agentme-edr-024](024-llm-development-standards.md).
31
-
32
- #### 03-observability-and-experiment-tracking
33
-
34
- Use **MLflow** for all agent and workflow observability and evaluation:
35
-
36
- - Wrap each agent or workflow run with `mlflow.start_run()` to capture traces, parameters, and metrics locally.
37
- - Log run parameters (model name, temperature, prompt version) and output metrics (accuracy, latency, token counts) using `mlflow.log_param` / `mlflow.log_metric`.
38
- - Run a local MLflow tracking server with `mlflow ui` to inspect runs during development. Do not require a remote MLflow server for local development.
39
- - For LangChain-level auto-tracing of individual LLM calls, see [agentme-edr-024](024-llm-development-standards.md) rule `03-llm-observability`.
40
-
41
- #### 04-dataset-driven-accuracy-measurement
42
-
43
- Every agent pipeline MUST have a companion evaluation dataset and an MLflow experiment that measures accuracy against it. Datasets and evals are organized per-workflow following rule `07-workflow-structure` and rule `08-workflow-evals`.
44
-
45
- - **Evals** measure model accuracy and performance against expected outputs. They are REQUIRED before every release to verify the workflow meets specified accuracy thresholds. They run against real LLM providers to capture model drift. They log metrics to MLflow and MUST have project-defined quality thresholds that block releases when not met.
46
- - **Integration tests** verify that workflows execute end-to-end with real connectors and real models, using pass/fail assertions. They are ADVISED but not required. They validate wiring, error handling, and system integration, not model accuracy. See rule `13-three-tier-testing-strategy` for integration test guidelines.
47
-
48
- **Eval requirements:**
49
-
50
- - Store evaluation datasets under `evals/<workflow>/` (sibling of `lib/` and `examples/`), following [agentme-edr-019](019-ml-dataset-structure.md) for structure and format. For MLflow input/output pairs, use the JSONL format described in `agentme-edr-019.04-complex-structured-datasets-must-use-jsonl`.
51
- - Write evaluation scripts under `evals/<workflow>/` that load the dataset, run each input through the live agent (against real LLMs, not mocks), compare outputs to expected values, and log per-sample and aggregate metrics to an MLflow experiment.
52
- - Add a `make eval` Makefile target in the module root Makefile (the same Makefile that sits alongside `lib/` and `examples/`) that delegates to all per-workflow eval targets.
53
- - Evaluation MUST run against real LLM providers, not recorded responses, to capture model drift. MLflow tracking MUST work locally without a remote server.
54
- - Evals MUST be executed before every release. Failed eval runs with accuracy below project-defined thresholds MUST block the release.
55
-
56
- #### 05-flow-documentation
57
-
58
- Each agent flow MUST be documented as a **Mermaid graph** in the project `README.md`. The diagram must match the LangGraph `StateGraph` definition:
59
-
60
- - Use `graph TD` or `graph LR` direction.
61
- - Label each node with its Python function name.
62
- - Label conditional edges with the condition expression.
63
- - Update the diagram whenever the graph topology changes.
64
-
65
- Example minimal diagram block:
66
-
67
- ```mermaid
68
- graph TD
69
- A[fetch_context] --> B[draft_response]
70
- B --> C{verify}
71
- C -->|pass| D[output]
72
- C -->|fail| B
73
- ```
74
-
75
- #### 06-verification-steps
76
-
77
- Agent flows MUST include at least one explicit verification node before producing final output:
78
-
79
- - Model the verification step as a dedicated LangGraph node (e.g. `verify_output`).
80
- - The node checks the draft output against defined acceptance criteria (schema validation, factual consistency check, rubric scoring, or LLM-as-judge call).
81
- - On failure, the verification node MUST route back to the relevant generation node, not silently pass through.
82
- - Log verification results (pass/fail, score, reason) as MLflow metrics on the current run.
83
-
84
- #### 07-workflow-structure
85
-
86
- Agent logic MUST be organized as named workflows following [agentme-edr-021](021-pragmatic-hexagonal-architecture.md). Each workflow is an independent LangGraph `StateGraph` with a defined start node and end node, connecting agents, states, routes, and decision nodes.
87
-
88
- Workflows live inside `app/workflows/` (the application layer), while external integrations such as LLM providers, vector stores, and third-party APIs live under `adapters/connectors/` (the outbound adapter layer). Inbound interfaces (HTTP API, CLI) live under `adapters/` as inbound adapters.
89
-
90
- For each workflow named `<workflow>`, the full project layout is:
91
-
92
- ```text
93
- lib/src/<package_name>/
94
- adapters/
95
- http/ # inbound: API server that triggers workflows
96
- cli/ # inbound: CLI entry point (if applicable)
97
- connectors/ # outbound: external resource integrations
98
- openai/ # LLM provider connector
99
- azure-openai/ # alternative LLM provider connector
100
- postgres/ # database connector (if applicable)
101
- vector-store/ # vector DB connector (if applicable)
102
- app/
103
- workflows/
104
- <workflow>/
105
- graph.py # StateGraph definition; entry point for the workflow
106
- agents.py # LangChain agent definitions used by this workflow
107
- states.py # Typed state dataclasses / TypedDicts
108
- routes.py # Conditional edge functions
109
- shared/ # infrastructure-agnostic utilities
110
- ```
111
-
112
- - `app/workflows/<workflow>/graph.py` MUST define and compile the `StateGraph` and expose a `graph` object that callers invoke.
113
- - Tool calls within workflow nodes that interact with external systems MUST use connectors from `adapters/connectors/`, not inline API calls.
114
- - Additional modules (prompts, schemas) MAY be added inside `app/workflows/<workflow>/` when they are specific to that workflow. Shared utilities belong in `shared/`.
115
- - Each workflow MUST be documented with a Mermaid diagram in the project `README.md` following rule `05-flow-documentation`.
116
-
117
- #### 08-workflow-evals
118
-
119
- For each workflow `<workflow>` there MUST be a corresponding eval directory:
120
-
121
- ```text
122
- evals/
123
- <workflow>/
124
- Makefile # eval targets for this workflow
125
- dataset_<slice>/ # one folder per eval slice (see agentme-edr-019)
126
- eval_<slice>.py # evaluation script for each slice
127
- ```
128
-
129
- The `evals/<workflow>/Makefile` MUST define:
130
-
131
- | Target | Behaviour |
132
- |---|---|
133
- | `eval` | Runs all eval slices for the workflow |
134
- | `eval-<slice>` | Runs one named slice (e.g. `eval-simple`, `eval-complex`) |
135
-
136
- Each `eval_<slice>.py` script MUST:
137
-
138
- - Load the dataset from `evals/<workflow>/dataset_<slice>/` following [agentme-edr-019](019-ml-dataset-structure.md).
139
- - Run every input through the live workflow against real LLMs.
140
- - Log per-sample and aggregate metrics to an MLflow experiment that runs locally.
141
-
142
- The module root Makefile `make eval` target MUST delegate to `eval` in every `evals/<workflow>/Makefile`.
143
-
144
- #### 09-node-naming-conventions
145
-
146
- LangGraph node names MUST follow a suffix convention that communicates the node's role at a glance. Names MUST be action-oriented and descriptive.
147
-
148
- | Suffix | Node type | When to use |
149
- |---|---|---|
150
- | `_llm` | LLM call | Any node whose primary action is a direct LLM inference call |
151
- | `_step` | Algorithmic step | Deterministic logic with no LLM involvement (transformation, validation, routing) |
152
- | `_tool` | Tool/API call | A node that wraps a single external tool or API (e.g. a REST endpoint, DB query) |
153
- | `_agent` | Subgraph agent | A node that invokes a nested subgraph containing its own tool-invocation cycle and LLM calls; prefer the **deepagents** library for these nodes |
154
-
155
- The Python function implementing the node SHOULD share the same name as the node alias passed to `add_node`, so that graph definitions and stack traces remain unambiguous:
156
-
157
- ```python
158
- def draft_doc_llm(state): ...
159
- graph.add_node("draft_doc_llm", draft_doc_llm)
160
-
161
- # Tool node — calls the Stripe API
162
- def stripe_api_tool(state): ...
163
- graph.add_node("stripe_api_tool", stripe_api_tool)
164
- ```
165
-
166
- Names MUST NOT use generic labels such as `node1`, `process`, or `run`. Each name must clearly express what action the node performs.
167
-
168
- #### 10-local-sandbox
169
-
170
- When a workflow node or tool requires a **local sandbox** — an isolated environment where the agent can read files, glob-search directories, and execute shell commands — use the **[deepagents](https://github.com/deepagents/deepagents) framework** to provide that sandbox.
171
-
172
- **When to apply this rule**
173
-
174
- Use deepagents whenever ANY of the following is true for a workflow or tool:
175
- - The agent needs to execute shell commands or scripts in a controlled environment.
176
- - The agent needs to list, read, or search files across multiple directories at runtime.
177
- - The agent operates on user-supplied or generated file trees that must not escape a sandboxed boundary.
178
-
179
- **Integration requirements**
180
-
181
- - Initialize the sandbox at the start of the workflow run and shut it down in the same `try/finally` block.
182
- - Pass the sandbox handle into the LangGraph workflow state so all nodes share the same sandbox instance.
183
- - If the host-side code needs to pass files into the sandbox (e.g. generated config or input data), create a temporary directory with `tempfile.mkdtemp()`, write the files there, and mount it into the sandbox. Clean it up in the `finally` block.
184
- - Replace hand-rolled `read_file`, `search_files`, and `grep_file` tool implementations with the equivalent tools provided by deepagents.
185
-
186
- #### 11-state-type-conventions
187
-
188
- All TypedDict and dataclass types that represent LangGraph node or workflow state MUST end with `_state` in their name. This suffix signals at a glance that the type is a state boundary, not a plain data model.
189
-
190
- **Naming reference:**
191
-
192
- | Owner | Naming pattern | Example |
193
- |---|---|---|
194
- | Single agent / agent subgraph | `<agent_name>_agent_state` | `reviewer_agent_state` |
195
- | Full workflow (`StateGraph`) | `<workflow_name>_workflow_state` | `document_workflow_state` |
196
- | Named group of nodes sharing state | `<group_responsibility>_state` | `retrieval_pipeline_state` |
197
-
198
- **Boundary rules:**
199
-
200
- - Each agent or agent subgraph MUST define its own dedicated state type. Do not reuse or extend a generic state across unrelated agents.
201
- - Each workflow (`StateGraph`) MUST define its own top-level state type. The workflow state is the authoritative boundary for that graph's inputs and outputs.
202
- - When a group of nodes (not a full workflow and not a single agent) shares a state type, the type name MUST clearly reflect the shared responsibility. Generic names such as `shared_state`, `common_state`, or `global_state` are FORBIDDEN.
203
- - Large workflows MUST NOT use a single monolithic state that all nodes read and write. Split the state into per-phase or per-agent state types scoped to the subgraph or set of nodes that produce or consume each field.
204
-
205
- State type names SHOULD align with the agent or node names defined in rule `09-node-naming-conventions` (e.g., an agent node named `draft_doc_agent` has a state type named `draft_doc_agent_state`).
206
-
207
- #### 12-workflow-naming-conventions
208
-
209
- LangGraph `StateGraph` instances and their enclosing classes MUST be given a meaningful name that conveys the workflow's input, output, and/or behavior. The name MUST end with `Workflow` (PascalCase class) or `_workflow` (snake_case variable or directory).
210
-
211
- Choose a name that summarises what the workflow consumes, processes, and produces — avoid generic labels such as `Pipeline`, `Flow`, `Graph`, or `Process`.
212
-
213
- | Context | Pattern | Example |
214
- |---|---|---|
215
- | Python class | `<DescriptiveName>Workflow` | `FileMapJudgeReduceWorkflow` |
216
- | Python variable / instance | `<descriptive_name>_workflow` | `file_map_judge_reduce_workflow` |
217
- | Directory under `app/workflows/` | `<descriptive_name>_workflow` | `financial_report_analysis_workflow/` |
218
-
219
- **Good names** communicate purpose at a glance:
220
-
221
- - `FileMapJudgeReduceWorkflow` — maps files, judges each, then reduces results
222
- - `FinancialReportAnalysisWorkflow` — analyses financial report inputs
223
- - `MarketingCampaignExecutorWorkflow` — executes a marketing campaign end-to-end
224
-
225
- **Bad names** (FORBIDDEN): `MainWorkflow`, `AgentGraph`, `ProcessFlow`, `Workflow1`, `RunGraph`.
226
-
227
- #### 13-three-tier-testing-strategy
228
-
229
- AI agent and workflow projects recognize three distinct testing tiers, each with its own purpose, tooling, and execution model:
230
-
231
- | Tier | Purpose | Model source | External APIs | File naming | When to run | Required |
232
- |---|---|---|---|---|---|---|
233
- | **Unit tests** | Test workflow logic, routing, and state management in isolation | Mocked (FakeListChatModel) | Mocked or faked | `<name>_test.py` | On every commit | **Required** |
234
- | **Integration tests** | Verify end-to-end wiring with real models and real external connectors | Real LLM providers | Real connectors | `<name>_integration_test.py` | Before releases or on a schedule | Advised |
235
- | **Evals** | Measure model accuracy and performance against expected outputs | Real LLM providers | Real connectors | `eval_<slice>.py` (see rule 08) | Before every release | **Required** |
236
-
237
- **Unit test requirements (REQUIRED):**
238
-
239
- - MUST use mocked LLM providers (see [agentme-edr-024](024-llm-development-standards.md) rule `04-unit-test-mocking`).
240
- - MUST run offline with no external dependencies per [agentme-edr-004](../principles/004-unit-test-requirements.md) rule `02-must-run-offline`.
241
- - MUST achieve 80% code coverage per [agentme-edr-004](../principles/004-unit-test-requirements.md) rule `03-must-maintain-80-percent-coverage`.
242
- - MUST test workflow routing logic, conditional edges, state transformations, and error handling.
243
- - Files MUST be named `<name>_test.py` and placed alongside the source file per [agentme-edr-004](../principles/004-unit-test-requirements.md) rule `04-must-place-test-files-alongside-source`.
244
- - MUST achieve 80% coverage of workflow graph edges and branches per rule `14-workflow-graph-coverage`.
245
-
246
- **Integration test guidelines (ADVISED):**
247
-
248
- Integration tests are advised but not required. See [agentme-edr-007](../principles/007-project-quality-standards.md) rule `08-integration-tests-are-advised` for general integration testing guidelines.
249
-
250
- For AI agent and workflow projects specifically, when integration tests are implemented:
251
-
252
- - SHOULD use real LLM providers configured via environment variables
253
- - MAY use smaller, faster models (e.g., `gpt-3.5-turbo`) instead of production models to reduce cost and latency
254
- - SHOULD verify workflow execution with pass/fail assertions, not accuracy metrics (accuracy is measured by evals)
255
-
256
- **Eval requirements (REQUIRED):**
257
-
258
- Evals are REQUIRED for all agent and workflow projects. See rule `04-dataset-driven-accuracy-measurement` and rule `08-workflow-evals` for full requirements.
259
-
260
- - Evals MUST run against real LLM providers to capture model drift and log metrics to MLflow for tracking model performance over time.
261
- - Evals MUST be executed before every release to verify the workflow meets specified accuracy thresholds.
262
- - Failed eval runs with accuracy below project-defined thresholds MUST block the release.
263
-
264
- #### 13-workflow-graph-coverage
265
-
266
- LangGraph Workflows MUST achieve **80% coverage of workflow graph edges and branches** during unit-tests.
267
-
268
- **Graph edge coverage** measures whether each transition (edge) in the `StateGraph` is exercised by at least one test. This ensures that routing logic, conditional edges, and error paths are tested.
269
-
270
- **Requirements:**
271
-
272
- - Every conditional edge (e.g., `add_conditional_edges`) MUST have test cases covering each possible branch.
273
- - Every node→node transition MUST be exercised by at least one test.
274
- - Error handling paths (e.g., nodes that route to error states or retry nodes) MUST be tested with inputs that trigger those paths.
275
- - Use mocked LLM responses in unit tests to control which branches are taken (see [agentme-edr-024](024-llm-development-standards.md) rule `04-unit-test-mocking`).
276
-
277
- **Example:**
278
-
279
- ```python
280
- def test_workflow_approval_path():
281
- """Test the workflow takes the approval path when LLM returns APPROVE."""
282
- fake_llm = FakeListChatModel(responses=["APPROVE"])
283
- workflow = DocumentWorkflow(llm=fake_llm)
284
-
285
- result = workflow.invoke({"document": sample_doc})
286
-
287
- assert result["status"] == "approved"
288
- assert "verify_approved" in result["_visited_nodes"]
289
-
290
- def test_workflow_rejection_path():
291
- """Test the workflow takes the rejection path when LLM returns REJECT."""
292
- fake_llm = FakeListChatModel(responses=["REJECT"])
293
- workflow = DocumentWorkflow(llm=fake_llm)
294
-
295
- result = workflow.invoke({"document": sample_doc})
296
-
297
- assert result["status"] == "rejected"
298
- assert "handle_rejection" in result["_visited_nodes"]
299
- ```
300
-
301
- When measuring coverage, use the same 80% threshold: at least 80% of graph edges must be covered by unit tests.
302
-
303
- ## References
304
-
305
- - [agentme-edr-024](024-llm-development-standards.md) — LLM development standards: LangChain framework, provider compatibility, LLM observability, unit test mocking, and the LLM / Agent / Workflow conceptual model
306
- - [agentme-edr-004](../principles/004-unit-test-requirements.md) — Unit test requirements including coverage, offline execution, and test file placement
307
- - [agentme-edr-021](021-pragmatic-hexagonal-architecture.md) — Adapter/application layer separation that defines the project layout
308
- - [agentme-edr-014](014-python-project-tooling.md) — Python project tooling and structure
309
- - [agentme-edr-019](019-ml-dataset-structure.md) — ML dataset structure for eval datasets
@@ -1,116 +0,0 @@
1
- ---
2
- name: agentme-edr-policy-024-llm-development-standards
3
- description: Defines the standard framework, provider compatibility, observability approach, and unit testing patterns for LLM-based applications in Python. Use when building, reviewing, or scaffolding any code that makes direct LLM calls, manages prompt context, or handles conversation history.
4
- apply-to: Python projects that make LLM calls, manage prompt context, or handle conversation threads
5
- valid-from: 2026-06-03
6
- ---
7
-
8
- # agentme-edr-policy-024: LLM development standards
9
-
10
- ## Context and Problem Statement
11
-
12
- LLM-based applications can be built at different levels of abstraction — from a single prompt call to a full autonomous agent or a complex multi-step workflow. Without a shared vocabulary and a prescribed framework, projects mix incompatible patterns for invoking models, managing context, and tracing requests.
13
-
14
- Which framework should be used for LLM calls, how should providers be configured, and what is the canonical meaning of "LLM", "agent", and "workflow" in this codebase?
15
-
16
- ## Decision Outcome
17
-
18
- **Use LangChain as the standard framework for all direct LLM interactions. Adopt a strict three-tier conceptual model — LLM / Agent / Workflow — that maps each tier to a specific library.**
19
-
20
- ### Conceptual model
21
-
22
- Three distinct tiers of LLM-based computation are recognized in this project. Every component MUST be classified into exactly one tier:
23
-
24
- | Tier | What it is | Library |
25
- |---|---|---|
26
- | **LLM** | A request → response prompt exchange with a model. May include a conversation history or thread. No autonomous decision-making. | `langchain` / `langchain-openai` |
27
- | **Agent** | An LLM-based flow driven by a tool-invocation loop that the LLM itself plans and executes. The LLM decides which tools to call and when to stop. | `deepagents` |
28
- | **Workflow** | A directed graph of nodes that mixes LLM-based nodes (simple LLM calls or agentic loops) with deterministic algorithmic nodes. The graph topology is defined in code, not chosen by the LLM at runtime. | `langgraph` |
29
-
30
- These tiers nest: a Workflow may contain Agent nodes; an Agent uses LLM calls internally. The tier of a component is determined by its outermost controlling structure.
31
-
32
- See [agentme-edr-018](018-ai-agent-development-standards.md) for Agent and Workflow implementation standards.
33
-
34
- ### Details
35
-
36
- #### 01-conceptual-model
37
-
38
- Every component that interacts with an LLM MUST be classified as exactly one of the three tiers defined in the conceptual model table above: **LLM**, **Agent**, or **Workflow**.
39
-
40
- - Do NOT use the word "agent" to describe a component that only makes a single LLM call without a tool-invocation loop.
41
- - Do NOT use the word "workflow" to describe a component that is purely an LLM call with no graph structure.
42
- - When designing a new feature, identify the correct tier first. The tier determines which library and patterns apply (LangChain, deepagents, or LangGraph).
43
-
44
- #### 02-llm-framework
45
-
46
- All direct LLM calls MUST use **LangChain** via the `langchain` and `langchain-openai` packages.
47
-
48
- - Use `langchain-openai` as the provider integration layer. It supports both OpenAI and Azure OpenAI through environment variables, with no code changes required to switch.
49
- - Select the provider by setting `OPENAI_API_TYPE=azure` for Azure OpenAI or omitting it for OpenAI.
50
- - Never hardcode provider-specific URLs, deployment names, or API versions in code; inject them through environment variables or a configuration object.
51
-
52
- Minimum required environment variable surface:
53
-
54
- | Variable | Purpose |
55
- |---|---|
56
- | `OPENAI_API_KEY` | API key (both providers) |
57
- | `OPENAI_API_BASE` / `AZURE_OPENAI_ENDPOINT` | Endpoint (Azure only) |
58
- | `OPENAI_API_VERSION` | API version (Azure only) |
59
- | `AZURE_OPENAI_DEPLOYMENT` | Deployment/model name (Azure only) |
60
- | `OPENAI_MODEL` | Model name (OpenAI only) |
61
-
62
- LangChain chain or runnable definitions MUST be placed in `app/workflows/<workflow>/agents.py` (for workflow-scoped LLM calls) or in the appropriate application layer module. Do not inline LLM construction in adapter or CLI code.
63
-
64
- #### 03-llm-observability
65
-
66
- Enable LangChain auto-tracing at every application entry point by calling `mlflow.langchain.autolog()` during startup, before any LLM call is made.
67
-
68
- - This captures inputs, outputs, token counts, and latency for every LangChain chain or runnable automatically.
69
- - Pair with `mlflow.start_run()` at the workflow or agent level to group LLM traces under a named experiment run (see [agentme-edr-018](018-ai-agent-development-standards.md) for run-level MLflow instrumentation).
70
- - Do not disable autolog in production unless there is an explicit performance justification documented in the codebase.
71
-
72
- #### 04-unit-test-mocking
73
-
74
- LLM provider calls are external API calls and MUST be mocked in unit tests per [agentme-edr-004](../principles/004-unit-test-requirements.md) rule `06-should-avoid-mocks`. Mocking LLM providers enables offline test execution while testing workflow logic, routing, and state management.
75
-
76
- Use **LangChain's `FakeListChatModel`** or a custom `GenericFakeChatModel` wrapper for unit testing LLM calls:
77
-
78
- ```python
79
- from langchain_core.language_models.fake_chat_models import FakeListChatModel
80
-
81
- # Unit test with mocked LLM responses
82
- def test_document_workflow_routing():
83
- fake_llm = FakeListChatModel(responses=[
84
- "APPROVE",
85
- "The document meets all quality criteria."
86
- ])
87
-
88
- workflow = DocumentWorkflow(llm=fake_llm)
89
- result = workflow.run(input_doc)
90
-
91
- assert result.status == "approved"
92
- assert "quality criteria" in result.reasoning
93
- ```
94
-
95
- **Mocking boundaries:**
96
-
97
- - Mocks are ONLY for unit tests (required). Integration tests SHOULD use real LLM providers when implemented (see [agentme-edr-018](018-ai-agent-development-standards.md) rule `13-three-tier-testing-strategy`).
98
- - Evals MUST use real LLM providers to capture model drift and are REQUIRED before every release (see [agentme-edr-018](018-ai-agent-development-standards.md) rule `04-dataset-driven-accuracy-measurement`).
99
- - Mock the LLM provider at the construction boundary (dependency injection), not by patching internal LangChain methods.
100
- - Test files MUST follow the naming convention `<name>_test.py` for unit tests (see [agentme-edr-018](018-ai-agent-development-standards.md) rule `13-three-tier-testing-strategy` for integration test naming).
101
-
102
- When workflows or agents require injectable LLM instances, accept the LLM as a constructor parameter or configuration field:
103
-
104
- ```python
105
- class DocumentWorkflow:
106
- def __init__(self, llm: Optional[BaseChatModel] = None):
107
- self.llm = llm or ChatOpenAI(model="gpt-4")
108
- ```
109
-
110
- This allows unit tests to inject `FakeListChatModel` while production code uses the real provider.
111
-
112
- ## References
113
-
114
- - [agentme-edr-018](018-ai-agent-development-standards.md) — Agent and Workflow implementation standards (LangGraph, deepagents, MLflow run-level tracking)
115
- - [agentme-edr-004](../principles/004-unit-test-requirements.md) — Unit test requirements including external API mocking guidance
116
- - [agentme-edr-014](014-python-project-tooling.md) — Python project tooling and structure