maestro-bundle 1.3.1 → 1.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/templates/bundle-ai-agents/skills/agent-orchestration/SKILL.md +107 -41
- package/templates/bundle-ai-agents/skills/agent-orchestration/references/graph-patterns.md +50 -0
- package/templates/bundle-ai-agents/skills/agent-orchestration/references/routing-strategies.md +47 -0
- package/templates/bundle-ai-agents/skills/api-design/SKILL.md +125 -16
- package/templates/bundle-ai-agents/skills/api-design/references/pydantic-patterns.md +72 -0
- package/templates/bundle-ai-agents/skills/api-design/references/rest-conventions.md +51 -0
- package/templates/bundle-ai-agents/skills/clean-architecture/SKILL.md +113 -21
- package/templates/bundle-ai-agents/skills/clean-architecture/references/dependency-injection.md +60 -0
- package/templates/bundle-ai-agents/skills/clean-architecture/references/layer-rules.md +56 -0
- package/templates/bundle-ai-agents/skills/context-engineering/SKILL.md +104 -36
- package/templates/bundle-ai-agents/skills/context-engineering/references/compression-techniques.md +76 -0
- package/templates/bundle-ai-agents/skills/context-engineering/references/context-budget-calculator.md +45 -0
- package/templates/bundle-ai-agents/skills/database-modeling/SKILL.md +146 -19
- package/templates/bundle-ai-agents/skills/database-modeling/references/index-strategies.md +48 -0
- package/templates/bundle-ai-agents/skills/database-modeling/references/naming-conventions.md +27 -0
- package/templates/bundle-ai-agents/skills/docker-containerization/SKILL.md +124 -15
- package/templates/bundle-ai-agents/skills/docker-containerization/references/compose-patterns.md +97 -0
- package/templates/bundle-ai-agents/skills/docker-containerization/references/dockerfile-checklist.md +37 -0
- package/templates/bundle-ai-agents/skills/eval-testing/SKILL.md +113 -25
- package/templates/bundle-ai-agents/skills/eval-testing/references/eval-types.md +52 -0
- package/templates/bundle-ai-agents/skills/eval-testing/references/golden-dataset-template.md +59 -0
- package/templates/bundle-ai-agents/skills/memory-management/SKILL.md +112 -28
- package/templates/bundle-ai-agents/skills/memory-management/references/memory-tiers.md +41 -0
- package/templates/bundle-ai-agents/skills/memory-management/references/namespace-conventions.md +41 -0
- package/templates/bundle-ai-agents/skills/prompt-engineering/SKILL.md +139 -47
- package/templates/bundle-ai-agents/skills/prompt-engineering/references/anti-patterns.md +59 -0
- package/templates/bundle-ai-agents/skills/prompt-engineering/references/prompt-templates.md +75 -0
- package/templates/bundle-ai-agents/skills/rag-pipeline/SKILL.md +104 -27
- package/templates/bundle-ai-agents/skills/rag-pipeline/references/chunking-strategies.md +27 -0
- package/templates/bundle-ai-agents/skills/rag-pipeline/references/embedding-models.md +31 -0
- package/templates/bundle-ai-agents/skills/rag-pipeline/references/rag-evaluation.md +39 -0
- package/templates/bundle-ai-agents/skills/testing-strategy/SKILL.md +127 -18
- package/templates/bundle-ai-agents/skills/testing-strategy/references/fixture-patterns.md +81 -0
- package/templates/bundle-ai-agents/skills/testing-strategy/references/naming-conventions.md +69 -0
- package/templates/bundle-base/skills/branch-strategy/SKILL.md +134 -21
- package/templates/bundle-base/skills/branch-strategy/references/branch-rules.md +40 -0
- package/templates/bundle-base/skills/code-review/SKILL.md +123 -38
- package/templates/bundle-base/skills/code-review/references/review-checklist.md +45 -0
- package/templates/bundle-base/skills/commit-pattern/SKILL.md +98 -39
- package/templates/bundle-base/skills/commit-pattern/references/conventional-commits.md +40 -0
- package/templates/bundle-data-pipeline/skills/data-preprocessing/SKILL.md +110 -19
- package/templates/bundle-data-pipeline/skills/data-preprocessing/references/pandas-cheatsheet.md +63 -0
- package/templates/bundle-data-pipeline/skills/data-preprocessing/references/pandera-schemas.md +44 -0
- package/templates/bundle-data-pipeline/skills/docker-containerization/SKILL.md +132 -16
- package/templates/bundle-data-pipeline/skills/docker-containerization/references/compose-patterns.md +82 -0
- package/templates/bundle-data-pipeline/skills/docker-containerization/references/dockerfile-best-practices.md +57 -0
- package/templates/bundle-data-pipeline/skills/feature-engineering/SKILL.md +143 -45
- package/templates/bundle-data-pipeline/skills/feature-engineering/references/encoding-guide.md +41 -0
- package/templates/bundle-data-pipeline/skills/feature-engineering/references/scaling-guide.md +38 -0
- package/templates/bundle-data-pipeline/skills/mlops-pipeline/SKILL.md +156 -37
- package/templates/bundle-data-pipeline/skills/mlops-pipeline/references/mlflow-commands.md +69 -0
- package/templates/bundle-data-pipeline/skills/model-training/SKILL.md +152 -33
- package/templates/bundle-data-pipeline/skills/model-training/references/evaluation-metrics.md +52 -0
- package/templates/bundle-data-pipeline/skills/model-training/references/model-selection-guide.md +41 -0
- package/templates/bundle-data-pipeline/skills/rag-pipeline/SKILL.md +127 -39
- package/templates/bundle-data-pipeline/skills/rag-pipeline/references/chunking-strategies.md +51 -0
- package/templates/bundle-data-pipeline/skills/rag-pipeline/references/embedding-models.md +49 -0
- package/templates/bundle-frontend-spa/skills/authentication/SKILL.md +196 -13
- package/templates/bundle-frontend-spa/skills/authentication/references/jwt-security.md +41 -0
- package/templates/bundle-frontend-spa/skills/component-design/SKILL.md +191 -41
- package/templates/bundle-frontend-spa/skills/component-design/references/accessibility-checklist.md +41 -0
- package/templates/bundle-frontend-spa/skills/component-design/references/tailwind-patterns.md +65 -0
- package/templates/bundle-frontend-spa/skills/e2e-testing/SKILL.md +241 -79
- package/templates/bundle-frontend-spa/skills/e2e-testing/references/playwright-selectors.md +66 -0
- package/templates/bundle-frontend-spa/skills/e2e-testing/references/test-patterns.md +82 -0
- package/templates/bundle-frontend-spa/skills/integration-api/SKILL.md +221 -31
- package/templates/bundle-frontend-spa/skills/integration-api/references/api-patterns.md +81 -0
- package/templates/bundle-frontend-spa/skills/react-patterns/SKILL.md +195 -70
- package/templates/bundle-frontend-spa/skills/react-patterns/references/component-checklist.md +22 -0
- package/templates/bundle-frontend-spa/skills/react-patterns/references/hook-patterns.md +63 -0
- package/templates/bundle-frontend-spa/skills/responsive-layout/SKILL.md +162 -22
- package/templates/bundle-frontend-spa/skills/responsive-layout/references/breakpoint-guide.md +63 -0
- package/templates/bundle-frontend-spa/skills/state-management/SKILL.md +158 -30
- package/templates/bundle-frontend-spa/skills/state-management/references/react-query-config.md +64 -0
- package/templates/bundle-frontend-spa/skills/state-management/references/state-patterns.md +78 -0
- package/templates/bundle-jhipster-microservices/skills/ci-cd-pipeline/SKILL.md +135 -45
- package/templates/bundle-jhipster-microservices/skills/ci-cd-pipeline/references/gitlab-ci-templates.md +93 -0
- package/templates/bundle-jhipster-microservices/skills/clean-architecture/SKILL.md +87 -21
- package/templates/bundle-jhipster-microservices/skills/clean-architecture/references/layer-rules.md +78 -0
- package/templates/bundle-jhipster-microservices/skills/ddd-tactical/SKILL.md +94 -25
- package/templates/bundle-jhipster-microservices/skills/ddd-tactical/references/ddd-patterns.md +48 -0
- package/templates/bundle-jhipster-microservices/skills/jhipster-angular/SKILL.md +63 -21
- package/templates/bundle-jhipster-microservices/skills/jhipster-angular/references/angular-microservices.md +40 -0
- package/templates/bundle-jhipster-microservices/skills/jhipster-angular/references/angular-structure.md +59 -0
- package/templates/bundle-jhipster-microservices/skills/jhipster-docker-k8s/SKILL.md +125 -91
- package/templates/bundle-jhipster-microservices/skills/jhipster-docker-k8s/references/docker-k8s-commands.md +68 -0
- package/templates/bundle-jhipster-microservices/skills/jhipster-entities/SKILL.md +72 -20
- package/templates/bundle-jhipster-microservices/skills/jhipster-entities/references/cross-service-entities.md +36 -0
- package/templates/bundle-jhipster-microservices/skills/jhipster-entities/references/jdl-types.md +56 -0
- package/templates/bundle-jhipster-microservices/skills/jhipster-gateway/SKILL.md +80 -8
- package/templates/bundle-jhipster-microservices/skills/jhipster-gateway/references/gateway-config.md +43 -0
- package/templates/bundle-jhipster-microservices/skills/jhipster-kafka/SKILL.md +115 -22
- package/templates/bundle-jhipster-microservices/skills/jhipster-kafka/references/kafka-events.md +39 -0
- package/templates/bundle-jhipster-microservices/skills/jhipster-registry/SKILL.md +92 -23
- package/templates/bundle-jhipster-microservices/skills/jhipster-registry/references/consul-config.md +61 -0
- package/templates/bundle-jhipster-microservices/skills/jhipster-service/SKILL.md +81 -18
- package/templates/bundle-jhipster-microservices/skills/jhipster-service/references/service-patterns.md +40 -0
- package/templates/bundle-jhipster-microservices/skills/testing-strategy/SKILL.md +101 -20
- package/templates/bundle-jhipster-microservices/skills/testing-strategy/references/test-naming.md +55 -0
- package/templates/bundle-jhipster-monorepo/skills/clean-architecture/SKILL.md +87 -21
- package/templates/bundle-jhipster-monorepo/skills/clean-architecture/references/layer-rules.md +78 -0
- package/templates/bundle-jhipster-monorepo/skills/ddd-tactical/SKILL.md +94 -25
- package/templates/bundle-jhipster-monorepo/skills/ddd-tactical/references/ddd-patterns.md +48 -0
- package/templates/bundle-jhipster-monorepo/skills/jhipster-angular/SKILL.md +99 -52
- package/templates/bundle-jhipster-monorepo/skills/jhipster-angular/references/angular-structure.md +59 -0
- package/templates/bundle-jhipster-monorepo/skills/jhipster-entities/SKILL.md +89 -36
- package/templates/bundle-jhipster-monorepo/skills/jhipster-entities/references/jdl-types.md +56 -0
- package/templates/bundle-jhipster-monorepo/skills/jhipster-liquibase/SKILL.md +123 -23
- package/templates/bundle-jhipster-monorepo/skills/jhipster-liquibase/references/liquibase-operations.md +95 -0
- package/templates/bundle-jhipster-monorepo/skills/jhipster-security/SKILL.md +106 -19
- package/templates/bundle-jhipster-monorepo/skills/jhipster-security/references/security-checklist.md +47 -0
- package/templates/bundle-jhipster-monorepo/skills/jhipster-spring/SKILL.md +84 -16
- package/templates/bundle-jhipster-monorepo/skills/jhipster-spring/references/spring-layers.md +41 -0
- package/templates/bundle-jhipster-monorepo/skills/testing-strategy/SKILL.md +101 -20
- package/templates/bundle-jhipster-monorepo/skills/testing-strategy/references/test-naming.md +55 -0
|
@@ -1,24 +1,38 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: eval-testing
|
|
3
|
-
description:
|
|
3
|
+
description: Build evaluation frameworks for AI agents with LLM-as-judge, rule-based evals, and golden datasets. Use when testing agents, evaluating RAG quality, or creating compliance benchmarks.
|
|
4
|
+
version: 1.0.0
|
|
5
|
+
author: Maestro
|
|
4
6
|
---
|
|
5
7
|
|
|
6
|
-
#
|
|
8
|
+
# Eval Testing
|
|
7
9
|
|
|
8
|
-
|
|
10
|
+
Build comprehensive evaluation pipelines for AI agents using rule-based checks, LLM-as-judge scoring, golden datasets, and CI/CD integration.
|
|
9
11
|
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
12
|
+
## When to Use
|
|
13
|
+
- Testing whether an agent produces correct, compliant output
|
|
14
|
+
- Evaluating RAG retrieval quality with RAGAS metrics
|
|
15
|
+
- Building a golden dataset for regression testing
|
|
16
|
+
- Setting up automated eval pipelines in CI/CD
|
|
17
|
+
- Comparing agent performance across prompt versions
|
|
16
18
|
|
|
17
|
-
##
|
|
19
|
+
## Available Operations
|
|
20
|
+
1. Create rule-based evaluators for compliance checks
|
|
21
|
+
2. Build LLM-as-judge scoring prompts
|
|
22
|
+
3. Design golden datasets with expected outcomes
|
|
23
|
+
4. Run evaluation benchmarks
|
|
24
|
+
5. Integrate evals into CI/CD pipelines
|
|
25
|
+
6. Analyze and compare eval results
|
|
26
|
+
|
|
27
|
+
## Multi-Step Workflow
|
|
28
|
+
|
|
29
|
+
### Step 1: Create Rule-Based Evaluators
|
|
30
|
+
|
|
31
|
+
Start with deterministic checks -- they are fast, free, and reliable.
|
|
18
32
|
|
|
19
33
|
```python
|
|
20
34
|
class ComplianceEvaluator:
|
|
21
|
-
"""
|
|
35
|
+
"""Verify code follows bundle standards."""
|
|
22
36
|
|
|
23
37
|
def evaluate(self, code: str, bundle: Bundle) -> EvalResult:
|
|
24
38
|
checks = []
|
|
@@ -36,25 +50,33 @@ class ComplianceEvaluator:
|
|
|
36
50
|
return Check(
|
|
37
51
|
name="max_lines",
|
|
38
52
|
passed=lines <= max,
|
|
39
|
-
detail=f"{lines}/{max}
|
|
53
|
+
detail=f"{lines}/{max} lines"
|
|
40
54
|
)
|
|
41
55
|
```
|
|
42
56
|
|
|
43
|
-
|
|
57
|
+
Run rule-based checks:
|
|
58
|
+
|
|
59
|
+
```bash
|
|
60
|
+
python -m evals.rules --code-dir src/ --bundle bundles/backend.json
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
### Step 2: Build LLM-as-Judge Evaluator
|
|
64
|
+
|
|
65
|
+
For subjective quality assessments, use an LLM to score agent output.
|
|
44
66
|
|
|
45
67
|
```python
|
|
46
68
|
JUDGE_PROMPT = """
|
|
47
|
-
|
|
48
|
-
1.
|
|
49
|
-
2.
|
|
50
|
-
3.
|
|
51
|
-
4.
|
|
69
|
+
Evaluate the code below on these criteria:
|
|
70
|
+
1. Clarity (1-5): Is the code easy to understand?
|
|
71
|
+
2. Correctness (1-5): Does the code do what it should?
|
|
72
|
+
3. Patterns (1-5): Does it follow Clean Architecture and DDD?
|
|
73
|
+
4. Tests (1-5): Are tests adequate?
|
|
52
74
|
|
|
53
|
-
|
|
75
|
+
Code:
|
|
54
76
|
{code}
|
|
55
77
|
|
|
56
|
-
|
|
57
|
-
{{"
|
|
78
|
+
Respond in JSON:
|
|
79
|
+
{{"clarity": X, "correctness": X, "patterns": X, "tests": X, "justification": "..."}}
|
|
58
80
|
"""
|
|
59
81
|
|
|
60
82
|
async def llm_judge(code: str) -> dict:
|
|
@@ -62,14 +84,16 @@ async def llm_judge(code: str) -> dict:
|
|
|
62
84
|
return json.loads(response.content)
|
|
63
85
|
```
|
|
64
86
|
|
|
65
|
-
|
|
87
|
+
### Step 3: Build a Golden Dataset
|
|
88
|
+
|
|
89
|
+
Create a set of input/expected-output pairs for regression testing.
|
|
66
90
|
|
|
67
91
|
```json
|
|
68
92
|
{
|
|
69
93
|
"evals": [
|
|
70
94
|
{
|
|
71
95
|
"id": "eval-001",
|
|
72
|
-
"prompt": "
|
|
96
|
+
"prompt": "Create a GET /api/v1/demands endpoint with pagination",
|
|
73
97
|
"expected": {
|
|
74
98
|
"has_controller": true,
|
|
75
99
|
"has_use_case": true,
|
|
@@ -82,7 +106,11 @@ async def llm_judge(code: str) -> dict:
|
|
|
82
106
|
}
|
|
83
107
|
```
|
|
84
108
|
|
|
85
|
-
|
|
109
|
+
Store golden datasets in `evals/golden_dataset.json`.
|
|
110
|
+
|
|
111
|
+
### Step 4: Build the Eval Runner
|
|
112
|
+
|
|
113
|
+
Combine rule-based and LLM-as-judge evaluations into a single runner.
|
|
86
114
|
|
|
87
115
|
```python
|
|
88
116
|
async def run_evals(agent, eval_set: list[dict]) -> BenchmarkResult:
|
|
@@ -103,7 +131,15 @@ async def run_evals(agent, eval_set: list[dict]) -> BenchmarkResult:
|
|
|
103
131
|
return BenchmarkResult(results=results, aggregate=aggregate(results))
|
|
104
132
|
```
|
|
105
133
|
|
|
106
|
-
|
|
134
|
+
Run the eval suite:
|
|
135
|
+
|
|
136
|
+
```bash
|
|
137
|
+
python -m evals.run_evals --dataset evals/golden_dataset.json --output results/eval_run_$(date +%Y%m%d).json
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
### Step 5: Integrate into CI/CD
|
|
141
|
+
|
|
142
|
+
Add eval checks to your pipeline so regressions are caught automatically.
|
|
107
143
|
|
|
108
144
|
```yaml
|
|
109
145
|
eval:
|
|
@@ -113,3 +149,55 @@ eval:
|
|
|
113
149
|
- python -m evals.check_threshold --min-score 0.8
|
|
114
150
|
allow_failure: false
|
|
115
151
|
```
|
|
152
|
+
|
|
153
|
+
Run locally before pushing:
|
|
154
|
+
|
|
155
|
+
```bash
|
|
156
|
+
python -m evals.run_evals --dataset evals/golden_dataset.json && python -m evals.check_threshold --min-score 0.8
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
### Step 6: Compare Results Across Versions
|
|
160
|
+
|
|
161
|
+
```bash
|
|
162
|
+
python -m evals.compare --baseline results/eval_v1.json --current results/eval_v2.json
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
## Resources
|
|
166
|
+
- `references/eval-types.md` - Detailed comparison of evaluation types and when to use each
|
|
167
|
+
- `references/golden-dataset-template.md` - Template for building golden datasets
|
|
168
|
+
|
|
169
|
+
## Examples
|
|
170
|
+
|
|
171
|
+
### Example 1: Evaluate a Backend Agent
|
|
172
|
+
User asks: "Set up evals for our backend agent that generates FastAPI endpoints."
|
|
173
|
+
Response approach:
|
|
174
|
+
1. Create rule-based checks: has controller, has use case, has repository, has tests
|
|
175
|
+
2. Create LLM-as-judge prompt that scores code clarity, correctness, and patterns
|
|
176
|
+
3. Build a golden dataset with 10 endpoint-generation scenarios
|
|
177
|
+
4. Run the eval suite and establish a baseline score
|
|
178
|
+
5. Add to CI/CD with a minimum score threshold of 0.8
|
|
179
|
+
|
|
180
|
+
### Example 2: Evaluate RAG Quality
|
|
181
|
+
User asks: "Our RAG agent gives wrong answers sometimes. Set up evals to catch this."
|
|
182
|
+
Response approach:
|
|
183
|
+
1. Build a golden dataset of 20 question/expected-answer pairs
|
|
184
|
+
2. Create rule-based checks for source citation and answer format
|
|
185
|
+
3. Create LLM-as-judge that scores faithfulness (is the answer grounded in context?)
|
|
186
|
+
4. Run evals and identify patterns in failures
|
|
187
|
+
5. Use results to guide chunking and retrieval improvements
|
|
188
|
+
|
|
189
|
+
### Example 3: A/B Test Prompt Changes
|
|
190
|
+
User asks: "I changed the system prompt. How do I know if it's better?"
|
|
191
|
+
Response approach:
|
|
192
|
+
1. Run the golden dataset against the old prompt (baseline)
|
|
193
|
+
2. Run the same dataset against the new prompt
|
|
194
|
+
3. Compare scores with `evals.compare` to see deltas
|
|
195
|
+
4. Check for regressions in specific eval cases
|
|
196
|
+
5. Only deploy if the new prompt scores >= baseline on all categories
|
|
197
|
+
|
|
198
|
+
## Notes
|
|
199
|
+
- Rule-based evals run first (fast, free) -- fail early before spending on LLM-as-judge
|
|
200
|
+
- Golden datasets should have at least 10 cases for meaningful results
|
|
201
|
+
- Track eval scores over time to detect gradual drift
|
|
202
|
+
- LLM-as-judge is non-deterministic -- run 3 times and average for reliable scores
|
|
203
|
+
- Store all eval results as JSON for historical comparison
|
|
@@ -0,0 +1,52 @@
|
|
|
1
|
+
# Eval Types Reference
|
|
2
|
+
|
|
3
|
+
## Comparison Matrix
|
|
4
|
+
|
|
5
|
+
| Type | When to Use | Speed | Cost | Deterministic |
|
|
6
|
+
|---|---|---|---|---|
|
|
7
|
+
| Rule-based | Validate format, structure, compliance | Fast | Free | Yes |
|
|
8
|
+
| LLM-as-judge | Evaluate quality, coherence, usefulness | Slow | $$ | No |
|
|
9
|
+
| Golden dataset | Compare against expected outcomes | Medium | Free | Yes |
|
|
10
|
+
| RAGAS | RAG metrics (faithfulness, relevancy) | Medium | $ | No |
|
|
11
|
+
|
|
12
|
+
## Rule-Based Evals
|
|
13
|
+
|
|
14
|
+
Best for: Structure validation, compliance checks, format verification.
|
|
15
|
+
|
|
16
|
+
```python
|
|
17
|
+
# Examples of rule-based checks
|
|
18
|
+
def check_has_tests(output: str) -> bool:
|
|
19
|
+
return "def test_" in output or "class Test" in output
|
|
20
|
+
|
|
21
|
+
def check_no_secrets(output: str) -> bool:
|
|
22
|
+
return not re.search(r'(password|secret|api_key)\s*=\s*["\'][^"\']+["\']', output)
|
|
23
|
+
|
|
24
|
+
def check_clean_arch_layers(output: str) -> bool:
|
|
25
|
+
return all(layer in output for layer in ["controller", "use_case", "repository"])
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
## LLM-as-Judge
|
|
29
|
+
|
|
30
|
+
Best for: Subjective quality assessment, code review, explanation quality.
|
|
31
|
+
|
|
32
|
+
Scoring rubric should be explicit:
|
|
33
|
+
- 1: Completely wrong or missing
|
|
34
|
+
- 2: Present but seriously flawed
|
|
35
|
+
- 3: Acceptable with notable issues
|
|
36
|
+
- 4: Good with minor issues
|
|
37
|
+
- 5: Excellent, production-ready
|
|
38
|
+
|
|
39
|
+
## Golden Datasets
|
|
40
|
+
|
|
41
|
+
Best for: Regression testing, baseline comparison, prompt A/B testing.
|
|
42
|
+
|
|
43
|
+
Minimum size: 10 cases for initial validation, 50+ for production confidence.
|
|
44
|
+
|
|
45
|
+
## RAGAS Metrics
|
|
46
|
+
|
|
47
|
+
Best for: RAG pipeline evaluation.
|
|
48
|
+
|
|
49
|
+
- **Faithfulness**: Is the answer supported by the retrieved context?
|
|
50
|
+
- **Answer Relevancy**: Does the answer address the question?
|
|
51
|
+
- **Context Precision**: Are the retrieved documents relevant?
|
|
52
|
+
- **Context Recall**: Did retrieval find all relevant documents?
|
|
@@ -0,0 +1,59 @@
|
|
|
1
|
+
# Golden Dataset Template
|
|
2
|
+
|
|
3
|
+
## Structure
|
|
4
|
+
|
|
5
|
+
```json
|
|
6
|
+
{
|
|
7
|
+
"metadata": {
|
|
8
|
+
"name": "Backend Agent Eval Suite",
|
|
9
|
+
"version": "1.0.0",
|
|
10
|
+
"created_at": "2026-03-27",
|
|
11
|
+
"min_pass_score": 0.8
|
|
12
|
+
},
|
|
13
|
+
"evals": [
|
|
14
|
+
{
|
|
15
|
+
"id": "eval-001",
|
|
16
|
+
"category": "endpoint-creation",
|
|
17
|
+
"difficulty": "basic",
|
|
18
|
+
"prompt": "Create a GET /api/v1/demands endpoint that returns a paginated list",
|
|
19
|
+
"expected": {
|
|
20
|
+
"has_controller": true,
|
|
21
|
+
"has_use_case": true,
|
|
22
|
+
"has_repository": true,
|
|
23
|
+
"has_pagination": true,
|
|
24
|
+
"follows_clean_arch": true,
|
|
25
|
+
"has_error_handling": true,
|
|
26
|
+
"has_tests": true
|
|
27
|
+
},
|
|
28
|
+
"judge_criteria": {
|
|
29
|
+
"clarity": ">= 4",
|
|
30
|
+
"correctness": ">= 4",
|
|
31
|
+
"patterns": ">= 3",
|
|
32
|
+
"tests": ">= 3"
|
|
33
|
+
}
|
|
34
|
+
},
|
|
35
|
+
{
|
|
36
|
+
"id": "eval-002",
|
|
37
|
+
"category": "error-handling",
|
|
38
|
+
"difficulty": "intermediate",
|
|
39
|
+
"prompt": "Add proper error handling to the demands API: 404 for not found, 422 for validation errors, 500 for server errors",
|
|
40
|
+
"expected": {
|
|
41
|
+
"has_exception_handler": true,
|
|
42
|
+
"has_error_response_model": true,
|
|
43
|
+
"handles_404": true,
|
|
44
|
+
"handles_422": true,
|
|
45
|
+
"handles_500": true
|
|
46
|
+
}
|
|
47
|
+
}
|
|
48
|
+
]
|
|
49
|
+
}
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
## Guidelines for Building Golden Datasets
|
|
53
|
+
|
|
54
|
+
1. Cover basic, intermediate, and advanced scenarios.
|
|
55
|
+
2. Include both happy path and error path cases.
|
|
56
|
+
3. Make expected outcomes binary (true/false) for rule-based checks.
|
|
57
|
+
4. Add judge criteria for subjective quality scoring.
|
|
58
|
+
5. Update the dataset as the agent's capabilities evolve.
|
|
59
|
+
6. Minimum 10 cases for initial validation, 50+ for production.
|
|
@@ -1,78 +1,112 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: memory-management
|
|
3
|
-
description:
|
|
3
|
+
description: Implement short-term, medium-term, and long-term memory for AI agents using LangGraph Store and checkpointers. Use when agents need to remember past interactions, persist state, or learn from previous executions.
|
|
4
|
+
version: 1.0.0
|
|
5
|
+
author: Maestro
|
|
4
6
|
---
|
|
5
7
|
|
|
6
|
-
#
|
|
8
|
+
# Memory Management
|
|
7
9
|
|
|
8
|
-
|
|
10
|
+
Implement three tiers of agent memory -- short-term (context window), medium-term (checkpointer), and long-term (store) -- to enable persistent learning and state management.
|
|
9
11
|
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
|
|
12
|
+
## When to Use
|
|
13
|
+
- Agent needs to resume work after interruption
|
|
14
|
+
- Agent should learn from past executions and avoid repeating mistakes
|
|
15
|
+
- Persisting state between nodes in a LangGraph workflow
|
|
16
|
+
- Storing and retrieving patterns learned across multiple demands
|
|
17
|
+
- Implementing memory decay to remove stale or low-confidence knowledge
|
|
15
18
|
|
|
16
|
-
##
|
|
19
|
+
## Available Operations
|
|
20
|
+
1. Configure short-term memory via context window
|
|
21
|
+
2. Set up medium-term memory with LangGraph checkpointer
|
|
22
|
+
3. Implement long-term memory with LangGraph Store
|
|
23
|
+
4. Integrate memory into Deep Agent configuration
|
|
24
|
+
5. Implement memory cleanup and decay policies
|
|
17
25
|
|
|
18
|
-
|
|
26
|
+
## Multi-Step Workflow
|
|
27
|
+
|
|
28
|
+
### Step 1: Set Up Short-Term Memory (Context Window)
|
|
29
|
+
|
|
30
|
+
Short-term memory is automatic -- LangGraph accumulates messages within a session.
|
|
19
31
|
|
|
20
32
|
```python
|
|
33
|
+
from typing import TypedDict, Annotated
|
|
34
|
+
from langgraph.graph.message import add_messages
|
|
35
|
+
|
|
21
36
|
class AgentState(TypedDict):
|
|
22
|
-
messages: Annotated[list, add_messages] #
|
|
37
|
+
messages: Annotated[list, add_messages] # Accumulates automatically
|
|
23
38
|
```
|
|
24
39
|
|
|
25
|
-
|
|
40
|
+
No additional setup needed. Messages persist for the duration of a single invocation chain.
|
|
26
41
|
|
|
27
|
-
|
|
42
|
+
### Step 2: Set Up Medium-Term Memory (Checkpointer)
|
|
43
|
+
|
|
44
|
+
Persists graph state between invocations of the same demand. Enables resume after failure.
|
|
45
|
+
|
|
46
|
+
```bash
|
|
47
|
+
pip install langgraph-checkpoint-postgres psycopg
|
|
48
|
+
```
|
|
28
49
|
|
|
29
50
|
```python
|
|
30
51
|
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver
|
|
52
|
+
from langgraph.graph import StateGraph
|
|
31
53
|
|
|
32
54
|
checkpointer = AsyncPostgresSaver.from_conn_string(DATABASE_URL)
|
|
33
55
|
|
|
34
56
|
graph = StateGraph(OrchestratorState)
|
|
35
|
-
# ...
|
|
57
|
+
# ... define nodes and edges ...
|
|
36
58
|
app = graph.compile(checkpointer=checkpointer)
|
|
37
59
|
|
|
38
|
-
#
|
|
60
|
+
# Use consistent thread_id per demand
|
|
39
61
|
config = {"configurable": {"thread_id": f"demand-{demand_id}"}}
|
|
40
62
|
result = await app.ainvoke({"messages": [...]}, config=config)
|
|
41
63
|
|
|
42
|
-
#
|
|
43
|
-
result2 = await app.ainvoke({"messages": [
|
|
64
|
+
# Next invocation with same thread_id resumes from saved state
|
|
65
|
+
result2 = await app.ainvoke({"messages": [new_msg]}, config=config)
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
Verify checkpointer is working:
|
|
69
|
+
|
|
70
|
+
```bash
|
|
71
|
+
psql $DATABASE_URL -c "SELECT thread_id, created_at FROM checkpoints ORDER BY created_at DESC LIMIT 5;"
|
|
44
72
|
```
|
|
45
73
|
|
|
46
|
-
|
|
74
|
+
### Step 3: Set Up Long-Term Memory (Store)
|
|
47
75
|
|
|
48
|
-
|
|
76
|
+
Persists knowledge across different demands. The agent remembers patterns, preferences, and learnings.
|
|
77
|
+
|
|
78
|
+
```bash
|
|
79
|
+
pip install langgraph-store-postgres
|
|
80
|
+
```
|
|
49
81
|
|
|
50
82
|
```python
|
|
51
83
|
from langgraph.store.postgres import AsyncPostgresStore
|
|
52
84
|
|
|
53
85
|
store = AsyncPostgresStore.from_conn_string(DATABASE_URL)
|
|
54
86
|
|
|
55
|
-
#
|
|
87
|
+
# Save a learned pattern
|
|
56
88
|
await store.aput(
|
|
57
89
|
namespace=("agent", "backend", "patterns"),
|
|
58
90
|
key="spring-crud-pattern",
|
|
59
91
|
value={
|
|
60
|
-
"pattern": "
|
|
92
|
+
"pattern": "Use record for DTO, entity for domain",
|
|
61
93
|
"learned_from": "demand-123",
|
|
62
94
|
"confidence": 0.95,
|
|
63
95
|
"created_at": "2026-03-27"
|
|
64
96
|
}
|
|
65
97
|
)
|
|
66
98
|
|
|
67
|
-
#
|
|
99
|
+
# Search for relevant learnings
|
|
68
100
|
results = await store.asearch(
|
|
69
101
|
namespace=("agent", "backend", "patterns"),
|
|
70
|
-
query="
|
|
102
|
+
query="how to create DTO for REST API",
|
|
71
103
|
limit=5
|
|
72
104
|
)
|
|
73
105
|
```
|
|
74
106
|
|
|
75
|
-
|
|
107
|
+
### Step 4: Integrate Memory into Deep Agent
|
|
108
|
+
|
|
109
|
+
Wire all three memory tiers into a Deep Agent configuration.
|
|
76
110
|
|
|
77
111
|
```python
|
|
78
112
|
from deepagents import create_deep_agent
|
|
@@ -85,22 +119,72 @@ agent = create_deep_agent(
|
|
|
85
119
|
backend=FilesystemBackend(root_dir=".", virtual_mode=True),
|
|
86
120
|
checkpointer=PostgresSaver(conn_string=DATABASE_URL),
|
|
87
121
|
store=PostgresStore(conn_string=DATABASE_URL),
|
|
88
|
-
system_prompt="
|
|
122
|
+
system_prompt="You are a backend agent..."
|
|
89
123
|
)
|
|
90
124
|
```
|
|
91
125
|
|
|
92
|
-
|
|
126
|
+
### Step 5: Implement Memory Cleanup and Decay
|
|
93
127
|
|
|
94
|
-
|
|
128
|
+
Memories age. Remove stale or low-confidence entries to keep the store relevant.
|
|
95
129
|
|
|
96
130
|
```python
|
|
131
|
+
from datetime import datetime, timedelta
|
|
132
|
+
|
|
97
133
|
async def cleanup_stale_memories(store, max_age_days: int = 90):
|
|
98
|
-
"""Remove
|
|
134
|
+
"""Remove old or low-confidence memories."""
|
|
99
135
|
cutoff = datetime.now() - timedelta(days=max_age_days)
|
|
100
136
|
memories = await store.alist(namespace=("agent",))
|
|
137
|
+
removed = 0
|
|
101
138
|
for mem in memories:
|
|
102
139
|
if mem.value.get("created_at", "") < cutoff.isoformat():
|
|
103
140
|
await store.adelete(namespace=mem.namespace, key=mem.key)
|
|
141
|
+
removed += 1
|
|
104
142
|
elif mem.value.get("confidence", 1.0) < 0.3:
|
|
105
143
|
await store.adelete(namespace=mem.namespace, key=mem.key)
|
|
144
|
+
removed += 1
|
|
145
|
+
return removed
|
|
106
146
|
```
|
|
147
|
+
|
|
148
|
+
Run cleanup:
|
|
149
|
+
|
|
150
|
+
```bash
|
|
151
|
+
python -m memory.cleanup --max-age-days 90 --min-confidence 0.3
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
## Resources
|
|
155
|
+
- `references/memory-tiers.md` - Detailed comparison of memory tiers with use cases
|
|
156
|
+
- `references/namespace-conventions.md` - Naming conventions for store namespaces
|
|
157
|
+
|
|
158
|
+
## Examples
|
|
159
|
+
|
|
160
|
+
### Example 1: Enable Resume After Failure
|
|
161
|
+
User asks: "Our agent crashes mid-task and loses all progress. Fix it."
|
|
162
|
+
Response approach:
|
|
163
|
+
1. Add PostgresSaver checkpointer to the agent's graph compilation
|
|
164
|
+
2. Use `thread_id` based on the demand ID for consistent state
|
|
165
|
+
3. On restart, invoke with the same thread_id -- LangGraph automatically resumes
|
|
166
|
+
4. Verify with: `psql $DATABASE_URL -c "SELECT * FROM checkpoints WHERE thread_id='demand-xyz';"`
|
|
167
|
+
|
|
168
|
+
### Example 2: Agent Should Learn From Past Work
|
|
169
|
+
User asks: "The backend agent keeps making the same pagination mistake. Make it learn."
|
|
170
|
+
Response approach:
|
|
171
|
+
1. After each successful demand, save patterns to the Store
|
|
172
|
+
2. Before each new task, search the Store for relevant learnings
|
|
173
|
+
3. Inject top-3 relevant learnings into the agent's context
|
|
174
|
+
4. Track confidence scores -- boost on positive feedback, decay on negative
|
|
175
|
+
|
|
176
|
+
### Example 3: Clean Up Old Memories
|
|
177
|
+
User asks: "The memory store has grown too large. Clean it up."
|
|
178
|
+
Response approach:
|
|
179
|
+
1. Run `memory.cleanup --max-age-days 90` to remove entries older than 90 days
|
|
180
|
+
2. Remove entries with confidence below 0.3
|
|
181
|
+
3. Audit remaining entries for duplicates
|
|
182
|
+
4. Set up a weekly cron job for automatic cleanup
|
|
183
|
+
|
|
184
|
+
## Notes
|
|
185
|
+
- Always use `thread_id` based on a stable identifier (demand_id, session_id)
|
|
186
|
+
- Checkpointer handles resume automatically -- no custom logic needed
|
|
187
|
+
- Store namespaces should follow the convention: `("agent", agent_type, category)`
|
|
188
|
+
- Memory cleanup should run on a schedule (weekly recommended)
|
|
189
|
+
- Include `confidence` and `created_at` in all store entries for decay management
|
|
190
|
+
- Long-term memories should be surfaced via retrieval, not dumped into the prompt
|
|
@@ -0,0 +1,41 @@
|
|
|
1
|
+
# Memory Tiers Reference
|
|
2
|
+
|
|
3
|
+
## Comparison
|
|
4
|
+
|
|
5
|
+
| Tier | Duration | Mechanism | Storage | Cost | Use Case |
|
|
6
|
+
|---|---|---|---|---|---|
|
|
7
|
+
| Short-term | 1 session | Context window | In-memory | Free | Current conversation messages |
|
|
8
|
+
| Medium-term | 1 demand | Checkpointer | PostgreSQL | Low | Graph state between invocations |
|
|
9
|
+
| Long-term | Permanent | Store | PostgreSQL | Low | Patterns, preferences, learnings |
|
|
10
|
+
|
|
11
|
+
## Short-Term Memory
|
|
12
|
+
|
|
13
|
+
- **What**: Messages in the current invocation chain.
|
|
14
|
+
- **How**: `Annotated[list, add_messages]` in state schema.
|
|
15
|
+
- **Limit**: Bounded by context window size.
|
|
16
|
+
- **When it resets**: End of the invocation chain.
|
|
17
|
+
|
|
18
|
+
## Medium-Term Memory
|
|
19
|
+
|
|
20
|
+
- **What**: Full graph state (all state fields) at each node execution.
|
|
21
|
+
- **How**: `graph.compile(checkpointer=PostgresSaver(...))`.
|
|
22
|
+
- **Limit**: Bounded by database storage.
|
|
23
|
+
- **When it resets**: When the demand is completed or explicitly cleared.
|
|
24
|
+
- **Key feature**: Enables resume after crash.
|
|
25
|
+
|
|
26
|
+
## Long-Term Memory
|
|
27
|
+
|
|
28
|
+
- **What**: Extracted patterns, preferences, and learnings.
|
|
29
|
+
- **How**: `store.aput(namespace=..., key=..., value=...)`.
|
|
30
|
+
- **Limit**: Bounded by database storage and relevance decay.
|
|
31
|
+
- **When it resets**: Only when explicitly cleaned up.
|
|
32
|
+
- **Key feature**: Enables cross-demand learning.
|
|
33
|
+
|
|
34
|
+
## Decision Guide
|
|
35
|
+
|
|
36
|
+
| Question | Answer | Use |
|
|
37
|
+
|---|---|---|
|
|
38
|
+
| Does the agent need to remember within this conversation? | Yes | Short-term |
|
|
39
|
+
| Does the agent need to resume after a crash? | Yes | Medium-term (checkpointer) |
|
|
40
|
+
| Should the agent learn from past demands? | Yes | Long-term (store) |
|
|
41
|
+
| Does the agent need to share knowledge with other agents? | Yes | Long-term (store with shared namespace) |
|
package/templates/bundle-ai-agents/skills/memory-management/references/namespace-conventions.md
ADDED
|
@@ -0,0 +1,41 @@
|
|
|
1
|
+
# Namespace Conventions Reference
|
|
2
|
+
|
|
3
|
+
## Store Namespace Structure
|
|
4
|
+
|
|
5
|
+
```
|
|
6
|
+
("agent", agent_type, category)
|
|
7
|
+
```
|
|
8
|
+
|
|
9
|
+
## Standard Namespaces
|
|
10
|
+
|
|
11
|
+
| Namespace | Purpose | Example Key |
|
|
12
|
+
|---|---|---|
|
|
13
|
+
| `("agent", "backend", "patterns")` | Code patterns learned by backend agent | `"fastapi-crud-pattern"` |
|
|
14
|
+
| `("agent", "frontend", "patterns")` | UI patterns learned by frontend agent | `"react-form-pattern"` |
|
|
15
|
+
| `("agent", "backend", "errors")` | Common errors and their fixes | `"alembic-migration-conflict"` |
|
|
16
|
+
| `("agent", "backend", "preferences")` | Team preferences for code style | `"prefer-dataclass-over-dict"` |
|
|
17
|
+
| `("project", "decisions")` | Architectural decisions | `"chose-fastapi-over-flask"` |
|
|
18
|
+
| `("project", "standards")` | Project-wide coding standards | `"naming-conventions"` |
|
|
19
|
+
|
|
20
|
+
## Value Schema
|
|
21
|
+
|
|
22
|
+
Every store entry should include these fields:
|
|
23
|
+
|
|
24
|
+
```python
|
|
25
|
+
{
|
|
26
|
+
"pattern": str, # The actual knowledge
|
|
27
|
+
"learned_from": str, # Which demand/task this came from
|
|
28
|
+
"confidence": float, # 0.0 to 1.0 (boost on positive feedback, decay on negative)
|
|
29
|
+
"created_at": str, # ISO timestamp
|
|
30
|
+
"updated_at": str, # ISO timestamp (updated on reinforcement)
|
|
31
|
+
"usage_count": int, # How many times this memory was retrieved
|
|
32
|
+
}
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
## Confidence Management
|
|
36
|
+
|
|
37
|
+
- **Initial**: 0.7 (new learning, not yet validated)
|
|
38
|
+
- **Reinforced**: +0.1 per positive use (max 1.0)
|
|
39
|
+
- **Contradicted**: -0.2 per negative feedback (min 0.0)
|
|
40
|
+
- **Cleanup threshold**: < 0.3 (remove on next cleanup run)
|
|
41
|
+
- **Time decay**: -0.05 per month without usage
|