maestro-bundle 1.3.1 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (116) hide show
  1. package/package.json +1 -1
  2. package/templates/bundle-ai-agents/skills/agent-orchestration/SKILL.md +107 -41
  3. package/templates/bundle-ai-agents/skills/agent-orchestration/references/graph-patterns.md +50 -0
  4. package/templates/bundle-ai-agents/skills/agent-orchestration/references/routing-strategies.md +47 -0
  5. package/templates/bundle-ai-agents/skills/api-design/SKILL.md +125 -16
  6. package/templates/bundle-ai-agents/skills/api-design/references/pydantic-patterns.md +72 -0
  7. package/templates/bundle-ai-agents/skills/api-design/references/rest-conventions.md +51 -0
  8. package/templates/bundle-ai-agents/skills/clean-architecture/SKILL.md +113 -21
  9. package/templates/bundle-ai-agents/skills/clean-architecture/references/dependency-injection.md +60 -0
  10. package/templates/bundle-ai-agents/skills/clean-architecture/references/layer-rules.md +56 -0
  11. package/templates/bundle-ai-agents/skills/context-engineering/SKILL.md +104 -36
  12. package/templates/bundle-ai-agents/skills/context-engineering/references/compression-techniques.md +76 -0
  13. package/templates/bundle-ai-agents/skills/context-engineering/references/context-budget-calculator.md +45 -0
  14. package/templates/bundle-ai-agents/skills/database-modeling/SKILL.md +146 -19
  15. package/templates/bundle-ai-agents/skills/database-modeling/references/index-strategies.md +48 -0
  16. package/templates/bundle-ai-agents/skills/database-modeling/references/naming-conventions.md +27 -0
  17. package/templates/bundle-ai-agents/skills/docker-containerization/SKILL.md +124 -15
  18. package/templates/bundle-ai-agents/skills/docker-containerization/references/compose-patterns.md +97 -0
  19. package/templates/bundle-ai-agents/skills/docker-containerization/references/dockerfile-checklist.md +37 -0
  20. package/templates/bundle-ai-agents/skills/eval-testing/SKILL.md +113 -25
  21. package/templates/bundle-ai-agents/skills/eval-testing/references/eval-types.md +52 -0
  22. package/templates/bundle-ai-agents/skills/eval-testing/references/golden-dataset-template.md +59 -0
  23. package/templates/bundle-ai-agents/skills/memory-management/SKILL.md +112 -28
  24. package/templates/bundle-ai-agents/skills/memory-management/references/memory-tiers.md +41 -0
  25. package/templates/bundle-ai-agents/skills/memory-management/references/namespace-conventions.md +41 -0
  26. package/templates/bundle-ai-agents/skills/prompt-engineering/SKILL.md +139 -47
  27. package/templates/bundle-ai-agents/skills/prompt-engineering/references/anti-patterns.md +59 -0
  28. package/templates/bundle-ai-agents/skills/prompt-engineering/references/prompt-templates.md +75 -0
  29. package/templates/bundle-ai-agents/skills/rag-pipeline/SKILL.md +104 -27
  30. package/templates/bundle-ai-agents/skills/rag-pipeline/references/chunking-strategies.md +27 -0
  31. package/templates/bundle-ai-agents/skills/rag-pipeline/references/embedding-models.md +31 -0
  32. package/templates/bundle-ai-agents/skills/rag-pipeline/references/rag-evaluation.md +39 -0
  33. package/templates/bundle-ai-agents/skills/testing-strategy/SKILL.md +127 -18
  34. package/templates/bundle-ai-agents/skills/testing-strategy/references/fixture-patterns.md +81 -0
  35. package/templates/bundle-ai-agents/skills/testing-strategy/references/naming-conventions.md +69 -0
  36. package/templates/bundle-base/skills/branch-strategy/SKILL.md +134 -21
  37. package/templates/bundle-base/skills/branch-strategy/references/branch-rules.md +40 -0
  38. package/templates/bundle-base/skills/code-review/SKILL.md +123 -38
  39. package/templates/bundle-base/skills/code-review/references/review-checklist.md +45 -0
  40. package/templates/bundle-base/skills/commit-pattern/SKILL.md +98 -39
  41. package/templates/bundle-base/skills/commit-pattern/references/conventional-commits.md +40 -0
  42. package/templates/bundle-data-pipeline/skills/data-preprocessing/SKILL.md +110 -19
  43. package/templates/bundle-data-pipeline/skills/data-preprocessing/references/pandas-cheatsheet.md +63 -0
  44. package/templates/bundle-data-pipeline/skills/data-preprocessing/references/pandera-schemas.md +44 -0
  45. package/templates/bundle-data-pipeline/skills/docker-containerization/SKILL.md +132 -16
  46. package/templates/bundle-data-pipeline/skills/docker-containerization/references/compose-patterns.md +82 -0
  47. package/templates/bundle-data-pipeline/skills/docker-containerization/references/dockerfile-best-practices.md +57 -0
  48. package/templates/bundle-data-pipeline/skills/feature-engineering/SKILL.md +143 -45
  49. package/templates/bundle-data-pipeline/skills/feature-engineering/references/encoding-guide.md +41 -0
  50. package/templates/bundle-data-pipeline/skills/feature-engineering/references/scaling-guide.md +38 -0
  51. package/templates/bundle-data-pipeline/skills/mlops-pipeline/SKILL.md +156 -37
  52. package/templates/bundle-data-pipeline/skills/mlops-pipeline/references/mlflow-commands.md +69 -0
  53. package/templates/bundle-data-pipeline/skills/model-training/SKILL.md +152 -33
  54. package/templates/bundle-data-pipeline/skills/model-training/references/evaluation-metrics.md +52 -0
  55. package/templates/bundle-data-pipeline/skills/model-training/references/model-selection-guide.md +41 -0
  56. package/templates/bundle-data-pipeline/skills/rag-pipeline/SKILL.md +127 -39
  57. package/templates/bundle-data-pipeline/skills/rag-pipeline/references/chunking-strategies.md +51 -0
  58. package/templates/bundle-data-pipeline/skills/rag-pipeline/references/embedding-models.md +49 -0
  59. package/templates/bundle-frontend-spa/skills/authentication/SKILL.md +196 -13
  60. package/templates/bundle-frontend-spa/skills/authentication/references/jwt-security.md +41 -0
  61. package/templates/bundle-frontend-spa/skills/component-design/SKILL.md +191 -41
  62. package/templates/bundle-frontend-spa/skills/component-design/references/accessibility-checklist.md +41 -0
  63. package/templates/bundle-frontend-spa/skills/component-design/references/tailwind-patterns.md +65 -0
  64. package/templates/bundle-frontend-spa/skills/e2e-testing/SKILL.md +241 -79
  65. package/templates/bundle-frontend-spa/skills/e2e-testing/references/playwright-selectors.md +66 -0
  66. package/templates/bundle-frontend-spa/skills/e2e-testing/references/test-patterns.md +82 -0
  67. package/templates/bundle-frontend-spa/skills/integration-api/SKILL.md +221 -31
  68. package/templates/bundle-frontend-spa/skills/integration-api/references/api-patterns.md +81 -0
  69. package/templates/bundle-frontend-spa/skills/react-patterns/SKILL.md +195 -70
  70. package/templates/bundle-frontend-spa/skills/react-patterns/references/component-checklist.md +22 -0
  71. package/templates/bundle-frontend-spa/skills/react-patterns/references/hook-patterns.md +63 -0
  72. package/templates/bundle-frontend-spa/skills/responsive-layout/SKILL.md +162 -22
  73. package/templates/bundle-frontend-spa/skills/responsive-layout/references/breakpoint-guide.md +63 -0
  74. package/templates/bundle-frontend-spa/skills/state-management/SKILL.md +158 -30
  75. package/templates/bundle-frontend-spa/skills/state-management/references/react-query-config.md +64 -0
  76. package/templates/bundle-frontend-spa/skills/state-management/references/state-patterns.md +78 -0
  77. package/templates/bundle-jhipster-microservices/skills/ci-cd-pipeline/SKILL.md +135 -45
  78. package/templates/bundle-jhipster-microservices/skills/ci-cd-pipeline/references/gitlab-ci-templates.md +93 -0
  79. package/templates/bundle-jhipster-microservices/skills/clean-architecture/SKILL.md +87 -21
  80. package/templates/bundle-jhipster-microservices/skills/clean-architecture/references/layer-rules.md +78 -0
  81. package/templates/bundle-jhipster-microservices/skills/ddd-tactical/SKILL.md +94 -25
  82. package/templates/bundle-jhipster-microservices/skills/ddd-tactical/references/ddd-patterns.md +48 -0
  83. package/templates/bundle-jhipster-microservices/skills/jhipster-angular/SKILL.md +63 -21
  84. package/templates/bundle-jhipster-microservices/skills/jhipster-angular/references/angular-microservices.md +40 -0
  85. package/templates/bundle-jhipster-microservices/skills/jhipster-angular/references/angular-structure.md +59 -0
  86. package/templates/bundle-jhipster-microservices/skills/jhipster-docker-k8s/SKILL.md +125 -91
  87. package/templates/bundle-jhipster-microservices/skills/jhipster-docker-k8s/references/docker-k8s-commands.md +68 -0
  88. package/templates/bundle-jhipster-microservices/skills/jhipster-entities/SKILL.md +72 -20
  89. package/templates/bundle-jhipster-microservices/skills/jhipster-entities/references/cross-service-entities.md +36 -0
  90. package/templates/bundle-jhipster-microservices/skills/jhipster-entities/references/jdl-types.md +56 -0
  91. package/templates/bundle-jhipster-microservices/skills/jhipster-gateway/SKILL.md +80 -8
  92. package/templates/bundle-jhipster-microservices/skills/jhipster-gateway/references/gateway-config.md +43 -0
  93. package/templates/bundle-jhipster-microservices/skills/jhipster-kafka/SKILL.md +115 -22
  94. package/templates/bundle-jhipster-microservices/skills/jhipster-kafka/references/kafka-events.md +39 -0
  95. package/templates/bundle-jhipster-microservices/skills/jhipster-registry/SKILL.md +92 -23
  96. package/templates/bundle-jhipster-microservices/skills/jhipster-registry/references/consul-config.md +61 -0
  97. package/templates/bundle-jhipster-microservices/skills/jhipster-service/SKILL.md +81 -18
  98. package/templates/bundle-jhipster-microservices/skills/jhipster-service/references/service-patterns.md +40 -0
  99. package/templates/bundle-jhipster-microservices/skills/testing-strategy/SKILL.md +101 -20
  100. package/templates/bundle-jhipster-microservices/skills/testing-strategy/references/test-naming.md +55 -0
  101. package/templates/bundle-jhipster-monorepo/skills/clean-architecture/SKILL.md +87 -21
  102. package/templates/bundle-jhipster-monorepo/skills/clean-architecture/references/layer-rules.md +78 -0
  103. package/templates/bundle-jhipster-monorepo/skills/ddd-tactical/SKILL.md +94 -25
  104. package/templates/bundle-jhipster-monorepo/skills/ddd-tactical/references/ddd-patterns.md +48 -0
  105. package/templates/bundle-jhipster-monorepo/skills/jhipster-angular/SKILL.md +99 -52
  106. package/templates/bundle-jhipster-monorepo/skills/jhipster-angular/references/angular-structure.md +59 -0
  107. package/templates/bundle-jhipster-monorepo/skills/jhipster-entities/SKILL.md +89 -36
  108. package/templates/bundle-jhipster-monorepo/skills/jhipster-entities/references/jdl-types.md +56 -0
  109. package/templates/bundle-jhipster-monorepo/skills/jhipster-liquibase/SKILL.md +123 -23
  110. package/templates/bundle-jhipster-monorepo/skills/jhipster-liquibase/references/liquibase-operations.md +95 -0
  111. package/templates/bundle-jhipster-monorepo/skills/jhipster-security/SKILL.md +106 -19
  112. package/templates/bundle-jhipster-monorepo/skills/jhipster-security/references/security-checklist.md +47 -0
  113. package/templates/bundle-jhipster-monorepo/skills/jhipster-spring/SKILL.md +84 -16
  114. package/templates/bundle-jhipster-monorepo/skills/jhipster-spring/references/spring-layers.md +41 -0
  115. package/templates/bundle-jhipster-monorepo/skills/testing-strategy/SKILL.md +101 -20
  116. package/templates/bundle-jhipster-monorepo/skills/testing-strategy/references/test-naming.md +55 -0
@@ -1,24 +1,38 @@
1
1
  ---
2
2
  name: eval-testing
3
- description: Criar framework de avaliação para agentes de AI com LLM-as-judge, rule-based evals e golden datasets. Use quando precisar testar agentes, avaliar qualidade de RAG, ou criar benchmarks de compliance.
3
+ description: Build evaluation frameworks for AI agents with LLM-as-judge, rule-based evals, and golden datasets. Use when testing agents, evaluating RAG quality, or creating compliance benchmarks.
4
+ version: 1.0.0
5
+ author: Maestro
4
6
  ---
5
7
 
6
- # Avaliação de Agentes
8
+ # Eval Testing
7
9
 
8
- ## Tipos de eval
10
+ Build comprehensive evaluation pipelines for AI agents using rule-based checks, LLM-as-judge scoring, golden datasets, and CI/CD integration.
9
11
 
10
- | Tipo | Quando usar | Velocidade |
11
- |---|---|---|
12
- | Rule-based | Validar formato, estrutura, compliance | Rápido |
13
- | LLM-as-judge | Avaliar qualidade, coerência, utilidade | Lento |
14
- | Golden dataset | Comparar com respostas esperadas | Médio |
15
- | RAGAS | Métricas de RAG (faithfulness, relevancy) | Médio |
12
+ ## When to Use
13
+ - Testing whether an agent produces correct, compliant output
14
+ - Evaluating RAG retrieval quality with RAGAS metrics
15
+ - Building a golden dataset for regression testing
16
+ - Setting up automated eval pipelines in CI/CD
17
+ - Comparing agent performance across prompt versions
16
18
 
17
- ## 1. Rule-based Evals
19
+ ## Available Operations
20
+ 1. Create rule-based evaluators for compliance checks
21
+ 2. Build LLM-as-judge scoring prompts
22
+ 3. Design golden datasets with expected outcomes
23
+ 4. Run evaluation benchmarks
24
+ 5. Integrate evals into CI/CD pipelines
25
+ 6. Analyze and compare eval results
26
+
27
+ ## Multi-Step Workflow
28
+
29
+ ### Step 1: Create Rule-Based Evaluators
30
+
31
+ Start with deterministic checks -- they are fast, free, and reliable.
18
32
 
19
33
  ```python
20
34
  class ComplianceEvaluator:
21
- """Verifica se o código segue o bundle"""
35
+ """Verify code follows bundle standards."""
22
36
 
23
37
  def evaluate(self, code: str, bundle: Bundle) -> EvalResult:
24
38
  checks = []
@@ -36,25 +50,33 @@ class ComplianceEvaluator:
36
50
  return Check(
37
51
  name="max_lines",
38
52
  passed=lines <= max,
39
- detail=f"{lines}/{max} linhas"
53
+ detail=f"{lines}/{max} lines"
40
54
  )
41
55
  ```
42
56
 
43
- ## 2. LLM-as-Judge
57
+ Run rule-based checks:
58
+
59
+ ```bash
60
+ python -m evals.rules --code-dir src/ --bundle bundles/backend.json
61
+ ```
62
+
63
+ ### Step 2: Build LLM-as-Judge Evaluator
64
+
65
+ For subjective quality assessments, use an LLM to score agent output.
44
66
 
45
67
  ```python
46
68
  JUDGE_PROMPT = """
47
- Avalie o código abaixo nos critérios:
48
- 1. Clareza (1-5): O código é fácil de entender?
49
- 2. Correção (1-5): O código faz o que deveria?
50
- 3. Padrões (1-5): Segue Clean Architecture e DDD?
51
- 4. Testes (1-5): Os testes são adequados?
69
+ Evaluate the code below on these criteria:
70
+ 1. Clarity (1-5): Is the code easy to understand?
71
+ 2. Correctness (1-5): Does the code do what it should?
72
+ 3. Patterns (1-5): Does it follow Clean Architecture and DDD?
73
+ 4. Tests (1-5): Are tests adequate?
52
74
 
53
- Código:
75
+ Code:
54
76
  {code}
55
77
 
56
- Responda em JSON:
57
- {{"clareza": X, "correcao": X, "padroes": X, "testes": X, "justificativa": "..."}}
78
+ Respond in JSON:
79
+ {{"clarity": X, "correctness": X, "patterns": X, "tests": X, "justification": "..."}}
58
80
  """
59
81
 
60
82
  async def llm_judge(code: str) -> dict:
@@ -62,14 +84,16 @@ async def llm_judge(code: str) -> dict:
62
84
  return json.loads(response.content)
63
85
  ```
64
86
 
65
- ## 3. Golden Dataset
87
+ ### Step 3: Build a Golden Dataset
88
+
89
+ Create a set of input/expected-output pairs for regression testing.
66
90
 
67
91
  ```json
68
92
  {
69
93
  "evals": [
70
94
  {
71
95
  "id": "eval-001",
72
- "prompt": "Crie um endpoint GET /api/v1/demands que retorna lista paginada",
96
+ "prompt": "Create a GET /api/v1/demands endpoint with pagination",
73
97
  "expected": {
74
98
  "has_controller": true,
75
99
  "has_use_case": true,
@@ -82,7 +106,11 @@ async def llm_judge(code: str) -> dict:
82
106
  }
83
107
  ```
84
108
 
85
- ## 4. Runner de avaliação
109
+ Store golden datasets in `evals/golden_dataset.json`.
110
+
111
+ ### Step 4: Build the Eval Runner
112
+
113
+ Combine rule-based and LLM-as-judge evaluations into a single runner.
86
114
 
87
115
  ```python
88
116
  async def run_evals(agent, eval_set: list[dict]) -> BenchmarkResult:
@@ -103,7 +131,15 @@ async def run_evals(agent, eval_set: list[dict]) -> BenchmarkResult:
103
131
  return BenchmarkResult(results=results, aggregate=aggregate(results))
104
132
  ```
105
133
 
106
- ## No CI/CD
134
+ Run the eval suite:
135
+
136
+ ```bash
137
+ python -m evals.run_evals --dataset evals/golden_dataset.json --output results/eval_run_$(date +%Y%m%d).json
138
+ ```
139
+
140
+ ### Step 5: Integrate into CI/CD
141
+
142
+ Add eval checks to your pipeline so regressions are caught automatically.
107
143
 
108
144
  ```yaml
109
145
  eval:
@@ -113,3 +149,55 @@ eval:
113
149
  - python -m evals.check_threshold --min-score 0.8
114
150
  allow_failure: false
115
151
  ```
152
+
153
+ Run locally before pushing:
154
+
155
+ ```bash
156
+ python -m evals.run_evals --dataset evals/golden_dataset.json && python -m evals.check_threshold --min-score 0.8
157
+ ```
158
+
159
+ ### Step 6: Compare Results Across Versions
160
+
161
+ ```bash
162
+ python -m evals.compare --baseline results/eval_v1.json --current results/eval_v2.json
163
+ ```
164
+
165
+ ## Resources
166
+ - `references/eval-types.md` - Detailed comparison of evaluation types and when to use each
167
+ - `references/golden-dataset-template.md` - Template for building golden datasets
168
+
169
+ ## Examples
170
+
171
+ ### Example 1: Evaluate a Backend Agent
172
+ User asks: "Set up evals for our backend agent that generates FastAPI endpoints."
173
+ Response approach:
174
+ 1. Create rule-based checks: has controller, has use case, has repository, has tests
175
+ 2. Create LLM-as-judge prompt that scores code clarity, correctness, and patterns
176
+ 3. Build a golden dataset with 10 endpoint-generation scenarios
177
+ 4. Run the eval suite and establish a baseline score
178
+ 5. Add to CI/CD with a minimum score threshold of 0.8
179
+
180
+ ### Example 2: Evaluate RAG Quality
181
+ User asks: "Our RAG agent gives wrong answers sometimes. Set up evals to catch this."
182
+ Response approach:
183
+ 1. Build a golden dataset of 20 question/expected-answer pairs
184
+ 2. Create rule-based checks for source citation and answer format
185
+ 3. Create LLM-as-judge that scores faithfulness (is the answer grounded in context?)
186
+ 4. Run evals and identify patterns in failures
187
+ 5. Use results to guide chunking and retrieval improvements
188
+
189
+ ### Example 3: A/B Test Prompt Changes
190
+ User asks: "I changed the system prompt. How do I know if it's better?"
191
+ Response approach:
192
+ 1. Run the golden dataset against the old prompt (baseline)
193
+ 2. Run the same dataset against the new prompt
194
+ 3. Compare scores with `evals.compare` to see deltas
195
+ 4. Check for regressions in specific eval cases
196
+ 5. Only deploy if the new prompt scores >= baseline on all categories
197
+
198
+ ## Notes
199
+ - Rule-based evals run first (fast, free) -- fail early before spending on LLM-as-judge
200
+ - Golden datasets should have at least 10 cases for meaningful results
201
+ - Track eval scores over time to detect gradual drift
202
+ - LLM-as-judge is non-deterministic -- run 3 times and average for reliable scores
203
+ - Store all eval results as JSON for historical comparison
@@ -0,0 +1,52 @@
1
+ # Eval Types Reference
2
+
3
+ ## Comparison Matrix
4
+
5
+ | Type | When to Use | Speed | Cost | Deterministic |
6
+ |---|---|---|---|---|
7
+ | Rule-based | Validate format, structure, compliance | Fast | Free | Yes |
8
+ | LLM-as-judge | Evaluate quality, coherence, usefulness | Slow | $$ | No |
9
+ | Golden dataset | Compare against expected outcomes | Medium | Free | Yes |
10
+ | RAGAS | RAG metrics (faithfulness, relevancy) | Medium | $ | No |
11
+
12
+ ## Rule-Based Evals
13
+
14
+ Best for: Structure validation, compliance checks, format verification.
15
+
16
+ ```python
17
+ # Examples of rule-based checks
18
+ def check_has_tests(output: str) -> bool:
19
+ return "def test_" in output or "class Test" in output
20
+
21
+ def check_no_secrets(output: str) -> bool:
22
+ return not re.search(r'(password|secret|api_key)\s*=\s*["\'][^"\']+["\']', output)
23
+
24
+ def check_clean_arch_layers(output: str) -> bool:
25
+ return all(layer in output for layer in ["controller", "use_case", "repository"])
26
+ ```
27
+
28
+ ## LLM-as-Judge
29
+
30
+ Best for: Subjective quality assessment, code review, explanation quality.
31
+
32
+ Scoring rubric should be explicit:
33
+ - 1: Completely wrong or missing
34
+ - 2: Present but seriously flawed
35
+ - 3: Acceptable with notable issues
36
+ - 4: Good with minor issues
37
+ - 5: Excellent, production-ready
38
+
39
+ ## Golden Datasets
40
+
41
+ Best for: Regression testing, baseline comparison, prompt A/B testing.
42
+
43
+ Minimum size: 10 cases for initial validation, 50+ for production confidence.
44
+
45
+ ## RAGAS Metrics
46
+
47
+ Best for: RAG pipeline evaluation.
48
+
49
+ - **Faithfulness**: Is the answer supported by the retrieved context?
50
+ - **Answer Relevancy**: Does the answer address the question?
51
+ - **Context Precision**: Are the retrieved documents relevant?
52
+ - **Context Recall**: Did retrieval find all relevant documents?
@@ -0,0 +1,59 @@
1
+ # Golden Dataset Template
2
+
3
+ ## Structure
4
+
5
+ ```json
6
+ {
7
+ "metadata": {
8
+ "name": "Backend Agent Eval Suite",
9
+ "version": "1.0.0",
10
+ "created_at": "2026-03-27",
11
+ "min_pass_score": 0.8
12
+ },
13
+ "evals": [
14
+ {
15
+ "id": "eval-001",
16
+ "category": "endpoint-creation",
17
+ "difficulty": "basic",
18
+ "prompt": "Create a GET /api/v1/demands endpoint that returns a paginated list",
19
+ "expected": {
20
+ "has_controller": true,
21
+ "has_use_case": true,
22
+ "has_repository": true,
23
+ "has_pagination": true,
24
+ "follows_clean_arch": true,
25
+ "has_error_handling": true,
26
+ "has_tests": true
27
+ },
28
+ "judge_criteria": {
29
+ "clarity": ">= 4",
30
+ "correctness": ">= 4",
31
+ "patterns": ">= 3",
32
+ "tests": ">= 3"
33
+ }
34
+ },
35
+ {
36
+ "id": "eval-002",
37
+ "category": "error-handling",
38
+ "difficulty": "intermediate",
39
+ "prompt": "Add proper error handling to the demands API: 404 for not found, 422 for validation errors, 500 for server errors",
40
+ "expected": {
41
+ "has_exception_handler": true,
42
+ "has_error_response_model": true,
43
+ "handles_404": true,
44
+ "handles_422": true,
45
+ "handles_500": true
46
+ }
47
+ }
48
+ ]
49
+ }
50
+ ```
51
+
52
+ ## Guidelines for Building Golden Datasets
53
+
54
+ 1. Cover basic, intermediate, and advanced scenarios.
55
+ 2. Include both happy path and error path cases.
56
+ 3. Make expected outcomes binary (true/false) for rule-based checks.
57
+ 4. Add judge criteria for subjective quality scoring.
58
+ 5. Update the dataset as the agent's capabilities evolve.
59
+ 6. Minimum 10 cases for initial validation, 50+ for production.
@@ -1,78 +1,112 @@
1
1
  ---
2
2
  name: memory-management
3
- description: Implementar memória de curto, médio e longo prazo para agentes usando LangGraph Store e checkpointers. Use quando precisar que agentes lembrem de interações anteriores, persistam estado, ou aprendam com execuções passadas.
3
+ description: Implement short-term, medium-term, and long-term memory for AI agents using LangGraph Store and checkpointers. Use when agents need to remember past interactions, persist state, or learn from previous executions.
4
+ version: 1.0.0
5
+ author: Maestro
4
6
  ---
5
7
 
6
- # Gerenciamento de Memória
8
+ # Memory Management
7
9
 
8
- ## 3 Níveis de Memória
10
+ Implement three tiers of agent memory -- short-term (context window), medium-term (checkpointer), and long-term (store) -- to enable persistent learning and state management.
9
11
 
10
- | Nível | Duração | Mecanismo | Exemplo |
11
- |---|---|---|---|
12
- | Curto prazo | 1 sessão | Context window | Mensagens da conversa atual |
13
- | Médio prazo | 1 demanda | Checkpointer | Estado entre nós do grafo |
14
- | Longo prazo | Permanente | Store | Padrões aprendidos, preferências |
12
+ ## When to Use
13
+ - Agent needs to resume work after interruption
14
+ - Agent should learn from past executions and avoid repeating mistakes
15
+ - Persisting state between nodes in a LangGraph workflow
16
+ - Storing and retrieving patterns learned across multiple demands
17
+ - Implementing memory decay to remove stale or low-confidence knowledge
15
18
 
16
- ## Curto Prazo — Context Window
19
+ ## Available Operations
20
+ 1. Configure short-term memory via context window
21
+ 2. Set up medium-term memory with LangGraph checkpointer
22
+ 3. Implement long-term memory with LangGraph Store
23
+ 4. Integrate memory into Deep Agent configuration
24
+ 5. Implement memory cleanup and decay policies
17
25
 
18
- Gerenciado automaticamente pelo LangGraph. Usar `add_messages` para acumular.
26
+ ## Multi-Step Workflow
27
+
28
+ ### Step 1: Set Up Short-Term Memory (Context Window)
29
+
30
+ Short-term memory is automatic -- LangGraph accumulates messages within a session.
19
31
 
20
32
  ```python
33
+ from typing import TypedDict, Annotated
34
+ from langgraph.graph.message import add_messages
35
+
21
36
  class AgentState(TypedDict):
22
- messages: Annotated[list, add_messages] # Acumula automaticamente
37
+ messages: Annotated[list, add_messages] # Accumulates automatically
23
38
  ```
24
39
 
25
- ## Médio Prazo Checkpointer
40
+ No additional setup needed. Messages persist for the duration of a single invocation chain.
26
41
 
27
- Persiste o estado do grafo entre invocações da mesma demanda.
42
+ ### Step 2: Set Up Medium-Term Memory (Checkpointer)
43
+
44
+ Persists graph state between invocations of the same demand. Enables resume after failure.
45
+
46
+ ```bash
47
+ pip install langgraph-checkpoint-postgres psycopg
48
+ ```
28
49
 
29
50
  ```python
30
51
  from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver
52
+ from langgraph.graph import StateGraph
31
53
 
32
54
  checkpointer = AsyncPostgresSaver.from_conn_string(DATABASE_URL)
33
55
 
34
56
  graph = StateGraph(OrchestratorState)
35
- # ... definir nós e edges ...
57
+ # ... define nodes and edges ...
36
58
  app = graph.compile(checkpointer=checkpointer)
37
59
 
38
- # Usar thread_id consistente por demanda
60
+ # Use consistent thread_id per demand
39
61
  config = {"configurable": {"thread_id": f"demand-{demand_id}"}}
40
62
  result = await app.ainvoke({"messages": [...]}, config=config)
41
63
 
42
- # Próxima invocação com mesmo thread_id retoma do estado salvo
43
- result2 = await app.ainvoke({"messages": [nova_msg]}, config=config)
64
+ # Next invocation with same thread_id resumes from saved state
65
+ result2 = await app.ainvoke({"messages": [new_msg]}, config=config)
66
+ ```
67
+
68
+ Verify checkpointer is working:
69
+
70
+ ```bash
71
+ psql $DATABASE_URL -c "SELECT thread_id, created_at FROM checkpoints ORDER BY created_at DESC LIMIT 5;"
44
72
  ```
45
73
 
46
- ## Longo Prazo Store
74
+ ### Step 3: Set Up Long-Term Memory (Store)
47
75
 
48
- Persiste conhecimento entre demandas diferentes.
76
+ Persists knowledge across different demands. The agent remembers patterns, preferences, and learnings.
77
+
78
+ ```bash
79
+ pip install langgraph-store-postgres
80
+ ```
49
81
 
50
82
  ```python
51
83
  from langgraph.store.postgres import AsyncPostgresStore
52
84
 
53
85
  store = AsyncPostgresStore.from_conn_string(DATABASE_URL)
54
86
 
55
- # Salvar aprendizado
87
+ # Save a learned pattern
56
88
  await store.aput(
57
89
  namespace=("agent", "backend", "patterns"),
58
90
  key="spring-crud-pattern",
59
91
  value={
60
- "pattern": "Usar record para DTO, entity para domínio",
92
+ "pattern": "Use record for DTO, entity for domain",
61
93
  "learned_from": "demand-123",
62
94
  "confidence": 0.95,
63
95
  "created_at": "2026-03-27"
64
96
  }
65
97
  )
66
98
 
67
- # Buscar aprendizados relevantes
99
+ # Search for relevant learnings
68
100
  results = await store.asearch(
69
101
  namespace=("agent", "backend", "patterns"),
70
- query="como criar DTO para API REST",
102
+ query="how to create DTO for REST API",
71
103
  limit=5
72
104
  )
73
105
  ```
74
106
 
75
- ## Memória no Deep Agent
107
+ ### Step 4: Integrate Memory into Deep Agent
108
+
109
+ Wire all three memory tiers into a Deep Agent configuration.
76
110
 
77
111
  ```python
78
112
  from deepagents import create_deep_agent
@@ -85,22 +119,72 @@ agent = create_deep_agent(
85
119
  backend=FilesystemBackend(root_dir=".", virtual_mode=True),
86
120
  checkpointer=PostgresSaver(conn_string=DATABASE_URL),
87
121
  store=PostgresStore(conn_string=DATABASE_URL),
88
- system_prompt="Você é um agente backend..."
122
+ system_prompt="You are a backend agent..."
89
123
  )
90
124
  ```
91
125
 
92
- ## Limpeza de memória
126
+ ### Step 5: Implement Memory Cleanup and Decay
93
127
 
94
- Memórias envelhecem. Implementar decay:
128
+ Memories age. Remove stale or low-confidence entries to keep the store relevant.
95
129
 
96
130
  ```python
131
+ from datetime import datetime, timedelta
132
+
97
133
  async def cleanup_stale_memories(store, max_age_days: int = 90):
98
- """Remove memórias antigas ou com baixa confiança"""
134
+ """Remove old or low-confidence memories."""
99
135
  cutoff = datetime.now() - timedelta(days=max_age_days)
100
136
  memories = await store.alist(namespace=("agent",))
137
+ removed = 0
101
138
  for mem in memories:
102
139
  if mem.value.get("created_at", "") < cutoff.isoformat():
103
140
  await store.adelete(namespace=mem.namespace, key=mem.key)
141
+ removed += 1
104
142
  elif mem.value.get("confidence", 1.0) < 0.3:
105
143
  await store.adelete(namespace=mem.namespace, key=mem.key)
144
+ removed += 1
145
+ return removed
106
146
  ```
147
+
148
+ Run cleanup:
149
+
150
+ ```bash
151
+ python -m memory.cleanup --max-age-days 90 --min-confidence 0.3
152
+ ```
153
+
154
+ ## Resources
155
+ - `references/memory-tiers.md` - Detailed comparison of memory tiers with use cases
156
+ - `references/namespace-conventions.md` - Naming conventions for store namespaces
157
+
158
+ ## Examples
159
+
160
+ ### Example 1: Enable Resume After Failure
161
+ User asks: "Our agent crashes mid-task and loses all progress. Fix it."
162
+ Response approach:
163
+ 1. Add PostgresSaver checkpointer to the agent's graph compilation
164
+ 2. Use `thread_id` based on the demand ID for consistent state
165
+ 3. On restart, invoke with the same thread_id -- LangGraph automatically resumes
166
+ 4. Verify with: `psql $DATABASE_URL -c "SELECT * FROM checkpoints WHERE thread_id='demand-xyz';"`
167
+
168
+ ### Example 2: Agent Should Learn From Past Work
169
+ User asks: "The backend agent keeps making the same pagination mistake. Make it learn."
170
+ Response approach:
171
+ 1. After each successful demand, save patterns to the Store
172
+ 2. Before each new task, search the Store for relevant learnings
173
+ 3. Inject top-3 relevant learnings into the agent's context
174
+ 4. Track confidence scores -- boost on positive feedback, decay on negative
175
+
176
+ ### Example 3: Clean Up Old Memories
177
+ User asks: "The memory store has grown too large. Clean it up."
178
+ Response approach:
179
+ 1. Run `memory.cleanup --max-age-days 90` to remove entries older than 90 days
180
+ 2. Remove entries with confidence below 0.3
181
+ 3. Audit remaining entries for duplicates
182
+ 4. Set up a weekly cron job for automatic cleanup
183
+
184
+ ## Notes
185
+ - Always use `thread_id` based on a stable identifier (demand_id, session_id)
186
+ - Checkpointer handles resume automatically -- no custom logic needed
187
+ - Store namespaces should follow the convention: `("agent", agent_type, category)`
188
+ - Memory cleanup should run on a schedule (weekly recommended)
189
+ - Include `confidence` and `created_at` in all store entries for decay management
190
+ - Long-term memories should be surfaced via retrieval, not dumped into the prompt
@@ -0,0 +1,41 @@
1
+ # Memory Tiers Reference
2
+
3
+ ## Comparison
4
+
5
+ | Tier | Duration | Mechanism | Storage | Cost | Use Case |
6
+ |---|---|---|---|---|---|
7
+ | Short-term | 1 session | Context window | In-memory | Free | Current conversation messages |
8
+ | Medium-term | 1 demand | Checkpointer | PostgreSQL | Low | Graph state between invocations |
9
+ | Long-term | Permanent | Store | PostgreSQL | Low | Patterns, preferences, learnings |
10
+
11
+ ## Short-Term Memory
12
+
13
+ - **What**: Messages in the current invocation chain.
14
+ - **How**: `Annotated[list, add_messages]` in state schema.
15
+ - **Limit**: Bounded by context window size.
16
+ - **When it resets**: End of the invocation chain.
17
+
18
+ ## Medium-Term Memory
19
+
20
+ - **What**: Full graph state (all state fields) at each node execution.
21
+ - **How**: `graph.compile(checkpointer=PostgresSaver(...))`.
22
+ - **Limit**: Bounded by database storage.
23
+ - **When it resets**: When the demand is completed or explicitly cleared.
24
+ - **Key feature**: Enables resume after crash.
25
+
26
+ ## Long-Term Memory
27
+
28
+ - **What**: Extracted patterns, preferences, and learnings.
29
+ - **How**: `store.aput(namespace=..., key=..., value=...)`.
30
+ - **Limit**: Bounded by database storage and relevance decay.
31
+ - **When it resets**: Only when explicitly cleaned up.
32
+ - **Key feature**: Enables cross-demand learning.
33
+
34
+ ## Decision Guide
35
+
36
+ | Question | Answer | Use |
37
+ |---|---|---|
38
+ | Does the agent need to remember within this conversation? | Yes | Short-term |
39
+ | Does the agent need to resume after a crash? | Yes | Medium-term (checkpointer) |
40
+ | Should the agent learn from past demands? | Yes | Long-term (store) |
41
+ | Does the agent need to share knowledge with other agents? | Yes | Long-term (store with shared namespace) |
@@ -0,0 +1,41 @@
1
+ # Namespace Conventions Reference
2
+
3
+ ## Store Namespace Structure
4
+
5
+ ```
6
+ ("agent", agent_type, category)
7
+ ```
8
+
9
+ ## Standard Namespaces
10
+
11
+ | Namespace | Purpose | Example Key |
12
+ |---|---|---|
13
+ | `("agent", "backend", "patterns")` | Code patterns learned by backend agent | `"fastapi-crud-pattern"` |
14
+ | `("agent", "frontend", "patterns")` | UI patterns learned by frontend agent | `"react-form-pattern"` |
15
+ | `("agent", "backend", "errors")` | Common errors and their fixes | `"alembic-migration-conflict"` |
16
+ | `("agent", "backend", "preferences")` | Team preferences for code style | `"prefer-dataclass-over-dict"` |
17
+ | `("project", "decisions")` | Architectural decisions | `"chose-fastapi-over-flask"` |
18
+ | `("project", "standards")` | Project-wide coding standards | `"naming-conventions"` |
19
+
20
+ ## Value Schema
21
+
22
+ Every store entry should include these fields:
23
+
24
+ ```python
25
+ {
26
+ "pattern": str, # The actual knowledge
27
+ "learned_from": str, # Which demand/task this came from
28
+ "confidence": float, # 0.0 to 1.0 (boost on positive feedback, decay on negative)
29
+ "created_at": str, # ISO timestamp
30
+ "updated_at": str, # ISO timestamp (updated on reinforcement)
31
+ "usage_count": int, # How many times this memory was retrieved
32
+ }
33
+ ```
34
+
35
+ ## Confidence Management
36
+
37
+ - **Initial**: 0.7 (new learning, not yet validated)
38
+ - **Reinforced**: +0.1 per positive use (max 1.0)
39
+ - **Contradicted**: -0.2 per negative feedback (min 0.0)
40
+ - **Cleanup threshold**: < 0.3 (remove on next cleanup run)
41
+ - **Time decay**: -0.05 per month without usage