npm - maestro-bundle - Versions diffs - 1.3.1 → 1.4.0 - Mend

maestro-bundle 1.3.1 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (116) hide show

package/templates/bundle-ai-agents/skills/eval-testing/SKILL.md CHANGED Viewed

@@ -1,24 +1,38 @@
 ---
 name: eval-testing
-description: Criar framework de avaliação para agentes de AI com LLM-as-judge, rule-based evals e golden datasets. Use quando precisar testar agentes, avaliar qualidade de RAG, ou criar benchmarks de compliance.
+description: Build evaluation frameworks for AI agents with LLM-as-judge, rule-based evals, and golden datasets. Use when testing agents, evaluating RAG quality, or creating compliance benchmarks.
+version: 1.0.0
+author: Maestro
 ---
-# Avaliação de Agentes
+# Eval Testing
-## Tipos de eval
+Build comprehensive evaluation pipelines for AI agents using rule-based checks, LLM-as-judge scoring, golden datasets, and CI/CD integration.
-| Tipo | Quando usar | Velocidade |
-|---|---|---|
-| Rule-based | Validar formato, estrutura, compliance | Rápido |
-| LLM-as-judge | Avaliar qualidade, coerência, utilidade | Lento |
-| Golden dataset | Comparar com respostas esperadas | Médio |
-| RAGAS | Métricas de RAG (faithfulness, relevancy) | Médio |
+## When to Use
+- Testing whether an agent produces correct, compliant output
+- Evaluating RAG retrieval quality with RAGAS metrics
+- Building a golden dataset for regression testing
+- Setting up automated eval pipelines in CI/CD
+- Comparing agent performance across prompt versions
-## 1. Rule-based Evals
+## Available Operations
+1. Create rule-based evaluators for compliance checks
+2. Build LLM-as-judge scoring prompts
+3. Design golden datasets with expected outcomes
+4. Run evaluation benchmarks
+5. Integrate evals into CI/CD pipelines
+6. Analyze and compare eval results
+## Multi-Step Workflow
+### Step 1: Create Rule-Based Evaluators
+Start with deterministic checks -- they are fast, free, and reliable.
 ```python
 class ComplianceEvaluator:
-    """Verifica se o código segue o bundle"""
+    """Verify code follows bundle standards."""
     def evaluate(self, code: str, bundle: Bundle) -> EvalResult:
         checks = []
@@ -36,25 +50,33 @@ class ComplianceEvaluator:
         return Check(
             name="max_lines",
             passed=lines <= max,
-            detail=f"{lines}/{max} linhas"
+            detail=f"{lines}/{max} lines"
         )
 ```
-## 2. LLM-as-Judge
+Run rule-based checks:
+```bash
+python -m evals.rules --code-dir src/ --bundle bundles/backend.json
+```
+### Step 2: Build LLM-as-Judge Evaluator
+For subjective quality assessments, use an LLM to score agent output.
 ```python
 JUDGE_PROMPT = """
-Avalie o código abaixo nos critérios:
-1. Clareza (1-5): O código é fácil de entender?
-2. Correção (1-5): O código faz o que deveria?
-3. Padrões (1-5): Segue Clean Architecture e DDD?
-4. Testes (1-5): Os testes são adequados?
+Evaluate the code below on these criteria:
+1. Clarity (1-5): Is the code easy to understand?
+2. Correctness (1-5): Does the code do what it should?
+3. Patterns (1-5): Does it follow Clean Architecture and DDD?
+4. Tests (1-5): Are tests adequate?
-Código:
+Code:
 {code}
-Responda em JSON:
-{{"clareza": X, "correcao": X, "padroes": X, "testes": X, "justificativa": "..."}}
+Respond in JSON:
+{{"clarity": X, "correctness": X, "patterns": X, "tests": X, "justification": "..."}}
 """
 async def llm_judge(code: str) -> dict:
@@ -62,14 +84,16 @@ async def llm_judge(code: str) -> dict:
     return json.loads(response.content)
 ```
-## 3. Golden Dataset
+### Step 3: Build a Golden Dataset
+Create a set of input/expected-output pairs for regression testing.
 ```json
 {
   "evals": [
     {
       "id": "eval-001",
-      "prompt": "Crie um endpoint GET /api/v1/demands que retorna lista paginada",
+      "prompt": "Create a GET /api/v1/demands endpoint with pagination",
       "expected": {
         "has_controller": true,
         "has_use_case": true,
@@ -82,7 +106,11 @@ async def llm_judge(code: str) -> dict:
 }
 ```
-## 4. Runner de avaliação
+Store golden datasets in `evals/golden_dataset.json`.
+### Step 4: Build the Eval Runner
+Combine rule-based and LLM-as-judge evaluations into a single runner.
 ```python
 async def run_evals(agent, eval_set: list[dict]) -> BenchmarkResult:
@@ -103,7 +131,15 @@ async def run_evals(agent, eval_set: list[dict]) -> BenchmarkResult:
     return BenchmarkResult(results=results, aggregate=aggregate(results))
 ```
-## No CI/CD
+Run the eval suite:
+```bash
+python -m evals.run_evals --dataset evals/golden_dataset.json --output results/eval_run_$(date +%Y%m%d).json
+```
+### Step 5: Integrate into CI/CD
+Add eval checks to your pipeline so regressions are caught automatically.
 ```yaml
 eval:
@@ -113,3 +149,55 @@ eval:
     - python -m evals.check_threshold --min-score 0.8
   allow_failure: false
 ```
+Run locally before pushing:
+```bash
+python -m evals.run_evals --dataset evals/golden_dataset.json && python -m evals.check_threshold --min-score 0.8
+```
+### Step 6: Compare Results Across Versions
+```bash
+python -m evals.compare --baseline results/eval_v1.json --current results/eval_v2.json
+```
+## Resources
+- `references/eval-types.md` - Detailed comparison of evaluation types and when to use each
+- `references/golden-dataset-template.md` - Template for building golden datasets
+## Examples
+### Example 1: Evaluate a Backend Agent
+User asks: "Set up evals for our backend agent that generates FastAPI endpoints."
+Response approach:
+1. Create rule-based checks: has controller, has use case, has repository, has tests
+2. Create LLM-as-judge prompt that scores code clarity, correctness, and patterns
+3. Build a golden dataset with 10 endpoint-generation scenarios
+4. Run the eval suite and establish a baseline score
+5. Add to CI/CD with a minimum score threshold of 0.8
+### Example 2: Evaluate RAG Quality
+User asks: "Our RAG agent gives wrong answers sometimes. Set up evals to catch this."
+Response approach:
+1. Build a golden dataset of 20 question/expected-answer pairs
+2. Create rule-based checks for source citation and answer format
+3. Create LLM-as-judge that scores faithfulness (is the answer grounded in context?)
+4. Run evals and identify patterns in failures
+5. Use results to guide chunking and retrieval improvements
+### Example 3: A/B Test Prompt Changes
+User asks: "I changed the system prompt. How do I know if it's better?"
+Response approach:
+1. Run the golden dataset against the old prompt (baseline)
+2. Run the same dataset against the new prompt
+3. Compare scores with `evals.compare` to see deltas
+4. Check for regressions in specific eval cases
+5. Only deploy if the new prompt scores >= baseline on all categories
+## Notes
+- Rule-based evals run first (fast, free) -- fail early before spending on LLM-as-judge
+- Golden datasets should have at least 10 cases for meaningful results
+- Track eval scores over time to detect gradual drift
+- LLM-as-judge is non-deterministic -- run 3 times and average for reliable scores
+- Store all eval results as JSON for historical comparison

package/templates/bundle-ai-agents/skills/eval-testing/references/eval-types.md ADDED Viewed

@@ -0,0 +1,52 @@
+# Eval Types Reference
+## Comparison Matrix
+| Type | When to Use | Speed | Cost | Deterministic |
+|---|---|---|---|---|
+| Rule-based | Validate format, structure, compliance | Fast | Free | Yes |
+| LLM-as-judge | Evaluate quality, coherence, usefulness | Slow | $$ | No |
+| Golden dataset | Compare against expected outcomes | Medium | Free | Yes |
+| RAGAS | RAG metrics (faithfulness, relevancy) | Medium | $ | No |
+## Rule-Based Evals
+Best for: Structure validation, compliance checks, format verification.
+```python
+# Examples of rule-based checks
+def check_has_tests(output: str) -> bool:
+    return "def test_" in output or "class Test" in output
+def check_no_secrets(output: str) -> bool:
+    return not re.search(r'(password|secret|api_key)\s*=\s*["\'][^"\']+["\']', output)
+def check_clean_arch_layers(output: str) -> bool:
+    return all(layer in output for layer in ["controller", "use_case", "repository"])
+```
+## LLM-as-Judge
+Best for: Subjective quality assessment, code review, explanation quality.
+Scoring rubric should be explicit:
+- 1: Completely wrong or missing
+- 2: Present but seriously flawed
+- 3: Acceptable with notable issues
+- 4: Good with minor issues
+- 5: Excellent, production-ready
+## Golden Datasets
+Best for: Regression testing, baseline comparison, prompt A/B testing.
+Minimum size: 10 cases for initial validation, 50+ for production confidence.
+## RAGAS Metrics
+Best for: RAG pipeline evaluation.
+- **Faithfulness**: Is the answer supported by the retrieved context?
+- **Answer Relevancy**: Does the answer address the question?
+- **Context Precision**: Are the retrieved documents relevant?
+- **Context Recall**: Did retrieval find all relevant documents?

package/templates/bundle-ai-agents/skills/eval-testing/references/golden-dataset-template.md ADDED Viewed

@@ -0,0 +1,59 @@
+# Golden Dataset Template
+## Structure
+```json
+{
+  "metadata": {
+    "name": "Backend Agent Eval Suite",
+    "version": "1.0.0",
+    "created_at": "2026-03-27",
+    "min_pass_score": 0.8
+  },
+  "evals": [
+    {
+      "id": "eval-001",
+      "category": "endpoint-creation",
+      "difficulty": "basic",
+      "prompt": "Create a GET /api/v1/demands endpoint that returns a paginated list",
+      "expected": {
+        "has_controller": true,
+        "has_use_case": true,
+        "has_repository": true,
+        "has_pagination": true,
+        "follows_clean_arch": true,
+        "has_error_handling": true,
+        "has_tests": true
+      },
+      "judge_criteria": {
+        "clarity": ">= 4",
+        "correctness": ">= 4",
+        "patterns": ">= 3",
+        "tests": ">= 3"
+      }
+    },
+    {
+      "id": "eval-002",
+      "category": "error-handling",
+      "difficulty": "intermediate",
+      "prompt": "Add proper error handling to the demands API: 404 for not found, 422 for validation errors, 500 for server errors",
+      "expected": {
+        "has_exception_handler": true,
+        "has_error_response_model": true,
+        "handles_404": true,
+        "handles_422": true,
+        "handles_500": true
+      }
+    }
+  ]
+}
+```
+## Guidelines for Building Golden Datasets
+1. Cover basic, intermediate, and advanced scenarios.
+2. Include both happy path and error path cases.
+3. Make expected outcomes binary (true/false) for rule-based checks.
+4. Add judge criteria for subjective quality scoring.
+5. Update the dataset as the agent's capabilities evolve.
+6. Minimum 10 cases for initial validation, 50+ for production.

package/templates/bundle-ai-agents/skills/memory-management/SKILL.md CHANGED Viewed

@@ -1,78 +1,112 @@
 ---
 name: memory-management
-description: Implementar memória de curto, médio e longo prazo para agentes usando LangGraph Store e checkpointers. Use quando precisar que agentes lembrem de interações anteriores, persistam estado, ou aprendam com execuções passadas.
+description: Implement short-term, medium-term, and long-term memory for AI agents using LangGraph Store and checkpointers. Use when agents need to remember past interactions, persist state, or learn from previous executions.
+version: 1.0.0
+author: Maestro
 ---
-# Gerenciamento de Memória
+# Memory Management
-## 3 Níveis de Memória
+Implement three tiers of agent memory -- short-term (context window), medium-term (checkpointer), and long-term (store) -- to enable persistent learning and state management.
-| Nível | Duração | Mecanismo | Exemplo |
-|---|---|---|---|
-| Curto prazo | 1 sessão | Context window | Mensagens da conversa atual |
-| Médio prazo | 1 demanda | Checkpointer | Estado entre nós do grafo |
-| Longo prazo | Permanente | Store | Padrões aprendidos, preferências |
+## When to Use
+- Agent needs to resume work after interruption
+- Agent should learn from past executions and avoid repeating mistakes
+- Persisting state between nodes in a LangGraph workflow
+- Storing and retrieving patterns learned across multiple demands
+- Implementing memory decay to remove stale or low-confidence knowledge
-## Curto Prazo — Context Window
+## Available Operations
+1. Configure short-term memory via context window
+2. Set up medium-term memory with LangGraph checkpointer
+3. Implement long-term memory with LangGraph Store
+4. Integrate memory into Deep Agent configuration
+5. Implement memory cleanup and decay policies
-Gerenciado automaticamente pelo LangGraph. Usar `add_messages` para acumular.
+## Multi-Step Workflow
+### Step 1: Set Up Short-Term Memory (Context Window)
+Short-term memory is automatic -- LangGraph accumulates messages within a session.
 ```python
+from typing import TypedDict, Annotated
+from langgraph.graph.message import add_messages
 class AgentState(TypedDict):
-    messages: Annotated[list, add_messages]  # Acumula automaticamente
+    messages: Annotated[list, add_messages]  # Accumulates automatically
 ```
-## Médio Prazo — Checkpointer
+No additional setup needed. Messages persist for the duration of a single invocation chain.
-Persiste o estado do grafo entre invocações da mesma demanda.
+### Step 2: Set Up Medium-Term Memory (Checkpointer)
+Persists graph state between invocations of the same demand. Enables resume after failure.
+```bash
+pip install langgraph-checkpoint-postgres psycopg
+```
 ```python
 from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver
+from langgraph.graph import StateGraph
 checkpointer = AsyncPostgresSaver.from_conn_string(DATABASE_URL)
 graph = StateGraph(OrchestratorState)
-# ... definir nós e edges ...
+# ... define nodes and edges ...
 app = graph.compile(checkpointer=checkpointer)
-# Usar thread_id consistente por demanda
+# Use consistent thread_id per demand
 config = {"configurable": {"thread_id": f"demand-{demand_id}"}}
 result = await app.ainvoke({"messages": [...]}, config=config)
-# Próxima invocação com mesmo thread_id retoma do estado salvo
-result2 = await app.ainvoke({"messages": [nova_msg]}, config=config)
+# Next invocation with same thread_id resumes from saved state
+result2 = await app.ainvoke({"messages": [new_msg]}, config=config)
+```
+Verify checkpointer is working:
+```bash
+psql $DATABASE_URL -c "SELECT thread_id, created_at FROM checkpoints ORDER BY created_at DESC LIMIT 5;"
 ```
-## Longo Prazo — Store
+### Step 3: Set Up Long-Term Memory (Store)
-Persiste conhecimento entre demandas diferentes.
+Persists knowledge across different demands. The agent remembers patterns, preferences, and learnings.
+```bash
+pip install langgraph-store-postgres
+```
 ```python
 from langgraph.store.postgres import AsyncPostgresStore
 store = AsyncPostgresStore.from_conn_string(DATABASE_URL)
-# Salvar aprendizado
+# Save a learned pattern
 await store.aput(
     namespace=("agent", "backend", "patterns"),
     key="spring-crud-pattern",
     value={
-        "pattern": "Usar record para DTO, entity para domínio",
+        "pattern": "Use record for DTO, entity for domain",
         "learned_from": "demand-123",
         "confidence": 0.95,
         "created_at": "2026-03-27"
     }
 )
-# Buscar aprendizados relevantes
+# Search for relevant learnings
 results = await store.asearch(
     namespace=("agent", "backend", "patterns"),
-    query="como criar DTO para API REST",
+    query="how to create DTO for REST API",
     limit=5
 )
 ```
-## Memória no Deep Agent
+### Step 4: Integrate Memory into Deep Agent
+Wire all three memory tiers into a Deep Agent configuration.
 ```python
 from deepagents import create_deep_agent
@@ -85,22 +119,72 @@ agent = create_deep_agent(
     backend=FilesystemBackend(root_dir=".", virtual_mode=True),
     checkpointer=PostgresSaver(conn_string=DATABASE_URL),
     store=PostgresStore(conn_string=DATABASE_URL),
-    system_prompt="Você é um agente backend..."
+    system_prompt="You are a backend agent..."
 )
 ```
-## Limpeza de memória
+### Step 5: Implement Memory Cleanup and Decay
-Memórias envelhecem. Implementar decay:
+Memories age. Remove stale or low-confidence entries to keep the store relevant.
 ```python
+from datetime import datetime, timedelta
 async def cleanup_stale_memories(store, max_age_days: int = 90):
-    """Remove memórias antigas ou com baixa confiança"""
+    """Remove old or low-confidence memories."""
     cutoff = datetime.now() - timedelta(days=max_age_days)
     memories = await store.alist(namespace=("agent",))
+    removed = 0
     for mem in memories:
         if mem.value.get("created_at", "") < cutoff.isoformat():
             await store.adelete(namespace=mem.namespace, key=mem.key)
+            removed += 1
         elif mem.value.get("confidence", 1.0) < 0.3:
             await store.adelete(namespace=mem.namespace, key=mem.key)
+            removed += 1
+    return removed
 ```
+Run cleanup:
+```bash
+python -m memory.cleanup --max-age-days 90 --min-confidence 0.3
+```
+## Resources
+- `references/memory-tiers.md` - Detailed comparison of memory tiers with use cases
+- `references/namespace-conventions.md` - Naming conventions for store namespaces
+## Examples
+### Example 1: Enable Resume After Failure
+User asks: "Our agent crashes mid-task and loses all progress. Fix it."
+Response approach:
+1. Add PostgresSaver checkpointer to the agent's graph compilation
+2. Use `thread_id` based on the demand ID for consistent state
+3. On restart, invoke with the same thread_id -- LangGraph automatically resumes
+4. Verify with: `psql $DATABASE_URL -c "SELECT * FROM checkpoints WHERE thread_id='demand-xyz';"`
+### Example 2: Agent Should Learn From Past Work
+User asks: "The backend agent keeps making the same pagination mistake. Make it learn."
+Response approach:
+1. After each successful demand, save patterns to the Store
+2. Before each new task, search the Store for relevant learnings
+3. Inject top-3 relevant learnings into the agent's context
+4. Track confidence scores -- boost on positive feedback, decay on negative
+### Example 3: Clean Up Old Memories
+User asks: "The memory store has grown too large. Clean it up."
+Response approach:
+1. Run `memory.cleanup --max-age-days 90` to remove entries older than 90 days
+2. Remove entries with confidence below 0.3
+3. Audit remaining entries for duplicates
+4. Set up a weekly cron job for automatic cleanup
+## Notes
+- Always use `thread_id` based on a stable identifier (demand_id, session_id)
+- Checkpointer handles resume automatically -- no custom logic needed
+- Store namespaces should follow the convention: `("agent", agent_type, category)`
+- Memory cleanup should run on a schedule (weekly recommended)
+- Include `confidence` and `created_at` in all store entries for decay management
+- Long-term memories should be surfaced via retrieval, not dumped into the prompt

package/templates/bundle-ai-agents/skills/memory-management/references/memory-tiers.md ADDED Viewed

@@ -0,0 +1,41 @@
+# Memory Tiers Reference
+## Comparison
+| Tier | Duration | Mechanism | Storage | Cost | Use Case |
+|---|---|---|---|---|---|
+| Short-term | 1 session | Context window | In-memory | Free | Current conversation messages |
+| Medium-term | 1 demand | Checkpointer | PostgreSQL | Low | Graph state between invocations |
+| Long-term | Permanent | Store | PostgreSQL | Low | Patterns, preferences, learnings |
+## Short-Term Memory
+- **What**: Messages in the current invocation chain.
+- **How**: `Annotated[list, add_messages]` in state schema.
+- **Limit**: Bounded by context window size.
+- **When it resets**: End of the invocation chain.
+## Medium-Term Memory
+- **What**: Full graph state (all state fields) at each node execution.
+- **How**: `graph.compile(checkpointer=PostgresSaver(...))`.
+- **Limit**: Bounded by database storage.
+- **When it resets**: When the demand is completed or explicitly cleared.
+- **Key feature**: Enables resume after crash.
+## Long-Term Memory
+- **What**: Extracted patterns, preferences, and learnings.
+- **How**: `store.aput(namespace=..., key=..., value=...)`.
+- **Limit**: Bounded by database storage and relevance decay.
+- **When it resets**: Only when explicitly cleaned up.
+- **Key feature**: Enables cross-demand learning.
+## Decision Guide
+| Question | Answer | Use |
+|---|---|---|
+| Does the agent need to remember within this conversation? | Yes | Short-term |
+| Does the agent need to resume after a crash? | Yes | Medium-term (checkpointer) |
+| Should the agent learn from past demands? | Yes | Long-term (store) |
+| Does the agent need to share knowledge with other agents? | Yes | Long-term (store with shared namespace) |

package/templates/bundle-ai-agents/skills/memory-management/references/namespace-conventions.md ADDED Viewed

@@ -0,0 +1,41 @@
+# Namespace Conventions Reference
+## Store Namespace Structure
+```
+("agent", agent_type, category)
+```
+## Standard Namespaces
+| Namespace | Purpose | Example Key |
+|---|---|---|
+| `("agent", "backend", "patterns")` | Code patterns learned by backend agent | `"fastapi-crud-pattern"` |
+| `("agent", "frontend", "patterns")` | UI patterns learned by frontend agent | `"react-form-pattern"` |
+| `("agent", "backend", "errors")` | Common errors and their fixes | `"alembic-migration-conflict"` |
+| `("agent", "backend", "preferences")` | Team preferences for code style | `"prefer-dataclass-over-dict"` |
+| `("project", "decisions")` | Architectural decisions | `"chose-fastapi-over-flask"` |
+| `("project", "standards")` | Project-wide coding standards | `"naming-conventions"` |
+## Value Schema
+Every store entry should include these fields:
+```python
+{
+    "pattern": str,        # The actual knowledge
+    "learned_from": str,   # Which demand/task this came from
+    "confidence": float,   # 0.0 to 1.0 (boost on positive feedback, decay on negative)
+    "created_at": str,     # ISO timestamp
+    "updated_at": str,     # ISO timestamp (updated on reinforcement)
+    "usage_count": int,    # How many times this memory was retrieved
+}
+```
+## Confidence Management
+- **Initial**: 0.7 (new learning, not yet validated)
+- **Reinforced**: +0.1 per positive use (max 1.0)
+- **Contradicted**: -0.2 per negative feedback (min 0.0)
+- **Cleanup threshold**: < 0.3 (remove on next cleanup run)
+- **Time decay**: -0.05 per month without usage