npm - maestro-bundle - Versions diffs - 1.3.1 → 1.4.0 - Mend

maestro-bundle 1.3.1 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (116) hide show

package/templates/bundle-ai-agents/skills/prompt-engineering/SKILL.md CHANGED Viewed

@@ -1,66 +1,158 @@
 ---
 name: prompt-engineering
-description: Criar e otimizar system prompts para agentes seguindo melhores práticas de context engineering. Use quando precisar escrever prompts, melhorar prompts existentes, ou criar instruções para agentes.
+description: Create and optimize system prompts for AI agents following context engineering best practices. Use when writing prompts, improving existing prompts, or creating agent instructions.
+version: 1.0.0
+author: Maestro
 ---
-# Prompt Engineering para Agentes
+# Prompt Engineering
-## Estrutura de System Prompt
+Craft effective system prompts for AI agents using structured templates, best practices, and iterative refinement.
+## When to Use
+- Writing a new system prompt for an agent
+- Improving an underperforming agent's instructions
+- Creating role-specific prompts for multi-agent systems
+- Reviewing prompts for anti-patterns and clarity issues
+- Optimizing prompts to reduce token usage without losing quality
+## Available Operations
+1. Write a structured system prompt from scratch
+2. Audit an existing prompt for anti-patterns
+3. Refine a prompt based on agent evaluation results
+4. Create few-shot examples for a prompt
+5. Optimize prompt token count
+## Multi-Step Workflow
+### Step 1: Define the Prompt Structure
+Every agent system prompt should follow this 6-part structure:
 ```
-1. IDENTIDADE — Quem o agente é
-2. OBJETIVO — O que ele deve alcançar
-3. FERRAMENTAS — O que tem disponível
-4. REGRAS — Limites inegociáveis
-5. FORMATO — Como estruturar a saída
-6. EXEMPLOS — Demonstrações concretas
+1. IDENTITY   -- Who the agent is
+2. OBJECTIVE  -- What it must achieve
+3. TOOLS      -- What it has available
+4. RULES      -- Non-negotiable constraints
+5. FORMAT     -- How to structure output
+6. EXAMPLES   -- Concrete demonstrations
 ```
-## Template
+### Step 2: Write the System Prompt
+Use the template below, filling in each section with specific details.
 ```python
 SYSTEM_PROMPT = """
-## Identidade
-Você é {role}, especializado em {especialidade}.
-## Objetivo
-Sua missão é {objetivo_principal}. Você trabalha dentro do Maestro,
-uma plataforma de governança de desenvolvimento.
-## Ferramentas disponíveis
-{lista_de_tools_com_descrição}
-## Regras
-1. Sempre seguir o bundle {bundle_name} para padrões de código
-2. Todo commit deve referenciar a task: {task_id}
-3. Trabalhar apenas na worktree designada: {worktree_path}
-4. Reportar progresso a cada etapa significativa
-5. Solicitar human review para operações destrutivas
-## Formato de resposta
-- Para código: blocos com linguagem especificada
-- Para decisões: justificar com "porquê"
-- Para erros: incluir contexto e sugestão de fix
-## Exemplo
-Task: "Criar endpoint GET /api/v1/demands"
-Ação: Criar controller, use case, repository seguindo Clean Architecture
+## Identity
+You are {role}, specialized in {specialty}.
+## Objective
+Your mission is {primary_objective}. You work within Maestro,
+a development governance platform.
+## Available Tools
+{list_of_tools_with_descriptions}
+## Rules
+1. Always follow the {bundle_name} bundle for code standards
+2. Every commit must reference the task: {task_id}
+3. Work only in the designated worktree: {worktree_path}
+4. Report progress at every significant step
+5. Request human review for destructive operations
+## Response Format
+- For code: use fenced code blocks with language specified
+- For decisions: justify with "why"
+- For errors: include context and suggested fix
+## Example
+Task: "Create endpoint GET /api/v1/demands"
+Action: Create controller, use case, repository following Clean Architecture
 Branch: feature/backend-{task_id}
 """
 ```
-## Boas práticas
+### Step 3: Apply Best Practices
+Review the prompt against these rules:
+1. **Be specific** -- "Create a REST API with FastAPI" > "Create an API"
+2. **Explain why** -- "Use Value Objects because they enforce validation at construction" > "Use Value Objects"
+3. **Give examples** -- One good example is worth 10 lines of instruction
+4. **Avoid negatives** -- "Keep functions under 20 lines" > "Don't write long functions"
+5. **Prioritize** -- Put the most important rules first (models pay more attention to early content)
+6. **Test** -- Run the prompt with real scenarios before deploying
+### Step 4: Check for Anti-Patterns
+Audit the prompt for these common problems:
+```bash
+# Count NEVER/ALWAYS occurrences (too many weaken their impact)
+grep -c -i "never\|always" prompt.md
+# Check prompt length (over 5000 words causes focus loss)
+wc -w prompt.md
+```
+Anti-patterns to fix:
+- Excessive NEVER/ALWAYS (loses emphasis when overused)
+- Contradictory instructions (e.g., "be concise" + "explain everything in detail")
+- Prompts over 5000 words (agent loses focus on key instructions)
+- Rules without justification (agent cannot reason about when to flex)
+### Step 5: Test and Iterate
+Run the prompt through evaluation scenarios to measure effectiveness.
+```bash
+python -m evals.run_prompt_eval --prompt prompts/agent_backend.md --scenarios evals/prompt_scenarios.json
+```
+Compare scores across prompt versions:
+```bash
+python -m evals.compare_prompts --v1 prompts/v1.md --v2 prompts/v2.md --scenarios evals/prompt_scenarios.json
+```
+## Resources
+- `references/prompt-templates.md` - Ready-to-use prompt templates for common agent roles
+- `references/anti-patterns.md` - Detailed anti-pattern catalog with fix examples
+## Examples
+### Example 1: Write a Backend Agent Prompt
+User asks: "Create a system prompt for our backend agent that builds FastAPI endpoints."
+Response approach:
+1. Set identity to "Backend Engineer Agent, specialized in FastAPI and Clean Architecture"
+2. Define objective: "Build production-ready REST API endpoints following bundle standards"
+3. List tools: file operations, git commands, test runner, linter
+4. Set rules: follow clean-architecture skill, enforce test coverage > 80%, use typed DTOs
+5. Add format section for code blocks and error reporting
+6. Include a concrete example of building a CRUD endpoint
-1. **Seja específico** — "Crie uma API REST com FastAPI" > "Crie uma API"
-2. **Explique o porquê** — "Usar Value Objects porque garantem validação no construtor" > "Usar Value Objects"
-3. **Dê exemplos** — Um bom exemplo vale mais que 10 linhas de instrução
-4. **Evite negativos** — "Mantenha funções com até 20 linhas" > "Não crie funções longas"
-5. **Priorize** — Coloque as regras mais importantes primeiro
-6. **Teste** — Rode o prompt com cenários reais antes de deployar
+### Example 2: Fix an Underperforming Prompt
+User asks: "Our agent keeps ignoring the coding standards. Fix the prompt."
+Response approach:
+1. Read the current prompt and check for vague instructions
+2. Look for missing justifications on rules (agent doesn't understand importance)
+3. Move coding standards rules higher in the prompt (priority by position)
+4. Add a concrete example showing compliant vs non-compliant code
+5. Add a "Common Mistakes" section with specific things to avoid
-## Anti-patterns
+### Example 3: Reduce Prompt Token Count
+User asks: "The system prompt is too long, cut it down without losing quality."
+Response approach:
+1. Count current tokens with `tiktoken`
+2. Move detailed reference material to skill files loaded on-demand
+3. Replace verbose explanations with concise bullet points
+4. Keep examples (they're high-value) but trim redundant ones
+5. Target: system prompt under 2000 tokens, details in skills
-- NEVER/ALWAYS em excesso (perde a força)
-- Instruções contraditórias
-- Prompts com 5000+ palavras (o agente se perde)
-- Regras sem justificativa (o agente não sabe quando flexibilizar)
+## Notes
+- System prompts should stay under 2000 tokens; move details to on-demand skills
+- Test every prompt change with at least 5 real-world scenarios
+- Version your prompts (v1, v2, etc.) and track performance across versions
+- The first 500 tokens of a prompt get the most attention from the model
+- Few-shot examples are the single most effective prompting technique

package/templates/bundle-ai-agents/skills/prompt-engineering/references/anti-patterns.md ADDED Viewed

@@ -0,0 +1,59 @@
+# Prompt Anti-Patterns Reference
+## 1. NEVER/ALWAYS Overuse
+**Problem**: Using NEVER and ALWAYS too frequently dilutes their impact.
+**Bad**:
+```
+NEVER use var. ALWAYS use const. NEVER use any. ALWAYS type everything.
+NEVER skip tests. ALWAYS write docs. NEVER commit without review.
+```
+**Good**:
+```
+Use const by default, let when reassignment is needed.
+Type all function parameters and return values.
+Critical: NEVER commit secrets or credentials to the repository.
+```
+**Fix**: Reserve NEVER/ALWAYS for truly critical rules (security, data integrity). Use softer language for preferences.
+## 2. Contradictory Instructions
+**Problem**: Instructions that conflict cause unpredictable behavior.
+**Bad**:
+```
+Be concise in your responses.
+Explain every decision in detail with full justification.
+```
+**Good**:
+```
+Be concise by default. When making architectural decisions, explain the reasoning.
+```
+**Fix**: Qualify when each instruction applies.
+## 3. Excessive Length (> 5000 words)
+**Problem**: The agent loses focus on key instructions when the prompt is too long.
+**Fix**: Move reference material to skill files. Keep the system prompt under 2000 tokens. Load details on-demand.
+## 4. Rules Without Justification
+**Problem**: Without knowing why, the agent cannot reason about edge cases.
+**Bad**: "Use Value Objects for all domain primitives."
+**Good**: "Use Value Objects for domain primitives because they enforce validation at construction time and make the domain model self-documenting."
+## 5. Vague Instructions
+**Problem**: Ambiguity leads to inconsistent agent behavior.
+**Bad**: "Write good code."
+**Good**: "Write code that follows Clean Architecture: separate controllers, use cases, and repositories. Keep functions under 20 lines. Include type hints on all function signatures."

package/templates/bundle-ai-agents/skills/prompt-engineering/references/prompt-templates.md ADDED Viewed

@@ -0,0 +1,75 @@
+# Prompt Templates Reference
+## Backend Agent Template
+```
+## Identity
+You are a Backend Engineer Agent, specialized in building REST APIs with FastAPI and Clean Architecture.
+## Objective
+Build production-ready API endpoints that follow the project's bundle standards, including proper error handling, validation, pagination, and test coverage.
+## Tools
+- File read/write operations
+- Git commands (commit, branch, push)
+- pytest for running tests
+- ruff for linting
+## Rules
+1. Follow Clean Architecture: controller -> use case -> repository
+2. Every endpoint must have input validation with Pydantic models
+3. Test coverage must be >= 80% for new code
+4. Use typed DTOs for all API responses
+5. Handle errors with standardized ErrorResponse format
+## Response Format
+- Code in fenced blocks with language specified
+- Explain architectural decisions with "why"
+- Report test results after implementation
+```
+## Frontend Agent Template
+```
+## Identity
+You are a Frontend Engineer Agent, specialized in React with TypeScript.
+## Objective
+Build responsive, accessible UI components following the project's design system and bundle standards.
+## Tools
+- File read/write operations
+- npm/pnpm commands
+- Jest/Vitest for testing
+- ESLint for linting
+## Rules
+1. Use functional components with hooks
+2. All props must be typed with TypeScript interfaces
+3. Components must be accessible (ARIA labels, keyboard navigation)
+4. Write unit tests for business logic, integration tests for user flows
+5. Use the project's design tokens for spacing, colors, typography
+```
+## DevOps Agent Template
+```
+## Identity
+You are a DevOps Engineer Agent, specialized in Docker, CI/CD, and infrastructure.
+## Objective
+Containerize applications, configure CI/CD pipelines, and manage deployment infrastructure.
+## Tools
+- Docker CLI (build, push, compose)
+- Git commands
+- kubectl for Kubernetes
+- Terraform for infrastructure
+## Rules
+1. All Dockerfiles must use multi-stage builds
+2. Never run containers as root
+3. Include health checks in all service containers
+4. Secrets must come from environment variables, never hardcoded
+5. CI pipelines must include lint, test, build, and security scan stages
+```

package/templates/bundle-ai-agents/skills/rag-pipeline/SKILL.md CHANGED Viewed

@@ -1,23 +1,54 @@
 ---
 name: rag-pipeline
-description: Construir pipeline RAG completo com ingestão, chunking, embedding, indexação e retrieval usando LangChain + pgvector. Use sempre que precisar implementar busca semântica, responder perguntas sobre documentos, ou criar um sistema de retrieval.
+description: Build complete RAG pipelines with ingestion, chunking, embedding, indexing, and retrieval using LangChain + pgvector. Use when implementing semantic search, answering questions over documents, or creating retrieval systems.
+version: 1.0.0
+author: Maestro
 ---
 # RAG Pipeline
-## Pipeline completo
+Build production-ready Retrieval-Augmented Generation pipelines with hybrid search, re-ranking, and quality evaluation.
+## When to Use
+- Building a semantic search system over documents
+- Answering questions from a knowledge base (PDFs, Markdown, code)
+- Creating a retrieval layer for an AI agent
+- Indexing project documentation, skills, or bundles into a vector store
+- Improving an existing RAG pipeline's accuracy or performance
+## Available Operations
+1. Ingest documents (load, split, enrich with metadata)
+2. Generate embeddings and index into pgvector
+3. Configure hybrid retrieval (semantic + keyword BM25)
+4. Add re-ranking for precision
+5. Build a query chain with LLM
+6. Evaluate retrieval quality with golden datasets
+## Multi-Step Workflow
+### Step 1: Set Up Environment
+Install required dependencies and verify database connectivity.
+```bash
+pip install langchain langchain-openai langchain-postgres langchain-community langchain-cohere pgvector rank-bm25
 ```
-Documentos → Loader → Splitter → Embeddings → pgvector → Retriever → Re-ranker → LLM
+Verify pgvector is available:
+```bash
+psql $DATABASE_URL -c "CREATE EXTENSION IF NOT EXISTS vector;"
 ```
-## 1. Ingestão
+### Step 2: Ingest Documents
+Load documents from the target directory and split into chunks with appropriate overlap.
 ```python
 from langchain_community.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader
 from langchain.text_splitter import RecursiveCharacterTextSplitter
-# Loader por tipo de documento
+# Load documents by type
 loader = DirectoryLoader(
     "./documents/",
     glob="**/*.md",
@@ -25,7 +56,7 @@ loader = DirectoryLoader(
 )
 docs = loader.load()
-# Splitter com separadores Markdown
+# Split with Markdown-aware separators
 splitter = RecursiveCharacterTextSplitter(
     chunk_size=1000,
     chunk_overlap=200,
@@ -34,10 +65,13 @@ splitter = RecursiveCharacterTextSplitter(
 chunks = splitter.split_documents(docs)
 ```
-## 2. Metadados obrigatórios
+### Step 3: Enrich Chunks with Metadata
+Every chunk must carry metadata for filtering and traceability.
-Cada chunk deve ter:
 ```python
+from datetime import datetime
 for chunk in chunks:
     chunk.metadata.update({
         "source": chunk.metadata.get("source", "unknown"),
@@ -47,7 +81,7 @@ for chunk in chunks:
     })
 ```
-## 3. Embedding + Indexação
+### Step 4: Embed and Index into pgvector
 ```python
 from langchain_openai import OpenAIEmbeddings
@@ -63,26 +97,26 @@ vectorstore = PGVector(
 vectorstore.add_documents(chunks)
 ```
-## 4. Retrieval Híbrido
+### Step 5: Configure Hybrid Retrieval
+Combine semantic search with keyword-based BM25 using Reciprocal Rank Fusion.
 ```python
 from langchain.retrievers import EnsembleRetriever
 from langchain_community.retrievers import BM25Retriever
-# Semântico
 semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
-# Keyword
 bm25_retriever = BM25Retriever.from_documents(chunks, k=20)
-# Ensemble com RRF
 hybrid_retriever = EnsembleRetriever(
     retrievers=[semantic_retriever, bm25_retriever],
     weights=[0.6, 0.4]
 )
 ```
-## 5. Re-ranking
+### Step 6: Add Re-Ranking
+Use Cohere re-ranker to refine top-k results for higher precision.
 ```python
 from langchain.retrievers import ContextualCompressionRetriever
@@ -95,18 +129,19 @@ final_retriever = ContextualCompressionRetriever(
 )
 ```
-## 6. Query Chain
+### Step 7: Build Query Chain
 ```python
 from langchain_core.prompts import ChatPromptTemplate
 from langchain_core.output_parsers import StrOutputParser
+from langchain_core.runnables import RunnablePassthrough
 prompt = ChatPromptTemplate.from_template("""
-Responda a pergunta baseado apenas no contexto fornecido.
-Se a resposta não estiver no contexto, diga "Não encontrei essa informação".
+Answer the question based only on the provided context.
+If the answer is not in the context, say "I could not find that information."
-Contexto: {context}
-Pergunta: {question}
+Context: {context}
+Question: {question}
 """)
 chain = (
@@ -116,13 +151,55 @@ chain = (
     | StrOutputParser()
 )
-result = chain.invoke("Qual skill usar para criar componentes React?")
+result = chain.invoke("Which skill should I use to create React components?")
 ```
-## Checklist de qualidade
+### Step 8: Evaluate Retrieval Quality
+Run evaluation against a golden dataset to measure retrieval accuracy.
+```bash
+python -m evals.run_rag_eval --dataset evals/rag_golden_dataset.json --min-score 0.8
+```
-- [ ] Chunks testados com perguntas reais
-- [ ] Metadados completos em todos os chunks
-- [ ] Retrieval quality medido com golden dataset
-- [ ] Re-ranking ativo para refinar top-k
-- [ ] Fallback para quando retrieval não encontra nada
+## Resources
+- `references/chunking-strategies.md` - Guidance on chunk sizes and overlap for different document types
+- `references/embedding-models.md` - Comparison of embedding models and their trade-offs
+- `references/rag-evaluation.md` - How to build golden datasets and measure RAG quality
+## Examples
+### Example 1: Index Project Documentation
+User asks: "Set up RAG for our project docs so the agent can answer questions about our architecture."
+Response approach:
+1. Scan the `docs/` directory for Markdown files
+2. Split with Markdown-aware separators (chunk_size=1000, overlap=200)
+3. Enrich metadata with doc_type and source path
+4. Embed with text-embedding-3-large and index into pgvector
+5. Configure hybrid retrieval with BM25 fallback
+6. Wire up a query chain and test with sample questions
+### Example 2: Improve Retrieval Accuracy
+User asks: "Our RAG is returning irrelevant results for technical queries."
+Response approach:
+1. Check current chunk sizes -- may be too large or too small
+2. Verify metadata filtering is applied for doc_type
+3. Add Cohere re-ranking to refine top-k
+4. Adjust ensemble weights (increase semantic weight for technical content)
+5. Build a golden dataset of 10-20 question/answer pairs and run evals
+### Example 3: Add a New Document Source
+User asks: "Add our API specs (OpenAPI YAML) to the RAG pipeline."
+Response approach:
+1. Add a YAML loader to the ingestion pipeline
+2. Configure appropriate splitter for structured YAML content
+3. Set doc_type metadata to "api_spec"
+4. Re-index and test retrieval with API-related queries
+## Notes
+- Always test chunks with real questions before deploying
+- Keep metadata complete on all chunks for filtering and traceability
+- Measure retrieval quality with a golden dataset (minimum 10 question/answer pairs)
+- Re-ranking is critical for precision -- always enable it in production
+- Implement a fallback response when retrieval returns no relevant results
+- Monitor token costs: embedding large document sets can be expensive

package/templates/bundle-ai-agents/skills/rag-pipeline/references/chunking-strategies.md ADDED Viewed

@@ -0,0 +1,27 @@
+# Chunking Strategies Reference
+## Recommended Chunk Sizes by Document Type
+| Document Type | Chunk Size | Overlap | Separators |
+|---|---|---|---|
+| Markdown docs | 1000 | 200 | `\n## `, `\n### `, `\n\n`, `\n` |
+| Source code | 1500 | 300 | `\nclass `, `\ndef `, `\n\n`, `\n` |
+| API specs (YAML/JSON) | 800 | 100 | `\n- `, `\n  `, `\n\n` |
+| PDF documents | 1200 | 250 | `\n\n`, `\n`, `. ` |
+| Plain text | 1000 | 200 | `\n\n`, `\n`, `. `, ` ` |
+## Guidelines
+- **Too small** (< 500 tokens): Loses context, retrieval finds fragments without meaning.
+- **Too large** (> 2000 tokens): Dilutes relevance, wastes context window space.
+- **Overlap**: 15-25% of chunk_size prevents information loss at boundaries.
+- **Separators**: Order matters -- RecursiveCharacterTextSplitter tries separators in order, falling back to the next one.
+## Metadata to Attach
+Every chunk should carry:
+- `source` -- file path or URL
+- `doc_type` -- classification (skill, prd, code, api_spec, etc.)
+- `language` -- content language for multilingual pipelines
+- `created_at` -- timestamp for freshness filtering
+- `section_title` -- nearest heading for context

package/templates/bundle-ai-agents/skills/rag-pipeline/references/embedding-models.md ADDED Viewed

@@ -0,0 +1,31 @@
+# Embedding Models Reference
+## Model Comparison
+| Model | Dimensions | Max Tokens | Cost | Quality |
+|---|---|---|---|---|
+| text-embedding-3-large | 3072 (or 1536) | 8191 | $0.13/1M tokens | Best |
+| text-embedding-3-small | 1536 | 8191 | $0.02/1M tokens | Good |
+| text-embedding-ada-002 | 1536 | 8191 | $0.10/1M tokens | Legacy |
+## Recommendations
+- **Production**: Use `text-embedding-3-large` with `dimensions=1536` for best quality/cost balance.
+- **Development/Prototyping**: Use `text-embedding-3-small` to reduce costs.
+- **Consistency**: Never mix embedding models in the same collection -- re-embed everything if you switch.
+## pgvector Index Types
+| Index | Build Speed | Query Speed | Recall | Use Case |
+|---|---|---|---|---|
+| HNSW | Slow | Fast | High | Production (< 10M vectors) |
+| IVFFlat | Fast | Medium | Medium | Large datasets, quick setup |
+| None (brute force) | N/A | Slow | Perfect | Small datasets (< 50k) |
+### Recommended HNSW Settings
+```sql
+CREATE INDEX idx_embeddings ON documents
+  USING hnsw (embedding vector_cosine_ops)
+  WITH (m = 16, ef_construction = 200);
+```

package/templates/bundle-ai-agents/skills/rag-pipeline/references/rag-evaluation.md ADDED Viewed

@@ -0,0 +1,39 @@
+# RAG Evaluation Reference
+## Golden Dataset Format
+```json
+{
+  "evals": [
+    {
+      "id": "rag-eval-001",
+      "question": "Which skill handles React component creation?",
+      "expected_answer": "The react-patterns skill covers component creation.",
+      "expected_sources": ["skills/react-patterns/SKILL.md"],
+      "relevance_threshold": 0.8
+    }
+  ]
+}
+```
+## Key Metrics
+| Metric | What It Measures | Target |
+|---|---|---|
+| Retrieval Precision | % of retrieved docs that are relevant | > 0.7 |
+| Retrieval Recall | % of relevant docs that are retrieved | > 0.8 |
+| Faithfulness | Answer grounded in retrieved context | > 0.9 |
+| Answer Relevancy | Answer addresses the question | > 0.85 |
+## Evaluation Workflow
+1. Build golden dataset with 20+ question/answer/source triples.
+2. Run retrieval for each question and compare against expected sources.
+3. Generate answers and evaluate faithfulness with LLM-as-judge.
+4. Track metrics over time -- regression means something changed in ingestion or indexing.
+## Common Failure Modes
+- **Low precision**: Chunks too large, no re-ranking, or poor metadata filtering.
+- **Low recall**: Chunks too small, missing document sources, or embedding model mismatch.
+- **Low faithfulness**: LLM hallucinating beyond context -- tighten the system prompt.