npm - maestro-bundle - Versions diffs - 1.3.1 → 1.5.0 - Mend

maestro-bundle 1.3.1 → 1.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (128) hide show

package/templates/bundle-ai-agents/skills/rag-pipeline/SKILL.md CHANGED Viewed

@@ -1,23 +1,54 @@
 ---
 name: rag-pipeline
-description: Construir pipeline RAG completo com ingestão, chunking, embedding, indexação e retrieval usando LangChain + pgvector. Use sempre que precisar implementar busca semântica, responder perguntas sobre documentos, ou criar um sistema de retrieval.
+description: Build complete RAG pipelines with ingestion, chunking, embedding, indexing, and retrieval using LangChain + pgvector. Use when implementing semantic search, answering questions over documents, or creating retrieval systems.
+version: 1.0.0
+author: Maestro
 ---
 # RAG Pipeline
-## Pipeline completo
+Build production-ready Retrieval-Augmented Generation pipelines with hybrid search, re-ranking, and quality evaluation.
+## When to Use
+- Building a semantic search system over documents
+- Answering questions from a knowledge base (PDFs, Markdown, code)
+- Creating a retrieval layer for an AI agent
+- Indexing project documentation, skills, or bundles into a vector store
+- Improving an existing RAG pipeline's accuracy or performance
+## Available Operations
+1. Ingest documents (load, split, enrich with metadata)
+2. Generate embeddings and index into pgvector
+3. Configure hybrid retrieval (semantic + keyword BM25)
+4. Add re-ranking for precision
+5. Build a query chain with LLM
+6. Evaluate retrieval quality with golden datasets
+## Multi-Step Workflow
+### Step 1: Set Up Environment
+Install required dependencies and verify database connectivity.
+```bash
+pip install langchain langchain-openai langchain-postgres langchain-community langchain-cohere pgvector rank-bm25
 ```
-Documentos → Loader → Splitter → Embeddings → pgvector → Retriever → Re-ranker → LLM
+Verify pgvector is available:
+```bash
+psql $DATABASE_URL -c "CREATE EXTENSION IF NOT EXISTS vector;"
 ```
-## 1. Ingestão
+### Step 2: Ingest Documents
+Load documents from the target directory and split into chunks with appropriate overlap.
 ```python
 from langchain_community.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader
 from langchain.text_splitter import RecursiveCharacterTextSplitter
-# Loader por tipo de documento
+# Load documents by type
 loader = DirectoryLoader(
     "./documents/",
     glob="**/*.md",
@@ -25,7 +56,7 @@ loader = DirectoryLoader(
 )
 docs = loader.load()
-# Splitter com separadores Markdown
+# Split with Markdown-aware separators
 splitter = RecursiveCharacterTextSplitter(
     chunk_size=1000,
     chunk_overlap=200,
@@ -34,10 +65,13 @@ splitter = RecursiveCharacterTextSplitter(
 chunks = splitter.split_documents(docs)
 ```
-## 2. Metadados obrigatórios
+### Step 3: Enrich Chunks with Metadata
+Every chunk must carry metadata for filtering and traceability.
-Cada chunk deve ter:
 ```python
+from datetime import datetime
 for chunk in chunks:
     chunk.metadata.update({
         "source": chunk.metadata.get("source", "unknown"),
@@ -47,7 +81,7 @@ for chunk in chunks:
     })
 ```
-## 3. Embedding + Indexação
+### Step 4: Embed and Index into pgvector
 ```python
 from langchain_openai import OpenAIEmbeddings
@@ -63,26 +97,26 @@ vectorstore = PGVector(
 vectorstore.add_documents(chunks)
 ```
-## 4. Retrieval Híbrido
+### Step 5: Configure Hybrid Retrieval
+Combine semantic search with keyword-based BM25 using Reciprocal Rank Fusion.
 ```python
 from langchain.retrievers import EnsembleRetriever
 from langchain_community.retrievers import BM25Retriever
-# Semântico
 semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
-# Keyword
 bm25_retriever = BM25Retriever.from_documents(chunks, k=20)
-# Ensemble com RRF
 hybrid_retriever = EnsembleRetriever(
     retrievers=[semantic_retriever, bm25_retriever],
     weights=[0.6, 0.4]
 )
 ```
-## 5. Re-ranking
+### Step 6: Add Re-Ranking
+Use Cohere re-ranker to refine top-k results for higher precision.
 ```python
 from langchain.retrievers import ContextualCompressionRetriever
@@ -95,18 +129,19 @@ final_retriever = ContextualCompressionRetriever(
 )
 ```
-## 6. Query Chain
+### Step 7: Build Query Chain
 ```python
 from langchain_core.prompts import ChatPromptTemplate
 from langchain_core.output_parsers import StrOutputParser
+from langchain_core.runnables import RunnablePassthrough
 prompt = ChatPromptTemplate.from_template("""
-Responda a pergunta baseado apenas no contexto fornecido.
-Se a resposta não estiver no contexto, diga "Não encontrei essa informação".
+Answer the question based only on the provided context.
+If the answer is not in the context, say "I could not find that information."
-Contexto: {context}
-Pergunta: {question}
+Context: {context}
+Question: {question}
 """)
 chain = (
@@ -116,13 +151,55 @@ chain = (
     | StrOutputParser()
 )
-result = chain.invoke("Qual skill usar para criar componentes React?")
+result = chain.invoke("Which skill should I use to create React components?")
 ```
-## Checklist de qualidade
+### Step 8: Evaluate Retrieval Quality
+Run evaluation against a golden dataset to measure retrieval accuracy.
+```bash
+python -m evals.run_rag_eval --dataset evals/rag_golden_dataset.json --min-score 0.8
+```
-- [ ] Chunks testados com perguntas reais
-- [ ] Metadados completos em todos os chunks
-- [ ] Retrieval quality medido com golden dataset
-- [ ] Re-ranking ativo para refinar top-k
-- [ ] Fallback para quando retrieval não encontra nada
+## Resources
+- `references/chunking-strategies.md` - Guidance on chunk sizes and overlap for different document types
+- `references/embedding-models.md` - Comparison of embedding models and their trade-offs
+- `references/rag-evaluation.md` - How to build golden datasets and measure RAG quality
+## Examples
+### Example 1: Index Project Documentation
+User asks: "Set up RAG for our project docs so the agent can answer questions about our architecture."
+Response approach:
+1. Scan the `docs/` directory for Markdown files
+2. Split with Markdown-aware separators (chunk_size=1000, overlap=200)
+3. Enrich metadata with doc_type and source path
+4. Embed with text-embedding-3-large and index into pgvector
+5. Configure hybrid retrieval with BM25 fallback
+6. Wire up a query chain and test with sample questions
+### Example 2: Improve Retrieval Accuracy
+User asks: "Our RAG is returning irrelevant results for technical queries."
+Response approach:
+1. Check current chunk sizes -- may be too large or too small
+2. Verify metadata filtering is applied for doc_type
+3. Add Cohere re-ranking to refine top-k
+4. Adjust ensemble weights (increase semantic weight for technical content)
+5. Build a golden dataset of 10-20 question/answer pairs and run evals
+### Example 3: Add a New Document Source
+User asks: "Add our API specs (OpenAPI YAML) to the RAG pipeline."
+Response approach:
+1. Add a YAML loader to the ingestion pipeline
+2. Configure appropriate splitter for structured YAML content
+3. Set doc_type metadata to "api_spec"
+4. Re-index and test retrieval with API-related queries
+## Notes
+- Always test chunks with real questions before deploying
+- Keep metadata complete on all chunks for filtering and traceability
+- Measure retrieval quality with a golden dataset (minimum 10 question/answer pairs)
+- Re-ranking is critical for precision -- always enable it in production
+- Implement a fallback response when retrieval returns no relevant results
+- Monitor token costs: embedding large document sets can be expensive

package/templates/bundle-ai-agents/skills/rag-pipeline/references/chunking-strategies.md ADDED Viewed

@@ -0,0 +1,27 @@
+# Chunking Strategies Reference
+## Recommended Chunk Sizes by Document Type
+| Document Type | Chunk Size | Overlap | Separators |
+|---|---|---|---|
+| Markdown docs | 1000 | 200 | `\n## `, `\n### `, `\n\n`, `\n` |
+| Source code | 1500 | 300 | `\nclass `, `\ndef `, `\n\n`, `\n` |
+| API specs (YAML/JSON) | 800 | 100 | `\n- `, `\n  `, `\n\n` |
+| PDF documents | 1200 | 250 | `\n\n`, `\n`, `. ` |
+| Plain text | 1000 | 200 | `\n\n`, `\n`, `. `, ` ` |
+## Guidelines
+- **Too small** (< 500 tokens): Loses context, retrieval finds fragments without meaning.
+- **Too large** (> 2000 tokens): Dilutes relevance, wastes context window space.
+- **Overlap**: 15-25% of chunk_size prevents information loss at boundaries.
+- **Separators**: Order matters -- RecursiveCharacterTextSplitter tries separators in order, falling back to the next one.
+## Metadata to Attach
+Every chunk should carry:
+- `source` -- file path or URL
+- `doc_type` -- classification (skill, prd, code, api_spec, etc.)
+- `language` -- content language for multilingual pipelines
+- `created_at` -- timestamp for freshness filtering
+- `section_title` -- nearest heading for context

package/templates/bundle-ai-agents/skills/rag-pipeline/references/embedding-models.md ADDED Viewed

@@ -0,0 +1,31 @@
+# Embedding Models Reference
+## Model Comparison
+| Model | Dimensions | Max Tokens | Cost | Quality |
+|---|---|---|---|---|
+| text-embedding-3-large | 3072 (or 1536) | 8191 | $0.13/1M tokens | Best |
+| text-embedding-3-small | 1536 | 8191 | $0.02/1M tokens | Good |
+| text-embedding-ada-002 | 1536 | 8191 | $0.10/1M tokens | Legacy |
+## Recommendations
+- **Production**: Use `text-embedding-3-large` with `dimensions=1536` for best quality/cost balance.
+- **Development/Prototyping**: Use `text-embedding-3-small` to reduce costs.
+- **Consistency**: Never mix embedding models in the same collection -- re-embed everything if you switch.
+## pgvector Index Types
+| Index | Build Speed | Query Speed | Recall | Use Case |
+|---|---|---|---|---|
+| HNSW | Slow | Fast | High | Production (< 10M vectors) |
+| IVFFlat | Fast | Medium | Medium | Large datasets, quick setup |
+| None (brute force) | N/A | Slow | Perfect | Small datasets (< 50k) |
+### Recommended HNSW Settings
+```sql
+CREATE INDEX idx_embeddings ON documents
+  USING hnsw (embedding vector_cosine_ops)
+  WITH (m = 16, ef_construction = 200);
+```

package/templates/bundle-ai-agents/skills/rag-pipeline/references/rag-evaluation.md ADDED Viewed

@@ -0,0 +1,39 @@
+# RAG Evaluation Reference
+## Golden Dataset Format
+```json
+{
+  "evals": [
+    {
+      "id": "rag-eval-001",
+      "question": "Which skill handles React component creation?",
+      "expected_answer": "The react-patterns skill covers component creation.",
+      "expected_sources": ["skills/react-patterns/SKILL.md"],
+      "relevance_threshold": 0.8
+    }
+  ]
+}
+```
+## Key Metrics
+| Metric | What It Measures | Target |
+|---|---|---|
+| Retrieval Precision | % of retrieved docs that are relevant | > 0.7 |
+| Retrieval Recall | % of relevant docs that are retrieved | > 0.8 |
+| Faithfulness | Answer grounded in retrieved context | > 0.9 |
+| Answer Relevancy | Answer addresses the question | > 0.85 |
+## Evaluation Workflow
+1. Build golden dataset with 20+ question/answer/source triples.
+2. Run retrieval for each question and compare against expected sources.
+3. Generate answers and evaluate faithfulness with LLM-as-judge.
+4. Track metrics over time -- regression means something changed in ingestion or indexing.
+## Common Failure Modes
+- **Low precision**: Chunks too large, no re-ranking, or poor metadata filtering.
+- **Low recall**: Chunks too small, missing document sources, or embedding model mismatch.
+- **Low faithfulness**: LLM hallucinating beyond context -- tighten the system prompt.

package/templates/bundle-ai-agents/skills/testing-strategy/SKILL.md CHANGED Viewed

@@ -1,27 +1,54 @@
 ---
 name: testing-strategy
-description: Implementar estratégia de testes com unitários, integração e e2e usando Pytest ou JUnit. Use quando for escrever testes, definir estratégia de testes, ou melhorar cobertura.
+description: Implement testing strategies with unit, integration, and e2e tests using Pytest or JUnit. Use when writing tests, defining test strategy, or improving test coverage.
+version: 1.0.0
+author: Maestro
 ---
-# Estratégia de Testes
+# Testing Strategy
-## Pirâmide
+Implement the testing pyramid with domain unit tests, integration tests for repositories, and e2e tests for API endpoints, following consistent naming and fixture patterns.
+## When to Use
+- Writing tests for new features or bug fixes
+- Setting up test infrastructure for a new project
+- Improving test coverage on existing code
+- Reviewing test quality and naming conventions
+- Configuring CI/CD test pipelines
+## Available Operations
+1. Write unit tests for domain entities and value objects
+2. Write integration tests for repositories and external services
+3. Write e2e tests for API endpoints
+4. Configure test fixtures and factories
+5. Run tests and measure coverage
+6. Set up test pipelines in CI/CD
+## Multi-Step Workflow
+### Step 1: Understand the Testing Pyramid
 ```
-        /  E2E  \          Poucos, lentos, caros
-       /  Integr. \        Moderados
-      / Unitários   \      Muitos, rápidos, baratos
+        /  E2E  \          Few, slow, expensive
+       / Integr. \         Moderate count
+      / Unit Tests \       Many, fast, cheap
 ```
-## Testes Unitários — Domínio
+- **Unit tests**: Test domain logic in isolation. No database, no HTTP, no external services.
+- **Integration tests**: Test adapters (repositories, clients) against real infrastructure.
+- **E2E tests**: Test full API request/response cycles.
-Testar regras de negócio sem infraestrutura.
+### Step 2: Write Unit Tests for Domain Entities
+Test business rules without any infrastructure dependency.
 ```python
 # tests/domain/test_demand.py
+import pytest
 class TestDemand:
     def test_should_decompose_new_demand(self):
-        demand = Demand(id=DemandId.generate(), description="Criar CRUD")
+        demand = Demand(id=DemandId.generate(), description="Create CRUD")
         planner = FakePlanner(tasks=[Task(...), Task(...)])
         tasks = demand.decompose(planner)
@@ -30,14 +57,14 @@ class TestDemand:
         assert demand.status == DemandStatus.PLANNED
     def test_should_reject_decompose_if_already_planned(self):
-        demand = Demand(id=DemandId.generate(), description="Criar CRUD")
+        demand = Demand(id=DemandId.generate(), description="Create CRUD")
         demand.decompose(FakePlanner(tasks=[Task(...)]))
         with pytest.raises(DemandAlreadyDecomposedException):
             demand.decompose(FakePlanner(tasks=[]))
     def test_should_not_allow_more_than_20_tasks(self):
-        demand = Demand(id=DemandId.generate(), description="Mega projeto")
+        demand = Demand(id=DemandId.generate(), description="Mega project")
         for i in range(20):
             demand.add_task(Task(...))
@@ -45,9 +72,10 @@ class TestDemand:
             demand.add_task(Task(...))
 ```
-## Testes Unitários — Value Objects
+### Step 3: Write Unit Tests for Value Objects
 ```python
+# tests/domain/test_compliance_score.py
 class TestComplianceScore:
     def test_passing_score(self):
         score = ComplianceScore(85.0)
@@ -62,10 +90,16 @@ class TestComplianceScore:
             ComplianceScore(150.0)
 ```
-## Testes de Integração — Repositórios
+### Step 4: Write Integration Tests for Repositories
+Test against a real database using fixtures with rollback.
 ```python
 # tests/infrastructure/test_pg_demand_repository.py
+import pytest
+from sqlalchemy import create_engine
+from sqlalchemy.orm import Session
 @pytest.fixture
 def db_session():
     engine = create_engine(TEST_DATABASE_URL)
@@ -84,12 +118,87 @@ class TestPgDemandRepository:
         assert found.description == "Test"
 ```
-## Naming
+### Step 5: Run Tests and Measure Coverage
+```bash
+# Run all tests
+pytest tests/ -v
+# Run only unit tests
+pytest tests/domain/ -v
+# Run only integration tests
+pytest tests/infrastructure/ -v --timeout=30
+# Run with coverage report
+pytest tests/ --cov=src --cov-report=term-missing --cov-fail-under=80
+# Run specific test file
+pytest tests/domain/test_demand.py -v
+# Run tests matching a pattern
+pytest tests/ -k "test_should_decompose" -v
 ```
-test_should_<resultado>_when_<condição>
-test_should_return_error_when_email_is_invalid
-test_should_decompose_demand_when_status_is_created
-test_should_reject_merge_when_conflicts_exist
+### Step 6: Set Up CI/CD Test Pipeline
+```yaml
+test:
+  stage: test
+  script:
+    - pip install -r requirements-test.txt
+    - pytest tests/domain/ -v --junitxml=reports/unit.xml
+    - pytest tests/infrastructure/ -v --junitxml=reports/integration.xml --timeout=60
+    - pytest tests/ --cov=src --cov-report=xml --cov-fail-under=80
+  artifacts:
+    reports:
+      junit:
+        - reports/*.xml
+      coverage_report:
+        coverage_format: cobertura
+        path: coverage.xml
 ```
+## Resources
+- `references/naming-conventions.md` - Test naming patterns and organization guidelines
+- `references/fixture-patterns.md` - Common pytest fixture patterns and factories
+## Examples
+### Example 1: Write Tests for a New Feature
+User asks: "Write tests for the new user registration feature."
+Response approach:
+1. Unit test the `User` entity: valid creation, email validation, password rules
+2. Unit test the `RegisterUser` use case: happy path, duplicate email, invalid data
+3. Integration test the `PgUserRepository`: save and find
+4. E2e test the `POST /api/v1/users` endpoint: 201 on success, 422 on invalid data
+5. Run: `pytest tests/ -v --cov=src`
+### Example 2: Improve Test Coverage
+User asks: "Our coverage is at 45%. Get it to 80%."
+Response approach:
+1. Run `pytest --cov=src --cov-report=term-missing` to find uncovered lines
+2. Prioritize domain layer (business rules are the most valuable to test)
+3. Add edge case tests for existing entities (error paths, boundary values)
+4. Add integration tests for untested repositories
+5. Skip e2e tests for now (lowest ROI per coverage point)
+6. Target: domain 95%, application 85%, infrastructure 70%
+### Example 3: Set Up Testing for a New Project
+User asks: "Set up the test infrastructure for our new Python project."
+Response approach:
+1. Install dependencies: `pip install pytest pytest-cov pytest-asyncio`
+2. Create `conftest.py` with database session fixtures
+3. Create directory structure: `tests/domain/`, `tests/application/`, `tests/infrastructure/`, `tests/api/`
+4. Add `pytest.ini` or `pyproject.toml` with test configuration
+5. Create a sample test to verify the setup
+6. Add test commands to Makefile or scripts
+## Notes
+- Follow the naming convention: `test_should_<result>_when_<condition>`
+- Unit tests must run without any external dependencies (no DB, no HTTP)
+- Use fakes/stubs for dependencies in unit tests, not mocks (fakes are more maintainable)
+- Integration tests should roll back database changes after each test
+- Coverage threshold: 80% minimum, but focus on testing business rules, not getters/setters
+- Run unit tests first in CI (fast feedback), integration tests second, e2e last
+- Each test should test one behavior -- avoid testing multiple things in a single test

package/templates/bundle-ai-agents/skills/testing-strategy/references/fixture-patterns.md ADDED Viewed

@@ -0,0 +1,81 @@
+# Fixture Patterns Reference
+## Database Session Fixture (with rollback)
+```python
+@pytest.fixture
+def db_session():
+    engine = create_engine(TEST_DATABASE_URL)
+    connection = engine.connect()
+    transaction = connection.begin()
+    session = Session(bind=connection)
+    yield session
+    session.close()
+    transaction.rollback()
+    connection.close()
+```
+## Factory Fixtures
+```python
+@pytest.fixture
+def make_demand():
+    def _make(
+        description: str = "Test demand",
+        status: DemandStatus = DemandStatus.CREATED,
+    ) -> Demand:
+        return Demand(
+            id=DemandId.generate(),
+            description=description,
+            status=status,
+        )
+    return _make
+# Usage in test
+def test_something(make_demand):
+    demand = make_demand(description="Custom demand")
+```
+## Async Fixtures
+```python
+@pytest.fixture
+async def async_client(app):
+    async with AsyncClient(app=app, base_url="http://test") as client:
+        yield client
+```
+## Fake Implementations
+```python
+class FakeDemandRepository(DemandRepository):
+    def __init__(self):
+        self._demands: dict[str, Demand] = {}
+    def find_by_id(self, id: DemandId) -> Demand:
+        demand = self._demands.get(str(id))
+        if not demand:
+            raise DemandNotFoundException(id)
+        return demand
+    def save(self, demand: Demand) -> None:
+        self._demands[str(demand.id)] = demand
+class FakeEventBus(EventBus):
+    def __init__(self):
+        self.published_events: list[DomainEvent] = []
+    def publish(self, event: DomainEvent) -> None:
+        self.published_events.append(event)
+```
+## Fixture Scoping
+| Scope | Lifecycle | Use For |
+|---|---|---|
+| `function` (default) | Each test | Most fixtures |
+| `class` | Each test class | Shared setup within a class |
+| `module` | Each test file | Expensive setup (DB schema) |
+| `session` | Entire test run | One-time setup (create DB) |

package/templates/bundle-ai-agents/skills/testing-strategy/references/naming-conventions.md ADDED Viewed

@@ -0,0 +1,69 @@
+# Test Naming Conventions Reference
+## Test Method Naming
+```
+test_should_<expected_result>_when_<condition>
+```
+### Examples
+```python
+# Good names
+test_should_return_error_when_email_is_invalid
+test_should_decompose_demand_when_status_is_created
+test_should_reject_merge_when_conflicts_exist
+test_should_create_user_when_data_is_valid
+test_should_raise_not_found_when_id_does_not_exist
+# Bad names
+test_demand()          # What about demand?
+test_error()           # What error? When?
+test_create()          # Create what? Under what condition?
+test_1()               # Meaningless
+```
+## Test Class Naming
+```python
+class TestDemand:           # Domain entity tests
+class TestDecomposeDemand:  # Use case tests
+class TestPgDemandRepository:  # Repository integration tests
+class TestDemandEndpoints:  # API e2e tests
+```
+## File Organization
+```
+tests/
+  domain/
+    test_demand.py
+    test_task.py
+    test_compliance_score.py
+  application/
+    test_decompose_demand.py
+    test_create_demand.py
+  infrastructure/
+    test_pg_demand_repository.py
+    test_pg_task_repository.py
+  api/
+    test_demand_endpoints.py
+    test_task_endpoints.py
+  conftest.py              # Shared fixtures
+```
+## Test Structure (AAA Pattern)
+```python
+def test_should_decompose_demand_when_status_is_created(self):
+    # Arrange
+    demand = Demand(id=DemandId.generate(), description="Create CRUD")
+    planner = FakePlanner(tasks=[Task(...)])
+    # Act
+    tasks = demand.decompose(planner)
+    # Assert
+    assert len(tasks) == 1
+    assert demand.status == DemandStatus.PLANNED
+```