maestro-bundle 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (60) hide show
  1. package/README.md +91 -0
  2. package/package.json +25 -0
  3. package/src/cli.mjs +212 -0
  4. package/templates/bundle-ai-agents/.spec/constitution.md +33 -0
  5. package/templates/bundle-ai-agents/AGENTS.md +140 -0
  6. package/templates/bundle-ai-agents/skills/agent-orchestration/SKILL.md +132 -0
  7. package/templates/bundle-ai-agents/skills/api-design/SKILL.md +100 -0
  8. package/templates/bundle-ai-agents/skills/clean-architecture/SKILL.md +99 -0
  9. package/templates/bundle-ai-agents/skills/context-engineering/SKILL.md +98 -0
  10. package/templates/bundle-ai-agents/skills/database-modeling/SKILL.md +59 -0
  11. package/templates/bundle-ai-agents/skills/docker-containerization/SKILL.md +114 -0
  12. package/templates/bundle-ai-agents/skills/eval-testing/SKILL.md +115 -0
  13. package/templates/bundle-ai-agents/skills/memory-management/SKILL.md +106 -0
  14. package/templates/bundle-ai-agents/skills/prompt-engineering/SKILL.md +66 -0
  15. package/templates/bundle-ai-agents/skills/rag-pipeline/SKILL.md +128 -0
  16. package/templates/bundle-ai-agents/skills/testing-strategy/SKILL.md +95 -0
  17. package/templates/bundle-base/AGENTS.md +118 -0
  18. package/templates/bundle-base/skills/branch-strategy/SKILL.md +42 -0
  19. package/templates/bundle-base/skills/code-review/SKILL.md +54 -0
  20. package/templates/bundle-base/skills/commit-pattern/SKILL.md +58 -0
  21. package/templates/bundle-data-pipeline/.spec/constitution.md +32 -0
  22. package/templates/bundle-data-pipeline/AGENTS.md +115 -0
  23. package/templates/bundle-data-pipeline/skills/data-preprocessing/SKILL.md +75 -0
  24. package/templates/bundle-data-pipeline/skills/docker-containerization/SKILL.md +114 -0
  25. package/templates/bundle-data-pipeline/skills/feature-engineering/SKILL.md +76 -0
  26. package/templates/bundle-data-pipeline/skills/mlops-pipeline/SKILL.md +77 -0
  27. package/templates/bundle-data-pipeline/skills/model-training/SKILL.md +68 -0
  28. package/templates/bundle-data-pipeline/skills/rag-pipeline/SKILL.md +128 -0
  29. package/templates/bundle-frontend-spa/.spec/constitution.md +32 -0
  30. package/templates/bundle-frontend-spa/AGENTS.md +107 -0
  31. package/templates/bundle-frontend-spa/skills/authentication/SKILL.md +90 -0
  32. package/templates/bundle-frontend-spa/skills/component-design/SKILL.md +115 -0
  33. package/templates/bundle-frontend-spa/skills/e2e-testing/SKILL.md +101 -0
  34. package/templates/bundle-frontend-spa/skills/integration-api/SKILL.md +95 -0
  35. package/templates/bundle-frontend-spa/skills/react-patterns/SKILL.md +130 -0
  36. package/templates/bundle-frontend-spa/skills/responsive-layout/SKILL.md +65 -0
  37. package/templates/bundle-frontend-spa/skills/state-management/SKILL.md +86 -0
  38. package/templates/bundle-jhipster-microservices/.spec/constitution.md +37 -0
  39. package/templates/bundle-jhipster-microservices/AGENTS.md +307 -0
  40. package/templates/bundle-jhipster-microservices/skills/ci-cd-pipeline/SKILL.md +112 -0
  41. package/templates/bundle-jhipster-microservices/skills/clean-architecture/SKILL.md +99 -0
  42. package/templates/bundle-jhipster-microservices/skills/ddd-tactical/SKILL.md +138 -0
  43. package/templates/bundle-jhipster-microservices/skills/jhipster-angular/SKILL.md +97 -0
  44. package/templates/bundle-jhipster-microservices/skills/jhipster-docker-k8s/SKILL.md +183 -0
  45. package/templates/bundle-jhipster-microservices/skills/jhipster-entities/SKILL.md +87 -0
  46. package/templates/bundle-jhipster-microservices/skills/jhipster-gateway/SKILL.md +96 -0
  47. package/templates/bundle-jhipster-microservices/skills/jhipster-kafka/SKILL.md +145 -0
  48. package/templates/bundle-jhipster-microservices/skills/jhipster-registry/SKILL.md +83 -0
  49. package/templates/bundle-jhipster-microservices/skills/jhipster-service/SKILL.md +131 -0
  50. package/templates/bundle-jhipster-microservices/skills/testing-strategy/SKILL.md +95 -0
  51. package/templates/bundle-jhipster-monorepo/.spec/constitution.md +32 -0
  52. package/templates/bundle-jhipster-monorepo/AGENTS.md +227 -0
  53. package/templates/bundle-jhipster-monorepo/skills/clean-architecture/SKILL.md +99 -0
  54. package/templates/bundle-jhipster-monorepo/skills/ddd-tactical/SKILL.md +138 -0
  55. package/templates/bundle-jhipster-monorepo/skills/jhipster-angular/SKILL.md +166 -0
  56. package/templates/bundle-jhipster-monorepo/skills/jhipster-entities/SKILL.md +141 -0
  57. package/templates/bundle-jhipster-monorepo/skills/jhipster-liquibase/SKILL.md +95 -0
  58. package/templates/bundle-jhipster-monorepo/skills/jhipster-security/SKILL.md +89 -0
  59. package/templates/bundle-jhipster-monorepo/skills/jhipster-spring/SKILL.md +155 -0
  60. package/templates/bundle-jhipster-monorepo/skills/testing-strategy/SKILL.md +95 -0
@@ -0,0 +1,115 @@
1
+ # Projeto: Pipeline de Dados e ML
2
+
3
+ Você está construindo um pipeline de dados que inclui ingestão, processamento, treinamento de modelos e serving. O projeto usa Python com foco em engenharia de dados e machine learning.
4
+
5
+ ## Specification-Driven Development (SDD)
6
+
7
+ Este projeto usa **GitHub Spec Kit** para governança. Antes de implementar qualquer demanda:
8
+
9
+ 1. Rodar `/speckit.constitution` — se `.spec/constitution.md` não existir
10
+ 2. Rodar `/speckit.specify` — descrever O QUE e POR QUÊ (não como)
11
+ 3. Rodar `/speckit.plan` — arquitetura e decisões técnicas
12
+ 4. Rodar `/speckit.tasks` — quebrar em tasks atômicas
13
+ 5. Rodar `/speckit.implement` — executar as tasks
14
+
15
+ Nunca pular direto para código. Spec primeiro, código depois.
16
+
17
+ ## References
18
+
19
+ Documentos de referência que o agente deve consultar quando necessário:
20
+
21
+ - `references/pandas-patterns.md` — Padrões de transformação com Pandas
22
+ - `references/mlflow-guide.md` — Guia de experiment tracking
23
+ - `references/data-validation.md` — Validação com Pandera/Great Expectations
24
+
25
+ ## Stack do projeto
26
+
27
+ - **Linguagem:** Python 3.11+
28
+ - **Dados:** Pandas, Polars, NumPy
29
+ - **ML:** Scikit-learn, XGBoost, LightGBM
30
+ - **Deep Learning:** PyTorch (quando necessário)
31
+ - **Pipeline:** Apache Airflow ou Prefect
32
+ - **Experiment Tracking:** MLflow
33
+ - **RAG (se aplicável):** LangChain + pgvector
34
+ - **Banco:** PostgreSQL
35
+ - **Containers:** Docker
36
+ - **Validação:** Pandera, Great Expectations
37
+
38
+ ## Estrutura do projeto
39
+
40
+ ```
41
+ src/
42
+ ├── data/
43
+ │ ├── raw/ # Dados originais (imutáveis, nunca editar)
44
+ │ ├── processed/ # Dados transformados
45
+ │ └── features/ # Feature store
46
+ ├── pipelines/
47
+ │ ├── ingestion/ # Ingestão de fontes externas
48
+ │ ├── preprocessing/ # Limpeza e transformação
49
+ │ ├── feature_engineering/ # Criação de features
50
+ │ └── training/ # Pipeline de treino
51
+ ├── models/
52
+ │ ├── training/ # Scripts de treino
53
+ │ ├── evaluation/ # Avaliação e métricas
54
+ │ └── serving/ # API de inferência (FastAPI)
55
+ ├── rag/ # Se aplicável
56
+ │ ├── ingest.py
57
+ │ ├── retriever.py
58
+ │ └── embeddings.py
59
+ ├── notebooks/ # APENAS exploração (não vai para prod)
60
+ ├── tests/
61
+ │ ├── test_preprocessing.py
62
+ │ ├── test_features.py
63
+ │ └── test_model.py
64
+ └── config/
65
+ ├── settings.py
66
+ └── models_config.yaml
67
+ ```
68
+
69
+ ## Padrões de código
70
+
71
+ - Máximo 500 linhas por arquivo, 20 linhas por função
72
+ - Type hints em funções públicas
73
+ - Docstrings em funções de transformação de dados (input/output)
74
+ - Black + Ruff para formatação
75
+ - Notebook → script Python antes de ir para produção
76
+
77
+ ## Padrões de dados
78
+
79
+ - Dados originais são IMUTÁVEIS — nunca editar `raw/`
80
+ - Cada transformação é uma função pura (input → output, sem side effects)
81
+ - Validar schema na entrada de cada pipeline step (Pandera)
82
+ - Versionamento de datasets com DVC
83
+ - Logging de todas as transformações
84
+
85
+ ## Padrões de ML
86
+
87
+ - Todo modelo precisa de baseline (majority class, média, regressão linear)
88
+ - Cross-validation k=5 mínimo
89
+ - Métricas documentadas: accuracy, precision, recall, F1, AUC
90
+ - Feature importance registrada no MLflow
91
+ - Modelo serializado com versão
92
+ - A/B testing antes de substituir modelo em produção
93
+
94
+ ## Git
95
+
96
+ - Commits: `feat(preprocessing): adicionar normalização de salários`
97
+ - Branches: `feature/<pipeline>-<descricao>`
98
+ - Nunca commitar dados (usar .gitignore, DVC para dados)
99
+ - Nunca commitar modelos binários (usar MLflow registry)
100
+
101
+ ## Testes
102
+
103
+ - Testes de schema (Pandera) para cada transformação
104
+ - Testes unitários para funções de feature engineering
105
+ - Testes de regressão para métricas do modelo
106
+ - Cobertura mínima: 80% em pipelines de transformação
107
+
108
+ ## O que NÃO fazer
109
+
110
+ - Não colocar notebook em produção sem refatorar
111
+ - Não treinar sem baseline
112
+ - Não ignorar data drift
113
+ - Não usar random seed inconsistente
114
+ - Não hardcodar paths — usar config
115
+ - Não fazer SELECT * em queries de dados grandes
@@ -0,0 +1,75 @@
1
+ ---
2
+ name: data-preprocessing
3
+ description: Preprocessar dados com Pandas e NumPy incluindo limpeza, transformação e análise exploratória. Use quando precisar limpar dados, fazer EDA, ou preparar datasets.
4
+ ---
5
+
6
+ # Data Preprocessing
7
+
8
+ ## EDA (Análise Exploratória)
9
+
10
+ ```python
11
+ import pandas as pd
12
+ import numpy as np
13
+
14
+ def eda_report(df: pd.DataFrame) -> dict:
15
+ return {
16
+ "shape": df.shape,
17
+ "dtypes": df.dtypes.to_dict(),
18
+ "nulls": df.isnull().sum().to_dict(),
19
+ "null_pct": (df.isnull().sum() / len(df) * 100).to_dict(),
20
+ "duplicates": df.duplicated().sum(),
21
+ "numeric_stats": df.describe().to_dict(),
22
+ "categorical_counts": {
23
+ col: df[col].value_counts().head(10).to_dict()
24
+ for col in df.select_dtypes(include='object').columns
25
+ }
26
+ }
27
+ ```
28
+
29
+ ## Pipeline de limpeza
30
+
31
+ ```python
32
+ def clean_pipeline(df: pd.DataFrame) -> pd.DataFrame:
33
+ df = df.copy()
34
+
35
+ # 1. Remover duplicatas
36
+ df = df.drop_duplicates()
37
+
38
+ # 2. Tratar tipos
39
+ date_cols = [c for c in df.columns if 'date' in c.lower() or 'at' in c.lower()]
40
+ for col in date_cols:
41
+ df[col] = pd.to_datetime(df[col], errors='coerce')
42
+
43
+ # 3. Tratar nulos numéricos
44
+ for col in df.select_dtypes(include=[np.number]).columns:
45
+ if df[col].isnull().sum() / len(df) < 0.05:
46
+ df[col] = df[col].fillna(df[col].median())
47
+ else:
48
+ df = df.drop(columns=[col]) # >5% nulos: remover coluna
49
+
50
+ # 4. Tratar nulos categóricos
51
+ for col in df.select_dtypes(include='object').columns:
52
+ df[col] = df[col].fillna('unknown')
53
+
54
+ # 5. Normalizar strings
55
+ for col in df.select_dtypes(include='object').columns:
56
+ df[col] = df[col].str.strip().str.lower()
57
+
58
+ return df
59
+ ```
60
+
61
+ ## Validação com Pandera
62
+
63
+ ```python
64
+ import pandera as pa
65
+
66
+ schema = pa.DataFrameSchema({
67
+ "demand_id": pa.Column(str, nullable=False, unique=True),
68
+ "description": pa.Column(str, nullable=False),
69
+ "status": pa.Column(str, pa.Check.isin(["created", "planned", "completed"])),
70
+ "compliance_score": pa.Column(float, pa.Check.between(0, 100), nullable=True),
71
+ "created_at": pa.Column("datetime64[ns]", nullable=False),
72
+ })
73
+
74
+ validated_df = schema.validate(df)
75
+ ```
@@ -0,0 +1,114 @@
1
+ ---
2
+ name: docker-containerization
3
+ description: Criar Dockerfiles otimizados com multi-stage build, security hardening e docker-compose para desenvolvimento. Use quando for containerizar aplicações, criar Dockerfiles, ou configurar ambiente de dev.
4
+ ---
5
+
6
+ # Docker Containerization
7
+
8
+ ## Dockerfile Python — Multi-stage
9
+
10
+ ```dockerfile
11
+ # === Build stage ===
12
+ FROM python:3.11-slim AS builder
13
+ WORKDIR /app
14
+ RUN apt-get update && apt-get install -y --no-install-recommends gcc && rm -rf /var/lib/apt/lists/*
15
+ COPY requirements.txt .
16
+ RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
17
+
18
+ # === Runtime stage ===
19
+ FROM python:3.11-slim
20
+ WORKDIR /app
21
+ RUN groupadd -r appuser && useradd -r -g appuser appuser
22
+ COPY --from=builder /install /usr/local
23
+ COPY src/ ./src/
24
+ USER appuser
25
+ EXPOSE 8000
26
+ HEALTHCHECK --interval=30s --timeout=5s CMD curl -f http://localhost:8000/health || exit 1
27
+ CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]
28
+ ```
29
+
30
+ ## Dockerfile React — Multi-stage
31
+
32
+ ```dockerfile
33
+ FROM node:20-slim AS builder
34
+ WORKDIR /app
35
+ COPY package*.json ./
36
+ RUN npm ci
37
+ COPY . .
38
+ RUN npm run build
39
+
40
+ FROM nginx:alpine
41
+ COPY --from=builder /app/dist /usr/share/nginx/html
42
+ COPY nginx.conf /etc/nginx/conf.d/default.conf
43
+ EXPOSE 80
44
+ ```
45
+
46
+ ## Docker Compose — Dev
47
+
48
+ ```yaml
49
+ # docker-compose.dev.yml
50
+ services:
51
+ api:
52
+ build:
53
+ context: .
54
+ dockerfile: docker/Dockerfile.api
55
+ ports:
56
+ - "8000:8000"
57
+ environment:
58
+ - DATABASE_URL=postgresql://maestro:maestro@postgres/maestro
59
+ - REDIS_URL=redis://redis:6379
60
+ volumes:
61
+ - ./src:/app/src # Hot reload
62
+ depends_on:
63
+ postgres:
64
+ condition: service_healthy
65
+
66
+ postgres:
67
+ image: pgvector/pgvector:pg16
68
+ environment:
69
+ POSTGRES_DB: maestro
70
+ POSTGRES_USER: maestro
71
+ POSTGRES_PASSWORD: maestro
72
+ ports:
73
+ - "5432:5432"
74
+ volumes:
75
+ - pgdata:/var/lib/postgresql/data
76
+ healthcheck:
77
+ test: ["CMD-SHELL", "pg_isready -U maestro"]
78
+ interval: 5s
79
+ timeout: 5s
80
+ retries: 5
81
+
82
+ redis:
83
+ image: redis:7-alpine
84
+ ports:
85
+ - "6379:6379"
86
+
87
+ minio:
88
+ image: minio/minio
89
+ command: server /data --console-address ":9001"
90
+ ports:
91
+ - "9000:9000"
92
+ - "9001:9001"
93
+ environment:
94
+ MINIO_ROOT_USER: minioadmin
95
+ MINIO_ROOT_PASSWORD: minioadmin
96
+
97
+ volumes:
98
+ pgdata:
99
+ ```
100
+
101
+ ## .dockerignore
102
+
103
+ ```
104
+ .git
105
+ node_modules
106
+ __pycache__
107
+ *.pyc
108
+ .env
109
+ .venv
110
+ dist
111
+ build
112
+ coverage
113
+ .pytest_cache
114
+ ```
@@ -0,0 +1,76 @@
1
+ ---
2
+ name: feature-engineering
3
+ description: Criar e transformar features para modelos de ML incluindo encoding, scaling, e feature selection. Use quando precisar preparar dados, criar features, ou selecionar variáveis relevantes.
4
+ ---
5
+
6
+ # Feature Engineering
7
+
8
+ ## Fluxo
9
+
10
+ ```
11
+ Dados brutos → Limpeza → Encoding → Scaling → Feature Selection → Dados prontos
12
+ ```
13
+
14
+ ## Limpeza
15
+
16
+ ```python
17
+ import pandas as pd
18
+
19
+ def clean_data(df: pd.DataFrame) -> pd.DataFrame:
20
+ # Remover duplicatas
21
+ df = df.drop_duplicates()
22
+
23
+ # Tratar nulos
24
+ df['age'] = df['age'].fillna(df['age'].median())
25
+ df['name'] = df['name'].fillna('Unknown')
26
+
27
+ # Remover outliers (IQR)
28
+ Q1, Q3 = df['salary'].quantile([0.25, 0.75])
29
+ IQR = Q3 - Q1
30
+ df = df[(df['salary'] >= Q1 - 1.5*IQR) & (df['salary'] <= Q3 + 1.5*IQR)]
31
+
32
+ # Tipagem
33
+ df['created_at'] = pd.to_datetime(df['created_at'])
34
+
35
+ return df
36
+ ```
37
+
38
+ ## Encoding
39
+
40
+ ```python
41
+ from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
42
+
43
+ # Categorias sem ordem → OneHotEncoder
44
+ ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
45
+ encoded = ohe.fit_transform(df[['department', 'city']])
46
+
47
+ # Categorias com ordem → OrdinalEncoder
48
+ oe = OrdinalEncoder(categories=[['junior', 'pleno', 'senior']])
49
+ df['level_encoded'] = oe.fit_transform(df[['level']])
50
+
51
+ # Target → LabelEncoder
52
+ le = LabelEncoder()
53
+ y = le.fit_transform(df['target'])
54
+ ```
55
+
56
+ ## Feature Selection
57
+
58
+ ```python
59
+ from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
60
+
61
+ # Filtro estatístico
62
+ selector = SelectKBest(score_func=f_classif, k=10)
63
+ X_selected = selector.fit_transform(X, y)
64
+
65
+ # Feature importance do modelo
66
+ model.fit(X, y)
67
+ importances = pd.Series(model.feature_importances_, index=feature_names)
68
+ top_features = importances.nlargest(10)
69
+ ```
70
+
71
+ ## Regras
72
+
73
+ 1. Nunca usar dados do test set para fit do scaler/encoder
74
+ 2. Salvar transformers junto com o modelo (pickle/joblib)
75
+ 3. Documentar cada feature criada (nome, tipo, origem)
76
+ 4. Verificar correlação entre features (remover redundantes)
@@ -0,0 +1,77 @@
1
+ ---
2
+ name: mlops-pipeline
3
+ description: Criar pipelines MLOps com MLflow para tracking, model registry e deployment automatizado. Use quando precisar versionar modelos, automatizar treino, ou configurar model registry.
4
+ ---
5
+
6
+ # MLOps Pipeline
7
+
8
+ ## MLflow Tracking
9
+
10
+ ```python
11
+ import mlflow
12
+
13
+ mlflow.set_tracking_uri("http://mlflow.maestro.local")
14
+ mlflow.set_experiment("compliance-classifier")
15
+
16
+ with mlflow.start_run(run_name="rf-v1"):
17
+ mlflow.log_params({
18
+ "n_estimators": 200,
19
+ "max_depth": 20,
20
+ "cv_folds": 5
21
+ })
22
+
23
+ model.fit(X_train, y_train)
24
+ y_pred = model.predict(X_test)
25
+
26
+ mlflow.log_metrics({
27
+ "accuracy": accuracy_score(y_test, y_pred),
28
+ "f1": f1_score(y_test, y_pred, average='weighted'),
29
+ "precision": precision_score(y_test, y_pred, average='weighted'),
30
+ })
31
+
32
+ mlflow.sklearn.log_model(model, "model")
33
+ ```
34
+
35
+ ## Model Registry
36
+
37
+ ```python
38
+ # Registrar modelo
39
+ model_uri = f"runs:/{run_id}/model"
40
+ mlflow.register_model(model_uri, "compliance-classifier")
41
+
42
+ # Promover para produção
43
+ client = mlflow.MlflowClient()
44
+ client.transition_model_version_stage(
45
+ name="compliance-classifier",
46
+ version=2,
47
+ stage="Production"
48
+ )
49
+ ```
50
+
51
+ ## Pipeline automatizado
52
+
53
+ ```python
54
+ # pipelines/training.py
55
+ def training_pipeline():
56
+ """Pipeline completo: dados → treino → avaliação → registro"""
57
+
58
+ # 1. Carregar dados
59
+ df = load_latest_data()
60
+
61
+ # 2. Preprocessar
62
+ X, y = preprocess(df)
63
+ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
64
+
65
+ # 3. Treinar com tracking
66
+ with mlflow.start_run():
67
+ model = train_model(X_train, y_train)
68
+ metrics = evaluate_model(model, X_test, y_test)
69
+ mlflow.log_metrics(metrics)
70
+
71
+ # 4. Registrar se melhor que produção
72
+ prod_metrics = get_production_metrics()
73
+ if metrics['f1'] > prod_metrics.get('f1', 0):
74
+ mlflow.sklearn.log_model(model, "model")
75
+ register_as_candidate(model)
76
+ notify_team("Novo modelo candidato disponível")
77
+ ```
@@ -0,0 +1,68 @@
1
+ ---
2
+ name: model-training
3
+ description: Treinar modelos de ML com Scikit-learn incluindo pipeline de preprocessing, cross-validation e hyperparameter tuning. Use quando for treinar modelos, fazer cross-validation, ou otimizar hiperparâmetros.
4
+ ---
5
+
6
+ # Model Training
7
+
8
+ ## Pipeline completo
9
+
10
+ ```python
11
+ from sklearn.pipeline import Pipeline
12
+ from sklearn.preprocessing import StandardScaler, OneHotEncoder
13
+ from sklearn.compose import ColumnTransformer
14
+ from sklearn.model_selection import cross_val_score, GridSearchCV
15
+ from sklearn.ensemble import RandomForestClassifier
16
+ from sklearn.metrics import classification_report
17
+ import joblib
18
+
19
+ # 1. Preprocessamento
20
+ numeric_features = ['age', 'salary', 'experience']
21
+ categorical_features = ['department', 'role']
22
+
23
+ preprocessor = ColumnTransformer(
24
+ transformers=[
25
+ ('num', StandardScaler(), numeric_features),
26
+ ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
27
+ ]
28
+ )
29
+
30
+ # 2. Pipeline
31
+ pipeline = Pipeline([
32
+ ('preprocessor', preprocessor),
33
+ ('classifier', RandomForestClassifier(random_state=42))
34
+ ])
35
+
36
+ # 3. Cross-validation
37
+ scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1_weighted')
38
+ print(f"F1 Score: {scores.mean():.3f} (+/- {scores.std():.3f})")
39
+
40
+ # 4. Hyperparameter tuning
41
+ param_grid = {
42
+ 'classifier__n_estimators': [100, 200, 500],
43
+ 'classifier__max_depth': [10, 20, None],
44
+ 'classifier__min_samples_split': [2, 5, 10]
45
+ }
46
+
47
+ grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1_weighted', n_jobs=-1)
48
+ grid_search.fit(X_train, y_train)
49
+
50
+ # 5. Avaliação final
51
+ y_pred = grid_search.predict(X_test)
52
+ print(classification_report(y_test, y_pred))
53
+
54
+ # 6. Salvar modelo
55
+ joblib.dump(grid_search.best_estimator_, 'models/model_v1.pkl')
56
+ ```
57
+
58
+ ## Sempre comparar com baseline
59
+
60
+ ```python
61
+ from sklearn.dummy import DummyClassifier
62
+
63
+ baseline = DummyClassifier(strategy='most_frequent')
64
+ baseline.fit(X_train, y_train)
65
+ baseline_score = baseline.score(X_test, y_test)
66
+ print(f"Baseline accuracy: {baseline_score:.3f}")
67
+ print(f"Model accuracy: {grid_search.score(X_test, y_test):.3f}")
68
+ ```
@@ -0,0 +1,128 @@
1
+ ---
2
+ name: rag-pipeline
3
+ description: Construir pipeline RAG completo com ingestão, chunking, embedding, indexação e retrieval usando LangChain + pgvector. Use sempre que precisar implementar busca semântica, responder perguntas sobre documentos, ou criar um sistema de retrieval.
4
+ ---
5
+
6
+ # RAG Pipeline
7
+
8
+ ## Pipeline completo
9
+
10
+ ```
11
+ Documentos → Loader → Splitter → Embeddings → pgvector → Retriever → Re-ranker → LLM
12
+ ```
13
+
14
+ ## 1. Ingestão
15
+
16
+ ```python
17
+ from langchain_community.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader
18
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
19
+
20
+ # Loader por tipo de documento
21
+ loader = DirectoryLoader(
22
+ "./documents/",
23
+ glob="**/*.md",
24
+ loader_cls=UnstructuredMarkdownLoader
25
+ )
26
+ docs = loader.load()
27
+
28
+ # Splitter com separadores Markdown
29
+ splitter = RecursiveCharacterTextSplitter(
30
+ chunk_size=1000,
31
+ chunk_overlap=200,
32
+ separators=["\n## ", "\n### ", "\n\n", "\n", ". ", " "]
33
+ )
34
+ chunks = splitter.split_documents(docs)
35
+ ```
36
+
37
+ ## 2. Metadados obrigatórios
38
+
39
+ Cada chunk deve ter:
40
+ ```python
41
+ for chunk in chunks:
42
+ chunk.metadata.update({
43
+ "source": chunk.metadata.get("source", "unknown"),
44
+ "doc_type": classify_document(chunk), # skill, agent_md, prd, code
45
+ "language": detect_language(chunk),
46
+ "created_at": datetime.now().isoformat(),
47
+ })
48
+ ```
49
+
50
+ ## 3. Embedding + Indexação
51
+
52
+ ```python
53
+ from langchain_openai import OpenAIEmbeddings
54
+ from langchain_postgres import PGVector
55
+
56
+ embeddings = OpenAIEmbeddings(model="text-embedding-3-large", dimensions=1536)
57
+
58
+ vectorstore = PGVector(
59
+ collection_name="documents",
60
+ connection=DATABASE_URL,
61
+ embedding_function=embeddings,
62
+ )
63
+ vectorstore.add_documents(chunks)
64
+ ```
65
+
66
+ ## 4. Retrieval Híbrido
67
+
68
+ ```python
69
+ from langchain.retrievers import EnsembleRetriever
70
+ from langchain_community.retrievers import BM25Retriever
71
+
72
+ # Semântico
73
+ semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
74
+
75
+ # Keyword
76
+ bm25_retriever = BM25Retriever.from_documents(chunks, k=20)
77
+
78
+ # Ensemble com RRF
79
+ hybrid_retriever = EnsembleRetriever(
80
+ retrievers=[semantic_retriever, bm25_retriever],
81
+ weights=[0.6, 0.4]
82
+ )
83
+ ```
84
+
85
+ ## 5. Re-ranking
86
+
87
+ ```python
88
+ from langchain.retrievers import ContextualCompressionRetriever
89
+ from langchain_cohere import CohereRerank
90
+
91
+ reranker = CohereRerank(top_n=5)
92
+ final_retriever = ContextualCompressionRetriever(
93
+ base_compressor=reranker,
94
+ base_retriever=hybrid_retriever
95
+ )
96
+ ```
97
+
98
+ ## 6. Query Chain
99
+
100
+ ```python
101
+ from langchain_core.prompts import ChatPromptTemplate
102
+ from langchain_core.output_parsers import StrOutputParser
103
+
104
+ prompt = ChatPromptTemplate.from_template("""
105
+ Responda a pergunta baseado apenas no contexto fornecido.
106
+ Se a resposta não estiver no contexto, diga "Não encontrei essa informação".
107
+
108
+ Contexto: {context}
109
+ Pergunta: {question}
110
+ """)
111
+
112
+ chain = (
113
+ {"context": final_retriever, "question": RunnablePassthrough()}
114
+ | prompt
115
+ | llm
116
+ | StrOutputParser()
117
+ )
118
+
119
+ result = chain.invoke("Qual skill usar para criar componentes React?")
120
+ ```
121
+
122
+ ## Checklist de qualidade
123
+
124
+ - [ ] Chunks testados com perguntas reais
125
+ - [ ] Metadados completos em todos os chunks
126
+ - [ ] Retrieval quality medido com golden dataset
127
+ - [ ] Re-ranking ativo para refinar top-k
128
+ - [ ] Fallback para quando retrieval não encontra nada
@@ -0,0 +1,32 @@
1
+ # Constitution — Projeto Frontend SPA
2
+
3
+ ## Princípios
4
+
5
+ 1. **Spec primeiro, código depois** — Toda demanda passa pelo fluxo SDD antes de implementação
6
+ 2. **Componente = 1 responsabilidade** — Componentes pequenos e focados
7
+ 3. **Server state no React Query** — Nunca duplicar dados da API no estado global
8
+ 4. **TypeScript strict** — Zero `any`, types para tudo
9
+ 5. **Mobile-first** — Escrever para mobile, breakpoints para desktop
10
+
11
+ ## Padrões de desenvolvimento
12
+
13
+ - React 18+, TypeScript strict mode
14
+ - Tailwind CSS + Shadcn/UI
15
+ - Feature-based folder structure
16
+ - Custom hooks para lógica reutilizável
17
+ - React Hook Form + Zod para formulários
18
+
19
+ ## Padrões de componentes
20
+
21
+ - Composição sobre configuração
22
+ - Loading/Error/Empty states em todo async
23
+ - Acessibilidade: semantic HTML, aria-labels
24
+ - Lazy loading por rota
25
+ - Máximo 200 linhas por componente
26
+
27
+ ## Padrões de qualidade
28
+
29
+ - Vitest + Testing Library para componentes
30
+ - Playwright para E2E
31
+ - Cobertura mínima: 70%
32
+ - Commits seguem Conventional Commits