npm - maestro-bundle - Versions diffs - 1.3.1 → 1.4.0 - Mend

maestro-bundle 1.3.1 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (116) hide show

package/templates/bundle-data-pipeline/skills/data-preprocessing/references/pandera-schemas.md ADDED Viewed

@@ -0,0 +1,44 @@
+# Pandera Schema Validation Reference
+## Basic Schema
+```python
+import pandera as pa
+schema = pa.DataFrameSchema({
+    "id": pa.Column(int, nullable=False, unique=True),
+    "name": pa.Column(str, nullable=False),
+    "score": pa.Column(float, pa.Check.between(0, 100)),
+    "status": pa.Column(str, pa.Check.isin(["active", "inactive"])),
+})
+validated = schema.validate(df)
+```
+## Common Checks
+```python
+pa.Check.between(0, 100)           # range check
+pa.Check.isin(["a", "b", "c"])     # allowed values
+pa.Check.str_matches(r"^\d{3}$")   # regex match
+pa.Check.gt(0)                     # greater than
+pa.Check.le(1.0)                   # less than or equal
+pa.Check(lambda s: s.str.len() > 3) # custom check
+```
+## Schema-Level Checks
+```python
+schema = pa.DataFrameSchema(
+    columns={...},
+    checks=[
+        pa.Check(lambda df: df["end_date"] > df["start_date"]),
+    ],
+    index=pa.Index(int, name="idx"),
+    coerce=True,  # auto-coerce types
+)
+```
+## Decorator Validation
+```python
+@pa.check_input(schema)
+def process_data(df: pd.DataFrame) -> pd.DataFrame:
+    return df.assign(processed=True)
+```

package/templates/bundle-data-pipeline/skills/docker-containerization/SKILL.md CHANGED Viewed

@@ -1,12 +1,50 @@
 ---
 name: docker-containerization
-description: Criar Dockerfiles otimizados com multi-stage build, security hardening e docker-compose para desenvolvimento. Use quando for containerizar aplicações, criar Dockerfiles, ou configurar ambiente de dev.
+description: Create optimized Dockerfiles with multi-stage builds, security hardening, and docker-compose for development environments. Use when you need to containerize an application, write Dockerfiles, set up docker-compose, or debug container issues.
+version: 1.0.0
+author: Maestro
 ---
 # Docker Containerization
-## Dockerfile Python — Multi-stage
+Build production-ready containers with multi-stage builds, security best practices, and full docker-compose dev environments.
+## When to Use
+- User needs to containerize a Python/Node.js application
+- User wants to create a docker-compose setup for local development
+- User needs to optimize Docker image size with multi-stage builds
+- User wants to add health checks, non-root users, or security hardening
+- User needs to debug container build or runtime issues
+## Available Operations
+1. Create a multi-stage Dockerfile for Python (FastAPI/Flask)
+2. Create a multi-stage Dockerfile for React/Node.js
+3. Set up docker-compose with PostgreSQL, Redis, MinIO
+4. Configure .dockerignore for optimal build context
+5. Debug build failures and optimize image size
+## Multi-Step Workflow
+### Step 1: Create .dockerignore
+```bash
+cat > .dockerignore << 'EOF'
+.git
+node_modules
+__pycache__
+*.pyc
+.env
+.venv
+dist
+build
+coverage
+.pytest_cache
+.mypy_cache
+*.egg-info
+.DS_Store
+EOF
+```
+### Step 2: Write Dockerfile for Python API (Multi-Stage)
 ```dockerfile
 # === Build stage ===
 FROM python:3.11-slim AS builder
@@ -18,18 +56,25 @@ RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
 # === Runtime stage ===
 FROM python:3.11-slim
 WORKDIR /app
+# Security: non-root user
 RUN groupadd -r appuser && useradd -r -g appuser appuser
+# Copy only installed packages from builder
 COPY --from=builder /install /usr/local
 COPY src/ ./src/
+# Switch to non-root
 USER appuser
 EXPOSE 8000
 HEALTHCHECK --interval=30s --timeout=5s CMD curl -f http://localhost:8000/health || exit 1
 CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]
 ```
-## Dockerfile React — Multi-stage
+### Step 3: Write Dockerfile for React Frontend (Multi-Stage)
 ```dockerfile
+# === Build stage ===
 FROM node:20-slim AS builder
 WORKDIR /app
 COPY package*.json ./
@@ -37,14 +82,33 @@ RUN npm ci
 COPY . .
 RUN npm run build
+# === Runtime stage ===
 FROM nginx:alpine
 COPY --from=builder /app/dist /usr/share/nginx/html
 COPY nginx.conf /etc/nginx/conf.d/default.conf
 EXPOSE 80
+HEALTHCHECK --interval=30s --timeout=5s CMD wget -q --spider http://localhost/ || exit 1
 ```
-## Docker Compose — Dev
+### Step 4: Build and Test Locally
+```bash
+# Build Python API image
+docker build -t myapp-api -f docker/Dockerfile.api .
+# Build React frontend image
+docker build -t myapp-frontend -f docker/Dockerfile.frontend .
+# Verify image sizes
+docker images | grep myapp
+# Test API container
+docker run --rm -p 8000:8000 myapp-api
+# Test frontend container
+docker run --rm -p 3000:80 myapp-frontend
+```
+### Step 5: Set Up docker-compose for Development
 ```yaml
 # docker-compose.dev.yml
 services:
@@ -98,17 +162,69 @@ volumes:
   pgdata:
 ```
-## .dockerignore
+### Step 6: Start and Manage the Stack
+```bash
+# Start all services
+docker compose -f docker-compose.dev.yml up -d
+# Check service health
+docker compose -f docker-compose.dev.yml ps
+# View logs
+docker compose -f docker-compose.dev.yml logs -f api
+# Stop everything
+docker compose -f docker-compose.dev.yml down
+# Stop and remove volumes (clean slate)
+docker compose -f docker-compose.dev.yml down -v
 ```
-.git
-node_modules
-__pycache__
-*.pyc
-.env
-.venv
-dist
-build
-coverage
-.pytest_cache
+### Step 7: Debug Common Issues
+```bash
+# Check why a container exited
+docker compose -f docker-compose.dev.yml logs api
+# Shell into a running container
+docker compose -f docker-compose.dev.yml exec api bash
+# Check container resource usage
+docker stats
+# Rebuild a single service after code changes
+docker compose -f docker-compose.dev.yml up -d --build api
+# Prune unused images to free disk space
+docker image prune -f
 ```
+## Resources
+- `references/dockerfile-best-practices.md` - Security and optimization patterns
+- `references/compose-patterns.md` - Common docker-compose service configurations
+## Examples
+### Example 1: Containerize a FastAPI App
+User asks: "Create a Dockerfile for our Python API"
+Response approach:
+1. Create .dockerignore to exclude unnecessary files
+2. Write multi-stage Dockerfile (builder + runtime)
+3. Add non-root user and health check
+4. Build and test with `docker build` and `docker run`
+5. Verify image size with `docker images`
+### Example 2: Set Up Local Dev Environment
+User asks: "Set up docker-compose with Postgres and Redis for development"
+Response approach:
+1. Create docker-compose.dev.yml with all services
+2. Add health checks and dependency ordering
+3. Mount source code as volume for hot reload
+4. Start with `docker compose up -d`
+5. Verify services with `docker compose ps`
+## Notes
+- Always use multi-stage builds to keep production images small
+- Never run containers as root -- create a dedicated appuser
+- Add HEALTHCHECK to every Dockerfile for orchestrator integration
+- Use `.dockerignore` to keep build context small and fast
+- Pin base image versions (python:3.11-slim, not python:latest)
+- Mount source code as volumes only in dev, not in production builds

package/templates/bundle-data-pipeline/skills/docker-containerization/references/compose-patterns.md ADDED Viewed

@@ -0,0 +1,82 @@
+# Docker Compose Patterns
+## PostgreSQL with pgvector
+```yaml
+postgres:
+  image: pgvector/pgvector:pg16
+  environment:
+    POSTGRES_DB: mydb
+    POSTGRES_USER: myuser
+    POSTGRES_PASSWORD: mypassword
+  ports:
+    - "5432:5432"
+  volumes:
+    - pgdata:/var/lib/postgresql/data
+    - ./init.sql:/docker-entrypoint-initdb.d/init.sql
+  healthcheck:
+    test: ["CMD-SHELL", "pg_isready -U myuser"]
+    interval: 5s
+    timeout: 5s
+    retries: 5
+```
+## Redis
+```yaml
+redis:
+  image: redis:7-alpine
+  ports:
+    - "6379:6379"
+  volumes:
+    - redisdata:/data
+  healthcheck:
+    test: ["CMD", "redis-cli", "ping"]
+    interval: 5s
+    timeout: 3s
+```
+## MinIO (S3-compatible storage)
+```yaml
+minio:
+  image: minio/minio
+  command: server /data --console-address ":9001"
+  ports:
+    - "9000:9000"
+    - "9001:9001"
+  environment:
+    MINIO_ROOT_USER: minioadmin
+    MINIO_ROOT_PASSWORD: minioadmin
+  volumes:
+    - miniodata:/data
+```
+## MLflow Server
+```yaml
+mlflow:
+  image: ghcr.io/mlflow/mlflow:latest
+  command: mlflow server --host 0.0.0.0 --port 5000 --backend-store-uri postgresql://mlflow:mlflow@postgres/mlflow --default-artifact-root s3://mlflow/
+  ports:
+    - "5000:5000"
+  depends_on:
+    postgres:
+      condition: service_healthy
+```
+## Dependency Ordering
+```yaml
+services:
+  api:
+    depends_on:
+      postgres:
+        condition: service_healthy  # wait for healthy, not just started
+      redis:
+        condition: service_started
+```
+## Development Volumes (hot reload)
+```yaml
+services:
+  api:
+    volumes:
+      - ./src:/app/src          # mount source code
+      - /app/node_modules       # exclude node_modules from mount
+```

package/templates/bundle-data-pipeline/skills/docker-containerization/references/dockerfile-best-practices.md ADDED Viewed

@@ -0,0 +1,57 @@
+# Dockerfile Best Practices
+## Multi-Stage Build Pattern
+```dockerfile
+# Stage 1: Build
+FROM python:3.11-slim AS builder
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
+# Stage 2: Runtime (smaller image)
+FROM python:3.11-slim
+COPY --from=builder /install /usr/local
+COPY src/ ./src/
+CMD ["python", "-m", "src.main"]
+```
+## Security Checklist
+- [ ] Non-root user: `RUN groupadd -r app && useradd -r -g app app` then `USER app`
+- [ ] Pin base image versions: `python:3.11-slim` not `python:latest`
+- [ ] No secrets in build args or ENV: use runtime environment variables
+- [ ] Minimal packages: `--no-install-recommends` for apt-get
+- [ ] Clean up apt cache: `rm -rf /var/lib/apt/lists/*`
+- [ ] Scan images: `docker scout cves myimage`
+## Layer Optimization
+```dockerfile
+# BAD: creates unnecessary layer cache misses
+COPY . .
+RUN pip install -r requirements.txt
+# GOOD: dependencies cached separately from code
+COPY requirements.txt .
+RUN pip install -r requirements.txt
+COPY src/ ./src/
+```
+## Health Checks
+```dockerfile
+# HTTP health check
+HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
+  CMD curl -f http://localhost:8000/health || exit 1
+# TCP health check (no curl needed)
+HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
+  CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
+```
+## Image Size Comparison
+| Base Image | Size | Use Case |
+|---|---|---|
+| python:3.11 | ~900MB | Avoid |
+| python:3.11-slim | ~120MB | Default choice |
+| python:3.11-alpine | ~50MB | If musl-compatible |
+| node:20 | ~1GB | Avoid |
+| node:20-slim | ~200MB | Default choice |
+| node:20-alpine | ~130MB | If no native deps |

package/templates/bundle-data-pipeline/skills/feature-engineering/SKILL.md CHANGED Viewed

@@ -1,76 +1,174 @@
 ---
 name: feature-engineering
-description: Criar e transformar features para modelos de ML incluindo encoding, scaling, e feature selection. Use quando precisar preparar dados, criar features, ou selecionar variáveis relevantes.
+description: Create and transform features for ML models including encoding, scaling, and feature selection. Use when you need to prepare data for training, create new features, encode categoricals, or select the most relevant variables.
+version: 1.0.0
+author: Maestro
 ---
 # Feature Engineering
-## Fluxo
+Build feature pipelines that transform raw data into model-ready inputs using scikit-learn.
-```
-Dados brutos → Limpeza → Encoding → Scaling → Feature Selection → Dados prontos
-```
+## When to Use
+- User needs to encode categorical variables (one-hot, ordinal, label)
+- User needs to scale or normalize numeric features
+- User wants to select the best features for a model
+- User needs to create derived features (interactions, aggregations, date parts)
+- User needs to remove outliers from a dataset
+## Available Operations
+1. Clean data and remove outliers (IQR method)
+2. Encode categorical features (OneHot, Ordinal, Label)
+3. Scale numeric features (Standard, MinMax, Robust)
+4. Create derived features (date parts, interactions, aggregations)
+5. Select top features (statistical tests, model importance)
+6. Build a reusable sklearn ColumnTransformer pipeline
-## Limpeza
+## Multi-Step Workflow
+### Step 1: Install Dependencies
+```bash
+pip install pandas numpy scikit-learn joblib
+```
+### Step 2: Load and Split Data
 ```python
 import pandas as pd
+import numpy as np
+from sklearn.model_selection import train_test_split
-def clean_data(df: pd.DataFrame) -> pd.DataFrame:
-    # Remover duplicatas
-    df = df.drop_duplicates()
-    # Tratar nulos
-    df['age'] = df['age'].fillna(df['age'].median())
-    df['name'] = df['name'].fillna('Unknown')
+df = pd.read_parquet("data/processed/dataset_clean.parquet")
-    # Remover outliers (IQR)
-    Q1, Q3 = df['salary'].quantile([0.25, 0.75])
-    IQR = Q3 - Q1
-    df = df[(df['salary'] >= Q1 - 1.5*IQR) & (df['salary'] <= Q3 + 1.5*IQR)]
+# Separate target
+X = df.drop(columns=["target"])
+y = df["target"]
-    # Tipagem
-    df['created_at'] = pd.to_datetime(df['created_at'])
+# Split BEFORE any fitting -- prevents data leakage
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
+print(f"Train: {X_train.shape}, Test: {X_test.shape}")
+```
+### Step 3: Remove Outliers (IQR Method)
+```python
+def remove_outliers_iqr(df: pd.DataFrame, columns: list[str]) -> pd.DataFrame:
+    df = df.copy()
+    for col in columns:
+        Q1, Q3 = df[col].quantile([0.25, 0.75])
+        IQR = Q3 - Q1
+        mask = (df[col] >= Q1 - 1.5 * IQR) & (df[col] <= Q3 + 1.5 * IQR)
+        before = len(df)
+        df = df[mask]
+        print(f"  {col}: removed {before - len(df)} outliers")
     return df
-```
-## Encoding
+numeric_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()
+X_train = remove_outliers_iqr(X_train, numeric_cols)
+y_train = y_train.loc[X_train.index]
+```
+### Step 4: Build Encoding and Scaling Pipeline
 ```python
-from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
+from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
+from sklearn.compose import ColumnTransformer
+from sklearn.pipeline import Pipeline
+numeric_features = ["age", "salary", "experience"]
+categorical_features = ["department", "city"]
+ordinal_features = ["level"]
+preprocessor = ColumnTransformer(
+    transformers=[
+        ("num", StandardScaler(), numeric_features),
+        ("cat", OneHotEncoder(sparse_output=False, handle_unknown="ignore"), categorical_features),
+        ("ord", OrdinalEncoder(categories=[["junior", "mid", "senior"]]), ordinal_features),
+    ],
+    remainder="drop"
+)
+# Fit on train only, transform both
+X_train_transformed = preprocessor.fit_transform(X_train)
+X_test_transformed = preprocessor.transform(X_test)
+print(f"Features after transform: {X_train_transformed.shape[1]}")
+```
-# Categorias sem ordem → OneHotEncoder
-ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
-encoded = ohe.fit_transform(df[['department', 'city']])
+### Step 5: Feature Selection
+```python
+from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
-# Categorias com ordem → OrdinalEncoder
-oe = OrdinalEncoder(categories=[['junior', 'pleno', 'senior']])
-df['level_encoded'] = oe.fit_transform(df[['level']])
+# Statistical filter
+selector = SelectKBest(score_func=f_classif, k=10)
+X_selected = selector.fit_transform(X_train_transformed, y_train)
-# Target → LabelEncoder
-le = LabelEncoder()
-y = le.fit_transform(df['target'])
+# Get selected feature names
+feature_names = preprocessor.get_feature_names_out()
+selected_mask = selector.get_support()
+selected_features = feature_names[selected_mask]
+print(f"Selected features: {list(selected_features)}")
 ```
-## Feature Selection
+Alternatively, use model-based importance:
+```python
+from sklearn.ensemble import RandomForestClassifier
+rf = RandomForestClassifier(n_estimators=100, random_state=42)
+rf.fit(X_train_transformed, y_train)
+importances = pd.Series(rf.feature_importances_, index=feature_names)
+top_features = importances.nlargest(10)
+print(top_features)
+```
+### Step 6: Save Transformer for Reuse
+```bash
+mkdir -p models
+```
 ```python
-from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
+import joblib
-# Filtro estatístico
-selector = SelectKBest(score_func=f_classif, k=10)
-X_selected = selector.fit_transform(X, y)
+joblib.dump(preprocessor, "models/preprocessor_v1.pkl")
+print("Saved preprocessor to models/preprocessor_v1.pkl")
-# Feature importance do modelo
-model.fit(X, y)
-importances = pd.Series(model.feature_importances_, index=feature_names)
-top_features = importances.nlargest(10)
+# To reload later:
+# preprocessor = joblib.load("models/preprocessor_v1.pkl")
 ```
-## Regras
+### Step 7: Verify Pipeline End-to-End
+```bash
+python -c "
+import joblib, pandas as pd
+p = joblib.load('models/preprocessor_v1.pkl')
+df = pd.read_parquet('data/processed/dataset_clean.parquet').head(5)
+X = df.drop(columns=['target'])
+result = p.transform(X)
+print(f'Input: {X.shape} -> Output: {result.shape}')
+"
+```
-1. Nunca usar dados do test set para fit do scaler/encoder
-2. Salvar transformers junto com o modelo (pickle/joblib)
-3. Documentar cada feature criada (nome, tipo, origem)
-4. Verificar correlação entre features (remover redundantes)
+## Resources
+- `references/encoding-guide.md` - When to use which encoder
+- `references/scaling-guide.md` - Scaler comparison and selection
+## Examples
+### Example 1: Encode Categoricals
+User asks: "Encode the department and city columns for my classifier"
+Response approach:
+1. Identify column cardinality with `df['col'].nunique()`
+2. Use OneHotEncoder for low-cardinality unordered categoricals
+3. Use OrdinalEncoder for ordered categoricals
+4. Build a ColumnTransformer and fit on training data only
+5. Save the fitted transformer with joblib
+### Example 2: Select Best Features
+User asks: "Which features matter most for predicting churn?"
+Response approach:
+1. Preprocess all features through the ColumnTransformer
+2. Run SelectKBest with f_classif to rank features
+3. Also run RandomForest feature_importances_ for comparison
+4. Report the top 10 features from both methods
+5. Recommend dropping low-importance features
+## Notes
+- Never fit encoders/scalers on test data -- fit on train, transform both
+- Save transformers alongside the model with joblib for reproducibility
+- Document each created feature: name, type, source column, transformation
+- Check correlation between features and remove redundants (threshold > 0.95)
+- For high-cardinality categoricals (>50 values), consider target encoding

package/templates/bundle-data-pipeline/skills/feature-engineering/references/encoding-guide.md ADDED Viewed

@@ -0,0 +1,41 @@
+# Encoding Guide
+## Decision Matrix
+| Scenario | Encoder | Example |
+|---|---|---|
+| Unordered, low cardinality (<15) | OneHotEncoder | department, color |
+| Ordered categories | OrdinalEncoder | level (junior/mid/senior) |
+| Binary target variable | LabelEncoder | yes/no, churn/retain |
+| High cardinality (>50) | TargetEncoder | zip_code, product_id |
+| Text-like categories | HashingEncoder | free-text categories |
+## OneHotEncoder
+```python
+from sklearn.preprocessing import OneHotEncoder
+ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
+encoded = ohe.fit_transform(df[['department', 'city']])
+feature_names = ohe.get_feature_names_out()
+```
+## OrdinalEncoder
+```python
+from sklearn.preprocessing import OrdinalEncoder
+oe = OrdinalEncoder(categories=[['junior', 'mid', 'senior']])
+df['level_encoded'] = oe.fit_transform(df[['level']])
+```
+## LabelEncoder (target only)
+```python
+from sklearn.preprocessing import LabelEncoder
+le = LabelEncoder()
+y = le.fit_transform(df['target'])
+# Decode: le.inverse_transform(y)
+```
+## TargetEncoder (high cardinality)
+```python
+from sklearn.preprocessing import TargetEncoder
+te = TargetEncoder(smooth="auto")
+df['zip_encoded'] = te.fit_transform(df[['zip_code']], y)
+```

package/templates/bundle-data-pipeline/skills/feature-engineering/references/scaling-guide.md ADDED Viewed

@@ -0,0 +1,38 @@
+# Scaling Guide
+## Decision Matrix
+| Scenario | Scaler | When to Use |
+|---|---|---|
+| Normal distribution, no outliers | StandardScaler | Default choice for most models |
+| Need 0-1 range | MinMaxScaler | Neural networks, image data |
+| Data has outliers | RobustScaler | Uses median/IQR, outlier-resistant |
+| Sparse data | MaxAbsScaler | Preserves sparsity |
+## StandardScaler (z-score normalization)
+```python
+from sklearn.preprocessing import StandardScaler
+scaler = StandardScaler()
+X_scaled = scaler.fit_transform(X_train)
+X_test_scaled = scaler.transform(X_test)
+```
+## MinMaxScaler (0-1 range)
+```python
+from sklearn.preprocessing import MinMaxScaler
+scaler = MinMaxScaler(feature_range=(0, 1))
+X_scaled = scaler.fit_transform(X_train)
+```
+## RobustScaler (outlier-resistant)
+```python
+from sklearn.preprocessing import RobustScaler
+scaler = RobustScaler()
+X_scaled = scaler.fit_transform(X_train)
+```
+## Important Rules
+1. Always fit on training data only
+2. Save the scaler with joblib alongside the model
+3. Tree-based models (RF, XGBoost) do NOT need scaling
+4. Linear models, SVM, KNN, neural nets DO need scaling