npm - @groupby/ai-dev - Versions diffs - 0.5.7 → 0.5.8 - Mend

@groupby/ai-dev 0.5.7 → 0.5.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (40) hide show

package/teams/fhr-ai-team/skills/e2e-testing/SKILL.md ADDED Viewed

@@ -0,0 +1,163 @@
+---
+name: e2e-testing
+description: >
+  Use when the user wants to run end-to-end tests of ML code, launch pipeline test runs,
+  or verify model outputs. Covers local pytest execution, Kubeflow pipeline launches,
+  MLflow metric validation, and pod-level debugging of failures.
+---
+# End-to-End ML Testing
+## Overview
+This skill handles all testing workflows for ML code, from local unit tests to full
+Kubeflow pipeline end-to-end runs and model validation.
+## Step 1: Determine Test Scope
+Use AskUserQuestion to ask:
+**Question:** "What type of testing do you want to run?"
+| Option | Description |
+|--------|-------------|
+| Local tests | Run pytest (unit and integration tests) |
+| Pipeline end-to-end | Launch a Kubeflow pipeline run and monitor it |
+| Model validation | Check MLflow metrics and model outputs |
+---
+## Local Tests
+### Identify the target
+- Determine which repo and test directory to run
+- Check for `pytest.ini`, `setup.cfg`, or `pyproject.toml` for test configuration
+- Check for test markers (e.g., `@pytest.mark.integration`, `@pytest.mark.slow`)
+### Run tests
+```bash
+pytest tests/ -v --tb=short
+```
+For specific test files or functions:
+```bash
+pytest tests/test_specific.py::test_function -v
+```
+With coverage:
+```bash
+pytest tests/ --cov=<package_name> --cov-report=term-missing
+```
+### Analyze failures
+- Read the full traceback
+- Check if it is a test environment issue (missing deps, wrong Python version, missing .env)
+- Check if it is a real code bug
+- Suggest fixes with exact code changes
+---
+## Pipeline End-to-End
+### Prerequisites check
+Before launching, verify:
+1. **`.env` file exists** in `attraqt-kubeflow-configs`:
+   ```bash
+   ls /Users/mehdi/dev/projects/attraqt-kubeflow-configs/.env
+   ```
+   If missing: `cp .env.dev .env`
+2. **`version_name` is valid**:
+   ```bash
+   python3 scripts/kf_query.py --pipeline-versions <pipeline_name>
+   ```
+3. **MongoDB config is correct** (for learning pipelines):
+   ```bash
+   mongosh "mongodb://10.11.96.21:27017/earlybirds" --quiet --eval '
+   const doc = db.predictors.findOne({"_id": ObjectId("<PREDICTOR_ID>")});
+   print(JSON.stringify(doc.config.batch, null, 2));
+   '
+   ```
+### Launch
+```bash
+cd /Users/mehdi/dev/projects/attraqt-kubeflow-configs/scripts
+python -m run -c <absolute_path_to_config>
+```
+### Monitor
+Poll the run status:
+```bash
+python3 scripts/kf_query.py <run_id>
+```
+Check for failures:
+```bash
+python3 scripts/kf_query.py <run_id> --failed
+```
+### Verify
+- All steps should show status "Succeeded"
+- Check output artifacts exist in GCS
+- For learning pipelines: verify MLflow run was created
+- For encoding pipelines: verify output encodings exist at expected GCS path
+### Debug Failures
+When a step fails:
+1. **Find the pod:**
+   ```bash
+   kubectl get pods -n kubeflow | grep <workflow-name>
+   ```
+2. **Read logs:**
+   ```bash
+   kubectl logs -n kubeflow <pod-name> --tail=200
+   kubectl logs -n kubeflow <pod-name> --previous  # if crashed
+   ```
+3. **Check events (OOM, scheduling, image pull):**
+   ```bash
+   kubectl describe pod -n kubeflow <pod-name>
+   ```
+4. **Common failure patterns:**
+   - OOM: increase memory in config, or reduce batch size in MongoDB
+   - Image pull error: wrong image version; verify with `kf_query.py --pipeline-versions`
+   - Config error: wrong arguments format (check `arguments` vs `custom_params`)
+   - GPU scheduling: check node availability with `kubectl get nodes`
+See `skills/ml-tooling-dev/references/kubectl-debug.md` for the full debugging reference.
+---
+## Model Validation
+### Fetch metrics
+```bash
+python3 scripts/mlflow_query.py run <run_id>
+```
+### Compare against baseline
+- Check key metrics (loss, accuracy, recall, precision) against previous runs
+- Use `mlflow_query.py runs <experiment_name>` to list recent runs for comparison
+### Check registered models
+```bash
+python3 scripts/mlflow_query.py model-for-predictor <predictor_id>
+python3 scripts/mlflow_query.py model <model_name>
+```
+Verify:
+- Model version was registered
+- Aliases are set correctly (e.g., "champion", "challenger")
+- Model artifact exists in the registry
+---
+## Skill Dependencies
+This skill invokes `ai.pierre:ml-tooling-dev` for all Kubeflow, MLflow, and MongoDB operations.
+For pipeline config generation, use `/plan-algo-tests` (invokes `ai.pierre:algo-test-planning`).

package/teams/fhr-ai-team/skills/grill-me/SKILL.md ADDED Viewed

@@ -0,0 +1,10 @@
+---
+name: grill-me
+description: Interview the user relentlessly about a plan or design until reaching shared understanding, resolving each branch of the decision tree. Use when user wants to stress-test a plan, get grilled on their design, or mentions "grill me".
+---
+Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.
+Ask the questions one at a time.
+If a question can be answered by exploring the codebase, explore the codebase instead.

package/teams/fhr-ai-team/skills/ml-tooling-dev/SKILL.md ADDED Viewed

@@ -0,0 +1,313 @@
+---
+name: ml-tooling-dev
+description: >
+  Manage, monitor, debug, create, and launch job configurations for the dev Kubeflow, MLflow,
+  and MongoDB environments. Use this skill when the user asks to:
+  - Check the status of a Kubeflow pipeline run or step
+  - Debug a failed Kubeflow job (logs, pod errors, OOM, config issues)
+  - Create or fill in a Kubeflow job config (python_batch_pipeline, scala_batch_pipeline, full pipelines)
+  - Launch / submit a pipeline run from a config file using run.py
+  - Look up MLflow runs, metrics, model versions, or registered model aliases
+  - Find pipeline step output paths (GCS paths for dataset preprocessing, query datasets, etc.)
+  - Understand what configs are needed to manually re-run a failed pipeline step
+  - Ask about predictor_id, strategy_id, image versions, GCS paths, pipeline version_names, or MLflow run IDs
+  - Read or update training hyperparameters in MongoDB (predictor config, learning config, model config)
+  - Run hyperparameter experiments by updating MongoDB config and launching Kubeflow runs
+  - Manage sequential experiment launches sharing the same predictor MongoDB document
+  Environments: DEV ONLY. Kubeflow: http://10.11.96.10/ | MLflow: http://10.11.96.16:5000/ |
+  MongoDB: mongodb://10.11.96.21:27017/earlybirds
+  Config/pipeline sources: attraqt-kubeflow-configs, attraqt-kubeflow-pipelines
+---
+# Kubeflow & MLflow Dev Management
+## Endpoints
+| Service | URL |
+|---------|-----|
+| Kubeflow UI | http://10.11.96.10/ |
+| MLflow UI | http://10.11.96.16:5000/ |
+| MongoDB | `mongodb://10.11.96.21:27017/earlybirds` (collection: `predictors`) |
+| Config files | `attraqt-kubeflow-configs/configs/development/` |
+| Pipeline code | `attraqt-kubeflow-pipelines/kubeflow_pipelines/pipelines/` |
+---
+## Available Scripts
+### `scripts/kf_query.py`: Query Kubeflow runs
+```bash
+python3 scripts/kf_query.py <run_id>                        # Run status + step overview
+python3 scripts/kf_query.py <run_id> --failed               # Only failed/running steps + pods
+python3 scripts/kf_query.py <run_id> --step <name>          # Filter by step name
+python3 scripts/kf_query.py <run_id> --all                  # Include internal KFP steps
+python3 scripts/kf_query.py --list --experiment <name>      # List runs in experiment
+python3 scripts/kf_query.py --experiments                   # List all experiments
+python3 scripts/kf_query.py --pipelines                     # List all pipelines + latest version_name
+python3 scripts/kf_query.py --pipeline-versions <name>      # List all versions of a pipeline
+```
+### `scripts/mlflow_query.py`: Query MLflow
+```bash
+python3 scripts/mlflow_query.py model-for-predictor <predictor_id>  # Find model by predictor_id
+python3 scripts/mlflow_query.py run <run_id>                         # Run metrics + params
+python3 scripts/mlflow_query.py runs <experiment_name>               # List runs in experiment
+python3 scripts/mlflow_query.py models                               # List registered models
+python3 scripts/mlflow_query.py model <model_name>                   # Versions + aliases
+```
+### `scripts/mongo_predictor.py`: Manage predictor training config in MongoDB
+Prefer this over raw `mongosh --eval` for any read/update/replace of `config.batch.<strategy>`. Strategy defaults to `semantic-search-learning`.
+```bash
+python3 scripts/mongo_predictor.py read   <predictor_id>                              # Print current strategy config
+python3 scripts/mongo_predictor.py update <predictor_id> --set k=v [--set k=v ...]    # Patch fields (dot-notation $set)
+python3 scripts/mongo_predictor.py apply  <predictor_id> --file <config.json>         # Replace whole strategy config
+python3 scripts/mongo_predictor.py diff   <predictor_id> --file <config.json>         # Show what apply would change
+```
+`update` keys are dot-paths *under* the strategy (e.g. `learningConfig.trainingArguments.perDeviceTrainBatchSize`). Values are auto-coerced (`true`/`false`/`null`/int/float/JSON literal/string). `apply` stores ints as `Int32` (uses `NumberInt` via extended JSON) so full-replace does not silently downgrade types to Double.
+### `scripts/kf_logs.py`: Tail Kubeflow step pod logs
+Resolves run -> step -> impl pod -> `kubectl logs`. No more grepping `kubectl get pods` for pod names.
+```bash
+python3 scripts/kf_logs.py <run_id> --step <name>                # Tail logs from matching step pod(s)
+python3 scripts/kf_logs.py <run_id> --step <name> --previous     # Logs from previous (crashed) container
+python3 scripts/kf_logs.py <run_id> --step <name> -f             # Stream live (one matching pod only)
+python3 scripts/kf_logs.py <run_id> --step <name> --tail 500     # Tail N lines (default 200)
+python3 scripts/kf_logs.py <run_id> --list                       # List matching pods only
+python3 scripts/kf_logs.py <run_id> --all                        # Include KFP internal step pods
+```
+Driver / DAG steps (`display_name` ending in `-driver`) and KFP-internal plumbing steps are skipped by default; pass `--all` to include them. Pods are deduped by `pod_name` in case the KFP v2 API returns the same pod under multiple templated task entries.
+### `scripts/kf_wait.py`: Wait for a step to reach a state
+```bash
+python3 scripts/kf_wait.py <run_id> --step <name> --state RUNNING                 # Default 600s timeout, 10s interval
+python3 scripts/kf_wait.py <run_id> --step <name> --state SUCCEEDED --timeout 3600
+python3 scripts/kf_wait.py <run_id> --step <name> --state RUNNING --interval 5
+python3 scripts/kf_wait.py <run_id> --step <name> --state RUNNING --quiet         # No progress output
+```
+Exit codes: `0` reached, `1` run not found or terminal-mismatch, `2` API unreachable for the full window, `124` timeout. Composable with the sequential-launch loop (apply config -> launch run -> `kf_wait` -> apply next).
+---
+## Core Workflow: Launching a Pipeline Run
+The launcher lives in `attraqt-kubeflow-configs`. It requires a `.env` file at the root of that project.
+### Setup (one-time)
+```bash
+cd attraqt-kubeflow-configs
+# Check if .env exists already
+ls .env
+# If not, create it from the dev template:
+cp .env.dev .env
+```
+The `.env.dev` sets:
+```
+KUBEFLOW_HOST=http://10.11.96.10:80
+PIPELINES_BUCKET_NAME=xo-dev-ai-eu-kubeflow-artifacts
+MONGO_CONF_URI=10.11.96.19:8080
+```
+### Submitting a run
+```bash
+cd attraqt-kubeflow-configs/scripts
+python -m run -c <absolute_path_to_config_file>
+```
+Example, using an existing config:
+```bash
+cd attraqt-kubeflow-configs/scripts
+python -m run -c /Users/mehdi/dev/projects/attraqt-kubeflow-configs/configs/development/search/myer-learning.json
+```
+Example, using a newly generated config saved to a temp file:
+```bash
+# Save generated config
+cat > /tmp/my-job-config.json << 'EOF'
+{ ... }
+EOF
+cd attraqt-kubeflow-configs/scripts
+python -m run -c /tmp/my-job-config.json
+```
+### Before submitting: verify the `version_name`
+The `version_name` in the config must match an **existing** pipeline version in Kubeflow. Always check:
+```bash
+python3 scripts/kf_query.py --pipeline-versions <pipeline_name>
+# e.g.: --pipeline-versions python_batch_pipeline
+#       --pipeline-versions scala_batch_pipeline
+#       --pipeline-versions semantic_search_item_encoding
+```
+Use the most recent `version_name` from the output (e.g. `"0.1.271"`).
+---
+## Core Workflow: Debugging a Failed Run
+1. **Identify the failed step:**
+   ```bash
+   python3 scripts/kf_query.py <run_id> --failed
+   ```
+2. **Read the failing pod's logs** (no manual pod-name lookup; `kf_logs.py` resolves it):
+   ```bash
+   python3 scripts/kf_logs.py <run_id> --step <step_name> --tail 200
+   python3 scripts/kf_logs.py <run_id> --step <step_name> --previous   # if container crashed
+   python3 scripts/kf_logs.py <run_id> --step <step_name> -f           # stream live
+   ```
+3. **Inspect pod events** (OOM, scheduling, image pull, etc.). Get the pod name from `--list`, then `kubectl describe`:
+   ```bash
+   python3 scripts/kf_logs.py <run_id> --step <step_name> --list
+   kubectl describe pod -n kubeflow <pod-name>
+   ```
+4. **For resource/GPU issues:**
+   ```bash
+   kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable."nvidia\.com/gpu"
+   ```
+> See `references/kubectl-debug.md` for the full kubectl command reference and failure-pattern table.
+---
+## Core Workflow: Updating MongoDB Training Config
+Training hyperparameters are stored in MongoDB predictor documents, not in the Kubeflow config JSON. The Kubeflow config specifies *what* to run (pipeline, image, dataset paths); MongoDB specifies *how* to train (learning rate, batch size, epochs, etc.).
+### Quick read
+```bash
+python3 scripts/mongo_predictor.py read <PREDICTOR_ID>
+```
+### Quick update (dot notation for individual fields)
+```bash
+python3 scripts/mongo_predictor.py update <PREDICTOR_ID> \
+  --set pipelineConfig.maxSequenceLength=64 \
+  --set learningConfig.trainingArguments.perDeviceTrainBatchSize=128 \
+  --set learningConfig.trainingArguments.gradientAccumulationSteps=4 \
+  --set learningConfig.trainingArguments.numTrainEpochs=3 \
+  --set learningConfig.trainingArguments.gradientCheckpointing=true
+```
+### Stage and apply experiment configs (preferred for sequential launches)
+```bash
+# Preview the change before applying
+python3 scripts/mongo_predictor.py diff  <PREDICTOR_ID> --file experiment-A.json
+python3 scripts/mongo_predictor.py apply <PREDICTOR_ID> --file experiment-A.json
+```
+`apply` replaces the whole `config.batch.<strategy>` document and preserves `Int32` types. Use `update --set` when patching a few fields; use `apply --file` when the experiment config is checked in or generated.
+> Raw `mongosh --eval` examples are kept in `references/mongodb-config.md` as a fallback.
+### Sequential launches (multiple experiments, same predictor)
+All runs for the same predictor share one MongoDB document. The full loop:
+```bash
+# Run A
+python3 scripts/mongo_predictor.py apply <PREDICTOR_ID> --file experiment-A.json
+# launch run A from attraqt-kubeflow-configs (capture <run_id_A>)
+python3 scripts/kf_wait.py <run_id_A> --step get-container-base-task-2 --state RUNNING --timeout 600
+# Run B (only after A's pod has read the config)
+python3 scripts/mongo_predictor.py apply <PREDICTOR_ID> --file experiment-B.json
+# launch run B (capture <run_id_B>)
+python3 scripts/kf_wait.py <run_id_B> --step get-container-base-task-2 --state RUNNING --timeout 600
+```
+Verify each run actually picked up its config:
+```bash
+python3 scripts/kf_logs.py <run_id> --step get-container-base-task-2 --tail 500 \
+  | grep -E 'maxSequenceLength:|perDeviceTrainBatchSize:|gradientAccumulationSteps:'
+```
+> See `references/mongodb-config.md` for full config schema, complete update templates, and gotchas.
+---
+## Core Workflow: Building a Job Config Interactively
+When the user asks to create a config for a Kubeflow job, ask only the questions needed, in this order:
+1. **What pipeline step?** (learning / evaluation / items-encoding / full pipeline / other)
+2. **Predictor ID?** (MongoDB ObjectId, e.g. `64f0a12b5856b11b7aa4e71e`)
+3. **Experiment name?** (check Kubeflow: `python3 scripts/kf_query.py --experiments`)
+4. **Image version?** Ask the user, or check `attraqt-kubeflow-pipelines/kubeflow_pipelines/pipelines/utils/versions.py` for defaults
+5. **Dataset paths?** GCS paths from previous completed steps; check Kubeflow UI or `kf_query.py <run_id>`
+6. **MLflow run_id?** For evaluation/encoding steps; check `mlflow_query.py model-for-predictor <predictor_id>`
+7. **Resource overrides?** Use defaults from `references/pipeline-configs.md` unless user specifies
+Then generate the config and offer to either save it and launch it, or just show it.
+**Key rules:**
+- `python_batch_pipeline` → uses `batch_config.arguments` for custom job params
+- `scala_batch_pipeline` → uses `batch_config.custom_params` (NOT `arguments`)
+- Always verify `version_name` with `kf_query.py --pipeline-versions <name>` before submitting
+- GPU jobs: always include `gpu_vendor: "nvidia.com/gpu"` and `gpu_accelerator_name: "nvidia-l4"` for L4 nodes
+> See `references/pipeline-configs.md` for full schemas, strategy IDs, image names, and GCS path patterns.
+> See `references/pipeline-steps.md` for step-by-step configs for the semantic search learning workflow.
+---
+## Core Workflow: Re-running Failed Pipeline Steps
+For **semantic search learning** (`semantic_search_learning_with_generated_analytics_pipeline`):
+| Step | Pipeline type | Key inputs needed |
+|------|--------------|-------------------|
+| Learning | `python_batch_pipeline` | Dataset paths from steps 2+3 |
+| Evaluation | `python_batch_pipeline` | MLflow `run_id` from learning; evaluation dataset from step 3 |
+| Items encoding | `scala_batch_pipeline` | MLflow `run_id` from learning |
+Get completed step output paths:
+```bash
+python3 scripts/kf_query.py <original_run_id>
+# Then check Kubeflow UI → that run → succeeded steps → Output artifacts tab
+```
+Get the MLflow `run_id` from a completed training:
+```bash
+python3 scripts/mlflow_query.py model-for-predictor <predictor_id>
+```
+> Full config templates for each step are in `references/pipeline-steps.md`.
+---
+## Quick Reference: Finding Things
+| What | Where |
+|------|-------|
+| Predictor ID | Kubeflow run config params (shown by `kf_query.py`) |
+| Training hyperparameters | MongoDB: `db.predictors.findOne({_id: ObjectId("<id>")}).config.batch` |
+| Image versions | `attraqt-kubeflow-pipelines/.../utils/versions.py` |
+| Dataset GCS paths | Kubeflow UI → run → step → Output artifacts; or `kf_query.py` |
+| MLflow run_id | `mlflow_query.py model-for-predictor <predictor_id>` |
+| Pipeline version_name | `kf_query.py --pipeline-versions <name>` |
+| Pipeline configs | `attraqt-kubeflow-configs/configs/development/` |
+| Strategy IDs & image names | `references/pipeline-configs.md` |
+| MongoDB config schema | `references/mongodb-config.md` |
+| Existing tenant configs | `attraqt-kubeflow-configs/configs/development/ai/<pipeline>/` |

package/teams/fhr-ai-team/skills/ml-tooling-dev/references/kubectl-debug.md ADDED Viewed

@@ -0,0 +1,165 @@
+# kubectl Debugging Reference for Kubeflow Jobs
+Kubeflow jobs run in the `kubeflow` namespace. Each pipeline step spawns pods.
+---
+## Quick Diagnostics
+On this dev cluster, KFP v2 pod names follow `<workflow-name>-<numeric-id>` (see "Getting Pod Name from Kubeflow UI" below). For ad-hoc lookups, grep by workflow name prefix. For scripted resolution, `scripts/kf_logs.py` uses the KFP API to map run + step to pods directly.
+```bash
+# Find all pods for a run by workflow name prefix (ALWAYS do this first)
+kubectl get pods -n kubeflow | grep <workflow-name>
+# List recent pods (sort by creation time)
+kubectl get pods -n kubeflow --sort-by=.metadata.creationTimestamp | tail -30
+# Find pods by predictor_id label
+kubectl get pods -n kubeflow -l predictor-id=<predictor_id>
+# Find pods by strategy
+kubectl get pods -n kubeflow -l strategy-id=<strategy_id>
+# Find pods by image (algo-search-batch, semantic-search, etc.)
+kubectl get pods -n kubeflow | grep <image_name>
+```
+---
+## Pod Logs
+```bash
+# Get logs from a pod (last 200 lines)
+kubectl logs -n kubeflow <pod-name> --tail=200
+# Stream live logs
+kubectl logs -n kubeflow <pod-name> -f
+# Get logs from previous crashed container
+kubectl logs -n kubeflow <pod-name> --previous
+# Get logs for a specific container in a pod (e.g. main container)
+kubectl logs -n kubeflow <pod-name> -c main
+# Get logs from all pods with a label
+kubectl logs -n kubeflow -l predictor-id=<predictor_id> --tail=100
+```
+---
+## Pod Status & Events
+```bash
+# Describe a pod (shows events, exit codes, OOM kills, node assignment)
+kubectl describe pod -n kubeflow <pod-name>
+# Check pod status quickly
+kubectl get pod -n kubeflow <pod-name> -o wide
+# Get exit code from a completed/failed pod
+kubectl get pod -n kubeflow <pod-name> -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}'
+# Get reason for failure
+kubectl get pod -n kubeflow <pod-name> -o jsonpath='{.status.containerStatuses[0].state.terminated.reason}'
+# List all events in kubeflow namespace (helpful for scheduling failures)
+kubectl get events -n kubeflow --sort-by=.lastTimestamp | tail -30
+# Events for a specific pod
+kubectl get events -n kubeflow --field-selector involvedObject.name=<pod-name>
+```
+---
+## GPU / Resource Issues
+```bash
+# Check GPU node availability (all GPU types)
+kubectl get nodes -l cloud.google.com/gke-accelerator -o custom-columns=NAME:.metadata.name,ACCELERATOR:.metadata.labels.cloud\\.google\\.com/gke-accelerator,GPU:.status.allocatable.nvidia\\.com/gpu
+# Check if GPU is allocated (look for Limits in describe)
+kubectl describe pod -n kubeflow <pod-name> | grep -A5 "Limits:"
+# Check if pod is stuck Pending (resource pressure)
+kubectl describe pod -n kubeflow <pod-name> | grep -A10 "Events:"
+```
+---
+## Argo Workflow (Kubeflow underlying engine)
+```bash
+# List workflows (each pipeline run = 1 workflow)
+kubectl get workflows -n kubeflow | tail -20
+# Get workflow status
+kubectl get workflow -n kubeflow <workflow-name> -o jsonpath='{.status.phase}'
+# Describe workflow (shows all steps and their status)
+kubectl describe workflow -n kubeflow <workflow-name>
+# Get workflow name from Kubeflow run_id
+# The workflow name is shown in the Kubeflow UI run details page URL/title
+```
+---
+## Common Failure Patterns
+| Symptom | What to check |
+|---------|---------------|
+| Pod stuck `Pending` | `kubectl describe pod` → Events: check "Insufficient nvidia.com/gpu" or "Insufficient memory" |
+| Pod `OOMKilled` | Increase `memory` in batch_config; `describe pod` shows `OOMKilled` in state |
+| Pod `Error` exit code 1 | Check `kubectl logs` for the application error (CUDA OOM, missing dataset path, etc.) |
+| Pod `Error` exit code 137 | OOM killed by Linux kernel |
+| Workflow stuck | Check if a previous step pod is still running: `kubectl get pods -n kubeflow` |
+| `ImagePullBackOff` | Wrong image version in config; check version exists in registry |
+| `CrashLoopBackOff` | App crashes on startup; check `--previous` logs |
+---
+## Getting Pod Name from Kubeflow UI
+On this dev cluster, the KFP v2 API's `task_details[].child_tasks[].pod_name` field is the kubectl pod name. Pods follow the pattern:
+```
+<workflow-name>-<numeric-id>
+```
+Example: workflow `python-batch-pipeline-26dsn` produces pods like:
+```
+python-batch-pipeline-26dsn-1314600408   # workflow root
+python-batch-pipeline-26dsn-2177690451   # a step pod
+python-batch-pipeline-26dsn-2721071988   # a driver pod
+```
+The step's role is encoded in the `display_name` field of `task_details`, NOT in the pod name. Steps ending in `-driver` are KFP infrastructure (driver / DAG orchestration); user code runs in steps without that suffix. `scripts/kf_logs.py` uses this distinction to filter.
+If you ever encounter a KFP configuration that produces longer pod names with step-name infixes (an older KFP v2 pattern was `<workflow>-<step>-system-container-impl-<hash>`), grep is the fallback:
+```bash
+kubectl get pods -n kubeflow | grep <workflow-name>
+```
+To find pods for a run:
+1. Get the workflow name from Kubeflow UI (Runs -> click run -> title/URL), or use `python3 scripts/kf_logs.py <run_id> --list` to print step + pod_name pairs.
+2. Use `kubectl logs -n kubeflow <pod-name>` directly, or let `kf_logs.py --step <name>` resolve and tail in one call.
+---
+## Useful One-liners
+```bash
+# Get all failed pods in kubeflow namespace
+kubectl get pods -n kubeflow --field-selector=status.phase=Failed
+# Delete a completed/failed pod (cleanup)
+kubectl delete pod -n kubeflow <pod-name>
+# Check resource quotas
+kubectl describe resourcequota -n kubeflow
+# Exec into a running pod for debugging
+kubectl exec -it -n kubeflow <pod-name> -- /bin/bash
+```