@groupby/ai-dev 0.5.7 → 0.5.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (40) hide show
  1. package/package.json +1 -1
  2. package/teams/fhr-ai-team/github/PULL_REQUEST_TEMPLATE/full.md +31 -0
  3. package/teams/fhr-ai-team/github/PULL_REQUEST_TEMPLATE/light.md +7 -0
  4. package/teams/fhr-ai-team/github/copilot-instructions.md +24 -0
  5. package/teams/fhr-ai-team/github/instructions/python.instructions.md +23 -0
  6. package/teams/fhr-ai-team/github/pull_request_template.md +21 -0
  7. package/teams/fhr-ai-team/prompts/brainstorm.md +7 -0
  8. package/teams/fhr-ai-team/prompts/plan-algo-tests.md +7 -0
  9. package/teams/fhr-ai-team/prompts/plan.md +7 -0
  10. package/teams/fhr-ai-team/prompts/pr-description.md +7 -0
  11. package/teams/fhr-ai-team/prompts/test.md +7 -0
  12. package/teams/fhr-ai-team/resources/AGENTS.md +55 -0
  13. package/teams/fhr-ai-team/resources/CLAUDE.md +52 -0
  14. package/teams/fhr-ai-team/resources/README.md +51 -0
  15. package/teams/fhr-ai-team/resources/claude-code-setup.md +60 -0
  16. package/teams/fhr-ai-team/resources/copilot-setup.md +64 -0
  17. package/teams/fhr-ai-team/resources/onboarding.md +179 -0
  18. package/teams/fhr-ai-team/resources/opencode-install.md +29 -0
  19. package/teams/fhr-ai-team/resources/opencode-setup.md +43 -0
  20. package/teams/fhr-ai-team/skills/algo-test-planning/SKILL.md +192 -0
  21. package/teams/fhr-ai-team/skills/algo-test-planning/references/pipeline-registry.md +280 -0
  22. package/teams/fhr-ai-team/skills/brainstorming/SKILL.md +111 -0
  23. package/teams/fhr-ai-team/skills/e2e-testing/SKILL.md +163 -0
  24. package/teams/fhr-ai-team/skills/grill-me/SKILL.md +10 -0
  25. package/teams/fhr-ai-team/skills/ml-tooling-dev/SKILL.md +313 -0
  26. package/teams/fhr-ai-team/skills/ml-tooling-dev/references/kubectl-debug.md +165 -0
  27. package/teams/fhr-ai-team/skills/ml-tooling-dev/references/mongodb-config.md +218 -0
  28. package/teams/fhr-ai-team/skills/ml-tooling-dev/references/pipeline-configs.md +190 -0
  29. package/teams/fhr-ai-team/skills/ml-tooling-dev/references/pipeline-steps.md +182 -0
  30. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/kf_logs.py +203 -0
  31. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/kf_query.py +233 -0
  32. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/kf_wait.py +195 -0
  33. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/mlflow_query.py +252 -0
  34. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/mongo_predictor.py +352 -0
  35. package/teams/fhr-ai-team/skills/naming-conventions-reviewer/SKILL.md +230 -0
  36. package/teams/fhr-ai-team/skills/naming-conventions-reviewer/references/dataset-naming.md +190 -0
  37. package/teams/fhr-ai-team/skills/naming-conventions-reviewer/references/domain-vocabulary.md +447 -0
  38. package/teams/fhr-ai-team/skills/naming-conventions-reviewer/references/repo-dependency-graph.md +264 -0
  39. package/teams/fhr-ai-team/skills/planning/SKILL.md +138 -0
  40. package/teams/fhr-ai-team/skills/pr-description/SKILL.md +94 -0
@@ -0,0 +1,163 @@
1
+ ---
2
+ name: e2e-testing
3
+ description: >
4
+ Use when the user wants to run end-to-end tests of ML code, launch pipeline test runs,
5
+ or verify model outputs. Covers local pytest execution, Kubeflow pipeline launches,
6
+ MLflow metric validation, and pod-level debugging of failures.
7
+ ---
8
+
9
+ # End-to-End ML Testing
10
+
11
+ ## Overview
12
+
13
+ This skill handles all testing workflows for ML code, from local unit tests to full
14
+ Kubeflow pipeline end-to-end runs and model validation.
15
+
16
+ ## Step 1: Determine Test Scope
17
+
18
+ Use AskUserQuestion to ask:
19
+
20
+ **Question:** "What type of testing do you want to run?"
21
+
22
+ | Option | Description |
23
+ |--------|-------------|
24
+ | Local tests | Run pytest (unit and integration tests) |
25
+ | Pipeline end-to-end | Launch a Kubeflow pipeline run and monitor it |
26
+ | Model validation | Check MLflow metrics and model outputs |
27
+
28
+ ---
29
+
30
+ ## Local Tests
31
+
32
+ ### Identify the target
33
+ - Determine which repo and test directory to run
34
+ - Check for `pytest.ini`, `setup.cfg`, or `pyproject.toml` for test configuration
35
+ - Check for test markers (e.g., `@pytest.mark.integration`, `@pytest.mark.slow`)
36
+
37
+ ### Run tests
38
+ ```bash
39
+ pytest tests/ -v --tb=short
40
+ ```
41
+
42
+ For specific test files or functions:
43
+ ```bash
44
+ pytest tests/test_specific.py::test_function -v
45
+ ```
46
+
47
+ With coverage:
48
+ ```bash
49
+ pytest tests/ --cov=<package_name> --cov-report=term-missing
50
+ ```
51
+
52
+ ### Analyze failures
53
+ - Read the full traceback
54
+ - Check if it is a test environment issue (missing deps, wrong Python version, missing .env)
55
+ - Check if it is a real code bug
56
+ - Suggest fixes with exact code changes
57
+
58
+ ---
59
+
60
+ ## Pipeline End-to-End
61
+
62
+ ### Prerequisites check
63
+ Before launching, verify:
64
+
65
+ 1. **`.env` file exists** in `attraqt-kubeflow-configs`:
66
+ ```bash
67
+ ls /Users/mehdi/dev/projects/attraqt-kubeflow-configs/.env
68
+ ```
69
+ If missing: `cp .env.dev .env`
70
+
71
+ 2. **`version_name` is valid**:
72
+ ```bash
73
+ python3 scripts/kf_query.py --pipeline-versions <pipeline_name>
74
+ ```
75
+
76
+ 3. **MongoDB config is correct** (for learning pipelines):
77
+ ```bash
78
+ mongosh "mongodb://10.11.96.21:27017/earlybirds" --quiet --eval '
79
+ const doc = db.predictors.findOne({"_id": ObjectId("<PREDICTOR_ID>")});
80
+ print(JSON.stringify(doc.config.batch, null, 2));
81
+ '
82
+ ```
83
+
84
+ ### Launch
85
+ ```bash
86
+ cd /Users/mehdi/dev/projects/attraqt-kubeflow-configs/scripts
87
+ python -m run -c <absolute_path_to_config>
88
+ ```
89
+
90
+ ### Monitor
91
+ Poll the run status:
92
+ ```bash
93
+ python3 scripts/kf_query.py <run_id>
94
+ ```
95
+
96
+ Check for failures:
97
+ ```bash
98
+ python3 scripts/kf_query.py <run_id> --failed
99
+ ```
100
+
101
+ ### Verify
102
+ - All steps should show status "Succeeded"
103
+ - Check output artifacts exist in GCS
104
+ - For learning pipelines: verify MLflow run was created
105
+ - For encoding pipelines: verify output encodings exist at expected GCS path
106
+
107
+ ### Debug Failures
108
+ When a step fails:
109
+
110
+ 1. **Find the pod:**
111
+ ```bash
112
+ kubectl get pods -n kubeflow | grep <workflow-name>
113
+ ```
114
+
115
+ 2. **Read logs:**
116
+ ```bash
117
+ kubectl logs -n kubeflow <pod-name> --tail=200
118
+ kubectl logs -n kubeflow <pod-name> --previous # if crashed
119
+ ```
120
+
121
+ 3. **Check events (OOM, scheduling, image pull):**
122
+ ```bash
123
+ kubectl describe pod -n kubeflow <pod-name>
124
+ ```
125
+
126
+ 4. **Common failure patterns:**
127
+ - OOM: increase memory in config, or reduce batch size in MongoDB
128
+ - Image pull error: wrong image version; verify with `kf_query.py --pipeline-versions`
129
+ - Config error: wrong arguments format (check `arguments` vs `custom_params`)
130
+ - GPU scheduling: check node availability with `kubectl get nodes`
131
+
132
+ See `skills/ml-tooling-dev/references/kubectl-debug.md` for the full debugging reference.
133
+
134
+ ---
135
+
136
+ ## Model Validation
137
+
138
+ ### Fetch metrics
139
+ ```bash
140
+ python3 scripts/mlflow_query.py run <run_id>
141
+ ```
142
+
143
+ ### Compare against baseline
144
+ - Check key metrics (loss, accuracy, recall, precision) against previous runs
145
+ - Use `mlflow_query.py runs <experiment_name>` to list recent runs for comparison
146
+
147
+ ### Check registered models
148
+ ```bash
149
+ python3 scripts/mlflow_query.py model-for-predictor <predictor_id>
150
+ python3 scripts/mlflow_query.py model <model_name>
151
+ ```
152
+
153
+ Verify:
154
+ - Model version was registered
155
+ - Aliases are set correctly (e.g., "champion", "challenger")
156
+ - Model artifact exists in the registry
157
+
158
+ ---
159
+
160
+ ## Skill Dependencies
161
+
162
+ This skill invokes `ai.pierre:ml-tooling-dev` for all Kubeflow, MLflow, and MongoDB operations.
163
+ For pipeline config generation, use `/plan-algo-tests` (invokes `ai.pierre:algo-test-planning`).
@@ -0,0 +1,10 @@
1
+ ---
2
+ name: grill-me
3
+ description: Interview the user relentlessly about a plan or design until reaching shared understanding, resolving each branch of the decision tree. Use when user wants to stress-test a plan, get grilled on their design, or mentions "grill me".
4
+ ---
5
+
6
+ Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.
7
+
8
+ Ask the questions one at a time.
9
+
10
+ If a question can be answered by exploring the codebase, explore the codebase instead.
@@ -0,0 +1,313 @@
1
+ ---
2
+ name: ml-tooling-dev
3
+ description: >
4
+ Manage, monitor, debug, create, and launch job configurations for the dev Kubeflow, MLflow,
5
+ and MongoDB environments. Use this skill when the user asks to:
6
+ - Check the status of a Kubeflow pipeline run or step
7
+ - Debug a failed Kubeflow job (logs, pod errors, OOM, config issues)
8
+ - Create or fill in a Kubeflow job config (python_batch_pipeline, scala_batch_pipeline, full pipelines)
9
+ - Launch / submit a pipeline run from a config file using run.py
10
+ - Look up MLflow runs, metrics, model versions, or registered model aliases
11
+ - Find pipeline step output paths (GCS paths for dataset preprocessing, query datasets, etc.)
12
+ - Understand what configs are needed to manually re-run a failed pipeline step
13
+ - Ask about predictor_id, strategy_id, image versions, GCS paths, pipeline version_names, or MLflow run IDs
14
+ - Read or update training hyperparameters in MongoDB (predictor config, learning config, model config)
15
+ - Run hyperparameter experiments by updating MongoDB config and launching Kubeflow runs
16
+ - Manage sequential experiment launches sharing the same predictor MongoDB document
17
+ Environments: DEV ONLY. Kubeflow: http://10.11.96.10/ | MLflow: http://10.11.96.16:5000/ |
18
+ MongoDB: mongodb://10.11.96.21:27017/earlybirds
19
+ Config/pipeline sources: attraqt-kubeflow-configs, attraqt-kubeflow-pipelines
20
+ ---
21
+
22
+ # Kubeflow & MLflow Dev Management
23
+
24
+ ## Endpoints
25
+
26
+ | Service | URL |
27
+ |---------|-----|
28
+ | Kubeflow UI | http://10.11.96.10/ |
29
+ | MLflow UI | http://10.11.96.16:5000/ |
30
+ | MongoDB | `mongodb://10.11.96.21:27017/earlybirds` (collection: `predictors`) |
31
+ | Config files | `attraqt-kubeflow-configs/configs/development/` |
32
+ | Pipeline code | `attraqt-kubeflow-pipelines/kubeflow_pipelines/pipelines/` |
33
+
34
+ ---
35
+
36
+ ## Available Scripts
37
+
38
+ ### `scripts/kf_query.py`: Query Kubeflow runs
39
+
40
+ ```bash
41
+ python3 scripts/kf_query.py <run_id> # Run status + step overview
42
+ python3 scripts/kf_query.py <run_id> --failed # Only failed/running steps + pods
43
+ python3 scripts/kf_query.py <run_id> --step <name> # Filter by step name
44
+ python3 scripts/kf_query.py <run_id> --all # Include internal KFP steps
45
+ python3 scripts/kf_query.py --list --experiment <name> # List runs in experiment
46
+ python3 scripts/kf_query.py --experiments # List all experiments
47
+ python3 scripts/kf_query.py --pipelines # List all pipelines + latest version_name
48
+ python3 scripts/kf_query.py --pipeline-versions <name> # List all versions of a pipeline
49
+ ```
50
+
51
+ ### `scripts/mlflow_query.py`: Query MLflow
52
+
53
+ ```bash
54
+ python3 scripts/mlflow_query.py model-for-predictor <predictor_id> # Find model by predictor_id
55
+ python3 scripts/mlflow_query.py run <run_id> # Run metrics + params
56
+ python3 scripts/mlflow_query.py runs <experiment_name> # List runs in experiment
57
+ python3 scripts/mlflow_query.py models # List registered models
58
+ python3 scripts/mlflow_query.py model <model_name> # Versions + aliases
59
+ ```
60
+
61
+ ### `scripts/mongo_predictor.py`: Manage predictor training config in MongoDB
62
+
63
+ Prefer this over raw `mongosh --eval` for any read/update/replace of `config.batch.<strategy>`. Strategy defaults to `semantic-search-learning`.
64
+
65
+ ```bash
66
+ python3 scripts/mongo_predictor.py read <predictor_id> # Print current strategy config
67
+ python3 scripts/mongo_predictor.py update <predictor_id> --set k=v [--set k=v ...] # Patch fields (dot-notation $set)
68
+ python3 scripts/mongo_predictor.py apply <predictor_id> --file <config.json> # Replace whole strategy config
69
+ python3 scripts/mongo_predictor.py diff <predictor_id> --file <config.json> # Show what apply would change
70
+ ```
71
+
72
+ `update` keys are dot-paths *under* the strategy (e.g. `learningConfig.trainingArguments.perDeviceTrainBatchSize`). Values are auto-coerced (`true`/`false`/`null`/int/float/JSON literal/string). `apply` stores ints as `Int32` (uses `NumberInt` via extended JSON) so full-replace does not silently downgrade types to Double.
73
+
74
+ ### `scripts/kf_logs.py`: Tail Kubeflow step pod logs
75
+
76
+ Resolves run -> step -> impl pod -> `kubectl logs`. No more grepping `kubectl get pods` for pod names.
77
+
78
+ ```bash
79
+ python3 scripts/kf_logs.py <run_id> --step <name> # Tail logs from matching step pod(s)
80
+ python3 scripts/kf_logs.py <run_id> --step <name> --previous # Logs from previous (crashed) container
81
+ python3 scripts/kf_logs.py <run_id> --step <name> -f # Stream live (one matching pod only)
82
+ python3 scripts/kf_logs.py <run_id> --step <name> --tail 500 # Tail N lines (default 200)
83
+ python3 scripts/kf_logs.py <run_id> --list # List matching pods only
84
+ python3 scripts/kf_logs.py <run_id> --all # Include KFP internal step pods
85
+ ```
86
+
87
+ Driver / DAG steps (`display_name` ending in `-driver`) and KFP-internal plumbing steps are skipped by default; pass `--all` to include them. Pods are deduped by `pod_name` in case the KFP v2 API returns the same pod under multiple templated task entries.
88
+
89
+ ### `scripts/kf_wait.py`: Wait for a step to reach a state
90
+
91
+ ```bash
92
+ python3 scripts/kf_wait.py <run_id> --step <name> --state RUNNING # Default 600s timeout, 10s interval
93
+ python3 scripts/kf_wait.py <run_id> --step <name> --state SUCCEEDED --timeout 3600
94
+ python3 scripts/kf_wait.py <run_id> --step <name> --state RUNNING --interval 5
95
+ python3 scripts/kf_wait.py <run_id> --step <name> --state RUNNING --quiet # No progress output
96
+ ```
97
+
98
+ Exit codes: `0` reached, `1` run not found or terminal-mismatch, `2` API unreachable for the full window, `124` timeout. Composable with the sequential-launch loop (apply config -> launch run -> `kf_wait` -> apply next).
99
+
100
+ ---
101
+
102
+ ## Core Workflow: Launching a Pipeline Run
103
+
104
+ The launcher lives in `attraqt-kubeflow-configs`. It requires a `.env` file at the root of that project.
105
+
106
+ ### Setup (one-time)
107
+
108
+ ```bash
109
+ cd attraqt-kubeflow-configs
110
+ # Check if .env exists already
111
+ ls .env
112
+ # If not, create it from the dev template:
113
+ cp .env.dev .env
114
+ ```
115
+
116
+ The `.env.dev` sets:
117
+ ```
118
+ KUBEFLOW_HOST=http://10.11.96.10:80
119
+ PIPELINES_BUCKET_NAME=xo-dev-ai-eu-kubeflow-artifacts
120
+ MONGO_CONF_URI=10.11.96.19:8080
121
+ ```
122
+
123
+ ### Submitting a run
124
+
125
+ ```bash
126
+ cd attraqt-kubeflow-configs/scripts
127
+ python -m run -c <absolute_path_to_config_file>
128
+ ```
129
+
130
+ Example, using an existing config:
131
+ ```bash
132
+ cd attraqt-kubeflow-configs/scripts
133
+ python -m run -c /Users/mehdi/dev/projects/attraqt-kubeflow-configs/configs/development/search/myer-learning.json
134
+ ```
135
+
136
+ Example, using a newly generated config saved to a temp file:
137
+ ```bash
138
+ # Save generated config
139
+ cat > /tmp/my-job-config.json << 'EOF'
140
+ { ... }
141
+ EOF
142
+
143
+ cd attraqt-kubeflow-configs/scripts
144
+ python -m run -c /tmp/my-job-config.json
145
+ ```
146
+
147
+ ### Before submitting: verify the `version_name`
148
+
149
+ The `version_name` in the config must match an **existing** pipeline version in Kubeflow. Always check:
150
+
151
+ ```bash
152
+ python3 scripts/kf_query.py --pipeline-versions <pipeline_name>
153
+ # e.g.: --pipeline-versions python_batch_pipeline
154
+ # --pipeline-versions scala_batch_pipeline
155
+ # --pipeline-versions semantic_search_item_encoding
156
+ ```
157
+
158
+ Use the most recent `version_name` from the output (e.g. `"0.1.271"`).
159
+
160
+ ---
161
+
162
+ ## Core Workflow: Debugging a Failed Run
163
+
164
+ 1. **Identify the failed step:**
165
+ ```bash
166
+ python3 scripts/kf_query.py <run_id> --failed
167
+ ```
168
+
169
+ 2. **Read the failing pod's logs** (no manual pod-name lookup; `kf_logs.py` resolves it):
170
+ ```bash
171
+ python3 scripts/kf_logs.py <run_id> --step <step_name> --tail 200
172
+ python3 scripts/kf_logs.py <run_id> --step <step_name> --previous # if container crashed
173
+ python3 scripts/kf_logs.py <run_id> --step <step_name> -f # stream live
174
+ ```
175
+
176
+ 3. **Inspect pod events** (OOM, scheduling, image pull, etc.). Get the pod name from `--list`, then `kubectl describe`:
177
+ ```bash
178
+ python3 scripts/kf_logs.py <run_id> --step <step_name> --list
179
+ kubectl describe pod -n kubeflow <pod-name>
180
+ ```
181
+
182
+ 4. **For resource/GPU issues:**
183
+ ```bash
184
+ kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable."nvidia\.com/gpu"
185
+ ```
186
+
187
+ > See `references/kubectl-debug.md` for the full kubectl command reference and failure-pattern table.
188
+
189
+ ---
190
+
191
+ ## Core Workflow: Updating MongoDB Training Config
192
+
193
+ Training hyperparameters are stored in MongoDB predictor documents, not in the Kubeflow config JSON. The Kubeflow config specifies *what* to run (pipeline, image, dataset paths); MongoDB specifies *how* to train (learning rate, batch size, epochs, etc.).
194
+
195
+ ### Quick read
196
+
197
+ ```bash
198
+ python3 scripts/mongo_predictor.py read <PREDICTOR_ID>
199
+ ```
200
+
201
+ ### Quick update (dot notation for individual fields)
202
+
203
+ ```bash
204
+ python3 scripts/mongo_predictor.py update <PREDICTOR_ID> \
205
+ --set pipelineConfig.maxSequenceLength=64 \
206
+ --set learningConfig.trainingArguments.perDeviceTrainBatchSize=128 \
207
+ --set learningConfig.trainingArguments.gradientAccumulationSteps=4 \
208
+ --set learningConfig.trainingArguments.numTrainEpochs=3 \
209
+ --set learningConfig.trainingArguments.gradientCheckpointing=true
210
+ ```
211
+
212
+ ### Stage and apply experiment configs (preferred for sequential launches)
213
+
214
+ ```bash
215
+ # Preview the change before applying
216
+ python3 scripts/mongo_predictor.py diff <PREDICTOR_ID> --file experiment-A.json
217
+ python3 scripts/mongo_predictor.py apply <PREDICTOR_ID> --file experiment-A.json
218
+ ```
219
+
220
+ `apply` replaces the whole `config.batch.<strategy>` document and preserves `Int32` types. Use `update --set` when patching a few fields; use `apply --file` when the experiment config is checked in or generated.
221
+
222
+ > Raw `mongosh --eval` examples are kept in `references/mongodb-config.md` as a fallback.
223
+
224
+ ### Sequential launches (multiple experiments, same predictor)
225
+
226
+ All runs for the same predictor share one MongoDB document. The full loop:
227
+
228
+ ```bash
229
+ # Run A
230
+ python3 scripts/mongo_predictor.py apply <PREDICTOR_ID> --file experiment-A.json
231
+ # launch run A from attraqt-kubeflow-configs (capture <run_id_A>)
232
+ python3 scripts/kf_wait.py <run_id_A> --step get-container-base-task-2 --state RUNNING --timeout 600
233
+
234
+ # Run B (only after A's pod has read the config)
235
+ python3 scripts/mongo_predictor.py apply <PREDICTOR_ID> --file experiment-B.json
236
+ # launch run B (capture <run_id_B>)
237
+ python3 scripts/kf_wait.py <run_id_B> --step get-container-base-task-2 --state RUNNING --timeout 600
238
+ ```
239
+
240
+ Verify each run actually picked up its config:
241
+ ```bash
242
+ python3 scripts/kf_logs.py <run_id> --step get-container-base-task-2 --tail 500 \
243
+ | grep -E 'maxSequenceLength:|perDeviceTrainBatchSize:|gradientAccumulationSteps:'
244
+ ```
245
+
246
+ > See `references/mongodb-config.md` for full config schema, complete update templates, and gotchas.
247
+
248
+ ---
249
+
250
+ ## Core Workflow: Building a Job Config Interactively
251
+
252
+ When the user asks to create a config for a Kubeflow job, ask only the questions needed, in this order:
253
+
254
+ 1. **What pipeline step?** (learning / evaluation / items-encoding / full pipeline / other)
255
+ 2. **Predictor ID?** (MongoDB ObjectId, e.g. `64f0a12b5856b11b7aa4e71e`)
256
+ 3. **Experiment name?** (check Kubeflow: `python3 scripts/kf_query.py --experiments`)
257
+ 4. **Image version?** Ask the user, or check `attraqt-kubeflow-pipelines/kubeflow_pipelines/pipelines/utils/versions.py` for defaults
258
+ 5. **Dataset paths?** GCS paths from previous completed steps; check Kubeflow UI or `kf_query.py <run_id>`
259
+ 6. **MLflow run_id?** For evaluation/encoding steps; check `mlflow_query.py model-for-predictor <predictor_id>`
260
+ 7. **Resource overrides?** Use defaults from `references/pipeline-configs.md` unless user specifies
261
+
262
+ Then generate the config and offer to either save it and launch it, or just show it.
263
+
264
+ **Key rules:**
265
+ - `python_batch_pipeline` → uses `batch_config.arguments` for custom job params
266
+ - `scala_batch_pipeline` → uses `batch_config.custom_params` (NOT `arguments`)
267
+ - Always verify `version_name` with `kf_query.py --pipeline-versions <name>` before submitting
268
+ - GPU jobs: always include `gpu_vendor: "nvidia.com/gpu"` and `gpu_accelerator_name: "nvidia-l4"` for L4 nodes
269
+
270
+ > See `references/pipeline-configs.md` for full schemas, strategy IDs, image names, and GCS path patterns.
271
+ > See `references/pipeline-steps.md` for step-by-step configs for the semantic search learning workflow.
272
+
273
+ ---
274
+
275
+ ## Core Workflow: Re-running Failed Pipeline Steps
276
+
277
+ For **semantic search learning** (`semantic_search_learning_with_generated_analytics_pipeline`):
278
+
279
+ | Step | Pipeline type | Key inputs needed |
280
+ |------|--------------|-------------------|
281
+ | Learning | `python_batch_pipeline` | Dataset paths from steps 2+3 |
282
+ | Evaluation | `python_batch_pipeline` | MLflow `run_id` from learning; evaluation dataset from step 3 |
283
+ | Items encoding | `scala_batch_pipeline` | MLflow `run_id` from learning |
284
+
285
+ Get completed step output paths:
286
+ ```bash
287
+ python3 scripts/kf_query.py <original_run_id>
288
+ # Then check Kubeflow UI → that run → succeeded steps → Output artifacts tab
289
+ ```
290
+
291
+ Get the MLflow `run_id` from a completed training:
292
+ ```bash
293
+ python3 scripts/mlflow_query.py model-for-predictor <predictor_id>
294
+ ```
295
+
296
+ > Full config templates for each step are in `references/pipeline-steps.md`.
297
+
298
+ ---
299
+
300
+ ## Quick Reference: Finding Things
301
+
302
+ | What | Where |
303
+ |------|-------|
304
+ | Predictor ID | Kubeflow run config params (shown by `kf_query.py`) |
305
+ | Training hyperparameters | MongoDB: `db.predictors.findOne({_id: ObjectId("<id>")}).config.batch` |
306
+ | Image versions | `attraqt-kubeflow-pipelines/.../utils/versions.py` |
307
+ | Dataset GCS paths | Kubeflow UI → run → step → Output artifacts; or `kf_query.py` |
308
+ | MLflow run_id | `mlflow_query.py model-for-predictor <predictor_id>` |
309
+ | Pipeline version_name | `kf_query.py --pipeline-versions <name>` |
310
+ | Pipeline configs | `attraqt-kubeflow-configs/configs/development/` |
311
+ | Strategy IDs & image names | `references/pipeline-configs.md` |
312
+ | MongoDB config schema | `references/mongodb-config.md` |
313
+ | Existing tenant configs | `attraqt-kubeflow-configs/configs/development/ai/<pipeline>/` |
@@ -0,0 +1,165 @@
1
+ # kubectl Debugging Reference for Kubeflow Jobs
2
+
3
+ Kubeflow jobs run in the `kubeflow` namespace. Each pipeline step spawns pods.
4
+
5
+ ---
6
+
7
+ ## Quick Diagnostics
8
+
9
+ On this dev cluster, KFP v2 pod names follow `<workflow-name>-<numeric-id>` (see "Getting Pod Name from Kubeflow UI" below). For ad-hoc lookups, grep by workflow name prefix. For scripted resolution, `scripts/kf_logs.py` uses the KFP API to map run + step to pods directly.
10
+
11
+ ```bash
12
+ # Find all pods for a run by workflow name prefix (ALWAYS do this first)
13
+ kubectl get pods -n kubeflow | grep <workflow-name>
14
+
15
+ # List recent pods (sort by creation time)
16
+ kubectl get pods -n kubeflow --sort-by=.metadata.creationTimestamp | tail -30
17
+
18
+ # Find pods by predictor_id label
19
+ kubectl get pods -n kubeflow -l predictor-id=<predictor_id>
20
+
21
+ # Find pods by strategy
22
+ kubectl get pods -n kubeflow -l strategy-id=<strategy_id>
23
+
24
+ # Find pods by image (algo-search-batch, semantic-search, etc.)
25
+ kubectl get pods -n kubeflow | grep <image_name>
26
+ ```
27
+
28
+ ---
29
+
30
+ ## Pod Logs
31
+
32
+ ```bash
33
+ # Get logs from a pod (last 200 lines)
34
+ kubectl logs -n kubeflow <pod-name> --tail=200
35
+
36
+ # Stream live logs
37
+ kubectl logs -n kubeflow <pod-name> -f
38
+
39
+ # Get logs from previous crashed container
40
+ kubectl logs -n kubeflow <pod-name> --previous
41
+
42
+ # Get logs for a specific container in a pod (e.g. main container)
43
+ kubectl logs -n kubeflow <pod-name> -c main
44
+
45
+ # Get logs from all pods with a label
46
+ kubectl logs -n kubeflow -l predictor-id=<predictor_id> --tail=100
47
+ ```
48
+
49
+ ---
50
+
51
+ ## Pod Status & Events
52
+
53
+ ```bash
54
+ # Describe a pod (shows events, exit codes, OOM kills, node assignment)
55
+ kubectl describe pod -n kubeflow <pod-name>
56
+
57
+ # Check pod status quickly
58
+ kubectl get pod -n kubeflow <pod-name> -o wide
59
+
60
+ # Get exit code from a completed/failed pod
61
+ kubectl get pod -n kubeflow <pod-name> -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}'
62
+
63
+ # Get reason for failure
64
+ kubectl get pod -n kubeflow <pod-name> -o jsonpath='{.status.containerStatuses[0].state.terminated.reason}'
65
+
66
+ # List all events in kubeflow namespace (helpful for scheduling failures)
67
+ kubectl get events -n kubeflow --sort-by=.lastTimestamp | tail -30
68
+
69
+ # Events for a specific pod
70
+ kubectl get events -n kubeflow --field-selector involvedObject.name=<pod-name>
71
+ ```
72
+
73
+ ---
74
+
75
+ ## GPU / Resource Issues
76
+
77
+ ```bash
78
+ # Check GPU node availability (all GPU types)
79
+ kubectl get nodes -l cloud.google.com/gke-accelerator -o custom-columns=NAME:.metadata.name,ACCELERATOR:.metadata.labels.cloud\\.google\\.com/gke-accelerator,GPU:.status.allocatable.nvidia\\.com/gpu
80
+
81
+ # Check if GPU is allocated (look for Limits in describe)
82
+ kubectl describe pod -n kubeflow <pod-name> | grep -A5 "Limits:"
83
+
84
+ # Check if pod is stuck Pending (resource pressure)
85
+ kubectl describe pod -n kubeflow <pod-name> | grep -A10 "Events:"
86
+ ```
87
+
88
+ ---
89
+
90
+ ## Argo Workflow (Kubeflow underlying engine)
91
+
92
+ ```bash
93
+ # List workflows (each pipeline run = 1 workflow)
94
+ kubectl get workflows -n kubeflow | tail -20
95
+
96
+ # Get workflow status
97
+ kubectl get workflow -n kubeflow <workflow-name> -o jsonpath='{.status.phase}'
98
+
99
+ # Describe workflow (shows all steps and their status)
100
+ kubectl describe workflow -n kubeflow <workflow-name>
101
+
102
+ # Get workflow name from Kubeflow run_id
103
+ # The workflow name is shown in the Kubeflow UI run details page URL/title
104
+ ```
105
+
106
+ ---
107
+
108
+ ## Common Failure Patterns
109
+
110
+ | Symptom | What to check |
111
+ |---------|---------------|
112
+ | Pod stuck `Pending` | `kubectl describe pod` → Events: check "Insufficient nvidia.com/gpu" or "Insufficient memory" |
113
+ | Pod `OOMKilled` | Increase `memory` in batch_config; `describe pod` shows `OOMKilled` in state |
114
+ | Pod `Error` exit code 1 | Check `kubectl logs` for the application error (CUDA OOM, missing dataset path, etc.) |
115
+ | Pod `Error` exit code 137 | OOM killed by Linux kernel |
116
+ | Workflow stuck | Check if a previous step pod is still running: `kubectl get pods -n kubeflow` |
117
+ | `ImagePullBackOff` | Wrong image version in config; check version exists in registry |
118
+ | `CrashLoopBackOff` | App crashes on startup; check `--previous` logs |
119
+
120
+ ---
121
+
122
+ ## Getting Pod Name from Kubeflow UI
123
+
124
+ On this dev cluster, the KFP v2 API's `task_details[].child_tasks[].pod_name` field is the kubectl pod name. Pods follow the pattern:
125
+
126
+ ```
127
+ <workflow-name>-<numeric-id>
128
+ ```
129
+
130
+ Example: workflow `python-batch-pipeline-26dsn` produces pods like:
131
+ ```
132
+ python-batch-pipeline-26dsn-1314600408 # workflow root
133
+ python-batch-pipeline-26dsn-2177690451 # a step pod
134
+ python-batch-pipeline-26dsn-2721071988 # a driver pod
135
+ ```
136
+
137
+ The step's role is encoded in the `display_name` field of `task_details`, NOT in the pod name. Steps ending in `-driver` are KFP infrastructure (driver / DAG orchestration); user code runs in steps without that suffix. `scripts/kf_logs.py` uses this distinction to filter.
138
+
139
+ If you ever encounter a KFP configuration that produces longer pod names with step-name infixes (an older KFP v2 pattern was `<workflow>-<step>-system-container-impl-<hash>`), grep is the fallback:
140
+
141
+ ```bash
142
+ kubectl get pods -n kubeflow | grep <workflow-name>
143
+ ```
144
+
145
+ To find pods for a run:
146
+ 1. Get the workflow name from Kubeflow UI (Runs -> click run -> title/URL), or use `python3 scripts/kf_logs.py <run_id> --list` to print step + pod_name pairs.
147
+ 2. Use `kubectl logs -n kubeflow <pod-name>` directly, or let `kf_logs.py --step <name>` resolve and tail in one call.
148
+
149
+ ---
150
+
151
+ ## Useful One-liners
152
+
153
+ ```bash
154
+ # Get all failed pods in kubeflow namespace
155
+ kubectl get pods -n kubeflow --field-selector=status.phase=Failed
156
+
157
+ # Delete a completed/failed pod (cleanup)
158
+ kubectl delete pod -n kubeflow <pod-name>
159
+
160
+ # Check resource quotas
161
+ kubectl describe resourcequota -n kubeflow
162
+
163
+ # Exec into a running pod for debugging
164
+ kubectl exec -it -n kubeflow <pod-name> -- /bin/bash
165
+ ```