@groupby/ai-dev 0.5.5 → 0.5.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (43) hide show
  1. package/package.json +1 -1
  2. package/teams/OOF/skills/jira-ticket-creator/README.md +22 -0
  3. package/teams/OOF/skills/jira-ticket-creator/SKILL.md +266 -0
  4. package/teams/fhr-ai-team/github/PULL_REQUEST_TEMPLATE/full.md +31 -0
  5. package/teams/fhr-ai-team/github/PULL_REQUEST_TEMPLATE/light.md +7 -0
  6. package/teams/fhr-ai-team/github/copilot-instructions.md +24 -0
  7. package/teams/fhr-ai-team/github/instructions/python.instructions.md +23 -0
  8. package/teams/fhr-ai-team/github/pull_request_template.md +21 -0
  9. package/teams/fhr-ai-team/prompts/brainstorm.md +7 -0
  10. package/teams/fhr-ai-team/prompts/plan-algo-tests.md +7 -0
  11. package/teams/fhr-ai-team/prompts/plan.md +7 -0
  12. package/teams/fhr-ai-team/prompts/pr-description.md +7 -0
  13. package/teams/fhr-ai-team/prompts/test.md +7 -0
  14. package/teams/fhr-ai-team/resources/AGENTS.md +55 -0
  15. package/teams/fhr-ai-team/resources/CLAUDE.md +52 -0
  16. package/teams/fhr-ai-team/resources/README.md +51 -0
  17. package/teams/fhr-ai-team/resources/claude-code-setup.md +60 -0
  18. package/teams/fhr-ai-team/resources/copilot-setup.md +64 -0
  19. package/teams/fhr-ai-team/resources/onboarding.md +179 -0
  20. package/teams/fhr-ai-team/resources/opencode-install.md +29 -0
  21. package/teams/fhr-ai-team/resources/opencode-setup.md +43 -0
  22. package/teams/fhr-ai-team/skills/algo-test-planning/SKILL.md +192 -0
  23. package/teams/fhr-ai-team/skills/algo-test-planning/references/pipeline-registry.md +280 -0
  24. package/teams/fhr-ai-team/skills/brainstorming/SKILL.md +111 -0
  25. package/teams/fhr-ai-team/skills/e2e-testing/SKILL.md +163 -0
  26. package/teams/fhr-ai-team/skills/grill-me/SKILL.md +10 -0
  27. package/teams/fhr-ai-team/skills/ml-tooling-dev/SKILL.md +313 -0
  28. package/teams/fhr-ai-team/skills/ml-tooling-dev/references/kubectl-debug.md +165 -0
  29. package/teams/fhr-ai-team/skills/ml-tooling-dev/references/mongodb-config.md +218 -0
  30. package/teams/fhr-ai-team/skills/ml-tooling-dev/references/pipeline-configs.md +190 -0
  31. package/teams/fhr-ai-team/skills/ml-tooling-dev/references/pipeline-steps.md +182 -0
  32. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/kf_logs.py +203 -0
  33. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/kf_query.py +233 -0
  34. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/kf_wait.py +195 -0
  35. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/mlflow_query.py +252 -0
  36. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/mongo_predictor.py +352 -0
  37. package/teams/fhr-ai-team/skills/naming-conventions-reviewer/SKILL.md +230 -0
  38. package/teams/fhr-ai-team/skills/naming-conventions-reviewer/references/dataset-naming.md +190 -0
  39. package/teams/fhr-ai-team/skills/naming-conventions-reviewer/references/domain-vocabulary.md +447 -0
  40. package/teams/fhr-ai-team/skills/naming-conventions-reviewer/references/repo-dependency-graph.md +264 -0
  41. package/teams/fhr-ai-team/skills/planning/SKILL.md +138 -0
  42. package/teams/fhr-ai-team/skills/pr-description/SKILL.md +94 -0
  43. package/teams/snpd/skills/code-review-github/SKILL.md +475 -0
@@ -0,0 +1,280 @@
1
+ # Pipeline Registry
2
+
3
+ Complete registry of all AI pipelines from `attraqt-kubeflow-pipelines`.
4
+ Source: `kubeflow_pipelines/pipelines/ai/__init__.py`
5
+
6
+ ---
7
+
8
+ ## Base Pipelines (Single-Step Templates)
9
+
10
+ These are generic pipeline wrappers for running a single batch job step.
11
+
12
+ | Pipeline | Type | Use case |
13
+ |----------|------|----------|
14
+ | `python_batch_pipeline` | Python | Standard Python batch jobs |
15
+ | `large_python_batch_pipeline` | Python | GPU/high-memory Python batch jobs |
16
+ | `scala_batch_pipeline` | Scala | Scala-based batch jobs |
17
+ | `spark_scala_batch_pipeline` | Spark Scala | Spark Scala batch jobs |
18
+
19
+ **Config differences:**
20
+ - Python pipelines use `batch_config.arguments` for custom params
21
+ - Scala pipelines use `batch_config.custom_params` (NOT `arguments`)
22
+ - `large_python_batch_pipeline` supports GPU: set `gpu_vendor: "nvidia.com/gpu"` and `gpu_accelerator_name: "nvidia-l4"`
23
+
24
+ ---
25
+
26
+ ## Semantic Search
27
+
28
+ | Pipeline | Type | Description |
29
+ |----------|------|-------------|
30
+ | `semantic_search_learning_pipeline` | Full | Learning only |
31
+ | `semantic_search_learning_with_generated_analytics_pipeline` | Full | Learning + analytics generation (most common) |
32
+ | `semantic_search_item_encoding_pipeline` | Full | Item encoding after training |
33
+ | `export_huggingface_sentence_transformer_model_pipeline` | Script | Export HuggingFace sentence transformer model |
34
+
35
+ ---
36
+
37
+ ## Search (Text-only, no images)
38
+
39
+ | Pipeline | Type | Description |
40
+ |----------|------|-------------|
41
+ | `search_rnn_learning_pipeline_without_images` | Full | RNN-based search learning |
42
+ | `search_llm_learning_pipeline_without_images` | Full | LLM-based search learning |
43
+ | `search_item_encoding_pipeline_without_images` | Full | Item encoding (text only) |
44
+
45
+ ---
46
+
47
+ ## Search + CLIP (Text + Images)
48
+
49
+ | Pipeline | Type | Description |
50
+ |----------|------|-------------|
51
+ | `clip_search_rnn_learning_pipeline` | Full | CLIP + RNN search learning |
52
+ | `clip_search_llm_learning_pipeline` | Full | CLIP + LLM search learning |
53
+ | `clip_search_llm_learning_pipeline_with_data_augmentation` | Full | CLIP + LLM with data augmentation |
54
+ | `clip_search_vertical_rnn_learning_pipeline` | Full | Vertical (per-tenant) CLIP + RNN learning |
55
+ | `clip_search_vertical_llm_learning_pipeline` | Full | Vertical (per-tenant) CLIP + LLM learning |
56
+ | `clip_large_search_vertical_rnn_learning_pipeline` | Full | Large vertical CLIP + RNN (GPU) |
57
+ | `clip_large_search_vertical_llm_learning_pipeline` | Full | Large vertical CLIP + LLM (GPU) |
58
+ | `clip_search_item_encoding_pipeline` | Full | CLIP search item encoding |
59
+
60
+ ---
61
+
62
+ ## Search + Image Encoder
63
+
64
+ | Pipeline | Type | Description |
65
+ |----------|------|-------------|
66
+ | `image_encoder_search_item_encoding_pipeline_with_images` | Full | Image encoder search encoding (with images) |
67
+ | `image_encoder_search_item_encoding_pipeline_without_images` | Full | Image encoder search encoding (without images) |
68
+
69
+ ---
70
+
71
+ ## Search Evaluation
72
+
73
+ | Pipeline | Type | Description |
74
+ |----------|------|-------------|
75
+ | `search_evaluation_pipeline` | Full | Standard search evaluation |
76
+ | `search_llm_evaluation_pipeline` | Full | LLM-based search evaluation |
77
+
78
+ ---
79
+
80
+ ## Computer Vision - CLIP
81
+
82
+ | Pipeline | Type | Description |
83
+ |----------|------|-------------|
84
+ | `clip_learning_pipeline` | Full | CLIP model learning |
85
+ | `clip_vertical_learning_pipeline` | Full | Per-tenant CLIP learning |
86
+ | `large_clip_learning_pipeline` | Full | Large CLIP learning (GPU) |
87
+ | `clip_item_images_single_encoding_pipeline` | Full | CLIP single image encoding |
88
+ | `export_huggingface_clip_model_pipeline` | Script | Export HuggingFace CLIP model |
89
+
90
+ ---
91
+
92
+ ## Computer Vision - Image Encoder
93
+
94
+ | Pipeline | Type | Description |
95
+ |----------|------|-------------|
96
+ | `computer_vision_learning_pipeline` | Full | Image encoder learning |
97
+ | `computer_vision_vertical_learning_pipeline` | Full | Per-tenant image encoder learning |
98
+ | `large_computer_vision_vertical_learning_pipeline` | Full | Large vertical learning (GPU) |
99
+ | `computer_vision_item_images_single_encoding_pipeline` | Full | Image encoding |
100
+
101
+ ---
102
+
103
+ ## Computer Vision - SAM
104
+
105
+ | Pipeline | Type | Description |
106
+ |----------|------|-------------|
107
+ | `export_huggingface_sam_model_pipeline` | Script | Export HuggingFace SAM model |
108
+
109
+ ---
110
+
111
+ ## FM (Factorization Machines / Recommendations)
112
+
113
+ | Pipeline | Type | Description |
114
+ |----------|------|-------------|
115
+ | `fm_global_initialization_pipeline` | Full | FM global model initialization |
116
+ | `fm_global_incremental_pipeline` | Full | FM global incremental update |
117
+ | `fm_complementarity_initialization_pipeline` | Full | FM complementarity initialization |
118
+ | `fm_complementarity_incremental_pipeline` | Full | FM complementarity incremental update |
119
+
120
+ ---
121
+
122
+ ## GPT (Generative)
123
+
124
+ | Pipeline | Type | Description |
125
+ |----------|------|-------------|
126
+ | `gpt_initialization_pipeline_with_images` | Full | GPT init (with images) |
127
+ | `gpt_initialization_pipeline_without_images` | Full | GPT init (text only) |
128
+ | `gpt_incremental_pipeline_with_images` | Full | GPT incremental (with images) |
129
+ | `gpt_incremental_pipeline_without_images` | Full | GPT incremental (text only) |
130
+ | `gpt_item_encoding_pipeline_with_images` | Full | GPT item encoding (with images) |
131
+ | `gpt_item_encoding_pipeline_without_images` | Full | GPT item encoding (text only) |
132
+
133
+ ---
134
+
135
+ ## Tagging
136
+
137
+ | Pipeline | Type | Description |
138
+ |----------|------|-------------|
139
+ | `tagging_learning_pipeline` | Full | Tagging model learning |
140
+ | `tagging_item_tagging_pipeline` | Full | Apply tagging to items |
141
+ | `tagging_item_macro_tagging_pipeline` | Full | Apply macro tagging to items |
142
+
143
+ ---
144
+
145
+ ## Shop the Look
146
+
147
+ | Pipeline | Type | Description |
148
+ |----------|------|-------------|
149
+ | `shop_the_look_recommendation_pipeline` | Full | STL recommendations |
150
+ | `shop_the_look_recommendation_with_segmentation_pipeline` | Full | STL with image segmentation |
151
+ | `shop_the_look_recommendation_without_segmentation_pipeline` | Full | STL without segmentation |
152
+ | `shop_the_look_recommendation_with_outfit_detection_pipeline` | Full | STL with outfit detection |
153
+ | `outfit_image_classification_learning_pipeline` | Full | Outfit classifier learning |
154
+ | `outfit_image_classification_vertical_learning_pipeline` | Full | Per-tenant outfit classifier |
155
+
156
+ ---
157
+
158
+ ## YOLO (Object Detection)
159
+
160
+ | Pipeline | Type | Description |
161
+ |----------|------|-------------|
162
+ | `yolo_model_fine_tuning_pipeline` | Full | YOLO model fine-tuning |
163
+ | `yolo_model_fine_tuning_vertical_pipeline` | Full | Per-tenant YOLO fine-tuning |
164
+ | `export_ultralytics_yolo_model_pipeline` | Script | Export Ultralytics YOLO model |
165
+
166
+ ---
167
+
168
+ ## Item and Analytic Data
169
+
170
+ | Pipeline | Type | Description |
171
+ |----------|------|-------------|
172
+ | `xo_item_data_pipeline` | Full | XO item data ingestion |
173
+ | `fhr_item_data_pipeline` | Full | FHR item data ingestion |
174
+ | `fhr_item_data_pipeline_legacy` | Full | FHR item data (legacy) |
175
+ | `cidp_item_data_pipeline` | Full | CIDP item data ingestion |
176
+ | `fhr_analytic_incremental_data_pipeline` | Full | FHR analytics incremental |
177
+ | `fhr_analytic_incremental_data_pipeline_legacy` | Full | FHR analytics incremental (legacy) |
178
+ | `fhr_analytic_data_pipeline_legacy` | Full | FHR analytics (legacy) |
179
+
180
+ ---
181
+
182
+ ## NLP
183
+
184
+ | Pipeline | Type | Description |
185
+ |----------|------|-------------|
186
+ | `nlp_word_tokenizer_pipeline` | Full | Word tokenizer training |
187
+ | `nlp_character_tokenizer_pipeline` | Full | Character tokenizer training |
188
+
189
+ ---
190
+
191
+ ## Content-Based
192
+
193
+ | Pipeline | Type | Description |
194
+ |----------|------|-------------|
195
+ | `content_based_word2vec_pipeline` | Full | Word2Vec content-based recommendations |
196
+
197
+ ---
198
+
199
+ ## ALS (Alternating Least Squares)
200
+
201
+ | Pipeline | Type | Description |
202
+ |----------|------|-------------|
203
+ | `als_pipeline` | Full | ALS collaborative filtering |
204
+
205
+ ---
206
+
207
+ ## FP-Growth
208
+
209
+ | Pipeline | Type | Description |
210
+ |----------|------|-------------|
211
+ | `fp_growth_items_pipeline` | Full | FP-Growth item associations |
212
+ | `fp_growth_categories_pipeline` | Full | FP-Growth category associations |
213
+
214
+ ---
215
+
216
+ ## Pass-Through (Graph)
217
+
218
+ | Pipeline | Type | Description |
219
+ |----------|------|-------------|
220
+ | `pass_through_scored_graph_pipeline` | Full | Scored graph pass-through |
221
+ | `pass_through_unscored_graph_1_pipeline` | Full | Unscored graph variant 1 |
222
+ | `pass_through_unscored_graph_2_pipeline` | Full | Unscored graph variant 2 |
223
+ | `pass_through_source_to_items_unscored_graph_pipeline` | Full | Source-to-items unscored graph |
224
+
225
+ ---
226
+
227
+ ## Autocomplete
228
+
229
+ | Pipeline | Type | Description |
230
+ |----------|------|-------------|
231
+ | `autocomplete_pipeline` | Full | Autocomplete model training |
232
+
233
+ ---
234
+
235
+ ## Miscellaneous
236
+
237
+ | Pipeline | Type | Description |
238
+ |----------|------|-------------|
239
+ | `basic_pipeline` | Full | Basic/generic pipeline template |
240
+ | `sessions_pipeline` | Full | Session data processing |
241
+ | `bigquery_cleanup_pipeline` | Full | BigQuery data cleanup |
242
+ | `gibberish_pipeline` | Full | Gibberish detection |
243
+ | `dummy_ai_scores_pipeline` | Full | Dummy AI scores (testing) |
244
+ | `item_tagging_pipeline` | Full | Items enrichment tagging |
245
+ | `merch_agent_data_pipeline` | Full | Merch agent data preparation |
246
+ | `lakefs_garbage_collection_pipeline` | Full | LakeFS garbage collection |
247
+
248
+ ---
249
+
250
+ ## Label Studio
251
+
252
+ | Pipeline | Type | Description |
253
+ |----------|------|-------------|
254
+ | `outfit_tasks_import_pipeline` | Script | Import outfit tasks to Label Studio |
255
+ | `outfit_annotations_export_pipeline` | Script | Export outfit annotations from Label Studio |
256
+ | `yolo_tasks_import_pipeline` | Script | Import YOLO tasks to Label Studio |
257
+ | `yolo_annotations_export_pipeline` | Script | Export YOLO annotations from Label Studio |
258
+
259
+ ---
260
+
261
+ ## Monitoring and Maintenance
262
+
263
+ | Pipeline | Type | Description |
264
+ |----------|------|-------------|
265
+ | `activity_monitoring` | Monitoring | Activity monitoring |
266
+ | `experiments_with_consecutive_failed_runs_monitoring_pipeline` | Monitoring | Failed experiments monitoring |
267
+ | `runs_with_abnormal_duration_cleaning_pipeline` | Monitoring | Abnormal duration cleanup |
268
+ | `gcs_cleaning_pipeline` | Script | GCS storage cleanup |
269
+ | `gcs_activities_copy_pipeline` | Script | GCS activities data copy |
270
+ | `image_download_pipeline` | Script | Image download utility |
271
+ | `inference_data_cleaning_pipeline` | Script | Inference data cleanup |
272
+
273
+ ---
274
+
275
+ ## Total: ~93 pipelines
276
+
277
+ - 4 base pipelines
278
+ - ~70 full (multi-step) pipelines
279
+ - ~12 script/utility pipelines
280
+ - ~7 monitoring/maintenance pipelines
@@ -0,0 +1,111 @@
1
+ ---
2
+ name: brainstorming
3
+ description: >
4
+ Use when the user wants to brainstorm, design, or explore a new feature, improvement,
5
+ or architecture decision. Discovers AI team repos via gh, searches existing code before
6
+ proposing solutions, and gathers requirements interactively via AskUserQuestion.
7
+ ---
8
+
9
+ # Codebase-Aware Brainstorming
10
+
11
+ ## Hard Gate
12
+
13
+ Do NOT invoke any implementation skill, write any code, scaffold any project, or take any
14
+ implementation action until you have presented a design and the user has approved it.
15
+
16
+ ## Process
17
+
18
+ ### Step 1: Discover AI Team Repos
19
+
20
+ Run the following to get the current repo landscape:
21
+
22
+ ```bash
23
+ gh repo list Attraqt --json name,description --limit 200 --no-archived
24
+ ```
25
+
26
+ Filter results for repos matching `ai.*`, `algo.*`, `ebap-*`, `attraqt-kubeflow-*`, and `*-toolbox` patterns.
27
+ Present the user with a summary of the relevant repos grouped by category:
28
+
29
+ | Category | Pattern | Purpose |
30
+ |----------|---------|---------|
31
+ | ML algorithms | `algo.*` | Model training, inference, evaluation |
32
+ | ML training | `algo.*-ml` | Kubeflow-based model training/fine-tuning |
33
+ | AI services | `ai.*` | FastAPI/Streamlit microservices |
34
+ | Toolboxes | `*-toolbox` | Shared Python libraries |
35
+ | Kubeflow infra | `attraqt-kubeflow-*` | Pipeline configs and definitions |
36
+ | Platform infra | `ebap-*` | Early Birds AI Platform |
37
+
38
+ ### Step 2: Explore Project Context
39
+
40
+ For repos relevant to the brainstorm topic:
41
+ - Read their `CLAUDE.md` or `README.md` for architecture context
42
+ - Check recent git history (`git log --oneline -20`) for active development areas
43
+ - Scan directory structure to understand component layout
44
+
45
+ ### Step 3: Search Before Proposing
46
+
47
+ **MANDATORY:** Before proposing any solution, search the codebase for existing utilities,
48
+ patterns, and implementations related to the topic.
49
+
50
+ - Use Grep/Glob across relevant repos
51
+ - Check shared libraries: `earlybirds_commons`, `torch_toolbox`, `item-toolbox`, `nlp-toolbox`, `eb_tensorflow`
52
+ - Report findings to the user: "I found X in repo Y that does something similar"
53
+
54
+ If existing code covers part of the need, build on it rather than proposing greenfield work.
55
+
56
+ ### Step 4: Gather Requirements
57
+
58
+ Use the AskUserQuestion tool to gather requirements interactively.
59
+
60
+ Rules:
61
+ - **One question per message.** Do not batch multiple questions.
62
+ - **Prefer multiple-choice** over open-ended questions. Provide 2-4 concrete options based on what you found in the codebase.
63
+ - Cover these dimensions (not all at once; ask only what is relevant):
64
+ - Scope: what is in/out
65
+ - Target repos: which repos are affected
66
+ - Constraints: performance, compatibility, timeline
67
+ - Dependencies: what must exist first
68
+ - Users: who benefits from this
69
+
70
+ ### Step 5: Propose 2-3 Approaches
71
+
72
+ For each approach, include:
73
+ - **Summary:** one-sentence description
74
+ - **Trade-offs:** pros, cons, effort
75
+ - **Repos affected:** which repos need changes
76
+ - **Reuse opportunities:** what existing code can be leveraged
77
+ - **Concrete code references:** point to specific files/functions in real repos
78
+
79
+ ### Step 6: Present Design in Sections
80
+
81
+ Break the design into focused sections, each covering one concern.
82
+ Wait for user feedback between sections. Sections might include:
83
+ - Data model / schema changes
84
+ - API contracts
85
+ - Pipeline configuration
86
+ - Integration points with existing code
87
+ - Testing strategy
88
+
89
+ ### Step 7: Write Design Document
90
+
91
+ After user approval, save the design to `docs/specs/YYYY-MM-DD-<topic>-design.md`
92
+ in the relevant project repo. Include:
93
+ - Problem statement
94
+ - Chosen approach (with rationale)
95
+ - Detailed design per section
96
+ - Open questions (if any remain)
97
+ - References to existing code being reused
98
+
99
+ ### Step 8: Self-Review
100
+
101
+ Before presenting the final spec, review it for:
102
+ - Placeholders or vague language ("TBD", "as appropriate", "handle errors")
103
+ - Contradictions between sections
104
+ - Scope creep beyond what was agreed
105
+ - Missing error paths or edge cases
106
+ - Naming convention violations (invoke `ai.pierre:naming-conventions-reviewer` if code is shown)
107
+
108
+ ### Step 9: User Review and Transition
109
+
110
+ Present the spec for final user review. After approval, offer to invoke `/plan` to create
111
+ implementation tasks from the approved design.
@@ -0,0 +1,163 @@
1
+ ---
2
+ name: e2e-testing
3
+ description: >
4
+ Use when the user wants to run end-to-end tests of ML code, launch pipeline test runs,
5
+ or verify model outputs. Covers local pytest execution, Kubeflow pipeline launches,
6
+ MLflow metric validation, and pod-level debugging of failures.
7
+ ---
8
+
9
+ # End-to-End ML Testing
10
+
11
+ ## Overview
12
+
13
+ This skill handles all testing workflows for ML code, from local unit tests to full
14
+ Kubeflow pipeline end-to-end runs and model validation.
15
+
16
+ ## Step 1: Determine Test Scope
17
+
18
+ Use AskUserQuestion to ask:
19
+
20
+ **Question:** "What type of testing do you want to run?"
21
+
22
+ | Option | Description |
23
+ |--------|-------------|
24
+ | Local tests | Run pytest (unit and integration tests) |
25
+ | Pipeline end-to-end | Launch a Kubeflow pipeline run and monitor it |
26
+ | Model validation | Check MLflow metrics and model outputs |
27
+
28
+ ---
29
+
30
+ ## Local Tests
31
+
32
+ ### Identify the target
33
+ - Determine which repo and test directory to run
34
+ - Check for `pytest.ini`, `setup.cfg`, or `pyproject.toml` for test configuration
35
+ - Check for test markers (e.g., `@pytest.mark.integration`, `@pytest.mark.slow`)
36
+
37
+ ### Run tests
38
+ ```bash
39
+ pytest tests/ -v --tb=short
40
+ ```
41
+
42
+ For specific test files or functions:
43
+ ```bash
44
+ pytest tests/test_specific.py::test_function -v
45
+ ```
46
+
47
+ With coverage:
48
+ ```bash
49
+ pytest tests/ --cov=<package_name> --cov-report=term-missing
50
+ ```
51
+
52
+ ### Analyze failures
53
+ - Read the full traceback
54
+ - Check if it is a test environment issue (missing deps, wrong Python version, missing .env)
55
+ - Check if it is a real code bug
56
+ - Suggest fixes with exact code changes
57
+
58
+ ---
59
+
60
+ ## Pipeline End-to-End
61
+
62
+ ### Prerequisites check
63
+ Before launching, verify:
64
+
65
+ 1. **`.env` file exists** in `attraqt-kubeflow-configs`:
66
+ ```bash
67
+ ls /Users/mehdi/dev/projects/attraqt-kubeflow-configs/.env
68
+ ```
69
+ If missing: `cp .env.dev .env`
70
+
71
+ 2. **`version_name` is valid**:
72
+ ```bash
73
+ python3 scripts/kf_query.py --pipeline-versions <pipeline_name>
74
+ ```
75
+
76
+ 3. **MongoDB config is correct** (for learning pipelines):
77
+ ```bash
78
+ mongosh "mongodb://10.11.96.21:27017/earlybirds" --quiet --eval '
79
+ const doc = db.predictors.findOne({"_id": ObjectId("<PREDICTOR_ID>")});
80
+ print(JSON.stringify(doc.config.batch, null, 2));
81
+ '
82
+ ```
83
+
84
+ ### Launch
85
+ ```bash
86
+ cd /Users/mehdi/dev/projects/attraqt-kubeflow-configs/scripts
87
+ python -m run -c <absolute_path_to_config>
88
+ ```
89
+
90
+ ### Monitor
91
+ Poll the run status:
92
+ ```bash
93
+ python3 scripts/kf_query.py <run_id>
94
+ ```
95
+
96
+ Check for failures:
97
+ ```bash
98
+ python3 scripts/kf_query.py <run_id> --failed
99
+ ```
100
+
101
+ ### Verify
102
+ - All steps should show status "Succeeded"
103
+ - Check output artifacts exist in GCS
104
+ - For learning pipelines: verify MLflow run was created
105
+ - For encoding pipelines: verify output encodings exist at expected GCS path
106
+
107
+ ### Debug Failures
108
+ When a step fails:
109
+
110
+ 1. **Find the pod:**
111
+ ```bash
112
+ kubectl get pods -n kubeflow | grep <workflow-name>
113
+ ```
114
+
115
+ 2. **Read logs:**
116
+ ```bash
117
+ kubectl logs -n kubeflow <pod-name> --tail=200
118
+ kubectl logs -n kubeflow <pod-name> --previous # if crashed
119
+ ```
120
+
121
+ 3. **Check events (OOM, scheduling, image pull):**
122
+ ```bash
123
+ kubectl describe pod -n kubeflow <pod-name>
124
+ ```
125
+
126
+ 4. **Common failure patterns:**
127
+ - OOM: increase memory in config, or reduce batch size in MongoDB
128
+ - Image pull error: wrong image version; verify with `kf_query.py --pipeline-versions`
129
+ - Config error: wrong arguments format (check `arguments` vs `custom_params`)
130
+ - GPU scheduling: check node availability with `kubectl get nodes`
131
+
132
+ See `skills/ml-tooling-dev/references/kubectl-debug.md` for the full debugging reference.
133
+
134
+ ---
135
+
136
+ ## Model Validation
137
+
138
+ ### Fetch metrics
139
+ ```bash
140
+ python3 scripts/mlflow_query.py run <run_id>
141
+ ```
142
+
143
+ ### Compare against baseline
144
+ - Check key metrics (loss, accuracy, recall, precision) against previous runs
145
+ - Use `mlflow_query.py runs <experiment_name>` to list recent runs for comparison
146
+
147
+ ### Check registered models
148
+ ```bash
149
+ python3 scripts/mlflow_query.py model-for-predictor <predictor_id>
150
+ python3 scripts/mlflow_query.py model <model_name>
151
+ ```
152
+
153
+ Verify:
154
+ - Model version was registered
155
+ - Aliases are set correctly (e.g., "champion", "challenger")
156
+ - Model artifact exists in the registry
157
+
158
+ ---
159
+
160
+ ## Skill Dependencies
161
+
162
+ This skill invokes `ai.pierre:ml-tooling-dev` for all Kubeflow, MLflow, and MongoDB operations.
163
+ For pipeline config generation, use `/plan-algo-tests` (invokes `ai.pierre:algo-test-planning`).
@@ -0,0 +1,10 @@
1
+ ---
2
+ name: grill-me
3
+ description: Interview the user relentlessly about a plan or design until reaching shared understanding, resolving each branch of the decision tree. Use when user wants to stress-test a plan, get grilled on their design, or mentions "grill me".
4
+ ---
5
+
6
+ Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.
7
+
8
+ Ask the questions one at a time.
9
+
10
+ If a question can be answered by exploring the codebase, explore the codebase instead.