@groupby/ai-dev 0.5.7 → 0.5.9
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/teams/agentic-checkout/prompts/AGENTS.md +103 -0
- package/teams/agentic-checkout/prompts/create-plan.md +103 -0
- package/teams/agentic-checkout/prompts/create-pull-request.md +157 -0
- package/teams/agentic-checkout/prompts/fix-pr-comments.md +170 -0
- package/teams/agentic-checkout/prompts/fix-review-findings.md +1 -12
- package/teams/agentic-checkout/prompts/implement-task.md +62 -0
- package/teams/agentic-checkout/prompts/new-workspace.md +12 -0
- package/teams/agentic-checkout/prompts/orchestrate-component-change.md +25 -0
- package/teams/agentic-checkout/prompts/review-change.md +8 -2
- package/teams/agentic-checkout/scripts/check-secrets +51 -0
- package/teams/agentic-checkout/scripts/install-git-hooks +15 -0
- package/teams/agentic-checkout/scripts/local-fast-report +5 -0
- package/teams/agentic-checkout/scripts/local-report +205 -0
- package/teams/agentic-checkout/scripts/local-summarize +47 -0
- package/teams/agentic-checkout/scripts/logs-deps +9 -0
- package/teams/agentic-checkout/scripts/setup-local-fast-model +20 -0
- package/teams/agentic-checkout/scripts/start-deps +15 -0
- package/teams/agentic-checkout/scripts/status-deps +9 -0
- package/teams/agentic-checkout/scripts/stop-deps +9 -0
- package/teams/agentic-checkout/scripts/sync-components +110 -0
- package/teams/agentic-checkout/skills/approval-gated-task-execution/SKILL.md +57 -0
- package/teams/agentic-checkout/skills/component-verification/SKILL.md +34 -0
- package/teams/agentic-checkout/skills/grill-me/SKILL.md +23 -0
- package/teams/agentic-checkout/skills/karpathy-guidelines/SKILL.md +67 -0
- package/teams/agentic-checkout/skills/secret-safety/SKILL.md +41 -0
- package/teams/agentic-checkout/skills/sync-components/SKILL.md +23 -60
- package/teams/agentic-checkout/skills/tdd/SKILL.md +48 -0
- package/teams/fhr-ai-team/github/PULL_REQUEST_TEMPLATE/full.md +31 -0
- package/teams/fhr-ai-team/github/PULL_REQUEST_TEMPLATE/light.md +7 -0
- package/teams/fhr-ai-team/github/copilot-instructions.md +24 -0
- package/teams/fhr-ai-team/github/instructions/python.instructions.md +23 -0
- package/teams/fhr-ai-team/github/pull_request_template.md +21 -0
- package/teams/fhr-ai-team/prompts/brainstorm.md +7 -0
- package/teams/fhr-ai-team/prompts/plan-algo-tests.md +7 -0
- package/teams/fhr-ai-team/prompts/plan.md +7 -0
- package/teams/fhr-ai-team/prompts/pr-description.md +7 -0
- package/teams/fhr-ai-team/prompts/test.md +7 -0
- package/teams/fhr-ai-team/resources/AGENTS.md +55 -0
- package/teams/fhr-ai-team/resources/CLAUDE.md +52 -0
- package/teams/fhr-ai-team/resources/README.md +51 -0
- package/teams/fhr-ai-team/resources/claude-code-setup.md +60 -0
- package/teams/fhr-ai-team/resources/copilot-setup.md +64 -0
- package/teams/fhr-ai-team/resources/onboarding.md +179 -0
- package/teams/fhr-ai-team/resources/opencode-install.md +29 -0
- package/teams/fhr-ai-team/resources/opencode-setup.md +43 -0
- package/teams/fhr-ai-team/skills/algo-test-planning/SKILL.md +192 -0
- package/teams/fhr-ai-team/skills/algo-test-planning/references/pipeline-registry.md +280 -0
- package/teams/fhr-ai-team/skills/brainstorming/SKILL.md +111 -0
- package/teams/fhr-ai-team/skills/e2e-testing/SKILL.md +163 -0
- package/teams/fhr-ai-team/skills/grill-me/SKILL.md +10 -0
- package/teams/fhr-ai-team/skills/ml-tooling-dev/SKILL.md +313 -0
- package/teams/fhr-ai-team/skills/ml-tooling-dev/references/kubectl-debug.md +165 -0
- package/teams/fhr-ai-team/skills/ml-tooling-dev/references/mongodb-config.md +218 -0
- package/teams/fhr-ai-team/skills/ml-tooling-dev/references/pipeline-configs.md +190 -0
- package/teams/fhr-ai-team/skills/ml-tooling-dev/references/pipeline-steps.md +182 -0
- package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/kf_logs.py +203 -0
- package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/kf_query.py +233 -0
- package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/kf_wait.py +195 -0
- package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/mlflow_query.py +252 -0
- package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/mongo_predictor.py +352 -0
- package/teams/fhr-ai-team/skills/naming-conventions-reviewer/SKILL.md +230 -0
- package/teams/fhr-ai-team/skills/naming-conventions-reviewer/references/dataset-naming.md +190 -0
- package/teams/fhr-ai-team/skills/naming-conventions-reviewer/references/domain-vocabulary.md +447 -0
- package/teams/fhr-ai-team/skills/naming-conventions-reviewer/references/repo-dependency-graph.md +264 -0
- package/teams/fhr-ai-team/skills/planning/SKILL.md +138 -0
- package/teams/fhr-ai-team/skills/pr-description/SKILL.md +94 -0
|
@@ -0,0 +1,218 @@
|
|
|
1
|
+
# MongoDB Configuration Reference
|
|
2
|
+
|
|
3
|
+
## Connection
|
|
4
|
+
|
|
5
|
+
| Parameter | Value |
|
|
6
|
+
|-----------|-------|
|
|
7
|
+
| Host | `mongodb://10.11.96.21:27017` |
|
|
8
|
+
| Database | `earlybirds` |
|
|
9
|
+
| Collection | `predictors` |
|
|
10
|
+
| Tool | `mongosh` (install via `brew install mongosh` if missing) |
|
|
11
|
+
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
## Config Structure
|
|
15
|
+
|
|
16
|
+
Training hyperparameters live inside each predictor document at:
|
|
17
|
+
|
|
18
|
+
```
|
|
19
|
+
config.batch.<strategy-id>
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
For semantic search learning, the path is:
|
|
23
|
+
|
|
24
|
+
```
|
|
25
|
+
config.batch.semantic-search-learning
|
|
26
|
+
├── modelConfig
|
|
27
|
+
│ ├── pretrainedModelNameOrPath (e.g. "intfloat/multilingual-e5-large")
|
|
28
|
+
│ ├── poolingType ("MEAN", "CLS", "LAST_TOKEN")
|
|
29
|
+
│ ├── lossType ("CONTRASTIVE", "TRIPLET", "COSINE")
|
|
30
|
+
│ ├── similarityScale (float, e.g. 20.0)
|
|
31
|
+
│ └── contrastiveLossConfig
|
|
32
|
+
│ ├── similarityScale (float, e.g. 20.0)
|
|
33
|
+
│ └── temperature (float, e.g. 1.0)
|
|
34
|
+
├── pipelineConfig
|
|
35
|
+
│ ├── maxSequenceLength (int, e.g. 32, 64, 128)
|
|
36
|
+
│ ├── pretrainedModelNameOrPath (same as modelConfig)
|
|
37
|
+
│ ├── queryPrefix (e.g. "query: ")
|
|
38
|
+
│ └── itemPrefix (e.g. "passage: ")
|
|
39
|
+
└── learningConfig
|
|
40
|
+
├── batchSize (int, dataset loading batch size)
|
|
41
|
+
├── useEvaluation (bool)
|
|
42
|
+
├── useEarlyStopping (bool)
|
|
43
|
+
├── earlyStoppingPatience (int)
|
|
44
|
+
├── earlyStoppingThreshold (float)
|
|
45
|
+
├── evaluationSplitRatio (float, e.g. 0.1)
|
|
46
|
+
├── onnxQuantization { "quantize": true }
|
|
47
|
+
└── trainingArguments
|
|
48
|
+
├── numTrainEpochs (int)
|
|
49
|
+
├── perDeviceTrainBatchSize (int, in-batch negatives count)
|
|
50
|
+
├── perDeviceEvalBatchSize (int)
|
|
51
|
+
├── gradientAccumulationSteps (int)
|
|
52
|
+
├── gradientCheckpointing (bool, required for large batch on L4)
|
|
53
|
+
├── evaluationStrategy ("epoch" | "steps")
|
|
54
|
+
├── saveStrategy ("epoch" | "steps")
|
|
55
|
+
├── loggingStrategy ("steps")
|
|
56
|
+
├── loggingSteps (int, e.g. 50)
|
|
57
|
+
├── warmupSteps (int, e.g. 400)
|
|
58
|
+
├── weightDecay (float, e.g. 0.01)
|
|
59
|
+
├── loadBestModelAtEnd (bool)
|
|
60
|
+
├── metricForBestModel ("eval_loss")
|
|
61
|
+
├── bf16 (bool)
|
|
62
|
+
├── fp16 (bool)
|
|
63
|
+
├── dataloaderNumWorkers (int)
|
|
64
|
+
├── dataloaderPinMemory (bool)
|
|
65
|
+
├── dataloaderPersistentWorkers (bool)
|
|
66
|
+
├── dataloaderPrefetchFactor (int)
|
|
67
|
+
├── removeUnusedColumns (bool)
|
|
68
|
+
├── torchCompile (bool)
|
|
69
|
+
└── torchCompileBackend (string | null)
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
---
|
|
73
|
+
|
|
74
|
+
## Common Operations
|
|
75
|
+
|
|
76
|
+
All read/update/replace operations on `config.batch.<strategy>` go through `scripts/mongo_predictor.py` (see `SKILL.md` for full options). It validates the ObjectId, escapes input through env-bound JSON (no shell injection surface), and preserves `Int32` types on full-replace.
|
|
77
|
+
|
|
78
|
+
### Read current config
|
|
79
|
+
|
|
80
|
+
```bash
|
|
81
|
+
python3 scripts/mongo_predictor.py read <PREDICTOR_ID>
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
### Update individual fields (preferred for targeted changes)
|
|
85
|
+
|
|
86
|
+
```bash
|
|
87
|
+
python3 scripts/mongo_predictor.py update <PREDICTOR_ID> \
|
|
88
|
+
--set pipelineConfig.maxSequenceLength=64 \
|
|
89
|
+
--set learningConfig.batchSize=128 \
|
|
90
|
+
--set learningConfig.trainingArguments.perDeviceTrainBatchSize=128 \
|
|
91
|
+
--set learningConfig.trainingArguments.perDeviceEvalBatchSize=128 \
|
|
92
|
+
--set learningConfig.trainingArguments.gradientAccumulationSteps=4 \
|
|
93
|
+
--set learningConfig.trainingArguments.gradientCheckpointing=true
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
Keys are dot-paths *under* the strategy. Values auto-coerce: `true`/`false`/`null`/int/float/JSON literal/string.
|
|
97
|
+
|
|
98
|
+
### Replace entire strategy config (for full overrides)
|
|
99
|
+
|
|
100
|
+
Stage the new config as JSON and apply it. Preview first with `diff`:
|
|
101
|
+
|
|
102
|
+
```bash
|
|
103
|
+
python3 scripts/mongo_predictor.py diff <PREDICTOR_ID> --file experiment-A.json
|
|
104
|
+
python3 scripts/mongo_predictor.py apply <PREDICTOR_ID> --file experiment-A.json
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
Example `experiment-A.json` (top-level matches `config.batch.<strategy>`):
|
|
108
|
+
|
|
109
|
+
```json
|
|
110
|
+
{
|
|
111
|
+
"modelConfig": {
|
|
112
|
+
"pretrainedModelNameOrPath": "intfloat/multilingual-e5-large",
|
|
113
|
+
"poolingType": "MEAN",
|
|
114
|
+
"similarityScale": 20.0,
|
|
115
|
+
"contrastiveLossConfig": { "similarityScale": 20.0, "temperature": 1.0 }
|
|
116
|
+
},
|
|
117
|
+
"pipelineConfig": {
|
|
118
|
+
"maxSequenceLength": 64,
|
|
119
|
+
"pretrainedModelNameOrPath": "intfloat/multilingual-e5-large",
|
|
120
|
+
"queryPrefix": "query: ",
|
|
121
|
+
"itemPrefix": "passage: "
|
|
122
|
+
},
|
|
123
|
+
"learningConfig": {
|
|
124
|
+
"useEvaluation": true,
|
|
125
|
+
"batchSize": 64,
|
|
126
|
+
"useEarlyStopping": true,
|
|
127
|
+
"earlyStoppingPatience": 3,
|
|
128
|
+
"earlyStoppingThreshold": 0.001,
|
|
129
|
+
"evaluationSplitRatio": 0.1,
|
|
130
|
+
"onnxQuantization": { "quantize": true },
|
|
131
|
+
"trainingArguments": {
|
|
132
|
+
"numTrainEpochs": 3,
|
|
133
|
+
"perDeviceTrainBatchSize": 64,
|
|
134
|
+
"perDeviceEvalBatchSize": 64,
|
|
135
|
+
"gradientAccumulationSteps": 4,
|
|
136
|
+
"evaluationStrategy": "epoch",
|
|
137
|
+
"saveStrategy": "epoch",
|
|
138
|
+
"loggingStrategy": "steps",
|
|
139
|
+
"loggingSteps": 50,
|
|
140
|
+
"warmupSteps": 400,
|
|
141
|
+
"weightDecay": 0.01,
|
|
142
|
+
"loadBestModelAtEnd": true,
|
|
143
|
+
"metricForBestModel": "eval_loss",
|
|
144
|
+
"bf16": true,
|
|
145
|
+
"fp16": false,
|
|
146
|
+
"dataloaderNumWorkers": 4,
|
|
147
|
+
"dataloaderPinMemory": false,
|
|
148
|
+
"dataloaderPersistentWorkers": false,
|
|
149
|
+
"dataloaderPrefetchFactor": 2,
|
|
150
|
+
"removeUnusedColumns": true,
|
|
151
|
+
"torchCompile": false,
|
|
152
|
+
"torchCompileBackend": null,
|
|
153
|
+
"gradientCheckpointing": false
|
|
154
|
+
}
|
|
155
|
+
}
|
|
156
|
+
}
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
Bare integers in this JSON are stored as `Int32` (the script wraps them in `$numberInt` before sending to mongosh). Bare floats stay as Double.
|
|
160
|
+
|
|
161
|
+
### Verify config was read by a running pod
|
|
162
|
+
|
|
163
|
+
The training pod reads the MongoDB config at startup. Grep the pod logs to confirm. Use `kf_logs.py` to skip the manual pod-name lookup:
|
|
164
|
+
|
|
165
|
+
```bash
|
|
166
|
+
python3 scripts/kf_logs.py <run_id> --step get-container-base-task-2 --tail 500 \
|
|
167
|
+
| grep -E 'maxSequenceLength:|perDeviceTrainBatchSize:|gradientAccumulationSteps:|gradientCheckpointing:|numTrainEpochs:'
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
Pod naming on this cluster: `<workflow-name>-<numeric-id>` (e.g., `python-batch-pipeline-26dsn-2177690451`). See `references/kubectl-debug.md` for the full kubectl reference.
|
|
171
|
+
|
|
172
|
+
### Raw mongosh fallback
|
|
173
|
+
|
|
174
|
+
If `mongo_predictor.py` is unavailable or you need a one-off query that the script does not cover, use `mongosh` directly:
|
|
175
|
+
|
|
176
|
+
```bash
|
|
177
|
+
mongosh "mongodb://10.11.96.21:27017/earlybirds" --quiet --eval '
|
|
178
|
+
const doc = db.predictors.findOne({_id: ObjectId("<PREDICTOR_ID>")});
|
|
179
|
+
print(JSON.stringify(doc.config.batch["semantic-search-learning"], null, 2));
|
|
180
|
+
'
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
Remember the `NumberInt()` rule for full-document `$set` (see Gotchas below).
|
|
184
|
+
|
|
185
|
+
---
|
|
186
|
+
|
|
187
|
+
## Sequential Launch Pattern (Multiple Experiments Sharing One Predictor)
|
|
188
|
+
|
|
189
|
+
All runs for the same predictor read config from the **same MongoDB document**. The basic loop:
|
|
190
|
+
|
|
191
|
+
1. `mongo_predictor.py apply` Config A
|
|
192
|
+
2. Launch run A (`python -m run -c ...`)
|
|
193
|
+
3. `kf_wait.py <run_id> --step get-container-base-task-2 --state RUNNING` (the training pod has now read the config)
|
|
194
|
+
4. `mongo_predictor.py apply` Config B
|
|
195
|
+
5. Launch run B
|
|
196
|
+
6. Repeat
|
|
197
|
+
|
|
198
|
+
```bash
|
|
199
|
+
python3 scripts/mongo_predictor.py apply <PREDICTOR_ID> --file experiment-A.json
|
|
200
|
+
# launch run A, capture <run_id_A> from output
|
|
201
|
+
python3 scripts/kf_wait.py <run_id_A> --step get-container-base-task-2 --state RUNNING --timeout 600
|
|
202
|
+
|
|
203
|
+
python3 scripts/mongo_predictor.py apply <PREDICTOR_ID> --file experiment-B.json
|
|
204
|
+
# launch run B
|
|
205
|
+
python3 scripts/kf_wait.py <run_id_B> --step get-container-base-task-2 --state RUNNING --timeout 600
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
**Critical:** if you apply the next config before the previous run's pod reaches `RUNNING`, that run gets the wrong config. Always wait, then verify with the `kf_logs.py` grep above.
|
|
209
|
+
|
|
210
|
+
---
|
|
211
|
+
|
|
212
|
+
## Key Gotchas
|
|
213
|
+
|
|
214
|
+
- **Database is `earlybirds`**, not `ebap`. Early attempts used `ebap` which is wrong.
|
|
215
|
+
- **Use `NumberInt()` for integer values** when doing full-document `$set` to avoid MongoDB storing them as doubles. Dot-notation `$set` with bare integers is fine.
|
|
216
|
+
- **`batchSize` under `learningConfig`** is the dataset loading batch size; **`perDeviceTrainBatchSize` under `trainingArguments`** is what controls in-batch negatives for contrastive learning. Keep them in sync for consistency.
|
|
217
|
+
- **`gradientCheckpointing: true`** is required when `perDeviceTrainBatchSize` is large (128+) on L4 GPUs to avoid OOM
|
|
218
|
+
- **`pretrainedModelNameOrPath`** appears in BOTH `modelConfig` and `pipelineConfig`; update both when changing the base model
|
|
@@ -0,0 +1,190 @@
|
|
|
1
|
+
# Pipeline Configuration Reference
|
|
2
|
+
|
|
3
|
+
## Endpoints
|
|
4
|
+
- **Kubeflow UI**: http://10.11.96.10/
|
|
5
|
+
- **MLflow UI**: http://10.11.96.16:5000/
|
|
6
|
+
- **Config files location**: `attraqt-kubeflow-configs/configs/development/`
|
|
7
|
+
- **Pipeline definitions**: `attraqt-kubeflow-pipelines/kubeflow_pipelines/pipelines/`
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## Launching a Run from a Config File
|
|
12
|
+
|
|
13
|
+
```bash
|
|
14
|
+
# Must be run from the scripts/ directory of attraqt-kubeflow-configs
|
|
15
|
+
cd attraqt-kubeflow-configs/scripts
|
|
16
|
+
python -m run -c <absolute_path_to_config.json>
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
**Prerequisites:**
|
|
20
|
+
- `attraqt-kubeflow-configs/.env` must exist (copy from `.env.dev` for dev)
|
|
21
|
+
- `version_name` must match an existing pipeline version. Check with:
|
|
22
|
+
```bash
|
|
23
|
+
python3 kf_query.py --pipeline-versions <pipeline_name>
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
**What `run.py` does:** Reads the JSON config, calls `KubeflowClient.create_run()` with `pipeline_name`, `version_name`, `experiment_name` (also used as `run_name`), and `params`.
|
|
27
|
+
|
|
28
|
+
---
|
|
29
|
+
|
|
30
|
+
## 1. `python_batch_pipeline`: Python GPU/CPU batch job
|
|
31
|
+
|
|
32
|
+
**Template** (`configs/development/batch/python_batch_pipeline.json`):
|
|
33
|
+
```json
|
|
34
|
+
{
|
|
35
|
+
"pipeline_name": "python_batch_pipeline",
|
|
36
|
+
"version_name": "0.1.271",
|
|
37
|
+
"experiment_name": "<EXPERIMENT_NAME>",
|
|
38
|
+
"params": {
|
|
39
|
+
"predictor_id": "<PREDICTOR_ID>",
|
|
40
|
+
"strategy_id": "<STRATEGY_ID>",
|
|
41
|
+
"image_name": "<IMAGE_NAME>",
|
|
42
|
+
"cmd_script_path": "<SCRIPT_PATH>",
|
|
43
|
+
"disk_enabled": "False",
|
|
44
|
+
"disk_name": null,
|
|
45
|
+
"batch_config": {
|
|
46
|
+
"arguments": {},
|
|
47
|
+
"cpu": "1000m",
|
|
48
|
+
"memory": "2G",
|
|
49
|
+
"extra_options": "",
|
|
50
|
+
"gpu": "1",
|
|
51
|
+
"gpu_vendor": "nvidia.com/gpu",
|
|
52
|
+
"gpu_accelerator_name": "nvidia-l4",
|
|
53
|
+
"version": "<IMAGE_VERSION>"
|
|
54
|
+
}
|
|
55
|
+
}
|
|
56
|
+
}
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
**`batch_config` fields:**
|
|
60
|
+
| Field | Description |
|
|
61
|
+
|-------|-------------|
|
|
62
|
+
| `arguments` | Dict passed as job arguments (strategy-specific) |
|
|
63
|
+
| `version` | Docker image version tag |
|
|
64
|
+
| `cpu` | CPU request (e.g. `"1000m"`, `"11"`) |
|
|
65
|
+
| `memory` | Memory request (e.g. `"2G"`, `"80G"`) |
|
|
66
|
+
| `gpu` | GPU count (e.g. `"1"`) |
|
|
67
|
+
| `gpu_vendor` | Always `"nvidia.com/gpu"` |
|
|
68
|
+
| `gpu_accelerator_name` | GPU type: `"nvidia-l4"` or `"nvidia-tesla-t4"` |
|
|
69
|
+
| `extra_options` | JVM/env extra flags string |
|
|
70
|
+
|
|
71
|
+
---
|
|
72
|
+
|
|
73
|
+
## 2. `scala_batch_pipeline`: Scala JVM batch job
|
|
74
|
+
|
|
75
|
+
**Template** (`configs/development/batch/scala_batch_pipeline.json`):
|
|
76
|
+
```json
|
|
77
|
+
{
|
|
78
|
+
"pipeline_name": "scala_batch_pipeline",
|
|
79
|
+
"version_name": "0.1.271",
|
|
80
|
+
"experiment_name": "<EXPERIMENT_NAME>",
|
|
81
|
+
"params": {
|
|
82
|
+
"predictor_id": "<PREDICTOR_ID>",
|
|
83
|
+
"strategy_id": "<STRATEGY_ID>",
|
|
84
|
+
"image_name": "<IMAGE_NAME>",
|
|
85
|
+
"cmd_script_path": "/opt/start.sh",
|
|
86
|
+
"launcher_class": "<MAIN_CLASS>",
|
|
87
|
+
"disk_enabled": "False",
|
|
88
|
+
"disk_name": null,
|
|
89
|
+
"batch_config": {
|
|
90
|
+
"custom_params": {},
|
|
91
|
+
"version": "<IMAGE_VERSION>",
|
|
92
|
+
"cpu": "1000m",
|
|
93
|
+
"memory": "2G",
|
|
94
|
+
"java_memory": "1G",
|
|
95
|
+
"timeout_s": 3600,
|
|
96
|
+
"extra_options": "",
|
|
97
|
+
"gpu": "1",
|
|
98
|
+
"gpu_vendor": "nvidia.com/gpu"
|
|
99
|
+
}
|
|
100
|
+
}
|
|
101
|
+
}
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
**`batch_config` fields:**
|
|
105
|
+
| Field | Description |
|
|
106
|
+
|-------|-------------|
|
|
107
|
+
| `custom_params` | Dict of strategy-specific params (NOT `arguments`) |
|
|
108
|
+
| `version` | Docker image version tag |
|
|
109
|
+
| `java_memory` | JVM heap max (e.g. `"1G"`, `"36G"`) |
|
|
110
|
+
| `timeout_s` | Job timeout in seconds (default `3600`) |
|
|
111
|
+
|
|
112
|
+
---
|
|
113
|
+
|
|
114
|
+
## 3. `semantic_search_item_encoding_pipeline`: Full encoding pipeline
|
|
115
|
+
|
|
116
|
+
**Example** (`configs/development/ai/semantic_search_item_encoding_pipeline/`):
|
|
117
|
+
```json
|
|
118
|
+
{
|
|
119
|
+
"pipeline_name": "semantic_search_item_encoding_pipeline",
|
|
120
|
+
"version_name": "0.1.269",
|
|
121
|
+
"experiment_name": "<NAME> - Encoding workflow",
|
|
122
|
+
"params": {
|
|
123
|
+
"search_predictor_id": "<PREDICTOR_ID>",
|
|
124
|
+
"model_repository": "mlflow",
|
|
125
|
+
"model_id": "<MLFLOW_RUN_ID>",
|
|
126
|
+
"item_search_encoding_version": "0",
|
|
127
|
+
"item_seo_keyphrases_generator": {
|
|
128
|
+
"cpu": "4",
|
|
129
|
+
"extra_options": "-DmappedProductsVersion=v1",
|
|
130
|
+
"custom_params": {
|
|
131
|
+
"imageDatasetName": "xo",
|
|
132
|
+
"imageBasePath": "/mnt",
|
|
133
|
+
"promptId": "SEOKeyPhrasesFashionPrompt"
|
|
134
|
+
},
|
|
135
|
+
"java_memory": "2G",
|
|
136
|
+
"memory": "4G",
|
|
137
|
+
"timeout_s": 7200,
|
|
138
|
+
"version": "3.45.0"
|
|
139
|
+
},
|
|
140
|
+
"items_encoding": {
|
|
141
|
+
"cpu": "10",
|
|
142
|
+
"extra_options": "-DmappedProductsVersion=v1 -DproductCatalogPastValidityInDays=3",
|
|
143
|
+
"java_memory": "36G",
|
|
144
|
+
"memory": "40G",
|
|
145
|
+
"gpu_vendor": "nvidia.com/gpu",
|
|
146
|
+
"timeout_s": 7200,
|
|
147
|
+
"version": "3.64.0"
|
|
148
|
+
}
|
|
149
|
+
}
|
|
150
|
+
}
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
---
|
|
154
|
+
|
|
155
|
+
## Strategy IDs & Image Names Reference
|
|
156
|
+
|
|
157
|
+
| Strategy | Image | Script | Launcher class |
|
|
158
|
+
|----------|-------|--------|----------------|
|
|
159
|
+
| `semantic-search-learning` | `semantic-search` | `/opt/start-semantic-search-batch.sh` | (n/a) |
|
|
160
|
+
| `semantic-search-evaluation` | `semantic-search` | `/opt/start-semantic-search-batch.sh` | (n/a) |
|
|
161
|
+
| `items-encoding` | `algo-search-batch` | `/opt/start.sh` | `earlybirds.algo.search.batch.SearchSimpleBatchLauncher` |
|
|
162
|
+
| `query-dataset-generation` | `algo-search-batch` | `/opt/start.sh` | `earlybirds.algo.search.batch.SearchSimpleBatchLauncher` |
|
|
163
|
+
| `seo-keyphrases-generator` | `item-utils` | `/opt/start.sh` | `earlybirds.item.dump.seo_keyphrases.SEOKeyPhrasesGeneratorLauncher` |
|
|
164
|
+
| `search-item-data-dataset-preprocessing` | `algo-search-batch` | `/opt/start.sh` | (Spark/Dataproc) |
|
|
165
|
+
|
|
166
|
+
---
|
|
167
|
+
|
|
168
|
+
## Docker Image Version Defaults (from `versions.py`)
|
|
169
|
+
|
|
170
|
+
| Image | Version constant | Current value |
|
|
171
|
+
|-------|-----------------|---------------|
|
|
172
|
+
| `semantic-search` | `SEMANTIC_SEARCH_ML_VERSION` | `0.0.15` (check for overrides) |
|
|
173
|
+
| `algo-search-batch` | `SEARCH_VERSION` | `3.64.0` |
|
|
174
|
+
| `item-utils` | `ITEM_UTILS_VERSION` | `2.23.0` |
|
|
175
|
+
|
|
176
|
+
> Always check the current `kubeflow_pipelines/pipelines/utils/versions.py` for latest values; these drift.
|
|
177
|
+
|
|
178
|
+
---
|
|
179
|
+
|
|
180
|
+
## GCS Path Patterns (outputs from pipeline steps)
|
|
181
|
+
|
|
182
|
+
| Step | Output key | GCS path pattern |
|
|
183
|
+
|------|-----------|-----------------|
|
|
184
|
+
| item_data_dataset_preprocessing | `item_data_dataset_directory_path` | `search/search_item_data_dataset_preprocessing/<version>/<predictor_id>/<timestamp>/item-data-dataset-preprocessing-dataframe` |
|
|
185
|
+
| item_data_dataset_preprocessing | `item_data_dataset_meta_info_directory_path` | `search/search_item_data_dataset_preprocessing/<version>/<predictor_id>/<timestamp>/item-data-dataset-preprocessing-meta-info` |
|
|
186
|
+
| query_dataset_generation | `query_training_dataset_directory_path` | `search/query-dataset-generation/<version>/<predictor_id>/<timestamp>/query-training-dataset-preprocessing-dataframe` |
|
|
187
|
+
| query_dataset_generation | `query_evaluation_dataset_directory_path` | `search/query-dataset-generation/<version>/<predictor_id>/<timestamp>/query-evaluation-dataset-preprocessing-dataframe` |
|
|
188
|
+
| learning | `run_id` | MLflow run ID (UUID hex string, e.g. `5046f02ec2c146b3b66abcd3b82f15d4`) |
|
|
189
|
+
|
|
190
|
+
> **Find output paths**: In Kubeflow UI → run → step → "Output artifacts" tab, or via `scripts/kf_query.py`.
|
|
@@ -0,0 +1,182 @@
|
|
|
1
|
+
# Semantic Search Pipeline Steps Reference
|
|
2
|
+
|
|
3
|
+
## `semantic_search_learning_with_generated_analytics_pipeline`
|
|
4
|
+
|
|
5
|
+
Full pipeline DAG. Steps run in order (with conditional branches):
|
|
6
|
+
|
|
7
|
+
```
|
|
8
|
+
1. item_seo_keyphrases_generator [Scala, item-utils image]
|
|
9
|
+
2. item_data_dataset_preprocessing [Spark/Dataproc, algo-search-batch]
|
|
10
|
+
3. query_dataset_generation [Scala, algo-search-batch] → outputs: training + evaluation dataset paths
|
|
11
|
+
4. learning [Python GPU, semantic-search image] → outputs: mlflow run_id
|
|
12
|
+
├── 5. evaluation (conditional) [Python GPU, semantic-search image] ; only if evaluation dataset != ""
|
|
13
|
+
├── 6. items_encoding (conditional) [Scala, algo-search-batch] ; only if items_encoding_enabled=True
|
|
14
|
+
│ └── 7. set_model_alias (conditional) ; only if model_alias_transition_enabled=True
|
|
15
|
+
└── send_pubsub_message
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## Step-by-Step Configs for Manual Re-runs
|
|
21
|
+
|
|
22
|
+
### Step 4: Learning (`python_batch_pipeline`)
|
|
23
|
+
|
|
24
|
+
```json
|
|
25
|
+
{
|
|
26
|
+
"pipeline_name": "python_batch_pipeline",
|
|
27
|
+
"version_name": "0.1.271",
|
|
28
|
+
"experiment_name": "<TENANT> - EBAP - Search - Learning workflow",
|
|
29
|
+
"params": {
|
|
30
|
+
"predictor_id": "<PREDICTOR_ID>",
|
|
31
|
+
"strategy_id": "semantic-search-learning",
|
|
32
|
+
"image_name": "semantic-search",
|
|
33
|
+
"cmd_script_path": "/opt/start-semantic-search-batch.sh",
|
|
34
|
+
"batch_config": {
|
|
35
|
+
"arguments": {
|
|
36
|
+
"itemDataDatasetPreprocessingDirectoryPath": "<FROM_STEP_2_item_data_dataset_directory_path>",
|
|
37
|
+
"pretrainedModelKey": {
|
|
38
|
+
"modelId": "<PRETRAINED_RUN_ID_or_huggingface_model_id>",
|
|
39
|
+
"modelRepository": "<mlflow_or_huggingface>"
|
|
40
|
+
},
|
|
41
|
+
"queryDatasetPreprocessingDirectoryPath": "<FROM_STEP_3_query_training_dataset_directory_path>",
|
|
42
|
+
"verticalRunPredictorIds": ["<PREDICTOR_ID>"]
|
|
43
|
+
},
|
|
44
|
+
"cpu": "11",
|
|
45
|
+
"gpu": "1",
|
|
46
|
+
"gpu_accelerator_name": "nvidia-l4",
|
|
47
|
+
"gpu_vendor": "nvidia.com/gpu",
|
|
48
|
+
"memory": "80G",
|
|
49
|
+
"version": "<SEMANTIC_SEARCH_ML_VERSION>"
|
|
50
|
+
}
|
|
51
|
+
}
|
|
52
|
+
}
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
**Required inputs:**
|
|
56
|
+
- `itemDataDatasetPreprocessingDirectoryPath` → from step 2 output
|
|
57
|
+
- `queryDatasetPreprocessingDirectoryPath` → from step 3 `query_training_dataset_directory_path`
|
|
58
|
+
- `pretrainedModelKey` → either a HuggingFace model (e.g. `sentence-transformers/all-MiniLM-L6-v2`) or an MLflow run_id from a previous training
|
|
59
|
+
|
|
60
|
+
**Output:** MLflow `run_id`. Find via MLflow UI or `scripts/mlflow_query.py`.
|
|
61
|
+
|
|
62
|
+
---
|
|
63
|
+
|
|
64
|
+
### Step 5: Evaluation (`python_batch_pipeline`)
|
|
65
|
+
|
|
66
|
+
```json
|
|
67
|
+
{
|
|
68
|
+
"pipeline_name": "python_batch_pipeline",
|
|
69
|
+
"version_name": "0.1.271",
|
|
70
|
+
"experiment_name": "<TENANT> - EBAP - Search - Learning workflow",
|
|
71
|
+
"params": {
|
|
72
|
+
"predictor_id": "<PREDICTOR_ID>",
|
|
73
|
+
"strategy_id": "semantic-search-evaluation",
|
|
74
|
+
"image_name": "semantic-search",
|
|
75
|
+
"cmd_script_path": "/opt/start-semantic-search-batch.sh",
|
|
76
|
+
"batch_config": {
|
|
77
|
+
"arguments": {
|
|
78
|
+
"itemDataDatasetPreprocessingDirectoryPath": "<FROM_STEP_2_item_data_dataset_directory_path>",
|
|
79
|
+
"pretrainedModelKey": {
|
|
80
|
+
"modelId": "<MLFLOW_RUN_ID_FROM_STEP_4>",
|
|
81
|
+
"modelRepository": "mlflow"
|
|
82
|
+
},
|
|
83
|
+
"queryDatasetPreprocessingDirectoryPath": "<FROM_STEP_3_query_evaluation_dataset_directory_path>"
|
|
84
|
+
},
|
|
85
|
+
"cpu": "11",
|
|
86
|
+
"gpu": "1",
|
|
87
|
+
"gpu_accelerator_name": "nvidia-l4",
|
|
88
|
+
"gpu_vendor": "nvidia.com/gpu",
|
|
89
|
+
"memory": "80G",
|
|
90
|
+
"version": "<SEMANTIC_SEARCH_ML_VERSION>"
|
|
91
|
+
}
|
|
92
|
+
}
|
|
93
|
+
}
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
**Key differences from learning:**
|
|
97
|
+
- `strategy_id`: `semantic-search-evaluation`
|
|
98
|
+
- `queryDatasetPreprocessingDirectoryPath` → uses the **evaluation** dataset path (step 3's `query_evaluation_dataset_directory_path`), not the training one
|
|
99
|
+
- No `verticalRunPredictorIds`
|
|
100
|
+
- `pretrainedModelKey.modelId` → MLflow `run_id` from step 4, `modelRepository` = `"mlflow"`
|
|
101
|
+
|
|
102
|
+
**Skip if:** `query_evaluation_dataset_directory_path` was empty in step 3.
|
|
103
|
+
|
|
104
|
+
---
|
|
105
|
+
|
|
106
|
+
### Step 6: Items Encoding (`scala_batch_pipeline`)
|
|
107
|
+
|
|
108
|
+
```json
|
|
109
|
+
{
|
|
110
|
+
"pipeline_name": "scala_batch_pipeline",
|
|
111
|
+
"version_name": "0.1.271",
|
|
112
|
+
"experiment_name": "<TENANT> - EBAP - Search - Learning workflow",
|
|
113
|
+
"params": {
|
|
114
|
+
"predictor_id": "<PREDICTOR_ID>",
|
|
115
|
+
"strategy_id": "items-encoding",
|
|
116
|
+
"image_name": "algo-search-batch",
|
|
117
|
+
"cmd_script_path": "/opt/start.sh",
|
|
118
|
+
"launcher_class": "earlybirds.algo.search.batch.SearchSimpleBatchLauncher",
|
|
119
|
+
"batch_config": {
|
|
120
|
+
"custom_params": {
|
|
121
|
+
"itemEncodingInferenceKey": {
|
|
122
|
+
"modelRepository": "mlflow",
|
|
123
|
+
"modelId": "<MLFLOW_RUN_ID_FROM_STEP_4>",
|
|
124
|
+
"version": ""
|
|
125
|
+
}
|
|
126
|
+
},
|
|
127
|
+
"version": "3.64.0",
|
|
128
|
+
"cpu": "10",
|
|
129
|
+
"memory": "40G",
|
|
130
|
+
"java_memory": "36G",
|
|
131
|
+
"timeout_s": 7200,
|
|
132
|
+
"gpu_vendor": "nvidia.com/gpu",
|
|
133
|
+
"extra_options": "-DmappedProductsVersion=v1 -DproductCatalogPastValidityInDays=3"
|
|
134
|
+
}
|
|
135
|
+
}
|
|
136
|
+
}
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
**Key notes:**
|
|
140
|
+
- Uses `custom_params` (not `arguments`); it's a Scala job
|
|
141
|
+
- `version: ""` in `itemEncodingInferenceKey` → auto-generates new encoding version
|
|
142
|
+
- `model_id` = MLflow `run_id` from step 4
|
|
143
|
+
|
|
144
|
+
---
|
|
145
|
+
|
|
146
|
+
### Step 7: Set Model Alias (MLflow CLI / UI)
|
|
147
|
+
|
|
148
|
+
```bash
|
|
149
|
+
# Via MLflow Python client
|
|
150
|
+
python3 - <<'EOF'
|
|
151
|
+
import mlflow
|
|
152
|
+
mlflow.set_tracking_uri("http://10.11.96.16:5000/")
|
|
153
|
+
client = mlflow.MlflowClient()
|
|
154
|
+
client.set_registered_model_alias(
|
|
155
|
+
name="semantic-search-<predictor_id>",
|
|
156
|
+
alias="Production",
|
|
157
|
+
version="<MODEL_VERSION>"
|
|
158
|
+
)
|
|
159
|
+
EOF
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
Or do it directly in the MLflow UI at http://10.11.96.16:5000/.
|
|
163
|
+
|
|
164
|
+
---
|
|
165
|
+
|
|
166
|
+
## Other Common Pipeline Types
|
|
167
|
+
|
|
168
|
+
### `semantic_search_item_encoding_pipeline` (standalone encoding)
|
|
169
|
+
|
|
170
|
+
Use when re-running only the encoding step with a new model. Runs SEO keyphrases generator + items encoding. See `pipeline-configs.md` for full config.
|
|
171
|
+
|
|
172
|
+
### `clip_search_rnn_learning_pipeline`
|
|
173
|
+
|
|
174
|
+
Learning pipeline for CLIP+RNN search model. Config example in `configs/development/ai/clip_search_rnn_learning_pipeline/`.
|
|
175
|
+
|
|
176
|
+
---
|
|
177
|
+
|
|
178
|
+
## How to Find Output Paths from a Completed Step
|
|
179
|
+
|
|
180
|
+
1. **Kubeflow UI**: http://10.11.96.10/ → Runs → click run → click step → "Output artifacts" tab
|
|
181
|
+
2. **Via script**: `python3 scripts/kf_query.py <run_id>` → shows all step outputs
|
|
182
|
+
3. **Pattern**: Outputs follow `<strategy>/<version>/<predictor_id>/<epoch_ms>/...` in GCS
|