@groupby/ai-dev 0.5.5 → 0.5.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (43) hide show
  1. package/package.json +1 -1
  2. package/teams/OOF/skills/jira-ticket-creator/README.md +22 -0
  3. package/teams/OOF/skills/jira-ticket-creator/SKILL.md +266 -0
  4. package/teams/fhr-ai-team/github/PULL_REQUEST_TEMPLATE/full.md +31 -0
  5. package/teams/fhr-ai-team/github/PULL_REQUEST_TEMPLATE/light.md +7 -0
  6. package/teams/fhr-ai-team/github/copilot-instructions.md +24 -0
  7. package/teams/fhr-ai-team/github/instructions/python.instructions.md +23 -0
  8. package/teams/fhr-ai-team/github/pull_request_template.md +21 -0
  9. package/teams/fhr-ai-team/prompts/brainstorm.md +7 -0
  10. package/teams/fhr-ai-team/prompts/plan-algo-tests.md +7 -0
  11. package/teams/fhr-ai-team/prompts/plan.md +7 -0
  12. package/teams/fhr-ai-team/prompts/pr-description.md +7 -0
  13. package/teams/fhr-ai-team/prompts/test.md +7 -0
  14. package/teams/fhr-ai-team/resources/AGENTS.md +55 -0
  15. package/teams/fhr-ai-team/resources/CLAUDE.md +52 -0
  16. package/teams/fhr-ai-team/resources/README.md +51 -0
  17. package/teams/fhr-ai-team/resources/claude-code-setup.md +60 -0
  18. package/teams/fhr-ai-team/resources/copilot-setup.md +64 -0
  19. package/teams/fhr-ai-team/resources/onboarding.md +179 -0
  20. package/teams/fhr-ai-team/resources/opencode-install.md +29 -0
  21. package/teams/fhr-ai-team/resources/opencode-setup.md +43 -0
  22. package/teams/fhr-ai-team/skills/algo-test-planning/SKILL.md +192 -0
  23. package/teams/fhr-ai-team/skills/algo-test-planning/references/pipeline-registry.md +280 -0
  24. package/teams/fhr-ai-team/skills/brainstorming/SKILL.md +111 -0
  25. package/teams/fhr-ai-team/skills/e2e-testing/SKILL.md +163 -0
  26. package/teams/fhr-ai-team/skills/grill-me/SKILL.md +10 -0
  27. package/teams/fhr-ai-team/skills/ml-tooling-dev/SKILL.md +313 -0
  28. package/teams/fhr-ai-team/skills/ml-tooling-dev/references/kubectl-debug.md +165 -0
  29. package/teams/fhr-ai-team/skills/ml-tooling-dev/references/mongodb-config.md +218 -0
  30. package/teams/fhr-ai-team/skills/ml-tooling-dev/references/pipeline-configs.md +190 -0
  31. package/teams/fhr-ai-team/skills/ml-tooling-dev/references/pipeline-steps.md +182 -0
  32. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/kf_logs.py +203 -0
  33. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/kf_query.py +233 -0
  34. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/kf_wait.py +195 -0
  35. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/mlflow_query.py +252 -0
  36. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/mongo_predictor.py +352 -0
  37. package/teams/fhr-ai-team/skills/naming-conventions-reviewer/SKILL.md +230 -0
  38. package/teams/fhr-ai-team/skills/naming-conventions-reviewer/references/dataset-naming.md +190 -0
  39. package/teams/fhr-ai-team/skills/naming-conventions-reviewer/references/domain-vocabulary.md +447 -0
  40. package/teams/fhr-ai-team/skills/naming-conventions-reviewer/references/repo-dependency-graph.md +264 -0
  41. package/teams/fhr-ai-team/skills/planning/SKILL.md +138 -0
  42. package/teams/fhr-ai-team/skills/pr-description/SKILL.md +94 -0
  43. package/teams/snpd/skills/code-review-github/SKILL.md +475 -0
@@ -0,0 +1,218 @@
1
+ # MongoDB Configuration Reference
2
+
3
+ ## Connection
4
+
5
+ | Parameter | Value |
6
+ |-----------|-------|
7
+ | Host | `mongodb://10.11.96.21:27017` |
8
+ | Database | `earlybirds` |
9
+ | Collection | `predictors` |
10
+ | Tool | `mongosh` (install via `brew install mongosh` if missing) |
11
+
12
+ ---
13
+
14
+ ## Config Structure
15
+
16
+ Training hyperparameters live inside each predictor document at:
17
+
18
+ ```
19
+ config.batch.<strategy-id>
20
+ ```
21
+
22
+ For semantic search learning, the path is:
23
+
24
+ ```
25
+ config.batch.semantic-search-learning
26
+ ├── modelConfig
27
+ │ ├── pretrainedModelNameOrPath (e.g. "intfloat/multilingual-e5-large")
28
+ │ ├── poolingType ("MEAN", "CLS", "LAST_TOKEN")
29
+ │ ├── lossType ("CONTRASTIVE", "TRIPLET", "COSINE")
30
+ │ ├── similarityScale (float, e.g. 20.0)
31
+ │ └── contrastiveLossConfig
32
+ │ ├── similarityScale (float, e.g. 20.0)
33
+ │ └── temperature (float, e.g. 1.0)
34
+ ├── pipelineConfig
35
+ │ ├── maxSequenceLength (int, e.g. 32, 64, 128)
36
+ │ ├── pretrainedModelNameOrPath (same as modelConfig)
37
+ │ ├── queryPrefix (e.g. "query: ")
38
+ │ └── itemPrefix (e.g. "passage: ")
39
+ └── learningConfig
40
+ ├── batchSize (int, dataset loading batch size)
41
+ ├── useEvaluation (bool)
42
+ ├── useEarlyStopping (bool)
43
+ ├── earlyStoppingPatience (int)
44
+ ├── earlyStoppingThreshold (float)
45
+ ├── evaluationSplitRatio (float, e.g. 0.1)
46
+ ├── onnxQuantization { "quantize": true }
47
+ └── trainingArguments
48
+ ├── numTrainEpochs (int)
49
+ ├── perDeviceTrainBatchSize (int, in-batch negatives count)
50
+ ├── perDeviceEvalBatchSize (int)
51
+ ├── gradientAccumulationSteps (int)
52
+ ├── gradientCheckpointing (bool, required for large batch on L4)
53
+ ├── evaluationStrategy ("epoch" | "steps")
54
+ ├── saveStrategy ("epoch" | "steps")
55
+ ├── loggingStrategy ("steps")
56
+ ├── loggingSteps (int, e.g. 50)
57
+ ├── warmupSteps (int, e.g. 400)
58
+ ├── weightDecay (float, e.g. 0.01)
59
+ ├── loadBestModelAtEnd (bool)
60
+ ├── metricForBestModel ("eval_loss")
61
+ ├── bf16 (bool)
62
+ ├── fp16 (bool)
63
+ ├── dataloaderNumWorkers (int)
64
+ ├── dataloaderPinMemory (bool)
65
+ ├── dataloaderPersistentWorkers (bool)
66
+ ├── dataloaderPrefetchFactor (int)
67
+ ├── removeUnusedColumns (bool)
68
+ ├── torchCompile (bool)
69
+ └── torchCompileBackend (string | null)
70
+ ```
71
+
72
+ ---
73
+
74
+ ## Common Operations
75
+
76
+ All read/update/replace operations on `config.batch.<strategy>` go through `scripts/mongo_predictor.py` (see `SKILL.md` for full options). It validates the ObjectId, escapes input through env-bound JSON (no shell injection surface), and preserves `Int32` types on full-replace.
77
+
78
+ ### Read current config
79
+
80
+ ```bash
81
+ python3 scripts/mongo_predictor.py read <PREDICTOR_ID>
82
+ ```
83
+
84
+ ### Update individual fields (preferred for targeted changes)
85
+
86
+ ```bash
87
+ python3 scripts/mongo_predictor.py update <PREDICTOR_ID> \
88
+ --set pipelineConfig.maxSequenceLength=64 \
89
+ --set learningConfig.batchSize=128 \
90
+ --set learningConfig.trainingArguments.perDeviceTrainBatchSize=128 \
91
+ --set learningConfig.trainingArguments.perDeviceEvalBatchSize=128 \
92
+ --set learningConfig.trainingArguments.gradientAccumulationSteps=4 \
93
+ --set learningConfig.trainingArguments.gradientCheckpointing=true
94
+ ```
95
+
96
+ Keys are dot-paths *under* the strategy. Values auto-coerce: `true`/`false`/`null`/int/float/JSON literal/string.
97
+
98
+ ### Replace entire strategy config (for full overrides)
99
+
100
+ Stage the new config as JSON and apply it. Preview first with `diff`:
101
+
102
+ ```bash
103
+ python3 scripts/mongo_predictor.py diff <PREDICTOR_ID> --file experiment-A.json
104
+ python3 scripts/mongo_predictor.py apply <PREDICTOR_ID> --file experiment-A.json
105
+ ```
106
+
107
+ Example `experiment-A.json` (top-level matches `config.batch.<strategy>`):
108
+
109
+ ```json
110
+ {
111
+ "modelConfig": {
112
+ "pretrainedModelNameOrPath": "intfloat/multilingual-e5-large",
113
+ "poolingType": "MEAN",
114
+ "similarityScale": 20.0,
115
+ "contrastiveLossConfig": { "similarityScale": 20.0, "temperature": 1.0 }
116
+ },
117
+ "pipelineConfig": {
118
+ "maxSequenceLength": 64,
119
+ "pretrainedModelNameOrPath": "intfloat/multilingual-e5-large",
120
+ "queryPrefix": "query: ",
121
+ "itemPrefix": "passage: "
122
+ },
123
+ "learningConfig": {
124
+ "useEvaluation": true,
125
+ "batchSize": 64,
126
+ "useEarlyStopping": true,
127
+ "earlyStoppingPatience": 3,
128
+ "earlyStoppingThreshold": 0.001,
129
+ "evaluationSplitRatio": 0.1,
130
+ "onnxQuantization": { "quantize": true },
131
+ "trainingArguments": {
132
+ "numTrainEpochs": 3,
133
+ "perDeviceTrainBatchSize": 64,
134
+ "perDeviceEvalBatchSize": 64,
135
+ "gradientAccumulationSteps": 4,
136
+ "evaluationStrategy": "epoch",
137
+ "saveStrategy": "epoch",
138
+ "loggingStrategy": "steps",
139
+ "loggingSteps": 50,
140
+ "warmupSteps": 400,
141
+ "weightDecay": 0.01,
142
+ "loadBestModelAtEnd": true,
143
+ "metricForBestModel": "eval_loss",
144
+ "bf16": true,
145
+ "fp16": false,
146
+ "dataloaderNumWorkers": 4,
147
+ "dataloaderPinMemory": false,
148
+ "dataloaderPersistentWorkers": false,
149
+ "dataloaderPrefetchFactor": 2,
150
+ "removeUnusedColumns": true,
151
+ "torchCompile": false,
152
+ "torchCompileBackend": null,
153
+ "gradientCheckpointing": false
154
+ }
155
+ }
156
+ }
157
+ ```
158
+
159
+ Bare integers in this JSON are stored as `Int32` (the script wraps them in `$numberInt` before sending to mongosh). Bare floats stay as Double.
160
+
161
+ ### Verify config was read by a running pod
162
+
163
+ The training pod reads the MongoDB config at startup. Grep the pod logs to confirm. Use `kf_logs.py` to skip the manual pod-name lookup:
164
+
165
+ ```bash
166
+ python3 scripts/kf_logs.py <run_id> --step get-container-base-task-2 --tail 500 \
167
+ | grep -E 'maxSequenceLength:|perDeviceTrainBatchSize:|gradientAccumulationSteps:|gradientCheckpointing:|numTrainEpochs:'
168
+ ```
169
+
170
+ Pod naming on this cluster: `<workflow-name>-<numeric-id>` (e.g., `python-batch-pipeline-26dsn-2177690451`). See `references/kubectl-debug.md` for the full kubectl reference.
171
+
172
+ ### Raw mongosh fallback
173
+
174
+ If `mongo_predictor.py` is unavailable or you need a one-off query that the script does not cover, use `mongosh` directly:
175
+
176
+ ```bash
177
+ mongosh "mongodb://10.11.96.21:27017/earlybirds" --quiet --eval '
178
+ const doc = db.predictors.findOne({_id: ObjectId("<PREDICTOR_ID>")});
179
+ print(JSON.stringify(doc.config.batch["semantic-search-learning"], null, 2));
180
+ '
181
+ ```
182
+
183
+ Remember the `NumberInt()` rule for full-document `$set` (see Gotchas below).
184
+
185
+ ---
186
+
187
+ ## Sequential Launch Pattern (Multiple Experiments Sharing One Predictor)
188
+
189
+ All runs for the same predictor read config from the **same MongoDB document**. The basic loop:
190
+
191
+ 1. `mongo_predictor.py apply` Config A
192
+ 2. Launch run A (`python -m run -c ...`)
193
+ 3. `kf_wait.py <run_id> --step get-container-base-task-2 --state RUNNING` (the training pod has now read the config)
194
+ 4. `mongo_predictor.py apply` Config B
195
+ 5. Launch run B
196
+ 6. Repeat
197
+
198
+ ```bash
199
+ python3 scripts/mongo_predictor.py apply <PREDICTOR_ID> --file experiment-A.json
200
+ # launch run A, capture <run_id_A> from output
201
+ python3 scripts/kf_wait.py <run_id_A> --step get-container-base-task-2 --state RUNNING --timeout 600
202
+
203
+ python3 scripts/mongo_predictor.py apply <PREDICTOR_ID> --file experiment-B.json
204
+ # launch run B
205
+ python3 scripts/kf_wait.py <run_id_B> --step get-container-base-task-2 --state RUNNING --timeout 600
206
+ ```
207
+
208
+ **Critical:** if you apply the next config before the previous run's pod reaches `RUNNING`, that run gets the wrong config. Always wait, then verify with the `kf_logs.py` grep above.
209
+
210
+ ---
211
+
212
+ ## Key Gotchas
213
+
214
+ - **Database is `earlybirds`**, not `ebap`. Early attempts used `ebap` which is wrong.
215
+ - **Use `NumberInt()` for integer values** when doing full-document `$set` to avoid MongoDB storing them as doubles. Dot-notation `$set` with bare integers is fine.
216
+ - **`batchSize` under `learningConfig`** is the dataset loading batch size; **`perDeviceTrainBatchSize` under `trainingArguments`** is what controls in-batch negatives for contrastive learning. Keep them in sync for consistency.
217
+ - **`gradientCheckpointing: true`** is required when `perDeviceTrainBatchSize` is large (128+) on L4 GPUs to avoid OOM
218
+ - **`pretrainedModelNameOrPath`** appears in BOTH `modelConfig` and `pipelineConfig`; update both when changing the base model
@@ -0,0 +1,190 @@
1
+ # Pipeline Configuration Reference
2
+
3
+ ## Endpoints
4
+ - **Kubeflow UI**: http://10.11.96.10/
5
+ - **MLflow UI**: http://10.11.96.16:5000/
6
+ - **Config files location**: `attraqt-kubeflow-configs/configs/development/`
7
+ - **Pipeline definitions**: `attraqt-kubeflow-pipelines/kubeflow_pipelines/pipelines/`
8
+
9
+ ---
10
+
11
+ ## Launching a Run from a Config File
12
+
13
+ ```bash
14
+ # Must be run from the scripts/ directory of attraqt-kubeflow-configs
15
+ cd attraqt-kubeflow-configs/scripts
16
+ python -m run -c <absolute_path_to_config.json>
17
+ ```
18
+
19
+ **Prerequisites:**
20
+ - `attraqt-kubeflow-configs/.env` must exist (copy from `.env.dev` for dev)
21
+ - `version_name` must match an existing pipeline version. Check with:
22
+ ```bash
23
+ python3 kf_query.py --pipeline-versions <pipeline_name>
24
+ ```
25
+
26
+ **What `run.py` does:** Reads the JSON config, calls `KubeflowClient.create_run()` with `pipeline_name`, `version_name`, `experiment_name` (also used as `run_name`), and `params`.
27
+
28
+ ---
29
+
30
+ ## 1. `python_batch_pipeline`: Python GPU/CPU batch job
31
+
32
+ **Template** (`configs/development/batch/python_batch_pipeline.json`):
33
+ ```json
34
+ {
35
+ "pipeline_name": "python_batch_pipeline",
36
+ "version_name": "0.1.271",
37
+ "experiment_name": "<EXPERIMENT_NAME>",
38
+ "params": {
39
+ "predictor_id": "<PREDICTOR_ID>",
40
+ "strategy_id": "<STRATEGY_ID>",
41
+ "image_name": "<IMAGE_NAME>",
42
+ "cmd_script_path": "<SCRIPT_PATH>",
43
+ "disk_enabled": "False",
44
+ "disk_name": null,
45
+ "batch_config": {
46
+ "arguments": {},
47
+ "cpu": "1000m",
48
+ "memory": "2G",
49
+ "extra_options": "",
50
+ "gpu": "1",
51
+ "gpu_vendor": "nvidia.com/gpu",
52
+ "gpu_accelerator_name": "nvidia-l4",
53
+ "version": "<IMAGE_VERSION>"
54
+ }
55
+ }
56
+ }
57
+ ```
58
+
59
+ **`batch_config` fields:**
60
+ | Field | Description |
61
+ |-------|-------------|
62
+ | `arguments` | Dict passed as job arguments (strategy-specific) |
63
+ | `version` | Docker image version tag |
64
+ | `cpu` | CPU request (e.g. `"1000m"`, `"11"`) |
65
+ | `memory` | Memory request (e.g. `"2G"`, `"80G"`) |
66
+ | `gpu` | GPU count (e.g. `"1"`) |
67
+ | `gpu_vendor` | Always `"nvidia.com/gpu"` |
68
+ | `gpu_accelerator_name` | GPU type: `"nvidia-l4"` or `"nvidia-tesla-t4"` |
69
+ | `extra_options` | JVM/env extra flags string |
70
+
71
+ ---
72
+
73
+ ## 2. `scala_batch_pipeline`: Scala JVM batch job
74
+
75
+ **Template** (`configs/development/batch/scala_batch_pipeline.json`):
76
+ ```json
77
+ {
78
+ "pipeline_name": "scala_batch_pipeline",
79
+ "version_name": "0.1.271",
80
+ "experiment_name": "<EXPERIMENT_NAME>",
81
+ "params": {
82
+ "predictor_id": "<PREDICTOR_ID>",
83
+ "strategy_id": "<STRATEGY_ID>",
84
+ "image_name": "<IMAGE_NAME>",
85
+ "cmd_script_path": "/opt/start.sh",
86
+ "launcher_class": "<MAIN_CLASS>",
87
+ "disk_enabled": "False",
88
+ "disk_name": null,
89
+ "batch_config": {
90
+ "custom_params": {},
91
+ "version": "<IMAGE_VERSION>",
92
+ "cpu": "1000m",
93
+ "memory": "2G",
94
+ "java_memory": "1G",
95
+ "timeout_s": 3600,
96
+ "extra_options": "",
97
+ "gpu": "1",
98
+ "gpu_vendor": "nvidia.com/gpu"
99
+ }
100
+ }
101
+ }
102
+ ```
103
+
104
+ **`batch_config` fields:**
105
+ | Field | Description |
106
+ |-------|-------------|
107
+ | `custom_params` | Dict of strategy-specific params (NOT `arguments`) |
108
+ | `version` | Docker image version tag |
109
+ | `java_memory` | JVM heap max (e.g. `"1G"`, `"36G"`) |
110
+ | `timeout_s` | Job timeout in seconds (default `3600`) |
111
+
112
+ ---
113
+
114
+ ## 3. `semantic_search_item_encoding_pipeline`: Full encoding pipeline
115
+
116
+ **Example** (`configs/development/ai/semantic_search_item_encoding_pipeline/`):
117
+ ```json
118
+ {
119
+ "pipeline_name": "semantic_search_item_encoding_pipeline",
120
+ "version_name": "0.1.269",
121
+ "experiment_name": "<NAME> - Encoding workflow",
122
+ "params": {
123
+ "search_predictor_id": "<PREDICTOR_ID>",
124
+ "model_repository": "mlflow",
125
+ "model_id": "<MLFLOW_RUN_ID>",
126
+ "item_search_encoding_version": "0",
127
+ "item_seo_keyphrases_generator": {
128
+ "cpu": "4",
129
+ "extra_options": "-DmappedProductsVersion=v1",
130
+ "custom_params": {
131
+ "imageDatasetName": "xo",
132
+ "imageBasePath": "/mnt",
133
+ "promptId": "SEOKeyPhrasesFashionPrompt"
134
+ },
135
+ "java_memory": "2G",
136
+ "memory": "4G",
137
+ "timeout_s": 7200,
138
+ "version": "3.45.0"
139
+ },
140
+ "items_encoding": {
141
+ "cpu": "10",
142
+ "extra_options": "-DmappedProductsVersion=v1 -DproductCatalogPastValidityInDays=3",
143
+ "java_memory": "36G",
144
+ "memory": "40G",
145
+ "gpu_vendor": "nvidia.com/gpu",
146
+ "timeout_s": 7200,
147
+ "version": "3.64.0"
148
+ }
149
+ }
150
+ }
151
+ ```
152
+
153
+ ---
154
+
155
+ ## Strategy IDs & Image Names Reference
156
+
157
+ | Strategy | Image | Script | Launcher class |
158
+ |----------|-------|--------|----------------|
159
+ | `semantic-search-learning` | `semantic-search` | `/opt/start-semantic-search-batch.sh` | (n/a) |
160
+ | `semantic-search-evaluation` | `semantic-search` | `/opt/start-semantic-search-batch.sh` | (n/a) |
161
+ | `items-encoding` | `algo-search-batch` | `/opt/start.sh` | `earlybirds.algo.search.batch.SearchSimpleBatchLauncher` |
162
+ | `query-dataset-generation` | `algo-search-batch` | `/opt/start.sh` | `earlybirds.algo.search.batch.SearchSimpleBatchLauncher` |
163
+ | `seo-keyphrases-generator` | `item-utils` | `/opt/start.sh` | `earlybirds.item.dump.seo_keyphrases.SEOKeyPhrasesGeneratorLauncher` |
164
+ | `search-item-data-dataset-preprocessing` | `algo-search-batch` | `/opt/start.sh` | (Spark/Dataproc) |
165
+
166
+ ---
167
+
168
+ ## Docker Image Version Defaults (from `versions.py`)
169
+
170
+ | Image | Version constant | Current value |
171
+ |-------|-----------------|---------------|
172
+ | `semantic-search` | `SEMANTIC_SEARCH_ML_VERSION` | `0.0.15` (check for overrides) |
173
+ | `algo-search-batch` | `SEARCH_VERSION` | `3.64.0` |
174
+ | `item-utils` | `ITEM_UTILS_VERSION` | `2.23.0` |
175
+
176
+ > Always check the current `kubeflow_pipelines/pipelines/utils/versions.py` for latest values; these drift.
177
+
178
+ ---
179
+
180
+ ## GCS Path Patterns (outputs from pipeline steps)
181
+
182
+ | Step | Output key | GCS path pattern |
183
+ |------|-----------|-----------------|
184
+ | item_data_dataset_preprocessing | `item_data_dataset_directory_path` | `search/search_item_data_dataset_preprocessing/<version>/<predictor_id>/<timestamp>/item-data-dataset-preprocessing-dataframe` |
185
+ | item_data_dataset_preprocessing | `item_data_dataset_meta_info_directory_path` | `search/search_item_data_dataset_preprocessing/<version>/<predictor_id>/<timestamp>/item-data-dataset-preprocessing-meta-info` |
186
+ | query_dataset_generation | `query_training_dataset_directory_path` | `search/query-dataset-generation/<version>/<predictor_id>/<timestamp>/query-training-dataset-preprocessing-dataframe` |
187
+ | query_dataset_generation | `query_evaluation_dataset_directory_path` | `search/query-dataset-generation/<version>/<predictor_id>/<timestamp>/query-evaluation-dataset-preprocessing-dataframe` |
188
+ | learning | `run_id` | MLflow run ID (UUID hex string, e.g. `5046f02ec2c146b3b66abcd3b82f15d4`) |
189
+
190
+ > **Find output paths**: In Kubeflow UI → run → step → "Output artifacts" tab, or via `scripts/kf_query.py`.
@@ -0,0 +1,182 @@
1
+ # Semantic Search Pipeline Steps Reference
2
+
3
+ ## `semantic_search_learning_with_generated_analytics_pipeline`
4
+
5
+ Full pipeline DAG. Steps run in order (with conditional branches):
6
+
7
+ ```
8
+ 1. item_seo_keyphrases_generator [Scala, item-utils image]
9
+ 2. item_data_dataset_preprocessing [Spark/Dataproc, algo-search-batch]
10
+ 3. query_dataset_generation [Scala, algo-search-batch] → outputs: training + evaluation dataset paths
11
+ 4. learning [Python GPU, semantic-search image] → outputs: mlflow run_id
12
+ ├── 5. evaluation (conditional) [Python GPU, semantic-search image] ; only if evaluation dataset != ""
13
+ ├── 6. items_encoding (conditional) [Scala, algo-search-batch] ; only if items_encoding_enabled=True
14
+ │ └── 7. set_model_alias (conditional) ; only if model_alias_transition_enabled=True
15
+ └── send_pubsub_message
16
+ ```
17
+
18
+ ---
19
+
20
+ ## Step-by-Step Configs for Manual Re-runs
21
+
22
+ ### Step 4: Learning (`python_batch_pipeline`)
23
+
24
+ ```json
25
+ {
26
+ "pipeline_name": "python_batch_pipeline",
27
+ "version_name": "0.1.271",
28
+ "experiment_name": "<TENANT> - EBAP - Search - Learning workflow",
29
+ "params": {
30
+ "predictor_id": "<PREDICTOR_ID>",
31
+ "strategy_id": "semantic-search-learning",
32
+ "image_name": "semantic-search",
33
+ "cmd_script_path": "/opt/start-semantic-search-batch.sh",
34
+ "batch_config": {
35
+ "arguments": {
36
+ "itemDataDatasetPreprocessingDirectoryPath": "<FROM_STEP_2_item_data_dataset_directory_path>",
37
+ "pretrainedModelKey": {
38
+ "modelId": "<PRETRAINED_RUN_ID_or_huggingface_model_id>",
39
+ "modelRepository": "<mlflow_or_huggingface>"
40
+ },
41
+ "queryDatasetPreprocessingDirectoryPath": "<FROM_STEP_3_query_training_dataset_directory_path>",
42
+ "verticalRunPredictorIds": ["<PREDICTOR_ID>"]
43
+ },
44
+ "cpu": "11",
45
+ "gpu": "1",
46
+ "gpu_accelerator_name": "nvidia-l4",
47
+ "gpu_vendor": "nvidia.com/gpu",
48
+ "memory": "80G",
49
+ "version": "<SEMANTIC_SEARCH_ML_VERSION>"
50
+ }
51
+ }
52
+ }
53
+ ```
54
+
55
+ **Required inputs:**
56
+ - `itemDataDatasetPreprocessingDirectoryPath` → from step 2 output
57
+ - `queryDatasetPreprocessingDirectoryPath` → from step 3 `query_training_dataset_directory_path`
58
+ - `pretrainedModelKey` → either a HuggingFace model (e.g. `sentence-transformers/all-MiniLM-L6-v2`) or an MLflow run_id from a previous training
59
+
60
+ **Output:** MLflow `run_id`. Find via MLflow UI or `scripts/mlflow_query.py`.
61
+
62
+ ---
63
+
64
+ ### Step 5: Evaluation (`python_batch_pipeline`)
65
+
66
+ ```json
67
+ {
68
+ "pipeline_name": "python_batch_pipeline",
69
+ "version_name": "0.1.271",
70
+ "experiment_name": "<TENANT> - EBAP - Search - Learning workflow",
71
+ "params": {
72
+ "predictor_id": "<PREDICTOR_ID>",
73
+ "strategy_id": "semantic-search-evaluation",
74
+ "image_name": "semantic-search",
75
+ "cmd_script_path": "/opt/start-semantic-search-batch.sh",
76
+ "batch_config": {
77
+ "arguments": {
78
+ "itemDataDatasetPreprocessingDirectoryPath": "<FROM_STEP_2_item_data_dataset_directory_path>",
79
+ "pretrainedModelKey": {
80
+ "modelId": "<MLFLOW_RUN_ID_FROM_STEP_4>",
81
+ "modelRepository": "mlflow"
82
+ },
83
+ "queryDatasetPreprocessingDirectoryPath": "<FROM_STEP_3_query_evaluation_dataset_directory_path>"
84
+ },
85
+ "cpu": "11",
86
+ "gpu": "1",
87
+ "gpu_accelerator_name": "nvidia-l4",
88
+ "gpu_vendor": "nvidia.com/gpu",
89
+ "memory": "80G",
90
+ "version": "<SEMANTIC_SEARCH_ML_VERSION>"
91
+ }
92
+ }
93
+ }
94
+ ```
95
+
96
+ **Key differences from learning:**
97
+ - `strategy_id`: `semantic-search-evaluation`
98
+ - `queryDatasetPreprocessingDirectoryPath` → uses the **evaluation** dataset path (step 3's `query_evaluation_dataset_directory_path`), not the training one
99
+ - No `verticalRunPredictorIds`
100
+ - `pretrainedModelKey.modelId` → MLflow `run_id` from step 4, `modelRepository` = `"mlflow"`
101
+
102
+ **Skip if:** `query_evaluation_dataset_directory_path` was empty in step 3.
103
+
104
+ ---
105
+
106
+ ### Step 6: Items Encoding (`scala_batch_pipeline`)
107
+
108
+ ```json
109
+ {
110
+ "pipeline_name": "scala_batch_pipeline",
111
+ "version_name": "0.1.271",
112
+ "experiment_name": "<TENANT> - EBAP - Search - Learning workflow",
113
+ "params": {
114
+ "predictor_id": "<PREDICTOR_ID>",
115
+ "strategy_id": "items-encoding",
116
+ "image_name": "algo-search-batch",
117
+ "cmd_script_path": "/opt/start.sh",
118
+ "launcher_class": "earlybirds.algo.search.batch.SearchSimpleBatchLauncher",
119
+ "batch_config": {
120
+ "custom_params": {
121
+ "itemEncodingInferenceKey": {
122
+ "modelRepository": "mlflow",
123
+ "modelId": "<MLFLOW_RUN_ID_FROM_STEP_4>",
124
+ "version": ""
125
+ }
126
+ },
127
+ "version": "3.64.0",
128
+ "cpu": "10",
129
+ "memory": "40G",
130
+ "java_memory": "36G",
131
+ "timeout_s": 7200,
132
+ "gpu_vendor": "nvidia.com/gpu",
133
+ "extra_options": "-DmappedProductsVersion=v1 -DproductCatalogPastValidityInDays=3"
134
+ }
135
+ }
136
+ }
137
+ ```
138
+
139
+ **Key notes:**
140
+ - Uses `custom_params` (not `arguments`); it's a Scala job
141
+ - `version: ""` in `itemEncodingInferenceKey` → auto-generates new encoding version
142
+ - `model_id` = MLflow `run_id` from step 4
143
+
144
+ ---
145
+
146
+ ### Step 7: Set Model Alias (MLflow CLI / UI)
147
+
148
+ ```bash
149
+ # Via MLflow Python client
150
+ python3 - <<'EOF'
151
+ import mlflow
152
+ mlflow.set_tracking_uri("http://10.11.96.16:5000/")
153
+ client = mlflow.MlflowClient()
154
+ client.set_registered_model_alias(
155
+ name="semantic-search-<predictor_id>",
156
+ alias="Production",
157
+ version="<MODEL_VERSION>"
158
+ )
159
+ EOF
160
+ ```
161
+
162
+ Or do it directly in the MLflow UI at http://10.11.96.16:5000/.
163
+
164
+ ---
165
+
166
+ ## Other Common Pipeline Types
167
+
168
+ ### `semantic_search_item_encoding_pipeline` (standalone encoding)
169
+
170
+ Use when re-running only the encoding step with a new model. Runs SEO keyphrases generator + items encoding. See `pipeline-configs.md` for full config.
171
+
172
+ ### `clip_search_rnn_learning_pipeline`
173
+
174
+ Learning pipeline for CLIP+RNN search model. Config example in `configs/development/ai/clip_search_rnn_learning_pipeline/`.
175
+
176
+ ---
177
+
178
+ ## How to Find Output Paths from a Completed Step
179
+
180
+ 1. **Kubeflow UI**: http://10.11.96.10/ → Runs → click run → click step → "Output artifacts" tab
181
+ 2. **Via script**: `python3 scripts/kf_query.py <run_id>` → shows all step outputs
182
+ 3. **Pattern**: Outputs follow `<strategy>/<version>/<predictor_id>/<epoch_ms>/...` in GCS