@groupby/ai-dev 0.5.7 → 0.5.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (67) hide show
  1. package/package.json +1 -1
  2. package/teams/agentic-checkout/prompts/AGENTS.md +103 -0
  3. package/teams/agentic-checkout/prompts/create-plan.md +103 -0
  4. package/teams/agentic-checkout/prompts/create-pull-request.md +157 -0
  5. package/teams/agentic-checkout/prompts/fix-pr-comments.md +170 -0
  6. package/teams/agentic-checkout/prompts/fix-review-findings.md +1 -12
  7. package/teams/agentic-checkout/prompts/implement-task.md +62 -0
  8. package/teams/agentic-checkout/prompts/new-workspace.md +12 -0
  9. package/teams/agentic-checkout/prompts/orchestrate-component-change.md +25 -0
  10. package/teams/agentic-checkout/prompts/review-change.md +8 -2
  11. package/teams/agentic-checkout/scripts/check-secrets +51 -0
  12. package/teams/agentic-checkout/scripts/install-git-hooks +15 -0
  13. package/teams/agentic-checkout/scripts/local-fast-report +5 -0
  14. package/teams/agentic-checkout/scripts/local-report +205 -0
  15. package/teams/agentic-checkout/scripts/local-summarize +47 -0
  16. package/teams/agentic-checkout/scripts/logs-deps +9 -0
  17. package/teams/agentic-checkout/scripts/setup-local-fast-model +20 -0
  18. package/teams/agentic-checkout/scripts/start-deps +15 -0
  19. package/teams/agentic-checkout/scripts/status-deps +9 -0
  20. package/teams/agentic-checkout/scripts/stop-deps +9 -0
  21. package/teams/agentic-checkout/scripts/sync-components +110 -0
  22. package/teams/agentic-checkout/skills/approval-gated-task-execution/SKILL.md +57 -0
  23. package/teams/agentic-checkout/skills/component-verification/SKILL.md +34 -0
  24. package/teams/agentic-checkout/skills/grill-me/SKILL.md +23 -0
  25. package/teams/agentic-checkout/skills/karpathy-guidelines/SKILL.md +67 -0
  26. package/teams/agentic-checkout/skills/secret-safety/SKILL.md +41 -0
  27. package/teams/agentic-checkout/skills/sync-components/SKILL.md +23 -60
  28. package/teams/agentic-checkout/skills/tdd/SKILL.md +48 -0
  29. package/teams/fhr-ai-team/github/PULL_REQUEST_TEMPLATE/full.md +31 -0
  30. package/teams/fhr-ai-team/github/PULL_REQUEST_TEMPLATE/light.md +7 -0
  31. package/teams/fhr-ai-team/github/copilot-instructions.md +24 -0
  32. package/teams/fhr-ai-team/github/instructions/python.instructions.md +23 -0
  33. package/teams/fhr-ai-team/github/pull_request_template.md +21 -0
  34. package/teams/fhr-ai-team/prompts/brainstorm.md +7 -0
  35. package/teams/fhr-ai-team/prompts/plan-algo-tests.md +7 -0
  36. package/teams/fhr-ai-team/prompts/plan.md +7 -0
  37. package/teams/fhr-ai-team/prompts/pr-description.md +7 -0
  38. package/teams/fhr-ai-team/prompts/test.md +7 -0
  39. package/teams/fhr-ai-team/resources/AGENTS.md +55 -0
  40. package/teams/fhr-ai-team/resources/CLAUDE.md +52 -0
  41. package/teams/fhr-ai-team/resources/README.md +51 -0
  42. package/teams/fhr-ai-team/resources/claude-code-setup.md +60 -0
  43. package/teams/fhr-ai-team/resources/copilot-setup.md +64 -0
  44. package/teams/fhr-ai-team/resources/onboarding.md +179 -0
  45. package/teams/fhr-ai-team/resources/opencode-install.md +29 -0
  46. package/teams/fhr-ai-team/resources/opencode-setup.md +43 -0
  47. package/teams/fhr-ai-team/skills/algo-test-planning/SKILL.md +192 -0
  48. package/teams/fhr-ai-team/skills/algo-test-planning/references/pipeline-registry.md +280 -0
  49. package/teams/fhr-ai-team/skills/brainstorming/SKILL.md +111 -0
  50. package/teams/fhr-ai-team/skills/e2e-testing/SKILL.md +163 -0
  51. package/teams/fhr-ai-team/skills/grill-me/SKILL.md +10 -0
  52. package/teams/fhr-ai-team/skills/ml-tooling-dev/SKILL.md +313 -0
  53. package/teams/fhr-ai-team/skills/ml-tooling-dev/references/kubectl-debug.md +165 -0
  54. package/teams/fhr-ai-team/skills/ml-tooling-dev/references/mongodb-config.md +218 -0
  55. package/teams/fhr-ai-team/skills/ml-tooling-dev/references/pipeline-configs.md +190 -0
  56. package/teams/fhr-ai-team/skills/ml-tooling-dev/references/pipeline-steps.md +182 -0
  57. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/kf_logs.py +203 -0
  58. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/kf_query.py +233 -0
  59. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/kf_wait.py +195 -0
  60. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/mlflow_query.py +252 -0
  61. package/teams/fhr-ai-team/skills/ml-tooling-dev/scripts/mongo_predictor.py +352 -0
  62. package/teams/fhr-ai-team/skills/naming-conventions-reviewer/SKILL.md +230 -0
  63. package/teams/fhr-ai-team/skills/naming-conventions-reviewer/references/dataset-naming.md +190 -0
  64. package/teams/fhr-ai-team/skills/naming-conventions-reviewer/references/domain-vocabulary.md +447 -0
  65. package/teams/fhr-ai-team/skills/naming-conventions-reviewer/references/repo-dependency-graph.md +264 -0
  66. package/teams/fhr-ai-team/skills/planning/SKILL.md +138 -0
  67. package/teams/fhr-ai-team/skills/pr-description/SKILL.md +94 -0
@@ -0,0 +1,230 @@
1
+ ---
2
+ name: naming-conventions-reviewer
3
+ description: "Expert reviewer for naming conventions across Crownpeak/Earlybirds ML repositories. Use when: (1) reviewing code for naming consistency, (2) writing new code in any algo.*, toolbox, or pipeline repo, (3) naming new datasets, pipeline steps, config keys, or strategy IDs, (4) checking variable names match cross-repo conventions, (5) creating Kubeflow pipeline configs. Covers: dataset names, parquet columns, GCS paths, LakeFS repos, Docker images, strategy IDs, algorithm names, config class patterns, label constants, and ID naming across Python/Scala/JSON."
4
+ ---
5
+
6
+ # Naming Conventions Reviewer
7
+
8
+ ## Review Workflow
9
+
10
+ 1. Identify the repo and language context (Python, Scala, JSON/YAML)
11
+ 2. Check names against the conventions below and the reference files
12
+ 3. Flag violations with the correct canonical name
13
+ 4. For ambiguous cases, check [domain-vocabulary.md](references/domain-vocabulary.md) for the authoritative pattern
14
+
15
+ ## Core Rules
16
+
17
+ ### Rule 1: Language-Appropriate Casing
18
+ | Context | Convention |
19
+ |---|---|
20
+ | Python variables/functions | `snake_case` |
21
+ | Python constants | `SCREAMING_SNAKE_CASE` |
22
+ | Python classes | `PascalCase` |
23
+ | Scala variables | `camelCase` |
24
+ | Scala classes/case classes | `PascalCase` |
25
+ | JSON/YAML config keys | `camelCase` |
26
+ | Parquet column names | `camelCase` |
27
+ | GCS path segments | `kebab-case` |
28
+ | Docker image names | `kebab-case` |
29
+ | Strategy IDs | `kebab-case` |
30
+ | Environment variables | `SCREAMING_SNAKE_CASE` |
31
+
32
+ ### Rule 2: Entity Naming - Use "Item", Not "Product"
33
+ The canonical entity name is **item**, not product. In ML code:
34
+ - `item_id` (Python), `itemId` (Scala/JSON), NOT `product_id`
35
+ - `item_data`, `ItemDataDataset`, NOT `product_data`
36
+ - Exception: LakeFS repo `rawproducts` uses "products" (legacy)
37
+
38
+ ### Rule 3: "Encodings" Over "Embeddings"
39
+ This codebase prefers `encodings` for vector representations:
40
+ - `item_encodings`, `query_encodings`, `image_encodings`
41
+ - `_image_encodings_label = "imageEncodings"` (class-level on DataLoader)
42
+ - `compute_item_encodings()`, `compute_query_encoding()`
43
+ - `items-encoding` (strategy ID), `item-encoding-export`
44
+ - Exception: some newer code uses `embeddings` - prefer `encodings` for consistency
45
+
46
+ ### Rule 4: Parquet Column Labels — Class-Level, Not Module-Level
47
+
48
+ Parquet column name labels MUST be defined as **class-level attributes on the Dataset or DataLoader class** that reads the parquet data. Do NOT define them as standalone module-level SCREAMING_SNAKE constants.
49
+
50
+ **Correct — class-level on Dataset (dominant pattern across 7+ repos):**
51
+ ```python
52
+ class ItemTextDataDataset(TorchDataset):
53
+ _tenant_id_label = "tenantId"
54
+ _item_kind_label = "itemKind"
55
+ _item_id_label = "itemId"
56
+ _description_label = "description"
57
+ _item_seo_key_phrases_label = "itemSEOKeyPhrasesOpt"
58
+ ```
59
+
60
+ **Wrong — standalone module-level constants:**
61
+ ```python
62
+ # DO NOT do this in ML repos
63
+ ITEM_ID_LABEL = "itemId"
64
+ TENANT_ID_LABEL = "tenantId"
65
+ ```
66
+
67
+ **Naming format:** `_<field_name>_label = "<camelCaseParquetColumn>"`
68
+ - Prefix with `_` (class-private)
69
+ - Attribute name in snake_case
70
+ - Value in camelCase (matching parquet column produced by Scala)
71
+ - Value must NEVER be snake_case (`"item_id"` is wrong, `"itemId"` is correct)
72
+
73
+ **Inheritance:** Toolbox base classes define shared labels, ML repos inherit or extend:
74
+ ```python
75
+ # item-toolbox: base class defines common labels
76
+ class ItemDataDataLoader(SequenceExampleFeatureDataLoader, ABC):
77
+ _tenant_id_label = "tenantId"
78
+ _item_kind_label = "itemKind"
79
+ _item_id_label = "itemId"
80
+ _locale_label = "locale"
81
+ _image_encodings_label = "imageEncodings"
82
+
83
+ # algo.tagging-ml: extends with domain-specific labels
84
+ class TransformerTaggingDataLoader(SequenceExampleFeatureDataLoader):
85
+ _item_kind_label = "itemKind"
86
+ _targets_label = "targets"
87
+ ```
88
+
89
+ **Cross-referencing:** When another class needs a label value owned by a Dataset class, reference the class attribute directly rather than duplicating:
90
+ ```python
91
+ # Prefer: reference the owning class
92
+ column = ItemTextDataDataset._item_id_label
93
+
94
+ # Avoid: duplicating the string literal
95
+ column = "itemId"
96
+ ```
97
+
98
+ **Dataclass models:** When a dataclass needs labels for `to_dict()`, use `ClassVar` attributes with camelCase values matching the Dataset class:
99
+ ```python
100
+ @dataclass
101
+ class ItemTextData:
102
+ _tenant_id_label: ClassVar[str] = "tenantId" # NOT "tenant_id"
103
+ _item_id_label: ClassVar[str] = "itemId" # NOT "item_id"
104
+ ```
105
+
106
+ ### Rule 5: Config Class Suffixes
107
+ | Suffix | Purpose |
108
+ |---|---|
109
+ | `*Config` | General configuration |
110
+ | `*ModelConfig` | Model hyperparameters |
111
+ | `*LearningAlgorithmConfig` | Training algorithm config |
112
+ | `*EvaluatorConfig` | Evaluation settings |
113
+ | `*BatchConfig` | Kubeflow batch job config |
114
+ | `*PreProcessingPipelineConfig` | Data preprocessing config |
115
+ | `*DatasetMetaInfo` | Dataset structure metadata |
116
+
117
+ ### Rule 6: Parameter Constants
118
+ Pattern: `{NAME}_PARAM = *Param(...)`
119
+ - Optional params: `{NAME}_OPT_PARAM`
120
+ ```python
121
+ BATCH_SIZE_PARAM = IntParam(...)
122
+ LEARNING_RATE_PARAM = FloatParam(...)
123
+ NB_OBSERVATION_BY_TENANT_OPT_PARAM = OptionalIntParam(...) # _OPT_ for optional
124
+ ```
125
+
126
+ ### Rule 7: Dataset Class Naming
127
+ Pattern: `{Domain}{DataType}Dataset`
128
+ - `QueryDataDataset`, NOT `QueryDataset`
129
+ - `ItemTextDataDataset`, NOT `ItemTextDataset`
130
+ - `ItemCutoutDataDataset`
131
+
132
+ MetaInfo pattern: `{Domain}DatasetMetaInfo`
133
+ Config pattern: `{Domain}DatasetConfig`
134
+
135
+ ### Rule 8: Algorithm Name Constants
136
+ Pattern: `{ALGO}_ALGO_NAME = "{kebab-case-value}"`
137
+ ```python
138
+ FM_ALGO_NAME = "fm"
139
+ SEARCH_ALGO_NAME = "search"
140
+ CLIP_ALGO_NAME = "clip"
141
+ SHOP_THE_LOOK_ALGO_NAME = "shop-the-look"
142
+ ```
143
+
144
+ ### Rule 9: Service Model Names
145
+ Pattern: `{TYPE}_SERVICE_MODEL_NAME = "{type}-service-model"`
146
+ ```python
147
+ ITEM_ENCODING_SERVICE_MODEL_NAME = "item-encoding-service-model"
148
+ QUERY_ENCODING_SERVICE_MODEL_NAME = "query-encoding-service-model"
149
+ IMAGE_ENCODING_SERVICE_MODEL_NAME = "image-encoding-service-model"
150
+ ```
151
+
152
+ ### Rule 10: Strategy IDs
153
+ Format: `kebab-case`, structured as `{domain}-{operation}`
154
+ - Learning: `semantic-search-learning`, `clip-learning`, `transformer-tagging-learning`
155
+ - Preprocessing: `query_dataset_preprocessing`, `item_data_dataset_preprocessing` (note: legacy uses underscores)
156
+ - Encoding: `item-images-single-encoding`, `item-cutout-image-encoding`
157
+ - Export: `item-encoding-export`, `ftp-exporter`
158
+
159
+ ### Rule 11: GCS Path Naming
160
+ Segments use kebab-case:
161
+ ```
162
+ gs://xo-{env}-ai-eu-eb-algo-models/{algo}/{step}/{version}/{predictor_id}/{timestamp}/{artifact-type}
163
+ ```
164
+ Artifact types: `query-training-dataset-preprocessing-dataframe`, `item-data-dataset-preprocessing-dataframe`, `multi-image-item-encodings`
165
+
166
+ ### Rule 12: Pipeline Task Naming
167
+ Pattern: `create_{step_name}_task`
168
+ ```python
169
+ create_query_dataset_generation_task
170
+ create_item_data_dataset_preprocessing_task
171
+ create_items_encoding_task
172
+ create_learning_task
173
+ create_evaluation_task
174
+ ```
175
+
176
+ ### Rule 13: Kubeflow Config Arguments
177
+ Always camelCase in YAML arguments block:
178
+ ```yaml
179
+ arguments:
180
+ preprocessingRootPath: "gs://..."
181
+ modelRootPath: "gs://..."
182
+ itemDataPreprocessingRootPath: "gs://..."
183
+ ```
184
+
185
+ ### Rule 14: Docker Image Names
186
+ | Pattern | Examples |
187
+ |---|---|
188
+ | `algo-{algo}-batch` | `algo-fm-batch`, `algo-stl-batch` |
189
+ | `algo-{algo}-dataproc-batch` | `algo-nlp-dataproc-batch`, `algo-search-dataproc-batch` |
190
+ | `{algo}` (Python ML) | `semantic-search`, `tagging`, `image-encoder` |
191
+ | `ebap-{service}` | `ebap-ftp-exporter`, `ebap-fhr-exporter` |
192
+
193
+ ### Rule 15: LakeFS Repository Names
194
+ Lowercase, no separators: `rawproducts`, `rawanalyticsincremental`, `mappedanalyticsincremental`
195
+
196
+ ### Rule 16: Batch Config File Location
197
+ Always at `config/batch.py` or `config/batches.py` within an ML repo.
198
+
199
+ ### Rule 17: Image Type Column Names
200
+ Parquet columns for image types: `stdImages`, `cropImages`, `cutoutImages`, `topTotalImages`, `otherImages`
201
+
202
+ ### Rule 18: KPI Template Names
203
+ Pattern: `*_KPI_TEMPLATE_NAME = "batch-*-kpi"`
204
+ ```python
205
+ TENANT_ATTRIBUTION_KPI_TEMPLATE_NAME = "batch-tenant-attribution-kpi"
206
+ ```
207
+
208
+ ## Reference Files
209
+
210
+ - **[repo-dependency-graph.md](references/repo-dependency-graph.md)** - Read when reviewing cross-repo dependencies, Docker image references, or understanding the library hierarchy (earlybirds_commons -> toolboxes -> algo repos)
211
+ - **[dataset-naming.md](references/dataset-naming.md)** - Read when reviewing dataset class names, parquet column names, GCS paths, LakeFS repos, or pipeline preprocessing step names
212
+ - **[domain-vocabulary.md](references/domain-vocabulary.md)** - Read when reviewing specific variable names, checking ID patterns, or looking up the canonical constant name for a domain concept
213
+
214
+ ## Common Violations to Flag
215
+
216
+ 1. Using `product` instead of `item` in ML code
217
+ 2. Using `embeddings` where `encodings` is the convention
218
+ 3. snake_case in JSON config keys (should be camelCase)
219
+ 4. camelCase in Python variables (should be snake_case)
220
+ 5. **Module-level SCREAMING_SNAKE label constants** (`ITEM_ID_LABEL = "itemId"`) — should be class-level `_item_id_label = "itemId"` on the Dataset/DataLoader class
221
+ 6. **snake_case values in label constants** (`_item_id_label = "item_id"`) — parquet column values must be camelCase (`"itemId"`)
222
+ 7. **Duplicated label string literals** — reference the owning Dataset class attribute instead of repeating `"itemId"` in multiple places
223
+ 8. **Dataclass labels diverging from Dataset labels** — `models.py` ClassVar labels must use the same camelCase values as `datasets.py` class attributes
224
+ 9. Config class missing standard suffix (`Config`, `BatchConfig`, etc.)
225
+ 10. Parameter constant missing `_PARAM` suffix
226
+ 11. Optional parameter missing `_OPT_` in name
227
+ 12. Strategy ID using underscores instead of kebab-case (or vice versa for legacy preprocessing)
228
+ 13. GCS path segments using camelCase instead of kebab-case
229
+ 14. Algorithm name not matching the canonical registry
230
+ 15. Docker image name not following `algo-{name}-batch` or `{name}` pattern
@@ -0,0 +1,190 @@
1
+ # Dataset Naming Conventions
2
+
3
+ ## Table of Contents
4
+ - [Dataset Class Names](#dataset-class-names)
5
+ - [Dataset MetaInfo Classes](#dataset-metainfo-classes)
6
+ - [Parquet Column Names](#parquet-column-names)
7
+ - [GCS Path Patterns](#gcs-path-patterns)
8
+ - [LakeFS Repository Names](#lakefs-repository-names)
9
+ - [Data Split Names](#data-split-names)
10
+ - [Pipeline Dataset Step Names](#pipeline-dataset-step-names)
11
+
12
+ ## Dataset Class Names
13
+
14
+ ### Naming Pattern
15
+ `{Domain}{DataType}Dataset` - always suffixed with `Dataset`
16
+
17
+ ### PyTorch Datasets (inherit from TorchDataset in pytorch-toolbox)
18
+ | Class Name | Repo | Description |
19
+ |---|---|---|
20
+ | `QueryDataDataset` | algo.semantic-search-ml, algo.semantic-search-bge-m3-ml | Query text data |
21
+ | `ItemTextDataDataset` | algo.semantic-search-ml, algo.semantic-search-bge-m3-ml | Item text/descriptions |
22
+ | `QueryItemTextDataDataset` | algo.semantic-search-ml | Combined query+item text |
23
+ | `ItemDataDataset` | algo.image-generative-tagging | General item data |
24
+ | `ItemCutoutDataDataset` | algo.image-generative-tagging | Item cutout images |
25
+ | `EnrichedItemDataDataset` | algo.image-generative-tagging | Enriched item data |
26
+ | `TorchDataset` | pytorch-toolbox | Abstract base class |
27
+
28
+ ### TensorFlow Datasets (eb_tensorflow patterns)
29
+ Dataset handling done through config-driven loaders, not explicit Dataset subclasses.
30
+
31
+ ### Config Classes
32
+ Pattern: `{Domain}DatasetConfig`
33
+ - `TorchDatasetConfig` (pytorch-toolbox)
34
+ - `ItemDataDatasetConfig` (algo.image-generative-tagging)
35
+ - `ItemCutoutDataDatasetConfig` (algo.image-generative-tagging)
36
+ - `SemanticSearchSamplingConfig` (algo.semantic-search-ml)
37
+
38
+ ## Dataset MetaInfo Classes
39
+
40
+ Pattern: `{Domain}DatasetMetaInfo` - metadata about dataset structure
41
+
42
+ | Class Name | Repo |
43
+ |---|---|
44
+ | `UserIntentDatasetMetaInfo` | algo.user-intent-ml |
45
+ | `ActivitiesDatasetMetaInfo` | algo.gpt-ml |
46
+ | `ItemDataDatasetMetaInfo` | item-toolbox |
47
+ | `ImageSegmentationDatasetMetaInfo` | algo.segmentation |
48
+ | `ImageBoundingBoxesDatasetMetaInfo` | algo.object-detection |
49
+ | `TextEncoderDatasetMetaInfo` | algo.text-encoder-ml |
50
+
51
+ ## Parquet Column Names
52
+
53
+ **CRITICAL**: Column names in parquet files use **camelCase** (matching Scala/JSON conventions).
54
+
55
+ ### Query Data Columns
56
+ | Column Name | Type | Description |
57
+ |---|---|---|
58
+ | `tenantId` | String | Tenant identifier |
59
+ | `query` | String | Search query text |
60
+ | `sortedTenantItemKeys` | List | Sorted item keys for the query |
61
+ | `sortedTenantItemIdLocales` | List | Sorted item IDs with locales |
62
+ | `sortedNbUniqueSearches` | List[Int] | Number of unique searches per item |
63
+
64
+ ### Item Data Columns
65
+ | Column Name | Type | Description |
66
+ |---|---|---|
67
+ | `tenantId` | String | Tenant identifier |
68
+ | `itemKind` | String | Item kind/type |
69
+ | `itemId` | String | Item identifier |
70
+ | `description` | String | Item description |
71
+ | `shortDescription` | String | Short description |
72
+ | `namedAttributes` | Map | Named attribute key-value pairs |
73
+ | `itemSEOKeyPhrasesOpt` | Optional[String] | SEO keyphrases |
74
+ | `locale` | String | Locale code |
75
+ | `variantId` | String | Variant identifier |
76
+ | `categories` | List[String] | Category hierarchy |
77
+ | `attributes` | Map | Item attributes |
78
+
79
+ ### Image Data Columns
80
+ | Column Name | Type | Description |
81
+ |---|---|---|
82
+ | `stdImages` | List | Standard product images |
83
+ | `cropImages` | List | Cropped images |
84
+ | `cutoutImages` | List | Cutout/transparent background images |
85
+ | `topTotalImages` | List | Top/total images |
86
+ | `otherImages` | List | Other image types |
87
+ | `imageEncodings` | Tensor | Image embedding vectors |
88
+ | `closeupImageEncodings` | Tensor | Closeup image embeddings |
89
+
90
+ ### Model Output Columns
91
+ | Column Name | Type | Context |
92
+ |---|---|---|
93
+ | `item_id` / `itemId` | String | Python snake_case / Parquet camelCase |
94
+ | `tenant_id` / `tenantId` | String | Python snake_case / Parquet camelCase |
95
+ | `image_id` / `imageId` | String | Python snake_case / Parquet camelCase |
96
+ | `is_outfit` | Boolean | Outfit detection flag |
97
+ | `is_facing` | Boolean | Facing detection flag |
98
+
99
+ ## GCS Path Patterns
100
+
101
+ ### Bucket Naming
102
+ `gs://xo-{environment}-ai-eu-eb-algo-models/` - Primary model/data bucket
103
+ `gs://xo-{environment}-ai-eu-mlflow/` - MLflow artifacts
104
+ `gs://xo-{environment}-ai-eu-eb-dumps/` - Data dumps
105
+ `gs://xo-{environment}-ai-eu-eb-spark-tmp/` - Spark temp data
106
+
107
+ ### Dataset Path Structure
108
+ ```
109
+ gs://xo-{env}-ai-eu-eb-algo-models/{algo}/{step}/{version}/{predictor_id}/{timestamp}/{artifact_type}
110
+ ```
111
+
112
+ ### Common Dataset Paths
113
+ | Path Pattern | Dataset Type |
114
+ |---|---|
115
+ | `search/query_dataset_preprocessing/{v}/{pid}/{ts}/query-training-dataset-preprocessing-dataframe` | Query training data |
116
+ | `search/search_item_data_dataset_preprocessing/{v}/{pid}/{ts}/item-data-dataset-preprocessing-dataframe` | Item data for search |
117
+ | `clip/clip_item_data_dataset_preprocessing/...` | CLIP item data |
118
+ | `clip/item-image-cutout-images-preprocessing/...` | Cutout images for CLIP |
119
+ | `gpt/preprocessing/{v}/{pid}/{ts}` | GPT preprocessing |
120
+ | `gpt/gpt_item_data_dataset_preprocessing/{v}/{pid}/{ts}` | GPT item data |
121
+ | `dev/als/preprocessing/{pid}/{ts}` | ALS preprocessing |
122
+ | `dev/image-encoder/image/{v}/{pid}/{ts}/multi-image-item-encodings` | Image encodings |
123
+ | `dev/nlp/character-tokenizer/{v}/{pid}` | Character tokenizer |
124
+ | `dev/nlp/word-tokenizer/{v}/{pid}/queries-tokenizer` | Word tokenizer |
125
+
126
+ ### Artifact Type Names (leaf directory)
127
+ - `query-training-dataset-preprocessing-dataframe`
128
+ - `item-data-dataset-preprocessing-dataframe`
129
+ - `multi-image-item-encodings`
130
+ - `queries-tokenizer`
131
+ - `item-image-outfit-classification`
132
+ - `fine-tuned-yolo-bounding-boxes`
133
+
134
+ ## LakeFS Repository Names
135
+
136
+ | Constant | Value | Content |
137
+ |---|---|---|
138
+ | `RAW_PRODUCTS_REPO_NAME` | `"rawproducts"` | Raw product catalog data |
139
+ | `RAW_ANALYTICS_INCREMENTAL_REPO_NAME` | `"rawanalyticsincremental"` | Raw analytics events |
140
+ | `MAPPED_ANALYTICS_INCREMENTAL_REPO_NAME` | `"mappedanalyticsincremental"` | Mapped analytics data |
141
+ | `MAPPED_PRODUCTS_REPO_NAME` | (from earlybirds_commons) | Mapped product data |
142
+
143
+ ### LakeFS Path Pattern
144
+ ```
145
+ lakefs://{REPO_NAME}/{BRANCH_OR_VERSION}/{tenant_uid}/...
146
+ lakefs://rawproducts/default/{tenant_uid}/products
147
+ lakefs://rawanalyticsincremental/default/{tenant_uid}/...
148
+ lakefs://mappedanalyticsincremental/{version}/{tenant_uid}/...
149
+ ```
150
+
151
+ Default branch: `DEFAULT_BRANCH_NAME = "default"`
152
+
153
+ ## Data Split Names
154
+
155
+ | Name | Usage Context |
156
+ |---|---|
157
+ | `training` / `train` | Training split |
158
+ | `validation` / `val` | Validation split |
159
+ | `test` | Test split |
160
+ | `preprocessing` | Intermediate preprocessing outputs |
161
+
162
+ Methods: `_split_dataset()`, `_retrieve_training_and_evaluation_datasets()`
163
+
164
+ ## Pipeline Dataset Step Names
165
+
166
+ ### Preprocessing Steps (strategy_id values)
167
+ | Strategy ID | Description |
168
+ |---|---|
169
+ | `query_dataset_preprocessing` | Query dataset generation |
170
+ | `item_data_dataset_preprocessing` | Item data preprocessing |
171
+ | `search_item_data_dataset_preprocessing` | Search-specific item data |
172
+ | `clip_item_data_dataset_preprocessing` | CLIP-specific item data |
173
+ | `gpt_item_data_dataset_preprocessing` | GPT-specific item data |
174
+ | `character_tokenizer_preprocessing` | NLP character tokenizer |
175
+ | `word_tokenizer_preprocessing` | NLP word tokenizer |
176
+ | `item-image-cutout-images-preprocessing` | Image cutout preprocessing |
177
+ | `item-images-preprocessing` | General image preprocessing |
178
+ | `item-images-single-encoding` | Single image encoding |
179
+ | `cutout-item-images-preprocessing` | Cutout image processing |
180
+ | `visual-search-preprocessing` | Visual search data prep |
181
+ | `global_preprocessing` | FM global preprocessing |
182
+ | `complementarity_preprocessing` | FM complementarity preprocessing |
183
+
184
+ ### Data Formats
185
+ | Format | Usage |
186
+ |---|---|
187
+ | `.parquet` | Standard tabular data (preferred) |
188
+ | `.ipc` | Apache Arrow IPC (cached parquet) |
189
+ | `.tfrecord` | Legacy TensorFlow datasets |
190
+ | `.csv` / `.tsv` | Some data exports (`csv_separator="\t\t"`) |