npm - @groupby/ai-dev - Versions diffs - 0.5.5 → 0.5.8 - Mend

@groupby/ai-dev 0.5.5 → 0.5.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (43) hide show

package/teams/fhr-ai-team/skills/naming-conventions-reviewer/references/domain-vocabulary.md ADDED Viewed

@@ -0,0 +1,447 @@
+# Domain Vocabulary & Variable Naming Conventions
+## Table of Contents
+- [Language-Specific Casing Rules](#language-casing)
+- [E-commerce Domain Terms](#ecommerce-terms)
+- [Search & Recommendation Terms](#search-terms)
+- [ML Model Terms](#ml-terms)
+- [Pipeline & Orchestration Terms](#pipeline-terms)
+- [Image Processing Terms](#image-terms)
+- [User Behavior Terms](#user-terms)
+- [Configuration Naming Patterns](#config-patterns)
+- [Identifier Patterns](#id-patterns)
+- [Algorithm Name Registry](#algorithm-names)
+- [Service Model Name Registry](#service-model-names)
+- [KPI Template Names](#kpi-templates)
+## Language-Specific Casing Rules {#language-casing}
+| Context | Convention | Example |
+|---|---|---|
+| Python variables | snake_case | `item_id`, `query_data` |
+| Python constants | SCREAMING_SNAKE_CASE | `ITEM_ID_LABEL`, `FM_ALGO_NAME` |
+| Python classes | PascalCase | `QueryDataDataset`, `SearchModelConfig` |
+| Scala variables | camelCase | `itemId`, `tenantId` |
+| Scala classes | PascalCase | `ItemDataDatasetRecord`, `TenantKey` |
+| JSON/YAML keys | camelCase | `"itemDataDataLoaderConfig"`, `"modelConfig"` |
+| Parquet columns | camelCase | `"tenantId"`, `"itemKind"`, `"sortedNbUniqueSearches"` |
+| GCS paths | kebab-case | `item-data-dataset-preprocessing-dataframe` |
+| Docker images | kebab-case | `algo-fm-batch`, `semantic-search` |
+| Strategy IDs | kebab-case | `semantic-search-learning`, `item-images-single-encoding` |
+| Environment vars | SCREAMING_SNAKE_CASE | `POSTGRES_KPI_DB_HOST` |
+| Algorithm names (constants) | kebab-case values | `FM_ALGO_NAME = "fm"`, `SEARCH_ALGO_NAME = "search"` |
+## E-commerce Domain Terms {#ecommerce-terms}
+### Core Entity Names
+| Canonical Name | Python | Scala | JSON/Parquet | Notes |
+|---|---|---|---|---|
+| Item | `item` | `Item` | `item` | Primary product entity (NOT "product" in ML code) |
+| Item ID | `item_id` | `itemId` / `ItemId` | `"itemId"` | Scala has `case class ItemId(kind, value)` |
+| Item Kind | `item_kind` | `itemKind` / `ItemKind` | `"itemKind"` | Item type classifier |
+| Tenant | `tenant` | `Tenant` | `tenant` | Customer/account |
+| Tenant ID | `tenant_id` | `tenantId` / `TenantId` | `"tenantId"` | |
+| Variant | `variant` | `Variant` | `variant` | Product variant |
+| Variant ID | `variant_id` | `variantId` | `"variantId"` | |
+| Category | `categories` | `categories` | `"categories"` | Always plural |
+| Attribute | `attributes`, `named_attributes` | `attributes`, `namedAttributes` | `"attributes"`, `"namedAttributes"` | |
+| Locale | `locale` | `locale` | `"locale"` | Language/region code |
+### Label Constants (Python)
+```python
+ITEM_ID_LABEL = "itemId"
+ITEM_KIND_LABEL = "itemKind"
+TENANT_ID_LABEL = "tenantId"
+CATEGORIES_LABEL = "categories"
+ATTRIBUTES_LABEL = "attributes"
+LOCALE_LABEL = "locale"
+SEO_KEYPHRASES_OPT_LABEL = "itemSEOKeyPhrasesOpt"
+```
+**Pattern**: `{ENTITY}_{FIELD}_LABEL = "{camelCaseFieldName}"` - Maps Python constant to parquet/JSON column name.
+## Search & Recommendation Terms {#search-terms}
+### Query-Related
+| Canonical Name | Constant | Value |
+|---|---|---|
+| Query | `QUERY_LABEL` | `"query"` |
+| Query Token Indices | `QUERY_TOKEN_INDICES_LABEL` | `"queryTokenIndices"` |
+| Query Encoding Service | `QUERY_ENCODING_SERVICE_MODEL_NAME` | `"query-encoding-service-model"` |
+| Sorted Tenant Item Keys | `TENANT_ITEM_IDS_LABEL` | `"sortedTenantItemIds"` |
+| Sorted Tenant Item ID Locales | `TENANT_ITEM_ID_LOCALES_LABEL` | `"sortedTenantItemIdLocales"` |
+| Unique Searches Count | `NB_UNIQUE_SEARCHES_LABEL` | `"sortedNbUniqueSearches"` |
+### Embedding/Vector Terms
+| Term | Usage |
+|---|---|
+| `embeddings` / `encodings` | Used interchangeably but `encodings` is more common in this codebase |
+| `item_embeddings` | Item vector representations |
+| `query_encodings` | Query vector representations |
+| `input_ids` | Tokenizer output IDs |
+| `attention_mask` | Transformer attention mask |
+| `query_item_similarities` | Similarity scores between queries and items |
+| `embedding_dimension` | Vector dimensionality |
+### Model Types
+| Constant | Value |
+|---|---|
+| `ONNX_MODEL_TYPE` | `"onnx"` |
+| `PYTORCH_MODEL_TYPE` | `"pytorch"` |
+| `MODEL_TYPE_KWARG` | `"model_type"` |
+## ML Model Terms {#ml-terms}
+### Config Class Naming (VERY CONSISTENT)
+Pattern: `{Domain}{Purpose}Config`
+| Pattern | Examples |
+|---|---|
+| `*ModelConfig` | `SearchModelConfig`, `SemanticSearchModelConfig` |
+| `*LearningAlgorithmConfig` | `SemanticSearchLearningAlgorithmConfig`, `LlmSearchLearningAlgorithmConfig` |
+| `*EvaluatorConfig` | `SearchEvaluatorConfig` |
+| `*PreProcessingPipelineConfig` | `SearchPreProcessingPipelineConfig` |
+| `*BatchConfig` | `SearchLearningBatchConfig`, `TransformerTaggingLearningBatchConfig` |
+### Parameter Constants
+Pattern: `{NAME}_PARAM = *Param(...)`
+```python
+# Common params
+BATCH_SIZE_PARAM
+LEARNING_RATE_PARAM
+WARMUP_STEPS_PARAM
+OUTPUT_DIR_PARAM
+LOGGING_DIR_PARAM
+PER_DEVICE_TRAIN_BATCH_SIZE_PARAM
+GRADIENT_ACCUMULATION_STEPS_PARAM
+DATALOADER_NUM_WORKERS_PARAM
+# Optional params use _OPT_ suffix
+NB_OBSERVATION_BY_TENANT_OPT_PARAM
+TF32_OPT_PARAM
+EVAL_STEPS_OPT_PARAM
+DATALOADER_PERSISTENT_WORKERS_OPT_PARAM
+# Config-level params
+MODEL_CONFIG_PARAM
+LEARNING_CONFIG_PARAM
+TRAINING_ARGUMENTS_CONFIG_PARAM
+ITEM_DATA_LOADER_CONFIG_PARAM
+EVALUATOR_CONFIG_PARAM
+PIPELINE_CONFIG_PARAM
+```
+### Model Registry
+| Constant | Value |
+|---|---|
+| `SEMANTIC_SEARCH_MODEL_NAME` | `"semantic-search-model-onnx"` |
+| `SEMANTIC_SEARCH_PYTORCH_MODEL_NAME` | `"semantic-search-model"` |
+| `SEMANTIC_SEARCH_QUANTIZED_MODEL_NAME` | `"semantic-search-model-quantized-onnx"` |
+### Predictor/Strategy (Scala Domain)
+```scala
+case class BatchParams(predictorId: String, ...)
+sealed trait Predictor extends WithLogPrefix
+object SubPredictor
+object AggregatorPredictor
+case class TenantKey(id: String, name: String, algorithmName: String)
+case class StandalonePredictorKey(id: String, name: String, algorithmName: String)
+```
+## Pipeline & Orchestration Terms {#pipeline-terms}
+### Pipeline Types
+| Pipeline Name | Description |
+|---|---|
+| `python_batch_pipeline` | Standard Python ML batch |
+| `scala_batch_pipeline` | Scala/Spark batch |
+| `spark_scala_batch_pipeline` | Spark Scala on Dataproc |
+| `large_python_batch_pipeline` | Large-scale Python batch |
+### Pipeline Task Naming
+Pattern: `create_{step_name}_task`
+Common steps:
+- `create_preprocessing_task`
+- `create_query_dataset_generation_task`
+- `create_item_data_dataset_preprocessing_task`
+- `create_items_encoding_task`
+- `create_learning_task`
+- `create_evaluation_task`
+- `create_ftp_exporter_task`
+- `create_fhr_exporter_task`
+### Config Parameters (YAML)
+| Parameter | Description |
+|---|---|
+| `predictor_id` | MongoDB ObjectId for predictor |
+| `strategy_id` | Strategy identifier (kebab-case) |
+| `image_name` | Docker image reference |
+| `pipeline_name` | Pipeline template name |
+| `version_name` | Config version (e.g., "0.1.269") |
+| `experiment_name` | Human-readable experiment name |
+| `algo_name` | Algorithm identifier |
+### Batch Config Arguments (camelCase)
+| Argument | Description |
+|---|---|
+| `preprocessingRootPath` | Root GCS path for preprocessing |
+| `modelRootPath` | Root GCS path for model artifacts |
+| `itemDataPreprocessingRootPath` | Item data preprocessing path |
+| `queryDatasetPreprocessingDirectoryPath` | Query dataset prep path |
+| `itemDataDatasetPreprocessingDirectoryPath` | Item dataset prep path |
+| `itemImageEncodingsDirectoryPath` | Image encoding path |
+| `outputRootPath` | Output directory root |
+| `yoloModelName` | YOLO model identifier |
+| `itemOutfitImagePathsListPath` | Outfit image paths |
+### Resource Config Keys
+| Key | Example Values |
+|---|---|
+| `cpu` | `"11"`, `"1000m"` |
+| `memory` | `"24G"`, `"2G"` |
+| `gpu` | `1`, `2` |
+| `gpu_vendor` | `"nvidia.com/gpu"` |
+| `gpu_accelerator_name` | `"nvidia-l4"` |
+| `java_memory` | `"8g"` |
+| `timeout_ms` / `timeout_s` | Timeout values |
+## Image Processing Terms {#image-terms}
+### Image Type Names
+| Name | Description |
+|---|---|
+| `stdImages` | Standard product images |
+| `cropImages` | Cropped images |
+| `cutoutImages` | Transparent background cutout images |
+| `topTotalImages` | Top/total view images |
+| `otherImages` | Miscellaneous images |
+### Image Processing Constants
+| Constant | Value |
+|---|---|
+| `IMAGE_ENCODINGS_LABEL` | `"imageEncodings"` |
+| `CLOSEUP_IMAGE_ENCODINGS_LABEL` | `"closeupImageEncodings"` |
+| `IMAGE_ENCODER_ALGORITHM_NAME` | `"image-encoder"` |
+| `IMAGE_ENCODING_SERVICE_MODEL_NAME` | `"image-encoding-service-model"` |
+### Cutout/Detection Functions
+- `get_item_image_cutout_bounding_boxes()`
+- `get_item_image_cutout_segmentation_masks()`
+- `item_image_cutout_bounding_box` (module path)
+- `item_image_cutout_segmentation_mask` (module path)
+## User Behavior Terms {#user-terms}
+### Core Identifiers
+| Python | Scala | JSON/Parquet |
+|---|---|---|
+| `user_id` | `userId` | `"userId"` |
+| `tenant_id` | `tenantId` | `"tenantId"` |
+| `session_id` | `sessionId` | `"sessionId"` |
+### Activity Models (Scala)
+- `Activity`, `ItemActivity`, `ProfileActivity`
+- `case class ItemActivity(itemId: ItemId, action: String, ...)`
+### Environment Variables
+Pattern: `POSTGRES_{SERVICE}_DB_{PARAM}`
+- `POSTGRES_KPI_DB_HOST`
+- `POSTGRES_SESSIONS_DB_PORT`
+- `POSTGRES_AI_INFERENCE_DATA_DB_DATABASE_NAME`
+## Configuration Naming Patterns {#config-patterns}
+### Config File Locations
+Always in `config/batch.py` or `config/batches.py`:
+- `*LearningBatchConfig` class
+- `*EvaluationBatchConfig` class
+- Related parameter definitions
+### JSON Config Keys
+Always **camelCase**:
+```json
+{
+  "chunkSize": 100,
+  "nbParallelProcesses": 4,
+  "llmConfig": {...},
+  "itemDataDataLoaderConfig": {...},
+  "queryAugmentationBatchSize": 32,
+  "maxNumWorkers": 8,
+  "modelName": "...",
+  "temperature": 0.7,
+  "maxTokens": 512
+}
+```
+### Serialization Label Pattern
+Suffix: `*Label` for data field name constants
+```python
+TenantIdLabel = "tenantId"
+ItemsSamplePercentageLabel = "itemsSamplePercentage"
+LocalesFilterLabel = "localesFilter"
+UseTokenLanguageLabel = "useTokenLanguage"
+```
+## Identifier Patterns {#id-patterns}
+### ID Type Summary
+| Concept | Python | Scala | JSON |
+|---|---|---|---|
+| Item ID | `item_id` | `itemId` / `ItemId(kind, value)` | `"itemId"` |
+| Tenant ID | `tenant_id` | `tenantId` / `TenantId` | `"tenantId"` |
+| User ID | `user_id` | `userId` | `"userId"` |
+| Predictor ID | `predictor_id` | `predictorId` / `PredictorId` | `"predictorId"` |
+| Strategy ID | `strategy_id` | `strategyId` | `"strategyId"` |
+| Session ID | `session_id` | `sessionId` | `"sessionId"` |
+| Image ID | `image_id` | `imageId` | `"imageId"` |
+| Variant ID | `variant_id` | `variantId` | `"variantId"` |
+### Scala Key Classes
+```scala
+case class TenantKey(id: String, name: String, algorithmName: String)
+case class StandalonePredictorKey(id: String, name: String, algorithmName: String)
+case class SubPredictorKey(id: String, ...)
+case class ItemId(kind: String, value: String)
+// Default: DefaultItemId(s"item-$value")
+```
+## Algorithm Name Registry {#algorithm-names}
+Source: `attraqt-kubeflow-pipelines/kubeflow_pipelines/pipelines/utils/constants.py`
+| Constant | Value | Domain |
+|---|---|---|
+| `FM_ALGO_NAME` | `"fm"` | Factorization Machines |
+| `ALS_ALGO_NAME` | `"als"` | Alternating Least Squares |
+| `BASIC_ALGO_NAME` | `"basic"` | Basic scoring (popularity, trendiness) |
+| `CONTENT_BASED_ALGO_NAME` | `"content-based"` | Content-based filtering |
+| `GRAPH_ALGO_NAME` | `"graph"` | Graph algorithms (FP-growth) |
+| `NLP_ALGO_NAME` | `"nlp"` | NLP preprocessing (tokenization) |
+| `AUTOCOMPLETE_ALGO_NAME` | `"autocomplete"` | Query autocompletion |
+| `SEARCH_ALGO_NAME` | `"search"` | Semantic Search |
+| `TAGGING_ALGO_NAME` | `"tagging"` | Item Tagging |
+| `COMPUTER_VISION_ALGO_NAME` | `"computer-vision"` | Image encoding |
+| `CLIP_ALGO_NAME` | `"clip"` | CLIP Vision-Language |
+| `SAM_ALGO_NAME` | `"sam"` | Segment Anything |
+| `YOLO_ALGO_NAME` | `"yolo"` | Object Detection |
+| `GPT_ALGO_NAME` | `"gpt"` | GPT Models |
+| `SHOP_THE_LOOK_ALGO_NAME` | `"shop-the-look"` | Shop The Look |
+| `GIBBERISH_DETECTOR_ALGO_NAME` | `"gibberish-detector"` | Gibberish detection |
+| `MERCH_AGENT_ALGO_NAME` | `"merch-agent"` | Merchandising agent |
+| `PASS_THROUGH_ALGO_NAME` | `"pass-through"` | Model import/pass-through |
+## Strategy ID Registry {#strategy-ids}
+Strategy IDs use **kebab-case** and follow the pattern `{domain}-{operation}`.
+### Learning Strategies
+- `learning` - Generic learning
+- `semantic-search-learning` - Semantic search training
+- `text-encoder-learning` - Text encoder fine-tuning
+- `image-classifier-learning` - Image classifier training
+- `transformer-tagging-learning` - Transformer-based tagging
+- `clip-learning` - CLIP model training
+- `visual-search-learning` - Visual search training
+- `segmentation-learning` - Segmentation learning
+- `sam-item-segmentation` - SAM item segmentation
+- `shopping-graph-learning` - Shopping graph training
+- `global-learning`, `complementarity-learning` - FM model training
+- `macro-tags-learning`, `macro-tags-learning-albert-base` - Macro tag classification
+- `item-images-outfit-detection` - Outfit detection learning
+- `learning-denoise` - Denoising learning
+### Evaluation Strategies
+- `evaluation` - Generic evaluation
+- `search-evaluation` - Search evaluation
+- `image-classifier-evaluation` - Image classifier evaluation
+- `visual-search-evaluation` - Visual search evaluation
+- `transformer-tagging-evaluation` - Tagging evaluation
+- `segmentation-evaluation`, `segmentation-calibration` - Segmentation eval/calibration
+- `item-images-outfit-evaluation` - Outfit detection evaluation
+### Preprocessing Strategies (NOTE: legacy uses underscores)
+- `query_dataset_preprocessing` - Query dataset generation
+- `item_data_dataset_preprocessing` - Item data preprocessing
+- `search_item_data_dataset_preprocessing` - Search-specific item data
+- `clip_item_data_dataset_preprocessing` - CLIP-specific item data
+- `gpt_item_data_dataset_preprocessing` - GPT-specific item data
+- `character_tokenizer_preprocessing`, `word_tokenizer_preprocessing` - NLP tokenization
+- `global_preprocessing`, `complementarity_preprocessing` - FM preprocessing
+- `recommend_to_user_evaluation_preprocessing`, `recommend_to_items_evaluation_preprocessing` - Eval data prep
+- `item-image-cutout-images-preprocessing` - Image cutout preprocessing (kebab-case)
+- `item-images-preprocessing`, `item-images-single-encoding` - Image preprocessing
+- `cutout-item-images-preprocessing` - Cutout image processing
+- `visual-search-preprocessing` - Visual search data prep
+- `item-image-outfit-detection-preprocessing` - Outfit detection preprocessing
+- `item-object-detection` - Object detection preprocessing
+### Encoding/Export Strategies
+- `items-encoding`, `item-cutout-image-encoding` - Item encoding
+- `item-encodings-updater` - Encoding updates
+- `item-encoding-export` - Encoding export
+- `item-ids-diff-dumper` - Item ID diff dumping
+- `ftp-exporter`, `fhr-exporter` - Model exporters
+- `aleph-product-mapper`, `aleph-analytic-incremental-mapper` - Aleph mapping
+- `aleph-product-feed-download`, `aleph-fhr-analytic-feed-download` - Data download
+### Scoring/Recommendation Strategies
+- `most_popular_scorer`, `most_trendy_scorer` - Scoring (underscore, legacy)
+- `scored-graph`, `unscored-graph-format-1`, `unscored-graph-format-2` - Graph formats
+- `shop-the-look-recommendation` - STL recommendation
+- `recommendation-enrichment` - Enrichment
+- `items-tagging`, `tagging`, `image-cutout-image-tagging` - Tagging
+- `bounding-box-computation` - YOLO bounding boxes
+- `outfit-image-enrichment` - Outfit image enrichment
+- `item-segmentation` - Item segmentation
+- `gcs_activities_copy` - GCS activity copying (underscore, legacy)
+## Service Model Name Registry {#service-model-names}
+Pattern: `{DOMAIN}_{TYPE}_SERVICE_MODEL_NAME = "{type}-service-model"`
+| Constant | Value |
+|---|---|
+| `SEARCH_ITEM_ENCODING_SERVICE_MODEL_NAME` | `"item-encoding-service-model"` |
+| `SEARCH_QUERY_ENCODING_SERVICE_MODEL_NAME` | `"query-encoding-service-model"` |
+| `FM_USER_ENCODING_SERVICE_MODEL_NAME` | `"user-encoding-service-model"` |
+| `ITEM_IMAGE_ENCODING_SERVICE_MODEL_NAME` | `"image-encoding-service-model"` |
+| `GPT_ITEM_ENCODING_SERVICE_MODEL_NAME` | `"item-encoding-service-model"` |
+| `GPT_USER_ENCODING_SERVICE_MODEL_NAME` | `"user-encoding-service-model"` |
+| `SEMANTIC_SEARCH_MODEL_NAME` | `"semantic-search-model-onnx"` |
+Note: Some service model names are domain-prefixed (`SEARCH_*`, `FM_*`, `GPT_*`) to disambiguate when the same model type serves different algorithms.
+## KPI Template Names {#kpi-templates}
+Pattern: `*_KPI_TEMPLATE_NAME = "batch-*-kpi"`
+| Constant | Value |
+|---|---|
+| `TENANT_ATTRIBUTION_KPI_TEMPLATE_NAME` | `"batch-tenant-attribution-kpi"` |
+| `WIDGET_ATTRIBUTION_KPI_TEMPLATE_NAME` | `"batch-widget-attribution-kpi"` |
+| `WIDGET_ABTEST_ATTRIBUTION_KPI_TEMPLATE_NAME` | `"batch-widget-abtest-attribution-kpi"` |
+| `STRATEGY_ABTEST_ATTRIBUTION_KPI_TEMPLATE_NAME` | `"batch-strategy-abtest-attribution-kpi"` |
+| `ITEMS_ACTIVITIES_HISTOGRAM_KPI_TEMPLATE_NAME` | `"batch-items-activities-histogram-kpi"` |
+| `ITEMS_ACTIVITIES_STATS_KPI_TEMPLATE_NAME` | `"batch-items-activities-stats-kpi"` |
+| `ITEMS_WITH_MOST_ACTIVITIES_KPI_TEMPLATE_NAME` | `"batch-items-with-most-activities-kpi"` |
+| `SESSIONS_TEMPLATE_NAME` | `"batch-sessions"` |
+| `ITEM_TAGGING_TEMPLATE_NAME` | `"batch-item-tagging"` |
+## Launcher Class Naming {#launcher-classes}
+Scala/Java launcher classes follow `earlybirds.algo.{domain}.batch.{Domain}BatchLauncher`:
+```
+earlybirds.algo.als.batch.AlsBatchLauncher
+earlybirds.algo.graph.batch.GraphBatchLauncher
+earlybirds.algo.nlp.batch.NlpDataprocBatchLauncher
+earlybirds.algo.search.batch.SearchDataprocBatchLauncher
+earlybirds.algo.gpt.batch.GPTDataprocBatchLauncher
+earlybirds.algo.item.batch.ItemBatchLauncher
+earlybirds.algo.tagging.batch.TaggingSimpleBatchLauncher
+earlybirds.algo.gpt.batch.GPTSimpleBatchLauncher
+earlybirds.model.updater.EncodingsUpdaterLauncher
+earlybirds.algo.evaluation.preprocessing.EvaluationPreprocessingLauncher
+```