npm - @groupby/ai-dev - Versions diffs - 0.5.5 → 0.5.8 - Mend

@groupby/ai-dev 0.5.5 → 0.5.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (43) hide show

package/teams/fhr-ai-team/skills/naming-conventions-reviewer/references/repo-dependency-graph.md ADDED Viewed

@@ -0,0 +1,264 @@
+# Repository Dependency Graph
+## Table of Contents
+- [Shared Libraries (Foundation Layer)](#shared-libraries)
+- [Dependency Map by Repo](#dependency-map)
+- [Cross-Repo Model Output Consumption](#model-output-flow)
+- [Docker Image to Repo Mapping](#docker-images)
+- [Protobuf Service Definitions](#protobuf-services)
+- [Data Flow Graph](#data-flow-graph)
+- [Kubeflow Config Directory Structure](#kubeflow-directories)
+## Shared Libraries
+### Foundation Layer (consumed by almost everything)
+| Package Name | Repo | Published As |
+|---|---|---|
+| `earlybirds_commons` | `algo.early-birds-python` | `earlybirds_commons` on Artifact Registry |
+| `protobuf-ml` | `protobuf-ml` | `protobuf_ml` on Artifact Registry |
+### Toolbox Layer (consumed by ML repos)
+| Package Name | Repo | Published As | Depends On |
+|---|---|---|---|
+| `torch_toolbox` | `pytorch-toolbox` | `torch-toolbox` / `torch_toolbox` | earlybirds_commons |
+| `eb_tensorflow` | `tensorflow-toolbox` | `eb_tensorflow` | earlybirds_commons, protobuf-ml |
+| `item-toolbox` | `item-toolbox` | `item` (import as `item`) | earlybirds_commons, eb_tensorflow, nlp, protobuf-ml |
+| `nlp-toolbox` | `nlp-toolbox` | `nlp` (import as `nlp`) | earlybirds_commons, eb_tensorflow |
+### Dependency Groupings
+**PyTorch-based repos** (use torch_toolbox):
+- algo.clip-ml, algo.semantic-search-ml, algo.semantic-search-bge-m3-ml
+**TensorFlow-based repos** (use eb_tensorflow):
+- algo.autocomplete-ml, algo.fm-ml, algo.gpt-ml, algo.image-classifier, algo.image-encoder,
+  algo.object-detection, algo.search-ml, algo.shop-the-look-ml, algo.shop-the-look-monitoring,
+  algo.tagging-ml, algo.user-intent-ml, algo.visual-search
+## Dependency Map
+```
+earlybirds_commons (algo.early-birds-python)
+  +-- protobuf-ml
+  |
+  +-- eb_tensorflow (tensorflow-toolbox)
+  |     +-- earlybirds_commons
+  |     +-- protobuf-ml
+  |
+  +-- torch_toolbox (pytorch-toolbox)
+  |     +-- earlybirds_commons
+  |
+  +-- item-toolbox (published as "item")
+  |     +-- earlybirds_commons
+  |     +-- eb_tensorflow
+  |     +-- nlp
+  |     +-- protobuf-ml
+  |
+  +-- nlp-toolbox (published as "nlp")
+        +-- earlybirds_commons
+        +-- eb_tensorflow
+```
+### Per-Repo Dependencies
+| Repo | earlybirds_commons | eb_tensorflow | torch_toolbox | item | nlp | protobuf-ml |
+|---|---|---|---|---|---|---|
+| algo.app-ml | Y | - | - | - | - | - |
+| algo.autocomplete-ml | Y | Y | - | - | Y | Y |
+| algo.clip-ml | Y | - | Y | - | - | - |
+| algo.fhr-data-feed | Y | - | - | - | - | - |
+| algo.fm-ml | Y | Y | - | - | - | Y |
+| algo.gpt-ml | Y | Y | - | Y | Y | Y |
+| algo.image-classifier | Y | Y | - | - | - | - |
+| algo.image-encoder | Y | Y | - | - | - | Y |
+| algo.object-detection | Y | Y | - | Y | - | Y |
+| algo.search-ml | Y | Y | - | Y | - | - |
+| algo.semantic-search-ml | Y | - | Y | - | - | - |
+| algo.semantic-search-bge-m3-ml | Y | - | Y | - | - | - |
+| algo.shop-the-look-ml | Y | Y | - | Y | - | Y |
+| algo.shop-the-look-monitoring | Y | Y | - | Y | - | Y |
+| algo.tagging-ml | Y | Y | - | Y | Y | Y |
+| algo.user-intent-ml | Y | Y | - | - | - | Y |
+| algo.visual-search | Y | Y | - | Y | - | - |
+### Integration Hub: algo.shop-the-look-ml
+This repo is the key integration point, consuming outputs from 7 internal packages:
+```
+algo.shop-the-look-ml
+  +-- earlybirds_commons (>=3.0.45)
+  +-- eb_tensorflow (>=3.1.69)
+  +-- item (>=2.0.38)
+  +-- tagging (>=1.1.70)        <-- from algo.tagging-ml
+  +-- object-detection (1.0.28)  <-- from algo.object-detection
+  +-- visual-search (1.1.18)    <-- from algo.visual-search
+  +-- visual_tag (0.0.8)
+  +-- protobuf-ml (>=2.64.0)
+```
+## Cross-Repo Model Output Consumption {#model-output-flow}
+### Model Output -> Consumer Relationships
+| Producer Repo | Output Type | Consumer Repos |
+|---|---|---|
+| algo.image-encoder | Image feature vectors | algo.clip-ml, algo.fm-ml, algo.search-ml, algo.shop-the-look-ml |
+| algo.tagging-ml | Item tags/categories | algo.shop-the-look-ml |
+| algo.object-detection | Bounding boxes (YOLO) | algo.shop-the-look-ml |
+| algo.visual-search | Visual similarity scores | algo.shop-the-look-ml |
+| algo.item-utils | Item data (images, metadata) | All ML repos needing item data |
+| algo.semantic-search-ml | Query/item encodings | algo.app-ml (serving) |
+| algo.search-ml | Search model artifacts | algo.app-ml (serving) |
+### Shop-the-Look Data Flow
+```
+Raw Item Data
+  |
+  v
+algo.item-utils (fetch images, prepare data)
+  |
+  v
+Parallel Processing:
+  +-> algo.image-encoder (extract image features)
+  +-> algo.tagging-ml (tag items)
+  +-> algo.object-detection (detect objects/YOLO)
+  +-> algo.visual-search (compute visual similarity)
+  +-> algo.clip-ml (CLIP embeddings)
+  |
+  v
+algo.shop-the-look-ml (CONSUMES ALL ABOVE)
+  |
+  v
+algo.shop-the-look-monitoring (tracks predictions)
+```
+## Docker Images
+### Batch Images (Scala/Dataproc)
+| Image Name | Source Repo | Type |
+|---|---|---|
+| `algo-fm-batch` | algo.fm-ml | Scala batch |
+| `algo-nlp-dataproc-batch` | nlp-toolbox | Dataproc |
+| `algo-item-dataproc-batch` | algo.item | Dataproc |
+| `algo-search-dataproc-batch` | algo.search | Dataproc |
+| `algo-gpt-dataproc-batch` | algo.gpt | Dataproc |
+| `algo-autocomplete-dataproc-batch` | algo.autocomplete-ml | Dataproc |
+| `algo-stl-batch` | algo.shop-the-look | Batch |
+| `algo-tagging-batch` | algo.tagging | Batch |
+| `algo-basic-batch` | algo.early-birds | Batch |
+| `algo-content-based-batch` | algo.early-birds | Batch |
+| `algo-cv-batch` | algo.computer-vision | Batch |
+| `algo-graph-batch` | algo.early-birds | Batch |
+| `algo-evaluation-preprocessing-batch` | algo.early-birds | Batch |
+| `algo-item-utils` | algo.item-utils | Batch |
+| `algo-model-utils` | algo.model-utils | Batch |
+| `algo-clip-dataproc-batch` | algo.clip-ml | Dataproc |
+### Python ML Images
+| Image Name | Source Repo | Type |
+|---|---|---|
+| `semantic-search` | algo.semantic-search-ml + algo.semantic-search-bge-m3-ml | Python ML |
+| `search` | algo.search-ml | Python ML |
+| `tagging` | algo.tagging-ml | Python ML |
+| `gpt` | algo.gpt-ml | Python ML |
+| `image-encoder` | algo.image-encoder | Python ML |
+| `image-classifier` | algo.image-classifier | Python ML |
+| `visual-search` | algo.visual-search | Python ML |
+| `text-encoder` | algo.text-encoder-ml | Python ML |
+| `autocomplete` | algo.autocomplete-ml | Python ML |
+| `segmentation` | algo.segmentation | Python ML |
+| `clip_cp` | algo.clip-ml | Python ML |
+| `sam` | algo.sam / algo.sam3 | Python ML |
+| `yolo_cp` | algo.yolo-world | Python ML |
+| `fm` | algo.fm-ml | Python ML |
+### Infrastructure Images
+| Image Name | Purpose |
+|---|---|
+| `ebap-ftp-exporter` | Model FTP export |
+| `ebap-fhr-exporter` | Model FHR export |
+| `ebap-model-importer` | Model import |
+| `ebap-activity-monitoring` | Activity monitoring |
+| `ebap-scripts` | GCS copy and utility scripts |
+| `aleph-mapper-mlp` | Aleph data mapping |
+| `aleph-data-feed` | Aleph data feed |
+| `ai-gcp-cleaner` | GCP resource cleanup |
+| `ai-utils` | AI utility operations |
+| `attraqt-gibberish-detector` | Gibberish detection |
+## Protobuf Services
+Proto definitions in `protobuf-ml/src/main/protobuf/earlybirds/grpc/`:
+| Service | Proto File | Domain |
+|---|---|---|
+| `algorithms_service` | algo/algorithms/ | Algorithm management |
+| `predictors_service` | predictors/ | Predictor CRUD |
+| `predictor_group_service` | algo/serving/predictor/group/ | Predictor groups |
+| `query_encoding_service` | algo/search/ | Search query encoding |
+| `autocomplete_service` | algo/autocomplete/ | Autocomplete |
+| `features_serving` | algo/features/serving/ | Feature serving |
+| `graph_serving` | algo/graph/serving/ | Graph recommendations |
+| `basic_serving` | algo/basic/serving/ | Basic scoring |
+| `activities_kpi_service` | algo/kpi/activities/ | Activity KPIs |
+| `strategies_kpi_service` | algo/kpi/strategies/ | Strategy KPIs |
+| `model_kpi_service` | algo/kpi/models/ | Model KPIs |
+| `datasources_service` | datasources/ | Data source management |
+| `customscores_service` | customscores/ | Custom scoring |
+| `user_intent_score_service` | algo/pui/ | User intent scoring |
+| `internal_serving_resource_prediction_service` | algo/serving/resource/prediction/ | Resource prediction |
+## Data Flow Graph
+```
+LakeFS (rawproducts, rawanalyticsincremental, mappedanalyticsincremental)
+  |
+  v
+[Dataproc preprocessing] --> GCS (xo-{env}-ai-eu-eb-algo-models/{algo}/preprocessing/)
+  |
+  v
+[Python ML training] --> MLflow (model registry) --> GCS (model artifacts)
+  |
+  v
+[Kubeflow pipeline orchestration] --> attraqt-kubeflow-configs (JSON configs)
+  |                                    attraqt-kubeflow-pipelines (Python KFP)
+  v
+[Model export] --> FTP/FHR exporter --> Production serving (algo.early-birds)
+```
+## Kubeflow Config Directory Structure {#kubeflow-directories}
+31 algorithm categories in `attraqt-kubeflow-configs/configs/development/`:
+| Directory | Algorithm Domain |
+|---|---|
+| `ai/` | Semantic search, item data, image download |
+| `aleph/` | Aleph data feed mapping |
+| `als/` | Alternating Least Squares |
+| `autocomplete/` | Query autocompletion |
+| `basic/` | Basic scoring (popularity, trendiness) |
+| `batch/` | Batch pipeline templates |
+| `clip/` | CLIP vision model pipelines |
+| `computer_vision/` | Image encoding & preprocessing |
+| `content-based/` | Content-based filtering |
+| `fm/` | Factorization Machines |
+| `fp_growth/` | FP-Growth graph algorithms |
+| `gibberish/` | Gibberish detection |
+| `gpt/` | GPT model training/inference |
+| `image_classifier/` | Image classification |
+| `item/` | Item data preprocessing |
+| `item-utils/` | Item utility operations |
+| `lakefs-gc/` | LakeFS garbage collection |
+| `model-utils/` | Model encoding updates/exports |
+| `nlp/` | NLP tokenization |
+| `pass-through/` | Model import/pass-through |
+| `sam3/` | Segment Anything Model |
+| `script/` | Utility scripts (GCS, inference) |
+| `search/` | Semantic search training/encoding |
+| `segmentation/` | Image segmentation |
+| `stl/` | Shop The Look recommendations |
+| `tagging/` | Item tagging/classification |
+| `text-encoder/` | Text encoding models |
+| `visual-search/` | Visual/image search |
+| `yolo/` | YOLO object detection |
+Production configs: `configs/production/` (ai, kpi categories only)

package/teams/fhr-ai-team/skills/planning/SKILL.md ADDED Viewed

@@ -0,0 +1,138 @@
+---
+name: planning
+description: >
+  Use when the user wants to plan an implementation, break down a feature into tasks,
+  or create a step-by-step development roadmap. Searches codebase first, enforces code
+  reuse, checks naming conventions, and produces Red/green TDD-based bite-sized tasks with exact
+  file paths and code blocks.
+---
+# Implementation Planning
+## Core Principle
+Every plan must prove that existing code was searched before proposing new code.
+No placeholders. No "TBD." Every task is precise enough for a developer with zero
+context to execute.
+## Process
+### Step 1: Enter Read-Only Exploration
+Do NOT modify any files during planning. This is a research and design phase only.
+### Step 2: Search Codebase First
+**MANDATORY.** Before proposing any implementation:
+1. Search for existing utilities, scripts, and patterns that do similar things across all team repos.
+2. Check shared libraries explicitly:
+   - `earlybirds_commons` - shared utilities and configurations
+   - `torch_toolbox` - PyTorch model training/inference utilities
+   - `item-toolbox` - e-commerce item data utilities
+   - `nlp-toolbox` - text tokenization and NLP utilities
+   - `eb_tensorflow` - TensorFlow-based model utilities
+3. Check `attraqt-kubeflow-configs` for existing config patterns.
+4. Check `attraqt-kubeflow-pipelines` for existing pipeline step implementations.
+Report what you found: "Reusing X from Y" for each piece of existing code that applies.
+### Step 3: Check Naming Conventions
+Invoke `ai.pierre:naming-conventions-reviewer` to validate any proposed names:
+- New classes, functions, variables
+- Config keys, strategy IDs
+- Dataset names, GCS paths
+- Docker image names
+### Step 4: Map Dependencies
+Identify cross-repo dependencies:
+- Which shared libraries are used and at what version
+- Config inheritance patterns (MongoDB -> Kubeflow config -> pipeline code)
+- Data flow: which repo produces data that another consumes
+- Docker image dependencies
+### Step 5: Break Into Tasks
+Each task must be:
+- **2-5 minutes** of work
+- **TDD-based**: write failing test -> implement -> verify -> commit
+- **Vertically sliced**: complete features per task, not horizontal layers
+- **Independently understandable**: a developer can execute this task without reading others
+### Step 6: Specify Precisely
+For each task, include:
+- **Exact file paths** to create or modify
+- **Complete code blocks** (not pseudocode, not "add similar logic")
+- **Precise commands** with expected output (e.g., `pytest tests/test_foo.py -v` should show "3 passed")
+- **Verification step**: how to confirm the task is done correctly
+Forbidden language in tasks:
+- "add appropriate error handling"
+- "implement similar to..."
+- "TBD"
+- "handle edge cases"
+- "add tests as needed"
+### Step 7: Multi-Repo Awareness
+If changes span multiple repos:
+- Specify the order of changes (which repo first)
+- Document cross-repo integration points
+- Note version pinning requirements for shared libraries
+- Include integration test steps that verify cross-repo behavior
+### Step 8: Save Plan
+Write the plan to `docs/plans/YYYY-MM-DD-<feature-name>.md` in the relevant project repo.
+Plan document structure:
+```
+# <Feature Name> Implementation Plan
+## Goal
+<One paragraph>
+## Architecture
+<Which repos, which components, data flow>
+## Reuse Summary
+<What existing code is being reused, from which repos>
+## Net-New Code Summary
+<What new code is being written, and why it does not exist yet>
+## Prerequisites
+<What must be true before starting>
+## Tasks
+### Task 1: <title>
+- Files: <exact paths>
+- Test: <failing test to write first>
+- Implementation: <code>
+- Verification: <command + expected output>
+### Task 2: ...
+## Post-Implementation
+<Integration testing, deployment steps, monitoring>
+```
+### Step 9: Self-Review
+Before presenting the plan, verify:
+- **Spec coverage**: every requirement from the design maps to at least one task
+- **Placeholder scan**: no vague instructions remain
+- **Type consistency**: function signatures match across tasks (caller and callee agree)
+- **Cross-repo alignment**: interface contracts are consistent
+- **Test coverage**: every behavioral change has a corresponding test
+### Step 10: Present for Approval
+Show the plan with:
+- Reuse summary ("reusing X from Y")
+- Net-new code summary ("creating X because nothing equivalent exists")
+- Risk areas and mitigation
+- Estimated task count and scope

package/teams/fhr-ai-team/skills/pr-description/SKILL.md ADDED Viewed

@@ -0,0 +1,94 @@
+---
+name: pr-description
+description: >
+  Generate a PR description from the current branch diff. Automatically selects
+  the light template (bug fix, config, deps, typo) or full template (feature,
+  refactor, architecture) based on the change scope. Use when creating a PR or
+  when asked to write/generate a PR description.
+---
+# PR Description Generator
+## Workflow
+1. **Gather context.** Run these commands to understand the change:
+   - `git log main..HEAD --oneline` (or `develop..HEAD` if `main` does not exist) to list commits
+   - `git diff main..HEAD --stat` for file-level summary
+   - `git diff main..HEAD` for the full diff
+   - Check for a ticket ID in branch name or commit messages (patterns: `XO-\d+`, `EB-\d+`)
+2. **Classify the change.** Pick a tier:
+   - **Light**: single-concern fix, dependency bump, config tweak, typo, CI change, doc update, or any change touching fewer than 4 files with no new public API
+   - **Full**: new feature, multi-file refactor, architectural change, new pipeline step, API or data format change, performance optimization with analysis, any change that alters observable system behavior for users or downstream consumers
+3. **Generate the description.** Follow the template for the selected tier exactly. Do not add sections that are not in the template. Delete optional sections (Scope, Wiring, How to run, Backward compatibility) when they do not apply rather than writing "N/A."
+## Light template
+```markdown
+## What changed
+{One or two sentences: what changed and why.}
+## Notes for reviewer
+{Anything non-obvious, or "None".}
+```
+## Full template
+```markdown
+## Summary
+{Problem statement and brief summary of the approach. Reference ticket ID if found.
+Explain the motivation and constraints, not the code.}
+## Scope
+{What is included in this PR.
+Delete this section when the scope is self-evident from the title.}
+## Behavior
+{Observable outcomes, not implementation details. Describe what the system
+does differently after this PR in terms a reviewer can verify:
+  - Given X, the system now does Y (previously: Z)
+  - Config key `foo.bar` controls ... ; default is ...
+For design decisions, state what was chosen and why alternatives were rejected.
+For performance work, include before/after measurements.
+For data model changes, show the before/after schema or format.}
+## Wiring
+{How the new code connects to the rest of the system. Which callers, configs,
+or entry points are affected. Delete this section for self-contained changes.}
+## Tests
+{How this was tested. List specific test commands and key scenarios.
+If tests are not yet written, state what needs to be tested.}
+## How to run
+{Steps to run or test this locally, if applicable.
+Delete this section for changes that don't need manual verification.}
+## Backward compatibility
+{Any breaking change to public API, config key, data format, or pipeline interface,
+with migration path. "No breaking changes" if fully compatible.
+Delete this section for internal-only changes.}
+```
+## Writing rules
+- **Summary before Behavior.** The first section must explain the problem or motivation, not describe the code.
+- **Behavior, not implementation.** Write "on non-VNNI pods, spin-wait is disabled," not "reads `/proc/cpuinfo` and sets `allow_spinning=0`." The reviewer should understand what changed from the Behavior section alone; the diff shows how.
+- **No commit log rewrites.** Do not copy-paste the commit list. Synthesize.
+- **Be specific.** Name the config keys, endpoints, or interfaces that changed. Avoid vague summaries like "updated logic" or "improved handling."
+- **Keep it scannable.** Use bullet points for lists of 3+ items. Use inline code for identifiers.
+- **Link, do not repeat.** If a ticket or design doc explains the context, link it and summarize in one sentence rather than reproducing it.
+- **No filler.** Every sentence must carry information. Remove "This PR does..." preambles.
+- **Delete optional sections** (Scope, Wiring, How to run, Backward compatibility) when they do not apply, rather than writing "N/A."
+- **Measurements for perf work.** If the change is a performance optimization, include numbers: before/after, methodology, dataset size.
+- **Scope boundaries matter.** When a PR is part of a larger effort, state what is deliberately left out. This prevents review drift and sets expectations for follow-up work.