npm - @bgicli/bgicli - Versions diffs - 2.2.8 → 2.2.9 - Mend

@bgicli/bgicli 2.2.8 → 2.2.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (113) hide show

package/data/skills/anthropic-algorithmic-art/SKILL.md +405 -0
package/data/skills/anthropic-canvas-design/SKILL.md +130 -0
package/data/skills/anthropic-claude-api/SKILL.md +243 -0
package/data/skills/anthropic-doc-coauthoring/SKILL.md +375 -0
package/data/skills/anthropic-docx/SKILL.md +590 -0
package/data/skills/anthropic-frontend-design/SKILL.md +42 -0
package/data/skills/anthropic-internal-comms/SKILL.md +32 -0
package/data/skills/anthropic-mcp-builder/SKILL.md +236 -0
package/data/skills/anthropic-pdf/SKILL.md +314 -0
package/data/skills/anthropic-pptx/SKILL.md +232 -0
package/data/skills/anthropic-skill-creator/SKILL.md +485 -0
package/data/skills/anthropic-webapp-testing/SKILL.md +96 -0
package/data/skills/anthropic-xlsx/SKILL.md +292 -0
package/data/skills/arxiv-database/SKILL.md +362 -0
package/data/skills/astropy/SKILL.md +329 -0
package/data/skills/ctx-advanced-evaluation/SKILL.md +402 -0
package/data/skills/ctx-bdi-mental-states/SKILL.md +311 -0
package/data/skills/ctx-context-compression/SKILL.md +272 -0
package/data/skills/ctx-context-degradation/SKILL.md +206 -0
package/data/skills/ctx-context-fundamentals/SKILL.md +201 -0
package/data/skills/ctx-context-optimization/SKILL.md +195 -0
package/data/skills/ctx-evaluation/SKILL.md +251 -0
package/data/skills/ctx-filesystem-context/SKILL.md +287 -0
package/data/skills/ctx-hosted-agents/SKILL.md +260 -0
package/data/skills/ctx-memory-systems/SKILL.md +225 -0
package/data/skills/ctx-multi-agent-patterns/SKILL.md +257 -0
package/data/skills/ctx-project-development/SKILL.md +291 -0
package/data/skills/ctx-tool-design/SKILL.md +271 -0
package/data/skills/dhdna-profiler/SKILL.md +162 -0
package/data/skills/generate-image/SKILL.md +183 -0
package/data/skills/geomaster/SKILL.md +365 -0
package/data/skills/get-available-resources/SKILL.md +275 -0
package/data/skills/hamelsmu-build-review-interface/SKILL.md +96 -0
package/data/skills/hamelsmu-error-analysis/SKILL.md +164 -0
package/data/skills/hamelsmu-eval-audit/SKILL.md +183 -0
package/data/skills/hamelsmu-evaluate-rag/SKILL.md +177 -0
package/data/skills/hamelsmu-generate-synthetic-data/SKILL.md +131 -0
package/data/skills/hamelsmu-validate-evaluator/SKILL.md +212 -0
package/data/skills/hamelsmu-write-judge-prompt/SKILL.md +144 -0
package/data/skills/hf-cli/SKILL.md +174 -0
package/data/skills/hf-mcp/SKILL.md +178 -0
package/data/skills/hugging-face-dataset-viewer/SKILL.md +121 -0
package/data/skills/hugging-face-datasets/SKILL.md +542 -0
package/data/skills/hugging-face-evaluation/SKILL.md +651 -0
package/data/skills/hugging-face-jobs/SKILL.md +1042 -0
package/data/skills/hugging-face-model-trainer/SKILL.md +717 -0
package/data/skills/hugging-face-paper-pages/SKILL.md +239 -0
package/data/skills/hugging-face-paper-publisher/SKILL.md +624 -0
package/data/skills/hugging-face-tool-builder/SKILL.md +110 -0
package/data/skills/hugging-face-trackio/SKILL.md +115 -0
package/data/skills/hugging-face-vision-trainer/SKILL.md +593 -0
package/data/skills/huggingface-gradio/SKILL.md +245 -0
package/data/skills/matlab/SKILL.md +376 -0
package/data/skills/modal/SKILL.md +381 -0
package/data/skills/openai-cloudflare-deploy/SKILL.md +224 -0
package/data/skills/openai-develop-web-game/SKILL.md +149 -0
package/data/skills/openai-doc/SKILL.md +80 -0
package/data/skills/openai-figma/SKILL.md +42 -0
package/data/skills/openai-figma-implement-design/SKILL.md +264 -0
package/data/skills/openai-gh-address-comments/SKILL.md +25 -0
package/data/skills/openai-gh-fix-ci/SKILL.md +69 -0
package/data/skills/openai-imagegen/SKILL.md +174 -0
package/data/skills/openai-jupyter-notebook/SKILL.md +107 -0
package/data/skills/openai-linear/SKILL.md +87 -0
package/data/skills/openai-netlify-deploy/SKILL.md +247 -0
package/data/skills/openai-notion-knowledge-capture/SKILL.md +56 -0
package/data/skills/openai-notion-meeting-intelligence/SKILL.md +60 -0
package/data/skills/openai-notion-research-documentation/SKILL.md +59 -0
package/data/skills/openai-notion-spec-to-implementation/SKILL.md +58 -0
package/data/skills/openai-openai-docs/SKILL.md +69 -0
package/data/skills/openai-pdf/SKILL.md +67 -0
package/data/skills/openai-playwright/SKILL.md +147 -0
package/data/skills/openai-render-deploy/SKILL.md +479 -0
package/data/skills/openai-screenshot/SKILL.md +267 -0
package/data/skills/openai-security-best-practices/SKILL.md +86 -0
package/data/skills/openai-security-ownership-map/SKILL.md +206 -0
package/data/skills/openai-security-threat-model/SKILL.md +81 -0
package/data/skills/openai-sentry/SKILL.md +123 -0
package/data/skills/openai-sora/SKILL.md +178 -0
package/data/skills/openai-speech/SKILL.md +144 -0
package/data/skills/openai-spreadsheet/SKILL.md +145 -0
package/data/skills/openai-transcribe/SKILL.md +81 -0
package/data/skills/openai-vercel-deploy/SKILL.md +77 -0
package/data/skills/openai-yeet/SKILL.md +28 -0
package/data/skills/pennylane/SKILL.md +224 -0
package/data/skills/polars-bio/SKILL.md +374 -0
package/data/skills/primekg/SKILL.md +97 -0
package/data/skills/pymatgen/SKILL.md +689 -0
package/data/skills/qiskit/SKILL.md +273 -0
package/data/skills/qutip/SKILL.md +316 -0
package/data/skills/recursive-decomposition/SKILL.md +185 -0
package/data/skills/rowan/SKILL.md +427 -0
package/data/skills/scholar-evaluation/SKILL.md +298 -0
package/data/skills/sentry-create-alert/SKILL.md +210 -0
package/data/skills/sentry-fix-issues/SKILL.md +126 -0
package/data/skills/sentry-pr-code-review/SKILL.md +105 -0
package/data/skills/sentry-python-sdk/SKILL.md +317 -0
package/data/skills/sentry-setup-ai-monitoring/SKILL.md +217 -0
package/data/skills/stable-baselines3/SKILL.md +297 -0
package/data/skills/sympy/SKILL.md +498 -0
package/data/skills/trailofbits-ask-questions-if-underspecified/SKILL.md +85 -0
package/data/skills/trailofbits-audit-context-building/SKILL.md +302 -0
package/data/skills/trailofbits-differential-review/SKILL.md +220 -0
package/data/skills/trailofbits-insecure-defaults/SKILL.md +117 -0
package/data/skills/trailofbits-modern-python/SKILL.md +333 -0
package/data/skills/trailofbits-property-based-testing/SKILL.md +123 -0
package/data/skills/trailofbits-semgrep-rule-creator/SKILL.md +172 -0
package/data/skills/trailofbits-sharp-edges/SKILL.md +292 -0
package/data/skills/trailofbits-variant-analysis/SKILL.md +142 -0
package/data/skills/transformers.js/SKILL.md +637 -0
package/data/skills/writing/SKILL.md +419 -0
package/dist/bgi.js +2 -2
package/package.json +1 -1

package/data/skills/hamelsmu-generate-synthetic-data/SKILL.md ADDED Viewed

@@ -0,0 +1,131 @@
+---
+name: generate-synthetic-data
+description: >
+  Create diverse synthetic test inputs for LLM pipeline evaluation using
+  dimension-based tuple generation. Use when bootstrapping an eval dataset,
+  when real user data is sparse, or when stress-testing specific failure
+  hypotheses. Do NOT use when you already have 100+ representative real
+  traces (use stratified sampling instead), or when the task is collecting
+  production logs.
+---
+# Generate Synthetic Data
+Generate diverse, realistic test inputs that cover the failure space of an LLM pipeline.
+## Prerequisites
+Before generating synthetic data, identify where the pipeline is likely to fail. Ask the user about known failure-prone areas, review existing user feedback, or form hypotheses from available traces. Dimensions (Step 1) must target anticipated failures, not arbitrary variation.
+## Core Process
+### Step 1: Define Dimensions
+Dimensions are axes of variation specific to your application. Choose dimensions based on where you expect failures.
+```
+Dimension 1: [Name] — [What it captures]
+  Values: [value_a, value_b, value_c, ...]
+Dimension 2: [Name] — [What it captures]
+  Values: [value_a, value_b, value_c, ...]
+Dimension 3: [Name] — [What it captures]
+  Values: [value_a, value_b, value_c, ...]
+```
+Example for a real estate assistant:
+```
+Feature: what task the user wants
+  Values: [property search, scheduling, email drafting]
+Client Persona: who the user serves
+  Values: [first-time buyer, investor, luxury buyer]
+Scenario Type: query clarity
+  Values: [well-specified, ambiguous, out-of-scope]
+```
+Start with 3 dimensions. Add more only if initial traces reveal failure patterns along new axes.
+### Step 2: Draft 20 Tuples with the User
+A tuple is one combination of dimension values defining a specific test case. Present 20 draft tuples to the user and iterate until they confirm the tuples reflect realistic scenarios. The user's domain knowledge is essential here — they know which combinations actually occur and which are unrealistic.
+```
+(Feature: Property Search, Persona: Investor, Scenario: Ambiguous)
+(Feature: Scheduling, Persona: First-time Buyer, Scenario: Well-specified)
+(Feature: Email Drafting, Persona: Luxury Buyer, Scenario: Out-of-scope)
+```
+### Step 3: Generate More Tuples with an LLM
+```
+Generate 10 random combinations of ({dim1}, {dim2}, {dim3})
+for a {your application description}.
+The dimensions are:
+{dim1}: {description}. Possible values: {values}
+{dim2}: {description}. Possible values: {values}
+{dim3}: {description}. Possible values: {values}
+Output each tuple in the format: ({dim1}, {dim2}, {dim3})
+Avoid duplicates. Vary values across dimensions.
+```
+### Step 4: Convert Each Tuple to a Natural Language Query
+Use a separate prompt for this step. Single-step generation (tuples + queries together) produces repetitive phrasing.
+```
+We are generating synthetic user queries for a {your application}.
+{Brief description of what it does.}
+Given:
+{dim1}: {value}
+{dim2}: {value}
+{dim3}: {value}
+Write a realistic query that a user might enter. The query should
+reflect the specified persona and scenario characteristics.
+Example: "{one of your hand-written examples}"
+Now generate a new query.
+```
+### Step 5: Filter for Quality
+Review generated queries. Discard and regenerate when:
+- Phrasing is awkward or unrealistic
+- Content doesn't match the tuple's intent
+- Queries are too similar to each other
+Optional: use an LLM to rate realism on a 1-5 scale, discard below 3.
+### Step 6: Run Queries Through the Pipeline
+Execute all queries through the full LLM pipeline. Capture complete traces: input, all intermediate steps, tool calls, retrieved docs, final output.
+**Target: ~100 high-quality, diverse traces.** This is a rough heuristic for reaching saturation (where new traces stop revealing new failure categories). The number depends on system complexity.
+## Sampling Real User Data
+When you have real queries available, don't sample randomly. Use stratified sampling:
+1. **Identify high-variance dimensions** — read through queries and find ways they differ (length, topic, complexity, presence of constraints).
+2. **Assign labels** — for small sets, with the user; for large sets, use K-means clustering on query embeddings.
+3. **Sample from each group** — ensures coverage across query types, not just the most common ones.
+When both real and synthetic data are available, use synthetic data to fill gaps in underrepresented query types.
+## Anti-Patterns
+- **Unstructured generation.** Prompting "give me test queries" without the dimension/tuple structure produces generic, repetitive, happy-path examples.
+- **Single-step generation.** Generating tuples and queries in one prompt produces less diverse results than the two-step separation.
+- **Arbitrary dimensions.** Dimensions that don't target failure-prone regions waste test budget.
+- **Skipping user review of tuples.** Without the user validating tuples first, you can't judge whether LLM-generated tuples are realistic.
+- **Synthetic data when no one can judge realism.** If no one can judge whether a synthetic trace is realistic, use real data instead.
+- **Synthetic data for complex domain-specific content** (legal filings, medical records) where LLMs miss structural nuance.
+- **Synthetic data for low-resource languages or dialects** where LLM-generated samples are unrealistic.

package/data/skills/hamelsmu-validate-evaluator/SKILL.md ADDED Viewed

@@ -0,0 +1,212 @@
+---
+name: validate-evaluator
+description: >
+  Calibrate an LLM judge against human labels using data splits, TPR/TNR, and
+  bias correction. Use after writing a judge prompt (write-judge-prompt) when you
+  need to verify alignment before trusting its outputs. Do NOT use for code-based
+  evaluators (those are deterministic; test with standard unit tests).
+---
+# Validate Evaluator
+Calibrate an LLM judge against human judgment.
+## Overview
+1. Split human-labeled data into train (10-20%), dev (40-45%), test (40-45%)
+2. Run judge on dev set and measure TPR/TNR
+3. Iterate on the judge until TPR and TNR > 90% on dev set
+4. Run once on held-out test set for final TPR/TNR
+5. Apply bias correction formula to production data
+## Prerequisites
+- A built LLM judge prompt (from write-judge-prompt)
+- Human-labeled data: ~100 traces with binary Pass/Fail labels per failure mode
+  - Aim for ~50 Pass and ~50 Fail (balanced, even if real distribution is skewed)
+  - Labels must come from a domain expert, not outsourced annotators
+- Candidate few-shot examples from your labeled data
+## Core Instructions
+### Step 1: Create Data Splits
+Split human-labeled data into three disjoint sets:
+| Split | Size | Purpose | Rules |
+|-------|------|---------|-------|
+| **Training** | 10-20% (~10-20 examples) | Source of few-shot examples for the judge prompt | Only clear-cut Pass and Fail cases. Used directly in the prompt. |
+| **Dev** | 40-45% (~40-45 examples) | Iterative evaluator refinement | Never include in the prompt. Evaluate against repeatedly. |
+| **Test** | 40-45% (~40-45 examples) | Final unbiased accuracy measurement | Do NOT look at during development. Used once at the end. |
+Target: 30-50 examples of each class (Pass and Fail) across dev and test combined. Use balanced splits even if real-world prevalence is skewed — you need enough Fail examples to measure TNR reliably.
+```python
+from sklearn.model_selection import train_test_split
+# First split: separate test set
+train_dev, test = train_test_split(
+    labeled_data, test_size=0.4, stratify=labeled_data['label'], random_state=42
+)
+# Second split: separate training examples from dev set
+train, dev = train_test_split(
+    train_dev, test_size=0.75, stratify=train_dev['label'], random_state=42
+)
+# Result: ~15% train, ~45% dev, ~40% test
+```
+### Step 2: Run Evaluator on Dev Set
+Run the judge on every example in the dev set. Compare predictions to human labels.
+### Step 3: Measure TPR and TNR
+**TPR (True Positive Rate):** When a human says Pass, how often does the judge also say Pass?
+```
+TPR = (judge says Pass AND human says Pass) / (human says Pass)
+```
+**TNR (True Negative Rate):** When a human says Fail, how often does the judge also say Fail?
+```
+TNR = (judge says Fail AND human says Fail) / (human says Fail)
+```
+```python
+from sklearn.metrics import confusion_matrix
+tn, fp, fn, tp = confusion_matrix(human_labels, evaluator_labels,
+                                   labels=['Fail', 'Pass']).ravel()
+tpr = tp / (tp + fn)
+tnr = tn / (tn + fp)
+```
+Use TPR/TNR, not Precision/Recall or raw accuracy. These two metrics directly map to the bias correction formula. Use Cohen's Kappa only for measuring agreement between two human annotators, not for judge-vs-ground-truth.
+### Step 4: Inspect Disagreements
+Examine every case where the judge disagrees with human labels:
+| Disagreement Type | Judge | Human | Fix |
+|-------------------|-------|-------|-----|
+| **False Pass** | Pass | Fail | Judge is too lenient. Strengthen Fail definitions or add edge-case examples. |
+| **False Fail** | Fail | Pass | Judge is too strict. Clarify Pass definitions or adjust examples. |
+For each disagreement, determine whether to:
+- Clarify wording in the judge prompt
+- Swap or add few-shot examples from the training set
+- Add explicit rules for the edge case
+- Split the criterion into more specific sub-checks
+### Step 5: Iterate
+Refine the judge prompt and re-run on the dev set. Repeat until TPR and TNR stabilize.
+**Stopping criteria:**
+- **Target:** TPR > 90% AND TNR > 90%
+- **Minimum acceptable:** TPR > 80% AND TNR > 80%
+**If alignment stalls:**
+| Problem | Solution |
+|---------|---------|
+| TPR and TNR both low | Use a more capable LLM for the judge |
+| One metric low, one acceptable | Inspect disagreements for the low metric specifically |
+| Both plateau below target | Decompose the criterion into smaller, more atomic checks |
+| Consistently wrong on certain input types | Add targeted few-shot examples from training set |
+| Labels themselves seem inconsistent | Re-examine human labels; the rubric may need refinement |
+### Step 6: Final Measurement on Test Set
+Run the judge **exactly once** on the held-out test set. Record final TPR and TNR.
+Do not iterate after seeing test set results. Go back to step 4 with new dev data if needed.
+### Step 7 (Optional): Estimate True Success Rate (Rogan-Gladen Correction)
+Raw judge scores on unlabeled production data are biased. If you need an accurate aggregate pass rate, correct for known judge errors:
+```
+theta_hat = (p_obs + TNR - 1) / (TPR + TNR - 1)
+```
+Where:
+- `p_obs` = fraction of unlabeled traces the judge scored as Pass
+- `TPR`, `TNR` = from test set measurement
+- `theta_hat` = corrected estimate of true success rate
+Clip to [0, 1]. Invalid when TPR + TNR - 1 is near 0 (judge is no better than random).
+**Example:**
+- Judge TPR = 0.92, TNR = 0.88
+- 500 production traces: 400 scored Pass -> p_obs = 0.80
+- theta_hat = (0.80 + 0.88 - 1) / (0.92 + 0.88 - 1) = 0.68 / 0.80 = **0.85**
+- True success rate is ~85%, not the raw 80%
+### Step 8: Confidence Interval
+Compute a bootstrap confidence interval. A point estimate alone is not enough.
+```python
+import numpy as np
+def bootstrap_ci(human_labels, eval_labels, p_obs, n_bootstrap=2000):
+    """Bootstrap 95% CI for corrected success rate."""
+    n = len(human_labels)
+    estimates = []
+    for _ in range(n_bootstrap):
+        idx = np.random.choice(n, size=n, replace=True)
+        h = np.array(human_labels)[idx]
+        e = np.array(eval_labels)[idx]
+        tp = ((h == 'Pass') & (e == 'Pass')).sum()
+        fn = ((h == 'Pass') & (e == 'Fail')).sum()
+        tn = ((h == 'Fail') & (e == 'Fail')).sum()
+        fp = ((h == 'Fail') & (e == 'Pass')).sum()
+        tpr_b = tp / (tp + fn) if (tp + fn) > 0 else 0
+        tnr_b = tn / (tn + fp) if (tn + fp) > 0 else 0
+        denom = tpr_b + tnr_b - 1
+        if abs(denom) < 1e-6:
+            continue
+        theta = (p_obs + tnr_b - 1) / denom
+        estimates.append(np.clip(theta, 0, 1))
+    return np.percentile(estimates, 2.5), np.percentile(estimates, 97.5)
+lower, upper = bootstrap_ci(test_human, test_eval, p_obs=0.80)
+print(f"95% CI: [{lower:.2f}, {upper:.2f}]")
+```
+Or use `judgy` (`pip install judgy`):
+```python
+from judgy import estimate_success_rate
+result = estimate_success_rate(
+    human_labels=test_human_labels,
+    evaluator_labels=test_eval_labels,
+    unlabeled_labels=prod_eval_labels
+)
+print(f"Corrected rate: {result.estimate:.2f}")
+print(f"95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")
+```
+## Practical Guidance
+- **Pin exact model versions** for LLM judges (e.g., `gpt-4o-2024-05-13`, not `gpt-4o`). Providers update models without notice, causing silent drift.
+- **Re-validate** after changing the judge prompt, switching models, or when production confidence intervals widen unexpectedly.
+- Use ~100 labeled examples (50 Pass, 50 Fail). Below 60, confidence intervals become wide.
+- **One trusted domain expert** is the most efficient labeling path. If not feasible, have two annotators label 20-50 traces independently and resolve disagreements before proceeding.
+- **Improving TPR narrows the confidence interval more than improving TNR.** The correction formula divides by TPR, so low TPR amplifies estimation errors into wide CIs.
+## Anti-Patterns
+- **Assuming judges "just work" without validation.** A judge may consistently miss failures or flag passing traces.
+- **Using raw accuracy or percent agreement.** Use TPR and TNR. With class imbalance, raw accuracy is misleading.
+- **Dev/test examples as few-shot examples.** This is data leakage.
+- **Reporting dev set performance as final accuracy.** Dev numbers are optimistic. The test set gives the unbiased estimate.
+- **Raw judge scores without bias correction.** If you report an aggregate pass rate, apply the Rogan-Gladen formula (Step 7).
+- **Point estimates without confidence intervals.** A corrected rate of 85% could easily be 78-92% with small test sets. Report the range so stakeholders know how much to trust the number.

package/data/skills/hamelsmu-write-judge-prompt/SKILL.md ADDED Viewed

@@ -0,0 +1,144 @@
+---
+name: write-judge-prompt
+description: >
+  Design LLM-as-Judge evaluators for subjective criteria that code-based checks
+  cannot handle. Use when a failure mode requires interpretation (tone,
+  faithfulness, relevance, completeness). Do NOT use when the failure mode can be
+  checked with code (regex, schema validation, execution tests). Do NOT use when
+  you need to validate or calibrate the judge — use validate-evaluator instead.
+---
+# Write LLM-as-Judge Prompt
+Design a binary Pass/Fail LLM-as-Judge evaluator for one specific failure mode. Each judge checks exactly one thing.
+## Prerequisites
+- Error analysis is complete. The failure mode is identified.
+- You have human-labeled traces for this failure mode (at least 20 Pass and 20 Fail examples).
+- A code-based evaluator cannot check this failure mode. Exhaust code-based options before reaching for a judge — many failure modes that seem subjective reduce to keyword checks, regex, or API calls when you understand the domain. Example: detecting whether an AI interviewing coach suggests "general" questions (asking about typical behavior instead of a specific past event) seems to require semantic understanding, but in practice a keyword check for words like "usually," "typical," and "normally" could work quite well.
+## The Four Components
+Every judge prompt requires exactly four components:
+### 1. Task and Evaluation Criterion
+State what the judge evaluates. One failure mode per judge.
+```
+You are an evaluator assessing whether a real estate assistant's email
+uses the appropriate tone for the client's persona.
+```
+Not: "Evaluate whether the email is good" or "Rate the email quality from 1-5."
+### 2. Pass/Fail Definitions
+Outcomes are strictly binary: Pass or Fail. No Likert scales, no letter grades, no partial credit. Define exactly what constitutes Pass and Fail. These definitions come from your error analysis failure mode descriptions.
+```
+## Definitions
+PASS: The email matches the expected communication style for the client persona:
+- Luxury Buyers: formal language, emphasis on exclusive features, premium
+  market positioning, no casual slang
+- First-Time Homebuyers: warm and encouraging tone, educational explanations,
+  avoids jargon, patient and supportive
+- Investors: data-driven language, ROI-focused, market analytics, concise
+  and professional
+FAIL: The email uses a tone mismatched to the client persona. Examples:
+- Using casual slang ("hey, check out this pad!") for a luxury buyer
+- Using heavy financial jargon for a first-time homebuyer
+- Using overly emotional language for an investor
+```
+### 3. Few-Shot Examples
+Include labeled Pass and Fail examples from your human-labeled data.
+```
+## Examples
+### Example 1: PASS
+Client Persona: Luxury Buyer
+Email: "Dear Mr. Harrington, I am pleased to present an exclusive listing
+at 1200 Pacific Heights Drive. This distinguished property features..."
+Critique: The email opens with a formal salutation and uses language
+consistent with luxury positioning — "exclusive listing," "distinguished
+property." No casual slang or informal phrasing. The tone matches the
+luxury buyer persona throughout.
+Result: Pass
+### Example 2: FAIL
+Client Persona: Luxury Buyer
+Email: "Hey! Just found this awesome place you might like. It's got a
+pool and stuff, super cool neighborhood..."
+Critique: The greeting "Hey!" is informal. Phrases like "awesome place,"
+"got a pool and stuff," and "super cool" are casual slang inappropriate
+for a luxury buyer. The email reads like a text message, not a
+professional communication for a high-end client.
+Result: Fail
+### Example 3: PASS (borderline)
+Client Persona: First-Time Homebuyer
+Email: "Hi Sarah, I found a property that might be a great fit for your
+first home. The neighborhood has good schools nearby, and the monthly
+payment would be similar to what you're currently paying in rent..."
+Critique: The greeting is warm but not overly casual. The email explains
+the property in relatable terms — comparing mortgage to rent, mentioning
+schools — which is educational without being condescending. It avoids
+jargon like "amortization" or "LTV ratio." While not deeply technical,
+this matches the supportive tone expected for a first-time buyer.
+Result: Pass
+```
+**Rules for selecting examples:**
+- Include at least one clear Pass, one clear Fail, and one borderline case. Borderline examples are the most valuable — they teach nuance.
+- Draw examples from the training split (10-20% of labeled data set aside for this purpose).
+- Any example used in the judge prompt must be excluded from dev and test sets. Using dev/test examples is data leakage.
+- 2-4 examples is typical. Performance plateaus after 4-8.
+### 4. Structured Output Format
+Enforce structured output using your LLM provider's schema enforcement (e.g., `response_format` in OpenAI, tool definitions in Anthropic) or a library like Instructor or Outlines. If the provider doesn't support schema enforcement, specify the JSON schema in the prompt.
+The output must include a critique before the verdict. Placing the critique first forces the judge to articulate its assessment before committing to a decision.
+```json
+{
+  "critique": "string — detailed assessment of the output against the criterion",
+  "result": "Pass or Fail"
+}
+```
+Critiques must be detailed, not terse. A good critique explains what specifically was correct or incorrect and references concrete evidence from the output. The critiques in your few-shot examples set the bar for the level of detail the judge will produce.
+## Choosing What to Pass to the Judge
+Feed only what the judge needs for an accurate decision:
+| Failure Mode | What the Judge Needs |
+|-------------|---------------------|
+| Tone mismatch | Client persona + generated email |
+| Answer faithfulness | Retrieved context + generated answer |
+| SQL correctness | User query + generated SQL + schema |
+| Instruction following | System prompt rules + generated response |
+| Tool call justification | Conversation history + tool call + tool result |
+For long documents, feed only the relevant snippet, not the entire document.
+## Model Selection
+Start with the most capable model available. The same model used for the main task works as judge (the judge performs a different, narrower task). Optimize for cost later once alignment is confirmed.
+## Anti-Patterns
+- **Vague criteria like "is this helpful?"** Target a specific, observable failure mode from error analysis.
+- **Holistic judge for the entire trace.** A single judge covering multiple dimensions produces unactionable verdicts.
+- **No few-shot examples.** Without examples, the model won't know what counts as a failure in your application.
+- **Dev/test examples used as few-shot.** This is data leakage. Use only the training split.
+- **Likert scales (1-5, letter grades, etc.).** Binary pass/fail only. Likert scales produce scores that sound precise but can't be calibrated: annotators disagree on the difference between a 3 and a 4, and the judge inherits that noise. Binary forces you to define a clear decision boundary upfront, which makes inter-annotator agreement measurable and the judge's errors actionable. If you need to capture severity, use multiple binary judges (e.g., "factually wrong" and "dangerously wrong") rather than one ordinal scale.
+- **Skipping validation.** Measure alignment with human labels using validate-evaluator before trusting the judge.
+- **Judges for specification failures without fixing the prompt first.** If the prompt never asked for the behavior, add the instruction before building an evaluator. For critical requirements, a judge can still serve as a regression guard.

package/data/skills/hf-cli/SKILL.md ADDED Viewed

@@ -0,0 +1,174 @@
+---
+name: hf-cli
+description: "Hugging Face Hub CLI (`hf`) for downloading, uploading, and managing repositories, models, datasets, and Spaces on the Hugging Face Hub. Replaces now deprecated `huggingface-cli` command."
+---
+Install: `curl -LsSf https://hf.co/cli/install.sh | bash -s`.
+The Hugging Face Hub CLI tool `hf` is available. IMPORTANT: The `hf` command replaces the deprecated `huggingface-cli` command.
+Use `hf --help` to view available functions. Note that auth commands are now all under `hf auth` e.g. `hf auth whoami`.
+Generated with `huggingface_hub v1.7.2`. Run `hf skills add --force` to regenerate.
+## Commands
+- `hf download REPO_ID` — Download files from the Hub. `[--type CHOICE --revision TEXT --include TEXT --exclude TEXT --cache-dir TEXT --local-dir TEXT --force-download --dry-run --quiet --max-workers INTEGER]`
+- `hf env` — Print information about the environment.
+- `hf sync` — Sync files between local directory and a bucket. `[--delete --ignore-times --ignore-sizes --plan TEXT --apply TEXT --dry-run --include TEXT --exclude TEXT --filter-from TEXT --existing --ignore-existing --verbose --quiet]`
+- `hf upload REPO_ID` — Upload a file or a folder to the Hub. Recommended for single-commit uploads. `[--type CHOICE --revision TEXT --private --include TEXT --exclude TEXT --delete TEXT --commit-message TEXT --commit-description TEXT --create-pr --every FLOAT --quiet]`
+- `hf upload-large-folder REPO_ID LOCAL_PATH` — Upload a large folder to the Hub. Recommended for resumable uploads. `[--type CHOICE --revision TEXT --private --include TEXT --exclude TEXT --num-workers INTEGER --no-report --no-bars]`
+- `hf version` — Print information about the hf version.
+### `hf auth` — Manage authentication (login, logout, etc.).
+- `hf auth list` — List all stored access tokens.
+- `hf auth login` — Login using a token from huggingface.co/settings/tokens. `[--add-to-git-credential --force]`
+- `hf auth logout` — Logout from a specific token. `[--token-name TEXT]`
+- `hf auth switch` — Switch between access tokens. `[--token-name TEXT --add-to-git-credential]`
+- `hf auth whoami` — Find out which huggingface.co account you are logged in as. `[--format CHOICE]`
+### `hf buckets` — Commands to interact with buckets.
+- `hf buckets cp SRC` — Copy a single file to or from a bucket. `[--quiet]`
+- `hf buckets create BUCKET_ID` — Create a new bucket. `[--private --exist-ok --quiet]`
+- `hf buckets delete BUCKET_ID` — Delete a bucket. `[--yes --missing-ok --quiet]`
+- `hf buckets info BUCKET_ID` — Get info about a bucket. `[--quiet]`
+- `hf buckets list` — List buckets or files in a bucket. `[--human-readable --tree --recursive --format CHOICE --quiet]`
+- `hf buckets move FROM_ID TO_ID` — Move (rename) a bucket to a new name or namespace.
+- `hf buckets remove ARGUMENT` — Remove files from a bucket. `[--recursive --yes --dry-run --include TEXT --exclude TEXT --quiet]`
+- `hf buckets sync` — Sync files between local directory and a bucket. `[--delete --ignore-times --ignore-sizes --plan TEXT --apply TEXT --dry-run --include TEXT --exclude TEXT --filter-from TEXT --existing --ignore-existing --verbose --quiet]`
+### `hf cache` — Manage local cache directory.
+- `hf cache list` — List cached repositories or revisions. `[--cache-dir TEXT --revisions --filter TEXT --format CHOICE --quiet --sort CHOICE --limit INTEGER]`
+- `hf cache prune` — Remove detached revisions from the cache. `[--cache-dir TEXT --yes --dry-run]`
+- `hf cache rm TARGETS` — Remove cached repositories or revisions. `[--cache-dir TEXT --yes --dry-run]`
+- `hf cache verify REPO_ID` — Verify checksums for a single repo revision from cache or a local directory. `[--type CHOICE --revision TEXT --cache-dir TEXT --local-dir TEXT --fail-on-missing-files --fail-on-extra-files]`
+### `hf collections` — Interact with collections on the Hub.
+- `hf collections add-item COLLECTION_SLUG ITEM_ID ITEM_TYPE` — Add an item to a collection. `[--note TEXT --exists-ok]`
+- `hf collections create TITLE` — Create a new collection on the Hub. `[--namespace TEXT --description TEXT --private --exists-ok]`
+- `hf collections delete COLLECTION_SLUG` — Delete a collection from the Hub. `[--missing-ok]`
+- `hf collections delete-item COLLECTION_SLUG ITEM_OBJECT_ID` — Delete an item from a collection. `[--missing-ok]`
+- `hf collections info COLLECTION_SLUG` — Get info about a collection on the Hub. Output is in JSON format.
+- `hf collections list` — List collections on the Hub. `[--owner TEXT --item TEXT --sort CHOICE --limit INTEGER --format CHOICE --quiet]`
+- `hf collections update COLLECTION_SLUG` — Update a collection's metadata on the Hub. `[--title TEXT --description TEXT --position INTEGER --private --theme TEXT]`
+- `hf collections update-item COLLECTION_SLUG ITEM_OBJECT_ID` — Update an item in a collection. `[--note TEXT --position INTEGER]`
+### `hf datasets` — Interact with datasets on the Hub.
+- `hf datasets info DATASET_ID` — Get info about a dataset on the Hub. Output is in JSON format. `[--revision TEXT --expand TEXT]`
+- `hf datasets list` — List datasets on the Hub. `[--search TEXT --author TEXT --filter TEXT --sort CHOICE --limit INTEGER --expand TEXT --format CHOICE --quiet]`
+- `hf datasets parquet DATASET_ID` — List parquet file URLs available for a dataset. `[--subset TEXT --split TEXT --format CHOICE --quiet]`
+- `hf datasets sql SQL` — Execute a raw SQL query with DuckDB against dataset parquet URLs. `[--format CHOICE]`
+### `hf discussions` — Manage discussions and pull requests on the Hub.
+- `hf discussions close REPO_ID NUM` — Close a discussion or pull request. `[--comment TEXT --yes --type CHOICE]`
+- `hf discussions comment REPO_ID NUM` — Comment on a discussion or pull request. `[--body TEXT --body-file PATH --type CHOICE]`
+- `hf discussions create REPO_ID --title TEXT` — Create a new discussion or pull request on a repo. `[--body TEXT --body-file PATH --pull-request --type CHOICE]`
+- `hf discussions diff REPO_ID NUM` — Show the diff of a pull request. `[--type CHOICE]`
+- `hf discussions info REPO_ID NUM` — Get info about a discussion or pull request. `[--comments --diff --no-color --type CHOICE --format CHOICE]`
+- `hf discussions list REPO_ID` — List discussions and pull requests on a repo. `[--status CHOICE --kind CHOICE --author TEXT --limit INTEGER --type CHOICE --format CHOICE --quiet]`
+- `hf discussions merge REPO_ID NUM` — Merge a pull request. `[--comment TEXT --yes --type CHOICE]`
+- `hf discussions rename REPO_ID NUM NEW_TITLE` — Rename a discussion or pull request. `[--type CHOICE]`
+- `hf discussions reopen REPO_ID NUM` — Reopen a closed discussion or pull request. `[--comment TEXT --yes --type CHOICE]`
+### `hf endpoints` — Manage Hugging Face Inference Endpoints.
+- `hf endpoints catalog deploy --repo TEXT` — Deploy an Inference Endpoint from the Model Catalog. `[--name TEXT --accelerator TEXT --namespace TEXT]`
+- `hf endpoints catalog list` — List available Catalog models.
+- `hf endpoints delete NAME` — Delete an Inference Endpoint permanently. `[--namespace TEXT --yes]`
+- `hf endpoints deploy NAME --repo TEXT --framework TEXT --accelerator TEXT --instance-size TEXT --instance-type TEXT --region TEXT --vendor TEXT` — Deploy an Inference Endpoint from a Hub repository. `[--namespace TEXT --task TEXT --min-replica INTEGER --max-replica INTEGER --scale-to-zero-timeout INTEGER --scaling-metric CHOICE --scaling-threshold FLOAT]`
+- `hf endpoints describe NAME` — Get information about an existing endpoint. `[--namespace TEXT]`
+- `hf endpoints list` — Lists all Inference Endpoints for the given namespace. `[--namespace TEXT --format CHOICE --quiet]`
+- `hf endpoints pause NAME` — Pause an Inference Endpoint. `[--namespace TEXT]`
+- `hf endpoints resume NAME` — Resume an Inference Endpoint. `[--namespace TEXT --fail-if-already-running]`
+- `hf endpoints scale-to-zero NAME` — Scale an Inference Endpoint to zero. `[--namespace TEXT]`
+- `hf endpoints update NAME` — Update an existing endpoint. `[--namespace TEXT --repo TEXT --accelerator TEXT --instance-size TEXT --instance-type TEXT --framework TEXT --revision TEXT --task TEXT --min-replica INTEGER --max-replica INTEGER --scale-to-zero-timeout INTEGER --scaling-metric CHOICE --scaling-threshold FLOAT]`
+### `hf extensions` — Manage hf CLI extensions.
+- `hf extensions exec NAME` — Execute an installed extension.
+- `hf extensions install REPO_ID` — Install an extension from a public GitHub repository. `[--force]`
+- `hf extensions list` — List installed extension commands. `[--format CHOICE --quiet]`
+- `hf extensions remove NAME` — Remove an installed extension.
+- `hf extensions search` — Search extensions available on GitHub (tagged with 'hf-extension' topic). `[--format CHOICE --quiet]`
+### `hf jobs` — Run and manage Jobs on the Hub.
+- `hf jobs cancel JOB_ID` — Cancel a Job `[--namespace TEXT]`
+- `hf jobs hardware` — List available hardware options for Jobs
+- `hf jobs inspect JOB_IDS` — Display detailed information on one or more Jobs `[--namespace TEXT]`
+- `hf jobs logs JOB_ID` — Fetch the logs of a Job. `[--follow --tail INTEGER --namespace TEXT]`
+- `hf jobs ps` — List Jobs. `[--all --namespace TEXT --filter TEXT --format TEXT --quiet]`
+- `hf jobs run IMAGE COMMAND` — Run a Job. `[--env TEXT --secrets TEXT --label TEXT --env-file TEXT --secrets-file TEXT --flavor CHOICE --timeout TEXT --detach --namespace TEXT]`
+- `hf jobs scheduled delete SCHEDULED_JOB_ID` — Delete a scheduled Job. `[--namespace TEXT]`
+- `hf jobs scheduled inspect SCHEDULED_JOB_IDS` — Display detailed information on one or more scheduled Jobs `[--namespace TEXT]`
+- `hf jobs scheduled ps` — List scheduled Jobs `[--all --namespace TEXT --filter TEXT --format TEXT --quiet]`
+- `hf jobs scheduled resume SCHEDULED_JOB_ID` — Resume (unpause) a scheduled Job. `[--namespace TEXT]`
+- `hf jobs scheduled run SCHEDULE IMAGE COMMAND` — Schedule a Job. `[--suspend --concurrency --env TEXT --secrets TEXT --label TEXT --env-file TEXT --secrets-file TEXT --flavor CHOICE --timeout TEXT --namespace TEXT]`
+- `hf jobs scheduled suspend SCHEDULED_JOB_ID` — Suspend (pause) a scheduled Job. `[--namespace TEXT]`
+- `hf jobs scheduled uv run SCHEDULE SCRIPT` — Run a UV script (local file or URL) on HF infrastructure `[--suspend --concurrency --image TEXT --flavor CHOICE --env TEXT --secrets TEXT --label TEXT --env-file TEXT --secrets-file TEXT --timeout TEXT --namespace TEXT --with TEXT --python TEXT]`
+- `hf jobs stats` — Fetch the resource usage statistics and metrics of Jobs `[--namespace TEXT]`
+- `hf jobs uv run SCRIPT` — Run a UV script (local file or URL) on HF infrastructure `[--image TEXT --flavor CHOICE --env TEXT --secrets TEXT --label TEXT --env-file TEXT --secrets-file TEXT --timeout TEXT --detach --namespace TEXT --with TEXT --python TEXT]`
+### `hf models` — Interact with models on the Hub.
+- `hf models info MODEL_ID` — Get info about a model on the Hub. Output is in JSON format. `[--revision TEXT --expand TEXT]`
+- `hf models list` — List models on the Hub. `[--search TEXT --author TEXT --filter TEXT --num-parameters TEXT --sort CHOICE --limit INTEGER --expand TEXT --format CHOICE --quiet]`
+### `hf papers` — Interact with papers on the Hub.
+- `hf papers list` — List daily papers on the Hub. `[--date TEXT --sort CHOICE --limit INTEGER --format CHOICE --quiet]`
+### `hf repos` — Manage repos on the Hub.
+- `hf repos branch create REPO_ID BRANCH` — Create a new branch for a repo on the Hub. `[--revision TEXT --type CHOICE --exist-ok]`
+- `hf repos branch delete REPO_ID BRANCH` — Delete a branch from a repo on the Hub. `[--type CHOICE]`
+- `hf repos create REPO_ID` — Create a new repo on the Hub. `[--type CHOICE --space-sdk TEXT --private --exist-ok --resource-group-id TEXT]`
+- `hf repos delete REPO_ID` — Delete a repo from the Hub. This is an irreversible operation. `[--type CHOICE --missing-ok]`
+- `hf repos delete-files REPO_ID PATTERNS` — Delete files from a repo on the Hub. `[--type CHOICE --revision TEXT --commit-message TEXT --commit-description TEXT --create-pr]`
+- `hf repos duplicate FROM_ID` — Duplicate a repo on the Hub (model, dataset, or Space). `[--type CHOICE --private --exist-ok]`
+- `hf repos move FROM_ID TO_ID` — Move a repository from a namespace to another namespace. `[--type CHOICE]`
+- `hf repos settings REPO_ID` — Update the settings of a repository. `[--gated CHOICE --private --type CHOICE]`
+- `hf repos tag create REPO_ID TAG` — Create a tag for a repo. `[--message TEXT --revision TEXT --type CHOICE]`
+- `hf repos tag delete REPO_ID TAG` — Delete a tag for a repo. `[--yes --type CHOICE]`
+- `hf repos tag list REPO_ID` — List tags for a repo. `[--type CHOICE]`
+### `hf skills` — Manage skills for AI assistants.
+- `hf skills add` — Download a skill and install it for an AI assistant. `[--claude --codex --cursor --opencode --global --dest PATH --force]`
+- `hf skills preview` — Print the generated SKILL.md to stdout.
+### `hf spaces` — Interact with spaces on the Hub.
+- `hf spaces dev-mode SPACE_ID` — Enable or disable dev mode on a Space. `[--stop]`
+- `hf spaces hot-reload SPACE_ID` — Hot-reload any Python file of a Space without a full rebuild + restart. `[--local-file TEXT --skip-checks --skip-summary]`
+- `hf spaces info SPACE_ID` — Get info about a space on the Hub. Output is in JSON format. `[--revision TEXT --expand TEXT]`
+- `hf spaces list` — List spaces on the Hub. `[--search TEXT --author TEXT --filter TEXT --sort CHOICE --limit INTEGER --expand TEXT --format CHOICE --quiet]`
+### `hf webhooks` — Manage webhooks on the Hub.
+- `hf webhooks create --watch TEXT` — Create a new webhook. `[--url TEXT --job-id TEXT --domain CHOICE --secret TEXT]`
+- `hf webhooks delete WEBHOOK_ID` — Delete a webhook permanently. `[--yes]`
+- `hf webhooks disable WEBHOOK_ID` — Disable an active webhook.
+- `hf webhooks enable WEBHOOK_ID` — Enable a disabled webhook.
+- `hf webhooks info WEBHOOK_ID` — Show full details for a single webhook as JSON.
+- `hf webhooks list` — List all webhooks for the current user. `[--format CHOICE --quiet]`
+- `hf webhooks update WEBHOOK_ID` — Update an existing webhook. Only provided options are changed. `[--url TEXT --watch TEXT --domain CHOICE --secret TEXT]`
+## Common options
+- `--format` — Output format: `--format json` (or `--json`) or `--format table` (default).
+- `-q / --quiet` — Minimal output.
+- `--revision` — Git revision id which can be a branch name, a tag, or a commit hash.
+- `--token` — Use a User Access Token. Prefer setting `HF_TOKEN` env var instead of passing `--token`.
+- `--type` — The type of repository (model, dataset, or space).
+## Tips
+- Use `hf <command> --help` for full options, descriptions, usage, and real-world examples
+- Authenticate with `HF_TOKEN` env var (recommended) or with `--token`