npm - quiver-cli - Versions diffs - 0.1.0 - Mend

quiver-cli 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (281) hide show

package/template/.agents/skills/integrations/langfuse/references/error-analysis.md ADDED Viewed

@@ -0,0 +1,100 @@
+---
+name: langfuse-error-analysis
+description: Deep-dive error analysis of an LLM pipeline or AI application using Langfuse traces.
+  Use this skill whenever the user wants to understand why their AI system is producing
+  bad outputs, where their pipeline is failing, how to categorise or label failures,
+  what to prioritise fixing, or how to set up evaluators. Also trigger for "review my
+  traces", "my outputs look wrong", "help me debug my LLM app", "I want to analyse
+  errors", "build a failure taxonomy", "what's going wrong with my pipeline", or any
+  request to systematically inspect, annotate, or score Langfuse traces. If the user
+  is trying to understand or improve the quality of an AI system's outputs, use this skill.
+---
+# Error Analysis
+## Primary Guide
+**1. Fetch the guide in this blogpost**
+https://langfuse.com/guides/cookbook/error-analysis-llm-applications.md
+If fetch is not available query for langfuse.com error analysis guide
+Read it in full. It defines the authoritative 5-step process (sample selection → open coding → clustering → labelling → deciding what to fix).
+**2. Guide the user through this step by step**
+You as a coding agent and the user go through this together to perform a full error analysis with their data in langfuse. Do everything you can achieve via CLI (look up traces, create annotation queues, ...) for the user. Provide them with direct links to UI wherever their action is required. Be proactive and narrate what is going on for the user.
+## Rules CRITICAL
+Use Langfuse CLI wherever possible
+Use charts where possible to display data
+---
+## Langfuse Implementation Notes
+The guide describes the process. These notes cover the Langfuse-specific API and CLI mechanics required to execute it.
+### Credentials
+```bash
+echo $LANGFUSE_PUBLIC_KEY   # pk-lf-...
+echo $LANGFUSE_SECRET_KEY   # sk-lf-...
+echo $LANGFUSE_BASE_URL     # https://cloud.langfuse.com (EU), https://us.cloud.langfuse.com (US), https://jp.cloud.langfuse.com (JP) or self-hosted
+```
+If not set, check `.env` in the project root: `export $(grep -v '^#' .env | xargs)`. If `LANGFUSE_HOST` is used instead of `LANGFUSE_BASE_URL`, run `export LANGFUSE_BASE_URL="$LANGFUSE_HOST"`.
+```bash
+AUTH=$(echo -n "${LANGFUSE_PUBLIC_KEY}:${LANGFUSE_SECRET_KEY}" | base64)
+# Verify before proceeding
+STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
+  -H "Authorization: Basic $AUTH" \
+  "${LANGFUSE_BASE_URL}/api/public/projects")
+echo "Auth check: $STATUS"
+```
+If status is not `200`, stop and ask the user to check their credentials and host before continuing.
+### Annotation target: OBSERVATION versus TRACE
+> **CRITICAL:** In OpenTelemetry-instrumented apps, trace-level `input`/`output` can be null — content often lives in a GENERATION observation. Always consider if the right objectType to add is `objectType: OBSERVATION` pointing to the GENERATION observation ID to annotation queues.
+### Annotation queues
+> **CRITICAL:** Queues cannot be updated or deleted after creation. Create score configs first, then the queue with all config IDs. To add new configs later, create a new queue.
+**Always give the user a direct link immediately after creating a queue:**
+| Host | URL pattern |
+|------|-------------|
+| EU cloud | `https://cloud.langfuse.com/project/<projectId>/annotation-queues/<queueId>` |
+| US cloud | `https://us.cloud.langfuse.com/project/<projectId>/annotation-queues/<queueId>` |
+| Self-hosted | `<LANGFUSE_BASE_URL>/project/<projectId>/annotation-queues/<queueId>` |
+Instruction to give: *"Please open code the first ~50 examples. For each trace, write what you observe in the `open_coding` field (describe behaviour, don't diagnose root causes), then set `pass_fail_assessment` to Pass or Fail."*
+### Prompt fixes
+When a category warrants a prompt fix, always offer the user two options:
+1. Create it as a versioned prompt in Langfuse (tracked, usable via the prompt API)
+2. Draft the specific text change for them to review and apply
+### Setup evaluators
+When a category warrants an evaluator setup, propose the type of evaluator and offer to set it up for user via CLI
+### Common gotchas
+| Mistake | Fix |
+|---------|-----|
+| `objectType: TRACE` in queue | Use `objectType: OBSERVATION` with GENERATION obs ID |
+| Creating score config without checking existing | `GET /api/public/score-configs` first; can't delete |
+| Queue created before score configs | Create configs → collect IDs → create queue |
+| `--limit` > 100 on traces list | API hard cap; paginate with `--page` |
+| No rate limiting on queue item creation | `sleep 0.4` between calls to avoid 429 |

package/template/.agents/skills/integrations/langfuse/references/instrumentation.md ADDED Viewed

@@ -0,0 +1,134 @@
+---
+name: langfuse-observability
+description: Instrument LLM applications with Langfuse tracing. Use when setting up Langfuse, adding observability to LLM calls, or auditing existing instrumentation.
+---
+# Langfuse Observability
+Instrument LLM applications with Langfuse tracing, following best practices and tailored to your use case.
+## Workflow
+### 1. Assess Current State
+Check the project:
+- Is Langfuse SDK installed?
+- What LLM frameworks are used? (OpenAI SDK, LangChain, LlamaIndex, Vercel AI SDK, etc.)
+- Is there existing instrumentation?
+**No integration yet:** Set up Langfuse using a framework integration if available. Integrations capture more context automatically and require less code than manual instrumentation.
+**Integration exists:** Audit against baseline requirements below.
+### 2. Verify Baseline Requirements
+Every trace should have these fundamentals:
+| Requirement               | Check                                                                                    | Why                                                    |
+| ------------------------- | ---------------------------------------------------------------------------------------- | ------------------------------------------------------ |
+| Model name                | Is the LLM model captured?                                                               | Enables model comparison and filtering                 |
+| Token usage               | Are input/output tokens tracked?                                                         | Enables automatic cost calculation                     |
+| Good trace names          | Are names descriptive? (`chat-response`, not `trace-1`)                                  | Makes traces findable and filterable                   |
+| Span hierarchy            | Are multi-step operations nested properly?                                               | Shows which step is slow or failing                    |
+| Correct observation types | Are generations marked as generations?                                                   | Enables model-specific analytics                       |
+| Sensitive data masked     | Is PII/confidential data excluded or masked?                                             | Prevents data leakage                                  |
+| Trace input/output        | Does the trace capture meaningful input/output? Is input explicitly set to show only relevant data (e.g., user message), not all function args? | Makes traces readable in the UI and avoids leaking sensitive args |
+Framework integrations (OpenAI, LangChain, etc.) handle model name, tokens, and observation types automatically. Prefer integrations over manual instrumentation.
+Docs: https://langfuse.com/docs/tracing
+### 3. Explore Traces First
+Once baseline instrumentation is working, encourage the user to explore their traces in the Langfuse UI before adding more context:
+"Your traces are now appearing in Langfuse. Take a look at a few of them—see what data is being captured, what's useful, and what's missing. This will help us decide what additional context to add."
+This helps the user:
+- Understand what they're already getting
+- Form opinions about what's missing
+- Ask better questions about what they need
+### 4. Discover Additional Context Needs
+Determine what additional instrumentation would be valuable. **Infer from code when possible, only ask when unclear.**
+**Infer from code:**
+| If you see in code...                                | Infer             | Suggest                   |
+| ---------------------------------------------------- | ----------------- | ------------------------- |
+| Conversation history, chat endpoints, message arrays | Multi-turn app    | `session_id`              |
+| User authentication, `user_id` variables             | User-aware app    | `user_id` on traces       |
+| Multiple distinct endpoints/features                 | Multi-feature app | `feature` tag             |
+| Customer/tenant identifiers                          | Multi-tenant app  | `customer_id` or tier tag |
+| Feedback collection, ratings                         | Has user feedback | Capture as scores         |
+**Only ask when not obvious from code:**
+- "How do you know when a response is good vs bad?" → Determines scoring approach
+- "What would you want to filter by in a dashboard?" → Surfaces non-obvious tags
+- "Are there different user segments you'd want to compare?" → Customer tiers, plans, etc.
+**Additions and their value:**
+| Addition            | Why                                         | Docs                                                |
+| ------------------- | ------------------------------------------- | --------------------------------------------------- |
+| `session_id`        | Groups conversations together               | https://langfuse.com/docs/tracing-features/sessions |
+| `user_id`           | Enables user filtering and cost attribution | https://langfuse.com/docs/tracing-features/users    |
+| User feedback score | Enables quality filtering and trends        | https://langfuse.com/docs/scores/overview           |
+| `feature` tag       | Per-feature analytics                       | https://langfuse.com/docs/tracing-features/tags     |
+| `customer_tier` tag | Cost/quality breakdown by segment           | https://langfuse.com/docs/tracing-features/tags     |
+These are NOT baseline requirements—only add what's relevant based on inference or user input.
+### 5. Guide to UI
+After adding context, point users to relevant UI features:
+- Traces view: See individual requests
+- Sessions view: See grouped conversations (if session_id added)
+- Dashboard: Build filtered views using tags
+- Scores: Filter by quality metrics
+## Framework Integrations
+Prefer these over manual instrumentation:
+| Framework     | Integration            | Docs                                                 |
+| ------------- | ---------------------- | ---------------------------------------------------- |
+| OpenAI SDK    | Drop-in replacement    | https://langfuse.com/docs/integrations/openai        |
+| LangChain     | Callback handler       | https://langfuse.com/docs/integrations/langchain     |
+| LlamaIndex    | Callback handler       | https://langfuse.com/docs/integrations/llama-index   |
+| Vercel AI SDK | OpenTelemetry exporter | https://langfuse.com/docs/integrations/vercel-ai-sdk |
+| LiteLLM       | Callback or proxy      | https://langfuse.com/docs/integrations/litellm       |
+Full list: https://langfuse.com/docs/integrations
+## Always Explain Why
+When suggesting additions, explain the user benefit:
+```
+"I recommend adding session_id to your traces.
+Why: This groups messages from the same conversation together.
+You'll be able to see full conversation flows in the Sessions view,
+making it much easier to debug multi-turn interactions.
+Learn more: https://langfuse.com/docs/tracing-features/sessions"
+```
+## Common Mistakes
+| Mistake                                        | Problem                                             | Fix                                                                               |
+| ---------------------------------------------- | --------------------------------------------------- | --------------------------------------------------------------------------------- |
+| No `flush()` in scripts                        | Traces never sent                                   | Call `langfuse.flush()` before exit                                               |
+| Flat traces                                    | Can't see which step failed                         | Use nested spans for distinct steps                                               |
+| Generic trace names                            | Hard to filter                                      | Use descriptive names: `chat-response`, `doc-summary`                             |
+| Logging sensitive data                         | Data leakage risk                                   | Mask PII before tracing                                                           |
+| Not explicitly setting input with `@observe`   | All function args become trace input (including API keys, configs) | Python: use `langfuse.update_current_span(input=...)`. JS/TS: use `updateActiveObservation({ input: ... })`. Set only the relevant input (e.g., user message) |
+| Manual instrumentation when integration exists | More code, less context                             | Use framework integration                                                         |
+| Langfuse import before env vars loaded         | Langfuse initializes with missing/wrong credentials | Import Langfuse AFTER loading environment variables (e.g., after `load_dotenv()`) |
+| Wrong import order with OpenAI                 | Langfuse can't patch the OpenAI client              | Import Langfuse and call its setup BEFORE importing OpenAI client                 |

package/template/.agents/skills/integrations/langfuse/references/judge-calibration.md ADDED Viewed

@@ -0,0 +1,288 @@
+---
+name: langfuse-judge-calibration
+description: Calibrate and validate LLM-as-a-Judge evaluators against dataset ground truth. Runs the judge prompt as a Langfuse dataset experiment, compares judge outputs with dataset item expected outputs, and reports simple accuracy or advanced confusion-matrix metrics. Use this guide whenever a user asks if their LLM judge is actually useful,
+aligned with human judgment, or safe to trust for monitoring decisions.
+---
+# Judge Calibration (LLM-as-a-Judge)
+## Goal
+Validate judge outputs against human labels using the smallest reliable workflow
+for the user's goal.
+Default to a **Langfuse dataset experiment** when the user has a Langfuse
+dataset or wants results in the Langfuse Experiments UI.
+Default to **simple calibration** unless the user asks for deeper metrics,
+split-based validation, thresholding, or production automation.
+## 1) Choose the calibration mode
+### Simple calibration
+Use this when the user wants a quick answer like "does this judge basically
+match human labels?" or explicitly asks for accuracy only.
+- No train/dev/test split is required.
+- Compute `exact_match` for each valid row.
+- Report valid sample size, invalid-label count, accuracy, and a short
+  recommendation.
+- Do not include Precision/Recall/F1, TPR/TNR, denominator notes, or top failure
+  direction unless the user asks for advanced metrics.
+### Advanced calibration
+Use this when the user asks for confusion matrix metrics, thresholds,
+production monitoring, high-stakes automation, or train/test-style validation.
+- If split labels exist, keep train/dev/test separate.
+- If no split exists, compute metrics on the provided rows and state that this is
+  not a held-out final quality claim.
+- Compute TP/FP/FN/TN and derived metrics.
+- Use `references/error-analysis.md` for qualitative diagnosis of disagreements.
+## 2) Primary workflow
+1. Confirm the dataset name, ground-truth label location in `expectedOutput`,
+   judge prompt name/version, judge model, and label vocabulary.
+2. Choose simple or advanced mode. If ambiguous, use simple mode.
+3. Run the judge prompt against each dataset item input as a Langfuse experiment.
+4. Compare the judge output to `item.expected_output` in evaluator functions.
+5. Return the matching report format from section 7.
+## 3) Langfuse experiment workflow
+Use the SDK experiment runner as the default implementation. A Langfuse-hosted
+dataset automatically creates a dataset run that can be inspected and compared
+in the Langfuse UI.
+Before implementing, you **must** retrieve the current experiment SDK
+documentation from the Langfuse docs — do not rely on memory, the SDK changes
+frequently. Fetch these pages (see SKILL.md section 2 for retrieval methods):
+- [Experiments via SDK](https://langfuse.com/docs/evaluation/experiments/experiments-via-sdk) — primary reference for `dataset.run_experiment`, task/evaluator signatures, and `Evaluation` return shape
+- [Datasets](https://langfuse.com/docs/evaluation/experiments/datasets) — dataset item structure (`input`, `expectedOutput`) and how to load a hosted dataset
+- [Experiments data model](https://langfuse.com/docs/evaluation/experiments/data-model) — how runs, items, and scores relate in the UI
+High-level shape of a simple-mode calibration experiment:
+```
+load dataset and judge prompt from Langfuse
+define POSITIVE / NEGATIVE label set
+task(item):
+    compile judge prompt with item.input
+    call judge model
+    return normalized label
+    # never read item.expected_output here — that would leak the answer
+item_evaluator(output, expected_output):
+    if either label is outside the allowed set:
+        return invalid (excluded from accuracy denominator)
+    return exact_match score (0 or 1)
+run_evaluator(item_results):
+    accuracy = matches / valid_rows
+    return aggregate score + invalid_label count
+dataset.run_experiment(task, [item_evaluator], [run_evaluator],
+                      metadata={calibration_mode, judge_prompt, labels})
+```
+For advanced calibration, the item evaluator emits one score per confusion-matrix
+cell (`judge-is-tp`, `judge-is-fp`, `judge-is-fn`, `judge-is-tn`), and the run
+evaluator aggregates them into Precision / Recall / F1 / TPR / TNR. See section
+6 for the metric definitions and zero-denominator guardrails.
+Rules:
+- Use a Langfuse-hosted dataset when the user wants a real Langfuse experiment.
+  Local SDK datasets create traces and scores, but not Langfuse dataset runs.
+- The dataset item `input` must contain everything needed to run the judge
+  prompt. The dataset item `expectedOutput` must contain the ground-truth label.
+- Never pass `expectedOutput` into the judge prompt or task. That would leak the
+  answer and invalidate calibration.
+- Return `Evaluation(...)` objects from item and run evaluators for stable SDK
+  formatting and score ingestion.
+- Store prompt name/version, judge model, label vocabulary, dataset version, and
+  calibration mode in run metadata.
+- Use a unique run name so the experiment appears as a separate dataset run.
+## 4) Label validation
+Never silently treat unknown labels as negative.
+- For binary judges, define the positive and negative labels before computing
+  confusion-matrix metrics.
+- Normalize only deterministic differences such as surrounding whitespace or
+  casing, and mention that normalization was applied.
+- Rows where `expected` or `actual` is outside the allowed label set are invalid.
+  Exclude them from metric denominators and report the invalid count.
+- For multi-class judges, use simple `exact_match` / accuracy unless the user
+  defines one positive class for binary metrics.
+Example binary labels:
+- Positive: `ESCALATE`
+- Negative: `RESOLVE`
+## 5) Simple metrics
+For each valid row:
+- `exact_match = 1 if actual == expected else 0`
+Aggregate:
+- `accuracy = sum(exact_match) / valid_rows`
+If `valid_rows == 0`, report that accuracy is undefined and ask for valid
+expected/actual labels.
+## 6) Advanced metrics
+### Dataset and split discipline
+Use split discipline only when the user asks for advanced validation or final
+quality claims.
+- **Train**: optional few-shot examples for the judge prompt
+- **Dev**: iterative prompt/model refinement
+- **Test**: single final calibration pass
+Do not tune on the same rows used to claim final quality. Use balanced classes
+in dev/test when possible so both error directions are measurable.
+### Per-row classification mapping
+For each row with `expected` and `actual` labels:
+- **TP**: expected = positive and actual = positive
+- **FP**: expected = negative and actual = positive
+- **FN**: expected = positive and actual = negative
+- **TN**: expected = negative and actual = negative
+Also compute:
+- `exact_match = 1 if actual == expected else 0`
+### Aggregate metrics
+From aggregate counts:
+- `accuracy = (TP + TN) / valid_rows`
+- `precision = TP / (TP + FP)`
+- `recall = TP / (TP + FN)`
+- `f1 = 2 * precision * recall / (precision + recall)`
+- `TPR = TP / (TP + FN)`
+- `TNR = TN / (TN + FP)`
+Guardrails:
+- if `TP + FP == 0`, precision is undefined (report null + note)
+- if `TP + FN == 0`, recall and TPR are undefined (report null + note)
+- if `TN + FP == 0`, TNR is undefined (report null + note)
+- if `precision + recall == 0`, set `f1 = 0`
+### Advanced quality gates
+Before trusting the judge on production traffic:
+1. **Split integrity**: no leakage from held-out rows into prompt examples.
+2. **Confusion matrix sanity**: `TP + FP + FN + TN == valid_rows`.
+3. **Metric recomputation check**: recompute aggregate stats from row-level
+   flags and compare.
+4. **TPR/TNR review**: inspect both directions for class-direction bias.
+5. **Threshold**: target `TPR > 0.90` and `TNR > 0.90` before high-stakes
+   automation.
+## 7) Report format
+### Simple report
+Return only:
+- dataset name and dataset run URL when available
+- valid rows / total rows
+- invalid-label count
+- accuracy
+- one-sentence recommendation
+### Advanced report
+Add:
+- confusion matrix: TP, FP, FN, TN
+- accuracy, precision, recall, F1, TPR, TNR
+- denominator notes for undefined metrics
+- top failure direction: false positives or false negatives
+- recommendation: ship, iterate, collect more labels, or do not automate
+## 8) Langfuse implementation notes
+Prefer SDK experiment evaluators for score creation. They attach item-level
+scores to the experiment traces and run-level scores to the dataset run.
+Use manual REST score creation only as a fallback when not using the SDK
+experiment runner, or for local smoke tests. See
+[Scores via SDK](https://langfuse.com/docs/evaluation/evaluation-methods/scores-via-sdk)
+and the [Scores API reference](https://langfuse.com/docs/api) (`POST /api/public/scores`)
+for the current payload shape. Do not use the current `langfuse-cli` score-create
+wrapper unless `--help` shows a usable `value` argument; `langfuse-cli@0.0.10`
+exposes `legacy-score-v1s create` but cannot pass the required score `value`.
+Score names to emit:
+Simple mode:
+- `judge-exact-match`
+- `judge-accuracy`
+Advanced mode:
+- `judge-exact-match`
+- `judge-is-tp`
+- `judge-is-fp`
+- `judge-is-fn`
+- `judge-is-tn`
+Recommended metadata:
+- `expected_label`, `actual_label`
+- calibration mode: `simple` or `advanced`
+- positive/negative labels when binary metrics are used
+- evaluator prompt name+version
+- dataset/split version when used
+- run identifier
+## 9) Classification logic (pseudo-code)
+The per-row classification each evaluator must perform:
+```
+normalize(label) = strip whitespace, uppercase
+ALLOWED = {POSITIVE, NEGATIVE}
+classify(expected, actual):
+    expected, actual = normalize(expected), normalize(actual)
+    if expected ∉ ALLOWED or actual ∉ ALLOWED:
+        mark row invalid → exclude from denominators
+    exact_match = (expected == actual)
+    is_tp = (expected == POSITIVE and actual == POSITIVE)
+    is_fp = (expected == NEGATIVE and actual == POSITIVE)
+    is_fn = (expected == POSITIVE and actual == NEGATIVE)
+    is_tn = (expected == NEGATIVE and actual == NEGATIVE)
+```
+## 10) Common failure modes
+- label vocabulary not constrained (judge outputs free text instead of strict
+  labels)
+- positive/negative label inversion between annotators and evaluator code
+- leaking `expectedOutput` into the judge task instead of only the evaluator
+- using local SDK data when the user expects a Langfuse dataset run in the UI
+- reporting only accuracy when classes are imbalanced and error direction matters
+- calculating F1 without explicit zero-denominator handling
+- using advanced validation claims without a held-out split
+## 11) What to do after calibration
+- If simple accuracy is enough: report it and stop.
+- If metrics are weak and advanced validation is needed: iterate prompt and
+  few-shots on dev data only.
+- If metrics pass: freeze the baseline and monitor drift over time.
+- For qualitative diagnosis of disagreements, switch to
+  `references/error-analysis.md`.