npm - @brunosps00/dev-workflow - Versions diffs - 0.13.0 → 1.0.0 - Mend

@brunosps00/dev-workflow 0.13.0 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (148) hide show

package/scaffold/skills/dw-llm-eval/references/rag-metrics.md ADDED Viewed

@@ -0,0 +1,186 @@
+# RAG evaluation — three orthogonal metrics
+Retrieval-augmented generation (RAG) has three failure modes, each requiring its own metric. Measure all three. Measuring only one creates blindspots.
+## The three metrics
+### 1. Retrieval precision@k
+**What it measures:** of the top-K chunks retrieved, how many were RELEVANT to the user's query?
+**How to compute:**
+```python
+def precision_at_k(retrieved_chunk_ids, relevant_chunk_ids, k=5):
+    top_k = retrieved_chunk_ids[:k]
+    relevant_in_top_k = sum(1 for cid in top_k if cid in relevant_chunk_ids)
+    return relevant_in_top_k / k
+```
+**Reference data needed:** for each test case, the human-labeled set of "chunks that should have been retrieved" — the ground truth.
+**Target:** depends on K. For k=5, target precision >0.6 (3 of 5 chunks relevant). For k=10, target >0.5.
+**What it catches:** retrieval is bringing back junk. Chunk embeddings are wrong, the index is stale, the query rewriting is broken.
+**What it misses:** the LLM may still produce a great answer even from imperfect retrieval — or a hallucinated answer despite perfect retrieval. Pair with metrics #2 and #3.
+### 2. Answer faithfulness
+**What it measures:** does the answer make claims that are SUPPORTED by the retrieved context? Or does it fabricate?
+**How to compute (rung-4 LLM-as-judge with rubric):**
+The judge sees: user question + retrieved context + generated answer. Scores 1-5 per the faithfulness rubric (see `judge-calibration.md` for an example).
+**Reference data needed:** the retrieved context (saved from the run) and the answer. No ground-truth answer required — the judge checks claim-by-claim against the context.
+**Target:** 80% of cases score ≥4 on the 1-5 scale.
+**What it catches:** hallucination — the answer says things the context didn't support. This is the #1 failure mode in production RAG.
+**What it misses:** the answer might be faithful to the retrieved context but the retrieved context might be WRONG. Pair with metric #1.
+### 3. Context utilization
+**What it measures:** did the answer USE the retrieved context, or ignore it and produce a generic / parametric-memory response?
+**How to compute (heuristic + LLM-as-judge hybrid):**
+Heuristic part — n-gram overlap or semantic similarity:
+```python
+def context_overlap(answer, context, n=3):
+    answer_ngrams = set(ngrams(answer, n))
+    context_ngrams = set(ngrams(context, n))
+    if not answer_ngrams:
+        return 0
+    return len(answer_ngrams & context_ngrams) / len(answer_ngrams)
+```
+Judge part — ask if the answer would change materially without the context:
+> "If the retrieved context were removed, would the answer be substantially different? 1 = same as without context (didn't use it), 5 = fully context-grounded."
+**Target:** 70%+ overlap on substantive answers; judge score ≥4 on 80% of cases.
+**What it catches:** the answer is faithful to the context (metric #2 passes) but ignores it — the model used its parametric memory instead. This means retrieval is doing nothing.
+**What it misses:** the answer might use the context but cite it incorrectly. Pair with metric #2.
+## Why all three are needed
+| Metric | Detects | Misses |
+|--------|---------|--------|
+| Retrieval precision@k | Junk in retrieval | Faithfulness; utilization |
+| Answer faithfulness | Hallucination | Retrieval quality; whether context was used |
+| Context utilization | Ignoring retrieval | Hallucination beyond context; retrieval quality |
+A RAG system can fail in all three independent ways. Measuring only one creates blind spots in the other two.
+## Combined metric example
+```python
+def evaluate_rag(case):
+    retrieved = retrieve(case.query)
+    answer = generate(case.query, retrieved)
+    return {
+        'precision_at_5': precision_at_k(
+            [c.id for c in retrieved],
+            case.relevant_chunk_ids,
+            k=5
+        ),
+        'faithfulness': llm_judge_faithfulness(
+            query=case.query,
+            context=retrieved,
+            answer=answer
+        ),
+        'context_utilization_overlap': context_overlap(answer, retrieved),
+        'context_utilization_judge': llm_judge_utilization(
+            query=case.query,
+            context=retrieved,
+            answer=answer
+        ),
+    }
+```
+Aggregate per-case scores into the per-run summary:
+```
+Run 2026-05-12:
+  precision@5:          0.68  (target >0.6) ✓
+  faithfulness ≥4:      83%   (target >80%) ✓
+  context utilization:  72%   (target >70%) ✓
+  Overall: PASS
+```
+## Common RAG failure modes
+| Symptom | Likely metric that catches it |
+|---------|------------------------------|
+| User says "the bot is making stuff up" | Faithfulness |
+| User says "the bot didn't see my documents" | Context utilization (or retrieval precision) |
+| User says "the bot is bad at finding things" | Retrieval precision@k |
+| User says "the answer is correct but ignores recent updates" | Retrieval recall (precision's partner — different metric) |
+| User says "the bot gives the same generic answer no matter what I ask" | Context utilization |
+| User says "the bot says the doc says X but it doesn't" | Faithfulness |
+The metric points at the layer to fix. Without it, debugging is guesswork.
+## Retrieval recall (the fourth metric, conditional)
+Precision asks "of what we retrieved, how much was good?" Recall asks "of what was good, how much did we retrieve?"
+In production RAG with many candidate chunks, recall is often the limiting factor — the right chunk exists in the index but doesn't surface.
+Compute:
+```python
+def recall_at_k(retrieved_chunk_ids, relevant_chunk_ids, k=5):
+    top_k = set(retrieved_chunk_ids[:k])
+    return len(top_k & set(relevant_chunk_ids)) / len(relevant_chunk_ids)
+```
+Track recall when:
+- The corpus is large (>1000 chunks per query domain).
+- Users report "the bot can't find things that exist in our docs."
+- You're tuning the retrieval pipeline (chunking strategy, embedding model, search algorithm).
+Skip recall when:
+- The corpus is small (top-K = ~10% of the corpus; recall is high by default).
+- Precision is the dominant problem.
+## Dataset structure for RAG
+```json
+{
+  "id": "rag-case-001",
+  "query": "What's our PTO policy for sabbatical years?",
+  "expected": {
+    "relevant_chunk_ids": ["chunk-policy-pto-2024", "chunk-policy-sabbatical"],
+    "expected_answer_themes": ["accrual rate", "carryover limits", "sabbatical exception"],
+    "should_cite": true
+  },
+  "metadata": {
+    "source": "production-2026-04-12-support-thread-S-892",
+    "difficulty": "medium",
+    "tags": ["pto-policy", "sabbatical", "rare-query"]
+  }
+}
+```
+The `relevant_chunk_ids` field requires human labeling — domain expert reviews the corpus, identifies which chunks SHOULD surface for that query.
+## Anti-patterns
+- **Measuring only one metric** (usually faithfulness via LLM-as-judge) → blind to retrieval and utilization failures.
+- **No human-labeled relevance** → can't compute precision/recall.
+- **Treating retrieval and generation as one black box** → can't tell which layer regressed.
+- **Eval set drawn only from "easy" queries** → metrics are good in test, terrible in production.
+- **Ignoring recent-information bias** (RAG must use retrieval; parametric memory is stale) → context utilization metric catches this.
+## Tooling
+- **ragas** (open source) implements precision, recall, faithfulness, and other RAG metrics with LLM judges. Use as reference implementation.
+- **Custom implementation** is straightforward — the metrics above are <100 lines of Python each.
+- **LangSmith / Weights & Biases** wrap eval runs with tracking but don't replace the core metrics.
+The discipline isn't tool choice; it's measuring all three orthogonal dimensions every run.

package/scaffold/skills/dw-llm-eval/references/reference-dataset.md ADDED Viewed

@@ -0,0 +1,190 @@
+# Reference dataset — 20 from failures beats 200 perfect
+The dataset is the bedrock. Without one, every "improvement" is anecdote and every regression goes unnoticed until users complain. With one, you can measure change.
+## The 20-from-failures principle
+> 20 unambiguous cases drawn from real production failures beat 200 synthetic perfect cases.
+Why:
+- Synthetic cases reflect what the team IMAGINED would happen — they cover the cases the team already knows about.
+- Production failures reflect what ACTUALLY happens — they cover blind spots and edge cases.
+- 20 well-curated cases at the right level of difficulty discriminate models better than 200 average-difficulty cases.
+A 20-case dataset is enough to:
+- Detect regressions of >10% accuracy.
+- Calibrate LLM-as-judge.
+- Run cheaply enough to evaluate on every PR.
+Scale up only when you've validated that 20 is producing useful signal.
+## Where cases come from
+In rough order of priority:
+1. **Real production failures.** User reported a bug, support escalated a case, error logs show an unexpected output. Each becomes a case. Sanitize PII before saving.
+2. **Edge cases discovered during development.** "Oh, what if the user asks X?" — add it.
+3. **Adversarial examples.** Inputs designed to trip the system (prompt injection, ambiguity, contradiction). Especially important for chat/RAG.
+4. **Boundary inputs.** Empty, very long, special characters, mixed languages, unusual encodings.
+5. **Synthetic — last resort.** Only when no real input exists yet (e.g., pre-launch). Mark them as synthetic; replace as production data arrives.
+Target distribution: **80% real production-sourced**, 20% adversarial/boundary. Pure-synthetic datasets give pure-synthetic confidence.
+## Case structure
+Each case is one line in `cases.jsonl`:
+```json
+{
+  "id": "case-001",
+  "input": {
+    "user_message": "I want to cancel my subscription",
+    "user_context": { "tier": "premium", "tenure_months": 18 }
+  },
+  "expected": {
+    "intent": "cancellation_request",
+    "should_offer_retention": true,
+    "tone_targets": ["empathetic", "non-manipulative"]
+  },
+  "rubric_criteria": ["faithfulness", "tone", "completeness"],
+  "metadata": {
+    "source": "production-2026-04-12-ticket-T-1234",
+    "added_at": "2026-04-15",
+    "added_by": "@bruno",
+    "difficulty": "medium",
+    "tags": ["cancellation", "retention", "premium-user"]
+  }
+}
+```
+Fields:
+- **`id`** — stable identifier (you'll reference it in regression reports).
+- **`input`** — what the system receives. Match the production input shape exactly.
+- **`expected`** — for rungs 1-3, the deterministic expected output or state change. For rung 4, the rubric-target (what a 5/5 answer would do).
+- **`rubric_criteria`** — which rubric dimensions this case exercises.
+- **`metadata.source`** — provenance. Production ticket? Synthetic? Adversarial?
+- **`metadata.difficulty`** — easy/medium/hard. Track score by difficulty bucket.
+- **`metadata.tags`** — for filtering ("show me cases that exercise retention logic").
+## Dataset layout
+```
+.dw/eval/datasets/<feature-name>/
+├── README.md                    # provenance, sample size, last review, change log
+├── cases.jsonl                  # the cases themselves
+├── rubric.md                    # the rubric used for rung-4 scoring
+├── runs/
+│   ├── 2026-05-01.jsonl         # one line per case with scores from that run
+│   ├── 2026-05-08.jsonl
+│   └── ...
+├── calibration/
+│   ├── 2026-05-12-human-scores.jsonl
+│   └── spearman-2026-05-12.txt
+└── changelog.md                 # when cases were added/removed and why
+```
+Everything is committed. Datasets evolve with the feature; the git history shows when and why.
+## README.md template
+```markdown
+# Reference dataset — <feature name>
+**Purpose:** evaluate <feature> for <quality dimensions>.
+**Current size:** N cases (X production-sourced, Y adversarial, Z synthetic).
+**Difficulty distribution:** easy: A, medium: B, hard: C.
+**Last reviewed:** YYYY-MM-DD.
+**Maintainers:** @name1, @name2.
+## When to expand
+Add a case when:
+- A new production failure is observed (always — that's the primary signal).
+- A new edge case is identified during development.
+- A new adversarial pattern is discovered (security review, red-team session).
+Do NOT add cases just to inflate the count. The 20-from-failures principle: quality over quantity.
+## When to retire a case
+Retire (don't delete) when:
+- The behavior the case checked is no longer relevant (feature removed).
+- The case became trivially passing across all model versions (it's no longer discriminating).
+Move retired cases to `cases-retired.jsonl` with a `retired_reason`.
+```
+## Adding cases from production
+Process:
+1. **Capture the failure** — paste the actual input that failed in `cases-pending.jsonl`.
+2. **Sanitize PII** — replace names, emails, IDs, account numbers with realistic-but-fake equivalents. NEVER commit real user data.
+3. **Define expected behavior** — what SHOULD have happened. Get sign-off from a domain expert if subjective.
+4. **Categorize** — difficulty, tags, rubric criteria.
+5. **Promote** to `cases.jsonl` after review.
+## Sampling for regression runs
+You don't need to re-run the entire dataset every time. Smart sampling:
+- **PR-time:** random sample of 30% + all "high difficulty" cases + any case added in the last 30 days. Fast feedback.
+- **Pre-merge to main:** full dataset.
+- **Nightly:** full dataset + judge re-calibration check.
+- **Pre-deploy:** full dataset + manual eyeball on 10 random outputs.
+## Detecting drift
+After each run, compare against the prior run on the SAME cases:
+```
+Run 2026-05-08 vs 2026-05-01:
+  faithfulness: 4.2 → 3.9 (-0.3) ⚠ regression
+  completeness: 4.0 → 4.1 (+0.1)
+  tone:         4.5 → 4.4 (-0.1)
+  outcome accuracy: 95% → 92% ⚠ regression
+```
+Two ways drift happens:
+1. **Code change degraded quality** — your model swap or prompt tweak hurt something. Bisect.
+2. **Judge drift** — the LLM-as-judge itself changed (vendor updated the model). Re-calibrate; the "regression" may be the judge, not the system.
+## Dataset versioning
+When the dataset materially changes (cases added/removed in batch, rubric updated), bump a version in the README:
+```
+Dataset version: 2.3
+- v2.3 (2026-05-12): added 8 cases from production tickets in last 30 days
+- v2.2 (2026-04-15): retired 3 cases that became trivially passing
+- v2.1 (2026-03-20): rubric updated to add "completeness" criterion
+```
+Each run logs which dataset version it ran against. You can't compare a v1 score to a v3 score directly.
+## Cost discipline
+- Keep the dataset SMALL on purpose. 20-50 cases for most features. 100+ only if the feature has many categorically different inputs.
+- Cheap evaluations (rungs 1-3) run on every case every time.
+- Expensive evaluations (rung 4) run on samples — 20-50 random cases — except for full pre-deploy runs.
+## Anti-patterns
+- **Synthetic-only dataset.** No connection to real production. Confidence isn't real.
+- **Dataset grew to 500 cases nobody re-reads.** Half are duplicates; half are no longer discriminating. Audit and prune.
+- **Cases without expected behavior.** "Just look at the output." No measurement possible.
+- **Dataset not committed.** Lives in a notebook; ephemeral; lost when person leaves.
+- **No metadata tracking source.** Can't tell synthetic from real; can't audit dataset quality.
+- **Dataset reused across features.** Each feature has its own dataset; one-size-fits-all is one-size-fits-none.
+## Cross-reference
+- `oracle-ladder.md` — what to assert per case.
+- `judge-calibration.md` — how to make rung-4 judgments meaningful.
+- `rag-metrics.md` — RAG-specific extras for the dataset structure.
+- `agent-eval.md` — agent-specific extras (trajectory matching).

package/scaffold/skills/dw-memory/SKILL.md CHANGED Viewed

@@ -153,8 +153,8 @@ When flagged for compaction, apply inline:
 ## Integration With Other dev-workflow Commands
-- `/dw-run-task` — reads memory before coding; updates `<N>_memory.md` during; runs promotion test + updates `MEMORY.md` at the end.
-- `/dw-run-plan` — runs promotion + compaction between tasks, so each task starts with clean shared state.
+- `/dw-run` — reads memory before coding; updates `<N>_memory.md` during; runs promotion test + updates `MEMORY.md` at the end.
+- `/dw-run` — runs promotion + compaction between tasks, so each task starts with clean shared state.
 - `/dw-autopilot` — threads memory through every phase (brainstorm → PRD → techspec → tasks → execution); on re-invocation reads `MEMORY.md` first to reconstitute cross-session context.
 Callers should mention this skill in their "Skills Complementares" section.

package/scaffold/skills/dw-review-rigor/SKILL.md CHANGED Viewed

@@ -14,7 +14,7 @@ A set of rules the caller applies while producing a review report. This skill do
 ## When Invoked
-By `/dw-code-review`, `/dw-review-implementation`, `/dw-refactoring-analysis`. The caller has already identified a scope (files, a PR, a codebase area). This skill governs how findings are selected, deduplicated, ordered, and phrased.
+By `/dw-review --code-only`, `/dw-review --coverage-only`, `/dw-brainstorm --refactor`. The caller has already identified a scope (files, a PR, a codebase area). This skill governs how findings are selected, deduplicated, ordered, and phrased.
 ## Required Inputs
@@ -122,9 +122,9 @@ The caller emits:
 ## Integration With Other dev-workflow Commands
-- `/dw-code-review` — applies all five rules to its Level-3 review output; uses prior reports in `.dw/spec/*/reviews/` to dedupe across rounds.
-- `/dw-review-implementation` — applies de-dup + severity-ordering when listing gaps between PRD requirements and code.
-- `/dw-refactoring-analysis` — applies rules 1, 2, 4, 5 when cataloging code smells (rule 3 adapts: a "smell" with a justifying ADR becomes a `low` finding at most).
+- `/dw-review --code-only` — applies all five rules to its Level-3 review output; uses prior reports in `.dw/spec/*/reviews/` to dedupe across rounds.
+- `/dw-review --coverage-only` — applies de-dup + severity-ordering when listing gaps between PRD requirements and code.
+- `/dw-brainstorm --refactor` — applies rules 1, 2, 4, 5 when cataloging code smells (rule 3 adapts: a "smell" with a justifying ADR becomes a `low` finding at most).
 Callers should mention this skill in their "Skills Complementares" section.
@@ -134,6 +134,6 @@ Ported from Compozy's `cy-review-round` skill (`/tmp/compozy/.agents/skills/cy-r
 - No `reviews-NNN/` directory convention — dev-workflow reviews already persist in `.dw/spec/*/reviews/` per command's existing contract.
 - The five rules are extracted here so three different dev-workflow review commands can share the discipline without duplicating it.
-- No issue-file frontmatter (Compozy uses it to interoperate with its remediation engine; dev-workflow's remediation is manual or via `/dw-fix-qa`).
+- No issue-file frontmatter (Compozy uses it to interoperate with its remediation engine; dev-workflow's remediation is manual or via `/dw-qa --fix`).
 Credit: Compozy project (https://github.com/compozy/compozy).

package/scaffold/skills/dw-simplification/SKILL.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: dw-simplification
-description: Disciplined code simplification — understand WHY code exists before changing it (Chesterton's Fence), preserve behavior exactly, prefer clarity over cleverness, scope to recent changes. Used by /dw-code-review and /dw-refactoring-analysis. Adapted from addyosmani/agent-skills (MIT).
+description: Use when simplifying code. Chesterton's Fence (understand WHY first), behavior-preserving refactor, complexity metrics. Triggers from /dw-review --code-only and /dw-brainstorm --refactor.
 allowed-tools:
   - Read
   - Edit
@@ -17,10 +17,10 @@ Behavioral discipline for simplifying code without breaking it. The trap of refa
 Read this skill when:
-- `/dw-code-review` flagged a complexity issue (deep nesting, long function, duplication).
-- `/dw-refactoring-analysis` proposed a simplification target.
+- `/dw-review --code-only` flagged a complexity issue (deep nesting, long function, duplication).
+- `/dw-brainstorm --refactor` proposed a simplification target.
 - The user explicitly asks to "clean this up" / "simplify X".
-- During `/dw-run-task` if the implementation accidentally produced complex code that wants pre-commit cleanup.
+- During `/dw-run` if the implementation accidentally produced complex code that wants pre-commit cleanup.
 Do NOT use when:

package/scaffold/skills/dw-source-grounding/SKILL.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: dw-source-grounding
-description: Discipline of grounding architectural and dependency decisions in versioned official documentation, with mandatory citations. Other commands invoke this skill when they need to decide based on framework/library behavior — never on hallucinated APIs or stale Stack Overflow answers. Adapted from addyosmani/agent-skills (MIT).
+description: "Use when citing frameworks or libraries. Detect → Fetch → Implement → Cite with [source: url, version, retrieved]. Triggers on every framework decision in techspec, deps audit, research."
 allowed-tools:
   - Read
   - Bash