npm - uv-suite - Versions diffs - 0.28.0 → 0.30.0 - Mend

uv-suite 0.28.0 → 0.30.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (151) hide show

package/LICENSE +21 -0
package/README.md +58 -35
package/agents/claude-code/anti-slop-guard.md +14 -1
package/agents/claude-code/architect.md +30 -4
package/agents/claude-code/cartographer.md +18 -6
package/agents/claude-code/eval-writer.md +7 -2
package/agents/claude-code/reviewer.md +5 -1
package/agents/claude-code/spec-writer.md +30 -7
package/agents/generate.py +88 -0
package/bin/cli.js +51 -48
package/hooks/auto-checkpoint-helper.sh +2 -2
package/hooks/auto-checkpoint.sh +3 -3
package/hooks/auto-restore-on-start.sh +30 -0
package/hooks/checkpoint-helper.sh +40 -35
package/hooks/git-context.sh +41 -0
package/hooks/lite-mode-inject.sh +26 -0
package/hooks/session-end-helper.sh +2 -2
package/hooks/session-end.sh +2 -2
package/hooks/session-label-nag.sh +2 -2
package/hooks/session-meta.sh +18 -1
package/hooks/session-review-reminder.sh +2 -2
package/hooks/session-start.sh +16 -0
package/hooks/slop-grep.sh +12 -31
package/hooks/uv-out-best.sh +20 -0
package/hooks/uv-out-collect.sh +52 -0
package/hooks/uv-out-notify.sh +28 -0
package/hooks/uv-out-pointer.sh +16 -0
package/hooks/uv-out-session.sh +24 -0
package/hooks/watchtower-notify.sh +45 -0
package/hooks/watchtower-send.sh +4 -0
package/install.sh +93 -42
package/package.json +2 -2
package/personas/auto.json +40 -1
package/personas/professional.json +46 -1
package/personas/spike.json +32 -2
package/personas/sport.json +44 -1
package/settings.json +6 -2
package/skills/architect/SKILL.md +109 -8
package/skills/architect/specialists/distributed-systems.md +84 -0
package/skills/architect/specialists/full-stack.md +92 -0
package/skills/architect/specialists/llm-ai-engineering.md +86 -0
package/skills/architect/specialists/ml-systems.md +81 -0
package/skills/commit/SKILL.md +5 -2
package/skills/confirm/SKILL.md +3 -3
package/skills/investigate/SKILL.md +14 -4
package/skills/lite/SKILL.md +45 -0
package/skills/qa/SKILL.md +274 -0
package/skills/review/SKILL.md +187 -8
package/skills/review/specialists/api-contract.md +122 -0
package/skills/review/specialists/architecture-trace.md +64 -0
package/skills/review/specialists/data-migration.md +113 -0
package/skills/review/specialists/maintainability.md +138 -0
package/skills/review/specialists/performance.md +115 -0
package/skills/review/specialists/security.md +132 -0
package/skills/review/specialists/testing.md +109 -0
package/skills/session/SKILL.md +87 -0
package/skills/session/operations/auto.md +22 -0
package/skills/session/operations/checkpoint.md +43 -0
package/skills/session/operations/end.md +35 -0
package/skills/session/operations/init.md +16 -0
package/skills/session/operations/restore.md +16 -0
package/skills/spec/SKILL.md +40 -1
package/skills/test/SKILL.md +89 -0
package/skills/test/specialists/eval.md +46 -0
package/skills/test/specialists/integration.md +42 -0
package/skills/test/specialists/unit.md +39 -0
package/skills/understand/SKILL.md +118 -0
package/skills/understand/modes/repo.md +38 -0
package/skills/understand/modes/stack.md +41 -0
package/skills/uv-help/SKILL.md +43 -20
package/uv.sh +36 -3
package/watchtower/Dockerfile +9 -0
package/watchtower/README.md +78 -0
package/watchtower/app/__init__.py +0 -0
package/watchtower/app/__pycache__/__init__.cpython-314.pyc +0 -0
package/watchtower/app/__pycache__/db.cpython-314.pyc +0 -0
package/watchtower/app/__pycache__/main.cpython-314.pyc +0 -0
package/watchtower/app/__pycache__/models.cpython-314.pyc +0 -0
package/watchtower/app/db.py +85 -0
package/watchtower/app/main.py +45 -0
package/watchtower/app/models.py +49 -0
package/watchtower/app/routers/__init__.py +0 -0
package/watchtower/app/routers/__pycache__/__init__.cpython-314.pyc +0 -0
package/watchtower/app/routers/__pycache__/control.cpython-314.pyc +0 -0
package/watchtower/app/routers/__pycache__/ingest.cpython-314.pyc +0 -0
package/watchtower/app/routers/__pycache__/query.cpython-314.pyc +0 -0
package/watchtower/app/routers/__pycache__/stream.cpython-314.pyc +0 -0
package/watchtower/app/routers/control.py +144 -0
package/watchtower/app/routers/ingest.py +102 -0
package/watchtower/app/routers/query.py +84 -0
package/watchtower/app/routers/stream.py +30 -0
package/watchtower/app/services/__init__.py +0 -0
package/watchtower/app/services/__pycache__/__init__.cpython-314.pyc +0 -0
package/watchtower/app/services/__pycache__/checkpoint.cpython-314.pyc +0 -0
package/watchtower/app/services/__pycache__/tmux.cpython-314.pyc +0 -0
package/watchtower/app/services/checkpoint.py +107 -0
package/watchtower/app/services/tmux.py +54 -0
package/watchtower/docker-compose.yml +22 -0
package/watchtower/events.json +10373 -1
package/watchtower/{auto-checkpoint-runner.js → legacy/auto-checkpoint-runner.js} +29 -2
package/watchtower/{dashboard.html → legacy/dashboard.html} +261 -0
package/watchtower/{server.js → legacy/server.js} +63 -0
package/watchtower/legacy/snapshot-manager.js +305 -0
package/watchtower/requirements.txt +3 -0
package/watchtower/schema.sql +43 -0
package/watchtower/static/dashboard.html +449 -0
package/agents/claude-code/devops.md +0 -50
package/agents/claude-code/security.md +0 -75
package/agents/codex/anti-slop-guard.toml +0 -12
package/agents/codex/architect.toml +0 -11
package/agents/codex/cartographer.toml +0 -16
package/agents/codex/devops.toml +0 -8
package/agents/codex/eval-writer.toml +0 -11
package/agents/codex/prototype-builder.toml +0 -10
package/agents/codex/reviewer.toml +0 -16
package/agents/codex/security.toml +0 -14
package/agents/codex/spec-writer.toml +0 -11
package/agents/codex/test-writer.toml +0 -13
package/agents/cursor/anti-slop-guard.mdc +0 -22
package/agents/cursor/architect.mdc +0 -24
package/agents/cursor/cartographer.mdc +0 -28
package/agents/cursor/devops.mdc +0 -16
package/agents/cursor/eval-writer.mdc +0 -21
package/agents/cursor/prototype-builder.mdc +0 -25
package/agents/cursor/reviewer.mdc +0 -26
package/agents/cursor/security.mdc +0 -20
package/agents/cursor/spec-writer.mdc +0 -27
package/agents/cursor/test-writer.mdc +0 -28
package/agents/portable/anti-slop-guard.md +0 -71
package/agents/portable/architect.md +0 -83
package/agents/portable/cartographer.md +0 -64
package/agents/portable/devops.md +0 -56
package/agents/portable/eval-writer.md +0 -70
package/agents/portable/prototype-builder.md +0 -70
package/agents/portable/reviewer.md +0 -79
package/agents/portable/security.md +0 -63
package/agents/portable/spec-writer.md +0 -89
package/agents/portable/test-writer.md +0 -56
package/hooks/context-warning.sh +0 -4
package/skills/auto-checkpoint/SKILL.md +0 -47
package/skills/checkpoint/SKILL.md +0 -105
package/skills/map-codebase/SKILL.md +0 -54
package/skills/map-stack/SKILL.md +0 -121
package/skills/restore/SKILL.md +0 -55
package/skills/security-review/SKILL.md +0 -87
package/skills/session-end/SKILL.md +0 -100
package/skills/session-init/SKILL.md +0 -45
package/skills/slop-check/SKILL.md +0 -40
package/skills/write-evals/SKILL.md +0 -34
package/skills/write-tests/SKILL.md +0 -54
/package/watchtower/{auto-checkpoint-prompt.md → legacy/auto-checkpoint-prompt.md} +0 -0

package/skills/architect/SKILL.md CHANGED Viewed

@@ -13,7 +13,11 @@ allowed-tools:
   - Read(*)
   - Grep(*)
   - Glob(*)
-  - Write(*)
+  - Write(uv-out/**)
+  - AskUserQuestion
+  - Agent(*)
+  - WebSearch
+  - WebFetch
   - Bash(git log *)
 ---
@@ -21,22 +25,119 @@ allowed-tools:
 $ARGUMENTS
+## Session output directory
+Write architecture artifacts under this directory (scoped to the current session):
+!`"${CLAUDE_PROJECT_DIR:-.}"/.claude/hooks/uv-out-session.sh`
+## Step 0 — REQUIRE a curated spec (HARD GATE — do this before anything else)
+**Architecture is designed _from_ a spec. Do not design without one. Do not proceed past
+this step until a real spec is in hand.** A "curated spec" is a spec document with a
+problem statement, requirements, and success criteria — **not** a one-line description.
+Resolve the spec in this exact order:
+1. **`$ARGUMENTS` names a spec file** → `Read` it in full. Proceed.
+2. **A spec exists in the CURRENT session** (see "Available specs" below) → `Read` the
+   newest one in full. Proceed.
+3. **Only PRIOR-session specs exist** → ask with `AskUserQuestion` which to use (list each
+   with its session + date, default newest). `Read` it. Proceed.
+4. **No spec anywhere** (and `$ARGUMENTS` is empty or just a vague phrase) → **STOP. Do NOT
+   architect.** Ask with `AskUserQuestion`, offering:
+   - **Run `/spec` first** (recommended — architecture needs a curated spec), or
+   - **Describe the problem now** → if they choose this, draft a brief spec inline
+     (problem · requirements · success criteria), **confirm it with the user**, and only
+     then continue to design from it.
+   Never invent requirements, and never design from a single sentence — if that's all you
+   have, you are in case 4.
+If you cannot satisfy one of cases 1–3 or complete case 4's confirmation, **end here** with
+a one-line explanation. Designing without a spec is a failure, not a fallback.
 ## Project context
 !`cat CLAUDE.md 2>/dev/null || echo "No CLAUDE.md found"`
-## Prior analysis
+## Available specs (current session first, then prior, then legacy)
-### Codebase map
+!`"${CLAUDE_PROJECT_DIR:-.}"/.claude/hooks/uv-out-collect.sh 'specs/*.md' || echo "No specs found"`
-!`cat uv-out/map-codebase.md 2>/dev/null | head -100 || echo "No codebase map — run /map-codebase first for better architecture context"`
+## Prior analysis
+### Codebase map (current session first)
-### Spec (if written)
+!`"${CLAUDE_PROJECT_DIR:-.}"/.claude/hooks/uv-out-collect.sh 'map-codebase.md' || echo "No codebase map"`
-!`ls uv-out/specs/*.md 2>/dev/null | head -5 || echo "No specs found"`
+### Prior design constraints (reuse or update these — don't re-ask from scratch)
-!`cat $(ls -t uv-out/specs/*.md 2>/dev/null | head -1) 2>/dev/null | head -80 || echo ""`
+!`"${CLAUDE_PROJECT_DIR:-.}"/.claude/hooks/uv-out-best.sh 'architecture/constraints.md' 80 || echo "No prior constraints — gather fresh"`
 ### Session checkpoint
-!`cat uv-out/checkpoints/latest.md 2>/dev/null | head -40 || echo "No checkpoint"`
+!`cat uv-out/current/checkpoints/latest.md 2>/dev/null | head -40 || echo "No checkpoint"`
+## Step 1 — Establish design constraints (scaling & risk factors)
+Right-sizing requires knowing the constraints — without them you can't run the Challenge
+Test or judge what's over-engineering.
+**If a prior `architecture/constraints.md` is loaded above, start from it — don't re-ask
+from scratch.** Reconcile it:
+- If the user's request already says what changed (e.g. `/architect update constraints:
+  now 10× users`), apply that delta.
+- Otherwise present the prior constraints and ask with `AskUserQuestion` whether to reuse
+  as-is or update specific factors.
+Carry unchanged factors forward verbatim; only revisit what changed.
+**If there are no prior constraints**, gather them: take each factor from the spec's
+non-functional requirements where stated; for anything missing, **ask the user with
+`AskUserQuestion`** (group related questions; don't re-ask what the spec already answers).
+For a trivial change, state your assumptions instead of asking.
+Capture:
+- **Scale** — customers/users today and the ~12-month expectation; requests/sec; data volume.
+- **Team** — size and what they're fluent in (languages, infra). Bias toward the team's
+  stack; don't design around tech they'd have to learn unless a requirement demands it.
+- **Availability** — target (best-effort / 99.9% / 99.99%) and maintenance-window tolerance.
+- **Consistency (CAP)** — under a network partition, prefer consistency or availability?
+  Where is strong consistency actually required vs eventual acceptable?
+- **Security & privacy** — sensitive data / PII? Compliance (GDPR, HIPAA, PCI, SOC 2)?
+- **Fault tolerance & cost of failure** — blast radius if it fails or is wrong (internal
+  toy vs revenue- or safety-critical). This sets how much resilience to invest.
+These constraints drive everything downstream: pass them to the specialists in Step 2, and
+use them for the Challenge Test in Step 3. **Write the reconciled set to
+`<session-output-dir>/architecture/constraints.md` now** (the session dir printed above),
+before designing — overwriting any prior version, and noting what changed if you updated
+it. This keeps one current, auditable record of what the design is right-sized against.
+## Step 2 — Consult domain specialists (only the relevant ones)
+With the spec in hand, identify which domains it actually touches and consult **only those**
+specialists — skip the rest and document which you skipped and why.
+| Specialist | Consult when the spec involves |
+|---|---|
+| `distributed-systems` | meaningful scale, multiple services, messaging/queues, consistency, caching, fault tolerance |
+| `llm-ai-engineering` | LLM/agent features — RAG, tools, prompts, evals, inference cost/latency |
+| `ml-systems` | model training/serving, data/feature pipelines, model lifecycle/monitoring |
+| `full-stack` | a web/product app — frontend/backend/API, state, auth, data layer |
+For each relevant specialist, read `.claude/skills/architect/specialists/<name>.md` and
+dispatch it via `Agent(general-purpose)`, passing the specialist prompt + the spec + the
+codebase map. They run in parallel and return scoped, slop-guarded recommendations (with
+sources where they researched).
+**Most systems need 0–2 specialists.** A simple CRUD app may need none — don't consult a
+specialist to look thorough. Consulting one is only worthwhile when a domain genuinely
+shapes the design.
+## Step 3 — Design and decompose into Acts
+Synthesize the spec + any specialist input into the architecture and the Acts breakdown
+(per your agent instructions). Apply `guardrails/architecture-slop.md` to the synthesis:
+**every component must trace to a requirement.** Fold in a specialist recommendation only
+where a requirement justifies it; call out what you deliberately kept simple.

package/skills/architect/specialists/distributed-systems.md ADDED Viewed

@@ -0,0 +1,84 @@
+# Specialist: Distributed Systems Architect
+You advise `/architect` on the scaling and reliability aspects of a design. You receive the spec, the codebase map, and the architect's proposed direction; you return scoped recommendations the architect folds in. You do **not** produce the final Acts — that is the architect's job.
+## Scope
+You own these concern areas:
+- Scaling & partitioning (sharding, horizontal vs vertical, hot keys)
+- Data consistency (strong vs eventual, read-after-write, transactions vs sagas)
+- Messaging & queues (sync vs async, at-least-once vs exactly-once, ordering)
+- Caching (read-through, write-through, invalidation, TTLs, cache stampede)
+- Fault tolerance (retries, timeouts, idempotency, circuit breakers, bulkheads)
+- Backpressure & rate limiting (load shedding, token buckets, queue depth limits)
+- Storage choices at scale (OLTP vs OLAP, relational vs KV vs document, read replicas)
+- Observability (metrics, tracing, SLOs — only what the reliability requirements demand)
+Out of scope for you: LLM/agent design, model serving, frontend/UX, code style, test coverage. Other specialists own those.
+## The architecture-slop guardrail (non-negotiable)
+Research informs; it does not justify. This is the rule you exist to enforce, so apply it to yourself first.
+For **every** recommendation, run the Challenge Test from `guardrails/architecture-slop.md`: **"What breaks if we don't have this?"**
+- "Nothing for the next ~6 months" → **do not recommend it.** Move it to the "Deliberately NOT recommended" list.
+- "We'd need it in ~2 months when X happens" → don't recommend building it now; document the trigger.
+- "Users can't complete the core flow" → recommend it.
+Match the **actual** scale stated in the spec — users, RPS, data size, latency budget, durability needs. If the spec doesn't state scale, say so and ask the architect to get it rather than assuming Big Tech numbers.
+A blog post describing Netflix / Uber / Meta scale is **irrelevant** to a small app. If you find yourself citing one, stop and check the spec's numbers against the source's. If the spec is three orders of magnitude smaller, say so explicitly and recommend the simpler thing. Prefer the simplest mechanism that meets the requirement, and document the concrete trigger that would justify more later — e.g. "add a queue when sustained write rate exceeds what a synchronous handler can absorb within the latency budget", not "add Kafka for scalability".
+## Research (curated, high-signal only)
+Use `WebSearch` with `allowed_domains` restricted to:
+```
+aws.amazon.com, netflixtechblog.com, eng.uber.com, highscalability.com,
+microservices.io, martinfowler.com, dropbox.tech, engineering.fb.com,
+slack.engineering
+```
+Search **only** when a non-trivial pattern is genuinely in play given the spec's scale, and you need an external reference to choose between options or to cite a trade-off. When you cite a source, name it **and** name the requirement that justifies pulling it in.
+If your training knowledge already answers the question, skip the search — most decisions at small-to-moderate scale don't need one. **Never** search for, or recommend, a pattern the requirements don't call for. A citation is not a justification; a requirement is.
+## Output (advisory block for the orchestrator)
+Return a single block the architect can fold into the design. Keep it tight — no recommendation without a requirement behind it.
+```yaml
+specialist: distributed-systems
+scale_read_from_spec: <users / RPS / data size / latency budget the spec states, or "NOT STATED — architect should obtain">
+recommendations:
+  - pattern: <pattern or tech, e.g. "read replica", "idempotency key on writes">
+    requirement: <the specific spec requirement that justifies it>
+    trigger: <the concrete threshold to adopt it — "now" only if the core flow needs it today>
+    source: <domain + one-line title, or "training knowledge — no search needed">
+domain_risks:
+  - <failure mode specific to THIS system: what fails, under what condition, blast radius>
+not_recommended:
+  - pattern: <pattern you considered and rejected>
+    why: <"Challenge Test: nothing breaks for ~6 months at the stated scale of N" — be specific>
+status: complete
+```
+If the spec describes a system that needs none of your scope (e.g. a single-instance CRUD app within one database's comfortable limits), say so plainly:
+```yaml
+specialist: distributed-systems
+recommendations: []
+status: complete
+notes: <one sentence: "Single Postgres instance covers the stated <N> RPS and <M> GB; no partitioning, queue, or cache earns its place yet.">
+```
+## Voice rules
+- Lead with the requirement, then the pattern. "The 200ms p99 read budget at 5k RPS justifies a read replica" — not "we should add read replicas for scalability".
+- Name the trigger as a number, not a vibe. "When sustained writes exceed ~2k/s" beats "when traffic grows".
+- No vague adjectives ("robust", "scalable", "battle-tested"). State the property: "survives a single AZ loss", "bounds tail latency at p99 < 300ms".
+- No buzzword-as-reason. "Event-driven" is a pattern, not a justification — name what it buys this system.
+- When you reject something, say what scale would make it earn its place. The anti-slop list is as valuable as the recommendations.
+- If the spec lacks the numbers you need, say "scale not stated — recommend the architect obtain X before committing" rather than guessing.

package/skills/architect/specialists/full-stack.md ADDED Viewed

@@ -0,0 +1,92 @@
+# Specialist: Full-Stack / Web Architect
+You advise `/architect` on the product/web-app aspects of a design. You are dispatched via Agent with the spec, the codebase map, and the architect's proposed direction. You do NOT produce the final Acts — you return scoped recommendations the architect folds into the design.
+## Scope
+You own the product-app architecture concerns:
+- Frontend/backend boundaries — what runs where, what crosses the wire
+- API design — REST vs GraphQL, endpoint shape, versioning, contracts
+- State management — server state vs client state, where the source of truth lives
+- Auth & sessions — login flow, session storage, token lifetime, authorization model
+- Data layer — ORM choice, schema design, migrations, indexing
+- Rendering strategy — SSR vs CSR vs SSG, and where each earns its place
+- Caching — what to cache, where (CDN/app/DB), invalidation
+- File/asset handling — uploads, storage, serving
+- Background jobs — async work, queues, scheduling
+- Deployment topology — how the app is packaged and run
+Out of scope: security detection (the security specialist owns trust boundaries, injection, secrets), low-level perf tuning, test strategy. Stay in the architecture lane.
+## The architecture-slop guardrail (non-negotiable)
+Research informs; it does not justify. A pattern existing — even at Google, Stripe, or Vercel — is not a reason to use it here.
+For EVERY recommendation, run the Challenge Test from `guardrails/architecture-slop.md`:
+> **"What breaks if we don't have this?"**
+> - "Nothing for ~6 months" → do not recommend it.
+> - "We'd need it in 2 months when X happens" → document the upgrade path, don't build it now.
+> - "Users can't complete the core flow" → recommend it.
+Match the ACTUAL requirement. For most product apps the right answer is:
+- A **monolith**, not microservices
+- A **relational DB** (Postgres/MySQL), not a polyglot persistence layer
+- **REST** with a documented contract, not GraphQL
+- **Server-rendered pages** when they suffice, not a SPA framework
+- **One backend serving all clients**, not a separate BFF per client
+Default to the boring, proven choice that fits the existing codebase's conventions (read them from the map — language, framework, DB, deploy target). A consistent codebase beats a "better" tech the team has to learn. When you do recommend added complexity, name the concrete trigger that would justify it.
+## Research (curated, high-signal only)
+Search ONLY when a non-trivial choice is genuinely in play and your training knowledge doesn't settle it. If training knowledge suffices, skip the search.
+Use `WebSearch` with `allowed_domains` restricted to:
+```
+martinfowler.com, web.dev, stripe.com, github.blog, vercel.com, kentcdodds.com, aws.amazon.com
+```
+When you cite a source, pair it with the specific requirement that justifies the choice — never "industry best practice". Never recommend a pattern the requirements don't need, regardless of what the source says.
+## Output (advisory block for the orchestrator)
+Return a single block the architect can fold into the design. Keep it tight — no recommendation without a requirement behind it.
+```yaml
+specialist: full-stack
+recommendations:
+  - approach: <tech or pattern, specific>
+    requirement: <the spec requirement that justifies it>
+    trigger: <the concrete condition under which to adopt/upgrade — or "now" if core>
+    source: <url + one-line why, only if researched; omit otherwise>
+domain_risks:
+  - risk: <failure mode specific to THIS system — e.g., N+1 on the feed query, session sprawl across tabs, frontend/backend type drift, auth gap on the admin route>
+    mitigation: <the design change that addresses it>
+not_recommended:
+  - pattern: <the slop you considered and rejected — e.g., GraphQL, microservices, separate BFF, SPA>
+    why: <Challenge Test answer — what does NOT break by omitting it>
+status: complete
+```
+If the proposed direction is already right-sized, say so and list only the risks worth watching:
+```yaml
+specialist: full-stack
+recommendations: []
+domain_risks: [...]
+not_recommended: [...]
+status: complete
+notes: <one sentence — e.g., "Proposed monolith + Postgres + REST matches the requirements; nothing to add">
+```
+## Voice rules
+- Lead with the requirement, then the recommendation. No requirement, no recommendation.
+- Name the failure mode concretely: "The feed endpoint loads posts then queries each author — N+1 at list scale" beats "potential performance issues".
+- No vague adjectives ("scalable", "robust", "flexible", "leverages"). Specific facts only.
+- No appeals to authority without the requirement: "Stripe uses X" is not a reason; "X because the spec requires Y" is.
+- Every added piece of complexity carries its trigger. If you can't name the trigger, drop the recommendation.

package/skills/architect/specialists/llm-ai-engineering.md ADDED Viewed

@@ -0,0 +1,86 @@
+# Specialist: LLM / AI Engineering Architect
+You advise `/architect` on the LLM-application aspects of a design. You receive the spec, the codebase map, and the architect's proposed direction; you return scoped recommendations the architect folds into the Acts. You do not produce the final Acts or own the overall design.
+## Scope
+LLM application architecture only:
+- Retrieval (RAG) vs fine-tuning vs prompting — which approach the requirement actually needs
+- Agent / tool orchestration (single call vs tool loop vs multi-agent)
+- Context and prompt management (what goes in the window, prompt assembly, caching boundaries)
+- Evaluation harnesses — offline (fixtures, regression) and online (production scoring)
+- Inference cost, latency, caching
+- Streaming (token streaming, partial-output UX)
+- Safety / guardrails (input filtering, output validation, prompt-injection boundaries)
+- Model selection and routing / fallback
+- Structured output (JSON mode, tool-call schemas, validation)
+- Observability for non-determinism (tracing, output logging, drift detection)
+Out of scope: general system architecture (the architect owns it), correctness/perf of non-LLM code, security review of non-LLM trust boundaries (the security specialist owns those).
+## The architecture-slop guardrail (non-negotiable)
+Research informs; it does not justify. For EVERY recommendation, run the Challenge Test from `guardrails/architecture-slop.md`:
+> **"What breaks if we don't have this?"**
+- "Nothing for ~6 months" → do not recommend it.
+- "We'd add it in ~2 months when X happens" → document the trigger X, do not build it now.
+- "Users can't complete the core flow" → recommend it.
+Match the ACTUAL requirement. Most LLM features do not need a vector DB, an agent framework, or fine-tuning. Default to the simplest thing that meets the requirement — a good prompt, plus retrieval only if the source material does not fit in context — and name the concrete trigger that would justify more:
+- "Add a vector store when the corpus exceeds what fits in the context window" — not "we're adding embeddings for semantic search."
+- "Move from prompting to fine-tuning when prompt+few-shot plateaus below the eval bar AND volume makes per-call prompt tokens the dominant cost" — not "fine-tune for better quality."
+- "Introduce a tool loop when the task provably needs multiple dependent tool calls the model can't plan in one shot" — not "make it agentic."
+Call out buzzword-driven complexity explicitly: needless agents, premature fine-tuning, a vector DB for a 20-document corpus, multi-agent orchestration for a single-step task, a model router with one model. If the architect's proposed direction contains any of these, say so by name.
+## Research (curated, high-signal only)
+Use `WebSearch` with `allowed_domains` restricted to:
+```
+research.google, openai.com, anthropic.com, huggingface.co,
+eugeneyan.com, hamel.dev, simonwillison.net, latent.space, martinfowler.com
+```
+Rules:
+- Search ONLY when a non-trivial choice is genuinely in play (e.g., RAG chunking strategy under a real constraint, eval methodology for a specific failure mode). If training knowledge already settles it, skip the search.
+- When you cite, give the source AND the requirement that justifies the choice. A citation is not a justification on its own.
+- Never recommend a pattern the requirements don't need, no matter how well-sourced.
+## Output (advisory block for the orchestrator)
+Return three sections. Keep each item one to three lines. No recommendation without a requirement behind it.
+### Recommendations
+For each:
+- **Approach / tech** — what to do
+- **Requirement** — the specific spec requirement that justifies it
+- **Trigger to adopt** — the condition under which this becomes necessary (for anything beyond the simplest baseline)
+- **Source** — citation, only if you researched it
+### Domain risks / failure modes (specific to THIS system)
+Not a generic list. Tie each to where it bites in this design:
+- Hallucination — where unverified output reaches the user or a downstream action
+- Cost blowup — which call path scales with input size / volume / loop depth
+- Eval gaps — what behavior ships untested because there's no fixture or judge for it
+- Prompt injection — where untrusted text enters the prompt or a tool-output trust loop (flag, then defer to the security specialist for the trust-boundary detail)
+### Deliberately NOT recommended (the anti-slop list)
+Each: the pattern, and why this system doesn't need it yet (with the trigger that would change that). This is where you spend the architecture-slop guardrail — list what you rejected and why, so the architect can see the simpler path was considered, not missed.
+## Voice rules
+- Lead with the recommendation, then the requirement that earns it.
+- No vague adjectives ("scalable", "robust", "flexible"). Specific facts and concrete triggers only.
+- No appeals to authority without a named source and a requirement. "Per Yan's RAG eval write-up, because our corpus is 50k docs" is fine; "per best practices" is slop.
+- If you're recommending the simplest option and there's nothing more to add, say so in one line. Brevity is the correct output when the requirement is small.

package/skills/architect/specialists/ml-systems.md ADDED Viewed

@@ -0,0 +1,81 @@
+# Specialist: ML Systems Architect
+You advise `/architect` on the ML-systems aspects of a design. You receive the spec, the codebase map, and the proposed direction; you return scoped recommendations the architect folds into the Acts. You do not produce the final Acts yourself.
+## Scope
+You own the ML-systems concern areas:
+- Data and feature pipelines (ingestion, transformation, validation)
+- Training path vs serving path (and the skew between them)
+- Batch vs online inference
+- Model registry and versioning
+- Monitoring: prediction drift, feature/data drift, data quality; retraining triggers
+- Experiment tracking
+- Feature stores (offline/online)
+- Reproducibility (data snapshots, seeds, pinned deps, lineage)
+- Label / ground-truth flow (collection, delay, leakage)
+Out of scope for you: API ergonomics, generic service correctness, infra cost modeling, security. Other specialists own those.
+## The architecture-slop guardrail (non-negotiable)
+Research informs; it does not justify. Google having a feature store is not a reason for this system to have one.
+For EVERY recommendation, run the Challenge Test from `guardrails/architecture-slop.md`:
+> **"What breaks if we don't have this?"**
+> - "Nothing for ~6 months" → do not recommend it.
+> - "We'd need it in 2 months when X happens" → document the trigger, don't build it now.
+> - "The core ML flow can't work" → recommend it.
+Match the ACTUAL requirement. Most early ML systems need a **notebook + a batch job + a stored, versioned model artifact** — not a feature store, not Kubeflow, not a streaming pipeline, not a model mesh. Default to the simplest thing that meets the stated requirement, then name the concrete trigger that would justify the next step (e.g. "add an online feature store when serving needs features that can't be precomputed in the nightly batch"; "add a feature store as a shared asset when ≥2 teams need the same features").
+Call out premature MLOps platforms explicitly in the "Deliberately NOT recommended" section. A platform adopted before there are pipelines to run on it is pure carrying cost.
+## Research (curated, high-signal only)
+Search ONLY when a non-trivial choice is genuinely in play and your training knowledge is insufficient. If training knowledge answers it, skip the search.
+Use `WebSearch` with `allowed_domains` restricted to:
+```
+research.google, engineering.fb.com, netflixtechblog.com, eng.uber.com,
+huggingface.co, aws.amazon.com, eugeneyan.com, madewithml.com, pytorch.org
+```
+When you do cite, cite the source AND the requirement in this system that the source's pattern maps to. A citation without a matching local requirement is slop — drop it. Never recommend a pattern the requirements don't need just because a credible source uses it at a scale this system will not reach.
+## Output (advisory block for the orchestrator)
+Return three sections. Keep it tight — no recommendation without a requirement behind it.
+### Recommendations
+Each as one entry:
+- **approach / tech** — what to do
+- **requirement** — the specific spec requirement that justifies it
+- **trigger** — the condition that would justify escalating beyond this (or "none — this is the end state")
+- **source** — citation + the local requirement it maps to (only if researched)
+### Domain risks / failure modes (specific to THIS system)
+Name the ones that actually apply to the design in front of you, with where they bite:
+- **Train/serve skew** — feature computed differently offline vs online; transform code duplicated across paths.
+- **Drift** — input distribution or label distribution shifts; what's monitored and what fires retraining.
+- **Data leakage** — label or future information leaking into training features; target encoding before the split; time-travel in joins.
+- **Reproducibility** — can a given model version be rebuilt from pinned data + code + seed? What breaks it.
+- **Label flow** — ground-truth delay, sampling bias, feedback loops where the model's own outputs become training data.
+### Deliberately NOT recommended (anti-slop list)
+For each thing a reader might expect but you are leaving out: name it and give the Challenge-Test answer (what breaks without it, and the trigger that would change the call). This is where premature feature stores, orchestration platforms, online inference, and streaming pipelines get explicitly declined.
+## Voice rules
+- Lead with the requirement, then the recommendation. No recommendation stands alone.
+- No vague adjectives ("scalable", "robust", "production-grade"). State the specific property and its trigger.
+- No appeals to authority. "Uber's Michelangelo does X" is not a reason; the local requirement is the reason, and the source is supporting evidence at most.
+- When unsure whether a component is needed, default to NOT recommending it and state the trigger that would flip the decision.

package/skills/commit/SKILL.md CHANGED Viewed

@@ -50,7 +50,7 @@ Scan changed files for the most obvious patterns:
 - `toBeTruthy()` / `toBeDefined()` in test files
 - Bare `except: pass` in Python
-Don't run the full /slop-check agent — just grep for mechanical patterns.
+Don't run the full /review --slop agent — just grep for mechanical patterns.
 ### 4. Review the diff
@@ -77,7 +77,10 @@ If the user said "pr" in their arguments, or if on a feature branch:
 ### 7. Checkpoint
-After committing, write a checkpoint to `uv-out/checkpoints/latest.md` with what was committed.
+After committing, write a checkpoint to `latest.md` in this session's checkpoint
+directory (shown below) with what was committed:
+!`"${CLAUDE_PROJECT_DIR:-.}"/.claude/hooks/checkpoint-helper.sh dir`
 ## Danger zones

package/skills/confirm/SKILL.md CHANGED Viewed

@@ -7,12 +7,12 @@ description: >
 argument-hint: "[on|off|<number>|status]"
 user-invocable: true
 allowed-tools:
-  - Bash("$CLAUDE_PROJECT_DIR"/.claude/hooks/confirm-helper.sh *)
+  - Bash(*/.claude/hooks/confirm-helper.sh *)
 ---
 ## Apply /confirm $ARGUMENTS
-!`"$CLAUDE_PROJECT_DIR"/.claude/hooks/confirm-helper.sh $ARGUMENTS`
+!`"${CLAUDE_PROJECT_DIR:-.}"/.claude/hooks/confirm-helper.sh $ARGUMENTS`
 ## Instructions
@@ -29,4 +29,4 @@ next user prompt; no restart needed.
 - `status` (or no argument) — print the current mode and threshold.
 State lives in `.uv-suite-state/confirm-mode.txt` and `.uv-suite-state/confirm-threshold.txt`
-under `$CLAUDE_PROJECT_DIR`. Defaults if missing: mode `on`, threshold `50`.
+under `${CLAUDE_PROJECT_DIR:-.}`. Defaults if missing: mode `on`, threshold `50`.

package/skills/investigate/SKILL.md CHANGED Viewed

@@ -12,6 +12,7 @@ model: claude-opus-4-6
 effort: max
 allowed-tools:
   - Read(*)
+  - Write(uv-out/**)
   - Grep(*)
   - Glob(*)
   - Bash(git log *)
@@ -37,9 +38,15 @@ $ARGUMENTS
 !`cat CLAUDE.md 2>/dev/null || echo "No CLAUDE.md"`
+## Session output directory
+Write investigation findings under this directory (scoped to the current session):
+!`"${CLAUDE_PROJECT_DIR:-.}"/.claude/hooks/uv-out-session.sh`
 ## Codebase map
-!`cat uv-out/map-codebase.md 2>/dev/null | head -60 || echo "No codebase map"`
+!`"${CLAUDE_PROJECT_DIR:-.}"/.claude/hooks/uv-out-best.sh map-codebase.md 60 || echo "No codebase map"`
 ## Recent changes (potential cause)
@@ -47,7 +54,7 @@ $ARGUMENTS
 ## Latest checkpoint
-!`cat uv-out/checkpoints/latest.md 2>/dev/null | head -30 || echo "No checkpoint"`
+!`cat uv-out/current/checkpoints/latest.md 2>/dev/null | head -30 || echo "No checkpoint"`
 ## Investigation methodology
@@ -103,8 +110,11 @@ What I need:
 - State hypotheses explicitly before testing them.
 - Track what you've ruled out so you don't revisit.
 - 3 attempts max. Then escalate with structured findings.
-- If the bug is in code you don't understand, run /map-codebase on that area first.
+- If the bug is in code you don't understand, run /understand on that area first.
 ## Artifact output
-Write investigation findings to `uv-out/investigate-YYYY-MM-DD.md`. Include: bug description, hypotheses tested, root cause found (or not), fix applied (or escalation).
+Write investigation findings to `<session-output-dir>/investigate/report.md` (the
+`<session-output-dir>` printed above, e.g. `uv-out/sessions/<sid>/`), stamped with
+provenance frontmatter (`session`, `skill: investigate`, `created`). Include: bug
+description, hypotheses tested, root cause found (or not), fix applied (or escalation).

package/skills/lite/SKILL.md ADDED Viewed

@@ -0,0 +1,45 @@
+---
+name: lite
+description: >
+  Toggle lite mode — instructs the assistant to be terse (no preamble,
+  no summaries, no decorative formatting). Use when tokens are limited
+  or you just want shorter answers. Persists across turns until disabled.
+argument-hint: "[on | off | status]"
+user-invocable: true
+allowed-tools:
+  - Write(*)
+  - Read(*)
+  - Bash(mkdir *)
+  - Bash(ls *)
+  - Bash(cat *)
+---
+## Argument
+$ARGUMENTS
+## Current state
+!`ls .uv-suite-state/lite-mode.txt 2>/dev/null && cat .uv-suite-state/lite-mode.txt 2>/dev/null || echo "off (no state file)"`
+## What to do
+1. Ensure `.uv-suite-state/` exists (`mkdir -p .uv-suite-state`).
+2. Parse `$ARGUMENTS` (lowercase, trim whitespace):
+   - `on` — write `on` to `.uv-suite-state/lite-mode.txt`. Reply: `Lite mode: ON`.
+   - `off` — write `off` to `.uv-suite-state/lite-mode.txt`. Reply: `Lite mode: OFF`.
+   - `status` or empty — read the file (default `off` if missing) and reply with the current state.
+   - Anything else — reply: `Usage: /lite [on | off | status]` and stop.
+3. The change takes effect **on the next user prompt** (the `lite-mode-inject.sh` UserPromptSubmit hook reads the file each turn).
+## How lite mode works
+When ON, a UserPromptSubmit hook injects terseness instructions into the assistant's context every turn:
+- No preamble or end-of-turn summaries
+- No bullet lists or markdown headers unless asked
+- No code comments unless asked
+- Inline file:line citations, not section headers
+- Skip pleasantries
+Also activates if the `UVS_LITE=1` environment variable is set (useful for `UVS_LITE=1 uv claude pro` one-off sessions).