npm - @amityco/social-plus-vise - Versions diffs - 0.8.1 → 0.12.2 - Mend

@amityco/social-plus-vise 0.8.1 → 0.12.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

package/CHANGELOG.md +207 -0
package/README.md +107 -40
package/dist/capabilities.js +447 -0
package/dist/outcomes.js +463 -5
package/dist/server.js +115 -3
package/dist/tools/ast.js +25 -0
package/dist/tools/compliance.js +88 -20
package/dist/tools/debug.js +267 -0
package/dist/tools/design.js +1496 -0
package/dist/tools/docs.js +9 -4
package/dist/tools/harness.js +17 -1
package/dist/tools/integration.js +83 -7
package/dist/tools/project.js +872 -67
package/dist/tools/sdkVersion.js +129 -0
package/dist/types.js +4 -0
package/package.json +27 -6
package/rules/auth.yaml +298 -38
package/rules/comments.yaml +0 -72
package/rules/feed.yaml +1151 -12
package/rules/live-data.yaml +316 -36
package/rules/push.yaml +140 -0
package/rules/sdk-lifecycle.yaml +1428 -138
package/rules/security.yaml +60 -0
package/skills/social-plus-vise/SKILL.md +98 -55
package/skills/social-plus-vise/reference/debugging.md +39 -0
package/skills/social-plus-vise/reference/operations.md +59 -0
package/skills/vise-harness-engineer/SKILL.md +35 -0
package/social.plus-vise.png +0 -0

package/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,207 @@
+# Changelog
+All notable changes to `@amityco/social-plus-vise` are documented in this file.
+The format is loosely based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## 0.12.2 — 2026-06-02
+**Maintenance / hygiene release.** No functional change from `0.12.1` — identical rules, validators, and CLI. This release exists to scrub an anonymized customer name from the bundled `CHANGELOG`; `0.12.0` and `0.12.1` (which contained it) were unpublished from npm. Use `0.12.2`.
+## 0.12.1 — 2026-06-02
+**Theme:** False-positive-frontier sweep — cross-platform FP hardening driven by the benchmark factory, plus compiler/surface ground-truth for grading. Patch release: all changes are false-positive corrections to existing rules (no new rules, no new features, no breaking changes). TP detection held at **290/290** across all of the below; every fix is locked with a both-direction regression fixture (`test/run-native-idioms.mjs`).
+### Fixed
+- **iOS/Android/Flutter push rules** no longer fire on a host app's own generic push (APNs/FCM). `push.unregister.present` and `push.payload-contract-respected` now require a social.plus push **registration nexus** (`registerPushNotification`), not the OS primitive (`didRegisterForRemoteNotifications` / a bare `onMessageReceived`). *(brownfield-iOS cell)*
+- **Ban-state** recognizes a `currentUserIsBanned` prop guard — the ban state fetched in a parent and passed down as a prop (a camelCase boolean form the marker missed). *(brownfield-Flutter cell)*
+- **`custom-post-type.dataType-declared`** no longer flags a plain text post (`createPost({ data: { text } })`) as a custom-data post. *(weak-model-feed cell)*
+- **Android brownfield cluster** — `dependency.sdk` scans feature-module Gradle files (not just `app/`); `setup.lifecycle` no longer flags application-scoped setup in a Hilt `@Module`; `auth.no-anonymous-write` recognizes a write gated on `getCurrentUserId()`. *(brownfield-Android cell)*
+- **React Native** — `client.region` recognizes a positionally-passed region arg (`createClient(KEY, AMITY_REGION)`); `logout-on-user-switch` no longer treats a React `setCurrentUser` useState setter as an SDK user-switch (now keys on the real `setActiveUser`). *(react-native-feed cell)*
+### Benchmark & quality infrastructure (`bench/`)
+- **Grading ground truth** (`bench/ground-truth.mjs`) — runs the real toolchain (`flutter analyze`, `tsc --noEmit`) where available; on compiler-less Android, flags `AmityType.member` calls whose member isn't a real surface member. Internal grading aid, **not** a shipped rule (Vise deliberately does not type-check).
+- **Scope-aware finding collection** — `bench/collect.mjs` takes an outcome and excludes out-of-scope rules (the same filter `vise check` uses), so a scoped build isn't graded against the validate-setup everything-sweep.
+- **Marker-boundary audit** (`bench/audit-marker-boundaries.mjs`) — surfaces `\bIDENT\b` markers that would miss a longer camelCase SDK symbol. An occasional eyeball aid, deliberately **not** a CI gate.
+### Docs
+- **ARCHITECTURE.md** — new "Sources of Truth" section (facts vs. opinion, across the SDK / docs / Vise repos), with a pointer from the docs repo's `.docs-ops/README`.
+---
+## 0.12.0 — 2026-06-01
+**Theme:** Beyond correctness — design conformance, feature completeness, and a self-improving false-positive loop. This release folds in everything since `0.11.0` (which shipped without notes): a full design-contract system, an advisory completeness catalog, three new integration outcomes, SDK-version currency, a large cross-platform false-positive hardening pass (corpus **265 → 300 rules**), and the benchmark factory that now drives FP reduction. All changes are backward-compatible additions or corrections — no breaking CLI changes.
+### Added
+- **Design-contract system** — `vise design check` (advisory, non-blocking) verifies a build against a design contract, and `vise design preview` renders a visual conformance report. The contract can be **extracted from an HTML/CSS prototype** (graded extractor) or **derived from the host project's own design system** via `--from-project` (CSS custom properties, Android XML, Flutter named-params, iOS `.colorset` + Swift color extensions). The contract is fed forward into the plan and skill, its digest is recorded in the compliance flow, and undefined CSS-var references are flagged as token-hygiene issues.
+- **Advisory completeness catalog** — feature-completeness nudges with recorded scope opt-outs (the agent can decline a capability with a logged reason; completeness never gates). The capability catalog was deepened to the full social.plus SDK surface, adding stories, events, livestream rooms, pinned posts, post impressions, and message edit/delete.
+- **Three new outcomes** — `add-community` (first-class outcome and the template for future outcomes), `add-follow` (social graph), and `add-notifications` (in-app notification tray).
+- **SDK-version currency** — `vise plan` now surfaces advisory guidance when a project pins an older SDK than the latest published, resolved from the npm registry (TS `@amityco/ts-sdk`, React Native `@amityco/ts-sdk-react-native`; native platforms version-agnostic). Bounded by a tight registry timeout with a graceful offline fallback.
+- **Acme-derived rules** — comment-creation rule + routing fix, mention-rendering guidance, parent/child post-type handling, and several new `add-feed` plan steps, all from real-world gap analysis.
+### Changed
+- **Large cross-platform false-positive hardening pass** across iOS (v8), Android, Flutter, and TypeScript/React Native — the markers now recognize the idiomatic native forms agents actually write: ban-state via `isGlobalBanned`/`isGlobalBan` and agent-derived ban booleans; service/repository (non-UI) layers skipped for ban-state/role-gated/flag-count rules; region via idiomatic TS (`API_REGIONS`) and Flutter (`httpEndpoint`/`AmityRegion`) forms; live collections via Android `LiveData`, Jetpack Compose, and Paging3 (`collectAsLazyPagingItems`); Kotlin sealed-class discriminators; one-shot `await queryX()` no longer mistaken for a subscription.
+- **Session-handler guidance corrected** — rules and findings named `AmitySessionHandler`, a type that exists in no SDK. Corrected to the real symbols: `SessionHandler` (Android/iOS) and a `(AccessTokenRenewal renewal)` callback (Flutter). Detection was unaffected; guidance only.
+- **Corpus grew from 265 → 300 rules.**
+### Fixed
+- iOS live-collection crash on files larger than 32 KB.
+- Flutter region marker keyed on a nonexistent `socketEndpoint`; replaced with the real `httpEndpoint`/`mqttEndpoint`/`uploadEndpoint` + `AmityRegional{Http,Mqtt}Endpoint` variants.
+- Several moderation false positives (flag is a user-level action; delete/edit are owner-or-mod) and post-datatype scoping (a comment renderer is not a post renderer).
+### Benchmark & quality infrastructure (`bench/`)
+- **Benchmark factory** — a synthetic two-track loop. A change counts as an improvement only if **FP↓ AND TP held** (coupled metric). The true-positive detection rate is a static seeded corpus (`bench/tp-dashboard.mjs`, currently 290/290); the false-positive *rate* is measured on fresh blind builds, never a static corpus. An LLM grader classifies findings FP-vs-real and is validated against labeled ground truth before being trusted. A `bench:gate` CI gate enforces the coupled metric on every change.
+- **FP-grader grounded in the authoritative SDK symbol surface** — the grader's symbol-existence judgments are now deterministic lookups against vendored snapshots distilled from the docs repo's machine-extracted surfaces (DocC ABI / Dokka GFM / TypeDoc / dartdoc), not model recall. `bench:symbols-drift` diffs the live surface against the vendored snapshot — the seam for a future SDK-release drift trigger.
+### Docs
+- README + ARCHITECTURE/RULES/TESTING refreshed for the design, completeness, and outcomes capabilities.
+### Honest scope
+The design, completeness, and SDK-version layers are **advisory** — they inform and attest, they do not gate (only correctness gates; 7 hard gates across 300 rules). The benchmark factory measures and narrows the false-positive frontier for the idioms it has seen; it does not close it in general. Symbol-surface grounding settles whether an API exists — not whether it is used idiomatically, which remains the hand-authored opinion layer that is the product.
+---
+## 0.10.0 — 2026-05-29
+**Theme:** Benchmark-driven sensor expansion. The Commune benchmark (9 new SDK domains: chat, push, social graph, moderation, comments) produced the first measured, defensible advantage for vise+skill over pure MCP: **7/9 working features vs 3/9** with the same agent on the same prompts. This release ships the sensors, rules, and findings.json improvements that produced that result.
+### Added
+- `react-native.chat.channel-type-dm` / `typescript.chat.channel-type-dm` (`warning`) — DM channels must use `type: 'conversation'`, not `type: 'community'`. Agents consistently choose `community` for 1-to-1 chats because it sounds plausible but silently creates a group channel with the wrong shape. Sensor requires `userIds` co-occurrence to avoid firing on legitimate community broadcasts.
+- `react-native.follow.status-subscription` (`warning`) — `getFollowStatus` must be wrapped in a live subscription, not a one-shot query. A one-shot call captures state at mount and never updates — follow/unfollow actions are not reflected in the UI until the user navigates away.
+- `rationale` field in `sp-vise/findings.json` — agents see *why* each rule exists, not just *what* it requires. Improves attestation quality on rules that allow it.
+- Compliance.json rule entries now include a `title` field (digest-stable, separate from hashing) so agents and humans can identify rules without grepping definitions.
+- Corpus grew from **262 → 265 rules**.
+### Changed
+- **`vise init` now writes `sp-vise/findings.json` immediately** — agents see current rule violations on startup with no exploration needed. Combined with the `npm run sp-check` script added to scaffolded workspaces, agents follow a directed (read findings → fix → verify) loop instead of an exploratory (search → search → search → implement) loop.
+- **`live-collection.api-mismatch`, `posts.activity-tag-filter`, `posts.reaction-stale-post-ref`, `user.ban-state-respected`** — all now skip `.d.ts` files to eliminate false positives from type stubs.
+- **`user.ban-state-respected`** — `flagComment` and `flagPost` added to the recognised write-pattern list. Flagging is a moderation action and must be ban-guarded.
+- **`react-native.push.unregister.present`** — recommendation generalised; no longer references benchmark-specific state variables. Surfaces the exact `useEffect` cleanup pattern needed.
+- Reactive markers now include `.on('dataUpdated', ...)` — the event-emitter style of subscribing to LiveCollection updates is now recognised as a valid alternative to property-callback subscription.
+- README updated with a step-by-step Quick Start that references `findings.json` directly.
+### Benchmark infrastructure (`benchmarks/`)
+- **Commune benchmark** added — 9-slice React Native scenario (CM-SETUP, CM-PRESENCE, CM-FEED, CM-EVENTS, CM-CHAT, CM-PUSH, CM-PROFILE, CM-MODERATE, CM-COMMENTS) covering chat, push, social graph, and moderation domains absent from TouchTunes. Three seed types per slice (`baseline`, `broken`, `greenfield`) for 27 fixture sets total.
+- **Rules-as-markdown control arm** (`benchmarks/commune/run-commune-rules-arm.sh`) — injects the rule corpus as a static document into the agent prompt. Built to isolate whether vise's measured advantage comes from *information delivery* (the rules) or the *iterative verification loop* (sp-check).
+- **TouchTunes runner improvements** — workspace isolation (`workspaces/broken/` vs `workspaces/baseline/` so agents can't peek at the answer), `< /dev/null` stdin redirect fix that was causing agy/codex to silently skip cells, `|| true` per-cell error isolation, and grader auto-attestation for no-file and `.d.ts`-pointing rules.
+- **agy + codex runners** (`run-agy-cells.sh`, `run-codex-cells.sh`) — production-quality scripts with TTY-detection fixes and workspace isolation.
+### Findings & reports
+- `benchmarks/FINDINGS.html` — engineering-facing summary of the benchmark methodology, results, and what was/wasn't proven.
+- `benchmarks/MARKETING.html` — three-tier marketing-claim framework (safe / concrete / honest / aspirational) with supporting wallclock data and a list of metrics to instrument next.
+### Honest claim
+On 9 new SDK domain implementations with codex gpt-5.4, vise+skill produced 7 working features vs 3 for pure MCP — same agent, same prompts. The cost: +28% wallclock per session. The net: −52% wallclock per *working* feature, because more features ship on the first try. Vise consistently catches five bug classes that capable models otherwise miss: wrong DM channel type, missing push register/unregister lifecycle, one-shot queries where live subscriptions are required, missing ban checks before write operations, and missing flag affordances on user-generated content.
+---
+## 0.9.0 — 2026-05-27
+**Theme:** Business model-grounded gap analysis; Next.js / SSR guard; environment hygiene expanded to all platforms.
+### Added
+- `typescript.client.no-ssr-init` (`error`) — SDK client must not be initialized in a Next.js Server Component, `layout.tsx` without `'use client'`, or inside `getServerSideProps`/`getStaticProps`. The primary demo-invisible failure mode for AI-native Next.js customers: `next dev` recovers from the error gracefully; `next build` + production does not.
+- `react-native.secret.env-gitignore` — React Native env files containing secret-shaped keys must be excluded by `.gitignore`.
+- `react-native.secret.env-example` — A `.env.example` or `.env.sample` must accompany any gitignored React Native env file.
+- `flutter.secret.env-gitignore` — Flutter `.env` or `secrets.dart` files containing secret-shaped keys must be excluded by `.gitignore`.
+- `android.secret.env-gitignore` — `local.properties` containing secret-shaped keys must be excluded by `.gitignore`.
+- `ios.secret.env-gitignore` — `Secrets.plist` or `*.xcconfig` files containing secret-shaped keys must be excluded by `.gitignore`.
+- Corpus grew from **256 → 262 rules**.
+- `benchmarks/SDK_INTEGRATION_GAP_ANALYSIS.md` — business model-grounded gap analysis mapping every SDK-relevant value claim to Vise rule coverage, with a prioritised improvement backlog.
+### Changed
+- **Skill — "Stop Instead Of Guessing":** intake list now asks about Next.js rendering mode (Server Component vs `'use client'` vs Pages Router) before implementing SDK initialization.
+- **Skill — "Session Renewal":** new feedforward: SDK collection queries must not fire before `login()` completes; gate collection setup behind the session-active signal.
+- **Skill — "Live Collection API Mismatch":** new guidance: handle connection-state changes and render a reconnecting indicator when the WebSocket drops.
+- **Skill — "Debugging & Troubleshooting":** compact `--brief` flag documented; `repairBrief` output described.
+---
+## 0.7.0 — 2026-05-23
+**Theme:** SDK-specific rule corpus expansion + measured cross-tool benchmark.
+### Added
+- 17 new SDK-specific rule families across 5 platforms = **85 new compliance rules** (corpus grew from 167 → 252):
+  - **Tier 1 — Silent-failure traps:** `session-handler.retained`, `live-collection.api-mismatch`, `posts.status-filter-applied`, `pagination.cursor-opaque`, `posts.parent-child-rendered`
+  - **Tier 2 — Wrong-target / silent misroute:** `feed.target-type-explicit`, `comment.reference-type-enum`, `channel.type-matches-shape`
+  - **Tier 3 — Moderator-only data leaking to user UI:** `moderation.role-gated-action`, `flag-count.not-leaked-to-non-mods`, `user.ban-state-respected`
+  - **Tier 4 — Notifications & unread state:** `notifications.amity-preferences-configured`, `unread.subscribed-not-counted`
+  - **Tier 5 — Custom config & types:** `reactions.configured-name-used`, `custom-post-type.dataType-declared`
+  - **Tier 6 — File upload & media:** `file-upload.via-amity-file-client`, `image-post.child-resolution-awaited`
+- Multi-outcome measured benchmark (chat / comments / push on React + Flutter) with cross-tool validation (Antigravity / Gemini 3.5 Flash). See `benchmarks/RESULTS.md`.
+- Fixture-foundation gates: `run-happy-path-clean.mjs` (every canonical happy-path must fire zero rules) and `run-fixture-symmetry.mjs` (every rule's positive fixture must not fire the rule).
+- Dedicated React Native canonical happy-path fixture (previously shared with TypeScript).
+- New CI exit code `4` for `contract-drift` (rules in `sp-vise/compliance.json` no longer match current ruleset).
+### Fixed
+- `*.secret.inline-api-key` now catches env-fallback literal leaks: `String.fromEnvironment(..., defaultValue: 'literal')` (Dart), `process.env.X ?? 'literal'` (JS/TS), ternary fallback. Previously these forms slipped past the regex because the literal wasn't directly assigned to `apiKey`.
+- Four web/Flutter rule false-positives on idiomatic guarded code: `typescript.client.region` now accepts env-sourced and positional region declarations; `*.network.error-handling-present` recognizes React error-state idiom; `flutter.design.reuse-detected-tokens` credits `Theme.of(context)` reuse.
+- Pre-existing CLI version assertion in `test/run-cli.mjs` (was pinned to `0.4.0`).
+### Changed
+- Project structure flattened — `packages/foundry/` layer removed; npm package now publishes from the repo root.
+- README consolidated from two files (brand + developer) into a single customer-facing canonical README; internal architecture moved to `docs/`.
+## 0.6.0 — 2026-05-22
+**Theme:** v0.6 compliance expansion + 5-platform measured benchmark.
+### Added
+- Corpus grew to 167 rules across 10 domains.
+- Outcomes: `add-comments`, `add-moderation`, `add-chat`.
+- Five-platform measured benchmark (TypeScript / React Native / Flutter / Android / iOS) with real `vise check` and `vise run-sensors` artifacts.
+### Fixed
+- React Native platform detection priority (previously misdetected as TypeScript when both signals were present).
+## 0.5.0 — 2026-05-21
+**Theme:** AST-based sensors.
+### Added
+- Tree-sitter AST sensors for Kotlin / Swift / Dart literal detection.
+- Phase 1 pilot: `typescript.auth.no-literal-user-id` resolves identifier-via-constant indirections.
+- Phase 4: AST-aware comment stripping for `ui-states-present` and `design-reuse-detected-tokens` rules.
+## 0.4.0 — 2026-05-20
+**Theme:** Compliance harness.
+### Added
+- `vise check --ci`: read-only verification with structured exit codes for CI pipelines.
+- Attestation flow: `vise attest` with rule id, signer, confidence, evidence, and rationale.
+- `vise sync`: persist deterministic-pass attestation files.
+- Engagement tracking: `vise engagement init/show` for tier / customer-id / scope metadata.
+- `sp-vise/` sidecar directory: customer-visible compliance contract (`compliance.json`, `attestations/`, `engagement.json`, `inspection.json`).
+- Cross-platform rule corpus.
+- Native project skill installs.
+## 0.3.0 — 2026-05-19
+**Theme:** Foundry → Vise rename.
+### Changed
+- Renamed npm package to `@amityco/social-plus-vise`.
+- Added `vise` short binary alias; kept `foundry-mcp` as a compatibility binary alias.
+- Added Claude Code skill targets (`--target claude`, `--target claude-project .`).
+- Documented Cursor, Copilot, and VS Code instruction installs.
+## 0.2.1 — 2026-05-18
+### Added
+- `vise install-skill`, `vise print-skill`, and `vise skill-path` for bundled-skill installation.
+## 0.2.0 — 2026-05-17
+### Added
+- Skill-guided CLI commands: `inspect`, `plan`, `validate`, `run-sensors`.
+- The `social-plus-vise` skill guidance shipped as part of the package.
+## 0.1.1 — 2026-05-16
+### Added
+- Initial npm publish.
+- MCP adapter (stdio).
+- Doc search backed by `https://learn.social.plus/llms-full.txt`.

package/README.md CHANGED Viewed

@@ -1,7 +1,3 @@
-<p align="center">
-  <img src="./social.plus-vise.png" alt="social.plus Vise" width="320" />
-</p>
 <h1 align="center">social.plus Vise</h1>
 <p align="center">
@@ -45,9 +41,11 @@ See [Usage Flow](#usage-flow) for the full step-by-step diagram.
 ---
-## What Vise Does
+## What Vise Does: Agentic Workflow Governance
-Vise is a **CLI + AI skill** that wraps coding agents in deterministic compliance guardrails when they integrate social.plus SDKs. It inspects your project, grounds the agent in hosted docs, enforces 250+ platform-specific compliance rules, and runs your project's own build/lint/typecheck sensors. **Your source code never leaves your machine.**
+Instead of just providing a CLI or AI skills, Vise implements a technique called **Agentic Workflow Governance**. Think of it as building a software factory directly on top of the customer's project.
+Vise acts as the foreman of this factory, wrapping your local coding agents in compliance guardrails when they integrate social.plus SDKs. It inspects your project, grounds the agent in hosted docs, enforces 300 platform-specific compliance rules, checks the generated UI against the customer's design system, surfaces the full SDK feature surface so nothing is silently dropped, and runs your project's own build/lint/typecheck sensors. **Your source code never leaves your machine.**
 | Layer | Purpose |
 |---|---|
@@ -55,59 +53,104 @@ Vise is a **CLI + AI skill** that wraps coding agents in deterministic complianc
 | **CLI** (`vise`) | Deterministic engine: inspects repos, searches docs, validates setup, runs sensors, manages attestations |
 | **MCP adapter** | Optional stdio server for MCP-capable tools (Claude Code, Cursor, Codex, VS Code, Copilot) |
+### What Vise validates: three layers
+Vise validates on three layers, and the layer is set by the *kind of claim* — which keeps it false-positive-free where it gates:
+| Layer | Claim | How | Enforcement |
+|---|---|---|---|
+| **SDK compliance** | "this is **wrong**" | 300 deterministic rules (session renewal, live-collection vs one-shot, no secret in logs, parent-child rendering, ban-state gating…) | **Hard gate** — `vise check` blocks until green or attested |
+| **Design conformance** | "this **looks off**" | extract the customer's design system into a contract, then check token usage | **Advisory** — `vise design check`/`preview`; never fails a build |
+| **Feature completeness** | "this is **missing**" | Vise proposes the full SDK feature surface per outcome; the agent opts out of anything out of scope with a recorded reason | **Advisory** — surfaced in `vise plan`/`check`; never fails a build |
+Only correctness is gated (it can be made FP-free); conformance and completeness are surfaced, because "all post types" and "matches the brand" are legitimately scope-dependent. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md).
+### Design-conformant UI
+Vise can ingest the customer's aesthetic into a **design contract** and guide generation to match it — from an HTML/CSS prototype (`vise design extract`) or from the host app's own design system across web + Android + Flutter + iOS (`vise design extract --from-project`: CSS vars/Tailwind/token modules, `colors.xml`, Flutter `Color(0x…)`, iOS `.colorset`/Swift). `vise design check` reports token conformance; `vise design preview` writes a visual review. All advisory.
+### Supported integrations (outcomes)
+`vise plan`/`init` classify the request into an outcome and tailor the plan, rules, and feature checklist: **feed** · **comments** · **chat** · **moderation** · **community** · **social graph (follow)** · **in-app notifications** · plus setup (SDK, push, live data).
 ### Why "Vise"
 A bench vise holds the workpiece steady so the craftsman's hands are free to shape it. Without one, the workpiece drifts and cuts wander. Vise does the same for AI agents integrating SDKs: it clamps the integration to a known-good position (the real docs, the real project structure, the real compliance rules) so the agent can focus on creative work instead of guessing.
 ---
-## Benchmark: First-Try Success
+## Benchmark: Phase 1 Results
+> **Every feature delivered correctly — confirmed independently with two different AI coding tools.**
+> With Vise, both agents built all 9 social features with no production gaps. Without Vise, 3 out of 9 features had hidden problems that would only surface after users complained.
+### What "delivered correctly" means
+"Correct" doesn't just mean the code compiles. It means every feature handles the edge cases that matter to real users and real moderation teams:
+- A **banned user** cannot type or submit a post — the send button is hidden, not just disabled-on-submit
+- **Push notification preferences** are wired to the Amity API so users who opt out actually stop receiving notifications
+- **Moderation actions** (report, flag, block) are surfaced in the UI so users can act on them, not buried in a hook
+- **Chat and feed queries** use live, reactive subscriptions — not one-time fetches that go stale
+Without Vise, AI agents frequently implement the primary feature correctly but miss these secondary requirements. They know about them in the abstract — but when building a chat screen, "ban state" feels out of scope and gets skipped. `sp-check` turns that vague awareness into a specific, actionable finding.
-> **100% first-try CI pass with Vise vs 0% without.**
->
-> **76% cheaper · 28% faster · 86% fewer issues**
+### The experiment: three conditions, nine features
-When an AI agent integrates social.plus with only docs access (Pure MCP), it produces code with real problems: hardcoded user IDs, missing authentication, no content moderation, broken reactive patterns. These aren't edge cases — they're the SDK-specific requirements that general AI knowledge reliably misses.
+We ran a controlled experiment — the **Commune Benchmark** — to measure not just *whether* Vise helps, but *why*. Each of the nine features below was built from scratch by an AI agent under three independent conditions:
-### v0.8 Pilot Results (React/Next.js · "add comments")
+**Nine features built:**
+SDK setup · User presence · Social feed · Events · Chat & DMs · Push notifications · User profile · Content moderation · Comments
-| Surface | CI Pass | Issues | Tokens | Cost | Wall-clock |
-|---|---|---|---|---|---|
-| **Pure MCP** (docs only) | ❌ 0/2 | 4–7 | 36,219 | $0.0108 | 619s |
-| **Vise-as-MCP** (rules engine) | ✅ 2/2 | 1 | 21,047 | $0.0061 | 540s |
-| **Vise CLI + Skill** (full workflow) | ✅ 2/2 | 1 | 8,733 | $0.0024 | 447s |
+| Condition | What the agent had | The question it answers |
+|---|---|---|
+| **Pure MCP** | Access to social.plus docs only — no compliance guidance | Baseline: how well does the agent do on its own? |
+| **Rules-as-Markdown** | The full 1,013-line compliance rulebook pasted directly into the prompt | Is the problem just that the agent doesn't know the rules? |
+| **Vise + Skill** | Full Vise CLI — `sp-check` runs automatically, agent reads specific findings, fixes them, repeats until green | Does an active feedback loop change the outcome? |
+The Rules-as-Markdown condition is the key isolation: if the agent already knows all the rules, does giving it the spec document fix the problem? The answer turned out to be **no** — knowing the rules and being forced to act on specific findings are different things.
+### Results — features delivered without production gaps
+| Coding agent (model) | Pure MCP | Rules-as-Markdown | Vise + Skill |
+|---|---|---|---|
+| **Cursor (Composer 2.5)** | 6 out of 9 ✗ | 5 out of 9 ✗ | **9 out of 9 ✅** |
+| **Claude Code (Sonnet 4.6)** | 6 out of 9 ✗ | 7 out of 9 ✗ | **9 out of 9 ✅** |
-<sub>Token/cost data from Antigravity/Gemini Flash 3.5. Copilot CLI does not expose token accounting.</sub>
+The three features that consistently fail without Vise — **Chat**, **Moderation**, and **Push Notifications** — are exactly the ones with secondary compliance requirements (ban-state, report affordances, Amity preference API). Vise's `sp-check` catches these with a specific finding; the rules doc does not.
-**What "Issues" means in plain language:**
+Both agents reached a perfect score with Vise. Neither could reach it with the compliance spec pasted into the prompt. All 9 passes were independently verified by code inspection — no scoring shortcuts.
-Without Vise, both agents produced code with hardcoded user IDs (security vulnerability), no authentication flow (anonymous writes), missing moderation UI, non-reactive queries, and missing SDK initialization. With Vise, those problems are caught or prevented during generation.
+### Efficiency — rework sessions needed
-### Why this matters
+Vise delivers all 9 features correctly in a single session. The other conditions leave failing features that require additional sessions to diagnose (the gap isn't visible without `sp-check`) and fix.
-| Metric | Without Vise | With Vise (CLI + Skill) | Improvement |
+| Coding agent (model) | Condition | Features correct | Rework sessions needed |
 |---|---|---|---|
-| Does it work on first try? | ❌ Fails CI | ✅ Passes CI | 100% pass rate |
-| Security issues? | Hardcoded IDs, no auth | 0 security findings | 100% eliminated |
-| Integration issues | 4–7 per run | 1 per run | **−86%** fewer issues |
-| Token cost | $0.0108 | $0.0024 | **−78%** cheaper |
-| Token usage | 36,219 | 8,733 | **−76%** fewer tokens |
-| Speed (Gemini) | 619s | 447s | **−28%** faster |
-| Manual rework needed? | Yes | No | Zero rework |
+| **Cursor (Composer 2.5)** | Pure MCP | 6 / 9 ✗ | +3 or more |
+| **Cursor (Composer 2.5)** | Rules-as-Markdown | 5 / 9 ✗ | +4 or more |
+| **Cursor (Composer 2.5)** | **Vise + Skill** | **9 / 9 ✅** | **0 ✅** |
+| **Claude Code (Sonnet 4.6)** | Pure MCP | 6 / 9 ✗ | +3 or more |
+| **Claude Code (Sonnet 4.6)** | Rules-as-Markdown | 7 / 9 ✗ | +2 or more |
+| **Claude Code (Sonnet 4.6)** | **Vise + Skill** | **9 / 9 ✅** | **0 ✅** |
-### Cross-model validation
+<sub>Rework sessions are additional developer-initiated prompts needed after the initial session to diagnose and fix the failing features. Each failing feature typically requires at least one session to identify the gap and one to fix it — and that's without the benefit of `sp-check` pointing directly at the problem.</sub>
-The effect holds across **Claude Sonnet 4.6** (Copilot CLI) and **Gemini Flash 3.5** (Antigravity). This is not a prompt trick for one model — it's domain knowledge applied consistently at the social.plus layer.
+### Reproducibility
+- **Gate-checked:** Every pass was verified by code inspection — the Vise workspaces contain an actual UI-level ban gate; the pure-MCP workspaces do not. Zero attestation shortcuts.
+- **Built from scratch** (greenfield seed) — not patching existing code.
+- **Three arms run with separate tooling.** The Rules-as-Markdown arm has no `sp-check` tool available — it cannot "cheat" by running the checker.
+- **N=1 per cell (Phase 1).** Each agent ran each scenario once. Repeatability seeds on the three most discriminating slices (CM-CHAT, CM-MODERATE, CM-PUSH) are pending. These results should be treated as a strong initial signal, not a statistically settled finding.
+- Full per-feature scorecards, agent transcripts, and workspace diffs: [`benchmarks/FINDINGS.html`](benchmarks/FINDINGS.html) · [`benchmarks/RULES_AS_MARKDOWN.html`](benchmarks/RULES_AS_MARKDOWN.html)
 ### Which mode should I use?
-| If you... | Use | Why |
+| If you… | Use | Why |
 |---|---|---|
-| Can install the skill | **CLI + Skill** | Fastest, cheapest, best results |
-| Can't install skill but have MCP | **Vise-as-MCP** | Same compliance, slightly more tokens |
-| Want to validate existing code | `vise check --ci` | Grade any codebase, any time |
-For the full interactive report with charts, see [`benchmarks/report.html`](./benchmarks/report.html). For per-cell scorecards and prior benchmark versions, see [`benchmarks/RESULTS.md`](./benchmarks/RESULTS.md).
+| Building new social features with an AI agent | **Vise CLI + Skill** | The only mode that reliably delivers all features correctly |
+| Auditing existing social.plus code | `vise check --ci` | Grades any codebase against the full ruleset |
+| Enforcing compliance in a CI pipeline | `vise check --ci` | Exits non-zero on failures; structured JSON output for logs |
 ---
@@ -121,7 +164,7 @@ For the full interactive report with charts, see [`benchmarks/report.html`](./be
 | **Android (Kotlin)** | ✅ Full | Gradle assemble, unit tests |
 | **iOS (Swift)** | ✅ Full | (static rule checks; runtime sensors WIP) |
-Each platform has 50–55 rules across 10 compliance domains (feed, comments, moderation, chat, secrets, session & auth, notifications, live objects, logging hygiene, design tokens).
+Each platform has 52–54 rules across 10 compliance domains (feed, comments, moderation, chat, secrets, session & auth, notifications, live objects, logging hygiene, design tokens).
 ---
@@ -199,12 +242,24 @@ The flow above is what the skill teaches your AI agent. You — the human — dr
 | `vise plan-harness [path] --request "..."` | (Pre-planning step) Build the harness around the request |
 | `vise init [path] --request "..."` | Write the `sp-vise/` compliance contract for this project |
-### Documentation grounding
+### Design contract (UI generation)
+| Command | Purpose |
+|---|---|
+| `vise design extract <prototypePath> [--repo .] [--no-write]` | Read an HTML/CSS prototype and write a graded `sp-vise/design-contract.json` (declared CSS custom properties become exact tokens; repeated literals become inferred/advisory tokens; single-use literals are dropped) so generated social.plus UI can match the customer's aesthetic |
+| `vise design extract --from-project [path] [--no-write]` | No external prototype? Derive the contract from the host project's **own** design system — CSS custom properties (incl. shadcn `:root` and Tailwind v4 `@theme`), TS/JS token modules, inline tailwind configs, **Android** `colors.xml`/`dimens.xml`, **Flutter** `Color(0x…)`, and **iOS** `.xcassets/*.colorset` + Swift `Color(hex:)`/`Color(red:g:b:)`. Reference values (`var()`/`theme()`/`calc()`) are skipped, so a var-mapped config contributes nothing rather than wrong tokens |
+| `vise design check [path]` | Advisory, **non-blocking** report on how closely the UI code matches the contract (token coverage + on/off-contract color literals). Never fails a build and is **not** a `vise check` gate |
+| `vise design preview [path] [--reference <prototype>]` | Write a self-contained `sp-vise/design-preview.html`: the contract's tokens as visual swatches + the conformance report + the HTML reference embedded for side-by-side review. Vise renders the artifact; a human/VLM judges the visual match. Dependency-free — **not** an automated pixel diff |
+The extracted contract is **advisory input for generation**, not an enforcement gate: a token-poor prototype yields a weaker — never wrong — contract, and absence of a prototype simply means no contract (the existing `*.design.reuse-detected-tokens` rules still cover reuse of a host project's own design system).
+### Documentation grounding & Troubleshooting
 | Command | Purpose |
 |---|---|
 | `vise search-docs "<query>"` | Search social.plus docs for relevant pages |
 | `vise get-doc-page <path>` | Fetch a specific doc page by path |
+| `vise debug [path] --error "..." [--brief]` | Debug an SDK-specific runtime failure and emit a likely-cause summary plus a minimal repair brief |
 ### Compliance verification
@@ -225,6 +280,18 @@ The flow above is what the skill teaches your AI agent. You — the human — dr
 | `vise run-sensors [path]` | Run detected project commands (npm scripts, Gradle, Flutter, lint, typecheck, SDK import smokes); never executes arbitrary shell |
 | `vise run-sensors [path] --dry-run` | List what would run without executing |
+### Troubleshooting quick loop
+For SDK-specific runtime issues, start with the compact debug flow before broader repo exploration:
+```sh
+vise debug . --error-file logs/crash.log --brief
+vise check . --ci
+vise run-sensors .
+```
+`vise debug --brief` returns the likely rule, minimum patch shape, invariants to preserve, and verification commands for the first repair pass.
 ### Skill management
 | Command | Purpose |
@@ -266,7 +333,7 @@ MCP-capable hosts can call Vise as structured tool calls instead of shell comman
 ### Tool names (snake_case per MCP convention)
-`inspect_project`, `plan_harness`, `plan_integration`, `init_compliance`, `check_compliance`, `sync_compliance`, `attest_rule`, `explain_rule`, `init_engagement`, `show_engagement`, `search_docs`, `get_doc_page`, `validate_setup`, `run_sensors`.
+`inspect_project`, `plan_harness`, `plan_integration`, `init_compliance`, `check_compliance`, `sync_compliance`, `attest_rule`, `explain_rule`, `init_engagement`, `show_engagement`, `search_docs`, `get_doc_page`, `debug_issue`, `validate_setup`, `run_sensors`.
 These are the same operations as the CLI commands above, exposed as MCP tools.