npm - @amityco/social-plus-vise - Versions diffs - 1.1.0 → 1.1.1 - Mend

@amityco/social-plus-vise 1.1.0 → 1.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

package/CHANGELOG.md CHANGED Viewed

@@ -4,6 +4,13 @@ All notable changes to `@amityco/social-plus-vise` are documented in this file.
 The format is loosely based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## 1.1.1 — 2026-06-16
+Docs-only patch; no CLI, MCP, rule, or sidecar behavior changes.
+### Changed
+- **README Evidence section reworked.** Dropped the pre-1.0 "like-for-like" feed-completeness result (it was measured against an earlier dev build on early-June agent versions); presented the two genuine matched comparisons (SDK compliance, design conformance) as a scannable table; relabeled the former "best case" as **"Feed completeness (capabilities selected)"** and reframed it as *absolute* opt-in completeness rather than a baseline-beating lift; led design conformance with the robust by-name-token signal instead of the fragile one-seed hex anecdote (and dropped the model attribution, given a cross-temporal confound now noted under Boundaries); tightened the limitations and lifted the "stopping condition" framing into the section intro. No benchmark numbers were invented; the SDK-compliance benchmark stays described generically (its internal codename remains denylisted from the shipped surface).
 ## 1.1.0 — 2026-06-16
 Public GA-readiness pass. No change to `vise check` exit codes or sidecar **formats**. The rule corpus gains three corrected rationales (stale/fictional SDK method names) — a deliberate pre-GA contract reset, with the `test:sidecar-compat` baseline re-frozen accordingly (no published 0.14.x consumers to protect; see below).

package/README.md CHANGED Viewed

@@ -130,13 +130,20 @@ Vise can ingest your aesthetic from an HTML/CSS prototype (`vise design extract`
 ## Evidence
-Vise's measured effect is on **feed-feature completeness** — does the agent build the whole outcome (pagination, empty/error states, the optional capabilities you asked for) or just the happy path?
+Vise's measured effect is on whether an agent builds the *whole* outcome — pagination, empty and error states, the optional capabilities you asked for — or just the happy path. The durable finding across benchmarks is the **mechanism**, not any single percentage: a checked verification loop beats the same rules pasted into a prompt. What Vise reliably changes is the **stopping condition** — the agent isn't done when the code compiles; it's done when the local contract is green, attested, or blocked on an explicit customer decision.
-- **Like-for-like** (pre-registered, K=3 seeds, one firewalled ground truth): the Vise workflow **roughly doubled** feed-feature completeness over the docs-only baseline — to **58–64%** from **21–30%** — and the effect replicated across all three agents (Cursor/Composer 2.5 30→64%, Claude/Sonnet 4.6 27→61%, Codex/GPT-5.4 21→58%), evidence it's a property of the verification loop rather than one model. (Grading was deterministic-only in two of the three cells, which if anything understates the gap.)
-- **Best case** (the Vise arm explicitly opted into the optional capabilities up front; later build): with image/poll/edit selected, the Vise arm reached **97–100%** of the 11-item feed checklist across Cursor/Composer 2.5, Claude/Sonnet 4.6, and Codex/GPT-5.4. This is an opt-in best case, **not** the like-for-like delta above — the baseline arm was not offered the same selection and ran an earlier build.
-- **SDK compliance:** in a greenfield SDK-compliance benchmark (9 slices, single run per cell) the Vise verification-loop arm scored **9/9 (100%)** vs a docs-only baseline at **6/9 (67%)**. The durable finding is the mechanism — a checked loop beats the same rules pasted into the prompt.
-- **Design drift:** under an ambiguous brief (Sonnet, n=3), Vise design runs held hex-literal variance to **0** while the docs-only arm spiked to 15 on one stochastic seed — variance reduction on bad runs, not pixel perfection (a later pre-registered run saw both arms at 0; the robust signal is by-name token usage).
-- **Boundaries:** these are best-case/opt-in comparisons for greenfield social.plus work, self-reported (no third-party audit), not universal claims. Negative results travel with the claims — e.g. no measured advantage on day-2 bug fixing. What Vise reliably changes is the **stopping condition**: the agent isn't done when code compiles; it's done when the local contract is green, attested, or blocked on explicit customer input.
+**Matched comparisons** — Vise's verification loop vs a docs-only baseline, same task and agent:
+| Benchmark | What's measured | Docs-only | With Vise |
+|---|---|---|---|
+| **SDK compliance** | correctness slices passed (greenfield; 9 independent slices, single run per cell, two agents) | 6/9 (67%) | **9/9 (100%)** |
+| **Design conformance** | the brand's *named* design tokens used vs hardcoded values (ambiguous brief, n=3) | ~72% | **~90%** |
+The mechanism is the durable claim: a checked loop beats the same rules in the prompt.
+**Feed completeness (capabilities selected):** when the optional capabilities — image, poll, edit — are selected up front, the Vise arm completed **97–100%** of an 11-item feed checklist across three agents (Cursor/Composer, Claude/Sonnet, Codex/GPT). This is *absolute* completeness from answering Vise's capability questions — **not** a lift over a baseline.
+**Boundaries:** these are social.plus's own measurements on greenfield work — self-reported, no third-party audit, measured on earlier builds, and not universal claims. Negative results travel with them: no measured advantage on day-2 bug fixing, and the design-conformance arms were measured at different times (the figure is the robust by-name-token signal, not pixel perfection).
 <sub>Cursor, Claude, Codex, GitHub Copilot, VS Code, and other product names are trademarks of their respective owners; social.plus is not affiliated with or endorsed by them. Benchmark figures are from social.plus's own measurements.</sub>

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@amityco/social-plus-vise",
-  "version": "1.1.0",
+  "version": "1.1.1",
   "description": "Skill-guided deterministic CLI for social.plus SDK integration assistance.",
   "license": "SEE LICENSE IN LICENSE",
   "type": "module",