npm - create-walle - Versions diffs - 0.9.21 → 0.9.23 - Mend

create-walle 0.9.21 → 0.9.23

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (500) hide show

package/template/wall-e/coding-prompts.js CHANGED Viewed

@@ -9,6 +9,11 @@ const {
 } = require('./runtime/prompt-envelope');
 const { runtimeModeInstructions } = require('./coding/runtime-mode');
 const { buildCapabilityContext } = require('./coding/prompt-capabilities');
+const {
+  buildArtifactCapabilityContext,
+  hasCapability,
+  routeArtifactCapabilities,
+} = require('./coding/capability-router');
 const { buildResponseLanguagePolicy } = require('./context/response-language');
 /**
@@ -124,7 +129,7 @@ These are calibration prompts, not absolute rules. The user's explicit choices (
 - IMAGE-LED for visual portfolios — when the subject's work IS images, one generous image beats four small grid cards; let the strongest piece bleed full-width; the work is the hero, not the bio. (Inverse: for text-led genres like docs or essays, typography is the hero.)
 - EARN every UI component — no stat cards, timelines, pull-quotes, or testimonial walls by reflex. Each should pass: would this section exist in a thoughtfully-designed analog version (printed monograph, gallery placard, zine, brochure, dashboard)?
 - REFERENCE 2-3 real sites by name before designing; borrow with intent. "I'm drawing on [X]'s editorial scale and [Y]'s color restraint" — then build. Pick references in the right genre.
-- SCREENSHOT what you built — call the \`browser_screenshot\` tool with \`viewport: 'desktop'\` and \`viewport: 'mobile'\` against your file:// or http://localhost URL, then look at the returned image and self-critique. If browser_screenshot returns ok:false (no Chrome installed), say so explicitly — don't pretend you verified.
+- SCREENSHOT + SMOKE what you built — call \`browser_screenshot\` with desktop and mobile viewports, then call \`browser_smoke_test\` against the same file:// or verified localhost URL. Screenshots prove appearance; browser_smoke_test proves the page has no runtime JS errors or broken click handlers. If browser verification returns ok:false, fix the issue or state the exact blocker — don't pretend you verified.
 - VOICE in copy: write specifically for this subject, not in AI-generic prose. If you lack voice material, leave a clear placeholder ("WRITE A REAL BIO HERE") instead of inventing plausible-sounding filler.
 If the user-provided brief explicitly chose a style on this list (e.g., "use Cormorant + Inter for an editorial book site"), follow the brief — the discipline above does not override explicit instructions. Note your choice and the user's source in code comments so the next reader sees the intent.
@@ -195,13 +200,21 @@ function buildAgentSystemPrompt({ resolvedCwd, projectInfo, projectSkills, taskF
   const runtimeCtx = runtimeModeInstructions(runtimeMode);
   const memoryToolsAvailable = runtimeContext.memoryToolsAvailable !== false;
   const memoryProtocolCtx = loadMemoryProtocolBlock({ available: memoryToolsAvailable });
-  const frontendDesignCtx = isFrontendTask(taskFileHints, runtimeContext.userTask)
+  const artifactCapabilities = runtimeContext.artifactCapabilities
+    || routeArtifactCapabilities({
+      prompt: runtimeContext.userTask,
+      taskFileHints,
+      projectInfo,
+    });
+  const frontendDesignCtx = (hasCapability(artifactCapabilities, 'frontend_design') || isFrontendTask(taskFileHints, runtimeContext.userTask))
     ? loadFrontendDesignBlock({ available: true })
     : '';
   const capabilityContext = runtimeContext.capabilityContext
     || buildCapabilityContext(runtimeContext.promptCapabilities);
+  const artifactCapabilityContext = buildArtifactCapabilityContext(artifactCapabilities);
   const channelContext = [
     runtimeContext.channelContext || '',
+    artifactCapabilityContext || '',
     capabilityContext || '',
   ].filter(Boolean).join('\n\n');
   const responseLanguagePolicy = buildResponseLanguagePolicy({
@@ -240,7 +253,7 @@ ${memoryProtocolCtx ? `${memoryProtocolCtx}\n\n` : ''}${frontendDesignCtx ? `${f
 - Prefer dedicated tools over run_shell when one fits: Read for known paths, edit_file/multi_edit for surgical edits, list_directory over \`ls -R\`. Reserve run_shell for things only a shell can do.
 - For writing source files, use write_file/edit_file/multi_edit. These tools can write inside the current project/cwd, including temporary project directories. Do not use run_shell heredocs or redirects just to create source files.
 - run_shell takes a complete shell command string in \`command\`. If you need pipes, redirects, heredocs, or \`cd ... && ...\`, put the whole shell expression in \`command\`, not in \`args\` or an interpreter \`-c\` wrapper.
-- For static HTML/CSS/JS verification, use browser_screenshot with a \`file://\` URL for the local HTML file. Do not start a background HTTP server unless the app truly needs one; background \`&\` processes can keep shell pipes open and waste the turn budget.
+- For static HTML/CSS/JS verification, use browser_screenshot and browser_smoke_test with a \`file://\` URL for the local HTML file. If a static file server is genuinely needed, use start_static_server and check_url. For NON-static long-lived processes (dev servers, watchers, long builds), use run_shell with \`background: true\` and poll with bg_output / stop with bg_kill — never append \`&\` to a command. Never say a localhost/127.0.0.1 preview is live, back up, HTTP 200, or reachable unless the current turn has successful start_static_server/check_url/browser_screenshot/browser_smoke_test evidence for that URL; localhost evidence is Wall-E host loopback only, not phone/remote-browser proof.
 - Multiple INDEPENDENT tool calls can run in parallel — use that to keep the loop fast. SEQUENTIAL calls (each depends on the previous result) must run one at a time.
 - Destructive run_shell operations (git reset --hard, force push, rm -rf, dropping tables) need explicit user authorization unless they're trivially recoverable in this sandbox. Investigate before deleting unfamiliar files.
@@ -351,7 +364,7 @@ ${context.plannerNotes}`);
 Output JSON with the following structure (no other text outside the JSON):
 {
   "subtasks": [
-    { "title": "Short title", "prompt": "Detailed prompt for this subtask" }
+    { "title": "Short title", "prompt": "Detailed prompt for this subtask", "acceptance": { "task_kind": "code|frontend-ui|docs|test", "write_policy": "must-write|may-write|read-only", "validators": ["project.tests"] } }
   ],
   "branch_name": "type/short-description",
   "estimated_scope": "small|medium|large"
@@ -360,9 +373,12 @@ Output JSON with the following structure (no other text outside the JSON):
 Rules:
 - This planning step has no tool access. Do not inspect files, run commands, or ask for tools; use only the supplied context.
 - Maximum ${maxSubtasks} subtasks. Smaller is better — combine where you can.
+- Vague polish prompts such as "improve UX", "make it world class", or "polish the page" are bounded improvement tasks by default, not permission for a full rewrite. Plan 1-3 high-impact changes unless the user explicitly asks for a rebuild.
+- For frontend/UI subtasks, keep HTML, CSS, and JS contracts together. If a subtask adds an inline handler or interactive control in HTML, the same subtask must implement the matching JavaScript and verify it.
 - Each subtask must be independently executable: complete enough context that the next agent doesn't need this plan to do it.
 - Order so dependencies come first. Test-writing subtasks often go LAST (so we test the final shape, not an intermediate one).
 - Preserve explicit verification/tool requirements from the user request inside the relevant subtask prompt, using the same tool names and key arguments when given.
+- For frontend/UI work, include acceptance validators ["frontend.static_contract","frontend.screenshot_evidence","frontend.browser_runtime"] and tell the worker to run both browser_screenshot and browser_smoke_test.
 - branch_name follows conventional naming (feat/, fix/, refactor/, test/, docs/, chore/).
 - Each subtask prompt must be specific: which files, what behavior, how to verify. "Implement X" without context is not a subtask.`);

package/template/wall-e/context/context-builder.js CHANGED Viewed

@@ -343,16 +343,28 @@ Relevant memories and knowledge are provided above. If they answer the question,
 ### Step 2: SEARCH — only if the context above is insufficient
 Call search_memories to find additional evidence. Batch multiple searches in ONE turn.
-Use different query angles: English keywords, Chinese terms, source filters.
+- **Search by keywords, not whole sentences.** Pull the key nouns/names out of the request; don't paste the full instruction as the query.
+- **People:** when the request is about a person, search their name AND call **lookup_person** to get their aliases/email/handle, then search those too. People are often referenced by email or @handle, not full name.
+- Use different query angles: English keywords, Chinese terms, source filters.
 For private, remembered, or work-context questions, use Wall-E memory before public web_fetch. This includes prior conversations, decisions, preferences, people, teams, projects, tools, Slack/email/calendar context, questions about the user's own writing style/tone/personality/work patterns, and "last time" / "do you know" / "what did we discuss" prompts. Use public web only for public/current facts or after memory misses.
+### Step 2b: ESCALATE — when stored memory comes back empty
+Empty memory is NOT a dead end — it usually means the brain hasn't ingested it yet, not that it doesn't exist. Before concluding you lack context, query the LIVE sources you have:
+- **slack_search** (and slack_read_channel) for live Slack history about a person/topic.
+- **mail_search** / **mail_messages** for email threads.
+- **calendar_events** for meetings together.
+- For a colleague, use **lookup_person** / entity tools and Glean (org directory, role, manager) to anchor who they are.
+Only after BOTH stored memory and the relevant live sources come back empty may you say you couldn't find specific history — and then proceed to Step 4 with what you do know.
 ### Step 3: THINK — reason through the evidence
 Use the **think** tool before responding to:
 - Analyze what the evidence ACTUALLY shows vs what it SEEMS to show
 - Challenge your conclusions: do you have 3+ examples, or are you over-generalizing?
 - Consider if behavior is DELIBERATE and STRATEGIC rather than a gap
-### Step 4: RESPOND — with depth and nuance
+### Step 4: RESPOND — never dead-end a request
+- **If asked to WRITE/DRAFT/COMPOSE something** (a note, message, email, summary): produce the actual draft. "Reads like me / in my voice" does NOT require more lookups — apply ${ownerName}'s writing style (from the profile/memories above; search "writing style"/"tone" if needed) and, if you have past examples of this kind of note, mirror them. When you're missing a specific fact, write the draft anyway and mark the gap inline with a clear \`[placeholder: ...]\` rather than refusing.
+- **Never answer a "do X for me" request with only "I don't have enough context."** Offer the best partial result you can, plus EITHER a draft-with-placeholders OR exactly ONE targeted clarifying question (e.g. "What's one moment with them you'd want to highlight?"). A useful draft beats a polished refusal.
 - Use **bold** for key names, dates, and decisions
 - Use > blockquotes when quoting actual Slack messages
 - Include dates and people for attribution
@@ -376,6 +388,7 @@ function buildToolRefBlock(ownerName, intent) {
   }
   lines.push('- **run_skill / mcp_call / list_mcp_tools**: Actions and external services.');
   lines.push('- **Local tools**: web_fetch, run_shell, read_file, write_file, search_files, calendar_events, calendar_list, calendar_create, reminder_create, notification, applescript, open_url, open_app, screenshot, system_info, clipboard_read/write');
+  lines.push('- **Local preview URLs**: Before saying a localhost/127.0.0.1 preview is live, back up, HTTP 200, or reachable, call check_url or start_static_server in the current turn. These checks prove Wall-E host loopback only; phone or remote-browser reachability needs CTM remote/tunnel evidence.');
   lines.push('- **Email (Google Workspace first)**: mail_messages, mail_read, mail_search, mail_reply, and mail_send search all usable configured Gmail/Google Workspace accounts by default. Read the `accounts`, `mail_account_scope`, and `unavailable_accounts` fields in tool results before concluding an email is missing. If the likely account is unavailable or the result has `needs_clarification`, tell the user exactly which accounts were searched and ask one short clarification/reconnect question. Use `mail_reply` for any reply/respond/original-thread request; it derives recipients and Gmail thread headers from the original message_id. Use `mail_send` only for brand-new emails. macOS Mail is only an explicit fallback after GWS is unavailable or source:"macos" is requested. Outbound external actions are admitted through an action controller: validate sender/account, recipient, payload, and user intent; if the controller stages a draft or requires confirmation, never claim the action was sent.');
   lines.push('- **Slack**: search_memories with source:"slack" for stored messages. slack_search / slack_read_channel for live data. slack_send_message to post. pull_slack to ingest.');
   lines.push("- **Glean**: When using reportsto: queries, \"entities\" = direct reports only. Check manager.email to verify.");

package/template/wall-e/decision/confidence.js CHANGED Viewed

@@ -27,7 +27,7 @@ function checkGraduation(domain) {
   if (currentTier === 1) {
     // Tier 1 -> 2: enough memories in domain
-    const memCount = brain.listMemories({ source: domain, limit: 51 }).length;
+    const memCount = brain.countMemories({ source: domain });
     if (memCount > 50) newTier = 2;
   } else if (currentTier === 2) {
     const rate = dc.total_actions > 0 ? dc.approved_actions / dc.total_actions : 0;

package/template/wall-e/docs/coding-acceptance-contract.md ADDED Viewed

@@ -0,0 +1,41 @@
+# Wall-E Coding Acceptance Contract
+Wall-E coding runs must not treat model prose as proof of completion. Every build run now has a typed acceptance contract that maps the task to required validators before a subtask or final completion can succeed.
+## Contract Shape
+Planner subtasks may include an `acceptance` object:
+```json
+{
+  "task_kind": "frontend-ui",
+  "write_policy": "must-write",
+  "validators": [
+    "frontend.static_contract",
+    "frontend.screenshot_evidence",
+    "frontend.browser_runtime"
+  ]
+}
+```
+The orchestrator also derives a contract when the planner omits one. Frontend tasks are detected from changed frontend files or UI/web task language.
+## Frontend Validators
+- `frontend.static_contract` checks local assets and HTML-to-JS inline event handlers. If HTML calls `toggleAudio()` or `openMusicLightbox()`, a loaded inline or local script must define that global function.
+- `frontend.screenshot_evidence` requires successful `browser_screenshot` evidence for frontend changes.
+- `frontend.browser_runtime` uses `browser_smoke_test` or the orchestrator's final auto-smoke path. It loads the page in headless Chrome through CDP, captures runtime exceptions, console errors, failed requests, and safe-clicks interactive elements such as `[onclick]`, buttons, role buttons, and hash links.
+## Enforcement Points
+- Subtask gate: after a worker changes files, Wall-E runs the fast contract checks before tests/review. Static frontend failures retry the subtask with concrete failure text.
+- Final gate: before Wall-E can report success or commit, Wall-E re-runs frontend static checks, requires screenshot evidence, and runs browser runtime smoke on discovered HTML entrypoints.
+- Telemetry/progress: each validator emits structured `acceptance_validator` progress and anonymous `coding_acceptance_validator` telemetry with validator name, status, and failure count.
+## Scope Control
+Vague UI prompts such as "improve UX" or "make it world class" are bounded improvement tasks by default. The planner should produce 1-3 high-impact changes unless the user explicitly asks for a rebuild. For frontend subtasks, HTML, CSS, and JS changes must stay in the same subtask when they form one interaction contract.
+## Why This Exists
+The failure mode that motivated this contract was a successful-looking frontend run that produced HTML with inline handlers whose JavaScript functions did not exist. Screenshots and file-change checks did not catch it. The new contract catches the defect statically and, at final completion, through a real browser runtime check.

package/template/wall-e/docs/external-action-controller.md CHANGED Viewed

@@ -32,6 +32,19 @@ user confirmation, and the exact approved envelope is replayed back to Wall-E.
 Wall-E then executes the original payload directly rather than asking the model
 to recreate it.
+There are two valid CTM approval handoffs:
+- **Exact-payload replay**: the prior assistant turn already produced blocked
+  external-action envelopes. CTM reconstructs those envelopes from either
+  durable `toolCalls` or provider-native `tool_result` blocks and sends exact
+  `actionId`/`payloadHash` approvals.
+- **Current-turn dispatch approval**: the user approves visible drafts with
+  explicit domain language such as `yes, please send both emails` before any
+  envelope exists. CTM sends a one-turn approval scoped to the matching external
+  domain. Wall-E may execute same-turn tool calls only if validation has no
+  issues and the only remaining confirmation is
+  `external_action_confirmation_required`.
 ## Approval Tiers
 Wall-E uses two approval tiers:
@@ -108,10 +121,14 @@ Anthropic, OpenAI, and other providers all use the same envelope replay path.
   account.
 - Retry tracing classifies outbound external actions as unsafe side effects.
 - CTM learned approval rules cannot auto-allow external action tools.
-- CTM only converts the next prompt into an approval when the latest assistant
-  turn contains a staged external action and the user text is a clear external
-  action confirmation. Coding prompts such as `go ahead with the fix` do not
-  approve mail/calendar side effects.
+- CTM converts the next prompt into an exact approval when the latest pending
+  action group contains staged external actions and the user text is a clear
+  external action confirmation. It can also create a one-turn domain-scoped
+  approval for explicit dispatch confirmations such as `yes, please send both
+  emails`. Bare confirmations such as `yes` do not create current-turn
+  approvals without a pending action.
+- Coding prompts such as `go ahead with the fix` do not approve mail/calendar
+  side effects.
 - Approved envelopes are idempotent per Wall-E session and payload hash to avoid
   accidental duplicate sends from double-submit or retry.
 - Calendar approval envelopes preserve `account`, `source`, `calendarId`,
@@ -148,8 +165,11 @@ Focused regressions:
 - `wall-e/tests/coding-orchestrator.test.js`
 - `wall-e/tests/coding-stream-processor.test.js`
 - `wall-e/tests/execution-trace.test.js`
-- `wall-e/tests/chat.test.js` with `stages a draft email`
+- `wall-e/tests/chat.test.js` with `stages a draft email` and
+  `validated same-turn email dispatch`
 For realistic prompt validation, run the Wall-E chat loop with a disposable data
 directory and a mock provider that attempts `mail_send`. The expected tool result
-is `decision: "stage_preview"` and `sent: false`.
+is `decision: "stage_preview"` and `sent: false` for draft-only prompts. For a
+current-turn approval prompt, the expected tool result is a verified executor
+result, not a blocked confirmation envelope.

package/template/wall-e/docs/telemetry-lifecycle.md CHANGED Viewed

@@ -36,7 +36,7 @@ Defaults are intentionally conservative and can be overridden by environment var
 | Install IP | 30 days | IP is operational only for abuse/debugging. |
 | Owner display name | 7 days | Hashes and machine buckets are enough for fleet analysis. |
-Diagnostic event names include `error`, `skill_fallback_attempt`, `initiative_provider_cooldown`, `compat_usage`, `upgrade`, `upgrade_prompt`, `funnel`, and the `ctm_update_` prefix.
+Diagnostic event names include `error`, `skill_fallback_attempt`, `initiative_provider_cooldown`, `compat_usage`, `upgrade`, `upgrade_prompt`, `funnel`, `session_integrity_issue`, `session_integrity_issue_summary`, and the `ctm_update_` prefix.
 Noisy event names include `skill_dispatch_decision`, `skill_exec`, `skill_run`, `skills_run`, `task_run`, `think`, `initiative`, `reflect`, and `ingest`.
@@ -59,7 +59,13 @@ This makes archived event counts exact for deleted rows and lets summary endpoin
 Feedback cleanup does not delete the report row. It writes daily feedback rollups first, then replaces title, description, triage text, evidence, context previews, and attachment metadata with redacted placeholders.
-The cleanup process does not run `VACUUM` automatically. SQLite file compaction can be expensive and should be a separate maintenance action after checking disk pressure and service load. WAL checkpointing is safe enough for the daily cleanup path.
+The cleanup process does not run `VACUUM` automatically. SQLite file compaction can be expensive and should be a separate maintenance action after checking disk pressure and service load. WAL checkpointing defaults to `PASSIVE` to avoid aggressive truncate behavior on storage backends that can return short reads; operators can set `WALLE_TELEMETRY_CLEANUP_CHECKPOINT_MODE=FULL`, `RESTART`, or `TRUNCATE` only when they explicitly need stronger WAL draining.
+Session integrity telemetry is intentionally aggregate-first. CTM emits `session_integrity_issue_summary` for fleet-level counts and only emits per-session `session_integrity_issue` details when `CTM_SESSION_INTEGRITY_DETAIL_TELEMETRY=1` is set for a focused debugging run.
+Upgrade telemetry distinguishes in-product updates from externally observed version changes. Summary fields use `completed`/`completed_after_apply` for updates with a matching apply-start signal, and `external_completed` for version changes detected without that signal.
+Compatibility telemetry reports both current usage and removal blockers. `safe_to_remove` may stay empty even when a compatibility feature has low usage if that feature is not deprecated or still has active installs.
 ## Operations