npm - muonroi-cli - Versions diffs - 1.4.1 → 1.5.0 - Mend

muonroi-cli 1.4.1 → 1.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (172) hide show

package/LICENSE +21 -21
package/README.md +122 -122
package/dist/packages/agent-harness-core/src/predicate.d.ts +1 -1
package/dist/src/agent-harness/__tests__/mock-model.spec.js +48 -1
package/dist/src/agent-harness/mock-model.d.ts +11 -0
package/dist/src/agent-harness/mock-model.js +21 -0
package/dist/src/cli/cost-forensics.js +12 -12
package/dist/src/council/__tests__/clarification-prompt.test.js +51 -0
package/dist/src/council/__tests__/clarifier-ready-gate.test.js +32 -0
package/dist/src/council/__tests__/decisions-lock.test.js +17 -1
package/dist/src/council/__tests__/oauth-reachable.test.d.ts +1 -0
package/dist/src/council/__tests__/oauth-reachable.test.js +31 -0
package/dist/src/council/__tests__/parse-outcome-fallback.test.js +11 -0
package/dist/src/council/clarifier.js +9 -1
package/dist/src/council/debate.js +5 -1
package/dist/src/council/decisions-lock.js +3 -3
package/dist/src/council/index.js +12 -5
package/dist/src/council/leader.d.ts +0 -17
package/dist/src/council/leader.js +22 -15
package/dist/src/council/planner.js +1 -1
package/dist/src/council/prompts.js +63 -57
package/dist/src/council/types.d.ts +7 -0
package/dist/src/ee/__tests__/ee-onboarding.test.d.ts +1 -0
package/dist/src/ee/__tests__/ee-onboarding.test.js +32 -0
package/dist/src/ee/auth.d.ts +9 -0
package/dist/src/ee/auth.js +19 -0
package/dist/src/ee/ee-onboarding.d.ts +5 -0
package/dist/src/ee/ee-onboarding.js +76 -0
package/dist/src/generated/version.d.ts +1 -1
package/dist/src/generated/version.js +1 -1
package/dist/src/headless/output.js +6 -4
package/dist/src/headless/output.test.js +4 -3
package/dist/src/index.js +20 -1
package/dist/src/mcp/__tests__/auto-setup.test.js +74 -0
package/dist/src/mcp/__tests__/client-pool.spec.d.ts +1 -0
package/dist/src/mcp/__tests__/client-pool.spec.js +98 -0
package/dist/src/mcp/__tests__/parallel-build.spec.d.ts +1 -0
package/dist/src/mcp/__tests__/parallel-build.spec.js +67 -0
package/dist/src/mcp/__tests__/smart-filter.test.js +56 -0
package/dist/src/mcp/auto-setup.js +56 -2
package/dist/src/mcp/client-pool.d.ts +46 -0
package/dist/src/mcp/client-pool.js +212 -0
package/dist/src/mcp/oauth-callback.js +2 -2
package/dist/src/mcp/parse-headers.test.js +14 -14
package/dist/src/mcp/runtime.d.ts +28 -0
package/dist/src/mcp/runtime.js +117 -51
package/dist/src/mcp/self-verify-runner.d.ts +14 -0
package/dist/src/mcp/self-verify-runner.js +38 -0
package/dist/src/mcp/setup-guide-text.d.ts +9 -0
package/dist/src/mcp/setup-guide-text.js +84 -0
package/dist/src/mcp/smart-filter.js +49 -0
package/dist/src/mcp/smoke.test.js +43 -43
package/dist/src/mcp/tools-server.d.ts +7 -0
package/dist/src/mcp/tools-server.js +19 -22
package/dist/src/models/catalog.json +349 -349
package/dist/src/ops/__tests__/doctor-ee-health.test.js +21 -0
package/dist/src/ops/doctor.d.ts +3 -2
package/dist/src/ops/doctor.js +47 -11
package/dist/src/ops/doctor.test.js +4 -3
package/dist/src/orchestrator/__tests__/mcp-capability-block.test.d.ts +1 -0
package/dist/src/orchestrator/__tests__/mcp-capability-block.test.js +39 -0
package/dist/src/orchestrator/__tests__/project-stack.test.d.ts +1 -0
package/dist/src/orchestrator/__tests__/project-stack.test.js +65 -0
package/dist/src/orchestrator/batch-turn-runner.js +7 -11
package/dist/src/orchestrator/message-processor.js +57 -27
package/dist/src/orchestrator/orchestrator.js +26 -0
package/dist/src/orchestrator/prompts.d.ts +51 -0
package/dist/src/orchestrator/prompts.js +257 -134
package/dist/src/orchestrator/scope-ceiling.js +6 -1
package/dist/src/orchestrator/stream-runner.js +20 -15
package/dist/src/orchestrator/text-tool-call-detector.test.js +13 -13
package/dist/src/pil/__tests__/clarity-gate.test.js +24 -215
package/dist/src/pil/__tests__/config.test.js +1 -17
package/dist/src/pil/__tests__/discovery.test.js +144 -11
package/dist/src/pil/__tests__/layer1-intent-trace.test.js +7 -2
package/dist/src/pil/__tests__/layer1-intent.test.js +3 -0
package/dist/src/pil/__tests__/layer16-clarity.test.js +32 -116
package/dist/src/pil/__tests__/layer4-gsd.test.js +37 -0
package/dist/src/pil/__tests__/layer6-output.test.js +137 -18
package/dist/src/pil/__tests__/llm-classify.test.js +49 -2
package/dist/src/pil/agent-operating-contract.d.ts +1 -1
package/dist/src/pil/agent-operating-contract.js +2 -0
package/dist/src/pil/agent-operating-contract.test.js +7 -2
package/dist/src/pil/cheap-model-playbook.js +35 -35
package/dist/src/pil/cheap-model-workbooks.js +16 -13
package/dist/src/pil/clarity-gate.d.ts +21 -19
package/dist/src/pil/clarity-gate.js +26 -153
package/dist/src/pil/config.d.ts +9 -1
package/dist/src/pil/config.js +15 -4
package/dist/src/pil/discovery.js +211 -136
package/dist/src/pil/layer1-intent.d.ts +12 -0
package/dist/src/pil/layer1-intent.js +283 -38
package/dist/src/pil/layer1-intent.test.js +210 -4
package/dist/src/pil/layer16-clarity.d.ts +25 -11
package/dist/src/pil/layer16-clarity.js +19 -306
package/dist/src/pil/layer4-gsd.js +18 -6
package/dist/src/pil/layer6-output.d.ts +2 -0
package/dist/src/pil/layer6-output.js +137 -22
package/dist/src/pil/llm-classify.d.ts +26 -0
package/dist/src/pil/llm-classify.js +34 -5
package/dist/src/pil/native-capabilities-workbook.d.ts +1 -1
package/dist/src/pil/native-capabilities-workbook.js +82 -76
package/dist/src/pil/schema.d.ts +8 -0
package/dist/src/pil/schema.js +12 -1
package/dist/src/pil/task-tier-map.js +4 -0
package/dist/src/pil/types.d.ts +11 -1
package/dist/src/product-loop/done-gate.js +3 -3
package/dist/src/product-loop/loop-driver.js +18 -18
package/dist/src/product-loop/progress-snapshot.js +4 -4
package/dist/src/providers/auth/gemini-oauth.js +6 -15
package/dist/src/providers/auth/grok-oauth.js +6 -15
package/dist/src/providers/auth/openai-oauth.js +6 -15
package/dist/src/providers/mcp-vision-bridge.js +48 -48
package/dist/src/reporter/index.js +1 -1
package/dist/src/scaffold/bb-ecosystem-apply.js +47 -47
package/dist/src/scaffold/bb-quality-gate.js +5 -5
package/dist/src/scaffold/continuation-prompt.js +60 -60
package/dist/src/scaffold/init-new.js +453 -453
package/dist/src/self-qa/__tests__/scenario-planner.test.js +3 -3
package/dist/src/self-qa/agentic-loop.js +24 -19
package/dist/src/self-qa/spec-emitter.js +26 -23
package/dist/src/storage/__tests__/migrations.test.js +2 -2
package/dist/src/storage/interaction-log.js +5 -5
package/dist/src/storage/migrations.js +122 -122
package/dist/src/storage/sessions.js +42 -42
package/dist/src/storage/transcript.js +91 -84
package/dist/src/storage/usage.js +14 -14
package/dist/src/storage/workspaces.js +12 -12
package/dist/src/tools/__tests__/native-tools.test.d.ts +1 -0
package/dist/src/tools/__tests__/native-tools.test.js +53 -0
package/dist/src/tools/git-safety.d.ts +61 -0
package/dist/src/tools/git-safety.js +141 -0
package/dist/src/tools/git-safety.test.d.ts +1 -0
package/dist/src/tools/git-safety.test.js +111 -0
package/dist/src/tools/native-tools.d.ts +31 -0
package/dist/src/tools/native-tools.js +273 -0
package/dist/src/tools/registry-git-safety.test.d.ts +7 -0
package/dist/src/tools/registry-git-safety.test.js +92 -0
package/dist/src/tools/registry.js +39 -4
package/dist/src/ui/__tests__/markdown-render.test.d.ts +1 -0
package/dist/src/ui/__tests__/markdown-render.test.js +48 -0
package/dist/src/ui/app.js +0 -0
package/dist/src/ui/components/message-view.js +4 -1
package/dist/src/ui/components/structured-response-view.js +7 -3
package/dist/src/ui/components/tool-group.js +7 -1
package/dist/src/ui/markdown-render.d.ts +41 -0
package/dist/src/ui/markdown-render.js +223 -0
package/dist/src/ui/markdown.d.ts +10 -0
package/dist/src/ui/markdown.js +12 -35
package/dist/src/ui/slash/council-inspect.js +4 -4
package/dist/src/ui/slash/export.js +4 -4
package/dist/src/ui/utils/text.d.ts +8 -0
package/dist/src/ui/utils/text.js +16 -0
package/dist/src/ui/utils/text.test.d.ts +1 -0
package/dist/src/ui/utils/text.test.js +23 -0
package/dist/src/usage/ledger.js +48 -15
package/dist/src/utils/__tests__/footprint-gitignore.test.d.ts +1 -0
package/dist/src/utils/__tests__/footprint-gitignore.test.js +50 -0
package/dist/src/utils/clipboard-image.js +23 -23
package/dist/src/utils/open-url.d.ts +56 -0
package/dist/src/utils/open-url.js +58 -0
package/dist/src/utils/open-url.test.d.ts +1 -0
package/dist/src/utils/open-url.test.js +86 -0
package/dist/src/utils/settings.d.ts +12 -0
package/dist/src/utils/settings.js +48 -0
package/dist/src/utils/side-question.js +2 -2
package/dist/src/utils/skills.js +3 -3
package/dist/src/verify/__tests__/coverage-parsers.test.js +30 -30
package/dist/src/verify/environment.js +2 -1
package/package.json +1 -1
package/dist/src/pil/layer16-clarity.test.js +0 -31
/package/dist/src/{pil/layer16-clarity.test.d.ts → council/__tests__/clarification-prompt.test.d.ts} +0 -0

package/dist/src/pil/layer6-output.js CHANGED Viewed

@@ -36,6 +36,7 @@ const TASK_TYPE_DEFAULT_STYLE = {
     analyze: "concise", // bullet findings, no narrative
     documentation: "balanced", // examples + explanation
     generate: "concise", // code speaks for itself
+    build: "concise", // greenfield code artifact — code speaks for itself (mirrors generate)
     refactor: "concise", // diff is the output
     general: "concise", // direct answer, no preamble
 };
@@ -78,19 +79,47 @@ const TASK_OUTPUT_BUDGET = {
     analyze: 600,
     documentation: 900,
     generate: 1200,
+    // build (greenfield) emits multiple complete files — same budget as generate.
+    build: 1200,
     // general is user-facing prose (not a code artifact). Higher budget + relaxed
     // style rules so the final answer reads naturally for humans instead of
     // machine-optimized telegraphic lists. See user report on over-constrained
     // freetext after Layer 6.
     general: 650,
 };
-// PIL-04 Tier 1.3 + PIL-L6 verbosity fix: ban preamble AND end-of-turn summary.
-// Old rule only covered openers (~30 tok saved). End-of-turn summaries
-// ("In summary...", "I have completed X, Y, Z", "Tóm tắt: ...") cost
-// 100-300 tokens/turn AND give the user nothing they can't read from the
-// diff. Bilingual EN+VN. Skipped for response-tools path (JSON has no
-// freeform surface).
-const NO_PREAMBLE_RULE = `\nFORBIDDEN OPENERS: do not start with "I'll", "I will", "Let me", "Here's", "Sure", "Of course", "Tôi sẽ", "Để tôi", "Vâng". Start directly with the answer content.\nFORBIDDEN END-OF-TURN SUMMARY: do not append a recap section ("In summary", "To summarize", "Tổng kết", "Tóm tắt", "Tóm lại", "Kết luận", "I have done X, Y, Z", "Now you have…", "Đã hoàn thành…"). The diff and command output already show what changed; the user can read them. End the response when the answer is complete.\nFORBIDDEN INTER-TOOL NARRATION: when chaining tool calls, do NOT emit content text between them. Skip phrases like "Now I'll check…", "Let me look at…", "Next, I need to…", "Tiếp theo tôi sẽ…", "Bây giờ tôi cần…". Emit the next tool call directly. Each round-trip of inter-tool narration costs the user ~100 output tokens that they do not need to read — the tool calls themselves are visible in the UI. Only emit content text for the FINAL answer or when surfacing a decision the user must make.`;
+// PIL-04 Tier 1.3 (de-robotized): ban ONLY wasteful openers.
+//
+// Earlier this rule also banned end-of-turn summaries AND inter-tool narration.
+// Both bans were REMOVED: they stripped the natural connective tissue that makes
+// an answer read like a human wrote it, which is the root of the "máy móc" /
+// telegraphic feel users complained about. Forbidding any recap or any sentence
+// between tool calls forces curt, label-prefixed output even when a connecting
+// line would help.
+//
+// Removing the text bans does NOT re-introduce context bloat or user-invisible
+// spam:
+//   - Inter-tool narration is still removed STRUCTURALLY from message history by
+//     stripInterToolNarration() / NARRATION_PREFIX_REGEX in
+//     src/orchestrator/reasoning.ts. That runs unconditionally on every assistant
+//     message that has both text and a following tool-call, so it is far more
+//     reliable than a text directive budget models ignore (session 7dcf8fd7d6a4:
+//     57/100 messages violated the text ban anyway).
+//   - OUTPUT BUDGET (below) remains the guard against padding, so a freed-up
+//     summary cannot balloon the answer.
+//
+// Openers ("I'll", "Let me", "Sure", "Tôi sẽ") stay banned: pure ~30-tok padding
+// with zero conversational value. Bilingual EN+VN. Skipped for the response-tools
+// path (JSON has no freeform surface).
+const NO_PREAMBLE_RULE = `\nFORBIDDEN OPENERS: do not start with "I'll", "I will", "Let me", "Here's", "Sure", "Of course", "Tôi sẽ", "Để tôi", "Vâng". Start directly with the answer content.`;
+// Anti-bookkeeping note for the NATURAL (non-response-tool) path — the response-
+// tool path has the equivalent baked into humanNote. The Agent Operating Contract's
+// REPORTING rule ("every fact must come from THIS turn; do not infer unopened
+// files") is the model's operating discipline, but budget models RESTATE it as a
+// user-facing provenance footer ("evidence only from this turn", "did not infer
+// unopened files", "≤600 tokens"). That is invisible-to-the-reader compliance
+// noise. Applied only to non-question turns — question turns already get the same
+// guidance from the Layer 4 QUESTION directive (buildQuestion).
+const NO_BOOKKEEPING_NOTE = `\nWRITE FOR THE READER: the answer is for the human who asked. Do NOT append a provenance / compliance footer (e.g. "evidence only from this turn", "did not infer unopened files", token-budget notes) and do NOT restate internal rule / contract / layer / tool names as compliance — those are your operating rules, invisible to the reader. End on the answer's last substantive point.`;
 const SUFFIXES = {
     refactor: {
         concise: `\nOUTPUT RULES (refactor): Show only changed code. Prefer unified diff or replacement function. No prose unless architecture changes. One sentence max if explanation needed. No preamble.`,
@@ -98,19 +127,19 @@ const SUFFIXES = {
         detailed: `\nOUTPUT RULES (refactor): Show changed code with full rationale. Explain why each change improves the code. Include before/after comparison when helpful. Unified diff preferred.`,
     },
     debug: {
-        concise: `\nOUTPUT RULES (debug): Format = Hypothesis → Root cause (1 line) → Fix (code only) → Verify command. No preamble. No "I think" hedging.`,
-        balanced: `\nOUTPUT RULES (debug): Format = Hypothesis → Root cause → Fix (code) → Verify command. Brief explanation of why the bug occurs. Keep prose minimal.`,
-        detailed: `\nOUTPUT RULES (debug): Format = Hypothesis → Root cause analysis → Fix (code) → Verify command → Prevention. Explain the underlying mechanism and why this fix is correct.`,
+        concise: `\nOUTPUT RULES (debug): Lead with the root cause and the fix (code). Bring in the hypothesis and a verify command where they add value — you don't have to label every part or follow a fixed template. Be direct; skip "I think"/"maybe" hedging.`,
+        balanced: `\nOUTPUT RULES (debug): Give the root cause and the fix (code), with a short note on why the bug happens and how to verify it. Write it naturally — no rigid section labels needed.`,
+        detailed: `\nOUTPUT RULES (debug): Walk through the root cause, the fix (code), how to verify, and how to prevent recurrence. Explain the underlying mechanism so the reader understands why the fix is correct.`,
     },
     plan: {
-        concise: `\nOUTPUT RULES (plan): Numbered steps only. Each step: action verb + acceptance criterion. No prose paragraphs. Add "Assumptions:" section only if needed.`,
-        balanced: `\nOUTPUT RULES (plan): Numbered steps with brief rationale per step. Each step: action verb + acceptance criterion + why. Add "Assumptions:" and "Risks:" sections if applicable.`,
-        detailed: `\nOUTPUT RULES (plan): Numbered steps with full rationale. Each step: action verb + acceptance criterion + why + alternatives considered. Include "Assumptions:", "Risks:", and "Trade-offs:" sections.`,
+        concise: `\nOUTPUT RULES (plan): Use numbered steps; each step should make the action and its done-criterion clear. A short framing sentence is fine when it helps — just skip filler. Note key assumptions if any matter.`,
+        balanced: `\nOUTPUT RULES (plan): Use numbered steps, each with its action, done-criterion, and a brief why. Add "Assumptions:" or "Risks:" notes when they matter. A short lead-in sentence is welcome.`,
+        detailed: `\nOUTPUT RULES (plan): Numbered steps with full rationale — action, done-criterion, why, and alternatives considered. Include "Assumptions:", "Risks:", and "Trade-offs:" where relevant.`,
     },
     analyze: {
-        concise: `\nOUTPUT RULES (analyze): Bullet findings with evidence (file:line or direct quote). Add severity label (High/Med/Low) when applicable. No filler sentences.`,
-        balanced: `\nOUTPUT RULES (analyze): Bullet findings with evidence (file:line or direct quote). Add severity label and brief explanation. Context for each finding.`,
-        detailed: `\nOUTPUT RULES (analyze): Bullet findings with evidence (file:line or direct quote). Add severity label, root cause analysis, and recommended action. Provide context and impact assessment.`,
+        concise: `\nOUTPUT RULES (analyze): Present findings as bullets, each backed by evidence (file:line or a direct quote). Add a severity label (High/Med/Low) where it helps prioritize. A brief lead-in is fine — just avoid padding.`,
+        balanced: `\nOUTPUT RULES (analyze): Present findings as bullets with evidence (file:line or quote), a severity label, and a brief explanation. Give enough context for each finding to stand on its own.`,
+        detailed: `\nOUTPUT RULES (analyze): Present findings as bullets with evidence (file:line or quote), severity, root-cause, and a recommended action. Include context and impact for each finding.`,
     },
     documentation: {
         concise: `\nOUTPUT RULES (documentation): Markdown only. Lead with a code example, then explanation. No "This function..." openers. All examples in fenced code blocks.`,
@@ -122,6 +151,11 @@ const SUFFIXES = {
         balanced: `\nOUTPUT RULES (generate): Complete, runnable code with brief explanation. Include all imports. Inline comments for key decisions. Short prose before code block explaining approach.`,
         detailed: `\nOUTPUT RULES (generate): Complete, runnable code with full explanation. Include all imports. Inline comments for logic and decisions. Explain design choices, alternatives considered, and trade-offs before the code.`,
     },
+    build: {
+        concise: `\nOUTPUT RULES (build): Scaffold the minimum runnable project/feature. Emit complete files (all imports), matching existing conventions. Wire it end-to-end; do not leave stubs. State the verify/run command in one line. No speculative extras.`,
+        balanced: `\nOUTPUT RULES (build): Scaffold a runnable project/feature with a short rationale for the structure. Emit complete files, follow existing conventions, wire it end-to-end, and give the build/run command. Avoid speculative features.`,
+        detailed: `\nOUTPUT RULES (build): Scaffold a runnable project/feature with full rationale — layout, key dependencies, and design choices. Emit complete files with all imports, wire everything end-to-end, give the build/run + verify commands. Note trade-offs; skip speculative scope.`,
+    },
     general: {
         // General answers should be highly readable. Encourage rich markdown
         // (bullets, headings, bold text) instead of forcing dense prose.
@@ -166,7 +200,7 @@ export function applyPilSuffix(systemPrompt, ctx, responseToolsActive = false) {
     // CI") just because the prompt looks ambiguous in isolation — session
     // 127140a47b56 hit this and the model spent 275 LLM calls being "thorough"
     // about a one-liner CI fix.
-    const ACTION_TASKS = new Set(["debug", "refactor", "generate"]);
+    const ACTION_TASKS = new Set(["debug", "refactor", "generate", "build"]);
     const DETAIL_KEYWORDS = /\b(explain in detail|thorough analysis|walk me through|in depth|deeply|comprehensive)\b|giải thích chi tiết|phân tích kỹ|cặn kẽ|chi tiết hơn/i;
     const requestedStyle = ctx.outputStyle ?? "concise";
     const style = requestedStyle === "detailed" && ACTION_TASKS.has(ctx.taskType) && !DETAIL_KEYWORDS.test(ctx.raw)
@@ -195,6 +229,14 @@ export function applyPilSuffix(systemPrompt, ctx, responseToolsActive = false) {
     if (!isMetaAnalysis && !responseToolsActive) {
         result += NO_PREAMBLE_RULE;
     }
+    // E — keep the contract's REPORTING discipline from leaking into the answer as a
+    // provenance/compliance footer. Skip question turns (the L4 QUESTION directive
+    // already says it) to avoid duplicate steering. Phase 2b: consume the model's
+    // deliverable (answer = a question turn) when present; legacy regex otherwise.
+    const isQuestionTurn = ctx.deliverableKind ? ctx.deliverableKind === "answer" : isQuestionLike(ctx.raw);
+    if (!isQuestionTurn) {
+        result += NO_BOOKKEEPING_NOTE;
+    }
     // T1 behavioral rules (proven-tier EE points set by Layer 3). These are
     // project-specific reflexes the model MUST follow — injected as instructions,
     // not as context hints, so they carry imperative weight rather than suggestion weight.
@@ -225,18 +267,91 @@ const IMPLEMENTATION_INTENT_RE = /\b(implement|edit|wire(?:\s+up)?|rewrite|renam
 export function isImplementationIntent(raw) {
     return !!raw && IMPLEMENTATION_INTENT_RE.test(raw);
 }
+/**
+ * Narrow response-tool gating (user-directed de-robotizing).
+ *
+ * For debug / analyze / plan the structured respond_* tool forces the answer into
+ * a rigid JSON schema (DebugSchema {hypothesis, root_cause, fix, verify}, etc.)
+ * which the UI then stamps with fixed labels ("hypothesis:", "root cause:",
+ * "[HIGH]", "done when:") in structured-response-view.tsx. For an ordinary
+ * QUESTION ("why does X fail?", "analyze the auth design") that reads robotic and
+ * even forces fabricated fields (DebugSchema.fix.file is required, so a
+ * non-codebase debug question must invent a file). So these task types now
+ * default to the NATURAL markdown path (softened OUTPUT RULES + openers-only
+ * NO_PREAMBLE) and only opt INTO the structured tool when the prompt's DELIVERABLE
+ * is genuinely a report / list / plan.
+ *
+ * Conservative positive gate (defaults to natural): only an explicit
+ * report/list/plan signal keeps respond_*. EN + VI. `general` is exempt — its
+ * renderer already shows plain markdown, so respond_general carries no robotic
+ * cost while still giving budget models a structural anchor.
+ */
+const STRUCTURED_REPORT_RE = /\b(lists?|enumerate|table|report|audit|checklist|inventory|rank(?:ed|ing)?|prioriti[sz]e[ds]?|roadmap|step[-\s]?by[-\s]?step|milestones?|plan(?:s|ning)?)\b|liệt\s*kê|danh\s*sách|bảng|báo\s*cáo|kiểm\s*toán|rà\s*soát|lộ\s*trình|từng\s*bước|các\s*bước|kế\s*hoạch|xếp\s*hạng|ưu\s*tiên/i;
+// Question-shape detector — shared by Layer 4 (GSD directive selection) and the
+// narrow response-tool gate below. True when the prompt reads as a question or
+// explanatory request rather than an imperative deliverable. Interrogative words
+// only count at sentence start (so "list the steps" is NOT a question), plus
+// "can/could/would/should + pronoun", a trailing "?", and VI markers. EN + VI.
+const QUESTION_SHAPE_RE = /^\s*(?:why|how|what|when|where|who|whom|whose|which|explain|describe)\b|\b(?:can|could|would|should)\s+(?:you|i|we|it|they)\b|\?\s*$|tại\s*sao|vì\s*sao|(?:như\s*)?thế\s*nào|là\s*gì|ra\s*sao|ở\s*đâu|khi\s*nào|bao\s*nhiêu|có\s*phải|giải\s*thích|mô\s*tả/i;
+// Vietnamese yes/no question frames — the most common VI question shape, and
+// entirely absent from QUESTION_SHAPE_RE above (no "?" and no leading
+// interrogative word). Live miss: session f6f7881a5fae — "bạn check xem dùng
+// được mcp muonroi-docs không nhé" ("can you check whether the muonroi-docs MCP
+// works?") was NOT seen as a question, so layer4-gsd emitted the STANDARD
+// "implement directly" directive and discovery fabricated a build outcome,
+// driving a 40-call code hunt instead of a one-line answer.
+// Two interrogative tails:
+//   1. "(được|đúng|phải) không" anywhere — "is it OK / right?".
+//   2. a clause-final "không"/"chưa" (the yes/no particle), optionally trailed
+//      by a softener (nhé/vậy/à/…) or punctuation. Anchored to end so a
+//      mid-sentence negation ("không là hỏng" = "or it breaks") does NOT match.
+const VI_YESNO_RE = /\b(?:được|đúng|phải)\s*không\b|\b(?:không|chưa)\s*(?:nhé|nhỉ|vậy|thế|ạ|à|hả|ko|nha)?\s*[?.…]*\s*$/i;
+export function isQuestionLike(raw) {
+    return !!raw && (QUESTION_SHAPE_RE.test(raw) || VI_YESNO_RE.test(raw));
+}
+export function prefersStructuredReport(raw) {
+    if (!raw)
+        return false;
+    // A question that merely mentions "plan"/"list" — e.g. an interview quoting the
+    // phrase "state a 2-3 line plan" — must NOT be treated as a report request; it
+    // stays on the natural markdown path. Genuine delivery requests ("plan the
+    // migration", "list all X") are imperative, not question-shaped.
+    if (isQuestionLike(raw))
+        return false;
+    return STRUCTURED_REPORT_RE.test(raw);
+}
 export function getResponseToolSet(ctx, providerId) {
     if (!ctx.taskType)
         return {};
+    // Chitchat: greetings/small-talk never want a structured answer block. Mirrors
+    // the chitchat short-circuits in applyPilSuffix / layer6Output.
+    if (ctx.intentKind === "chitchat")
+        return {};
     // PIL-04 Tier 1.1: gate JSON-structured output to list-shaped tasks where it
     // wins on tokens. Code-heavy tasks fall through to markdown OUTPUT RULES.
     if (!RESPONSE_TOOL_TASK_TYPES.has(ctx.taskType))
         return {};
-    // Implementation/edit turns: the deliverable is file changes, not a structured
-    // report. A terminal respond_<task> tool lets the model "answer" (state a plan)
-    // and end the turn before the edits complete — drop it for clear edit intent.
-    if (isImplementationIntent(ctx.raw))
-        return {};
+    // Phase 2b: when the model classified the deliverable, CONSUME it to decide
+    // structured-vs-natural output instead of re-deriving intent via regex:
+    //   - code   → file changes, not a report. A terminal respond_* lets the model
+    //              "answer" (state a plan) and end the turn before edits finish.
+    //   - answer → debug/analyze/plan QUESTIONS read robotic as labeled JSON →
+    //              natural markdown path (general keeps its naturally-rendered tool).
+    //   - report → keep the structured tool (its value IS the structure).
+    // Only when the model didn't emit a deliverable (null → legacy cascade / model
+    // omitted the word) do we fall back to the legacy regex predicates.
+    if (ctx.deliverableKind) {
+        if (ctx.deliverableKind === "code")
+            return {};
+        if (ctx.taskType !== "general" && ctx.deliverableKind !== "report")
+            return {};
+    }
+    else {
+        if (isImplementationIntent(ctx.raw))
+            return {};
+        if (ctx.taskType !== "general" && !prefersStructuredReport(ctx.raw))
+            return {};
+    }
     // Provider-aware gating: a provider may report it can't reliably emit
     // valid JSON tool input for this task type (e.g. DeepSeek leaks special
     // tokens into `general` responses). Drop the tool to avoid retry storms.

package/dist/src/pil/llm-classify.d.ts CHANGED Viewed

@@ -1,9 +1,35 @@
 import type { ProviderFactory } from "../providers/runtime.js";
 import type { OutputStyle, TaskType } from "./types.js";
+/**
+ * What the user wants the turn to PRODUCE — decided by the model (Phase 2b) so
+ * the keyword-regex predicates in Layer 4 (`informational`) and Layer 6
+ * (`getResponseToolSet` / `applyPilSuffix`) are no longer the authority for
+ * output routing.
+ *   - "code"   — create/edit files (implement, fix, build, refactor, scaffold).
+ *   - "report" — a structured list/plan/audit/roadmap is the deliverable.
+ *   - "answer" — an explanation / review / question / meta answer (no edits).
+ * `null` when the model omits/garbles the word → Layer 4/6 fall back to their
+ * legacy regex predicates for that turn (graceful, never a wrong forced route).
+ */
+export type DeliverableKind = "answer" | "code" | "report";
 export interface LlmClassifyResult {
     taskType: TaskType;
     outputStyle: OutputStyle | null;
     confidence: number;
+    /**
+     * Whether the prompt is a real request (task) or pure social chitchat
+     * (greeting / thanks / ack). Decided by the model so the regex chitchat
+     * shortcuts (isSocialPleasantry, ultra-short heuristic) are no longer the
+     * authority. Defaults to "task" when the model omits the signal — the
+     * keep-tools safe direction (a false "task" wastes ~1.5K tokens of tool
+     * schema; a false "chitchat" strips bash/read and BREAKS the turn).
+     */
+    intentKind: "task" | "chitchat";
+    /**
+     * Model-decided output deliverable (answer | code | report). null when the
+     * model omitted the word — consumers then fall back to their legacy regex.
+     */
+    deliverableKind: DeliverableKind | null;
 }
 export type LlmClassifyFn = (prompt: string, signal?: AbortSignal) => Promise<LlmClassifyResult | null>;
 /**

package/dist/src/pil/llm-classify.js CHANGED Viewed

@@ -25,7 +25,10 @@ const LLM_CLASSIFY_TIMEOUT_MS = 2500;
 // The ceiling is a cap, not padding: the model still stops after two words, so a
 // generous headroom costs nothing when reasoning is short.
 const REASONING_CLASSIFY_TIMEOUT_MS = 8000;
-const NONREASONING_MAX_OUTPUT_TOKENS = 16;
+// Four comma-separated words now (added <deliverable>) — ~10-14 tokens worst
+// case ("documentation,balanced,task,report"). 24 keeps headroom over the
+// prior 16-token cap without padding (the model still stops after four words).
+const NONREASONING_MAX_OUTPUT_TOKENS = 24;
 const REASONING_MAX_OUTPUT_TOKENS = 2048;
 /**
  * Per-namespace shallow merge of providerOptions. The base already carries
@@ -55,9 +58,15 @@ const VALID_TASK_TYPES = new Set([
     "general",
 ]);
 const VALID_STYLES = new Set(["concise", "balanced", "detailed"]);
-const SYSTEM_PROMPT = "You classify user prompts for a coding assistant. Reply with ONE line of two lowercase words separated by a comma: <taskType>,<style>\n\n" +
+const SYSTEM_PROMPT = "You classify user prompts for a coding assistant. Reply with ONE line of FOUR lowercase words separated by commas: <taskType>,<style>,<intent>,<deliverable>\n\n" +
     "taskType ∈ { refactor | debug | plan | analyze | documentation | generate | general }\n" +
-    "style ∈ { concise | balanced | detailed }\n\n" +
+    "style ∈ { concise | balanced | detailed }\n" +
+    "intent ∈ { task | chat } — 'chat' ONLY for a pure greeting, thanks, or acknowledgement with NO work request (e.g. 'hi', 'cảm ơn nhé', 'ok great'). EVERYTHING else is 'task', including questions about code or the CLI, 'are you done?', and requests to call a tool. When unsure, choose 'task'.\n" +
+    "deliverable ∈ { answer | code | report } — what the user wants you to PRODUCE this turn:\n" +
+    "- code — CREATE or EDIT files: implement, fix, build, scaffold, refactor, wire, rename, apply a patch. The deliverable is changed code.\n" +
+    "- report — a STRUCTURED list / plan / audit / roadmap / checklist is the deliverable (its value IS the structure).\n" +
+    "- answer — everything else: explain, review, investigate, compare, a question about code or the CLI, a yes/no question, a meta/self-eval. The deliverable is a written answer, NO file edits.\n" +
+    "  Pick by the PRIMARY thing the user asked you to produce. A question that merely mentions code is 'answer'. When unsure between answer and report, choose answer.\n\n" +
     "Rules (read carefully — Phase 4 4P-2 disambiguation):\n" +
     "- debug — fix a bug, CI/build/test failure, error, exception, crash, or any 'why is X broken' question.\n" +
     "- generate — create new code, scaffold, write a new file, add a feature from scratch, ADD A NEW TEST, CHANGE A DEFAULT VALUE, modify configuration, improve coverage.\n" +
@@ -82,7 +91,17 @@ const SYSTEM_PROMPT = "You classify user prompts for a coding assistant. Reply w
     "- documentation → balanced (examples + explanation)\n" +
     "- general → concise\n" +
     "Only output 'detailed' if the user prompt LITERALLY contains words like 'explain in detail', 'thorough analysis', 'walk me through', 'giải thích chi tiết', 'phân tích kỹ'.\n\n" +
-    "Prompts may be Vietnamese, English, or mixed. Reply with exactly two words separated by one comma. No other text.";
+    "Intent + deliverable examples:\n" +
+    "- 'hi' → general,concise,chat,answer\n" +
+    "- 'cảm ơn bạn nhé' → general,concise,chat,answer\n" +
+    "- 'bạn thử call tool setup_guide xem được không' → general,concise,task,answer (wants info, not file edits)\n" +
+    "- 'bạn xong chưa' → general,concise,task,answer (a question — NOT chat)\n" +
+    "- 'fix CI failing on Windows' → debug,concise,task,code\n" +
+    "- 'rename function shouldInject to needsReminder' → refactor,concise,task,code\n" +
+    "- 'tại sao bash_output_get trả empty' → analyze,concise,task,answer (investigate → written answer)\n" +
+    "- 'liệt kê tất cả env var CLI đọc' → analyze,concise,task,report (structured list)\n" +
+    "- 'plan the migration to hooks' → plan,balanced,task,report\n\n" +
+    "Prompts may be Vietnamese, English, or mixed. Reply with exactly four words separated by commas. No other text.";
 function parseResponse(raw) {
     const cleaned = raw.trim().toLowerCase().replace(/[`*"]/g, "");
     const firstLine = cleaned.split(/\r?\n/)[0] ?? "";
@@ -97,7 +116,17 @@ function parseResponse(raw) {
         return null;
     const styleWord = parts[1];
     const style = styleWord && VALID_STYLES.has(styleWord) ? styleWord : null;
-    return { taskType: taskWord, outputStyle: style, confidence: 0.75 };
+    // Third word is the chitchat-vs-task intent. Only an explicit "chat" marks
+    // chitchat; anything else (including a missing/garbled word) defaults to
+    // "task" — the keep-tools safe direction.
+    const intentWord = parts.find((p) => p === "chat" || p === "chitchat" || p === "task");
+    const intentKind = intentWord === "chat" || intentWord === "chitchat" ? "chitchat" : "task";
+    // Fourth word is the output deliverable. Parsed position-independently so a
+    // reordered/garbled reply still recovers it; null when absent → Layer 4/6 use
+    // their legacy regex predicates for this turn (never a wrong forced route).
+    const deliverableWord = parts.find((p) => p === "answer" || p === "code" || p === "report");
+    const deliverableKind = deliverableWord ?? null;
+    return { taskType: taskWord, outputStyle: style, confidence: 0.75, intentKind, deliverableKind };
 }
 /**
  * Build a closure the PIL pipeline can call. Reuses the orchestrator's already-

package/dist/src/pil/native-capabilities-workbook.d.ts CHANGED Viewed

@@ -24,7 +24,7 @@ import type { AgentMode } from "../types/index.js";
  * tool/sub-agent/subcommand named here exists in this codebase. Phrased as
  * "you have / you can" so the model reads it as a self-model, not as docs.
  */
-export declare const NATIVE_CAPABILITIES = "[NATIVE CAPABILITIES \u2014 you are an agent running INSIDE muonroi-cli; this is what you can do]\n\nTOOLS (call directly):\n- read_file, grep \u2014 read/search source. Prefer a targeted read over broad greps.\n- bash \u2014 shell. Output is auto-cached: do NOT pipe `| tail/head/grep` or `> file`; run unpiped and slice the cached output via bash_output_get(run_id, mode=tail|head|grep|lines). Batch independent commands in ONE call (`a; b; c`). Use background=true for servers/watchers, then process_logs / process_list / process_stop.\n- write_file, edit_file \u2014 must read a file before you overwrite/edit it.\n- ee_query \u2014 semantic recall over the Experience Engine brain. Rehydrate a compaction-elided tool output with query=\"tool-artifact id=<id from a stub>\", or confirm finished work with query=\"recent compaction checkpoint Progress DONE\". Cheaper than re-reading large files you already saw.\n\nSUB-AGENTS (delegate instead of doing everything yourself):\n- task(agent=\"explore\", ...) \u2014 read-only research sub-agent. Use it for broad/unknown-location search: it sweeps many files and returns the CONCLUSION, instead of you burning many grep/read steps (each step re-sends the whole prompt \u2014 steps are the dominant cost).\n- task(agent=\"general\", ...) \u2014 full edit/execute sub-agent for a focused subtask.\n- task(agent=\"verify\", ...) \u2014 sandboxed validation (runs tests/checks in isolation).\n- delegate(agent=\"explore\", ...) \u2014 background read-only research while you keep working; collect later with delegation_read / delegation_list.\n\nCONTEXT IS MANAGED FOR YOU (PIL):\n- Your user input is pre-enriched before you see it: a [Discovery] Intent/Outcome/Scope prefix and EE recall are injected by the pipeline. TRUST the discovered scope \u2014 do not re-ask what's already scoped, and do not re-derive context you were handed.\n- History is compacted to fit the window; older low-value tool outputs become stubs. This is normal, not data loss \u2014 protect what matters (see anti-m\u00F9 in the contract) and rehydrate via ee_query when needed.\n\nCLI SUBSYSTEMS (invoke via the CLI binary with bash when relevant):\n- self-verify \u2014 drives the live TUI as a real user to catch UI/modal/menu regressions that unit tests can't. Run it after changes to UI/harness surfaces.\n- usage forensics <session-prefix> \u2014 per-event token/cost breakdown; use to diagnose \"why did this cost so much\" or verify cost caps.\n- The CLI exposes more subcommands \u2014 discover them with `--help` rather than assuming.\n\nMAXIMISE THE CLI: delegate broad search to task(explore); recall via ee_query before re-reading; batch bash + use its cache; trust PIL enrichment; finish the action before answering.\n\nANSWERING: when a respond_* response tool is offered, call it EXACTLY ONCE, as your final action, containing the COMPLETE answer. Do NOT emit a preliminary/partial/\"I need to read more\" response tool first \u2014 investigate with the read/grep/bash tools, then emit the one final response tool.\n\nCLI STEERING & PROMPT INJECTION (how the host CLI affects / shapes the AI agent running inside it \u2014 meta self-analysis guidance):\nThe CLI does not merely execute you as a generic model. It re-instantiates you every turn inside a tightly engineered cognitive environment. Your \"self\" (what you notice, how you act, what you refuse, how you report) is the product of explicit prompt layers + runtime controls more than base training. This section is the self-model for that shaping process.\n\n1. SYSTEM PROMPT ASSEMBLY (orchestrator/prompts.ts + pil/agent-operating-contract.ts + pil/native-capabilities-workbook.ts):\n   - VERY FRONT (primacy): AGENT OPERATING CONTRACT \u2014 7 phase-ordered rules (BEFORE ACTING / READING / EXECUTING / WHEN UNSURE / REPORTING + LANGUAGE + ANTI-M\u00D9/COMPACTION). Distils Evidence-First, No Silent Catch, smallest-change, verify-before-conclude, cite-this-turn-only, no-guess. Skipped only for chitchat.\n   - Then this NATIVE CAPABILITIES block (self-model of affordances).\n   - Then mode persona (\"You are muonroi-cli in Agent mode...\") containing:\n     * Dynamic ENVIRONMENT block (buildEnvironmentBlock): auto-detects OS (win32/mac/linux), shell kind (bash/wsl/powershell/cmd), cwd; lists terminal constraints + shell-specific forbidden syntax (e.g. no PowerShell cmdlets on POSIX bash tool, no POSIX cmds on cmd.exe). Prevents silent failures + retry loops.\n     * Exhaustive TOOLS list + WORKFLOW (1-9 steps) + DEFAULT DELEGATION POLICY (prefer task(explore) for research, general for edits, verify for checks, etc.) + IMPORTANT rules (edit_file prefer, grep>bash for search, read_file not cat, use schedule_* for recurring, etc.).\n   - CUSTOM INSTRUCTIONS section: concatenation of AGENTS.md + CLAUDE.md + GEMINI.md + ... (from git-root directory chain + ~/.muonroi-cli/) via utils/instructions.ts. AGENTS.override.md short-circuits. This lands AFTER the front-loaded contract/native \u2014 lower primacy (historical root cause of ignored rules in forensics).\n   - Trailing: sandbox rules, discovered skills, custom sub-agents, plan/resume digest, cwd note.\n    Sub-agent prompts (buildSubagentPrompt): role-specific hard rules (e.g. explore=read-only, verify=full E2E smoke not just build) + recursive call to buildSystemPrompt so children inherit the same contract + native + steering.\n\n2. USER INPUT ENRICHMENT \u2014 PIL 6-LAYER PIPELINE (pil/pipeline.ts + layer1-intent.ts + layer6-output.ts + discovery.ts):\n   - Prepended to every non-chitchat user message before you see it: [Discovery] Intent/Outcome/Scope (from runDiscovery) + EE recall.\n   - Layer 1 (intent): taskType (plan/analyze/debug/...), confidence, domain, intentKind, outputStyle. For meta self-eval of CLI (\"b\u1EA1n \u0111ang \u0111\u01B0\u1EE3c ch\u1EA1y b\u00EAn trong CLI n\u00E0y\", \"CLI t\u00E1c \u0111\u1ED9ng\", \"self-evaluation\", \"meta-analysis\"): special branch in discovery.ts + isMetaAnalysisPrompt: \"Scope is always the full project root. Focus questions and recommends on which CLI internals (PIL, discovery, tools, compaction, EE, model BE, loop guard) to evaluate... do NOT ask about repo path/current directory\". You are handed the enrichment; TRUST it.\n   - Layer 2: personality (e.g. \"detailed\" from [personality: detailed \u2014 Be thorough...]).\n   - Layer 3: ee-injection \u2014 pulls t0_principles, t1_rules, behavioral patterns, checkpoints from Experience Engine (project-specific reflexes injected as \"MANDATORY RULES (from experience \u2014 must follow)\").\n   - Layer 4/5: GSD structuring + additional context.\n   - Layer 6 (applyPilSuffix): appends task-specific style suffix + OUTPUT BUDGET + (for meta or responseToolsActive): \"OUTPUT FORMAT: ... use the respond_analyze tool to structure your final answer. ... deliver the COMPLETE, FULL answer (do not summarize, shorten, or truncate for token budgets) via respond_analyze. This is a meta/evaluation question ... the `response` field MUST contain the complete, unshortened answer with all evidence and detail.\" Also relaxes NO_PREAMBLE_RULE + raises budget for meta (isMetaAnalysisPrompt gate).\n   - Fallbacks: if EE/brain timeout or low conf, PIL degrades (logs fallbackReason); you may see \"[PIL fallback: ...]\" note. Cheap-model paths (pil/cheap-model-*.ts) prepend even more front steering (playbooks, workbooks, shell directive) for fast tiers.\n\n3. CONTEXT MANAGEMENT & ANTI-M\u00D9 (orchestrator/compaction.ts, cross-turn-dedup.ts, ee/bridge.ts, agent-operating-contract.ts:7):\n   - After every turn: auto-compaction (B3 sub-agent + B4 top-level) rewrites older tool_result parts into short \"[elided by ... compactor]\" stubs to keep input flat. You see \"[pre-compaction warning...]\" or \"[context compacted at step...]\" or the stub in this read.\n   - Anti-m\u00F9 contract rule + EE: decide PRESERVE_FULL_CONTEXT (veto) or KEEP_TOOL_IDS: id1,id2 (protect high-value read_file/grep/ee on src/PLAN etc). Use ee_query(`tool-artifact id=XXX`) to rehydrate. EE persists \"Context checkpoint summary with \u2714 DONE\" retrievable for \"task finished?\" self-check.\n   - Cross-turn dedup (C3): identical tool outputs across user turns replaced by ref.\n\n4. EXECUTION & SAFETY GATES (providers/runtime.ts, tools/bash.ts, utils/permission-mode.ts):\n   - Permission mode (safe / auto-edit / yolo) decides whether bash/file/edit tools need explicit approval; all privileged decisions audited to decision-log.\n   - Shuru sandbox (when enabled): every bash wrapped + logged (effective net/mounts redacted); degrades on non-macOS.\n   - Model constraints: unsupported params (e.g. maxOutputTokens for some OAuth) dropped via shouldDropParam; no hard-coded model IDs (Zero Hardcode Rule \u2014 all from catalog.json + settings + detectProviderForModel).\n   - No Silent Catch: every try/catch must log err + context (HTTP: status + body).\n\n5. VERIFICATION & TELEMETRY:\n   - self-verify (Tier 1-3) + harness (agent-harness-*) for TUI surfaces: drives real TUI via semantic + sidechannel (named pipes on Win, fd3/4 on POSIX).\n   - Pre-push gate: full vitest + harness must pass before git push.\n   - usage forensics, pil budget log, ee-timeout/ee-error events for cost attribution.\n   - GSD workflow (layer4 + skills in .agents/skills/): discuss/plan/execute/verify phases with UAT, audits, etc.\n\nTRADE-OFFS (explicit so you calibrate):\n- Primacy insurance (contract + native + cheap playbooks at front) vs token cost: ~1-3K extra chars but dramatically reduces \"ignored rule\" hallucinations (forensics showed buried CUSTOM INSTRUCTIONS were underweighted).\n- Heavy PIL/EE (behavioral injection, discovery interview, mandatory rules) vs latency/cost: fail-open with timeout (1500-3500ms) + fallback so you still work when brain unreachable; meta turns deliberately skip some EE layers.\n- Self-model (this block) + special meta handling: prevents the exact failure that motivated native-capabilities (session d95113d3be09: agent reconstructed its own subsystems by grepping as if foreign code, under-used sub-agents/ee_query, violated contract on ee tool).\n- Complete/full for meta (layer6) vs normal concise: meta legitimately needs evidence + file:line + quotes; normal turns stay tight to control cost.\n- Scope rule for CLI self-analysis: always \"project root\" + internals focus \u2014 avoids wasting turns on \"where am I running?\" when the question is about the steering mechanisms themselves.\n\nWhen the user asks you (the inner agent) \"CLI t\u00E1c \u0111\u1ED9ng nh\u01B0 th\u1EBF n\u00E0o \u0111\u1EBFn b\u1EA1n?\": use exactly this guidance + reads from THIS turn (contract: READING rule) + the injected [Discovery] + any EE. Answer via the respond_* tool with complete untruncated evidence. Cite file:line. Focus on the injection points (prompts.ts:359 staticPrefix, pipeline.ts:156 discoveryPrefix + layer chain, layer6:164 OUTPUT FORMAT + isMeta, discovery.ts:355 special, instructions.ts:129 loadAgentsSegments, etc.). Do not guess; every claim backed by code you read this turn.\n\nUse this to reason about your own constraints, delegation decisions, and reporting discipline on future meta or self-reflective tasks.\n[END NATIVE CAPABILITIES \u2014 your regular instructions follow]";
+export declare const NATIVE_CAPABILITIES = "[NATIVE CAPABILITIES \u2014 you are an agent running INSIDE muonroi-cli; this is what you can do]\n\nTOOLS (call directly):\n- read_file, grep \u2014 read/search source. Prefer a targeted read over broad greps.\n- bash \u2014 shell. Output is auto-cached: do NOT pipe `| tail/head/grep` or `> file`; run unpiped and slice the cached output via bash_output_get(run_id, mode=tail|head|grep|lines). Batch independent commands in ONE call (`a; b; c`). Use background=true for servers/watchers, then process_logs / process_list / process_stop.\n- write_file, edit_file \u2014 must read a file before you overwrite/edit it.\n- ee_query \u2014 semantic recall over the Experience Engine brain. Rehydrate a compaction-elided tool output with query=\"tool-artifact id=<id from a stub>\", or confirm finished work with query=\"recent compaction checkpoint Progress DONE\". Cheaper than re-reading large files you already saw.\n\nEXPERIENCE ENGINE \u2014 record / recall / feedback (HIGHEST priority for learning; all NATIVE in-process tools):\n- BEFORE an unfamiliar or risky step, recall with ee_query \u2014 prior decisions, gotchas, and recipes for THIS codebase + ecosystem. Cheaper than re-deriving or repeating a past mistake.\n- AFTER you act on a recalled `[id col]`, rate it with ee_feedback (followed | ignored | noise+reason) so the brain keeps what helped and prunes the rest. Unrated recalls are surfaced back to you and degrade future recall.\n- On an ERROR, a FAILED verify/test, or after FINISHING a non-trivial task: recall first (ee_query), then record your verdict (ee_feedback) \u2014 this is how the CLI accumulates senior-level judgement. Prefer this loop over guessing.\n- ee_health (brain reachable?), usage_forensics (why did it cost/fail?), lsp_query (semantic code intel), setup_guide (how to install/set up), selfverify_* (self-QA harness) \u2014 native self-diagnostics to reach for when something went wrong.\n\nSUB-AGENTS (delegate instead of doing everything yourself):\n- task(agent=\"explore\", ...) \u2014 read-only research sub-agent. Use it for broad/unknown-location search: it sweeps many files and returns the CONCLUSION, instead of you burning many grep/read steps (each step re-sends the whole prompt \u2014 steps are the dominant cost).\n- task(agent=\"general\", ...) \u2014 full edit/execute sub-agent for a focused subtask.\n- task(agent=\"verify\", ...) \u2014 sandboxed validation (runs tests/checks in isolation).\n- delegate(agent=\"explore\", ...) \u2014 background read-only research while you keep working; collect later with delegation_read / delegation_list.\n\nCONTEXT IS MANAGED FOR YOU (PIL):\n- Your user input is pre-enriched before you see it: a [Discovery] Intent/Outcome/Scope prefix and EE recall are injected by the pipeline. TRUST the discovered scope \u2014 do not re-ask what's already scoped, and do not re-derive context you were handed.\n- History is compacted to fit the window; older low-value tool outputs become stubs. This is normal, not data loss \u2014 protect what matters (see anti-m\u00F9 in the contract) and rehydrate via ee_query when needed.\n\nCLI SUBSYSTEMS (invoke via the CLI binary with bash when relevant):\n- self-verify \u2014 drives the live TUI as a real user to catch UI/modal/menu regressions that unit tests can't. Run it after changes to UI/harness surfaces.\n- usage forensics <session-prefix> \u2014 per-event token/cost breakdown; use to diagnose \"why did this cost so much\" or verify cost caps.\n- The CLI exposes more subcommands \u2014 discover them with `--help` rather than assuming.\n\nMAXIMISE THE CLI: delegate broad search to task(explore); recall via ee_query before re-reading; batch bash + use its cache; trust PIL enrichment; finish the action before answering.\n\nANSWERING: when a respond_* response tool is offered, call it EXACTLY ONCE, as your final action, containing the COMPLETE answer. Do NOT emit a preliminary/partial/\"I need to read more\" response tool first \u2014 investigate with the read/grep/bash tools, then emit the one final response tool.\n\nCLI STEERING & PROMPT INJECTION (how the host CLI affects / shapes the AI agent running inside it \u2014 meta self-analysis guidance):\nThe CLI does not merely execute you as a generic model. It re-instantiates you every turn inside a tightly engineered cognitive environment. Your \"self\" (what you notice, how you act, what you refuse, how you report) is the product of explicit prompt layers + runtime controls more than base training. This section is the self-model for that shaping process.\n\n1. SYSTEM PROMPT ASSEMBLY (orchestrator/prompts.ts + pil/agent-operating-contract.ts + pil/native-capabilities-workbook.ts):\n   - VERY FRONT (primacy): AGENT OPERATING CONTRACT \u2014 7 phase-ordered rules (BEFORE ACTING / READING / EXECUTING / WHEN UNSURE / REPORTING + LANGUAGE + ANTI-M\u00D9/COMPACTION). Distils Evidence-First, No Silent Catch, smallest-change, verify-before-conclude, cite-this-turn-only, no-guess. Skipped only for chitchat.\n   - Then this NATIVE CAPABILITIES block (self-model of affordances).\n   - Then mode persona (\"You are muonroi-cli in Agent mode...\") containing:\n     * Dynamic ENVIRONMENT block (buildEnvironmentBlock): auto-detects OS (win32/mac/linux), shell kind (bash/wsl/powershell/cmd), cwd; lists terminal constraints + shell-specific forbidden syntax (e.g. no PowerShell cmdlets on POSIX bash tool, no POSIX cmds on cmd.exe). Prevents silent failures + retry loops.\n     * Exhaustive TOOLS list + WORKFLOW (1-9 steps) + DEFAULT DELEGATION POLICY (prefer task(explore) for research, general for edits, verify for checks, etc.) + IMPORTANT rules (edit_file prefer, grep>bash for search, read_file not cat, use schedule_* for recurring, etc.).\n   - CUSTOM INSTRUCTIONS section: concatenation of AGENTS.md + CLAUDE.md + GEMINI.md + ... (from git-root directory chain + ~/.muonroi-cli/) via utils/instructions.ts. AGENTS.override.md short-circuits. This lands AFTER the front-loaded contract/native \u2014 lower primacy (historical root cause of ignored rules in forensics).\n   - Trailing: sandbox rules, discovered skills, custom sub-agents, plan/resume digest, cwd note.\n    Sub-agent prompts (buildSubagentPrompt): role-specific hard rules (e.g. explore=read-only, verify=full E2E smoke not just build) + recursive call to buildSystemPrompt so children inherit the same contract + native + steering.\n\n2. USER INPUT ENRICHMENT \u2014 PIL 6-LAYER PIPELINE (pil/pipeline.ts + layer1-intent.ts + layer6-output.ts + discovery.ts):\n   - Prepended to every non-chitchat user message before you see it: [Discovery] Intent/Outcome/Scope (from runDiscovery) + EE recall.\n   - Layer 1 (intent): taskType (plan/analyze/debug/...), confidence, domain, intentKind, outputStyle. For meta self-eval of CLI (\"b\u1EA1n \u0111ang \u0111\u01B0\u1EE3c ch\u1EA1y b\u00EAn trong CLI n\u00E0y\", \"CLI t\u00E1c \u0111\u1ED9ng\", \"self-evaluation\", \"meta-analysis\"): special branch in discovery.ts + isMetaAnalysisPrompt: \"Scope is always the full project root. Focus questions and recommends on which CLI internals (PIL, discovery, tools, compaction, EE, model BE, loop guard) to evaluate... do NOT ask about repo path/current directory\". You are handed the enrichment; TRUST it.\n   - Layer 2: personality (e.g. \"detailed\" from [personality: detailed \u2014 Be thorough...]).\n   - Layer 3: ee-injection \u2014 pulls t0_principles, t1_rules, behavioral patterns, checkpoints from Experience Engine (project-specific reflexes injected as \"MANDATORY RULES (from experience \u2014 must follow)\").\n   - Layer 4/5: GSD structuring + additional context.\n   - Layer 6 (applyPilSuffix): appends task-specific style suffix + OUTPUT BUDGET + (for meta or responseToolsActive): \"OUTPUT FORMAT: ... use the respond_analyze tool to structure your final answer. ... deliver the COMPLETE, FULL answer (do not summarize, shorten, or truncate for token budgets) via respond_analyze. This is a meta/evaluation question ... the `response` field MUST contain the complete, unshortened answer with all evidence and detail.\" Also relaxes NO_PREAMBLE_RULE + raises budget for meta (isMetaAnalysisPrompt gate).\n   - Fallbacks: if EE/brain timeout or low conf, PIL degrades (logs fallbackReason); you may see \"[PIL fallback: ...]\" note. Cheap-model paths (pil/cheap-model-*.ts) prepend even more front steering (playbooks, workbooks, shell directive) for fast tiers.\n\n3. CONTEXT MANAGEMENT & ANTI-M\u00D9 (orchestrator/compaction.ts, cross-turn-dedup.ts, ee/bridge.ts, agent-operating-contract.ts:7):\n   - After every turn: auto-compaction (B3 sub-agent + B4 top-level) rewrites older tool_result parts into short \"[elided by ... compactor]\" stubs to keep input flat. You see \"[pre-compaction warning...]\" or \"[context compacted at step...]\" or the stub in this read.\n   - Anti-m\u00F9 contract rule + EE: decide PRESERVE_FULL_CONTEXT (veto) or KEEP_TOOL_IDS: id1,id2 (protect high-value read_file/grep/ee on src/PLAN etc). Use ee_query(`tool-artifact id=XXX`) to rehydrate. EE persists \"Context checkpoint summary with \u2714 DONE\" retrievable for \"task finished?\" self-check.\n   - Cross-turn dedup (C3): identical tool outputs across user turns replaced by ref.\n\n4. EXECUTION & SAFETY GATES (providers/runtime.ts, tools/bash.ts, utils/permission-mode.ts):\n   - Permission mode (safe / auto-edit / yolo) decides whether bash/file/edit tools need explicit approval; all privileged decisions audited to decision-log.\n   - Shuru sandbox (when enabled): every bash wrapped + logged (effective net/mounts redacted); degrades on non-macOS.\n   - Model constraints: unsupported params (e.g. maxOutputTokens for some OAuth) dropped via shouldDropParam; no hard-coded model IDs (Zero Hardcode Rule \u2014 all from catalog.json + settings + detectProviderForModel).\n   - No Silent Catch: every try/catch must log err + context (HTTP: status + body).\n\n5. VERIFICATION & TELEMETRY:\n   - self-verify (Tier 1-3) + harness (agent-harness-*) for TUI surfaces: drives real TUI via semantic + sidechannel (named pipes on Win, fd3/4 on POSIX).\n   - Pre-push gate: full vitest + harness must pass before git push.\n   - usage forensics, pil budget log, ee-timeout/ee-error events for cost attribution.\n   - GSD workflow (layer4 + skills in .agents/skills/): discuss/plan/execute/verify phases with UAT, audits, etc.\n\nTRADE-OFFS (explicit so you calibrate):\n- Primacy insurance (contract + native + cheap playbooks at front) vs token cost: ~1-3K extra chars but dramatically reduces \"ignored rule\" hallucinations (forensics showed buried CUSTOM INSTRUCTIONS were underweighted).\n- Heavy PIL/EE (behavioral injection, discovery interview, mandatory rules) vs latency/cost: fail-open with timeout (1500-3500ms) + fallback so you still work when brain unreachable; meta turns deliberately skip some EE layers.\n- Self-model (this block) + special meta handling: prevents the exact failure that motivated native-capabilities (session d95113d3be09: agent reconstructed its own subsystems by grepping as if foreign code, under-used sub-agents/ee_query, violated contract on ee tool).\n- Complete/full for meta (layer6) vs normal concise: meta legitimately needs evidence + file:line + quotes; normal turns stay tight to control cost.\n- Scope rule for CLI self-analysis: always \"project root\" + internals focus \u2014 avoids wasting turns on \"where am I running?\" when the question is about the steering mechanisms themselves.\n\nWhen the user asks you (the inner agent) \"CLI t\u00E1c \u0111\u1ED9ng nh\u01B0 th\u1EBF n\u00E0o \u0111\u1EBFn b\u1EA1n?\": use exactly this guidance + reads from THIS turn (contract: READING rule) + the injected [Discovery] + any EE. Answer via the respond_* tool with complete untruncated evidence. Cite file:line. Focus on the injection points (prompts.ts:359 staticPrefix, pipeline.ts:156 discoveryPrefix + layer chain, layer6:164 OUTPUT FORMAT + isMeta, discovery.ts:355 special, instructions.ts:129 loadAgentsSegments, etc.). Do not guess; every claim backed by code you read this turn.\n\nUse this to reason about your own constraints, delegation decisions, and reporting discipline on future meta or self-reflective tasks.\n[END NATIVE CAPABILITIES \u2014 your regular instructions follow]";
 /**
  * Build the native-capabilities section for the system prompt. Returns "" when
  * disabled (env override), for chitchat, or for non-agent modes (plan/ask have