npm - kc-beta - Versions diffs - 0.7.5 → 0.8.3 - Mend

kc-beta 0.7.5 → 0.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (81) hide show

package/README.md +47 -0
package/package.json +3 -2
package/src/agent/context.js +17 -1
package/src/agent/engine.js +467 -100
package/src/agent/llm-client.js +24 -1
package/src/agent/pipelines/_advance-hints.js +92 -0
package/src/agent/pipelines/_milestone-derive.js +325 -20
package/src/agent/pipelines/skill-authoring.js +49 -3
package/src/agent/tools/agent-tool.js +2 -2
package/src/agent/tools/consult-skill.js +15 -0
package/src/agent/tools/dashboard-render.js +48 -1
package/src/agent/tools/document-parse.js +31 -2
package/src/agent/tools/phase-advance.js +17 -13
package/src/agent/tools/release.js +343 -7
package/src/agent/tools/sandbox-exec.js +65 -8
package/src/agent/tools/worker-llm-call.js +95 -15
package/src/agent/workspace.js +25 -4
package/src/cli/components.js +4 -1
package/src/cli/index.js +125 -8
package/src/config.js +19 -2
package/src/marathon/driver.js +217 -0
package/src/marathon/prompts.js +93 -0
package/template/.env.template +17 -1
package/template/AGENT.md +2 -2
package/template/skills/en/auto-model-selection/SKILL.md +55 -35
package/template/skills/en/bootstrap-workspace/SKILL.md +27 -0
package/template/skills/en/compliance-judgment/SKILL.md +14 -0
package/template/skills/en/confidence-system/SKILL.md +30 -8
package/template/skills/en/corner-case-management/SKILL.md +53 -33
package/template/skills/en/cross-document-verification/SKILL.md +88 -83
package/template/skills/en/dashboard-reporting/SKILL.md +91 -66
package/template/skills/en/dashboard-reporting/scripts/generate_dashboard.py +1 -1
package/template/skills/en/data-sensibility/SKILL.md +19 -12
package/template/skills/en/document-chunking/SKILL.md +99 -15
package/template/skills/en/entity-extraction/SKILL.md +14 -4
package/template/skills/en/quality-control/SKILL.md +23 -0
package/template/skills/en/rule-extraction/SKILL.md +92 -94
package/template/skills/en/rule-extraction/references/chunking-strategies.md +7 -78
package/template/skills/en/skill-authoring/SKILL.md +85 -2
package/template/skills/en/skill-creator/SKILL.md +25 -3
package/template/skills/en/skill-to-workflow/SKILL.md +73 -1
package/template/skills/en/task-decomposition/SKILL.md +1 -1
package/template/skills/en/tree-processing/SKILL.md +1 -1
package/template/skills/en/version-control/SKILL.md +15 -0
package/template/skills/en/work-decomposition/SKILL.md +52 -32
package/template/skills/phase_skills.yaml +5 -0
package/template/skills/zh/auto-model-selection/SKILL.md +54 -33
package/template/skills/zh/bootstrap-workspace/SKILL.md +27 -0
package/template/skills/zh/compliance-judgment/SKILL.md +51 -37
package/template/skills/zh/compliance-judgment/references/output-format.md +62 -62
package/template/skills/zh/confidence-system/SKILL.md +34 -9
package/template/skills/zh/corner-case-management/SKILL.md +71 -104
package/template/skills/zh/cross-document-verification/SKILL.md +90 -195
package/template/skills/zh/cross-document-verification/references/contradiction-taxonomy.md +36 -36
package/template/skills/zh/dashboard-reporting/SKILL.md +82 -232
package/template/skills/zh/dashboard-reporting/scripts/generate_dashboard.py +1 -1
package/template/skills/zh/data-sensibility/SKILL.md +13 -0
package/template/skills/zh/document-chunking/SKILL.md +101 -18
package/template/skills/zh/document-parsing/SKILL.md +65 -65
package/template/skills/zh/document-parsing/references/parser-catalog.md +26 -26
package/template/skills/zh/entity-extraction/SKILL.md +78 -68
package/template/skills/zh/evolution-loop/references/convergence-guide.md +38 -38
package/template/skills/zh/quality-control/SKILL.md +23 -0
package/template/skills/zh/quality-control/references/qa-layers.md +65 -65
package/template/skills/zh/quality-control/references/sampling-strategies.md +49 -49
package/template/skills/zh/rule-extraction/SKILL.md +199 -188
package/template/skills/zh/rule-extraction/references/chunking-strategies.md +5 -78
package/template/skills/zh/skill-authoring/SKILL.md +136 -58
package/template/skills/zh/skill-authoring/references/skill-format-spec.md +39 -39
package/template/skills/zh/skill-creator/SKILL.md +215 -201
package/template/skills/zh/skill-creator/references/schemas.md +60 -60
package/template/skills/zh/skill-to-workflow/SKILL.md +73 -1
package/template/skills/zh/skill-to-workflow/references/worker-llm-catalog.md +24 -24
package/template/skills/zh/task-decomposition/SKILL.md +1 -1
package/template/skills/zh/task-decomposition/references/decision-matrix.md +54 -54
package/template/skills/zh/tree-processing/SKILL.md +67 -63
package/template/skills/zh/version-control/SKILL.md +15 -0
package/template/skills/zh/version-control/references/trace-id-spec.md +34 -34
package/template/skills/zh/work-decomposition/SKILL.md +52 -30
package/template/workflows/common/llm_client.py +168 -0
package/template/workflows/common/utils.py +132 -0

package/src/marathon/driver.js ADDED Viewed

@@ -0,0 +1,217 @@
+// v0.8.1 P8-A — marathon driver as inline state machine.
+//
+// v0.8.0 shipped this as a separate-process driver (bin/kc-marathon.js)
+// that tailed events.jsonl + wrote prompts to .kc_marathon/inbox.jsonl.
+// E2E #11 audits found both drivers died silently within 10 min when
+// the terminal closed or laptop slept (SIGHUP/SIGTERM unhandled). The
+// engine survived both deaths because it lives in a different process.
+//
+// v0.8.1 redesign per user proposal (2026-05-15):
+//   - Single process: driver runs inline as part of the engine
+//   - Activated via `/marathon <goal>` slash command in kc-beta TUI
+//   - Engine calls decideNext(state) after each turn_complete to get
+//     the next continuation prompt (or null if marathon should end)
+//   - No filesystem IPC (no inbox, no active marker, no state.json)
+//   - State persists via engine's existing session-state.json
+//
+// The state machine logic from v0.8.0 is preserved verbatim — only
+// the I/O wrapper changes. Templates (renderPrompt) unchanged.
+import { renderPrompt } from "./prompts.js";
+const DEFAULT_STUCK_AFTER_MS = 30 * 60 * 1000;        // 30 min
+const DEFAULT_MAX_WALLCLOCK_MS = 12 * 60 * 60 * 1000; // 12 h
+export class MarathonDriver {
+  /**
+   * @param {object} opts
+   * @param {string} opts.goal — the marathon goal-description prompt
+   * @param {string} [opts.language] — "en" or "zh"
+   * @param {number} [opts.maxWallclockMs] — stop after this much wall time
+   * @param {number} [opts.stuckAfterMs] — emit unstick prompt after idle
+   */
+  constructor(opts = {}) {
+    if (!opts.goal || typeof opts.goal !== "string") {
+      throw new Error("MarathonDriver requires a non-empty `goal` string");
+    }
+    this.goal = opts.goal;
+    this.language = opts.language === "zh" ? "zh" : "en";
+    this.maxWallclockMs = opts.maxWallclockMs ?? DEFAULT_MAX_WALLCLOCK_MS;
+    this.stuckAfterMs = opts.stuckAfterMs ?? DEFAULT_STUCK_AFTER_MS;
+    this.startedAt = Date.now();
+    this.lastDecisionAt = 0;
+    this.decisionCount = 0;
+    this.currentPhase = "bootstrap";
+    this.lastMilestones = {};
+    this.turnsThisPhase = 0;
+    this.lastEventTs = Date.now();
+    this.initialDelivered = false;
+    this.stopped = false;
+    this.stopReason = null;
+    // Decision history (kept in-memory; surfaced in /marathon status).
+    // Bounded to last 100 to cap memory.
+    this.decisions = [];
+  }
+  /**
+   * Engine calls this once BEFORE the initial turn after /marathon was
+   * typed. Returns the goal-description prompt to feed into runTurn.
+   */
+  getInitialPrompt() {
+    const out = renderPrompt(
+      "initial",
+      this._stateSnapshot(),
+      this.language,
+    );
+    this._recordDecision("initial", "marathon kickoff", out);
+    this.initialDelivered = true;
+    return out;
+  }
+  /**
+   * Engine calls decideNext(state) after each turn_complete event.
+   * Returns { prompt, template, reason } if marathon should continue,
+   * or null if a stop condition is met (engine will exit marathon mode).
+   *
+   * @param {object} state — engine snapshot:
+   *   {currentPhase, milestones, phaseChanged, errorSeen, turnsThisPhase}
+   */
+  decideNext(state = {}) {
+    if (this.stopped) return null;
+    // Update tracked state from engine
+    if (state.currentPhase && state.currentPhase !== this.currentPhase) {
+      this.currentPhase = state.currentPhase;
+      this.turnsThisPhase = 0;
+    }
+    if (state.milestones) this.lastMilestones = state.milestones;
+    if (typeof state.turnsThisPhase === "number") {
+      this.turnsThisPhase = state.turnsThisPhase;
+    } else {
+      this.turnsThisPhase += 1;
+    }
+    this.lastEventTs = Date.now();
+    // Stop conditions
+    if (this._shouldStop()) {
+      this.stopped = true;
+      // Emit one final "stop" prompt so the agent has a chance to wrap up.
+      const out = renderPrompt("stop", this._stateSnapshot(), this.language);
+      this._recordDecision("stop", this.stopReason, out);
+      return { prompt: out, template: "stop", reason: this.stopReason };
+    }
+    let template = "continue_phase";
+    let reason = "turn_complete in same phase";
+    if (state.errorSeen) {
+      template = "unstick";
+      reason = "engine emitted error event";
+    } else if (state.phaseChanged) {
+      if (this.currentPhase === "finalization") {
+        template = "finalize";
+        reason = "reached finalization";
+      } else {
+        template = "continue_phase";
+        reason = `entered ${this.currentPhase}`;
+      }
+    } else {
+      const idleMs = Date.now() - this.lastEventTs;
+      if (idleMs > this.stuckAfterMs) {
+        template = "unstick";
+        reason = `idle for ${Math.round(idleMs / 60000)} min`;
+      }
+    }
+    const out = renderPrompt(template, this._stateSnapshot(), this.language);
+    this._recordDecision(template, reason, out);
+    return { prompt: out, template, reason };
+  }
+  /** User-invoked manual stop (e.g., `/marathon off`). */
+  stop(reason = "user_off") {
+    this.stopped = true;
+    this.stopReason = reason;
+    this._recordDecision("manual_stop", reason, "");
+  }
+  /** Snapshot for /marathon status command + audit. */
+  getStatus() {
+    return {
+      active: !this.stopped,
+      goal: this.goal,
+      language: this.language,
+      startedAt: new Date(this.startedAt).toISOString(),
+      runtimeMs: Date.now() - this.startedAt,
+      currentPhase: this.currentPhase,
+      turnsThisPhase: this.turnsThisPhase,
+      decisionCount: this.decisionCount,
+      lastDecisionAt: this.lastDecisionAt ? new Date(this.lastDecisionAt).toISOString() : null,
+      stopReason: this.stopReason,
+      maxWallclockMs: this.maxWallclockMs,
+      stuckAfterMs: this.stuckAfterMs,
+      recentDecisions: this.decisions.slice(-5),
+    };
+  }
+  /** Serialize for session-state.json persistence (NOT used for auto-resume per user-locked decision; included for audit visibility only). */
+  toJSON() {
+    return {
+      goal: this.goal,
+      language: this.language,
+      maxWallclockMs: this.maxWallclockMs,
+      stuckAfterMs: this.stuckAfterMs,
+      startedAt: this.startedAt,
+      currentPhase: this.currentPhase,
+      turnsThisPhase: this.turnsThisPhase,
+      decisionCount: this.decisionCount,
+      initialDelivered: this.initialDelivered,
+      stopped: this.stopped,
+      stopReason: this.stopReason,
+      // Note: decisions array not persisted (memory-only)
+    };
+  }
+  // ─── internals ──────────────────────────────────────────────────
+  _stateSnapshot() {
+    return {
+      goal: this.goal,
+      currentPhase: this.currentPhase,
+      milestones: this.lastMilestones,
+      idleSec: Math.round((Date.now() - this.lastEventTs) / 1000),
+      lastEventType: this._lastEventType || null,
+    };
+  }
+  _shouldStop() {
+    if (this.stopped) return true;
+    if (Date.now() - this.startedAt > this.maxWallclockMs) {
+      this.stopReason = "max_wallclock";
+      return true;
+    }
+    if (
+      this.currentPhase === "finalization" &&
+      this.turnsThisPhase >= 5
+    ) {
+      this.stopReason = "finalization_settled";
+      return true;
+    }
+    return false;
+  }
+  _recordDecision(template, reason, prompt) {
+    this.decisionCount += 1;
+    this.lastDecisionAt = Date.now();
+    this.decisions.push({
+      ts: new Date().toISOString(),
+      template,
+      reason,
+      currentPhase: this.currentPhase,
+      promptPreview: (prompt || "").slice(0, 200),
+    });
+    if (this.decisions.length > 100) this.decisions.shift();
+  }
+}

package/src/marathon/prompts.js ADDED Viewed

@@ -0,0 +1,93 @@
+// v0.8 P4 — templated continuation prompts for the marathon driver.
+//
+// Driver state → prompt template mapping. Templates are deterministic
+// (no LLM in the driver), bilingual (en + zh per workspace LANGUAGE).
+// Goal: surface to KC's main conductor the smallest useful nudge to
+// keep the pipeline moving without leaking marathon-implementation
+// details into the agent's context.
+//
+// Each template is a function (engineState, goal) → string. State
+// fields used:
+//   currentPhase, milestones, idleSec, lastEventType, goal
+//
+// Add new templates here as the driver state machine grows.
+const TEMPLATES_EN = {
+  initial: (s) => `Goal: ${s.goal}\n\n` +
+    `You are running in marathon mode (no manual user check-ins). Advance the pipeline phase by phase. ` +
+    `Engine derives milestones from filesystem facts; produce real artifacts, then call phase_advance. ` +
+    `If you get stuck on a specific phase, surface the blocker in your next response and the driver will ` +
+    `inject a diagnostic prompt next turn.`,
+  continue_phase: (s) => `Continue with ${s.currentPhase} work. ` +
+    `Engine status: ${formatMilestones(s.milestones)}.`,
+  advance_phase: (s) => `${s.currentPhase} milestones look complete (${formatMilestones(s.milestones)}). ` +
+    `Verify the gate conditions then call \`phase_advance\` to the next phase.`,
+  unstick: (s) => `No phase progress in the last ${Math.round(s.idleSec / 60)} minutes. ` +
+    `Either (1) surface the blocker explicitly so the developer user can intervene, or (2) ` +
+    `consult the relevant meta-skill for the current phase and try a different approach. ` +
+    `If you've genuinely finished and the engine gate is wrong, force phase_advance with reason.`,
+  finalize: (s) => `You've reached finalization. Wrap the deliverable bundle: ` +
+    `verify rule_skills/coverage_report.md is substantive, output/releases/<slug>/ is current ` +
+    `(re-run release tool if workflows changed after the last snapshot), and final_dashboard.html ` +
+    `reflects the latest QC data. When done, just say so — the marathon will exit.`,
+  stop: () => `Marathon stop condition reached. Save state and summarize what was accomplished.`,
+};
+const TEMPLATES_ZH = {
+  initial: (s) => `目标：${s.goal}\n\n` +
+    `你正运行在 marathon 模式（无人工 check-in）。按阶段推进整条流水线。` +
+    `引擎从文件系统事实派生里程碑；先把真实交付物产出来，再调用 phase_advance。` +
+    `如果在某个阶段卡住了，直接在下一回合的回复里把阻塞点说清楚，驱动器会在下一回合注入诊断提示。`,
+  continue_phase: (s) => `继续 ${s.currentPhase} 阶段的工作。` +
+    `引擎状态：${formatMilestones(s.milestones)}。`,
+  advance_phase: (s) => `${s.currentPhase} 阶段的里程碑看起来已经完成（${formatMilestones(s.milestones)}）。` +
+    `核对一遍门控条件，然后调用 \`phase_advance\` 进入下一阶段。`,
+  unstick: (s) => `已经 ${Math.round(s.idleSec / 60)} 分钟没有阶段推进了。` +
+    `两条路：(1) 明确说出阻塞在哪里、让开发者用户介入；(2) 查阅当前阶段相关的 meta-skill 换个思路再试。` +
+    `如果你确实已经做完、但引擎门控判断错误，用 reason 强制 phase_advance。`,
+  finalize: (s) => `已经进入 finalization。收尾打包：` +
+    `确认 rule_skills/coverage_report.md 内容充实、output/releases/<slug>/ 是最新的（` +
+    `如果 workflows 在最近一次快照之后还有改动，重新跑 release 工具），final_dashboard.html 反映最新 QC 数据。` +
+    `做完之后直接说一声，marathon 会退出。`,
+  stop: () => `Marathon 停止条件已触发。保存状态、总结已完成的工作。`,
+};
+function formatMilestones(m) {
+  if (!m || typeof m !== "object") return "(unknown)";
+  const parts = [];
+  for (const [k, v] of Object.entries(m)) {
+    if (typeof v === "number") parts.push(`${k}:${v}`);
+    else if (typeof v === "boolean") parts.push(`${k}:${v ? "yes" : "no"}`);
+    else if (Array.isArray(v)) parts.push(`${k}:${v.length}`);
+  }
+  return parts.slice(0, 6).join(", ") || "(empty)";
+}
+/**
+ * Render a continuation prompt for the given driver state.
+ *
+ * @param {string} templateKey — one of: initial, continue_phase, advance_phase, unstick, finalize, stop
+ * @param {object} state — {goal, currentPhase, milestones, idleSec, lastEventType}
+ * @param {string} [language] — "en" or "zh" (defaults to "en")
+ * @returns {string}
+ */
+export function renderPrompt(templateKey, state, language = "en") {
+  const tmpls = language === "zh" ? TEMPLATES_ZH : TEMPLATES_EN;
+  const tmpl = tmpls[templateKey];
+  if (!tmpl) {
+    throw new Error(`Unknown marathon prompt template: ${templateKey}`);
+  }
+  return tmpl(state);
+}
+export const PROMPT_TEMPLATES = Object.freeze(Object.keys(TEMPLATES_EN));

package/template/.env.template CHANGED Viewed

@@ -1,4 +1,4 @@
-# === KC Reborn Configuration ===
+# === KC Configuration ===
 # Language: en | zh
 LANGUAGE=en
@@ -29,3 +29,19 @@ MONITOR_FREQUENCY=mid
 # === Evolution Control ===
 # Maximum evolution iterations per rule before escalating to developer user
 MAX_ITERATIONS=20
+# === sandbox_exec Timeout (v0.8 P1-F) ===
+# Default timeout when the agent doesn't pass timeout_ms (ms).
+# Claude Code parity = 120000 (2 min). Raise if your default workloads
+# routinely exceed 2 min (e.g., document-parsing benchmarks).
+# KC_EXEC_DEFAULT_TIMEOUT_MS=120000
+#
+# Hard ceiling — agent's timeout_ms is clamped to this (ms). Raise for
+# v0.9 jyppx integration where parser_building can take 10-30 min per
+# corpus. Don't go above 1800000 (30 min) without a specific reason.
+# KC_EXEC_MAX_TIMEOUT_MS=600000
+#
+# Legacy alias (seconds) for KC_EXEC_DEFAULT_TIMEOUT_MS. Deprecated as of
+# v0.8 — prefer the ms-based name. The seconds value is multiplied by
+# 1000 if KC_EXEC_DEFAULT_TIMEOUT_MS isn't set.
+# KC_EXEC_TIMEOUT=120

package/template/AGENT.md CHANGED Viewed

@@ -15,7 +15,7 @@ update as you learn about this specific business scenario.
 ---
-# KC Reborn — Document Verification Workspace
+# KC — Document Verification Workspace
 ## What This Workspace Is
@@ -93,7 +93,7 @@ The skill body is the methodology. Skills convey philosophy and decision framewo
 ---
-# KC Reborn — 文档核查工作区
+# KC — 文档核查工作区
 > **技能优先级**: meta-meta 技能是架构层面 —— 当指导冲突时，
 > meta-meta 凌驾于 meta (技法层面) 之上。架构师的框架约束技法。

package/template/skills/en/auto-model-selection/SKILL.md CHANGED Viewed

@@ -2,53 +2,73 @@
 name: auto-model-selection
 tier: meta
 description: >
-  Use Context7 CLI to get up-to-date LLM model information. Use whenever you need to
-  know about available models, model capabilities, pricing, context window sizes, or
-  which model is suitable for a task — including tier assignment, worker LLM workflow
-  design, model comparison, and provider-specific API usage. Context7 gives you current
-  information that your training data may not have. Requires context7 CLI installed
-  (npm i -g context7). Optional plugin.
+  Use Context7 CLI for up-to-date model facts (size, API format, context window) and the
+  guidance below for "what kind of models are good for what part of a doc verification app".
+  Consult whenever you need to pick a model for a tier slot, decide between provider
+  alternatives, or sanity-check an existing tier assignment. Context7 gives you fresh
+  facts; this skill gives you the heuristic that maps those facts onto KC's pipeline.
+  Optional plugin (Context7 install: npm i -g context7).
 ---
-# Auto Model Selection via Context7
+# Auto Model Selection
-## What Context7 Is
+Model selection is not a frequently called skill — for most users, the tiers in workspace `.env` are already set sensibly, and a 4-tier cost-sensitive selection is overkill. This skill exists for two moments: when the conductor is bootstrapping fresh tier assignments (rare), and as a reference inside `skill-to-workflow` when a workflow needs to pick which worker LLM is right for its job.
-Context7 (`c7`) is a lightweight CLI tool that fetches up-to-date documentation for libraries and APIs. Install: `npm i -g context7`. Two commands:
-- `c7 library <query>` — search for a library/provider by name
-- `c7 docs <libraryId> <query>` — get specific documentation and code examples
+The teaching below is empirical — what the author has found works in this domain. Treat it as a starting heuristic with a 3-6 month shelf life, since model families update quickly.
-## When to Use
+## Worker LLM family — practical heuristic
-- User's `model-tiers.json` is outdated (KC hasn't been updated)
-- User switched to a new provider and needs model discovery
-- User explicitly asks to update model selections
-- Onboarding `/models` endpoint failed and curated list is stale
+- **Qwen family** — robust and cheap for general worker work. The flagship MoE in this family at any given time is usually one of the best workhorses for routine extraction / classification. Smaller sizes (3B-70B) are plentiful and reliable. Good default.
+- **DeepSeek** — excellent on more complicated tasks. Worth reaching for when the rule involves multi-step reasoning, nested judgment, or anything where Qwen's shorter-context behavior shows strain.
+- **GLM and Kimi** — also strong for the same complicated-task range as DeepSeek. Trade-off: they don't usually ship smaller variants (3B-70B), so they're tier1/tier2 only, not tier3/tier4.
-## How It Works
+## Flagship-MoE shape and the tier1 baseline
-1. User chooses provider and provides API key (or coding plan)
-2. Use `c7 library <provider-name>` to find the provider's library ID
-3. Use `c7 docs <id> "available models"` to get current model listings
-4. From the docs, identify: model names, capabilities (reasoning, coding, vision), context window sizes, pricing tiers
-5. Assign models to tiers based on capability and cost:
-   - LLM tier1: most capable (complex judgment, extraction)
-   - LLM tier2-3: mid-range (routine extraction, simple judgment)
-   - LLM tier4: cheapest (high-volume simple tasks)
-   - VLM tier1-3: vision models for document parsing/OCR
-6. Update `model-tiers.json` or workspace `.env` with assignments
+The current generation of flagship MoE LLMs has a recognizable shape: total parameter count in the 200-400B range, ~20B of activated expert size per token. Examples (these will be stale within months): Qwen's flagship MoE in the 200-400B-A20B band, DeepSeek-V4-Flash, etc.
-## Tier Assignment Principles
+This shape is a good first-choice worker LLM — not necessarily the absolute best at tier1, but a reasonable benchmark to start from. When you're picking a tier1 model, start from one of these and only move if you have a specific reason.
-- Cheapest model that meets accuracy threshold for the task
-- Regex is tier0 — smaller than any LLM
-- Not all tiers need to be filled — blank tiers are fine if the provider lacks suitable models
-- Record what works in AGENT.md for future reference
+## Smaller LLMs — basically free below 30B
-## Prerequisites
+Once you drop below ~30B parameters, models are extremely cheap on most providers. Qwen ships a ton of choices in this range and they work well.
+Two rules for picking in this range:
+- **Avoid "coder" variants** (models with `coder` / `code` in the name) at small sizes — these are mostly unreliable for general worker tasks. Use them only when the task is literally code-related.
+- **Prefer no-thinking-mode variants** when available. Tasks assigned to small workers are simple and fixed; extra thinking buys nothing and adds latency + token cost.
+## VLM / OCR selection
+First question: what's the visual task?
+- **Characters in scanned docs, seal stamps, handwriting** — use a dedicated OCR model. Current strong choices (subject to change): Paddle-OCR family, GLM-OCR, DeepSeek-OCR. Previous versions still work fine. No need to use a larger general VLM for plain character recognition.
+- **Graphs, complex tables, structures with strange or no frame lines** — try a larger and more expensive general VLM. The structure-understanding gap between OCR-specific models and general VLMs widens fast on these cases.
+When in doubt, run the cheapest OCR option first and only upgrade if the cheap path misses structure information.
+## Context7 — model facts on demand
+The heuristics above are about *which kind* of model to pick. For the specific facts (current model names, exact context window, pricing, API format), use Context7:
 ```bash
-npm i -g context7
+c7 library <provider-name>
+c7 docs <libraryId> "available models"
 ```
-Verify: `c7 library openai` should return results.
+Two commands. The first finds the provider's library ID, the second fetches up-to-date docs and code examples. Useful when:
+- Workspace `model-tiers.json` looks stale (KC hasn't been updated since the last model launch)
+- User switched providers and needs model discovery
+- The `/models` endpoint on a new provider was empty or unhelpful
+- You're sanity-checking that a model name in `.env` is still served
+Install: `npm i -g context7`. Verify: `c7 library openai` should return results.
+## When tier1/tier2 are picked
+The tier assignment ends up driving cost. Cheapest model that meets the accuracy bar wins. Regex is the implicit "tier 0" and should be the first reach when the rule can be satisfied by pattern matching alone — see `skill-to-workflow` for when to escalate from regex to worker LLMs.
+Not all tiers need to be filled. Blank tier3/tier4 slots are fine if the provider doesn't ship suitable small models. Record what works in `AGENT.md` so the next session inherits the choice.
+## Refresh cadence
+This skill's heuristics will drift. Author plans to revisit every 3-6 months as new model generations land. If you find yourself applying advice here that contradicts what Context7 shows today, trust Context7 — the facts have moved.

package/template/skills/en/bootstrap-workspace/SKILL.md CHANGED Viewed

@@ -73,6 +73,33 @@ Once a project is past bootstrap and into production, fresh documents often arri
 Discuss the cadence with the developer user during bootstrap — knowing the production input rhythm shapes how skills and workflows should be written (batch vs streaming, idempotency requirements, etc.).
+## Per-project memory: keep AGENT.md alive
+`AGENT.md` at the workspace root has per-project memory sections (`Project`, `Decisions`, `Domain Notes`, `User Preferences`). These are intentionally placeholder comments at bootstrap — they're for YOU to fill in as the work surfaces things worth remembering across phases or future sessions.
+What belongs there:
+- **Project**: corpus identity (regulation name + scope), language, primary vs auxiliary rules, sample doc set composition.
+- **Decisions**: design choices that aren't obvious from code — "non-标 35% limit is bank-level not per-product, so single-doc reports get WARNING not FAIL", "季报 not applicable for R02-06/R02-08 per regulation §39", etc.
+- **Domain Notes**: regulatory or business-domain nuance worth surfacing — "PT/RT/LZ are three distinct product types with different disclosure templates", terminology disambiguation.
+- **User Preferences**: how the developer user wants you to operate on THIS project — verbosity, naming conventions, when to ask vs proceed.
+Update AGENT.md at natural checkpoints: after the developer user gives you a substantive clarification, after you finish a phase, after you discover a design constraint that affects subsequent phases. Don't wait for a `/remember` instruction — the memory is yours to maintain.
+A future session resumes by reading AGENT.md first. The richer it is, the less re-explanation the developer user has to do.
+### Phase-transition cadence
+A recurring failure mode worth flagging: agents bootstrap AGENT.md richly, then never touch it again — many hours of phase work pass without a single AGENT.md commit. That defeats the long-term-memory purpose.
+Cadence to adopt: **append a one-line decision log to AGENT.md at each phase transition**. Format:
+```
+[<timestamp> | rule_extraction → skill_authoring]
+N rules extracted; coverage_audit complete; R03/R05/R07 flagged judgment-heavy.
+```
+Three lines of friction per phase transition; thirty lines of insight for the next auditor / next session. The format is loose — the cadence matters more than perfect prose.
 ## When to Re-Bootstrap
 Return to this skill when:

package/template/skills/en/compliance-judgment/SKILL.md CHANGED Viewed

@@ -81,3 +81,17 @@ Map these dependencies in the rule catalog. Execute rules in dependency order. P
 - **Multiple values**: The document contains the entity in multiple places with different values. Flag as `uncertain`. Report all found values.
 - **Conditional rules**: "If the loan exceeds 1M, then collateral is required." Check the condition before applying the rule. If the condition is not met, the rule does not apply — result is `pass` (or `not_applicable` if you add that category).
 - **Negative results**: Some rules check for absence. "The document must NOT contain guarantees to related parties." Searching for absence is harder than searching for presence. Be thorough in the search, then be confident in the negative.
+## Cross-document consistency of confidence bars
+For a rule that judges multiple documents (which is the common case), **the confidence threshold for "pass" must be the same across all documents** for that rule. A rule that requires 0.85 confidence to pass on document A and 0.75 on document B isn't a rule — it's two rules pretending to be one.
+When the worker LLM is the judge, set the threshold in the prompt or in postprocessing, not "let the LLM decide" per call. Random variation between LLM calls means each call lands somewhere on a distribution; your job is to set the bar that distribution must clear.
+Two ways to enforce:
+- **In the prompt**: explicit "output 'PASS' only if confidence ≥ 0.85" language. Cheap. Vulnerable to LLM drift between calls.
+- **In postprocessing**: ask the LLM for verdict + confidence separately, then apply the threshold in a small Python wrapper. More reliable. Engine sees the threshold as code, not prompt prose.
+For rules with stable patterns (formats, presence checks), prefer postprocessing. For rules with subjective judgment (adequacy, sufficiency), the prompt-level threshold is easier to write but worth auditing — sample a batch and check whether the LLM honored the bar.
+The `confidence-system` skill describes how to compose confidence from multiple signals; this section is about applying it consistently across documents for the same rule.

package/template/skills/en/confidence-system/SKILL.md CHANGED Viewed

@@ -50,21 +50,43 @@ Does this document match any known corner case pattern?
 - Near miss: slightly lower confidence.
 - No match: neutral (no adjustment).
+### Signal: Format-Pattern Conformance
+Does the extracted value match a regex-expressible format that the rule implies? Many fields have a known shape — phone numbers, ID numbers, dates, currency amounts, regulatory codes — that can be checked cheaply with a regex independent of the LLM's confidence in its own output.
+- Value matches the expected format pattern: positive signal.
+- Value violates the format pattern (e.g., a phone field with letters): strong negative signal, often a hallucination indicator.
+- No applicable format pattern for the field type: neutral.
+This catches a common failure: the LLM correctly identifies *where* the value lives but gets the format wrong (typos a digit, drops a country code, returns a stand-in like "see appendix").
+### Signal: Statistical Outlier
+For numeric values, does this value sit far from the typical range seen across other documents for the same field?
+- Within 1 standard deviation of average: neutral / mild positive.
+- 2-3 standard deviations: mild negative.
+- Beyond 3 std dev or outside a domain-plausible range: strong negative — often a unit error (yuan vs. 万元), a decimal mistake, or a hallucinated digit.
+Useful especially for amounts, percentages, ratios. Compute the reference range from QC-confirmed historical data; refresh periodically. For categorical fields, the analog is "value not in the observed value-set" — flag novel values for review.
 ### Combining Signals
-Start with a simple weighted average:
+Combine the signals into a single confidence score. The form is usually a weighted sum:
 ```
-confidence = w1 * method_prior + w2 * source_presence + w3 * historical_accuracy + w4 * corner_case_adjustment
+confidence = w_method     * method_prior
+           + w_source     * source_presence
+           + w_history    * historical_accuracy
+           + w_corner     * corner_case_adjustment
+           + w_format     * format_conformance
+           + w_outlier    * outlier_check
 ```
-Initial weights (adjust through calibration):
-- w1 (method): 0.25
-- w2 (source): 0.25
-- w3 (history): 0.35 (most important once available)
-- w4 (corner case): 0.15
+How to set the weights is **your judgment call** per rule and per project. A few orienting principles, not prescriptions:
+- Historical accuracy is the most predictive signal once you have data — favor it heavily after the first few QC cycles.
+- Method prior and source presence are always available — they carry the early iterations before history exists.
+- Format conformance and outlier signals are independent of the LLM — they keep working even when the LLM is overconfident, which is exactly when you want them.
+- Corner-case adjustment is usually small but should fire decisively when a known corner case is matched.
-When historical accuracy is not yet available (early iterations), redistribute its weight to the other signals.
+When historical accuracy is not yet available (early iterations), redistribute its weight to the other signals. The right weights for this rule on this corpus will reveal themselves through calibration — that's the loop the next section describes.
 ## Threshold Bands