@tw93/waza 3.25.0 → 3.28.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +49 -25
- package/package.json +5 -3
- package/rules/anti-patterns.md +24 -20
- package/rules/durable-context.md +6 -0
- package/rules/waza-routing.md +18 -0
- package/scripts/build_metadata.py +28 -16
- package/scripts/check_routing_drift.py +8 -0
- package/scripts/package-skill.sh +2 -3
- package/scripts/setup-rule.sh +4 -2
- package/scripts/setup-statusline.sh +1 -1
- package/scripts/skill_checks.py +290 -2
- package/scripts/statusline.sh +6 -14
- package/scripts/validate_package.py +1 -1
- package/scripts/verify_skills.py +12 -0
- package/skills/RESOLVER.md +8 -8
- package/skills/check/SKILL.md +78 -28
- package/skills/check/references/project-context.md +14 -6
- package/skills/check/scripts/audit_signals.py +192 -11
- package/skills/design/SKILL.md +39 -2
- package/skills/design/references/design-reference.md +17 -0
- package/skills/design/references/design-tokens.md +3 -11
- package/skills/health/SKILL.md +53 -26
- package/skills/health/agents/inspector-context.md +1 -1
- package/skills/health/scripts/check_agent_context.py +38 -1
- package/skills/health/scripts/check_maintainability.py +6 -0
- package/skills/health/scripts/collect-data.sh +11 -20
- package/skills/hunt/SKILL.md +33 -1
- package/skills/hunt/references/failure-patterns.md +54 -0
- package/skills/learn/SKILL.md +13 -3
- package/skills/read/SKILL.md +40 -9
- package/skills/read/references/read-methods.md +23 -4
- package/skills/read/scripts/fetch.sh +8 -7
- package/skills/read/scripts/fetch_feishu.py +11 -6
- package/skills/think/SKILL.md +33 -8
- package/skills/write/SKILL.md +88 -10
- package/skills/write/references/write-en.md +19 -17
- package/skills/write/references/write-product-localization.md +43 -0
- package/skills/write/references/write-zh-bilingual.md +2 -3
- package/skills/write/references/write-zh-prose.md +2 -0
- package/skills/write/references/write-zh.md +144 -68
- package/skills/read/references/save-paths.md +0 -33
package/skills/design/SKILL.md
CHANGED
|
@@ -11,16 +11,47 @@ Prefix your first line with 🥷 inline, not as its own paragraph.
|
|
|
11
11
|
|
|
12
12
|
If it could have been generated by a default prompt, it is not good enough.
|
|
13
13
|
|
|
14
|
+
## Outcome Contract
|
|
15
|
+
|
|
16
|
+
- Outcome: a usable interface or visual fix with a clear point of view and no incoherent layout, text, or responsive breakage.
|
|
17
|
+
- Done when: the real rendered surface or generated artifact has been checked against the user's visual goal and the relevant viewport states.
|
|
18
|
+
- Evidence: screenshots, rendered UI, source components, design tokens, accessibility constraints, and user-provided references.
|
|
19
|
+
- Output: the implemented visual change or a precise visual review with the remaining verification gap named.
|
|
20
|
+
|
|
14
21
|
**Output language rule:** Never use em-dash (—) in any output from this skill. Use commas, colons, or periods instead.
|
|
15
22
|
|
|
16
23
|
**Chinese gut-feel complaints**: when the user says "很傻", "很怪", "突兀", "不协调", "不和谐" about a visual, treat it as an aesthetic rejection, not a debugging symptom. Route to Screenshot Iteration Mode, not to `/hunt`.
|
|
17
24
|
|
|
25
|
+
**Document & print typography → Kami.** When the deliverable is a shippable document rather than a product UI surface (report, slide deck, resume, long-form or print-oriented page, paged PDF), do not hand-roll an over-designed document layout here. Suggest the user run it through Kami (`tw93/Kami`), a document design system with a fixed constraint language and templates, and let Kami draft the detailed plan. Screen 排版 (app surfaces, components, web pages) stays in this skill.
|
|
26
|
+
|
|
18
27
|
## Durable Context Preflight
|
|
19
28
|
|
|
20
29
|
See [rules/durable-context.md](../../rules/durable-context.md) for when to read durable context, the read-order budget, and the memory-type mapping.
|
|
21
30
|
|
|
22
31
|
For `/design`, visual constraints are `decision`, `preference`, and `principle` entries; reusable product and UI patterns are `pattern` and `learning`. Current screenshots, rendered output, code, design tokens, and user feedback override memory. Reuse durable visual preferences and mature interaction patterns, but still name the current visual problem from the screenshot or source before changing code.
|
|
23
32
|
|
|
33
|
+
## Visual Quick-Fix Mode
|
|
34
|
+
|
|
35
|
+
Activate when the user asks for a narrow visual repair with a concrete symptom: overflow, clipped or wrapped text, misalignment, spacing imbalance, contrast/readability, localized text not fitting, or compact responsive breakage. This is for fixing an existing surface, not redesigning it.
|
|
36
|
+
|
|
37
|
+
Flow:
|
|
38
|
+
|
|
39
|
+
1. Read the current UI evidence: screenshot, rendered page, native view, or responsible component.
|
|
40
|
+
2. Name the exact visual defect in one sentence.
|
|
41
|
+
3. Make the smallest material, geometry, spacing, contrast, typography, or text-fit change that fixes that defect.
|
|
42
|
+
4. Verify the real running surface or generated artifact. Check long words, localized strings, compact states, and at least one narrow viewport when applicable.
|
|
43
|
+
5. If the fix touches three or more components, changes product behavior, or reveals a direction problem, stop and switch to Screenshot Iteration Mode or Lock the Direction First.
|
|
44
|
+
|
|
45
|
+
**Spacing unification rule.** If a magic spacing or sizing value has been adjusted three times and the layout still looks off, stop tuning. Replace the N independent padding / gap / margin / size values with one shared named token (`Spacing.s4`, `--gap-content`, `gap-4`). Outer container padding defaults to the same value as inner element gap. Asymmetry that survives tuning is structural, not numeric, so more rounds of magic numbers will not converge. Reduce the count of independent values first, then argue about the specific value.
|
|
46
|
+
|
|
47
|
+
**Fixed-height action slot, uniform typography.** Any container that swaps children based on state (status bar, action slot, toolbar row, menu item) must use one font size across every state. Vary fill, stroke, opacity, color, or icon, never font size. A 1pt height delta between `secondary 13px` and `primary 14px` becomes visible jitter at the state transition. CTA pill buttons in the same slot use the same size (typically 14px), distinguished by background and border, not by typography.
|
|
48
|
+
|
|
49
|
+
**Completion screen layout.** Operation-complete surfaces show the single result the user came for: the actual reclaimed size / processed count / changed state. Long explanations belong in a details overlay opened from a summary row, not in the primary completion line. Do not add a separate "Review" button next to the summary row when one tap on the row already opens details; do not show an empty "0 skipped" entry point. If there is no skipped or failed item, hide the details affordance entirely.
|
|
50
|
+
|
|
51
|
+
**Safety-bound action design.** For cleanup, deletion, uninstall, reset, or permission-changing surfaces, do not make the UI feel simpler by hiding recoverability. Bulk select, auto-select, one-tap delete, or "recommended" destructive defaults are only appropriate when each row is understandable to the target user and carries enough identity to verify safety (name, source, owner, path, preview, or recovery implication as relevant). If rows are opaque identifiers, inferred leftovers, or machine-only paths, prefer review-first UI, current-target scoping, disabled destructive affordances, or explanatory grouping over faster batch controls. A feature request for fewer clicks is not enough to remove the user's ability to verify what will change.
|
|
52
|
+
|
|
53
|
+
**Quiet product boundary.** Fewer clicks and richer controls are not automatically better. Remove misleading affordances before adding alternate controls, prefer quiet defaults for diagnostics and alerts, and fix unstable motion cadence before changing speed or adding a new motion preference. If the current UI implies an action, state, or promise it cannot support, remove that implication first.
|
|
54
|
+
|
|
24
55
|
## Screenshot Iteration Mode
|
|
25
56
|
|
|
26
57
|
Activate when the user sends a screenshot or image alongside a complaint ("这里很丑", "这个不对", "fix this", "looks wrong"). The existing product is the direction. Skip the five-question direction lock.
|
|
@@ -56,7 +87,7 @@ Before writing any code, ask the user directly, using the environment's native q
|
|
|
56
87
|
2. **What is the aesthetic direction?** Name it precisely: dense editorial, raw terminal, ink-on-paper, brutalist grid, warm analog. "Clean and modern" is not a direction. If the user names a reference site or product ("feels like Linear / Claude.ai / Vercel"), do not accept it as a direction -- extract 3 concrete properties from it: button radius philosophy, surface depth treatment (shadow vs background step vs border), and accent color family. Name those instead.
|
|
57
88
|
|
|
58
89
|
**Shortcut for well-known brands**: see "Brand preset flow" in `references/design-reference.md`. Ask first, run the preset, then decompose against the generated file.
|
|
59
|
-
3. **What is the
|
|
90
|
+
3. **What is the design signature?** A typeface, color system, unexpected motion, asymmetric layout. Pick one and make it obvious.
|
|
60
91
|
4. **What are the hard constraints?** Framework, bundle size, contrast minimums, keyboard accessibility.
|
|
61
92
|
5. **What is the signature micro-interaction?** Scale on press, staggered reveal, or contextual icon animation. Pick one and know exactly how it's implemented.
|
|
62
93
|
|
|
@@ -73,6 +104,10 @@ Lift exact values: hex codes, spacing scale entries, font stacks, border radii.
|
|
|
73
104
|
|
|
74
105
|
Only attach the target component folder or package. Exclude `.git`, `node_modules`, `dist`, and lock files. Dragging in an entire monorepo pollutes the context with irrelevant code and degrades output quality.
|
|
75
106
|
|
|
107
|
+
### Existing-native-app exception (do not propose wholesale platform restyling)
|
|
108
|
+
|
|
109
|
+
When the target is an existing macOS / iOS / Android native app that already has a coherent visual direction, do not propose a wholesale port to a newer platform style (macOS 26 Liquid Glass, iOS 18 frosted material, Material You, Fluent Design, etc.) as the default improvement plan. Wholesale restyling reads as "I do not have a specific design intent, here is the platform's." Default to incremental polish on the existing direction: spacing, alignment, hover and focus states, typography hierarchy, copy tightening, motion timing. Only propose a platform-style migration when the user has explicitly asked for it in this turn, or when the existing direction is broken in a way that incremental polish cannot fix. State the existing direction in one sentence before proposing changes so the user can correct the read.
|
|
110
|
+
|
|
76
111
|
### App shell exception (sidebar + main workspace)
|
|
77
112
|
|
|
78
113
|
If question 1 is an app shell (Slack, Linear, Notion class), load the "App shell rules" section in `references/design-reference.md` and apply those constraints before proceeding.
|
|
@@ -90,7 +125,7 @@ Summarize the direction as three lines before writing any code:
|
|
|
90
125
|
|
|
91
126
|
For production or multi-page UIs, expand the thesis into the 9-section DESIGN.md scaffold in `references/design-reference.md` (theme, palette, typography, components, layout, depth, do/don't, responsive, prompt guide). For a single component, the three lines are sufficient.
|
|
92
127
|
|
|
93
|
-
##
|
|
128
|
+
## Hard Rules
|
|
94
129
|
|
|
95
130
|
`references/design-reference.md` is already loaded during direction lock. It owns the full rules: typography, OKLCH color, motion timings, layout defaults, CSS-pattern bans, accessibility baseline, and complexity matching. Apply them. Do not restate them here.
|
|
96
131
|
|
|
@@ -108,7 +143,9 @@ Give at least 3 variations across genuinely different dimensions (density, typog
|
|
|
108
143
|
| Chose glassmorphism, ignored the mobile constraint | `backdrop-filter` is expensive on low-power devices. Name the tradeoff. |
|
|
109
144
|
| Light-mode app: white panel on white background, visually indistinguishable | Adjacent nested surfaces must differ visually. Either background step (sidebar vs main ≥4% lightness difference) or shadow minimum `0 1px 3px rgba(0,0,0,0.10)`. |
|
|
110
145
|
| Fixed visual polish by redesigning the whole surface | Locate the concrete visual delta first, then make the smallest material, opacity, geometry, or typography change that addresses it. |
|
|
146
|
+
| Added a setting or louder control to solve UI noise | Remove the misleading affordance or choose a quiet default first |
|
|
111
147
|
| English looked fine, localized text overflowed | Test long words and localized strings before handoff, especially inside buttons, tabs, nav, and compact cards. |
|
|
148
|
+
| Relied on `…` truncation to fit text in a fixed-width slot | Guarantee fit instead: compact the format, cap to whole segments, or hard-trim with no glyph. Metric and label footers must never tail-truncate into an ellipsis. |
|
|
112
149
|
|
|
113
150
|
## Aesthetic Review
|
|
114
151
|
|
|
@@ -123,6 +123,13 @@ When extending an existing interface, first spend time understanding its visual
|
|
|
123
123
|
|
|
124
124
|
If swapping in different content would make the new component look out of place, the vocabulary was not matched closely enough.
|
|
125
125
|
|
|
126
|
+
### Responsive & Screen Verification
|
|
127
|
+
- Verify the rendered surface, not a type check or CSS-balance read. Several regressions (early wraps, orphaned separator dots, table overflow) are invisible in source and only show in the render. Screenshot at phone (375px, plus 320px for buttons) and desktop (1280px), in every shipped locale.
|
|
128
|
+
- Line widows: eliminate 1-2 word last lines by trimming the copy so the block rebalances, not by adding a `max-width` cap (a cap narrower than its container wraps early and leaves empty space on the right, which reads as a premature break). Detect objectively: flag any text block whose last line is under ~13% of its widest line; eyeballing misses them, and nested `<code>` hides them from greps.
|
|
129
|
+
- Mobile CTA resting state: natural width, left-aligned to the surrounding text edge, height unchanged. Centering reads as floating; full-width `flex: 1` reads heavy; dropping button height to relieve a "too full" feel treats a width problem as a height one.
|
|
130
|
+
- Spacing is a system, not a per-gap value. Run section spacing as one responsive ladder; when a page reads too airy or too tight, scale the whole set by a single factor across all breakpoints rather than tuning one gap. Asymmetry that survives tuning is structural.
|
|
131
|
+
- Long-form and documentation surfaces stay light: a borderless prev/next text pager (not bordered cards), a sidebar active state as a thin rail rather than a filled block, and build-time zero-runtime-JS code highlighting (bake static spans, plain code stays the source) over a shipped highlighter.
|
|
132
|
+
|
|
126
133
|
## Data Visualization Surfaces
|
|
127
134
|
|
|
128
135
|
For dashboards, analytics views, chart-heavy interfaces, or number-dense displays, load `references/design-data-viz.md`. It owns dashboard defaults, chart selection, number alignment, and product-benchmark extraction.
|
|
@@ -140,6 +147,16 @@ Reject: Inter, DM Sans, DM Serif Display, DM Serif Text, Outfit, Plus Jakarta Sa
|
|
|
140
147
|
3. Reject all three.
|
|
141
148
|
4. Pick a typeface from a named foundry (Klim, Commercial Type, Colophon, Grilli Type, OH no Type, Village, etc.) or an open-source option with a clear personality that matches the brand words. Be able to explain why that specific typeface in one sentence.
|
|
142
149
|
|
|
150
|
+
## CJK & Multilingual Type
|
|
151
|
+
|
|
152
|
+
When the interface mixes Chinese, Japanese, or Korean with Latin, Latin-only type rules silently break the CJK text. Apply these before handoff:
|
|
153
|
+
|
|
154
|
+
- **Latin face first, system CJK face after** in the stack, so each script renders with correct glyphs: `font-family: -apple-system, "SF Pro Text", "PingFang SC", "Noto Sans SC", sans-serif;`. Latin runs use the Latin face; Han characters fall through to the CJK face.
|
|
155
|
+
- **Give CJK body text more line-height than Latin**: roughly 1.7–1.8 for reading. Dense Hanzi needs more vertical room than the 1.4–1.5 that suits Latin body copy.
|
|
156
|
+
- **Tag runs with `lang="zh"` / `lang="ja"` / `lang="en"`** so the browser picks the right font and line-breaking. Mixed-language paragraphs break badly without it.
|
|
157
|
+
- **Serif reading modes need an explicit CJK serif fallback.** Most Latin "reading serif" webfonts carry no CJK glyphs, so a serif toggle silently drops Chinese back to a sans and looks broken. Pair them: `"Newsreader", "Songti SC", "Noto Serif SC", serif`.
|
|
158
|
+
- **Do not apply negative letter-spacing to CJK runs.** The display-type tracking rule above is Latin-only; tightening tracking on Hanzi cramps the glyphs and reads as a rendering bug. Scope tracking to `lang="en"` runs.
|
|
159
|
+
|
|
143
160
|
## Color System: OKLCH Rules
|
|
144
161
|
|
|
145
162
|
- Use OKLCH instead of HSL. OKLCH is perceptually uniform: equal numeric changes produce equal perceived changes across the spectrum.
|
|
@@ -1,4 +1,6 @@
|
|
|
1
|
-
# Design Tokens: Color
|
|
1
|
+
# Design Tokens: Color and Typography
|
|
2
|
+
|
|
3
|
+
Motion rules live in [design-reference.md](./design-reference.md) under Animation and Motion Specifics. This file owns color and typography only.
|
|
2
4
|
|
|
3
5
|
## Color System: OKLCH Rules
|
|
4
6
|
|
|
@@ -41,13 +43,3 @@ Reject: Inter, DM Sans, DM Serif Display, DM Serif Text, Outfit, Plus Jakarta Sa
|
|
|
41
43
|
- `-webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale` once on root layout (macOS only)
|
|
42
44
|
- `font-variant-numeric: tabular-nums` for counters, timers, prices, number columns
|
|
43
45
|
- Letter-spacing: roughly -0.022em for display sizes (32px+), -0.012em for mid-range (20-28px), normal at 16px and below
|
|
44
|
-
|
|
45
|
-
## Motion Specifics
|
|
46
|
-
|
|
47
|
-
- No bounce or elastic easing. Use exponential ease-out: `cubic-bezier(0.16,1,0.3,1)` for natural deceleration.
|
|
48
|
-
- Animate `transform` and `opacity` only. Every other property triggers layout or paint.
|
|
49
|
-
- For height reveals: `grid-template-rows: 0fr` to `1fr` (avoids `height: auto` animation trap).
|
|
50
|
-
- Icon swaps: 120ms cross-fade with `opacity` and subtle `scale(0.9)` to `scale(1)`.
|
|
51
|
-
- Scale on press: `scale(0.96)` on active/press via CSS transitions.
|
|
52
|
-
- Page-load guard: `initial={false}` on animated presence wrappers for toggles and tabs (prevents enter animations on first render).
|
|
53
|
-
- Honor `prefers-reduced-motion`: disable or reduce animations when set.
|
package/skills/health/SKILL.md
CHANGED
|
@@ -1,11 +1,11 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: health
|
|
3
|
-
description: "Runs a budget-aware
|
|
3
|
+
description: "Runs a budget-aware agent-assisted engineering health audit for instruction/config drift, hooks/MCP, verifier surfaces, and AI maintainability. Use when users ask 检查claude/检查codex/检查pi/配置检查/健康度 or report agents ignoring instructions, missing validation, or code becoming hard to maintain. Not for debugging code or reviewing PRs."
|
|
4
4
|
when_to_use: "检查claude, 检查codex, 检查pi, Codex 配置, Pi 配置, AGENTS.md, config.toml, agent instructions, 健康度, 配置检查, 配置对不对, AI coding 腐化, 代码变烂, 维护性, 上下文混乱, 验证缺失, 验证命令失真, Claude ignoring instructions, Pi coding agent, check config, settings not working, audit config"
|
|
5
5
|
dispatch_intent: "Codex/Claude/Pi ignoring instructions, agent config audit, hooks/MCP broken, health token usage, AI coding code rot, hotspot ownership, unclear context, missing verification, stale verifier output"
|
|
6
6
|
---
|
|
7
7
|
|
|
8
|
-
# Health: Agent
|
|
8
|
+
# Health: Agent-Assisted Engineering Health
|
|
9
9
|
|
|
10
10
|
Prefix your first line with 🥷 inline, not as its own paragraph.
|
|
11
11
|
|
|
@@ -14,6 +14,18 @@ Audit the current project's agent setup and AI coding maintainability against th
|
|
|
14
14
|
|
|
15
15
|
Find violations. Identify the misaligned layer. Calibrate to project complexity only.
|
|
16
16
|
|
|
17
|
+
## Outcome Contract
|
|
18
|
+
|
|
19
|
+
- Outcome: a budget-aware health report that separates agent configuration risk from AI maintainability risk.
|
|
20
|
+
- Done when: each finding names the misaligned layer, the concrete evidence, and a copy-pasteable action or diagnostic command.
|
|
21
|
+
- Evidence: collected health script output, tracked project instructions, runtime config summaries, verifier logs, hooks/MCP surfaces, and live probes when needed.
|
|
22
|
+
- Output: prioritized findings with status, impact, and next action, or a clear clean bill with residual risk.
|
|
23
|
+
|
|
24
|
+
Two lanes share one report:
|
|
25
|
+
|
|
26
|
+
- **Agent config health**: Codex/Claude/Pi instruction drift, permissions, hooks, MCP, skills, and memory supply chain.
|
|
27
|
+
- **AI maintainability health**: project context surface, verifier wrapper, generated-artifact checks, hotspot ownership, and stale or misleading durable docs.
|
|
28
|
+
|
|
17
29
|
**Output language:** Check in order: (1) project agent instructions (`AGENTS.md` before runtime-specific files); (2) global agent instructions; (3) user's recent language; (4) English.
|
|
18
30
|
|
|
19
31
|
**Budget posture:** Start with the summary audit. Escalate automatically when the user asks for a deep, full, complete, thorough, "深入", "完整", "彻底", or "继续跑完" audit, when the user explicitly mentions AI coding code rot, Codex/Claude config drift, unclear context, missing verification, verifier output that points at stale paths, or "代码变烂", when current project instructions or remembered user preference says to run deep health checks by default, when the project is Complex, or when the summary pass exposes a critical ambiguity that cannot be resolved locally. Otherwise do not read full conversation extracts or launch inspector subagents. Tell the user before escalating because deep health audits can consume significant token quota.
|
|
@@ -28,13 +40,11 @@ For `/health`, audit expectations are `decision`, `preference`, and `principle`
|
|
|
28
40
|
|
|
29
41
|
Pick one. Apply only that tier's requirements.
|
|
30
42
|
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
|
34
|
-
| **
|
|
35
|
-
| **
|
|
36
|
-
| **Complex** | >5K files, multi-contributor, active CI | Full six-layer setup required |
|
|
37
|
-
|
|
43
|
+
| Tier | Signal | What's expected |
|
|
44
|
+
|---|---|---|
|
|
45
|
+
| **Simple** | <500 files, 1 contributor, no CI | CLAUDE.md only; 0-1 skills; hooks optional |
|
|
46
|
+
| **Standard** | 500-5K files, small team or CI | CLAUDE.md + 1-2 rules; 2-4 skills; basic hooks |
|
|
47
|
+
| **Complex** | >5K files, multi-contributor, active CI | Full six-layer setup required |
|
|
38
48
|
|
|
39
49
|
## Step 1: Collect data
|
|
40
50
|
|
|
@@ -74,23 +84,27 @@ The collector includes both runtime-specific and agent-agnostic surfaces:
|
|
|
74
84
|
|
|
75
85
|
Test every MCP server: call one harmless tool per server. Record `live=yes/no` with error detail. Respect `enabled: false` (skip without flagging). For API keys, only check if the env var is set (`echo $VAR | head -c 5`), never print full keys.
|
|
76
86
|
|
|
77
|
-
##
|
|
87
|
+
## Step 1c: Safety and security checks
|
|
88
|
+
|
|
89
|
+
These run after collection and before the Step 2 analysis. The first two apply to every audit; the third only to projects with long-running or autonomous agents.
|
|
90
|
+
|
|
91
|
+
### Security Baseline Checks
|
|
78
92
|
|
|
79
93
|
Run these on every audit, regardless of tier. They are the floor, not the ceiling.
|
|
80
94
|
|
|
81
|
-
**Deny-list floor.**
|
|
95
|
+
**Deny-list floor.** Apply this only when the project or runtime exposes agent permission settings, hook settings, MCP settings, allowed/denied tools, or a documented autonomous-agent launcher. In that case, the settings should deny, at minimum: credential and key directories (SSH, cloud providers, GPG, gh CLI), secret files (`.env`, `credentials*`, `secrets*`), pipe-to-shell installers (`curl ... | bash`, `wget ... | sh`), and outbound shells (`ssh`, `scp`, `nc`). Report this as one concise WARN with the missing categories and suggested fix; let the reviewer fill in exact local paths from the environment. If no agent settings surface exists, report the deny-list as not applicable rather than a failure.
|
|
82
96
|
|
|
83
97
|
**Environment override surface.** Treat the following as attack surface, report when set in tracked files or shipped settings without a justification comment: API base-URL overrides (redirect all traffic to a third party), auto-trust flags for project-local MCP servers, wildcard tool allowlists (`allowedTools: ["*"]`), and permission-skip flags (`--dangerously-skip-permissions` or equivalents). Print file:line and the key name only; never print secrets.
|
|
84
98
|
|
|
85
|
-
|
|
99
|
+
### Memory and Skill Supply Chain
|
|
86
100
|
|
|
87
101
|
Treat agent memory and third-party skills as supply-chain artifacts. They run with the user's privileges.
|
|
88
102
|
|
|
89
103
|
**Memory hygiene.** Audit the project's long-term agent memory store for secrets, tokens, or credentials (Critical), and for entries written by untrusted runs (subagent invoked on attacker-controlled input, /loop iteration over external content); recommend rotation after such runs. For high-risk one-off runs (untrusted PDFs, uncontrolled scraping, third-party scripts), recommend disabling memory persistence for that session entirely.
|
|
90
104
|
|
|
91
|
-
**Skill supply chain.** Third-party skills, plugins, and MCP servers run with the user's privileges. For each one not authored in this repo, check: source pinned to a release tag (not `main
|
|
105
|
+
**Skill supply chain.** Third-party skills, plugins, and MCP servers run with the user's privileges. For each one not authored in this repo, check: source pinned to a release tag or revision (not `main`, a branch, or a remote git marketplace left tracking its latest head), hook handlers do not write to credential directories, MCP servers have explicit user consent (not auto-trusted by wildcard). Report unpinned sources or unreviewed hook handlers as Structural, not Critical, unless an active exploit signal is present.
|
|
92
106
|
|
|
93
|
-
|
|
107
|
+
### Long-Running Agent Stop Conditions
|
|
94
108
|
|
|
95
109
|
For projects that use `/loop`, autonomous agents, or any long-running agent flow, the project must define explicit stop conditions. An agent that never stops is a budget and safety incident waiting to happen.
|
|
96
110
|
|
|
@@ -101,7 +115,7 @@ Audit for these four hard stop signals; flag the absence of each as a Structural
|
|
|
101
115
|
3. **Cost or token budget exceeded.** Project should declare a per-run budget (tokens, API spend, wall-clock minutes). Loop exits when the budget is hit, not when work is done.
|
|
102
116
|
4. **External blockers.** Merge conflict on the target branch, dependency lock the agent cannot resolve, missing credential, network unreachable. Any of these halt the loop and ask the user, not retry forever.
|
|
103
117
|
|
|
104
|
-
The stop conditions should live in tracked project docs (`AGENTS.md`, the loop's launch script, or a dedicated config), not only in the agent's prompt. Prompts are forgettable; tracked config is enforceable. Recommend hooks (PostToolUse on the relevant tools) over prompt instructions when the project supports them: a hook physically cannot be skipped, a prompt instruction can.
|
|
118
|
+
The stop conditions should live in tracked project docs (`AGENTS.md`, the loop's launch script, or a dedicated config), not only in the agent's prompt. Prompts are forgettable; tracked config is enforceable. Recommend hooks (PostToolUse on the relevant tools) over prompt instructions when the project supports them: a hook physically cannot be skipped, a prompt instruction can. Confirm the host's hook coverage before recommending one: some agents only fire PostToolUse for a subset of tools (for example, a runtime may match shell/Bash only), so a fixup that must run after file edits belongs on a Stop or session-end hook there instead.
|
|
105
119
|
|
|
106
120
|
## Step 2: Analyze
|
|
107
121
|
|
|
@@ -155,6 +169,20 @@ bash skills/health/scripts/check-agent-context.sh . summary
|
|
|
155
169
|
|
|
156
170
|
**AI-maintainability gaps.** Use `AI MAINTAINABILITY SUMMARY` in summary mode and `AI MAINTAINABILITY DETAIL` in deep mode. Report `FAIL` when the project has no executable verification command, no agent instruction surface for a non-trivial repo, or broken doc references. Report `WARN` when instructions exist but lack a project map, verification guidance, boundary/non-goal language, when TODO/HACK markers are concentrated, when large source hotspots lack ownership/boundary and verification guidance, or when durable docs contain raw one-off review reports, scorecards, dated line references, or diagnostic dumps instead of stable invariants. Treat missing `docs/`, `specs/`, `.specify/`, `HANDOFF.md`, `CHANGELOG`, issue templates, and PR templates as informational unless project complexity makes them necessary for handoff. The action for stale reports is to extract stable rules into public instructions, rules, references, or verifier scripts, then remove or archive the transient report.
|
|
157
171
|
|
|
172
|
+
**Conversation-derived guidance.** When a health audit reads recent agent conversations, do not recommend copying the conversation or a scorecard into docs. Recommend a candidate-matrix pass instead:
|
|
173
|
+
|
|
174
|
+
| Field | Question |
|
|
175
|
+
|---|---|
|
|
176
|
+
| Repeated failure | Did this recur across fixes, releases, agents, or user reports? |
|
|
177
|
+
| Durable invariant | Can the lesson be stated as a stable rule, not a dated incident summary? |
|
|
178
|
+
| Target layer | Should it live in project instructions, a Waza skill, a global rule, or private memory? |
|
|
179
|
+
| Verifier | Is there a deterministic command, script, artifact check, or runtime smoke that can enforce it? |
|
|
180
|
+
| Redaction risk | Does the lesson require local paths, issue numbers, customer details, machine state, secrets, or unpublished release facts? |
|
|
181
|
+
|
|
182
|
+
Layering rule: project-specific commands, app names, artifact names, and release rituals stay in the project; reusable workflows such as cancelled-release review gates or native-freeze evidence ladders belong in Waza skills; universal honesty and verification rules belong in global CLAUDE/AGENTS; private user preferences and one-machine facts stay in memory. If the lesson cannot pass the redaction-risk field, keep it out of public guidance.
|
|
183
|
+
|
|
184
|
+
**Concentrated fix chains.** Run `git log --oneline --since='2 weeks ago' | grep -i fix` and group by area (the prefix before `:` or `(`). When the same area has 3+ fix commits in a short window, it signals a missing structural invariant: each fix is a guess at a rule that was never written down. Report a Structural `WARN` with the area name, fix count, and recommend adding an explicit rule to `AGENTS.md` / `CLAUDE.md` / project rules that captures the invariant those fixes were converging toward. A concentrated fix chain that touches the same file 4+ times is a stronger signal than scattered fixes across different files.
|
|
185
|
+
|
|
158
186
|
**Hotspot ownership gaps.** In deep mode, read `HOTSPOT OWNERSHIP SURFACE`. If a largest source file exceeds the hotspot threshold and `AGENTS.md` / `CLAUDE.md` / shared instruction files do not name who owns the hotspot, what boundary should stay stable, and which verification command covers it, report a Structural `WARN`. Do not treat documented large files as code rot by size alone; some modules are intentionally large.
|
|
159
187
|
|
|
160
188
|
**Missing stable verifier wrapper.** If the repo exposes multiple verification commands through CI, scripts, or manifests but `Makefile` has no `check`, `test`, or `verify` target, report a Structural `WARN`. This is an AI-maintainability gap because agents need one stable default entrypoint, not because the project is broken.
|
|
@@ -217,15 +245,14 @@ If no issues: `All relevant checks passed. Nothing to fix.`
|
|
|
217
245
|
|
|
218
246
|
## Gotchas
|
|
219
247
|
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
|
|
|
223
|
-
|
|
|
224
|
-
|
|
|
225
|
-
|
|
|
226
|
-
| Flagged intentionally noisy hook as broken | Ask before calling a hook "broken" |
|
|
248
|
+
| What happened | Rule |
|
|
249
|
+
|---|---|
|
|
250
|
+
| Missed the local override | Always read `settings.local.json` too; it shadows the committed file |
|
|
251
|
+
| Subagent timeout reported as MCP failure | MCP failures come from the live probe, not data collection |
|
|
252
|
+
| Reported issues in wrong language | Honor CLAUDE.md Communication rule first |
|
|
253
|
+
| Flagged intentionally noisy hook as broken | Ask before calling a hook "broken" |
|
|
227
254
|
| Hook seemed not to fire, but it did -- a later UI element rendered above it | Hook firing order is not visual order. Before re-editing the hook config: (a) confirm with `--debug` or by piping output, (b) check whether a diff dialog, permission prompt, or other UI element rendered on top and pushed the hook output offscreen, (c) only then suspect the hook itself. |
|
|
228
|
-
| `/health` burned too much quota on first run
|
|
229
|
-
| Treated missing specs/docs as a failure
|
|
230
|
-
| Treated an ignored AGENTS/CLAUDE file as durable project truth
|
|
231
|
-
| Treated a review scorecard as maintainability documentation
|
|
255
|
+
| `/health` burned too much quota on first run | Stay in summary mode first. Full conversation extracts and inspector subagents are deep-audit tools, not the default path for Standard projects. |
|
|
256
|
+
| Treated missing specs/docs as a failure | Decision artifacts are optional by default. Escalate missing docs/specs only when the tier, active handoff risk, or user request makes them necessary. |
|
|
257
|
+
| Treated an ignored AGENTS/CLAUDE file as durable project truth | Report whether the rule is tracked and distributed. Local overlays can inform the audit, but durable fixes belong in public repo docs or shipped skill/rule files. |
|
|
258
|
+
| Treated a review scorecard as maintainability documentation | Scorecards are snapshots. Extract the invariant and verification path, then remove or archive the report instead of calling the score itself a durable rule. |
|
|
@@ -22,7 +22,7 @@ rules/ checks:
|
|
|
22
22
|
|
|
23
23
|
Skill checks:
|
|
24
24
|
- SIMPLE: 0–1 skills is fine.
|
|
25
|
-
- ALL tiers: If skills exist, descriptions should be
|
|
25
|
+
- ALL tiers: If skills exist, descriptions should be concise, triggerable, include `Use when`, include `Not for`, and avoid overlapping triggers.
|
|
26
26
|
- STANDARD+: Low-frequency skills may use `disable-model-invocation: true`, but Claude Code plugin skills should not rely on it until upstream invocation bugs are fixed.
|
|
27
27
|
|
|
28
28
|
MEMORY.md checks, STANDARD+:
|
|
@@ -20,6 +20,10 @@ from pathlib import Path
|
|
|
20
20
|
SENSITIVE_RE = re.compile(r"(api[_-]?key|token|secret|password|credential)", re.IGNORECASE)
|
|
21
21
|
PROJECT_RE = re.compile(r'^\[projects\."(.+)"\]\s*$')
|
|
22
22
|
TABLE_RE = re.compile(r'^\[([A-Za-z0-9_.@"\-/]+)\]\s*$')
|
|
23
|
+
OPERATIONAL_RULE_RE = re.compile(
|
|
24
|
+
r"(Git Safety|Public Issue Replies|Investigation Honesty|Verification|Response Style|Commit|Security)",
|
|
25
|
+
re.IGNORECASE,
|
|
26
|
+
)
|
|
23
27
|
|
|
24
28
|
|
|
25
29
|
def rel(path: Path, root: Path) -> str:
|
|
@@ -121,6 +125,20 @@ def claude_delegates_to_agents(path: Path) -> bool:
|
|
|
121
125
|
return any("AGENTS.md" in line for line in meaningful)
|
|
122
126
|
|
|
123
127
|
|
|
128
|
+
def has_operational_rules(path: Path) -> bool:
|
|
129
|
+
text = read(path, 40_000)
|
|
130
|
+
if not text:
|
|
131
|
+
return False
|
|
132
|
+
return len(set(m.group(1).lower() for m in OPERATIONAL_RULE_RE.finditer(text))) >= 2
|
|
133
|
+
|
|
134
|
+
|
|
135
|
+
def looks_identity_only(path: Path) -> bool:
|
|
136
|
+
text = read(path, 40_000)
|
|
137
|
+
if not text:
|
|
138
|
+
return False
|
|
139
|
+
return "nian-identity:start" in text and not has_operational_rules(path)
|
|
140
|
+
|
|
141
|
+
|
|
124
142
|
def parse_codex_config(
|
|
125
143
|
path: Path,
|
|
126
144
|
) -> tuple[dict[str, str], list[str], list[str], list[str], list[str]]:
|
|
@@ -161,7 +179,7 @@ def parse_codex_config(
|
|
|
161
179
|
if "=" not in line:
|
|
162
180
|
continue
|
|
163
181
|
key, value = [part.strip() for part in line.split("=", 1)]
|
|
164
|
-
if section == "features" and value.lower() == "true":
|
|
182
|
+
if section == "features" and value.split("#", 1)[0].strip().strip('"').lower() == "true":
|
|
165
183
|
features.append(key)
|
|
166
184
|
elif section.startswith('projects."') and key == "trust_level":
|
|
167
185
|
project = section[len('projects."'): -1]
|
|
@@ -344,6 +362,25 @@ def main() -> int:
|
|
|
344
362
|
if not global_claude.is_file() and not claude.is_file():
|
|
345
363
|
claude_findings.append("Claude instruction surface not found")
|
|
346
364
|
|
|
365
|
+
if (
|
|
366
|
+
global_claude.is_file()
|
|
367
|
+
and has_operational_rules(global_claude)
|
|
368
|
+
and global_codex_agents.is_file()
|
|
369
|
+
and looks_identity_only(global_codex_agents)
|
|
370
|
+
):
|
|
371
|
+
codex_findings.append(
|
|
372
|
+
"global Codex AGENTS.md has identity/memory context but lacks operational rules present in global Claude CLAUDE.md"
|
|
373
|
+
)
|
|
374
|
+
codex_config_text = read(codex_config) if codex_config.is_file() else ""
|
|
375
|
+
if (
|
|
376
|
+
'sandbox_mode = "danger-full-access"' in codex_config_text
|
|
377
|
+
and 'approval_policy = "never"' in codex_config_text
|
|
378
|
+
and "deny" not in codex_config_text.lower()
|
|
379
|
+
):
|
|
380
|
+
codex_findings.append(
|
|
381
|
+
"Codex high-permission mode lacks a deny floor; add denies for secrets, credentials, pipe-to-shell installers, and outbound shells"
|
|
382
|
+
)
|
|
383
|
+
|
|
347
384
|
conflict_findings: list[str] = []
|
|
348
385
|
if agents.is_file() and claude.is_file() and not claude_delegates:
|
|
349
386
|
conflict_findings.append("AGENTS.md and CLAUDE.md both exist; verify they do not diverge")
|
|
@@ -66,6 +66,12 @@ VERIFICATION_WORD_RE = re.compile(
|
|
|
66
66
|
)
|
|
67
67
|
|
|
68
68
|
|
|
69
|
+
# The file-walk helpers below are deliberately duplicated in
|
|
70
|
+
# skills/check/scripts/audit_signals.py. Both scripts ship standalone
|
|
71
|
+
# (see packaging.allowlist) and run inside an arbitrary target project, so
|
|
72
|
+
# they import only stdlib. Do not hoist them into a shared scripts/
|
|
73
|
+
# module: it is dev-only, not on the ship allowlist, and would couple a
|
|
74
|
+
# standalone tool to the install layout.
|
|
69
75
|
def rel(path: Path, root: Path) -> str:
|
|
70
76
|
try:
|
|
71
77
|
return path.resolve().relative_to(root).as_posix()
|
|
@@ -8,7 +8,7 @@
|
|
|
8
8
|
# python3 not on PATH -> MCP/hooks/allowedTools sections print "(unavailable)"; do not flag those areas
|
|
9
9
|
# settings.local.json absent -> hooks, MCP, allowedTools all show "(unavailable)"; normal for global-settings-only projects
|
|
10
10
|
# MEMORY.md path -> built via sed on pwd; unusual chars produce wrong project key; verify manually if (none) seems wrong
|
|
11
|
-
# Conversation scope ->
|
|
11
|
+
# Conversation scope -> 2 most recent PREVIOUS .jsonl sampled (live session skipped); fewer than 2 = [LOW CONFIDENCE]
|
|
12
12
|
# MCP token estimate -> assumes ~25 tools/server, ~200 tokens/tool; treat as directional, not precise
|
|
13
13
|
# Tier misclassification -> .next/, __pycache__, .turbo/ can inflate file count; recheck manually if tier feels wrong
|
|
14
14
|
set -euo pipefail
|
|
@@ -293,9 +293,10 @@ sample_jsonl_prefix() {
|
|
|
293
293
|
' "$file"
|
|
294
294
|
}
|
|
295
295
|
|
|
296
|
-
|
|
297
|
-
|
|
298
|
-
|
|
296
|
+
# Shared jq filter: collapse one transcript record to a single trimmed text
|
|
297
|
+
# line, dropping meta and tool-result noise. Defined once and prepended to both
|
|
298
|
+
# extract_* programs below so the flattening logic lives in exactly one place.
|
|
299
|
+
JQ_FLATTEN='
|
|
299
300
|
def flatten:
|
|
300
301
|
if (.isMeta // false) or (.toolUseResult? != null) then
|
|
301
302
|
empty
|
|
@@ -311,6 +312,11 @@ extract_messages_from_file() {
|
|
|
311
312
|
| sub("^ "; "")
|
|
312
313
|
| sub(" $"; "")
|
|
313
314
|
end;
|
|
315
|
+
'
|
|
316
|
+
|
|
317
|
+
extract_messages_from_file() {
|
|
318
|
+
local file="$1"
|
|
319
|
+
sample_jsonl_prefix "$file" | jq -r "$JQ_FLATTEN"'
|
|
314
320
|
(.type // .role // "") as $kind
|
|
315
321
|
| (flatten) as $text
|
|
316
322
|
| if ($text | length) == 0 then
|
|
@@ -329,22 +335,7 @@ extract_messages_from_file() {
|
|
|
329
335
|
|
|
330
336
|
extract_signals_from_file() {
|
|
331
337
|
local file="$1"
|
|
332
|
-
sample_jsonl_prefix "$file" | jq -r '
|
|
333
|
-
def flatten:
|
|
334
|
-
if (.isMeta // false) or (.toolUseResult? != null) then
|
|
335
|
-
empty
|
|
336
|
-
else
|
|
337
|
-
(.message.content // .content // .text // "")
|
|
338
|
-
| if type == "array" then
|
|
339
|
-
[ .[] | if type == "object" and .type == "text" then .text elif type == "string" then . else empty end ] | join(" ")
|
|
340
|
-
elif type == "string" then .
|
|
341
|
-
else empty
|
|
342
|
-
end
|
|
343
|
-
| gsub("[\\r\\n]+"; " ")
|
|
344
|
-
| gsub(" +"; " ")
|
|
345
|
-
| sub("^ "; "")
|
|
346
|
-
| sub(" $"; "")
|
|
347
|
-
end;
|
|
338
|
+
sample_jsonl_prefix "$file" | jq -r "$JQ_FLATTEN"'
|
|
348
339
|
def is_correction:
|
|
349
340
|
test("(?i)(\\bdon'\''t\\b|\\bdo not\\b|\\bplease don'\''t\\b|\\binstead\\b|\\bnext time\\b|\\bremember\\b|\\buse\\b.*\\binstead\\b|\\bnot\\b.*\\bbut\\b)")
|
|
350
341
|
or test("(不要再|请不要|不要|别再|下次|记得|改成|改为|而不是|别用|去掉|统一成)");
|
package/skills/hunt/SKILL.md
CHANGED
|
@@ -11,6 +11,13 @@ Prefix your first line with 🥷 inline, not as its own paragraph.
|
|
|
11
11
|
|
|
12
12
|
A patch applied to a symptom creates a new bug somewhere else.
|
|
13
13
|
|
|
14
|
+
## Outcome Contract
|
|
15
|
+
|
|
16
|
+
- Outcome: the root cause is identified before any fix is applied.
|
|
17
|
+
- Done when: one sentence explains the cause, every observed symptom fits it, and the fix or handoff is verified against a reproducible check.
|
|
18
|
+
- Evidence: source trace, repro command or UI path, logs or state, targeted test/build output, and runtime evidence for UI or native defects.
|
|
19
|
+
- Output: root cause, fix or handoff, verification result, and any unswept sibling risks.
|
|
20
|
+
|
|
14
21
|
**Do not touch code until you can state the root cause in one sentence:**
|
|
15
22
|
> "I believe the root cause is [X] because [evidence]."
|
|
16
23
|
|
|
@@ -36,8 +43,11 @@ For `/hunt`, diagnostic constraints are `decision`, `preference`, and `principle
|
|
|
36
43
|
- **After three failed hypotheses, stop.** Use the Handoff format below to surface what was checked, what was ruled out, and what is unknown. Ask how to proceed.
|
|
37
44
|
- **Verify before claiming.** Never state versions, function names, or file locations from memory. Run `sw_vers` / `node --version` / grep first. No results = re-examine the path.
|
|
38
45
|
- **External tool failure: diagnose before switching.** When an MCP tool or API fails, determine why first (server running? API key valid? Config correct?) before trying an alternative.
|
|
46
|
+
- **System/tooling symptoms need a lower-layer baseline.** Before blaming the visible app, generated file, or top-level feature, measure the raw lower layer first: OS capture versus post-processing, runtime service versus UI, compiler/toolchain versus test assertion, network/API versus client handling. Retire hypotheses that the baseline disproves instead of circling them.
|
|
39
47
|
- **Pay attention to deflection.** When someone says "that part doesn't matter," treat it as a signal. The area someone avoids examining is often where the problem lives.
|
|
40
48
|
- **Visual/rendering bugs: static analysis first.** Trace paint layers, stacking contexts, and layer order in DevTools before adding console.log or visual debug overlays. Logs cannot capture what the compositor does. Only add instrumentation after static analysis fails.
|
|
49
|
+
- **Behavioral / lifecycle / async bugs: instrument first, not after failure.** Window lifecycle, event delivery, navigation, focus, timer, state-machine, and async-ordering bugs almost never yield to static reading alone. Do not wait for a failed fix to add logs. The moment your hypothesis involves "this callback fires before/after that one", "this state should be X when Y runs", or "this object should still be alive here", **add the log immediately as part of forming the hypothesis**, before writing any fix. A hypothesis without runtime evidence is a guess; two guesses in a row is the hard-stop signal. Distinguish from visual-rendering bugs (compositor behavior needs DevTools, not logs) and pure-logic bugs (wrong formula, off-by-one) where static analysis is sufficient.
|
|
50
|
+
- **Tuning magic numbers past round three: stop, unify.** When a spacing / sizing / threshold value has been adjusted three times and still looks wrong, the bug is structural, not numeric. Replace the N independent values with one named token (`Spacing.s4`, `--gap-content`, etc.) and verify the asymmetry was hiding a missing constraint. Asymmetry that survives tuning is structural; more tuning will not converge.
|
|
41
51
|
- **Fix the cause, not the symptom.** If the fix touches more than 5 files, pause and confirm scope with the user.
|
|
42
52
|
|
|
43
53
|
## Fix Scope Discipline
|
|
@@ -48,6 +58,7 @@ If the bug genuinely needs a refactor first (e.g. the cause cannot be addressed
|
|
|
48
58
|
|
|
49
59
|
Activate when: "以前是好的", "之前是好的", "used to work", "上一次提交还是对的", "broke after update", or the user remembers a specific good commit or version.
|
|
50
60
|
|
|
61
|
+
0. Protect the user's worktree first: run `git status --short --branch -uall`. If modified, staged, or untracked files exist, do not bisect in the current checkout. Create a temporary detached worktree from the same HEAD, run bisect there, then `git bisect reset` and remove the temporary worktree when done. If a temporary worktree is impossible, stop and ask for explicit cleanup/stash approval.
|
|
51
62
|
1. Find candidate good tag: `git tag --sort=-version:refname | head -10` or ask the user for the last known-good commit.
|
|
52
63
|
2. Define a non-interactive pass/fail test command before starting bisect. Bisect is worthless without a reproducible check.
|
|
53
64
|
3. Run: `git bisect start && git bisect bad HEAD && git bisect good <tag-or-hash>`
|
|
@@ -91,7 +102,7 @@ If the blast surfaces unrelated bugs, list them but do not fix in this PR unless
|
|
|
91
102
|
|
|
92
103
|
## Confirm or Discard
|
|
93
104
|
|
|
94
|
-
|
|
105
|
+
The instrument-first rule lives in Hard Rules (behavioral/async bugs) above; this is what to do with its result. Run the one probe that would fail if the hypothesis were wrong, then read it. If the evidence contradicts the hypothesis, discard it completely and re-orient on what the probe just showed. Do not stack a fix onto a disproven hypothesis, and do not keep one just because the code "looks like" the cause.
|
|
95
106
|
|
|
96
107
|
## Runtime Evidence Ladder
|
|
97
108
|
|
|
@@ -107,6 +118,26 @@ Compile-only is not enough for UI, native-app, visual, rendering, or generated-a
|
|
|
107
118
|
|
|
108
119
|
For recurring classes of failures, load `references/failure-patterns.md` before adding a second fix.
|
|
109
120
|
|
|
121
|
+
## Native App Freeze Mode
|
|
122
|
+
|
|
123
|
+
Activate when a desktop or mobile native app reports beachball, not responding, tab-switch freeze, first-open lag, idle wake stall, overlay lockup, or a screenshot shows a frozen app.
|
|
124
|
+
|
|
125
|
+
Evidence to collect before changing code:
|
|
126
|
+
|
|
127
|
+
1. Exact user path and version: first launch versus warm launch, the tab or window transition, idle duration, permissions, display count, and any setting that makes the freeze disappear.
|
|
128
|
+
2. Runtime capture while frozen: `sample <process>`, recent app logs, CPU and memory footprint, thread count, and whether the main thread is blocked, spinning, or allocating.
|
|
129
|
+
3. First-frame surface: view body work, first `.task`, synchronous icon or metadata lookup, filesystem scans, URL parent walks, notification callbacks, and app/window wake handlers.
|
|
130
|
+
4. Blast search after the fix: grep the same API shape across the repo, especially path parent walks, synchronous icon loading, metadata reads in render paths, and callbacks that run on the main thread.
|
|
131
|
+
|
|
132
|
+
Common native freeze traps:
|
|
133
|
+
|
|
134
|
+
- Launch, terminate, permission, audio, display, or workspace notifications doing path walks, icon lookup, filesystem scans, or process enumeration on the main thread.
|
|
135
|
+
- First paint hydrating a full app list, directory tree, media thumbnail set, or system status table before showing an interactive shell.
|
|
136
|
+
- An input-lock or full-screen overlay without a guaranteed teardown path for Escape, app deactivation, permission denial, process termination, and window close.
|
|
137
|
+
- Timer or sampler work that survives hidden windows, long idle periods, sleep/wake, or app reactivation.
|
|
138
|
+
|
|
139
|
+
Compile-only and source-only checks are insufficient for this mode. The outcome must include the runtime capture, the root-cause frame or state transition, the focused regression guard, and any sibling matches that were fixed or explicitly left safe.
|
|
140
|
+
|
|
110
141
|
## Targeted Logging
|
|
111
142
|
|
|
112
143
|
Use logs as a scalpel, not as noise. Before adding a log, write the question it answers:
|
|
@@ -128,6 +159,7 @@ If adding logs changes the behavior, treat that as evidence of a timing, lifecyc
|
|
|
128
159
|
|---------------|------|
|
|
129
160
|
| Patched client pane instead of local pane | Trace the execution path backward before touching any file |
|
|
130
161
|
| MCP not loading, switched tools instead of diagnosing | Check server status, API key, config before switching methods |
|
|
162
|
+
| Blamed the visible app before measuring the raw system/tooling layer | Measure the lower layer first, then retire ruled-out hypotheses explicitly |
|
|
131
163
|
| Orchestrator said RUNNING but TTS vendor was misconfigured | In multi-stage pipelines, test each stage in isolation |
|
|
132
164
|
| Race condition diagnosed as a stale-state bug | For timing-sensitive issues, inspect event timestamps and ordering before state |
|
|
133
165
|
| Added logs everywhere and still could not explain the bug | Rewrite each log as a yes/no question. Delete logs that do not rule a hypothesis in or out |
|
|
@@ -56,6 +56,51 @@ Checks:
|
|
|
56
56
|
- Reject paths outside the allowed root after symlink resolution.
|
|
57
57
|
- Reproduce from a non-default cwd and through any UI entry point that supplies paths.
|
|
58
58
|
|
|
59
|
+
## CLI Effect Scope Drift
|
|
60
|
+
|
|
61
|
+
Signals: preview, dry-run, size, count, or report output is computed from one predicate, but execution mutates a broader or different set.
|
|
62
|
+
|
|
63
|
+
Checks:
|
|
64
|
+
- Trace display, dry-run, and mutation predicates to the same source of truth.
|
|
65
|
+
- Compare planned paths or records with executor input in a regression test.
|
|
66
|
+
- Assert partial failures report the exact skipped and completed items.
|
|
67
|
+
|
|
68
|
+
## CLI Wrapper Or PATH Drift
|
|
69
|
+
|
|
70
|
+
Signals: source-tree invocation works, but the installed command, package wrapper, PATH shim, completion, or package-manager install path runs old code or a different binary.
|
|
71
|
+
|
|
72
|
+
Checks:
|
|
73
|
+
- Inspect built package contents, shebang, executable bit, and wrapper target.
|
|
74
|
+
- Reproduce through a temp prefix or package-manager install path, not only from source.
|
|
75
|
+
- Check PATH order and use absolute system-tool paths where wrappers should not intercept.
|
|
76
|
+
|
|
77
|
+
## Interactive Stdin Or TTY Hang
|
|
78
|
+
|
|
79
|
+
Signals: CI stalls, spinner never finishes, a subprocess reads from the script body, or an auth prompt appears in non-interactive mode.
|
|
80
|
+
|
|
81
|
+
Checks:
|
|
82
|
+
- Reproduce with stdin redirected and with TTY/non-TTY paths separated.
|
|
83
|
+
- Add test-mode or no-auth guards around real prompts and system changes.
|
|
84
|
+
- Stub external prompt tools through PATH when timeout wrappers exec real binaries.
|
|
85
|
+
|
|
86
|
+
## Signal Or Partial-Failure Mapping
|
|
87
|
+
|
|
88
|
+
Signals: cancel, timeout, SIGINT, or SIGTERM is reported as success or as a normal business failure; temp files, locks, or operation logs make retries look complete.
|
|
89
|
+
|
|
90
|
+
Checks:
|
|
91
|
+
- Classify interrupted execution separately from success and expected validation failures.
|
|
92
|
+
- Assert temp cleanup, lock release, and operation-log state after interruption.
|
|
93
|
+
- Test retry and idempotency after a partial write.
|
|
94
|
+
|
|
95
|
+
## CLI Stream Contract Regression
|
|
96
|
+
|
|
97
|
+
Signals: automation breaks after human logs, progress output, JSON shape, stdout/stderr routing, or exit-code behavior changes.
|
|
98
|
+
|
|
99
|
+
Checks:
|
|
100
|
+
- Assert exit code, stdout, and stderr separately in CLI tests.
|
|
101
|
+
- Keep human diagnostics off stdout for machine-readable modes.
|
|
102
|
+
- Snapshot or parse JSON/schema output and include non-interactive coverage.
|
|
103
|
+
|
|
59
104
|
## Snapshot Rebuild Drops Carried Field
|
|
60
105
|
|
|
61
106
|
Signals: live data shows up at the data source and on the wire but a downstream view sees it empty; the field has a default value (`var x: [T] = []`, `var y: Int? = nil`) that lets memberwise init compile without it; the symptom appears only on the path where the snapshot is rebuilt (icon resolution, decoration, redaction), not on a fresh fetch.
|
|
@@ -73,3 +118,12 @@ Checks:
|
|
|
73
118
|
- Read the tool's man page for cold-start semantics. `top -l 2`, `iostat -d 2`, `vm_stat 1 2`, etc. all share this shape.
|
|
74
119
|
- Slice the output to the latest sample (`.suffix(perSampleSize)` on parsed lines, or look for the second instance of the header row).
|
|
75
120
|
- When in doubt, raise `-l` to 3 and confirm sample 2 and 3 agree; sample 1 stays zero.
|
|
121
|
+
|
|
122
|
+
## Aggregation Key Variant
|
|
123
|
+
|
|
124
|
+
Signals: a count, log roll-up, event tally, or per-category breakdown is short by some entries; the missing items share a trait (a system-derived path, a localized string, a prefixed command name); the base-form key matches but a derived variant (`<base>-system`, a suffix, a prefix) is silently dropped.
|
|
125
|
+
|
|
126
|
+
Checks:
|
|
127
|
+
- Before adding a category, grep every write site that produces this class of key and enumerate the real variants, not just the base form.
|
|
128
|
+
- Match with `hasPrefix` / a regex / an explicit variant list rather than exact equality on the base key.
|
|
129
|
+
- Add a fixture row for each known variant so a future key shape that escapes the matcher fails the test instead of the aggregate.
|