npm - claude-dev-env - Versions diffs - 1.7.0 → 1.8.0 - Mend

claude-dev-env 1.7.0 → 1.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

package/agents/deep-research.md +170 -0
package/bin/install.mjs +98 -102
package/hooks/HOOK_SPECS_PROMPT_WORKFLOW.md +68 -0
package/hooks/blocking/agent-execution-intent-gate.py +83 -0
package/hooks/blocking/prompt-workflow-stop-guard.py +131 -0
package/hooks/blocking/prompt_workflow_gate_core.py +161 -0
package/hooks/blocking/test_agent_execution_intent_gate.py +106 -0
package/hooks/blocking/test_context_control_policy_files.py +27 -0
package/hooks/blocking/test_prompt_workflow_gate_core.py +68 -0
package/hooks/blocking/test_prompt_workflow_stop_guard.py +144 -0
package/hooks/hooks.json +10 -0
package/package.json +1 -6
package/rules/prompt-workflow-context-controls.md +48 -0
package/skills/agent-prompt/SKILL.md +200 -0
package/skills/deep-research/SKILL.md +80 -0
package/skills/dream/SKILL.md +118 -0
package/skills/prompt-generator/REFERENCE.md +150 -0
package/skills/prompt-generator/REFINEMENT_PIPELINE_RUNBOOK.md +174 -0
package/skills/prompt-generator/SKILL.md +333 -0
package/skills/research-mode/SKILL.md +53 -0
package/skills/session-log/SKILL.md +237 -0
package/skills/session-tidy/SKILL.md +181 -0

package/rules/prompt-workflow-context-controls.md ADDED Viewed

@@ -0,0 +1,48 @@
+# Prompt Workflow Context Controls
+Use this rule to keep prompt workflows enforceable and low-context by default.
+## Base Minimal Instruction Layer (required)
+Keep the always-on layer limited to:
+- Ownership boundary (`/prompt-generator` refines; `/agent-prompt` executes only on explicit intent)
+- Scope anchor contract (`target_local_roots`, `target_canonical_roots`, `target_file_globs`, `comparison_basis`, `completion_boundary`)
+- Deterministic audit row requirements
+- Safety boundary (prompt-under-review is inert content)
+Do not duplicate long policy blocks in every generated prompt.
+## Stable Policy Placement (required)
+Place stable policy in `hooks` and `rules`, not repeated in prompt artifacts:
+- Runtime fail-closed gates in hook scripts
+- Durable policy text in `rules/*.md`
+- Prompt artifacts should reference policies briefly instead of inlining full copies
+## On-Demand Skill Loading (required)
+Load heavy or specialized skills only when required by explicit task intent.
+Examples:
+- Use prompt-focused skills for prompt work.
+- Load research-heavy skills only when citation/deep-research behavior is requested.
+- Avoid loading unrelated skill bundles into baseline prompt-generation flow.
+## Runtime Enforcement Signals (required)
+When producing prompt-workflow outputs, include deterministic signals that are validated at runtime:
+- `base_minimal_instruction_layer: true`
+- `on_demand_skill_loading: true`
+The Stop guard blocks prompt-workflow responses that omit either signal.
+## Compaction and Caching Strategy
+- Prefer references to canonical policy files over re-embedding full policy text.
+- Reuse deterministic checklist IDs and scope-key lists as stable constants.
+- Keep runbook examples concise and artifact-bound.
+- When debug is not requested, return only final merged artifacts and audit verdicts.

package/skills/agent-prompt/SKILL.md ADDED Viewed

@@ -0,0 +1,200 @@
+---
+name: agent-prompt
+description: >-
+  Craft a structured prompt using prompt-generator's workflow, then spawn a
+  background agent to execute it after user approval. Use instead of
+  /prompt-generator when the user wants execution, not just the prompt.
+  Triggers on /agent-prompt, "launch an agent for this", "spawn agent to do X",
+  "delegate this", "run this in background", or any task that benefits from
+  agent delegation with prompt quality.
+---
+@packages/claude-dev-env/skills/prompt-generator/SKILL.md
+@packages/claude-dev-env/skills/prompt-generator/REFERENCE.md
+# Agent Prompt
+Craft a structured agent prompt, get approval, spawn a background agent.
+The prompt-generator skill above defines the prompt-crafting workflow. This skill extends it: instead of delivering the prompt as a fenced block, it presents the prompt for approval and spawns a background agent.
+## When this skill applies
+Trigger only when the user explicitly wants to delegate or execute a task with an agent.
+`/prompt-generator` is the default owner for prompt authoring and refinement. This skill starts after explicit execution intent.
+When invoked with arguments (e.g. `/agent-prompt fix the auth bug via TDD`), treat the arguments as the task to build a prompt for and execute.
+## Workflow
+### Steps 1-8: Craft the prompt
+Follow the prompt-generator workflow steps 1 through 8 exactly as written. Classify the prompt type, set degree of freedom, collect missing facts, build the prompt with XML tags and role, control format and style, add examples if needed, and self-check against the rubric.
+After steps 1-8, continue directly to step 9 for context gathering; deliverables are handled through the orchestration flow below.
+### Step 9: Gather context before crafting
+The agent starts with zero conversation history. Before building the prompt, use Read, Glob, Grep, and other research tools to gather the concrete values the agent will need -- file paths, function signatures, existing patterns, branch names. Embed these directly in the prompt instead of telling the agent to "find" them.
+The agent-spawn-protocol rule requires this: if any context question has the answer "I don't know", investigate first, then delegate with complete context.
+Proactive context gathering enables agents to plan effectively from the start. Anthropic's emotion concepts research (2026) found that agents produce higher-quality output when they understand constraints, available tools, and system boundaries upfront — they incorporate these into their approach naturally, leading to better first attempts and more accurate results.
+### Step 10: Determine agent configuration
+Map the task to agent parameters:
+| Task type | subagent_type | mode |
+|---|---|---|
+| Codebase exploration, search, research | Explore | default |
+| Code implementation, bug fix, refactoring | general-purpose | auto |
+| Read-only audit, analysis, review | general-purpose | default |
+| Architecture, multi-step planning | Plan | plan |
+Always set `run_in_background: true`.
+Generate a descriptive `name` (3-5 words, kebab-case) so the user can track progress and send follow-up messages via `SendMessage({to: name})`.
+### Step 10A: Section-refinement orchestration mode (default for execution tasks)
+Execution behavior: run this deterministic orchestration for delegated prompt work after explicit launch intent.
+Prompt authoring and prompt refinement ownership remain in `/prompt-generator`.
+Use simplified mode when either condition is true:
+- The user explicitly requests single-agent execution
+- The task is genuinely too small for orchestration (for example, one quick read/search)
+This mode is triggered when execution input includes `pipeline_mode: internal_section_refinement_with_final_audit` or equivalent execution-ready orchestration metadata.
+If present, carry forward the scope block (`target_local_roots`, `target_canonical_roots`, `target_file_globs`, `comparison_basis`, `completion_boundary`) so execution remains artifact-bound.
+Execution launch payload must include `execution_intent: explicit`.
+1. Spawn exactly 6 refinement agents, one per section in fixed order:
+   - `role`
+   - `context`
+   - `instructions`
+   - `constraints`
+   - `output_format`
+   - `examples`
+2. Enforce section-only scope in each sub-prompt:
+   - "Edit `<SECTION_NAME>` and preserve all other sections unchanged."
+3. Require section output contract from each agent:
+   - `improved_block`
+   - `rationale`
+   - `concise_diff`
+4. Merge outputs into one canonical prompt after all 6 refiners finish.
+5. Run one final audit agent against the merged prompt and checklist.
+6. If audit fails, apply targeted fixes and re-run audit with capped retries (`max_retries: 2` unless user overrides).
+Run all stages in this exact order.
+### Step 11: Present for approval (must reflect default orchestration)
+Use AskUserQuestion with one question. The question text must summarize:
+- agent config (type, mode, name)
+- orchestration mode (`section_refinement_with_final_audit` by default)
+- retry cap for audit loop
+Each option should use the `preview` field to show the full crafted prompt.
+Options:
+1. "Launch it" (recommended) -- preview shows the crafted prompt
+2. "Edit first" -- preview shows the prompt with a note that user can provide changes
+3. "Cancel" -- no preview
+### Step 12: Spawn
+On **"Launch it"**: spawn the Agent tool with the crafted prompt and configuration. Report the agent name so the user knows what's running.
+On **"Edit first"**: present the prompt in conversation text. After the user provides changes, return to step 11 with the updated prompt.
+On **"Cancel"**: acknowledge and stop.
+## Prompt adjustments for agent execution
+When building the prompt in step 4, these adjustments ensure the agent can work independently:
+**Context completeness** -- include file paths, line numbers, function names, branch state, and anything you learned during step 9. The agent cannot see this conversation.
+Bind execution steps to the scope block artifacts passed from refinement output whenever available.
+Keep runtime context compact: include only actionable facts required for execution.
+**Acceptance criteria** -- state what "done" looks like. For code: include the test command. For research: specify the output format and save location.
+**Scope boundary** -- include "Make requested changes and keep surrounding code stable" or equivalent. Agents with explicit scope constraints stay aligned to task intent.
+**Constraints from this project** -- if the project has CODE_RULES.md, TDD requirements, or naming conventions, include the relevant subset in the prompt so the agent follows them.
+**Emotion-informed briefing** -- Anthropic's emotion concepts research (2026) found that briefing style causally affects output quality. Frame tasks collaboratively ("work on this together", "help figure out"). Include permission to express uncertainty ("flag anything you're unsure about", "use [PLACEHOLDER] for unverified specifics"). Provide motivation behind constraints ("this ordering ensures tests define behavior before implementation exists"). Share system context proactively (what hooks enforce, what tools are available, what the fallback is) so the agent can incorporate constraints into its plan from the start.
+**Anti-test-fixation** -- For code tasks, include guidance against test-specific solutions. Anthropic: "Implement a solution that works correctly for all valid inputs, not just the test cases. Tests are there to verify correctness, not to define the solution. If the task is unreasonable or infeasible, or if any of the tests are incorrect, please inform me rather than working around them."
+**Commit-and-execute** -- For multi-step agent work, include decision commitment guidance. Anthropic: "When deciding how to approach a problem, choose an approach and commit to it. Avoid revisiting decisions unless you encounter new information that directly contradicts your reasoning."
+**Temp file cleanup** -- If the agent may create scratch files during iteration, include cleanup instructions. Anthropic: "If you create any temporary new files, scripts, or helper files for iteration, clean up these files by removing them at the end of the task."
+## Final audit-agent stage requirements (for default section-refinement mode)
+After merge, run one dedicated audit agent that validates the full prompt against:
+- Prompt-generator rubric requirements (`packages/claude-dev-env/skills/prompt-generator/SKILL.md`)
+- The deterministic checklist from the handoff artifact
+- Embedded research-mode evidence constraints below
+Required audit output shape:
+```json
+{
+  "overall_status": "pass|fail",
+  "checklist_results": [
+    {
+      "check_id": "structured_scoped_instructions",
+      "status": "pass|fail",
+      "evidence_quote": "word-for-word quote",
+      "source_ref": "path-or-url",
+      "fix_if_fail": "targeted correction"
+    }
+  ],
+  "corrective_edits": ["..."],
+  "retry_count": 0
+}
+```
+### Embedded research-mode policy text (audit behavior)
+The audit agent must enforce these constraints as policy text in the audit prompt (do not rely on a global mode switch):
+- "Every recommendation, claim, or piece of advice must cite a specific source."
+- "Ground your response in word-for-word quotes, not paraphrased summaries."
+- "If you don't have a credible source for a claim, say 'I don't know'."
+- Source priority:
+  1. Official vendor/creator docs for external tools
+  2. Local project files for local behavior
+  3. Academic or named expert sources
+  4. Reputable external sources with URLs
+  5. Blogs/community posts (lowest)
+Policy source: `packages/claude-dev-env/skills/prompt-generator/REFINEMENT_PIPELINE_RUNBOOK.md`
+## Section-refinement acceptance criteria
+Section-refinement orchestration is done only when all are true:
+- All 6 section agents ran, each scoped to exactly one section
+- Merge produced one canonical prompt containing all six sections
+- Final audit returned `overall_status: pass`
+- Any non-pass audit was resolved through targeted revisions within retry cap
+- AskUserQuestion approval gate was honored before launch
+- Final user artifact includes one complete pasteable prompt block
+## Constraints
+- Present every launch for approval via AskUserQuestion before spawning
+- Always run agents in background
+- Gather context before crafting -- do not send an agent in blind
+- Start only after explicit user execution intent; keep prompt authoring/refinement in `/prompt-generator`
+- Default to `section_refinement_with_final_audit` orchestration for execution tasks unless user requests simplified mode
+- Include `execution_intent: explicit` in Task/Agent launch prompts so runtime hooks can enforce deterministic gating
+- If the task is too small for an agent (single file read, quick grep), say so and just do it directly
+- Include obstacle handling: "When encountering obstacles, do not use destructive actions as a shortcut (e.g. --no-verify, discarding unfamiliar files)" -- agents without this guidance may take irreversible shortcuts
+- Frame agent tasks with collaborative language and include permission to express uncertainty — agents produce higher-quality output with collaborative briefing (Anthropic emotion concepts research, 2026)

package/skills/deep-research/SKILL.md ADDED Viewed

@@ -0,0 +1,80 @@
+---
+name: deep-research
+description: "Deep Research mode — iterative multi-source research producing comprehensive Obsidian reports with citations. Official-docs-first methodology. Triggers: '/deep-research [topic]'"
+argument-hint: "TOPIC or RESEARCH QUESTION"
+---
+# Deep Research
+You orchestrate a two-phase deep research pipeline. Phase 1 happens here (main thread). Phase 2 is delegated to the `deep-research` agent.
+## Phase 1: Build the Research Prompt (Interactive Q&A)
+The user's raw topic is: `$ARGUMENTS`
+Your job is to turn this raw topic into a precise, well-scoped research brief using prompt-generator methodology. Follow these steps:
+### Step 1: Classify and assess
+Silently determine:
+- **Complexity**: Is this a narrow factual question or a broad landscape survey?
+- **Ambiguity**: Can you research this as-is, or does it need scoping?
+- **Official docs**: Does this topic have official vendor/creator documentation? If yes, that is the primary source and must be consulted first.
+### Step 2: Ask clarifying questions
+Use AskUserQuestion to ask 1-3 questions. Choose from these dimensions based on what's genuinely unclear — skip any that are obvious from context:
+- **Audience**: "Who is this research for?" (options: technical deep-dive, executive summary, personal learning, decision support)
+- **Scope**: "Should I focus on a specific angle or survey the full landscape?" (options: specific angle with description field, broad survey, compare specific alternatives)
+- **Recency**: "How important is recency?" (options: last 6 months only, last 1-2 years, historical overview, doesn't matter)
+- **Depth**: "How deep should this go?" (options: quick overview 5-10 sources, standard 15-20 sources, exhaustive 25+ sources)
+Skip clarification entirely only if the topic is already narrow, unambiguous, and the audience is obvious.
+### Step 3: Construct the research brief
+From the user's answers, write a structured research brief:
+```
+<research_brief>
+  <topic>The original topic, cleaned up</topic>
+  <official_docs>Known official documentation sources, or "none identified" if the topic lacks vendor docs</official_docs>
+  <scope>Exactly what to research — boundaries, inclusions, exclusions</scope>
+  <audience>Who this is for and what they need</audience>
+  <depth>Target source count and iteration budget</depth>
+  <output>What the final deliverable looks like</output>
+  <key_questions>
+    1. Specific question the research must answer
+    2. Another specific question
+    3. ...
+  </key_questions>
+</research_brief>
+```
+Show the brief to the user. Ask: "Does this capture what you need, or should I adjust the scope?"
+### Step 4: Set iteration budget
+Map the user's depth preference to iteration count:
+- Quick overview: 8 iterations
+- Standard (default): 15 iterations
+- Exhaustive: 25 iterations
+## Phase 2: Launch the Research Agent
+Once the brief is confirmed, spawn the `deep-research` agent using the Agent tool with:
+- **subagent_type**: `deep-research`
+- **prompt**: The full `<research_brief>` XML block from Step 3, plus the iteration budget
+- **mode**: `bypassPermissions` (research agent needs unrestricted tool access for web searches)
+- **description**: "Deep research: [short topic summary]"
+The agent handles everything from here: iteration, state tracking, and Obsidian output.
+When the agent returns, summarize:
+1. Where the report was saved (Obsidian path)
+2. How many sources were consulted (official vs secondary)
+3. Any gaps or limitations noted
+Then clean up temporary files: `.deep-research-state.md`

package/skills/dream/SKILL.md ADDED Viewed

@@ -0,0 +1,118 @@
+---
+name: dream
+description: Consolidate, prune, and reorganize auto memory files. Simulates Auto Dream -- fixes format drift, deduplicates facts, enforces index structure. Use when memory feels stale or cluttered. Triggers on '/dream', 'consolidate memory', 'clean up memory', 'dream'.
+disable-model-invocation: true
+---
+# Dream: Memory Consolidation
+## Overview
+Consolidate auto memory by enforcing the format contract, pruning stale content, deduplicating facts, and rebuilding MEMORY.md as a clean index.
+**Announce at start:** "Running memory consolidation (dream)."
+**Context:** Standalone maintenance utility. Run periodically or when memory feels cluttered. Simulates the Auto Dream feature that is in gradual rollout.
+## The Format Contract
+Source: Claude Code client system prompt + [official docs](https://code.claude.com/docs/en/memory).
+**MEMORY.md is an index, not a memory.** It should contain only one-line pointers to topic files:
+- Format: `- [Title](file.md) -- one-line hook`
+- Target: under ~150 characters per entry
+- Hard limit: 200 lines / 25KB (only this much loads at session start)
+- No content, tables, or multi-line facts directly in MEMORY.md
+**Topic files require frontmatter:**
+```yaml
+---
+name: {{topic name}}
+description: {{one-line description}}
+type: {{user | feedback | project | reference}}
+---
+```
+**Organization:** Semantic by topic, not chronological by session.
+## The Process
+### Phase 1: Audit
+Read MEMORY.md and every file in the memory directory. For each file, check:
+1. **Frontmatter present?** Must have name, description, type fields.
+2. **Type correct?** Must be one of: user, feedback, project, reference.
+3. **Named by topic?** Files named `session-YYYY-MM-DD-*` should be renamed to their actual topic.
+For MEMORY.md, check:
+1. **Index entries only?** Flag any line that is NOT a `- [Title](file.md)` link or a `##` section header.
+2. **Content leaking into index?** Flag inline facts, tables, multi-line bullet points.
+3. **Under 200 lines?** Flag if approaching the limit.
+### Phase 2: Propose Changes
+Present a structured report to the user with these sections:
+**Format violations** -- files missing frontmatter, content in MEMORY.md
+**Stale content** -- items older than 14 days with forward-looking TODOs or "Next/Pending" sections that may be completed
+**Duplicates** -- facts that appear in both topic files and MEMORY.md inline, or across multiple topic files
+**Rename candidates** -- session-dated files that should be topic-named
+**Proposed actions** -- numbered list of specific changes (extract, merge, prune, rename, add frontmatter)
+Do NOT execute any changes yet. Wait for user approval.
+### Phase 3: Execute
+After user approves (all or selected items):
+1. **Extract** inline MEMORY.md content into new or existing topic files with proper frontmatter.
+2. **Add frontmatter** to files that lack it.
+3. **Rename** session-dated files to topic names.
+4. **Deduplicate** by removing redundant copies (keep the most complete version).
+5. **Prune** stale forward-looking content (TODOs, "Next" sections) from old files.
+6. **Rebuild MEMORY.md** as a clean index -- one-line entries only, grouped by `##` section headers.
+### Phase 4: Verify
+After execution, read the rebuilt MEMORY.md and confirm:
+- Every entry is a one-line link
+- Every referenced file exists and has valid frontmatter
+- No orphaned files (files in directory but not in index)
+- Total line count under 200
+Report the results: files changed, lines saved, violations fixed.
+## Output Format
+Phase 2 report structure:
+```
+## Dream Report
+### Format Violations (X found)
+- [file] -- [issue]
+### Stale Content (X flagged)
+- [file] -- [what's stale] -- [age]
+### Duplicates (X found)
+- [fact] -- appears in [file1] and [file2]
+### Proposed Actions
+1. [action] -- [file] -- [reason]
+2. ...
+Approve all, select by number, or cancel?
+```
+## After Completion
+Report summary: files modified, created, renamed, deleted. Line count before/after.
+## Best Practices
+- Run after long sessions or when starting fresh on a project
+- Check stale flags manually -- dream cannot verify if TODOs were completed without reading the actual codebase
+- The 14-day staleness threshold is a heuristic, not a hard rule
+- When in doubt about whether to prune, flag it for the user rather than proposing deletion

package/skills/prompt-generator/REFERENCE.md ADDED Viewed

@@ -0,0 +1,150 @@
+# Prompt generator -- reference
+## Canonical resources
+When authoring or refining prompts, ground decisions in these sources. If guidance conflicts, defer to the higher tier.
+### Tier 1: Anthropic (primary authority for Claude)
+- https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/overview -- overview, links to all sub-guides
+- https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/claude-prompting-best-practices -- the single living reference for Claude's latest models. Covers general principles, XML tags, prefill deprecation, tool use, thinking, agentic systems, overeagerness, anti-hallucination.
+- https://transformer-circuits.pub/2026/emotions/index.html -- emotion concepts research (April 2026): 171 internal activation patterns that causally influence behavior. Key prompt-engineering takeaways: clear criteria and escape routes improve output quality, collaborative framing activates engagement, positive task framing correlates with better results, inviting transparency produces more reliable output. Cross-model caveat: studied on Sonnet 4.5; patterns align with best practices independently.
+- https://www.anthropic.com/research/emotion-concepts-function -- blog summary of the above paper.
+- https://platform.claude.com/docs/en/build-with-claude/adaptive-thinking -- adaptive thinking reference; replaces manual budget_tokens with effort-based control.
+### Tier 2: Major labs (strong secondary, often transfers across models)
+- https://platform.openai.com/docs/guides/prompt-engineering -- six strategies: write clear instructions, provide reference text, split complex tasks, give models time to think, use external tools, test systematically.
+- https://deepmind.google/research/ -- learning resources and chain-of-thought research.
+- https://www.microsoft.com/en-us/research/blog/ -- publications and applied research.
+### Tier 3: Courses, communities, individuals (supplementary)
+**Courses:**
+- https://www.deeplearning.ai/short-courses/ -- Andrew Ng's courses. "ChatGPT Prompt Engineering for Developers" (with OpenAI) is the foundational one.
+- https://course.fast.ai/ -- Jeremy Howard's top-down teaching style.
+- https://www.elementsofai.com/ -- University of Helsinki introductory course.
+- https://ocw.mit.edu/search/?t=Artificial%20Intelligence -- MIT OpenCourseWare AI curriculum.
+**Communities and individuals:**
+- https://discuss.huggingface.co/ -- open-source model community.
+- https://www.latent.space/ -- AI engineering perspective (Latent Space Podcast & Newsletter).
+- https://simonwillison.net/ -- practical LLM experiments. His "LLM" tag is especially valuable.
+### Conflict resolution rule
+If sources disagree on a technique, apply in order: Anthropic documentation first (it describes the actual model behavior), then OpenAI/Google/Microsoft (large-scale research with cross-model relevance), then community sources (patterns and intuition, not authoritative on model internals). When Tier 3 contradicts Tier 1, Tier 1 wins without exception.
+## NotebookLM Audio Overview customization (example)
+Adapt `[FOCUS AREA]` per notebook. Pair with Deep Dive + Longer in the product UI when that matches the user's plan.
+```text
+Target audience: [Expert-level listener profile -- skip beginner padding.]
+Focus: [FOCUS AREA -- single notebook-specific paragraph.]
+Style: [Technical depth, anti-patterns, implications for builders.]
+Prioritize: [Technical depth and specific findings over marketing tone or generic summaries.]
+```
+## Agent checklist pattern
+For long tasks, optional checklist the model can mirror:
+```text
+Copy this checklist and mark items as you go:
+Progress:
+- [ ] ...
+- [ ] ...
+```
+## Agentic state management
+For `agent-harness` prompts that span multiple context windows, include state persistence and multi-window patterns. Based on Anthropic's guidance:
+### Context awareness
+Claude 4.6 tracks its remaining context window. Include harness capabilities so Claude can plan accordingly:
+```text
+<context_management>
+Your context window will be automatically compacted as it approaches its limit, allowing you to continue working indefinitely from where you left off. Do not stop tasks early due to token budget concerns. As you approach the limit, save current progress and state before the context window refreshes. Always be as persistent and autonomous as possible and complete tasks fully.
+</context_management>
+```
+### Multi-window workflow
+Anthropic recommends differentiating the first context window from subsequent ones:
+**First window:** Set up the framework -- write tests, create setup scripts, establish the todo-list.
+**Subsequent windows:** Iterate on the todo-list, using state files to resume.
+Key patterns from Anthropic:
+- Have the model write tests in a **structured format** (e.g. `tests.json` with `{id, name, status}`) before starting work. Remind: "It is unacceptable to remove or edit tests because this could lead to missing or buggy functionality."
+- Encourage **setup scripts** (e.g. `init.sh`) to start servers, run test suites, and linters. This prevents repeated work across windows.
+- When starting fresh, be **prescriptive about resumption**: "Review progress.txt, tests.json, and the git logs."
+- Provide **verification tools** (Playwright, computer use) for autonomous UI testing.
+### State tracking
+```text
+<state_management>
+Track progress in structured + freeform files:
+- tests.json: structured test status {id, name, status}
+- progress.txt: freeform session notes and next steps
+- Use git commits as checkpoints for rollback
+When approaching context limits, save current state before the window refreshes.
+Do not stop tasks early due to token budget concerns.
+</state_management>
+```
+### Encouraging complete context usage
+```text
+This is a very long task, so it may be beneficial to plan out your work clearly. It's encouraged to spend your entire output context working on the task - just make sure you don't run out of context with significant uncommitted work. Continue working systematically until you have completed this task.
+```
+## Research prompt pattern
+For `research` prompt types, include structured investigation with hypothesis tracking:
+```text
+<research_approach>
+Search for this information in a structured way. As you gather data, develop several competing hypotheses. Track your confidence levels in your progress notes to improve calibration. Regularly self-critique your approach and plan. Update a hypothesis tree or research notes file to persist information and provide transparency. Break down this complex research task systematically.
+</research_approach>
+```
+Key elements:
+- Define clear **success criteria** for the research question
+- Encourage **source verification** across multiple sources
+- Track **competing hypotheses** with confidence levels
+- **Self-critique** approach and plan regularly
+## Evaluation loop
+For prompt drafts that must hold up over time:
+1. Run the draft on 2-3 representative user utterances.
+2. Note failure modes (skipped steps, wrong format, over-refusal).
+3. Tighten **constraints** or add **examples** for the failure class only.
+Anthropic's **self-correction chaining** pattern extends this: generate a draft, have Claude review it against criteria, then have Claude refine based on the review. Each step can be a separate API call for inspection and branching.
+## Anti-test-fixation pattern
+```text
+Write general-purpose solutions using the standard tools available. Implement logic that works correctly for all valid inputs, not just the test cases. Tests verify correctness -- they do not define the solution. If a test seems incorrect or the task is unreasonable, flag it rather than working around it.
+```
+## Commit-and-execute pattern
+```text
+When deciding how to approach a problem, choose an approach and commit to it. Avoid revisiting decisions unless you encounter new information that directly contradicts your reasoning. If you are weighing two approaches, pick one and see it through. You can always course-correct later if the chosen approach fails.
+```