npm - company-skill - Versions diffs - 4.4.0 → 4.5.0 - Mend

company-skill 4.4.0 → 4.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

package/.github/workflows/check.yml +16 -0
package/.github/workflows/publish.yml +18 -0
package/COMPANY.md.template +5 -2
package/README.md +53 -162
package/agents/company-critic.md +28 -4
package/agents/company-digest.md +17 -2
package/agents/company-lead.md +32 -3
package/agents/company-reviewer.md +21 -3
package/agents/company-worker.md +32 -2
package/bin/install.js +7 -4
package/commands/resume.md +6 -1
package/commands/run.md +2 -1
package/hooks/precompact.js +18 -4
package/hooks/session-restore.js +42 -18
package/hooks/stop-guard.js +129 -39
package/install.sh +77 -12
package/package.json +2 -2
package/scripts/check-contracts.js +55 -0
package/scripts/check-findings.js +25 -0
package/scripts/check.sh +125 -0
package/skill/SKILL.md +186 -52
package/tests/check-contracts.test.js +31 -0
package/tests/stop-guard.test.js +278 -0
package/hooks/compiler-check.js +0 -39

package/.github/workflows/check.yml ADDED Viewed

@@ -0,0 +1,16 @@
+name: check
+on:
+  pull_request:
+  push:
+    branches: [main]
+jobs:
+  check:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-node@v4
+        with:
+          node-version: 20
+      - run: bash scripts/check.sh

package/.github/workflows/publish.yml ADDED Viewed

@@ -0,0 +1,18 @@
+name: publish
+on:
+  release:
+    types: [published]
+  workflow_dispatch: {}
+jobs:
+  publish:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-node@v4
+        with:
+          node-version: 20
+          registry-url: https://registry.npmjs.org
+      - run: bash scripts/check.sh
+      - run: npm publish
+        env:
+          NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}

package/COMPANY.md.template CHANGED Viewed

@@ -10,8 +10,11 @@
   TIPS:
   - First role in each department (or marked "Lead:") becomes the department lead
-  - Add [opus], [sonnet], or [haiku] to override model per role
-  - Default: leads use Opus (deep thinking), workers use Sonnet (execution)
+  - Add [opus], [sonnet], or [haiku] to request a model for a role. The
+    orchestrator states the override when spawning if the harness supports
+    per-agent models, otherwise the tag is ignored
+  - Defaults come from the agent files: leads and reviewers on a strong
+    model, workers on a mid tier, the digest on the cheapest
   - Add as many departments and roles as you want
 -->

package/README.md CHANGED Viewed

@@ -2,223 +2,114 @@
 [![npm](https://img.shields.io/npm/v/company-skill)](https://www.npmjs.com/package/company-skill) [![license](https://img.shields.io/npm/l/company-skill)](LICENSE) [![downloads](https://img.shields.io/npm/dw/company-skill)](https://www.npmjs.com/package/company-skill)
-> *You don't prompt agents one at a time. You write a team in markdown, hand them a goal, and go to sleep. In the morning, STATUS.md tells you what got done, what got rejected, and what the company learned. The playbook from session 3 makes session 4 faster. By session 10, the company runs itself better than you could direct it manually.*
+**Your agent stops when it feels done. This makes it stop only when the work is actually done.**
-**Define your team in markdown. Give it a goal. Walk away.**
+You define a team in one markdown file, hand it a goal, and walk away while it builds, reviews its own work, and keeps going until every success criterion passes with evidence a second agent reproduced. A stop hook reads criteria.json and physically blocks exit until then, and that guard is pinned by a 24-check test suite that runs green in CI.
-A Claude Code skill that runs your entire company — CEO delegates, departments execute in parallel, built-in reviewers verify — and doesn't stop until the goal is done.
+```bash
+npx company-skill install
+```
 ```
-/company "Build the user auth system with OAuth2"
+/company "Build a REST API for user management with tests"
 ```
-## Why /company
-| | Without /company | With /company |
-|---|---|---|
-| Task routing | You manually prompt each agent | CEO reads the goal, picks relevant employees, delegates |
-| Quality gates | Hope it's correct | Reviewer + Devil's Advocate + Elegance Enforcer triple-check |
-| Knowledge retention | Lost every session | Playbook accumulates what worked, what failed, what's faster |
-| Parallelism | One agent at a time | All departments run in parallel |
-| Stopping condition | You decide when it's done | criteria.json blocks exit until ALL criteria pass |
-## Quick Start
+Optionally define your team first in `COMPANY.md` (skip it and a minimal company is created):
-**1. Install**
-```bash
-npx company-skill install
-```
-**2. Define your team** (optional — a minimal company is created automatically)
 ```markdown
 ## Engineering
 - Backend Lead, API design and database architecture
 - Frontend Dev, React components and state management
-## Research
-- ML Scientist, model experiments and benchmarks
 ```
-**3. Run**
-```
-/company "Build a REST API for user management with tests"
-```
-## How It Works
+## How it works
 ```mermaid
 graph LR
     G[GOAL] --> T[THINK]
-    T -->|Opus: CEO + leads assign tasks| E[EXECUTE]
-    E -->|Sonnet: workers do the work| V[VERIFY]
-    V -->|Opus: Reviewer + Advocate| D{Done?}
-    D -->|NO: feedback| T
+    T -->|contract shape gate| E[EXECUTE]
+    E -->|findings shape gate| V[VERIFY]
+    V -->|reviewer re-derives + critic attacks| D{Done?}
+    D -->|NO: feedback| C[COMPRESS]
+    C --> T
     D -->|YES| S[STATUS.md]
+    D -.->|stop attempt while failing| B[stop guard blocks]
+    B -.-> T
 ```
-The loop does NOT stop until the Reviewer confirms all criteria pass AND the Devil's Advocate accepts. There is no iteration limit.
-<details>
-<summary><strong>THINK</strong> — CEO picks relevant employees, leads assign tasks</summary>
-The CEO reads the goal and COMPANY.md, decides which departments and employees are relevant (a mobile app goal doesn't need a Topologist), writes an active roster, then launches all department leads in parallel. Each lead assigns tasks to their employees with one sentence, one skill, and context.
-If a lead sees a skill gap, they write `HIRE: {role}, {why}` and the CEO adds it to the team.
-</details>
-<details>
-<summary><strong>EXECUTE</strong> — All workers run in parallel with installed skills</summary>
-Every employee gets their task, previous findings, and failed approaches from the playbook. Every finding must have a source — file path, URL, or command output. Novel ideas use "NOVEL — needs validation" and the reviewer adds a validation criterion. No source = rejected.
-</details>
-<details>
-<summary><strong>VERIFY</strong> — Triple quality gate blocks premature completion</summary>
-**Internal Reviewer** checks each criterion in criteria.json against evidence. No evidence? Stays `false`. Also scans all public-facing output for unverified claims about external projects — any number, percentage, or technical detail cited from memory gets blocked until verified from source.
+It runs any model the way frontier models like Claude Fable 5 run themselves: delegation contracts, verify layers, and failing-by-default criteria ship as structural artifacts, so the discipline holds whichever model fills each role. The orchestrator reads the goal and activates only the relevant employees. Leads decompose the goal into delegation contracts, workers execute them in parallel waves, and two reviewers gate every cycle: the Internal Reviewer re-runs the evidence and the Devil's Advocate attacks it. There is no iteration limit. The harness carries the quality, so none of it depends on the model remembering to be careful.
-**Devil's Advocate** attacks anything marked as passing. "Is this actually complete or surface-level? What edge cases were missed?" For any claim about external projects: "did you actually verify this from their repo/docs, or are you guessing?"
+## Delegation contracts
-**Elegance Enforcer** asks "Can this be simpler? Does every component justify its existence?"
+A task does not exist until it is a filled contract:
-All three must accept before the loop exits.
-</details>
-## External Fact Verification
-Workers producing public-facing output (GitHub comments, PRs, blog posts) must verify every claim about external projects from their actual docs/source before publishing. No citing from memory. The reviewer blocks unverified external claims automatically.
-**One strike rule:** if corrected by someone, respond "my bad, you're right" and stop. Never attempt a second correction with more guessed details.
-## Goal Enforcement
-The skill creates `criteria.json` with machine-checkable success criteria:
-```json
-{"goal": "Build auth", "criteria": [
-  {"id": 1, "description": "OAuth2 login works with Google", "passes": false, "evidence": null},
-  {"id": 2, "description": "All tests pass", "passes": false, "evidence": null}
-]}
+```
+TASK: one sentence, one employee
+EMPLOYEE: role from the roster
+SKILL: routed skill, or none
+INPUTS: paths and context, paste-complete
+OUTPUT: FINDING + SOURCE lines to the employee's findings file
+DONE-WHEN: one machine-checkable condition
+VERIFY-WITH: the exact command that proves DONE-WHEN
+OUT-OF-SCOPE: what this task must not touch
+DEPENDS-ON: task numbers that must finish first, or none
 ```
-A Stop Hook reads this file and **blocks Claude from exiting** until every criterion passes. To cancel: `touch .company/CANCEL`.
-## Self-Improving Playbook
-One file: `.company/playbook.md`. Accumulates across sessions.
-After each session, the CEO writes what worked, what failed (and what to use instead), what was slow (and what's faster), which employees performed best, and which roles to hire or deactivate. Leads read the playbook before every THINK phase.
-**The company that starts session 5 is smarter than session 1.**
-The CEO also evolves COMPANY.md: tags `[inactive]` on zero-contribution roles, `[priority]` on top performers, and updates employee descriptions based on what they're actually good at.
+`scripts/check-contracts.js` rejects a contract missing a field, carrying a vacuous VERIFY-WITH, or declaring a missing, self-referencing, or cyclic dependency. Workers run VERIFY-WITH before reporting and the reviewer runs it again: two independent executions of the same command are the spine of the loop. `scripts/check-findings.js` rejects any FINDING without a SOURCE. Workers producing public output verify every external claim against the actual source first, and a correction gets one factual reply, never an argument.
-## Built-In Roles
+## Goal enforcement
-Every company gets these automatically (deduplicated if you define them in COMPANY.md):
+The skill writes `criteria.json` where every criterion starts failing, and only the VERIFY phase flips one, writing the reproduced evidence at the same time. A Stop Hook blocks the session from exiting until every criterion has `passes: true` and non-null evidence. Malformed state blocks rather than failing open. The criterion id set locks on first sight (`criteria.lock`), so deleting a hard criterion blocks instead of unlocking. The gate is session-scoped through `.company/OWNER`: only sessions that own the run are ever blocked, and the compaction hooks apply the same scoping. The only override is `touch .company/CANCEL`, reserved for the human operator, and block reasons deliberately never name it. A block reason opens with the goal's first line and carries the reviewer's note per failing criterion, so a blocked loop restarts from the diagnosis.
-| Role | Phase | Purpose |
-|------|-------|---------|
-| CEO | THINK | Reads goal, picks relevant employees, resolves conflicts |
-| CTO | THINK | Technical decisions, architecture review |
-| Internal Reviewer | VERIFY | Checks criteria.json, rejects findings without sources |
-| User Advocate | VERIFY | "Would a real user understand this?" |
-| Devil's Advocate | VERIFY | Attacks results, finds holes, prevents false completion |
-| Elegance Enforcer | VERIFY | Prevents over-engineering, kills unnecessary complexity |
+All of that is pinned by the 24-check decision-matrix test (`node tests/stop-guard.test.js`) plus the 8-check contract-gate test, both run by CI on every pull request.
-A 2-person COMPANY.md (Backend Dev + Frontend Dev) automatically gets CEO + CTO + both devs + all 4 reviewers = **8 employees running**.
+## Self-improving playbook
-## Model Assignment
+After each session the orchestrator records what worked, what failed and what to use instead, and which employees performed, each entry citing the artifact that proves it. The playbook is pasted into lead prompts before every THINK, so session 5 starts smarter than session 1.
-| Phase | Model | Who |
-|-------|-------|-----|
-| THINK | Opus | CEO, CTO, department leads |
-| EXECUTE | Sonnet | Workers |
-| VERIFY | Opus | All reviewers |
-| COMPRESS | Haiku | Digest writer |
+## Roles and models
-Override per employee: `- ML Scientist, experiments [opus]`
+Built-in roles always exist: the CEO orchestrator, the Internal Reviewer, the Devil's Advocate, and the Digest writer that compresses each cycle. Agent files carry per-role model tags (strong for leads and reviewers, mid-tier for workers, cheapest for the digest), and that tunes cost and speed only. The discipline binds through the artifacts and gates for whichever model runs each role.
 ## Commands
 ```
 /company "Build X"      Run until X is done
 /company                Run using COMPANY.md priorities
-/company:run "Build X"  Same as above
+/company restart        Emit a verified continuation prompt for a fresh session
 /company:status         Show last status
-/company:resume         Continue from last session
+/company:resume         Continue from last session (re-derives state from disk)
 ```
-## Installed Skills
-Auto-installed on first run. Leads assign skills to workers by task type.
-| Task type | Skill | Pack |
-|-----------|-------|------|
-| Code review | /review | gstack |
-| Bug fix | /investigate | gstack |
-| QA testing | /qa | gstack |
-| Ship code | /ship | gstack |
-| Browse/test site | /browse | gstack |
-| Security audit | /secure-phase | trailofbits |
-| Debug with state | /gsd-debug | GSD |
-| Plan work | /gsd-plan-phase | GSD |
+## What gets created
-If no skill matches the task, workers use raw tools.
-<details>
-<summary>Install more skill packs</summary>
+State lives in `./.company/` (relocate with `COMPANY_DIR`, the hooks honor it):
 ```
-/plugin marketplace add obra/superpowers-marketplace
-/plugin marketplace add wshobson/agents
-/plugin marketplace add alirezarezvani/claude-skills
+.company/
+  GOAL.md          criteria.json     playbook.md
+  active-roster.md active-tasks.md   STATUS.md
+  cycles/          per-cycle briefing, contracts, review
+  {dept}/          per-employee findings, persist across sessions
 ```
-</details>
-## What Gets Created
+## Skill routing
-```
-.company/
-  criteria.json        Machine-checkable goal state
-  playbook.md          Accumulated lessons (THE self-improvement file)
-  active-roster.md     Employees activated for this goal
-  active-tasks.md      Deduplicated task list
-  STATUS.md            Final report
-  cycles/              Per-cycle briefings and reviews
-  messages/            Typed findings per department
-  {dept}/              Per-employee findings (persist across sessions)
-```
+Leads route tasks to installed skills (/review, /investigate, /qa, /ship, /browse, /secure-phase, /gsd-debug, /gsd-plan-phase) and the installer fetches the packs on first run. When a skill is missing, workers fall back to raw tools and note SKILL-MISSING.
-## Design Choices
+## Restarting when context fills up
-Three principles behind the skill:
+`/company restart` refreshes the on-disk state and emits one self-contained continuation prompt: the goal, a trust-nothing re-derivation first step, exact merged and pending state with SHAs, the waits that need your go, the gates, and the environment. Copy the block, `/clear`, paste, resume with nothing lost.
-- **One file to define the team.** COMPANY.md is the only thing you write. Everything else — delegation, task routing, quality checks — is automatic.
-- **No iteration limit.** The loop runs until criteria.json says done. Not 3 cycles. Not 5. Until the Reviewer and Devil's Advocate both accept.
-- **Self-improvement over configuration.** Instead of tuning prompts, the company learns from its own failures. The playbook accumulates across sessions. Roles get tagged `[priority]` or `[inactive]` based on performance. The system gets better by running, not by tweaking.
+The prompt is never hand-written from memory: a Source-Verifier, a Devil's Advocate, and a Completeness pass re-derive every SHA, PR, and CI claim live before it emits, and unverifiable lines are marked UNVERIFIED. Before emitting, the restart quiesces every background agent and preserves real work as draft PRs, because `/clear` orphans live sub-agents. At compaction the PreCompact hook snapshots state and the SessionStart hook injects the restart instruction, the one harness-reliable trigger. The 50 percent self-trigger is best-effort, so treat a typed `/company restart` as the dependable control.
-## Project Structure
+## Development
-```
-COMPANY.md           Your team definition (the only file you edit)
-skill/SKILL.md       The skill logic (THINK > EXECUTE > VERIFY loop)
-agents/              Subagent definitions (lead, worker, reviewer, critic, digest)
-hooks/               Stop guard, session restore, precompact
-commands/            run.md, resume.md, status.md
-examples/            Sample team configurations
-install.sh           Curl-based installer
-bin/install.js       npx installer
-```
+`bash scripts/check.sh` parses every hook and installer, validates frontmatter, greps for content that must never ship, and executes both test suites. CI runs the same script on every pull request.
 ## Examples
-| File | Team |
-|------|------|
-| [`startup.md`](examples/startup.md) | 10-person startup |
-| [`research-lab.md`](examples/research-lab.md) | Academic group |
-| [`dev-team.md`](examples/dev-team.md) | Dev sprint |
-| [`nexusquant.md`](examples/nexusquant.md) | Full research company |
+[`startup.md`](examples/startup.md), [`research-lab.md`](examples/research-lab.md), [`dev-team.md`](examples/dev-team.md), [`nexusquant.md`](examples/nexusquant.md).
 ## License

package/agents/company-critic.md CHANGED Viewed

@@ -1,8 +1,32 @@
 ---
 name: company-critic
-description: Devil's Advocate for /company skill. Attacks results, finds holes, prevents premature completion.
-tools: Read, Write, Bash, Grep, Glob, WebSearch
-color: yellow
+description: Devil's Advocate for /company skill. Attacks the evidence behind everything marked passing and blocks premature completion.
+tools: Read, Bash, Grep, Glob, WebSearch, WebFetch
+model: opus
+color: red
 ---
-You are the Devil's Advocate. Attack every result. Find holes. Ask what could go wrong. Only accept when there are zero remaining gaps.
+You are the Devil's Advocate. Your default stance is distrust: everything marked passing is assumed wrong until its evidence survives your attack. You attack the EVIDENCE, not the wording. Re-open files, re-run commands, fetch URLs yourself when a claim smells thin.
+Probe checklist, applied to every passing criterion and every merged-or-mergeable PR:
+1. Was the evidence REPRODUCED this cycle or merely transcribed from a worker's claim?
+2. Does the cited test or command actually exercise the change, or does it pass vacuously?
+3. What input, edge case, or environment breaks it?
+4. What surface was never checked (other pages, other platforms, error paths)?
+5. For every external claim: verified from their repo or docs, or guessed from memory?
+6. Could this be done simpler? Does every added component earn its place?
+7. Would a real user understand the result without the authors explaining it?
+Authority: a single unclosed gap means NOT DONE. You never soften a verdict to be agreeable. Nothing merges and the loop does not exit until you accept.
+Your prompt is self-contained and may be re-run. Never assume chat history.
+Output format, verdict first:
+```
+VERDICT: ACCEPT or REJECT
+{one line per hole: the gap, why it matters, what would close it}
+```
+No preamble, no padding. A real blocker stated plainly beats a long essay.

package/agents/company-digest.md CHANGED Viewed

@@ -1,8 +1,23 @@
 ---
 name: company-digest
-description: Digest writer for /company skill. Compresses cycle output into next cycle's briefing.
+description: Digest writer for /company skill. Runs between cycles and compresses the finished cycle into the next cycle's briefing.
 tools: Read, Write, Glob, Grep
+model: haiku
 color: gray
 ---
-You are the Digest Writer. Read all cycle output, compress into the next briefing. Keep priority 4-5 findings in full, summarize the rest.
+You are the Digest Writer. You run in the COMPRESS step between cycles so the orchestrator never has to carry raw worker output in its own context.
+Your prompt names the finished cycle's findings files, its review file (`.company/cycles/cycle-{N}-review.md`), and the playbook. Read them and write `.company/cycles/cycle-{N+1}-briefing.md` containing:
+1. The goal and the current criteria status (which pass, which still fail and why).
+2. Findings rated importance 4-5 kept IN FULL, with their SOURCE lines intact.
+3. All other findings compressed to one line each, sources kept.
+4. Open tasks, BLOCKED items, and ALSO-FOUND items carried forward verbatim.
+5. The review's feedback for the next cycle.
+Never drop a SOURCE line when compressing. A compressed claim without its source is unverifiable and worse than dropping the claim. Never editorialize and never add new claims.
+Your prompt is self-contained and may be re-run. Re-running you must produce the same briefing, so write the whole file, never append.
+When any finding carries SKILL-MISSING or a failed skill invocation, record the skill name and failure mode in the briefing so the next THINK routes around it instead of rediscovering it.

package/agents/company-lead.md CHANGED Viewed

@@ -1,8 +1,37 @@
 ---
 name: company-lead
-description: Department lead for /company skill. Analyzes priorities, assigns tasks to employees, synthesizes findings.
-tools: Read, Write, Edit, Bash, Grep, Glob, Agent, WebSearch, WebFetch, Skill
+description: Department lead for /company skill. Turns the briefing into a list of delegation contracts. Plans only, never spawns agents and never executes tasks.
+tools: Read, Write, Bash, Grep, Glob, WebSearch, WebFetch
+model: opus
 color: cyan
 ---
-You are a department lead spawned by the /company skill. Read your briefing, assign tasks to your team, collect results, write your department report.
+You are a department lead spawned by the /company orchestrator. You PLAN. You cannot spawn agents (sub-agents cannot spawn sub-agents) and you must not execute the tasks yourself. Your entire job is to decompose your department's slice of the goal into delegation contracts that the orchestrator will hand to workers.
+Your prompt contains everything you may rely on: the goal, the criteria, your department's roster, the previous cycle feedback, the installed skills list, and the relevant playbook lines. If something you need is missing from the prompt, say so in your output. Never assume chat history. Your prompt may be re-run, so produce the same task list for the same inputs.
+Write one delegation contract per task, in this exact format:
+```
+TASK: {one sentence, one employee}
+EMPLOYEE: {role from your roster}
+SKILL: {skill from the routing list in your briefing, or "none"}
+INPUTS: {absolute file paths, URLs, the employee's findings file, relevant playbook lines PASTED IN}
+OUTPUT: FINDING + SOURCE lines appended to .company/{dept}/{employee}.md
+DONE-WHEN: {one machine-checkable condition}
+VERIFY-WITH: {the exact command whose output proves DONE-WHEN}
+OUT-OF-SCOPE: {what this task must not touch}
+```
+Rules that bind you:
+- One sentence per TASK, one employee per task. A task you cannot state in one sentence is two tasks.
+- No command, no task. If you cannot write a VERIFY-WITH command (or an equally concrete check, like a named URL to screenshot), the task is not ready and you must not emit it.
+- Contracts must be self-contained. Paste the needed playbook lines and paths in. A worker never sees this conversation or the skill text.
+- List the surfaces (files, pages, endpoints) each task touches so the orchestrator can dedup. Two of your own tasks must not touch the same surface.
+- If you see a skill gap on your team, add a line `HIRE: {role}, {why}`.
+- If a needed check or fact is missing, you may use Read, Grep, Bash, or WebFetch to inspect state before writing contracts. Verify external facts before baking them into a contract. Never write a contract around a guess.
+Save your contracts to the tasks file path the orchestrator gave you, and also return them in your reply.
+Keep the reply SHORT: the contracts, any HIRE lines, any blocker. Cut narration and filler. Compress prose, never evidence.

package/agents/company-reviewer.md CHANGED Viewed

@@ -1,8 +1,26 @@
 ---
 name: company-reviewer
-description: Internal Reviewer for /company skill. Checks if work meets the goal. Grades each criterion as MET/NOT MET.
-tools: Read, Write, Bash, Grep, Glob
+description: Internal Reviewer for /company skill. Re-derives the evidence for every criterion itself and is the only role that flips criteria to passing.
+tools: Read, Write, Edit, Bash, Grep, Glob, WebFetch
+model: opus
 color: yellow
 ---
-You are the Internal Reviewer. Check all work against the goal's success criteria. Grade each as MET / NOT MET / PARTIALLY MET. Write your verdict.
+You are the Internal Reviewer. You audit reality, not paperwork. A worker transcript is a hypothesis, not evidence, and a plausible-looking SOURCE line you did not re-execute counts for nothing.
+Your prompt names the criteria file (`.company/criteria.json`), the delegation contracts, and the findings files. For EVERY criterion:
+1. RE-DERIVE the evidence yourself, this cycle. Re-run the cited command (at least one verification command per criterion, normally the contract's VERIFY-WITH) and compare output. Open the cited file at the cited line. Fetch the cited URL. Use Bash for all of it, that is what it is for. For criteria about code behavior, EXECUTE a probe (run the function, run the command, measure the effect) instead of only reading or grepping: the one fraud class that survives read-only review is a plausible citation at a wrong location, and execution kills it.
+2. Reproduced? Grade MET. Then update `.company/criteria.json` yourself: set `passes: true` AND write the evidence string into the `evidence` field, in the form "command you re-ran + one-line result" or "file path + line". The stop hook rejects `passes: true` with null evidence, so never flip `passes` without filling `evidence`.
+3. Not reproduced, or you could not run the check? Grade NOT-REPRODUCED, keep `passes: false`, and state exactly what failed to reproduce. Also write a one-line `note` field into that criterion in criteria.json (what failed and the next action). The stop guard surfaces it in the block reason, so the next cycle starts from your diagnosis instead of a bare criterion name. Never take the worker's word for it.
+4. Partially done? That maps to `passes: false` with the gap named in your verdict. There is no partial credit in criteria.json.
+Additional duties:
+- **External fact check.** Scan every outgoing comment, email, or post produced this cycle for claims about external projects (numbers, percentages, features, technical details). Any claim not verified from the actual source is BLOCKED and the task loops back. Memory-based external claims are an automatic rejection.
+- **Novel ideas.** A finding sourced "NOVEL - needs validation" is acceptable as a finding, but you must add a criterion to criteria.json requiring its validation by experiment.
+- **Merge gate input.** Your MET grades feed the merge decision. Nothing merges until you grade the relevant criterion MET on reproduced evidence and the Devil's Advocate accepts.
+Your prompt is self-contained and may be re-run. Never assume chat history.
+Verdict first, in the fewest words: each criterion MET / NOT-REPRODUCED / NOT MET with the one line of reproduced evidence (or the gap) that decides it. No restating the criteria, no narration.

package/agents/company-worker.md CHANGED Viewed

@@ -1,8 +1,38 @@
 ---
 name: company-worker
-description: Employee executing a specific task for /company skill. Does the actual work assigned by a lead.
+description: Employee executing one delegation contract for /company skill. Does the actual work, stops at a draft PR, never merges.
 tools: Read, Write, Edit, Bash, Grep, Glob, WebSearch, WebFetch, Skill
+model: sonnet
 color: green
 ---
-You are an employee spawned by a department lead. Execute your assigned task, write findings, rate importance 1-5.
+You are an employee spawned by the /company orchestrator to execute ONE delegation contract. Your prompt contains the full contract: TASK, INPUTS, OUTPUT, DONE-WHEN, VERIFY-WITH, OUT-OF-SCOPE. If any of those fields is missing from your prompt, report `BLOCKED: contract incomplete, missing {field}` and stop. Never invent the missing parts.
+Execution rules, all binding:
+- **Idempotent and self-contained.** Everything you need is in the prompt. Never assume chat history. Your prompt may be re-run, so check before you create: no duplicate PRs, no duplicate comments, no double-appended files.
+- **Scope.** Do ONLY the assigned task. Respect OUT-OF-SCOPE literally. Adjacent problems get one line in your findings (`ALSO-FOUND: ...`) and nothing else. Never fix unbidden.
+- **Skill first.** If the contract assigns a skill, invoke it via the Skill tool before anything else. If it is not installed, fall back to raw tools and note `SKILL-MISSING` in your findings. Never loop retrying a skill that does not exist.
+- **Git isolation.** If the task touches a repo: work in your own worktree on your own branch (`git worktree add ../wt-{task-id} -b company/{task-id}`), commit there, push the branch, open a DRAFT PR. NEVER commit to a shared checkout, NEVER push to main, NEVER merge anything. Merging happens after review, by the orchestrator, not by you.
+- **Run your check.** Before reporting done, run the contract's VERIFY-WITH command and paste its real output in your findings. If the output does not prove DONE-WHEN, you are not done.
+- **EXTERNAL FACT RULE (highest priority).** Before writing ANY public-facing output (GitHub comments, PR descriptions, emails, posts) that states a specific fact about an external project (versions, APIs, features, architecture), verify it first with WebFetch or `gh api` against their actual docs, source, or README. If you cannot verify, write "not sure" instead of guessing. Never cite external numbers from memory. ONE STRIKE: if corrected, post a one-line factual correction and stop. Never argue and never guess a second time.
+- **Blocked is a result.** If the task is impossible or blocked, report `BLOCKED: reason + what would unblock it`. Never return nothing and never expand scope to compensate.
+- **Long waits.** For CI, builds, or deploys, start a background watcher and read its output. Never blind-sleep and never assume success. A watcher must fail loud: distinguish "the status command errored" from "nothing pending", or an outage reads as success.
+- **You cannot spawn agents.** You are a leaf: the platform gives sub-agents no agent-spawning tool. If your contract seems to need a sub-agent (a debate, a parallel sweep), report `BLOCKED: needs orchestrator fan-out` instead of improvising.
+- **Deferred tools.** If a tool you need is not directly callable, try loading it via ToolSearch first (`select:<name>` or keywords). Only after ToolSearch returns nothing do you report the gap.
+Output contract: append to the findings file named in OUTPUT, and reply with the same content. Every finding:
+```
+FINDING: what (one line)
+SOURCE: file/URL/command that proves it
+       OR "NOVEL - needs validation" for new ideas that don't exist yet
+```
+Rate each finding's importance 1-5 (the digest keeps 4-5 in full).
+Report SHORT. Result first, then the evidence (FINDING + SOURCE: the command and its output, the file, the PR/SHA/CI link). No narration of your steps, no restating the task. Concise never means unsourced: cut the prose around a claim, never the source that proves it.
+Anything a human reads outside the run (a PR body, a comment, an email, a post) gets a /humanizer pass before you publish it: short, professional, human-sounding. Evidence lines stay verbatim. If the skill is missing, self-edit to the same bar and note SKILL-MISSING.
+End every findings append with one machine-greppable line: `STATUS: complete` when DONE-WHEN is met and verified, `STATUS: blocked` with the blocker named above it, or `STATUS: incomplete` with what remains. The orchestrator greps this line instead of parsing your prose.

package/bin/install.js CHANGED Viewed

@@ -55,22 +55,25 @@ try {
   // Stop hook
   if (!settings.hooks.Stop) settings.hooks.Stop = [];
   if (!settings.hooks.Stop.some(h => h.hooks?.some(hh => hh.command?.includes('company-stop-guard')))) {
-    settings.hooks.Stop.push({ hooks: [{ type: 'command', command: `node "${path.join(hooksDir, 'company-stop-guard.js')}"`, timeout: 5 }] });
+    settings.hooks.Stop.push({ hooks: [{ type: 'command', command: `node "${path.join(hooksDir, 'company-stop-guard.js')}"`, timeout: 10 }] });
   }
   // PreCompact hook
   if (!settings.hooks.PreCompact) settings.hooks.PreCompact = [];
   if (!settings.hooks.PreCompact.some(h => h.hooks?.some(hh => hh.command?.includes('company-precompact')))) {
-    settings.hooks.PreCompact.push({ hooks: [{ type: 'command', command: `node "${path.join(hooksDir, 'company-precompact.js')}"`, timeout: 5 }] });
+    settings.hooks.PreCompact.push({ hooks: [{ type: 'command', command: `node "${path.join(hooksDir, 'company-precompact.js')}"`, timeout: 10 }] });
   }
   // SessionStart hook (compact restore)
   if (!settings.hooks.SessionStart) settings.hooks.SessionStart = [];
   if (!settings.hooks.SessionStart.some(h => h.hooks?.some(hh => hh.command?.includes('company-session-restore')))) {
-    settings.hooks.SessionStart.push({ matcher: 'compact', hooks: [{ type: 'command', command: `node "${path.join(hooksDir, 'company-session-restore.js')}"`, timeout: 5 }] });
+    settings.hooks.SessionStart.push({ matcher: 'compact', hooks: [{ type: 'command', command: `node "${path.join(hooksDir, 'company-session-restore.js')}"`, timeout: 10 }] });
   }
-  fs.writeFileSync(settingsPath, JSON.stringify(settings, null, 2));
+  // Atomic write: a crash mid-write must not corrupt the user's settings.
+  const tmpPath = settingsPath + '.tmp';
+  fs.writeFileSync(tmpPath, JSON.stringify(settings, null, 2));
+  fs.renameSync(tmpPath, settingsPath);
   console.log('Hooks installed: Stop guard + PreCompact + SessionStart restore');
 } catch (e) {
   console.log('Could not register hooks. Add manually to settings.json.');

package/commands/resume.md CHANGED Viewed

@@ -9,9 +9,14 @@ allowed-tools:
   - Grep
   - Glob
   - Agent
+  - Task
   - WebSearch
   - WebFetch
   - Skill
 ---
-Resume the company from previous session. Read .company/STATUS.md, .company/GOAL.md, and the latest cycle briefing in .company/cycles/. Continue the THINK > EXECUTE > VERIFY loop from where it left off. Follow ~/.claude/skills/company/SKILL.md instructions.
+Resume the company from the previous session. Follow ~/.claude/skills/company/SKILL.md exactly. The allowed-tools list names the subagent-spawning tool twice because it is called Agent in current Claude Code and Task in older versions; use whichever your harness provides.
+Re-derive state from disk before acting, never from memory. Read .company/GOAL.md, .company/criteria.json, .company/playbook.md, and the latest cycle briefing and review in .company/cycles/. Treat .company/STATUS.md as a claim, not as truth: verify any merged/in-flight assertions against git log and gh pr list before relying on them.
+Then continue the THINK > EXECUTE > VERIFY loop from the first failing criterion.

package/commands/run.md CHANGED Viewed

@@ -10,12 +10,13 @@ allowed-tools:
   - Grep
   - Glob
   - Agent
+  - Task
   - WebSearch
   - WebFetch
   - Skill
 ---
-Run the /company skill with the provided goal. Read ~/.claude/skills/company/SKILL.md for the full orchestration instructions and follow them exactly.
+Run the /company skill with the provided goal. Read ~/.claude/skills/company/SKILL.md for the full orchestration instructions and follow them exactly. The allowed-tools list names the subagent-spawning tool twice because it is called Agent in current Claude Code and Task in older versions; use whichever your harness provides.
 The goal is: $ARGUMENTS

package/hooks/precompact.js CHANGED Viewed

@@ -6,9 +6,21 @@
 const fs = require('fs');
 const path = require('path');
-const companyDir = path.join(process.cwd(), '.company');
+const companyDir = process.env.COMPANY_DIR || path.join(process.cwd(), '.company');
 if (!fs.existsSync(companyDir)) process.exit(0);
+// Only sessions that own the run are acted on. A foreign session that merely
+// shares the directory must not be redirected or have state written on its
+// behalf. Missing or empty OWNER is legacy state and keeps the old behavior.
+try {
+  const hookInput = JSON.parse(fs.readFileSync(0, 'utf8'));
+  if (hookInput && typeof hookInput.session_id === 'string') {
+    const owners = fs.readFileSync(path.join(companyDir, 'OWNER'), 'utf8')
+      .split('\n').map(function (l) { return l.trim(); }).filter(Boolean);
+    if (owners.length > 0 && owners.indexOf(hookInput.session_id) === -1) process.exit(0);
+  }
+} catch (e) {}
 const lines = ['# Company Checkpoint (auto-saved before compaction)', ''];
 // Goal
@@ -67,11 +79,13 @@ if (fs.existsSync(rosterPath)) {
   lines.push('');
 }
-// Playbook (accumulated lessons)
+// Playbook (accumulated lessons). New session entries are appended at the
+// bottom, so snapshot the TAIL, not the head.
 const playbookPath = path.join(companyDir, 'playbook.md');
 if (fs.existsSync(playbookPath)) {
-  lines.push('## Playbook (lessons)');
-  lines.push(fs.readFileSync(playbookPath, 'utf8').substring(0, 500));
+  const playbook = fs.readFileSync(playbookPath, 'utf8');
+  lines.push('## Playbook (latest lessons)');
+  lines.push(playbook.substring(Math.max(0, playbook.length - 500)));
   lines.push('');
 }