company-skill 4.4.0 → 4.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,16 @@
1
+ name: check
2
+
3
+ on:
4
+ pull_request:
5
+ push:
6
+ branches: [main]
7
+
8
+ jobs:
9
+ check:
10
+ runs-on: ubuntu-latest
11
+ steps:
12
+ - uses: actions/checkout@v4
13
+ - uses: actions/setup-node@v4
14
+ with:
15
+ node-version: 20
16
+ - run: bash scripts/check.sh
@@ -0,0 +1,18 @@
1
+ name: publish
2
+ on:
3
+ release:
4
+ types: [published]
5
+ workflow_dispatch: {}
6
+ jobs:
7
+ publish:
8
+ runs-on: ubuntu-latest
9
+ steps:
10
+ - uses: actions/checkout@v4
11
+ - uses: actions/setup-node@v4
12
+ with:
13
+ node-version: 20
14
+ registry-url: https://registry.npmjs.org
15
+ - run: bash scripts/check.sh
16
+ - run: npm publish
17
+ env:
18
+ NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}
@@ -10,8 +10,11 @@
10
10
 
11
11
  TIPS:
12
12
  - First role in each department (or marked "Lead:") becomes the department lead
13
- - Add [opus], [sonnet], or [haiku] to override model per role
14
- - Default: leads use Opus (deep thinking), workers use Sonnet (execution)
13
+ - Add [opus], [sonnet], or [haiku] to request a model for a role. The
14
+ orchestrator states the override when spawning if the harness supports
15
+ per-agent models, otherwise the tag is ignored
16
+ - Defaults come from the agent files: leads and reviewers on a strong
17
+ model, workers on a mid tier, the digest on the cheapest
15
18
  - Add as many departments and roles as you want
16
19
  -->
17
20
 
package/README.md CHANGED
@@ -2,223 +2,114 @@
2
2
 
3
3
  [![npm](https://img.shields.io/npm/v/company-skill)](https://www.npmjs.com/package/company-skill) [![license](https://img.shields.io/npm/l/company-skill)](LICENSE) [![downloads](https://img.shields.io/npm/dw/company-skill)](https://www.npmjs.com/package/company-skill)
4
4
 
5
- > *You don't prompt agents one at a time. You write a team in markdown, hand them a goal, and go to sleep. In the morning, STATUS.md tells you what got done, what got rejected, and what the company learned. The playbook from session 3 makes session 4 faster. By session 10, the company runs itself better than you could direct it manually.*
5
+ **Your agent stops when it feels done. This makes it stop only when the work is actually done.**
6
6
 
7
- **Define your team in markdown. Give it a goal. Walk away.**
7
+ You define a team in one markdown file, hand it a goal, and walk away while it builds, reviews its own work, and keeps going until every success criterion passes with evidence a second agent reproduced. A stop hook reads criteria.json and physically blocks exit until then, and that guard is pinned by a 24-check test suite that runs green in CI.
8
8
 
9
- A Claude Code skill that runs your entire company — CEO delegates, departments execute in parallel, built-in reviewers verify — and doesn't stop until the goal is done.
9
+ ```bash
10
+ npx company-skill install
11
+ ```
10
12
 
11
13
  ```
12
- /company "Build the user auth system with OAuth2"
14
+ /company "Build a REST API for user management with tests"
13
15
  ```
14
16
 
15
- ## Why /company
16
-
17
- | | Without /company | With /company |
18
- |---|---|---|
19
- | Task routing | You manually prompt each agent | CEO reads the goal, picks relevant employees, delegates |
20
- | Quality gates | Hope it's correct | Reviewer + Devil's Advocate + Elegance Enforcer triple-check |
21
- | Knowledge retention | Lost every session | Playbook accumulates what worked, what failed, what's faster |
22
- | Parallelism | One agent at a time | All departments run in parallel |
23
- | Stopping condition | You decide when it's done | criteria.json blocks exit until ALL criteria pass |
24
-
25
- ## Quick Start
17
+ Optionally define your team first in `COMPANY.md` (skip it and a minimal company is created):
26
18
 
27
- **1. Install**
28
- ```bash
29
- npx company-skill install
30
- ```
31
-
32
- **2. Define your team** (optional — a minimal company is created automatically)
33
19
  ```markdown
34
20
  ## Engineering
35
21
  - Backend Lead, API design and database architecture
36
22
  - Frontend Dev, React components and state management
37
-
38
- ## Research
39
- - ML Scientist, model experiments and benchmarks
40
23
  ```
41
24
 
42
- **3. Run**
43
- ```
44
- /company "Build a REST API for user management with tests"
45
- ```
46
-
47
- ## How It Works
25
+ ## How it works
48
26
 
49
27
  ```mermaid
50
28
  graph LR
51
29
  G[GOAL] --> T[THINK]
52
- T -->|Opus: CEO + leads assign tasks| E[EXECUTE]
53
- E -->|Sonnet: workers do the work| V[VERIFY]
54
- V -->|Opus: Reviewer + Advocate| D{Done?}
55
- D -->|NO: feedback| T
30
+ T -->|contract shape gate| E[EXECUTE]
31
+ E -->|findings shape gate| V[VERIFY]
32
+ V -->|reviewer re-derives + critic attacks| D{Done?}
33
+ D -->|NO: feedback| C[COMPRESS]
34
+ C --> T
56
35
  D -->|YES| S[STATUS.md]
36
+ D -.->|stop attempt while failing| B[stop guard blocks]
37
+ B -.-> T
57
38
  ```
58
39
 
59
- The loop does NOT stop until the Reviewer confirms all criteria pass AND the Devil's Advocate accepts. There is no iteration limit.
60
-
61
- <details>
62
- <summary><strong>THINK</strong> — CEO picks relevant employees, leads assign tasks</summary>
63
-
64
- The CEO reads the goal and COMPANY.md, decides which departments and employees are relevant (a mobile app goal doesn't need a Topologist), writes an active roster, then launches all department leads in parallel. Each lead assigns tasks to their employees with one sentence, one skill, and context.
65
-
66
- If a lead sees a skill gap, they write `HIRE: {role}, {why}` and the CEO adds it to the team.
67
- </details>
68
-
69
- <details>
70
- <summary><strong>EXECUTE</strong> — All workers run in parallel with installed skills</summary>
71
-
72
- Every employee gets their task, previous findings, and failed approaches from the playbook. Every finding must have a source — file path, URL, or command output. Novel ideas use "NOVEL — needs validation" and the reviewer adds a validation criterion. No source = rejected.
73
- </details>
74
-
75
- <details>
76
- <summary><strong>VERIFY</strong> — Triple quality gate blocks premature completion</summary>
77
-
78
- **Internal Reviewer** checks each criterion in criteria.json against evidence. No evidence? Stays `false`. Also scans all public-facing output for unverified claims about external projects — any number, percentage, or technical detail cited from memory gets blocked until verified from source.
40
+ It runs any model the way frontier models like Claude Fable 5 run themselves: delegation contracts, verify layers, and failing-by-default criteria ship as structural artifacts, so the discipline holds whichever model fills each role. The orchestrator reads the goal and activates only the relevant employees. Leads decompose the goal into delegation contracts, workers execute them in parallel waves, and two reviewers gate every cycle: the Internal Reviewer re-runs the evidence and the Devil's Advocate attacks it. There is no iteration limit. The harness carries the quality, so none of it depends on the model remembering to be careful.
79
41
 
80
- **Devil's Advocate** attacks anything marked as passing. "Is this actually complete or surface-level? What edge cases were missed?" For any claim about external projects: "did you actually verify this from their repo/docs, or are you guessing?"
42
+ ## Delegation contracts
81
43
 
82
- **Elegance Enforcer** asks "Can this be simpler? Does every component justify its existence?"
44
+ A task does not exist until it is a filled contract:
83
45
 
84
- All three must accept before the loop exits.
85
- </details>
86
-
87
- ## External Fact Verification
88
-
89
- Workers producing public-facing output (GitHub comments, PRs, blog posts) must verify every claim about external projects from their actual docs/source before publishing. No citing from memory. The reviewer blocks unverified external claims automatically.
90
-
91
- **One strike rule:** if corrected by someone, respond "my bad, you're right" and stop. Never attempt a second correction with more guessed details.
92
-
93
- ## Goal Enforcement
94
-
95
- The skill creates `criteria.json` with machine-checkable success criteria:
96
-
97
- ```json
98
- {"goal": "Build auth", "criteria": [
99
- {"id": 1, "description": "OAuth2 login works with Google", "passes": false, "evidence": null},
100
- {"id": 2, "description": "All tests pass", "passes": false, "evidence": null}
101
- ]}
46
+ ```
47
+ TASK: one sentence, one employee
48
+ EMPLOYEE: role from the roster
49
+ SKILL: routed skill, or none
50
+ INPUTS: paths and context, paste-complete
51
+ OUTPUT: FINDING + SOURCE lines to the employee's findings file
52
+ DONE-WHEN: one machine-checkable condition
53
+ VERIFY-WITH: the exact command that proves DONE-WHEN
54
+ OUT-OF-SCOPE: what this task must not touch
55
+ DEPENDS-ON: task numbers that must finish first, or none
102
56
  ```
103
57
 
104
- A Stop Hook reads this file and **blocks Claude from exiting** until every criterion passes. To cancel: `touch .company/CANCEL`.
105
-
106
- ## Self-Improving Playbook
107
-
108
- One file: `.company/playbook.md`. Accumulates across sessions.
109
-
110
- After each session, the CEO writes what worked, what failed (and what to use instead), what was slow (and what's faster), which employees performed best, and which roles to hire or deactivate. Leads read the playbook before every THINK phase.
111
-
112
- **The company that starts session 5 is smarter than session 1.**
113
-
114
- The CEO also evolves COMPANY.md: tags `[inactive]` on zero-contribution roles, `[priority]` on top performers, and updates employee descriptions based on what they're actually good at.
58
+ `scripts/check-contracts.js` rejects a contract missing a field, carrying a vacuous VERIFY-WITH, or declaring a missing, self-referencing, or cyclic dependency. Workers run VERIFY-WITH before reporting and the reviewer runs it again: two independent executions of the same command are the spine of the loop. `scripts/check-findings.js` rejects any FINDING without a SOURCE. Workers producing public output verify every external claim against the actual source first, and a correction gets one factual reply, never an argument.
115
59
 
116
- ## Built-In Roles
60
+ ## Goal enforcement
117
61
 
118
- Every company gets these automatically (deduplicated if you define them in COMPANY.md):
62
+ The skill writes `criteria.json` where every criterion starts failing, and only the VERIFY phase flips one, writing the reproduced evidence at the same time. A Stop Hook blocks the session from exiting until every criterion has `passes: true` and non-null evidence. Malformed state blocks rather than failing open. The criterion id set locks on first sight (`criteria.lock`), so deleting a hard criterion blocks instead of unlocking. The gate is session-scoped through `.company/OWNER`: only sessions that own the run are ever blocked, and the compaction hooks apply the same scoping. The only override is `touch .company/CANCEL`, reserved for the human operator, and block reasons deliberately never name it. A block reason opens with the goal's first line and carries the reviewer's note per failing criterion, so a blocked loop restarts from the diagnosis.
119
63
 
120
- | Role | Phase | Purpose |
121
- |------|-------|---------|
122
- | CEO | THINK | Reads goal, picks relevant employees, resolves conflicts |
123
- | CTO | THINK | Technical decisions, architecture review |
124
- | Internal Reviewer | VERIFY | Checks criteria.json, rejects findings without sources |
125
- | User Advocate | VERIFY | "Would a real user understand this?" |
126
- | Devil's Advocate | VERIFY | Attacks results, finds holes, prevents false completion |
127
- | Elegance Enforcer | VERIFY | Prevents over-engineering, kills unnecessary complexity |
64
+ All of that is pinned by the 24-check decision-matrix test (`node tests/stop-guard.test.js`) plus the 8-check contract-gate test, both run by CI on every pull request.
128
65
 
129
- A 2-person COMPANY.md (Backend Dev + Frontend Dev) automatically gets CEO + CTO + both devs + all 4 reviewers = **8 employees running**.
66
+ ## Self-improving playbook
130
67
 
131
- ## Model Assignment
68
+ After each session the orchestrator records what worked, what failed and what to use instead, and which employees performed, each entry citing the artifact that proves it. The playbook is pasted into lead prompts before every THINK, so session 5 starts smarter than session 1.
132
69
 
133
- | Phase | Model | Who |
134
- |-------|-------|-----|
135
- | THINK | Opus | CEO, CTO, department leads |
136
- | EXECUTE | Sonnet | Workers |
137
- | VERIFY | Opus | All reviewers |
138
- | COMPRESS | Haiku | Digest writer |
70
+ ## Roles and models
139
71
 
140
- Override per employee: `- ML Scientist, experiments [opus]`
72
+ Built-in roles always exist: the CEO orchestrator, the Internal Reviewer, the Devil's Advocate, and the Digest writer that compresses each cycle. Agent files carry per-role model tags (strong for leads and reviewers, mid-tier for workers, cheapest for the digest), and that tunes cost and speed only. The discipline binds through the artifacts and gates for whichever model runs each role.
141
73
 
142
74
  ## Commands
143
75
 
144
76
  ```
145
77
  /company "Build X" Run until X is done
146
78
  /company Run using COMPANY.md priorities
147
- /company:run "Build X" Same as above
79
+ /company restart Emit a verified continuation prompt for a fresh session
148
80
  /company:status Show last status
149
- /company:resume Continue from last session
81
+ /company:resume Continue from last session (re-derives state from disk)
150
82
  ```
151
83
 
152
- ## Installed Skills
153
-
154
- Auto-installed on first run. Leads assign skills to workers by task type.
155
-
156
- | Task type | Skill | Pack |
157
- |-----------|-------|------|
158
- | Code review | /review | gstack |
159
- | Bug fix | /investigate | gstack |
160
- | QA testing | /qa | gstack |
161
- | Ship code | /ship | gstack |
162
- | Browse/test site | /browse | gstack |
163
- | Security audit | /secure-phase | trailofbits |
164
- | Debug with state | /gsd-debug | GSD |
165
- | Plan work | /gsd-plan-phase | GSD |
84
+ ## What gets created
166
85
 
167
- If no skill matches the task, workers use raw tools.
168
-
169
- <details>
170
- <summary>Install more skill packs</summary>
86
+ State lives in `./.company/` (relocate with `COMPANY_DIR`, the hooks honor it):
171
87
 
172
88
  ```
173
- /plugin marketplace add obra/superpowers-marketplace
174
- /plugin marketplace add wshobson/agents
175
- /plugin marketplace add alirezarezvani/claude-skills
89
+ .company/
90
+ GOAL.md criteria.json playbook.md
91
+ active-roster.md active-tasks.md STATUS.md
92
+ cycles/ per-cycle briefing, contracts, review
93
+ {dept}/ per-employee findings, persist across sessions
176
94
  ```
177
- </details>
178
95
 
179
- ## What Gets Created
96
+ ## Skill routing
180
97
 
181
- ```
182
- .company/
183
- criteria.json Machine-checkable goal state
184
- playbook.md Accumulated lessons (THE self-improvement file)
185
- active-roster.md Employees activated for this goal
186
- active-tasks.md Deduplicated task list
187
- STATUS.md Final report
188
- cycles/ Per-cycle briefings and reviews
189
- messages/ Typed findings per department
190
- {dept}/ Per-employee findings (persist across sessions)
191
- ```
98
+ Leads route tasks to installed skills (/review, /investigate, /qa, /ship, /browse, /secure-phase, /gsd-debug, /gsd-plan-phase) and the installer fetches the packs on first run. When a skill is missing, workers fall back to raw tools and note SKILL-MISSING.
192
99
 
193
- ## Design Choices
100
+ ## Restarting when context fills up
194
101
 
195
- Three principles behind the skill:
102
+ `/company restart` refreshes the on-disk state and emits one self-contained continuation prompt: the goal, a trust-nothing re-derivation first step, exact merged and pending state with SHAs, the waits that need your go, the gates, and the environment. Copy the block, `/clear`, paste, resume with nothing lost.
196
103
 
197
- - **One file to define the team.** COMPANY.md is the only thing you write. Everything else delegation, task routing, quality checks is automatic.
198
- - **No iteration limit.** The loop runs until criteria.json says done. Not 3 cycles. Not 5. Until the Reviewer and Devil's Advocate both accept.
199
- - **Self-improvement over configuration.** Instead of tuning prompts, the company learns from its own failures. The playbook accumulates across sessions. Roles get tagged `[priority]` or `[inactive]` based on performance. The system gets better by running, not by tweaking.
104
+ The prompt is never hand-written from memory: a Source-Verifier, a Devil's Advocate, and a Completeness pass re-derive every SHA, PR, and CI claim live before it emits, and unverifiable lines are marked UNVERIFIED. Before emitting, the restart quiesces every background agent and preserves real work as draft PRs, because `/clear` orphans live sub-agents. At compaction the PreCompact hook snapshots state and the SessionStart hook injects the restart instruction, the one harness-reliable trigger. The 50 percent self-trigger is best-effort, so treat a typed `/company restart` as the dependable control.
200
105
 
201
- ## Project Structure
106
+ ## Development
202
107
 
203
- ```
204
- COMPANY.md Your team definition (the only file you edit)
205
- skill/SKILL.md The skill logic (THINK > EXECUTE > VERIFY loop)
206
- agents/ Subagent definitions (lead, worker, reviewer, critic, digest)
207
- hooks/ Stop guard, session restore, precompact
208
- commands/ run.md, resume.md, status.md
209
- examples/ Sample team configurations
210
- install.sh Curl-based installer
211
- bin/install.js npx installer
212
- ```
108
+ `bash scripts/check.sh` parses every hook and installer, validates frontmatter, greps for content that must never ship, and executes both test suites. CI runs the same script on every pull request.
213
109
 
214
110
  ## Examples
215
111
 
216
- | File | Team |
217
- |------|------|
218
- | [`startup.md`](examples/startup.md) | 10-person startup |
219
- | [`research-lab.md`](examples/research-lab.md) | Academic group |
220
- | [`dev-team.md`](examples/dev-team.md) | Dev sprint |
221
- | [`nexusquant.md`](examples/nexusquant.md) | Full research company |
112
+ [`startup.md`](examples/startup.md), [`research-lab.md`](examples/research-lab.md), [`dev-team.md`](examples/dev-team.md), [`nexusquant.md`](examples/nexusquant.md).
222
113
 
223
114
  ## License
224
115
 
@@ -1,8 +1,32 @@
1
1
  ---
2
2
  name: company-critic
3
- description: Devil's Advocate for /company skill. Attacks results, finds holes, prevents premature completion.
4
- tools: Read, Write, Bash, Grep, Glob, WebSearch
5
- color: yellow
3
+ description: Devil's Advocate for /company skill. Attacks the evidence behind everything marked passing and blocks premature completion.
4
+ tools: Read, Bash, Grep, Glob, WebSearch, WebFetch
5
+ model: opus
6
+ color: red
6
7
  ---
7
8
 
8
- You are the Devil's Advocate. Attack every result. Find holes. Ask what could go wrong. Only accept when there are zero remaining gaps.
9
+ You are the Devil's Advocate. Your default stance is distrust: everything marked passing is assumed wrong until its evidence survives your attack. You attack the EVIDENCE, not the wording. Re-open files, re-run commands, fetch URLs yourself when a claim smells thin.
10
+
11
+ Probe checklist, applied to every passing criterion and every merged-or-mergeable PR:
12
+
13
+ 1. Was the evidence REPRODUCED this cycle or merely transcribed from a worker's claim?
14
+ 2. Does the cited test or command actually exercise the change, or does it pass vacuously?
15
+ 3. What input, edge case, or environment breaks it?
16
+ 4. What surface was never checked (other pages, other platforms, error paths)?
17
+ 5. For every external claim: verified from their repo or docs, or guessed from memory?
18
+ 6. Could this be done simpler? Does every added component earn its place?
19
+ 7. Would a real user understand the result without the authors explaining it?
20
+
21
+ Authority: a single unclosed gap means NOT DONE. You never soften a verdict to be agreeable. Nothing merges and the loop does not exit until you accept.
22
+
23
+ Your prompt is self-contained and may be re-run. Never assume chat history.
24
+
25
+ Output format, verdict first:
26
+
27
+ ```
28
+ VERDICT: ACCEPT or REJECT
29
+ {one line per hole: the gap, why it matters, what would close it}
30
+ ```
31
+
32
+ No preamble, no padding. A real blocker stated plainly beats a long essay.
@@ -1,8 +1,23 @@
1
1
  ---
2
2
  name: company-digest
3
- description: Digest writer for /company skill. Compresses cycle output into next cycle's briefing.
3
+ description: Digest writer for /company skill. Runs between cycles and compresses the finished cycle into the next cycle's briefing.
4
4
  tools: Read, Write, Glob, Grep
5
+ model: haiku
5
6
  color: gray
6
7
  ---
7
8
 
8
- You are the Digest Writer. Read all cycle output, compress into the next briefing. Keep priority 4-5 findings in full, summarize the rest.
9
+ You are the Digest Writer. You run in the COMPRESS step between cycles so the orchestrator never has to carry raw worker output in its own context.
10
+
11
+ Your prompt names the finished cycle's findings files, its review file (`.company/cycles/cycle-{N}-review.md`), and the playbook. Read them and write `.company/cycles/cycle-{N+1}-briefing.md` containing:
12
+
13
+ 1. The goal and the current criteria status (which pass, which still fail and why).
14
+ 2. Findings rated importance 4-5 kept IN FULL, with their SOURCE lines intact.
15
+ 3. All other findings compressed to one line each, sources kept.
16
+ 4. Open tasks, BLOCKED items, and ALSO-FOUND items carried forward verbatim.
17
+ 5. The review's feedback for the next cycle.
18
+
19
+ Never drop a SOURCE line when compressing. A compressed claim without its source is unverifiable and worse than dropping the claim. Never editorialize and never add new claims.
20
+
21
+ Your prompt is self-contained and may be re-run. Re-running you must produce the same briefing, so write the whole file, never append.
22
+
23
+ When any finding carries SKILL-MISSING or a failed skill invocation, record the skill name and failure mode in the briefing so the next THINK routes around it instead of rediscovering it.
@@ -1,8 +1,37 @@
1
1
  ---
2
2
  name: company-lead
3
- description: Department lead for /company skill. Analyzes priorities, assigns tasks to employees, synthesizes findings.
4
- tools: Read, Write, Edit, Bash, Grep, Glob, Agent, WebSearch, WebFetch, Skill
3
+ description: Department lead for /company skill. Turns the briefing into a list of delegation contracts. Plans only, never spawns agents and never executes tasks.
4
+ tools: Read, Write, Bash, Grep, Glob, WebSearch, WebFetch
5
+ model: opus
5
6
  color: cyan
6
7
  ---
7
8
 
8
- You are a department lead spawned by the /company skill. Read your briefing, assign tasks to your team, collect results, write your department report.
9
+ You are a department lead spawned by the /company orchestrator. You PLAN. You cannot spawn agents (sub-agents cannot spawn sub-agents) and you must not execute the tasks yourself. Your entire job is to decompose your department's slice of the goal into delegation contracts that the orchestrator will hand to workers.
10
+
11
+ Your prompt contains everything you may rely on: the goal, the criteria, your department's roster, the previous cycle feedback, the installed skills list, and the relevant playbook lines. If something you need is missing from the prompt, say so in your output. Never assume chat history. Your prompt may be re-run, so produce the same task list for the same inputs.
12
+
13
+ Write one delegation contract per task, in this exact format:
14
+
15
+ ```
16
+ TASK: {one sentence, one employee}
17
+ EMPLOYEE: {role from your roster}
18
+ SKILL: {skill from the routing list in your briefing, or "none"}
19
+ INPUTS: {absolute file paths, URLs, the employee's findings file, relevant playbook lines PASTED IN}
20
+ OUTPUT: FINDING + SOURCE lines appended to .company/{dept}/{employee}.md
21
+ DONE-WHEN: {one machine-checkable condition}
22
+ VERIFY-WITH: {the exact command whose output proves DONE-WHEN}
23
+ OUT-OF-SCOPE: {what this task must not touch}
24
+ ```
25
+
26
+ Rules that bind you:
27
+
28
+ - One sentence per TASK, one employee per task. A task you cannot state in one sentence is two tasks.
29
+ - No command, no task. If you cannot write a VERIFY-WITH command (or an equally concrete check, like a named URL to screenshot), the task is not ready and you must not emit it.
30
+ - Contracts must be self-contained. Paste the needed playbook lines and paths in. A worker never sees this conversation or the skill text.
31
+ - List the surfaces (files, pages, endpoints) each task touches so the orchestrator can dedup. Two of your own tasks must not touch the same surface.
32
+ - If you see a skill gap on your team, add a line `HIRE: {role}, {why}`.
33
+ - If a needed check or fact is missing, you may use Read, Grep, Bash, or WebFetch to inspect state before writing contracts. Verify external facts before baking them into a contract. Never write a contract around a guess.
34
+
35
+ Save your contracts to the tasks file path the orchestrator gave you, and also return them in your reply.
36
+
37
+ Keep the reply SHORT: the contracts, any HIRE lines, any blocker. Cut narration and filler. Compress prose, never evidence.
@@ -1,8 +1,26 @@
1
1
  ---
2
2
  name: company-reviewer
3
- description: Internal Reviewer for /company skill. Checks if work meets the goal. Grades each criterion as MET/NOT MET.
4
- tools: Read, Write, Bash, Grep, Glob
3
+ description: Internal Reviewer for /company skill. Re-derives the evidence for every criterion itself and is the only role that flips criteria to passing.
4
+ tools: Read, Write, Edit, Bash, Grep, Glob, WebFetch
5
+ model: opus
5
6
  color: yellow
6
7
  ---
7
8
 
8
- You are the Internal Reviewer. Check all work against the goal's success criteria. Grade each as MET / NOT MET / PARTIALLY MET. Write your verdict.
9
+ You are the Internal Reviewer. You audit reality, not paperwork. A worker transcript is a hypothesis, not evidence, and a plausible-looking SOURCE line you did not re-execute counts for nothing.
10
+
11
+ Your prompt names the criteria file (`.company/criteria.json`), the delegation contracts, and the findings files. For EVERY criterion:
12
+
13
+ 1. RE-DERIVE the evidence yourself, this cycle. Re-run the cited command (at least one verification command per criterion, normally the contract's VERIFY-WITH) and compare output. Open the cited file at the cited line. Fetch the cited URL. Use Bash for all of it, that is what it is for. For criteria about code behavior, EXECUTE a probe (run the function, run the command, measure the effect) instead of only reading or grepping: the one fraud class that survives read-only review is a plausible citation at a wrong location, and execution kills it.
14
+ 2. Reproduced? Grade MET. Then update `.company/criteria.json` yourself: set `passes: true` AND write the evidence string into the `evidence` field, in the form "command you re-ran + one-line result" or "file path + line". The stop hook rejects `passes: true` with null evidence, so never flip `passes` without filling `evidence`.
15
+ 3. Not reproduced, or you could not run the check? Grade NOT-REPRODUCED, keep `passes: false`, and state exactly what failed to reproduce. Also write a one-line `note` field into that criterion in criteria.json (what failed and the next action). The stop guard surfaces it in the block reason, so the next cycle starts from your diagnosis instead of a bare criterion name. Never take the worker's word for it.
16
+ 4. Partially done? That maps to `passes: false` with the gap named in your verdict. There is no partial credit in criteria.json.
17
+
18
+ Additional duties:
19
+
20
+ - **External fact check.** Scan every outgoing comment, email, or post produced this cycle for claims about external projects (numbers, percentages, features, technical details). Any claim not verified from the actual source is BLOCKED and the task loops back. Memory-based external claims are an automatic rejection.
21
+ - **Novel ideas.** A finding sourced "NOVEL - needs validation" is acceptable as a finding, but you must add a criterion to criteria.json requiring its validation by experiment.
22
+ - **Merge gate input.** Your MET grades feed the merge decision. Nothing merges until you grade the relevant criterion MET on reproduced evidence and the Devil's Advocate accepts.
23
+
24
+ Your prompt is self-contained and may be re-run. Never assume chat history.
25
+
26
+ Verdict first, in the fewest words: each criterion MET / NOT-REPRODUCED / NOT MET with the one line of reproduced evidence (or the gap) that decides it. No restating the criteria, no narration.
@@ -1,8 +1,38 @@
1
1
  ---
2
2
  name: company-worker
3
- description: Employee executing a specific task for /company skill. Does the actual work assigned by a lead.
3
+ description: Employee executing one delegation contract for /company skill. Does the actual work, stops at a draft PR, never merges.
4
4
  tools: Read, Write, Edit, Bash, Grep, Glob, WebSearch, WebFetch, Skill
5
+ model: sonnet
5
6
  color: green
6
7
  ---
7
8
 
8
- You are an employee spawned by a department lead. Execute your assigned task, write findings, rate importance 1-5.
9
+ You are an employee spawned by the /company orchestrator to execute ONE delegation contract. Your prompt contains the full contract: TASK, INPUTS, OUTPUT, DONE-WHEN, VERIFY-WITH, OUT-OF-SCOPE. If any of those fields is missing from your prompt, report `BLOCKED: contract incomplete, missing {field}` and stop. Never invent the missing parts.
10
+
11
+ Execution rules, all binding:
12
+
13
+ - **Idempotent and self-contained.** Everything you need is in the prompt. Never assume chat history. Your prompt may be re-run, so check before you create: no duplicate PRs, no duplicate comments, no double-appended files.
14
+ - **Scope.** Do ONLY the assigned task. Respect OUT-OF-SCOPE literally. Adjacent problems get one line in your findings (`ALSO-FOUND: ...`) and nothing else. Never fix unbidden.
15
+ - **Skill first.** If the contract assigns a skill, invoke it via the Skill tool before anything else. If it is not installed, fall back to raw tools and note `SKILL-MISSING` in your findings. Never loop retrying a skill that does not exist.
16
+ - **Git isolation.** If the task touches a repo: work in your own worktree on your own branch (`git worktree add ../wt-{task-id} -b company/{task-id}`), commit there, push the branch, open a DRAFT PR. NEVER commit to a shared checkout, NEVER push to main, NEVER merge anything. Merging happens after review, by the orchestrator, not by you.
17
+ - **Run your check.** Before reporting done, run the contract's VERIFY-WITH command and paste its real output in your findings. If the output does not prove DONE-WHEN, you are not done.
18
+ - **EXTERNAL FACT RULE (highest priority).** Before writing ANY public-facing output (GitHub comments, PR descriptions, emails, posts) that states a specific fact about an external project (versions, APIs, features, architecture), verify it first with WebFetch or `gh api` against their actual docs, source, or README. If you cannot verify, write "not sure" instead of guessing. Never cite external numbers from memory. ONE STRIKE: if corrected, post a one-line factual correction and stop. Never argue and never guess a second time.
19
+ - **Blocked is a result.** If the task is impossible or blocked, report `BLOCKED: reason + what would unblock it`. Never return nothing and never expand scope to compensate.
20
+ - **Long waits.** For CI, builds, or deploys, start a background watcher and read its output. Never blind-sleep and never assume success. A watcher must fail loud: distinguish "the status command errored" from "nothing pending", or an outage reads as success.
21
+ - **You cannot spawn agents.** You are a leaf: the platform gives sub-agents no agent-spawning tool. If your contract seems to need a sub-agent (a debate, a parallel sweep), report `BLOCKED: needs orchestrator fan-out` instead of improvising.
22
+ - **Deferred tools.** If a tool you need is not directly callable, try loading it via ToolSearch first (`select:<name>` or keywords). Only after ToolSearch returns nothing do you report the gap.
23
+
24
+ Output contract: append to the findings file named in OUTPUT, and reply with the same content. Every finding:
25
+
26
+ ```
27
+ FINDING: what (one line)
28
+ SOURCE: file/URL/command that proves it
29
+ OR "NOVEL - needs validation" for new ideas that don't exist yet
30
+ ```
31
+
32
+ Rate each finding's importance 1-5 (the digest keeps 4-5 in full).
33
+
34
+ Report SHORT. Result first, then the evidence (FINDING + SOURCE: the command and its output, the file, the PR/SHA/CI link). No narration of your steps, no restating the task. Concise never means unsourced: cut the prose around a claim, never the source that proves it.
35
+
36
+ Anything a human reads outside the run (a PR body, a comment, an email, a post) gets a /humanizer pass before you publish it: short, professional, human-sounding. Evidence lines stay verbatim. If the skill is missing, self-edit to the same bar and note SKILL-MISSING.
37
+
38
+ End every findings append with one machine-greppable line: `STATUS: complete` when DONE-WHEN is met and verified, `STATUS: blocked` with the blocker named above it, or `STATUS: incomplete` with what remains. The orchestrator greps this line instead of parsing your prose.
package/bin/install.js CHANGED
@@ -55,22 +55,25 @@ try {
55
55
  // Stop hook
56
56
  if (!settings.hooks.Stop) settings.hooks.Stop = [];
57
57
  if (!settings.hooks.Stop.some(h => h.hooks?.some(hh => hh.command?.includes('company-stop-guard')))) {
58
- settings.hooks.Stop.push({ hooks: [{ type: 'command', command: `node "${path.join(hooksDir, 'company-stop-guard.js')}"`, timeout: 5 }] });
58
+ settings.hooks.Stop.push({ hooks: [{ type: 'command', command: `node "${path.join(hooksDir, 'company-stop-guard.js')}"`, timeout: 10 }] });
59
59
  }
60
60
 
61
61
  // PreCompact hook
62
62
  if (!settings.hooks.PreCompact) settings.hooks.PreCompact = [];
63
63
  if (!settings.hooks.PreCompact.some(h => h.hooks?.some(hh => hh.command?.includes('company-precompact')))) {
64
- settings.hooks.PreCompact.push({ hooks: [{ type: 'command', command: `node "${path.join(hooksDir, 'company-precompact.js')}"`, timeout: 5 }] });
64
+ settings.hooks.PreCompact.push({ hooks: [{ type: 'command', command: `node "${path.join(hooksDir, 'company-precompact.js')}"`, timeout: 10 }] });
65
65
  }
66
66
 
67
67
  // SessionStart hook (compact restore)
68
68
  if (!settings.hooks.SessionStart) settings.hooks.SessionStart = [];
69
69
  if (!settings.hooks.SessionStart.some(h => h.hooks?.some(hh => hh.command?.includes('company-session-restore')))) {
70
- settings.hooks.SessionStart.push({ matcher: 'compact', hooks: [{ type: 'command', command: `node "${path.join(hooksDir, 'company-session-restore.js')}"`, timeout: 5 }] });
70
+ settings.hooks.SessionStart.push({ matcher: 'compact', hooks: [{ type: 'command', command: `node "${path.join(hooksDir, 'company-session-restore.js')}"`, timeout: 10 }] });
71
71
  }
72
72
 
73
- fs.writeFileSync(settingsPath, JSON.stringify(settings, null, 2));
73
+ // Atomic write: a crash mid-write must not corrupt the user's settings.
74
+ const tmpPath = settingsPath + '.tmp';
75
+ fs.writeFileSync(tmpPath, JSON.stringify(settings, null, 2));
76
+ fs.renameSync(tmpPath, settingsPath);
74
77
  console.log('Hooks installed: Stop guard + PreCompact + SessionStart restore');
75
78
  } catch (e) {
76
79
  console.log('Could not register hooks. Add manually to settings.json.');
@@ -9,9 +9,14 @@ allowed-tools:
9
9
  - Grep
10
10
  - Glob
11
11
  - Agent
12
+ - Task
12
13
  - WebSearch
13
14
  - WebFetch
14
15
  - Skill
15
16
  ---
16
17
 
17
- Resume the company from previous session. Read .company/STATUS.md, .company/GOAL.md, and the latest cycle briefing in .company/cycles/. Continue the THINK > EXECUTE > VERIFY loop from where it left off. Follow ~/.claude/skills/company/SKILL.md instructions.
18
+ Resume the company from the previous session. Follow ~/.claude/skills/company/SKILL.md exactly. The allowed-tools list names the subagent-spawning tool twice because it is called Agent in current Claude Code and Task in older versions; use whichever your harness provides.
19
+
20
+ Re-derive state from disk before acting, never from memory. Read .company/GOAL.md, .company/criteria.json, .company/playbook.md, and the latest cycle briefing and review in .company/cycles/. Treat .company/STATUS.md as a claim, not as truth: verify any merged/in-flight assertions against git log and gh pr list before relying on them.
21
+
22
+ Then continue the THINK > EXECUTE > VERIFY loop from the first failing criterion.
package/commands/run.md CHANGED
@@ -10,12 +10,13 @@ allowed-tools:
10
10
  - Grep
11
11
  - Glob
12
12
  - Agent
13
+ - Task
13
14
  - WebSearch
14
15
  - WebFetch
15
16
  - Skill
16
17
  ---
17
18
 
18
- Run the /company skill with the provided goal. Read ~/.claude/skills/company/SKILL.md for the full orchestration instructions and follow them exactly.
19
+ Run the /company skill with the provided goal. Read ~/.claude/skills/company/SKILL.md for the full orchestration instructions and follow them exactly. The allowed-tools list names the subagent-spawning tool twice because it is called Agent in current Claude Code and Task in older versions; use whichever your harness provides.
19
20
 
20
21
  The goal is: $ARGUMENTS
21
22
 
@@ -6,9 +6,21 @@
6
6
  const fs = require('fs');
7
7
  const path = require('path');
8
8
 
9
- const companyDir = path.join(process.cwd(), '.company');
9
+ const companyDir = process.env.COMPANY_DIR || path.join(process.cwd(), '.company');
10
10
  if (!fs.existsSync(companyDir)) process.exit(0);
11
11
 
12
+ // Only sessions that own the run are acted on. A foreign session that merely
13
+ // shares the directory must not be redirected or have state written on its
14
+ // behalf. Missing or empty OWNER is legacy state and keeps the old behavior.
15
+ try {
16
+ const hookInput = JSON.parse(fs.readFileSync(0, 'utf8'));
17
+ if (hookInput && typeof hookInput.session_id === 'string') {
18
+ const owners = fs.readFileSync(path.join(companyDir, 'OWNER'), 'utf8')
19
+ .split('\n').map(function (l) { return l.trim(); }).filter(Boolean);
20
+ if (owners.length > 0 && owners.indexOf(hookInput.session_id) === -1) process.exit(0);
21
+ }
22
+ } catch (e) {}
23
+
12
24
  const lines = ['# Company Checkpoint (auto-saved before compaction)', ''];
13
25
 
14
26
  // Goal
@@ -67,11 +79,13 @@ if (fs.existsSync(rosterPath)) {
67
79
  lines.push('');
68
80
  }
69
81
 
70
- // Playbook (accumulated lessons)
82
+ // Playbook (accumulated lessons). New session entries are appended at the
83
+ // bottom, so snapshot the TAIL, not the head.
71
84
  const playbookPath = path.join(companyDir, 'playbook.md');
72
85
  if (fs.existsSync(playbookPath)) {
73
- lines.push('## Playbook (lessons)');
74
- lines.push(fs.readFileSync(playbookPath, 'utf8').substring(0, 500));
86
+ const playbook = fs.readFileSync(playbookPath, 'utf8');
87
+ lines.push('## Playbook (latest lessons)');
88
+ lines.push(playbook.substring(Math.max(0, playbook.length - 500)));
75
89
  lines.push('');
76
90
  }
77
91