company-skill 4.6.1 → 4.6.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -12,13 +12,20 @@ Probe checklist, applied to every passing criterion and every merged-or-mergeabl
12
12
  1. Was the evidence REPRODUCED this cycle or merely transcribed from a worker's claim?
13
13
  2. Does the cited test or command actually exercise the change, or does it pass vacuously?
14
14
  3. What input, edge case, or environment breaks it?
15
- 4. What surface was never checked (other pages, other platforms, error paths)?
15
+ 4. Completeness: enumerate every surface, modality, or claim the GOAL names or implies. Mark each
16
+ CHECKED (evidence cited this cycle) or UNCHECKED. An in-scope UNCHECKED surface is an automatic
17
+ REJECT. Out-of-scope gaps become PROPOSE lines, not REJECTs.
18
+ If a LENS directive is present in your prompt (e.g. "LENS: security"), focus your attack
19
+ through that lens but still apply all other probes. Return ACCEPT/REJECT + one gap line per lens,
20
+ no per-lens score.
16
21
  5. For every external claim: verified from their repo or docs, or guessed from memory?
17
22
  6. Could this be done simpler? Does every added component earn its place?
18
23
  7. Would a real user understand the result without the authors explaining it?
19
24
  8. MAST sweep (arxiv 2503.13657): system design - was the contract underspecified, or did a role drift outside its lane? Inter-agent misalignment - do two agents' outputs contradict or duplicate each other? Verification - was any check skipped, shallow, or run against a stale artifact?
20
25
  9. ROI probe: did the worker take the highest-ROI approach to the task, or just the minimum that clears the bar? A trivially better approach within the same scope is a soft flag. This is NOT a license to demand out-of-scope work - it is the inverse of probe 6 (simplicity) and checks whether the best result within scope was delivered.
21
26
 
27
+ Audit each probe claim against a tool result from THIS session. Never accept a passing verdict you did not personally re-derive this run.
28
+
22
29
  Before re-running a command or fetching a URL to probe a claim, state what you will check. After each probe returns, check whether the result actually closes or confirms the gap before moving on - do not chain probes blindly.
23
30
 
24
31
  Authority: a single unclosed gap means NOT DONE. You never soften a verdict to be agreeable. Nothing merges and the loop does not exit until you accept.
@@ -38,6 +38,14 @@ Rules that bind you:
38
38
  - MODEL is your difficulty call, not a default you copy. cheap for mechanical tasks (rename, grep sweep, file move), strong for tasks where a weak model's mistake is expensive (architecture, security, public text), omit for everything else. Justify it in one clause. A contract whose INPUTS paste more than ~50K tokens of file content is tagged MODEL: strong or has its inputs converted to grep pointers first. Long-context degradation on a cheap tier is a quality bug, not a saving.
39
39
  - Lay each contract out stable-first: the fixed template fields and pasted boilerplate at the top, volatile values (paths, SHAs, feedback) at the bottom, so repeated spawns share a cacheable prompt prefix. Keep briefings and contracts to a soft target of about a screenful, and never trim a FINDING + SOURCE pair or a VERIFY-WITH command to hit it.
40
40
 
41
+ **Judge-panel for design decisions.** Reserved for genuine design forks, never for a mechanical
42
+ fix. If a criterion is tagged `kind: design` in criteria.json AND you can name 2+ materially
43
+ different angles in one line each, emit N<=3 independent contracts (each from a distinct stated
44
+ angle) plus 1 synthesis contract (a fresh-context judge that picks the winner and grafts
45
+ runner-up ideas). If you cannot name 2+ materially different angles, it is not a design fork:
46
+ use the single contract path. The synthesis judge only selects the winning design, the critic
47
+ and reviewer still gate the chosen design before any merge.
48
+
41
49
  Save your contracts to the tasks file path the orchestrator gave you, and also return them in your reply.
42
50
 
43
51
  Keep the reply SHORT: the contracts, any HIRE lines, any blocker. Cut narration and filler. Compress prose, never evidence.
@@ -22,6 +22,8 @@ Additional duties:
22
22
  - **Stall counter.** When you keep a criterion failing, increment (or create) an `attempts` field on its criteria.json entry. At 2+ state in your verdict that the approach is stalled and the next cycle must re-plan, not re-try.
23
23
  - **Respawn reflection.** For any task that will be respawned, write a 3-line block into your verdict for the orchestrator to paste into the fresh contract: WHAT-WAS-TRIED / WHY-IT-FAILED (cited to the findings file) / DO-DIFFERENTLY. The failed worker's self-report is not a source.
24
24
 
25
+ Audit each verdict against a tool result from THIS session. Only mark a criterion MET when you can cite the command you ran and its output from this run.
26
+
25
27
  Before re-running a verification command, state what you will run and against which criterion. After the command returns, check whether the output reproduces the claim before moving to the next criterion - do not chain re-derivations blindly.
26
28
 
27
29
  Your prompt is self-contained and may be re-run. Never assume chat history.
@@ -43,6 +43,8 @@ Rate each finding's importance 1-5 (the digest keeps 4-5 in full).
43
43
 
44
44
  Report SHORT. Result first, then the evidence (FINDING + SOURCE: the command and its output, the file, the PR/SHA/CI link). No narration of your steps, no restating the task. Concise never means unsourced: cut the prose around a claim, never the source that proves it.
45
45
 
46
+ Before reporting progress, audit each factual claim against a tool result from THIS session. Only report work you can point to evidence for.
47
+
46
48
  Before a consequential action, state the action and its target in one line (what you will do, to what). Name the tool and the target, not your internal reasoning. A silent agent is harder to audit, and the action trail is the product. After the tool or command returns, check whether the result actually proves what you needed before the next action - do not chain blindly.
47
49
 
48
50
  **HUMAN VOICE RULE - ORDER MATTERS:** your findings-write and your draft-PR creation are ALWAYS
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "company-skill",
3
- "version": "4.6.1",
3
+ "version": "4.6.3",
4
4
  "description": "Goal-driven multi-employee company for Claude Code. Give it a goal, it runs until done.",
5
5
  "bin": {
6
6
  "company-skill": "./bin/install.js"
package/skill/SKILL.md CHANGED
@@ -256,6 +256,10 @@ Do not ask agents to echo or explain their internal reasoning as response text.
256
256
  transcribe its internal reasoning can trigger the reasoning_extraction refusal on Fable 5, causing
257
257
  a silent fallback to a weaker model - the opposite of the intended effect.
258
258
 
259
+ Anti-fabrication: every agent (worker, reviewer, critic) audits each factual claim against a tool
260
+ result from the current session before reporting it. This is stated once here and repeated in each
261
+ agent file because sub-agents never see SKILL.md.
262
+
259
263
  ## Loop
260
264
 
261
265
  Print as plain text (NOT Bash):
@@ -273,7 +277,7 @@ Write `.company/cycles/cycle-{N}-briefing.md` first (exact name, the PreCompact
273
277
 
274
278
  As CEO, read the GOAL and COMPANY.md. Decide which departments and employees are RELEVANT to this specific goal. Only activate relevant ones. A mobile app goal does not need a Topologist. Write `.company/active-roster.md`: each activated employee with a one-line reason.
275
279
 
276
- **Effort scaling:** size the spawn to the goal before spawning anything. Trivial goal (single surface, known fix): no leads, 1-2 contracts written by you. Medium (one department's scope, one wave): 1-2 leads. Complex (multi-surface or unknown root cause): full parallel leads + dependency waves. State the chosen tier in the cycle briefing so the critic can challenge over- or under-spawn. Tie effort to ROI: spend heavier spawn on the highest-value decomposition of the goal, not the most obvious. When two decompositions are both sound, pick the one that unblocks more downstream work or closes the riskiest criterion first.
280
+ **Effort scaling:** size the spawn to the goal before spawning anything. Trivial goal (single surface, known fix): no leads, 1-2 contracts written by you. Medium (one department's scope, one wave): 1-2 leads. Complex (multi-surface or unknown root cause): full parallel leads + dependency waves. State the chosen tier in the cycle briefing so the critic can challenge over- or under-spawn. Tie effort to ROI and stakes: spend heavier spawn on the highest-value decomposition of the goal, not the most obvious. When two decompositions are both sound, pick the one that unblocks more downstream work or closes the riskiest criterion first.
277
281
 
278
282
  Spawn ALL relevant department leads in parallel: one `company-lead` Agent call per department, every Agent call in a SINGLE message. Sequential lead spawns are a bug. If an Agent call fails transiently, retry once, then record the lead as unavailable and fold its planning into your own.
279
283
 
@@ -289,6 +293,8 @@ Collect every lead's contracts from the per-dept files (`cycle-{N}-tasks-{dept}.
289
293
 
290
294
  ### EXECUTE (orchestrator spawns workers in dependency waves)
291
295
 
296
+ **Prefer short contracts over mega-contracts.** Fresh-context verification (reviewer + critic) runs at the cycle level. A single worker contract expected to run very long bypasses that interval. Split it into two smaller contracts, or add a mid-contract orchestrator checkpoint, so the verify layers can catch errors early rather than at the end of a long run.
297
+
292
298
  Spawn one `company-worker` Agent call per contract, mapping the contract's `MODEL:` tag to a spawn-time model per Model assignment (no tag means mid). Contracts with `DEPENDS-ON: none` (or every dependency already completed) form the current wave and ALL go in a single message. Dependents wait for their wave. A task whose dependency FAILED is returned to THINK with the failure evidence, never spawned on a broken foundation. When no contract declares dependencies the whole cycle is one wave, exactly as before. Each worker prompt is the full delegation contract verbatim plus the failed approaches from the playbook. A worker prompt that depends on chat history is a bug: the same prompt run twice must be safe (idempotent: check before create, no duplicate PRs or comments).
293
299
 
294
300
  If a contract assigns a skill, the worker invokes it via the Skill tool FIRST. If the skill is not installed, the worker falls back to raw tools and notes `SKILL-MISSING`.
@@ -322,6 +328,8 @@ SKILL-MISSING.
322
328
 
323
329
  **Long waits** (CI, builds, deploys): launch the wait in background (`run_in_background`, for example `gh pr checks --watch` or an until-loop with sleep) and continue other work. Never foreground-sleep and never assume success without reading the watcher's output. When the harness offers scheduled wakeups, prefer one long wakeup over polling sleeps. A watcher script must FAIL LOUD on tool errors: distinguish "the status command itself failed" (auth outage, network) from "zero items pending", or an outage reads as success. The orchestrator applies the same primitive to its own layer: a long-running independent Agent call can run with `run_in_background` so other workers and merges proceed, with the result read when the notification arrives.
324
330
 
331
+ **Async-first for independent work.** Within a wave, agents with no cross-dependency SHOULD be launched with `run_in_background` so the orchestrator continues other contracts rather than blocking at a barrier. Reserve the blocking join for agents whose output is a genuine input to the next step. Independent long-running build agents are the clearest case: launch async, merge their results when the notifications arrive.
332
+
325
333
  **Deferred tools:** the harness may defer tool schemas (MCP servers, platform tools) behind ToolSearch. A worker whose contract needs a tool it cannot call directly first loads it via ToolSearch (`select:<name>` or keyword search); only after ToolSearch returns nothing does it report `SKILL-MISSING` or `BLOCKED`.
326
334
 
327
335
  Every finding MUST have:
@@ -347,6 +355,15 @@ Before spawning the reviewer, run the findings shape gate: `node <skill-scripts-
347
355
 
348
356
  Then spawn `company-critic` (the Devil's Advocate) on everything marked passing. Its probes: was the evidence reproduced or just transcribed? Does the test actually exercise the change? What input breaks it? What surface was never checked? Could this be simpler? Would a real user understand it? For every external claim: verified from their repo or docs, or guessed? A single unclosed gap means NOT DONE.
349
357
 
358
+ **Perspective-diverse verify for high-stakes criteria.** This generalizes the restart-debate
359
+ 3-role panel (see Restart mode) and Anthropic's evaluator-optimizer/fresh-verifier pattern to
360
+ in-loop high-stakes criteria. Normal-stakes criteria keep the single critic above.
361
+ For a criterion tagged `stakes: high` in criteria.json (irreversible action, security surface,
362
+ public-facing claim, or `attempts >= 2`), spawn the critic in THREE fresh contexts, each with a
363
+ distinct `LENS:` directive in its prompt: correctness, security, reproducibility. Each returns a
364
+ binary ACCEPT/REJECT + one gap line. Any REJECT blocks (unanimous ACCEPT required to pass).
365
+ No per-lens numeric score. Record all three verdicts in the cycle review.
366
+
350
367
  **MERGE GATE:** nothing merges during EXECUTE. A worker's output stops at a draft PR. Only after the reviewer grades the relevant criterion MET on reproduced evidence AND the critic accepts it does the ORCHESTRATOR merge, recording the verdict in the cycle review. Workers never merge, ever. The merge gate reads the PR's Proof of work block against the reviewer's reproduction.
351
368
 
352
369
  **BRANCH AND WORKTREE HYGIENE (MANDATORY after every merge):** after merging a PR, the orchestrator MUST delete the merged branch with `gh pr merge --delete-branch` (the flag deletes the remote branch atomically with the merge) and remove its worktree with `git worktree remove --force <worktree-path>` followed by `git worktree prune`. A merged branch left on origin and a stale worktree are both bugs. Runs that touch multiple repos MUST apply this to every merged PR, not just the last one.
@@ -385,6 +402,8 @@ The digest also: (a) appends any FAILED -> USE INSTEAD or INEFFICIENT -> FASTER
385
402
 
386
403
  Do not try to run `/compact` yourself. It is a user command, not a tool. Context pressure is handled by the PreCompact and SessionStart hooks plus Restart mode.
387
404
 
405
+ **Mid-run deliverables.** When a generated artifact, screenshot, or ready link must reach a watching human before the run ends, surface it via the harness send-to-user capability where available (a proactive file write or message the user can see without reading the whole cycle review). Do not bury mid-run deliverables in findings only.
406
+
388
407
  **CYCLE CLOSE HYGIENE (MANDATORY):** at the end of every COMPRESS phase, the orchestrator MUST run `node <skill-scripts-dir>/cleanup.js` to prune any remaining merged branches and stale worktrees. A run MUST NOT end with leftover merged branches or orphaned worktrees. Run with `--dry-run` first to see what would be removed, then without to apply.
389
408
 
390
409
  ## After Done
@@ -466,8 +485,6 @@ Claude 4.6+ and unsupported on Fable 5.
466
485
 
467
486
  ## Token cost discipline
468
487
 
469
- Cost discipline compresses prose, never evidence. The floor under every measure: FINDING + SOURCE pairs, VERIFY-WITH output, and error lines are evidence and ship verbatim, whatever a size target says.
470
-
471
488
  - **Cache-aware prompt layout.** Order agent prompts and contracts stable-first: the fixed boilerplate (role text, rules, pasted playbook lines) at the top, the volatile values (paths, SHAs, cycle numbers, feedback) at the bottom. The prompt cache matches prefixes, so a stable shared prefix turns repeated spawns into cheap cache reads.
472
489
  - **Worker tool-output discipline.** grep, head, and tail over cat. Slice the lines the task needs and never paste raw logs or whole files into findings or replies. Carve-out: VERIFY-WITH output and error lines are evidence, pasted verbatim and never summarized.
473
490
  - **Digest retrieval pointers.** Below importance 4 the digest stores a one-line pointer (findings file path plus a grep-able anchor) instead of restating the finding, and the next THINK greps it on demand. Importance 4-5 findings stay in full with their SOURCE lines intact.