company-skill 4.6.2 → 4.6.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -12,7 +12,12 @@ Probe checklist, applied to every passing criterion and every merged-or-mergeabl
12
12
  1. Was the evidence REPRODUCED this cycle or merely transcribed from a worker's claim?
13
13
  2. Does the cited test or command actually exercise the change, or does it pass vacuously?
14
14
  3. What input, edge case, or environment breaks it?
15
- 4. What surface was never checked (other pages, other platforms, error paths)?
15
+ 4. Completeness: enumerate every surface, modality, or claim the GOAL names or implies. Mark each
16
+ CHECKED (evidence cited this cycle) or UNCHECKED. An in-scope UNCHECKED surface is an automatic
17
+ REJECT. Out-of-scope gaps become PROPOSE lines, not REJECTs.
18
+ If a LENS directive is present in your prompt (e.g. "LENS: security"), focus your attack
19
+ through that lens but still apply all other probes. Return ACCEPT/REJECT + one gap line per lens,
20
+ no per-lens score.
16
21
  5. For every external claim: verified from their repo or docs, or guessed from memory?
17
22
  6. Could this be done simpler? Does every added component earn its place?
18
23
  7. Would a real user understand the result without the authors explaining it?
@@ -38,6 +38,14 @@ Rules that bind you:
38
38
  - MODEL is your difficulty call, not a default you copy. cheap for mechanical tasks (rename, grep sweep, file move), strong for tasks where a weak model's mistake is expensive (architecture, security, public text), omit for everything else. Justify it in one clause. A contract whose INPUTS paste more than ~50K tokens of file content is tagged MODEL: strong or has its inputs converted to grep pointers first. Long-context degradation on a cheap tier is a quality bug, not a saving.
39
39
  - Lay each contract out stable-first: the fixed template fields and pasted boilerplate at the top, volatile values (paths, SHAs, feedback) at the bottom, so repeated spawns share a cacheable prompt prefix. Keep briefings and contracts to a soft target of about a screenful, and never trim a FINDING + SOURCE pair or a VERIFY-WITH command to hit it.
40
40
 
41
+ **Judge-panel for design decisions.** Reserved for genuine design forks, never for a mechanical
42
+ fix. If a criterion is tagged `kind: design` in criteria.json AND you can name 2+ materially
43
+ different angles in one line each, emit N<=3 independent contracts (each from a distinct stated
44
+ angle) plus 1 synthesis contract (a fresh-context judge that picks the winner and grafts
45
+ runner-up ideas). If you cannot name 2+ materially different angles, it is not a design fork:
46
+ use the single contract path. The synthesis judge only selects the winning design, the critic
47
+ and reviewer still gate the chosen design before any merge.
48
+
41
49
  Save your contracts to the tasks file path the orchestrator gave you, and also return them in your reply.
42
50
 
43
51
  Keep the reply SHORT: the contracts, any HIRE lines, any blocker. Cut narration and filler. Compress prose, never evidence.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "company-skill",
3
- "version": "4.6.2",
3
+ "version": "4.6.3",
4
4
  "description": "Goal-driven multi-employee company for Claude Code. Give it a goal, it runs until done.",
5
5
  "bin": {
6
6
  "company-skill": "./bin/install.js"
package/skill/SKILL.md CHANGED
@@ -277,7 +277,7 @@ Write `.company/cycles/cycle-{N}-briefing.md` first (exact name, the PreCompact
277
277
 
278
278
  As CEO, read the GOAL and COMPANY.md. Decide which departments and employees are RELEVANT to this specific goal. Only activate relevant ones. A mobile app goal does not need a Topologist. Write `.company/active-roster.md`: each activated employee with a one-line reason.
279
279
 
280
- **Effort scaling:** size the spawn to the goal before spawning anything. Trivial goal (single surface, known fix): no leads, 1-2 contracts written by you. Medium (one department's scope, one wave): 1-2 leads. Complex (multi-surface or unknown root cause): full parallel leads + dependency waves. State the chosen tier in the cycle briefing so the critic can challenge over- or under-spawn. Tie effort to ROI: spend heavier spawn on the highest-value decomposition of the goal, not the most obvious. When two decompositions are both sound, pick the one that unblocks more downstream work or closes the riskiest criterion first.
280
+ **Effort scaling:** size the spawn to the goal before spawning anything. Trivial goal (single surface, known fix): no leads, 1-2 contracts written by you. Medium (one department's scope, one wave): 1-2 leads. Complex (multi-surface or unknown root cause): full parallel leads + dependency waves. State the chosen tier in the cycle briefing so the critic can challenge over- or under-spawn. Tie effort to ROI and stakes: spend heavier spawn on the highest-value decomposition of the goal, not the most obvious. When two decompositions are both sound, pick the one that unblocks more downstream work or closes the riskiest criterion first.
281
281
 
282
282
  Spawn ALL relevant department leads in parallel: one `company-lead` Agent call per department, every Agent call in a SINGLE message. Sequential lead spawns are a bug. If an Agent call fails transiently, retry once, then record the lead as unavailable and fold its planning into your own.
283
283
 
@@ -355,6 +355,15 @@ Before spawning the reviewer, run the findings shape gate: `node <skill-scripts-
355
355
 
356
356
  Then spawn `company-critic` (the Devil's Advocate) on everything marked passing. Its probes: was the evidence reproduced or just transcribed? Does the test actually exercise the change? What input breaks it? What surface was never checked? Could this be simpler? Would a real user understand it? For every external claim: verified from their repo or docs, or guessed? A single unclosed gap means NOT DONE.
357
357
 
358
+ **Perspective-diverse verify for high-stakes criteria.** This generalizes the restart-debate
359
+ 3-role panel (see Restart mode) and Anthropic's evaluator-optimizer/fresh-verifier pattern to
360
+ in-loop high-stakes criteria. Normal-stakes criteria keep the single critic above.
361
+ For a criterion tagged `stakes: high` in criteria.json (irreversible action, security surface,
362
+ public-facing claim, or `attempts >= 2`), spawn the critic in THREE fresh contexts, each with a
363
+ distinct `LENS:` directive in its prompt: correctness, security, reproducibility. Each returns a
364
+ binary ACCEPT/REJECT + one gap line. Any REJECT blocks (unanimous ACCEPT required to pass).
365
+ No per-lens numeric score. Record all three verdicts in the cycle review.
366
+
358
367
  **MERGE GATE:** nothing merges during EXECUTE. A worker's output stops at a draft PR. Only after the reviewer grades the relevant criterion MET on reproduced evidence AND the critic accepts it does the ORCHESTRATOR merge, recording the verdict in the cycle review. Workers never merge, ever. The merge gate reads the PR's Proof of work block against the reviewer's reproduction.
359
368
 
360
369
  **BRANCH AND WORKTREE HYGIENE (MANDATORY after every merge):** after merging a PR, the orchestrator MUST delete the merged branch with `gh pr merge --delete-branch` (the flag deletes the remote branch atomically with the merge) and remove its worktree with `git worktree remove --force <worktree-path>` followed by `git worktree prune`. A merged branch left on origin and a stale worktree are both bugs. Runs that touch multiple repos MUST apply this to every merged PR, not just the last one.