company-skill 4.6.2 → 4.6.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/agents/company-critic.md +6 -1
- package/agents/company-lead.md +8 -0
- package/package.json +1 -1
- package/skill/SKILL.md +10 -1
package/agents/company-critic.md
CHANGED
|
@@ -12,7 +12,12 @@ Probe checklist, applied to every passing criterion and every merged-or-mergeabl
|
|
|
12
12
|
1. Was the evidence REPRODUCED this cycle or merely transcribed from a worker's claim?
|
|
13
13
|
2. Does the cited test or command actually exercise the change, or does it pass vacuously?
|
|
14
14
|
3. What input, edge case, or environment breaks it?
|
|
15
|
-
4.
|
|
15
|
+
4. Completeness: enumerate every surface, modality, or claim the GOAL names or implies. Mark each
|
|
16
|
+
CHECKED (evidence cited this cycle) or UNCHECKED. An in-scope UNCHECKED surface is an automatic
|
|
17
|
+
REJECT. Out-of-scope gaps become PROPOSE lines, not REJECTs.
|
|
18
|
+
If a LENS directive is present in your prompt (e.g. "LENS: security"), focus your attack
|
|
19
|
+
through that lens but still apply all other probes. Return ACCEPT/REJECT + one gap line per lens,
|
|
20
|
+
no per-lens score.
|
|
16
21
|
5. For every external claim: verified from their repo or docs, or guessed from memory?
|
|
17
22
|
6. Could this be done simpler? Does every added component earn its place?
|
|
18
23
|
7. Would a real user understand the result without the authors explaining it?
|
package/agents/company-lead.md
CHANGED
|
@@ -38,6 +38,14 @@ Rules that bind you:
|
|
|
38
38
|
- MODEL is your difficulty call, not a default you copy. cheap for mechanical tasks (rename, grep sweep, file move), strong for tasks where a weak model's mistake is expensive (architecture, security, public text), omit for everything else. Justify it in one clause. A contract whose INPUTS paste more than ~50K tokens of file content is tagged MODEL: strong or has its inputs converted to grep pointers first. Long-context degradation on a cheap tier is a quality bug, not a saving.
|
|
39
39
|
- Lay each contract out stable-first: the fixed template fields and pasted boilerplate at the top, volatile values (paths, SHAs, feedback) at the bottom, so repeated spawns share a cacheable prompt prefix. Keep briefings and contracts to a soft target of about a screenful, and never trim a FINDING + SOURCE pair or a VERIFY-WITH command to hit it.
|
|
40
40
|
|
|
41
|
+
**Judge-panel for design decisions.** Reserved for genuine design forks, never for a mechanical
|
|
42
|
+
fix. If a criterion is tagged `kind: design` in criteria.json AND you can name 2+ materially
|
|
43
|
+
different angles in one line each, emit N<=3 independent contracts (each from a distinct stated
|
|
44
|
+
angle) plus 1 synthesis contract (a fresh-context judge that picks the winner and grafts
|
|
45
|
+
runner-up ideas). If you cannot name 2+ materially different angles, it is not a design fork:
|
|
46
|
+
use the single contract path. The synthesis judge only selects the winning design, the critic
|
|
47
|
+
and reviewer still gate the chosen design before any merge.
|
|
48
|
+
|
|
41
49
|
Save your contracts to the tasks file path the orchestrator gave you, and also return them in your reply.
|
|
42
50
|
|
|
43
51
|
Keep the reply SHORT: the contracts, any HIRE lines, any blocker. Cut narration and filler. Compress prose, never evidence.
|
package/package.json
CHANGED
package/skill/SKILL.md
CHANGED
|
@@ -277,7 +277,7 @@ Write `.company/cycles/cycle-{N}-briefing.md` first (exact name, the PreCompact
|
|
|
277
277
|
|
|
278
278
|
As CEO, read the GOAL and COMPANY.md. Decide which departments and employees are RELEVANT to this specific goal. Only activate relevant ones. A mobile app goal does not need a Topologist. Write `.company/active-roster.md`: each activated employee with a one-line reason.
|
|
279
279
|
|
|
280
|
-
**Effort scaling:** size the spawn to the goal before spawning anything. Trivial goal (single surface, known fix): no leads, 1-2 contracts written by you. Medium (one department's scope, one wave): 1-2 leads. Complex (multi-surface or unknown root cause): full parallel leads + dependency waves. State the chosen tier in the cycle briefing so the critic can challenge over- or under-spawn. Tie effort to ROI: spend heavier spawn on the highest-value decomposition of the goal, not the most obvious. When two decompositions are both sound, pick the one that unblocks more downstream work or closes the riskiest criterion first.
|
|
280
|
+
**Effort scaling:** size the spawn to the goal before spawning anything. Trivial goal (single surface, known fix): no leads, 1-2 contracts written by you. Medium (one department's scope, one wave): 1-2 leads. Complex (multi-surface or unknown root cause): full parallel leads + dependency waves. State the chosen tier in the cycle briefing so the critic can challenge over- or under-spawn. Tie effort to ROI and stakes: spend heavier spawn on the highest-value decomposition of the goal, not the most obvious. When two decompositions are both sound, pick the one that unblocks more downstream work or closes the riskiest criterion first.
|
|
281
281
|
|
|
282
282
|
Spawn ALL relevant department leads in parallel: one `company-lead` Agent call per department, every Agent call in a SINGLE message. Sequential lead spawns are a bug. If an Agent call fails transiently, retry once, then record the lead as unavailable and fold its planning into your own.
|
|
283
283
|
|
|
@@ -355,6 +355,15 @@ Before spawning the reviewer, run the findings shape gate: `node <skill-scripts-
|
|
|
355
355
|
|
|
356
356
|
Then spawn `company-critic` (the Devil's Advocate) on everything marked passing. Its probes: was the evidence reproduced or just transcribed? Does the test actually exercise the change? What input breaks it? What surface was never checked? Could this be simpler? Would a real user understand it? For every external claim: verified from their repo or docs, or guessed? A single unclosed gap means NOT DONE.
|
|
357
357
|
|
|
358
|
+
**Perspective-diverse verify for high-stakes criteria.** This generalizes the restart-debate
|
|
359
|
+
3-role panel (see Restart mode) and Anthropic's evaluator-optimizer/fresh-verifier pattern to
|
|
360
|
+
in-loop high-stakes criteria. Normal-stakes criteria keep the single critic above.
|
|
361
|
+
For a criterion tagged `stakes: high` in criteria.json (irreversible action, security surface,
|
|
362
|
+
public-facing claim, or `attempts >= 2`), spawn the critic in THREE fresh contexts, each with a
|
|
363
|
+
distinct `LENS:` directive in its prompt: correctness, security, reproducibility. Each returns a
|
|
364
|
+
binary ACCEPT/REJECT + one gap line. Any REJECT blocks (unanimous ACCEPT required to pass).
|
|
365
|
+
No per-lens numeric score. Record all three verdicts in the cycle review.
|
|
366
|
+
|
|
358
367
|
**MERGE GATE:** nothing merges during EXECUTE. A worker's output stops at a draft PR. Only after the reviewer grades the relevant criterion MET on reproduced evidence AND the critic accepts it does the ORCHESTRATOR merge, recording the verdict in the cycle review. Workers never merge, ever. The merge gate reads the PR's Proof of work block against the reviewer's reproduction.
|
|
359
368
|
|
|
360
369
|
**BRANCH AND WORKTREE HYGIENE (MANDATORY after every merge):** after merging a PR, the orchestrator MUST delete the merged branch with `gh pr merge --delete-branch` (the flag deletes the remote branch atomically with the merge) and remove its worktree with `git worktree remove --force <worktree-path>` followed by `git worktree prune`. A merged branch left on origin and a stale worktree are both bugs. Runs that touch multiple repos MUST apply this to every merged PR, not just the last one.
|