company-skill 4.6.0 → 4.6.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/agents/company-critic.md +4 -0
- package/agents/company-reviewer.md +4 -0
- package/agents/company-worker.md +3 -1
- package/package.json +1 -1
- package/scripts/dashboard.js +70 -7
- package/skill/SKILL.md +33 -2
package/agents/company-critic.md
CHANGED
|
@@ -19,6 +19,10 @@ Probe checklist, applied to every passing criterion and every merged-or-mergeabl
|
|
|
19
19
|
8. MAST sweep (arxiv 2503.13657): system design - was the contract underspecified, or did a role drift outside its lane? Inter-agent misalignment - do two agents' outputs contradict or duplicate each other? Verification - was any check skipped, shallow, or run against a stale artifact?
|
|
20
20
|
9. ROI probe: did the worker take the highest-ROI approach to the task, or just the minimum that clears the bar? A trivially better approach within the same scope is a soft flag. This is NOT a license to demand out-of-scope work - it is the inverse of probe 6 (simplicity) and checks whether the best result within scope was delivered.
|
|
21
21
|
|
|
22
|
+
Audit each probe claim against a tool result from THIS session. Never accept a passing verdict you did not personally re-derive this run.
|
|
23
|
+
|
|
24
|
+
Before re-running a command or fetching a URL to probe a claim, state what you will check. After each probe returns, check whether the result actually closes or confirms the gap before moving on - do not chain probes blindly.
|
|
25
|
+
|
|
22
26
|
Authority: a single unclosed gap means NOT DONE. You never soften a verdict to be agreeable. Nothing merges and the loop does not exit until you accept.
|
|
23
27
|
|
|
24
28
|
Your prompt is self-contained and may be re-run. Never assume chat history.
|
|
@@ -22,6 +22,10 @@ Additional duties:
|
|
|
22
22
|
- **Stall counter.** When you keep a criterion failing, increment (or create) an `attempts` field on its criteria.json entry. At 2+ state in your verdict that the approach is stalled and the next cycle must re-plan, not re-try.
|
|
23
23
|
- **Respawn reflection.** For any task that will be respawned, write a 3-line block into your verdict for the orchestrator to paste into the fresh contract: WHAT-WAS-TRIED / WHY-IT-FAILED (cited to the findings file) / DO-DIFFERENTLY. The failed worker's self-report is not a source.
|
|
24
24
|
|
|
25
|
+
Audit each verdict against a tool result from THIS session. Only mark a criterion MET when you can cite the command you ran and its output from this run.
|
|
26
|
+
|
|
27
|
+
Before re-running a verification command, state what you will run and against which criterion. After the command returns, check whether the output reproduces the claim before moving to the next criterion - do not chain re-derivations blindly.
|
|
28
|
+
|
|
25
29
|
Your prompt is self-contained and may be re-run. Never assume chat history.
|
|
26
30
|
|
|
27
31
|
Verdict first, in the fewest words: each criterion MET / NOT-REPRODUCED / NOT MET with the one line of reproduced evidence (or the gap) that decides it. No restating the criteria, no narration.
|
package/agents/company-worker.md
CHANGED
|
@@ -43,7 +43,9 @@ Rate each finding's importance 1-5 (the digest keeps 4-5 in full).
|
|
|
43
43
|
|
|
44
44
|
Report SHORT. Result first, then the evidence (FINDING + SOURCE: the command and its output, the file, the PR/SHA/CI link). No narration of your steps, no restating the task. Concise never means unsourced: cut the prose around a claim, never the source that proves it.
|
|
45
45
|
|
|
46
|
-
|
|
46
|
+
Before reporting progress, audit each factual claim against a tool result from THIS session. Only report work you can point to evidence for.
|
|
47
|
+
|
|
48
|
+
Before a consequential action, state the action and its target in one line (what you will do, to what). Name the tool and the target, not your internal reasoning. A silent agent is harder to audit, and the action trail is the product. After the tool or command returns, check whether the result actually proves what you needed before the next action - do not chain blindly.
|
|
47
49
|
|
|
48
50
|
**HUMAN VOICE RULE - ORDER MATTERS:** your findings-write and your draft-PR creation are ALWAYS
|
|
49
51
|
your final two tool calls. Nothing may come after them.
|
package/package.json
CHANGED
package/scripts/dashboard.js
CHANGED
|
@@ -1202,12 +1202,17 @@ tr:last-child td { border-bottom: none; }
|
|
|
1202
1202
|
.pill.warn { background: rgba(154,103,0,0.12); color: var(--amber); }
|
|
1203
1203
|
.empty { color: var(--dim); font-size: 13.5px; padding: 0.5rem 0; }
|
|
1204
1204
|
.feed { font-size: 13px; max-height: 22rem; overflow-y: auto; }
|
|
1205
|
-
.feed .row {
|
|
1205
|
+
.feed .row { border-bottom: 1px solid var(--line); }
|
|
1206
1206
|
.feed .row:last-child { border-bottom: none; }
|
|
1207
|
-
.feed .
|
|
1207
|
+
.feed-header { display: flex; gap: 0.75rem; align-items: baseline; padding: 0.3rem 0; cursor: pointer; width: 100%; background: none; border: none; text-align: left; font-size: 13px; font-family: inherit; color: inherit; }
|
|
1208
|
+
.feed-header:hover { background: var(--hover); }
|
|
1209
|
+
.feed .ts { font-family: var(--font-mono); color: var(--dim); white-space: nowrap; font-size: 12px; flex-shrink: 0; }
|
|
1208
1210
|
.feed .kind { font-weight: 600; width: 3.5rem; flex-shrink: 0; }
|
|
1209
1211
|
.feed .kind.spawn { color: var(--cyan); }
|
|
1210
1212
|
.feed .kind.finish { color: var(--accent); }
|
|
1213
|
+
.feed-bullet { flex: 1; overflow: hidden; text-overflow: ellipsis; white-space: nowrap; }
|
|
1214
|
+
.feed-caret { font-size: 11px; color: var(--dim); flex-shrink: 0; padding-left: 0.5rem; }
|
|
1215
|
+
.feed-detail { padding: 0.3rem 0 0.5rem 1.25rem; font-size: 12.5px; white-space: normal; word-break: break-word; color: var(--dim); }
|
|
1211
1216
|
.ctx-card { margin-top: 2.5rem; }
|
|
1212
1217
|
.ctx-gauge-wrap { position: relative; height: 10px; border-radius: var(--radius-pill); overflow: visible; background: var(--hover); margin: 1.25rem 0 2rem; }
|
|
1213
1218
|
.ctx-gauge-bar { position: absolute; left: 0; top: 0; height: 100%; border-radius: var(--radius-pill); transition: width 0.3s; }
|
|
@@ -1895,16 +1900,74 @@ function setupTreeInteractions() {
|
|
|
1895
1900
|
});
|
|
1896
1901
|
}
|
|
1897
1902
|
|
|
1903
|
+
// Per-feed-item expand: sessionStorage set of expanded item keys, keyed per session
|
|
1904
|
+
function feedStorageKey() {
|
|
1905
|
+
return 'feedExpanded:' + (_treeSessionId || 'unbound');
|
|
1906
|
+
}
|
|
1907
|
+
|
|
1908
|
+
function loadFeedExpanded() {
|
|
1909
|
+
try {
|
|
1910
|
+
return new Set(JSON.parse(sessionStorage.getItem(feedStorageKey()) || '[]'));
|
|
1911
|
+
} catch (_) { return new Set(); }
|
|
1912
|
+
}
|
|
1913
|
+
|
|
1914
|
+
function saveFeedExpanded(set) {
|
|
1915
|
+
try { sessionStorage.setItem(feedStorageKey(), JSON.stringify([...set])); } catch (_) {}
|
|
1916
|
+
}
|
|
1917
|
+
|
|
1918
|
+
let _lastFeed = null;
|
|
1919
|
+
|
|
1920
|
+
// Build a stable per-item key from ts + kind + agentType (no id on feed events)
|
|
1921
|
+
function feedItemKey(e, idx) {
|
|
1922
|
+
return (e.ts || '') + ':' + (e.kind || '') + ':' + (e.agentType || '') + ':' + idx;
|
|
1923
|
+
}
|
|
1924
|
+
|
|
1898
1925
|
function renderFeed(s) {
|
|
1899
1926
|
const root = $('feed');
|
|
1900
1927
|
root.replaceChildren();
|
|
1901
1928
|
const feed = s.feed || [];
|
|
1902
|
-
|
|
1903
|
-
|
|
1929
|
+
_lastFeed = feed;
|
|
1930
|
+
if (!feed.length) {
|
|
1931
|
+
root.appendChild(el('div', 'empty', 'No spawn or finish events in the current transcript tail'));
|
|
1932
|
+
return;
|
|
1933
|
+
}
|
|
1934
|
+
const expanded = loadFeedExpanded();
|
|
1935
|
+
for (let idx = 0; idx < feed.length; idx++) {
|
|
1936
|
+
const e = feed[idx];
|
|
1937
|
+
const key = feedItemKey(e, idx);
|
|
1938
|
+
const isOpen = expanded.has(key);
|
|
1939
|
+
|
|
1904
1940
|
const row = el('div', 'row');
|
|
1905
|
-
|
|
1906
|
-
|
|
1907
|
-
|
|
1941
|
+
|
|
1942
|
+
// Collapsed header: timestamp + kind badge + truncated bullet + caret
|
|
1943
|
+
const header = el('button', 'feed-header');
|
|
1944
|
+
header.appendChild(el('span', 'ts', e.ts ? new Date(e.ts).toLocaleTimeString() : '?'));
|
|
1945
|
+
header.appendChild(el('span', 'kind ' + e.kind, e.kind));
|
|
1946
|
+
const bullet = el('span', 'feed-bullet');
|
|
1947
|
+
bullet.textContent = (e.agentType || '') + ': ' + (e.description || '');
|
|
1948
|
+
header.appendChild(bullet);
|
|
1949
|
+
const caret = el('span', 'feed-caret');
|
|
1950
|
+
caret.textContent = isOpen ? 'v' : '>';
|
|
1951
|
+
header.appendChild(caret);
|
|
1952
|
+
row.appendChild(header);
|
|
1953
|
+
|
|
1954
|
+
// Expanded detail: full text, wrapping freely
|
|
1955
|
+
if (isOpen) {
|
|
1956
|
+
const detail = el('div', 'feed-detail');
|
|
1957
|
+
detail.textContent = (e.agentType || '') + ': ' + (e.description || '');
|
|
1958
|
+
row.appendChild(detail);
|
|
1959
|
+
}
|
|
1960
|
+
|
|
1961
|
+
header.addEventListener('click', () => {
|
|
1962
|
+
// Read fresh from storage so concurrent tabs don't clobber each other
|
|
1963
|
+
const current = loadFeedExpanded();
|
|
1964
|
+
if (current.has(key)) current.delete(key);
|
|
1965
|
+
else current.add(key);
|
|
1966
|
+
saveFeedExpanded(current);
|
|
1967
|
+
// Re-render only the feed section using cached state
|
|
1968
|
+
if (_lastFeed !== null) renderFeed({ feed: _lastFeed });
|
|
1969
|
+
});
|
|
1970
|
+
|
|
1908
1971
|
root.appendChild(row);
|
|
1909
1972
|
}
|
|
1910
1973
|
}
|
package/skill/SKILL.md
CHANGED
|
@@ -240,6 +240,26 @@ No command, no task. If nobody can write a VERIFY-WITH command (or an equally co
|
|
|
240
240
|
|
|
241
241
|
MODEL is optional and defaults to mid. When present it carries the lead's judgment about the task's difficulty, and the orchestrator maps it to a model at spawn time (see Model assignment). ROI is required and carries the lead's value rationale, why this task over an alternative, so the orchestrator can rank contracts by value-over-effort and triage surfaced proposals meaningfully. Lay contracts out stable-first: the fixed template fields and pasted boilerplate at the top, the volatile values (paths, SHAs, cycle feedback) at the bottom, so repeated spawns share a cacheable prompt prefix.
|
|
242
242
|
|
|
243
|
+
## Thinking and reflection
|
|
244
|
+
|
|
245
|
+
On models that support adaptive thinking (Fable 5 and later), agents run with thinking enabled by
|
|
246
|
+
default for the orchestrator and verify layers. Do NOT set a `budget_tokens` parameter - it is
|
|
247
|
+
deprecated on Claude 4.6+ and unsupported on Fable 5. Adaptive thinking depth is model-controlled.
|
|
248
|
+
|
|
249
|
+
After every tool call or command returns, check whether the result actually proves what you needed
|
|
250
|
+
before the next action. Do not chain tool calls blindly. A result that partially answers or
|
|
251
|
+
contradicts the plan changes the plan, not just the evidence box. This is the reflect-after-tool
|
|
252
|
+
practice Anthropic documents for multi-step agentic loops.
|
|
253
|
+
|
|
254
|
+
Do not ask agents to echo or explain their internal reasoning as response text. Narrate the ACTION
|
|
255
|
+
(what you will do, to what target) not the internal chain-of-thought. Prompting a model to
|
|
256
|
+
transcribe its internal reasoning can trigger the reasoning_extraction refusal on Fable 5, causing
|
|
257
|
+
a silent fallback to a weaker model - the opposite of the intended effect.
|
|
258
|
+
|
|
259
|
+
Anti-fabrication: every agent (worker, reviewer, critic) audits each factual claim against a tool
|
|
260
|
+
result from the current session before reporting it. This is stated once here and repeated in each
|
|
261
|
+
agent file because sub-agents never see SKILL.md.
|
|
262
|
+
|
|
243
263
|
## Loop
|
|
244
264
|
|
|
245
265
|
Print as plain text (NOT Bash):
|
|
@@ -273,6 +293,8 @@ Collect every lead's contracts from the per-dept files (`cycle-{N}-tasks-{dept}.
|
|
|
273
293
|
|
|
274
294
|
### EXECUTE (orchestrator spawns workers in dependency waves)
|
|
275
295
|
|
|
296
|
+
**Prefer short contracts over mega-contracts.** Fresh-context verification (reviewer + critic) runs at the cycle level. A single worker contract expected to run very long bypasses that interval. Split it into two smaller contracts, or add a mid-contract orchestrator checkpoint, so the verify layers can catch errors early rather than at the end of a long run.
|
|
297
|
+
|
|
276
298
|
Spawn one `company-worker` Agent call per contract, mapping the contract's `MODEL:` tag to a spawn-time model per Model assignment (no tag means mid). Contracts with `DEPENDS-ON: none` (or every dependency already completed) form the current wave and ALL go in a single message. Dependents wait for their wave. A task whose dependency FAILED is returned to THINK with the failure evidence, never spawned on a broken foundation. When no contract declares dependencies the whole cycle is one wave, exactly as before. Each worker prompt is the full delegation contract verbatim plus the failed approaches from the playbook. A worker prompt that depends on chat history is a bug: the same prompt run twice must be safe (idempotent: check before create, no duplicate PRs or comments).
|
|
277
299
|
|
|
278
300
|
If a contract assigns a skill, the worker invokes it via the Skill tool FIRST. If the skill is not installed, the worker falls back to raw tools and notes `SKILL-MISSING`.
|
|
@@ -306,6 +328,8 @@ SKILL-MISSING.
|
|
|
306
328
|
|
|
307
329
|
**Long waits** (CI, builds, deploys): launch the wait in background (`run_in_background`, for example `gh pr checks --watch` or an until-loop with sleep) and continue other work. Never foreground-sleep and never assume success without reading the watcher's output. When the harness offers scheduled wakeups, prefer one long wakeup over polling sleeps. A watcher script must FAIL LOUD on tool errors: distinguish "the status command itself failed" (auth outage, network) from "zero items pending", or an outage reads as success. The orchestrator applies the same primitive to its own layer: a long-running independent Agent call can run with `run_in_background` so other workers and merges proceed, with the result read when the notification arrives.
|
|
308
330
|
|
|
331
|
+
**Async-first for independent work.** Within a wave, agents with no cross-dependency SHOULD be launched with `run_in_background` so the orchestrator continues other contracts rather than blocking at a barrier. Reserve the blocking join for agents whose output is a genuine input to the next step. Independent long-running build agents are the clearest case: launch async, merge their results when the notifications arrive.
|
|
332
|
+
|
|
309
333
|
**Deferred tools:** the harness may defer tool schemas (MCP servers, platform tools) behind ToolSearch. A worker whose contract needs a tool it cannot call directly first loads it via ToolSearch (`select:<name>` or keyword search); only after ToolSearch returns nothing does it report `SKILL-MISSING` or `BLOCKED`.
|
|
310
334
|
|
|
311
335
|
Every finding MUST have:
|
|
@@ -369,6 +393,8 @@ The digest also: (a) appends any FAILED -> USE INSTEAD or INEFFICIENT -> FASTER
|
|
|
369
393
|
|
|
370
394
|
Do not try to run `/compact` yourself. It is a user command, not a tool. Context pressure is handled by the PreCompact and SessionStart hooks plus Restart mode.
|
|
371
395
|
|
|
396
|
+
**Mid-run deliverables.** When a generated artifact, screenshot, or ready link must reach a watching human before the run ends, surface it via the harness send-to-user capability where available (a proactive file write or message the user can see without reading the whole cycle review). Do not bury mid-run deliverables in findings only.
|
|
397
|
+
|
|
372
398
|
**CYCLE CLOSE HYGIENE (MANDATORY):** at the end of every COMPRESS phase, the orchestrator MUST run `node <skill-scripts-dir>/cleanup.js` to prune any remaining merged branches and stale worktrees. A run MUST NOT end with leftover merged branches or orphaned worktrees. Run with `--dry-run` first to see what would be removed, then without to apply.
|
|
373
399
|
|
|
374
400
|
## After Done
|
|
@@ -441,9 +467,14 @@ Two founder overrides exist, one per timescale:
|
|
|
441
467
|
|
|
442
468
|
Degradation stays visible: the cycle briefing records the session model, and when the session family is the cheap tier the orchestrator warns in its opening paragraph that the verify layers run on a weak model. A `[model]` tag on a role in COMPANY.md is a request: state the override in the Agent call when the harness supports one, otherwise ignore it. Never claim a model switch happened unless the harness reports it.
|
|
443
469
|
|
|
444
|
-
|
|
470
|
+
**Effort (Fable 5 and compatible models).** Where the active model exposes an `effort` control,
|
|
471
|
+
the cheap/mid/strong contract intent maps to lower/default/higher effort respectively. The harness
|
|
472
|
+
does not set effort automatically; treat it as a launch-time control - pass it in the Agent call or
|
|
473
|
+
via session config where supported. The default on Fable 5 is high effort, which is the right
|
|
474
|
+
default for orchestrator and verify roles. Do not add a `budget_tokens` field: it is deprecated on
|
|
475
|
+
Claude 4.6+ and unsupported on Fable 5.
|
|
445
476
|
|
|
446
|
-
|
|
477
|
+
## Token cost discipline
|
|
447
478
|
|
|
448
479
|
- **Cache-aware prompt layout.** Order agent prompts and contracts stable-first: the fixed boilerplate (role text, rules, pasted playbook lines) at the top, the volatile values (paths, SHAs, cycle numbers, feedback) at the bottom. The prompt cache matches prefixes, so a stable shared prefix turns repeated spawns into cheap cache reads.
|
|
449
480
|
- **Worker tool-output discipline.** grep, head, and tail over cat. Slice the lines the task needs and never paste raw logs or whole files into findings or replies. Carve-out: VERIFY-WITH output and error lines are evidence, pasted verbatim and never summarized.
|