get-claudia 1.61.1 → 1.63.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +21 -1
- package/memory-daemon/claudia_memory/loops/__init__.py +9 -0
- package/memory-daemon/claudia_memory/loops/status.py +85 -0
- package/package.json +1 -1
- package/template-v2/.claude/agents/loop-checker.md +51 -0
- package/template-v2/.claude/manifest.json +11 -6
- package/template-v2/.claude/skills/_loop/README.md +33 -0
- package/template-v2/.claude/skills/_loop/checker.md +64 -0
- package/template-v2/.claude/skills/_loop/maker.md +39 -0
- package/template-v2/.claude/skills/_loop/repair.md +85 -0
- package/template-v2/.claude/skills/auto-research/SKILL.md +47 -9
- package/template-v2/.claude/skills/build-team/SKILL.md +167 -0
- package/template-v2/.claude/skills/hire-agent.md +46 -0
- package/template-v2/.claude/skills/meditate/SKILL.md +19 -0
- package/template-v2/.claude/skills/skill-index.json +15 -0
package/CHANGELOG.md
CHANGED
|
@@ -2,6 +2,27 @@
|
|
|
2
2
|
|
|
3
3
|
All notable changes to Claudia will be documented in this file.
|
|
4
4
|
|
|
5
|
+
## 1.63.0 (2026-06-13)
|
|
6
|
+
|
|
7
|
+
### Added
|
|
8
|
+
|
|
9
|
+
- **`/build-team` skill** (Proposal 11, E6). A user-invoked counterpart to `hire-agent`: it reads the user's profile, judgment rules, and real task history, then proposes a minimal, tailored agent team in one pass. The proposal runs through the independent `loop-checker` (scored against goal alignment, right-sizing, and a "progressive, not overwhelming" hard constraint, bounded to 2 revisions), is written to a `team_status.md` control file, and is gated on explicit user approval. On approval it scaffolds agent definitions and writes `.bak` siblings for any modified file so the change is reversible. Reuses the existing roster first and never auto-spawns agents or takes external actions. New files: `template-v2/.claude/skills/build-team/SKILL.md` (+ `skill-index.json` entry).
|
|
10
|
+
- **Proactive team review** (Proposal 11, E7). `hire-agent` now distinguishes "add one agent" from "the whole team has drifted." When it detects an archetype shift, a new recurring task class the roster does not cover, or two-plus unused agents, it gently suggests reviewing the team as a unit and routes to `/build-team` for the validated, approval-gated, reversible change. One suggestion at a time; never auto-applied; a review can shrink a team as readily as grow it.
|
|
11
|
+
|
|
12
|
+
## 1.62.0 (2026-06-13)
|
|
13
|
+
|
|
14
|
+
### Loop engineering foundation (Proposal 11, Phases 1-2)
|
|
15
|
+
|
|
16
|
+
The first two phases of the Autonomy & Personalization Layer: the Maker-Checker pattern on the existing `auto-research` loop, a bounded self-repair sub-loop, the shared loop infrastructure later loops reuse, and a `meditate` feed that reviews how the loops themselves performed. See `docs/proposals/11-autonomy-personalization-layer.md`.
|
|
17
|
+
|
|
18
|
+
### Added
|
|
19
|
+
|
|
20
|
+
- **Independent Checker for `auto-research`** (Proposal 11, E2). The iteration loop no longer grades its own work. Each iteration now dispatches a new `loop-checker` agent (Haiku, adversarial brief) that scores the artifact against the rubric independently; the Checker's score, not Claudia's, drives the keep/revert decision. When the Checker's score and Claudia's self-score diverge past threshold, the iteration is flagged `contested` and surfaced in the end-of-run summary. New files: `template-v2/.claude/agents/loop-checker.md` and `template-v2/.claude/skills/_loop/` (`maker.md`, `checker.md`, `README.md`).
|
|
21
|
+
- **Loop status-file schema** (Proposal 11, E1). `docs/loop-status-schema.md` defines a standard Markdown-body + YAML-frontmatter control plane (`last_input`, `maker_proposal`, `checker_verdict`, `verified`, `next_action`, and friends) plus the exit-condition standard every loop must declare up front. `auto-research` now writes `research_status.md` in this format each iteration.
|
|
22
|
+
- **Atomic status-file helper** (Proposal 11, E1/B3). New daemon module `claudia_memory/loops/status.py` (`write_status` / `read_status`) writes a status file through a same-directory temp file plus `os.replace`, so an interrupted write never leaves a partial file at the canonical path. Six tests in `memory-daemon/tests/test_loop_status.py` cover round-trip, native-type preservation, parent-dir creation, temp cleanup, and crash-safety.
|
|
23
|
+
- **Self-repair sub-loop** (Proposal 11, E3). When a loop stalls (the Checker fails 3 times running or scores regress), `auto-research` now enters a bounded self-repair pass (`template-v2/.claude/skills/_loop/repair.md`): it diagnoses whether the rubric or a brief is the real problem, proposes one fix, and proves it on the exact failing input before adopting it. Capped at 2 repair attempts; confirmed repairs leave a regression fixture under `~/.claudia/loops/regressions/`; edits to shipped briefs always go to a human approval gate.
|
|
24
|
+
- **`meditate` reviews loop-harness performance** (Proposal 11, E4). End-of-session reflection now reads recent loop status files, summarizes how Claudia's own Maker-Checker loops did (contested iterations, stuck loops, recurring Checker findings), and can propose a harness improvement. Any such proposal passes through the Checker before it is written, and edits to shipped briefs are presented as a diff for approval.
|
|
25
|
+
|
|
5
26
|
## 1.61.1 (2026-05-23)
|
|
6
27
|
|
|
7
28
|
### `claudia` command works from anywhere, and repairs itself
|
|
@@ -31,7 +52,6 @@ This release adds a `claudia` shell command (and `update-claudia`) so you no lon
|
|
|
31
52
|
|
|
32
53
|
- **ASCII greeting logo: row 1 (hair top) realigned over face** (in `template-v2/CLAUDE.md`). The top hair row was indented 6 leading spaces while the face row below it starts at column 0, so the hair appeared shifted ~4 characters to the right. Now sits correctly above the face. Existing installs see the fix on next upgrade.
|
|
33
54
|
- **Installer no longer treats flag-looking arguments as install targets.** Previously, `npx get-claudia --help` (or any unrecognized `--flag`) was interpreted as the install path and would create a `./--help/` directory with templates copied into it. Now: `--help`/`-h` prints a proper usage message and exits, `--version`/`-V` prints the version, and any other leading-dash argument prints `Error: unrecognized argument: <flag>` and exits 1 without touching the filesystem.
|
|
34
|
-
|
|
35
55
|
## 1.60.1 (2026-05-22)
|
|
36
56
|
|
|
37
57
|
### Two bug fixes from the fixing-phase pass
|
|
@@ -0,0 +1,9 @@
|
|
|
1
|
+
"""Loop-engineering helpers (Proposal 11).
|
|
2
|
+
|
|
3
|
+
Shared infrastructure for Maker-Checker loops: atomic, human-readable status
|
|
4
|
+
files that act as the control plane for both skill-level and daemon-level loops.
|
|
5
|
+
"""
|
|
6
|
+
|
|
7
|
+
from claudia_memory.loops.status import read_status, write_status
|
|
8
|
+
|
|
9
|
+
__all__ = ["read_status", "write_status"]
|
|
@@ -0,0 +1,85 @@
|
|
|
1
|
+
"""Atomic status-file helper for loop engineering (Proposal 11, E1/B3).
|
|
2
|
+
|
|
3
|
+
A status file is Markdown with a YAML frontmatter block. The frontmatter carries
|
|
4
|
+
the structured control fields (loop_id, verified, checker_verdict, next_action,
|
|
5
|
+
...) and the body carries the human-readable narrative:
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
loop_id: consolidation
|
|
9
|
+
verified: true
|
|
10
|
+
next_action: none
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
# Loop status: consolidation
|
|
14
|
+
|
|
15
|
+
Last run consolidated 12 memories; all invariants held.
|
|
16
|
+
|
|
17
|
+
Writes go through a temp file in the same directory followed by ``os.replace``,
|
|
18
|
+
so an interrupted write never leaves a partially-written file at the canonical
|
|
19
|
+
path. A reader sees either the complete previous file or the complete new one.
|
|
20
|
+
"""
|
|
21
|
+
|
|
22
|
+
from __future__ import annotations
|
|
23
|
+
|
|
24
|
+
import os
|
|
25
|
+
from pathlib import Path
|
|
26
|
+
from typing import Any
|
|
27
|
+
|
|
28
|
+
import yaml
|
|
29
|
+
|
|
30
|
+
_DELIM = "---"
|
|
31
|
+
|
|
32
|
+
|
|
33
|
+
def write_status(path: str | os.PathLike, fields: dict[str, Any], body: str = "") -> None:
|
|
34
|
+
"""Atomically write a status file at ``path``.
|
|
35
|
+
|
|
36
|
+
Serializes ``fields`` as YAML frontmatter and appends ``body`` as the
|
|
37
|
+
Markdown body. Creates missing parent directories. If the final rename
|
|
38
|
+
fails, the canonical path is left untouched and the temp file is removed.
|
|
39
|
+
"""
|
|
40
|
+
target = Path(path)
|
|
41
|
+
target.parent.mkdir(parents=True, exist_ok=True)
|
|
42
|
+
|
|
43
|
+
frontmatter = yaml.safe_dump(fields, sort_keys=False, default_flow_style=False).strip()
|
|
44
|
+
content = f"{_DELIM}\n{frontmatter}\n{_DELIM}\n\n{body}"
|
|
45
|
+
|
|
46
|
+
# Temp file lives in the SAME directory as the target so that os.replace is
|
|
47
|
+
# a same-filesystem rename (atomic), not a cross-device copy.
|
|
48
|
+
tmp = target.with_name(f".{target.name}.tmp")
|
|
49
|
+
try:
|
|
50
|
+
with open(tmp, "w", encoding="utf-8") as fh:
|
|
51
|
+
fh.write(content)
|
|
52
|
+
fh.flush()
|
|
53
|
+
os.fsync(fh.fileno())
|
|
54
|
+
os.replace(tmp, target)
|
|
55
|
+
except BaseException:
|
|
56
|
+
# Any failure (including the rename) must leave the canonical path
|
|
57
|
+
# intact and must not litter the directory with a temp file.
|
|
58
|
+
try:
|
|
59
|
+
os.unlink(tmp)
|
|
60
|
+
except OSError:
|
|
61
|
+
pass
|
|
62
|
+
raise
|
|
63
|
+
|
|
64
|
+
|
|
65
|
+
def read_status(path: str | os.PathLike) -> tuple[dict[str, Any], str]:
|
|
66
|
+
"""Read a status file, returning ``(fields, body)``.
|
|
67
|
+
|
|
68
|
+
``fields`` is the parsed frontmatter (empty dict if there is none); ``body``
|
|
69
|
+
is the Markdown body with the single separating blank line removed.
|
|
70
|
+
"""
|
|
71
|
+
text = Path(path).read_text(encoding="utf-8")
|
|
72
|
+
return _split_frontmatter(text)
|
|
73
|
+
|
|
74
|
+
|
|
75
|
+
def _split_frontmatter(text: str) -> tuple[dict[str, Any], str]:
|
|
76
|
+
if not text.startswith(_DELIM):
|
|
77
|
+
return {}, text
|
|
78
|
+
rest = text[len(_DELIM):].lstrip("\n")
|
|
79
|
+
end = rest.find(f"\n{_DELIM}")
|
|
80
|
+
if end == -1:
|
|
81
|
+
return {}, text
|
|
82
|
+
raw = rest[:end]
|
|
83
|
+
body = rest[end + 1 + len(_DELIM):].lstrip("\n")
|
|
84
|
+
fields = yaml.safe_load(raw) or {}
|
|
85
|
+
return fields, body
|
package/package.json
CHANGED
|
@@ -0,0 +1,51 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: loop-checker
|
|
3
|
+
description: Independently scores a loop iteration's output against a rubric and returns a structured verdict. Adversarial by design: finds faults, does not confirm. Used by auto-research and future Maker-Checker loops.
|
|
4
|
+
model: haiku
|
|
5
|
+
dispatch-category: verification
|
|
6
|
+
dispatch-tier: task
|
|
7
|
+
auto-dispatch: false
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# Loop Checker
|
|
11
|
+
|
|
12
|
+
You are Claudia's Loop Checker. When a loop produces an iteration, Claudia
|
|
13
|
+
dispatches you to score it independently of whoever produced it. You are the
|
|
14
|
+
"Checker" half of the Maker-Checker pattern (Proposal 11).
|
|
15
|
+
|
|
16
|
+
Your brief is the Checker role brief at `.claude/skills/_loop/checker.md`. Follow
|
|
17
|
+
it exactly. The short version:
|
|
18
|
+
|
|
19
|
+
- You receive a `rubric`, an `artifact` to score, and the `proposed_change` the
|
|
20
|
+
Maker made (with the Maker's expected effect, if any).
|
|
21
|
+
- Score the artifact against each rubric dimension. Sum to a total.
|
|
22
|
+
- Hunt for faults: unmet constraints, regressions the change introduced,
|
|
23
|
+
dimensions the Maker over-claimed.
|
|
24
|
+
- Decide `verified`. When uncertain, default to `verified: false`: a false pass
|
|
25
|
+
costs more than one more iteration.
|
|
26
|
+
- Do not be swayed by the Maker's self-score. Score independently.
|
|
27
|
+
|
|
28
|
+
## Output
|
|
29
|
+
|
|
30
|
+
Return exactly the verdict JSON defined in the Checker role brief:
|
|
31
|
+
|
|
32
|
+
```json
|
|
33
|
+
{
|
|
34
|
+
"verified": false,
|
|
35
|
+
"score": 7.2,
|
|
36
|
+
"max_score": 10,
|
|
37
|
+
"issues": [
|
|
38
|
+
{"severity": "major", "issue": "the ask is buried in paragraph 3", "where": "body"}
|
|
39
|
+
],
|
|
40
|
+
"rationale": "One or two sentences, grounded in the rubric.",
|
|
41
|
+
"hard_constraint_violated": false
|
|
42
|
+
}
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
## Constraints
|
|
46
|
+
|
|
47
|
+
- Do NOT edit the artifact. You score and report; the Maker edits.
|
|
48
|
+
- Do NOT decide keep or revert. You return a verdict; the loop applies the rule.
|
|
49
|
+
- Do NOT soften findings to be agreeable. Disagreement with the Maker is the
|
|
50
|
+
entire reason you exist as a separate agent.
|
|
51
|
+
- Be concrete: every issue names a severity, what is wrong, and where.
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
|
-
"version": "1.
|
|
3
|
-
"generated": "2026-
|
|
2
|
+
"version": "1.63.0",
|
|
3
|
+
"generated": "2026-06-14T02:49:59.295Z",
|
|
4
4
|
"algorithm": "sha256",
|
|
5
5
|
"files": {
|
|
6
6
|
".claude/rules/claudia-principles.md": "939e9720421628e7f2e4c8dfbaa4aeb9c1e18e8c6a5379cd6b772a6835b812e5",
|
|
@@ -10,6 +10,10 @@
|
|
|
10
10
|
".claude/rules/shell-compatibility.md": "565977bc04e269b3ce7d8a7963173df4f44bb9634692ea76abb1b64f6a67513e",
|
|
11
11
|
".claude/rules/trust-north-star.md": "0188b17c26b791cf597ce975bb75d40f543aa3d6b84b7edb1309b78530b3d43f",
|
|
12
12
|
".claude/skills/README.md": "22311215317c61dcb67975af02bdd1d7164578713ff9faf6fbed8aec14256bcd",
|
|
13
|
+
".claude/skills/_loop/README.md": "13d44b9d2f88be477a71773a1dbbe8a3270bdda39a16785beaeff92abaeb019e",
|
|
14
|
+
".claude/skills/_loop/checker.md": "dcec7e38330244973020050bb95e76853ff5f05ebf4ca82b6b442d2eb26f7832",
|
|
15
|
+
".claude/skills/_loop/maker.md": "eadc6c2c59f3e43f4b029c3d40ce3af29cd746da219c132da0888d31a0743e01",
|
|
16
|
+
".claude/skills/_loop/repair.md": "675704a1aa1eb8463dac6270bd02ea34099674a9012e1b65924ef6923b384d5b",
|
|
13
17
|
".claude/skills/agent-dispatcher.md": "b48fc5283f0a88d1ebbc692b30b6c326fef012d4168dcf721a94e3c954b76b90",
|
|
14
18
|
".claude/skills/archetypes/_base-structure.md": "c0d8df77c07aa48cd9ba9c5ba3eddb533fcea790c7f98440fa3d31405fd5c75d",
|
|
15
19
|
".claude/skills/archetypes/consultant.md": "1e0ccbf89115f92a1fa0c86b48edcce22668c4bb828ba4268376fcd6b5c05680",
|
|
@@ -17,11 +21,12 @@
|
|
|
17
21
|
".claude/skills/archetypes/executive.md": "0d43c5c2c6315f5a4d6d36150778b55229cc922e0491a80af54824d87ac52e82",
|
|
18
22
|
".claude/skills/archetypes/founder.md": "a5bf3f22439c7c5f554c13966899a4af4de80f90c8f6a25eb62be1c68de6d54f",
|
|
19
23
|
".claude/skills/archetypes/solo.md": "cc26332b96394775e62a0f1a06f32093be6147bb75fa3783aa271e554d323c6b",
|
|
20
|
-
".claude/skills/auto-research/SKILL.md": "
|
|
24
|
+
".claude/skills/auto-research/SKILL.md": "2f7f8a7d0731813921094b0117026693314f0f59e8d9b52637bd3e775a2e95f8",
|
|
21
25
|
".claude/skills/auto-research/references/program-template.md": "850457de96bde5ad65d24f1018509da94dd31fec9ec45c15087031f470a28dbc",
|
|
22
26
|
".claude/skills/auto-research/references/safety-rules.md": "ae036270e86949ed0f9bface582f6038276adafd3f40a29aa733698111a26683",
|
|
23
27
|
".claude/skills/brain/SKILL.md": "b87dd93c85923c676292bea67f7a0580ee71eea2bea05426ea6eda951a140d5c",
|
|
24
28
|
".claude/skills/brain-monitor/SKILL.md": "6e9ce1e6e40bc612a5126d84d743c1fd7ebc596f60575873533479fdcae5b608",
|
|
29
|
+
".claude/skills/build-team/SKILL.md": "38a33c8f3e84d0b27f4a4a6ac10cc8318e684b0611a9052a1dd47acbd26874d9",
|
|
25
30
|
".claude/skills/capability-suggester.md": "06235c8ab062d356cf9b6c18c7aa44ea667e7694983a46e8ae0a9fe32e734944",
|
|
26
31
|
".claude/skills/capture-meeting/SKILL.md": "815e7587fcd925323c379a6c57e458b3484d034338c1237378641d142c126175",
|
|
27
32
|
".claude/skills/capture-meeting/evals/basic.yaml": "47c4eb1479aeda8067bc877cebfe8688c9b2bb3ce09297e42375cdf558f412e3",
|
|
@@ -40,13 +45,13 @@
|
|
|
40
45
|
".claude/skills/fix-duplicates/SKILL.md": "bd3d6aae0bce09bc844ba93f76abb4a97ff0286c6a4da29cd369763864fd4a86",
|
|
41
46
|
".claude/skills/follow-up-draft/SKILL.md": "49212946cb7b2db39cecd435ab6fcb7cb26cc49666306084d27676543be67281",
|
|
42
47
|
".claude/skills/growth-check/SKILL.md": "a3ed3f2427f3639f81d16b1a2adfa268bafed795d2b0d02621bc6be318784c24",
|
|
43
|
-
".claude/skills/hire-agent.md": "
|
|
48
|
+
".claude/skills/hire-agent.md": "8f46632d6c9b02563a41c00c811e1bcd365614bb5e64d770482f3536fa30e6fe",
|
|
44
49
|
".claude/skills/inbox-check/SKILL.md": "e2cddc91f72ffc0443d105385d2e8b464f86404a4825d308bd03a68e4ee43726",
|
|
45
50
|
".claude/skills/ingest-sources/SKILL.md": "938b93d299649db156ad19e9a7bd5d09e950f53bc2eaf8a85386f9fdbca350d0",
|
|
46
51
|
".claude/skills/ingest-sources/references/extraction-patterns.md": "28d6dc604c4a11eacfe4523e705ad430ba8fe26632bafcccaef656e9618e5aea",
|
|
47
52
|
".claude/skills/judgment-awareness.md": "0077129a87f91d295515108f3f9a68ca6874c6e38fc7f2a73a462704fa00ba2d",
|
|
48
53
|
".claude/skills/map-connections/SKILL.md": "e8133d5e3de36e32e3858a1ee3918a577855e77ade022933097f87a0100cc9a3",
|
|
49
|
-
".claude/skills/meditate/SKILL.md": "
|
|
54
|
+
".claude/skills/meditate/SKILL.md": "1d4eab90fdedd652f75750df1a1c3be885fbeefb89d5d99979324e395eb6fd83",
|
|
50
55
|
".claude/skills/meditate/evals/basic.yaml": "daba441b2fd9d1d4afddcff6eaa9673884198b1d0f85d9d06edbe0738012291a",
|
|
51
56
|
".claude/skills/meeting-prep/SKILL.md": "cb0b2afc41052a0652e7de30d09e71ab489162fbb09c648e3c95cec2d438ba63",
|
|
52
57
|
".claude/skills/memory-audit/SKILL.md": "7daed2ec128ea70171b8e3ffaf4d933ca69db2637b1487f69724e3428c04663a",
|
|
@@ -65,7 +70,7 @@
|
|
|
65
70
|
".claude/skills/research/SKILL.md": "0d317e6350b043ee774916b923957f61dbc174e3caafc0187198fa236a2fd28d",
|
|
66
71
|
".claude/skills/research/references/source-evaluation.md": "64614b7eff83468d7ff76dd640252579f23e69e760969672e9aebe5ceb00f695",
|
|
67
72
|
".claude/skills/risk-surfacer.md": "b01e6bc7a22df414a7ee7a5e4a108088b5d049646723cc7513fac27a3ba4856c",
|
|
68
|
-
".claude/skills/skill-index.json": "
|
|
73
|
+
".claude/skills/skill-index.json": "576ebd98fe5f597a45871366d286cd6765d6bdabefe49633c2d00514518c54bc",
|
|
69
74
|
".claude/skills/skill-router/SKILL.md": "12fc45aebc94d3ead65843cd336a88797a40856e82908cda47256c9cbb4cb60f",
|
|
70
75
|
".claude/skills/skill-router/references/overlap-clusters.md": "bc819b4892cbdf9919db392825230f4cd4eec546cd7f0c9de6c29a2468ab489a",
|
|
71
76
|
".claude/skills/structure-generator.md": "dbe70841ab60599a632687aaa3c3652361483df2fe854640c56982b60622eb19",
|
|
@@ -0,0 +1,33 @@
|
|
|
1
|
+
# _loop: shared loop-engineering resources
|
|
2
|
+
|
|
3
|
+
This directory is **not a skill**. It has no `SKILL.md` and is not listed in
|
|
4
|
+
`skill-index.json`, so the skill router never loads it directly. It holds the
|
|
5
|
+
versioned Maker and Checker prompt templates that skill-level loops reuse, so the
|
|
6
|
+
Maker-Checker pattern is defined once and shared.
|
|
7
|
+
|
|
8
|
+
Consumers today:
|
|
9
|
+
|
|
10
|
+
- `auto-research` (Proposal 11, E2) uses `checker.md` to score each iteration
|
|
11
|
+
independently of the Maker.
|
|
12
|
+
- `build-team` (Proposal 11, E6, not yet built) will reuse both templates.
|
|
13
|
+
|
|
14
|
+
The underscore prefix marks this as an internal resource, not a user-facing
|
|
15
|
+
capability. See `docs/loop-status-schema.md` for the status-file contract these
|
|
16
|
+
loops write, and `docs/proposals/11-autonomy-personalization-layer.md` for the
|
|
17
|
+
design.
|
|
18
|
+
|
|
19
|
+
## Files
|
|
20
|
+
|
|
21
|
+
| File | Role |
|
|
22
|
+
|------|------|
|
|
23
|
+
| `maker.md` | The producer prompt: makes one bounded change per iteration. |
|
|
24
|
+
| `checker.md` | The independent verifier prompt: scores adversarially, returns a structured verdict. |
|
|
25
|
+
| `repair.md` | The self-repair sub-loop: when a loop stalls, diagnose and fix the harness, proven on the exact failing input. |
|
|
26
|
+
|
|
27
|
+
## Why Maker and Checker are separate
|
|
28
|
+
|
|
29
|
+
A single agent grading its own work has a self-justification bias: it tends to
|
|
30
|
+
rate what it just produced as good. Splitting the roles, and giving the Checker
|
|
31
|
+
an adversarial brief (find faults, do not confirm), makes a passing verdict
|
|
32
|
+
actually mean something. The Checker runs on a cheaper model tier (Haiku) so the
|
|
33
|
+
verification pass is fast and inexpensive.
|
|
@@ -0,0 +1,64 @@
|
|
|
1
|
+
# Checker role brief (Proposal 11, E1/B2)
|
|
2
|
+
|
|
3
|
+
The Checker verifies work the Maker produced. It is dispatched as a separate
|
|
4
|
+
agent (`loop-checker`, Haiku) so it reasons independently of the Maker and costs
|
|
5
|
+
little to run. Its brief is adversarial: **find faults, do not confirm.** A
|
|
6
|
+
passing verdict from a Checker told to look for problems is worth something; a
|
|
7
|
+
rubber stamp is not.
|
|
8
|
+
|
|
9
|
+
## What the Checker receives
|
|
10
|
+
|
|
11
|
+
The dispatching skill passes:
|
|
12
|
+
|
|
13
|
+
- `rubric`: the scoring criteria (from the loop's program/rubric file).
|
|
14
|
+
- `artifact`: the current output to score (the edited draft, the proposed team,
|
|
15
|
+
the consolidation result).
|
|
16
|
+
- `proposed_change`: the one change the Maker made this iteration, and the
|
|
17
|
+
Maker's expected effect (the self-score), if any.
|
|
18
|
+
|
|
19
|
+
## How the Checker scores
|
|
20
|
+
|
|
21
|
+
1. Score the artifact against each rubric dimension on its stated scale. Sum to a
|
|
22
|
+
total `score`.
|
|
23
|
+
2. Actively hunt for what is wrong or weak: unmet constraints, regressions the
|
|
24
|
+
change introduced, dimensions the Maker over-claimed, anything the rubric
|
|
25
|
+
penalizes.
|
|
26
|
+
3. Decide `verified`: did this artifact meet the rubric's bar (and not violate
|
|
27
|
+
any hard constraint)? When uncertain, default to `verified: false`. The cost
|
|
28
|
+
of a false pass is higher than the cost of one more iteration.
|
|
29
|
+
4. Do not be swayed by the Maker's self-score. Score independently first, then
|
|
30
|
+
you may note divergence.
|
|
31
|
+
|
|
32
|
+
## Output contract (return exactly this JSON)
|
|
33
|
+
|
|
34
|
+
```json
|
|
35
|
+
{
|
|
36
|
+
"verified": false,
|
|
37
|
+
"score": 7.2,
|
|
38
|
+
"max_score": 10,
|
|
39
|
+
"issues": [
|
|
40
|
+
{"severity": "major", "issue": "the ask is buried in paragraph 3", "where": "body"},
|
|
41
|
+
{"severity": "minor", "issue": "two sentences exceed the length cap", "where": "closing"}
|
|
42
|
+
],
|
|
43
|
+
"rationale": "Lede improved, but the rubric weights ask-clarity heavily and the ask is still not in the opening.",
|
|
44
|
+
"hard_constraint_violated": false
|
|
45
|
+
}
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
- `verified` (bool): does the artifact clear the bar this iteration?
|
|
49
|
+
- `score` (float): total across rubric dimensions.
|
|
50
|
+
- `max_score` (float): the maximum the rubric allows, so the caller can normalize.
|
|
51
|
+
- `issues` (array): concrete, located faults. Empty only when there genuinely are
|
|
52
|
+
none. Each has `severity` (`major` | `minor`), `issue`, and `where`.
|
|
53
|
+
- `rationale` (string): one or two sentences. Why this verdict, grounded in the
|
|
54
|
+
rubric.
|
|
55
|
+
- `hard_constraint_violated` (bool): true if any non-negotiable constraint failed
|
|
56
|
+
(which forces `verified: false` regardless of score).
|
|
57
|
+
|
|
58
|
+
## What the Checker does not do
|
|
59
|
+
|
|
60
|
+
- It does not edit the artifact. It scores and reports; the Maker edits.
|
|
61
|
+
- It does not decide keep/revert. It returns a verdict; the loop applies the
|
|
62
|
+
rule (keep if the score beats the running best and no hard constraint failed).
|
|
63
|
+
- It does not soften findings to be agreeable. Disagreement with the Maker is the
|
|
64
|
+
whole point.
|
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
# Maker role brief (Proposal 11, E1/B2)
|
|
2
|
+
|
|
3
|
+
The Maker produces work. In a Claudia loop, the Maker is Claudia herself (the
|
|
4
|
+
main model), not a dispatched subagent. This brief defines how the Maker behaves
|
|
5
|
+
inside any Maker-Checker loop, so the producer role is consistent across
|
|
6
|
+
`auto-research`, `build-team`, and future loops.
|
|
7
|
+
|
|
8
|
+
## Contract
|
|
9
|
+
|
|
10
|
+
Each iteration, the Maker does exactly one thing:
|
|
11
|
+
|
|
12
|
+
1. **Read the state.** The program/rubric, the current artifact, and the last
|
|
13
|
+
few rows of history (what was tried, what was kept, what was reverted).
|
|
14
|
+
2. **Propose ONE change.** One specific, bounded edit, justified in a single
|
|
15
|
+
sentence. Not a rewrite. Not a refactor. Not three changes bundled together.
|
|
16
|
+
One lever, pulled deliberately, so the Checker can attribute the score delta
|
|
17
|
+
to it.
|
|
18
|
+
3. **Apply it** to the working copy (never the user's original file during the
|
|
19
|
+
loop).
|
|
20
|
+
4. **Hand off to the Checker.** The Maker does not score its own work for the
|
|
21
|
+
keep/revert decision. It may state an expected effect ("this should raise the
|
|
22
|
+
lede dimension"), which becomes the Maker self-score the Checker's verdict is
|
|
23
|
+
compared against, but the Checker's score governs.
|
|
24
|
+
|
|
25
|
+
## Discipline
|
|
26
|
+
|
|
27
|
+
- One change per iteration. Bundling changes destroys the signal.
|
|
28
|
+
- Justify in one sentence. If the justification needs a paragraph, the change is
|
|
29
|
+
too big; split it.
|
|
30
|
+
- Stay inside the bounds declared up front (see the exit-condition standard in
|
|
31
|
+
`docs/loop-status-schema.md`).
|
|
32
|
+
- Honest self-assessment. Do not inflate the expected effect to pre-empt the
|
|
33
|
+
Checker. The point of the split is that the Checker is not you.
|
|
34
|
+
|
|
35
|
+
## Why the Maker does not grade itself
|
|
36
|
+
|
|
37
|
+
A model that grades its own output rates it generously: it just argued itself
|
|
38
|
+
into producing exactly that. The keep/revert decision therefore belongs to an
|
|
39
|
+
independent Checker (see `checker.md`). The Maker proposes; the Checker disposes.
|
|
@@ -0,0 +1,85 @@
|
|
|
1
|
+
# Self-repair brief (Proposal 11, E3)
|
|
2
|
+
|
|
3
|
+
When a Maker-Checker loop keeps failing, the problem may not be the artifact: it
|
|
4
|
+
may be the harness (the rubric, the Maker brief, or the Checker brief). The
|
|
5
|
+
self-repair sub-loop diagnoses that, proposes a fix, and proves the fix on the
|
|
6
|
+
exact input that failed before adopting it. It is itself bounded, and it never
|
|
7
|
+
auto-edits shipped briefs or the user's live files.
|
|
8
|
+
|
|
9
|
+
This brief is shared. `auto-research` uses it today; future loops reuse it.
|
|
10
|
+
|
|
11
|
+
## B1: When self-repair triggers
|
|
12
|
+
|
|
13
|
+
Enter the self-repair sub-loop when any of these holds during a loop:
|
|
14
|
+
|
|
15
|
+
- The Checker returns `verified: false` for **3 iterations in a row**.
|
|
16
|
+
- The Checker's `score` **regresses** across 3 consecutive iterations (each lower
|
|
17
|
+
than the last).
|
|
18
|
+
- An iteration raises an **exception** the loop cannot attribute to the artifact
|
|
19
|
+
(for example, the Checker returns malformed output twice).
|
|
20
|
+
|
|
21
|
+
These are signals that hill-climbing has stalled for a structural reason, not
|
|
22
|
+
that one more edit will help.
|
|
23
|
+
|
|
24
|
+
## B2: The repair sub-loop
|
|
25
|
+
|
|
26
|
+
Self-repair is itself a small Maker-Checker loop, run on the harness instead of
|
|
27
|
+
the artifact:
|
|
28
|
+
|
|
29
|
+
1. **Diagnose (Maker).** Read the last few status rows and the Checker's `issues`.
|
|
30
|
+
State, in one or two sentences, the single most likely harness cause: an
|
|
31
|
+
ambiguous rubric dimension, a Maker brief that permits bundled changes, a
|
|
32
|
+
Checker brief that scores the wrong thing, a missing hard constraint.
|
|
33
|
+
2. **Propose one fix (Maker).** One concrete edit to exactly one harness piece
|
|
34
|
+
(the rubric in `program.md`, `maker.md`, or `checker.md`). Not a rewrite. One
|
|
35
|
+
lever.
|
|
36
|
+
3. **Validate on the exact failing input (Checker).** Re-run the failing
|
|
37
|
+
iteration's input through the loop **with the proposed fix applied in a scratch
|
|
38
|
+
copy**, and dispatch the `loop-checker`. The fix is adopted only if that exact
|
|
39
|
+
previously-failing case now reaches `verified: true` (or scores materially
|
|
40
|
+
higher) and no previously-passing regression fixture breaks.
|
|
41
|
+
4. **Adopt or discard.** If it passes, apply the fix (see B4 for where).
|
|
42
|
+
If it does not, discard it and either try one more diagnosis (within the B4
|
|
43
|
+
bound) or stop and hand the loop back to the user with the diagnosis.
|
|
44
|
+
|
|
45
|
+
The non-negotiable rule is step 3: a fix is never adopted on a hunch. It must be
|
|
46
|
+
proven on the input that exposed the problem.
|
|
47
|
+
|
|
48
|
+
## B3: Regression capture
|
|
49
|
+
|
|
50
|
+
Every confirmed repair leaves a regression behind so the same failure cannot
|
|
51
|
+
silently return:
|
|
52
|
+
|
|
53
|
+
- Save the failing input plus the verdict it should now produce as a fixture
|
|
54
|
+
under `~/.claudia/loops/regressions/<loop-id>/<short-slug>.md` (a status file in
|
|
55
|
+
the standard schema, with the expected `verified`/`score`).
|
|
56
|
+
- On future runs of that loop, replay the loop-id's regression fixtures through
|
|
57
|
+
the Checker first. If any regression breaks, the loop refuses to start and
|
|
58
|
+
reports which fix regressed.
|
|
59
|
+
|
|
60
|
+
Shipped-skill regressions (ones that should travel with the template, not just
|
|
61
|
+
the user's machine) are proposed to the user for inclusion as in-repo test
|
|
62
|
+
fixtures instead.
|
|
63
|
+
|
|
64
|
+
## B4: Bound and safety
|
|
65
|
+
|
|
66
|
+
Self-repair obeys the same bounded-autonomy rules as every loop:
|
|
67
|
+
|
|
68
|
+
- **Bounded.** At most **2 repair attempts** per loop run. After that, stop and
|
|
69
|
+
hand back to the user with the diagnosis. Self-repair never loops forever.
|
|
70
|
+
- **Workspace-only for artifacts.** The repair sub-loop validates in a scratch
|
|
71
|
+
copy and never touches the user's original file.
|
|
72
|
+
- **Never auto-edit shipped briefs.** A fix to `maker.md` or `checker.md` (files
|
|
73
|
+
that ship in the template) is **proposed as a diff for the user to approve**,
|
|
74
|
+
never written silently. A fix to a run-local `program.md` rubric may be applied
|
|
75
|
+
within the run, since that rubric is the user's own per-run governance file.
|
|
76
|
+
- **Logged.** Each repair attempt appends to the loop's status file: what was
|
|
77
|
+
diagnosed, what was changed, and whether the exact failing input passed after
|
|
78
|
+
the change.
|
|
79
|
+
|
|
80
|
+
## What self-repair is not
|
|
81
|
+
|
|
82
|
+
- It is not a license to rewrite the harness whenever a loop is hard. It triggers
|
|
83
|
+
only on the B1 signals, and it changes one thing at a time.
|
|
84
|
+
- It is not autonomous prompt-editing of shipped files. Those changes always end
|
|
85
|
+
at a human approval gate.
|
|
@@ -14,11 +14,11 @@ A hill-climber for any local artifact with a rubric. Inspired by [Karpathy's aut
|
|
|
14
14
|
Three primitives, one governance file:
|
|
15
15
|
|
|
16
16
|
1. **The artifact** — what gets iterated on (a draft email, a brief, a wiki page, a one-pager). Lives in the workspace, never touches the user's original file during the loop.
|
|
17
|
-
2. **The evaluator** — a scalar rubric
|
|
17
|
+
2. **The evaluator** — a scalar rubric. Plain English. An independent Checker (the `loop-checker` agent, Haiku) scores each iteration against it, not Claudia herself. The rubric must produce a number for the original artifact before the loop starts.
|
|
18
18
|
3. **The budget** — iteration count (default 20). Loop stops when budget is exhausted or score plateaus.
|
|
19
19
|
4. **The program** (governance file) — what the user wants, what's off-limits, what counts as "better." Claudia helps the user write this if they don't volunteer it.
|
|
20
20
|
|
|
21
|
-
Loop: read program + artifact + results history
|
|
21
|
+
Loop: read program + artifact + results history, Claudia (the Maker) proposes one specific change, implements it, then dispatches the `loop-checker` (the Checker) to score it independently. If the Checker's score beats the running best, ratchet (commit); if worse, revert (git reset). Write the status file. Report each iteration as a one-line update. Repeat until budget is hit.
|
|
22
22
|
|
|
23
23
|
## When to invoke this skill
|
|
24
24
|
|
|
@@ -53,7 +53,8 @@ Each run gets its own workspace at `~/.claudia/auto-research/<task-id>/`:
|
|
|
53
53
|
├── program.md the brief (user-authored with Claudia's help)
|
|
54
54
|
├── artifact.md (or .txt, the working copy that gets edited)
|
|
55
55
|
├── original.md immutable copy of the input, for reference and diff
|
|
56
|
-
├── results.tsv one row per iteration: timestamp, score, kept/reverted, change-summary
|
|
56
|
+
├── results.tsv one row per iteration: timestamp, score, kept/reverted/contested, change-summary
|
|
57
|
+
├── research_status.md loop control plane (see docs/loop-status-schema.md): iteration, verified, score, checker_verdict, next_action
|
|
57
58
|
├── best.md symlink (or copy on Windows) to the highest-scoring version
|
|
58
59
|
└── iterations/
|
|
59
60
|
├── 01/artifact.md
|
|
@@ -111,16 +112,50 @@ For each iteration N:
|
|
|
111
112
|
1. **Read the state.** Read `program.md`, `artifact.md`, last 3 rows of `results.tsv`.
|
|
112
113
|
2. **Propose one change.** One specific edit, justified in one sentence. Not a rewrite. Not a refactor. One change.
|
|
113
114
|
3. **Implement.** Apply the edit to `artifact.md`. Save.
|
|
114
|
-
4. **Score.**
|
|
115
|
-
5. **Decide.** If
|
|
116
|
-
6. **
|
|
117
|
-
7. **
|
|
115
|
+
4. **Score (independent Checker).** Dispatch the `loop-checker` agent (Haiku) with the rubric from `program.md`, the new `artifact.md`, and the one change you just made plus your expected effect (your Maker self-score). The Checker returns a verdict JSON: `verified`, `score`, `max_score`, `issues`, `rationale`, `hard_constraint_violated`. You do NOT score it yourself; the Checker's `score` governs the decision. (See `.claude/skills/_loop/checker.md`.)
|
|
116
|
+
5. **Decide.** If the Checker's `score` > best score so far and `hard_constraint_violated` is false: ratchet. Copy `artifact.md` to `iterations/NN/`, update `best.md` to point at the new winner, append a row to `results.tsv` marked `kept`. Otherwise: revert. Copy `iterations/(best_iter)/artifact.md` back to `artifact.md`, append a row marked `reverted`.
|
|
117
|
+
6. **Flag divergence.** Compare the Checker's `score` against your Maker self-score. If they diverge by more than 1.5 points on a 10-point scale, mark the iteration `contested` in `results.tsv`. The Checker's verdict still governs; the flag just surfaces the disagreement in the end-of-run summary.
|
|
118
|
+
7. **Write the status file.** Atomically update `research_status.md` with `iteration`, `verified`, `score`, `budget_remaining`, `last_input`, `maker_proposal`, `checker_verdict` (the Checker's rationale), and `next_action`. Write to a temp sibling and rename; never edit it in place. (See `docs/loop-status-schema.md`.)
|
|
119
|
+
8. **Report.** One line to the user: `iter N: score 7.4 (kept), change: tightened lede to one sentence`. Append `[contested]` when flagged.
|
|
120
|
+
9. **Check stop conditions and self-repair.** If the Checker has returned `verified: false` 3 times in a row, or its score has regressed for 3 straight iterations, enter the self-repair sub-loop (see `.claude/skills/_loop/repair.md`) before giving up: it diagnoses whether the rubric or a brief is the real problem, proposes one fix, and proves it on the exact failing input. If budget exhausted: stop. If score plateaued (no improvement for 5 iterations): stop. Otherwise: next iteration.
|
|
121
|
+
|
|
122
|
+
## The independent Checker (Maker-Checker)
|
|
123
|
+
|
|
124
|
+
This loop separates the producer from the verifier (Proposal 11, E2). Claudia is
|
|
125
|
+
the **Maker**: she reads the state and proposes one change. The **Checker** is a
|
|
126
|
+
separate agent (`loop-checker`, Haiku) that scores the result independently. The
|
|
127
|
+
two never collapse into one model grading its own work, which is the bias the
|
|
128
|
+
split exists to remove.
|
|
129
|
+
|
|
130
|
+
- The Checker's `score`, not Claudia's, drives keep/revert. Claudia's expected
|
|
131
|
+
effect is only a self-score, used to detect disagreement.
|
|
132
|
+
- The Checker is adversarial by brief: it hunts for faults and defaults to
|
|
133
|
+
`verified: false` when uncertain. A passing verdict therefore means something.
|
|
134
|
+
- A `hard_constraint_violated: true` verdict forces a revert regardless of score.
|
|
135
|
+
|
|
136
|
+
**Contested iterations.** When the Checker's score and Claudia's self-score
|
|
137
|
+
diverge by more than 1.5 points (10-point scale), the iteration is marked
|
|
138
|
+
`contested`. The Checker still governs the decision; the flag makes the
|
|
139
|
+
disagreement visible. The end-of-run summary reports how many iterations were
|
|
140
|
+
contested, which is a useful signal that the rubric is ambiguous or that Claudia
|
|
141
|
+
was over-optimistic.
|
|
142
|
+
|
|
143
|
+
**Cost.** The Checker runs on Haiku, so the verification pass is cheap and fast.
|
|
144
|
+
With the default 20-iteration budget, that is at most 21 Checker dispatches
|
|
145
|
+
(including the baseline), each scoped to a single small artifact.
|
|
146
|
+
|
|
147
|
+
**Self-repair.** If the loop stalls (the Checker fails 3 times running, or scores
|
|
148
|
+
regress), `auto-research` enters a bounded self-repair sub-loop that treats the
|
|
149
|
+
harness, not the artifact, as the suspect: it diagnoses the rubric or a brief,
|
|
150
|
+
proposes one fix, and proves it on the exact failing input before adopting it.
|
|
151
|
+
See `.claude/skills/_loop/repair.md`. It is capped at 2 repair attempts and never
|
|
152
|
+
auto-edits shipped briefs (those changes go to a human approval gate).
|
|
118
153
|
|
|
119
154
|
## Safety rules (mandatory, follow without exception)
|
|
120
155
|
|
|
121
156
|
1. **Workspace-only edits.** The loop never writes to a file outside `~/.claudia/auto-research/<task-id>/`. The user's original file at `/Users/.../wherever.md` is untouched until the user explicitly approves the final hand-off.
|
|
122
157
|
2. **No external actions during the loop.** No sending, no posting, no scheduling, no calling tools that affect the outside world. The loop is a closed system.
|
|
123
|
-
3. **Baseline-score gate.** Before iteration 1, score the original artifact and write it to `results.tsv` as `iter 0: score X (baseline)
|
|
158
|
+
3. **Baseline-score gate.** Before iteration 1, dispatch the `loop-checker` to score the original artifact and write it to `results.tsv` as `iter 0: score X (baseline)` and to `research_status.md` as iteration 0. If the Checker cannot produce a number for the original, the loop does not start; Claudia tells the user the rubric needs to be more specific.
|
|
124
159
|
4. **Bounded budget.** Default 20 iterations. User can set higher (50 max) or lower. Plateau detection (no improvement for 5 in a row) stops early. The loop is not allowed to be "open-ended" the way Karpathy's autoresearch is; Claudia's safety principles require an end.
|
|
125
160
|
5. **Per-iteration revertability.** Each iteration is a git commit inside the workspace. Workspace is a fresh git repo (`git init` on start, `git commit -am` per iteration). At any point: `git log` shows the history, `git checkout` rolls back to any prior iteration.
|
|
126
161
|
6. **User interrupt at any time.** If the user says "stop", "pause", "show me what you have", or otherwise interrupts: stop the loop immediately. The workspace persists. The best version so far is in `best.md`. The user can resume by saying "continue" (loop picks up from the current best).
|
|
@@ -178,10 +213,13 @@ When the user gives a bad rubric, help them turn it into a good one before start
|
|
|
178
213
|
- `wiki` for iterating wiki pages specifically (compress, sharpen, restructure)
|
|
179
214
|
- `draft-reply`, `follow-up-draft` for one-shot draft generation
|
|
180
215
|
- `summarize-doc` for one-pass summarization
|
|
216
|
+
- `.claude/skills/_loop/` for the shared Maker and Checker role briefs
|
|
217
|
+
- `.claude/agents/loop-checker.md` for the Checker agent definition
|
|
218
|
+
- `docs/loop-status-schema.md` for the `research_status.md` control-plane format
|
|
181
219
|
|
|
182
220
|
## Open questions for future versions
|
|
183
221
|
|
|
184
|
-
- Shell-command evaluators (run `node test.js my-draft.md`, take the number it prints) are deferred. Today,
|
|
222
|
+
- Shell-command evaluators (run `node test.js my-draft.md`, take the number it prints) are deferred. Today, the `loop-checker` agent scores against the rubric in `program.md`.
|
|
185
223
|
- Token-spend budget and wall-clock budget are deferred. Today, only iteration count is supported.
|
|
186
224
|
- Parallel branches (try 3 different changes in parallel, keep the winner) are deferred. Today, sequential only.
|
|
187
225
|
- Wiki-page iteration as a first-class use case (apply auto-research to a wiki page until it's < N words while citing all sources) is deferred. The skill supports this in principle today, but a worked example doesn't ship until the wiki has earned its keep.
|
|
@@ -0,0 +1,167 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: build-team
|
|
3
|
+
description: Propose a personalized team of specialized agents based on the user's profile, goals, and how they actually work. Runs the proposal through an independent Checker, gates on the user's approval, and applies with rollback. Use when the user says "build my team", "set up my agents", "what agents should I have", "design my team", "/build-team", or asks Claudia to tailor her team to their work. See also: `hire-agent` for proactive single-agent suggestions; `structure-generator` for folder scaffolding; `capability-suggester` for command-level additions.
|
|
4
|
+
argument-hint: "[optional: focus area, e.g. 'around client work']"
|
|
5
|
+
invocation: explicit
|
|
6
|
+
effort-level: high
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# Build Team
|
|
10
|
+
|
|
11
|
+
The user-invoked counterpart to `hire-agent`. Where `hire-agent` suggests one
|
|
12
|
+
agent reactively when it spots a repeated pattern, `build-team` looks at the
|
|
13
|
+
whole picture (the user's profile, judgment rules, and real task history) and
|
|
14
|
+
proposes a tailored team in one pass. It is a Maker-Checker flow (Proposal 11,
|
|
15
|
+
E6): Claudia proposes, an independent Checker validates, the user approves, and
|
|
16
|
+
only then is anything written.
|
|
17
|
+
|
|
18
|
+
## What this is, and is not
|
|
19
|
+
|
|
20
|
+
- It **proposes and, on approval, scaffolds** agent definitions. It never
|
|
21
|
+
auto-spawns agents or takes external actions.
|
|
22
|
+
- It **reuses the existing roster first** (the agents in `.claude/agents/`) and
|
|
23
|
+
adds a new role only when the user's work clearly justifies one.
|
|
24
|
+
- It **starts minimal.** A small team that covers the real work beats a sprawling
|
|
25
|
+
org chart the user will never use. Growth comes later, from `hire-agent` and
|
|
26
|
+
the proactive team-update suggestions (Proposal 11, E7).
|
|
27
|
+
- It does **not** duplicate `agent-dispatcher` routing logic, and it does not
|
|
28
|
+
change how Claudia delegates. It only proposes which agents exist.
|
|
29
|
+
|
|
30
|
+
## When to fire
|
|
31
|
+
|
|
32
|
+
Trigger phrases: "build my team", "set up my agents", "what agents should I
|
|
33
|
+
have", "design my team", "/build-team", "tailor your team to my work".
|
|
34
|
+
|
|
35
|
+
Also the landing spot when a proactive suggestion (from `capability-suggester`
|
|
36
|
+
or `hire-agent`) escalates from "add one agent" to "let's set up your whole
|
|
37
|
+
team".
|
|
38
|
+
|
|
39
|
+
Do NOT fire for: adding a single agent for one observed pattern (use
|
|
40
|
+
`hire-agent`), folder/structure changes (use `structure-generator`), or
|
|
41
|
+
command/workflow additions (use `capability-suggester`).
|
|
42
|
+
|
|
43
|
+
## The flow
|
|
44
|
+
|
|
45
|
+
### Step 1: Read the profile
|
|
46
|
+
|
|
47
|
+
Gather, silently:
|
|
48
|
+
|
|
49
|
+
- `context/me.md`: role, archetype, priorities, the shape of a typical week.
|
|
50
|
+
- `context/judgment.yaml` (if it exists): priorities, delegation preferences,
|
|
51
|
+
surfacing rules. A user who said "just auto-process transcripts" wants a
|
|
52
|
+
processing agent; a user who said "always run things by me" wants a smaller,
|
|
53
|
+
more ask-first team.
|
|
54
|
+
- Recent task patterns: what the user has actually asked Claudia to do
|
|
55
|
+
repeatedly (lean on the same signals `hire-agent` uses, plus `memory_recall`
|
|
56
|
+
for recurring work). The team should map to real work, not a generic template.
|
|
57
|
+
|
|
58
|
+
If `context/me.md` does not exist, the user has not onboarded. Route to
|
|
59
|
+
onboarding first; do not propose a team into a vacuum.
|
|
60
|
+
|
|
61
|
+
### Step 2: Maker proposes the team
|
|
62
|
+
|
|
63
|
+
Claudia (the Maker) drafts a team:
|
|
64
|
+
|
|
65
|
+
1. Start from the **minimal seed for the user's archetype** (see the seed table
|
|
66
|
+
below).
|
|
67
|
+
2. Adjust to the user's actual work: drop a seed role they will not use, add a
|
|
68
|
+
role a recurring task clearly needs.
|
|
69
|
+
3. For each role, write a one-line **rationale tied to the user's real work**
|
|
70
|
+
("You process client transcripts most weeks, so a Document Processor handles
|
|
71
|
+
the extraction while I keep the judgment"). A role with no concrete rationale
|
|
72
|
+
does not belong in the proposal.
|
|
73
|
+
4. Every new (non-roster) role must pass the `hire-agent` candidate test:
|
|
74
|
+
compute-heavy, judgment-light, structured in and out, repeatable. If it needs
|
|
75
|
+
relationship context or would take external actions, it is not an agent.
|
|
76
|
+
|
|
77
|
+
### Step 3: The Checker validates (bounded)
|
|
78
|
+
|
|
79
|
+
Dispatch the `loop-checker` agent with the team proposal as the `artifact` and
|
|
80
|
+
the team rubric below. The Checker scores independently and returns the standard
|
|
81
|
+
verdict JSON (`verified`, `score`, `issues`, `hard_constraint_violated`).
|
|
82
|
+
|
|
83
|
+
If the Checker returns `verified: false` (most often: too large, or a role that
|
|
84
|
+
needs judgment), the Maker **revises once** based on the issues (usually by
|
|
85
|
+
trimming) and re-checks. Cap at **2 revisions**. This is the same bounded
|
|
86
|
+
Maker-Checker discipline every loop follows; the team proposal is not exempt.
|
|
87
|
+
|
|
88
|
+
**The team rubric:**
|
|
89
|
+
|
|
90
|
+
| Dimension | 10 | 0 |
|
|
91
|
+
|-----------|----|----|
|
|
92
|
+
| Goal alignment | Every role maps to a stated priority or a recurring task in the profile | Roles are generic, not tied to this user |
|
|
93
|
+
| Right-sized | The minimum team that covers the work (progressive, not overwhelming) | Sprawling; roles the user will not use |
|
|
94
|
+
| Judgment-safe | Every role is compute-heavy and judgment-light | A role would take external actions or need relationship judgment |
|
|
95
|
+
| Reuse-first | Uses existing roster agents where they fit; new roles only when justified | Invents roles that duplicate existing agents |
|
|
96
|
+
| Tier-correct | Haiku for structured extraction, Sonnet for multi-turn research | Mismatched model tiers |
|
|
97
|
+
|
|
98
|
+
**Hard constraints** (any one forces `verified: false`): more than 5 roles in a
|
|
99
|
+
first team, or any role that would require relationship judgment or take external
|
|
100
|
+
actions.
|
|
101
|
+
|
|
102
|
+
### Step 4: Status file + approval gate
|
|
103
|
+
|
|
104
|
+
Write `team_status.md` in the standard schema (`docs/loop-status-schema.md`):
|
|
105
|
+
`last_input` (one-line profile summary), `maker_proposal` (the team), `score`,
|
|
106
|
+
`checker_verdict`, `verified`, `next_action: await user approval`. Write it to
|
|
107
|
+
the workspace, atomically (temp sibling then rename).
|
|
108
|
+
|
|
109
|
+
Then present the validated proposal to the user: each role, its model tier, and
|
|
110
|
+
its one-line rationale, plus what is reused vs new. **Write nothing to
|
|
111
|
+
`.claude/agents/` or `.claude/skills/` yet.** This is a Human-Approved action
|
|
112
|
+
(see `claudia-principles.md`). Ask plainly: "Want me to set this team up?"
|
|
113
|
+
|
|
114
|
+
### Step 5: Apply and rollback (only after explicit approval)
|
|
115
|
+
|
|
116
|
+
On a clear yes:
|
|
117
|
+
|
|
118
|
+
1. For each **new** role, scaffold `.claude/agents/<name>.md` using the agent
|
|
119
|
+
definition format from `hire-agent` (frontmatter: `name`, `description`,
|
|
120
|
+
`model`, `dispatch-category`, `dispatch-tier`, `auto-dispatch`; body: role,
|
|
121
|
+
job, output format, constraints).
|
|
122
|
+
2. For any **existing** file you modify (for example, adding the team to
|
|
123
|
+
`.claude/agents/README.md`), write a `.bak` sibling first, the same
|
|
124
|
+
mechanism the installer upgrade flow uses. New files need no `.bak`.
|
|
125
|
+
3. Report exactly what was created or changed, and how to undo it: "Rollback by
|
|
126
|
+
restoring the `.bak` files and deleting the new agent files, or just say
|
|
127
|
+
'undo the team setup' and I will."
|
|
128
|
+
4. Do not edit `agent-dispatcher.md` routing here. The new agents exist; routing
|
|
129
|
+
them is the dispatcher's job and can be tuned separately.
|
|
130
|
+
|
|
131
|
+
Rollback on request: restore each `.bak`, delete each newly created agent file,
|
|
132
|
+
and report what was reverted.
|
|
133
|
+
|
|
134
|
+
## Minimal seed teams (Proposal 11, D5)
|
|
135
|
+
|
|
136
|
+
Starting points, not prescriptions. Every seed gets pruned and adjusted to the
|
|
137
|
+
user's actual work in Step 2. All roles are drawn from the existing roster.
|
|
138
|
+
|
|
139
|
+
| Archetype | Minimal seed |
|
|
140
|
+
|-----------|-------------|
|
|
141
|
+
| Consultant / Advisor | Document Archivist (intake), Document Processor (extract from client docs and transcripts), Research Scout (client and market research) |
|
|
142
|
+
| Executive / Manager | Document Processor (reports and decks), Schedule Analyst (calendar patterns), Research Scout |
|
|
143
|
+
| Founder / Entrepreneur | Research Scout (market and competitor), Document Processor (investor and product docs), Document Archivist |
|
|
144
|
+
| Solo Professional | Document Archivist, Document Processor |
|
|
145
|
+
| Content Creator | Research Scout (topic research), Document Processor (transcript and draft extraction), Canvas Generator (visual) |
|
|
146
|
+
|
|
147
|
+
Two or three roles is a healthy first team. Five is the ceiling, not the target.
|
|
148
|
+
|
|
149
|
+
## Safety
|
|
150
|
+
|
|
151
|
+
- **Approval gate is non-negotiable.** Nothing is written to `agents/` or
|
|
152
|
+
`skills/` without an explicit yes (Step 4).
|
|
153
|
+
- **Reversible.** `.bak` siblings for any modified file; new files are
|
|
154
|
+
deletable; "undo the team setup" restores the prior state.
|
|
155
|
+
- **Judgment stays human.** Agents handle processing. Relationship judgment,
|
|
156
|
+
strategy, and external actions never get delegated to a generated agent.
|
|
157
|
+
- **Minimal by default.** When the Checker and the Maker disagree on size, the
|
|
158
|
+
smaller team wins. The user can always add more later.
|
|
159
|
+
|
|
160
|
+
## See also
|
|
161
|
+
|
|
162
|
+
- `hire-agent` for proactive, single-agent suggestions from one observed pattern.
|
|
163
|
+
- `structure-generator` for the archetype folder structures this composes with.
|
|
164
|
+
- `capability-suggester` for command and workflow additions (not agents).
|
|
165
|
+
- `.claude/agents/README.md` for the current roster and the two-tier model.
|
|
166
|
+
- `.claude/skills/_loop/checker.md` for the Checker brief the validation uses.
|
|
167
|
+
- `docs/loop-status-schema.md` for the `team_status.md` format.
|
|
@@ -147,3 +147,49 @@ After creating a new agent, I monitor:
|
|
|
147
147
|
If an agent isn't being used or consistently needs my intervention, I might suggest retiring it:
|
|
148
148
|
|
|
149
149
|
"The LinkedIn Processor hasn't been used in 3 weeks. Want me to remove it to keep things simple?"
|
|
150
|
+
|
|
151
|
+
## Proactive team review (Proposal 11, E7)
|
|
152
|
+
|
|
153
|
+
Single-agent suggestions (above) cover incremental growth. Sometimes the whole
|
|
154
|
+
team has drifted from how the user now works, and the right move is to review the
|
|
155
|
+
team as a unit, not bolt on one more agent.
|
|
156
|
+
|
|
157
|
+
### Drift triggers
|
|
158
|
+
|
|
159
|
+
I raise a team review (not just a single hire) when I notice:
|
|
160
|
+
|
|
161
|
+
- **Archetype shift.** The user's `context/me.md` archetype has changed, or their
|
|
162
|
+
described work no longer matches it (a consultant who is now mostly building a
|
|
163
|
+
product).
|
|
164
|
+
- **A new recurring task class.** A whole category of repeated work has appeared
|
|
165
|
+
that the current roster does not cover, not just one task type.
|
|
166
|
+
- **Roster drift.** Two or more agents have gone unused for weeks while the user
|
|
167
|
+
keeps doing a kind of work by hand.
|
|
168
|
+
|
|
169
|
+
These are about the shape of the team, not a single gap. One missing agent is a
|
|
170
|
+
`hire-agent` suggestion; a team that no longer fits is a team review.
|
|
171
|
+
|
|
172
|
+
### Suggest a diff, then route to /build-team
|
|
173
|
+
|
|
174
|
+
When a drift trigger fires, I suggest the change as a **team diff**, gently and
|
|
175
|
+
as one suggestion:
|
|
176
|
+
|
|
177
|
+
```
|
|
178
|
+
"The way you work has shifted toward [X] over the last while. Your current team
|
|
179
|
+
was set up for [Y]. Want me to review the whole team? Roughly, I'd add [role],
|
|
180
|
+
retire [unused role], and keep the rest."
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
If the user says yes, I route to `/build-team`, which does the real work: it reads
|
|
184
|
+
the current profile, proposes the adjusted team, validates it through the Checker,
|
|
185
|
+
shows it for approval, and applies it with `.bak` rollback. I do not re-implement
|
|
186
|
+
the proposal, validation, or apply logic here. This skill only notices the drift
|
|
187
|
+
and offers the review; `build-team` owns the change.
|
|
188
|
+
|
|
189
|
+
### Discipline
|
|
190
|
+
|
|
191
|
+
- One suggestion, not a flood. If the user declines, I drop it and do not re-raise
|
|
192
|
+
until a new, distinct drift signal appears.
|
|
193
|
+
- Never auto-apply. A team review is always proposed, never performed silently.
|
|
194
|
+
- Minimal still wins. A review can shrink a team as readily as grow it; an unused
|
|
195
|
+
agent is friction, not value.
|
|
@@ -42,6 +42,7 @@ Silently retrieve:
|
|
|
42
42
|
|
|
43
43
|
- Call the `memory_reflections` MCP tool to see what already exists
|
|
44
44
|
- Call the `memory_session_context` MCP tool for recent context (if available)
|
|
45
|
+
- Read recent loop status files (`~/.claudia/loops/*_status.md`, and any `research_status.md` from `auto-research` runs this session) to see how Claudia's own Maker-Checker loops performed: which iterations were `contested`, which loops never reached `verified`, where budgets ran out. This feeds the harness review in Step 2b.
|
|
45
46
|
|
|
46
47
|
### Step 2: Generate Reflections
|
|
47
48
|
|
|
@@ -78,6 +79,18 @@ Review the session and identify 1-3 reflections. Ask yourself:
|
|
|
78
79
|
|
|
79
80
|
**Quality over quantity.** One genuine insight beats three generic observations.
|
|
80
81
|
|
|
82
|
+
### Step 2b: Loop harness review (only if loops ran this session)
|
|
83
|
+
|
|
84
|
+
If any Maker-Checker loop ran this session (`auto-research`, or a wrapped daemon job), review how the harness itself performed, separately from reflections about the user. Read the signal from the status files gathered in Step 1, not from memory:
|
|
85
|
+
|
|
86
|
+
- **Contested iterations.** Did the Maker (Claudia) and the Checker disagree often? Frequent `contested` flags usually mean the rubric is ambiguous or Claudia was over-optimistic, not that the Checker is wrong.
|
|
87
|
+
- **Stuck loops.** Did a loop burn its whole budget without reaching `verified`? That points at the rubric, the Maker brief, or the Checker brief needing sharpening.
|
|
88
|
+
- **Recurring failure shape.** Across runs, is the same kind of issue showing up in the Checker's `issues` list?
|
|
89
|
+
|
|
90
|
+
If a clear, repeatable harness problem emerges, draft a **harness-improvement proposal**: a concrete edit to a rubric, the Maker brief (`.claude/skills/_loop/maker.md`), the Checker brief (`.claude/skills/_loop/checker.md`), or a new judgment rule about how to run loops. Do not apply it yet. Harness proposals go through the Checker gate in Step 5.
|
|
91
|
+
|
|
92
|
+
Most sessions produce no harness proposal. Raise one only when the evidence is in the status files.
|
|
93
|
+
|
|
81
94
|
### Step 3: Present for Approval
|
|
82
95
|
|
|
83
96
|
Format reflections clearly and ask for approval:
|
|
@@ -171,6 +184,12 @@ Call the `memory_end_session` MCP tool with:
|
|
|
171
184
|
5. Set `source` to `meditate/YYYY-MM-DD` (today's date)
|
|
172
185
|
6. Write the updated file
|
|
173
186
|
|
|
187
|
+
**Harness proposals go through the Checker before anything is written** (Proposal 11, E4). If Step 2b produced a harness-improvement proposal (a rubric edit, a Maker or Checker brief edit, or a loop-running judgment rule), validate it first:
|
|
188
|
+
|
|
189
|
+
1. Dispatch the `loop-checker` agent with the proposal as the artifact and a one-line rubric: "does this change actually fix the observed failure without breaking the brief's contract?"
|
|
190
|
+
2. Write the proposal only if the Checker returns `verified: true`. A proposal the Checker rejects is surfaced to the user as an open question, not saved.
|
|
191
|
+
3. Edits to shipped briefs (`_loop/maker.md`, `_loop/checker.md`) are never auto-applied. Present them as a diff for the user to approve, like any other change to Claudia's own files. Loop-running judgment rules go into `context/judgment.yaml` like any other rule, after passing the Checker.
|
|
192
|
+
|
|
174
193
|
Confirm storage with both reflections and rules:
|
|
175
194
|
"Got it. I've saved [N] reflection(s) and added [M] judgment rule(s) to your judgment file. I'll apply them going forward. See you next time."
|
|
176
195
|
|
|
@@ -17,6 +17,21 @@
|
|
|
17
17
|
"I have an idea for Claudia"
|
|
18
18
|
]
|
|
19
19
|
},
|
|
20
|
+
{
|
|
21
|
+
"name": "build-team",
|
|
22
|
+
"path": "build-team/SKILL.md",
|
|
23
|
+
"description": "Propose a personalized team of specialized agents based on the user profile and real work, validated by an independent Checker, applied on approval with rollback.",
|
|
24
|
+
"invocation": "explicit",
|
|
25
|
+
"effort_level": "high",
|
|
26
|
+
"triggers": ["build my team", "set up my agents", "what agents should I have", "design my team", "/build-team"],
|
|
27
|
+
"examples": [
|
|
28
|
+
"build my team",
|
|
29
|
+
"what agents should I have for my work?",
|
|
30
|
+
"set up my agents",
|
|
31
|
+
"design a team around my client work",
|
|
32
|
+
"tailor your team to how I work"
|
|
33
|
+
]
|
|
34
|
+
},
|
|
20
35
|
{
|
|
21
36
|
"name": "onboarding",
|
|
22
37
|
"path": "onboarding.md",
|