code-lens-cli 0.7.1__tar.gz → 0.8.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- code_lens_cli-0.8.0/.claude/skills/eval/SKILL.md +399 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/CHANGELOG.md +15 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/PKG-INFO +1 -1
- code_lens_cli-0.8.0/docs/eval-rounds/2026-05-16-round-02.md +74 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/docs/skill-sources.md +1 -1
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/_io.py +6 -3
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/backfill.py +1 -1
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/corpus.yaml +1 -1
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/judge.py +230 -59
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/report.py +40 -27
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/summarize.py +153 -65
- code_lens_cli-0.8.0/experiments/scripts_eval/switch-arm.sh +58 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/trial.py +1 -1
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/validate.py +2 -2
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/pyproject.toml +1 -1
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_io.py +6 -1
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_judge.py +130 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_report.py +3 -2
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/uv.lock +1 -1
- code_lens_cli-0.7.1/.claude/skills/eval/SKILL.md +0 -263
- code_lens_cli-0.7.1/experiments/scripts_eval/switch-arm.sh +0 -91
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/settings.json +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/cicd/SKILL.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/cicd/scripts/_resolve-nick.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/cicd/scripts/portability-lint.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/cicd/scripts/pr-reply.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/cicd/scripts/pr-status.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/cicd/scripts/workflow.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/code-lookup/SKILL.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/code-lookup/scripts/classify.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/code-lookup/scripts/grep.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/code-lookup/scripts/recent.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/communicate/SKILL.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/communicate/scripts/fetch-issues.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/communicate/scripts/mesh-message.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/communicate/scripts/post-comment.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/communicate/scripts/post-issue.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/communicate/scripts/templates/skill-update-brief.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/repo-map/SKILL.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/repo-map/scripts/connections.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/repo-map/scripts/graph.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/repo-map/scripts/profile.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/run-tests/SKILL.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/run-tests/scripts/test.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/sonarclaude/SKILL.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/sonarclaude/scripts/sonar.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/version-bump/SKILL.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/version-bump/scripts/bump.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills.local.yaml.example +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.flake8 +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.github/workflows/publish.yml +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.github/workflows/security-checks.yml +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.github/workflows/tests.yml +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.gitignore +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.markdownlint-cli2.yaml +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.pre-commit-config.yaml +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/CLAUDE.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/LICENSE +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/README.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/culture.yaml +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/docs/eval-rounds/2026-05-15-round-01.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/docs/eval-rounds/2026-05-15-smoke-02-examples.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/docs/superpowers/plans/2026-05-15-repo-map.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/docs/superpowers/plans/2026-05-15-scripts-eval-harness.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/docs/superpowers/plans/2026-05-16-seer-classify.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/docs/superpowers/specs/2026-05-15-repo-map-design.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/docs/superpowers/specs/2026-05-15-scripts-eval-harness-design.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/docs/superpowers/specs/2026-05-16-seer-classify-design.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/__init__.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/README.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/RUNBOOK.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/__init__.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/corpus.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/hooks/__init__.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/hooks/post_tool.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/hooks/pre_tool.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/judge_rubric.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/manifest.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/results/.gitkeep +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/__init__.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/__main__.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/__init__.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/_commands/__init__.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/_commands/classify.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/_commands/explain.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/_commands/grep.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/_commands/learn.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/_commands/recent.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/_commands/whoami.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/_errors.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/_output.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/lookup/__init__.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/lookup/ast_scope.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/lookup/classify.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/lookup/grep_context.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/lookup/recent_outline.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/lookup/render.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/__init__.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/__main__.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/config.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/connections.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/detect.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/errors.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/graph.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/manifest.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/profile.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/render.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/sonar-project.properties +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/__init__.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/__init__.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/fixtures/.gitkeep +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/fixtures/corpus_minimal.yaml +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/fixtures/sidechain_min.jsonl +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_backfill.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_corpus.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_hooks_post_tool.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_hooks_pre_tool.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_manifest.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_summarize.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_trial.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_validate.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_ast_scope.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_classify.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_classify_render.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_cli_chassis.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_cli_errors.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_cli_output.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_cli_stubs.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_grep_cmd.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_grep_context.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_package.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_recent_cmd.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_recent_outline.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_repo_cli.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_repo_config.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_repo_connections.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_repo_detect.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_repo_errors.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_repo_graph.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_repo_manifest.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_repo_profile.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_repo_render.py +0 -0
|
@@ -0,0 +1,399 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: eval
|
|
3
|
+
description: >
|
|
4
|
+
Run one scripts-eval set — one `(target, question)` row from
|
|
5
|
+
`experiments/scripts_eval/corpus.yaml` × 3 trials × one arm — including
|
|
6
|
+
tester subagent dispatches + captures, plus (for arm C) judge subagent
|
|
7
|
+
dispatches + records, then `summarize` + commit to the round's
|
|
8
|
+
accumulator file. Use when the user says "run eval set", "eval",
|
|
9
|
+
"scripts-eval", "round-NN set", or asks to execute a row of the corpus.
|
|
10
|
+
Three arms: A (banned — rider forbids the seer skills), B (directed
|
|
11
|
+
— rider instructs use of seer skills), C (organic — rider permits
|
|
12
|
+
but doesn't direct). Two judge pairs: A-vs-B ("do the skills help
|
|
13
|
+
when used") and A-vs-C ("do the skills get adopted organically").
|
|
14
|
+
`judge prepare --pair AB|AC` selects the pair.
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
# scripts-eval — running a set
|
|
18
|
+
|
|
19
|
+
This skill drives one **set** of the scripts-eval harness:
|
|
20
|
+
one `(target, question)` row × 3 trials × one arm.
|
|
21
|
+
|
|
22
|
+
The harness pipeline (`trial` / `validate` / `judge` / `summarize`) and
|
|
23
|
+
the corpus (`corpus.yaml`) are repo state — this skill is just the
|
|
24
|
+
operator procedure that sequences them per session.
|
|
25
|
+
|
|
26
|
+
## When to push back
|
|
27
|
+
|
|
28
|
+
Before doing anything, verify the user's intent matches the session
|
|
29
|
+
state. Stop and ask if any of these hold:
|
|
30
|
+
|
|
31
|
+
- `env | grep SEER_EVAL_RUN_ID` is empty → the harness hooks no-op, no
|
|
32
|
+
metrics get captured. Operator needs to re-launch with the env vars
|
|
33
|
+
exported.
|
|
34
|
+
- `SEER_EVAL_ARM` is set to anything other than `A`, `B`, or `C` → bad config.
|
|
35
|
+
- User says "do arm C" but the matching arm-A cells don't exist on
|
|
36
|
+
disk under `experiments/scripts_eval/results/$SEER_EVAL_RUN_ID/arm-A/`
|
|
37
|
+
→ arm A must complete first; there's nothing to pair against.
|
|
38
|
+
|
|
39
|
+
All three arms run with `repo-map` and `code-lookup` enabled on disk.
|
|
40
|
+
Arm-A's constraint is **verbal** — the rider in the dispatched prompt
|
|
41
|
+
is the sole guard against the subagent using the seer skills. Do not
|
|
42
|
+
edit the rider; copy it verbatim. (Earlier versions of this skill
|
|
43
|
+
physically moved `.claude/skills/repo-map/` aside for arm A as
|
|
44
|
+
defense-in-depth; that step was dropped because the rider proved
|
|
45
|
+
sufficient and the move-aside dance made operator setup brittle.)
|
|
46
|
+
|
|
47
|
+
Three arms, three questions they answer:
|
|
48
|
+
|
|
49
|
+
- **A (banned)** — verbal rider forbids both seer skills. Establishes
|
|
50
|
+
the "without the new skills" baseline.
|
|
51
|
+
- **B (directed)** — verbal rider instructs the subagent to use the
|
|
52
|
+
seer skills where applicable. Establishes the "with the new skills,
|
|
53
|
+
when actually used" upper bound.
|
|
54
|
+
- **C (organic)** — verbal rider permits but does not direct use of
|
|
55
|
+
the seer skills. Measures organic adoption rate.
|
|
56
|
+
|
|
57
|
+
A-vs-B is the primary "do the skills help?" comparison; A-vs-C is the
|
|
58
|
+
adoption canary. The judge supports both pairs via the `--pair` flag.
|
|
59
|
+
|
|
60
|
+
## Preflight (every session)
|
|
61
|
+
|
|
62
|
+
```bash
|
|
63
|
+
env | grep -E "^SEER_EVAL_(RUN_ID|ARM)="
|
|
64
|
+
# expect both set to the intended round / arm
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
If unset, export them in your shell before launching `claude`:
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
# arm-A session (banned):
|
|
71
|
+
export SEER_EVAL_RUN_ID=2026-05-NN-round-XX SEER_EVAL_ARM=A
|
|
72
|
+
# arm-B session (directed):
|
|
73
|
+
export SEER_EVAL_RUN_ID=2026-05-NN-round-XX SEER_EVAL_ARM=B
|
|
74
|
+
# arm-C session (organic):
|
|
75
|
+
export SEER_EVAL_RUN_ID=2026-05-NN-round-XX SEER_EVAL_ARM=C
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
`experiments/scripts_eval/switch-arm.sh A|B|C <run_id>` does the same
|
|
79
|
+
thing.
|
|
80
|
+
|
|
81
|
+
If this is the first set of the run (idempotent, safe to re-run):
|
|
82
|
+
|
|
83
|
+
```bash
|
|
84
|
+
uv run --group experiments python -m experiments.scripts_eval.manifest \
|
|
85
|
+
init --run $SEER_EVAL_RUN_ID
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
## Arm-A procedure
|
|
89
|
+
|
|
90
|
+
**For each trial in {1, 2, 3}:**
|
|
91
|
+
|
|
92
|
+
1. Read the question template for the target's `question_id` from
|
|
93
|
+
`experiments/scripts_eval/corpus.yaml`. Look up the target's path
|
|
94
|
+
from the same file's `targets:` list.
|
|
95
|
+
|
|
96
|
+
2. Substitute `{repo_path}` (or `{workspace_root}` for the workspace
|
|
97
|
+
question) in the template, then append **verbatim**:
|
|
98
|
+
|
|
99
|
+
```text
|
|
100
|
+
|
|
101
|
+
Constraints (verbatim):
|
|
102
|
+
- You may NOT use the `repo-map` skill, `python -m seer.repo`,
|
|
103
|
+
the `seer.repo` Python module, or any `scripts/*.sh` paths under
|
|
104
|
+
`.claude/skills/repo-map/`.
|
|
105
|
+
- You may NOT use the `code-lookup` skill, the `seer.lookup`
|
|
106
|
+
Python module, the `seer grep` / `seer recent` / `seer classify`
|
|
107
|
+
CLI verbs, or any `scripts/*.sh` paths under
|
|
108
|
+
`.claude/skills/code-lookup/`.
|
|
109
|
+
If you cannot answer without them, say so explicitly and stop.
|
|
110
|
+
- Use only Read, Grep, Glob, and Bash.
|
|
111
|
+
- After answering, append two sections and stop:
|
|
112
|
+
### tools_used
|
|
113
|
+
- <ToolName>: <count> (one line per distinct tool)
|
|
114
|
+
### evidence
|
|
115
|
+
- <one path per line>
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
3. **Before dispatch** — start the trial. The script reads
|
|
119
|
+
`CLAUDE_CODE_SESSION_ID` from env, stamps an in-flight record, and
|
|
120
|
+
prints the `trial_id` to stdout:
|
|
121
|
+
|
|
122
|
+
```bash
|
|
123
|
+
TRIAL_ID=$(uv run --group experiments python -m experiments.scripts_eval.trial \
|
|
124
|
+
start --run $SEER_EVAL_RUN_ID --arm $SEER_EVAL_ARM \
|
|
125
|
+
--target <target> --question <question_id> --trial <n>)
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
(For the workspace-scope question, omit `--target`.)
|
|
129
|
+
|
|
130
|
+
4. Dispatch **one** `Explore` subagent with the full prompt.
|
|
131
|
+
|
|
132
|
+
5. After the subagent finishes, end the trial. The script reads the
|
|
133
|
+
subagent's sidechain transcript from
|
|
134
|
+
`$HOME/.claude/projects/<encoded_cwd>/<session>/subagents/agent-*.jsonl`
|
|
135
|
+
and writes the cell JSON:
|
|
136
|
+
|
|
137
|
+
```bash
|
|
138
|
+
uv run --group experiments python -m experiments.scripts_eval.trial \
|
|
139
|
+
end --trial-id "$TRIAL_ID"
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
6. Confirm the cell JSON appeared under
|
|
143
|
+
`experiments/scripts_eval/results/$SEER_EVAL_RUN_ID/arm-A/`.
|
|
144
|
+
|
|
145
|
+
**After all 3 trials**, summarize + commit:
|
|
146
|
+
|
|
147
|
+
```bash
|
|
148
|
+
uv run --group experiments python -m experiments.scripts_eval.summarize \
|
|
149
|
+
--run $SEER_EVAL_RUN_ID \
|
|
150
|
+
--out docs/eval-rounds/$SEER_EVAL_RUN_ID.md
|
|
151
|
+
|
|
152
|
+
git add docs/eval-rounds/$SEER_EVAL_RUN_ID.md
|
|
153
|
+
git commit -m "$SEER_EVAL_RUN_ID: arm-A captured for <target>/<question_id> (3 trials)"
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
Report back: cell count under arm-A/, what's pending for arm-B and
|
|
157
|
+
arm-C on this set, the next pending set per the run-state table in the
|
|
158
|
+
accumulator file.
|
|
159
|
+
|
|
160
|
+
## Arm-B procedure
|
|
161
|
+
|
|
162
|
+
Arm-B captures the **directed** trials so the A-vs-B judge run can
|
|
163
|
+
assess "do the skills help when actually used?". Capture happens in
|
|
164
|
+
its own session (`SEER_EVAL_ARM=B`); the A-vs-B judges then run in
|
|
165
|
+
the arm-C session's Judge phase, alongside the A-vs-C judges
|
|
166
|
+
(`judge prepare --pair AB`).
|
|
167
|
+
|
|
168
|
+
**For each trial in {1, 2, 3}:**
|
|
169
|
+
|
|
170
|
+
1. Substitute the corpus question template (same target / question
|
|
171
|
+
resolution as arm A), then append **verbatim** the arm-B rider:
|
|
172
|
+
|
|
173
|
+
```text
|
|
174
|
+
|
|
175
|
+
Constraints (verbatim):
|
|
176
|
+
- For this question, you MUST use the seer skills where they
|
|
177
|
+
apply:
|
|
178
|
+
* `repo-map` (`scripts/profile.sh`, `scripts/connections.sh`,
|
|
179
|
+
`scripts/graph.sh` under `.claude/skills/repo-map/`) for
|
|
180
|
+
repo overview, dependencies, and workspace shape.
|
|
181
|
+
* `code-lookup` (`seer grep`, `seer recent`, `seer classify`,
|
|
182
|
+
or the equivalent scripts under
|
|
183
|
+
`.claude/skills/code-lookup/`) for symbol references,
|
|
184
|
+
recent commit-symbol diffs, and project-kind classification.
|
|
185
|
+
Only fall back to Read / Grep / Glob / Bash for facts the
|
|
186
|
+
scripts do not cover.
|
|
187
|
+
- After answering, append two sections and stop:
|
|
188
|
+
### tools_used
|
|
189
|
+
- <ToolName>: <count> (one line per distinct tool)
|
|
190
|
+
### evidence
|
|
191
|
+
- <one path per line>
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
2. Bookend the dispatch with `trial start` and `trial end` exactly
|
|
195
|
+
as in arm A, just with `--arm B`:
|
|
196
|
+
|
|
197
|
+
```bash
|
|
198
|
+
TRIAL_ID=$(uv run --group experiments python -m experiments.scripts_eval.trial \
|
|
199
|
+
start --run $SEER_EVAL_RUN_ID --arm $SEER_EVAL_ARM \
|
|
200
|
+
--target <target> --question <question_id> --trial <n>)
|
|
201
|
+
# dispatch one Explore subagent with the rendered prompt above
|
|
202
|
+
uv run --group experiments python -m experiments.scripts_eval.trial \
|
|
203
|
+
end --trial-id "$TRIAL_ID"
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
3. Confirm the cell JSON appeared under
|
|
207
|
+
`experiments/scripts_eval/results/$SEER_EVAL_RUN_ID/arm-B/`.
|
|
208
|
+
|
|
209
|
+
**After all 3 trials**, summarize + commit:
|
|
210
|
+
|
|
211
|
+
```bash
|
|
212
|
+
uv run --group experiments python -m experiments.scripts_eval.summarize \
|
|
213
|
+
--run $SEER_EVAL_RUN_ID \
|
|
214
|
+
--out docs/eval-rounds/$SEER_EVAL_RUN_ID.md
|
|
215
|
+
|
|
216
|
+
git add docs/eval-rounds/$SEER_EVAL_RUN_ID.md
|
|
217
|
+
git commit -m "$SEER_EVAL_RUN_ID: arm-B captured for <target>/<question_id> (3 trials)"
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
Report back: cell count under arm-B/, whether the subagent actually
|
|
221
|
+
followed the directive (look at the `### tools_used` of each arm-B
|
|
222
|
+
cell — `B_did_not_use_scripts` is a finding, not a bug), the next
|
|
223
|
+
pending set per the run-state table.
|
|
224
|
+
|
|
225
|
+
## Arm-C procedure
|
|
226
|
+
|
|
227
|
+
**Precondition check (mandatory):**
|
|
228
|
+
|
|
229
|
+
```bash
|
|
230
|
+
ls experiments/scripts_eval/results/$SEER_EVAL_RUN_ID/arm-A/<target>-<question_id>-t*.json
|
|
231
|
+
# expect: 3 files (t1, t2, t3)
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
If fewer than 3, stop — arm A must complete first.
|
|
235
|
+
|
|
236
|
+
### Tester phase
|
|
237
|
+
|
|
238
|
+
**For each trial in {1, 2, 3}:**
|
|
239
|
+
|
|
240
|
+
1. Substitute the corpus question template (same as arm A) but with the
|
|
241
|
+
arm-C rider:
|
|
242
|
+
|
|
243
|
+
```text
|
|
244
|
+
|
|
245
|
+
Constraints (verbatim):
|
|
246
|
+
- You may use the `repo-map` skill (and its scripts under
|
|
247
|
+
`.claude/skills/repo-map/`) and the `code-lookup` skill (and its
|
|
248
|
+
scripts under `.claude/skills/code-lookup/`) at your discretion.
|
|
249
|
+
This includes `seer grep` / `seer recent` / `seer classify`.
|
|
250
|
+
- After answering, append two sections and stop:
|
|
251
|
+
### tools_used
|
|
252
|
+
- <ToolName>: <count>
|
|
253
|
+
### evidence
|
|
254
|
+
- <one path per line>
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
2. Bookend the dispatch with `trial start` and `trial end`:
|
|
258
|
+
|
|
259
|
+
```bash
|
|
260
|
+
TRIAL_ID=$(uv run --group experiments python -m experiments.scripts_eval.trial \
|
|
261
|
+
start --run $SEER_EVAL_RUN_ID --arm $SEER_EVAL_ARM \
|
|
262
|
+
--target <target> --question <question_id> --trial <n>)
|
|
263
|
+
# dispatch one Explore subagent with the rendered prompt above
|
|
264
|
+
uv run --group experiments python -m experiments.scripts_eval.trial \
|
|
265
|
+
end --trial-id "$TRIAL_ID"
|
|
266
|
+
```
|
|
267
|
+
|
|
268
|
+
(For the workspace-scope question, omit `--target`.)
|
|
269
|
+
|
|
270
|
+
### Judge phase
|
|
271
|
+
|
|
272
|
+
Two pairs are judged independently:
|
|
273
|
+
|
|
274
|
+
- **A-vs-C** — the original "with vs without (organic)" comparison.
|
|
275
|
+
- **A-vs-B** — the new "with (directed) vs without" comparison; needs
|
|
276
|
+
arm-B cells captured first.
|
|
277
|
+
|
|
278
|
+
Both pairs use the same `prepare` / `record` flow; only `--pair`
|
|
279
|
+
(`AC` or `AB`) and the matching `--blind-label-for-<arm>` flags differ.
|
|
280
|
+
|
|
281
|
+
**For each trial in {1, 2, 3}**, run the A-vs-C judge first (if arm-C
|
|
282
|
+
cells exist) and then the A-vs-B judge (if arm-B cells exist):
|
|
283
|
+
|
|
284
|
+
1. Prepare the blinded job. `--pair` defaults to `AC`; pass `--pair AB`
|
|
285
|
+
for the A-vs-B run.
|
|
286
|
+
|
|
287
|
+
```bash
|
|
288
|
+
uv run --group experiments python -m experiments.scripts_eval.judge \
|
|
289
|
+
prepare --run $SEER_EVAL_RUN_ID \
|
|
290
|
+
--pair AC \
|
|
291
|
+
--pair-key <target>/<question_id>/<n> \
|
|
292
|
+
--seed 0 > /tmp/judge-AC-<n>.json
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
2. Materialise the prompt to a text file for dispatch (`jq -j`
|
|
296
|
+
joins without adding a trailing newline, so the bytes match what
|
|
297
|
+
`prepare` emitted):
|
|
298
|
+
|
|
299
|
+
```bash
|
|
300
|
+
jq -j '.prompt_text' /tmp/judge-AC-<n>.json > /tmp/judge-AC-<n>.txt
|
|
301
|
+
```
|
|
302
|
+
|
|
303
|
+
3. Dispatch the judge subagent. **The description prefix is
|
|
304
|
+
load-bearing** — the `pre_tool` hook recognises `scripts_eval judge:`
|
|
305
|
+
and skips logging, so the judge dispatch does not pollute the
|
|
306
|
+
harness's `raw/` directory:
|
|
307
|
+
|
|
308
|
+
- `subagent_type`: `general-purpose`
|
|
309
|
+
- `description`: `scripts_eval judge: AC <target>/<question_id>/<n>`
|
|
310
|
+
- `prompt`: the verbatim contents of `/tmp/judge-AC-<n>.txt`
|
|
311
|
+
|
|
312
|
+
4. Capture the subagent's final-text response and record. The blind
|
|
313
|
+
labels for an AC pair come back as `blind_label_for_A` /
|
|
314
|
+
`blind_label_for_C`:
|
|
315
|
+
|
|
316
|
+
```bash
|
|
317
|
+
A_LABEL=$(jq -r .blind_label_for_A /tmp/judge-AC-<n>.json)
|
|
318
|
+
C_LABEL=$(jq -r .blind_label_for_C /tmp/judge-AC-<n>.json)
|
|
319
|
+
uv run --group experiments python -m experiments.scripts_eval.judge \
|
|
320
|
+
record --run $SEER_EVAL_RUN_ID \
|
|
321
|
+
--pair AC \
|
|
322
|
+
--pair-key <target>/<question_id>/<n> \
|
|
323
|
+
--blind-label-for-a "$A_LABEL" \
|
|
324
|
+
--blind-label-for-c "$C_LABEL" \
|
|
325
|
+
--verdict-file -
|
|
326
|
+
```
|
|
327
|
+
|
|
328
|
+
5. **Repeat the four steps with `--pair AB`** to judge the directed
|
|
329
|
+
arm. The job JSON for an AB pair carries `blind_label_for_A` and
|
|
330
|
+
`blind_label_for_B` (no `_C`); use `--blind-label-for-b` instead of
|
|
331
|
+
`--blind-label-for-c`:
|
|
332
|
+
|
|
333
|
+
```bash
|
|
334
|
+
uv run --group experiments python -m experiments.scripts_eval.judge \
|
|
335
|
+
prepare --run $SEER_EVAL_RUN_ID \
|
|
336
|
+
--pair AB \
|
|
337
|
+
--pair-key <target>/<question_id>/<n> \
|
|
338
|
+
--seed 0 > /tmp/judge-AB-<n>.json
|
|
339
|
+
jq -j '.prompt_text' /tmp/judge-AB-<n>.json > /tmp/judge-AB-<n>.txt
|
|
340
|
+
# …dispatch general-purpose subagent with description
|
|
341
|
+
# "scripts_eval judge: AB <target>/<question_id>/<n>" and the txt prompt…
|
|
342
|
+
A_LABEL=$(jq -r .blind_label_for_A /tmp/judge-AB-<n>.json)
|
|
343
|
+
B_LABEL=$(jq -r .blind_label_for_B /tmp/judge-AB-<n>.json)
|
|
344
|
+
uv run --group experiments python -m experiments.scripts_eval.judge \
|
|
345
|
+
record --run $SEER_EVAL_RUN_ID \
|
|
346
|
+
--pair AB \
|
|
347
|
+
--pair-key <target>/<question_id>/<n> \
|
|
348
|
+
--blind-label-for-a "$A_LABEL" \
|
|
349
|
+
--blind-label-for-b "$B_LABEL" \
|
|
350
|
+
--verdict-file -
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
If `record` exits non-zero with `non-JSON` / `winner` / `margin` /
|
|
354
|
+
`blind_label` in the error, re-dispatch the judge subagent for that
|
|
355
|
+
trial and re-record. `record` is idempotent on replay — the operator's
|
|
356
|
+
recovery path is "re-dispatch + re-record"; no manual cell editing.
|
|
357
|
+
|
|
358
|
+
Storage: AC verdicts land under `cell["judges"]["AC"]` (and are mirrored
|
|
359
|
+
to `cell["judge"]` for back-compat with pre-phase-2 readers); AB verdicts
|
|
360
|
+
land under `cell["judges"]["AB"]` only.
|
|
361
|
+
|
|
362
|
+
### Wrap-up
|
|
363
|
+
|
|
364
|
+
```bash
|
|
365
|
+
uv run --group experiments python -m experiments.scripts_eval.validate \
|
|
366
|
+
--run $SEER_EVAL_RUN_ID
|
|
367
|
+
|
|
368
|
+
uv run --group experiments python -m experiments.scripts_eval.summarize \
|
|
369
|
+
--run $SEER_EVAL_RUN_ID \
|
|
370
|
+
--out docs/eval-rounds/$SEER_EVAL_RUN_ID.md
|
|
371
|
+
|
|
372
|
+
git add docs/eval-rounds/$SEER_EVAL_RUN_ID.md
|
|
373
|
+
git commit -m "$SEER_EVAL_RUN_ID: completed <target>/<question_id> (both arms + judge)"
|
|
374
|
+
```
|
|
375
|
+
|
|
376
|
+
Report back:
|
|
377
|
+
- A-vs-B winners (A / B / tie) and A-vs-C winners (A / C / tie).
|
|
378
|
+
- Whether arm B and arm C actually used the seer scripts (look at the
|
|
379
|
+
`### tools_used` of each cell — `B_did_not_use_scripts` and
|
|
380
|
+
`C_did_not_use_scripts` are findings, not bugs).
|
|
381
|
+
- The next pending set per the run-state table.
|
|
382
|
+
|
|
383
|
+
## Reading the run state
|
|
384
|
+
|
|
385
|
+
The committed run-state table and per-set verdicts live in
|
|
386
|
+
`docs/eval-rounds/$SEER_EVAL_RUN_ID.md`, between the
|
|
387
|
+
`<!-- runstate:start -->` / `<!-- runstate:end -->` and
|
|
388
|
+
`<!-- evidence:start -->` / `<!-- evidence:end -->` markers. `summarize.py`
|
|
389
|
+
rewrites those regions idempotently — do not hand-edit them.
|
|
390
|
+
|
|
391
|
+
The accumulator file is also the operator's source of truth for what's
|
|
392
|
+
pending: a row's `arm-A` or `arm-C` count below `3/3` means more trials
|
|
393
|
+
are needed; `judged` below the arm counts means judges still owe verdicts.
|
|
394
|
+
|
|
395
|
+
## Cite-don't-import
|
|
396
|
+
|
|
397
|
+
This skill is original to seer-cli (the harness only exists here). When
|
|
398
|
+
promoted upstream, it would re-vendor into steward's skill suppliers —
|
|
399
|
+
update `docs/skill-sources.md` accordingly at that point.
|
|
@@ -5,6 +5,21 @@ All notable changes to this project will be documented in this file.
|
|
|
5
5
|
Format follows [Keep a Changelog](https://keepachangelog.com/). This project
|
|
6
6
|
adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
7
|
|
|
8
|
+
## [0.8.0] - 2026-05-16
|
|
9
|
+
|
|
10
|
+
### Added
|
|
11
|
+
|
|
12
|
+
- scripts-eval: arm-B (directed-use) — rider explicitly instructs the subagent to use repo-map + code-lookup scripts. Cells captured under `results/<run>/arm-B/`. `corpus.yaml` arms field becomes `[A, B, C]`.
|
|
13
|
+
- scripts-eval: `judge.py` is now pair-aware. `judge prepare` / `judge record` take `--pair AC|AB|BC`; new label flag `--blind-label-for-b`. Verdicts land under `cell["judges"][pair]`; the AC pair still mirrors to legacy `cell["judge"]` for back-compat. New `iter_jobs_pair` / `record_verdict_pair` public APIs; old `iter_jobs` / `record_verdict` stay as AC-pair wrappers.
|
|
14
|
+
- scripts-eval: summarize.py renders both A-vs-B and A-vs-C winner tallies in the run-state table, with per-pair verdict tables in the evidence section.
|
|
15
|
+
- eval skill: arm-A rider tightened to also forbid the code-lookup skill / seer.lookup / seer grep/recent/classify so "without" means without both new skills. Added arm-B procedure section and pair-aware judge procedure (--pair AB | AC).
|
|
16
|
+
|
|
17
|
+
### Changed
|
|
18
|
+
|
|
19
|
+
- eval skill: switch-arm.sh no longer moves .claude/skills/repo-map/ on disk; arm-A relies on the verbal rider alone (rider proved sufficient; move-aside dance made operator setup brittle).
|
|
20
|
+
- scripts-eval: report.py violation patterns now also flag code-lookup script use; report aggregates median validation across all three arms; per-cell view shows every captured arm.
|
|
21
|
+
- scripts-eval: validate.py / backfill.py iterate over _io.ARMS instead of hardcoding (A, C).
|
|
22
|
+
|
|
8
23
|
## [0.7.1] - 2026-05-16
|
|
9
24
|
|
|
10
25
|
### Changed
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: code-lens-cli
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.8.0
|
|
4
4
|
Summary: seer — codebase lookup and indexing for agent skills (greenfield AgentCulture sibling).
|
|
5
5
|
Project-URL: Homepage, https://github.com/agentculture/seer-cli
|
|
6
6
|
Project-URL: Issues, https://github.com/agentculture/seer-cli/issues
|
|
@@ -0,0 +1,74 @@
|
|
|
1
|
+
# scripts-eval Round 02 — 2026-05-16
|
|
2
|
+
|
|
3
|
+
Run id: `2026-05-16-round-02`
|
|
4
|
+
Corpus: `experiments/scripts_eval/corpus.yaml` (corpus_version: 1)
|
|
5
|
+
Trials per cell: 3
|
|
6
|
+
Arms: A (verbal rider bans both seer skills), B (verbal rider
|
|
7
|
+
directs use of seer skills), C (verbal rider permits but doesn't
|
|
8
|
+
direct). Judge pairs: A-vs-B + A-vs-C.
|
|
9
|
+
|
|
10
|
+
This file is the round's **evidence accumulator**. The two
|
|
11
|
+
`<!-- runstate:... -->` / `<!-- evidence:... -->` regions below are
|
|
12
|
+
rewritten by `experiments/scripts_eval/summarize.py` at the end of every
|
|
13
|
+
session — operator never edits them by hand.
|
|
14
|
+
|
|
15
|
+
Raw per-cell JSONs stay gitignored under
|
|
16
|
+
`experiments/scripts_eval/results/2026-05-16-round-02/`. This file is
|
|
17
|
+
the committed evidence.
|
|
18
|
+
|
|
19
|
+
## Procedure
|
|
20
|
+
|
|
21
|
+
The operator procedure is locked in the **`eval` skill** —
|
|
22
|
+
`.claude/skills/eval/SKILL.md`. Read or invoke that skill from a fresh
|
|
23
|
+
session; this file does not duplicate it.
|
|
24
|
+
|
|
25
|
+
Round 02 is the first run under the post-move-aside skill: all three
|
|
26
|
+
arms run with `.claude/skills/repo-map/` and
|
|
27
|
+
`.claude/skills/code-lookup/` present on disk; arm A's verbal rider in
|
|
28
|
+
the dispatched prompt is the sole guard against the subagent using
|
|
29
|
+
either skill. It is also the first run under the pair-aware judge
|
|
30
|
+
(`judge prepare --pair AB|AC`), so the run-state table carries A-vs-B
|
|
31
|
+
winners alongside A-vs-C.
|
|
32
|
+
|
|
33
|
+
## Per-session preflight (your shell, *before* `claude`)
|
|
34
|
+
|
|
35
|
+
```bash
|
|
36
|
+
# arm-A session (banned):
|
|
37
|
+
export SEER_EVAL_RUN_ID=2026-05-16-round-02 SEER_EVAL_ARM=A
|
|
38
|
+
|
|
39
|
+
# arm-B session (directed):
|
|
40
|
+
export SEER_EVAL_RUN_ID=2026-05-16-round-02 SEER_EVAL_ARM=B
|
|
41
|
+
|
|
42
|
+
# arm-C session (organic; same run id, different arm):
|
|
43
|
+
export SEER_EVAL_RUN_ID=2026-05-16-round-02 SEER_EVAL_ARM=C
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
Then launch `claude` from the seer-cli repo root.
|
|
47
|
+
|
|
48
|
+
## Run state
|
|
49
|
+
|
|
50
|
+
<!-- runstate:start -->
|
|
51
|
+
|
|
52
|
+
| target | question | arm-A | arm-B | arm-C | A-vs-B judged | A-vs-B (A/B/tie) | A-vs-C judged | A-vs-C (A/C/tie) |
|
|
53
|
+
|---|---|---|---|---|---|---|---|---|
|
|
54
|
+
| culture | q-profile-overview | 3/3 | 3/3 | 0/3 | 3/3 | 2/1/0 | 0/3 | 0/0/0 |
|
|
55
|
+
|
|
56
|
+
<!-- runstate:end -->
|
|
57
|
+
|
|
58
|
+
## Evidence per set
|
|
59
|
+
|
|
60
|
+
<!-- evidence:start -->
|
|
61
|
+
|
|
62
|
+
### culture / q-profile-overview
|
|
63
|
+
|
|
64
|
+
**AB** winners: A=2, B=1, tie=0 (of 3 judged).
|
|
65
|
+
|
|
66
|
+
| trial | winner | margin | A duration | B duration | A tools | B tools | judge reasoning |
|
|
67
|
+
|---|---|---|---|---|---|---|---|
|
|
68
|
+
| 1 | A | slight | 136.7s | 47.2s | Bash:26, Read:8 | Bash:4, Read:2, Skill:2 | Y is more accurate about the agentirc shim relationship and gives a concrete runnable test command with coverage details, while X invents some specifics (e.g. publish.yml trigger) and lists skills not matching Y's set; both are solid but Y is tighter and more actionable. |
|
|
69
|
+
| 2 | A | slight | 78.7s | 66.8s | Bash:27, Read:8 | Bash:4, Read:2, Skill:1 | Y is more accurate and actionable with concrete test commands, CI workflow names, and architectural choices, while X includes some questionable specifics (e.g., exact 90% coverage floor, v12.1.7 release details) that read as potentially fabricated. |
|
|
70
|
+
| 3 | B | slight | 74.6s | 42.5s | Bash:20, Read:7 | Bash:5, Skill:2 | Y adds concrete dependency list, release state, and vendored skills inventory that aid actionability, while X has a slightly cleaner structure but less verifiable detail; Y's minor duplication (afi-cli listed twice) is a small blemish. |
|
|
71
|
+
|
|
72
|
+
*No AC verdicts yet.*
|
|
73
|
+
|
|
74
|
+
<!-- evidence:end -->
|
|
@@ -17,7 +17,7 @@ explicitly — these copies do not auto-update.
|
|
|
17
17
|
| `version-bump` | `steward` (`../steward/.claude/skills/version-bump/`) | 2026-05-15 | None — portable verbatim. Pure Python, no per-repo customization. seer-cli's `CHANGELOG.md` keeps a `# Changelog` + intro-prose header, so the first `## [` entry is a valid insertion point for the upstream script. |
|
|
18
18
|
| `repo-map` | _internal implementation_ — seer-cli origin | 2026-05-15 | **Runtime:** thin shell wrappers under `.claude/skills/repo-map/scripts/{profile,connections,graph}.sh` that invoke `uv run --directory <repo-root> python -m seer.repo <verb>`. Engine lives in `seer/repo/` in this repo. **Divergence:** N/A — this skill is original to seer-cli, not vendored from steward. If/when promoted upstream, this row flips to a `Re-vendor from steward` pointer. |
|
|
19
19
|
| `code-lookup` | _internal implementation_ — seer-cli origin | 2026-05-16 | **Runtime:** thin shell wrapper under `.claude/skills/code-lookup/scripts/classify.sh` that invokes `uv run --directory <repo-root> python -m seer classify`. Engine lives in `seer/lookup/` in this repo. **Divergence:** N/A — original to seer-cli, sibling of `repo-map`. If/when promoted upstream, this row flips to a `Re-vendor from steward` pointer. |
|
|
20
|
-
| `eval` | _internal implementation_ — seer-cli origin | 2026-05-
|
|
20
|
+
| `eval` | _internal implementation_ — seer-cli origin | 2026-05-16 | **Runtime:** locked operator procedure for one scripts-eval set (one `(target, question)` row × 3 trials × one arm). Three arms — **A** (banned), **B** (directed), **C** (organic) — and two judge pairs — **A-vs-B** ("do the skills help when used?") and **A-vs-C** ("do the skills get adopted organically?"). Backing CLIs live in `experiments/scripts_eval/` (`trial`, `validate`, `judge`, `summarize`, `manifest`); hooks live in `experiments/scripts_eval/hooks/`. `judge prepare --pair AB\|AC` selects the pair; verdicts land under `cell["judges"][pair]` (the AC pair also mirrors to legacy `cell["judge"]`). Round-agnostic — reads `$SEER_EVAL_RUN_ID` and writes the round's accumulator at `docs/eval-rounds/$SEER_EVAL_RUN_ID.md`. **Divergence:** N/A — original to seer-cli (the harness only exists here). If/when promoted upstream, this row flips to a `Re-vendor from steward` pointer. |
|
|
21
21
|
|
|
22
22
|
## Vendoring policy
|
|
23
23
|
|
|
@@ -25,13 +25,16 @@ def eval_run_id() -> str | None:
|
|
|
25
25
|
return val if val else None
|
|
26
26
|
|
|
27
27
|
|
|
28
|
+
ARMS = ("A", "B", "C")
|
|
29
|
+
|
|
30
|
+
|
|
28
31
|
def eval_arm() -> str | None:
|
|
29
|
-
"""Return SEER_EVAL_ARM (must be 'A' or 'C'), or None if unset."""
|
|
32
|
+
"""Return SEER_EVAL_ARM (must be 'A', 'B', or 'C'), or None if unset."""
|
|
30
33
|
val = os.environ.get("SEER_EVAL_ARM")
|
|
31
34
|
if val is None or val == "":
|
|
32
35
|
return None
|
|
33
|
-
if val not in
|
|
34
|
-
raise ValueError(f"SEER_EVAL_ARM must be
|
|
36
|
+
if val not in ARMS:
|
|
37
|
+
raise ValueError(f"SEER_EVAL_ARM must be one of {ARMS} (got {val!r})")
|
|
35
38
|
return val
|
|
36
39
|
|
|
37
40
|
|