code-lens-cli 0.7.1__tar.gz → 0.9.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/cicd/SKILL.md +1 -1
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/cicd/scripts/portability-lint.sh +2 -2
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/code-lookup/SKILL.md +1 -1
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/code-lookup/scripts/classify.sh +1 -1
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/code-lookup/scripts/grep.sh +1 -1
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/code-lookup/scripts/recent.sh +1 -1
- code_lens_cli-0.9.1/.claude/skills/eval/SKILL.md +399 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/repo-map/SKILL.md +2 -2
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/repo-map/scripts/connections.sh +1 -1
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/repo-map/scripts/graph.sh +1 -1
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/repo-map/scripts/profile.sh +1 -1
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.github/workflows/publish.yml +5 -5
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.github/workflows/security-checks.yml +2 -2
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.github/workflows/tests.yml +5 -5
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.markdownlint-cli2.yaml +1 -1
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/CHANGELOG.md +27 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/CLAUDE.md +18 -18
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/PKG-INFO +6 -6
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/README.md +2 -2
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/__init__.py +8 -7
- code_lens_cli-0.9.1/antoine/__main__.py +8 -0
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/cli/__init__.py +27 -27
- code_lens_cli-0.9.1/antoine/cli/_commands/__init__.py +1 -0
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/cli/_commands/classify.py +4 -4
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/cli/_commands/explain.py +7 -7
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/cli/_commands/grep.py +3 -3
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/cli/_commands/learn.py +8 -8
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/cli/_commands/recent.py +3 -3
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/cli/_commands/whoami.py +7 -7
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/cli/_errors.py +7 -7
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/cli/_output.py +4 -4
- code_lens_cli-0.9.1/antoine/lookup/__init__.py +25 -0
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/lookup/ast_scope.py +1 -1
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/lookup/classify.py +9 -9
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/lookup/grep_context.py +11 -11
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/lookup/recent_outline.py +16 -16
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/lookup/render.py +1 -1
- code_lens_cli-0.9.1/antoine/repo/__init__.py +9 -0
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/repo/__main__.py +22 -22
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/repo/connections.py +9 -9
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/repo/errors.py +17 -17
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/repo/graph.py +8 -8
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/repo/manifest.py +2 -2
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/repo/profile.py +2 -2
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/repo/render.py +7 -7
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/culture.yaml +1 -1
- code_lens_cli-0.9.1/docs/eval-rounds/2026-05-16-round-02.md +74 -0
- code_lens_cli-0.9.1/docs/skill-sources.md +29 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/scripts_eval/README.md +2 -2
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/scripts_eval/RUNBOOK.md +13 -13
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/scripts_eval/_io.py +10 -7
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/scripts_eval/backfill.py +1 -1
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/scripts_eval/corpus.yaml +1 -1
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/scripts_eval/hooks/pre_tool.py +3 -3
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/scripts_eval/judge.py +230 -59
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/scripts_eval/report.py +43 -30
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/scripts_eval/summarize.py +153 -65
- code_lens_cli-0.9.1/experiments/scripts_eval/switch-arm.sh +58 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/scripts_eval/trial.py +1 -1
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/scripts_eval/validate.py +2 -2
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/pyproject.toml +10 -10
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/sonar-project.properties +2 -2
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/scripts_eval/test_hooks_post_tool.py +8 -8
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/scripts_eval/test_hooks_pre_tool.py +11 -11
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/scripts_eval/test_io.py +10 -5
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/scripts_eval/test_judge.py +130 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/scripts_eval/test_report.py +3 -2
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_ast_scope.py +2 -2
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_classify.py +9 -9
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_classify_render.py +2 -2
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_cli_chassis.py +5 -5
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_cli_errors.py +10 -10
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_cli_output.py +8 -8
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_cli_stubs.py +8 -8
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_grep_cmd.py +2 -2
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_grep_context.py +7 -7
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_package.py +12 -12
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_recent_cmd.py +2 -2
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_recent_outline.py +12 -12
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_repo_cli.py +3 -3
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_repo_config.py +2 -2
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_repo_connections.py +5 -5
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_repo_detect.py +2 -2
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_repo_errors.py +4 -4
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_repo_graph.py +5 -5
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_repo_manifest.py +5 -5
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_repo_profile.py +4 -4
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/test_repo_render.py +6 -6
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/uv.lock +51 -51
- code_lens_cli-0.7.1/.claude/skills/eval/SKILL.md +0 -263
- code_lens_cli-0.7.1/docs/skill-sources.md +0 -29
- code_lens_cli-0.7.1/experiments/scripts_eval/switch-arm.sh +0 -91
- code_lens_cli-0.7.1/seer/__main__.py +0 -8
- code_lens_cli-0.7.1/seer/cli/_commands/__init__.py +0 -1
- code_lens_cli-0.7.1/seer/lookup/__init__.py +0 -25
- code_lens_cli-0.7.1/seer/repo/__init__.py +0 -9
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/settings.json +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/cicd/scripts/_resolve-nick.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/cicd/scripts/pr-reply.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/cicd/scripts/pr-status.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/cicd/scripts/workflow.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/communicate/SKILL.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/communicate/scripts/fetch-issues.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/communicate/scripts/mesh-message.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/communicate/scripts/post-comment.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/communicate/scripts/post-issue.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/communicate/scripts/templates/skill-update-brief.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/run-tests/SKILL.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/run-tests/scripts/test.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/sonarclaude/SKILL.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/sonarclaude/scripts/sonar.sh +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/version-bump/SKILL.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills/version-bump/scripts/bump.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.claude/skills.local.yaml.example +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.flake8 +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.gitignore +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/.pre-commit-config.yaml +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/LICENSE +0 -0
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/repo/config.py +0 -0
- {code_lens_cli-0.7.1/seer → code_lens_cli-0.9.1/antoine}/repo/detect.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/docs/eval-rounds/2026-05-15-round-01.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/docs/eval-rounds/2026-05-15-smoke-02-examples.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/docs/superpowers/plans/2026-05-15-repo-map.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/docs/superpowers/plans/2026-05-15-scripts-eval-harness.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/docs/superpowers/plans/2026-05-16-seer-classify.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/docs/superpowers/specs/2026-05-15-repo-map-design.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/docs/superpowers/specs/2026-05-15-scripts-eval-harness-design.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/docs/superpowers/specs/2026-05-16-seer-classify-design.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/__init__.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/scripts_eval/__init__.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/scripts_eval/corpus.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/scripts_eval/hooks/__init__.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/scripts_eval/hooks/post_tool.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/scripts_eval/judge_rubric.md +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/scripts_eval/manifest.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/experiments/scripts_eval/results/.gitkeep +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/__init__.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/scripts_eval/__init__.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/scripts_eval/fixtures/.gitkeep +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/scripts_eval/fixtures/corpus_minimal.yaml +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/scripts_eval/fixtures/sidechain_min.jsonl +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/scripts_eval/test_backfill.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/scripts_eval/test_corpus.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/scripts_eval/test_manifest.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/scripts_eval/test_summarize.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/scripts_eval/test_trial.py +0 -0
- {code_lens_cli-0.7.1 → code_lens_cli-0.9.1}/tests/scripts_eval/test_validate.py +0 -0
|
@@ -9,7 +9,7 @@ description: >
|
|
|
9
9
|
review feedback, polling CI status, or the user says "create PR",
|
|
10
10
|
"review comments", "address feedback", "resolve threads". Renamed
|
|
11
11
|
from `pr-review` in steward 0.7.0; rebased on agex in 0.12.0.
|
|
12
|
-
|
|
12
|
+
antoine divergence: `scripts/portability-lint.sh` drops the GNU-only
|
|
13
13
|
`xargs -r` flag for BSD/macOS portability — see `docs/skill-sources.md`.
|
|
14
14
|
---
|
|
15
15
|
|
|
@@ -21,7 +21,7 @@ esac
|
|
|
21
21
|
[ -z "$files" ] && { echo "(no files to check)"; exit 0; }
|
|
22
22
|
|
|
23
23
|
# ----- Check 1: hard-coded /home/<user>/... paths -----
|
|
24
|
-
#
|
|
24
|
+
# antoine divergence: `xargs -r` is GNU-only and fails on BSD/macOS xargs.
|
|
25
25
|
# `$files` is already guarded non-empty above, so `-r` is redundant — dropped.
|
|
26
26
|
hits1=$(echo "$files" | xargs grep -nE '/home/[a-z][a-z0-9_-]+/' 2>/dev/null || true)
|
|
27
27
|
|
|
@@ -31,7 +31,7 @@ hits1=$(echo "$files" | xargs grep -nE '/home/[a-z][a-z0-9_-]+/' 2>/dev/null ||
|
|
|
31
31
|
# - ~/.culture/ Culture mesh data this skill is supposed to read
|
|
32
32
|
md_yaml=$(echo "$files" | grep -E '\.(md|ya?ml|toml|json|jsonc)$' || true)
|
|
33
33
|
if [ -n "$md_yaml" ]; then
|
|
34
|
-
#
|
|
34
|
+
# antoine divergence: `xargs -r` is GNU-only; `$md_yaml` is guarded
|
|
35
35
|
# non-empty by the enclosing `if`, so `-r` is redundant — dropped.
|
|
36
36
|
hits2=$(echo "$md_yaml" | xargs grep -nE '~/\.[A-Za-z]' 2>/dev/null \
|
|
37
37
|
| grep -vE '~/\.claude/skills/[^[:space:]"]+/scripts/' \
|
|
@@ -96,5 +96,5 @@ One call each, no re-grepping.
|
|
|
96
96
|
|
|
97
97
|
## Engine
|
|
98
98
|
|
|
99
|
-
`
|
|
99
|
+
`antoine/lookup/` — `python -m antoine <verb> …`. Each shell wrapper is a
|
|
100
100
|
one-liner; the agent-facing contract is the verb and its flags.
|
|
@@ -0,0 +1,399 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: eval
|
|
3
|
+
description: >
|
|
4
|
+
Run one scripts-eval set — one `(target, question)` row from
|
|
5
|
+
`experiments/scripts_eval/corpus.yaml` × 3 trials × one arm — including
|
|
6
|
+
tester subagent dispatches + captures, plus (for arm C) judge subagent
|
|
7
|
+
dispatches + records, then `summarize` + commit to the round's
|
|
8
|
+
accumulator file. Use when the user says "run eval set", "eval",
|
|
9
|
+
"scripts-eval", "round-NN set", or asks to execute a row of the corpus.
|
|
10
|
+
Three arms: A (banned — rider forbids the antoine skills), B (directed
|
|
11
|
+
— rider instructs use of antoine skills), C (organic — rider permits
|
|
12
|
+
but doesn't direct). Two judge pairs: A-vs-B ("do the skills help
|
|
13
|
+
when used") and A-vs-C ("do the skills get adopted organically").
|
|
14
|
+
`judge prepare --pair AB|AC` selects the pair.
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
# scripts-eval — running a set
|
|
18
|
+
|
|
19
|
+
This skill drives one **set** of the scripts-eval harness:
|
|
20
|
+
one `(target, question)` row × 3 trials × one arm.
|
|
21
|
+
|
|
22
|
+
The harness pipeline (`trial` / `validate` / `judge` / `summarize`) and
|
|
23
|
+
the corpus (`corpus.yaml`) are repo state — this skill is just the
|
|
24
|
+
operator procedure that sequences them per session.
|
|
25
|
+
|
|
26
|
+
## When to push back
|
|
27
|
+
|
|
28
|
+
Before doing anything, verify the user's intent matches the session
|
|
29
|
+
state. Stop and ask if any of these hold:
|
|
30
|
+
|
|
31
|
+
- `env | grep ANTOINE_EVAL_RUN_ID` is empty → the harness hooks no-op, no
|
|
32
|
+
metrics get captured. Operator needs to re-launch with the env vars
|
|
33
|
+
exported.
|
|
34
|
+
- `ANTOINE_EVAL_ARM` is set to anything other than `A`, `B`, or `C` → bad config.
|
|
35
|
+
- User says "do arm C" but the matching arm-A cells don't exist on
|
|
36
|
+
disk under `experiments/scripts_eval/results/$ANTOINE_EVAL_RUN_ID/arm-A/`
|
|
37
|
+
→ arm A must complete first; there's nothing to pair against.
|
|
38
|
+
|
|
39
|
+
All three arms run with `repo-map` and `code-lookup` enabled on disk.
|
|
40
|
+
Arm-A's constraint is **verbal** — the rider in the dispatched prompt
|
|
41
|
+
is the sole guard against the subagent using the antoine skills. Do not
|
|
42
|
+
edit the rider; copy it verbatim. (Earlier versions of this skill
|
|
43
|
+
physically moved `.claude/skills/repo-map/` aside for arm A as
|
|
44
|
+
defense-in-depth; that step was dropped because the rider proved
|
|
45
|
+
sufficient and the move-aside dance made operator setup brittle.)
|
|
46
|
+
|
|
47
|
+
Three arms, three questions they answer:
|
|
48
|
+
|
|
49
|
+
- **A (banned)** — verbal rider forbids both antoine skills. Establishes
|
|
50
|
+
the "without the new skills" baseline.
|
|
51
|
+
- **B (directed)** — verbal rider instructs the subagent to use the
|
|
52
|
+
antoine skills where applicable. Establishes the "with the new skills,
|
|
53
|
+
when actually used" upper bound.
|
|
54
|
+
- **C (organic)** — verbal rider permits but does not direct use of
|
|
55
|
+
the antoine skills. Measures organic adoption rate.
|
|
56
|
+
|
|
57
|
+
A-vs-B is the primary "do the skills help?" comparison; A-vs-C is the
|
|
58
|
+
adoption canary. The judge supports both pairs via the `--pair` flag.
|
|
59
|
+
|
|
60
|
+
## Preflight (every session)
|
|
61
|
+
|
|
62
|
+
```bash
|
|
63
|
+
env | grep -E "^ANTOINE_EVAL_(RUN_ID|ARM)="
|
|
64
|
+
# expect both set to the intended round / arm
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
If unset, export them in your shell before launching `claude`:
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
# arm-A session (banned):
|
|
71
|
+
export ANTOINE_EVAL_RUN_ID=2026-05-NN-round-XX ANTOINE_EVAL_ARM=A
|
|
72
|
+
# arm-B session (directed):
|
|
73
|
+
export ANTOINE_EVAL_RUN_ID=2026-05-NN-round-XX ANTOINE_EVAL_ARM=B
|
|
74
|
+
# arm-C session (organic):
|
|
75
|
+
export ANTOINE_EVAL_RUN_ID=2026-05-NN-round-XX ANTOINE_EVAL_ARM=C
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
`experiments/scripts_eval/switch-arm.sh A|B|C <run_id>` does the same
|
|
79
|
+
thing.
|
|
80
|
+
|
|
81
|
+
If this is the first set of the run (idempotent, safe to re-run):
|
|
82
|
+
|
|
83
|
+
```bash
|
|
84
|
+
uv run --group experiments python -m experiments.scripts_eval.manifest \
|
|
85
|
+
init --run $ANTOINE_EVAL_RUN_ID
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
## Arm-A procedure
|
|
89
|
+
|
|
90
|
+
**For each trial in {1, 2, 3}:**
|
|
91
|
+
|
|
92
|
+
1. Read the question template for the target's `question_id` from
|
|
93
|
+
`experiments/scripts_eval/corpus.yaml`. Look up the target's path
|
|
94
|
+
from the same file's `targets:` list.
|
|
95
|
+
|
|
96
|
+
2. Substitute `{repo_path}` (or `{workspace_root}` for the workspace
|
|
97
|
+
question) in the template, then append **verbatim**:
|
|
98
|
+
|
|
99
|
+
```text
|
|
100
|
+
|
|
101
|
+
Constraints (verbatim):
|
|
102
|
+
- You may NOT use the `repo-map` skill, `python -m antoine.repo`,
|
|
103
|
+
the `antoine.repo` Python module, or any `scripts/*.sh` paths under
|
|
104
|
+
`.claude/skills/repo-map/`.
|
|
105
|
+
- You may NOT use the `code-lookup` skill, the `antoine.lookup`
|
|
106
|
+
Python module, the `antoine grep` / `antoine recent` / `antoine classify`
|
|
107
|
+
CLI verbs, or any `scripts/*.sh` paths under
|
|
108
|
+
`.claude/skills/code-lookup/`.
|
|
109
|
+
If you cannot answer without them, say so explicitly and stop.
|
|
110
|
+
- Use only Read, Grep, Glob, and Bash.
|
|
111
|
+
- After answering, append two sections and stop:
|
|
112
|
+
### tools_used
|
|
113
|
+
- <ToolName>: <count> (one line per distinct tool)
|
|
114
|
+
### evidence
|
|
115
|
+
- <one path per line>
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
3. **Before dispatch** — start the trial. The script reads
|
|
119
|
+
`CLAUDE_CODE_SESSION_ID` from env, stamps an in-flight record, and
|
|
120
|
+
prints the `trial_id` to stdout:
|
|
121
|
+
|
|
122
|
+
```bash
|
|
123
|
+
TRIAL_ID=$(uv run --group experiments python -m experiments.scripts_eval.trial \
|
|
124
|
+
start --run $ANTOINE_EVAL_RUN_ID --arm $ANTOINE_EVAL_ARM \
|
|
125
|
+
--target <target> --question <question_id> --trial <n>)
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
(For the workspace-scope question, omit `--target`.)
|
|
129
|
+
|
|
130
|
+
4. Dispatch **one** `Explore` subagent with the full prompt.
|
|
131
|
+
|
|
132
|
+
5. After the subagent finishes, end the trial. The script reads the
|
|
133
|
+
subagent's sidechain transcript from
|
|
134
|
+
`$HOME/.claude/projects/<encoded_cwd>/<session>/subagents/agent-*.jsonl`
|
|
135
|
+
and writes the cell JSON:
|
|
136
|
+
|
|
137
|
+
```bash
|
|
138
|
+
uv run --group experiments python -m experiments.scripts_eval.trial \
|
|
139
|
+
end --trial-id "$TRIAL_ID"
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
6. Confirm the cell JSON appeared under
|
|
143
|
+
`experiments/scripts_eval/results/$ANTOINE_EVAL_RUN_ID/arm-A/`.
|
|
144
|
+
|
|
145
|
+
**After all 3 trials**, summarize + commit:
|
|
146
|
+
|
|
147
|
+
```bash
|
|
148
|
+
uv run --group experiments python -m experiments.scripts_eval.summarize \
|
|
149
|
+
--run $ANTOINE_EVAL_RUN_ID \
|
|
150
|
+
--out docs/eval-rounds/$ANTOINE_EVAL_RUN_ID.md
|
|
151
|
+
|
|
152
|
+
git add docs/eval-rounds/$ANTOINE_EVAL_RUN_ID.md
|
|
153
|
+
git commit -m "$ANTOINE_EVAL_RUN_ID: arm-A captured for <target>/<question_id> (3 trials)"
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
Report back: cell count under arm-A/, what's pending for arm-B and
|
|
157
|
+
arm-C on this set, the next pending set per the run-state table in the
|
|
158
|
+
accumulator file.
|
|
159
|
+
|
|
160
|
+
## Arm-B procedure
|
|
161
|
+
|
|
162
|
+
Arm-B captures the **directed** trials so the A-vs-B judge run can
|
|
163
|
+
assess "do the skills help when actually used?". Capture happens in
|
|
164
|
+
its own session (`ANTOINE_EVAL_ARM=B`); the A-vs-B judges then run in
|
|
165
|
+
the arm-C session's Judge phase, alongside the A-vs-C judges
|
|
166
|
+
(`judge prepare --pair AB`).
|
|
167
|
+
|
|
168
|
+
**For each trial in {1, 2, 3}:**
|
|
169
|
+
|
|
170
|
+
1. Substitute the corpus question template (same target / question
|
|
171
|
+
resolution as arm A), then append **verbatim** the arm-B rider:
|
|
172
|
+
|
|
173
|
+
```text
|
|
174
|
+
|
|
175
|
+
Constraints (verbatim):
|
|
176
|
+
- For this question, you MUST use the antoine skills where they
|
|
177
|
+
apply:
|
|
178
|
+
* `repo-map` (`scripts/profile.sh`, `scripts/connections.sh`,
|
|
179
|
+
`scripts/graph.sh` under `.claude/skills/repo-map/`) for
|
|
180
|
+
repo overview, dependencies, and workspace shape.
|
|
181
|
+
* `code-lookup` (`antoine grep`, `antoine recent`, `antoine classify`,
|
|
182
|
+
or the equivalent scripts under
|
|
183
|
+
`.claude/skills/code-lookup/`) for symbol references,
|
|
184
|
+
recent commit-symbol diffs, and project-kind classification.
|
|
185
|
+
Only fall back to Read / Grep / Glob / Bash for facts the
|
|
186
|
+
scripts do not cover.
|
|
187
|
+
- After answering, append two sections and stop:
|
|
188
|
+
### tools_used
|
|
189
|
+
- <ToolName>: <count> (one line per distinct tool)
|
|
190
|
+
### evidence
|
|
191
|
+
- <one path per line>
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
2. Bookend the dispatch with `trial start` and `trial end` exactly
|
|
195
|
+
as in arm A, just with `--arm B`:
|
|
196
|
+
|
|
197
|
+
```bash
|
|
198
|
+
TRIAL_ID=$(uv run --group experiments python -m experiments.scripts_eval.trial \
|
|
199
|
+
start --run $ANTOINE_EVAL_RUN_ID --arm $ANTOINE_EVAL_ARM \
|
|
200
|
+
--target <target> --question <question_id> --trial <n>)
|
|
201
|
+
# dispatch one Explore subagent with the rendered prompt above
|
|
202
|
+
uv run --group experiments python -m experiments.scripts_eval.trial \
|
|
203
|
+
end --trial-id "$TRIAL_ID"
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
3. Confirm the cell JSON appeared under
|
|
207
|
+
`experiments/scripts_eval/results/$ANTOINE_EVAL_RUN_ID/arm-B/`.
|
|
208
|
+
|
|
209
|
+
**After all 3 trials**, summarize + commit:
|
|
210
|
+
|
|
211
|
+
```bash
|
|
212
|
+
uv run --group experiments python -m experiments.scripts_eval.summarize \
|
|
213
|
+
--run $ANTOINE_EVAL_RUN_ID \
|
|
214
|
+
--out docs/eval-rounds/$ANTOINE_EVAL_RUN_ID.md
|
|
215
|
+
|
|
216
|
+
git add docs/eval-rounds/$ANTOINE_EVAL_RUN_ID.md
|
|
217
|
+
git commit -m "$ANTOINE_EVAL_RUN_ID: arm-B captured for <target>/<question_id> (3 trials)"
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
Report back: cell count under arm-B/, whether the subagent actually
|
|
221
|
+
followed the directive (look at the `### tools_used` of each arm-B
|
|
222
|
+
cell — `B_did_not_use_scripts` is a finding, not a bug), the next
|
|
223
|
+
pending set per the run-state table.
|
|
224
|
+
|
|
225
|
+
## Arm-C procedure
|
|
226
|
+
|
|
227
|
+
**Precondition check (mandatory):**
|
|
228
|
+
|
|
229
|
+
```bash
|
|
230
|
+
ls experiments/scripts_eval/results/$ANTOINE_EVAL_RUN_ID/arm-A/<target>-<question_id>-t*.json
|
|
231
|
+
# expect: 3 files (t1, t2, t3)
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
If fewer than 3, stop — arm A must complete first.
|
|
235
|
+
|
|
236
|
+
### Tester phase
|
|
237
|
+
|
|
238
|
+
**For each trial in {1, 2, 3}:**
|
|
239
|
+
|
|
240
|
+
1. Substitute the corpus question template (same as arm A) but with the
|
|
241
|
+
arm-C rider:
|
|
242
|
+
|
|
243
|
+
```text
|
|
244
|
+
|
|
245
|
+
Constraints (verbatim):
|
|
246
|
+
- You may use the `repo-map` skill (and its scripts under
|
|
247
|
+
`.claude/skills/repo-map/`) and the `code-lookup` skill (and its
|
|
248
|
+
scripts under `.claude/skills/code-lookup/`) at your discretion.
|
|
249
|
+
This includes `antoine grep` / `antoine recent` / `antoine classify`.
|
|
250
|
+
- After answering, append two sections and stop:
|
|
251
|
+
### tools_used
|
|
252
|
+
- <ToolName>: <count>
|
|
253
|
+
### evidence
|
|
254
|
+
- <one path per line>
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
2. Bookend the dispatch with `trial start` and `trial end`:
|
|
258
|
+
|
|
259
|
+
```bash
|
|
260
|
+
TRIAL_ID=$(uv run --group experiments python -m experiments.scripts_eval.trial \
|
|
261
|
+
start --run $ANTOINE_EVAL_RUN_ID --arm $ANTOINE_EVAL_ARM \
|
|
262
|
+
--target <target> --question <question_id> --trial <n>)
|
|
263
|
+
# dispatch one Explore subagent with the rendered prompt above
|
|
264
|
+
uv run --group experiments python -m experiments.scripts_eval.trial \
|
|
265
|
+
end --trial-id "$TRIAL_ID"
|
|
266
|
+
```
|
|
267
|
+
|
|
268
|
+
(For the workspace-scope question, omit `--target`.)
|
|
269
|
+
|
|
270
|
+
### Judge phase
|
|
271
|
+
|
|
272
|
+
Two pairs are judged independently:
|
|
273
|
+
|
|
274
|
+
- **A-vs-C** — the original "with vs without (organic)" comparison.
|
|
275
|
+
- **A-vs-B** — the new "with (directed) vs without" comparison; needs
|
|
276
|
+
arm-B cells captured first.
|
|
277
|
+
|
|
278
|
+
Both pairs use the same `prepare` / `record` flow; only `--pair`
|
|
279
|
+
(`AC` or `AB`) and the matching `--blind-label-for-<arm>` flags differ.
|
|
280
|
+
|
|
281
|
+
**For each trial in {1, 2, 3}**, run the A-vs-C judge first (if arm-C
|
|
282
|
+
cells exist) and then the A-vs-B judge (if arm-B cells exist):
|
|
283
|
+
|
|
284
|
+
1. Prepare the blinded job. `--pair` defaults to `AC`; pass `--pair AB`
|
|
285
|
+
for the A-vs-B run.
|
|
286
|
+
|
|
287
|
+
```bash
|
|
288
|
+
uv run --group experiments python -m experiments.scripts_eval.judge \
|
|
289
|
+
prepare --run $ANTOINE_EVAL_RUN_ID \
|
|
290
|
+
--pair AC \
|
|
291
|
+
--pair-key <target>/<question_id>/<n> \
|
|
292
|
+
--seed 0 > /tmp/judge-AC-<n>.json
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
2. Materialise the prompt to a text file for dispatch (`jq -j`
|
|
296
|
+
joins without adding a trailing newline, so the bytes match what
|
|
297
|
+
`prepare` emitted):
|
|
298
|
+
|
|
299
|
+
```bash
|
|
300
|
+
jq -j '.prompt_text' /tmp/judge-AC-<n>.json > /tmp/judge-AC-<n>.txt
|
|
301
|
+
```
|
|
302
|
+
|
|
303
|
+
3. Dispatch the judge subagent. **The description prefix is
|
|
304
|
+
load-bearing** — the `pre_tool` hook recognises `scripts_eval judge:`
|
|
305
|
+
and skips logging, so the judge dispatch does not pollute the
|
|
306
|
+
harness's `raw/` directory:
|
|
307
|
+
|
|
308
|
+
- `subagent_type`: `general-purpose`
|
|
309
|
+
- `description`: `scripts_eval judge: AC <target>/<question_id>/<n>`
|
|
310
|
+
- `prompt`: the verbatim contents of `/tmp/judge-AC-<n>.txt`
|
|
311
|
+
|
|
312
|
+
4. Capture the subagent's final-text response and record. The blind
|
|
313
|
+
labels for an AC pair come back as `blind_label_for_A` /
|
|
314
|
+
`blind_label_for_C`:
|
|
315
|
+
|
|
316
|
+
```bash
|
|
317
|
+
A_LABEL=$(jq -r .blind_label_for_A /tmp/judge-AC-<n>.json)
|
|
318
|
+
C_LABEL=$(jq -r .blind_label_for_C /tmp/judge-AC-<n>.json)
|
|
319
|
+
uv run --group experiments python -m experiments.scripts_eval.judge \
|
|
320
|
+
record --run $ANTOINE_EVAL_RUN_ID \
|
|
321
|
+
--pair AC \
|
|
322
|
+
--pair-key <target>/<question_id>/<n> \
|
|
323
|
+
--blind-label-for-a "$A_LABEL" \
|
|
324
|
+
--blind-label-for-c "$C_LABEL" \
|
|
325
|
+
--verdict-file -
|
|
326
|
+
```
|
|
327
|
+
|
|
328
|
+
5. **Repeat the four steps with `--pair AB`** to judge the directed
|
|
329
|
+
arm. The job JSON for an AB pair carries `blind_label_for_A` and
|
|
330
|
+
`blind_label_for_B` (no `_C`); use `--blind-label-for-b` instead of
|
|
331
|
+
`--blind-label-for-c`:
|
|
332
|
+
|
|
333
|
+
```bash
|
|
334
|
+
uv run --group experiments python -m experiments.scripts_eval.judge \
|
|
335
|
+
prepare --run $ANTOINE_EVAL_RUN_ID \
|
|
336
|
+
--pair AB \
|
|
337
|
+
--pair-key <target>/<question_id>/<n> \
|
|
338
|
+
--seed 0 > /tmp/judge-AB-<n>.json
|
|
339
|
+
jq -j '.prompt_text' /tmp/judge-AB-<n>.json > /tmp/judge-AB-<n>.txt
|
|
340
|
+
# …dispatch general-purpose subagent with description
|
|
341
|
+
# "scripts_eval judge: AB <target>/<question_id>/<n>" and the txt prompt…
|
|
342
|
+
A_LABEL=$(jq -r .blind_label_for_A /tmp/judge-AB-<n>.json)
|
|
343
|
+
B_LABEL=$(jq -r .blind_label_for_B /tmp/judge-AB-<n>.json)
|
|
344
|
+
uv run --group experiments python -m experiments.scripts_eval.judge \
|
|
345
|
+
record --run $ANTOINE_EVAL_RUN_ID \
|
|
346
|
+
--pair AB \
|
|
347
|
+
--pair-key <target>/<question_id>/<n> \
|
|
348
|
+
--blind-label-for-a "$A_LABEL" \
|
|
349
|
+
--blind-label-for-b "$B_LABEL" \
|
|
350
|
+
--verdict-file -
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
If `record` exits non-zero with `non-JSON` / `winner` / `margin` /
|
|
354
|
+
`blind_label` in the error, re-dispatch the judge subagent for that
|
|
355
|
+
trial and re-record. `record` is idempotent on replay — the operator's
|
|
356
|
+
recovery path is "re-dispatch + re-record"; no manual cell editing.
|
|
357
|
+
|
|
358
|
+
Storage: AC verdicts land under `cell["judges"]["AC"]` (and are mirrored
|
|
359
|
+
to `cell["judge"]` for back-compat with pre-phase-2 readers); AB verdicts
|
|
360
|
+
land under `cell["judges"]["AB"]` only.
|
|
361
|
+
|
|
362
|
+
### Wrap-up
|
|
363
|
+
|
|
364
|
+
```bash
|
|
365
|
+
uv run --group experiments python -m experiments.scripts_eval.validate \
|
|
366
|
+
--run $ANTOINE_EVAL_RUN_ID
|
|
367
|
+
|
|
368
|
+
uv run --group experiments python -m experiments.scripts_eval.summarize \
|
|
369
|
+
--run $ANTOINE_EVAL_RUN_ID \
|
|
370
|
+
--out docs/eval-rounds/$ANTOINE_EVAL_RUN_ID.md
|
|
371
|
+
|
|
372
|
+
git add docs/eval-rounds/$ANTOINE_EVAL_RUN_ID.md
|
|
373
|
+
git commit -m "$ANTOINE_EVAL_RUN_ID: completed <target>/<question_id> (both arms + judge)"
|
|
374
|
+
```
|
|
375
|
+
|
|
376
|
+
Report back:
|
|
377
|
+
- A-vs-B winners (A / B / tie) and A-vs-C winners (A / C / tie).
|
|
378
|
+
- Whether arm B and arm C actually used the antoine scripts (look at the
|
|
379
|
+
`### tools_used` of each cell — `B_did_not_use_scripts` and
|
|
380
|
+
`C_did_not_use_scripts` are findings, not bugs).
|
|
381
|
+
- The next pending set per the run-state table.
|
|
382
|
+
|
|
383
|
+
## Reading the run state
|
|
384
|
+
|
|
385
|
+
The committed run-state table and per-set verdicts live in
|
|
386
|
+
`docs/eval-rounds/$ANTOINE_EVAL_RUN_ID.md`, between the
|
|
387
|
+
`<!-- runstate:start -->` / `<!-- runstate:end -->` and
|
|
388
|
+
`<!-- evidence:start -->` / `<!-- evidence:end -->` markers. `summarize.py`
|
|
389
|
+
rewrites those regions idempotently — do not hand-edit them.
|
|
390
|
+
|
|
391
|
+
The accumulator file is also the operator's source of truth for what's
|
|
392
|
+
pending: a row's `arm-A` or `arm-C` count below `3/3` means more trials
|
|
393
|
+
are needed; `judged` below the arm counts means judges still owe verdicts.
|
|
394
|
+
|
|
395
|
+
## Cite-don't-import
|
|
396
|
+
|
|
397
|
+
This skill is original to antoine (the harness only exists here). When
|
|
398
|
+
promoted upstream, it would re-vendor into steward's skill suppliers —
|
|
399
|
+
update `docs/skill-sources.md` accordingly at that point.
|
|
@@ -97,8 +97,8 @@ Flags always override config.
|
|
|
97
97
|
|
|
98
98
|
## Engine
|
|
99
99
|
|
|
100
|
-
The actual logic lives in `
|
|
101
|
-
`uv run python -m
|
|
100
|
+
The actual logic lives in `antoine/repo/` and is invoked via
|
|
101
|
+
`uv run python -m antoine.repo <verb>`. The shell scripts are one-line wrappers; the
|
|
102
102
|
agent-facing contract is the verbs and their flags, not the wrappers.
|
|
103
103
|
|
|
104
104
|
> **Interpreter note:** the scripts use `uv run --directory <project-root>`
|
|
@@ -5,12 +5,12 @@ on:
|
|
|
5
5
|
branches: [main]
|
|
6
6
|
paths:
|
|
7
7
|
- "pyproject.toml"
|
|
8
|
-
- "
|
|
8
|
+
- "antoine/**"
|
|
9
9
|
pull_request:
|
|
10
10
|
branches: [main]
|
|
11
11
|
paths:
|
|
12
12
|
- "pyproject.toml"
|
|
13
|
-
- "
|
|
13
|
+
- "antoine/**"
|
|
14
14
|
|
|
15
15
|
jobs:
|
|
16
16
|
test:
|
|
@@ -57,7 +57,7 @@ jobs:
|
|
|
57
57
|
- name: Build and publish each distribution to TestPyPI
|
|
58
58
|
run: |
|
|
59
59
|
set -euo pipefail
|
|
60
|
-
for pkg in
|
|
60
|
+
for pkg in antoine-cli kata-cli code-lens-cli; do
|
|
61
61
|
echo "::group::TestPyPI publish $pkg"
|
|
62
62
|
# Run the per-package steps in a subshell so set -e failures
|
|
63
63
|
# don't skip the ::endgroup:: marker — keeps Actions logs
|
|
@@ -81,7 +81,7 @@ jobs:
|
|
|
81
81
|
- name: Print install commands
|
|
82
82
|
if: always()
|
|
83
83
|
run: |
|
|
84
|
-
for pkg in
|
|
84
|
+
for pkg in antoine-cli kata-cli code-lens-cli; do
|
|
85
85
|
echo "::notice::Test with: uv tool install --index-url https://test.pypi.org/simple/ --index-strategy unsafe-best-match $pkg==${DEV_VERSION}"
|
|
86
86
|
done
|
|
87
87
|
|
|
@@ -105,7 +105,7 @@ jobs:
|
|
|
105
105
|
- name: Build and publish each distribution
|
|
106
106
|
run: |
|
|
107
107
|
set -euo pipefail
|
|
108
|
-
for pkg in
|
|
108
|
+
for pkg in antoine-cli kata-cli code-lens-cli; do
|
|
109
109
|
echo "::group::Publishing $pkg"
|
|
110
110
|
# Run the per-package steps in a subshell so set -e failures
|
|
111
111
|
# don't skip the ::endgroup:: marker — keeps Actions logs
|
|
@@ -25,11 +25,11 @@ jobs:
|
|
|
25
25
|
- run: uv sync
|
|
26
26
|
|
|
27
27
|
- name: Run Bandit
|
|
28
|
-
run: uv run bandit -r
|
|
28
|
+
run: uv run bandit -r antoine/ -f json -o bandit-results.json -c pyproject.toml
|
|
29
29
|
continue-on-error: true
|
|
30
30
|
|
|
31
31
|
- name: Run Pylint
|
|
32
|
-
run: uv run pylint
|
|
32
|
+
run: uv run pylint antoine/ --output-format=json:pylint-results.json,text
|
|
33
33
|
continue-on-error: true
|
|
34
34
|
|
|
35
35
|
- name: Upload Security Results
|
|
@@ -30,7 +30,7 @@ jobs:
|
|
|
30
30
|
|
|
31
31
|
- run: uv sync
|
|
32
32
|
|
|
33
|
-
- run: uv run pytest -n auto --cov=
|
|
33
|
+
- run: uv run pytest -n auto --cov=antoine --cov-report=xml:coverage.xml --cov-report=term -v
|
|
34
34
|
|
|
35
35
|
- name: SonarCloud Scan
|
|
36
36
|
if: env.SONAR_TOKEN != ''
|
|
@@ -56,16 +56,16 @@ jobs:
|
|
|
56
56
|
- run: uv sync
|
|
57
57
|
|
|
58
58
|
- name: black --check
|
|
59
|
-
run: uv run black --check
|
|
59
|
+
run: uv run black --check antoine tests
|
|
60
60
|
|
|
61
61
|
- name: isort --check
|
|
62
|
-
run: uv run isort --check-only
|
|
62
|
+
run: uv run isort --check-only antoine tests
|
|
63
63
|
|
|
64
64
|
- name: flake8
|
|
65
|
-
run: uv run flake8 --config=.flake8
|
|
65
|
+
run: uv run flake8 --config=.flake8 antoine/ tests/
|
|
66
66
|
|
|
67
67
|
- name: bandit
|
|
68
|
-
run: uv run bandit -c pyproject.toml -r
|
|
68
|
+
run: uv run bandit -c pyproject.toml -r antoine
|
|
69
69
|
|
|
70
70
|
- name: markdownlint-cli2
|
|
71
71
|
run: |
|
|
@@ -1,4 +1,4 @@
|
|
|
1
|
-
# markdownlint-cli2 config for
|
|
1
|
+
# markdownlint-cli2 config for antoine.
|
|
2
2
|
# markdownlint-cli2 stops walking at the git root, so a global
|
|
3
3
|
# markdownlint config in the user's home directory isn't picked up from
|
|
4
4
|
# inside the repo. Keep this file aligned with the global preset.
|
|
@@ -5,6 +5,33 @@ All notable changes to this project will be documented in this file.
|
|
|
5
5
|
Format follows [Keep a Changelog](https://keepachangelog.com/). This project
|
|
6
6
|
adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
7
|
|
|
8
|
+
## [0.9.1] - 2026-05-17
|
|
9
|
+
|
|
10
|
+
### Changed
|
|
11
|
+
|
|
12
|
+
- PyPI distribution renamed from `antoine` to `antoine-cli` to avoid name collision and stay consistent with the `kata-cli` / `code-lens-cli` alt-publish naming convention. The Python module (`antoine`) and console script (`antoine`) are unchanged; only the wheel-distribution name moves. `_resolve_version()` fallback list and `.github/workflows/publish.yml` publish loop updated to match.
|
|
13
|
+
|
|
14
|
+
## [0.9.0] - 2026-05-17
|
|
15
|
+
|
|
16
|
+
### Changed
|
|
17
|
+
|
|
18
|
+
- **Repository rename: `seer-cli` → `antoine`.** GitHub remote moved to `agentculture/antoine`; primary PyPI distribution renamed from `seer-cli` to `antoine`; `kata-cli` and `code-lens-cli` alt-publishes preserved. Python module renamed `seer/` → `antoine/`; primary console script renamed `seer` → `antoine` (the `kata` alias is retained). All imports, error classes (`SeerError` → `AntoineError`, `_SeerArgumentParser` → `_AntoineArgumentParser`), env vars (`SEER_EVAL_*` → `ANTOINE_EVAL_*`), Sonar project key (`agentculture_seer-cli` → `agentculture_antoine`), `culture.yaml` agent suffix, vendored skill bodies, and the scripts-eval harness's banned-pattern detection updated accordingly. Historical `CHANGELOG.md` entries, `docs/eval-rounds/`, and dated `docs/superpowers/{specs,plans}/` files are intentionally left referring to `seer` — those describe past state.
|
|
19
|
+
|
|
20
|
+
## [0.8.0] - 2026-05-16
|
|
21
|
+
|
|
22
|
+
### Added
|
|
23
|
+
|
|
24
|
+
- scripts-eval: arm-B (directed-use) — rider explicitly instructs the subagent to use repo-map + code-lookup scripts. Cells captured under `results/<run>/arm-B/`. `corpus.yaml` arms field becomes `[A, B, C]`.
|
|
25
|
+
- scripts-eval: `judge.py` is now pair-aware. `judge prepare` / `judge record` take `--pair AC|AB|BC`; new label flag `--blind-label-for-b`. Verdicts land under `cell["judges"][pair]`; the AC pair still mirrors to legacy `cell["judge"]` for back-compat. New `iter_jobs_pair` / `record_verdict_pair` public APIs; old `iter_jobs` / `record_verdict` stay as AC-pair wrappers.
|
|
26
|
+
- scripts-eval: summarize.py renders both A-vs-B and A-vs-C winner tallies in the run-state table, with per-pair verdict tables in the evidence section.
|
|
27
|
+
- eval skill: arm-A rider tightened to also forbid the code-lookup skill / seer.lookup / seer grep/recent/classify so "without" means without both new skills. Added arm-B procedure section and pair-aware judge procedure (--pair AB | AC).
|
|
28
|
+
|
|
29
|
+
### Changed
|
|
30
|
+
|
|
31
|
+
- eval skill: switch-arm.sh no longer moves .claude/skills/repo-map/ on disk; arm-A relies on the verbal rider alone (rider proved sufficient; move-aside dance made operator setup brittle).
|
|
32
|
+
- scripts-eval: report.py violation patterns now also flag code-lookup script use; report aggregates median validation across all three arms; per-cell view shows every captured arm.
|
|
33
|
+
- scripts-eval: validate.py / backfill.py iterate over _io.ARMS instead of hardcoding (A, C).
|
|
34
|
+
|
|
8
35
|
## [0.7.1] - 2026-05-16
|
|
9
36
|
|
|
10
37
|
### Changed
|