code-lens-cli 0.7.1__tar.gz → 0.8.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (142) hide show
  1. code_lens_cli-0.8.0/.claude/skills/eval/SKILL.md +399 -0
  2. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/CHANGELOG.md +15 -0
  3. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/PKG-INFO +1 -1
  4. code_lens_cli-0.8.0/docs/eval-rounds/2026-05-16-round-02.md +74 -0
  5. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/docs/skill-sources.md +1 -1
  6. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/_io.py +6 -3
  7. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/backfill.py +1 -1
  8. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/corpus.yaml +1 -1
  9. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/judge.py +230 -59
  10. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/report.py +40 -27
  11. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/summarize.py +153 -65
  12. code_lens_cli-0.8.0/experiments/scripts_eval/switch-arm.sh +58 -0
  13. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/trial.py +1 -1
  14. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/validate.py +2 -2
  15. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/pyproject.toml +1 -1
  16. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_io.py +6 -1
  17. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_judge.py +130 -0
  18. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_report.py +3 -2
  19. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/uv.lock +1 -1
  20. code_lens_cli-0.7.1/.claude/skills/eval/SKILL.md +0 -263
  21. code_lens_cli-0.7.1/experiments/scripts_eval/switch-arm.sh +0 -91
  22. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/settings.json +0 -0
  23. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/cicd/SKILL.md +0 -0
  24. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/cicd/scripts/_resolve-nick.sh +0 -0
  25. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/cicd/scripts/portability-lint.sh +0 -0
  26. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/cicd/scripts/pr-reply.sh +0 -0
  27. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/cicd/scripts/pr-status.sh +0 -0
  28. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/cicd/scripts/workflow.sh +0 -0
  29. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/code-lookup/SKILL.md +0 -0
  30. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/code-lookup/scripts/classify.sh +0 -0
  31. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/code-lookup/scripts/grep.sh +0 -0
  32. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/code-lookup/scripts/recent.sh +0 -0
  33. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/communicate/SKILL.md +0 -0
  34. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/communicate/scripts/fetch-issues.sh +0 -0
  35. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/communicate/scripts/mesh-message.sh +0 -0
  36. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/communicate/scripts/post-comment.sh +0 -0
  37. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/communicate/scripts/post-issue.sh +0 -0
  38. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/communicate/scripts/templates/skill-update-brief.md +0 -0
  39. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/repo-map/SKILL.md +0 -0
  40. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/repo-map/scripts/connections.sh +0 -0
  41. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/repo-map/scripts/graph.sh +0 -0
  42. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/repo-map/scripts/profile.sh +0 -0
  43. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/run-tests/SKILL.md +0 -0
  44. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/run-tests/scripts/test.sh +0 -0
  45. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/sonarclaude/SKILL.md +0 -0
  46. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/sonarclaude/scripts/sonar.sh +0 -0
  47. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/version-bump/SKILL.md +0 -0
  48. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills/version-bump/scripts/bump.py +0 -0
  49. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.claude/skills.local.yaml.example +0 -0
  50. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.flake8 +0 -0
  51. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.github/workflows/publish.yml +0 -0
  52. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.github/workflows/security-checks.yml +0 -0
  53. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.github/workflows/tests.yml +0 -0
  54. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.gitignore +0 -0
  55. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.markdownlint-cli2.yaml +0 -0
  56. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/.pre-commit-config.yaml +0 -0
  57. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/CLAUDE.md +0 -0
  58. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/LICENSE +0 -0
  59. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/README.md +0 -0
  60. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/culture.yaml +0 -0
  61. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/docs/eval-rounds/2026-05-15-round-01.md +0 -0
  62. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/docs/eval-rounds/2026-05-15-smoke-02-examples.md +0 -0
  63. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/docs/superpowers/plans/2026-05-15-repo-map.md +0 -0
  64. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/docs/superpowers/plans/2026-05-15-scripts-eval-harness.md +0 -0
  65. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/docs/superpowers/plans/2026-05-16-seer-classify.md +0 -0
  66. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/docs/superpowers/specs/2026-05-15-repo-map-design.md +0 -0
  67. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/docs/superpowers/specs/2026-05-15-scripts-eval-harness-design.md +0 -0
  68. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/docs/superpowers/specs/2026-05-16-seer-classify-design.md +0 -0
  69. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/__init__.py +0 -0
  70. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/README.md +0 -0
  71. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/RUNBOOK.md +0 -0
  72. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/__init__.py +0 -0
  73. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/corpus.py +0 -0
  74. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/hooks/__init__.py +0 -0
  75. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/hooks/post_tool.py +0 -0
  76. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/hooks/pre_tool.py +0 -0
  77. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/judge_rubric.md +0 -0
  78. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/manifest.py +0 -0
  79. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/experiments/scripts_eval/results/.gitkeep +0 -0
  80. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/__init__.py +0 -0
  81. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/__main__.py +0 -0
  82. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/__init__.py +0 -0
  83. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/_commands/__init__.py +0 -0
  84. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/_commands/classify.py +0 -0
  85. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/_commands/explain.py +0 -0
  86. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/_commands/grep.py +0 -0
  87. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/_commands/learn.py +0 -0
  88. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/_commands/recent.py +0 -0
  89. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/_commands/whoami.py +0 -0
  90. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/_errors.py +0 -0
  91. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/cli/_output.py +0 -0
  92. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/lookup/__init__.py +0 -0
  93. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/lookup/ast_scope.py +0 -0
  94. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/lookup/classify.py +0 -0
  95. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/lookup/grep_context.py +0 -0
  96. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/lookup/recent_outline.py +0 -0
  97. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/lookup/render.py +0 -0
  98. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/__init__.py +0 -0
  99. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/__main__.py +0 -0
  100. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/config.py +0 -0
  101. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/connections.py +0 -0
  102. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/detect.py +0 -0
  103. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/errors.py +0 -0
  104. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/graph.py +0 -0
  105. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/manifest.py +0 -0
  106. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/profile.py +0 -0
  107. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/seer/repo/render.py +0 -0
  108. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/sonar-project.properties +0 -0
  109. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/__init__.py +0 -0
  110. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/__init__.py +0 -0
  111. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/fixtures/.gitkeep +0 -0
  112. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/fixtures/corpus_minimal.yaml +0 -0
  113. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/fixtures/sidechain_min.jsonl +0 -0
  114. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_backfill.py +0 -0
  115. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_corpus.py +0 -0
  116. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_hooks_post_tool.py +0 -0
  117. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_hooks_pre_tool.py +0 -0
  118. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_manifest.py +0 -0
  119. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_summarize.py +0 -0
  120. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_trial.py +0 -0
  121. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/scripts_eval/test_validate.py +0 -0
  122. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_ast_scope.py +0 -0
  123. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_classify.py +0 -0
  124. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_classify_render.py +0 -0
  125. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_cli_chassis.py +0 -0
  126. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_cli_errors.py +0 -0
  127. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_cli_output.py +0 -0
  128. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_cli_stubs.py +0 -0
  129. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_grep_cmd.py +0 -0
  130. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_grep_context.py +0 -0
  131. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_package.py +0 -0
  132. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_recent_cmd.py +0 -0
  133. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_recent_outline.py +0 -0
  134. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_repo_cli.py +0 -0
  135. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_repo_config.py +0 -0
  136. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_repo_connections.py +0 -0
  137. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_repo_detect.py +0 -0
  138. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_repo_errors.py +0 -0
  139. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_repo_graph.py +0 -0
  140. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_repo_manifest.py +0 -0
  141. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_repo_profile.py +0 -0
  142. {code_lens_cli-0.7.1 → code_lens_cli-0.8.0}/tests/test_repo_render.py +0 -0
@@ -0,0 +1,399 @@
1
+ ---
2
+ name: eval
3
+ description: >
4
+ Run one scripts-eval set — one `(target, question)` row from
5
+ `experiments/scripts_eval/corpus.yaml` × 3 trials × one arm — including
6
+ tester subagent dispatches + captures, plus (for arm C) judge subagent
7
+ dispatches + records, then `summarize` + commit to the round's
8
+ accumulator file. Use when the user says "run eval set", "eval",
9
+ "scripts-eval", "round-NN set", or asks to execute a row of the corpus.
10
+ Three arms: A (banned — rider forbids the seer skills), B (directed
11
+ — rider instructs use of seer skills), C (organic — rider permits
12
+ but doesn't direct). Two judge pairs: A-vs-B ("do the skills help
13
+ when used") and A-vs-C ("do the skills get adopted organically").
14
+ `judge prepare --pair AB|AC` selects the pair.
15
+ ---
16
+
17
+ # scripts-eval — running a set
18
+
19
+ This skill drives one **set** of the scripts-eval harness:
20
+ one `(target, question)` row × 3 trials × one arm.
21
+
22
+ The harness pipeline (`trial` / `validate` / `judge` / `summarize`) and
23
+ the corpus (`corpus.yaml`) are repo state — this skill is just the
24
+ operator procedure that sequences them per session.
25
+
26
+ ## When to push back
27
+
28
+ Before doing anything, verify the user's intent matches the session
29
+ state. Stop and ask if any of these hold:
30
+
31
+ - `env | grep SEER_EVAL_RUN_ID` is empty → the harness hooks no-op, no
32
+ metrics get captured. Operator needs to re-launch with the env vars
33
+ exported.
34
+ - `SEER_EVAL_ARM` is set to anything other than `A`, `B`, or `C` → bad config.
35
+ - User says "do arm C" but the matching arm-A cells don't exist on
36
+ disk under `experiments/scripts_eval/results/$SEER_EVAL_RUN_ID/arm-A/`
37
+ → arm A must complete first; there's nothing to pair against.
38
+
39
+ All three arms run with `repo-map` and `code-lookup` enabled on disk.
40
+ Arm-A's constraint is **verbal** — the rider in the dispatched prompt
41
+ is the sole guard against the subagent using the seer skills. Do not
42
+ edit the rider; copy it verbatim. (Earlier versions of this skill
43
+ physically moved `.claude/skills/repo-map/` aside for arm A as
44
+ defense-in-depth; that step was dropped because the rider proved
45
+ sufficient and the move-aside dance made operator setup brittle.)
46
+
47
+ Three arms, three questions they answer:
48
+
49
+ - **A (banned)** — verbal rider forbids both seer skills. Establishes
50
+ the "without the new skills" baseline.
51
+ - **B (directed)** — verbal rider instructs the subagent to use the
52
+ seer skills where applicable. Establishes the "with the new skills,
53
+ when actually used" upper bound.
54
+ - **C (organic)** — verbal rider permits but does not direct use of
55
+ the seer skills. Measures organic adoption rate.
56
+
57
+ A-vs-B is the primary "do the skills help?" comparison; A-vs-C is the
58
+ adoption canary. The judge supports both pairs via the `--pair` flag.
59
+
60
+ ## Preflight (every session)
61
+
62
+ ```bash
63
+ env | grep -E "^SEER_EVAL_(RUN_ID|ARM)="
64
+ # expect both set to the intended round / arm
65
+ ```
66
+
67
+ If unset, export them in your shell before launching `claude`:
68
+
69
+ ```bash
70
+ # arm-A session (banned):
71
+ export SEER_EVAL_RUN_ID=2026-05-NN-round-XX SEER_EVAL_ARM=A
72
+ # arm-B session (directed):
73
+ export SEER_EVAL_RUN_ID=2026-05-NN-round-XX SEER_EVAL_ARM=B
74
+ # arm-C session (organic):
75
+ export SEER_EVAL_RUN_ID=2026-05-NN-round-XX SEER_EVAL_ARM=C
76
+ ```
77
+
78
+ `experiments/scripts_eval/switch-arm.sh A|B|C <run_id>` does the same
79
+ thing.
80
+
81
+ If this is the first set of the run (idempotent, safe to re-run):
82
+
83
+ ```bash
84
+ uv run --group experiments python -m experiments.scripts_eval.manifest \
85
+ init --run $SEER_EVAL_RUN_ID
86
+ ```
87
+
88
+ ## Arm-A procedure
89
+
90
+ **For each trial in {1, 2, 3}:**
91
+
92
+ 1. Read the question template for the target's `question_id` from
93
+ `experiments/scripts_eval/corpus.yaml`. Look up the target's path
94
+ from the same file's `targets:` list.
95
+
96
+ 2. Substitute `{repo_path}` (or `{workspace_root}` for the workspace
97
+ question) in the template, then append **verbatim**:
98
+
99
+ ```text
100
+
101
+ Constraints (verbatim):
102
+ - You may NOT use the `repo-map` skill, `python -m seer.repo`,
103
+ the `seer.repo` Python module, or any `scripts/*.sh` paths under
104
+ `.claude/skills/repo-map/`.
105
+ - You may NOT use the `code-lookup` skill, the `seer.lookup`
106
+ Python module, the `seer grep` / `seer recent` / `seer classify`
107
+ CLI verbs, or any `scripts/*.sh` paths under
108
+ `.claude/skills/code-lookup/`.
109
+ If you cannot answer without them, say so explicitly and stop.
110
+ - Use only Read, Grep, Glob, and Bash.
111
+ - After answering, append two sections and stop:
112
+ ### tools_used
113
+ - <ToolName>: <count> (one line per distinct tool)
114
+ ### evidence
115
+ - <one path per line>
116
+ ```
117
+
118
+ 3. **Before dispatch** — start the trial. The script reads
119
+ `CLAUDE_CODE_SESSION_ID` from env, stamps an in-flight record, and
120
+ prints the `trial_id` to stdout:
121
+
122
+ ```bash
123
+ TRIAL_ID=$(uv run --group experiments python -m experiments.scripts_eval.trial \
124
+ start --run $SEER_EVAL_RUN_ID --arm $SEER_EVAL_ARM \
125
+ --target <target> --question <question_id> --trial <n>)
126
+ ```
127
+
128
+ (For the workspace-scope question, omit `--target`.)
129
+
130
+ 4. Dispatch **one** `Explore` subagent with the full prompt.
131
+
132
+ 5. After the subagent finishes, end the trial. The script reads the
133
+ subagent's sidechain transcript from
134
+ `$HOME/.claude/projects/<encoded_cwd>/<session>/subagents/agent-*.jsonl`
135
+ and writes the cell JSON:
136
+
137
+ ```bash
138
+ uv run --group experiments python -m experiments.scripts_eval.trial \
139
+ end --trial-id "$TRIAL_ID"
140
+ ```
141
+
142
+ 6. Confirm the cell JSON appeared under
143
+ `experiments/scripts_eval/results/$SEER_EVAL_RUN_ID/arm-A/`.
144
+
145
+ **After all 3 trials**, summarize + commit:
146
+
147
+ ```bash
148
+ uv run --group experiments python -m experiments.scripts_eval.summarize \
149
+ --run $SEER_EVAL_RUN_ID \
150
+ --out docs/eval-rounds/$SEER_EVAL_RUN_ID.md
151
+
152
+ git add docs/eval-rounds/$SEER_EVAL_RUN_ID.md
153
+ git commit -m "$SEER_EVAL_RUN_ID: arm-A captured for <target>/<question_id> (3 trials)"
154
+ ```
155
+
156
+ Report back: cell count under arm-A/, what's pending for arm-B and
157
+ arm-C on this set, the next pending set per the run-state table in the
158
+ accumulator file.
159
+
160
+ ## Arm-B procedure
161
+
162
+ Arm-B captures the **directed** trials so the A-vs-B judge run can
163
+ assess "do the skills help when actually used?". Capture happens in
164
+ its own session (`SEER_EVAL_ARM=B`); the A-vs-B judges then run in
165
+ the arm-C session's Judge phase, alongside the A-vs-C judges
166
+ (`judge prepare --pair AB`).
167
+
168
+ **For each trial in {1, 2, 3}:**
169
+
170
+ 1. Substitute the corpus question template (same target / question
171
+ resolution as arm A), then append **verbatim** the arm-B rider:
172
+
173
+ ```text
174
+
175
+ Constraints (verbatim):
176
+ - For this question, you MUST use the seer skills where they
177
+ apply:
178
+ * `repo-map` (`scripts/profile.sh`, `scripts/connections.sh`,
179
+ `scripts/graph.sh` under `.claude/skills/repo-map/`) for
180
+ repo overview, dependencies, and workspace shape.
181
+ * `code-lookup` (`seer grep`, `seer recent`, `seer classify`,
182
+ or the equivalent scripts under
183
+ `.claude/skills/code-lookup/`) for symbol references,
184
+ recent commit-symbol diffs, and project-kind classification.
185
+ Only fall back to Read / Grep / Glob / Bash for facts the
186
+ scripts do not cover.
187
+ - After answering, append two sections and stop:
188
+ ### tools_used
189
+ - <ToolName>: <count> (one line per distinct tool)
190
+ ### evidence
191
+ - <one path per line>
192
+ ```
193
+
194
+ 2. Bookend the dispatch with `trial start` and `trial end` exactly
195
+ as in arm A, just with `--arm B`:
196
+
197
+ ```bash
198
+ TRIAL_ID=$(uv run --group experiments python -m experiments.scripts_eval.trial \
199
+ start --run $SEER_EVAL_RUN_ID --arm $SEER_EVAL_ARM \
200
+ --target <target> --question <question_id> --trial <n>)
201
+ # dispatch one Explore subagent with the rendered prompt above
202
+ uv run --group experiments python -m experiments.scripts_eval.trial \
203
+ end --trial-id "$TRIAL_ID"
204
+ ```
205
+
206
+ 3. Confirm the cell JSON appeared under
207
+ `experiments/scripts_eval/results/$SEER_EVAL_RUN_ID/arm-B/`.
208
+
209
+ **After all 3 trials**, summarize + commit:
210
+
211
+ ```bash
212
+ uv run --group experiments python -m experiments.scripts_eval.summarize \
213
+ --run $SEER_EVAL_RUN_ID \
214
+ --out docs/eval-rounds/$SEER_EVAL_RUN_ID.md
215
+
216
+ git add docs/eval-rounds/$SEER_EVAL_RUN_ID.md
217
+ git commit -m "$SEER_EVAL_RUN_ID: arm-B captured for <target>/<question_id> (3 trials)"
218
+ ```
219
+
220
+ Report back: cell count under arm-B/, whether the subagent actually
221
+ followed the directive (look at the `### tools_used` of each arm-B
222
+ cell — `B_did_not_use_scripts` is a finding, not a bug), the next
223
+ pending set per the run-state table.
224
+
225
+ ## Arm-C procedure
226
+
227
+ **Precondition check (mandatory):**
228
+
229
+ ```bash
230
+ ls experiments/scripts_eval/results/$SEER_EVAL_RUN_ID/arm-A/<target>-<question_id>-t*.json
231
+ # expect: 3 files (t1, t2, t3)
232
+ ```
233
+
234
+ If fewer than 3, stop — arm A must complete first.
235
+
236
+ ### Tester phase
237
+
238
+ **For each trial in {1, 2, 3}:**
239
+
240
+ 1. Substitute the corpus question template (same as arm A) but with the
241
+ arm-C rider:
242
+
243
+ ```text
244
+
245
+ Constraints (verbatim):
246
+ - You may use the `repo-map` skill (and its scripts under
247
+ `.claude/skills/repo-map/`) and the `code-lookup` skill (and its
248
+ scripts under `.claude/skills/code-lookup/`) at your discretion.
249
+ This includes `seer grep` / `seer recent` / `seer classify`.
250
+ - After answering, append two sections and stop:
251
+ ### tools_used
252
+ - <ToolName>: <count>
253
+ ### evidence
254
+ - <one path per line>
255
+ ```
256
+
257
+ 2. Bookend the dispatch with `trial start` and `trial end`:
258
+
259
+ ```bash
260
+ TRIAL_ID=$(uv run --group experiments python -m experiments.scripts_eval.trial \
261
+ start --run $SEER_EVAL_RUN_ID --arm $SEER_EVAL_ARM \
262
+ --target <target> --question <question_id> --trial <n>)
263
+ # dispatch one Explore subagent with the rendered prompt above
264
+ uv run --group experiments python -m experiments.scripts_eval.trial \
265
+ end --trial-id "$TRIAL_ID"
266
+ ```
267
+
268
+ (For the workspace-scope question, omit `--target`.)
269
+
270
+ ### Judge phase
271
+
272
+ Two pairs are judged independently:
273
+
274
+ - **A-vs-C** — the original "with vs without (organic)" comparison.
275
+ - **A-vs-B** — the new "with (directed) vs without" comparison; needs
276
+ arm-B cells captured first.
277
+
278
+ Both pairs use the same `prepare` / `record` flow; only `--pair`
279
+ (`AC` or `AB`) and the matching `--blind-label-for-<arm>` flags differ.
280
+
281
+ **For each trial in {1, 2, 3}**, run the A-vs-C judge first (if arm-C
282
+ cells exist) and then the A-vs-B judge (if arm-B cells exist):
283
+
284
+ 1. Prepare the blinded job. `--pair` defaults to `AC`; pass `--pair AB`
285
+ for the A-vs-B run.
286
+
287
+ ```bash
288
+ uv run --group experiments python -m experiments.scripts_eval.judge \
289
+ prepare --run $SEER_EVAL_RUN_ID \
290
+ --pair AC \
291
+ --pair-key <target>/<question_id>/<n> \
292
+ --seed 0 > /tmp/judge-AC-<n>.json
293
+ ```
294
+
295
+ 2. Materialise the prompt to a text file for dispatch (`jq -j`
296
+ joins without adding a trailing newline, so the bytes match what
297
+ `prepare` emitted):
298
+
299
+ ```bash
300
+ jq -j '.prompt_text' /tmp/judge-AC-<n>.json > /tmp/judge-AC-<n>.txt
301
+ ```
302
+
303
+ 3. Dispatch the judge subagent. **The description prefix is
304
+ load-bearing** — the `pre_tool` hook recognises `scripts_eval judge:`
305
+ and skips logging, so the judge dispatch does not pollute the
306
+ harness's `raw/` directory:
307
+
308
+ - `subagent_type`: `general-purpose`
309
+ - `description`: `scripts_eval judge: AC <target>/<question_id>/<n>`
310
+ - `prompt`: the verbatim contents of `/tmp/judge-AC-<n>.txt`
311
+
312
+ 4. Capture the subagent's final-text response and record. The blind
313
+ labels for an AC pair come back as `blind_label_for_A` /
314
+ `blind_label_for_C`:
315
+
316
+ ```bash
317
+ A_LABEL=$(jq -r .blind_label_for_A /tmp/judge-AC-<n>.json)
318
+ C_LABEL=$(jq -r .blind_label_for_C /tmp/judge-AC-<n>.json)
319
+ uv run --group experiments python -m experiments.scripts_eval.judge \
320
+ record --run $SEER_EVAL_RUN_ID \
321
+ --pair AC \
322
+ --pair-key <target>/<question_id>/<n> \
323
+ --blind-label-for-a "$A_LABEL" \
324
+ --blind-label-for-c "$C_LABEL" \
325
+ --verdict-file -
326
+ ```
327
+
328
+ 5. **Repeat the four steps with `--pair AB`** to judge the directed
329
+ arm. The job JSON for an AB pair carries `blind_label_for_A` and
330
+ `blind_label_for_B` (no `_C`); use `--blind-label-for-b` instead of
331
+ `--blind-label-for-c`:
332
+
333
+ ```bash
334
+ uv run --group experiments python -m experiments.scripts_eval.judge \
335
+ prepare --run $SEER_EVAL_RUN_ID \
336
+ --pair AB \
337
+ --pair-key <target>/<question_id>/<n> \
338
+ --seed 0 > /tmp/judge-AB-<n>.json
339
+ jq -j '.prompt_text' /tmp/judge-AB-<n>.json > /tmp/judge-AB-<n>.txt
340
+ # …dispatch general-purpose subagent with description
341
+ # "scripts_eval judge: AB <target>/<question_id>/<n>" and the txt prompt…
342
+ A_LABEL=$(jq -r .blind_label_for_A /tmp/judge-AB-<n>.json)
343
+ B_LABEL=$(jq -r .blind_label_for_B /tmp/judge-AB-<n>.json)
344
+ uv run --group experiments python -m experiments.scripts_eval.judge \
345
+ record --run $SEER_EVAL_RUN_ID \
346
+ --pair AB \
347
+ --pair-key <target>/<question_id>/<n> \
348
+ --blind-label-for-a "$A_LABEL" \
349
+ --blind-label-for-b "$B_LABEL" \
350
+ --verdict-file -
351
+ ```
352
+
353
+ If `record` exits non-zero with `non-JSON` / `winner` / `margin` /
354
+ `blind_label` in the error, re-dispatch the judge subagent for that
355
+ trial and re-record. `record` is idempotent on replay — the operator's
356
+ recovery path is "re-dispatch + re-record"; no manual cell editing.
357
+
358
+ Storage: AC verdicts land under `cell["judges"]["AC"]` (and are mirrored
359
+ to `cell["judge"]` for back-compat with pre-phase-2 readers); AB verdicts
360
+ land under `cell["judges"]["AB"]` only.
361
+
362
+ ### Wrap-up
363
+
364
+ ```bash
365
+ uv run --group experiments python -m experiments.scripts_eval.validate \
366
+ --run $SEER_EVAL_RUN_ID
367
+
368
+ uv run --group experiments python -m experiments.scripts_eval.summarize \
369
+ --run $SEER_EVAL_RUN_ID \
370
+ --out docs/eval-rounds/$SEER_EVAL_RUN_ID.md
371
+
372
+ git add docs/eval-rounds/$SEER_EVAL_RUN_ID.md
373
+ git commit -m "$SEER_EVAL_RUN_ID: completed <target>/<question_id> (both arms + judge)"
374
+ ```
375
+
376
+ Report back:
377
+ - A-vs-B winners (A / B / tie) and A-vs-C winners (A / C / tie).
378
+ - Whether arm B and arm C actually used the seer scripts (look at the
379
+ `### tools_used` of each cell — `B_did_not_use_scripts` and
380
+ `C_did_not_use_scripts` are findings, not bugs).
381
+ - The next pending set per the run-state table.
382
+
383
+ ## Reading the run state
384
+
385
+ The committed run-state table and per-set verdicts live in
386
+ `docs/eval-rounds/$SEER_EVAL_RUN_ID.md`, between the
387
+ `<!-- runstate:start -->` / `<!-- runstate:end -->` and
388
+ `<!-- evidence:start -->` / `<!-- evidence:end -->` markers. `summarize.py`
389
+ rewrites those regions idempotently — do not hand-edit them.
390
+
391
+ The accumulator file is also the operator's source of truth for what's
392
+ pending: a row's `arm-A` or `arm-C` count below `3/3` means more trials
393
+ are needed; `judged` below the arm counts means judges still owe verdicts.
394
+
395
+ ## Cite-don't-import
396
+
397
+ This skill is original to seer-cli (the harness only exists here). When
398
+ promoted upstream, it would re-vendor into steward's skill suppliers —
399
+ update `docs/skill-sources.md` accordingly at that point.
@@ -5,6 +5,21 @@ All notable changes to this project will be documented in this file.
5
5
  Format follows [Keep a Changelog](https://keepachangelog.com/). This project
6
6
  adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
 
8
+ ## [0.8.0] - 2026-05-16
9
+
10
+ ### Added
11
+
12
+ - scripts-eval: arm-B (directed-use) — rider explicitly instructs the subagent to use repo-map + code-lookup scripts. Cells captured under `results/<run>/arm-B/`. `corpus.yaml` arms field becomes `[A, B, C]`.
13
+ - scripts-eval: `judge.py` is now pair-aware. `judge prepare` / `judge record` take `--pair AC|AB|BC`; new label flag `--blind-label-for-b`. Verdicts land under `cell["judges"][pair]`; the AC pair still mirrors to legacy `cell["judge"]` for back-compat. New `iter_jobs_pair` / `record_verdict_pair` public APIs; old `iter_jobs` / `record_verdict` stay as AC-pair wrappers.
14
+ - scripts-eval: summarize.py renders both A-vs-B and A-vs-C winner tallies in the run-state table, with per-pair verdict tables in the evidence section.
15
+ - eval skill: arm-A rider tightened to also forbid the code-lookup skill / seer.lookup / seer grep/recent/classify so "without" means without both new skills. Added arm-B procedure section and pair-aware judge procedure (--pair AB | AC).
16
+
17
+ ### Changed
18
+
19
+ - eval skill: switch-arm.sh no longer moves .claude/skills/repo-map/ on disk; arm-A relies on the verbal rider alone (rider proved sufficient; move-aside dance made operator setup brittle).
20
+ - scripts-eval: report.py violation patterns now also flag code-lookup script use; report aggregates median validation across all three arms; per-cell view shows every captured arm.
21
+ - scripts-eval: validate.py / backfill.py iterate over _io.ARMS instead of hardcoding (A, C).
22
+
8
23
  ## [0.7.1] - 2026-05-16
9
24
 
10
25
  ### Changed
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: code-lens-cli
3
- Version: 0.7.1
3
+ Version: 0.8.0
4
4
  Summary: seer — codebase lookup and indexing for agent skills (greenfield AgentCulture sibling).
5
5
  Project-URL: Homepage, https://github.com/agentculture/seer-cli
6
6
  Project-URL: Issues, https://github.com/agentculture/seer-cli/issues
@@ -0,0 +1,74 @@
1
+ # scripts-eval Round 02 — 2026-05-16
2
+
3
+ Run id: `2026-05-16-round-02`
4
+ Corpus: `experiments/scripts_eval/corpus.yaml` (corpus_version: 1)
5
+ Trials per cell: 3
6
+ Arms: A (verbal rider bans both seer skills), B (verbal rider
7
+ directs use of seer skills), C (verbal rider permits but doesn't
8
+ direct). Judge pairs: A-vs-B + A-vs-C.
9
+
10
+ This file is the round's **evidence accumulator**. The two
11
+ `<!-- runstate:... -->` / `<!-- evidence:... -->` regions below are
12
+ rewritten by `experiments/scripts_eval/summarize.py` at the end of every
13
+ session — operator never edits them by hand.
14
+
15
+ Raw per-cell JSONs stay gitignored under
16
+ `experiments/scripts_eval/results/2026-05-16-round-02/`. This file is
17
+ the committed evidence.
18
+
19
+ ## Procedure
20
+
21
+ The operator procedure is locked in the **`eval` skill** —
22
+ `.claude/skills/eval/SKILL.md`. Read or invoke that skill from a fresh
23
+ session; this file does not duplicate it.
24
+
25
+ Round 02 is the first run under the post-move-aside skill: all three
26
+ arms run with `.claude/skills/repo-map/` and
27
+ `.claude/skills/code-lookup/` present on disk; arm A's verbal rider in
28
+ the dispatched prompt is the sole guard against the subagent using
29
+ either skill. It is also the first run under the pair-aware judge
30
+ (`judge prepare --pair AB|AC`), so the run-state table carries A-vs-B
31
+ winners alongside A-vs-C.
32
+
33
+ ## Per-session preflight (your shell, *before* `claude`)
34
+
35
+ ```bash
36
+ # arm-A session (banned):
37
+ export SEER_EVAL_RUN_ID=2026-05-16-round-02 SEER_EVAL_ARM=A
38
+
39
+ # arm-B session (directed):
40
+ export SEER_EVAL_RUN_ID=2026-05-16-round-02 SEER_EVAL_ARM=B
41
+
42
+ # arm-C session (organic; same run id, different arm):
43
+ export SEER_EVAL_RUN_ID=2026-05-16-round-02 SEER_EVAL_ARM=C
44
+ ```
45
+
46
+ Then launch `claude` from the seer-cli repo root.
47
+
48
+ ## Run state
49
+
50
+ <!-- runstate:start -->
51
+
52
+ | target | question | arm-A | arm-B | arm-C | A-vs-B judged | A-vs-B (A/B/tie) | A-vs-C judged | A-vs-C (A/C/tie) |
53
+ |---|---|---|---|---|---|---|---|---|
54
+ | culture | q-profile-overview | 3/3 | 3/3 | 0/3 | 3/3 | 2/1/0 | 0/3 | 0/0/0 |
55
+
56
+ <!-- runstate:end -->
57
+
58
+ ## Evidence per set
59
+
60
+ <!-- evidence:start -->
61
+
62
+ ### culture / q-profile-overview
63
+
64
+ **AB** winners: A=2, B=1, tie=0 (of 3 judged).
65
+
66
+ | trial | winner | margin | A duration | B duration | A tools | B tools | judge reasoning |
67
+ |---|---|---|---|---|---|---|---|
68
+ | 1 | A | slight | 136.7s | 47.2s | Bash:26, Read:8 | Bash:4, Read:2, Skill:2 | Y is more accurate about the agentirc shim relationship and gives a concrete runnable test command with coverage details, while X invents some specifics (e.g. publish.yml trigger) and lists skills not matching Y's set; both are solid but Y is tighter and more actionable. |
69
+ | 2 | A | slight | 78.7s | 66.8s | Bash:27, Read:8 | Bash:4, Read:2, Skill:1 | Y is more accurate and actionable with concrete test commands, CI workflow names, and architectural choices, while X includes some questionable specifics (e.g., exact 90% coverage floor, v12.1.7 release details) that read as potentially fabricated. |
70
+ | 3 | B | slight | 74.6s | 42.5s | Bash:20, Read:7 | Bash:5, Skill:2 | Y adds concrete dependency list, release state, and vendored skills inventory that aid actionability, while X has a slightly cleaner structure but less verifiable detail; Y's minor duplication (afi-cli listed twice) is a small blemish. |
71
+
72
+ *No AC verdicts yet.*
73
+
74
+ <!-- evidence:end -->
@@ -17,7 +17,7 @@ explicitly — these copies do not auto-update.
17
17
  | `version-bump` | `steward` (`../steward/.claude/skills/version-bump/`) | 2026-05-15 | None — portable verbatim. Pure Python, no per-repo customization. seer-cli's `CHANGELOG.md` keeps a `# Changelog` + intro-prose header, so the first `## [` entry is a valid insertion point for the upstream script. |
18
18
  | `repo-map` | _internal implementation_ — seer-cli origin | 2026-05-15 | **Runtime:** thin shell wrappers under `.claude/skills/repo-map/scripts/{profile,connections,graph}.sh` that invoke `uv run --directory <repo-root> python -m seer.repo <verb>`. Engine lives in `seer/repo/` in this repo. **Divergence:** N/A — this skill is original to seer-cli, not vendored from steward. If/when promoted upstream, this row flips to a `Re-vendor from steward` pointer. |
19
19
  | `code-lookup` | _internal implementation_ — seer-cli origin | 2026-05-16 | **Runtime:** thin shell wrapper under `.claude/skills/code-lookup/scripts/classify.sh` that invokes `uv run --directory <repo-root> python -m seer classify`. Engine lives in `seer/lookup/` in this repo. **Divergence:** N/A — original to seer-cli, sibling of `repo-map`. If/when promoted upstream, this row flips to a `Re-vendor from steward` pointer. |
20
- | `eval` | _internal implementation_ — seer-cli origin | 2026-05-15 | **Runtime:** locked operator procedure for one scripts-eval set (one `(target, question)` row × 3 trials × one arm). Backing CLIs live in `experiments/scripts_eval/` (`capture`, `validate`, `judge`, `summarize`, `manifest`); hooks live in `experiments/scripts_eval/hooks/`. Round-agnostic — reads `$SEER_EVAL_RUN_ID` and writes the round's accumulator at `docs/eval-rounds/$SEER_EVAL_RUN_ID.md`. **Divergence:** N/A — original to seer-cli (the harness only exists here). If/when promoted upstream, this row flips to a `Re-vendor from steward` pointer. |
20
+ | `eval` | _internal implementation_ — seer-cli origin | 2026-05-16 | **Runtime:** locked operator procedure for one scripts-eval set (one `(target, question)` row × 3 trials × one arm). Three arms — **A** (banned), **B** (directed), **C** (organic) — and two judge pairs — **A-vs-B** ("do the skills help when used?") and **A-vs-C** ("do the skills get adopted organically?"). Backing CLIs live in `experiments/scripts_eval/` (`trial`, `validate`, `judge`, `summarize`, `manifest`); hooks live in `experiments/scripts_eval/hooks/`. `judge prepare --pair AB\|AC` selects the pair; verdicts land under `cell["judges"][pair]` (the AC pair also mirrors to legacy `cell["judge"]`). Round-agnostic — reads `$SEER_EVAL_RUN_ID` and writes the round's accumulator at `docs/eval-rounds/$SEER_EVAL_RUN_ID.md`. **Divergence:** N/A — original to seer-cli (the harness only exists here). If/when promoted upstream, this row flips to a `Re-vendor from steward` pointer. |
21
21
 
22
22
  ## Vendoring policy
23
23
 
@@ -25,13 +25,16 @@ def eval_run_id() -> str | None:
25
25
  return val if val else None
26
26
 
27
27
 
28
+ ARMS = ("A", "B", "C")
29
+
30
+
28
31
  def eval_arm() -> str | None:
29
- """Return SEER_EVAL_ARM (must be 'A' or 'C'), or None if unset."""
32
+ """Return SEER_EVAL_ARM (must be 'A', 'B', or 'C'), or None if unset."""
30
33
  val = os.environ.get("SEER_EVAL_ARM")
31
34
  if val is None or val == "":
32
35
  return None
33
- if val not in ("A", "C"):
34
- raise ValueError(f"SEER_EVAL_ARM must be 'A' or 'C' (got {val!r})")
36
+ if val not in ARMS:
37
+ raise ValueError(f"SEER_EVAL_ARM must be one of {ARMS} (got {val!r})")
35
38
  return val
36
39
 
37
40
 
@@ -327,7 +327,7 @@ def main(argv: list[str] | None = None) -> int:
327
327
  return 1
328
328
 
329
329
  total = 0
330
- for arm in ("A", "C"):
330
+ for arm in _io.ARMS:
331
331
  arm_dir = rd / f"arm-{arm}"
332
332
  if not arm_dir.exists():
333
333
  continue
@@ -2,7 +2,7 @@ corpus_version: 1
2
2
 
3
3
  config:
4
4
  trials_per_cell: 3
5
- arms: [A, C]
5
+ arms: [A, B, C]
6
6
  workspace_root: /home/spark/git
7
7
 
8
8
  targets: