touchstone-eval 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (205) hide show
  1. touchstone_eval-0.1.0/.github/workflows/publish.yml +59 -0
  2. touchstone_eval-0.1.0/.gitignore +29 -0
  3. touchstone_eval-0.1.0/CONTEXT.md +184 -0
  4. touchstone_eval-0.1.0/LICENSE +21 -0
  5. touchstone_eval-0.1.0/PKG-INFO +343 -0
  6. touchstone_eval-0.1.0/README.md +310 -0
  7. touchstone_eval-0.1.0/acp_agents.yaml.example +27 -0
  8. touchstone_eval-0.1.0/cutoffs.yaml.example +20 -0
  9. touchstone_eval-0.1.0/docs/BENCHMARK.md +84 -0
  10. touchstone_eval-0.1.0/docs/adr/0001-responder-mediated-interaction.md +35 -0
  11. touchstone_eval-0.1.0/docs/adr/0002-parallel-safe-store-and-isolation.md +35 -0
  12. touchstone_eval-0.1.0/docs/adr/0003-acp-as-single-rich-adapter.md +35 -0
  13. touchstone_eval-0.1.0/docs/adr/0004-per-cell-environment.md +48 -0
  14. touchstone_eval-0.1.0/docs/adr/0005-pluggable-provisioning-and-executor.md +87 -0
  15. touchstone_eval-0.1.0/docs/adr/0006-native-stream-json-claude-adapter.md +74 -0
  16. touchstone_eval-0.1.0/docs/adr/0007-fixtures-repo-source-and-hidden.md +91 -0
  17. touchstone_eval-0.1.0/docs/adr/0008-reachability-and-availability-policy.md +115 -0
  18. touchstone_eval-0.1.0/docs/plans/0001-observability-and-interaction.md +97 -0
  19. touchstone_eval-0.1.0/docs/plans/0002-harder-diverse-battery.md +332 -0
  20. touchstone_eval-0.1.0/docs/plans/0003-reachability-and-fallback.md +113 -0
  21. touchstone_eval-0.1.0/evals/cron-droid/case.yaml +52 -0
  22. touchstone_eval-0.1.0/evals/csv-droid/case.yaml +49 -0
  23. touchstone_eval-0.1.0/evals/diff-droid/case.yaml +51 -0
  24. touchstone_eval-0.1.0/evals/dummy-droid/case.yaml +33 -0
  25. touchstone_eval-0.1.0/evals/example-case/case.yaml +42 -0
  26. touchstone_eval-0.1.0/evals/example-case/graders/rubric.md +11 -0
  27. touchstone_eval-0.1.0/evals/example-case/source/client.py +15 -0
  28. touchstone_eval-0.1.0/evals/glob-droid/case.yaml +49 -0
  29. touchstone_eval-0.1.0/evals/humanize-bytes-droid/case.yaml +47 -0
  30. touchstone_eval-0.1.0/evals/json-droid/case.yaml +50 -0
  31. touchstone_eval-0.1.0/evals/markdown-droid/case.yaml +57 -0
  32. touchstone_eval-0.1.0/evals/numwords-droid/case.yaml +47 -0
  33. touchstone_eval-0.1.0/evals/observed-droid/case.yaml +53 -0
  34. touchstone_eval-0.1.0/evals/observed-droid/graders/rubric.md +10 -0
  35. touchstone_eval-0.1.0/evals/pluralize-droid/case.yaml +47 -0
  36. touchstone_eval-0.1.0/evals/realistic-droid/case.yaml +47 -0
  37. touchstone_eval-0.1.0/evals/regex-droid/case.yaml +52 -0
  38. touchstone_eval-0.1.0/evals/repo-bucketize-droid/case.yaml +66 -0
  39. touchstone_eval-0.1.0/evals/repo-chunkedeven-droid/case.yaml +52 -0
  40. touchstone_eval-0.1.0/evals/repo-codex-tui-docs/case.yaml +81 -0
  41. touchstone_eval-0.1.0/evals/repo-codex-tui-docs/graders/rubric.md +22 -0
  42. touchstone_eval-0.1.0/evals/repo-collapse-droid/case.yaml +53 -0
  43. touchstone_eval-0.1.0/evals/repo-droid/case.yaml +54 -0
  44. touchstone_eval-0.1.0/evals/repo-flowforge-bug-droid/case.yaml +74 -0
  45. touchstone_eval-0.1.0/evals/repo-funcy-chunks-droid/case.yaml +59 -0
  46. touchstone_eval-0.1.0/evals/repo-java-camelcase-droid/case.yaml +100 -0
  47. touchstone_eval-0.1.0/evals/repo-js-bytes-droid/case.yaml +81 -0
  48. touchstone_eval-0.1.0/evals/repo-js-camelcase-droid/case.yaml +81 -0
  49. touchstone_eval-0.1.0/evals/repo-js-kindof-wf/case.yaml +85 -0
  50. touchstone_eval-0.1.0/evals/repo-js-prettybytes-wf/case.yaml +90 -0
  51. touchstone_eval-0.1.0/evals/repo-js-wordwrap-droid/case.yaml +73 -0
  52. touchstone_eval-0.1.0/evals/repo-js-wordwrap-frugal-droid/case.yaml +77 -0
  53. touchstone_eval-0.1.0/evals/repo-mergewith-droid/case.yaml +53 -0
  54. touchstone_eval-0.1.0/evals/repo-mutated-filename-droid/case.yaml +88 -0
  55. touchstone_eval-0.1.0/evals/repo-parameterize-droid/case.yaml +54 -0
  56. touchstone_eval-0.1.0/evals/repo-py-chunkedeven-wf/case.yaml +83 -0
  57. touchstone_eval-0.1.0/evals/repo-py-chunkwindow-wf/case.yaml +101 -0
  58. touchstone_eval-0.1.0/evals/repo-py-easter-wf/case.yaml +94 -0
  59. touchstone_eval-0.1.0/evals/repo-py-formatintlist-wf/case.yaml +80 -0
  60. touchstone_eval-0.1.0/evals/repo-py-inflectioncase-wf/case.yaml +94 -0
  61. touchstone_eval-0.1.0/evals/repo-py-inflectionplural-wf/case.yaml +92 -0
  62. touchstone_eval-0.1.0/evals/repo-py-intword-wf/case.yaml +87 -0
  63. touchstone_eval-0.1.0/evals/repo-py-naturalsize-wf/case.yaml +96 -0
  64. touchstone_eval-0.1.0/evals/repo-py-ordinal-wf/case.yaml +90 -0
  65. touchstone_eval-0.1.0/evals/repo-py-pathsplit-wf/case.yaml +86 -0
  66. touchstone_eval-0.1.0/evals/repo-py-semvercompare-wf/case.yaml +92 -0
  67. touchstone_eval-0.1.0/evals/repo-py-slugify-fullsuite-wf/case.yaml +74 -0
  68. touchstone_eval-0.1.0/evals/repo-py-splitfamily-wf/case.yaml +104 -0
  69. touchstone_eval-0.1.0/evals/repo-py-splitnl-wf/case.yaml +89 -0
  70. touchstone_eval-0.1.0/evals/repo-py-striptags-wf/case.yaml +94 -0
  71. touchstone_eval-0.1.0/evals/repo-py-truncate-debug-droid/case.yaml +87 -0
  72. touchstone_eval-0.1.0/evals/repo-py-windowed-droid/case.yaml +86 -0
  73. touchstone_eval-0.1.0/evals/repo-scheduler-bug-droid/case.yaml +73 -0
  74. touchstone_eval-0.1.0/evals/repo-securefilename-droid/case.yaml +73 -0
  75. touchstone_eval-0.1.0/evals/repo-smarttruncate-droid/case.yaml +73 -0
  76. touchstone_eval-0.1.0/evals/repo-splitinto-droid/case.yaml +52 -0
  77. touchstone_eval-0.1.0/evals/repo-swebench-afero-577/case.yaml +206 -0
  78. touchstone_eval-0.1.0/evals/repo-swebench-anyio-1189/case.yaml +113 -0
  79. touchstone_eval-0.1.0/evals/repo-swebench-astropy-13453/case.yaml +239 -0
  80. touchstone_eval-0.1.0/evals/repo-swebench-chi-1085/case.yaml +107 -0
  81. touchstone_eval-0.1.0/evals/repo-swebench-chi-1097/case.yaml +90 -0
  82. touchstone_eval-0.1.0/evals/repo-swebench-chrono-1798/case.yaml +107 -0
  83. touchstone_eval-0.1.0/evals/repo-swebench-clap-6276/case.yaml +208 -0
  84. touchstone_eval-0.1.0/evals/repo-swebench-click-3434/case.yaml +140 -0
  85. touchstone_eval-0.1.0/evals/repo-swebench-click-3493/case.yaml +122 -0
  86. touchstone_eval-0.1.0/evals/repo-swebench-cobra-2356/case.yaml +128 -0
  87. touchstone_eval-0.1.0/evals/repo-swebench-commonscoll-693/case.yaml +129 -0
  88. touchstone_eval-0.1.0/evals/repo-swebench-commonslang-1713/case.yaml +123 -0
  89. touchstone_eval-0.1.0/evals/repo-swebench-commonslang-1717/case.yaml +121 -0
  90. touchstone_eval-0.1.0/evals/repo-swebench-commonslang-1729/case.yaml +117 -0
  91. touchstone_eval-0.1.0/evals/repo-swebench-commonstext-748/case.yaml +117 -0
  92. touchstone_eval-0.1.0/evals/repo-swebench-express-7181/case.yaml +130 -0
  93. touchstone_eval-0.1.0/evals/repo-swebench-flask-5014/case.yaml +146 -0
  94. touchstone_eval-0.1.0/evals/repo-swebench-flask-5917/case.yaml +140 -0
  95. touchstone_eval-0.1.0/evals/repo-swebench-gson-3034/case.yaml +123 -0
  96. touchstone_eval-0.1.0/evals/repo-swebench-ky-861/case.yaml +142 -0
  97. touchstone_eval-0.1.0/evals/repo-swebench-pylint-4551/case.yaml +105 -0
  98. touchstone_eval-0.1.0/evals/repo-swebench-pylint-6528/case.yaml +337 -0
  99. touchstone_eval-0.1.0/evals/repo-swebench-pylint-7080/case.yaml +466 -0
  100. touchstone_eval-0.1.0/evals/repo-swebench-pylint-8898/case.yaml +184 -0
  101. touchstone_eval-0.1.0/evals/repo-swebench-pytest-10356/case.yaml +229 -0
  102. touchstone_eval-0.1.0/evals/repo-swebench-pytest-5787/case.yaml +350 -0
  103. touchstone_eval-0.1.0/evals/repo-swebench-pytest-5840/case.yaml +150 -0
  104. touchstone_eval-0.1.0/evals/repo-swebench-pytest-6197/case.yaml +286 -0
  105. touchstone_eval-0.1.0/evals/repo-swebench-pytest-7236/case.yaml +218 -0
  106. touchstone_eval-0.1.0/evals/repo-swebench-requests-7309/case.yaml +142 -0
  107. touchstone_eval-0.1.0/evals/repo-swebench-requests-7315/case.yaml +144 -0
  108. touchstone_eval-0.1.0/evals/repo-swebench-sklearn-14053/case.yaml +150 -0
  109. touchstone_eval-0.1.0/evals/repo-swebench-sphinx-10466/case.yaml +245 -0
  110. touchstone_eval-0.1.0/evals/repo-swebench-sphinx-11510/case.yaml +225 -0
  111. touchstone_eval-0.1.0/evals/repo-swebench-sphinx-7590/case.yaml +150 -0
  112. touchstone_eval-0.1.0/evals/repo-swebench-sphinx-8035/case.yaml +116 -0
  113. touchstone_eval-0.1.0/evals/repo-swebench-sphinx-8548/case.yaml +116 -0
  114. touchstone_eval-0.1.0/evals/repo-swebench-sphinx-8551/case.yaml +199 -0
  115. touchstone_eval-0.1.0/evals/repo-swebench-sphinx-9229/case.yaml +215 -0
  116. touchstone_eval-0.1.0/evals/repo-swebench-sphinx-9461/case.yaml +234 -0
  117. touchstone_eval-0.1.0/evals/repo-swebench-testify-1877/case.yaml +135 -0
  118. touchstone_eval-0.1.0/evals/repo-swebench-testify-1888/case.yaml +121 -0
  119. touchstone_eval-0.1.0/evals/repo-swebench-time-782/case.yaml +149 -0
  120. touchstone_eval-0.1.0/evals/repo-swebench-validator-2693/case.yaml +130 -0
  121. touchstone_eval-0.1.0/evals/repo-swebench-validator-2774/case.yaml +130 -0
  122. touchstone_eval-0.1.0/evals/repo-swebench-werkzeug-3129/case.yaml +137 -0
  123. touchstone_eval-0.1.0/evals/repo-swebench-werkzeug-3147/case.yaml +145 -0
  124. touchstone_eval-0.1.0/evals/repo-swebench-xarray-3677/case.yaml +143 -0
  125. touchstone_eval-0.1.0/evals/repo-windowed-droid/case.yaml +52 -0
  126. touchstone_eval-0.1.0/evals/roman-droid/case.yaml +47 -0
  127. touchstone_eval-0.1.0/evals/scored-droid/case.yaml +48 -0
  128. touchstone_eval-0.1.0/evals/titlecase-droid/case.yaml +47 -0
  129. touchstone_eval-0.1.0/evals/toposort-droid/case.yaml +51 -0
  130. touchstone_eval-0.1.0/evals-private/README.md +46 -0
  131. touchstone_eval-0.1.0/evals-private/example-private-case/case.yaml +34 -0
  132. touchstone_eval-0.1.0/evals-private/example-private-case/graders/rubric.md +12 -0
  133. touchstone_eval-0.1.0/harnesses.yaml.example +13 -0
  134. touchstone_eval-0.1.0/pyproject.toml +57 -0
  135. touchstone_eval-0.1.0/src/touchstone/__init__.py +3 -0
  136. touchstone_eval-0.1.0/src/touchstone/artifacts.py +51 -0
  137. touchstone_eval-0.1.0/src/touchstone/cli.py +212 -0
  138. touchstone_eval-0.1.0/src/touchstone/concurrency.py +34 -0
  139. touchstone_eval-0.1.0/src/touchstone/config.py +431 -0
  140. touchstone_eval-0.1.0/src/touchstone/environment.py +126 -0
  141. touchstone_eval-0.1.0/src/touchstone/executor.py +162 -0
  142. touchstone_eval-0.1.0/src/touchstone/export/__init__.py +5 -0
  143. touchstone_eval-0.1.0/src/touchstone/export/langfuse.py +102 -0
  144. touchstone_eval-0.1.0/src/touchstone/fixtures.py +53 -0
  145. touchstone_eval-0.1.0/src/touchstone/grader/__init__.py +6 -0
  146. touchstone_eval-0.1.0/src/touchstone/grader/base.py +55 -0
  147. touchstone_eval-0.1.0/src/touchstone/grader/command.py +40 -0
  148. touchstone_eval-0.1.0/src/touchstone/grader/efficiency.py +63 -0
  149. touchstone_eval-0.1.0/src/touchstone/grader/files.py +71 -0
  150. touchstone_eval-0.1.0/src/touchstone/grader/implemented.py +42 -0
  151. touchstone_eval-0.1.0/src/touchstone/grader/model_judge.py +149 -0
  152. touchstone_eval-0.1.0/src/touchstone/grader/pytest_runner.py +209 -0
  153. touchstone_eval-0.1.0/src/touchstone/grader/registry.py +40 -0
  154. touchstone_eval-0.1.0/src/touchstone/grader/swebench.py +101 -0
  155. touchstone_eval-0.1.0/src/touchstone/grader/trace.py +116 -0
  156. touchstone_eval-0.1.0/src/touchstone/harness/__init__.py +7 -0
  157. touchstone_eval-0.1.0/src/touchstone/harness/acp.py +492 -0
  158. touchstone_eval-0.1.0/src/touchstone/harness/base.py +90 -0
  159. touchstone_eval-0.1.0/src/touchstone/harness/claude_code.py +85 -0
  160. touchstone_eval-0.1.0/src/touchstone/harness/claude_stream.py +158 -0
  161. touchstone_eval-0.1.0/src/touchstone/harness/cli_agent.py +97 -0
  162. touchstone_eval-0.1.0/src/touchstone/harness/echo.py +30 -0
  163. touchstone_eval-0.1.0/src/touchstone/harness/registry.py +78 -0
  164. touchstone_eval-0.1.0/src/touchstone/interaction/__init__.py +11 -0
  165. touchstone_eval-0.1.0/src/touchstone/interaction/base.py +86 -0
  166. touchstone_eval-0.1.0/src/touchstone/interaction/policies.py +104 -0
  167. touchstone_eval-0.1.0/src/touchstone/interaction/registry.py +40 -0
  168. touchstone_eval-0.1.0/src/touchstone/interaction/responder.py +98 -0
  169. touchstone_eval-0.1.0/src/touchstone/metrics.py +125 -0
  170. touchstone_eval-0.1.0/src/touchstone/reachability.py +170 -0
  171. touchstone_eval-0.1.0/src/touchstone/report.py +433 -0
  172. touchstone_eval-0.1.0/src/touchstone/runner.py +333 -0
  173. touchstone_eval-0.1.0/src/touchstone/sandbox.py +121 -0
  174. touchstone_eval-0.1.0/src/touchstone/setup.py +58 -0
  175. touchstone_eval-0.1.0/src/touchstone/store.py +170 -0
  176. touchstone_eval-0.1.0/src/touchstone/trace.py +189 -0
  177. touchstone_eval-0.1.0/tests/conftest.py +17 -0
  178. touchstone_eval-0.1.0/tests/fake_acp_agent.py +82 -0
  179. touchstone_eval-0.1.0/tests/test_acp.py +104 -0
  180. touchstone_eval-0.1.0/tests/test_acp_profiles.py +32 -0
  181. touchstone_eval-0.1.0/tests/test_claude_stream.py +106 -0
  182. touchstone_eval-0.1.0/tests/test_config.py +52 -0
  183. touchstone_eval-0.1.0/tests/test_container.py +137 -0
  184. touchstone_eval-0.1.0/tests/test_efficiency_grader.py +64 -0
  185. touchstone_eval-0.1.0/tests/test_environment.py +154 -0
  186. touchstone_eval-0.1.0/tests/test_graders.py +65 -0
  187. touchstone_eval-0.1.0/tests/test_implemented.py +34 -0
  188. touchstone_eval-0.1.0/tests/test_integration.py +85 -0
  189. touchstone_eval-0.1.0/tests/test_interaction.py +68 -0
  190. touchstone_eval-0.1.0/tests/test_judge_jury.py +20 -0
  191. touchstone_eval-0.1.0/tests/test_langfuse.py +60 -0
  192. touchstone_eval-0.1.0/tests/test_metrics.py +103 -0
  193. touchstone_eval-0.1.0/tests/test_multiturn.py +51 -0
  194. touchstone_eval-0.1.0/tests/test_observe.py +86 -0
  195. touchstone_eval-0.1.0/tests/test_parallel_isolation.py +125 -0
  196. touchstone_eval-0.1.0/tests/test_pytest_grader.py +222 -0
  197. touchstone_eval-0.1.0/tests/test_reachability.py +115 -0
  198. touchstone_eval-0.1.0/tests/test_reachability_runner.py +102 -0
  199. touchstone_eval-0.1.0/tests/test_report_caveats.py +78 -0
  200. touchstone_eval-0.1.0/tests/test_setup.py +47 -0
  201. touchstone_eval-0.1.0/tests/test_store.py +53 -0
  202. touchstone_eval-0.1.0/tests/test_swebench_grader.py +101 -0
  203. touchstone_eval-0.1.0/tests/test_trace.py +52 -0
  204. touchstone_eval-0.1.0/tests/test_trace_grader.py +105 -0
  205. touchstone_eval-0.1.0/uv.lock +956 -0
@@ -0,0 +1,59 @@
1
+ name: Publish to PyPI
2
+
3
+ # Publishes touchstone-eval to PyPI when a version tag is pushed, e.g.:
4
+ # git tag v0.1.0 && git push origin v0.1.0
5
+ #
6
+ # Uses PyPI Trusted Publishing (OIDC) — no API token / password stored as a secret.
7
+ # One-time setup on PyPI: project Settings -> Publishing -> add a GitHub publisher with
8
+ # owner: krimvp repo: touchstone workflow: publish.yml environment: pypi
9
+ # (Create the "pypi" environment under the repo's Settings -> Environments first, or use
10
+ # PyPI's "pending publisher" flow to register it before the project's first release.)
11
+
12
+ on:
13
+ push:
14
+ tags:
15
+ - "v*"
16
+
17
+ permissions:
18
+ contents: read
19
+
20
+ jobs:
21
+ build:
22
+ name: Build sdist + wheel
23
+ runs-on: ubuntu-latest
24
+ steps:
25
+ - uses: actions/checkout@v4
26
+
27
+ - name: Install uv
28
+ uses: astral-sh/setup-uv@v5
29
+
30
+ - name: Build distributions
31
+ run: uv build
32
+
33
+ - name: Check metadata
34
+ run: uvx twine check dist/*
35
+
36
+ - name: Upload artifacts
37
+ uses: actions/upload-artifact@v4
38
+ with:
39
+ name: dist
40
+ path: dist/
41
+
42
+ publish:
43
+ name: Publish to PyPI
44
+ needs: build
45
+ runs-on: ubuntu-latest
46
+ environment:
47
+ name: pypi
48
+ url: https://pypi.org/p/touchstone-eval
49
+ permissions:
50
+ id-token: write # required for Trusted Publishing (OIDC)
51
+ steps:
52
+ - name: Download built distributions
53
+ uses: actions/download-artifact@v4
54
+ with:
55
+ name: dist
56
+ path: dist/
57
+
58
+ - name: Publish
59
+ uses: pypa/gh-action-pypi-publish@release/v1
@@ -0,0 +1,29 @@
1
+ # Run outputs (intermediate + final results) — large and machine-specific.
2
+ runs/
3
+
4
+ # Transient run logs
5
+ *.log
6
+
7
+ # Python
8
+ __pycache__/
9
+ *.py[cod]
10
+ *.egg-info/
11
+ .eggs/
12
+ build/
13
+ dist/
14
+ .venv/
15
+ venv/
16
+ .pytest_cache/
17
+ .ruff_cache/
18
+
19
+ # Env / secrets
20
+ .env
21
+
22
+ # Private held-out eval set — your OWN usecases, never committed (contamination-proof
23
+ # tiebreaker; Lessons 3 & 6). Run with: touchstone --evals-dir evals-private run
24
+ # The README + example template stay tracked; real private cases are ignored.
25
+ evals-private/*
26
+ !evals-private/.gitignore
27
+ !evals-private/README.md
28
+ !evals-private/example-private-case
29
+ !evals-private/example-private-case/**
@@ -0,0 +1,184 @@
1
+ # touchstone
2
+
3
+ A personal eval benchmark for deciding which model works best for the user's own
4
+ usecases. This glossary fixes the language the framework and its docs use.
5
+
6
+ ## Language
7
+
8
+ ### Benchmark structure
9
+
10
+ **Case**:
11
+ One eval — a task plus the input, artifacts, graders, and expectations needed to judge it.
12
+ _Avoid_: test, suite, scenario.
13
+
14
+ **Run**:
15
+ A single execution of the benchmark that expands a matrix and produces a report.
16
+ _Avoid_: session, job.
17
+
18
+ **Cell**:
19
+ The atomic unit of work and persistence: one `(Case × Harness × Model × Trial)`.
20
+ _Avoid_: task-run, instance.
21
+
22
+ **Trial**:
23
+ One repeated attempt of the same Cell coordinates, used for consistency / pass@k.
24
+ _Avoid_: attempt, sample.
25
+
26
+ **Grader**:
27
+ A component that turns a Harness's result into a Score.
28
+ _Avoid_: judge (reserve "judge" for the model-as-judge grader specifically), scorer, checker.
29
+
30
+ **Harness**:
31
+ The swappable thing that turns a Case's task into an output. Every Harness is an adapter
32
+ behind one interface.
33
+ _Avoid_: runner, agent, driver.
34
+
35
+ ### Observation & interaction
36
+
37
+ **Trace**:
38
+ The normalized, vendor-neutral event stream captured from a Harness run (messages, tool
39
+ calls, token usage, etc.). The framework's own schema — never an external protocol's types.
40
+ _Avoid_: log (reserve for raw stdout/transcript), transcript (that is the raw capture).
41
+
42
+ **Trace Event**:
43
+ One item in a Trace (e.g. a tool call, a token-usage update, a permission request).
44
+ _Avoid_: span (reserve "span" for the future LangFuse mapping), record.
45
+
46
+ **Tracing** (a Harness capability):
47
+ A Harness's ability to emit a Trace. Opt-in; Harnesses that lack it degrade to output-only.
48
+ _Avoid_: instrumentation, observability (use as adjectives, not as the capability name).
49
+
50
+ **Interaction** (a Harness capability):
51
+ A Harness's ability to let the framework answer the agent's mid-run requests (tool
52
+ permission / approval / input). Strictly richer than Tracing.
53
+ _Avoid_: feedback, callback, HITL.
54
+
55
+ **Output-only**:
56
+ A Harness that exposes neither Tracing nor Interaction — only its final output. Today's
57
+ default and the universal fallback.
58
+ _Avoid_: black-box, basic.
59
+
60
+ **Tool Kind**:
61
+ The portable, canonical category of a tool call (`read | write | execute | search |
62
+ fetch | other`) — the one tool axis that means the same across agents. Distinct from a
63
+ tool's `raw_name` (verbatim from the agent) and `name` (the Adapter's normalized name).
64
+ _Avoid_: tool type, category.
65
+
66
+ **Interaction Policy**:
67
+ The per-Case rule that answers *agent-initiated* mid-run requests. One of `auto-approve`,
68
+ `auto-deny`, `scripted`, `llm-based`, or `manual`.
69
+ _Avoid_: handler, strategy.
70
+
71
+ **Turn**:
72
+ One *eval-initiated* prompt sent to the agent within a Cell. Distinct from an
73
+ agent-initiated request (which the Interaction Policy answers) and from a Trial (a repeat
74
+ of the whole Cell). The first Turn is the Case's task; later Turns are scripted follow-ups.
75
+ _Avoid_: round, step, message.
76
+
77
+ **Conversation**:
78
+ The ordered Turns of a Case, sent one after another, each dispatched once the agent's
79
+ previous Turn reaches a `stop`. A single-prompt Case is a one-Turn Conversation.
80
+ _Avoid_: dialogue, thread, chat.
81
+
82
+ **Responder**:
83
+ The fixed auxiliary LLM that answers the agent's mid-run requests under Case guidelines
84
+ when the Interaction Policy is `llm-based`. A control variable, held constant across the
85
+ matrix like the Judge, so the agent stays the only thing being compared.
86
+ _Avoid_: helper, user-sim, proxy.
87
+
88
+ **Judge**:
89
+ The fixed auxiliary LLM used by the model-as-judge Grader to score output. Like the
90
+ Responder, a control variable held constant across the matrix.
91
+ _Avoid_: grader (the Judge is used *by* a Grader, not a synonym for one).
92
+
93
+ **Adapter**:
94
+ A concrete Harness implementation. The **ACP adapter** is the single rich adapter — one
95
+ implementation driving every Agent-Client-Protocol agent (Claude via `claude-agent-acp`,
96
+ Codex, Gemini, droid, devin-cli, …) through one event-translation path. The **CLI adapter**
97
+ is the generic output-only fallback. (A native **Claude Agent SDK adapter** is an optional
98
+ future no-Node alternative, not part of the core.)
99
+ _Avoid_: backend, plugin, connector.
100
+
101
+ ### Execution & isolation
102
+
103
+ **Sandbox**:
104
+ The isolated working directory a Cell's Harness operates in, prepared fresh from the Case
105
+ source. Self-contained: own directory, own subprocess env, never shared between Cells.
106
+ _Avoid_: workspace, workdir (informal synonyms only).
107
+
108
+ **Isolation Mode**:
109
+ How a Sandbox is created from the source — `copy` (copy a folder), `clone` (git clone at a
110
+ commit), or `worktree` (git worktree at a commit). Inferred from source type and
111
+ overridable; `worktree` is opt-in, not the default.
112
+ _Avoid_: sandbox type, strategy.
113
+
114
+ **Environment**:
115
+ A Cell's own throwaway dependency setup (the *broader Sandbox*), provisioned when a Case
116
+ declares an `environment` block. A **Provisioner** (selected by `kind`) prepares it:
117
+ `pip-venv` (default) / `uv` build a per-Cell virtualenv and install declared dependencies
118
+ (and optionally the Sandbox repo via `install: editable`); `command` runs declared shell
119
+ commands for ecosystems with project-local deps (`npm ci`, `cargo fetch`). Every subprocess
120
+ the Cell spawns (Harness, setup, `command`/`tests`/`pytest` Graders) runs under it via an
121
+ explicit env, never a shared global interpreter. Absent the block, the Cell uses the host.
122
+ _Avoid_: virtualenv (one Provisioner's mechanism, not the concept), image, container (no
123
+ OS-level isolation *yet* — see Executor).
124
+
125
+ **Provisioner**:
126
+ The strategy that prepares a Cell's Environment, selected by `environment.kind` (`pip-venv`,
127
+ `uv`, `command`). Mirrors Isolation Mode for the Sandbox: one declarative knob, multiple
128
+ backends, one contract (return the subprocess `env`, or `None` for host).
129
+ _Avoid_: installer, builder.
130
+
131
+ **Executor**:
132
+ Where a Cell's non-Harness commands run, behind one `run(argv, cwd, env)` + `create_venv`
133
+ interface. `LocalExecutor` runs host subprocesses; `ContainerExecutor` runs them via
134
+ `docker exec` in a container with the cell bind-mounted (selected by a Case's `container`
135
+ block), bringing OS-level isolation and OS packages. Provisioning, `setup.run`, and the
136
+ `command`/`tests`/`pytest` Graders run through the Cell's Executor; the Harness still runs on
137
+ the host (ADR 0005). The provisioner recipes are written once and run under either backend.
138
+ _Avoid_: runner (that is the orchestration loop), shell, backend (use as adjective only).
139
+
140
+ **Reachability / Availability**:
141
+ Whether this host can reach a Case's external git repos (its remote `source` and/or
142
+ `fixtures`). A preflight probes each (access-level `git ls-remote`, cached per URL) before the
143
+ Run does work, then applies the **availability policy**: `fail` (default — a single unreachable
144
+ *required* Case aborts the Run) or `skip` (degrade unreachable Cases to the `skipped` status and
145
+ continue). A Case marked `availability: optional` always degrades. `skipped` is terminal and
146
+ excluded from every aggregate — distinct from `failed`, which is a defect. See ADR 0008.
147
+ _Avoid_: offline (a probe failure may be auth, not network), error (a skip is not a failure).
148
+
149
+ ## Relationships
150
+
151
+ - A **Run** expands into many **Cells**; each **Cell** has one **Harness**, one model, one **Trial** index.
152
+ - A **Harness** is realized by exactly one **Adapter**; an **Adapter** declares its **Tracing** and **Interaction** capabilities.
153
+ - A **Tracing**-capable **Harness** produces a **Trace** (a sequence of **Trace Events**) per **Cell**, alongside the raw transcript.
154
+ - **Interaction** implies **Tracing** (you cannot answer requests you cannot observe), not vice-versa.
155
+ - **Graders** may read the final output, the **Trace**, or both.
156
+ - A **Trace** is the source mapped to LangFuse spans later — graders and LangFuse both depend on the **Trace**, never on ACP or the Claude SDK directly.
157
+ - A **Case** opts into observation via an `observe` block (Tracing and/or Interaction); absent it, the **Cell** is **Output-only**. A Run-level flag can override.
158
+ - When a **Case** requests more than its **Adapter** supports, the **Cell** soft-degrades (empty **Trace**, warning recorded) — unless a **Grader** needs the **Trace**, which is a hard failure for that **Cell**.
159
+ - An **Interaction Policy** of `llm-based` uses a **Responder**; deterministic policies (`auto-approve`/`auto-deny`/`scripted`) use none. `manual` is non-reproducible and excluded from aggregation; `llm-based` is included but flagged responder-mediated.
160
+ - Every mid-run request and its answer is recorded in the **Trace** (`permission_request` / `permission_response`), whatever the **Interaction Policy**.
161
+ - A **Case** is a **Conversation** of one or more **Turns**; **Turns** are eval-initiated, while requests answered by the **Interaction Policy** are agent-initiated. Both happen within one **Cell**.
162
+ - Each **Cell** gets its own **Sandbox** via an **Isolation Mode**; all modes yield a fully isolated, parallel-safe directory. Commit-pinned modes (`clone`/`worktree`) give reproducibility.
163
+ - A **Cell** that declares an **Environment** also gets its own venv beside the **Sandbox**; the venv (and its installed dependencies) is the *broader Sandbox* that keeps dependency-bearing Cells reproducible and parallel-safe — provisioned before the agent runs, torn down with the **Sandbox**.
164
+ - A **Cell**'s outcome lives in its own `result.json` (the source of truth); the run manifest is a derived index merged from those, so parallel **Cells** never write the same file.
165
+ - Before a **Run** does work it checks **Reachability** of every **Case**'s external repos and applies the availability policy; unreachable **Cases** either abort the Run (`fail`) or become `skipped` **Cells** (`skip`/`optional`), which are excluded from every aggregate — a missing private repo can never silently shrink the benchmark.
166
+ - Model selection is **Adapter-specific**: a CLI flag (CLI adapter) or applied **via the ACP instruction** — launch arg and/or `session/new` / `session/setConfigOption` (ACP adapter). A model string is opaque to the framework and may be a custom alias (e.g. `glm-5.1:cloud`).
167
+ - A model is meaningful only relative to a **Harness**, so the matrix **pairs models per-Harness** (entries of `{harness, models}`) rather than taking a blind cross-product.
168
+
169
+ ## Example dialogue
170
+
171
+ > **Dev:** "Does the ACP **Adapter** give us **Interaction**?"
172
+ > **User:** "Yes — ACP's `request_permission` lets us answer the agent, so that **Adapter** is **Interaction**-capable. The generic **CLI Adapter** is **Output-only**."
173
+ > **Dev:** "And aider today?"
174
+ > **User:** "**Output-only** — no ACP, no SDK. We still observe its final output; the **Trace** is just empty of tool events."
175
+
176
+ ## Flagged ambiguities
177
+
178
+ - "ACP" was used as if it were the abstraction. Resolved: ACP is one **Adapter**'s transport;
179
+ the abstraction is the **Trace**. The Claude Agent SDK populates the same **Trace** without ACP.
180
+ - "wrap the calls" meant two distinct capabilities — **Tracing** (observe) and **Interaction**
181
+ (respond). Resolved: they are separate, with **Interaction** implying **Tracing**.
182
+ - "tool name" was treated as one thing. Resolved into three: `raw_name` (verbatim, never
183
+ lost), `name` (Adapter-normalized), and **Tool Kind** (portable enum). Cross-model grading
184
+ prefers **Tool Kind**; within-agent grading may use `raw_name`.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 krimvp
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,343 @@
1
+ Metadata-Version: 2.4
2
+ Name: touchstone-eval
3
+ Version: 0.1.0
4
+ Summary: Personal eval benchmark: compare model outcomes across swappable CLI-agent harnesses on custom tasks.
5
+ Project-URL: Homepage, https://github.com/krimvp/touchstone
6
+ Project-URL: Repository, https://github.com/krimvp/touchstone
7
+ Project-URL: Issues, https://github.com/krimvp/touchstone/issues
8
+ Author-email: krimvp <anton.balboa@gmail.com>
9
+ License-Expression: MIT
10
+ License-File: LICENSE
11
+ Keywords: acp,agent,benchmark,claude-code,cli,eval,evaluation,llm
12
+ Classifier: Development Status :: 3 - Alpha
13
+ Classifier: Environment :: Console
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: Operating System :: OS Independent
16
+ Classifier: Programming Language :: Python :: 3
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Programming Language :: Python :: 3.13
21
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
22
+ Classifier: Topic :: Software Development :: Testing
23
+ Requires-Python: >=3.10
24
+ Requires-Dist: pydantic>=2.6
25
+ Requires-Dist: pyyaml>=6.0
26
+ Provides-Extra: dev
27
+ Requires-Dist: pytest>=8.0; extra == 'dev'
28
+ Provides-Extra: judge
29
+ Requires-Dist: anthropic>=0.40; extra == 'judge'
30
+ Provides-Extra: langfuse
31
+ Requires-Dist: langfuse>=2.0; extra == 'langfuse'
32
+ Description-Content-Type: text/markdown
33
+
34
+ # touchstone
35
+
36
+ > A *touchstone* is the dark stone jewelers rub gold against to read its purity from the
37
+ > streak it leaves — telling true gold from convincing fakes. That is this benchmark's whole
38
+ > job: telling apart models that look identical on paper, by the marks they leave on real work.
39
+
40
+ A personal eval benchmark for answering one question: **for my usecases, which model
41
+ works best?**
42
+
43
+ Each eval (a *case*) bundles its own task, its own input source files, its own AI
44
+ artifacts (skills / commands / plugins / MCP), and its own definition of a correct
45
+ outcome. A *run* executes a **matrix** of cells — one cell per
46
+ `(case × harness × model × trial)` — fully isolated and persisted independently, then
47
+ aggregates everything into a single report.
48
+
49
+ ## Core model
50
+
51
+ ```
52
+ Case (one eval) Matrix axes Cell (unit of work + persistence)
53
+ task / prompt × harnesses[] = sandbox + transcript + output
54
+ source/ files models[] + grader scores + metrics + status
55
+ artifacts/ trials (k)
56
+ graders[]
57
+ ```
58
+
59
+ - **Harness** — the swappable thing that turns a task into an output, behind one interface
60
+ (`harness/base.py`). `echo` (fake) and `claude-code` are output-only. For *rich* runs
61
+ (a Trace of tool calls / tokens / cost) there are two paths: **`claude-code-stream`** drives
62
+ Claude natively over `--output-format stream-json` (no ACP, no Node; Tracing-only,
63
+ autonomous via skip-permissions — see `docs/adr/0006`), and the **ACP adapter** drives any
64
+ Agent Client Protocol agent (droid, gemini, codex, claude-acp, devin-cli) with full
65
+ observation **and** bidirectional interaction. ACP is one rich path, not the only one — the
66
+ Trace is the contract.
67
+ - **Graders** — `command` (run tests/build), `files` (expected files / grep patterns),
68
+ `model_judge` (LLM-as-judge), and `trace` (assert over observed tool usage / token &
69
+ cost budgets). All run; combined per the case's `expect.pass_threshold`.
70
+ - **Observation & interaction** (opt-in per case via `observe:`) — capture a normalized
71
+ **Trace** (tool calls, tokens, cost, permission events) and answer the agent's mid-run
72
+ requests with an **Interaction Policy** (`auto-approve`/`auto-deny`/`scripted`/
73
+ `llm-based`/`manual`). See `CONTEXT.md` + `docs/adr/`.
74
+ - **Resumability & parallelism** — each cell's `result.json` is the source of truth (the
75
+ manifest is a derived index), so cells run in parallel (`--workers`) without contention
76
+ and `run --resume <id>` continues after a crash.
77
+
78
+ ## Install
79
+
80
+ The published package is `touchstone-eval`; the command it installs is `touchstone`
81
+ (the bare `touchstone` name on PyPI belongs to an unrelated, abandoned project).
82
+
83
+ ```bash
84
+ uvx touchstone-eval --help # run without installing (recommended)
85
+ pipx install touchstone-eval # or install as an isolated tool
86
+ pip install touchstone-eval # or into the current environment
87
+ ```
88
+
89
+ Add the optional extras when you need them: `touchstone-eval[judge]` (Anthropic SDK for
90
+ `model_judge`), `[langfuse]` (export), `[dev]` (pytest).
91
+
92
+ For local development from a checkout:
93
+
94
+ ```bash
95
+ pip install -e ".[judge,dev]" # judge = Anthropic SDK for model_judge; dev = pytest
96
+ ```
97
+
98
+ ## Usage
99
+
100
+ ```bash
101
+ touchstone validate # schema-check every evals/<case>/case.yaml
102
+ touchstone list # list cases and past runs
103
+ touchstone run # run the whole evals/ suite
104
+ touchstone run --eval example-case --harness echo --trials 2
105
+ touchstone run --harness droid --with-model A --with-model B # compare models, same harness
106
+ touchstone run --workers 4 # run cells in parallel
107
+ touchstone run --resume <run_id> # continue an interrupted run
108
+ touchstone report <run_id> # (re)generate runs/<run_id>/report.md
109
+ touchstone export <run_id> [--push] # write runs/<id>/langfuse.json (and optionally push)
110
+ ```
111
+
112
+ ### Comparing models on the same harness
113
+
114
+ The matrix is what answers "which model for my usecases?" — distinct models become
115
+ distinct cells, and the report ranks them in a per-case matrix + a leaderboard (score,
116
+ cost, time, tools, tokens). A case can declare the models inline
117
+ (`matrix.models` / `matrix.entries[].models`), or you can hold a harness fixed and push
118
+ models through it at run time without editing the case:
119
+
120
+ ```bash
121
+ # Run these models on droid even if the cases declared only one — they replace the
122
+ # case's models for that harness. Each becomes its own row in the comparison.
123
+ touchstone run --harness droid \
124
+ --with-model custom:glm-5.1:cloud-0 --with-model custom:glm-4.6:cloud
125
+ ```
126
+
127
+ `--with-model` *replaces* the declared models (so you can introduce new ones); `--model`
128
+ only *filters* the models a case already declares. Models are agent-specific opaque
129
+ strings, so prefix `HARNESS=` (`--with-model droid=A`) to scope an override to one harness
130
+ when a run spans several.
131
+
132
+ ACP agents are configured in `acp_agents.yaml` (see `acp_agents.yaml.example`); the
133
+ built-in profiles (`droid`, `gemini`, `codex`, `claude-acp`, `devin-cli`) work out of the
134
+ box once the agent's CLI is on `PATH`. `evals/observed-droid/` is a worked example of a
135
+ fully observed, interactive, multi-turn case.
136
+
137
+ Real harnesses (e.g. `claude-code`) cost money and require their CLI on `PATH`.
138
+ The built-in `echo` harness runs the full loop with no network/API spend — use it for
139
+ testing the framework itself.
140
+
141
+ ## Defining a case
142
+
143
+ See `evals/example-case/case.yaml` for a worked example. Schema:
144
+
145
+ ```yaml
146
+ id: my-case
147
+ description: ...
148
+ task:
149
+ prompt: |
150
+ What the model/agent must accomplish.
151
+ source: # optional; copied fresh into every cell sandbox
152
+ path: ./source # ...or {repo: owner/name, commit: <sha>} (pinned clone)
153
+ # repo form may add `subdir: <dir>` to use just one sub-directory of the clone as the
154
+ # sandbox — lets one fixtures repo hold many cases (see "Source fixtures repo" below).
155
+ artifacts: # optional AI artifacts injected into the harness
156
+ skills: [./artifacts/skills/foo]
157
+ commands: [./artifacts/commands/bar.md]
158
+ mcp: ./artifacts/.mcp.json
159
+ environment: # optional per-cell dependency setup (the "broader sandbox")
160
+ kind: pip-venv # pip-venv (default) | uv | command — how deps are provisioned
161
+ requirements: [markupsafe] # (pip-venv/uv) installed into an isolated venv per cell
162
+ install: editable # (pip-venv/uv) `pip install -e .` (src-layout pkg + its deps)
163
+ # kind: command → run shell installs for project-local ecosystems, e.g.
164
+ # commands: ["npm ci"] # node_modules / target/ etc. live in the sandbox
165
+ setup: # optional; introduce the task state after clone, before the agent
166
+ stub: [{file: pkg/mod.py, function: target}] # blank a fn body -> NotImplementedError
167
+ run: ["rm -rf .git"] # shell commands in the sandbox
168
+ matrix:
169
+ harnesses: [claude-code]
170
+ models: [opus, sonnet, haiku]
171
+ trials: 3
172
+ graders:
173
+ - {type: command, cmd: "pytest -q", weight: 1.0}
174
+ - {type: files, patterns: ["retry", "backoff"]}
175
+ - {type: model_judge, rubric: ./graders/rubric.md, model: opus, pass_threshold: 0.8}
176
+ expect:
177
+ pass_threshold: 1.0
178
+ ```
179
+
180
+ ### Source fixtures repo
181
+
182
+ A case's bulky, hand-written assets — synthetic codebases to debug **and** the `hidden/`
183
+ oracle test suites — live **out of this repo**, in a separate fixtures repo
184
+ ([`krimvp/touchstone-eval-fixtures`](https://github.com/krimvp/touchstone-eval-fixtures)), so they
185
+ don't pollute the runner/eval tree. The eval repo keeps only the *contract* (task, graders,
186
+ expectations); the fixtures repo holds the code. Each case has one directory there, split by
187
+ **visibility**:
188
+
189
+ ```
190
+ <case-id>/
191
+ source/ # agent-VISIBLE input → promoted to the sandbox before the agent runs
192
+ hidden/ # grader ORACLE → injected at grade time only; the agent never sees it
193
+ ```
194
+
195
+ A case wires the two halves with two independent pins (both default-pinned by commit):
196
+
197
+ ```yaml
198
+ source: {repo: krimvp/touchstone-eval-fixtures, commit: <sha>, subdir: <case-id>/source}
199
+ fixtures: {repo: krimvp/touchstone-eval-fixtures, commit: <sha>} # subdir defaults to <case-id>
200
+ graders:
201
+ - {type: pytest, inject: ["./hidden/test_x.py"]} # resolved under <case-id>/hidden/
202
+ ```
203
+
204
+ - **`source`** clones the repo, checks out the commit, and promotes `<case-id>/source/` into
205
+ the sandbox (no `.git`, like `copy`). SWE-bench-style cases point `source` at the *real
206
+ upstream repo* instead, so they have only a `hidden/` in the fixtures repo (no `source/`).
207
+ - **`fixtures`** names the repo that graders resolve `inject:` paths against — `Case.asset()`
208
+ pulls each hidden file from a host-cached clone (`src/touchstone/fixtures.py`), at grade
209
+ time, *after* the agent has stopped. Because `source/` and `hidden/` are sibling directories
210
+ and only `source/` is promoted, the oracle can never leak into the agent's sandbox.
211
+
212
+ Keep the fixtures repo **private** for the anti-memorization cases. `evals/example-case/`
213
+ stays local (`source: path`) as the offline worked example / integration fixture.
214
+
215
+ ### Real-repo (SWE-bench-style) cases
216
+
217
+ A case can pin a real GitHub repo at a commit (`source: {repo, commit}`), `setup.stub` a
218
+ function to blank its body, and inject **hidden tests** (oracle = the real function) only
219
+ at grade time — so the agent reimplements real library code and the `pytest` grader scores
220
+ the fraction of FAIL→PASS tests. See `evals/repo-*-droid/`.
221
+
222
+ When a repo needs third-party dependencies or isn't importable from its root (a `src/`
223
+ layout), declare an **`environment`**: each cell gets its own throwaway virtualenv, into
224
+ which `requirements` are pip-installed and — with `install: editable` — the repo itself
225
+ (`pip install -e .`, which resolves a src-layout package and pulls its deps). Every
226
+ subprocess the cell spawns (harness, setup, and the `command`/`pytest` graders) runs under
227
+ that venv via an explicit env, so dependency-bearing cases stay reproducible and
228
+ parallel-safe (no shared site-packages). Worked examples: `repo-smarttruncate-droid`
229
+ (a `requirements` dep) and `repo-securefilename-droid` (`install: editable`, src-layout).
230
+
231
+ ### Non-Python projects
232
+
233
+ Cases aren't Python-specific. The `command`, `files`, `model_judge`, and `trace` graders
234
+ are language-agnostic, and the **`tests`** grader gives the same partial-credit scoring as
235
+ `pytest` for any runner whose results it can read. Two substrates, **XML primary with a
236
+ console fallback**:
237
+
238
+ - **JUnit XML** (`junit_xml: <glob>`) — the universal report format every framework/build
239
+ tool can emit (Maven Surefire, Gradle, pytest `--junitxml`, vitest/jest/mocha reporters,
240
+ `go-junit-report`, `cargo2junit`). Deterministic, exact per-test counts, framework-agnostic.
241
+ - **Console summary** (`_parse_counts`) — scraped when no XML report is produced: pytest/
242
+ unittest, `node --test`/TAP, Maven Surefire, **`go test -v`** (`--- PASS:`/`--- FAIL:`), and
243
+ **`cargo test`** (`test result: … N passed; M failed`).
244
+
245
+ A `tests` grader with `gate: true` is a validity **gate** (never adds credit; disqualifies the
246
+ cell to 0 on failure) — use it to mirror SWE-bench's PASS_TO_PASS regression gate in any
247
+ language. `inject` takes either a bare filename (dropped at the sandbox root) or `{src, dest}`
248
+ to place a hidden test at a runner-specific path (e.g. Maven's `src/test/java/...`). Use
249
+ `setup.run` to blank the function (the AST-based `setup.stub` is Python-only); the
250
+ `implemented` gate works on any language when pointed at explicit `files`. Worked examples:
251
+ `repo-js-wordwrap-droid` (CommonJS, `node --test`), `repo-java-camelcase-droid` (Maven,
252
+ Surefire), and the `repo-swebench-*` battery — real recent GitHub issues across Python, Go
253
+ (`go test`), Java (Surefire + JUnit XML), JS/TS (mocha/ava/TAP), and Rust (`cargo test`).
254
+
255
+ **Dependencies aren't special — how they're *isolated* is.** Real projects have
256
+ dependencies; the question is only whether installing them safely needs the `environment`
257
+ venv. It depends on where the ecosystem puts deps:
258
+
259
+ | Ecosystem | Where deps go | Isolation | How to declare |
260
+ | --- | --- | --- | --- |
261
+ | Python | shared `site-packages` (mutable) | needs the per-cell venv | `environment:` `kind: pip-venv` (or `uv`) + `requirements` / `install: editable` |
262
+ | Node / Rust / Go | project-local (`node_modules`, `target/`, build cache) | per-cell for free | `environment:` `kind: command` + `commands: ["npm ci"]` etc. |
263
+ | Java / Maven | shared `~/.m2` (versioned, immutable artifacts) | safe to share across cells | resolved by the build (`mvn test`) |
264
+
265
+ The `environment.kind` is the one declarative knob (mirroring the Sandbox's Isolation Mode):
266
+ `pip-venv` and `uv` build an isolated venv and install into it; `command` runs your install
267
+ commands for ecosystems whose deps are project-local.
268
+
269
+ ### OS-level isolation + OS packages (containers)
270
+
271
+ For cases that need OS packages or a pinned, reproducible build/grade environment, declare a
272
+ **`container`**: provisioning, `setup.run`, and the `command`/`tests`/`pytest` graders then run
273
+ inside it (via `docker exec`), with the cell bind-mounted at its same path.
274
+
275
+ ```yaml
276
+ container:
277
+ image: python:3.12-slim # pin by digest (…@sha256:…) for full reproducibility
278
+ setup: ["apt-get update -qq", "apt-get install -y -qq libxml2"] # OS packages, once at start
279
+ caches: [".cache/pip"] # share the host's cache so cells don't re-download deps
280
+ environment:
281
+ kind: pip-venv # the venv is now built *inside* the container
282
+ requirements: [lxml, pytest]
283
+ graders:
284
+ - {type: pytest, inject: ["./hidden/test_x.py"], weight: 4.0} # runs in the container
285
+ ```
286
+
287
+ `caches` mounts a home-relative dir (e.g. `.cache/pip`, `.m2`) shared with the host and
288
+ across cells, so a fresh container per cell reuses already-downloaded dependencies instead
289
+ of re-fetching them — the same shared-cache benefit the host's `~/.m2` gives today. The
290
+ suite uses this on its dependency-bearing cases: `repo-js-wordwrap` (`node:20-slim`,
291
+ zero-dep), `repo-smarttruncate` / `repo-securefilename` (`python:3.12-slim` + pip cache),
292
+ and `repo-java-camelcase` (`maven:3.9-eclipse-temurin-21` + shared `~/.m2`).
293
+
294
+ Every provisioner and grader runs through the Cell's **Executor** — `LocalExecutor` (host
295
+ subprocess) by default, `ContainerExecutor` when a `container` is declared — so the same
296
+ recipe runs under either backend (needs the docker daemon running). The Harness (the agent
297
+ under test) still runs on the host against the bind-mounted Sandbox; running the agent
298
+ itself in-container is future work. See `docs/adr/0005`.
299
+
300
+ So the earlier zero-dep examples were picked to keep the *demo* offline, not because deps
301
+ are rare. `repo-java-camelcase-droid` is a genuinely dependency-bearing non-Python case:
302
+ commons-text's source needs `commons-lang3`, which Maven resolves from Maven Central.
303
+
304
+ ## Bring your own private repos (reachability & fallback)
305
+
306
+ `touchstone` is an **engine + a public sample battery**. The verdict you can actually trust for
307
+ "which model is best **for me**" comes from *your own* tasks, so the design is built to pull
308
+ case material from external git repos you own — both the agent-visible `source: {repo, commit}`
309
+ and the hidden oracle in `fixtures: {repo, commit}` — some of them private. Auth is just your
310
+ normal git credentials (SSH agent / `gh` / a credential helper); nothing extra to configure.
311
+
312
+ Because a given host may not have access to every referenced repo (a teammate's private
313
+ fixtures, a CI box without keys, an offline laptop), a run **probes each case's external repos
314
+ before doing any work** (`git ls-remote`, cached per URL) and applies a policy:
315
+
316
+ ```bash
317
+ touchstone run # default: FAIL FAST if any required repo is unreachable
318
+ touchstone run --on-unavailable skip # degrade: skip unreachable cases, run the rest
319
+ touchstone validate --check-access # preflight only: report what a run would skip/fail on
320
+ ```
321
+
322
+ - **Fail by default.** A missing repo on a host you expected to be complete is a *loud, early*
323
+ error — never a silently smaller benchmark (which would corrupt cross-model comparisons).
324
+ - **`--on-unavailable skip`** degrades the unreachable cases to a `skipped` status: excluded
325
+ from every score and the leaderboard, surfaced in a "Skipped (unavailable)" report section,
326
+ and **not** counted as failures. Resume re-probes, so a transient outage is retried.
327
+ - **Per-case `availability: optional`** marks a case that may reference a repo you might not
328
+ have — it degrades to `skipped` even under the default fail mode.
329
+ - Only *access* failures (no auth / no network / not found) are degradable; a bad commit or
330
+ schema error is a defect and still fails loudly.
331
+
332
+ A fork can repoint the default hidden-fixtures repo to its own private one without editing every
333
+ case by setting `TOUCHSTONE_FIXTURES_REPO=owner/my-fixtures`. Your fully-private held-out suite
334
+ lives in `evals-private/` (gitignored) and runs with `--evals-dir evals-private` — see its
335
+ README. Design: `docs/adr/0008-reachability-and-availability-policy.md`.
336
+
337
+ ## Layout
338
+
339
+ ```
340
+ evals/<case>/ the benchmark suite (one dir per case)
341
+ src/touchstone/ the framework (config, harness/, grader/, runner, report, cli)
342
+ runs/<run_id>/ results (gitignored): manifest.json + cells/ + report.md
343
+ ```