touchstone-eval 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- touchstone_eval-0.1.0/.github/workflows/publish.yml +59 -0
- touchstone_eval-0.1.0/.gitignore +29 -0
- touchstone_eval-0.1.0/CONTEXT.md +184 -0
- touchstone_eval-0.1.0/LICENSE +21 -0
- touchstone_eval-0.1.0/PKG-INFO +343 -0
- touchstone_eval-0.1.0/README.md +310 -0
- touchstone_eval-0.1.0/acp_agents.yaml.example +27 -0
- touchstone_eval-0.1.0/cutoffs.yaml.example +20 -0
- touchstone_eval-0.1.0/docs/BENCHMARK.md +84 -0
- touchstone_eval-0.1.0/docs/adr/0001-responder-mediated-interaction.md +35 -0
- touchstone_eval-0.1.0/docs/adr/0002-parallel-safe-store-and-isolation.md +35 -0
- touchstone_eval-0.1.0/docs/adr/0003-acp-as-single-rich-adapter.md +35 -0
- touchstone_eval-0.1.0/docs/adr/0004-per-cell-environment.md +48 -0
- touchstone_eval-0.1.0/docs/adr/0005-pluggable-provisioning-and-executor.md +87 -0
- touchstone_eval-0.1.0/docs/adr/0006-native-stream-json-claude-adapter.md +74 -0
- touchstone_eval-0.1.0/docs/adr/0007-fixtures-repo-source-and-hidden.md +91 -0
- touchstone_eval-0.1.0/docs/adr/0008-reachability-and-availability-policy.md +115 -0
- touchstone_eval-0.1.0/docs/plans/0001-observability-and-interaction.md +97 -0
- touchstone_eval-0.1.0/docs/plans/0002-harder-diverse-battery.md +332 -0
- touchstone_eval-0.1.0/docs/plans/0003-reachability-and-fallback.md +113 -0
- touchstone_eval-0.1.0/evals/cron-droid/case.yaml +52 -0
- touchstone_eval-0.1.0/evals/csv-droid/case.yaml +49 -0
- touchstone_eval-0.1.0/evals/diff-droid/case.yaml +51 -0
- touchstone_eval-0.1.0/evals/dummy-droid/case.yaml +33 -0
- touchstone_eval-0.1.0/evals/example-case/case.yaml +42 -0
- touchstone_eval-0.1.0/evals/example-case/graders/rubric.md +11 -0
- touchstone_eval-0.1.0/evals/example-case/source/client.py +15 -0
- touchstone_eval-0.1.0/evals/glob-droid/case.yaml +49 -0
- touchstone_eval-0.1.0/evals/humanize-bytes-droid/case.yaml +47 -0
- touchstone_eval-0.1.0/evals/json-droid/case.yaml +50 -0
- touchstone_eval-0.1.0/evals/markdown-droid/case.yaml +57 -0
- touchstone_eval-0.1.0/evals/numwords-droid/case.yaml +47 -0
- touchstone_eval-0.1.0/evals/observed-droid/case.yaml +53 -0
- touchstone_eval-0.1.0/evals/observed-droid/graders/rubric.md +10 -0
- touchstone_eval-0.1.0/evals/pluralize-droid/case.yaml +47 -0
- touchstone_eval-0.1.0/evals/realistic-droid/case.yaml +47 -0
- touchstone_eval-0.1.0/evals/regex-droid/case.yaml +52 -0
- touchstone_eval-0.1.0/evals/repo-bucketize-droid/case.yaml +66 -0
- touchstone_eval-0.1.0/evals/repo-chunkedeven-droid/case.yaml +52 -0
- touchstone_eval-0.1.0/evals/repo-codex-tui-docs/case.yaml +81 -0
- touchstone_eval-0.1.0/evals/repo-codex-tui-docs/graders/rubric.md +22 -0
- touchstone_eval-0.1.0/evals/repo-collapse-droid/case.yaml +53 -0
- touchstone_eval-0.1.0/evals/repo-droid/case.yaml +54 -0
- touchstone_eval-0.1.0/evals/repo-flowforge-bug-droid/case.yaml +74 -0
- touchstone_eval-0.1.0/evals/repo-funcy-chunks-droid/case.yaml +59 -0
- touchstone_eval-0.1.0/evals/repo-java-camelcase-droid/case.yaml +100 -0
- touchstone_eval-0.1.0/evals/repo-js-bytes-droid/case.yaml +81 -0
- touchstone_eval-0.1.0/evals/repo-js-camelcase-droid/case.yaml +81 -0
- touchstone_eval-0.1.0/evals/repo-js-kindof-wf/case.yaml +85 -0
- touchstone_eval-0.1.0/evals/repo-js-prettybytes-wf/case.yaml +90 -0
- touchstone_eval-0.1.0/evals/repo-js-wordwrap-droid/case.yaml +73 -0
- touchstone_eval-0.1.0/evals/repo-js-wordwrap-frugal-droid/case.yaml +77 -0
- touchstone_eval-0.1.0/evals/repo-mergewith-droid/case.yaml +53 -0
- touchstone_eval-0.1.0/evals/repo-mutated-filename-droid/case.yaml +88 -0
- touchstone_eval-0.1.0/evals/repo-parameterize-droid/case.yaml +54 -0
- touchstone_eval-0.1.0/evals/repo-py-chunkedeven-wf/case.yaml +83 -0
- touchstone_eval-0.1.0/evals/repo-py-chunkwindow-wf/case.yaml +101 -0
- touchstone_eval-0.1.0/evals/repo-py-easter-wf/case.yaml +94 -0
- touchstone_eval-0.1.0/evals/repo-py-formatintlist-wf/case.yaml +80 -0
- touchstone_eval-0.1.0/evals/repo-py-inflectioncase-wf/case.yaml +94 -0
- touchstone_eval-0.1.0/evals/repo-py-inflectionplural-wf/case.yaml +92 -0
- touchstone_eval-0.1.0/evals/repo-py-intword-wf/case.yaml +87 -0
- touchstone_eval-0.1.0/evals/repo-py-naturalsize-wf/case.yaml +96 -0
- touchstone_eval-0.1.0/evals/repo-py-ordinal-wf/case.yaml +90 -0
- touchstone_eval-0.1.0/evals/repo-py-pathsplit-wf/case.yaml +86 -0
- touchstone_eval-0.1.0/evals/repo-py-semvercompare-wf/case.yaml +92 -0
- touchstone_eval-0.1.0/evals/repo-py-slugify-fullsuite-wf/case.yaml +74 -0
- touchstone_eval-0.1.0/evals/repo-py-splitfamily-wf/case.yaml +104 -0
- touchstone_eval-0.1.0/evals/repo-py-splitnl-wf/case.yaml +89 -0
- touchstone_eval-0.1.0/evals/repo-py-striptags-wf/case.yaml +94 -0
- touchstone_eval-0.1.0/evals/repo-py-truncate-debug-droid/case.yaml +87 -0
- touchstone_eval-0.1.0/evals/repo-py-windowed-droid/case.yaml +86 -0
- touchstone_eval-0.1.0/evals/repo-scheduler-bug-droid/case.yaml +73 -0
- touchstone_eval-0.1.0/evals/repo-securefilename-droid/case.yaml +73 -0
- touchstone_eval-0.1.0/evals/repo-smarttruncate-droid/case.yaml +73 -0
- touchstone_eval-0.1.0/evals/repo-splitinto-droid/case.yaml +52 -0
- touchstone_eval-0.1.0/evals/repo-swebench-afero-577/case.yaml +206 -0
- touchstone_eval-0.1.0/evals/repo-swebench-anyio-1189/case.yaml +113 -0
- touchstone_eval-0.1.0/evals/repo-swebench-astropy-13453/case.yaml +239 -0
- touchstone_eval-0.1.0/evals/repo-swebench-chi-1085/case.yaml +107 -0
- touchstone_eval-0.1.0/evals/repo-swebench-chi-1097/case.yaml +90 -0
- touchstone_eval-0.1.0/evals/repo-swebench-chrono-1798/case.yaml +107 -0
- touchstone_eval-0.1.0/evals/repo-swebench-clap-6276/case.yaml +208 -0
- touchstone_eval-0.1.0/evals/repo-swebench-click-3434/case.yaml +140 -0
- touchstone_eval-0.1.0/evals/repo-swebench-click-3493/case.yaml +122 -0
- touchstone_eval-0.1.0/evals/repo-swebench-cobra-2356/case.yaml +128 -0
- touchstone_eval-0.1.0/evals/repo-swebench-commonscoll-693/case.yaml +129 -0
- touchstone_eval-0.1.0/evals/repo-swebench-commonslang-1713/case.yaml +123 -0
- touchstone_eval-0.1.0/evals/repo-swebench-commonslang-1717/case.yaml +121 -0
- touchstone_eval-0.1.0/evals/repo-swebench-commonslang-1729/case.yaml +117 -0
- touchstone_eval-0.1.0/evals/repo-swebench-commonstext-748/case.yaml +117 -0
- touchstone_eval-0.1.0/evals/repo-swebench-express-7181/case.yaml +130 -0
- touchstone_eval-0.1.0/evals/repo-swebench-flask-5014/case.yaml +146 -0
- touchstone_eval-0.1.0/evals/repo-swebench-flask-5917/case.yaml +140 -0
- touchstone_eval-0.1.0/evals/repo-swebench-gson-3034/case.yaml +123 -0
- touchstone_eval-0.1.0/evals/repo-swebench-ky-861/case.yaml +142 -0
- touchstone_eval-0.1.0/evals/repo-swebench-pylint-4551/case.yaml +105 -0
- touchstone_eval-0.1.0/evals/repo-swebench-pylint-6528/case.yaml +337 -0
- touchstone_eval-0.1.0/evals/repo-swebench-pylint-7080/case.yaml +466 -0
- touchstone_eval-0.1.0/evals/repo-swebench-pylint-8898/case.yaml +184 -0
- touchstone_eval-0.1.0/evals/repo-swebench-pytest-10356/case.yaml +229 -0
- touchstone_eval-0.1.0/evals/repo-swebench-pytest-5787/case.yaml +350 -0
- touchstone_eval-0.1.0/evals/repo-swebench-pytest-5840/case.yaml +150 -0
- touchstone_eval-0.1.0/evals/repo-swebench-pytest-6197/case.yaml +286 -0
- touchstone_eval-0.1.0/evals/repo-swebench-pytest-7236/case.yaml +218 -0
- touchstone_eval-0.1.0/evals/repo-swebench-requests-7309/case.yaml +142 -0
- touchstone_eval-0.1.0/evals/repo-swebench-requests-7315/case.yaml +144 -0
- touchstone_eval-0.1.0/evals/repo-swebench-sklearn-14053/case.yaml +150 -0
- touchstone_eval-0.1.0/evals/repo-swebench-sphinx-10466/case.yaml +245 -0
- touchstone_eval-0.1.0/evals/repo-swebench-sphinx-11510/case.yaml +225 -0
- touchstone_eval-0.1.0/evals/repo-swebench-sphinx-7590/case.yaml +150 -0
- touchstone_eval-0.1.0/evals/repo-swebench-sphinx-8035/case.yaml +116 -0
- touchstone_eval-0.1.0/evals/repo-swebench-sphinx-8548/case.yaml +116 -0
- touchstone_eval-0.1.0/evals/repo-swebench-sphinx-8551/case.yaml +199 -0
- touchstone_eval-0.1.0/evals/repo-swebench-sphinx-9229/case.yaml +215 -0
- touchstone_eval-0.1.0/evals/repo-swebench-sphinx-9461/case.yaml +234 -0
- touchstone_eval-0.1.0/evals/repo-swebench-testify-1877/case.yaml +135 -0
- touchstone_eval-0.1.0/evals/repo-swebench-testify-1888/case.yaml +121 -0
- touchstone_eval-0.1.0/evals/repo-swebench-time-782/case.yaml +149 -0
- touchstone_eval-0.1.0/evals/repo-swebench-validator-2693/case.yaml +130 -0
- touchstone_eval-0.1.0/evals/repo-swebench-validator-2774/case.yaml +130 -0
- touchstone_eval-0.1.0/evals/repo-swebench-werkzeug-3129/case.yaml +137 -0
- touchstone_eval-0.1.0/evals/repo-swebench-werkzeug-3147/case.yaml +145 -0
- touchstone_eval-0.1.0/evals/repo-swebench-xarray-3677/case.yaml +143 -0
- touchstone_eval-0.1.0/evals/repo-windowed-droid/case.yaml +52 -0
- touchstone_eval-0.1.0/evals/roman-droid/case.yaml +47 -0
- touchstone_eval-0.1.0/evals/scored-droid/case.yaml +48 -0
- touchstone_eval-0.1.0/evals/titlecase-droid/case.yaml +47 -0
- touchstone_eval-0.1.0/evals/toposort-droid/case.yaml +51 -0
- touchstone_eval-0.1.0/evals-private/README.md +46 -0
- touchstone_eval-0.1.0/evals-private/example-private-case/case.yaml +34 -0
- touchstone_eval-0.1.0/evals-private/example-private-case/graders/rubric.md +12 -0
- touchstone_eval-0.1.0/harnesses.yaml.example +13 -0
- touchstone_eval-0.1.0/pyproject.toml +57 -0
- touchstone_eval-0.1.0/src/touchstone/__init__.py +3 -0
- touchstone_eval-0.1.0/src/touchstone/artifacts.py +51 -0
- touchstone_eval-0.1.0/src/touchstone/cli.py +212 -0
- touchstone_eval-0.1.0/src/touchstone/concurrency.py +34 -0
- touchstone_eval-0.1.0/src/touchstone/config.py +431 -0
- touchstone_eval-0.1.0/src/touchstone/environment.py +126 -0
- touchstone_eval-0.1.0/src/touchstone/executor.py +162 -0
- touchstone_eval-0.1.0/src/touchstone/export/__init__.py +5 -0
- touchstone_eval-0.1.0/src/touchstone/export/langfuse.py +102 -0
- touchstone_eval-0.1.0/src/touchstone/fixtures.py +53 -0
- touchstone_eval-0.1.0/src/touchstone/grader/__init__.py +6 -0
- touchstone_eval-0.1.0/src/touchstone/grader/base.py +55 -0
- touchstone_eval-0.1.0/src/touchstone/grader/command.py +40 -0
- touchstone_eval-0.1.0/src/touchstone/grader/efficiency.py +63 -0
- touchstone_eval-0.1.0/src/touchstone/grader/files.py +71 -0
- touchstone_eval-0.1.0/src/touchstone/grader/implemented.py +42 -0
- touchstone_eval-0.1.0/src/touchstone/grader/model_judge.py +149 -0
- touchstone_eval-0.1.0/src/touchstone/grader/pytest_runner.py +209 -0
- touchstone_eval-0.1.0/src/touchstone/grader/registry.py +40 -0
- touchstone_eval-0.1.0/src/touchstone/grader/swebench.py +101 -0
- touchstone_eval-0.1.0/src/touchstone/grader/trace.py +116 -0
- touchstone_eval-0.1.0/src/touchstone/harness/__init__.py +7 -0
- touchstone_eval-0.1.0/src/touchstone/harness/acp.py +492 -0
- touchstone_eval-0.1.0/src/touchstone/harness/base.py +90 -0
- touchstone_eval-0.1.0/src/touchstone/harness/claude_code.py +85 -0
- touchstone_eval-0.1.0/src/touchstone/harness/claude_stream.py +158 -0
- touchstone_eval-0.1.0/src/touchstone/harness/cli_agent.py +97 -0
- touchstone_eval-0.1.0/src/touchstone/harness/echo.py +30 -0
- touchstone_eval-0.1.0/src/touchstone/harness/registry.py +78 -0
- touchstone_eval-0.1.0/src/touchstone/interaction/__init__.py +11 -0
- touchstone_eval-0.1.0/src/touchstone/interaction/base.py +86 -0
- touchstone_eval-0.1.0/src/touchstone/interaction/policies.py +104 -0
- touchstone_eval-0.1.0/src/touchstone/interaction/registry.py +40 -0
- touchstone_eval-0.1.0/src/touchstone/interaction/responder.py +98 -0
- touchstone_eval-0.1.0/src/touchstone/metrics.py +125 -0
- touchstone_eval-0.1.0/src/touchstone/reachability.py +170 -0
- touchstone_eval-0.1.0/src/touchstone/report.py +433 -0
- touchstone_eval-0.1.0/src/touchstone/runner.py +333 -0
- touchstone_eval-0.1.0/src/touchstone/sandbox.py +121 -0
- touchstone_eval-0.1.0/src/touchstone/setup.py +58 -0
- touchstone_eval-0.1.0/src/touchstone/store.py +170 -0
- touchstone_eval-0.1.0/src/touchstone/trace.py +189 -0
- touchstone_eval-0.1.0/tests/conftest.py +17 -0
- touchstone_eval-0.1.0/tests/fake_acp_agent.py +82 -0
- touchstone_eval-0.1.0/tests/test_acp.py +104 -0
- touchstone_eval-0.1.0/tests/test_acp_profiles.py +32 -0
- touchstone_eval-0.1.0/tests/test_claude_stream.py +106 -0
- touchstone_eval-0.1.0/tests/test_config.py +52 -0
- touchstone_eval-0.1.0/tests/test_container.py +137 -0
- touchstone_eval-0.1.0/tests/test_efficiency_grader.py +64 -0
- touchstone_eval-0.1.0/tests/test_environment.py +154 -0
- touchstone_eval-0.1.0/tests/test_graders.py +65 -0
- touchstone_eval-0.1.0/tests/test_implemented.py +34 -0
- touchstone_eval-0.1.0/tests/test_integration.py +85 -0
- touchstone_eval-0.1.0/tests/test_interaction.py +68 -0
- touchstone_eval-0.1.0/tests/test_judge_jury.py +20 -0
- touchstone_eval-0.1.0/tests/test_langfuse.py +60 -0
- touchstone_eval-0.1.0/tests/test_metrics.py +103 -0
- touchstone_eval-0.1.0/tests/test_multiturn.py +51 -0
- touchstone_eval-0.1.0/tests/test_observe.py +86 -0
- touchstone_eval-0.1.0/tests/test_parallel_isolation.py +125 -0
- touchstone_eval-0.1.0/tests/test_pytest_grader.py +222 -0
- touchstone_eval-0.1.0/tests/test_reachability.py +115 -0
- touchstone_eval-0.1.0/tests/test_reachability_runner.py +102 -0
- touchstone_eval-0.1.0/tests/test_report_caveats.py +78 -0
- touchstone_eval-0.1.0/tests/test_setup.py +47 -0
- touchstone_eval-0.1.0/tests/test_store.py +53 -0
- touchstone_eval-0.1.0/tests/test_swebench_grader.py +101 -0
- touchstone_eval-0.1.0/tests/test_trace.py +52 -0
- touchstone_eval-0.1.0/tests/test_trace_grader.py +105 -0
- touchstone_eval-0.1.0/uv.lock +956 -0
|
@@ -0,0 +1,59 @@
|
|
|
1
|
+
name: Publish to PyPI
|
|
2
|
+
|
|
3
|
+
# Publishes touchstone-eval to PyPI when a version tag is pushed, e.g.:
|
|
4
|
+
# git tag v0.1.0 && git push origin v0.1.0
|
|
5
|
+
#
|
|
6
|
+
# Uses PyPI Trusted Publishing (OIDC) — no API token / password stored as a secret.
|
|
7
|
+
# One-time setup on PyPI: project Settings -> Publishing -> add a GitHub publisher with
|
|
8
|
+
# owner: krimvp repo: touchstone workflow: publish.yml environment: pypi
|
|
9
|
+
# (Create the "pypi" environment under the repo's Settings -> Environments first, or use
|
|
10
|
+
# PyPI's "pending publisher" flow to register it before the project's first release.)
|
|
11
|
+
|
|
12
|
+
on:
|
|
13
|
+
push:
|
|
14
|
+
tags:
|
|
15
|
+
- "v*"
|
|
16
|
+
|
|
17
|
+
permissions:
|
|
18
|
+
contents: read
|
|
19
|
+
|
|
20
|
+
jobs:
|
|
21
|
+
build:
|
|
22
|
+
name: Build sdist + wheel
|
|
23
|
+
runs-on: ubuntu-latest
|
|
24
|
+
steps:
|
|
25
|
+
- uses: actions/checkout@v4
|
|
26
|
+
|
|
27
|
+
- name: Install uv
|
|
28
|
+
uses: astral-sh/setup-uv@v5
|
|
29
|
+
|
|
30
|
+
- name: Build distributions
|
|
31
|
+
run: uv build
|
|
32
|
+
|
|
33
|
+
- name: Check metadata
|
|
34
|
+
run: uvx twine check dist/*
|
|
35
|
+
|
|
36
|
+
- name: Upload artifacts
|
|
37
|
+
uses: actions/upload-artifact@v4
|
|
38
|
+
with:
|
|
39
|
+
name: dist
|
|
40
|
+
path: dist/
|
|
41
|
+
|
|
42
|
+
publish:
|
|
43
|
+
name: Publish to PyPI
|
|
44
|
+
needs: build
|
|
45
|
+
runs-on: ubuntu-latest
|
|
46
|
+
environment:
|
|
47
|
+
name: pypi
|
|
48
|
+
url: https://pypi.org/p/touchstone-eval
|
|
49
|
+
permissions:
|
|
50
|
+
id-token: write # required for Trusted Publishing (OIDC)
|
|
51
|
+
steps:
|
|
52
|
+
- name: Download built distributions
|
|
53
|
+
uses: actions/download-artifact@v4
|
|
54
|
+
with:
|
|
55
|
+
name: dist
|
|
56
|
+
path: dist/
|
|
57
|
+
|
|
58
|
+
- name: Publish
|
|
59
|
+
uses: pypa/gh-action-pypi-publish@release/v1
|
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
# Run outputs (intermediate + final results) — large and machine-specific.
|
|
2
|
+
runs/
|
|
3
|
+
|
|
4
|
+
# Transient run logs
|
|
5
|
+
*.log
|
|
6
|
+
|
|
7
|
+
# Python
|
|
8
|
+
__pycache__/
|
|
9
|
+
*.py[cod]
|
|
10
|
+
*.egg-info/
|
|
11
|
+
.eggs/
|
|
12
|
+
build/
|
|
13
|
+
dist/
|
|
14
|
+
.venv/
|
|
15
|
+
venv/
|
|
16
|
+
.pytest_cache/
|
|
17
|
+
.ruff_cache/
|
|
18
|
+
|
|
19
|
+
# Env / secrets
|
|
20
|
+
.env
|
|
21
|
+
|
|
22
|
+
# Private held-out eval set — your OWN usecases, never committed (contamination-proof
|
|
23
|
+
# tiebreaker; Lessons 3 & 6). Run with: touchstone --evals-dir evals-private run
|
|
24
|
+
# The README + example template stay tracked; real private cases are ignored.
|
|
25
|
+
evals-private/*
|
|
26
|
+
!evals-private/.gitignore
|
|
27
|
+
!evals-private/README.md
|
|
28
|
+
!evals-private/example-private-case
|
|
29
|
+
!evals-private/example-private-case/**
|
|
@@ -0,0 +1,184 @@
|
|
|
1
|
+
# touchstone
|
|
2
|
+
|
|
3
|
+
A personal eval benchmark for deciding which model works best for the user's own
|
|
4
|
+
usecases. This glossary fixes the language the framework and its docs use.
|
|
5
|
+
|
|
6
|
+
## Language
|
|
7
|
+
|
|
8
|
+
### Benchmark structure
|
|
9
|
+
|
|
10
|
+
**Case**:
|
|
11
|
+
One eval — a task plus the input, artifacts, graders, and expectations needed to judge it.
|
|
12
|
+
_Avoid_: test, suite, scenario.
|
|
13
|
+
|
|
14
|
+
**Run**:
|
|
15
|
+
A single execution of the benchmark that expands a matrix and produces a report.
|
|
16
|
+
_Avoid_: session, job.
|
|
17
|
+
|
|
18
|
+
**Cell**:
|
|
19
|
+
The atomic unit of work and persistence: one `(Case × Harness × Model × Trial)`.
|
|
20
|
+
_Avoid_: task-run, instance.
|
|
21
|
+
|
|
22
|
+
**Trial**:
|
|
23
|
+
One repeated attempt of the same Cell coordinates, used for consistency / pass@k.
|
|
24
|
+
_Avoid_: attempt, sample.
|
|
25
|
+
|
|
26
|
+
**Grader**:
|
|
27
|
+
A component that turns a Harness's result into a Score.
|
|
28
|
+
_Avoid_: judge (reserve "judge" for the model-as-judge grader specifically), scorer, checker.
|
|
29
|
+
|
|
30
|
+
**Harness**:
|
|
31
|
+
The swappable thing that turns a Case's task into an output. Every Harness is an adapter
|
|
32
|
+
behind one interface.
|
|
33
|
+
_Avoid_: runner, agent, driver.
|
|
34
|
+
|
|
35
|
+
### Observation & interaction
|
|
36
|
+
|
|
37
|
+
**Trace**:
|
|
38
|
+
The normalized, vendor-neutral event stream captured from a Harness run (messages, tool
|
|
39
|
+
calls, token usage, etc.). The framework's own schema — never an external protocol's types.
|
|
40
|
+
_Avoid_: log (reserve for raw stdout/transcript), transcript (that is the raw capture).
|
|
41
|
+
|
|
42
|
+
**Trace Event**:
|
|
43
|
+
One item in a Trace (e.g. a tool call, a token-usage update, a permission request).
|
|
44
|
+
_Avoid_: span (reserve "span" for the future LangFuse mapping), record.
|
|
45
|
+
|
|
46
|
+
**Tracing** (a Harness capability):
|
|
47
|
+
A Harness's ability to emit a Trace. Opt-in; Harnesses that lack it degrade to output-only.
|
|
48
|
+
_Avoid_: instrumentation, observability (use as adjectives, not as the capability name).
|
|
49
|
+
|
|
50
|
+
**Interaction** (a Harness capability):
|
|
51
|
+
A Harness's ability to let the framework answer the agent's mid-run requests (tool
|
|
52
|
+
permission / approval / input). Strictly richer than Tracing.
|
|
53
|
+
_Avoid_: feedback, callback, HITL.
|
|
54
|
+
|
|
55
|
+
**Output-only**:
|
|
56
|
+
A Harness that exposes neither Tracing nor Interaction — only its final output. Today's
|
|
57
|
+
default and the universal fallback.
|
|
58
|
+
_Avoid_: black-box, basic.
|
|
59
|
+
|
|
60
|
+
**Tool Kind**:
|
|
61
|
+
The portable, canonical category of a tool call (`read | write | execute | search |
|
|
62
|
+
fetch | other`) — the one tool axis that means the same across agents. Distinct from a
|
|
63
|
+
tool's `raw_name` (verbatim from the agent) and `name` (the Adapter's normalized name).
|
|
64
|
+
_Avoid_: tool type, category.
|
|
65
|
+
|
|
66
|
+
**Interaction Policy**:
|
|
67
|
+
The per-Case rule that answers *agent-initiated* mid-run requests. One of `auto-approve`,
|
|
68
|
+
`auto-deny`, `scripted`, `llm-based`, or `manual`.
|
|
69
|
+
_Avoid_: handler, strategy.
|
|
70
|
+
|
|
71
|
+
**Turn**:
|
|
72
|
+
One *eval-initiated* prompt sent to the agent within a Cell. Distinct from an
|
|
73
|
+
agent-initiated request (which the Interaction Policy answers) and from a Trial (a repeat
|
|
74
|
+
of the whole Cell). The first Turn is the Case's task; later Turns are scripted follow-ups.
|
|
75
|
+
_Avoid_: round, step, message.
|
|
76
|
+
|
|
77
|
+
**Conversation**:
|
|
78
|
+
The ordered Turns of a Case, sent one after another, each dispatched once the agent's
|
|
79
|
+
previous Turn reaches a `stop`. A single-prompt Case is a one-Turn Conversation.
|
|
80
|
+
_Avoid_: dialogue, thread, chat.
|
|
81
|
+
|
|
82
|
+
**Responder**:
|
|
83
|
+
The fixed auxiliary LLM that answers the agent's mid-run requests under Case guidelines
|
|
84
|
+
when the Interaction Policy is `llm-based`. A control variable, held constant across the
|
|
85
|
+
matrix like the Judge, so the agent stays the only thing being compared.
|
|
86
|
+
_Avoid_: helper, user-sim, proxy.
|
|
87
|
+
|
|
88
|
+
**Judge**:
|
|
89
|
+
The fixed auxiliary LLM used by the model-as-judge Grader to score output. Like the
|
|
90
|
+
Responder, a control variable held constant across the matrix.
|
|
91
|
+
_Avoid_: grader (the Judge is used *by* a Grader, not a synonym for one).
|
|
92
|
+
|
|
93
|
+
**Adapter**:
|
|
94
|
+
A concrete Harness implementation. The **ACP adapter** is the single rich adapter — one
|
|
95
|
+
implementation driving every Agent-Client-Protocol agent (Claude via `claude-agent-acp`,
|
|
96
|
+
Codex, Gemini, droid, devin-cli, …) through one event-translation path. The **CLI adapter**
|
|
97
|
+
is the generic output-only fallback. (A native **Claude Agent SDK adapter** is an optional
|
|
98
|
+
future no-Node alternative, not part of the core.)
|
|
99
|
+
_Avoid_: backend, plugin, connector.
|
|
100
|
+
|
|
101
|
+
### Execution & isolation
|
|
102
|
+
|
|
103
|
+
**Sandbox**:
|
|
104
|
+
The isolated working directory a Cell's Harness operates in, prepared fresh from the Case
|
|
105
|
+
source. Self-contained: own directory, own subprocess env, never shared between Cells.
|
|
106
|
+
_Avoid_: workspace, workdir (informal synonyms only).
|
|
107
|
+
|
|
108
|
+
**Isolation Mode**:
|
|
109
|
+
How a Sandbox is created from the source — `copy` (copy a folder), `clone` (git clone at a
|
|
110
|
+
commit), or `worktree` (git worktree at a commit). Inferred from source type and
|
|
111
|
+
overridable; `worktree` is opt-in, not the default.
|
|
112
|
+
_Avoid_: sandbox type, strategy.
|
|
113
|
+
|
|
114
|
+
**Environment**:
|
|
115
|
+
A Cell's own throwaway dependency setup (the *broader Sandbox*), provisioned when a Case
|
|
116
|
+
declares an `environment` block. A **Provisioner** (selected by `kind`) prepares it:
|
|
117
|
+
`pip-venv` (default) / `uv` build a per-Cell virtualenv and install declared dependencies
|
|
118
|
+
(and optionally the Sandbox repo via `install: editable`); `command` runs declared shell
|
|
119
|
+
commands for ecosystems with project-local deps (`npm ci`, `cargo fetch`). Every subprocess
|
|
120
|
+
the Cell spawns (Harness, setup, `command`/`tests`/`pytest` Graders) runs under it via an
|
|
121
|
+
explicit env, never a shared global interpreter. Absent the block, the Cell uses the host.
|
|
122
|
+
_Avoid_: virtualenv (one Provisioner's mechanism, not the concept), image, container (no
|
|
123
|
+
OS-level isolation *yet* — see Executor).
|
|
124
|
+
|
|
125
|
+
**Provisioner**:
|
|
126
|
+
The strategy that prepares a Cell's Environment, selected by `environment.kind` (`pip-venv`,
|
|
127
|
+
`uv`, `command`). Mirrors Isolation Mode for the Sandbox: one declarative knob, multiple
|
|
128
|
+
backends, one contract (return the subprocess `env`, or `None` for host).
|
|
129
|
+
_Avoid_: installer, builder.
|
|
130
|
+
|
|
131
|
+
**Executor**:
|
|
132
|
+
Where a Cell's non-Harness commands run, behind one `run(argv, cwd, env)` + `create_venv`
|
|
133
|
+
interface. `LocalExecutor` runs host subprocesses; `ContainerExecutor` runs them via
|
|
134
|
+
`docker exec` in a container with the cell bind-mounted (selected by a Case's `container`
|
|
135
|
+
block), bringing OS-level isolation and OS packages. Provisioning, `setup.run`, and the
|
|
136
|
+
`command`/`tests`/`pytest` Graders run through the Cell's Executor; the Harness still runs on
|
|
137
|
+
the host (ADR 0005). The provisioner recipes are written once and run under either backend.
|
|
138
|
+
_Avoid_: runner (that is the orchestration loop), shell, backend (use as adjective only).
|
|
139
|
+
|
|
140
|
+
**Reachability / Availability**:
|
|
141
|
+
Whether this host can reach a Case's external git repos (its remote `source` and/or
|
|
142
|
+
`fixtures`). A preflight probes each (access-level `git ls-remote`, cached per URL) before the
|
|
143
|
+
Run does work, then applies the **availability policy**: `fail` (default — a single unreachable
|
|
144
|
+
*required* Case aborts the Run) or `skip` (degrade unreachable Cases to the `skipped` status and
|
|
145
|
+
continue). A Case marked `availability: optional` always degrades. `skipped` is terminal and
|
|
146
|
+
excluded from every aggregate — distinct from `failed`, which is a defect. See ADR 0008.
|
|
147
|
+
_Avoid_: offline (a probe failure may be auth, not network), error (a skip is not a failure).
|
|
148
|
+
|
|
149
|
+
## Relationships
|
|
150
|
+
|
|
151
|
+
- A **Run** expands into many **Cells**; each **Cell** has one **Harness**, one model, one **Trial** index.
|
|
152
|
+
- A **Harness** is realized by exactly one **Adapter**; an **Adapter** declares its **Tracing** and **Interaction** capabilities.
|
|
153
|
+
- A **Tracing**-capable **Harness** produces a **Trace** (a sequence of **Trace Events**) per **Cell**, alongside the raw transcript.
|
|
154
|
+
- **Interaction** implies **Tracing** (you cannot answer requests you cannot observe), not vice-versa.
|
|
155
|
+
- **Graders** may read the final output, the **Trace**, or both.
|
|
156
|
+
- A **Trace** is the source mapped to LangFuse spans later — graders and LangFuse both depend on the **Trace**, never on ACP or the Claude SDK directly.
|
|
157
|
+
- A **Case** opts into observation via an `observe` block (Tracing and/or Interaction); absent it, the **Cell** is **Output-only**. A Run-level flag can override.
|
|
158
|
+
- When a **Case** requests more than its **Adapter** supports, the **Cell** soft-degrades (empty **Trace**, warning recorded) — unless a **Grader** needs the **Trace**, which is a hard failure for that **Cell**.
|
|
159
|
+
- An **Interaction Policy** of `llm-based` uses a **Responder**; deterministic policies (`auto-approve`/`auto-deny`/`scripted`) use none. `manual` is non-reproducible and excluded from aggregation; `llm-based` is included but flagged responder-mediated.
|
|
160
|
+
- Every mid-run request and its answer is recorded in the **Trace** (`permission_request` / `permission_response`), whatever the **Interaction Policy**.
|
|
161
|
+
- A **Case** is a **Conversation** of one or more **Turns**; **Turns** are eval-initiated, while requests answered by the **Interaction Policy** are agent-initiated. Both happen within one **Cell**.
|
|
162
|
+
- Each **Cell** gets its own **Sandbox** via an **Isolation Mode**; all modes yield a fully isolated, parallel-safe directory. Commit-pinned modes (`clone`/`worktree`) give reproducibility.
|
|
163
|
+
- A **Cell** that declares an **Environment** also gets its own venv beside the **Sandbox**; the venv (and its installed dependencies) is the *broader Sandbox* that keeps dependency-bearing Cells reproducible and parallel-safe — provisioned before the agent runs, torn down with the **Sandbox**.
|
|
164
|
+
- A **Cell**'s outcome lives in its own `result.json` (the source of truth); the run manifest is a derived index merged from those, so parallel **Cells** never write the same file.
|
|
165
|
+
- Before a **Run** does work it checks **Reachability** of every **Case**'s external repos and applies the availability policy; unreachable **Cases** either abort the Run (`fail`) or become `skipped` **Cells** (`skip`/`optional`), which are excluded from every aggregate — a missing private repo can never silently shrink the benchmark.
|
|
166
|
+
- Model selection is **Adapter-specific**: a CLI flag (CLI adapter) or applied **via the ACP instruction** — launch arg and/or `session/new` / `session/setConfigOption` (ACP adapter). A model string is opaque to the framework and may be a custom alias (e.g. `glm-5.1:cloud`).
|
|
167
|
+
- A model is meaningful only relative to a **Harness**, so the matrix **pairs models per-Harness** (entries of `{harness, models}`) rather than taking a blind cross-product.
|
|
168
|
+
|
|
169
|
+
## Example dialogue
|
|
170
|
+
|
|
171
|
+
> **Dev:** "Does the ACP **Adapter** give us **Interaction**?"
|
|
172
|
+
> **User:** "Yes — ACP's `request_permission` lets us answer the agent, so that **Adapter** is **Interaction**-capable. The generic **CLI Adapter** is **Output-only**."
|
|
173
|
+
> **Dev:** "And aider today?"
|
|
174
|
+
> **User:** "**Output-only** — no ACP, no SDK. We still observe its final output; the **Trace** is just empty of tool events."
|
|
175
|
+
|
|
176
|
+
## Flagged ambiguities
|
|
177
|
+
|
|
178
|
+
- "ACP" was used as if it were the abstraction. Resolved: ACP is one **Adapter**'s transport;
|
|
179
|
+
the abstraction is the **Trace**. The Claude Agent SDK populates the same **Trace** without ACP.
|
|
180
|
+
- "wrap the calls" meant two distinct capabilities — **Tracing** (observe) and **Interaction**
|
|
181
|
+
(respond). Resolved: they are separate, with **Interaction** implying **Tracing**.
|
|
182
|
+
- "tool name" was treated as one thing. Resolved into three: `raw_name` (verbatim, never
|
|
183
|
+
lost), `name` (Adapter-normalized), and **Tool Kind** (portable enum). Cross-model grading
|
|
184
|
+
prefers **Tool Kind**; within-agent grading may use `raw_name`.
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 krimvp
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,343 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: touchstone-eval
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Personal eval benchmark: compare model outcomes across swappable CLI-agent harnesses on custom tasks.
|
|
5
|
+
Project-URL: Homepage, https://github.com/krimvp/touchstone
|
|
6
|
+
Project-URL: Repository, https://github.com/krimvp/touchstone
|
|
7
|
+
Project-URL: Issues, https://github.com/krimvp/touchstone/issues
|
|
8
|
+
Author-email: krimvp <anton.balboa@gmail.com>
|
|
9
|
+
License-Expression: MIT
|
|
10
|
+
License-File: LICENSE
|
|
11
|
+
Keywords: acp,agent,benchmark,claude-code,cli,eval,evaluation,llm
|
|
12
|
+
Classifier: Development Status :: 3 - Alpha
|
|
13
|
+
Classifier: Environment :: Console
|
|
14
|
+
Classifier: Intended Audience :: Developers
|
|
15
|
+
Classifier: Operating System :: OS Independent
|
|
16
|
+
Classifier: Programming Language :: Python :: 3
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
21
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
22
|
+
Classifier: Topic :: Software Development :: Testing
|
|
23
|
+
Requires-Python: >=3.10
|
|
24
|
+
Requires-Dist: pydantic>=2.6
|
|
25
|
+
Requires-Dist: pyyaml>=6.0
|
|
26
|
+
Provides-Extra: dev
|
|
27
|
+
Requires-Dist: pytest>=8.0; extra == 'dev'
|
|
28
|
+
Provides-Extra: judge
|
|
29
|
+
Requires-Dist: anthropic>=0.40; extra == 'judge'
|
|
30
|
+
Provides-Extra: langfuse
|
|
31
|
+
Requires-Dist: langfuse>=2.0; extra == 'langfuse'
|
|
32
|
+
Description-Content-Type: text/markdown
|
|
33
|
+
|
|
34
|
+
# touchstone
|
|
35
|
+
|
|
36
|
+
> A *touchstone* is the dark stone jewelers rub gold against to read its purity from the
|
|
37
|
+
> streak it leaves — telling true gold from convincing fakes. That is this benchmark's whole
|
|
38
|
+
> job: telling apart models that look identical on paper, by the marks they leave on real work.
|
|
39
|
+
|
|
40
|
+
A personal eval benchmark for answering one question: **for my usecases, which model
|
|
41
|
+
works best?**
|
|
42
|
+
|
|
43
|
+
Each eval (a *case*) bundles its own task, its own input source files, its own AI
|
|
44
|
+
artifacts (skills / commands / plugins / MCP), and its own definition of a correct
|
|
45
|
+
outcome. A *run* executes a **matrix** of cells — one cell per
|
|
46
|
+
`(case × harness × model × trial)` — fully isolated and persisted independently, then
|
|
47
|
+
aggregates everything into a single report.
|
|
48
|
+
|
|
49
|
+
## Core model
|
|
50
|
+
|
|
51
|
+
```
|
|
52
|
+
Case (one eval) Matrix axes Cell (unit of work + persistence)
|
|
53
|
+
task / prompt × harnesses[] = sandbox + transcript + output
|
|
54
|
+
source/ files models[] + grader scores + metrics + status
|
|
55
|
+
artifacts/ trials (k)
|
|
56
|
+
graders[]
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
- **Harness** — the swappable thing that turns a task into an output, behind one interface
|
|
60
|
+
(`harness/base.py`). `echo` (fake) and `claude-code` are output-only. For *rich* runs
|
|
61
|
+
(a Trace of tool calls / tokens / cost) there are two paths: **`claude-code-stream`** drives
|
|
62
|
+
Claude natively over `--output-format stream-json` (no ACP, no Node; Tracing-only,
|
|
63
|
+
autonomous via skip-permissions — see `docs/adr/0006`), and the **ACP adapter** drives any
|
|
64
|
+
Agent Client Protocol agent (droid, gemini, codex, claude-acp, devin-cli) with full
|
|
65
|
+
observation **and** bidirectional interaction. ACP is one rich path, not the only one — the
|
|
66
|
+
Trace is the contract.
|
|
67
|
+
- **Graders** — `command` (run tests/build), `files` (expected files / grep patterns),
|
|
68
|
+
`model_judge` (LLM-as-judge), and `trace` (assert over observed tool usage / token &
|
|
69
|
+
cost budgets). All run; combined per the case's `expect.pass_threshold`.
|
|
70
|
+
- **Observation & interaction** (opt-in per case via `observe:`) — capture a normalized
|
|
71
|
+
**Trace** (tool calls, tokens, cost, permission events) and answer the agent's mid-run
|
|
72
|
+
requests with an **Interaction Policy** (`auto-approve`/`auto-deny`/`scripted`/
|
|
73
|
+
`llm-based`/`manual`). See `CONTEXT.md` + `docs/adr/`.
|
|
74
|
+
- **Resumability & parallelism** — each cell's `result.json` is the source of truth (the
|
|
75
|
+
manifest is a derived index), so cells run in parallel (`--workers`) without contention
|
|
76
|
+
and `run --resume <id>` continues after a crash.
|
|
77
|
+
|
|
78
|
+
## Install
|
|
79
|
+
|
|
80
|
+
The published package is `touchstone-eval`; the command it installs is `touchstone`
|
|
81
|
+
(the bare `touchstone` name on PyPI belongs to an unrelated, abandoned project).
|
|
82
|
+
|
|
83
|
+
```bash
|
|
84
|
+
uvx touchstone-eval --help # run without installing (recommended)
|
|
85
|
+
pipx install touchstone-eval # or install as an isolated tool
|
|
86
|
+
pip install touchstone-eval # or into the current environment
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
Add the optional extras when you need them: `touchstone-eval[judge]` (Anthropic SDK for
|
|
90
|
+
`model_judge`), `[langfuse]` (export), `[dev]` (pytest).
|
|
91
|
+
|
|
92
|
+
For local development from a checkout:
|
|
93
|
+
|
|
94
|
+
```bash
|
|
95
|
+
pip install -e ".[judge,dev]" # judge = Anthropic SDK for model_judge; dev = pytest
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
## Usage
|
|
99
|
+
|
|
100
|
+
```bash
|
|
101
|
+
touchstone validate # schema-check every evals/<case>/case.yaml
|
|
102
|
+
touchstone list # list cases and past runs
|
|
103
|
+
touchstone run # run the whole evals/ suite
|
|
104
|
+
touchstone run --eval example-case --harness echo --trials 2
|
|
105
|
+
touchstone run --harness droid --with-model A --with-model B # compare models, same harness
|
|
106
|
+
touchstone run --workers 4 # run cells in parallel
|
|
107
|
+
touchstone run --resume <run_id> # continue an interrupted run
|
|
108
|
+
touchstone report <run_id> # (re)generate runs/<run_id>/report.md
|
|
109
|
+
touchstone export <run_id> [--push] # write runs/<id>/langfuse.json (and optionally push)
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
### Comparing models on the same harness
|
|
113
|
+
|
|
114
|
+
The matrix is what answers "which model for my usecases?" — distinct models become
|
|
115
|
+
distinct cells, and the report ranks them in a per-case matrix + a leaderboard (score,
|
|
116
|
+
cost, time, tools, tokens). A case can declare the models inline
|
|
117
|
+
(`matrix.models` / `matrix.entries[].models`), or you can hold a harness fixed and push
|
|
118
|
+
models through it at run time without editing the case:
|
|
119
|
+
|
|
120
|
+
```bash
|
|
121
|
+
# Run these models on droid even if the cases declared only one — they replace the
|
|
122
|
+
# case's models for that harness. Each becomes its own row in the comparison.
|
|
123
|
+
touchstone run --harness droid \
|
|
124
|
+
--with-model custom:glm-5.1:cloud-0 --with-model custom:glm-4.6:cloud
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
`--with-model` *replaces* the declared models (so you can introduce new ones); `--model`
|
|
128
|
+
only *filters* the models a case already declares. Models are agent-specific opaque
|
|
129
|
+
strings, so prefix `HARNESS=` (`--with-model droid=A`) to scope an override to one harness
|
|
130
|
+
when a run spans several.
|
|
131
|
+
|
|
132
|
+
ACP agents are configured in `acp_agents.yaml` (see `acp_agents.yaml.example`); the
|
|
133
|
+
built-in profiles (`droid`, `gemini`, `codex`, `claude-acp`, `devin-cli`) work out of the
|
|
134
|
+
box once the agent's CLI is on `PATH`. `evals/observed-droid/` is a worked example of a
|
|
135
|
+
fully observed, interactive, multi-turn case.
|
|
136
|
+
|
|
137
|
+
Real harnesses (e.g. `claude-code`) cost money and require their CLI on `PATH`.
|
|
138
|
+
The built-in `echo` harness runs the full loop with no network/API spend — use it for
|
|
139
|
+
testing the framework itself.
|
|
140
|
+
|
|
141
|
+
## Defining a case
|
|
142
|
+
|
|
143
|
+
See `evals/example-case/case.yaml` for a worked example. Schema:
|
|
144
|
+
|
|
145
|
+
```yaml
|
|
146
|
+
id: my-case
|
|
147
|
+
description: ...
|
|
148
|
+
task:
|
|
149
|
+
prompt: |
|
|
150
|
+
What the model/agent must accomplish.
|
|
151
|
+
source: # optional; copied fresh into every cell sandbox
|
|
152
|
+
path: ./source # ...or {repo: owner/name, commit: <sha>} (pinned clone)
|
|
153
|
+
# repo form may add `subdir: <dir>` to use just one sub-directory of the clone as the
|
|
154
|
+
# sandbox — lets one fixtures repo hold many cases (see "Source fixtures repo" below).
|
|
155
|
+
artifacts: # optional AI artifacts injected into the harness
|
|
156
|
+
skills: [./artifacts/skills/foo]
|
|
157
|
+
commands: [./artifacts/commands/bar.md]
|
|
158
|
+
mcp: ./artifacts/.mcp.json
|
|
159
|
+
environment: # optional per-cell dependency setup (the "broader sandbox")
|
|
160
|
+
kind: pip-venv # pip-venv (default) | uv | command — how deps are provisioned
|
|
161
|
+
requirements: [markupsafe] # (pip-venv/uv) installed into an isolated venv per cell
|
|
162
|
+
install: editable # (pip-venv/uv) `pip install -e .` (src-layout pkg + its deps)
|
|
163
|
+
# kind: command → run shell installs for project-local ecosystems, e.g.
|
|
164
|
+
# commands: ["npm ci"] # node_modules / target/ etc. live in the sandbox
|
|
165
|
+
setup: # optional; introduce the task state after clone, before the agent
|
|
166
|
+
stub: [{file: pkg/mod.py, function: target}] # blank a fn body -> NotImplementedError
|
|
167
|
+
run: ["rm -rf .git"] # shell commands in the sandbox
|
|
168
|
+
matrix:
|
|
169
|
+
harnesses: [claude-code]
|
|
170
|
+
models: [opus, sonnet, haiku]
|
|
171
|
+
trials: 3
|
|
172
|
+
graders:
|
|
173
|
+
- {type: command, cmd: "pytest -q", weight: 1.0}
|
|
174
|
+
- {type: files, patterns: ["retry", "backoff"]}
|
|
175
|
+
- {type: model_judge, rubric: ./graders/rubric.md, model: opus, pass_threshold: 0.8}
|
|
176
|
+
expect:
|
|
177
|
+
pass_threshold: 1.0
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
### Source fixtures repo
|
|
181
|
+
|
|
182
|
+
A case's bulky, hand-written assets — synthetic codebases to debug **and** the `hidden/`
|
|
183
|
+
oracle test suites — live **out of this repo**, in a separate fixtures repo
|
|
184
|
+
([`krimvp/touchstone-eval-fixtures`](https://github.com/krimvp/touchstone-eval-fixtures)), so they
|
|
185
|
+
don't pollute the runner/eval tree. The eval repo keeps only the *contract* (task, graders,
|
|
186
|
+
expectations); the fixtures repo holds the code. Each case has one directory there, split by
|
|
187
|
+
**visibility**:
|
|
188
|
+
|
|
189
|
+
```
|
|
190
|
+
<case-id>/
|
|
191
|
+
source/ # agent-VISIBLE input → promoted to the sandbox before the agent runs
|
|
192
|
+
hidden/ # grader ORACLE → injected at grade time only; the agent never sees it
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
A case wires the two halves with two independent pins (both default-pinned by commit):
|
|
196
|
+
|
|
197
|
+
```yaml
|
|
198
|
+
source: {repo: krimvp/touchstone-eval-fixtures, commit: <sha>, subdir: <case-id>/source}
|
|
199
|
+
fixtures: {repo: krimvp/touchstone-eval-fixtures, commit: <sha>} # subdir defaults to <case-id>
|
|
200
|
+
graders:
|
|
201
|
+
- {type: pytest, inject: ["./hidden/test_x.py"]} # resolved under <case-id>/hidden/
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
- **`source`** clones the repo, checks out the commit, and promotes `<case-id>/source/` into
|
|
205
|
+
the sandbox (no `.git`, like `copy`). SWE-bench-style cases point `source` at the *real
|
|
206
|
+
upstream repo* instead, so they have only a `hidden/` in the fixtures repo (no `source/`).
|
|
207
|
+
- **`fixtures`** names the repo that graders resolve `inject:` paths against — `Case.asset()`
|
|
208
|
+
pulls each hidden file from a host-cached clone (`src/touchstone/fixtures.py`), at grade
|
|
209
|
+
time, *after* the agent has stopped. Because `source/` and `hidden/` are sibling directories
|
|
210
|
+
and only `source/` is promoted, the oracle can never leak into the agent's sandbox.
|
|
211
|
+
|
|
212
|
+
Keep the fixtures repo **private** for the anti-memorization cases. `evals/example-case/`
|
|
213
|
+
stays local (`source: path`) as the offline worked example / integration fixture.
|
|
214
|
+
|
|
215
|
+
### Real-repo (SWE-bench-style) cases
|
|
216
|
+
|
|
217
|
+
A case can pin a real GitHub repo at a commit (`source: {repo, commit}`), `setup.stub` a
|
|
218
|
+
function to blank its body, and inject **hidden tests** (oracle = the real function) only
|
|
219
|
+
at grade time — so the agent reimplements real library code and the `pytest` grader scores
|
|
220
|
+
the fraction of FAIL→PASS tests. See `evals/repo-*-droid/`.
|
|
221
|
+
|
|
222
|
+
When a repo needs third-party dependencies or isn't importable from its root (a `src/`
|
|
223
|
+
layout), declare an **`environment`**: each cell gets its own throwaway virtualenv, into
|
|
224
|
+
which `requirements` are pip-installed and — with `install: editable` — the repo itself
|
|
225
|
+
(`pip install -e .`, which resolves a src-layout package and pulls its deps). Every
|
|
226
|
+
subprocess the cell spawns (harness, setup, and the `command`/`pytest` graders) runs under
|
|
227
|
+
that venv via an explicit env, so dependency-bearing cases stay reproducible and
|
|
228
|
+
parallel-safe (no shared site-packages). Worked examples: `repo-smarttruncate-droid`
|
|
229
|
+
(a `requirements` dep) and `repo-securefilename-droid` (`install: editable`, src-layout).
|
|
230
|
+
|
|
231
|
+
### Non-Python projects
|
|
232
|
+
|
|
233
|
+
Cases aren't Python-specific. The `command`, `files`, `model_judge`, and `trace` graders
|
|
234
|
+
are language-agnostic, and the **`tests`** grader gives the same partial-credit scoring as
|
|
235
|
+
`pytest` for any runner whose results it can read. Two substrates, **XML primary with a
|
|
236
|
+
console fallback**:
|
|
237
|
+
|
|
238
|
+
- **JUnit XML** (`junit_xml: <glob>`) — the universal report format every framework/build
|
|
239
|
+
tool can emit (Maven Surefire, Gradle, pytest `--junitxml`, vitest/jest/mocha reporters,
|
|
240
|
+
`go-junit-report`, `cargo2junit`). Deterministic, exact per-test counts, framework-agnostic.
|
|
241
|
+
- **Console summary** (`_parse_counts`) — scraped when no XML report is produced: pytest/
|
|
242
|
+
unittest, `node --test`/TAP, Maven Surefire, **`go test -v`** (`--- PASS:`/`--- FAIL:`), and
|
|
243
|
+
**`cargo test`** (`test result: … N passed; M failed`).
|
|
244
|
+
|
|
245
|
+
A `tests` grader with `gate: true` is a validity **gate** (never adds credit; disqualifies the
|
|
246
|
+
cell to 0 on failure) — use it to mirror SWE-bench's PASS_TO_PASS regression gate in any
|
|
247
|
+
language. `inject` takes either a bare filename (dropped at the sandbox root) or `{src, dest}`
|
|
248
|
+
to place a hidden test at a runner-specific path (e.g. Maven's `src/test/java/...`). Use
|
|
249
|
+
`setup.run` to blank the function (the AST-based `setup.stub` is Python-only); the
|
|
250
|
+
`implemented` gate works on any language when pointed at explicit `files`. Worked examples:
|
|
251
|
+
`repo-js-wordwrap-droid` (CommonJS, `node --test`), `repo-java-camelcase-droid` (Maven,
|
|
252
|
+
Surefire), and the `repo-swebench-*` battery — real recent GitHub issues across Python, Go
|
|
253
|
+
(`go test`), Java (Surefire + JUnit XML), JS/TS (mocha/ava/TAP), and Rust (`cargo test`).
|
|
254
|
+
|
|
255
|
+
**Dependencies aren't special — how they're *isolated* is.** Real projects have
|
|
256
|
+
dependencies; the question is only whether installing them safely needs the `environment`
|
|
257
|
+
venv. It depends on where the ecosystem puts deps:
|
|
258
|
+
|
|
259
|
+
| Ecosystem | Where deps go | Isolation | How to declare |
|
|
260
|
+
| --- | --- | --- | --- |
|
|
261
|
+
| Python | shared `site-packages` (mutable) | needs the per-cell venv | `environment:` `kind: pip-venv` (or `uv`) + `requirements` / `install: editable` |
|
|
262
|
+
| Node / Rust / Go | project-local (`node_modules`, `target/`, build cache) | per-cell for free | `environment:` `kind: command` + `commands: ["npm ci"]` etc. |
|
|
263
|
+
| Java / Maven | shared `~/.m2` (versioned, immutable artifacts) | safe to share across cells | resolved by the build (`mvn test`) |
|
|
264
|
+
|
|
265
|
+
The `environment.kind` is the one declarative knob (mirroring the Sandbox's Isolation Mode):
|
|
266
|
+
`pip-venv` and `uv` build an isolated venv and install into it; `command` runs your install
|
|
267
|
+
commands for ecosystems whose deps are project-local.
|
|
268
|
+
|
|
269
|
+
### OS-level isolation + OS packages (containers)
|
|
270
|
+
|
|
271
|
+
For cases that need OS packages or a pinned, reproducible build/grade environment, declare a
|
|
272
|
+
**`container`**: provisioning, `setup.run`, and the `command`/`tests`/`pytest` graders then run
|
|
273
|
+
inside it (via `docker exec`), with the cell bind-mounted at its same path.
|
|
274
|
+
|
|
275
|
+
```yaml
|
|
276
|
+
container:
|
|
277
|
+
image: python:3.12-slim # pin by digest (…@sha256:…) for full reproducibility
|
|
278
|
+
setup: ["apt-get update -qq", "apt-get install -y -qq libxml2"] # OS packages, once at start
|
|
279
|
+
caches: [".cache/pip"] # share the host's cache so cells don't re-download deps
|
|
280
|
+
environment:
|
|
281
|
+
kind: pip-venv # the venv is now built *inside* the container
|
|
282
|
+
requirements: [lxml, pytest]
|
|
283
|
+
graders:
|
|
284
|
+
- {type: pytest, inject: ["./hidden/test_x.py"], weight: 4.0} # runs in the container
|
|
285
|
+
```
|
|
286
|
+
|
|
287
|
+
`caches` mounts a home-relative dir (e.g. `.cache/pip`, `.m2`) shared with the host and
|
|
288
|
+
across cells, so a fresh container per cell reuses already-downloaded dependencies instead
|
|
289
|
+
of re-fetching them — the same shared-cache benefit the host's `~/.m2` gives today. The
|
|
290
|
+
suite uses this on its dependency-bearing cases: `repo-js-wordwrap` (`node:20-slim`,
|
|
291
|
+
zero-dep), `repo-smarttruncate` / `repo-securefilename` (`python:3.12-slim` + pip cache),
|
|
292
|
+
and `repo-java-camelcase` (`maven:3.9-eclipse-temurin-21` + shared `~/.m2`).
|
|
293
|
+
|
|
294
|
+
Every provisioner and grader runs through the Cell's **Executor** — `LocalExecutor` (host
|
|
295
|
+
subprocess) by default, `ContainerExecutor` when a `container` is declared — so the same
|
|
296
|
+
recipe runs under either backend (needs the docker daemon running). The Harness (the agent
|
|
297
|
+
under test) still runs on the host against the bind-mounted Sandbox; running the agent
|
|
298
|
+
itself in-container is future work. See `docs/adr/0005`.
|
|
299
|
+
|
|
300
|
+
So the earlier zero-dep examples were picked to keep the *demo* offline, not because deps
|
|
301
|
+
are rare. `repo-java-camelcase-droid` is a genuinely dependency-bearing non-Python case:
|
|
302
|
+
commons-text's source needs `commons-lang3`, which Maven resolves from Maven Central.
|
|
303
|
+
|
|
304
|
+
## Bring your own private repos (reachability & fallback)
|
|
305
|
+
|
|
306
|
+
`touchstone` is an **engine + a public sample battery**. The verdict you can actually trust for
|
|
307
|
+
"which model is best **for me**" comes from *your own* tasks, so the design is built to pull
|
|
308
|
+
case material from external git repos you own — both the agent-visible `source: {repo, commit}`
|
|
309
|
+
and the hidden oracle in `fixtures: {repo, commit}` — some of them private. Auth is just your
|
|
310
|
+
normal git credentials (SSH agent / `gh` / a credential helper); nothing extra to configure.
|
|
311
|
+
|
|
312
|
+
Because a given host may not have access to every referenced repo (a teammate's private
|
|
313
|
+
fixtures, a CI box without keys, an offline laptop), a run **probes each case's external repos
|
|
314
|
+
before doing any work** (`git ls-remote`, cached per URL) and applies a policy:
|
|
315
|
+
|
|
316
|
+
```bash
|
|
317
|
+
touchstone run # default: FAIL FAST if any required repo is unreachable
|
|
318
|
+
touchstone run --on-unavailable skip # degrade: skip unreachable cases, run the rest
|
|
319
|
+
touchstone validate --check-access # preflight only: report what a run would skip/fail on
|
|
320
|
+
```
|
|
321
|
+
|
|
322
|
+
- **Fail by default.** A missing repo on a host you expected to be complete is a *loud, early*
|
|
323
|
+
error — never a silently smaller benchmark (which would corrupt cross-model comparisons).
|
|
324
|
+
- **`--on-unavailable skip`** degrades the unreachable cases to a `skipped` status: excluded
|
|
325
|
+
from every score and the leaderboard, surfaced in a "Skipped (unavailable)" report section,
|
|
326
|
+
and **not** counted as failures. Resume re-probes, so a transient outage is retried.
|
|
327
|
+
- **Per-case `availability: optional`** marks a case that may reference a repo you might not
|
|
328
|
+
have — it degrades to `skipped` even under the default fail mode.
|
|
329
|
+
- Only *access* failures (no auth / no network / not found) are degradable; a bad commit or
|
|
330
|
+
schema error is a defect and still fails loudly.
|
|
331
|
+
|
|
332
|
+
A fork can repoint the default hidden-fixtures repo to its own private one without editing every
|
|
333
|
+
case by setting `TOUCHSTONE_FIXTURES_REPO=owner/my-fixtures`. Your fully-private held-out suite
|
|
334
|
+
lives in `evals-private/` (gitignored) and runs with `--evals-dir evals-private` — see its
|
|
335
|
+
README. Design: `docs/adr/0008-reachability-and-availability-policy.md`.
|
|
336
|
+
|
|
337
|
+
## Layout
|
|
338
|
+
|
|
339
|
+
```
|
|
340
|
+
evals/<case>/ the benchmark suite (one dir per case)
|
|
341
|
+
src/touchstone/ the framework (config, harness/, grader/, runner, report, cli)
|
|
342
|
+
runs/<run_id>/ results (gitignored): manifest.json + cells/ + report.md
|
|
343
|
+
```
|