proofswe 0.1.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +116 -0
- package/npm/bin/proofswe.js +87 -0
- package/package.json +33 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 proofswe contributors
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,116 @@
|
|
|
1
|
+
# proofswe
|
|
2
|
+
|
|
3
|
+
**SWE-bench tells you how models score in the lab. This is proof from real work.**
|
|
4
|
+
|
|
5
|
+
proofswe is an open benchmark for coding agents, built from real developer
|
|
6
|
+
sessions instead of synthetic tasks. The question it answers is the one no
|
|
7
|
+
existing benchmark can:
|
|
8
|
+
|
|
9
|
+
> Oracle benchmarks measure whether a model can close tasks that *have answers*.
|
|
10
|
+
> proofswe measures whether a model's work *survives* in tasks that don't.
|
|
11
|
+
|
|
12
|
+
Status: **design exploration.** No collector, no binary yet. We are nailing the
|
|
13
|
+
hard parts on paper before writing a line of code:
|
|
14
|
+
|
|
15
|
+
- [`docs/METHODOLOGY.md`](docs/METHODOLOGY.md) — how raw real-world sessions
|
|
16
|
+
become a statistically meaningful benchmark (survival analysis, IRT,
|
|
17
|
+
hierarchical Bradley–Terry, learning the metric weights from the OSS merge
|
|
18
|
+
oracle). This is what decides whether proofswe is a benchmark or just telemetry
|
|
19
|
+
with a leaderboard attached.
|
|
20
|
+
- [`docs/CAPTURE.md`](docs/CAPTURE.md) — the data-capture pipeline architecture:
|
|
21
|
+
a Go binary that captures Claude Code / Codex / (later) Cursor sessions
|
|
22
|
+
through a harness-agnostic narrow waist, backed by how the best Go CLIs
|
|
23
|
+
(gh, hugo, fzf, GoReleaser, go-git) actually build this.
|
|
24
|
+
|
|
25
|
+
## Why this category is empty
|
|
26
|
+
|
|
27
|
+
Every serious coding benchmark today is **oracle-based**: it needs a
|
|
28
|
+
machine-checkable definition of success *before* the task runs.
|
|
29
|
+
|
|
30
|
+
| Benchmark | The question it actually answers |
|
|
31
|
+
|---|---|
|
|
32
|
+
| SWE-bench | Can the model close historical issues that happen to have test oracles? |
|
|
33
|
+
| Terminal-Bench / Aider | Can the model complete synthetic tasks with verifiable end states? |
|
|
34
|
+
| Arena-style (Copilot Arena, LMArena) | Which output do strangers prefer at a glance? |
|
|
35
|
+
| **proofswe** | **Whose work do informed owners keep, in tasks that have no oracle?** |
|
|
36
|
+
|
|
37
|
+
The first three are *pre-task* benchmarks. Success must be definable before work
|
|
38
|
+
starts, which structurally forbids ambiguity. But ambiguity is the defining
|
|
39
|
+
property of real software engineering: "refactor this," "build me a dashboard,"
|
|
40
|
+
"make it faster," "figure out why staging is weird." None of those can ever
|
|
41
|
+
enter an oracle benchmark, no matter how good the benchmark gets.
|
|
42
|
+
|
|
43
|
+
proofswe is a *post-task* benchmark. Success is defined by what happened
|
|
44
|
+
*afterward*: did the person who owns the codebase and knows the real
|
|
45
|
+
requirements **keep** the work, **commit** it, **build on top of it** — or
|
|
46
|
+
revert it. That is the most informed judgment available for an unverifiable
|
|
47
|
+
task, and it is expressed as a costly action rather than a cheap rating.
|
|
48
|
+
|
|
49
|
+
## Three properties no existing benchmark has at once
|
|
50
|
+
|
|
51
|
+
- **Contamination-proof by construction.** Every datapoint is a fresh task in a
|
|
52
|
+
private codebase that did not exist at training time.
|
|
53
|
+
- **Ungameable in the usual way.** A lab cannot fine-tune against a metric that
|
|
54
|
+
lives downstream of thousands of uncontrolled real environments. (One real
|
|
55
|
+
attack vector exists — astroturfing the public pool — and the methodology
|
|
56
|
+
addresses it head-on.)
|
|
57
|
+
- **The real task distribution.** Including the boring, ambiguous, underspecified
|
|
58
|
+
majority of work that defines what "good at software engineering" means.
|
|
59
|
+
|
|
60
|
+
## The collection ladder
|
|
61
|
+
|
|
62
|
+
Ordered by how much the data is worth to research:
|
|
63
|
+
|
|
64
|
+
1. **Observational outcomes** — privacy-safe, line-hash telemetry. Scales widest,
|
|
65
|
+
weakest for causal claims.
|
|
66
|
+
2. **Paired replay** — the same real task attempted by two models, judged by the
|
|
67
|
+
same informed user. One paired comparison is worth 10–100 observational ones.
|
|
68
|
+
The feature users want most is also the data researchers need most.
|
|
69
|
+
3. **Open-source transcripts** — when the repo is public, the privacy cost of
|
|
70
|
+
donating the full transcript collapses, and PR merge status becomes an
|
|
71
|
+
external oracle. This is the successor dataset to SWE-bench.
|
|
72
|
+
|
|
73
|
+
The leaderboard is marketing for the dataset. The dataset — real multi-turn
|
|
74
|
+
trajectories with survival-based reward signals, openly licensed and auditable —
|
|
75
|
+
is the research contribution the labs sit on privately and nobody else has.
|
|
76
|
+
|
|
77
|
+
## Current Pipeline
|
|
78
|
+
|
|
79
|
+
Install the CLI through npm:
|
|
80
|
+
|
|
81
|
+
```sh
|
|
82
|
+
npx proofswe version
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
Submit a real Codex or Claude Code transcript for server-side judging:
|
|
86
|
+
|
|
87
|
+
```sh
|
|
88
|
+
proofswe submit
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
`submit` auto-detects the latest supported Codex or Claude Code transcript for
|
|
92
|
+
the current git repo, builds the same scrubbed reproducible task as
|
|
93
|
+
`contribute`, sends it to the proofswe API, and prints the server scorecard. The
|
|
94
|
+
contributor does not need an OpenAI or Anthropic key; the official judge runs on
|
|
95
|
+
the proofswe server. Use `proofswe submit <path>` for an explicit transcript,
|
|
96
|
+
`--no-wait` for automation, and `PROOFSWE_API_URL` or `--endpoint` to point at a
|
|
97
|
+
staging server.
|
|
98
|
+
|
|
99
|
+
Agent chat helpers are explicit opt-in:
|
|
100
|
+
|
|
101
|
+
- Codex: `/prompts:benchmark` plus `$proofswe-benchmark`
|
|
102
|
+
- Claude Code: `$proofswe-benchmark`
|
|
103
|
+
|
|
104
|
+
```sh
|
|
105
|
+
proofswe agent install
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
Run the staging judge endpoint with server-side credentials:
|
|
109
|
+
|
|
110
|
+
```sh
|
|
111
|
+
OPENAI_API_KEY=... proofswe serve --addr=:8080 --judge-provider=openai
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
## License
|
|
115
|
+
|
|
116
|
+
MIT. See [`LICENSE`](LICENSE).
|
|
@@ -0,0 +1,87 @@
|
|
|
1
|
+
#!/usr/bin/env node
|
|
2
|
+
|
|
3
|
+
const { spawnSync } = require("node:child_process");
|
|
4
|
+
const fs = require("node:fs");
|
|
5
|
+
const path = require("node:path");
|
|
6
|
+
|
|
7
|
+
function packageName() {
|
|
8
|
+
const platform = currentPlatform();
|
|
9
|
+
const arch = currentArch();
|
|
10
|
+
const supported = new Set([
|
|
11
|
+
"darwin:arm64",
|
|
12
|
+
"darwin:x64",
|
|
13
|
+
"linux:arm64",
|
|
14
|
+
"linux:x64",
|
|
15
|
+
"win32:arm64",
|
|
16
|
+
"win32:x64",
|
|
17
|
+
]);
|
|
18
|
+
const key = `${platform}:${arch}`;
|
|
19
|
+
if (!supported.has(key)) {
|
|
20
|
+
throw new Error(`unsupported platform ${platform}/${arch}`);
|
|
21
|
+
}
|
|
22
|
+
return `proofswe-${packagePlatform(platform)}-${arch}`;
|
|
23
|
+
}
|
|
24
|
+
|
|
25
|
+
function devOverridesEnabled() {
|
|
26
|
+
return process.env.PROOFSWE_ENABLE_DEV_OVERRIDES === "1";
|
|
27
|
+
}
|
|
28
|
+
|
|
29
|
+
function currentPlatform() {
|
|
30
|
+
if (devOverridesEnabled() && process.env.PROOFSWE_TEST_PLATFORM) {
|
|
31
|
+
return process.env.PROOFSWE_TEST_PLATFORM;
|
|
32
|
+
}
|
|
33
|
+
return process.platform;
|
|
34
|
+
}
|
|
35
|
+
|
|
36
|
+
function currentArch() {
|
|
37
|
+
if (devOverridesEnabled() && process.env.PROOFSWE_TEST_ARCH) {
|
|
38
|
+
return process.env.PROOFSWE_TEST_ARCH;
|
|
39
|
+
}
|
|
40
|
+
return process.arch;
|
|
41
|
+
}
|
|
42
|
+
|
|
43
|
+
function packagePlatform(platform) {
|
|
44
|
+
return platform === "win32" ? "windows" : platform;
|
|
45
|
+
}
|
|
46
|
+
|
|
47
|
+
function binaryPath() {
|
|
48
|
+
if (devOverridesEnabled() && process.env.PROOFSWE_BINARY_PATH) {
|
|
49
|
+
return process.env.PROOFSWE_BINARY_PATH;
|
|
50
|
+
}
|
|
51
|
+
|
|
52
|
+
const platform = currentPlatform();
|
|
53
|
+
const suffix = platform === "win32" ? ".exe" : "";
|
|
54
|
+
const pkg = packageName();
|
|
55
|
+
if (devOverridesEnabled() && process.env.PROOFSWE_PACKAGE_ROOT) {
|
|
56
|
+
const candidate = path.join(process.env.PROOFSWE_PACKAGE_ROOT, "node_modules", pkg, "bin", `proofswe${suffix}`);
|
|
57
|
+
if (fs.existsSync(candidate)) {
|
|
58
|
+
return candidate;
|
|
59
|
+
}
|
|
60
|
+
}
|
|
61
|
+
try {
|
|
62
|
+
return require.resolve(`${pkg}/bin/proofswe${suffix}`);
|
|
63
|
+
} catch (err) {
|
|
64
|
+
const local = path.resolve(__dirname, "..", "..", "dist", `proofswe${suffix}`);
|
|
65
|
+
if (fs.existsSync(local)) {
|
|
66
|
+
return local;
|
|
67
|
+
}
|
|
68
|
+
throw new Error(
|
|
69
|
+
`could not find native proofswe binary package ${pkg}; reinstall with optional dependencies enabled`
|
|
70
|
+
);
|
|
71
|
+
}
|
|
72
|
+
}
|
|
73
|
+
|
|
74
|
+
let bin;
|
|
75
|
+
try {
|
|
76
|
+
bin = binaryPath();
|
|
77
|
+
} catch (err) {
|
|
78
|
+
console.error(`proofswe: ${err.message}`);
|
|
79
|
+
process.exit(1);
|
|
80
|
+
}
|
|
81
|
+
|
|
82
|
+
const result = spawnSync(bin, process.argv.slice(2), { stdio: "inherit" });
|
|
83
|
+
if (result.error) {
|
|
84
|
+
console.error(`proofswe: ${result.error.message}`);
|
|
85
|
+
process.exit(1);
|
|
86
|
+
}
|
|
87
|
+
process.exit(result.status === null ? 1 : result.status);
|
package/package.json
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "proofswe",
|
|
3
|
+
"version": "0.1.3",
|
|
4
|
+
"description": "Benchmark coding agents from real developer sessions.",
|
|
5
|
+
"license": "MIT",
|
|
6
|
+
"repository": {
|
|
7
|
+
"type": "git",
|
|
8
|
+
"url": "https://github.com/Atharva-Kanherkar/proofswe"
|
|
9
|
+
},
|
|
10
|
+
"bin": {
|
|
11
|
+
"proofswe": "npm/bin/proofswe.js"
|
|
12
|
+
},
|
|
13
|
+
"files": [
|
|
14
|
+
"npm/bin/proofswe.js",
|
|
15
|
+
"README.md",
|
|
16
|
+
"LICENSE"
|
|
17
|
+
],
|
|
18
|
+
"optionalDependencies": {
|
|
19
|
+
"proofswe-darwin-arm64": "0.1.3",
|
|
20
|
+
"proofswe-darwin-x64": "0.1.3",
|
|
21
|
+
"proofswe-linux-arm64": "0.1.3",
|
|
22
|
+
"proofswe-linux-x64": "0.1.3",
|
|
23
|
+
"proofswe-windows-arm64": "0.1.3",
|
|
24
|
+
"proofswe-windows-x64": "0.1.3"
|
|
25
|
+
},
|
|
26
|
+
"engines": {
|
|
27
|
+
"node": ">=18"
|
|
28
|
+
},
|
|
29
|
+
"scripts": {
|
|
30
|
+
"smoke": "node npm/bin/proofswe.js version",
|
|
31
|
+
"test": "node npm/test-wrapper.mjs"
|
|
32
|
+
}
|
|
33
|
+
}
|