proofswe 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 proofswe contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,116 @@
1
+ # proofswe
2
+
3
+ **SWE-bench tells you how models score in the lab. This is proof from real work.**
4
+
5
+ proofswe is an open benchmark for coding agents, built from real developer
6
+ sessions instead of synthetic tasks. The question it answers is the one no
7
+ existing benchmark can:
8
+
9
+ > Oracle benchmarks measure whether a model can close tasks that *have answers*.
10
+ > proofswe measures whether a model's work *survives* in tasks that don't.
11
+
12
+ Status: **design exploration.** No collector, no binary yet. We are nailing the
13
+ hard parts on paper before writing a line of code:
14
+
15
+ - [`docs/METHODOLOGY.md`](docs/METHODOLOGY.md) — how raw real-world sessions
16
+ become a statistically meaningful benchmark (survival analysis, IRT,
17
+ hierarchical Bradley–Terry, learning the metric weights from the OSS merge
18
+ oracle). This is what decides whether proofswe is a benchmark or just telemetry
19
+ with a leaderboard attached.
20
+ - [`docs/CAPTURE.md`](docs/CAPTURE.md) — the data-capture pipeline architecture:
21
+ a Go binary that captures Claude Code / Codex / (later) Cursor sessions
22
+ through a harness-agnostic narrow waist, backed by how the best Go CLIs
23
+ (gh, hugo, fzf, GoReleaser, go-git) actually build this.
24
+
25
+ ## Why this category is empty
26
+
27
+ Every serious coding benchmark today is **oracle-based**: it needs a
28
+ machine-checkable definition of success *before* the task runs.
29
+
30
+ | Benchmark | The question it actually answers |
31
+ |---|---|
32
+ | SWE-bench | Can the model close historical issues that happen to have test oracles? |
33
+ | Terminal-Bench / Aider | Can the model complete synthetic tasks with verifiable end states? |
34
+ | Arena-style (Copilot Arena, LMArena) | Which output do strangers prefer at a glance? |
35
+ | **proofswe** | **Whose work do informed owners keep, in tasks that have no oracle?** |
36
+
37
+ The first three are *pre-task* benchmarks. Success must be definable before work
38
+ starts, which structurally forbids ambiguity. But ambiguity is the defining
39
+ property of real software engineering: "refactor this," "build me a dashboard,"
40
+ "make it faster," "figure out why staging is weird." None of those can ever
41
+ enter an oracle benchmark, no matter how good the benchmark gets.
42
+
43
+ proofswe is a *post-task* benchmark. Success is defined by what happened
44
+ *afterward*: did the person who owns the codebase and knows the real
45
+ requirements **keep** the work, **commit** it, **build on top of it** — or
46
+ revert it. That is the most informed judgment available for an unverifiable
47
+ task, and it is expressed as a costly action rather than a cheap rating.
48
+
49
+ ## Three properties no existing benchmark has at once
50
+
51
+ - **Contamination-proof by construction.** Every datapoint is a fresh task in a
52
+ private codebase that did not exist at training time.
53
+ - **Ungameable in the usual way.** A lab cannot fine-tune against a metric that
54
+ lives downstream of thousands of uncontrolled real environments. (One real
55
+ attack vector exists — astroturfing the public pool — and the methodology
56
+ addresses it head-on.)
57
+ - **The real task distribution.** Including the boring, ambiguous, underspecified
58
+ majority of work that defines what "good at software engineering" means.
59
+
60
+ ## The collection ladder
61
+
62
+ Ordered by how much the data is worth to research:
63
+
64
+ 1. **Observational outcomes** — privacy-safe, line-hash telemetry. Scales widest,
65
+ weakest for causal claims.
66
+ 2. **Paired replay** — the same real task attempted by two models, judged by the
67
+ same informed user. One paired comparison is worth 10–100 observational ones.
68
+ The feature users want most is also the data researchers need most.
69
+ 3. **Open-source transcripts** — when the repo is public, the privacy cost of
70
+ donating the full transcript collapses, and PR merge status becomes an
71
+ external oracle. This is the successor dataset to SWE-bench.
72
+
73
+ The leaderboard is marketing for the dataset. The dataset — real multi-turn
74
+ trajectories with survival-based reward signals, openly licensed and auditable —
75
+ is the research contribution the labs sit on privately and nobody else has.
76
+
77
+ ## Current Pipeline
78
+
79
+ Install the CLI through npm:
80
+
81
+ ```sh
82
+ npx proofswe version
83
+ ```
84
+
85
+ Submit a real Codex or Claude Code transcript for server-side judging:
86
+
87
+ ```sh
88
+ proofswe submit
89
+ ```
90
+
91
+ `submit` auto-detects the latest supported Codex or Claude Code transcript for
92
+ the current git repo, builds the same scrubbed reproducible task as
93
+ `contribute`, sends it to the proofswe API, and prints the server scorecard. The
94
+ contributor does not need an OpenAI or Anthropic key; the official judge runs on
95
+ the proofswe server. Use `proofswe submit <path>` for an explicit transcript,
96
+ `--no-wait` for automation, and `PROOFSWE_API_URL` or `--endpoint` to point at a
97
+ staging server.
98
+
99
+ Agent chat helpers are explicit opt-in:
100
+
101
+ - Codex: `/prompts:benchmark` plus `$proofswe-benchmark`
102
+ - Claude Code: `$proofswe-benchmark`
103
+
104
+ ```sh
105
+ proofswe agent install
106
+ ```
107
+
108
+ Run the staging judge endpoint with server-side credentials:
109
+
110
+ ```sh
111
+ OPENAI_API_KEY=... proofswe serve --addr=:8080 --judge-provider=openai
112
+ ```
113
+
114
+ ## License
115
+
116
+ MIT. See [`LICENSE`](LICENSE).
@@ -0,0 +1,87 @@
1
+ #!/usr/bin/env node
2
+
3
+ const { spawnSync } = require("node:child_process");
4
+ const fs = require("node:fs");
5
+ const path = require("node:path");
6
+
7
+ function packageName() {
8
+ const platform = currentPlatform();
9
+ const arch = currentArch();
10
+ const supported = new Set([
11
+ "darwin:arm64",
12
+ "darwin:x64",
13
+ "linux:arm64",
14
+ "linux:x64",
15
+ "win32:arm64",
16
+ "win32:x64",
17
+ ]);
18
+ const key = `${platform}:${arch}`;
19
+ if (!supported.has(key)) {
20
+ throw new Error(`unsupported platform ${platform}/${arch}`);
21
+ }
22
+ return `proofswe-${packagePlatform(platform)}-${arch}`;
23
+ }
24
+
25
+ function devOverridesEnabled() {
26
+ return process.env.PROOFSWE_ENABLE_DEV_OVERRIDES === "1";
27
+ }
28
+
29
+ function currentPlatform() {
30
+ if (devOverridesEnabled() && process.env.PROOFSWE_TEST_PLATFORM) {
31
+ return process.env.PROOFSWE_TEST_PLATFORM;
32
+ }
33
+ return process.platform;
34
+ }
35
+
36
+ function currentArch() {
37
+ if (devOverridesEnabled() && process.env.PROOFSWE_TEST_ARCH) {
38
+ return process.env.PROOFSWE_TEST_ARCH;
39
+ }
40
+ return process.arch;
41
+ }
42
+
43
+ function packagePlatform(platform) {
44
+ return platform === "win32" ? "windows" : platform;
45
+ }
46
+
47
+ function binaryPath() {
48
+ if (devOverridesEnabled() && process.env.PROOFSWE_BINARY_PATH) {
49
+ return process.env.PROOFSWE_BINARY_PATH;
50
+ }
51
+
52
+ const platform = currentPlatform();
53
+ const suffix = platform === "win32" ? ".exe" : "";
54
+ const pkg = packageName();
55
+ if (devOverridesEnabled() && process.env.PROOFSWE_PACKAGE_ROOT) {
56
+ const candidate = path.join(process.env.PROOFSWE_PACKAGE_ROOT, "node_modules", pkg, "bin", `proofswe${suffix}`);
57
+ if (fs.existsSync(candidate)) {
58
+ return candidate;
59
+ }
60
+ }
61
+ try {
62
+ return require.resolve(`${pkg}/bin/proofswe${suffix}`);
63
+ } catch (err) {
64
+ const local = path.resolve(__dirname, "..", "..", "dist", `proofswe${suffix}`);
65
+ if (fs.existsSync(local)) {
66
+ return local;
67
+ }
68
+ throw new Error(
69
+ `could not find native proofswe binary package ${pkg}; reinstall with optional dependencies enabled`
70
+ );
71
+ }
72
+ }
73
+
74
+ let bin;
75
+ try {
76
+ bin = binaryPath();
77
+ } catch (err) {
78
+ console.error(`proofswe: ${err.message}`);
79
+ process.exit(1);
80
+ }
81
+
82
+ const result = spawnSync(bin, process.argv.slice(2), { stdio: "inherit" });
83
+ if (result.error) {
84
+ console.error(`proofswe: ${result.error.message}`);
85
+ process.exit(1);
86
+ }
87
+ process.exit(result.status === null ? 1 : result.status);
package/package.json ADDED
@@ -0,0 +1,33 @@
1
+ {
2
+ "name": "proofswe",
3
+ "version": "0.1.3",
4
+ "description": "Benchmark coding agents from real developer sessions.",
5
+ "license": "MIT",
6
+ "repository": {
7
+ "type": "git",
8
+ "url": "https://github.com/Atharva-Kanherkar/proofswe"
9
+ },
10
+ "bin": {
11
+ "proofswe": "npm/bin/proofswe.js"
12
+ },
13
+ "files": [
14
+ "npm/bin/proofswe.js",
15
+ "README.md",
16
+ "LICENSE"
17
+ ],
18
+ "optionalDependencies": {
19
+ "proofswe-darwin-arm64": "0.1.3",
20
+ "proofswe-darwin-x64": "0.1.3",
21
+ "proofswe-linux-arm64": "0.1.3",
22
+ "proofswe-linux-x64": "0.1.3",
23
+ "proofswe-windows-arm64": "0.1.3",
24
+ "proofswe-windows-x64": "0.1.3"
25
+ },
26
+ "engines": {
27
+ "node": ">=18"
28
+ },
29
+ "scripts": {
30
+ "smoke": "node npm/bin/proofswe.js version",
31
+ "test": "node npm/test-wrapper.mjs"
32
+ }
33
+ }