open-research-protocol 0.4.7 → 0.4.9
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +15 -0
- package/cli/orp.py +1158 -43
- package/docs/AGENT_LOOP.md +3 -0
- package/docs/ORP_REASONING_KERNEL_AGENT_PILOT.md +125 -0
- package/docs/ORP_REASONING_KERNEL_AGENT_REPLICATION.md +97 -0
- package/docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md +100 -0
- package/docs/ORP_REASONING_KERNEL_COMPARISON_PILOT.md +116 -0
- package/docs/ORP_REASONING_KERNEL_CONTINUATION_PILOT.md +86 -0
- package/docs/ORP_REASONING_KERNEL_EVALUATION_PLAN.md +261 -0
- package/docs/ORP_REASONING_KERNEL_EVIDENCE_MATRIX.md +131 -0
- package/docs/ORP_REASONING_KERNEL_EVOLUTION.md +123 -0
- package/docs/ORP_REASONING_KERNEL_PICKUP_PILOT.md +107 -0
- package/docs/ORP_REASONING_KERNEL_TECHNICAL_VALIDATION.md +140 -22
- package/docs/ORP_REASONING_KERNEL_V0_1.md +11 -0
- package/docs/ORP_YOUTUBE_INSPECT.md +97 -0
- package/docs/benchmarks/orp_reasoning_kernel_agent_pilot_v0_1.json +796 -0
- package/docs/benchmarks/orp_reasoning_kernel_agent_replication_task_smoke.json +487 -0
- package/docs/benchmarks/orp_reasoning_kernel_agent_replication_v0_1.json +1927 -0
- package/docs/benchmarks/orp_reasoning_kernel_agent_replication_v0_2.json +10217 -0
- package/docs/benchmarks/orp_reasoning_kernel_canonical_continuation_task_smoke.json +174 -0
- package/docs/benchmarks/orp_reasoning_kernel_canonical_continuation_v0_1.json +598 -0
- package/docs/benchmarks/orp_reasoning_kernel_comparison_v0_1.json +688 -0
- package/docs/benchmarks/orp_reasoning_kernel_continuation_task_smoke.json +150 -0
- package/docs/benchmarks/orp_reasoning_kernel_continuation_v0_1.json +448 -0
- package/docs/benchmarks/orp_reasoning_kernel_pickup_v0_1.json +594 -0
- package/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json +769 -41
- package/examples/README.md +2 -0
- package/examples/kernel/comparison/comparison-corpus.json +337 -0
- package/examples/kernel/comparison/next-task-continuation.json +55 -0
- package/examples/kernel/corpus/operations/habanero-routing.checkpoint.kernel.yml +12 -0
- package/examples/kernel/corpus/operations/runner-routing.policy.kernel.yml +9 -0
- package/examples/kernel/corpus/product/project-home.decision.kernel.yml +11 -0
- package/examples/kernel/corpus/research/kernel-handoff.experiment.kernel.yml +16 -0
- package/examples/kernel/corpus/research/lane-drift.hypothesis.kernel.yml +11 -0
- package/examples/kernel/corpus/software/trace-widget.task.kernel.yml +13 -0
- package/examples/kernel/corpus/writing/kernel-launch.result.kernel.yml +12 -0
- package/llms.txt +3 -0
- package/package.json +4 -1
- package/scripts/orp-kernel-agent-pilot.py +673 -0
- package/scripts/orp-kernel-agent-replication.py +307 -0
- package/scripts/orp-kernel-benchmark.py +471 -2
- package/scripts/orp-kernel-canonical-continuation.py +381 -0
- package/scripts/orp-kernel-ci-check.py +138 -0
- package/scripts/orp-kernel-comparison.py +592 -0
- package/scripts/orp-kernel-continuation-pilot.py +384 -0
- package/scripts/orp-kernel-pickup.py +401 -0
- package/spec/v1/kernel-extension.schema.json +96 -0
- package/spec/v1/kernel-proposal.schema.json +115 -0
- package/spec/v1/kernel.schema.json +2 -1
- package/spec/v1/youtube-source.schema.json +151 -0
package/docs/AGENT_LOOP.md
CHANGED
|
@@ -21,6 +21,9 @@ Use this loop when an AI agent is the primary operator of an ORP-enabled repo.
|
|
|
21
21
|
- or `orp pack fetch --source <git-url> --pack-id <pack-id> --install-target . --json`
|
|
22
22
|
- If the workflow depends on public Erdos data, sync it first:
|
|
23
23
|
- `orp erdos sync --problem-id <id> --out-problem-dir <dir> --json`
|
|
24
|
+
- If the task begins from a public YouTube link, normalize it first:
|
|
25
|
+
- `orp youtube inspect <youtube-url> --json`
|
|
26
|
+
- or `orp youtube inspect <youtube-url> --save --json` when the source artifact should stay with the repo
|
|
24
27
|
|
|
25
28
|
## 3. Run
|
|
26
29
|
|
|
@@ -0,0 +1,125 @@
|
|
|
1
|
+
# ORP Reasoning Kernel Agent Pilot
|
|
2
|
+
|
|
3
|
+
This document records the first live in-environment Codex pickup simulation for
|
|
4
|
+
the ORP Reasoning Kernel.
|
|
5
|
+
|
|
6
|
+
Supporting artifact:
|
|
7
|
+
|
|
8
|
+
- [docs/benchmarks/orp_reasoning_kernel_agent_pilot_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_agent_pilot_v0_1.json)
|
|
9
|
+
|
|
10
|
+
Supporting corpus and harness:
|
|
11
|
+
|
|
12
|
+
- [examples/kernel/comparison/comparison-corpus.json](/Volumes/Code_2TB/code/orp/examples/kernel/comparison/comparison-corpus.json)
|
|
13
|
+
- [scripts/orp-kernel-agent-pilot.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-agent-pilot.py)
|
|
14
|
+
|
|
15
|
+
## What This Pilot Measures
|
|
16
|
+
|
|
17
|
+
This pilot asks a real fresh Codex session to act like a downstream handoff
|
|
18
|
+
consumer with no repo context.
|
|
19
|
+
|
|
20
|
+
For each matched case and condition, the agent sees only the artifact and must
|
|
21
|
+
recover the **full required field set** for that artifact class:
|
|
22
|
+
|
|
23
|
+
- `task`: `object`, `goal`, `boundary`, `constraints`, `success_criteria`
|
|
24
|
+
- `decision`: `question`, `chosen_path`, `rejected_alternatives`, `rationale`, `consequences`
|
|
25
|
+
- `hypothesis`: `claim`, `boundary`, `assumptions`, `test_path`, `falsifiers`
|
|
26
|
+
- `experiment`: `objective`, `method`, `inputs`, `outputs`, `evidence_expectations`, `interpretation_limits`
|
|
27
|
+
- `checkpoint`: `completed_unit`, `current_state`, `risks`, `next_handoff_target`, `artifact_refs`
|
|
28
|
+
- `policy`: `scope`, `rule`, `rationale`, `invariants`, `enforcement_surface`
|
|
29
|
+
- `result`: `claim`, `evidence_paths`, `status`, `interpretation_limits`, `next_follow_up`
|
|
30
|
+
|
|
31
|
+
The agent is instructed to use `null` unless a field is explicit enough to
|
|
32
|
+
carry forward into a canonical artifact without invention.
|
|
33
|
+
|
|
34
|
+
This is a stronger standard than the earlier deterministic pickup proxy because
|
|
35
|
+
it uses an actual fresh Codex session rather than a local rubric only.
|
|
36
|
+
|
|
37
|
+
## What This Is And Is Not
|
|
38
|
+
|
|
39
|
+
This is a **live internal agent simulation**, not a human handoff study.
|
|
40
|
+
|
|
41
|
+
It does **not** prove:
|
|
42
|
+
|
|
43
|
+
- human pickup speed
|
|
44
|
+
- human clarification count
|
|
45
|
+
- downstream implementation quality
|
|
46
|
+
- cross-team or cross-model outcome superiority
|
|
47
|
+
|
|
48
|
+
It **does** prove something narrower and important:
|
|
49
|
+
|
|
50
|
+
- how much of the kernel’s required structure remains explicitly recoverable to
|
|
51
|
+
a fresh downstream agent
|
|
52
|
+
- whether the kernel’s structural advantages survive contact with a real agent,
|
|
53
|
+
not just a deterministic local scorer
|
|
54
|
+
|
|
55
|
+
## Current Result
|
|
56
|
+
|
|
57
|
+
On the matched `7`-case, `5`-domain live Codex corpus, the current report
|
|
58
|
+
shows:
|
|
59
|
+
|
|
60
|
+
- kernel mean pickup score: `1.000`
|
|
61
|
+
- generic checklist mean pickup score: `0.810`
|
|
62
|
+
- free-form mean pickup score: `0.695`
|
|
63
|
+
|
|
64
|
+
Pairwise result:
|
|
65
|
+
|
|
66
|
+
- kernel beats free-form on `7/7` cases
|
|
67
|
+
- kernel beats generic checklist on `4/7` cases and ties on `3/7`
|
|
68
|
+
- generic checklist beats free-form on average, but with `1` loss and `2` ties
|
|
69
|
+
|
|
70
|
+
Additional result:
|
|
71
|
+
|
|
72
|
+
- kernel keeps all required fields explicitly recoverable on the matched corpus
|
|
73
|
+
- kernel mean invention rate: `0.000`
|
|
74
|
+
- free-form and generic checklist both leave recoverability gaps for at least
|
|
75
|
+
some artifact classes
|
|
76
|
+
|
|
77
|
+
## Why This Matters
|
|
78
|
+
|
|
79
|
+
This pilot matters because it is the first evidence layer that uses a real
|
|
80
|
+
fresh agent rather than only local deterministic scoring.
|
|
81
|
+
|
|
82
|
+
That gives us a stronger internal claim:
|
|
83
|
+
|
|
84
|
+
- the kernel’s structural advantage is not decorative
|
|
85
|
+
- the advantage remains visible to a downstream Codex session
|
|
86
|
+
- the benefit is strongest against free-form artifacts
|
|
87
|
+
- the generic checklist baseline is helpful, but it does not match the kernel’s
|
|
88
|
+
full recoverability
|
|
89
|
+
|
|
90
|
+
The generic-checklist result is especially useful because it is **not** a
|
|
91
|
+
strawman. It performs reasonably well and even ties the kernel on some cases.
|
|
92
|
+
That makes the kernel’s wins more credible.
|
|
93
|
+
|
|
94
|
+
## Honest Caveat
|
|
95
|
+
|
|
96
|
+
This pilot still does not seal the kernel as a universally outcome-superior
|
|
97
|
+
methodology.
|
|
98
|
+
|
|
99
|
+
It is stronger than the earlier deterministic proxy, but it remains:
|
|
100
|
+
|
|
101
|
+
- internal
|
|
102
|
+
- model-specific
|
|
103
|
+
- artifact-recoverability-focused
|
|
104
|
+
|
|
105
|
+
The next real evidence bar is still:
|
|
106
|
+
|
|
107
|
+
- blinded human pickup studies
|
|
108
|
+
- downstream execution or review studies
|
|
109
|
+
- broader cross-model or cross-team replication
|
|
110
|
+
|
|
111
|
+
## Bottom Line
|
|
112
|
+
|
|
113
|
+
The live agent pilot makes the kernel materially harder to dismiss.
|
|
114
|
+
|
|
115
|
+
We now have evidence across three levels:
|
|
116
|
+
|
|
117
|
+
- deterministic structural comparison
|
|
118
|
+
- deterministic pickup proxy
|
|
119
|
+
- live fresh-agent recoverability simulation
|
|
120
|
+
|
|
121
|
+
Together, those support a strong internal claim:
|
|
122
|
+
|
|
123
|
+
ORP kernel artifacts preserve more explicit canonical structure for downstream
|
|
124
|
+
agents than free-form artifacts, and more than a generic checklist on average,
|
|
125
|
+
while keeping full required-field recoverability on the matched corpus.
|
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
# ORP Reasoning Kernel Agent Replication
|
|
2
|
+
|
|
3
|
+
This document records the completed `10`-repeat full-corpus repeatability
|
|
4
|
+
pilot for the live ORP kernel agent evaluation.
|
|
5
|
+
|
|
6
|
+
Supporting artifact:
|
|
7
|
+
|
|
8
|
+
- [docs/benchmarks/orp_reasoning_kernel_agent_replication_v0_2.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_agent_replication_v0_2.json)
|
|
9
|
+
|
|
10
|
+
Supporting harness:
|
|
11
|
+
|
|
12
|
+
- [scripts/orp-kernel-agent-replication.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-agent-replication.py)
|
|
13
|
+
|
|
14
|
+
The harness now supports:
|
|
15
|
+
|
|
16
|
+
- per-field stability tables
|
|
17
|
+
- progress reporting for long runs
|
|
18
|
+
- shard-and-merge execution for higher-repeat studies
|
|
19
|
+
- confidence-interval reporting for repeated live runs
|
|
20
|
+
|
|
21
|
+
## What This Measures
|
|
22
|
+
|
|
23
|
+
The original live agent pilot showed that a fresh Codex session could recover
|
|
24
|
+
the kernel’s required fields more completely than the simpler alternatives.
|
|
25
|
+
|
|
26
|
+
This replication harness asks a different question:
|
|
27
|
+
|
|
28
|
+
- does that result survive across repeated fresh-agent runs?
|
|
29
|
+
|
|
30
|
+
The current completed pilot covers:
|
|
31
|
+
|
|
32
|
+
- the full matched `7`-case corpus
|
|
33
|
+
- `10` independent fresh Codex repetitions of that corpus
|
|
34
|
+
|
|
35
|
+
## Current Result
|
|
36
|
+
|
|
37
|
+
Across the repeated full matched corpus:
|
|
38
|
+
|
|
39
|
+
- kernel mean pickup score: `1.000`
|
|
40
|
+
- generic checklist mean pickup score: `0.790`
|
|
41
|
+
- free-form mean pickup score: `0.718`
|
|
42
|
+
|
|
43
|
+
Stability result:
|
|
44
|
+
|
|
45
|
+
- kernel stayed above checklist on all `10/10` run-level repeats
|
|
46
|
+
- kernel stayed above free-form on all `10/10` run-level repeats
|
|
47
|
+
- kernel won `47/70` case-level comparisons against checklist and tied the other `23/70`
|
|
48
|
+
- kernel won `70/70` case-level comparisons against free-form
|
|
49
|
+
- kernel invention rate remained `0.000` across all `10` repeats
|
|
50
|
+
|
|
51
|
+
Confidence and stability result:
|
|
52
|
+
|
|
53
|
+
- kernel pickup CI95 half-width: `0.000`
|
|
54
|
+
- generic checklist pickup CI95 half-width: `0.023`
|
|
55
|
+
- free-form pickup CI95 half-width: `0.007`
|
|
56
|
+
- kernel per-field stability gap stayed at `0.000` for every tracked required field
|
|
57
|
+
|
|
58
|
+
The per-field tables are especially useful because they show where the simpler
|
|
59
|
+
alternatives still degrade under repetition:
|
|
60
|
+
|
|
61
|
+
- checklist repeatedly misses `decision.question` and `decision.rationale`
|
|
62
|
+
- checklist repeatedly misses `experiment.inputs`, `experiment.outputs`, and `hypothesis.falsifiers`
|
|
63
|
+
- checklist still struggles with `policy.enforcement_surface` and `task.object`
|
|
64
|
+
- free-form most often drops `experiment.outputs`, `result.evidence_paths`, and sometimes `checkpoint.next_handoff_target`
|
|
65
|
+
|
|
66
|
+
## Why This Matters
|
|
67
|
+
|
|
68
|
+
The replication pilot is now much stronger than a single-case smoke, because
|
|
69
|
+
it covers the full matched corpus, repeats it `10` times with fresh sessions,
|
|
70
|
+
and exposes field-level stability instead of only overall means.
|
|
71
|
+
|
|
72
|
+
That strengthens the evidence story in an agent-first way:
|
|
73
|
+
|
|
74
|
+
- the kernel advantage can survive fresh-session variation
|
|
75
|
+
- the kernel continues to look safe on invention
|
|
76
|
+
- the checklist baseline is helpful, but it still leaves repeatable structural
|
|
77
|
+
holes that the kernel does not
|
|
78
|
+
|
|
79
|
+
## Honest Boundary
|
|
80
|
+
|
|
81
|
+
This is now a strong internal replication pilot, but it is still not a final
|
|
82
|
+
external replication study.
|
|
83
|
+
|
|
84
|
+
What remains:
|
|
85
|
+
|
|
86
|
+
- compare multiple agent models when practical
|
|
87
|
+
- run blinded human or mixed human/agent handoff replications
|
|
88
|
+
- extend the repeated continuation benchmark in the same style
|
|
89
|
+
|
|
90
|
+
## Bottom Line
|
|
91
|
+
|
|
92
|
+
The replication pilot makes the live agent story much more credible:
|
|
93
|
+
|
|
94
|
+
the kernel’s recoverability advantage appears stable under repeated fresh-agent
|
|
95
|
+
execution across the full matched corpus, not only on a single representative
|
|
96
|
+
case, and its field-level stability now looks materially cleaner than both
|
|
97
|
+
free-form and checklist alternatives.
|
|
@@ -0,0 +1,100 @@
|
|
|
1
|
+
# ORP Reasoning Kernel Canonical Continuation Pilot
|
|
2
|
+
|
|
3
|
+
This document records the first full-corpus live canonical continuation pilot
|
|
4
|
+
for the ORP Reasoning Kernel.
|
|
5
|
+
|
|
6
|
+
Supporting artifact:
|
|
7
|
+
|
|
8
|
+
- [docs/benchmarks/orp_reasoning_kernel_canonical_continuation_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_canonical_continuation_v0_1.json)
|
|
9
|
+
|
|
10
|
+
Supporting harness:
|
|
11
|
+
|
|
12
|
+
- [scripts/orp-kernel-canonical-continuation.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-canonical-continuation.py)
|
|
13
|
+
|
|
14
|
+
## What This Pilot Measures
|
|
15
|
+
|
|
16
|
+
The earlier continuation pilot asked whether a fresh downstream agent could
|
|
17
|
+
continue work safely in a general sense.
|
|
18
|
+
|
|
19
|
+
This harder pilot asks a more demanding question:
|
|
20
|
+
|
|
21
|
+
- can a fresh downstream agent turn the source artifact into the next canonical
|
|
22
|
+
kernel task artifact?
|
|
23
|
+
|
|
24
|
+
For each source artifact, the agent must produce a real task-shaped follow-on
|
|
25
|
+
object with:
|
|
26
|
+
|
|
27
|
+
- `object`
|
|
28
|
+
- `goal`
|
|
29
|
+
- `boundary`
|
|
30
|
+
- `constraints`
|
|
31
|
+
- `success_criteria`
|
|
32
|
+
|
|
33
|
+
The benchmark then scores:
|
|
34
|
+
|
|
35
|
+
- field alignment against the expected next canonical task
|
|
36
|
+
- unsupported invention
|
|
37
|
+
- whether missing structure is reported explicitly instead of silently filled
|
|
38
|
+
|
|
39
|
+
## Current Result
|
|
40
|
+
|
|
41
|
+
On the matched `7`-case, `5`-domain live canonical continuation corpus:
|
|
42
|
+
|
|
43
|
+
- kernel mean total score: `0.738`
|
|
44
|
+
- generic checklist mean total score: `0.663`
|
|
45
|
+
- free-form mean total score: `0.484`
|
|
46
|
+
|
|
47
|
+
Invention result:
|
|
48
|
+
|
|
49
|
+
- kernel mean invention rate: `0.386`
|
|
50
|
+
- generic checklist mean invention rate: `0.495`
|
|
51
|
+
- free-form mean invention rate: `0.748`
|
|
52
|
+
|
|
53
|
+
Pairwise result:
|
|
54
|
+
|
|
55
|
+
- kernel beat free-form on `7/7` cases
|
|
56
|
+
- kernel beat generic checklist on `4/7` cases
|
|
57
|
+
- kernel tied generic checklist on `1/7` cases
|
|
58
|
+
- kernel lost to generic checklist on `2/7` cases
|
|
59
|
+
|
|
60
|
+
## Why This Matters
|
|
61
|
+
|
|
62
|
+
This pilot is stronger than the softer continuation benchmark because it asks
|
|
63
|
+
the downstream agent to produce a real canonical artifact rather than only a
|
|
64
|
+
safe next action.
|
|
65
|
+
|
|
66
|
+
The result is more discriminative:
|
|
67
|
+
|
|
68
|
+
- free-form falls off sharply once a real next artifact must be constructed
|
|
69
|
+
- checklist stays meaningfully competitive
|
|
70
|
+
- kernel still leads on aggregate score and invention control
|
|
71
|
+
|
|
72
|
+
That makes the evidence more believable, not less. The kernel is not winning
|
|
73
|
+
against a straw baseline here.
|
|
74
|
+
|
|
75
|
+
## Honest Boundary
|
|
76
|
+
|
|
77
|
+
The canonical continuation pilot does not prove universal methodological
|
|
78
|
+
superiority.
|
|
79
|
+
|
|
80
|
+
It does show a narrower and important result:
|
|
81
|
+
|
|
82
|
+
- kernel structure gives a fresh downstream agent a stronger base for producing
|
|
83
|
+
the next canonical task artifact than free-form notes
|
|
84
|
+
- checklist can still be strong on some cases, so the kernel advantage is real
|
|
85
|
+
but not absolute
|
|
86
|
+
|
|
87
|
+
What remains:
|
|
88
|
+
|
|
89
|
+
- repeat the canonical continuation benchmark across multiple fresh runs
|
|
90
|
+
- add per-field stability summaries for canonical continuation itself
|
|
91
|
+
- compare multiple models when practical
|
|
92
|
+
|
|
93
|
+
## Bottom Line
|
|
94
|
+
|
|
95
|
+
The harder continuation benchmark makes the kernel evidence more grounded:
|
|
96
|
+
|
|
97
|
+
the kernel still wins on average, stays safest on invention, and clearly beats
|
|
98
|
+
free-form when the downstream task is “produce the next canonical artifact,”
|
|
99
|
+
while revealing that a strong checklist baseline remains competitive enough to
|
|
100
|
+
matter on some cases.
|
|
@@ -0,0 +1,116 @@
|
|
|
1
|
+
# ORP Reasoning Kernel Comparison Pilot
|
|
2
|
+
|
|
3
|
+
This document records the first in-repo side-by-side comparison between three
|
|
4
|
+
artifact styles:
|
|
5
|
+
|
|
6
|
+
1. free-form artifact writing
|
|
7
|
+
2. generic checklist artifact writing
|
|
8
|
+
3. ORP typed kernel artifact writing
|
|
9
|
+
|
|
10
|
+
Supporting artifact:
|
|
11
|
+
|
|
12
|
+
- [docs/benchmarks/orp_reasoning_kernel_comparison_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_comparison_v0_1.json)
|
|
13
|
+
|
|
14
|
+
Supporting corpus and harness:
|
|
15
|
+
|
|
16
|
+
- [examples/kernel/comparison/comparison-corpus.json](/Volumes/Code_2TB/code/orp/examples/kernel/comparison/comparison-corpus.json)
|
|
17
|
+
- [scripts/orp-kernel-comparison.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-comparison.py)
|
|
18
|
+
|
|
19
|
+
## What This Pilot Measures
|
|
20
|
+
|
|
21
|
+
This pilot is intentionally narrow.
|
|
22
|
+
|
|
23
|
+
It does **not** measure downstream execution quality, human review outcomes, or
|
|
24
|
+
team handoff performance directly.
|
|
25
|
+
|
|
26
|
+
It **does** measure structural clarity on a matched internal corpus using a
|
|
27
|
+
deterministic rubric.
|
|
28
|
+
|
|
29
|
+
Each condition is scored on:
|
|
30
|
+
|
|
31
|
+
- artifact type clarity
|
|
32
|
+
- objective clarity
|
|
33
|
+
- limits clarity
|
|
34
|
+
- evaluation clarity
|
|
35
|
+
- handoff readiness
|
|
36
|
+
- class-specific completeness
|
|
37
|
+
|
|
38
|
+
The corpus spans:
|
|
39
|
+
|
|
40
|
+
- `7` cases
|
|
41
|
+
- `5` domains
|
|
42
|
+
- all `7` v0.1 kernel artifact classes
|
|
43
|
+
|
|
44
|
+
## Why This Comparison Matters
|
|
45
|
+
|
|
46
|
+
The current technical validation package proves that the kernel works.
|
|
47
|
+
|
|
48
|
+
What it did not yet prove by itself was whether the kernel offers more usable
|
|
49
|
+
structure than simpler alternatives on the same prompts.
|
|
50
|
+
|
|
51
|
+
This pilot addresses that specific gap by comparing matched artifacts instead
|
|
52
|
+
of evaluating the kernel in isolation.
|
|
53
|
+
|
|
54
|
+
## Current Result
|
|
55
|
+
|
|
56
|
+
On the matched internal corpus, the current report shows:
|
|
57
|
+
|
|
58
|
+
- kernel mean total score: `1.000`
|
|
59
|
+
- generic checklist mean total score: `0.687`
|
|
60
|
+
- free-form mean total score: `0.275`
|
|
61
|
+
|
|
62
|
+
Pairwise result:
|
|
63
|
+
|
|
64
|
+
- kernel beats generic checklist on all `7/7` cases
|
|
65
|
+
- kernel beats free-form on all `7/7` cases
|
|
66
|
+
- generic checklist beats free-form on all `7/7` cases
|
|
67
|
+
|
|
68
|
+
Additional result:
|
|
69
|
+
|
|
70
|
+
- kernel class-specific completeness mean: `1.000`
|
|
71
|
+
- generic checklist class-specific completeness mean: `0.657`
|
|
72
|
+
- free-form class-specific completeness mean: `0.328`
|
|
73
|
+
|
|
74
|
+
## What This Supports
|
|
75
|
+
|
|
76
|
+
This pilot supports a **narrow but real** claim:
|
|
77
|
+
|
|
78
|
+
On a matched internal comparison corpus, ORP kernel artifacts preserve more
|
|
79
|
+
explicit structural coverage than both free-form artifacts and a generic
|
|
80
|
+
checklist alternative.
|
|
81
|
+
|
|
82
|
+
That is stronger than a pure rationale-only claim.
|
|
83
|
+
|
|
84
|
+
## What This Does Not Yet Support
|
|
85
|
+
|
|
86
|
+
This pilot does **not** prove that the kernel:
|
|
87
|
+
|
|
88
|
+
- improves downstream implementation success
|
|
89
|
+
- improves human pickup speed in live handoffs
|
|
90
|
+
- reduces rework in actual projects
|
|
91
|
+
- is universally superior across all teams or domains
|
|
92
|
+
|
|
93
|
+
Those still require the larger studies in
|
|
94
|
+
[docs/ORP_REASONING_KERNEL_EVALUATION_PLAN.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_EVALUATION_PLAN.md).
|
|
95
|
+
|
|
96
|
+
## Why The Scoring Is Structured This Way
|
|
97
|
+
|
|
98
|
+
The scoring deliberately rewards explicit structure rather than latent meaning
|
|
99
|
+
buried in prose.
|
|
100
|
+
|
|
101
|
+
That reflects the actual ORP goal:
|
|
102
|
+
|
|
103
|
+
- humans can stay loose at the boundary
|
|
104
|
+
- but promotable repository artifacts should remain structurally legible
|
|
105
|
+
|
|
106
|
+
So this pilot is not a writing-style contest. It is a test of how much
|
|
107
|
+
canonically useful structure survives into the artifact itself.
|
|
108
|
+
|
|
109
|
+
## Bottom Line
|
|
110
|
+
|
|
111
|
+
This comparison pilot does not seal the kernel as a universally outcome-better
|
|
112
|
+
methodology.
|
|
113
|
+
|
|
114
|
+
It does provide the first comparative evidence that the kernel is not only
|
|
115
|
+
valid in isolation, but also materially stronger as a structural artifact
|
|
116
|
+
surface than the simpler alternatives tested here.
|
|
@@ -0,0 +1,86 @@
|
|
|
1
|
+
# ORP Reasoning Kernel Continuation Pilot
|
|
2
|
+
|
|
3
|
+
This document records the first full-corpus live continuation pilot for the ORP
|
|
4
|
+
Reasoning Kernel.
|
|
5
|
+
|
|
6
|
+
Supporting artifact:
|
|
7
|
+
|
|
8
|
+
- [docs/benchmarks/orp_reasoning_kernel_continuation_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_continuation_v0_1.json)
|
|
9
|
+
|
|
10
|
+
Supporting harness:
|
|
11
|
+
|
|
12
|
+
- [scripts/orp-kernel-continuation-pilot.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-continuation-pilot.py)
|
|
13
|
+
|
|
14
|
+
Related harder benchmark:
|
|
15
|
+
|
|
16
|
+
- [docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md)
|
|
17
|
+
|
|
18
|
+
## What This Pilot Measures
|
|
19
|
+
|
|
20
|
+
The earlier live agent pilot measured recoverability: can a fresh downstream
|
|
21
|
+
agent reconstruct the required kernel fields?
|
|
22
|
+
|
|
23
|
+
This continuation pilot asks a slightly different question:
|
|
24
|
+
|
|
25
|
+
- can a fresh downstream agent continue the work safely?
|
|
26
|
+
|
|
27
|
+
For each artifact, the agent must:
|
|
28
|
+
|
|
29
|
+
- propose a recommended next action
|
|
30
|
+
- identify handoff-critical fields to carry forward
|
|
31
|
+
- surface what is explicitly missing instead of inventing it
|
|
32
|
+
|
|
33
|
+
The scoring emphasizes:
|
|
34
|
+
|
|
35
|
+
- carry-forward coverage
|
|
36
|
+
- invention rate
|
|
37
|
+
- whether a concrete next action is present
|
|
38
|
+
|
|
39
|
+
This remains the softer downstream benchmark. The harder follow-on test is the
|
|
40
|
+
canonical continuation pilot, where the downstream agent must produce the next
|
|
41
|
+
full kernel task artifact rather than only a safe continuation.
|
|
42
|
+
|
|
43
|
+
## Current Result
|
|
44
|
+
|
|
45
|
+
On the matched `7`-case, `5`-domain live continuation corpus:
|
|
46
|
+
|
|
47
|
+
- kernel continuation score: `1.000`
|
|
48
|
+
- generic checklist continuation score: `0.984`
|
|
49
|
+
- free-form continuation score: `0.968`
|
|
50
|
+
|
|
51
|
+
Invention result:
|
|
52
|
+
|
|
53
|
+
- kernel invention rate: `0.000`
|
|
54
|
+
- generic checklist invention rate: `0.000`
|
|
55
|
+
- free-form invention rate: `0.048`
|
|
56
|
+
|
|
57
|
+
## Why This Matters
|
|
58
|
+
|
|
59
|
+
This pilot pushes the evaluation one step closer to real downstream agent work.
|
|
60
|
+
|
|
61
|
+
The continuation corpus suggests:
|
|
62
|
+
|
|
63
|
+
- the kernel clearly supports safer continuation than free-form artifacts
|
|
64
|
+
- the kernel slightly exceeds the generic checklist on continuation score at
|
|
65
|
+
corpus level, while never doing worse on any matched case
|
|
66
|
+
- the kernel preserves a safety advantage by keeping invention at `0.000`
|
|
67
|
+
|
|
68
|
+
## Honest Boundary
|
|
69
|
+
|
|
70
|
+
This is now a real full-corpus pilot, but it is still not a full external
|
|
71
|
+
continuation study.
|
|
72
|
+
|
|
73
|
+
What remains:
|
|
74
|
+
|
|
75
|
+
- replicate across more fresh sessions
|
|
76
|
+
- compare additional agent models when practical
|
|
77
|
+
- extend from artifact continuation to fuller downstream execution quality
|
|
78
|
+
|
|
79
|
+
## Bottom Line
|
|
80
|
+
|
|
81
|
+
The continuation pilot strengthens the kernel in a well-rounded way:
|
|
82
|
+
|
|
83
|
+
it suggests the kernel is not only easier to recover, but also a strong and
|
|
84
|
+
safe surface for downstream agent continuation across the matched corpus, while
|
|
85
|
+
showing that a generic checklist remains competitive enough to be a meaningful
|
|
86
|
+
baseline rather than a strawman.
|