open-research-protocol 0.4.7 → 0.4.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (50) hide show
  1. package/README.md +15 -0
  2. package/cli/orp.py +1158 -43
  3. package/docs/AGENT_LOOP.md +3 -0
  4. package/docs/ORP_REASONING_KERNEL_AGENT_PILOT.md +125 -0
  5. package/docs/ORP_REASONING_KERNEL_AGENT_REPLICATION.md +97 -0
  6. package/docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md +100 -0
  7. package/docs/ORP_REASONING_KERNEL_COMPARISON_PILOT.md +116 -0
  8. package/docs/ORP_REASONING_KERNEL_CONTINUATION_PILOT.md +86 -0
  9. package/docs/ORP_REASONING_KERNEL_EVALUATION_PLAN.md +261 -0
  10. package/docs/ORP_REASONING_KERNEL_EVIDENCE_MATRIX.md +131 -0
  11. package/docs/ORP_REASONING_KERNEL_EVOLUTION.md +123 -0
  12. package/docs/ORP_REASONING_KERNEL_PICKUP_PILOT.md +107 -0
  13. package/docs/ORP_REASONING_KERNEL_TECHNICAL_VALIDATION.md +140 -22
  14. package/docs/ORP_REASONING_KERNEL_V0_1.md +11 -0
  15. package/docs/ORP_YOUTUBE_INSPECT.md +97 -0
  16. package/docs/benchmarks/orp_reasoning_kernel_agent_pilot_v0_1.json +796 -0
  17. package/docs/benchmarks/orp_reasoning_kernel_agent_replication_task_smoke.json +487 -0
  18. package/docs/benchmarks/orp_reasoning_kernel_agent_replication_v0_1.json +1927 -0
  19. package/docs/benchmarks/orp_reasoning_kernel_agent_replication_v0_2.json +10217 -0
  20. package/docs/benchmarks/orp_reasoning_kernel_canonical_continuation_task_smoke.json +174 -0
  21. package/docs/benchmarks/orp_reasoning_kernel_canonical_continuation_v0_1.json +598 -0
  22. package/docs/benchmarks/orp_reasoning_kernel_comparison_v0_1.json +688 -0
  23. package/docs/benchmarks/orp_reasoning_kernel_continuation_task_smoke.json +150 -0
  24. package/docs/benchmarks/orp_reasoning_kernel_continuation_v0_1.json +448 -0
  25. package/docs/benchmarks/orp_reasoning_kernel_pickup_v0_1.json +594 -0
  26. package/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json +769 -41
  27. package/examples/README.md +2 -0
  28. package/examples/kernel/comparison/comparison-corpus.json +337 -0
  29. package/examples/kernel/comparison/next-task-continuation.json +55 -0
  30. package/examples/kernel/corpus/operations/habanero-routing.checkpoint.kernel.yml +12 -0
  31. package/examples/kernel/corpus/operations/runner-routing.policy.kernel.yml +9 -0
  32. package/examples/kernel/corpus/product/project-home.decision.kernel.yml +11 -0
  33. package/examples/kernel/corpus/research/kernel-handoff.experiment.kernel.yml +16 -0
  34. package/examples/kernel/corpus/research/lane-drift.hypothesis.kernel.yml +11 -0
  35. package/examples/kernel/corpus/software/trace-widget.task.kernel.yml +13 -0
  36. package/examples/kernel/corpus/writing/kernel-launch.result.kernel.yml +12 -0
  37. package/llms.txt +3 -0
  38. package/package.json +4 -1
  39. package/scripts/orp-kernel-agent-pilot.py +673 -0
  40. package/scripts/orp-kernel-agent-replication.py +307 -0
  41. package/scripts/orp-kernel-benchmark.py +471 -2
  42. package/scripts/orp-kernel-canonical-continuation.py +381 -0
  43. package/scripts/orp-kernel-ci-check.py +138 -0
  44. package/scripts/orp-kernel-comparison.py +592 -0
  45. package/scripts/orp-kernel-continuation-pilot.py +384 -0
  46. package/scripts/orp-kernel-pickup.py +401 -0
  47. package/spec/v1/kernel-extension.schema.json +96 -0
  48. package/spec/v1/kernel-proposal.schema.json +115 -0
  49. package/spec/v1/kernel.schema.json +2 -1
  50. package/spec/v1/youtube-source.schema.json +151 -0
@@ -21,6 +21,9 @@ Use this loop when an AI agent is the primary operator of an ORP-enabled repo.
21
21
  - or `orp pack fetch --source <git-url> --pack-id <pack-id> --install-target . --json`
22
22
  - If the workflow depends on public Erdos data, sync it first:
23
23
  - `orp erdos sync --problem-id <id> --out-problem-dir <dir> --json`
24
+ - If the task begins from a public YouTube link, normalize it first:
25
+ - `orp youtube inspect <youtube-url> --json`
26
+ - or `orp youtube inspect <youtube-url> --save --json` when the source artifact should stay with the repo
24
27
 
25
28
  ## 3. Run
26
29
 
@@ -0,0 +1,125 @@
1
+ # ORP Reasoning Kernel Agent Pilot
2
+
3
+ This document records the first live in-environment Codex pickup simulation for
4
+ the ORP Reasoning Kernel.
5
+
6
+ Supporting artifact:
7
+
8
+ - [docs/benchmarks/orp_reasoning_kernel_agent_pilot_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_agent_pilot_v0_1.json)
9
+
10
+ Supporting corpus and harness:
11
+
12
+ - [examples/kernel/comparison/comparison-corpus.json](/Volumes/Code_2TB/code/orp/examples/kernel/comparison/comparison-corpus.json)
13
+ - [scripts/orp-kernel-agent-pilot.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-agent-pilot.py)
14
+
15
+ ## What This Pilot Measures
16
+
17
+ This pilot asks a real fresh Codex session to act like a downstream handoff
18
+ consumer with no repo context.
19
+
20
+ For each matched case and condition, the agent sees only the artifact and must
21
+ recover the **full required field set** for that artifact class:
22
+
23
+ - `task`: `object`, `goal`, `boundary`, `constraints`, `success_criteria`
24
+ - `decision`: `question`, `chosen_path`, `rejected_alternatives`, `rationale`, `consequences`
25
+ - `hypothesis`: `claim`, `boundary`, `assumptions`, `test_path`, `falsifiers`
26
+ - `experiment`: `objective`, `method`, `inputs`, `outputs`, `evidence_expectations`, `interpretation_limits`
27
+ - `checkpoint`: `completed_unit`, `current_state`, `risks`, `next_handoff_target`, `artifact_refs`
28
+ - `policy`: `scope`, `rule`, `rationale`, `invariants`, `enforcement_surface`
29
+ - `result`: `claim`, `evidence_paths`, `status`, `interpretation_limits`, `next_follow_up`
30
+
31
+ The agent is instructed to use `null` unless a field is explicit enough to
32
+ carry forward into a canonical artifact without invention.
33
+
34
+ This is a stronger standard than the earlier deterministic pickup proxy because
35
+ it uses an actual fresh Codex session rather than a local rubric only.
36
+
37
+ ## What This Is And Is Not
38
+
39
+ This is a **live internal agent simulation**, not a human handoff study.
40
+
41
+ It does **not** prove:
42
+
43
+ - human pickup speed
44
+ - human clarification count
45
+ - downstream implementation quality
46
+ - cross-team or cross-model outcome superiority
47
+
48
+ It **does** prove something narrower and important:
49
+
50
+ - how much of the kernel’s required structure remains explicitly recoverable to
51
+ a fresh downstream agent
52
+ - whether the kernel’s structural advantages survive contact with a real agent,
53
+ not just a deterministic local scorer
54
+
55
+ ## Current Result
56
+
57
+ On the matched `7`-case, `5`-domain live Codex corpus, the current report
58
+ shows:
59
+
60
+ - kernel mean pickup score: `1.000`
61
+ - generic checklist mean pickup score: `0.810`
62
+ - free-form mean pickup score: `0.695`
63
+
64
+ Pairwise result:
65
+
66
+ - kernel beats free-form on `7/7` cases
67
+ - kernel beats generic checklist on `4/7` cases and ties on `3/7`
68
+ - generic checklist beats free-form on average, but with `1` loss and `2` ties
69
+
70
+ Additional result:
71
+
72
+ - kernel keeps all required fields explicitly recoverable on the matched corpus
73
+ - kernel mean invention rate: `0.000`
74
+ - free-form and generic checklist both leave recoverability gaps for at least
75
+ some artifact classes
76
+
77
+ ## Why This Matters
78
+
79
+ This pilot matters because it is the first evidence layer that uses a real
80
+ fresh agent rather than only local deterministic scoring.
81
+
82
+ That gives us a stronger internal claim:
83
+
84
+ - the kernel’s structural advantage is not decorative
85
+ - the advantage remains visible to a downstream Codex session
86
+ - the benefit is strongest against free-form artifacts
87
+ - the generic checklist baseline is helpful, but it does not match the kernel’s
88
+ full recoverability
89
+
90
+ The generic-checklist result is especially useful because it is **not** a
91
+ strawman. It performs reasonably well and even ties the kernel on some cases.
92
+ That makes the kernel’s wins more credible.
93
+
94
+ ## Honest Caveat
95
+
96
+ This pilot still does not seal the kernel as a universally outcome-superior
97
+ methodology.
98
+
99
+ It is stronger than the earlier deterministic proxy, but it remains:
100
+
101
+ - internal
102
+ - model-specific
103
+ - artifact-recoverability-focused
104
+
105
+ The next real evidence bar is still:
106
+
107
+ - blinded human pickup studies
108
+ - downstream execution or review studies
109
+ - broader cross-model or cross-team replication
110
+
111
+ ## Bottom Line
112
+
113
+ The live agent pilot makes the kernel materially harder to dismiss.
114
+
115
+ We now have evidence across three levels:
116
+
117
+ - deterministic structural comparison
118
+ - deterministic pickup proxy
119
+ - live fresh-agent recoverability simulation
120
+
121
+ Together, those support a strong internal claim:
122
+
123
+ ORP kernel artifacts preserve more explicit canonical structure for downstream
124
+ agents than free-form artifacts, and more than a generic checklist on average,
125
+ while keeping full required-field recoverability on the matched corpus.
@@ -0,0 +1,97 @@
1
+ # ORP Reasoning Kernel Agent Replication
2
+
3
+ This document records the completed `10`-repeat full-corpus repeatability
4
+ pilot for the live ORP kernel agent evaluation.
5
+
6
+ Supporting artifact:
7
+
8
+ - [docs/benchmarks/orp_reasoning_kernel_agent_replication_v0_2.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_agent_replication_v0_2.json)
9
+
10
+ Supporting harness:
11
+
12
+ - [scripts/orp-kernel-agent-replication.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-agent-replication.py)
13
+
14
+ The harness now supports:
15
+
16
+ - per-field stability tables
17
+ - progress reporting for long runs
18
+ - shard-and-merge execution for higher-repeat studies
19
+ - confidence-interval reporting for repeated live runs
20
+
21
+ ## What This Measures
22
+
23
+ The original live agent pilot showed that a fresh Codex session could recover
24
+ the kernel’s required fields more completely than the simpler alternatives.
25
+
26
+ This replication harness asks a different question:
27
+
28
+ - does that result survive across repeated fresh-agent runs?
29
+
30
+ The current completed pilot covers:
31
+
32
+ - the full matched `7`-case corpus
33
+ - `10` independent fresh Codex repetitions of that corpus
34
+
35
+ ## Current Result
36
+
37
+ Across the repeated full matched corpus:
38
+
39
+ - kernel mean pickup score: `1.000`
40
+ - generic checklist mean pickup score: `0.790`
41
+ - free-form mean pickup score: `0.718`
42
+
43
+ Stability result:
44
+
45
+ - kernel stayed above checklist on all `10/10` run-level repeats
46
+ - kernel stayed above free-form on all `10/10` run-level repeats
47
+ - kernel won `47/70` case-level comparisons against checklist and tied the other `23/70`
48
+ - kernel won `70/70` case-level comparisons against free-form
49
+ - kernel invention rate remained `0.000` across all `10` repeats
50
+
51
+ Confidence and stability result:
52
+
53
+ - kernel pickup CI95 half-width: `0.000`
54
+ - generic checklist pickup CI95 half-width: `0.023`
55
+ - free-form pickup CI95 half-width: `0.007`
56
+ - kernel per-field stability gap stayed at `0.000` for every tracked required field
57
+
58
+ The per-field tables are especially useful because they show where the simpler
59
+ alternatives still degrade under repetition:
60
+
61
+ - checklist repeatedly misses `decision.question` and `decision.rationale`
62
+ - checklist repeatedly misses `experiment.inputs`, `experiment.outputs`, and `hypothesis.falsifiers`
63
+ - checklist still struggles with `policy.enforcement_surface` and `task.object`
64
+ - free-form most often drops `experiment.outputs`, `result.evidence_paths`, and sometimes `checkpoint.next_handoff_target`
65
+
66
+ ## Why This Matters
67
+
68
+ The replication pilot is now much stronger than a single-case smoke, because
69
+ it covers the full matched corpus, repeats it `10` times with fresh sessions,
70
+ and exposes field-level stability instead of only overall means.
71
+
72
+ That strengthens the evidence story in an agent-first way:
73
+
74
+ - the kernel advantage can survive fresh-session variation
75
+ - the kernel continues to look safe on invention
76
+ - the checklist baseline is helpful, but it still leaves repeatable structural
77
+ holes that the kernel does not
78
+
79
+ ## Honest Boundary
80
+
81
+ This is now a strong internal replication pilot, but it is still not a final
82
+ external replication study.
83
+
84
+ What remains:
85
+
86
+ - compare multiple agent models when practical
87
+ - run blinded human or mixed human/agent handoff replications
88
+ - extend the repeated continuation benchmark in the same style
89
+
90
+ ## Bottom Line
91
+
92
+ The replication pilot makes the live agent story much more credible:
93
+
94
+ the kernel’s recoverability advantage appears stable under repeated fresh-agent
95
+ execution across the full matched corpus, not only on a single representative
96
+ case, and its field-level stability now looks materially cleaner than both
97
+ free-form and checklist alternatives.
@@ -0,0 +1,100 @@
1
+ # ORP Reasoning Kernel Canonical Continuation Pilot
2
+
3
+ This document records the first full-corpus live canonical continuation pilot
4
+ for the ORP Reasoning Kernel.
5
+
6
+ Supporting artifact:
7
+
8
+ - [docs/benchmarks/orp_reasoning_kernel_canonical_continuation_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_canonical_continuation_v0_1.json)
9
+
10
+ Supporting harness:
11
+
12
+ - [scripts/orp-kernel-canonical-continuation.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-canonical-continuation.py)
13
+
14
+ ## What This Pilot Measures
15
+
16
+ The earlier continuation pilot asked whether a fresh downstream agent could
17
+ continue work safely in a general sense.
18
+
19
+ This harder pilot asks a more demanding question:
20
+
21
+ - can a fresh downstream agent turn the source artifact into the next canonical
22
+ kernel task artifact?
23
+
24
+ For each source artifact, the agent must produce a real task-shaped follow-on
25
+ object with:
26
+
27
+ - `object`
28
+ - `goal`
29
+ - `boundary`
30
+ - `constraints`
31
+ - `success_criteria`
32
+
33
+ The benchmark then scores:
34
+
35
+ - field alignment against the expected next canonical task
36
+ - unsupported invention
37
+ - whether missing structure is reported explicitly instead of silently filled
38
+
39
+ ## Current Result
40
+
41
+ On the matched `7`-case, `5`-domain live canonical continuation corpus:
42
+
43
+ - kernel mean total score: `0.738`
44
+ - generic checklist mean total score: `0.663`
45
+ - free-form mean total score: `0.484`
46
+
47
+ Invention result:
48
+
49
+ - kernel mean invention rate: `0.386`
50
+ - generic checklist mean invention rate: `0.495`
51
+ - free-form mean invention rate: `0.748`
52
+
53
+ Pairwise result:
54
+
55
+ - kernel beat free-form on `7/7` cases
56
+ - kernel beat generic checklist on `4/7` cases
57
+ - kernel tied generic checklist on `1/7` cases
58
+ - kernel lost to generic checklist on `2/7` cases
59
+
60
+ ## Why This Matters
61
+
62
+ This pilot is stronger than the softer continuation benchmark because it asks
63
+ the downstream agent to produce a real canonical artifact rather than only a
64
+ safe next action.
65
+
66
+ The result is more discriminative:
67
+
68
+ - free-form falls off sharply once a real next artifact must be constructed
69
+ - checklist stays meaningfully competitive
70
+ - kernel still leads on aggregate score and invention control
71
+
72
+ That makes the evidence more believable, not less. The kernel is not winning
73
+ against a straw baseline here.
74
+
75
+ ## Honest Boundary
76
+
77
+ The canonical continuation pilot does not prove universal methodological
78
+ superiority.
79
+
80
+ It does show a narrower and important result:
81
+
82
+ - kernel structure gives a fresh downstream agent a stronger base for producing
83
+ the next canonical task artifact than free-form notes
84
+ - checklist can still be strong on some cases, so the kernel advantage is real
85
+ but not absolute
86
+
87
+ What remains:
88
+
89
+ - repeat the canonical continuation benchmark across multiple fresh runs
90
+ - add per-field stability summaries for canonical continuation itself
91
+ - compare multiple models when practical
92
+
93
+ ## Bottom Line
94
+
95
+ The harder continuation benchmark makes the kernel evidence more grounded:
96
+
97
+ the kernel still wins on average, stays safest on invention, and clearly beats
98
+ free-form when the downstream task is “produce the next canonical artifact,”
99
+ while revealing that a strong checklist baseline remains competitive enough to
100
+ matter on some cases.
@@ -0,0 +1,116 @@
1
+ # ORP Reasoning Kernel Comparison Pilot
2
+
3
+ This document records the first in-repo side-by-side comparison between three
4
+ artifact styles:
5
+
6
+ 1. free-form artifact writing
7
+ 2. generic checklist artifact writing
8
+ 3. ORP typed kernel artifact writing
9
+
10
+ Supporting artifact:
11
+
12
+ - [docs/benchmarks/orp_reasoning_kernel_comparison_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_comparison_v0_1.json)
13
+
14
+ Supporting corpus and harness:
15
+
16
+ - [examples/kernel/comparison/comparison-corpus.json](/Volumes/Code_2TB/code/orp/examples/kernel/comparison/comparison-corpus.json)
17
+ - [scripts/orp-kernel-comparison.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-comparison.py)
18
+
19
+ ## What This Pilot Measures
20
+
21
+ This pilot is intentionally narrow.
22
+
23
+ It does **not** measure downstream execution quality, human review outcomes, or
24
+ team handoff performance directly.
25
+
26
+ It **does** measure structural clarity on a matched internal corpus using a
27
+ deterministic rubric.
28
+
29
+ Each condition is scored on:
30
+
31
+ - artifact type clarity
32
+ - objective clarity
33
+ - limits clarity
34
+ - evaluation clarity
35
+ - handoff readiness
36
+ - class-specific completeness
37
+
38
+ The corpus spans:
39
+
40
+ - `7` cases
41
+ - `5` domains
42
+ - all `7` v0.1 kernel artifact classes
43
+
44
+ ## Why This Comparison Matters
45
+
46
+ The current technical validation package proves that the kernel works.
47
+
48
+ What it did not yet prove by itself was whether the kernel offers more usable
49
+ structure than simpler alternatives on the same prompts.
50
+
51
+ This pilot addresses that specific gap by comparing matched artifacts instead
52
+ of evaluating the kernel in isolation.
53
+
54
+ ## Current Result
55
+
56
+ On the matched internal corpus, the current report shows:
57
+
58
+ - kernel mean total score: `1.000`
59
+ - generic checklist mean total score: `0.687`
60
+ - free-form mean total score: `0.275`
61
+
62
+ Pairwise result:
63
+
64
+ - kernel beats generic checklist on all `7/7` cases
65
+ - kernel beats free-form on all `7/7` cases
66
+ - generic checklist beats free-form on all `7/7` cases
67
+
68
+ Additional result:
69
+
70
+ - kernel class-specific completeness mean: `1.000`
71
+ - generic checklist class-specific completeness mean: `0.657`
72
+ - free-form class-specific completeness mean: `0.328`
73
+
74
+ ## What This Supports
75
+
76
+ This pilot supports a **narrow but real** claim:
77
+
78
+ On a matched internal comparison corpus, ORP kernel artifacts preserve more
79
+ explicit structural coverage than both free-form artifacts and a generic
80
+ checklist alternative.
81
+
82
+ That is stronger than a pure rationale-only claim.
83
+
84
+ ## What This Does Not Yet Support
85
+
86
+ This pilot does **not** prove that the kernel:
87
+
88
+ - improves downstream implementation success
89
+ - improves human pickup speed in live handoffs
90
+ - reduces rework in actual projects
91
+ - is universally superior across all teams or domains
92
+
93
+ Those still require the larger studies in
94
+ [docs/ORP_REASONING_KERNEL_EVALUATION_PLAN.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_EVALUATION_PLAN.md).
95
+
96
+ ## Why The Scoring Is Structured This Way
97
+
98
+ The scoring deliberately rewards explicit structure rather than latent meaning
99
+ buried in prose.
100
+
101
+ That reflects the actual ORP goal:
102
+
103
+ - humans can stay loose at the boundary
104
+ - but promotable repository artifacts should remain structurally legible
105
+
106
+ So this pilot is not a writing-style contest. It is a test of how much
107
+ canonically useful structure survives into the artifact itself.
108
+
109
+ ## Bottom Line
110
+
111
+ This comparison pilot does not seal the kernel as a universally outcome-better
112
+ methodology.
113
+
114
+ It does provide the first comparative evidence that the kernel is not only
115
+ valid in isolation, but also materially stronger as a structural artifact
116
+ surface than the simpler alternatives tested here.
@@ -0,0 +1,86 @@
1
+ # ORP Reasoning Kernel Continuation Pilot
2
+
3
+ This document records the first full-corpus live continuation pilot for the ORP
4
+ Reasoning Kernel.
5
+
6
+ Supporting artifact:
7
+
8
+ - [docs/benchmarks/orp_reasoning_kernel_continuation_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_continuation_v0_1.json)
9
+
10
+ Supporting harness:
11
+
12
+ - [scripts/orp-kernel-continuation-pilot.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-continuation-pilot.py)
13
+
14
+ Related harder benchmark:
15
+
16
+ - [docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md)
17
+
18
+ ## What This Pilot Measures
19
+
20
+ The earlier live agent pilot measured recoverability: can a fresh downstream
21
+ agent reconstruct the required kernel fields?
22
+
23
+ This continuation pilot asks a slightly different question:
24
+
25
+ - can a fresh downstream agent continue the work safely?
26
+
27
+ For each artifact, the agent must:
28
+
29
+ - propose a recommended next action
30
+ - identify handoff-critical fields to carry forward
31
+ - surface what is explicitly missing instead of inventing it
32
+
33
+ The scoring emphasizes:
34
+
35
+ - carry-forward coverage
36
+ - invention rate
37
+ - whether a concrete next action is present
38
+
39
+ This remains the softer downstream benchmark. The harder follow-on test is the
40
+ canonical continuation pilot, where the downstream agent must produce the next
41
+ full kernel task artifact rather than only a safe continuation.
42
+
43
+ ## Current Result
44
+
45
+ On the matched `7`-case, `5`-domain live continuation corpus:
46
+
47
+ - kernel continuation score: `1.000`
48
+ - generic checklist continuation score: `0.984`
49
+ - free-form continuation score: `0.968`
50
+
51
+ Invention result:
52
+
53
+ - kernel invention rate: `0.000`
54
+ - generic checklist invention rate: `0.000`
55
+ - free-form invention rate: `0.048`
56
+
57
+ ## Why This Matters
58
+
59
+ This pilot pushes the evaluation one step closer to real downstream agent work.
60
+
61
+ The continuation corpus suggests:
62
+
63
+ - the kernel clearly supports safer continuation than free-form artifacts
64
+ - the kernel slightly exceeds the generic checklist on continuation score at
65
+ corpus level, while never doing worse on any matched case
66
+ - the kernel preserves a safety advantage by keeping invention at `0.000`
67
+
68
+ ## Honest Boundary
69
+
70
+ This is now a real full-corpus pilot, but it is still not a full external
71
+ continuation study.
72
+
73
+ What remains:
74
+
75
+ - replicate across more fresh sessions
76
+ - compare additional agent models when practical
77
+ - extend from artifact continuation to fuller downstream execution quality
78
+
79
+ ## Bottom Line
80
+
81
+ The continuation pilot strengthens the kernel in a well-rounded way:
82
+
83
+ it suggests the kernel is not only easier to recover, but also a strong and
84
+ safe surface for downstream agent continuation across the matched corpus, while
85
+ showing that a generic checklist remains competitive enough to be a meaningful
86
+ baseline rather than a strawman.