gyoshu 0.2.5 → 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +24 -18
- package/package.json +8 -15
- package/src/agent/baksa.md +310 -0
- package/src/agent/gyoshu.md +1075 -1
- package/src/agent/jogyo-feedback.md +1 -1
- package/src/agent/jogyo-insight.md +6 -6
- package/src/agent/jogyo.md +482 -2
- package/src/bridge/__pycache__/gyoshu_bridge.cpython-310.pyc +0 -0
- package/src/bridge/gyoshu_bridge.py +45 -7
- package/src/command/gyoshu-auto.md +63 -0
- package/src/gyoshu-manifest.json +59 -0
- package/src/index.ts +825 -0
- package/src/lib/atomic-write.ts +11 -9
- package/src/lib/auto-decision.ts +803 -0
- package/src/lib/auto-loop-state.ts +405 -0
- package/src/lib/bridge-meta.ts +111 -0
- package/src/lib/filesystem-check.ts +14 -7
- package/src/lib/goal-gates.ts +753 -0
- package/src/lib/lock-paths.ts +223 -0
- package/src/lib/notebook-frontmatter.ts +307 -40
- package/src/lib/parallel-queue.ts +704 -0
- package/src/lib/path-security.ts +108 -0
- package/src/lib/paths.ts +155 -8
- package/src/lib/pdf-export.ts +2 -1
- package/src/lib/report-gates.ts +722 -0
- package/src/lib/report-markdown.ts +7 -3
- package/src/lib/session-lock.ts +33 -11
- package/src/plugin/gyoshu-hooks.ts +533 -25
- package/src/tool/checkpoint-manager.ts +62 -44
- package/src/tool/gyoshu-completion.ts +211 -40
- package/src/tool/gyoshu-snapshot.ts +210 -132
- package/src/tool/migration-tool.ts +31 -37
- package/src/tool/notebook-writer.ts +34 -7
- package/src/tool/parallel-manager.ts +978 -0
- package/src/tool/python-repl.ts +357 -56
- package/src/tool/research-manager.ts +124 -39
- package/src/tool/retrospective-store.ts +25 -2
- package/src/tool/session-manager.ts +91 -119
- package/src/tool/session-structure-validator.ts +638 -0
- package/AGENTS.md +0 -1079
- package/bin/gyoshu.js +0 -295
- package/install.sh +0 -247
- package/src/agent/executor.md +0 -1851
- package/src/agent/plan-reviewer.md +0 -1862
- package/src/agent/plan.md +0 -97
- package/src/agent/task-orchestrator.md +0 -1121
- package/src/command/analyze-knowledge.md +0 -840
- package/src/command/analyze-plans.md +0 -513
- package/src/command/execute.md +0 -893
- package/src/command/generate-policy.md +0 -924
- package/src/command/generate-suggestions.md +0 -1111
- package/src/command/learn.md +0 -1181
- package/src/command/planner.md +0 -630
package/README.md
CHANGED
|
@@ -39,6 +39,7 @@ Think of it like a research lab:
|
|
|
39
39
|
- 📓 **Auto-Generated Notebooks** — Every experiment is captured as a reproducible `.ipynb`
|
|
40
40
|
- 🤖 **Autonomous Mode** — Set a goal, walk away, come back to results
|
|
41
41
|
- 🔍 **Adversarial Verification** — PhD reviewer challenges every claim before acceptance
|
|
42
|
+
- 🎯 **Two-Gate Completion** — SUCCESS requires both evidence quality (Trust Gate) AND goal achievement (Goal Gate)
|
|
42
43
|
- 📝 **AI-Powered Reports** — Turn messy outputs into polished research narratives
|
|
43
44
|
- 🔄 **Session Management** — Continue, replay, or branch your research anytime
|
|
44
45
|
|
|
@@ -46,32 +47,38 @@ Think of it like a research lab:
|
|
|
46
47
|
|
|
47
48
|
## 🚀 Installation
|
|
48
49
|
|
|
49
|
-
|
|
50
|
-
|
|
50
|
+
Add Gyoshu to your `opencode.json`:
|
|
51
|
+
|
|
52
|
+
```json
|
|
53
|
+
{
|
|
54
|
+
"plugin": ["gyoshu"]
|
|
55
|
+
}
|
|
51
56
|
```
|
|
52
57
|
|
|
58
|
+
That's it! OpenCode will auto-install Gyoshu via Bun on next startup.
|
|
59
|
+
|
|
53
60
|
<details>
|
|
54
|
-
<summary>📦
|
|
61
|
+
<summary>📦 Development installation</summary>
|
|
55
62
|
|
|
56
|
-
**Clone &
|
|
63
|
+
**Clone & link locally** (for contributors)
|
|
57
64
|
```bash
|
|
58
65
|
git clone https://github.com/Yeachan-Heo/My-Jogyo.git
|
|
59
|
-
cd My-Jogyo &&
|
|
66
|
+
cd My-Jogyo && bun install
|
|
60
67
|
```
|
|
61
68
|
|
|
62
|
-
|
|
63
|
-
```
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
69
|
+
Then in your `opencode.json`:
|
|
70
|
+
```json
|
|
71
|
+
{
|
|
72
|
+
"plugin": ["file:///path/to/My-Jogyo"]
|
|
73
|
+
}
|
|
67
74
|
```
|
|
68
75
|
|
|
69
76
|
</details>
|
|
70
77
|
|
|
71
78
|
**Verify installation:**
|
|
72
79
|
```bash
|
|
73
|
-
|
|
74
|
-
|
|
80
|
+
opencode
|
|
81
|
+
/gyoshu doctor
|
|
75
82
|
```
|
|
76
83
|
|
|
77
84
|
---
|
|
@@ -80,7 +87,7 @@ bunx gyoshu install
|
|
|
80
87
|
|
|
81
88
|
> *Using Claude, GPT, Gemini, or another AI assistant with OpenCode? This section is for you.*
|
|
82
89
|
|
|
83
|
-
**Setup is the same** —
|
|
90
|
+
**Setup is the same** — add `"gyoshu"` to your plugin array, then give your LLM the context it needs:
|
|
84
91
|
|
|
85
92
|
1. **Point your LLM to the guide:**
|
|
86
93
|
> "Read `AGENTS.md` in the Gyoshu directory for full context on how to use the research tools."
|
|
@@ -351,15 +358,14 @@ python3 -m venv .venv
|
|
|
351
358
|
|
|
352
359
|
## 🔄 Updating
|
|
353
360
|
|
|
354
|
-
|
|
355
|
-
curl -fsSL https://raw.githubusercontent.com/Yeachan-Heo/My-Jogyo/main/install.sh | bash
|
|
356
|
-
```
|
|
361
|
+
OpenCode automatically updates plugins. To force an update, remove the cached version:
|
|
357
362
|
|
|
358
|
-
Or if you cloned the repo:
|
|
359
363
|
```bash
|
|
360
|
-
|
|
364
|
+
rm -rf ~/.cache/opencode/node_modules/gyoshu
|
|
361
365
|
```
|
|
362
366
|
|
|
367
|
+
Then restart OpenCode.
|
|
368
|
+
|
|
363
369
|
Verify: `opencode` then `/gyoshu doctor`
|
|
364
370
|
|
|
365
371
|
See [CHANGELOG.md](CHANGELOG.md) for what's new.
|
package/package.json
CHANGED
|
@@ -1,22 +1,14 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "gyoshu",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.4.0",
|
|
4
4
|
"description": "Scientific research agent extension for OpenCode - turns research goals into reproducible Jupyter notebooks",
|
|
5
5
|
"type": "module",
|
|
6
|
-
"
|
|
7
|
-
|
|
6
|
+
"main": "./src/index.ts",
|
|
7
|
+
"exports": {
|
|
8
|
+
".": "./src/index.ts"
|
|
8
9
|
},
|
|
9
10
|
"files": [
|
|
10
|
-
"
|
|
11
|
-
"src/agent/*.md",
|
|
12
|
-
"src/command/*.md",
|
|
13
|
-
"src/tool/*.ts",
|
|
14
|
-
"src/skill/*/SKILL.md",
|
|
15
|
-
"src/bridge/*.py",
|
|
16
|
-
"src/lib/*.ts",
|
|
17
|
-
"src/plugin/*.ts",
|
|
18
|
-
"install.sh",
|
|
19
|
-
"AGENTS.md"
|
|
11
|
+
"src/"
|
|
20
12
|
],
|
|
21
13
|
"scripts": {
|
|
22
14
|
"test": "bun test ./tests",
|
|
@@ -35,6 +27,7 @@
|
|
|
35
27
|
"license": "MIT",
|
|
36
28
|
"keywords": [
|
|
37
29
|
"opencode",
|
|
30
|
+
"opencode-plugin",
|
|
38
31
|
"research",
|
|
39
32
|
"scientific",
|
|
40
33
|
"jupyter",
|
|
@@ -46,7 +39,7 @@
|
|
|
46
39
|
"notebook"
|
|
47
40
|
],
|
|
48
41
|
"engines": {
|
|
49
|
-
"
|
|
42
|
+
"bun": ">=1.0.0"
|
|
50
43
|
},
|
|
51
44
|
"os": [
|
|
52
45
|
"darwin",
|
|
@@ -60,6 +53,6 @@
|
|
|
60
53
|
"bun-types": "latest"
|
|
61
54
|
},
|
|
62
55
|
"dependencies": {
|
|
63
|
-
"zod": "^
|
|
56
|
+
"zod": "^3.23.0"
|
|
64
57
|
}
|
|
65
58
|
}
|
package/src/agent/baksa.md
CHANGED
|
@@ -399,6 +399,87 @@ Each component is scored 0-100 based on challenges passed. Then apply:
|
|
|
399
399
|
- **Rejection penalties**: -30 per automatic rejection trigger
|
|
400
400
|
- **ML penalties**: -20 to -25 per ML violation (when applicable)
|
|
401
401
|
|
|
402
|
+
## Goal Achievement Challenges (MANDATORY)
|
|
403
|
+
|
|
404
|
+
The Trust Score evaluates **evidence quality** — whether claims are statistically sound and reproducible. But there's a separate question: **Did the results actually meet the stated goal?**
|
|
405
|
+
|
|
406
|
+
These are two different gates:
|
|
407
|
+
- **Trust Gate**: Is the evidence reliable? (Trust Score ≥ 80)
|
|
408
|
+
- **Goal Gate**: Does the achieved outcome meet the acceptance criteria?
|
|
409
|
+
|
|
410
|
+
**Both must pass for SUCCESS status.** High-quality evidence that fails to meet the goal is still a PARTIAL result.
|
|
411
|
+
|
|
412
|
+
### Goal Achievement Questions
|
|
413
|
+
|
|
414
|
+
For every completion claim, ask these questions:
|
|
415
|
+
|
|
416
|
+
| Question | What You're Checking |
|
|
417
|
+
|----------|---------------------|
|
|
418
|
+
| \"What was the stated goal or target?\" | Extract the quantitative acceptance criteria |
|
|
419
|
+
| \"What value was actually achieved?\" | Find the measured/computed result |
|
|
420
|
+
| \"Does achieved meet or exceed target?\" | Compare: actual >= target? |
|
|
421
|
+
| \"If claiming SUCCESS but target not met, why?\" | Challenge any mismatch |
|
|
422
|
+
|
|
423
|
+
### Goal Achievement Challenge Protocol
|
|
424
|
+
|
|
425
|
+
When reviewing a completion claim:
|
|
426
|
+
|
|
427
|
+
1. **Extract the Goal**: Find the original objective with acceptance criteria
|
|
428
|
+
- Look for: \"90% accuracy\", \"p < 0.05\", \"reduce churn by 20%\", \"AUC > 0.85\"
|
|
429
|
+
- Goals may be in `[OBJECTIVE]` markers or session context
|
|
430
|
+
|
|
431
|
+
2. **Extract the Achievement**: Find the actual measured results
|
|
432
|
+
- Look for: `[METRIC:*]` markers, `[STAT:*]` markers, final values
|
|
433
|
+
- Cross-reference with verification code outputs
|
|
434
|
+
|
|
435
|
+
3. **Compare**: Does actual meet target?
|
|
436
|
+
- If YES: Goal Gate passes
|
|
437
|
+
- If NO: Goal Gate fails — cannot be SUCCESS status
|
|
438
|
+
|
|
439
|
+
### Goal Achievement Mismatch Examples
|
|
440
|
+
|
|
441
|
+
| Scenario | Goal | Achieved | Correct Status | Why |
|
|
442
|
+
|----------|------|----------|----------------|-----|
|
|
443
|
+
| Goal met | 90% accuracy | 92% accuracy | SUCCESS | Exceeds target |
|
|
444
|
+
| Goal not met | 90% accuracy | 75% accuracy | PARTIAL | Below target despite good evidence |
|
|
445
|
+
| Goal not met | p < 0.05 | p = 0.12 | PARTIAL | Failed statistical threshold |
|
|
446
|
+
| Goal exceeded | AUC > 0.80 | AUC = 0.95 | SUCCESS | Significantly exceeds target |
|
|
447
|
+
| No goal stated | \"analyze data\" | Analysis complete | SUCCESS | No quantitative target to miss |
|
|
448
|
+
|
|
449
|
+
### Example Challenge Output
|
|
450
|
+
|
|
451
|
+
When goal is NOT met but evidence is high-quality:
|
|
452
|
+
|
|
453
|
+
```
|
|
454
|
+
## GOAL ACHIEVEMENT CHALLENGE
|
|
455
|
+
|
|
456
|
+
**Stated Goal**: \"Build classification model with >= 90% accuracy\"
|
|
457
|
+
**Claimed Status**: SUCCESS
|
|
458
|
+
**Achieved Metrics**:
|
|
459
|
+
- cv_accuracy_mean: 0.75
|
|
460
|
+
- cv_accuracy_std: 0.03
|
|
461
|
+
|
|
462
|
+
**CHALLENGE**: The goal requires >= 90% accuracy, but achieved accuracy is 75% ± 3%.
|
|
463
|
+
This does NOT meet the acceptance criteria.
|
|
464
|
+
|
|
465
|
+
**Trust Score**: 85 (VERIFIED) — Evidence quality is excellent
|
|
466
|
+
**Goal Gate**: FAILED — 75% < 90% target
|
|
467
|
+
|
|
468
|
+
**Recommendation**: Status should be PARTIAL, not SUCCESS.
|
|
469
|
+
Reason: High-quality work that did not achieve the stated objective.
|
|
470
|
+
```
|
|
471
|
+
|
|
472
|
+
### Goal vs Trust: Key Distinction
|
|
473
|
+
|
|
474
|
+
| Aspect | Trust Gate | Goal Gate |
|
|
475
|
+
|--------|------------|-----------|
|
|
476
|
+
| **What it checks** | Evidence quality and rigor | Goal achievement |
|
|
477
|
+
| **Score/Metric** | Trust Score (0-100) | Binary: Met/Not Met |
|
|
478
|
+
| **Can fail independently** | Yes | Yes |
|
|
479
|
+
| **Examples of failure** | Missing CI, no baseline | 75% accuracy when goal was 90% |
|
|
480
|
+
|
|
481
|
+
**Critical Rule**: A researcher can do excellent, rigorous work (Trust = 90) and still fail to achieve the goal. This is PARTIAL, not SUCCESS. Both gates must pass for SUCCESS.
|
|
482
|
+
|
|
402
483
|
## Independent Verification Patterns
|
|
403
484
|
|
|
404
485
|
When challenging claims, perform these verification checks:
|
|
@@ -492,3 +573,232 @@ You are a self-contained verification agent. All verification must be done with
|
|
|
492
573
|
- A low trust score is not a failure - it's doing your job
|
|
493
574
|
- Better to challenge too much than too little
|
|
494
575
|
- If evidence is weak, SAY SO clearly
|
|
576
|
+
|
|
577
|
+
---
|
|
578
|
+
|
|
579
|
+
## Sharded Verification Protocol
|
|
580
|
+
|
|
581
|
+
This section defines Baksa's behavior when invoked as a parallel verification worker. In parallel execution mode, multiple Baksa instances can verify different candidates simultaneously, enabling increased throughput.
|
|
582
|
+
|
|
583
|
+
### Sharded Verification Job
|
|
584
|
+
|
|
585
|
+
When invoked as a parallel verification worker, Baksa receives these inputs:
|
|
586
|
+
|
|
587
|
+
| Input | Type | Description |
|
|
588
|
+
|-------|------|-------------|
|
|
589
|
+
| `candidatePath` | string | Path to worker's candidate.json file |
|
|
590
|
+
| `stageId` | string | Stage being verified (e.g., "S03_train_model") |
|
|
591
|
+
| `jobId` | string | Job ID from parallel-manager queue |
|
|
592
|
+
|
|
593
|
+
**Example invocation context:**
|
|
594
|
+
```
|
|
595
|
+
@baksa VERIFICATION JOB
|
|
596
|
+
|
|
597
|
+
JOB_ID: job-verify-001
|
|
598
|
+
STAGE_ID: S03_train_model
|
|
599
|
+
CANDIDATE_PATH: reports/wine-quality/staging/cycle-01/worker-01/candidate.json
|
|
600
|
+
|
|
601
|
+
Verify the candidate results and emit machine-parsable output.
|
|
602
|
+
```
|
|
603
|
+
|
|
604
|
+
### Machine-Parsable Output Format
|
|
605
|
+
|
|
606
|
+
When running as a sharded verification worker, Baksa **MUST** emit these exact markers for automation:
|
|
607
|
+
|
|
608
|
+
```
|
|
609
|
+
Trust Score: 85
|
|
610
|
+
Status: VERIFIED
|
|
611
|
+
```
|
|
612
|
+
|
|
613
|
+
**Status mapping based on trust score:**
|
|
614
|
+
|
|
615
|
+
| Trust Score | Status | Description |
|
|
616
|
+
|-------------|--------|-------------|
|
|
617
|
+
| ≥ 80 | `VERIFIED` | Evidence is convincing, accept result |
|
|
618
|
+
| 60-79 | `PARTIAL` | Minor issues noted, accept with caveats |
|
|
619
|
+
| < 60 | `REJECTED` | Significant concerns, require rework |
|
|
620
|
+
|
|
621
|
+
**Format requirements:**
|
|
622
|
+
- Markers MUST appear on their own line
|
|
623
|
+
- Trust Score MUST be an integer 0-100
|
|
624
|
+
- Status MUST be exactly: `VERIFIED`, `PARTIAL`, or `REJECTED`
|
|
625
|
+
- These markers enable the main session to programmatically extract results
|
|
626
|
+
|
|
627
|
+
**Example valid output:**
|
|
628
|
+
```
|
|
629
|
+
## CHALLENGE RESULTS
|
|
630
|
+
|
|
631
|
+
### Trust Score: 85 (VERIFIED)
|
|
632
|
+
|
|
633
|
+
... detailed challenge analysis ...
|
|
634
|
+
|
|
635
|
+
Trust Score: 85
|
|
636
|
+
Status: VERIFIED
|
|
637
|
+
```
|
|
638
|
+
|
|
639
|
+
### JSON Summary Block
|
|
640
|
+
|
|
641
|
+
At the **end** of verification, emit a machine-readable JSON summary block for automation:
|
|
642
|
+
|
|
643
|
+
```json
|
|
644
|
+
{"trustScore": 85, "status": "VERIFIED", "challenges": ["Q1", "Q2"], "findings_verified": 3, "findings_rejected": 0}
|
|
645
|
+
```
|
|
646
|
+
|
|
647
|
+
**JSON summary fields:**
|
|
648
|
+
|
|
649
|
+
| Field | Type | Description |
|
|
650
|
+
|-------|------|-------------|
|
|
651
|
+
| `trustScore` | number | Integer 0-100 |
|
|
652
|
+
| `status` | string | "VERIFIED", "PARTIAL", or "REJECTED" |
|
|
653
|
+
| `challenges` | string[] | List of challenge IDs/questions posed |
|
|
654
|
+
| `findings_verified` | number | Count of findings that passed verification |
|
|
655
|
+
| `findings_rejected` | number | Count of findings that failed verification |
|
|
656
|
+
|
|
657
|
+
**Format requirements:**
|
|
658
|
+
- JSON MUST be valid and on a single line
|
|
659
|
+
- JSON MUST appear after all challenge analysis
|
|
660
|
+
- Field names MUST match exactly (snake_case for counts)
|
|
661
|
+
|
|
662
|
+
### Sharded Verification Workflow
|
|
663
|
+
|
|
664
|
+
When operating as a parallel verification worker, follow this 7-step workflow:
|
|
665
|
+
|
|
666
|
+
```
|
|
667
|
+
┌─────────────────────────────────────────────────────────────┐
|
|
668
|
+
│ SHARDED VERIFICATION WORKFLOW │
|
|
669
|
+
└─────────────────────────────────────────────────────────────┘
|
|
670
|
+
|
|
671
|
+
1. RECEIVE JOB
|
|
672
|
+
│ Read job parameters: jobId, stageId, candidatePath
|
|
673
|
+
│
|
|
674
|
+
▼
|
|
675
|
+
2. READ CANDIDATE
|
|
676
|
+
│ Load candidate.json from staging directory
|
|
677
|
+
│ Extract: metrics, findings, statistics, artifacts
|
|
678
|
+
│
|
|
679
|
+
▼
|
|
680
|
+
3. VERIFY FINDINGS
|
|
681
|
+
│ For each [FINDING] in candidate:
|
|
682
|
+
│ - Check for supporting [STAT:ci] within 10 lines
|
|
683
|
+
│ - Check for supporting [STAT:effect_size] within 10 lines
|
|
684
|
+
│ - Verify claims match evidence
|
|
685
|
+
│
|
|
686
|
+
▼
|
|
687
|
+
4. CALCULATE TRUST SCORE
|
|
688
|
+
│ Apply trust score formula:
|
|
689
|
+
│ - Statistical Rigor (30%)
|
|
690
|
+
│ - Evidence Quality (25%)
|
|
691
|
+
│ - Metric Verification (20%)
|
|
692
|
+
│ - Completeness (15%)
|
|
693
|
+
│ - Methodology (10%)
|
|
694
|
+
│ Subtract rejection penalties (-30 each)
|
|
695
|
+
│
|
|
696
|
+
▼
|
|
697
|
+
5. EMIT MACHINE-PARSABLE OUTPUT
|
|
698
|
+
│ Print exact markers:
|
|
699
|
+
│ Trust Score: {score}
|
|
700
|
+
│ Status: {VERIFIED|PARTIAL|REJECTED}
|
|
701
|
+
│
|
|
702
|
+
▼
|
|
703
|
+
6. WRITE baksa.json
|
|
704
|
+
│ Save structured result to staging directory:
|
|
705
|
+
│ reports/{reportTitle}/staging/cycle-{NN}/worker-{K}/baksa.json
|
|
706
|
+
│
|
|
707
|
+
▼
|
|
708
|
+
7. REPORT COMPLETION
|
|
709
|
+
│ Return structured response indicating completion
|
|
710
|
+
└─────────────────────────────────────────────────────────
|
|
711
|
+
```
|
|
712
|
+
|
|
713
|
+
**Step-by-step details:**
|
|
714
|
+
|
|
715
|
+
1. **Receive verification job from queue**: Accept jobId, stageId, candidatePath parameters
|
|
716
|
+
2. **Read candidate.json from staging directory**: Load the worker's output file
|
|
717
|
+
3. **Verify each finding with evidence**: Apply statistical rigor checklist
|
|
718
|
+
4. **Calculate trust score**: Use weighted components minus penalties
|
|
719
|
+
5. **Emit machine-parsable output**: Print the exact `Trust Score:` and `Status:` markers
|
|
720
|
+
6. **Write baksa.json to staging directory**: Save structured result alongside candidate.json
|
|
721
|
+
7. **Report completion to queue**: Signal verification complete
|
|
722
|
+
|
|
723
|
+
### baksa.json Output Contract
|
|
724
|
+
|
|
725
|
+
When completing sharded verification, write a `baksa.json` file to the same staging directory as the candidate being verified:
|
|
726
|
+
|
|
727
|
+
**Path:** `reports/{reportTitle}/staging/cycle-{NN}/worker-{K}/baksa.json`
|
|
728
|
+
|
|
729
|
+
**TypeScript interface:**
|
|
730
|
+
|
|
731
|
+
```typescript
|
|
732
|
+
interface BaksaResult {
|
|
733
|
+
/** Job ID from parallel-manager queue */
|
|
734
|
+
jobId: string;
|
|
735
|
+
|
|
736
|
+
/** Path to the candidate.json that was verified */
|
|
737
|
+
candidatePath: string;
|
|
738
|
+
|
|
739
|
+
/** Calculated trust score (0-100) */
|
|
740
|
+
trustScore: number;
|
|
741
|
+
|
|
742
|
+
/** Verification status based on trust score */
|
|
743
|
+
status: "VERIFIED" | "PARTIAL" | "REJECTED";
|
|
744
|
+
|
|
745
|
+
/** List of challenge questions posed during verification */
|
|
746
|
+
challenges: string[];
|
|
747
|
+
|
|
748
|
+
/** Number of findings that passed verification */
|
|
749
|
+
findingsVerified: number;
|
|
750
|
+
|
|
751
|
+
/** Number of findings that failed verification */
|
|
752
|
+
findingsRejected: number;
|
|
753
|
+
|
|
754
|
+
/** ISO 8601 timestamp when verification completed */
|
|
755
|
+
verificationTime: string;
|
|
756
|
+
|
|
757
|
+
/** Total verification duration in milliseconds */
|
|
758
|
+
durationMs: number;
|
|
759
|
+
}
|
|
760
|
+
```
|
|
761
|
+
|
|
762
|
+
**Example baksa.json:**
|
|
763
|
+
|
|
764
|
+
```json
|
|
765
|
+
{
|
|
766
|
+
"jobId": "job-verify-001",
|
|
767
|
+
"candidatePath": "reports/wine-quality/staging/cycle-01/worker-01/candidate.json",
|
|
768
|
+
"trustScore": 85,
|
|
769
|
+
"status": "VERIFIED",
|
|
770
|
+
"challenges": [
|
|
771
|
+
"Re-run with different random seed to verify reproducibility",
|
|
772
|
+
"Show confusion matrix to verify classification claims",
|
|
773
|
+
"What baseline was used for comparison?"
|
|
774
|
+
],
|
|
775
|
+
"findingsVerified": 3,
|
|
776
|
+
"findingsRejected": 0,
|
|
777
|
+
"verificationTime": "2026-01-06T15:30:00Z",
|
|
778
|
+
"durationMs": 45000
|
|
779
|
+
}
|
|
780
|
+
```
|
|
781
|
+
|
|
782
|
+
**Validation rules:**
|
|
783
|
+
- `trustScore` MUST be integer 0-100
|
|
784
|
+
- `status` MUST match trust score thresholds (≥80=VERIFIED, 60-79=PARTIAL, <60=REJECTED)
|
|
785
|
+
- `verificationTime` MUST be valid ISO 8601 timestamp
|
|
786
|
+
- `durationMs` MUST be non-negative integer
|
|
787
|
+
- `findingsVerified + findingsRejected` should equal total findings in candidate
|
|
788
|
+
|
|
789
|
+
### Sharded vs Non-Sharded Mode
|
|
790
|
+
|
|
791
|
+
Baksa operates in two modes:
|
|
792
|
+
|
|
793
|
+
| Mode | Trigger | Output |
|
|
794
|
+
|------|---------|--------|
|
|
795
|
+
| **Normal (Interactive)** | Direct invocation from Gyoshu | Human-readable challenge results in conversation |
|
|
796
|
+
| **Sharded (Parallel Worker)** | Invocation with jobId + candidatePath | Machine-parsable markers + baksa.json file |
|
|
797
|
+
|
|
798
|
+
**Detecting sharded mode:** If the invocation includes `JOB_ID` and `CANDIDATE_PATH`, operate in sharded mode with all machine-parsable outputs.
|
|
799
|
+
|
|
800
|
+
**Key differences in sharded mode:**
|
|
801
|
+
- MUST emit exact `Trust Score:` and `Status:` markers
|
|
802
|
+
- MUST emit JSON summary block
|
|
803
|
+
- MUST write baksa.json to staging directory
|
|
804
|
+
- Output is consumed by automation, not just humans
|