gyoshu 0.2.5 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/AGENTS.md CHANGED
@@ -573,6 +573,369 @@ print(f"[FINDING] Random Forest achieves {scores.mean():.1%} accuracy, "
573
573
  f"outperforming baseline by {improvement:.1%} (95% CI [{ci_low:.3f}, {ci_high:.3f}])")
574
574
  ```
575
575
 
576
+ ## Goal Contract System
577
+
578
+ A **Goal Contract** defines measurable success criteria for research before execution begins. This enables Gyoshu to objectively determine whether a research goal has been achieved, rather than relying solely on subjective verification.
579
+
580
+ ### What is a Goal Contract?
581
+
582
+ A Goal Contract is a formal specification that:
583
+ 1. **States the goal** in clear, measurable terms
584
+ 2. **Defines acceptance criteria** that must be met for success
585
+ 3. **Limits retry attempts** to prevent infinite loops
586
+ 4. **Enables automatic verification** at research completion
587
+
588
+ ### Goal Contract in Notebook Frontmatter
589
+
590
+ Goal contracts are stored in the notebook's YAML frontmatter under the `gyoshu.goal_contract` key:
591
+
592
+ ```yaml
593
+ ---
594
+ title: "Customer Churn Classification"
595
+ gyoshu:
596
+ schema_version: 1
597
+ reportTitle: churn-classification
598
+ status: active
599
+ goal_contract:
600
+ version: 1
601
+ goal_text: "Build a classification model with 90% accuracy"
602
+ goal_type: "ml_classification"
603
+ max_goal_attempts: 3
604
+ acceptance_criteria:
605
+ - id: AC1
606
+ kind: metric_threshold
607
+ metric: cv_accuracy_mean
608
+ op: ">="
609
+ target: 0.90
610
+ - id: AC2
611
+ kind: marker_required
612
+ marker: "METRIC:baseline_accuracy"
613
+ - id: AC3
614
+ kind: artifact_exists
615
+ artifactPattern: "*.pkl"
616
+ - id: AC4
617
+ kind: finding_count
618
+ minCount: 3
619
+ ---
620
+ ```
621
+
622
+ ### Goal Contract Fields
623
+
624
+ | Field | Type | Required | Description |
625
+ |-------|------|----------|-------------|
626
+ | `version` | number | Yes | Schema version (currently `1`) |
627
+ | `goal_text` | string | Yes | Human-readable goal statement |
628
+ | `goal_type` | string | No | Goal category: `ml_classification`, `ml_regression`, `eda`, `statistical`, `custom` |
629
+ | `max_goal_attempts` | number | No | Maximum pivot attempts before BLOCKED (default: 3) |
630
+ | `acceptance_criteria` | array | Yes | List of criteria that must ALL pass |
631
+
632
+ ### Acceptance Criteria Types
633
+
634
+ #### 1. `metric_threshold` — Compare Metric to Target
635
+
636
+ Checks if a `[METRIC:name]` marker value meets a threshold.
637
+
638
+ | Field | Type | Description |
639
+ |-------|------|-------------|
640
+ | `id` | string | Unique identifier (e.g., `AC1`) |
641
+ | `kind` | string | Must be `metric_threshold` |
642
+ | `metric` | string | Metric name (e.g., `cv_accuracy_mean`, `f1_score`) |
643
+ | `op` | string | Comparison operator: `>=`, `>`, `<=`, `<`, `==` |
644
+ | `target` | number | Target value to compare against |
645
+
646
+ **Example:**
647
+ ```yaml
648
+ - id: AC1
649
+ kind: metric_threshold
650
+ metric: cv_accuracy_mean
651
+ op: ">="
652
+ target: 0.90
653
+ ```
654
+
655
+ **How it works:** Scans notebook output for `[METRIC:cv_accuracy_mean] 0.92` and checks if `0.92 >= 0.90`.
656
+
657
+ #### 2. `marker_required` — Check Marker Exists
658
+
659
+ Verifies that a specific marker type appears in the notebook output.
660
+
661
+ | Field | Type | Description |
662
+ |-------|------|-------------|
663
+ | `id` | string | Unique identifier |
664
+ | `kind` | string | Must be `marker_required` |
665
+ | `marker` | string | Marker type to find (e.g., `METRIC:baseline_accuracy`, `STAT:ci`) |
666
+
667
+ **Example:**
668
+ ```yaml
669
+ - id: AC2
670
+ kind: marker_required
671
+ marker: "METRIC:baseline_accuracy"
672
+ ```
673
+
674
+ **How it works:** Searches for `[METRIC:baseline_accuracy]` in any cell output. Passes if found at least once.
675
+
676
+ #### 3. `artifact_exists` — Check File Exists
677
+
678
+ Verifies that a specific artifact file was created in the reports directory.
679
+
680
+ | Field | Type | Description |
681
+ |-------|------|-------------|
682
+ | `id` | string | Unique identifier |
683
+ | `kind` | string | Must be `artifact_exists` |
684
+ | `artifactPattern` | string | Glob pattern to match (e.g., `*.pkl`, `figures/*.png`, `model.joblib`) |
685
+
686
+ **Example:**
687
+ ```yaml
688
+ - id: AC3
689
+ kind: artifact_exists
690
+ artifactPattern: "models/*.pkl"
691
+ ```
692
+
693
+ **How it works:** Checks `reports/{reportTitle}/models/` for any `.pkl` file. Passes if at least one match exists.
694
+
695
+ #### 4. `finding_count` — Count Verified Findings
696
+
697
+ Verifies that a minimum number of verified `[FINDING]` markers exist.
698
+
699
+ | Field | Type | Description |
700
+ |-------|------|-------------|
701
+ | `id` | string | Unique identifier |
702
+ | `kind` | string | Must be `finding_count` |
703
+ | `minCount` | number | Minimum number of verified findings required |
704
+
705
+ **Example:**
706
+ ```yaml
707
+ - id: AC4
708
+ kind: finding_count
709
+ minCount: 3
710
+ ```
711
+
712
+ **How it works:** Counts `[FINDING]` markers that have supporting `[STAT:ci]` and `[STAT:effect_size]` within 10 lines before. Only verified findings count.
713
+
714
+ ### Goal Contract Examples
715
+
716
+ #### ML Classification Goal
717
+
718
+ ```yaml
719
+ goal_contract:
720
+ version: 1
721
+ goal_text: "Classify wine quality with F1 >= 0.85"
722
+ goal_type: ml_classification
723
+ max_goal_attempts: 3
724
+ acceptance_criteria:
725
+ - id: AC1
726
+ kind: metric_threshold
727
+ metric: cv_f1_mean
728
+ op: ">="
729
+ target: 0.85
730
+ - id: AC2
731
+ kind: marker_required
732
+ marker: "METRIC:baseline_accuracy"
733
+ - id: AC3
734
+ kind: artifact_exists
735
+ artifactPattern: "models/*.pkl"
736
+ ```
737
+
738
+ #### Exploratory Data Analysis Goal
739
+
740
+ ```yaml
741
+ goal_contract:
742
+ version: 1
743
+ goal_text: "Complete comprehensive EDA with 5+ insights"
744
+ goal_type: eda
745
+ max_goal_attempts: 2
746
+ acceptance_criteria:
747
+ - id: AC1
748
+ kind: finding_count
749
+ minCount: 5
750
+ - id: AC2
751
+ kind: artifact_exists
752
+ artifactPattern: "figures/*.png"
753
+ - id: AC3
754
+ kind: marker_required
755
+ marker: "CONCLUSION"
756
+ ```
757
+
758
+ #### Statistical Analysis Goal
759
+
760
+ ```yaml
761
+ goal_contract:
762
+ version: 1
763
+ goal_text: "Test hypothesis with p < 0.05"
764
+ goal_type: statistical
765
+ max_goal_attempts: 2
766
+ acceptance_criteria:
767
+ - id: AC1
768
+ kind: marker_required
769
+ marker: "STAT:p_value"
770
+ - id: AC2
771
+ kind: marker_required
772
+ marker: "STAT:ci"
773
+ - id: AC3
774
+ kind: marker_required
775
+ marker: "STAT:effect_size"
776
+ - id: AC4
777
+ kind: finding_count
778
+ minCount: 1
779
+ ```
780
+
781
+ ## Two-Gate Completion
782
+
783
+ Gyoshu uses a **Two-Gate verification system** to ensure both research quality (Trust Gate) and goal achievement (Goal Gate) before accepting results.
784
+
785
+ ### The Two Gates
786
+
787
+ | Gate | What It Checks | Who Evaluates | Pass Condition |
788
+ |------|----------------|---------------|----------------|
789
+ | **Trust Gate** | Research quality, statistical rigor, evidence validity | Baksa (adversarial verifier) | Trust score ≥ 80 |
790
+ | **Goal Gate** | Whether acceptance criteria are met | Automated (from goal contract) | All criteria pass |
791
+
792
+ ### Why Two Gates?
793
+
794
+ **Trust Gate alone is insufficient:**
795
+ - Research can be methodologically sound but fail to achieve the stated goal
796
+ - Example: Perfect statistical analysis showing 70% accuracy when goal was 90%
797
+
798
+ **Goal Gate alone is insufficient:**
799
+ - Goal can be "achieved" through flawed methodology
800
+ - Example: Claiming 95% accuracy on training set without cross-validation
801
+
802
+ **Together, they ensure:**
803
+ - Results are trustworthy AND meaningful
804
+ - Claims are verified AND goals are met
805
+ - Research is rigorous AND successful
806
+
807
+ ### Two-Gate Decision Matrix
808
+
809
+ | Trust Gate | Goal Gate | Final Status | Action |
810
+ |------------|-----------|--------------|--------|
811
+ | ✅ PASS | ✅ MET | **SUCCESS** | Accept result, generate report |
812
+ | ✅ PASS | ❌ NOT_MET | **PARTIAL** | Pivot: try different approach |
813
+ | ✅ PASS | 🚫 BLOCKED | **BLOCKED** | Goal impossible, escalate to user |
814
+ | ❌ FAIL | ✅ MET | **PARTIAL** | Rework: improve evidence quality |
815
+ | ❌ FAIL | ❌ NOT_MET | **PARTIAL** | Rework: fix methodology |
816
+ | ❌ FAIL | 🚫 BLOCKED | **BLOCKED** | Cannot proceed, escalate to user |
817
+
818
+ ### Gate Status Definitions
819
+
820
+ **Trust Gate:**
821
+ - `PASS`: Trust score ≥ 80 (verified)
822
+ - `FAIL`: Trust score < 80 (needs rework)
823
+
824
+ **Goal Gate:**
825
+ - `MET`: All acceptance criteria pass
826
+ - `NOT_MET`: Some criteria failed, but retry is possible
827
+ - `BLOCKED`: Goal is impossible (e.g., data doesn't support the hypothesis)
828
+
829
+ ### The Pivot and Rework Cycle
830
+
831
+ When gates fail, Gyoshu doesn't immediately give up:
832
+
833
+ ```
834
+ ┌─────────────────────────────────────────────────────────────┐
835
+ │ Research Execution │
836
+ └─────────────────────────┬───────────────────────────────────┘
837
+
838
+ ┌─────────────────────────────────────────────────────────────┐
839
+ │ Trust Gate Check │
840
+ │ (Baksa adversarial verification) │
841
+ └───────────┬─────────────────────────────────┬───────────────┘
842
+ │ PASS │ FAIL
843
+ ▼ ▼
844
+ ┌───────────────────────┐ ┌───────────────────────────┐
845
+ │ Goal Gate Check │ │ Rework Request │
846
+ │ (Automated criteria) │ │ (Fix evidence quality) │
847
+ └───────┬───────────────┘ └─────────────┬─────────────┘
848
+ │ │
849
+ MET │ NOT_MET │
850
+ │ │ │
851
+ ▼ ▼ │
852
+ SUCCESS PARTIAL │
853
+ │ │
854
+ ├───────── Attempt < Max? ────────┤
855
+ │ Yes │
856
+ ▼ │
857
+ ┌─────────────┐ │
858
+ │ PIVOT │◄────────────────────────┘
859
+ │ Try new │
860
+ │ approach │
861
+ └─────────────┘
862
+
863
+ │ Attempt >= Max
864
+
865
+ BLOCKED
866
+ ```
867
+
868
+ ### Pivot vs Rework
869
+
870
+ | Action | Trigger | What Happens |
871
+ |--------|---------|--------------|
872
+ | **Rework** | Trust Gate FAIL | Jogyo improves evidence (adds CI, effect size, etc.) without changing approach |
873
+ | **Pivot** | Goal Gate NOT_MET | Jogyo tries a different approach (new model, different features, etc.) |
874
+
875
+ ### Max Attempts and BLOCKED Status
876
+
877
+ The `max_goal_attempts` field in the goal contract limits how many times Gyoshu will try to achieve the goal:
878
+
879
+ ```yaml
880
+ goal_contract:
881
+ max_goal_attempts: 3 # Try up to 3 different approaches
882
+ ```
883
+
884
+ **Attempt counting:**
885
+ - Each Pivot increments the attempt counter
886
+ - Reworks do NOT increment (same approach, better evidence)
887
+ - When attempts ≥ max_goal_attempts, status becomes BLOCKED
888
+
889
+ **BLOCKED status means:**
890
+ - The goal cannot be achieved with available data/methods
891
+ - User intervention is required
892
+ - Gyoshu will NOT keep trying indefinitely
893
+
894
+ ### Example: Two-Gate Flow
895
+
896
+ **Goal:** "Build classifier with 90% accuracy"
897
+
898
+ **Attempt 1:**
899
+ 1. Jogyo trains Random Forest → 85% accuracy
900
+ 2. Trust Gate: PASS (proper CV, baseline comparison)
901
+ 3. Goal Gate: NOT_MET (85% < 90%)
902
+ 4. Decision: PARTIAL → Pivot
903
+
904
+ **Attempt 2:**
905
+ 1. Jogyo trains XGBoost → 92% accuracy
906
+ 2. Trust Gate: FAIL (no confidence interval reported)
907
+ 3. Decision: PARTIAL → Rework
908
+
909
+ **Attempt 2 (Rework):**
910
+ 1. Jogyo adds CI: 95% CI [0.90, 0.94]
911
+ 2. Trust Gate: PASS
912
+ 3. Goal Gate: MET (92% ≥ 90%)
913
+ 4. Decision: **SUCCESS** ✅
914
+
915
+ ### Viewing Gate Results
916
+
917
+ Gate results are included in the completion response:
918
+
919
+ ```json
920
+ {
921
+ "status": "PARTIAL",
922
+ "trustGate": {
923
+ "passed": true,
924
+ "score": 85
925
+ },
926
+ "goalGate": {
927
+ "status": "NOT_MET",
928
+ "criteriaResults": [
929
+ { "id": "AC1", "passed": false, "actual": 0.85, "target": 0.90 },
930
+ { "id": "AC2", "passed": true }
931
+ ]
932
+ },
933
+ "action": "PIVOT",
934
+ "attemptNumber": 1,
935
+ "maxAttempts": 3
936
+ }
937
+ ```
938
+
576
939
  ## Structured Output Markers
577
940
 
578
941
  When working with Gyoshu REPL output, use these markers:
package/README.md CHANGED
@@ -39,6 +39,7 @@ Think of it like a research lab:
39
39
  - 📓 **Auto-Generated Notebooks** — Every experiment is captured as a reproducible `.ipynb`
40
40
  - 🤖 **Autonomous Mode** — Set a goal, walk away, come back to results
41
41
  - 🔍 **Adversarial Verification** — PhD reviewer challenges every claim before acceptance
42
+ - 🎯 **Two-Gate Completion** — SUCCESS requires both evidence quality (Trust Gate) AND goal achievement (Goal Gate)
42
43
  - 📝 **AI-Powered Reports** — Turn messy outputs into polished research narratives
43
44
  - 🔄 **Session Management** — Continue, replay, or branch your research anytime
44
45
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "gyoshu",
3
- "version": "0.2.5",
3
+ "version": "0.3.0",
4
4
  "description": "Scientific research agent extension for OpenCode - turns research goals into reproducible Jupyter notebooks",
5
5
  "type": "module",
6
6
  "bin": {
@@ -399,6 +399,87 @@ Each component is scored 0-100 based on challenges passed. Then apply:
399
399
  - **Rejection penalties**: -30 per automatic rejection trigger
400
400
  - **ML penalties**: -20 to -25 per ML violation (when applicable)
401
401
 
402
+ ## Goal Achievement Challenges (MANDATORY)
403
+
404
+ The Trust Score evaluates **evidence quality** — whether claims are statistically sound and reproducible. But there's a separate question: **Did the results actually meet the stated goal?**
405
+
406
+ These are two different gates:
407
+ - **Trust Gate**: Is the evidence reliable? (Trust Score ≥ 80)
408
+ - **Goal Gate**: Does the achieved outcome meet the acceptance criteria?
409
+
410
+ **Both must pass for SUCCESS status.** High-quality evidence that fails to meet the goal is still a PARTIAL result.
411
+
412
+ ### Goal Achievement Questions
413
+
414
+ For every completion claim, ask these questions:
415
+
416
+ | Question | What You're Checking |
417
+ |----------|---------------------|
418
+ | \"What was the stated goal or target?\" | Extract the quantitative acceptance criteria |
419
+ | \"What value was actually achieved?\" | Find the measured/computed result |
420
+ | \"Does achieved meet or exceed target?\" | Compare: actual >= target? |
421
+ | \"If claiming SUCCESS but target not met, why?\" | Challenge any mismatch |
422
+
423
+ ### Goal Achievement Challenge Protocol
424
+
425
+ When reviewing a completion claim:
426
+
427
+ 1. **Extract the Goal**: Find the original objective with acceptance criteria
428
+ - Look for: \"90% accuracy\", \"p < 0.05\", \"reduce churn by 20%\", \"AUC > 0.85\"
429
+ - Goals may be in `[OBJECTIVE]` markers or session context
430
+
431
+ 2. **Extract the Achievement**: Find the actual measured results
432
+ - Look for: `[METRIC:*]` markers, `[STAT:*]` markers, final values
433
+ - Cross-reference with verification code outputs
434
+
435
+ 3. **Compare**: Does actual meet target?
436
+ - If YES: Goal Gate passes
437
+ - If NO: Goal Gate fails — cannot be SUCCESS status
438
+
439
+ ### Goal Achievement Mismatch Examples
440
+
441
+ | Scenario | Goal | Achieved | Correct Status | Why |
442
+ |----------|------|----------|----------------|-----|
443
+ | Goal met | 90% accuracy | 92% accuracy | SUCCESS | Exceeds target |
444
+ | Goal not met | 90% accuracy | 75% accuracy | PARTIAL | Below target despite good evidence |
445
+ | Goal not met | p < 0.05 | p = 0.12 | PARTIAL | Failed statistical threshold |
446
+ | Goal exceeded | AUC > 0.80 | AUC = 0.95 | SUCCESS | Significantly exceeds target |
447
+ | No goal stated | \"analyze data\" | Analysis complete | SUCCESS | No quantitative target to miss |
448
+
449
+ ### Example Challenge Output
450
+
451
+ When goal is NOT met but evidence is high-quality:
452
+
453
+ ```
454
+ ## GOAL ACHIEVEMENT CHALLENGE
455
+
456
+ **Stated Goal**: \"Build classification model with >= 90% accuracy\"
457
+ **Claimed Status**: SUCCESS
458
+ **Achieved Metrics**:
459
+ - cv_accuracy_mean: 0.75
460
+ - cv_accuracy_std: 0.03
461
+
462
+ **CHALLENGE**: The goal requires >= 90% accuracy, but achieved accuracy is 75% ± 3%.
463
+ This does NOT meet the acceptance criteria.
464
+
465
+ **Trust Score**: 85 (VERIFIED) — Evidence quality is excellent
466
+ **Goal Gate**: FAILED — 75% < 90% target
467
+
468
+ **Recommendation**: Status should be PARTIAL, not SUCCESS.
469
+ Reason: High-quality work that did not achieve the stated objective.
470
+ ```
471
+
472
+ ### Goal vs Trust: Key Distinction
473
+
474
+ | Aspect | Trust Gate | Goal Gate |
475
+ |--------|------------|-----------|
476
+ | **What it checks** | Evidence quality and rigor | Goal achievement |
477
+ | **Score/Metric** | Trust Score (0-100) | Binary: Met/Not Met |
478
+ | **Can fail independently** | Yes | Yes |
479
+ | **Examples of failure** | Missing CI, no baseline | 75% accuracy when goal was 90% |
480
+
481
+ **Critical Rule**: A researcher can do excellent, rigorous work (Trust = 90) and still fail to achieve the goal. This is PARTIAL, not SUCCESS. Both gates must pass for SUCCESS.
482
+
402
483
  ## Independent Verification Patterns
403
484
 
404
485
  When challenging claims, perform these verification checks:
@@ -542,6 +542,186 @@ For backwards compatibility, you can still use researchId-based creation:
542
542
  2. Add a run with `research-manager` (action: addRun, runId: "run-xxx", data: {goal, mode})
543
543
  3. Initialize notebook with `notebook-writer` (action: ensure_notebook, notebookPath: "...")
544
544
 
545
+ ### Goal Contract Creation
546
+
547
+ **CRITICAL**: Every research must have a **Goal Contract** that defines measurable acceptance criteria. The Goal Contract enables the **Two-Gate Verification System** — research cannot be marked SUCCESS without passing BOTH the Trust Gate (evidence quality) AND the Goal Gate (acceptance criteria met).
548
+
549
+ #### Why Goal Contracts?
550
+
551
+ Without a Goal Contract, research can be incorrectly marked SUCCESS when:
552
+ - Evidence quality is high (Trust Gate passes), BUT
553
+ - The actual goal was not achieved (e.g., "Build 90% accurate model" achieved 75%)
554
+
555
+ The Goal Contract makes acceptance criteria **explicit and verifiable**.
556
+
557
+ #### Goal Contract Format
558
+
559
+ Include goal_contract in the notebook frontmatter or run configuration:
560
+
561
+ ```yaml
562
+ gyoshu:
563
+ goal_contract:
564
+ version: 1
565
+ goal_text: "Build a churn prediction model with 90% accuracy"
566
+ goal_type: "ml_classification" # ml_classification | ml_regression | statistical_test | eda | custom
567
+ max_goal_attempts: 3 # Maximum pivots before BLOCKED
568
+ acceptance_criteria:
569
+ - id: AC1
570
+ kind: metric_threshold # metric_threshold | artifact_exists | statistical_significance
571
+ metric: cv_accuracy_mean # Must match a [METRIC:*] marker
572
+ op: ">=" # >= | > | <= | < | == | !=
573
+ target: 0.90
574
+
575
+ - id: AC2
576
+ kind: metric_threshold
577
+ metric: cv_accuracy_std
578
+ op: "<="
579
+ target: 0.05 # Low variance required
580
+
581
+ - id: AC3
582
+ kind: artifact_exists
583
+ artifact: "model.pkl" # Must exist in reports/{reportTitle}/models/
584
+ ```
585
+
586
+ #### Goal Contract Fields
587
+
588
+ | Field | Required | Description |
589
+ |-------|----------|-------------|
590
+ | `version` | Yes | Schema version (currently: 1) |
591
+ | `goal_text` | Yes | Human-readable goal statement |
592
+ | `goal_type` | Yes | Category for validation rules |
593
+ | `max_goal_attempts` | No | Max pivots before BLOCKED (default: 3) |
594
+ | `acceptance_criteria` | Yes | List of verifiable criteria |
595
+
596
+ #### Acceptance Criteria Kinds
597
+
598
+ | Kind | Use When | Verification Method |
599
+ |------|----------|---------------------|
600
+ | `metric_threshold` | Numeric target (accuracy, R², etc.) | Compare `[METRIC:*]` marker value against target |
601
+ | `artifact_exists` | Required output file | Check file exists in reports directory |
602
+ | `statistical_significance` | Hypothesis testing | Verify p-value < alpha AND effect size reported |
603
+
604
+ #### Two-Gate Decision Matrix
605
+
606
+ **CRITICAL**: SUCCESS requires BOTH gates to pass. This is the core rule.
607
+
608
+ | Trust Gate | Goal Gate | Result | Action |
609
+ |------------|-----------|--------|--------|
610
+ | PASS (≥80) | MET | **SUCCESS** | Accept result, finalize research |
611
+ | PASS (≥80) | NOT_MET | **PARTIAL** | Pivot: try alternative approach |
612
+ | PASS (≥80) | BLOCKED | **BLOCKED** | Cannot achieve goal, report to user |
613
+ | FAIL (<80) | MET | **PARTIAL** | Rework: strengthen evidence |
614
+ | FAIL (<80) | NOT_MET | **PARTIAL** | Rework both evidence and approach |
615
+ | FAIL (<80) | BLOCKED | **BLOCKED** | Fundamental issue, escalate to user |
616
+
617
+ **Gate Definitions:**
618
+ - **Trust Gate**: Baksa verification score ≥ 80 (evidence quality)
619
+ - **Goal Gate**: All acceptance_criteria in goal_contract are satisfied
620
+
621
+ #### Goal Gate Evaluation
622
+
623
+ After Trust Gate evaluation, check Goal Gate:
624
+
625
+ ```
626
+ FUNCTION evaluateGoalGate(goalContract, snapshot):
627
+ FOR each criterion IN goalContract.acceptance_criteria:
628
+
629
+ IF criterion.kind == "metric_threshold":
630
+ metricValue = findMetric(snapshot, criterion.metric)
631
+ IF NOT compare(metricValue, criterion.op, criterion.target):
632
+ RETURN { status: "NOT_MET", failed: criterion.id }
633
+
634
+ ELSE IF criterion.kind == "artifact_exists":
635
+ IF NOT artifactExists(criterion.artifact):
636
+ RETURN { status: "NOT_MET", failed: criterion.id }
637
+
638
+ ELSE IF criterion.kind == "statistical_significance":
639
+ IF NOT hasValidFinding(criterion):
640
+ RETURN { status: "NOT_MET", failed: criterion.id }
641
+
642
+ RETURN { status: "MET" }
643
+ ```
644
+
645
+ #### Pivot Protocol
646
+
647
+ When Trust Gate PASSES but Goal Gate does NOT_MET:
648
+
649
+ ```
650
+ PIVOT PROTOCOL (Trust ≥ 80, Goal NOT_MET):
651
+
652
+ 1. Increment goal_attempt counter
653
+ 2. IF goal_attempt >= max_goal_attempts:
654
+ → Status: BLOCKED
655
+ → Report: "Goal not achievable after {N} attempts"
656
+ → Present options to user
657
+
658
+ 3. ELSE:
659
+ → Status: PARTIAL
660
+ → Analyze WHY goal not met:
661
+ - Which criteria failed?
662
+ - What was the gap? (e.g., achieved 85% vs target 90%)
663
+ - What approaches were tried?
664
+
665
+ → Generate pivot strategy:
666
+ - Alternative algorithm?
667
+ - More feature engineering?
668
+ - Relaxed criteria (with user approval)?
669
+ - Different data preprocessing?
670
+
671
+ → Delegate to @jogyo with pivot context:
672
+ "PIVOT ATTEMPT {N}/{max}
673
+ Previous: XGBoost achieved 85% accuracy
674
+ Gap: 5% below 90% target
675
+ Try: Random Forest with hyperparameter tuning"
676
+ ```
677
+
678
+ #### Example: Two-Gate Verification Flow
679
+
680
+ ```
681
+ 1. @jogyo completes model training
682
+ → Signals: gyoshu_completion(status: "SUCCESS", evidence: {...})
683
+
684
+ 2. Gyoshu gets snapshot
685
+ → gyoshu_snapshot(researchSessionID: "...")
686
+
687
+ 3. TRUST GATE: Invoke @baksa
688
+ → Trust Score: 85 (PASS ✓)
689
+ → Evidence quality verified
690
+
691
+ 4. GOAL GATE: Check acceptance criteria
692
+ → AC1: cv_accuracy_mean = 0.87 (target: 0.90) → FAIL
693
+ → Goal Gate: NOT_MET ✗
694
+
695
+ 5. TWO-GATE RESULT: Trust PASS + Goal NOT_MET = PARTIAL
696
+ → Do NOT mark SUCCESS
697
+ → Trigger PIVOT PROTOCOL
698
+
699
+ 6. PIVOT: Attempt 1/3
700
+ → Delegate to @jogyo with alternative approach
701
+ → "Try ensemble with feature selection"
702
+
703
+ 7. (Repeat verification after pivot...)
704
+
705
+ 8. FINAL: Trust PASS + Goal MET = SUCCESS
706
+ → Accept result, finalize research
707
+ ```
708
+
709
+ #### Never Accept SUCCESS Without Goal Gate
710
+
711
+ **HARD RULE**: Even with perfect evidence (Trust = 100), if acceptance criteria are not met, the result is PARTIAL, not SUCCESS.
712
+
713
+ ```
714
+ ❌ WRONG:
715
+ Trust: 95 (excellent evidence!)
716
+ Goal: 85% accuracy (target was 90%)
717
+ → "SUCCESS" // NO! Goal not met
718
+
719
+ ✓ CORRECT:
720
+ Trust: 95 (excellent evidence!)
721
+ Goal: 85% accuracy (target was 90%)
722
+ → "PARTIAL" → Pivot or report gap to user
723
+ ```
724
+
545
725
  ### Continuing Research
546
726
 
547
727
  When continuing existing research with notebook-centric workflow: