gyoshu 0.2.5 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +363 -0
- package/README.md +1 -0
- package/package.json +1 -1
- package/src/agent/baksa.md +81 -0
- package/src/agent/gyoshu.md +180 -0
- package/src/agent/jogyo.md +55 -0
- package/src/lib/goal-gates.ts +753 -0
- package/src/lib/notebook-frontmatter.ts +307 -40
- package/src/tool/gyoshu-completion.ts +53 -0
package/AGENTS.md
CHANGED
|
@@ -573,6 +573,369 @@ print(f"[FINDING] Random Forest achieves {scores.mean():.1%} accuracy, "
|
|
|
573
573
|
f"outperforming baseline by {improvement:.1%} (95% CI [{ci_low:.3f}, {ci_high:.3f}])")
|
|
574
574
|
```
|
|
575
575
|
|
|
576
|
+
## Goal Contract System
|
|
577
|
+
|
|
578
|
+
A **Goal Contract** defines measurable success criteria for research before execution begins. This enables Gyoshu to objectively determine whether a research goal has been achieved, rather than relying solely on subjective verification.
|
|
579
|
+
|
|
580
|
+
### What is a Goal Contract?
|
|
581
|
+
|
|
582
|
+
A Goal Contract is a formal specification that:
|
|
583
|
+
1. **States the goal** in clear, measurable terms
|
|
584
|
+
2. **Defines acceptance criteria** that must be met for success
|
|
585
|
+
3. **Limits retry attempts** to prevent infinite loops
|
|
586
|
+
4. **Enables automatic verification** at research completion
|
|
587
|
+
|
|
588
|
+
### Goal Contract in Notebook Frontmatter
|
|
589
|
+
|
|
590
|
+
Goal contracts are stored in the notebook's YAML frontmatter under the `gyoshu.goal_contract` key:
|
|
591
|
+
|
|
592
|
+
```yaml
|
|
593
|
+
---
|
|
594
|
+
title: "Customer Churn Classification"
|
|
595
|
+
gyoshu:
|
|
596
|
+
schema_version: 1
|
|
597
|
+
reportTitle: churn-classification
|
|
598
|
+
status: active
|
|
599
|
+
goal_contract:
|
|
600
|
+
version: 1
|
|
601
|
+
goal_text: "Build a classification model with 90% accuracy"
|
|
602
|
+
goal_type: "ml_classification"
|
|
603
|
+
max_goal_attempts: 3
|
|
604
|
+
acceptance_criteria:
|
|
605
|
+
- id: AC1
|
|
606
|
+
kind: metric_threshold
|
|
607
|
+
metric: cv_accuracy_mean
|
|
608
|
+
op: ">="
|
|
609
|
+
target: 0.90
|
|
610
|
+
- id: AC2
|
|
611
|
+
kind: marker_required
|
|
612
|
+
marker: "METRIC:baseline_accuracy"
|
|
613
|
+
- id: AC3
|
|
614
|
+
kind: artifact_exists
|
|
615
|
+
artifactPattern: "*.pkl"
|
|
616
|
+
- id: AC4
|
|
617
|
+
kind: finding_count
|
|
618
|
+
minCount: 3
|
|
619
|
+
---
|
|
620
|
+
```
|
|
621
|
+
|
|
622
|
+
### Goal Contract Fields
|
|
623
|
+
|
|
624
|
+
| Field | Type | Required | Description |
|
|
625
|
+
|-------|------|----------|-------------|
|
|
626
|
+
| `version` | number | Yes | Schema version (currently `1`) |
|
|
627
|
+
| `goal_text` | string | Yes | Human-readable goal statement |
|
|
628
|
+
| `goal_type` | string | No | Goal category: `ml_classification`, `ml_regression`, `eda`, `statistical`, `custom` |
|
|
629
|
+
| `max_goal_attempts` | number | No | Maximum pivot attempts before BLOCKED (default: 3) |
|
|
630
|
+
| `acceptance_criteria` | array | Yes | List of criteria that must ALL pass |
|
|
631
|
+
|
|
632
|
+
### Acceptance Criteria Types
|
|
633
|
+
|
|
634
|
+
#### 1. `metric_threshold` — Compare Metric to Target
|
|
635
|
+
|
|
636
|
+
Checks if a `[METRIC:name]` marker value meets a threshold.
|
|
637
|
+
|
|
638
|
+
| Field | Type | Description |
|
|
639
|
+
|-------|------|-------------|
|
|
640
|
+
| `id` | string | Unique identifier (e.g., `AC1`) |
|
|
641
|
+
| `kind` | string | Must be `metric_threshold` |
|
|
642
|
+
| `metric` | string | Metric name (e.g., `cv_accuracy_mean`, `f1_score`) |
|
|
643
|
+
| `op` | string | Comparison operator: `>=`, `>`, `<=`, `<`, `==` |
|
|
644
|
+
| `target` | number | Target value to compare against |
|
|
645
|
+
|
|
646
|
+
**Example:**
|
|
647
|
+
```yaml
|
|
648
|
+
- id: AC1
|
|
649
|
+
kind: metric_threshold
|
|
650
|
+
metric: cv_accuracy_mean
|
|
651
|
+
op: ">="
|
|
652
|
+
target: 0.90
|
|
653
|
+
```
|
|
654
|
+
|
|
655
|
+
**How it works:** Scans notebook output for `[METRIC:cv_accuracy_mean] 0.92` and checks if `0.92 >= 0.90`.
|
|
656
|
+
|
|
657
|
+
#### 2. `marker_required` — Check Marker Exists
|
|
658
|
+
|
|
659
|
+
Verifies that a specific marker type appears in the notebook output.
|
|
660
|
+
|
|
661
|
+
| Field | Type | Description |
|
|
662
|
+
|-------|------|-------------|
|
|
663
|
+
| `id` | string | Unique identifier |
|
|
664
|
+
| `kind` | string | Must be `marker_required` |
|
|
665
|
+
| `marker` | string | Marker type to find (e.g., `METRIC:baseline_accuracy`, `STAT:ci`) |
|
|
666
|
+
|
|
667
|
+
**Example:**
|
|
668
|
+
```yaml
|
|
669
|
+
- id: AC2
|
|
670
|
+
kind: marker_required
|
|
671
|
+
marker: "METRIC:baseline_accuracy"
|
|
672
|
+
```
|
|
673
|
+
|
|
674
|
+
**How it works:** Searches for `[METRIC:baseline_accuracy]` in any cell output. Passes if found at least once.
|
|
675
|
+
|
|
676
|
+
#### 3. `artifact_exists` — Check File Exists
|
|
677
|
+
|
|
678
|
+
Verifies that a specific artifact file was created in the reports directory.
|
|
679
|
+
|
|
680
|
+
| Field | Type | Description |
|
|
681
|
+
|-------|------|-------------|
|
|
682
|
+
| `id` | string | Unique identifier |
|
|
683
|
+
| `kind` | string | Must be `artifact_exists` |
|
|
684
|
+
| `artifactPattern` | string | Glob pattern to match (e.g., `*.pkl`, `figures/*.png`, `model.joblib`) |
|
|
685
|
+
|
|
686
|
+
**Example:**
|
|
687
|
+
```yaml
|
|
688
|
+
- id: AC3
|
|
689
|
+
kind: artifact_exists
|
|
690
|
+
artifactPattern: "models/*.pkl"
|
|
691
|
+
```
|
|
692
|
+
|
|
693
|
+
**How it works:** Checks `reports/{reportTitle}/models/` for any `.pkl` file. Passes if at least one match exists.
|
|
694
|
+
|
|
695
|
+
#### 4. `finding_count` — Count Verified Findings
|
|
696
|
+
|
|
697
|
+
Verifies that a minimum number of verified `[FINDING]` markers exist.
|
|
698
|
+
|
|
699
|
+
| Field | Type | Description |
|
|
700
|
+
|-------|------|-------------|
|
|
701
|
+
| `id` | string | Unique identifier |
|
|
702
|
+
| `kind` | string | Must be `finding_count` |
|
|
703
|
+
| `minCount` | number | Minimum number of verified findings required |
|
|
704
|
+
|
|
705
|
+
**Example:**
|
|
706
|
+
```yaml
|
|
707
|
+
- id: AC4
|
|
708
|
+
kind: finding_count
|
|
709
|
+
minCount: 3
|
|
710
|
+
```
|
|
711
|
+
|
|
712
|
+
**How it works:** Counts `[FINDING]` markers that have supporting `[STAT:ci]` and `[STAT:effect_size]` within 10 lines before. Only verified findings count.
|
|
713
|
+
|
|
714
|
+
### Goal Contract Examples
|
|
715
|
+
|
|
716
|
+
#### ML Classification Goal
|
|
717
|
+
|
|
718
|
+
```yaml
|
|
719
|
+
goal_contract:
|
|
720
|
+
version: 1
|
|
721
|
+
goal_text: "Classify wine quality with F1 >= 0.85"
|
|
722
|
+
goal_type: ml_classification
|
|
723
|
+
max_goal_attempts: 3
|
|
724
|
+
acceptance_criteria:
|
|
725
|
+
- id: AC1
|
|
726
|
+
kind: metric_threshold
|
|
727
|
+
metric: cv_f1_mean
|
|
728
|
+
op: ">="
|
|
729
|
+
target: 0.85
|
|
730
|
+
- id: AC2
|
|
731
|
+
kind: marker_required
|
|
732
|
+
marker: "METRIC:baseline_accuracy"
|
|
733
|
+
- id: AC3
|
|
734
|
+
kind: artifact_exists
|
|
735
|
+
artifactPattern: "models/*.pkl"
|
|
736
|
+
```
|
|
737
|
+
|
|
738
|
+
#### Exploratory Data Analysis Goal
|
|
739
|
+
|
|
740
|
+
```yaml
|
|
741
|
+
goal_contract:
|
|
742
|
+
version: 1
|
|
743
|
+
goal_text: "Complete comprehensive EDA with 5+ insights"
|
|
744
|
+
goal_type: eda
|
|
745
|
+
max_goal_attempts: 2
|
|
746
|
+
acceptance_criteria:
|
|
747
|
+
- id: AC1
|
|
748
|
+
kind: finding_count
|
|
749
|
+
minCount: 5
|
|
750
|
+
- id: AC2
|
|
751
|
+
kind: artifact_exists
|
|
752
|
+
artifactPattern: "figures/*.png"
|
|
753
|
+
- id: AC3
|
|
754
|
+
kind: marker_required
|
|
755
|
+
marker: "CONCLUSION"
|
|
756
|
+
```
|
|
757
|
+
|
|
758
|
+
#### Statistical Analysis Goal
|
|
759
|
+
|
|
760
|
+
```yaml
|
|
761
|
+
goal_contract:
|
|
762
|
+
version: 1
|
|
763
|
+
goal_text: "Test hypothesis with p < 0.05"
|
|
764
|
+
goal_type: statistical
|
|
765
|
+
max_goal_attempts: 2
|
|
766
|
+
acceptance_criteria:
|
|
767
|
+
- id: AC1
|
|
768
|
+
kind: marker_required
|
|
769
|
+
marker: "STAT:p_value"
|
|
770
|
+
- id: AC2
|
|
771
|
+
kind: marker_required
|
|
772
|
+
marker: "STAT:ci"
|
|
773
|
+
- id: AC3
|
|
774
|
+
kind: marker_required
|
|
775
|
+
marker: "STAT:effect_size"
|
|
776
|
+
- id: AC4
|
|
777
|
+
kind: finding_count
|
|
778
|
+
minCount: 1
|
|
779
|
+
```
|
|
780
|
+
|
|
781
|
+
## Two-Gate Completion
|
|
782
|
+
|
|
783
|
+
Gyoshu uses a **Two-Gate verification system** to ensure both research quality (Trust Gate) and goal achievement (Goal Gate) before accepting results.
|
|
784
|
+
|
|
785
|
+
### The Two Gates
|
|
786
|
+
|
|
787
|
+
| Gate | What It Checks | Who Evaluates | Pass Condition |
|
|
788
|
+
|------|----------------|---------------|----------------|
|
|
789
|
+
| **Trust Gate** | Research quality, statistical rigor, evidence validity | Baksa (adversarial verifier) | Trust score ≥ 80 |
|
|
790
|
+
| **Goal Gate** | Whether acceptance criteria are met | Automated (from goal contract) | All criteria pass |
|
|
791
|
+
|
|
792
|
+
### Why Two Gates?
|
|
793
|
+
|
|
794
|
+
**Trust Gate alone is insufficient:**
|
|
795
|
+
- Research can be methodologically sound but fail to achieve the stated goal
|
|
796
|
+
- Example: Perfect statistical analysis showing 70% accuracy when goal was 90%
|
|
797
|
+
|
|
798
|
+
**Goal Gate alone is insufficient:**
|
|
799
|
+
- Goal can be "achieved" through flawed methodology
|
|
800
|
+
- Example: Claiming 95% accuracy on training set without cross-validation
|
|
801
|
+
|
|
802
|
+
**Together, they ensure:**
|
|
803
|
+
- Results are trustworthy AND meaningful
|
|
804
|
+
- Claims are verified AND goals are met
|
|
805
|
+
- Research is rigorous AND successful
|
|
806
|
+
|
|
807
|
+
### Two-Gate Decision Matrix
|
|
808
|
+
|
|
809
|
+
| Trust Gate | Goal Gate | Final Status | Action |
|
|
810
|
+
|------------|-----------|--------------|--------|
|
|
811
|
+
| ✅ PASS | ✅ MET | **SUCCESS** | Accept result, generate report |
|
|
812
|
+
| ✅ PASS | ❌ NOT_MET | **PARTIAL** | Pivot: try different approach |
|
|
813
|
+
| ✅ PASS | 🚫 BLOCKED | **BLOCKED** | Goal impossible, escalate to user |
|
|
814
|
+
| ❌ FAIL | ✅ MET | **PARTIAL** | Rework: improve evidence quality |
|
|
815
|
+
| ❌ FAIL | ❌ NOT_MET | **PARTIAL** | Rework: fix methodology |
|
|
816
|
+
| ❌ FAIL | 🚫 BLOCKED | **BLOCKED** | Cannot proceed, escalate to user |
|
|
817
|
+
|
|
818
|
+
### Gate Status Definitions
|
|
819
|
+
|
|
820
|
+
**Trust Gate:**
|
|
821
|
+
- `PASS`: Trust score ≥ 80 (verified)
|
|
822
|
+
- `FAIL`: Trust score < 80 (needs rework)
|
|
823
|
+
|
|
824
|
+
**Goal Gate:**
|
|
825
|
+
- `MET`: All acceptance criteria pass
|
|
826
|
+
- `NOT_MET`: Some criteria failed, but retry is possible
|
|
827
|
+
- `BLOCKED`: Goal is impossible (e.g., data doesn't support the hypothesis)
|
|
828
|
+
|
|
829
|
+
### The Pivot and Rework Cycle
|
|
830
|
+
|
|
831
|
+
When gates fail, Gyoshu doesn't immediately give up:
|
|
832
|
+
|
|
833
|
+
```
|
|
834
|
+
┌─────────────────────────────────────────────────────────────┐
|
|
835
|
+
│ Research Execution │
|
|
836
|
+
└─────────────────────────┬───────────────────────────────────┘
|
|
837
|
+
▼
|
|
838
|
+
┌─────────────────────────────────────────────────────────────┐
|
|
839
|
+
│ Trust Gate Check │
|
|
840
|
+
│ (Baksa adversarial verification) │
|
|
841
|
+
└───────────┬─────────────────────────────────┬───────────────┘
|
|
842
|
+
│ PASS │ FAIL
|
|
843
|
+
▼ ▼
|
|
844
|
+
┌───────────────────────┐ ┌───────────────────────────┐
|
|
845
|
+
│ Goal Gate Check │ │ Rework Request │
|
|
846
|
+
│ (Automated criteria) │ │ (Fix evidence quality) │
|
|
847
|
+
└───────┬───────────────┘ └─────────────┬─────────────┘
|
|
848
|
+
│ │
|
|
849
|
+
MET │ NOT_MET │
|
|
850
|
+
│ │ │
|
|
851
|
+
▼ ▼ │
|
|
852
|
+
SUCCESS PARTIAL │
|
|
853
|
+
│ │
|
|
854
|
+
├───────── Attempt < Max? ────────┤
|
|
855
|
+
│ Yes │
|
|
856
|
+
▼ │
|
|
857
|
+
┌─────────────┐ │
|
|
858
|
+
│ PIVOT │◄────────────────────────┘
|
|
859
|
+
│ Try new │
|
|
860
|
+
│ approach │
|
|
861
|
+
└─────────────┘
|
|
862
|
+
│
|
|
863
|
+
│ Attempt >= Max
|
|
864
|
+
▼
|
|
865
|
+
BLOCKED
|
|
866
|
+
```
|
|
867
|
+
|
|
868
|
+
### Pivot vs Rework
|
|
869
|
+
|
|
870
|
+
| Action | Trigger | What Happens |
|
|
871
|
+
|--------|---------|--------------|
|
|
872
|
+
| **Rework** | Trust Gate FAIL | Jogyo improves evidence (adds CI, effect size, etc.) without changing approach |
|
|
873
|
+
| **Pivot** | Goal Gate NOT_MET | Jogyo tries a different approach (new model, different features, etc.) |
|
|
874
|
+
|
|
875
|
+
### Max Attempts and BLOCKED Status
|
|
876
|
+
|
|
877
|
+
The `max_goal_attempts` field in the goal contract limits how many times Gyoshu will try to achieve the goal:
|
|
878
|
+
|
|
879
|
+
```yaml
|
|
880
|
+
goal_contract:
|
|
881
|
+
max_goal_attempts: 3 # Try up to 3 different approaches
|
|
882
|
+
```
|
|
883
|
+
|
|
884
|
+
**Attempt counting:**
|
|
885
|
+
- Each Pivot increments the attempt counter
|
|
886
|
+
- Reworks do NOT increment (same approach, better evidence)
|
|
887
|
+
- When attempts ≥ max_goal_attempts, status becomes BLOCKED
|
|
888
|
+
|
|
889
|
+
**BLOCKED status means:**
|
|
890
|
+
- The goal cannot be achieved with available data/methods
|
|
891
|
+
- User intervention is required
|
|
892
|
+
- Gyoshu will NOT keep trying indefinitely
|
|
893
|
+
|
|
894
|
+
### Example: Two-Gate Flow
|
|
895
|
+
|
|
896
|
+
**Goal:** "Build classifier with 90% accuracy"
|
|
897
|
+
|
|
898
|
+
**Attempt 1:**
|
|
899
|
+
1. Jogyo trains Random Forest → 85% accuracy
|
|
900
|
+
2. Trust Gate: PASS (proper CV, baseline comparison)
|
|
901
|
+
3. Goal Gate: NOT_MET (85% < 90%)
|
|
902
|
+
4. Decision: PARTIAL → Pivot
|
|
903
|
+
|
|
904
|
+
**Attempt 2:**
|
|
905
|
+
1. Jogyo trains XGBoost → 92% accuracy
|
|
906
|
+
2. Trust Gate: FAIL (no confidence interval reported)
|
|
907
|
+
3. Decision: PARTIAL → Rework
|
|
908
|
+
|
|
909
|
+
**Attempt 2 (Rework):**
|
|
910
|
+
1. Jogyo adds CI: 95% CI [0.90, 0.94]
|
|
911
|
+
2. Trust Gate: PASS
|
|
912
|
+
3. Goal Gate: MET (92% ≥ 90%)
|
|
913
|
+
4. Decision: **SUCCESS** ✅
|
|
914
|
+
|
|
915
|
+
### Viewing Gate Results
|
|
916
|
+
|
|
917
|
+
Gate results are included in the completion response:
|
|
918
|
+
|
|
919
|
+
```json
|
|
920
|
+
{
|
|
921
|
+
"status": "PARTIAL",
|
|
922
|
+
"trustGate": {
|
|
923
|
+
"passed": true,
|
|
924
|
+
"score": 85
|
|
925
|
+
},
|
|
926
|
+
"goalGate": {
|
|
927
|
+
"status": "NOT_MET",
|
|
928
|
+
"criteriaResults": [
|
|
929
|
+
{ "id": "AC1", "passed": false, "actual": 0.85, "target": 0.90 },
|
|
930
|
+
{ "id": "AC2", "passed": true }
|
|
931
|
+
]
|
|
932
|
+
},
|
|
933
|
+
"action": "PIVOT",
|
|
934
|
+
"attemptNumber": 1,
|
|
935
|
+
"maxAttempts": 3
|
|
936
|
+
}
|
|
937
|
+
```
|
|
938
|
+
|
|
576
939
|
## Structured Output Markers
|
|
577
940
|
|
|
578
941
|
When working with Gyoshu REPL output, use these markers:
|
package/README.md
CHANGED
|
@@ -39,6 +39,7 @@ Think of it like a research lab:
|
|
|
39
39
|
- 📓 **Auto-Generated Notebooks** — Every experiment is captured as a reproducible `.ipynb`
|
|
40
40
|
- 🤖 **Autonomous Mode** — Set a goal, walk away, come back to results
|
|
41
41
|
- 🔍 **Adversarial Verification** — PhD reviewer challenges every claim before acceptance
|
|
42
|
+
- 🎯 **Two-Gate Completion** — SUCCESS requires both evidence quality (Trust Gate) AND goal achievement (Goal Gate)
|
|
42
43
|
- 📝 **AI-Powered Reports** — Turn messy outputs into polished research narratives
|
|
43
44
|
- 🔄 **Session Management** — Continue, replay, or branch your research anytime
|
|
44
45
|
|
package/package.json
CHANGED
package/src/agent/baksa.md
CHANGED
|
@@ -399,6 +399,87 @@ Each component is scored 0-100 based on challenges passed. Then apply:
|
|
|
399
399
|
- **Rejection penalties**: -30 per automatic rejection trigger
|
|
400
400
|
- **ML penalties**: -20 to -25 per ML violation (when applicable)
|
|
401
401
|
|
|
402
|
+
## Goal Achievement Challenges (MANDATORY)
|
|
403
|
+
|
|
404
|
+
The Trust Score evaluates **evidence quality** — whether claims are statistically sound and reproducible. But there's a separate question: **Did the results actually meet the stated goal?**
|
|
405
|
+
|
|
406
|
+
These are two different gates:
|
|
407
|
+
- **Trust Gate**: Is the evidence reliable? (Trust Score ≥ 80)
|
|
408
|
+
- **Goal Gate**: Does the achieved outcome meet the acceptance criteria?
|
|
409
|
+
|
|
410
|
+
**Both must pass for SUCCESS status.** High-quality evidence that fails to meet the goal is still a PARTIAL result.
|
|
411
|
+
|
|
412
|
+
### Goal Achievement Questions
|
|
413
|
+
|
|
414
|
+
For every completion claim, ask these questions:
|
|
415
|
+
|
|
416
|
+
| Question | What You're Checking |
|
|
417
|
+
|----------|---------------------|
|
|
418
|
+
| \"What was the stated goal or target?\" | Extract the quantitative acceptance criteria |
|
|
419
|
+
| \"What value was actually achieved?\" | Find the measured/computed result |
|
|
420
|
+
| \"Does achieved meet or exceed target?\" | Compare: actual >= target? |
|
|
421
|
+
| \"If claiming SUCCESS but target not met, why?\" | Challenge any mismatch |
|
|
422
|
+
|
|
423
|
+
### Goal Achievement Challenge Protocol
|
|
424
|
+
|
|
425
|
+
When reviewing a completion claim:
|
|
426
|
+
|
|
427
|
+
1. **Extract the Goal**: Find the original objective with acceptance criteria
|
|
428
|
+
- Look for: \"90% accuracy\", \"p < 0.05\", \"reduce churn by 20%\", \"AUC > 0.85\"
|
|
429
|
+
- Goals may be in `[OBJECTIVE]` markers or session context
|
|
430
|
+
|
|
431
|
+
2. **Extract the Achievement**: Find the actual measured results
|
|
432
|
+
- Look for: `[METRIC:*]` markers, `[STAT:*]` markers, final values
|
|
433
|
+
- Cross-reference with verification code outputs
|
|
434
|
+
|
|
435
|
+
3. **Compare**: Does actual meet target?
|
|
436
|
+
- If YES: Goal Gate passes
|
|
437
|
+
- If NO: Goal Gate fails — cannot be SUCCESS status
|
|
438
|
+
|
|
439
|
+
### Goal Achievement Mismatch Examples
|
|
440
|
+
|
|
441
|
+
| Scenario | Goal | Achieved | Correct Status | Why |
|
|
442
|
+
|----------|------|----------|----------------|-----|
|
|
443
|
+
| Goal met | 90% accuracy | 92% accuracy | SUCCESS | Exceeds target |
|
|
444
|
+
| Goal not met | 90% accuracy | 75% accuracy | PARTIAL | Below target despite good evidence |
|
|
445
|
+
| Goal not met | p < 0.05 | p = 0.12 | PARTIAL | Failed statistical threshold |
|
|
446
|
+
| Goal exceeded | AUC > 0.80 | AUC = 0.95 | SUCCESS | Significantly exceeds target |
|
|
447
|
+
| No goal stated | \"analyze data\" | Analysis complete | SUCCESS | No quantitative target to miss |
|
|
448
|
+
|
|
449
|
+
### Example Challenge Output
|
|
450
|
+
|
|
451
|
+
When goal is NOT met but evidence is high-quality:
|
|
452
|
+
|
|
453
|
+
```
|
|
454
|
+
## GOAL ACHIEVEMENT CHALLENGE
|
|
455
|
+
|
|
456
|
+
**Stated Goal**: \"Build classification model with >= 90% accuracy\"
|
|
457
|
+
**Claimed Status**: SUCCESS
|
|
458
|
+
**Achieved Metrics**:
|
|
459
|
+
- cv_accuracy_mean: 0.75
|
|
460
|
+
- cv_accuracy_std: 0.03
|
|
461
|
+
|
|
462
|
+
**CHALLENGE**: The goal requires >= 90% accuracy, but achieved accuracy is 75% ± 3%.
|
|
463
|
+
This does NOT meet the acceptance criteria.
|
|
464
|
+
|
|
465
|
+
**Trust Score**: 85 (VERIFIED) — Evidence quality is excellent
|
|
466
|
+
**Goal Gate**: FAILED — 75% < 90% target
|
|
467
|
+
|
|
468
|
+
**Recommendation**: Status should be PARTIAL, not SUCCESS.
|
|
469
|
+
Reason: High-quality work that did not achieve the stated objective.
|
|
470
|
+
```
|
|
471
|
+
|
|
472
|
+
### Goal vs Trust: Key Distinction
|
|
473
|
+
|
|
474
|
+
| Aspect | Trust Gate | Goal Gate |
|
|
475
|
+
|--------|------------|-----------|
|
|
476
|
+
| **What it checks** | Evidence quality and rigor | Goal achievement |
|
|
477
|
+
| **Score/Metric** | Trust Score (0-100) | Binary: Met/Not Met |
|
|
478
|
+
| **Can fail independently** | Yes | Yes |
|
|
479
|
+
| **Examples of failure** | Missing CI, no baseline | 75% accuracy when goal was 90% |
|
|
480
|
+
|
|
481
|
+
**Critical Rule**: A researcher can do excellent, rigorous work (Trust = 90) and still fail to achieve the goal. This is PARTIAL, not SUCCESS. Both gates must pass for SUCCESS.
|
|
482
|
+
|
|
402
483
|
## Independent Verification Patterns
|
|
403
484
|
|
|
404
485
|
When challenging claims, perform these verification checks:
|
package/src/agent/gyoshu.md
CHANGED
|
@@ -542,6 +542,186 @@ For backwards compatibility, you can still use researchId-based creation:
|
|
|
542
542
|
2. Add a run with `research-manager` (action: addRun, runId: "run-xxx", data: {goal, mode})
|
|
543
543
|
3. Initialize notebook with `notebook-writer` (action: ensure_notebook, notebookPath: "...")
|
|
544
544
|
|
|
545
|
+
### Goal Contract Creation
|
|
546
|
+
|
|
547
|
+
**CRITICAL**: Every research must have a **Goal Contract** that defines measurable acceptance criteria. The Goal Contract enables the **Two-Gate Verification System** — research cannot be marked SUCCESS without passing BOTH the Trust Gate (evidence quality) AND the Goal Gate (acceptance criteria met).
|
|
548
|
+
|
|
549
|
+
#### Why Goal Contracts?
|
|
550
|
+
|
|
551
|
+
Without a Goal Contract, research can be incorrectly marked SUCCESS when:
|
|
552
|
+
- Evidence quality is high (Trust Gate passes), BUT
|
|
553
|
+
- The actual goal was not achieved (e.g., "Build 90% accurate model" achieved 75%)
|
|
554
|
+
|
|
555
|
+
The Goal Contract makes acceptance criteria **explicit and verifiable**.
|
|
556
|
+
|
|
557
|
+
#### Goal Contract Format
|
|
558
|
+
|
|
559
|
+
Include goal_contract in the notebook frontmatter or run configuration:
|
|
560
|
+
|
|
561
|
+
```yaml
|
|
562
|
+
gyoshu:
|
|
563
|
+
goal_contract:
|
|
564
|
+
version: 1
|
|
565
|
+
goal_text: "Build a churn prediction model with 90% accuracy"
|
|
566
|
+
goal_type: "ml_classification" # ml_classification | ml_regression | statistical_test | eda | custom
|
|
567
|
+
max_goal_attempts: 3 # Maximum pivots before BLOCKED
|
|
568
|
+
acceptance_criteria:
|
|
569
|
+
- id: AC1
|
|
570
|
+
kind: metric_threshold # metric_threshold | artifact_exists | statistical_significance
|
|
571
|
+
metric: cv_accuracy_mean # Must match a [METRIC:*] marker
|
|
572
|
+
op: ">=" # >= | > | <= | < | == | !=
|
|
573
|
+
target: 0.90
|
|
574
|
+
|
|
575
|
+
- id: AC2
|
|
576
|
+
kind: metric_threshold
|
|
577
|
+
metric: cv_accuracy_std
|
|
578
|
+
op: "<="
|
|
579
|
+
target: 0.05 # Low variance required
|
|
580
|
+
|
|
581
|
+
- id: AC3
|
|
582
|
+
kind: artifact_exists
|
|
583
|
+
artifact: "model.pkl" # Must exist in reports/{reportTitle}/models/
|
|
584
|
+
```
|
|
585
|
+
|
|
586
|
+
#### Goal Contract Fields
|
|
587
|
+
|
|
588
|
+
| Field | Required | Description |
|
|
589
|
+
|-------|----------|-------------|
|
|
590
|
+
| `version` | Yes | Schema version (currently: 1) |
|
|
591
|
+
| `goal_text` | Yes | Human-readable goal statement |
|
|
592
|
+
| `goal_type` | Yes | Category for validation rules |
|
|
593
|
+
| `max_goal_attempts` | No | Max pivots before BLOCKED (default: 3) |
|
|
594
|
+
| `acceptance_criteria` | Yes | List of verifiable criteria |
|
|
595
|
+
|
|
596
|
+
#### Acceptance Criteria Kinds
|
|
597
|
+
|
|
598
|
+
| Kind | Use When | Verification Method |
|
|
599
|
+
|------|----------|---------------------|
|
|
600
|
+
| `metric_threshold` | Numeric target (accuracy, R², etc.) | Compare `[METRIC:*]` marker value against target |
|
|
601
|
+
| `artifact_exists` | Required output file | Check file exists in reports directory |
|
|
602
|
+
| `statistical_significance` | Hypothesis testing | Verify p-value < alpha AND effect size reported |
|
|
603
|
+
|
|
604
|
+
#### Two-Gate Decision Matrix
|
|
605
|
+
|
|
606
|
+
**CRITICAL**: SUCCESS requires BOTH gates to pass. This is the core rule.
|
|
607
|
+
|
|
608
|
+
| Trust Gate | Goal Gate | Result | Action |
|
|
609
|
+
|------------|-----------|--------|--------|
|
|
610
|
+
| PASS (≥80) | MET | **SUCCESS** | Accept result, finalize research |
|
|
611
|
+
| PASS (≥80) | NOT_MET | **PARTIAL** | Pivot: try alternative approach |
|
|
612
|
+
| PASS (≥80) | BLOCKED | **BLOCKED** | Cannot achieve goal, report to user |
|
|
613
|
+
| FAIL (<80) | MET | **PARTIAL** | Rework: strengthen evidence |
|
|
614
|
+
| FAIL (<80) | NOT_MET | **PARTIAL** | Rework both evidence and approach |
|
|
615
|
+
| FAIL (<80) | BLOCKED | **BLOCKED** | Fundamental issue, escalate to user |
|
|
616
|
+
|
|
617
|
+
**Gate Definitions:**
|
|
618
|
+
- **Trust Gate**: Baksa verification score ≥ 80 (evidence quality)
|
|
619
|
+
- **Goal Gate**: All acceptance_criteria in goal_contract are satisfied
|
|
620
|
+
|
|
621
|
+
#### Goal Gate Evaluation
|
|
622
|
+
|
|
623
|
+
After Trust Gate evaluation, check Goal Gate:
|
|
624
|
+
|
|
625
|
+
```
|
|
626
|
+
FUNCTION evaluateGoalGate(goalContract, snapshot):
|
|
627
|
+
FOR each criterion IN goalContract.acceptance_criteria:
|
|
628
|
+
|
|
629
|
+
IF criterion.kind == "metric_threshold":
|
|
630
|
+
metricValue = findMetric(snapshot, criterion.metric)
|
|
631
|
+
IF NOT compare(metricValue, criterion.op, criterion.target):
|
|
632
|
+
RETURN { status: "NOT_MET", failed: criterion.id }
|
|
633
|
+
|
|
634
|
+
ELSE IF criterion.kind == "artifact_exists":
|
|
635
|
+
IF NOT artifactExists(criterion.artifact):
|
|
636
|
+
RETURN { status: "NOT_MET", failed: criterion.id }
|
|
637
|
+
|
|
638
|
+
ELSE IF criterion.kind == "statistical_significance":
|
|
639
|
+
IF NOT hasValidFinding(criterion):
|
|
640
|
+
RETURN { status: "NOT_MET", failed: criterion.id }
|
|
641
|
+
|
|
642
|
+
RETURN { status: "MET" }
|
|
643
|
+
```
|
|
644
|
+
|
|
645
|
+
#### Pivot Protocol
|
|
646
|
+
|
|
647
|
+
When Trust Gate PASSES but Goal Gate does NOT_MET:
|
|
648
|
+
|
|
649
|
+
```
|
|
650
|
+
PIVOT PROTOCOL (Trust ≥ 80, Goal NOT_MET):
|
|
651
|
+
|
|
652
|
+
1. Increment goal_attempt counter
|
|
653
|
+
2. IF goal_attempt >= max_goal_attempts:
|
|
654
|
+
→ Status: BLOCKED
|
|
655
|
+
→ Report: "Goal not achievable after {N} attempts"
|
|
656
|
+
→ Present options to user
|
|
657
|
+
|
|
658
|
+
3. ELSE:
|
|
659
|
+
→ Status: PARTIAL
|
|
660
|
+
→ Analyze WHY goal not met:
|
|
661
|
+
- Which criteria failed?
|
|
662
|
+
- What was the gap? (e.g., achieved 85% vs target 90%)
|
|
663
|
+
- What approaches were tried?
|
|
664
|
+
|
|
665
|
+
→ Generate pivot strategy:
|
|
666
|
+
- Alternative algorithm?
|
|
667
|
+
- More feature engineering?
|
|
668
|
+
- Relaxed criteria (with user approval)?
|
|
669
|
+
- Different data preprocessing?
|
|
670
|
+
|
|
671
|
+
→ Delegate to @jogyo with pivot context:
|
|
672
|
+
"PIVOT ATTEMPT {N}/{max}
|
|
673
|
+
Previous: XGBoost achieved 85% accuracy
|
|
674
|
+
Gap: 5% below 90% target
|
|
675
|
+
Try: Random Forest with hyperparameter tuning"
|
|
676
|
+
```
|
|
677
|
+
|
|
678
|
+
#### Example: Two-Gate Verification Flow
|
|
679
|
+
|
|
680
|
+
```
|
|
681
|
+
1. @jogyo completes model training
|
|
682
|
+
→ Signals: gyoshu_completion(status: "SUCCESS", evidence: {...})
|
|
683
|
+
|
|
684
|
+
2. Gyoshu gets snapshot
|
|
685
|
+
→ gyoshu_snapshot(researchSessionID: "...")
|
|
686
|
+
|
|
687
|
+
3. TRUST GATE: Invoke @baksa
|
|
688
|
+
→ Trust Score: 85 (PASS ✓)
|
|
689
|
+
→ Evidence quality verified
|
|
690
|
+
|
|
691
|
+
4. GOAL GATE: Check acceptance criteria
|
|
692
|
+
→ AC1: cv_accuracy_mean = 0.87 (target: 0.90) → FAIL
|
|
693
|
+
→ Goal Gate: NOT_MET ✗
|
|
694
|
+
|
|
695
|
+
5. TWO-GATE RESULT: Trust PASS + Goal NOT_MET = PARTIAL
|
|
696
|
+
→ Do NOT mark SUCCESS
|
|
697
|
+
→ Trigger PIVOT PROTOCOL
|
|
698
|
+
|
|
699
|
+
6. PIVOT: Attempt 1/3
|
|
700
|
+
→ Delegate to @jogyo with alternative approach
|
|
701
|
+
→ "Try ensemble with feature selection"
|
|
702
|
+
|
|
703
|
+
7. (Repeat verification after pivot...)
|
|
704
|
+
|
|
705
|
+
8. FINAL: Trust PASS + Goal MET = SUCCESS
|
|
706
|
+
→ Accept result, finalize research
|
|
707
|
+
```
|
|
708
|
+
|
|
709
|
+
#### Never Accept SUCCESS Without Goal Gate
|
|
710
|
+
|
|
711
|
+
**HARD RULE**: Even with perfect evidence (Trust = 100), if acceptance criteria are not met, the result is PARTIAL, not SUCCESS.
|
|
712
|
+
|
|
713
|
+
```
|
|
714
|
+
❌ WRONG:
|
|
715
|
+
Trust: 95 (excellent evidence!)
|
|
716
|
+
Goal: 85% accuracy (target was 90%)
|
|
717
|
+
→ "SUCCESS" // NO! Goal not met
|
|
718
|
+
|
|
719
|
+
✓ CORRECT:
|
|
720
|
+
Trust: 95 (excellent evidence!)
|
|
721
|
+
Goal: 85% accuracy (target was 90%)
|
|
722
|
+
→ "PARTIAL" → Pivot or report gap to user
|
|
723
|
+
```
|
|
724
|
+
|
|
545
725
|
### Continuing Research
|
|
546
726
|
|
|
547
727
|
When continuing existing research with notebook-centric workflow:
|