gyoshu 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (59) hide show
  1. package/AGENTS.md +1039 -0
  2. package/README.ja.md +390 -0
  3. package/README.ko.md +385 -0
  4. package/README.md +459 -0
  5. package/README.zh.md +383 -0
  6. package/bin/gyoshu.js +295 -0
  7. package/install.sh +241 -0
  8. package/package.json +65 -0
  9. package/src/agent/baksa.md +494 -0
  10. package/src/agent/executor.md +1851 -0
  11. package/src/agent/gyoshu.md +2351 -0
  12. package/src/agent/jogyo-feedback.md +137 -0
  13. package/src/agent/jogyo-insight.md +359 -0
  14. package/src/agent/jogyo-paper-writer.md +370 -0
  15. package/src/agent/jogyo.md +1445 -0
  16. package/src/agent/plan-reviewer.md +1862 -0
  17. package/src/agent/plan.md +97 -0
  18. package/src/agent/task-orchestrator.md +1121 -0
  19. package/src/bridge/gyoshu_bridge.py +782 -0
  20. package/src/command/analyze-knowledge.md +840 -0
  21. package/src/command/analyze-plans.md +513 -0
  22. package/src/command/execute.md +893 -0
  23. package/src/command/generate-policy.md +924 -0
  24. package/src/command/generate-suggestions.md +1111 -0
  25. package/src/command/gyoshu-auto.md +258 -0
  26. package/src/command/gyoshu.md +1352 -0
  27. package/src/command/learn.md +1181 -0
  28. package/src/command/planner.md +630 -0
  29. package/src/lib/artifact-security.ts +159 -0
  30. package/src/lib/atomic-write.ts +107 -0
  31. package/src/lib/cell-identity.ts +176 -0
  32. package/src/lib/checkpoint-schema.ts +455 -0
  33. package/src/lib/environment-capture.ts +181 -0
  34. package/src/lib/filesystem-check.ts +84 -0
  35. package/src/lib/literature-client.ts +1048 -0
  36. package/src/lib/marker-parser.ts +474 -0
  37. package/src/lib/notebook-frontmatter.ts +835 -0
  38. package/src/lib/paths.ts +799 -0
  39. package/src/lib/pdf-export.ts +340 -0
  40. package/src/lib/quality-gates.ts +369 -0
  41. package/src/lib/readme-index.ts +462 -0
  42. package/src/lib/report-markdown.ts +870 -0
  43. package/src/lib/session-lock.ts +411 -0
  44. package/src/plugin/gyoshu-hooks.ts +140 -0
  45. package/src/skill/data-analysis/SKILL.md +369 -0
  46. package/src/skill/experiment-design/SKILL.md +374 -0
  47. package/src/skill/ml-rigor/SKILL.md +672 -0
  48. package/src/skill/scientific-method/SKILL.md +331 -0
  49. package/src/tool/checkpoint-manager.ts +1387 -0
  50. package/src/tool/gyoshu-completion.ts +493 -0
  51. package/src/tool/gyoshu-snapshot.ts +745 -0
  52. package/src/tool/literature-search.ts +389 -0
  53. package/src/tool/migration-tool.ts +1404 -0
  54. package/src/tool/notebook-search.ts +794 -0
  55. package/src/tool/notebook-writer.ts +391 -0
  56. package/src/tool/python-repl.ts +1038 -0
  57. package/src/tool/research-manager.ts +1494 -0
  58. package/src/tool/retrospective-store.ts +347 -0
  59. package/src/tool/session-manager.ts +565 -0
package/AGENTS.md ADDED
@@ -0,0 +1,1039 @@
1
+ # AGENTS.md - Gyoshu Repository Guide
2
+
3
+ > Guidelines for AI agents operating in this repository.
4
+
5
+ ## Overview
6
+
7
+ Gyoshu is a scientific research agent extension for OpenCode. It provides:
8
+ - Persistent Python REPL with structured output markers
9
+ - Jupyter notebook integration for reproducible research
10
+ - Session management for research workflows
11
+
12
+ ## The Agent Team
13
+
14
+ | Agent | Role | Korean | What They Do |
15
+ |-------|------|--------|--------------|
16
+ | **Gyoshu** | Professor | 교수 | Plans research, orchestrates workflow, manages sessions |
17
+ | **Jogyo** | Teaching Assistant | 조교 | Executes Python code, runs experiments, generates outputs |
18
+ | **Baksa** | PhD Reviewer | 박사 | Adversarial verifier - challenges claims, calculates trust scores |
19
+ | **Jogyo Paper Writer** | Grad Student | 조교 | Transforms raw findings into narrative research reports |
20
+
21
+ ## Build & Test Commands
22
+
23
+ ### Python Tests (pytest)
24
+
25
+ ```bash
26
+ # Run all tests
27
+ pytest
28
+
29
+ # Run with verbose output (default via pyproject.toml)
30
+ pytest -v --tb=short
31
+
32
+ # Run a single test file
33
+ pytest tests/test_bridge.py
34
+
35
+ # Run a specific test class
36
+ pytest tests/test_bridge.py::TestParseMarkers
37
+
38
+ # Run a single test
39
+ pytest tests/test_bridge.py::TestParseMarkers::test_simple_marker
40
+
41
+ # Run with coverage
42
+ pytest --cov=src/bridge --cov-report=term-missing
43
+ ```
44
+
45
+ ### TypeScript/JavaScript (Bun)
46
+
47
+ ```bash
48
+ # Run all tests
49
+ bun test
50
+
51
+ # Watch mode for development
52
+ bun test --watch
53
+
54
+ # Run a specific test file
55
+ bun test src/tool/session-manager.test.ts
56
+ ```
57
+
58
+ ### No Build Step Required
59
+
60
+ This is an OpenCode extension - no compilation needed. TypeScript files are executed directly by Bun.
61
+
62
+ ## Code Style Guidelines
63
+
64
+ ### Python (.py files)
65
+
66
+ #### Imports
67
+ ```python
68
+ # Standard library first (alphabetical within each category)
69
+ import argparse
70
+ import json
71
+ import os
72
+ import sys
73
+ from datetime import datetime, timezone
74
+ from typing import Any, Dict, List, Optional, Callable
75
+
76
+ # Third-party next (blank line before)
77
+ import pytest
78
+
79
+ # Local imports last (blank line before)
80
+ from gyoshu_bridge import parse_markers, execute_code
81
+ ```
82
+
83
+ #### Type Hints (Required)
84
+ ```python
85
+ from typing import Any, Dict, List, Optional
86
+
87
+ def execute_code(code: str, namespace: dict) -> Dict[str, Any]:
88
+ """Execute Python code in the given namespace."""
89
+ ...
90
+
91
+ def parse_markers(text: str) -> List[Dict]:
92
+ ...
93
+ ```
94
+
95
+ #### Docstrings
96
+ ```python
97
+ """Module-level docstring at the top of each file.
98
+
99
+ Describe the module's purpose and key components.
100
+ Include protocol formats, methods, or usage examples.
101
+ """
102
+
103
+ def send_response(
104
+ id: Optional[str],
105
+ result: Optional[Dict] = None,
106
+ error: Optional[Dict] = None
107
+ ) -> None:
108
+ """Send JSON-RPC 2.0 response via protocol channel."""
109
+ ...
110
+ ```
111
+
112
+ #### Naming Conventions
113
+ - `UPPER_SNAKE_CASE` for constants: `JSON_RPC_VERSION`, `ERROR_PARSE`
114
+ - `PascalCase` for classes: `ExecutionState`, `TestParseMarkers`
115
+ - `snake_case` for functions/variables: `send_response`, `parse_markers`
116
+ - `_leading_underscore` for private/internal: `_send_protocol`, `_protocol_fd`
117
+
118
+ #### Section Organization
119
+ ```python
120
+ # =============================================================================
121
+ # SECTION NAME IN ALL CAPS
122
+ # =============================================================================
123
+
124
+ # Code for this section...
125
+ ```
126
+
127
+ #### Error Handling
128
+ ```python
129
+ # Use specific exception types with descriptive context
130
+ try:
131
+ result = json.loads(data)
132
+ except json.JSONDecodeError as e:
133
+ return make_error(ERROR_PARSE, f"Parse error: {e}")
134
+ except TimeoutError as e:
135
+ result["exception"] = str(e)
136
+ result["exception_type"] = "TimeoutError"
137
+ except KeyboardInterrupt:
138
+ result["exception"] = "Execution interrupted"
139
+ result["exception_type"] = "KeyboardInterrupt"
140
+ except Exception as e:
141
+ # Last resort - always include type and message
142
+ result["exception"] = str(e)
143
+ result["exception_type"] = type(e).__name__
144
+
145
+ # Never use bare except
146
+ # Never silently swallow exceptions
147
+ ```
148
+
149
+ ### TypeScript (.ts files)
150
+
151
+ #### Imports
152
+ ```typescript
153
+ // External packages first
154
+ import { tool } from "@opencode-ai/plugin";
155
+
156
+ // Built-in Node modules next
157
+ import * as fs from "fs/promises";
158
+ import * as path from "path";
159
+ import * as os from "os";
160
+
161
+ // Local modules last (use multi-line for readability)
162
+ import { durableAtomicWrite, fileExists, readFile } from "../lib/atomic-write";
163
+ import {
164
+ getRuntimeDir,
165
+ getSessionDir,
166
+ ensureDirSync,
167
+ existsSync,
168
+ } from "../lib/paths";
169
+ ```
170
+
171
+ #### JSDoc Comments
172
+ ```typescript
173
+ /**
174
+ * Session Manager - OpenCode tool for managing Gyoshu research sessions
175
+ *
176
+ * Provides CRUD operations for session manifests with:
177
+ * - Atomic, durable writes to prevent data corruption
178
+ * - Cell execution tracking with content hashes
179
+ *
180
+ * @module session-manager
181
+ */
182
+
183
+ // Import from centralized path resolver (see src/lib/paths.ts)
184
+ import { getRuntimeDir, getResearchDir } from "../lib/paths";
185
+
186
+ /**
187
+ * Get the runtime directory for a specific session.
188
+ * Uses centralized path resolver for consistency.
189
+ */
190
+ const runtimeDir = getRuntimeDir(sessionId);
191
+
192
+ /**
193
+ * Get the research directory for storing research manifests.
194
+ * Always use path helpers instead of hardcoding paths.
195
+ */
196
+ const researchDir = getResearchDir();
197
+ ```
198
+
199
+ #### Interfaces and Types
200
+ ```typescript
201
+ // Descriptive JSDoc for each interface
202
+ /**
203
+ * Environment metadata captured for reproducibility.
204
+ */
205
+ interface EnvironmentMetadata {
206
+ /** Python interpreter version */
207
+ pythonVersion: string;
208
+ /** Operating system platform */
209
+ platform: string;
210
+ }
211
+
212
+ // Use type for unions
213
+ type SessionMode = "PLANNER" | "AUTO" | "REPL";
214
+ type GoalStatus = "PENDING" | "IN_PROGRESS" | "COMPLETED" | "BLOCKED";
215
+ ```
216
+
217
+ #### Naming Conventions
218
+ - `PascalCase` for interfaces/types: `SessionManifest`, `CellExecution`
219
+ - `UPPER_SNAKE_CASE` for constants: `DEFAULT_TIMEOUT`, `MAX_RETRIES`
220
+ - `camelCase` for variables/functions: `researchSessionID`, `readFile`
221
+
222
+ ### Test Files
223
+
224
+ #### Python Tests (pytest)
225
+ ```python
226
+ import pytest
227
+
228
+ class TestModuleName:
229
+ """Tests for module_name - brief description."""
230
+
231
+ def test_specific_behavior(self):
232
+ """What this test verifies."""
233
+ result = function_under_test(input)
234
+ assert result["expected_key"] == expected_value
235
+
236
+ @pytest.fixture
237
+ def setup_data(self):
238
+ """Fixture description."""
239
+ return {"test": "data"}
240
+ ```
241
+
242
+ #### TypeScript Tests (Bun)
243
+ ```typescript
244
+ import { describe, test, expect } from "bun:test";
245
+
246
+ describe("ModuleName", () => {
247
+ test("specific behavior", () => {
248
+ const result = functionUnderTest(input);
249
+ expect(result.expectedKey).toBe(expectedValue);
250
+ });
251
+ });
252
+ ```
253
+
254
+ ## Slash Commands
255
+
256
+ Gyoshu provides **two commands** for all research operations:
257
+
258
+ | Command | Purpose |
259
+ |---------|---------|
260
+ | `/gyoshu [subcommand\|goal]` | Unified interactive research command |
261
+ | `/gyoshu-auto <goal>` | Autonomous research (hands-off bounded execution) |
262
+
263
+ ### `/gyoshu` - Unified Research Command
264
+
265
+ The main entry point for all research operations. Supports subcommands and direct goals.
266
+
267
+ | Subcommand | Description | Example |
268
+ |------------|-------------|---------|
269
+ | *(no args)* | Show status and suggestions | `/gyoshu` |
270
+ | `<goal>` | Start new research with discovery | `/gyoshu analyze customer churn` |
271
+ | `plan <goal>` | Create research plan only | `/gyoshu plan classify iris species` |
272
+ | `continue [id]` | Continue existing research | `/gyoshu continue iris-clustering` |
273
+ | `list [--status X]` | List all research projects | `/gyoshu list --status active` |
274
+ | `search <query>` | Search researches & notebooks | `/gyoshu search "correlation"` |
275
+ | `report [id]` | Generate research report | `/gyoshu report` |
276
+ | `repl <query>` | Direct REPL exploration | `/gyoshu repl show df columns` |
277
+ | `migrate [--options]`| Migrate legacy data | `/gyoshu migrate --to-notebooks` |
278
+ | `replay <sessionId>` | Replay for reproducibility | `/gyoshu replay ses_abc123` |
279
+ | `unlock <sessionId>` | Unlock stuck session | `/gyoshu unlock ses_abc123` |
280
+ | `abort [sessionId]` | Abort current research | `/gyoshu abort` |
281
+ | `doctor` | Check system health and diagnose issues | `/gyoshu doctor` |
282
+ | `help` | Show usage and examples | `/gyoshu help` |
283
+
284
+ ### `/gyoshu-auto` - Autonomous Research
285
+
286
+ Runs research autonomously with bounded cycles (max 10). Executes until completion, blocked, or budget exhausted.
287
+
288
+ ```bash
289
+ /gyoshu-auto analyze wine quality factors using XGBoost
290
+ ```
291
+
292
+ Use this when you have a clear goal and want hands-off execution.
293
+
294
+ ### Quick Examples
295
+
296
+ ```bash
297
+ # See current status and suggestions
298
+ /gyoshu
299
+
300
+ # Start interactive research (searches for similar prior work first)
301
+ /gyoshu analyze customer churn patterns
302
+
303
+ # Continue previous research
304
+ /gyoshu continue churn-analysis
305
+
306
+ # Search across all notebooks and research
307
+ /gyoshu search "feature importance"
308
+
309
+ # Generate a report for the current research
310
+ /gyoshu report
311
+
312
+ # Hands-off autonomous research
313
+ /gyoshu-auto cluster wine dataset and identify quality predictors
314
+ ```
315
+
316
+ ## Adversarial Verification Protocol
317
+
318
+ Gyoshu implements a "Never Trust" philosophy where every claim from Jogyo must be verified by Baksa before acceptance.
319
+
320
+ ### The Challenge Loop
321
+
322
+ 1. **Jogyo Completes Work**: Signals completion with evidence via `gyoshu_completion`
323
+ 2. **Gyoshu Gets Snapshot**: Reviews current state via `gyoshu_snapshot`
324
+ 3. **Baksa Challenges**: Generates probing questions and calculates trust score
325
+ 4. **Decision**:
326
+ - Trust >= 80: VERIFIED - Accept result
327
+ - Trust 60-79: PARTIAL - Accept with caveats
328
+ - Trust < 60: DOUBTFUL - Request rework from Jogyo
329
+ 5. **Max 3 Rounds**: If verification fails 3 times, escalate to BLOCKED
330
+
331
+ ### Trust Score Components
332
+
333
+ | Component | Weight | Description |
334
+ |-----------|--------|-------------|
335
+ | Statistical Rigor | 30% | CI reported, effect size calculated, assumptions checked |
336
+ | Evidence Quality | 25% | Artifacts exist, code is reproducible |
337
+ | Metric Verification | 20% | Independent checks match claimed values |
338
+ | Completeness | 15% | All objectives addressed |
339
+ | Methodology | 10% | Sound approach, appropriate tests |
340
+
341
+ ### Automatic Rejection Triggers
342
+
343
+ The following immediately reduce trust score by 30 points:
344
+ - `[FINDING]` without accompanying `[STAT:ci]`
345
+ - `[FINDING]` without accompanying `[STAT:effect_size]`
346
+ - "Significant" claim without p-value
347
+ - Correlation claim without effect size interpretation
348
+ - ML metrics without baseline comparison
349
+
350
+ ### Challenge Response Markers
351
+
352
+ When Jogyo responds to challenges, use these markers:
353
+
354
+ ```python
355
+ # Respond to a specific challenge (N = challenge number)
356
+ print("[CHALLENGE-RESPONSE:1] Re-verified correlation with alternative method")
357
+
358
+ # Provide reproducible verification code
359
+ print("[VERIFICATION-CODE] df['accuracy'].mean() == 0.95")
360
+
361
+ # Show independent cross-validation
362
+ print("[INDEPENDENT-CHECK] 5-fold CV confirms accuracy: 0.94 ± 0.02")
363
+ ```
364
+
365
+ ### Example Challenge Flow
366
+
367
+ ```
368
+ 1. Jogyo: "Model accuracy is 95%"
369
+ 2. Baksa challenges:
370
+ - "Re-run with different random seed"
371
+ - "Show confusion matrix"
372
+ - "What's the baseline accuracy?"
373
+ 3. Trust Score: 45 (DOUBTFUL)
374
+ 4. Gyoshu sends rework request to Jogyo
375
+ 5. Jogyo responds with enhanced evidence
376
+ 6. Baksa re-evaluates: Trust Score 82 (VERIFIED)
377
+ 7. Gyoshu accepts result
378
+ ```
379
+
380
+ ## Research Quality Standards
381
+
382
+ Gyoshu enforces **senior data scientist level** research quality through hard quality gates. Every claim must have statistical evidence before becoming a verified finding.
383
+
384
+ ### The Claim Contract
385
+
386
+ Every finding must include:
387
+ ```
388
+ Claim → Data slice → Method/Test → Assumptions →
389
+ Estimate + CI → Effect size → p-value → Robustness checks →
390
+ Practical "so what"
391
+ ```
392
+
393
+ ### Quality Gates
394
+
395
+ | Gate | Requirement | Consequence if Missing |
396
+ |------|-------------|------------------------|
397
+ | Hypothesis | H0/H1 stated before analysis | Finding marked "exploratory" |
398
+ | Confidence Interval | 95% CI reported | Finding rejected |
399
+ | Effect Size | Cohen's d, r², or OR reported | Finding rejected |
400
+ | Assumptions | Statistical assumptions checked | Warning flag |
401
+ | Robustness | At least one sensitivity check | Warning flag |
402
+ | So What | Practical significance explained | Finding incomplete |
403
+
404
+ ### Finding Categories
405
+
406
+ | Category | Trust Score | Report Section |
407
+ |----------|-------------|----------------|
408
+ | **Verified Findings** | ≥ 80 | Key Findings |
409
+ | **Partial Findings** | 60-79 | Findings (with caveats) |
410
+ | **Exploratory Notes** | < 60 | Exploratory Observations |
411
+
412
+ ### How Quality Gates Work
413
+
414
+ Quality gates are automated checks that run at research completion (via `gyoshu-completion` tool) to enforce statistical rigor. The system:
415
+
416
+ 1. **Scans notebook outputs** for structured markers
417
+ 2. **Validates findings** using the "Finding Gating Rule"
418
+ 3. **Validates ML pipelines** for required components
419
+ 4. **Calculates quality score** (100 - sum of penalties)
420
+ 5. **Categorizes findings** as Verified, Partial, or Exploratory
421
+
422
+ #### The Finding Gating Rule
423
+
424
+ Every `[FINDING]` marker must have supporting evidence within 10 lines BEFORE it:
425
+ - `[STAT:ci]` - Confidence interval (required)
426
+ - `[STAT:effect_size]` - Effect magnitude (required)
427
+
428
+ If either is missing, the finding is marked as **unverified** and goes to "Exploratory Observations" in the report.
429
+
430
+ #### Quality Score Calculation
431
+
432
+ ```
433
+ Quality Score = 100 - (sum of all penalties)
434
+ ```
435
+
436
+ | Violation | Penalty | Description |
437
+ |-----------|---------|-------------|
438
+ | `FINDING_NO_CI` | -30 | Finding without confidence interval |
439
+ | `FINDING_NO_EFFECT_SIZE` | -30 | Finding without effect size |
440
+ | `ML_NO_BASELINE` | -20 | ML metrics without baseline comparison |
441
+ | `ML_NO_CV` | -25 | ML metrics without cross-validation |
442
+ | `ML_NO_INTERPRETATION` | -15 | ML metrics without feature importance |
443
+
444
+ #### Quality Gate Decision
445
+
446
+ | Score Range | Status | Result |
447
+ |-------------|--------|--------|
448
+ | 100 | SUCCESS | All quality gates passed |
449
+ | 80-99 | PARTIAL | Minor issues, findings still accepted |
450
+ | 60-79 | PARTIAL | Some findings moved to exploratory |
451
+ | 0-59 | PARTIAL | Significant quality issues |
452
+
453
+ > **Note**: Quality gates never block completion, but they do affect how findings are categorized in reports.
454
+
455
+ ### Good vs Bad Findings: Examples
456
+
457
+ Understanding the difference between verified and exploratory findings:
458
+
459
+ #### BAD: Exploratory Finding (Missing Evidence)
460
+
461
+ ```python
462
+ # ❌ This finding will be marked EXPLORATORY (score -60)
463
+ print("[FINDING] Model accuracy is 95%")
464
+
465
+ # Why it fails:
466
+ # - No [STAT:ci] within 10 lines before
467
+ # - No [STAT:effect_size] within 10 lines before
468
+ ```
469
+
470
+ This produces:
471
+ ```
472
+ Quality Score: 40/100
473
+ Violations:
474
+ - FINDING_NO_CI: Missing confidence interval (-30)
475
+ - FINDING_NO_EFFECT_SIZE: Missing effect size (-30)
476
+ Report Section: "Exploratory Observations" (not trusted)
477
+ ```
478
+
479
+ #### GOOD: Verified Finding (Full Evidence)
480
+
481
+ ```python
482
+ # ✅ This finding will be VERIFIED (score 100)
483
+
484
+ # 1. Statistical evidence BEFORE the finding
485
+ print(f"[STAT:estimate] accuracy = 0.95")
486
+ print(f"[STAT:ci] 95% CI [0.93, 0.97]")
487
+ print(f"[STAT:effect_size] Cohen's d = 0.82 (large improvement over baseline)")
488
+ print(f"[STAT:p_value] p < 0.001")
489
+
490
+ # 2. NOW state the finding with summary evidence
491
+ print("[FINDING] Model (AUC=0.95) significantly outperforms baseline "
492
+ "(d=0.82, 95% CI [0.93, 0.97], p<0.001)")
493
+
494
+ # 3. Explain practical significance
495
+ print("[SO_WHAT] This means 40% fewer false negatives in fraud detection")
496
+ ```
497
+
498
+ This produces:
499
+ ```
500
+ Quality Score: 100/100
501
+ Violations: None
502
+ Report Section: "Key Findings" (trusted, verified)
503
+ ```
504
+
505
+ #### ML Pipeline: Complete Example
506
+
507
+ ```python
508
+ # ✅ Complete ML pipeline with all required markers
509
+
510
+ # 1. Baseline comparison (REQUIRED)
511
+ from sklearn.dummy import DummyClassifier
512
+ dummy = DummyClassifier(strategy='stratified')
513
+ dummy_scores = cross_val_score(dummy, X, y, cv=5)
514
+ print(f"[METRIC:baseline_accuracy] {dummy_scores.mean():.3f}")
515
+
516
+ # 2. Model cross-validation (REQUIRED)
517
+ scores = cross_val_score(rf_model, X, y, cv=5)
518
+ print(f"[METRIC:cv_accuracy_mean] {scores.mean():.3f}")
519
+ print(f"[METRIC:cv_accuracy_std] {scores.std():.3f}")
520
+
521
+ # 3. Feature interpretation (REQUIRED)
522
+ importances = rf_model.feature_importances_
523
+ print(f"[METRIC:feature_importance] age={importances[0]:.2f}, income={importances[1]:.2f}")
524
+
525
+ # 4. Statistical evidence for finding
526
+ improvement = scores.mean() - dummy_scores.mean()
527
+ ci_low, ci_high = scores.mean() - 1.96*scores.std(), scores.mean() + 1.96*scores.std()
528
+ print(f"[STAT:ci] 95% CI [{ci_low:.3f}, {ci_high:.3f}]")
529
+ print(f"[STAT:effect_size] Improvement = {improvement:.3f} ({improvement/dummy_scores.std():.1f}σ)")
530
+
531
+ # 5. Verified finding
532
+ print(f"[FINDING] Random Forest achieves {scores.mean():.1%} accuracy, "
533
+ f"outperforming baseline by {improvement:.1%} (95% CI [{ci_low:.3f}, {ci_high:.3f}])")
534
+ ```
535
+
536
+ ## Structured Output Markers
537
+
538
+ When working with Gyoshu REPL output, use these markers:
539
+
540
+ ### Research Process Markers
541
+
542
+ ```python
543
+ # Research Process
544
+ print("[OBJECTIVE] Research goal statement")
545
+ print("[HYPOTHESIS] H0: no effect; H1: treatment improves outcome")
546
+ print("[CONCLUSION] Final conclusions with evidence summary")
547
+ ```
548
+
549
+ ### Statistical Evidence Markers (REQUIRED for Findings)
550
+
551
+ ```python
552
+ # Test Decision - explain why this test
553
+ print("[DECISION] Using Welch's t-test: two independent groups, unequal variance")
554
+
555
+ # Assumption Checking
556
+ print("[CHECK:normality] Shapiro-Wilk p=0.23 - normality assumption OK")
557
+ print("[CHECK:homogeneity] Levene's p=0.04 - using Welch's (unequal var)")
558
+
559
+ # Statistical Results (ALL required before [FINDING])
560
+ print(f"[STAT:estimate] mean_diff = {mean_diff:.3f}")
561
+ print(f"[STAT:ci] 95% CI [{ci_low:.3f}, {ci_high:.3f}]")
562
+ print(f"[STAT:effect_size] Cohen's d = {d:.3f} (medium)")
563
+ print(f"[STAT:p_value] p = {p:.4f}")
564
+
565
+ # Robustness Check
566
+ print("[INDEPENDENT_CHECK] Bootstrap 95% CI: [0.12, 0.28] - consistent")
567
+
568
+ # Only AFTER above evidence:
569
+ print("[FINDING] Treatment shows medium effect (d=0.45, 95% CI [0.2, 0.7])")
570
+
571
+ # Practical Significance
572
+ print("[SO_WHAT] Effect translates to $50K annual savings per customer segment")
573
+
574
+ # Limitations
575
+ print("[LIMITATION] Self-selection bias - users opted in voluntarily")
576
+ ```
577
+
578
+ ### ML Pipeline Markers
579
+
580
+ ```python
581
+ # Baseline (REQUIRED before claiming model performance)
582
+ print(f"[METRIC:baseline_accuracy] {dummy_score:.3f}")
583
+
584
+ # Cross-Validation (REQUIRED - report mean ± std)
585
+ print(f"[METRIC:cv_accuracy_mean] {scores.mean():.3f}")
586
+ print(f"[METRIC:cv_accuracy_std] {scores.std():.3f}")
587
+
588
+ # Model Performance with CI
589
+ print(f"[STAT:ci] 95% CI [{ci_low:.3f}, {ci_high:.3f}]")
590
+ print(f"[METRIC:improvement_over_baseline] {improvement:.3f}")
591
+
592
+ # Interpretation (REQUIRED)
593
+ print("[METRIC:top_features] age (0.23), income (0.18), tenure (0.15)")
594
+ print("[FINDING] Random Forest (AUC=0.82) outperforms baseline (0.65) by 0.17")
595
+ print("[SO_WHAT] Model identifies 80% of churners in top 20% of predictions")
596
+ ```
597
+
598
+ ### Data Operations Markers
599
+
600
+ ```python
601
+ print("[DATA] Dataset description")
602
+ print(f"[SHAPE] {df.shape}")
603
+ print(f"[METRIC:missing_rate] {missing_pct:.1f}%")
604
+ ```
605
+
606
+ ### Legacy Markers (Still Supported)
607
+
608
+ ```python
609
+ print("[PATTERN] Identified pattern")
610
+ print("[OBSERVATION] Descriptive observation")
611
+ print("[EXPERIMENT] Experimental setup description")
612
+ ```
613
+
614
+ ### Complete Statistical Analysis Example
615
+
616
+ ```python
617
+ import numpy as np
618
+ from scipy.stats import ttest_ind, shapiro, levene
619
+
620
+ # 1. State hypothesis
621
+ print("[HYPOTHESIS] H0: No difference between groups; H1: Treatment > Control")
622
+ print("[DECISION] Using Welch's t-test for independent samples")
623
+
624
+ # 2. Check assumptions
625
+ _, p_norm_t = shapiro(treatment)
626
+ _, p_norm_c = shapiro(control)
627
+ print(f"[CHECK:normality] Treatment p={p_norm_t:.3f}, Control p={p_norm_c:.3f}")
628
+
629
+ _, p_var = levene(treatment, control)
630
+ print(f"[CHECK:homogeneity] Levene's p={p_var:.3f} - using Welch's t-test")
631
+
632
+ # 3. Run test
633
+ t_stat, p_value = ttest_ind(treatment, control, equal_var=False)
634
+
635
+ # 4. Calculate effect size (Cohen's d)
636
+ pooled_std = np.sqrt(((len(treatment)-1)*treatment.std()**2 +
637
+ (len(control)-1)*control.std()**2) /
638
+ (len(treatment) + len(control) - 2))
639
+ cohens_d = (treatment.mean() - control.mean()) / pooled_std
640
+
641
+ # 5. Calculate CI for difference
642
+ from scipy.stats import sem
643
+ mean_diff = treatment.mean() - control.mean()
644
+ se_diff = np.sqrt(sem(treatment)**2 + sem(control)**2)
645
+ ci_low = mean_diff - 1.96 * se_diff
646
+ ci_high = mean_diff + 1.96 * se_diff
647
+
648
+ # 6. Report ALL statistics
649
+ print(f"[STAT:estimate] mean_diff = {mean_diff:.3f}")
650
+ print(f"[STAT:ci] 95% CI [{ci_low:.3f}, {ci_high:.3f}]")
651
+ print(f"[STAT:effect_size] Cohen's d = {cohens_d:.3f} ({'small' if abs(cohens_d) < 0.5 else 'medium' if abs(cohens_d) < 0.8 else 'large'})")
652
+ print(f"[STAT:p_value] p = {p_value:.4f}")
653
+
654
+ # 7. Robustness check
655
+ from scipy.stats import mannwhitneyu
656
+ _, p_mw = mannwhitneyu(treatment, control, alternative='greater')
657
+ print(f"[INDEPENDENT_CHECK] Mann-Whitney U p={p_mw:.4f} (non-parametric confirmation)")
658
+
659
+ # 8. NOW state finding with full evidence
660
+ print(f"[FINDING] Treatment shows {'significant' if p_value < 0.05 else 'no significant'} effect "
661
+ f"(d={cohens_d:.2f}, 95% CI [{ci_low:.2f}, {ci_high:.2f}], p={p_value:.4f})")
662
+
663
+ # 9. Practical significance
664
+ print(f"[SO_WHAT] A {abs(cohens_d):.1f}σ effect means ~{abs(mean_diff)*100:.0f} unit improvement per customer")
665
+
666
+ # 10. Limitations
667
+ print("[LIMITATION] Single time point; longitudinal effects unknown")
668
+ ```
669
+
670
+ ## Report Generation
671
+
672
+ Gyoshu can generate publication-quality research reports from notebooks and export them to PDF.
673
+
674
+ ### Report Markers
675
+
676
+ Reports are generated by extracting structured markers from notebook cell outputs. Use these markers in your REPL output to populate report sections:
677
+
678
+ | Marker | Report Section | Description |
679
+ |--------|----------------|-------------|
680
+ | `[OBJECTIVE]` | Executive Summary | Research goal statement |
681
+ | `[HYPOTHESIS]` | Hypotheses | Proposed explanations |
682
+ | `[METRIC:name]` | Performance Metrics | Named metrics with values |
683
+ | `[FINDING]` | Key Findings | Important discoveries |
684
+ | `[LIMITATION]` | Limitations | Known constraints |
685
+ | `[NEXT_STEP]` | Recommended Next Steps | Follow-up actions |
686
+ | `[CONCLUSION]` | Conclusion | Final summary |
687
+
688
+ ### IMRAD Report Structure
689
+
690
+ Reports follow the IMRAD (Introduction, Methods, Results, Analysis, Discussion) structure:
691
+
692
+ | Section | Content | Required Markers |
693
+ |---------|---------|------------------|
694
+ | **Executive Summary** | Question, answer, magnitude, confidence | `[OBJECTIVE]`, `[CONCLUSION]` |
695
+ | **Hypotheses & Endpoints** | H0/H1, metrics, alpha | `[HYPOTHESIS]`, `[DECISION]` |
696
+ | **Methods** | Data, tests, assumptions | `[DATA]`, `[CHECK:*]` |
697
+ | **Results** | Estimates + CI + effect sizes | `[STAT:*]`, `[METRIC:*]` |
698
+ | **Robustness** | Sensitivity analyses | `[INDEPENDENT_CHECK]` |
699
+ | **Key Findings** | Verified discoveries (trust ≥ 80) | `[FINDING]` + `[STAT:ci]` + `[STAT:effect_size]` |
700
+ | **Exploratory Observations** | Unverified claims (trust < 80) | `[FINDING]` without full stats |
701
+ | **Implications ("So What")** | Practical significance | `[SO_WHAT]` |
702
+ | **Limitations** | Threats to validity | `[LIMITATION]` |
703
+ | **Next Steps** | Follow-up actions | `[NEXT_STEP]` |
704
+
705
+ ### Automatic Report Generation
706
+
707
+ When research completes with SUCCESS status, a markdown report is automatically generated and saved to:
708
+
709
+ ```
710
+ reports/{reportTitle}/report.md
711
+ ```
712
+
713
+ The report includes:
714
+ - **Executive Summary**: Objective, key metrics, and status
715
+ - **Hypotheses**: All proposed explanations
716
+ - **Performance Metrics**: Table of all `[METRIC:name]` values
717
+ - **Key Findings**: Numbered list of discoveries
718
+ - **Output Files**: Artifacts from the reports directory
719
+ - **Conclusion**: Final research summary
720
+
721
+ ### Manual Report Generation
722
+
723
+ Generate a report manually using the `/gyoshu report` command:
724
+
725
+ ```bash
726
+ # Generate report for current research
727
+ /gyoshu report
728
+
729
+ # Generate report for specific research
730
+ /gyoshu report my-research-slug
731
+ ```
732
+
733
+ Or via the research-manager tool:
734
+
735
+ ```typescript
736
+ research-manager(action: "report", reportTitle: "my-research")
737
+ ```
738
+
739
+ ### PDF Export
740
+
741
+ Export markdown reports to PDF using available converters:
742
+
743
+ | Priority | Converter | Quality | Install Command |
744
+ |----------|-----------|---------|-----------------|
745
+ | 1 | pandoc | Best (LaTeX math support) | `apt install pandoc texlive-xetex` or `brew install pandoc basictex` |
746
+ | 2 | wkhtmltopdf | Good (widely available) | `apt install wkhtmltopdf` or `brew install wkhtmltopdf` |
747
+ | 3 | weasyprint | Good (CSS-based) | `pip install weasyprint` |
748
+
749
+ Export via the research-manager tool:
750
+
751
+ ```typescript
752
+ research-manager(action: "export-pdf", reportTitle: "my-research")
753
+ ```
754
+
755
+ PDF files are saved to:
756
+
757
+ ```
758
+ reports/{reportTitle}/report.pdf
759
+ ```
760
+
761
+ > **Note**: At least one PDF converter must be installed for PDF export. Gyoshu automatically detects and uses the best available converter.
762
+
763
+ ### Automatic PDF Export on Completion
764
+
765
+ When using the `gyoshu-completion` tool with `exportPdf: true`, PDF export happens automatically after report generation:
766
+
767
+ ```typescript
768
+ gyoshu-completion({
769
+ researchSessionID: "my-session",
770
+ status: "SUCCESS",
771
+ summary: "Research complete",
772
+ evidence: { ... },
773
+ exportPdf: true // Automatically exports report.pdf after generating report.md
774
+ })
775
+ ```
776
+
777
+ This is useful for autonomous research workflows where you want both the markdown report and PDF export without a separate step.
778
+
779
+ ## Checkpoint System
780
+
781
+ Gyoshu provides checkpoint/resume capability for long-running research:
782
+
783
+ ### Stage Protocol
784
+
785
+ Research is divided into bounded stages (max 4 minutes each):
786
+ - Each stage has a unique ID: `S{NN}_{verb}_{noun}` (e.g., `S01_load_data`)
787
+ - Stages emit markers: `[STAGE:begin]`, `[STAGE:progress]`, `[STAGE:end]`
788
+ - Checkpoints are created at stage boundaries
789
+
790
+ ### Checkpoint Markers
791
+
792
+ ```python
793
+ # Stage boundaries
794
+ print("[STAGE:begin:id=S01_load_data]")
795
+ print("[STAGE:end:id=S01_load_data:duration=120s]")
796
+
797
+ # Checkpoint saved
798
+ print("[CHECKPOINT:saved:id=ckpt-001:stage=S01_load_data]")
799
+
800
+ # Rehydrated from checkpoint
801
+ print("[REHYDRATED:from=ckpt-001]")
802
+ ```
803
+
804
+ ### Checkpoint Storage
805
+
806
+ ```
807
+ reports/{reportTitle}/checkpoints/{runId}/{checkpointId}/
808
+ └── checkpoint.json # Manifest with artifact hashes
809
+ ```
810
+
811
+ ### Resume Commands
812
+
813
+ ```bash
814
+ # Continue research (auto-detects checkpoints)
815
+ /gyoshu continue my-research
816
+
817
+ # List checkpoints
818
+ checkpoint-manager(action: "list", reportTitle: "my-research")
819
+
820
+ # Resume from specific checkpoint
821
+ checkpoint-manager(action: "resume", reportTitle: "my-research", runId: "run-001")
822
+ ```
823
+
824
+ ### Troubleshooting
825
+
826
+ | Issue | Solution |
827
+ |-------|----------|
828
+ | "No valid checkpoints" | Artifacts may be missing or corrupted. Check `reports/*/checkpoints/` |
829
+ | "Manifest SHA256 mismatch" | Checkpoint file was modified. Use previous checkpoint |
830
+ | "Session locked" | Use `/gyoshu unlock <sessionId>` after verifying no active process |
831
+
832
+ ### Checkpoint Manager Actions
833
+
834
+ | Action | Description |
835
+ |--------|-------------|
836
+ | `save` | Create new checkpoint at stage boundary |
837
+ | `list` | List all checkpoints for a research/run |
838
+ | `validate` | Verify checkpoint integrity (manifest + artifacts) |
839
+ | `resume` | Find last valid checkpoint and generate rehydration code |
840
+ | `prune` | Keep only last N checkpoints (default: 5) |
841
+ | `emergency` | Fast checkpoint for watchdog/abort (skips validation) |
842
+
843
+ ### Trust Levels
844
+
845
+ Checkpoints have a trust level that controls security validation:
846
+
847
+ | Level | Description | Validation |
848
+ |-------|-------------|------------|
849
+ | `local` | Created by this system (default) | Standard validation |
850
+ | `imported` | Copied from another project | + Parent directory symlink check |
851
+ | `untrusted` | From external/unknown source | + Parent symlink check + User confirmation |
852
+
853
+ **When to use each level:**
854
+ - `local`: Normal checkpoints created during research (automatic)
855
+ - `imported`: When copying checkpoints from a colleague or another machine
856
+ - `untrusted`: When loading checkpoints from the internet or unknown sources
857
+
858
+ **Security implications:**
859
+ - `local` checkpoints trust the local filesystem
860
+ - `imported` and `untrusted` checkpoints verify that parent directories aren't symlinks (prevents escape attacks)
861
+ - `untrusted` checkpoints show a warning before resume, as rehydration code could execute arbitrary Python
862
+
863
+ **Example:**
864
+ ```bash
865
+ # Save with explicit trust level (for imported checkpoint)
866
+ checkpoint-manager(action: "save", ..., trustLevel: "imported")
867
+
868
+ # Resume will show warning for non-local checkpoints
869
+ checkpoint-manager(action: "resume", reportTitle: "imported-research")
870
+ # Returns: { ..., trustWarning: "Checkpoint is imported - verify source before resuming" }
871
+ ```
872
+
873
+ ## Project Structure
874
+
875
+ ### Durable (Tracked in Git)
876
+
877
+ ```
878
+ Gyoshu/
879
+ ├── notebooks/ # Research notebooks (default location)
880
+ │ ├── README.md # Auto-generated index
881
+ │ ├── _migrated/ # Migrated legacy research
882
+ │ └── {reportTitle}.ipynb # Self-describing notebooks
883
+
884
+ ├── reports/ # Research reports (mirrors notebooks)
885
+ │ └── {reportTitle}/
886
+ │ ├── README.md # Combined report view
887
+ │ ├── figures/
888
+ │ ├── models/
889
+ │ ├── exports/
890
+ │ ├── report.md # Generated research report
891
+ │ └── report.pdf # PDF export (if converter available)
892
+
893
+ ├── src/ # OpenCode extension source
894
+ │ ├── agent/ # Agent definitions
895
+ │ ├── command/ # Slash commands
896
+ │ ├── tool/ # Tool implementations
897
+ │ ├── lib/ # Shared utilities
898
+ │ ├── bridge/ # Python REPL bridge
899
+ │ └── skill/ # Research skills
900
+ ├── data/ # Datasets
901
+ ├── .venv/ # Python environment
902
+ └── ...
903
+ ```
904
+
905
+ ### Ephemeral (OS Temp Directory)
906
+
907
+ Runtime data is stored in OS-appropriate temp directories, NOT in the project root:
908
+
909
+ ```
910
+ Linux (with XDG_RUNTIME_DIR):
911
+ $XDG_RUNTIME_DIR/gyoshu/ # Usually /run/user/{uid}/gyoshu
912
+ └── {shortSessionId}/
913
+ ├── bridge.sock # Python REPL socket
914
+ ├── session.lock # Session lock
915
+ └── bridge_meta.json # Runtime state
916
+
917
+ macOS:
918
+ ~/Library/Caches/gyoshu/runtime/
919
+ └── {shortSessionId}/...
920
+
921
+ Linux (fallback):
922
+ ~/.cache/gyoshu/runtime/
923
+ └── {shortSessionId}/...
924
+ ```
925
+
926
+ **Environment Variable Override**: Set `GYOSHU_RUNTIME_DIR` to force a custom location.
927
+
928
+ **Note**: Session IDs are hashed to 12 characters to respect Unix socket path limits (~108 bytes).
929
+
930
+ ## Notebook-Centric Architecture
931
+
932
+ Gyoshu stores research metadata **in notebooks**, not separate JSON files:
933
+
934
+ ### Notebook Frontmatter
935
+ Each notebook has YAML frontmatter in the first cell (raw cell):
936
+
937
+ ```yaml
938
+ ---
939
+ # Quarto-compatible fields (optional)
940
+ title: "Customer Churn Prediction"
941
+ date: 2026-01-01
942
+
943
+ # Gyoshu-specific fields
944
+ gyoshu:
945
+ schema_version: 1
946
+ reportTitle: churn-prediction # Notebook identifier
947
+ status: active # active | completed | archived
948
+ created: "2026-01-01T10:00:00Z"
949
+ updated: "2026-01-01T15:00:00Z"
950
+ tags: [ml, classification]
951
+ runs:
952
+ - id: run-001
953
+ started: "2026-01-01T10:00:00Z"
954
+ status: completed
955
+ ---
956
+ ```
957
+
958
+ ### Cell Tags (Papermill-style)
959
+ Cells are tagged with `gyoshu-*` markers in metadata to structure the research:
960
+ - `gyoshu-objective`, `gyoshu-hypothesis`, `gyoshu-finding`, etc.
961
+
962
+ ## Key Files
963
+
964
+ | File | Purpose |
965
+ |------|---------|
966
+ | `src/bridge/gyoshu_bridge.py` | JSON-RPC Python execution bridge |
967
+ | `src/tool/research-manager.ts` | Research operations |
968
+ | `src/tool/session-manager.ts` | Runtime session management |
969
+ | `src/tool/python-repl.ts` | REPL tool interface |
970
+ | `src/tool/notebook-writer.ts` | Jupyter notebook generation |
971
+ | `src/tool/migration-tool.ts` | Legacy session migration utility |
972
+ | `src/tool/notebook-search.ts` | Notebook content search |
973
+ | `src/lib/notebook-frontmatter.ts`| Frontmatter parsing/updating |
974
+ | `src/lib/readme-index.ts` | README index generation |
975
+ | `src/lib/paths.ts` | Centralized path resolver |
976
+ | `src/lib/report-markdown.ts` | Report generation library |
977
+ | `src/lib/pdf-export.ts` | PDF export utilities |
978
+ | `tests/test_bridge.py` | Bridge unit tests |
979
+
980
+ ## Common Tasks
981
+
982
+ ### Adding a New Test
983
+ 1. Create test class in appropriate file under `tests/`
984
+ 2. Use `test_` prefix for test methods
985
+ 3. Run: `pytest tests/test_file.py::TestClass::test_method -v`
986
+
987
+ ### Modifying the Python Bridge
988
+ 1. Edit `src/bridge/gyoshu_bridge.py`
989
+ 2. Run tests: `pytest tests/test_bridge.py -v`
990
+ 3. Test manually with JSON-RPC messages
991
+
992
+ ### Working with Research
993
+ Research is now stored in the `notebooks/` directory by default.
994
+
995
+ ```
996
+ ./notebooks/
997
+ ├── README.md # Auto-generated root index
998
+ └── {reportTitle}.ipynb # Research notebook with YAML frontmatter
999
+ ```
1000
+
1001
+ Reports are stored in a mirrored structure:
1002
+ ```
1003
+ ./reports/
1004
+ └── {reportTitle}/
1005
+ ├── README.md # Combined report view
1006
+ ├── figures/ # Saved plots
1007
+ ├── models/ # Saved model files
1008
+ └── exports/ # Data exports (CSV, etc.)
1009
+ ```
1010
+
1011
+ > **Migration Note**: Legacy research stored at `gyoshu/research/` or `~/.gyoshu/sessions/` is still readable. Use the `/gyoshu migrate --to-notebooks` command to move data to the new structure.
1012
+
1013
+ ## Python Environment Management
1014
+
1015
+ Gyoshu uses Python virtual environments for research reproducibility.
1016
+
1017
+ ### Detection Priority
1018
+
1019
+ | Priority | Type | Detection Method |
1020
+ |----------|------|------------------|
1021
+ | 1 | Custom | `GYOSHU_PYTHON_PATH` env var |
1022
+ | 2 | venv | `.venv/bin/python` exists |
1023
+
1024
+ ### Quick Setup
1025
+
1026
+ ```bash
1027
+ python3 -m venv .venv
1028
+ .venv/bin/pip install pandas numpy scikit-learn matplotlib seaborn
1029
+ ```
1030
+
1031
+ ### Environment Override
1032
+
1033
+ Set `GYOSHU_PYTHON_PATH` to force a specific Python interpreter:
1034
+
1035
+ ```bash
1036
+ export GYOSHU_PYTHON_PATH=/path/to/custom/python
1037
+ ```
1038
+
1039
+ > **Note:** Gyoshu uses your project's virtual environment. It never modifies system Python.