npm - mindforge-cc - Versions diffs - 11.5.0 → 11.6.0 - Mend

mindforge-cc 11.5.0 → 11.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (177) hide show

package/.agent/mindforge/skill-tdd.md +53 -0
package/.agent/mindforge/skills-index.md +118 -0
package/.agent/mindforge/systematic-debug.md +60 -0
package/.agent/skills/1password-skill/SKILL.md +156 -0
package/.agent/skills/1password-skill/references/cli-examples.md +31 -0
package/.agent/skills/1password-skill/references/get-started.md +21 -0
package/.agent/skills/article-illustrator/SKILL.md +199 -0
package/.agent/skills/article-illustrator/references/prompt-construction.md +426 -0
package/.agent/skills/article-illustrator/references/style-presets.md +80 -0
package/.agent/skills/article-illustrator/references/styles.md +224 -0
package/.agent/skills/article-illustrator/references/usage.md +50 -0
package/.agent/skills/article-illustrator/references/workflow.md +332 -0
package/.agent/skills/arxiv/SKILL.md +275 -0
package/.agent/skills/blogwatcher/SKILL.md +130 -0
package/.agent/skills/code-wiki/SKILL.md +438 -0
package/.agent/skills/code-wiki/templates/README.md +31 -0
package/.agent/skills/code-wiki/templates/architecture.md +30 -0
package/.agent/skills/code-wiki/templates/getting-started.md +47 -0
package/.agent/skills/code-wiki/templates/module.md +38 -0
package/.agent/skills/codebase-inspection/SKILL.md +109 -0
package/.agent/skills/comic-creator/SKILL.md +240 -0
package/.agent/skills/comic-creator/references/analysis-framework.md +176 -0
package/.agent/skills/comic-creator/references/auto-selection.md +71 -0
package/.agent/skills/comic-creator/references/base-prompt.md +98 -0
package/.agent/skills/comic-creator/references/character-template.md +180 -0
package/.agent/skills/comic-creator/references/ohmsha-guide.md +85 -0
package/.agent/skills/comic-creator/references/partial-workflows.md +106 -0
package/.agent/skills/comic-creator/references/storyboard-template.md +143 -0
package/.agent/skills/comic-creator/references/workflow.md +401 -0
package/.agent/skills/concept-diagrams/SKILL.md +355 -0
package/.agent/skills/concept-diagrams/references/dashboard-patterns.md +43 -0
package/.agent/skills/concept-diagrams/references/infrastructure-patterns.md +144 -0
package/.agent/skills/concept-diagrams/references/physical-shape-cookbook.md +42 -0
package/.agent/skills/creative-ideation/SKILL.md +144 -0
package/.agent/skills/creative-ideation/references/full-prompt-library.md +110 -0
package/.agent/skills/devops-cli/SKILL.md +149 -0
package/.agent/skills/devops-cli/references/app-discovery.md +112 -0
package/.agent/skills/devops-cli/references/authentication.md +59 -0
package/.agent/skills/devops-cli/references/cli-reference.md +104 -0
package/.agent/skills/devops-cli/references/running-apps.md +171 -0
package/.agent/skills/devops-watchers/SKILL.md +103 -0
package/.agent/skills/docker-management/SKILL.md +273 -0
package/.agent/skills/domain-intel/SKILL.md +96 -0
package/.agent/skills/duckduckgo-search/SKILL.md +230 -0
package/.agent/skills/github-auth/SKILL.md +240 -0
package/.agent/skills/github-code-review/SKILL.md +474 -0
package/.agent/skills/github-code-review/references/review-output-template.md +74 -0
package/.agent/skills/github-issues/SKILL.md +363 -0
package/.agent/skills/github-issues/templates/bug-report.md +35 -0
package/.agent/skills/github-issues/templates/feature-request.md +31 -0
package/.agent/skills/github-pr-workflow/SKILL.md +360 -0
package/.agent/skills/github-pr-workflow/references/ci-troubleshooting.md +183 -0
package/.agent/skills/github-pr-workflow/references/conventional-commits.md +71 -0
package/.agent/skills/github-pr-workflow/templates/pr-body-bugfix.md +35 -0
package/.agent/skills/github-pr-workflow/templates/pr-body-feature.md +33 -0
package/.agent/skills/github-repo-management/SKILL.md +509 -0
package/.agent/skills/github-repo-management/references/github-api-cheatsheet.md +161 -0
package/.agent/skills/godmode/SKILL.md +396 -0
package/.agent/skills/godmode/references/jailbreak-templates.md +128 -0
package/.agent/skills/godmode/references/refusal-detection.md +142 -0
package/.agent/skills/hyperframes/SKILL.md +182 -0
package/.agent/skills/hyperframes/references/cli.md +185 -0
package/.agent/skills/hyperframes/references/composition.md +129 -0
package/.agent/skills/hyperframes/references/features.md +289 -0
package/.agent/skills/hyperframes/references/gsap.md +136 -0
package/.agent/skills/hyperframes/references/troubleshooting.md +137 -0
package/.agent/skills/hyperframes/references/website-to-video.md +145 -0
package/.agent/skills/jupyter-live-kernel/SKILL.md +160 -0
package/.agent/skills/kanban-orchestrator/SKILL.md +209 -0
package/.agent/skills/kanban-worker/SKILL.md +188 -0
package/.agent/skills/llm-wiki/SKILL.md +499 -0
package/.agent/skills/meme-generation/SKILL.md +122 -0
package/.agent/skills/node-inspect-debugger/SKILL.md +312 -0
package/.agent/skills/obsidian/SKILL.md +60 -0
package/.agent/skills/osint-investigation/SKILL.md +269 -0
package/.agent/skills/osint-investigation/templates/source-template.md +59 -0
package/.agent/skills/oss-forensics/SKILL.md +422 -0
package/.agent/skills/oss-forensics/references/evidence-types.md +89 -0
package/.agent/skills/oss-forensics/references/github-archive-guide.md +184 -0
package/.agent/skills/oss-forensics/references/investigation-templates.md +131 -0
package/.agent/skills/oss-forensics/references/recovery-techniques.md +164 -0
package/.agent/skills/oss-forensics/templates/forensic-report.md +151 -0
package/.agent/skills/oss-forensics/templates/malicious-package-report.md +43 -0
package/.agent/skills/parallel-cli/SKILL.md +384 -0
package/.agent/skills/pinggy-tunnel/SKILL.md +302 -0
package/.agent/skills/pixel-art/SKILL.md +209 -0
package/.agent/skills/pixel-art/references/palettes.md +49 -0
package/.agent/skills/plan/SKILL.md +331 -0
package/.agent/skills/polymarket/SKILL.md +75 -0
package/.agent/skills/polymarket/references/api-endpoints.md +220 -0
package/.agent/skills/python-debugpy/SKILL.md +368 -0
package/.agent/skills/requesting-code-review/SKILL.md +273 -0
package/.agent/skills/research-paper-writing/SKILL.md +2367 -0
package/.agent/skills/research-paper-writing/references/autoreason-methodology.md +394 -0
package/.agent/skills/research-paper-writing/references/checklists.md +434 -0
package/.agent/skills/research-paper-writing/references/citation-workflow.md +563 -0
package/.agent/skills/research-paper-writing/references/experiment-patterns.md +728 -0
package/.agent/skills/research-paper-writing/references/human-evaluation.md +476 -0
package/.agent/skills/research-paper-writing/references/paper-types.md +481 -0
package/.agent/skills/research-paper-writing/references/reviewer-guidelines.md +433 -0
package/.agent/skills/research-paper-writing/references/sources.md +191 -0
package/.agent/skills/research-paper-writing/references/writing-guide.md +474 -0
package/.agent/skills/research-paper-writing/templates/README.md +251 -0
package/.agent/skills/rest-graphql-debug/SKILL.md +507 -0
package/.agent/skills/s6-container-supervision/SKILL.md +171 -0
package/.agent/skills/scrapling/SKILL.md +328 -0
package/.agent/skills/sherlock/SKILL.md +186 -0
package/.agent/skills/simplify-code/SKILL.md +168 -0
package/.agent/skills/skill-authoring/SKILL.md +158 -0
package/.agent/skills/spike/SKILL.md +190 -0
package/.agent/skills/subagent-driven-development/SKILL.md +345 -0
package/.agent/skills/subagent-driven-development/references/context-budget-discipline.md +53 -0
package/.agent/skills/subagent-driven-development/references/gates-taxonomy.md +93 -0
package/.agent/skills/systematic-debugging/SKILL.md +360 -0
package/.agent/skills/test-driven-development/SKILL.md +336 -0
package/.agent/skills/video-orchestrator/SKILL.md +194 -0
package/.agent/skills/video-orchestrator/references/examples.md +227 -0
package/.agent/skills/video-orchestrator/references/intake.md +166 -0
package/.agent/skills/video-orchestrator/references/kanban-setup.md +278 -0
package/.agent/skills/video-orchestrator/references/monitoring.md +180 -0
package/.agent/skills/video-orchestrator/references/role-archetypes.md +298 -0
package/.agent/skills/video-orchestrator/references/tool-matrix.md +317 -0
package/.agent/skills/web-pentest/SKILL.md +332 -0
package/.agent/skills/web-pentest/references/bypass-techniques.md +133 -0
package/.agent/skills/web-pentest/references/exploitation-techniques.md +204 -0
package/.agent/skills/web-pentest/references/scope-enforcement.md +110 -0
package/.agent/skills/web-pentest/references/vuln-taxonomy.md +81 -0
package/.agent/skills/web-pentest/templates/authorization.md +69 -0
package/.agent/skills/web-pentest/templates/pentest-report.md +178 -0
package/.claude/commands/mindforge/skill-tdd.md +53 -0
package/.claude/commands/mindforge/skills-index.md +118 -0
package/.claude/commands/mindforge/systematic-debug.md +60 -0
package/.mindforge/config.json +2 -2
package/.mindforge/memory/sync-manifest.json +1 -1
package/.mindforge/skills/arxiv/SKILL.md +294 -0
package/.mindforge/skills/blogwatcher/SKILL.md +147 -0
package/.mindforge/skills/code-wiki/SKILL.md +457 -0
package/.mindforge/skills/codebase-inspection/SKILL.md +126 -0
package/.mindforge/skills/concept-diagrams/SKILL.md +373 -0
package/.mindforge/skills/creative-ideation/SKILL.md +162 -0
package/.mindforge/skills/domain-intel/SKILL.md +116 -0
package/.mindforge/skills/duckduckgo-search/SKILL.md +249 -0
package/.mindforge/skills/github-code-review/SKILL.md +493 -0
package/.mindforge/skills/github-issues/SKILL.md +382 -0
package/.mindforge/skills/github-pr-workflow/SKILL.md +379 -0
package/.mindforge/skills/jupyter-live-kernel/SKILL.md +179 -0
package/.mindforge/skills/kanban-orchestrator/SKILL.md +227 -0
package/.mindforge/skills/kanban-worker/SKILL.md +206 -0
package/.mindforge/skills/meme-generation/SKILL.md +141 -0
package/.mindforge/skills/obsidian/SKILL.md +80 -0
package/.mindforge/skills/osint-investigation/SKILL.md +288 -0
package/.mindforge/skills/oss-forensics/SKILL.md +421 -0
package/.mindforge/skills/pixel-art/SKILL.md +228 -0
package/.mindforge/skills/plan/SKILL.md +350 -0
package/.mindforge/skills/requesting-code-review/SKILL.md +292 -0
package/.mindforge/skills/research-paper-writing/SKILL.md +2384 -0
package/.mindforge/skills/scrapling/SKILL.md +345 -0
package/.mindforge/skills/sherlock/SKILL.md +203 -0
package/.mindforge/skills/simplify-code/SKILL.md +187 -0
package/.mindforge/skills/spike/SKILL.md +209 -0
package/.mindforge/skills/subagent-driven-development/SKILL.md +364 -0
package/.mindforge/skills/systematic-debugging/SKILL.md +379 -0
package/.mindforge/skills/test-driven-development/SKILL.md +355 -0
package/.mindforge/skills/web-pentest/SKILL.md +327 -0
package/CHANGELOG.md +88 -0
package/MINDFORGE.md +3 -3
package/README.md +38 -3
package/RELEASENOTES.md +100 -0
package/bin/dashboard/api-router.js +10 -1
package/bin/governance/approve.js +5 -1
package/bin/memory/federated-sync.js +11 -2
package/bin/memory/knowledge-capture.js +10 -1
package/bin/memory/pillar-health-tracker.js +9 -1
package/bin/review/ads-engine.js +2 -2
package/bin/security/trust-boundaries.js +5 -0
package/docs/getting-started.md +42 -5
package/package.json +1 -1

package/.agent/skills/research-paper-writing/references/experiment-patterns.md ADDED Viewed

@@ -0,0 +1,728 @@
+# Experiment Design Patterns
+Patterns and best practices distilled from running research experiments at scale with the
+---
+## Experiment Infrastructure
+### Directory Structure
+Organize experiments with a consistent structure:
+```
+workspace/
+  experiments/
+    run_main.py                # Core experiment runner
+    run_baselines.py           # Baseline comparison
+    run_ablation.py            # Ablation studies
+    strategies.py              # Method implementations
+    config.yaml                # Shared configuration
+  results/
+    <experiment_name>/
+      <task_or_problem>/
+        <strategy>/
+          result.json          # Final metrics
+          final_output.md      # Final output artifact
+          history.json         # Full trajectory/log
+          pass_01/             # Per-iteration artifacts (if iterative)
+            intermediate.md
+  analysis/
+    analyze_results.py         # Statistical analysis
+    compute_stats.py           # Significance tests
+    make_charts.py             # Visualization
+  paper/
+    paper.tex                  # LaTeX source
+    fig_*.pdf                  # Generated figures
+```
+### Script Design Principles
+**1. Incremental Saving (Crash Recovery)**
+Every experiment script should save results after each unit of work, and skip already-completed work on restart:
+```python
+import json, os
+from pathlib import Path
+def run_experiment(problems, strategies, output_dir):
+    for problem in problems:
+        for strategy in strategies:
+            result_path = Path(output_dir) / problem["id"] / strategy / "result.json"
+            if result_path.exists():
+                print(f"Skipping {problem['id']}/{strategy} (already done)")
+                continue
+            # Run the experiment
+            result = execute_strategy(problem, strategy)
+            # Save immediately
+            result_path.parent.mkdir(parents=True, exist_ok=True)
+            with open(result_path, 'w') as f:
+                json.dump(result, f, indent=2)
+```
+This pattern makes re-runs safe and efficient. If a process crashes at problem 47/150, restarting skips the first 46.
+**2. Artifact Preservation**
+Save all intermediate outputs, not just final results. This enables post-hoc analysis without re-running:
+```python
+def save_pass_artifacts(output_dir, pass_num, artifacts):
+    """Save all artifacts from a single pass of an iterative method."""
+    pass_dir = Path(output_dir) / f"pass_{pass_num:02d}"
+    pass_dir.mkdir(parents=True, exist_ok=True)
+    for name, content in artifacts.items():
+        with open(pass_dir / f"{name}.md", 'w') as f:
+            f.write(content)
+```
+**3. Configuration Management**
+Use YAML configs for reproducibility:
+```yaml
+# config.yaml
+model: anthropic/claude-sonnet-4-20250514
+author_temperature: 0.8
+judge_temperature: 0.3
+max_tokens: 4096
+num_judges: 3
+max_passes: 15
+convergence_k: 2
+```
+```python
+import yaml
+with open("config.yaml") as f:
+    config = yaml.safe_load(f)
+```
+**4. Separation of Concerns**
+Keep generation, evaluation, and visualization in separate scripts:
+| Script | Purpose |
+|--------|---------|
+| `run_experiment.py` | Core method execution |
+| `run_baselines.py` | Baseline comparisons at same compute |
+| `run_eval.py` | Blind evaluation / judge panels |
+| `analyze_results.py` | Statistical analysis |
+| `make_charts.py` | Figure generation |
+This lets you re-run evaluation without re-running expensive generation, and regenerate figures without re-running analysis.
+---
+## Evaluation Protocols
+### Blind Judge Panels (for Subjective Tasks)
+When evaluating subjective outputs (writing, analysis, recommendations), use a blind judge panel:
+```python
+import random
+def run_blind_evaluation(outputs: dict, task_prompt: str, num_judges: int = 7):
+    """
+    Run blind evaluation of multiple method outputs.
+    Args:
+        outputs: {"method_name": "output_text", ...}
+        task_prompt: The original task description
+        num_judges: Number of independent judge evaluations
+    """
+    rankings = []
+    for judge_i in range(num_judges):
+        # Randomize labels and presentation order per judge
+        methods = list(outputs.keys())
+        random.shuffle(methods)
+        labels = {m: chr(65 + i) for i, m in enumerate(methods)}  # A, B, C...
+        # Present to judge with randomized labels
+        prompt = f"Task: {task_prompt}\n\n"
+        for method in methods:
+            prompt += f"--- Proposal {labels[method]} ---\n{outputs[method]}\n\n"
+        prompt += "Rank all proposals from best to worst. Format: RANKING: [best], [second], [worst]"
+        ranking = call_judge(prompt)
+        rankings.append({"labels": labels, "ranking": ranking})
+    # Aggregate via Borda count
+    return compute_borda(rankings)
+def compute_borda(rankings, n_methods=3):
+    """Borda count: 3/2/1 points for 1st/2nd/3rd."""
+    scores = {}
+    points = {0: n_methods, 1: n_methods - 1, 2: n_methods - 2}  # Adjust for n_methods
+    for r in rankings:
+        for position, method in enumerate(r["ranking"]):
+            scores[method] = scores.get(method, 0) + points.get(position, 0)
+    return scores
+```
+Key design decisions:
+- **Randomize both labels AND order** per judge to prevent position bias
+- **Use odd number of judges** (3, 5, 7) to break ties
+- **Conservative tiebreak**: Incumbent/baseline wins ties (prevents false positives)
+- **CoT judges** match non-CoT quality at ~40% cost (1 CoT judge ≈ 3 standard judges)
+### Code/Objective Evaluation
+For tasks with ground-truth evaluation (code, math, factual):
+```python
+import subprocess
+def evaluate_code(solution: str, test_cases: list, timeout: int = 30):
+    """Run code solution against test cases with sandboxed execution."""
+    results = {"public": [], "private": []}
+    for test in test_cases:
+        try:
+            proc = subprocess.run(
+                ["python3", "-c", solution],
+                input=test["input"],
+                capture_output=True,
+                timeout=timeout,
+                text=True
+            )
+            actual = proc.stdout.strip()
+            expected = test["expected"].strip()
+            passed = actual == expected
+        except subprocess.TimeoutExpired:
+            passed = False
+        category = "public" if test.get("public") else "private"
+        results[category].append(passed)
+    return {
+        "public_pass_rate": sum(results["public"]) / max(len(results["public"]), 1),
+        "private_pass_rate": sum(results["private"]) / max(len(results["private"]), 1),
+    }
+```
+### Compute-Matched Comparison
+Always compare methods at equal compute budget. If your method uses N API calls, baselines get N calls too:
+| Method | Call Budget | Allocation |
+|--------|-----------|------------|
+| Single pass | 6 calls | 6 independent generations |
+| Critique & revise | 6 calls | 1 generate + 5 revise rounds |
+| Autoreason | 6 calls | 1 generate + 1 analysis + 4 revisions |
+| Best-of-N | 6 calls | 6 independent, pick best on public test |
+### Human Evaluation Design
+Many ML/NLP papers require human evaluation, especially for subjective tasks (text generation, summarization, dialogue, creative writing). Poorly designed human evals are a common rejection reason.
+#### When Human Evaluation Is Required
+| Task Type | Required? | Notes |
+|-----------|-----------|-------|
+| Text generation (open-ended) | Yes | LLM-as-judge alone is insufficient for acceptance at ACL/EMNLP |
+| Summarization | Usually | At minimum for a subset of outputs |
+| Dialogue systems | Yes | User studies or annotation |
+| Code generation | No | Test suites are objective ground truth |
+| Classification | No | Standard metrics suffice |
+| Any task with subjective quality | Strongly recommended | Strengthens the paper significantly |
+#### Annotation Protocol Design
+```
+Human Evaluation Protocol:
+1. Define the evaluation dimensions (fluency, relevance, factual accuracy, etc.)
+2. Create annotation guidelines with examples of each score level
+3. Run a pilot with 2-3 annotators on 20-30 examples
+4. Compute pilot inter-annotator agreement — if low, revise guidelines
+5. Run full evaluation
+6. Report: annotator count, agreement metrics, compensation, time per item
+```
+**Evaluation dimensions** (pick relevant subset):
+| Dimension | Definition | Scale |
+|-----------|-----------|-------|
+| Fluency | Grammaticality and naturalness | 1-5 Likert |
+| Relevance | Does it address the task? | 1-5 Likert |
+| Factual accuracy | Are stated facts correct? | Binary or 1-5 |
+| Coherence | Logical flow and consistency | 1-5 Likert |
+| Informativeness | Does it provide useful information? | 1-5 Likert |
+| Overall preference | Which output is better? | A/B/Tie (pairwise) |
+**Pairwise comparison** (preferred over absolute scoring — more reliable):
+- Present two outputs side-by-side (randomize left/right position)
+- Ask: "Which is better? A / B / Tie"
+- More discriminative and less susceptible to annotator calibration drift
+#### Inter-Annotator Agreement
+Always report agreement metrics. Without them, reviewers assume your annotations are unreliable.
+```python
+# Krippendorff's alpha (preferred — handles missing data, any scale)
+# pip install krippendorffs-alpha
+import krippendorff
+# Ratings: rows = annotators, columns = items, values = scores
+ratings = [
+    [3, 4, 1, 2, 5, None, 3],  # Annotator 1
+    [3, 5, 1, 3, 5, 2, 3],     # Annotator 2
+    [4, 4, 2, 2, 4, 2, None],  # Annotator 3
+]
+alpha = krippendorff.alpha(reliability_data=ratings, level_of_measurement="ordinal")
+print(f"Krippendorff's alpha: {alpha:.3f}")
+# Interpretation: >0.80 good, 0.67-0.80 acceptable, <0.67 questionable
+```
+```python
+# Cohen's kappa (for exactly 2 annotators, categorical data)
+from sklearn.metrics import cohen_kappa_score
+annotator_1 = [1, 2, 3, 1, 2, 3, 2]
+annotator_2 = [1, 2, 2, 1, 3, 3, 2]
+kappa = cohen_kappa_score(annotator_1, annotator_2)
+print(f"Cohen's kappa: {kappa:.3f}")
+# Interpretation: >0.80 excellent, 0.60-0.80 substantial, 0.40-0.60 moderate
+```
+| Metric | When to Use | Annotators | Scale |
+|--------|------------|-----------|-------|
+| Krippendorff's alpha | Default choice | Any number | Any (ordinal, nominal, ratio) |
+| Cohen's kappa | 2 annotators, categorical | Exactly 2 | Nominal/ordinal |
+| Fleiss' kappa | 3+ annotators, categorical | 3+ | Nominal |
+| Pearson/Spearman | Continuous scores | 2 | Interval/ratio |
+#### Crowdsourcing Platforms
+| Platform | Best For | Cost | Quality |
+|----------|----------|------|---------|
+| **Prolific** | Academic research, higher quality | $8-15/hr | High — academic participant pool |
+| **MTurk** | Large-scale, fast turnaround | $2-10/hr | Variable — use qualifications |
+| **Surge AI** | NLP-specific annotations | Premium | High — trained annotators |
+| **Expert annotators** | Domain-specific (medical, legal) | Highest | Highest — but slow |
+**Ethics requirements**:
+- Report compensation rate (must be at minimum local minimum wage)
+- Describe annotator demographics if relevant
+- Obtain IRB/ethics approval if required by your institution
+- ACL venues explicitly require compensation documentation
+#### What to Report in the Paper
+```
+Human Evaluation Section Checklist:
+- [ ] Number of annotators
+- [ ] Annotator qualifications / recruitment method
+- [ ] Number of items evaluated
+- [ ] Evaluation dimensions with definitions
+- [ ] Scale used (Likert, pairwise, binary)
+- [ ] Inter-annotator agreement (Krippendorff's alpha or Cohen's kappa)
+- [ ] Compensation rate
+- [ ] Time per annotation item
+- [ ] Whether annotators saw model identities (should be blind)
+- [ ] Randomization of presentation order
+```
+---
+## Statistical Analysis
+### Required Tests
+| Test | When to Use | Python |
+|------|------------|--------|
+| McNemar's test | Comparing two methods on same problems | `scipy.stats.binomtest` for small n |
+| Two-proportion z-test | Comparing success rates | Custom or `statsmodels` |
+| Fisher's exact test | Small sample pairwise comparison | `scipy.stats.fisher_exact` |
+| Bootstrapped CI | Confidence intervals for any metric | Custom bootstrap |
+| Cohen's h | Effect size for proportions | Manual calculation |
+### Standard Analysis Script
+```python
+import numpy as np
+from scipy import stats
+from pathlib import Path
+import json
+def load_all_results(results_dir):
+    """Load all results into a structured format."""
+    results = {}
+    for result_file in Path(results_dir).rglob("result.json"):
+        parts = result_file.relative_to(results_dir).parts
+        if len(parts) >= 3:
+            experiment, task, strategy = parts[0], parts[1], parts[2]
+            data = json.loads(result_file.read_text())
+            results.setdefault(experiment, {}).setdefault(strategy, {})[task] = data
+    return results
+def pairwise_mcnemar(method_a_results, method_b_results):
+    """McNemar's test for paired binary outcomes."""
+    a_win_b_lose = sum(1 for a, b in zip(method_a_results, method_b_results) if a and not b)
+    b_win_a_lose = sum(1 for a, b in zip(method_a_results, method_b_results) if b and not a)
+    n = a_win_b_lose + b_win_a_lose
+    if n < 25:
+        # Use exact binomial for small samples
+        result = stats.binomtest(a_win_b_lose, n, 0.5)
+        p_value = result.pvalue
+    else:
+        # Chi-squared approximation
+        chi2 = (abs(a_win_b_lose - b_win_a_lose) - 1)**2 / (a_win_b_lose + b_win_a_lose)
+        p_value = 1 - stats.chi2.cdf(chi2, df=1)
+    return {
+        "a_wins": a_win_b_lose,
+        "b_wins": b_win_a_lose,
+        "n_discordant": n,
+        "p_value": p_value,
+        "significant": p_value < 0.05
+    }
+def bootstrap_ci(data, n_bootstrap=10000, ci=0.95):
+    """Bootstrap confidence interval for mean."""
+    means = []
+    for _ in range(n_bootstrap):
+        sample = np.random.choice(data, size=len(data), replace=True)
+        means.append(np.mean(sample))
+    lower = np.percentile(means, (1 - ci) / 2 * 100)
+    upper = np.percentile(means, (1 + ci) / 2 * 100)
+    return {"mean": np.mean(data), "ci_lower": lower, "ci_upper": upper}
+def cohens_h(p1, p2):
+    """Cohen's h effect size for two proportions."""
+    return 2 * np.arcsin(np.sqrt(p1)) - 2 * np.arcsin(np.sqrt(p2))
+```
+### Reporting Standards
+Always include in the paper:
+- **Sample sizes**: n=X problems/tasks
+- **Number of runs**: K independent runs if applicable
+- **Error bars**: Specify standard deviation or standard error
+- **Confidence intervals**: 95% CI for key results
+- **Significance tests**: p-values for key comparisons
+- **Effect sizes**: Cohen's d or h for practical significance
+---
+## Monitoring (Cron Pattern)
+### Cron Prompt Template
+For each experiment batch, create a monitoring prompt:
+```
+Check the status of the [EXPERIMENT_NAME] experiment:
+1. Process check: ps aux | grep [PROCESS_PATTERN]
+2. Log check: tail -30 [LOG_FILE]
+3. Results check: ls [RESULT_DIR]/eval/ (or appropriate result location)
+4. If results are available:
+   - Read the result JSON files
+   - Report metrics in a table (Borda scores, accuracy, etc.)
+   - Compute key comparisons between methods
+5. If all experiments in this batch are complete:
+   - git add -A && git commit -m "[COMMIT_MESSAGE]" && git push
+   - Report final summary
+6. Key question: [SPECIFIC ANALYTICAL QUESTION]
+If nothing has changed since the last check, respond with [SILENT].
+```
+### Monitoring Best Practices
+1. **Check processes first** — don't read results if the experiment is still running and results are incomplete
+2. **Read the log tail** — look for errors, progress indicators, completion messages
+3. **Count completed vs expected** — "45/150 problems done" is more useful than "some results exist"
+4. **Report in structured tables** — always include key metrics in a table
+5. **Answer the key question** — each experiment should have a specific analytical question to answer when done
+6. **[SILENT] for no-news** — suppress notifications when nothing has changed
+7. **Commit on completion** — every completed batch gets committed with a descriptive message
+### Example Monitoring Report
+```
+## Code Experiments (Haiku 3.5) - COMPLETE
+| Strategy | Pass Rate (150 problems) | vs Single |
+|----------|------------------------|-----------|
+| single_pass | 38.0% | — |
+| critique_revise | 35.2% | -2.8pp |
+| **autoreason** | **40.0%** | **+2.0pp** |
+| best_of_6 | 31.0% | -7.0pp |
+Key finding: Autoreason shows +2pp improvement over single pass, while
+best-of-6 collapses due to single-public-test selection issue.
+Committed: `git commit -m "Add Haiku code results (150 problems, 4 strategies)"`
+Next: Run significance tests on these results.
+```
+---
+## Failure Recovery
+### Common Failures and Recovery
+| Failure | Detection | Recovery |
+|---------|-----------|----------|
+| **API credit exhaustion** | 402 errors in logs, incomplete results | Top up credits, re-run (skips completed work automatically) |
+| **Rate limiting** | 429 errors, slow progress | Add retry logic with exponential backoff |
+| **Process crash** | PID gone, log stops mid-problem | Re-run script (resumes from last checkpoint) |
+| **Wrong model ID** | Model not found errors | Fix ID (e.g., `claude-opus-4-6` not `claude-opus-4.6`) |
+| **Parallel slowdown** | Each experiment taking 2x longer | Reduce parallel experiments to 2-3 max |
+| **Security scan blocks** | Commands blocked by security | Use `execute_code` instead of piped `terminal` commands |
+| **Delegation failures** | `delegate_task` returns errors | Fall back to doing work directly |
+| **Timeout on hard problems** | Process stuck, no log progress | Kill, skip problem, note in results |
+| **Dataset path mismatch** | File not found errors | Verify paths before launching |
+### Retry Naming Convention
+When re-running failed experiments, use a suffix to track rounds:
+```
+logs/experiment_haiku_0_50.log       # Round 1
+logs/experiment_haiku_0_50_r2.log    # Round 2 (after credit exhaustion)
+logs/experiment_haiku_0_50_r3.log    # Round 3 (after bug fix)
+```
+### Pre-Flight Checklist
+Before launching any experiment batch:
+```
+Pre-Flight:
+- [ ] API credits sufficient for estimated calls
+- [ ] Model IDs correct (test with 1 problem first)
+- [ ] Output directory exists and is writable
+- [ ] Resume logic works (re-run won't overwrite existing results)
+- [ ] Log file path is unique (won't overwrite previous logs)
+- [ ] Dataset/task files are accessible
+- [ ] Config matches intended experiment
+```
+---
+## Task/Benchmark Design
+### Open-Ended Tasks (Subjective Evaluation)
+Design tasks that have clear objectives but subjective quality:
+```markdown
+# Task: [Title]
+## Context
+[Specific scenario with concrete details: company size, constraints, timeline]
+## Deliverable
+[Exact format and structure required]
+## Requirements
+- [Specific, measurable requirements]
+- [Not vague — "be comprehensive" is bad, "include exactly 6 sections" is good]
+```
+### Constrained Tasks (for Testing Scope Effects)
+Constrained tasks test whether methods respect scope boundaries. Design with:
+- **Fixed facts**: "Use only these N data points, add nothing else"
+- **Fixed deliverable**: Specific format (pitch, postmortem, memo — not "improve this")
+- **Fixed structure**: "These sections in this order, do not add/remove"
+- **Fixed change items**: "Address exactly these N points, nothing else"
+**Do NOT use word count as a scope constraint.** Word limits cause false convergence — outputs get rejected for length, not quality. Constrain scope (what to include) not length.
+### Example: Good vs Bad Constraints
+| Bad Constraint | Why | Good Constraint |
+|---------------|-----|-----------------|
+| "Max 500 words" | Judges reject for length | "Exactly 4 sections, each with 3 numbered items" |
+| "Be concise" | Too vague | "Each prohibition must reference a specific base fact" |
+| "Improve this" | Unbounded scope | "Write a 600-word incident postmortem with this exact structure" |
+| "Make it better" | No clear criterion | "Address exactly these 3 reviewer concerns" |
+---
+## Visualization Best Practices
+### Setup: SciencePlots + matplotlib
+Install SciencePlots for publication-ready defaults:
+```bash
+pip install SciencePlots matplotlib numpy
+```
+**Option A: SciencePlots styles** (recommended — handles most defaults automatically):
+```python
+import matplotlib.pyplot as plt
+import scienceplots  # registers the styles
+# Pick a style:
+# 'science'        — clean, serif fonts, suitable for most venues
+# 'science+ieee'   — IEEE-style (good for two-column papers)
+# 'science+nature' — Nature-style
+# Add 'no-latex' if LaTeX is not installed on the machine generating plots
+with plt.style.context(['science', 'no-latex']):
+    fig, ax = plt.subplots(figsize=(3.5, 2.5))  # single-column width
+    # ... plot ...
+    fig.savefig('paper/fig_results.pdf', bbox_inches='tight')
+```
+**Option B: Manual rcParams** (when you need full control):
+```python
+import matplotlib.pyplot as plt
+plt.rcParams.update({
+    'font.size': 10,
+    'font.family': 'serif',
+    'axes.labelsize': 11,
+    'axes.titlesize': 11,
+    'xtick.labelsize': 9,
+    'ytick.labelsize': 9,
+    'legend.fontsize': 9,
+    'figure.figsize': (3.5, 2.5),    # single-column default
+    'figure.dpi': 300,
+    'savefig.dpi': 300,
+    'savefig.bbox': 'tight',
+    'savefig.pad_inches': 0.05,
+    'axes.linewidth': 0.8,
+    'lines.linewidth': 1.5,
+    'lines.markersize': 5,
+    'axes.grid': True,
+    'grid.alpha': 0.3,
+    'grid.linewidth': 0.5,
+})
+```
+### Standard Figure Sizes (Two-Column Format)
+| Use Case | figsize | Notes |
+|----------|---------|-------|
+| Single column | `(3.5, 2.5)` | Fits in one column of two-column layout |
+| Double column | `(7.0, 3.0)` | Spans full page width |
+| Square (heatmap, confusion matrix) | `(3.5, 3.5)` | Single column |
+| Tall single (many rows) | `(3.5, 5.0)` | Use sparingly |
+### Colorblind-Safe Palette (Okabe-Ito)
+Use this palette for all paper figures. It is distinguishable by people with all common forms of color vision deficiency:
+```python
+COLORS = {
+    'blue':    '#0072B2',
+    'orange':  '#E69F00',
+    'green':   '#009E73',
+    'red':     '#D55E00',
+    'purple':  '#CC79A7',
+    'cyan':    '#56B4E9',
+    'yellow':  '#F0E442',
+    'black':   '#000000',
+}
+# As a list for cycling:
+COLOR_CYCLE = ['#0072B2', '#D55E00', '#009E73', '#E69F00', '#CC79A7', '#56B4E9']
+```
+Also differentiate lines by **marker and linestyle**, not just color:
+```python
+STYLES = [
+    {'color': '#0072B2', 'marker': 'o', 'linestyle': '-'},
+    {'color': '#D55E00', 'marker': 's', 'linestyle': '--'},
+    {'color': '#009E73', 'marker': '^', 'linestyle': '-.'},
+    {'color': '#E69F00', 'marker': 'D', 'linestyle': ':'},
+]
+```
+### Complete Example: Method Comparison Bar Chart
+```python
+import matplotlib.pyplot as plt
+import numpy as np
+try:
+    import scienceplots
+    style = ['science', 'no-latex']
+except ImportError:
+    style = 'default'
+with plt.style.context(style):
+    methods = ['Single Pass', 'Critique+Revise', 'Best-of-N', 'Ours']
+    scores = [73.2, 74.1, 68.5, 77.0]
+    errors = [2.1, 1.8, 3.2, 1.5]
+    colors = ['#56B4E9', '#E69F00', '#CC79A7', '#0072B2']
+    fig, ax = plt.subplots(figsize=(3.5, 2.5))
+    bars = ax.bar(methods, scores, yerr=errors, capsize=3,
+                  color=colors, edgecolor='black', linewidth=0.5)
+    # Highlight "Ours"
+    bars[-1].set_edgecolor('#0072B2')
+    bars[-1].set_linewidth(1.5)
+    ax.set_ylabel('Pass Rate (%)')
+    ax.set_ylim(60, 85)
+    ax.spines['top'].set_visible(False)
+    ax.spines['right'].set_visible(False)
+    fig.savefig('paper/fig_comparison.pdf', bbox_inches='tight')
+```
+### Complete Example: Convergence/Trajectory Line Chart
+```python
+with plt.style.context(style):
+    fig, ax = plt.subplots(figsize=(3.5, 2.5))
+    passes = np.arange(1, 16)
+    ours = [65, 72, 78, 82, 85, 87, 88, 89, 89.5, 90, 90, 90, 90, 90, 90]
+    baseline = [65, 68, 70, 71, 69, 67, 66, 65, 64, 63, 62, 61, 60, 59, 58]
+    ax.plot(passes, ours, **STYLES[0], label='Ours', markersize=4)
+    ax.plot(passes, baseline, **STYLES[1], label='Critique+Revise', markersize=4)
+    # Mark convergence point
+    ax.axvline(x=10, color='gray', linestyle=':', alpha=0.5, linewidth=0.8)
+    ax.annotate('Converged', xy=(10, 90), fontsize=8, ha='center',
+                xytext=(10, 93), arrowprops=dict(arrowstyle='->', color='gray'))
+    ax.set_xlabel('Iteration')
+    ax.set_ylabel('Quality Score')
+    ax.legend(loc='lower right')
+    ax.spines['top'].set_visible(False)
+    ax.spines['right'].set_visible(False)
+    fig.savefig('paper/fig_trajectory.pdf', bbox_inches='tight')
+```
+### Output Rules
+- **Always save as PDF**: `fig.savefig('fig.pdf')` — vector graphics, sharp at any zoom
+- **Never save as PNG** for paper figures — raster PNGs look blurry when printed/zoomed
+- **Exception**: Screenshots, photographs, or pixel-art visualizations → PNG at 600 DPI
+- **Verify grayscale**: Print to grayscale PDF and check all information is still visible
+### Chart Types for Common Comparisons
+| Comparison Type | Chart | Notes |
+|----------------|-------|-------|
+| Method vs method | Grouped bar chart | Include error bars |
+| Across model sizes | Line chart with CI bands | Log scale for model size axis |
+| Ablation study | Stacked/grouped bar | Highlight removed component |
+| Trajectory/convergence | Line chart over iterations | Show winner per iteration |
+| Per-task breakdown | Heatmap or grouped bar | Show variance across tasks |