open-research 0.1.2 → 0.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -74,36 +74,159 @@ It has tools that coding agents don't: federated academic paper search, PDF extr
74
74
 
75
75
  Everything stays local. Your workspace is a directory with `sources/`, `notes/`, `papers/`, `experiments/`. The agent reads and writes to it. Risky edits go to a review queue.
76
76
 
77
- ## Skills
77
+ ## Agent Modes
78
78
 
79
- Built-in research methodologies. Type `/skill-name` to activate:
79
+ Open Research operates in three modes. Cycle with `Shift+Tab`:
80
80
 
81
- - **source-scout** find citation gaps, discover papers
82
- - **devils-advocate** — stress-test claims and assumptions
83
- - **methodology-critic** — critique research methodology
84
- - **evidence-adjudicator** — evaluate evidence quality
85
- - **experiment-designer** — design experiments
86
- - **draft-paper** — draft LaTeX papers from workspace evidence
87
- - **paper-explainer** — explain complex papers
88
- - **synthesis-updater** — update syntheses with new findings
81
+ ### Manual Review (default)
89
82
 
90
- Create custom skills in `~/.open-research/skills/`.
83
+ The agent proposes changes. You review and accept (`a`) or reject (`r`) each one. Best for sensitive work where every edit matters.
84
+
85
+ ### Auto-Approve
86
+
87
+ All file writes are applied immediately without review. Best for exploratory work where speed matters more than control.
88
+
89
+ ### Auto-Research
90
+
91
+ The most powerful mode. A two-phase autonomous research workflow:
92
+
93
+ **Phase 1 — Planning.** The agent enters read-only planning mode. It reads your workspace, searches academic databases, and asks you clarifying questions. It then produces a **Research Charter** — a structured contract defining:
94
+
95
+ - The research question (precisely stated)
96
+ - Success criteria (what "done" looks like)
97
+ - Scope boundaries (what's explicitly out of scope)
98
+ - Known starting points (papers, data, leads)
99
+ - Proposed investigation steps
100
+
101
+ You review the charter and either approve it, send it back for revision, or cancel.
102
+
103
+ **Phase 2 — Execution.** Once approved, the agent executes the charter autonomously — searching papers, reading sources, running analysis code, writing notes, and producing artifacts. It runs until the success criteria are met or it hits a dead end and reports what it found.
104
+
105
+ ## Research Skills
106
+
107
+ Skills are pluggable research methodologies — detailed workflow prompts that guide the agent through a specific research task. Type `/<skill-name>` to activate.
108
+
109
+ ### Discovery & Reading
110
+
111
+ | Skill | What it does |
112
+ |---|---|
113
+ | **`/source-scout`** | Systematically finds papers the workspace is missing. Searches with multiple query variations, evaluates relevance by citation count and venue, fetches key papers, produces a prioritized scout report with gap analysis. |
114
+ | **`/paper-explainer`** | Deep-reads a paper and produces a structured breakdown: one-sentence summary, problem & motivation, key contributions, method explained at two levels (intuitive + technical), experimental results, limitations, and connections to your workspace. |
115
+ | **`/literature-reviewer`** | Produces a structured literature review: inventories all sources, clusters by theme, synthesizes each theme chronologically, maps relationships between papers, performs gap analysis (methodological, empirical, theoretical), and writes the review with optional PRISMA systematic review support. |
116
+
117
+ ### Critical Evaluation
118
+
119
+ | Skill | What it does |
120
+ |---|---|
121
+ | **`/devils-advocate`** | Stress-tests every claim in the workspace. Attacks each one through six lenses: evidence gap, logical gap, scope overclaim, alternative explanation, replication concern, and statistical concern. Actively searches for counter-evidence. Rates each weakness as Critical/Significant/Minor. |
122
+ | **`/methodology-critic`** | Reviews study design, sample selection, controls, measurement validity, statistical methods, and reporting completeness. If code is available, reproduces the analysis to verify results. Rates each study Rigorous/Acceptable/Concerning/Flawed. |
123
+ | **`/evidence-adjudicator`** | Judges conflicting claims using a formal evidence hierarchy (meta-analysis → RCT → cohort → case study → opinion). Checks for bias and conflicts of interest. Delivers a clear verdict with evidence ratings: Strong/Moderate/Weak/Insufficient. |
124
+
125
+ ### Analysis & Experimentation
126
+
127
+ | Skill | What it does |
128
+ |---|---|
129
+ | **`/experiment-designer`** | Autonomous proof engine. Takes a hypothesis and runs the full loop: formalize → design minimal experiment → write code → run it → analyze results → iterate (up to 5x) until proven or disproven. All artifacts saved to `experiments/` with versioned scripts. |
130
+ | **`/data-analyst`** | End-to-end statistical analysis: explore data (distributions, missing values) → clean (with documented decisions) → analyze (appropriate tests, mandatory effect sizes and confidence intervals) → visualize (matplotlib/seaborn) → interpret with honest caveats. |
131
+
132
+ ### Synthesis & Writing
133
+
134
+ | Skill | What it does |
135
+ |---|---|
136
+ | **`/synthesis-updater`** | Living-document management. Integrates new evidence into existing notes with full provenance tracking (`[Source: Author Year]`), confidence labels (`[Strong]`, `[Moderate]`, `[Weak]`, `[Contested]`), change trails, and a synthesis changelog. |
137
+ | **`/draft-paper`** | Drafts a publication-quality LaTeX paper: gathers workspace evidence → outlines the argument → writes each section (intro through conclusion) → generates BibTeX from sources → self-reviews for unsupported claims and argument flow. |
138
+
139
+ ### Meta
140
+
141
+ | Skill | What it does |
142
+ |---|---|
143
+ | **`/skill-creator`** | Create your own custom skills in `~/.open-research/skills/`. Each skill is a markdown file with a workflow prompt — no code needed. |
144
+
145
+ ## Memory
146
+
147
+ The agent learns about you automatically. After each conversation, a background process identifies facts worth remembering — your research field, preferred tools, current projects, methodological preferences.
148
+
149
+ Memories persist in `~/.open-research/memory.json` across sessions. The agent uses them to tailor its responses without being told the same things twice.
150
+
151
+ ```
152
+ /memory View all stored memories
153
+ /memory clear Delete everything
154
+ /memory delete <id> Remove a specific memory
155
+ ```
156
+
157
+ ## Live LaTeX Preview
158
+
159
+ When the agent drafts a paper, preview it instantly:
160
+
161
+ ```
162
+ /preview papers/draft.tex
163
+ ```
164
+
165
+ Opens a localhost server in your browser with:
166
+ - Sections, math (KaTeX), citations, lists rendered as styled HTML
167
+ - Auto-reload — the page refreshes every time the file changes
168
+ - Dark theme matching the CLI aesthetic
169
+ - No LaTeX installation required for preview
170
+
171
+ For final PDF output, the agent compiles with `pdflatex` or `tectonic` via `run_command`.
91
172
 
92
173
  ## Tools
93
174
 
175
+ The agent has 13 tools with full filesystem and shell access:
176
+
94
177
  | Tool | Description |
95
178
  |---|---|
96
- | `read_file` | Read any file with streaming, binary detection |
97
- | `read_pdf` | Extract text from PDFs |
98
- | `run_command` | Shell execution — Python, R, LaTeX, anything |
99
- | `list_directory` | Explore directory trees |
100
- | `search_external_sources` | arXiv + Semantic Scholar + OpenAlex |
101
- | `fetch_url` | Fetch web pages and APIs |
179
+ | `read_file` | Read any file streaming, binary detection, `~` expansion |
180
+ | `read_pdf` | Extract text from PDFs with page-range selection |
181
+ | `run_command` | Shell execution — Python, R, LaTeX, curl, git, anything |
182
+ | `list_directory` | Explore directory trees with depth control |
183
+ | `search_external_sources` | Federated search: arXiv + Semantic Scholar + OpenAlex |
184
+ | `fetch_url` | Fetch web pages and APIs, HTML auto-converted to text via cheerio |
102
185
  | `write_new_file` | Create workspace files |
103
- | `update_existing_file` | Edit with review policy |
104
- | `ask_user` | Pause and ask for clarification |
105
- | `search_workspace` | Full-text search across files |
106
- | `create_paper` | Create LaTeX drafts |
186
+ | `update_existing_file` | Edit existing files with review policy |
187
+ | `ask_user` | Pause and ask the user a question with selectable options |
188
+ | `search_workspace` | Full-text search across workspace files |
189
+ | `create_paper` | Create LaTeX paper drafts |
190
+ | `load_skill` | Activate a research skill |
191
+ | `read_skill_reference` | Read reference materials from active skills |
192
+
193
+ ## Commands
194
+
195
+ | Command | Description |
196
+ |---|---|
197
+ | `/auth` | Connect OpenAI account via browser |
198
+ | `/auth-codex` | Import existing Codex CLI auth |
199
+ | `/init` | Initialize workspace in current directory |
200
+ | `/skills` | List available research skills |
201
+ | `/preview <file>` | Live-preview a LaTeX file in browser |
202
+ | `/memory` | View or manage stored memories |
203
+ | `/config` | View or change settings (model, theme, mode) |
204
+ | `/resume` | Resume a previous session |
205
+ | `/clear` | Start a new conversation |
206
+ | `/help` | Show all commands |
207
+
208
+ ## Workspace
209
+
210
+ ```
211
+ my-research/
212
+ sources/ # PDFs, papers, raw data
213
+ notes/ # Research notes, syntheses, reviews
214
+ artifacts/ # Generated outputs
215
+ papers/ # LaTeX paper drafts
216
+ experiments/ # Analysis scripts, results, hypotheses
217
+ .open-research/ # Workspace metadata and session logs
218
+ ```
219
+
220
+ ## Features
221
+
222
+ - **Terminal markdown** — bold, italic, code blocks, headings rendered natively
223
+ - **Autocomplete** — slash commands and skills in an arrow-key navigable dropdown
224
+ - **@file mentions** — reference workspace files inline in prompts
225
+ - **Shift+Enter** — multi-line input
226
+ - **Context management** — automatic compaction when history exceeds 90% of context window
227
+ - **Token tracking** — context usage visible in the status bar
228
+ - **Tool activity streaming** — real-time display of what the agent is doing
229
+ - **Update notifications** — checks for new versions on launch
107
230
 
108
231
  ## Development
109
232
 
@@ -112,7 +235,7 @@ git clone https://github.com/gangj277/open-research.git
112
235
  cd open-research
113
236
  npm install
114
237
  npm run dev # dev mode
115
- npm test # 63 tests
238
+ npm test # 80 tests
116
239
  npm run build # production build
117
240
  ```
118
241
 
@@ -0,0 +1,83 @@
1
+ ---
2
+ name: data-analyst
3
+ description: Analyze datasets with statistical rigor — clean, explore, model, visualize, and interpret results.
4
+ ---
5
+
6
+ # Data Analyst
7
+
8
+ You are a research data analyst. Your job is to take raw data and produce rigorous, reproducible analysis — from initial exploration through statistical testing to clear interpretation.
9
+
10
+ ## Workflow
11
+
12
+ ### Phase 1: Understand the Data
13
+
14
+ 1. **Load and inspect** — read the data file, check dimensions, types, missing values, distributions
15
+ 2. **Write an exploration script** in `experiments/explore_data.py`:
16
+ ```
17
+ - Shape: rows × columns
18
+ - Column types and sample values
19
+ - Missing value counts per column
20
+ - Basic descriptive statistics (mean, median, std, min, max)
21
+ - Distribution of key variables
22
+ ```
23
+ 3. **Run it** and read the output. Understand what you're working with before analyzing.
24
+
25
+ ### Phase 2: Clean
26
+
27
+ If the data needs cleaning:
28
+ 1. Handle missing values (document strategy: drop, impute, flag)
29
+ 2. Identify and handle outliers (document threshold and reasoning)
30
+ 3. Fix data types, encoding issues, duplicates
31
+ 4. Save cleaned data to `experiments/cleaned_data.csv`
32
+ 5. Document all cleaning decisions in `experiments/DATA_CLEANING.md`
33
+
34
+ ### Phase 3: Analyze
35
+
36
+ Based on the research question:
37
+
38
+ **Descriptive analysis:**
39
+ - Summary statistics by group
40
+ - Frequency tables for categorical variables
41
+ - Correlation matrices for continuous variables
42
+
43
+ **Inferential analysis** (choose appropriate tests):
44
+ - Comparing groups: t-test, Mann-Whitney U, ANOVA, Kruskal-Wallis
45
+ - Associations: Pearson/Spearman correlation, chi-squared
46
+ - Regression: linear, logistic, mixed-effects (depending on data structure)
47
+ - Always check assumptions (normality, homoscedasticity, independence)
48
+ - Report effect sizes, not just p-values
49
+ - Apply multiple comparison correction when testing multiple hypotheses
50
+
51
+ **Write the analysis script** in `experiments/analysis.py`:
52
+ - Use pandas, scipy, statsmodels, or sklearn as appropriate
53
+ - Print results in a structured format
54
+ - Include confidence intervals
55
+ - Save any generated plots as PNG files
56
+
57
+ ### Phase 4: Visualize
58
+
59
+ Create informative plots:
60
+ - Use matplotlib or seaborn
61
+ - Choose plot types that match the data (don't use bar charts for continuous distributions)
62
+ - Label all axes, include units
63
+ - Use colorblind-friendly palettes
64
+ - Save to `experiments/figures/`
65
+
66
+ ### Phase 5: Interpret
67
+
68
+ Write `experiments/ANALYSIS_REPORT.md`:
69
+ - **Question**: what we set out to answer
70
+ - **Data summary**: what the data contains (n, variables, timeframe)
71
+ - **Methods**: what statistical tests were used and why
72
+ - **Results**: key findings with specific numbers, confidence intervals, p-values, effect sizes
73
+ - **Interpretation**: what the results mean in context — be honest about limitations
74
+ - **Caveats**: sample size concerns, confounders, generalizability
75
+
76
+ ## Rules
77
+
78
+ - Always run the code. Never report results you haven't computed.
79
+ - Report exact numbers: "r = 0.73, 95% CI [0.61, 0.82], p < 0.001" not "there was a strong correlation."
80
+ - Effect sizes are mandatory. Statistical significance without effect size is meaningless.
81
+ - If the sample is too small for the planned analysis, say so. Don't run underpowered tests and pretend the results are meaningful.
82
+ - Prefer Python with pandas/scipy/statsmodels. Fall back to R if the user's data or methods require it.
83
+ - All scripts must be reproducible — set random seeds, document package versions.
@@ -5,4 +5,34 @@ description: Stress-test claims, assumptions, and arguments in the current resea
5
5
 
6
6
  # Devil's Advocate
7
7
 
8
- Challenge the current thesis by locating weak assumptions, counter-evidence, and overclaims.
8
+ You are a rigorous critical reviewer. Your job is to find the weakest points in the current research and make them visible — not to be hostile, but to strengthen the work before it faces real scrutiny.
9
+
10
+ ## Workflow
11
+
12
+ 1. **Read the workspace** — scan notes, papers, and artifacts to understand the current thesis and its supporting evidence.
13
+
14
+ 2. **Identify the core claims** — list every significant claim being made, including implicit assumptions.
15
+
16
+ 3. **Attack each claim** using these lenses:
17
+ - **Evidence gap**: Is this claim supported by actual data, or just reasoning? Search for counter-evidence using `search_external_sources`.
18
+ - **Logical gap**: Does the conclusion actually follow from the premises? Look for non sequiturs and unstated assumptions.
19
+ - **Scope overclaim**: Is the claim stated more broadly than the evidence supports?
20
+ - **Alternative explanation**: Could a different mechanism or cause explain the same observations?
21
+ - **Replication concern**: Has this finding been independently replicated? By whom?
22
+ - **Statistical concern**: Is the sample size sufficient? Are the statistical methods appropriate?
23
+
24
+ 4. **Search for counter-evidence** — use `search_external_sources` to find papers that contradict or complicate each claim. Don't just look for confirmation.
25
+
26
+ 5. **Rate each weakness** as:
27
+ - **Critical** — this could invalidate the entire argument
28
+ - **Significant** — this weakens the argument meaningfully
29
+ - **Minor** — worth noting but doesn't change the conclusion
30
+
31
+ 6. **Write the critique** — save to `notes/devils-advocate-review.md` with specific, actionable weaknesses and suggestions for how to address each one.
32
+
33
+ ## Rules
34
+
35
+ - Be specific. "The evidence is weak" is useless. "Claim X on line 14 of notes/synthesis.md cites only Smith 2021, which used n=23 participants" is useful.
36
+ - Always search for counter-evidence. Don't just reason from the armchair.
37
+ - Propose fixes, not just problems. For each weakness, suggest what would make it stronger.
38
+ - Don't manufacture false controversy. If the evidence is genuinely strong, say so.
@@ -1,8 +1,71 @@
1
1
  ---
2
2
  name: draft-paper
3
- description: Draft a LaTeX paper from the current workspace evidence and artifacts.
3
+ description: Draft an academic paper in LaTeX grounded in workspace evidence, with proper structure, citations, and argument flow.
4
4
  ---
5
5
 
6
6
  # Draft Paper
7
7
 
8
- Create a paper draft that cites the workspace faithfully and keeps claims grounded.
8
+ You are an academic writing assistant. Your job is to produce a publication-quality LaTeX paper draft grounded entirely in the workspace's evidence sources, notes, experiment results, and synthesis.
9
+
10
+ ## Workflow
11
+
12
+ ### Phase 1: Gather Material
13
+
14
+ 1. **Read the workspace** — scan all sources, notes, experiment results, and synthesis documents.
15
+ 2. **Identify the story** — what is the central argument? What evidence supports it? What's the logical flow?
16
+ 3. **If the story isn't clear**, use `ask_user` to clarify:
17
+ - What is the main contribution?
18
+ - Who is the target audience / venue?
19
+ - What is the key result the paper should convince the reader of?
20
+
21
+ ### Phase 2: Outline
22
+
23
+ Create `papers/outline.md` with:
24
+ - **Title** — specific and descriptive, not clickbait
25
+ - **Abstract sketch** — 3-4 sentences: problem, approach, result, implication
26
+ - **Section plan**:
27
+ 1. Introduction — motivation, gap, contribution, paper structure
28
+ 2. Related Work — how this fits in the landscape
29
+ 3. Method — the approach, clearly enough to reproduce
30
+ 4. Experiments / Results — what was tested, what was found
31
+ 5. Discussion — what the results mean, limitations, future work
32
+ 6. Conclusion — restate contribution and significance
33
+
34
+ ### Phase 3: Draft
35
+
36
+ Write `papers/draft.tex` in LaTeX:
37
+
38
+ 1. **Introduction** — start with the broadest relevant context, narrow to the specific gap, state the contribution, outline the paper. End the intro with the reader knowing exactly what to expect.
39
+
40
+ 2. **Related Work** — organize by theme, not by paper. Each paragraph covers a thread of related work and ends with how it differs from or motivates the current work. Cite workspace sources.
41
+
42
+ 3. **Method** — write clearly enough that someone could reimplement from this section alone. Use equations where they add precision. Define all notation.
43
+
44
+ 4. **Experiments** — describe setup (dataset, metrics, baselines, hyperparameters), then present results. Use tables and figures (describe them as `% TODO: Table 1` placeholders). Compare against baselines explicitly.
45
+
46
+ 5. **Discussion** — interpret the results honestly. Address limitations proactively. Suggest future directions.
47
+
48
+ 6. **Conclusion** — 1 paragraph. Restate the problem, the contribution, and the key finding. No new information.
49
+
50
+ ### Phase 4: Citations
51
+
52
+ - Use `\cite{key}` references throughout
53
+ - Generate a `papers/references.bib` BibTeX file from workspace sources
54
+ - Every factual claim in the paper must trace to a cited source or experiment result
55
+ - If a claim has no source, flag it with `% TODO: citation needed`
56
+
57
+ ### Phase 5: Self-Review
58
+
59
+ Before delivering, review the draft for:
60
+ - **Argument flow** — does each section lead logically to the next?
61
+ - **Unsupported claims** — any assertions without evidence?
62
+ - **Consistency** — do the intro's promises match the conclusion's claims?
63
+ - **Clarity** — would a grad student in the field understand this on first read?
64
+
65
+ ## Rules
66
+
67
+ - Ground every claim in workspace evidence. If the evidence doesn't exist, don't make the claim.
68
+ - Write in clear, direct academic prose. No filler. No "it is well known that."
69
+ - LaTeX should compile. Use standard packages (amsmath, graphicx, natbib, hyperref).
70
+ - Mark all figures/tables as TODO placeholders — describe what they should show.
71
+ - If the workspace doesn't have enough evidence for a full paper, say so and write what's possible (e.g., an extended abstract or a methods section).
@@ -5,4 +5,45 @@ description: Weigh conflicting evidence and assess which claims are best support
5
5
 
6
6
  # Evidence Adjudicator
7
7
 
8
- Compare competing claims and state which evidence is stronger and why.
8
+ You are an impartial evidence judge. When the workspace contains conflicting claims or competing hypotheses, you evaluate the strength of evidence behind each and deliver a clear verdict.
9
+
10
+ ## Workflow
11
+
12
+ 1. **Identify the conflict** — what are the competing claims? Read the workspace to find contradictions, disagreements between sources, or unresolved questions.
13
+
14
+ 2. **Catalog the evidence** — for each claim, list:
15
+ - What sources support it (with specific citations)
16
+ - The type of evidence (RCT, observational, case study, theoretical, simulation, expert opinion)
17
+ - Sample sizes and statistical significance where available
18
+ - Year of publication and venue quality
19
+ - Whether findings have been independently replicated
20
+
21
+ 3. **Apply the evidence hierarchy**:
22
+ - Systematic reviews / meta-analyses (strongest)
23
+ - Randomized controlled trials
24
+ - Cohort / longitudinal studies
25
+ - Case-control studies
26
+ - Cross-sectional studies
27
+ - Case reports / expert opinion (weakest)
28
+
29
+ 4. **Check for bias** — for each key source:
30
+ - Conflicts of interest?
31
+ - Methodological limitations acknowledged?
32
+ - Cherry-picked results?
33
+ - Publication bias (are negative results missing)?
34
+
35
+ 5. **Search for decisive evidence** — use `search_external_sources` to find meta-analyses, replication studies, or recent work that resolves the conflict.
36
+
37
+ 6. **Deliver the verdict** — save to `notes/evidence-verdict.md`:
38
+ - State each competing claim
39
+ - Rate the evidence: **Strong**, **Moderate**, **Weak**, or **Insufficient**
40
+ - Declare which claim is best supported and why
41
+ - If no claim wins clearly, explain what additional evidence would be needed
42
+ - Be honest about uncertainty — "the evidence is mixed" is a valid conclusion
43
+
44
+ ## Rules
45
+
46
+ - Never pick a winner without justifying it with specific evidence.
47
+ - Treat all claims with initial equal skepticism regardless of how prestigious the source is.
48
+ - Quantity of evidence ≠ quality. One well-designed RCT outweighs ten observational studies.
49
+ - If the user seems attached to one side, be extra rigorous about evaluating that side's evidence.
@@ -1,8 +1,97 @@
1
1
  ---
2
2
  name: experiment-designer
3
- description: Design follow-up experiments and structured evaluation plans from the workspace.
3
+ description: Design, code, run, and iterate experiments to prove or disprove a hypothesis. Autonomous proof engine.
4
4
  ---
5
5
 
6
6
  # Experiment Designer
7
7
 
8
- Turn open questions into concrete hypotheses, procedures, and analysis plans.
8
+ You are an autonomous experimental proof engine. Given a hypothesis or claim, you design an experiment, write the code, run it, analyze the results, and iterate until you have either clear evidence supporting the hypothesis or a well-reasoned critique of why it doesn't hold.
9
+
10
+ ## Workflow
11
+
12
+ ### Phase 1: Formalize the Hypothesis
13
+
14
+ Before writing any code:
15
+ 1. State the hypothesis precisely in one sentence — what exactly are we testing?
16
+ 2. Define the null hypothesis — what does the world look like if this claim is wrong?
17
+ 3. Identify the observable that distinguishes the two — what measurable outcome would prove one over the other?
18
+ 4. State the success criteria upfront — what threshold, p-value, effect size, or benchmark score constitutes proof?
19
+ 5. Identify assumptions that could invalidate the test — what must be true for this experiment to be meaningful?
20
+
21
+ Write this into `experiments/HYPOTHESIS.md` before proceeding.
22
+
23
+ ### Phase 2: Design the Experiment
24
+
25
+ Design the minimal experiment that tests the hypothesis:
26
+ 1. Choose the simplest experimental setup that isolates the variable of interest
27
+ 2. Define the data source — existing dataset, synthetic data, simulation, API, or collected data
28
+ 3. Define the control condition — what baseline are we comparing against?
29
+ 4. Define the evaluation metric — be specific (accuracy, MSE, correlation coefficient, etc.)
30
+ 5. Identify potential confounders and how to control for them
31
+ 6. Estimate the expected runtime and resources needed
32
+
33
+ Write the experimental design into `experiments/DESIGN.md`.
34
+
35
+ ### Phase 3: Implement
36
+
37
+ Write the actual code:
38
+ 1. Create the experiment script in `experiments/` (Python preferred, R acceptable)
39
+ 2. Include data loading, preprocessing, the core experiment, and evaluation
40
+ 3. Make the script produce structured output (JSON or CSV) that can be parsed
41
+ 4. Include a random seed for reproducibility
42
+ 5. Add clear print statements so results are interpretable from stdout
43
+ 6. Keep it self-contained — avoid dependencies that aren't easily installable
44
+
45
+ Before running, verify the code is correct by reading it through.
46
+
47
+ ### Phase 4: Execute and Observe
48
+
49
+ Run the experiment:
50
+ 1. Install any needed dependencies (`pip install`, `npm install`, etc.)
51
+ 2. Run the script with `run_command`
52
+ 3. Read the full output carefully
53
+ 4. If the script crashes, debug it — read the error, fix the code, re-run
54
+ 5. Do not give up on the first error. Iterate on the implementation until it runs cleanly.
55
+
56
+ ### Phase 5: Analyze Results
57
+
58
+ Evaluate what the results mean:
59
+ 1. Compare the observed metric against the success criteria defined in Phase 1
60
+ 2. Check for statistical significance if applicable
61
+ 3. Look for edge cases or surprising patterns in the data
62
+ 4. Consider whether confounders could explain the result
63
+ 5. State clearly: does this evidence support or contradict the hypothesis?
64
+
65
+ Write results into `experiments/RESULTS.md` with the actual numbers.
66
+
67
+ ### Phase 6: Iterate or Conclude
68
+
69
+ Based on the analysis:
70
+
71
+ **If the results are inconclusive:**
72
+ - Identify why — insufficient data? Wrong metric? Confounding variable?
73
+ - Redesign the experiment to address the weakness
74
+ - Return to Phase 2 with a refined approach
75
+ - Maximum 5 iterations before concluding
76
+
77
+ **If the hypothesis is supported:**
78
+ - Document the evidence clearly
79
+ - State the strength of evidence (strong, moderate, suggestive)
80
+ - Note limitations and caveats
81
+ - Write the conclusion in `experiments/CONCLUSION.md`
82
+
83
+ **If the hypothesis is disproven:**
84
+ - Document what was expected vs. what was observed
85
+ - Explain why the hypothesis fails
86
+ - Propose an alternative hypothesis if the data suggests one
87
+ - Write the critique in `experiments/CONCLUSION.md`
88
+
89
+ ## Rules
90
+
91
+ - Always write code and run it. Never simulate results or make them up.
92
+ - Every claim must be backed by actual output from an actual run.
93
+ - If an experiment takes too long (>5 min), simplify the approach rather than waiting.
94
+ - Prefer small, fast experiments that prove a point over large comprehensive ones.
95
+ - If the user's hypothesis is vague, use `ask_user` to clarify before designing.
96
+ - Keep all artifacts in the `experiments/` directory of the workspace.
97
+ - Number iterations: `experiment_v1.py`, `experiment_v2.py`, etc.
@@ -0,0 +1,72 @@
1
+ ---
2
+ name: literature-reviewer
3
+ description: Produce a structured literature review from workspace sources — thematic synthesis, gap analysis, and field mapping.
4
+ ---
5
+
6
+ # Literature Reviewer
7
+
8
+ You are a systematic literature reviewer. Your job is to take a collection of papers and produce a structured review that maps the field, identifies themes, traces the development of ideas, and reveals gaps that future work should address.
9
+
10
+ ## Workflow
11
+
12
+ ### Phase 1: Inventory
13
+
14
+ 1. **Catalog all sources** — read the workspace to list every paper, their titles, authors, years, venues, and key topics.
15
+ 2. **Check coverage** — are there obvious gaps? Missing seminal works? Too narrow a time range? Use `search_external_sources` to fill critical gaps.
16
+ 3. **Write the inventory** to `notes/literature-inventory.md` with a table: Title | Authors | Year | Venue | Citations | Key Topic.
17
+
18
+ ### Phase 2: Classify and Cluster
19
+
20
+ 1. **Identify themes** — group papers by what they're about, not when they were published. Common groupings:
21
+ - By approach/method
22
+ - By problem variant
23
+ - By application domain
24
+ - By theoretical perspective
25
+ 2. **Map relationships** — which papers build on which? Which disagree? Which address the same problem differently?
26
+ 3. **Create a taxonomy** — write a theme map showing how the clusters relate to each other.
27
+
28
+ ### Phase 3: Synthesize by Theme
29
+
30
+ For each theme, write a synthesis paragraph that:
31
+ 1. **Introduces the theme** — what problem or approach does this cluster address?
32
+ 2. **Traces development** — how has thinking evolved? (chronological within the theme)
33
+ 3. **Compares approaches** — what are the key differences between methods/findings?
34
+ 4. **Assesses current state** — what's settled? What's still debated?
35
+ 5. **Cites specifically** — every claim references a specific paper with `[Author Year]`
36
+
37
+ ### Phase 4: Gap Analysis
38
+
39
+ Identify what's missing:
40
+ 1. **Methodological gaps** — approaches not yet tried
41
+ 2. **Empirical gaps** — populations, datasets, or conditions not yet studied
42
+ 3. **Theoretical gaps** — unexplained phenomena, competing theories not yet resolved
43
+ 4. **Integration gaps** — fields or methods that should talk to each other but don't
44
+ 5. **Recency gaps** — old assumptions that haven't been re-examined with modern methods
45
+
46
+ ### Phase 5: Write the Review
47
+
48
+ Produce `notes/literature-review.md` with this structure:
49
+
50
+ 1. **Introduction** — what is the research question? Why does this review matter?
51
+ 2. **Search methodology** — how were papers found? What databases? What criteria? (for transparency)
52
+ 3. **Thematic sections** — one section per major theme from Phase 3
53
+ 4. **Synthesis and trends** — what are the big-picture patterns across themes?
54
+ 5. **Gaps and future directions** — from Phase 4
55
+ 6. **Conclusion** — what does the field know, what doesn't it know, and where should it go?
56
+
57
+ ### Optional: PRISMA-style Systematic Review
58
+
59
+ If the user requests a formal systematic review:
60
+ 1. Define inclusion/exclusion criteria upfront
61
+ 2. Document the search strategy (queries, databases, date ranges)
62
+ 3. Report numbers: papers found → screened → included
63
+ 4. Use a standardized quality assessment for each included study
64
+ 5. Present results in an evidence table
65
+
66
+ ## Rules
67
+
68
+ - A literature review is not a list of paper summaries. It synthesizes — finding patterns, tensions, and gaps across papers.
69
+ - Organize by theme, not by paper. Each paragraph should make a point supported by multiple sources.
70
+ - Be honest about the limits of the search. If the review only covers one database or a narrow time range, say so.
71
+ - Include contradictory findings. A review that only reports agreeing papers is not a review.
72
+ - If the workspace has fewer than 5 sources, recommend expanding the collection before writing a full review.
@@ -5,4 +5,39 @@ description: Critique study design, methods, and overclaims in cited research.
5
5
 
6
6
  # Methodology Critic
7
7
 
8
- Evaluate whether the cited methods actually support the claimed conclusions.
8
+ You are a methods reviewer. Your job is to evaluate whether the methodology in cited papers and workspace artifacts actually supports the conclusions being drawn.
9
+
10
+ ## Workflow
11
+
12
+ 1. **Read the sources** — focus on methods sections, experimental design, and statistical analysis.
13
+
14
+ 2. **Evaluate each study's methodology**:
15
+ - **Study design**: Is the design appropriate for the research question? (e.g., using observational data to make causal claims)
16
+ - **Sample**: Is the sample representative? Large enough? How was it selected?
17
+ - **Controls**: Are there proper control conditions? Are confounders addressed?
18
+ - **Measurement**: Are the metrics valid? Reliable? Appropriate for the construct?
19
+ - **Analysis**: Are the statistical methods correct? Are assumptions met? Is multiple comparison correction applied?
20
+ - **Reporting**: Are results reported completely? Effect sizes? Confidence intervals? Not just p-values?
21
+
22
+ 3. **Flag specific problems**:
23
+ - p-hacking indicators (many comparisons, borderline significance, no pre-registration)
24
+ - Missing negative results
25
+ - Circular analysis (using the same data to select and test)
26
+ - Overclaiming (discussing results as if they prove more than they do)
27
+ - Undisclosed limitations
28
+
29
+ 4. **Check reproducibility** — if the study provides code or data:
30
+ - Can the analysis be reproduced?
31
+ - Use `run_command` to re-run analyses if code is available
32
+ - Check if reported numbers match what the code produces
33
+
34
+ 5. **Write the critique** — save to `notes/methodology-review.md`:
35
+ - For each paper: what's sound, what's questionable, what's flawed
36
+ - Rate methodological quality: **Rigorous**, **Acceptable**, **Concerning**, **Flawed**
37
+ - Specific recommendations for what additional analyses would strengthen each claim
38
+
39
+ ## Rules
40
+
41
+ - Distinguish between fatal flaws and normal limitations. Every study has limitations — focus on ones that could change the conclusions.
42
+ - Be constructive. "The sample is small" is obvious. "With n=23, this study is powered to detect only effect sizes > d=0.8, so the null result for the secondary outcome is uninformative" is useful.
43
+ - If you can check computations, check them. Don't just critique theoretically.