npm - flonat-research - Versions diffs - 0.1.0 → 0.2.0 - Mend

flonat-research 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (498) hide show

package/skills/causal-design/references/strategy-memo-template.md ADDED Viewed

@@ -0,0 +1,138 @@
+# Strategy Memo Template
+> Use this template when producing the strategy memo in `/causal-design` Design Phase 3.
+> Fill in every section. If a section is not applicable, state why rather than leaving it blank.
+---
+```markdown
+# Causal Strategy Memo
+**Project:** [project name]
+**Date:** YYYY-MM-DD
+**Status:** Draft | Locked
+---
+## 1. Research Question
+**Causal question:** [State the causal question in one sentence. What is the effect of X on Y?]
+**Treatment (X):** [Define precisely -- what is the treatment, intervention, or policy?]
+**Outcome (Y):** [Define precisely -- what is the primary outcome variable? How is it measured?]
+**Population:** [Who is the estimand defined over? What is the target population?]
+## 2. Estimand
+**Formal definition:**
+$$\tau = E[Y_i(1) - Y_i(0) \mid \text{subpopulation}]$$
+**Type:** [ATE / ATT / LATE / CATE / Other]
+**Interpretation:** [One sentence explaining what the parameter means substantively.]
+## 3. Identification Strategy
+**Strategy:** [DiD / IV / RDD / SC / Event Study / Matching / Other]
+**Source of variation:** [What generates exogenous variation in treatment? Why is this variation plausibly exogenous?]
+**Intuition:** [2-3 sentences explaining the identification argument in plain language. A non-technical reader should understand why this comparison is valid.]
+**Formal identification result:**
+[State the formal result: under assumptions A1-An, the estimand is identified by [expression]. Reference the relevant econometric result if applicable.]
+## 4. Key Assumptions
+For each assumption, state it formally, provide a conceptual defence, and describe how (if possible) it will be tested.
+### Assumption 1: [Name]
+**Statement:** [Formal statement]
+**Defence:** [Why should this hold in your setting?]
+**Testable?** [Yes / No / Partially]
+**Test plan:** [If testable, what diagnostic will you run? What would a failure look like?]
+### Assumption 2: [Name]
+[Same structure]
+### Assumption N: [Name]
+[Same structure]
+## 5. Threats and Mitigations
+| # | Threat | Severity | Mitigation | Residual risk |
+|---|--------|----------|-----------|---------------|
+| T1 | [What could go wrong] | High/Medium/Low | [How you address it] | [What risk remains] |
+| T2 | ... | ... | ... | ... |
+| T3 | ... | ... | ... | ... |
+## 6. Diagnostics Plan
+List every diagnostic test you will run before trusting the main estimates. These must be run **before** examining point estimates (per the `design-before-results` rule).
+| # | Diagnostic | Purpose | Pass criterion |
+|---|-----------|---------|----------------|
+| D1 | [Test name] | [What it checks] | [What counts as passing] |
+| D2 | ... | ... | ... |
+| D3 | ... | ... | ... |
+## 7. Robustness Checks
+Pre-commit to alternative specifications. These are decided now, before seeing results. Post-hoc robustness checks added after seeing the main results are not credible.
+| # | Specification | What it varies | Why informative |
+|---|--------------|----------------|-----------------|
+| R1 | [Description] | [What changes vs. main spec] | [What we learn] |
+| R2 | ... | ... | ... |
+| R3 | ... | ... | ... |
+## 8. Alternative Strategies Considered
+| Strategy | Why considered | Why rejected |
+|----------|---------------|-------------|
+| [Strategy 1] | [What made it a candidate] | [Why it was not chosen] |
+| [Strategy 2] | ... | ... |
+## 9. Data Requirements
+| Variable | Source | Available? | Notes |
+|----------|--------|-----------|-------|
+| Treatment | [source] | Yes/No/Partial | |
+| Outcome | [source] | Yes/No/Partial | |
+| Running variable (RDD) | [source] | Yes/No/N/A | |
+| Instrument (IV) | [source] | Yes/No/N/A | |
+| Pre-treatment covariates | [source] | Yes/No/Partial | |
+| Panel structure | [source] | Yes/No/N/A | |
+## 10. Implementation Notes
+**Estimator:** [What R/Python/Stata package and function will implement this? e.g., `did` (Callaway & Sant'Anna), `rdrobust`, `ivreg`]
+**Standard errors:** [How will inference be conducted? Clustered at what level? Bootstrap?]
+**Sample restrictions:** [Any sample restrictions beyond the target population?]
+---
+## Sign-Off
+- [ ] Estimand is precisely defined
+- [ ] Identification strategy matches the estimand
+- [ ] All key assumptions are stated and defended
+- [ ] Diagnostics plan is complete and pre-committed
+- [ ] Robustness checks are pre-committed
+- [ ] This memo has been reviewed by domain-reviewer agent
+**Locked by:** [name]
+**Lock date:** [date]
+```

package/skills/code-review/SKILL.md CHANGED Viewed

@@ -1,13 +1,13 @@
 ---
 name: code-review
-description: "Use when you need a quality review of R or Python research scripts."
-allowed-tools: Read, Glob, Grep
-argument-hint: [script-path or project-path]
+description: "Use when you need a quality review of R, Python, or Julia research scripts. Multi-persona orchestrator with parallel specialist reviewers."
+allowed-tools: Read, Glob, Grep, Agent, Bash(wc*)
+argument-hint: "[script-path or project-path]"
 ---
 # Research Code Review
-**Report-only skill.** Never edit source files — produce `CODE-REVIEW-REPORT.md` only.
+**Report-only skill.** Never edit source files — produce `reviews/code-review/YYYY-MM-DD_CODE-REVIEW-REPORT.md` only.
 ## When to Use
@@ -22,216 +22,217 @@ argument-hint: [script-path or project-path]
 - **Formal verification** — use the Referee 2 agent for cross-language replication
 - **General software projects** — this is for research scripts, not applications
-## Workflow
+---
-1. **Locate scripts**: Find all `.R`, `.py`, `.do`, `.jl` files in the project
-2. **Read each script** carefully
-3. **Score each category** (Pass / Fail / N/A)
-4. **Produce report**: Write `CODE-REVIEW-REPORT.md` in the project directory
+## Architecture
-## 11 Review Categories
+**Orchestrator + parallel specialist reviewers.** The main context runs a baseline checklist, then spawns 3-6 specialist sub-agents in parallel. Each reviewer produces structured JSON findings. The orchestrator deduplicates, merges, and synthesizes a single report.
-### 1. Reproducibility
+```
+Phase 1: Scope → Phase 2: Baseline Checklist → Phase 3: Spawn Reviewers
+→ Phase 4: Merge & Dedup → Phase 5: Synthesize Report
+```
-| Check | Pass Criteria |
-|-------|--------------|
-| Random seeds | `set.seed()` / `random.seed()` / `np.random.seed()` set before any stochastic operation |
-| Relative paths | No hardcoded absolute paths (e.g., `/Users/username/...` or `C:\...`) |
-| Working directory | Script does not `setwd()` / `os.chdir()` — uses project-relative paths |
-| Session info | Script prints session info at end (`sessionInfo()` / `sys.version`) or documents environment |
+---
-### 2. Script Structure
+## Phase 1: Scope Detection
-| Check | Pass Criteria |
-|-------|--------------|
-| Header | Script begins with comment block: purpose, author, date, inputs, outputs |
-| Sections | Code organised into labelled sections (comments or `# ---- Section ----`) |
-| Imports at top | All `library()` / `import` statements at the top of the file |
-| Reasonable length | Single script < 500 lines; longer scripts should be split |
+1. **Locate scripts:** Find all `.R`, `.py`, `.jl`, `.do` files in the project (or the specified path)
+2. **Count and classify:** Report file count, languages, total lines of code
+3. **Read project CLAUDE.md** (if it exists) for domain context, estimand, methodology
-### 3. Output Hygiene
+If no code files found, stop: "No code files found at [path]."
-| Check | Pass Criteria |
-|-------|--------------|
-| No print pollution | No stray `print()` / `cat()` / `message()` dumping to console |
-| Outputs saved | Key results saved to files, not just printed |
-| Clean console | Running the script does not produce walls of text |
+---
-### 4. Function Quality
+## Phase 2: Baseline Checklist (main context, fast pass)
-| Check | Pass Criteria |
-|-------|--------------|
-| Documentation | Functions have comments explaining purpose, inputs, outputs |
-| Naming | Function names are descriptive verbs (`estimate_ate`, not `f1`) |
-| Defaults | Reasonable defaults for optional parameters |
-| No side effects | Functions don't modify global state |
+Run through all 11 categories as a quick structural check. This catches mechanical issues that don't need specialist reviewers.
-### 5. Domain Correctness
+### 11 Checklist Categories
-| Check | Pass Criteria |
-|-------|--------------|
-| Estimator matches paper | The estimator used matches what the paper claims |
-| Weights | If weighted: weights sum to expected value, correct application |
-| Standard errors | Clustering / HC / bootstrap matches paper specification |
-| Sample restrictions | Filters match the paper's sample description |
-| Variable construction | Variables constructed as described in the paper |
+See [`references/checklist-categories.md`](references/checklist-categories.md) for detailed specifications of all 11 categories: Reproducibility, Script Structure, Output Hygiene, Function Quality, Domain Correctness, Figure Quality, Data Persistence, Dependencies, Python-Specific, R-Specific, and Cross-Language Verification.
-### 6. Figure Quality
+Record checklist results (Pass/Fail/N/A per category) for the report. Continue to Phase 3 regardless of results.
-| Check | Pass Criteria |
-|-------|--------------|
-| Dimensions specified | Figure size set explicitly (not default) |
-| Transparency/resolution | Appropriate for publication (300+ DPI for raster, vector preferred) |
-| Saved to file | Figures saved with `ggsave()` / `plt.savefig()`, not just displayed |
-| Labels | Axes labelled, legend present where needed, title informative |
-| Colour | Colourblind-friendly palette; not relying on red/green distinction |
+---
-### 7. Data Persistence
+## Phase 3: Spawn Specialist Reviewers
-| Check | Pass Criteria |
-|-------|--------------|
-| Intermediate objects saved | Expensive computations saved (`saveRDS()` / `pickle.dump()` / `.parquet`) |
-| Load before recompute | Script checks for saved objects before rerunning expensive operations |
-| Output format | Final outputs in portable format (CSV, parquet — not just `.RData`) |
+Read `references/persona-catalog.md` for the full persona definitions and selection logic.
-### 8. Dependencies
+### 3a. Select Reviewers
-| Check | Pass Criteria |
-|-------|--------------|
-| Declared at top | All `library()` / `import` at the start of the script |
-| Versions documented | `renv.lock` / `requirements.txt` / `pyproject.toml` exists |
-| No unnecessary packages | Each loaded package is actually used |
-| Installation instructions | README or comment explains how to set up the environment |
+**Always spawn (3 reviewers):**
+- `correctness-reviewer` — logic errors, bugs, state issues
+- `reproducibility-reviewer` — seeds, paths, environment, portability
+- `design-reviewer` — structure, naming, dead code, complexity
-### 9. Python-Specific
+**Conditionally spawn (scan code to decide):**
+- `domain-reviewer` — if statistical/econometric methods detected
+- `performance-reviewer` — if loops over data, DB queries, or expensive operations detected
+- `security-reviewer` — if user input handling, HTTP, SQL, shell commands, or credentials detected
-*Score N/A if no Python files.*
+### 3b. Announce Team
-| Check | Pass Criteria |
-|-------|--------------|
-| Type hints | Functions have type annotations for parameters and return values |
-| Docstrings | Functions have docstrings (not just comments) |
-| uv usage | Uses `uv` for environment management (per project conventions) |
-| f-strings | Uses f-strings, not `.format()` or `%` formatting |
+Before spawning, list the team:
-### 10. R-Specific
+```
+Review team: correctness, reproducibility, design, domain (detected: lm() with cluster SEs)
+```
-*Score N/A if no R files.*
+### 3c. Spawn in Parallel
-| Check | Pass Criteria |
-|-------|--------------|
-| tidyverse consistency | Doesn't mix base R and tidyverse for the same operation |
-| Assignment operator | Uses `<-` not `=` for assignment |
-| Boolean values | Uses `TRUE`/`FALSE`, not `T`/`F` |
-| Pipe consistency | Uses one pipe style consistently (`%>%` or `|>`) |
+For each selected reviewer, launch a sub-agent (subagent_type: "general-purpose", model: "haiku") with:
-### 11. Cross-Language Verification
+1. Read `references/subagent-template.md` — substitute `{persona_name}` and `{persona_content}` from the catalog
+2. Pass the file list and instruct the agent to read each file
+3. Instruct: return ONLY JSON matching `references/findings-schema.json`
-*Score N/A if the project has no numerical results or only uses one language.*
+**All reviewers run in parallel** — launch them in a single message with multiple Agent tool calls.
-| Check | Pass Criteria |
-|-------|--------------|
-| Replication directory | `code/replication/` (or equivalent) exists with cross-language scripts |
-| Two-language coverage | Key numerical results reproduced in a second language (e.g., R results verified in Python or vice versa) |
-| Result comparison | Scripts compare outputs and report discrepancies (tolerance-based, not exact match) |
-| Precision threshold | Numerical outputs compared to 6+ decimal places — discrepancies at lower precision indicate real bugs |
-| Documentation | README or comments explain what is being replicated and acceptable tolerance |
+---
-#### Why Cross-Language Replication Works
+## Phase 4: Merge & Deduplicate
-Different languages produce different hallucination patterns when AI-assisted. An error in a Python implementation is unlikely to appear identically in R (or vice versa), making discrepancies easy to spot. This is the core insight from Scott Cunningham's Referee 2 protocol.
+After all reviewers return:
-#### How to Set Up
+### 4a. Validate
-1. Create `code/replication/` with scripts that independently implement key numerical results in a second language
-2. Write a comparison script that loads outputs from both languages and reports discrepancies at 6+ decimal places
-3. Document what is being replicated, which results are covered, and the acceptable tolerance (e.g., 1e-6 for coefficients, 1e-4 for standard errors)
+- Parse each reviewer's JSON output
+- Drop malformed findings (note count of dropped findings)
+- Drop findings with confidence < 0.60 (exception: P0 at 0.50+ survives)
-## Confidence Filtering
+### 4b. Deduplicate
-- Only report issues where you are >80% confident they are genuine problems
-- Consolidate similar findings (e.g., 5 instances of the same naming issue = 1 finding with count)
-- For borderline cases, note uncertainty: "Possible issue (medium confidence): ..."
-- Never pad the report with low-confidence observations to appear thorough
+Fingerprint each finding:
-## Scorecard
+```
+fingerprint = normalize(file) + line_bucket(line, ±3) + normalize(title)
+```
-| # | Category | Result | Notes |
-|---|----------|--------|-------|
-| 1 | Reproducibility | Pass/Fail | |
-| 2 | Script structure | Pass/Fail | |
-| 3 | Output hygiene | Pass/Fail | |
-| 4 | Function quality | Pass/Fail | |
-| 5 | Domain correctness | Pass/Fail | |
-| 6 | Figure quality | Pass/Fail | |
-| 7 | Data persistence | Pass/Fail | |
-| 8 | Dependencies | Pass/Fail | |
-| 9 | Python-specific | Pass/Fail/N/A | |
-| 10 | R-specific | Pass/Fail/N/A | |
-| 11 | Cross-language verification | Pass/Fail/N/A | |
+Where:
+- `normalize()` = lowercase, strip whitespace
+- `line_bucket(line, ±3)` = any line within ±3 of another is considered the same location
-**Overall: X/11 Pass** (adjust denominator for N/A categories)
+When fingerprints match across reviewers:
+- Keep the **highest severity**
+- Keep the **highest confidence** + union all evidence
+- Record which reviewers agreed (e.g., "correctness, domain")
+- **Cross-reviewer agreement bonus:** +0.10 confidence (capped at 1.0)
-## Quality Scoring
+### 4c. Map to Quality Rubric
-Apply numeric quality scoring using the shared framework and skill-specific rubric:
+Map each merged finding to the closest entry in `references/quality-rubric.md` to determine the deduction. If no exact match, classify by severity tier and use the midpoint deduction.
-- **Framework:** [`../shared/quality-scoring.md`](../shared/quality-scoring.md) — severity tiers, thresholds, verdict rules
-- **Rubric:** [`references/quality-rubric.md`](references/quality-rubric.md) — issue-to-deduction mappings for this skill
+### 4d. Sort
-Start at 100, deduct per issue found, apply verdict. Insert the Score Block into the report after the scorecard.
+Sort findings: P0 first → P1 → P2 → P3, then by confidence (descending), then by file, then by line.
+---
-## Report Format
+## Phase 5: Synthesize Report
+Create `reviews/code-review/` if it does not exist (`mkdir -p`). Write `reviews/code-review/YYYY-MM-DD_CODE-REVIEW-REPORT.md` in the project directory (date-stamped so prior reports are preserved, matching the pattern used by `paper-critic`, `peer-reviewer`, `domain-reviewer`, `referee2-reviewer`, and `proofread`).
+### Report Format
 ```markdown
 # Code Review Report
 **Project:** [path]
 **Date:** YYYY-MM-DD
-**Scripts reviewed:** [list]
-**Languages:** R / Python / Both
+**Scripts reviewed:** [list with line counts]
+**Languages:** R / Python / Julia / Both
+**Review team:** [list of reviewers with conditional justifications]
-## Scorecard
+## Quality Score
-[Table above, filled in]
+| Metric | Value |
+|--------|-------|
+| **Score** | XX / 100 |
+| **Verdict** | Ship / Ship with notes / Revise / Revise (major) / Blocked |
-## Detailed Findings
+### Deductions
-### Category 1: Reproducibility
-**Result: Pass/Fail**
+| # | Issue | Tier | Deduction | Category | Reviewer(s) | Confidence |
+|---|-------|------|-----------|----------|-------------|------------|
+| 1 | [title] | P0 | -25 | Domain Correctness | domain, correctness | 0.92 |
+| 2 | [title] | P1 | -15 | Reproducibility | reproducibility | 0.85 |
+| ... | | | | | | |
+| | **Total deductions** | | **-XX** | | | |
-[Specific findings with file:line references]
+## Checklist Scorecard
-### Category 2: Script Structure
-...
+| # | Category | Result | Notes |
+|---|----------|--------|-------|
+| 1 | Reproducibility | Pass/Fail | |
+| 2 | Script structure | Pass/Fail | |
+| ... | | | |
+| 11 | Cross-language verification | Pass/Fail/N/A | |
-[Continue for all 11 categories]
+**Checklist: X/11 Pass** (adjust denominator for N/A categories)
-## Priority Fixes
+## Detailed Findings
-1. [Most important issue — what to fix first]
-2. [Second most important]
-3. [Third]
+### P0 — Blocker
-## Quality Score
+| # | File | Issue | Reviewer(s) | Confidence | Evidence |
+|---|------|-------|-------------|------------|----------|
+| 1 | path:line | [title + why_it_matters] | [reviewers] | 0.92 | [evidence] |
-| Metric | Value |
-|--------|-------|
-| **Score** | XX / 100 |
-| **Verdict** | Ship / Ship with notes / Revise / Revise (major) / Blocked |
+### P1 — Critical
+[same format, omit if empty]
-### Deductions
+### P2 — Major
+[same format, omit if empty]
+### P3 — Minor
+[same format, omit if empty]
+## Residual Risks
-| # | Issue | Tier | Deduction | Category |
-|---|-------|------|-----------|----------|
-| 1 | [description] | [tier] | -X | [category] |
-| | **Total deductions** | | **-XX** | |
+[Union of residual_risks from all reviewers — things that can't be verified from code alone]
+## Priority Fixes
+1. [Most impactful issue — what to fix first]
+2. [Second]
+3. [Third]
 ## Positive Observations
 [Things done well — important for morale and learning]
+## Review Metadata
+- Reviewers spawned: [N]
+- Findings before dedup: [N]
+- Findings after dedup: [N]
+- Findings suppressed (low confidence): [N]
+- Cross-reviewer agreements: [N]
 ```
+---
+## Confidence Filtering
+- Suppress findings below 0.60 confidence (exception: P0 at 0.50+)
+- Consolidate identical patterns: 5 instances of the same issue = 1 finding with count in evidence
+- Cross-reviewer agreement boosts confidence by +0.10 (capped at 1.0)
+- Never pad the report with low-confidence observations
+## Quality Scoring
+Apply numeric quality scoring using the shared framework and skill-specific rubric:
+- **Framework:** [`../shared/quality-scoring.md`](../shared/quality-scoring.md) — severity tiers, thresholds, verdict rules
+- **Rubric:** [`references/quality-rubric.md`](references/quality-rubric.md) — issue-to-deduction mappings for this skill
+Start at 100, deduct per issue found, apply verdict.
+---
 ## Council Mode (Optional)
 For complex codebases or high-stakes replication packages, run the code review across multiple LLM providers. Different models have different strengths: some excel at spotting statistical errors, others at code structure or reproducibility issues.
@@ -241,25 +242,17 @@ For complex codebases or high-stakes replication packages, run the code review a
 **How it works:**
 1. Each model independently scores all 11 categories against the same scripts
 2. Cross-review: models evaluate each other's findings — catching false positives and missed issues
-3. Chairman synthesis: produces a single `CODE-REVIEW-REPORT.md` with the union of confirmed findings
-**Invocation (CLI backend):**
-```bash
-cd packages/cli-council
-uv run python -m cli_council \
-    --prompt-file /tmp/code-review-prompt.txt \
-    --context-file /tmp/scripts-content.txt \
-    --output-md /tmp/code-review-council.md \
-    --chairman claude \
-    --timeout 180
-```
+3. Chairman synthesis: produces a single `reviews/code-review/YYYY-MM-DD_CODE-REVIEW-REPORT.md` with the union of confirmed findings
 See `skills/shared/council-protocol.md` for the full orchestration protocol.
-**Value:** Moderate to high — most valuable for domain correctness (Category 5) and cross-language verification (Category 11), where different models may catch different statistical or logical errors.
+---
 ## Cross-References
 - **`/code-archaeology`** — For understanding unfamiliar code before reviewing it
-- **Referee 2 agent** — For formal cross-language replication and verification (Category 11 flags the absence; Referee 2 does the actual replication)
+- **Referee 2 agent** — For formal cross-language replication and verification
 - **`/proofread`** — For the paper that accompanies this code
+- **`references/persona-catalog.md`** — Reviewer persona definitions and selection logic
+- **`references/findings-schema.json`** — JSON output contract for sub-agents
+- **`references/subagent-template.md`** — Prompt template for spawning reviewers

package/skills/code-review/references/checklist-categories.md ADDED Viewed

@@ -0,0 +1,111 @@
+# 11 Checklist Categories
+Detailed specifications for Phase 2 baseline checklist.
+## 1. Reproducibility
+| Check | Pass Criteria |
+|-------|--------------|
+| Random seeds | `set.seed()` / `random.seed()` / `np.random.seed()` set before any stochastic operation |
+| Relative paths | No hardcoded absolute paths (e.g., `/Users/username/...` or `C:\...`) |
+| Working directory | Script does not `setwd()` / `os.chdir()` — uses project-relative paths |
+| Session info | Script prints session info at end (`sessionInfo()` / `sys.version`) or documents environment |
+| HPC SHA logging | If project has `hpc/*.sbatch`: every sbatch writes `git-sha.txt` + `git-status.txt` to `OUT_DIR` before `srun` (pins results to code version). Missing SHA log = P1 reproducibility deduction. See [`docs/guides/hpc.md`](../../docs/guides/hpc.md). |
+| HPC account/partition | If `*.sbatch` present: `--account=wbs` set, partition matches workload (compute/gpu/hmem/devel), no hardcoded user paths |
+## 2. Script Structure
+| Check | Pass Criteria |
+|-------|--------------|
+| Header | Script begins with comment block: purpose, author, date, inputs, outputs |
+| Sections | Code organised into labelled sections (comments or `# ---- Section ----`) |
+| Imports at top | All `library()` / `import` statements at the top of the file |
+| Reasonable length | Single script < 500 lines; longer scripts should be split |
+## 3. Output Hygiene
+| Check | Pass Criteria |
+|-------|--------------|
+| No print pollution | No stray `print()` / `cat()` / `message()` dumping to console |
+| Outputs saved | Key results saved to files, not just printed |
+| Clean console | Running the script does not produce walls of text |
+## 4. Function Quality
+| Check | Pass Criteria |
+|-------|--------------|
+| Documentation | Functions have comments explaining purpose, inputs, outputs |
+| Naming | Function names are descriptive verbs (`estimate_ate`, not `f1`) |
+| Defaults | Reasonable defaults for optional parameters |
+| No side effects | Functions don't modify global state |
+## 5. Domain Correctness
+| Check | Pass Criteria |
+|-------|--------------|
+| Estimator matches paper | The estimator used matches what the paper claims |
+| Weights | If weighted: weights sum to expected value, correct application |
+| Standard errors | Clustering / HC / bootstrap matches paper specification |
+| Sample restrictions | Filters match the paper's sample description |
+| Variable construction | Variables constructed as described in the paper |
+## 6. Figure Quality
+| Check | Pass Criteria |
+|-------|--------------|
+| Dimensions specified | Figure size set explicitly (not default) |
+| Transparency/resolution | Appropriate for publication (300+ DPI for raster, vector preferred) |
+| Saved to file | Figures saved with `ggsave()` / `plt.savefig()`, not just displayed |
+| Labels | Axes labelled, legend present where needed, title informative |
+| Colour | Colourblind-friendly palette; not relying on red/green distinction |
+## 7. Data Persistence
+| Check | Pass Criteria |
+|-------|--------------|
+| Intermediate objects saved | Expensive computations saved (`saveRDS()` / `pickle.dump()` / `.parquet`) |
+| Load before recompute | Script checks for saved objects before rerunning expensive operations |
+| Output format | Final outputs in portable format (CSV, parquet — not just `.RData`) |
+## 8. Dependencies
+| Check | Pass Criteria |
+|-------|--------------|
+| Declared at top | All `library()` / `import` at the start of the script |
+| Versions documented | `renv.lock` / `requirements.txt` / `pyproject.toml` exists |
+| No unnecessary packages | Each loaded package is actually used |
+| Installation instructions | README or comment explains how to set up the environment |
+## 9. Python-Specific
+*Score N/A if no Python files.*
+| Check | Pass Criteria |
+|-------|--------------|
+| Type hints | Functions have type annotations for parameters and return values |
+| Docstrings | Functions have docstrings (not just comments) |
+| uv usage | Uses `uv` for environment management (per project conventions) |
+| f-strings | Uses f-strings, not `.format()` or `%` formatting |
+## 10. R-Specific
+*Score N/A if no R files.*
+| Check | Pass Criteria |
+|-------|--------------|
+| tidyverse consistency | Doesn't mix base R and tidyverse for the same operation |
+| Assignment operator | Uses `<-` not `=` for assignment |
+| Boolean values | Uses `TRUE`/`FALSE`, not `T`/`F` |
+| Pipe consistency | Uses one pipe style consistently (`%>%` or `\|>`) |
+## 11. Cross-Language Verification
+*Score N/A if the project has no numerical results or only uses one language.*
+| Check | Pass Criteria |
+|-------|--------------|
+| Replication directory | `code/replication/` (or equivalent) exists with cross-language scripts |
+| Two-language coverage | Key numerical results reproduced in a second language |
+| Result comparison | Scripts compare outputs and report discrepancies (tolerance-based) |
+| Precision threshold | Numerical outputs compared to 6+ decimal places |
+| Documentation | README explains what is being replicated and acceptable tolerance |