npm - @bohuyeshan/openagent-labforge-core - Versions diffs - 3.11.2 → 3.11.3 - Mend

@bohuyeshan/openagent-labforge-core 3.11.2 → 3.11.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (226) hide show

package/generated/skills-bundles/paper/skills/data-analysis/optimization/auto-claude__dse-loop/SKILL.md ADDED Viewed

@@ -0,0 +1,279 @@
+---
+name: "auto-claude/dse-loop"
+description: "Autonomous design space exploration loop for computer architecture and EDA. Runs a program, analyzes results, tunes parameters, and iterates until objective is met or timeout. Use when user says \"DSE\", \"design space exploration\", \"sweep parameters\", \"optimize\", \"find best config\", or wants iterative parameter tuning."
+argument-hint: ["task-description — include program","parameters","objective","and timeout"]
+allowed-tools: "Bash(*), Read, Grep, Glob, Write, Edit, Agent"
+metadata:
+  category: "data-analysis/optimization"
+---
+# DSE Loop: Autonomous Design Space Exploration
+Autonomously explore a design space: run → analyze → pick next parameters → repeat, until the objective is met or timeout is reached. Designed for computer architecture and EDA problems.
+## Context: $ARGUMENTS
+## Safety Rules — READ FIRST
+**NEVER do any of the following:**
+- `sudo` anything
+- `rm -rf`, `rm -r`, or any recursive deletion
+- `rm` any file you did not create in this session
+- Overwrite existing source files without reading them first
+- `git push`, `git reset --hard`, or any destructive git operation
+- Kill processes you did not start
+**If a step requires any of the above, STOP and report to the user.**
+## Constants (override via $ARGUMENTS)
+| Constant | Default | Description |
+|----------|---------|-------------|
+| `TIMEOUT` | 2h | Total wall-clock budget. Stop exploring after this. |
+| `MAX_ITERATIONS` | 50 | Hard cap on number of design points evaluated. |
+| `PATIENCE` | 10 | Stop early if no improvement for this many consecutive iterations. |
+| `OBJECTIVE` | minimize | `minimize` or `maximize` the target metric. |
+Override inline: `/dse-loop "task desc — timeout: 4h, max_iterations: 100, patience: 15"`
+## Typical Use Cases
+| Problem | Program | Parameters | Objective |
+|---------|---------|-----------|-----------|
+| Microarch DSE | gem5 simulation | cache size, assoc, pipeline width, ROB size, branch predictor | maximize IPC or minimize area×delay |
+| Synthesis tuning | yosys/DC script | optimization passes, target freq, effort level | minimize area at timing closure |
+| RTL parameterization | verilator sim | data width, FIFO depth, pipeline stages, buffer sizes | meet throughput target at min area |
+| Compiler flags | gcc/llvm build + benchmark | -O levels, unroll factor, vectorization, scheduling | minimize runtime or code size |
+| Placement/routing | openroad/innovus | utilization, aspect ratio, layer config | minimize wirelength / timing |
+| Formal verification | abc/sby | bound depth, engine, timeout per property | maximize coverage in time budget |
+| Memory subsystem | cacti / ramulator | bank count, row buffer policy, scheduling | optimize bandwidth/energy |
+## Workflow
+### Phase 0: Parse Task & Setup
+1. **Parse $ARGUMENTS** to extract:
+   - **Program**: what to run (command, script, or Makefile target)
+   - **Parameter space**: which knobs to tune and their ranges/options (may be incomplete — see step 2)
+   - **Objective metric**: what to optimize (and how to extract it from output)
+   - **Constraints**: hard limits that must not be violated (e.g., timing must close)
+   - **Timeout**: wall-clock budget
+   - **Success criteria**: when is the result "good enough" to stop early?
+2. **Infer missing parameter ranges** — If the user provides parameter names but NOT ranges/options, you MUST infer them before exploring:
+   a. **Read the source code** — search for the parameter names in the codebase:
+      - Look for argparse/click definitions, config files, Makefile variables, module parameters, `#define`, `parameter` (SystemVerilog), `localparam`, etc.
+      - Extract defaults, types, and any comments hinting at valid values
+   b. **Apply domain knowledge** to set reasonable ranges:
+      | Parameter type | Inference strategy |
+      |---------------|-------------------|
+      | Cache/memory sizes | Powers of 2, typically 1KB–16MB |
+      | Associativity | Powers of 2: 1, 2, 4, 8, 16 |
+      | Pipeline width / issue width | Small integers: 1, 2, 4, 8 |
+      | Buffer/queue/FIFO depth | Powers of 2: 4, 8, 16, 32, 64 |
+      | Clock period / frequency | Based on technology node; try ±50% from default |
+      | Bound depth (BMC/formal) | Geometric: 5, 10, 20, 50, 100 |
+      | Timeout values | Geometric: 10s, 30s, 60s, 120s, 300s |
+      | Boolean/enum flags | Enumerate all options found in source |
+      | Continuous (learning rate, threshold) | Log-scale sweep: 5 points spanning 2 orders of magnitude around default |
+      | Integer counts (threads, cores) | Linear: from 1 to hardware max |
+   c. **Start conservative** — begin with 3-5 values per parameter. Expand range later if the best result is at a boundary.
+   d. **Log inferred ranges** — write the inferred parameter space to `dse_results/inferred_params.md` so the user can review:
+      ```markdown
+      # Inferred Parameter Space
+      | Parameter | Source | Default | Inferred Range | Reasoning |
+      |-----------|--------|---------|---------------|-----------|
+      | CACHE_SIZE | config.py:42 | 32768 | [8192, 16384, 32768, 65536, 131072] | powers of 2, ±2x from default |
+      | ASSOC | config.py:43 | 4 | [1, 2, 4, 8] | standard associativities |
+      | BMC_DEPTH | run_bmc.py:15 | 10 | [5, 10, 20, 50] | geometric, common BMC depths |
+      ```
+   e. **Boundary expansion** — during the search, if the best result is at the min or max of a range, automatically extend that range by one step in that direction (but log the extension).
+3. **Read the project** to understand:
+   - How to run the program
+   - Where results are produced (stdout, log files, reports)
+   - How to parse the objective metric from output
+   - Current/baseline configuration (if any)
+4. **Create working directory**: `dse_results/` in project root
+   - `dse_results/dse_log.csv` — one row per design point
+   - `dse_results/DSE_REPORT.md` — final report
+   - `dse_results/DSE_STATE.json` — state for recovery
+   - `dse_results/inferred_params.md` — inferred parameter space (if ranges were not provided)
+   - `dse_results/configs/` — config files for each run
+   - `dse_results/outputs/` — raw output for each run
+5. **Write a parameter extraction script** (`dse_results/parse_result.py` or similar) that takes a run's output and returns the objective metric as a number. Test it on a baseline run first.
+6. **Run baseline** (iteration 0): run the program with default/current parameters. Record the baseline metric. This is the point to beat.
+### Phase 1: Initial Exploration
+**Goal**: Quickly survey the space to understand which parameters matter most.
+**Strategy**: Latin Hypercube Sampling or structured sweep of key parameters.
+1. Pick 5-10 diverse design points that span the parameter ranges
+2. Run them (in parallel if independent, via background processes or sequential)
+3. Record all results in `dse_log.csv`:
+   ```
+   iteration,param1,param2,...,metric,constraint_met,timestamp,notes
+   0,default,default,...,baseline_val,yes,2026-03-13T10:00:00,baseline
+   1,val1a,val2a,...,result1,yes,2026-03-13T10:05:00,initial sweep
+   ...
+   ```
+4. Analyze: which parameters have the most impact on the objective?
+5. Narrow the search to the most sensitive parameters
+### Phase 2: Directed Search
+**Goal**: Converge toward the optimum by making informed choices.
+**Strategy**: Adaptive — pick the approach that fits the problem:
+- **Few parameters (≤3)**: Fine-grained grid search around the best region from Phase 1
+- **Many parameters (>3)**: Coordinate descent — optimize one parameter at a time, holding others at current best
+- **Binary/categorical params**: Enumerate promising combinations
+- **Continuous params**: Binary search or golden section between best neighbors
+- **Multi-objective**: Track Pareto frontier, explore along the front
+For each iteration:
+1. **Select next design point** based on results so far:
+   - Look at the trend: which direction improves the metric?
+   - Avoid re-running configurations already evaluated
+   - Balance exploration (untested regions) vs exploitation (near current best)
+2. **Modify parameters**: edit config file, command-line args, or source constants
+3. **Run the program**: execute and capture output
+4. **Parse results**: extract the objective metric and check constraints
+5. **Log to `dse_log.csv`**: append the new row
+6. **Check stopping conditions**:
+   - Timeout reached? → stop
+   - Max iterations reached? → stop
+   - Patience exhausted (no improvement in N iterations)? → stop
+   - Success criteria met (metric is "good enough")? → stop
+   - Constraint violation pattern detected? → adjust search bounds
+7. **Update `DSE_STATE.json`**:
+   ```json
+   {
+     "iteration": 15,
+     "status": "in_progress",
+     "best_metric": 1.23,
+     "best_params": {"cache_size": 32768, "assoc": 4, "pipeline_width": 2},
+     "total_iterations": 15,
+     "start_time": "2026-03-13T10:00:00",
+     "timeout": "2h",
+     "patience_counter": 3
+   }
+   ```
+8. **Decide next step** → back to step 1
+### Phase 3: Refinement (if time allows)
+If the search converged and there's still time budget:
+1. **Local perturbation**: try ±1 step on each parameter from the best point
+2. **Sensitivity analysis**: which parameters can be relaxed without hurting the metric?
+3. **Constraint boundary**: if a constraint is nearly binding, explore near-feasible points
+### Phase 4: Report
+Write `dse_results/DSE_REPORT.md`:
+```markdown
+# Design Space Exploration Report
+**Task**: [description]
+**Date**: [start] → [end]
+**Total iterations**: N
+**Wall-clock time**: X hours Y minutes
+## Objective
+- **Metric**: [what was optimized]
+- **Direction**: minimize / maximize
+- **Baseline**: [value]
+- **Best found**: [value] ([improvement]% better than baseline)
+## Best Configuration
+| Parameter | Baseline | Best |
+|-----------|----------|------|
+| param1    | default  | best_val |
+| param2    | default  | best_val |
+| ...       | ...      | ... |
+## Search Trajectory
+| Iteration | param1 | param2 | ... | Metric | Notes |
+|-----------|--------|--------|-----|--------|-------|
+| 0 (baseline) | ... | ... | ... | ... | baseline |
+| 1 | ... | ... | ... | ... | initial sweep |
+| ... | ... | ... | ... | ... | ... |
+| N (best) | ... | ... | ... | ... | ★ best |
+## Parameter Sensitivity
+- **param1**: [high/medium/low impact] — [brief explanation]
+- **param2**: [high/medium/low impact] — [brief explanation]
+## Pareto Frontier (if multi-objective)
+[Table or description of non-dominated points]
+## Stopping Reason
+[timeout / max_iterations / patience / success_criteria_met]
+## Recommendations
+- [actionable insights from the exploration]
+- [which parameters matter most]
+- [suggested follow-up explorations]
+```
+Also generate a summary plot if matplotlib is available:
+- Convergence curve (metric vs iteration)
+- Parameter sensitivity bar chart
+- Pareto frontier scatter (if multi-objective)
+## State Recovery
+If the context window compacts mid-run, the loop recovers from `DSE_STATE.json` + `dse_log.csv`:
+1. Read `DSE_STATE.json` for current iteration, best params, patience counter
+2. Read `dse_log.csv` for full history
+3. Resume from next iteration
+## Key Rules
+- Work AUTONOMOUSLY — do not ask the user for permission at each iteration
+- **Every run must be logged** — even failed runs, constraint violations, errors. The log is the ground truth.
+- **Never re-run an identical configuration** — check `dse_log.csv` before each run
+- **Respect the timeout** — check elapsed time before starting a new iteration. If the next run is likely to exceed the timeout, stop and report.
+- **Parse metrics programmatically** — write a parsing script, don't eyeball logs
+- **Keep raw outputs** — save each run's full output in `dse_results/outputs/iter_N/`
+- **Constraint violations are not improvements** — a design point that violates constraints is never "best", regardless of the metric
+- If a run crashes, log the error, skip that point, and continue with the next
+- If the same crash repeats 3 times with different configs, stop and report the issue
+## Example Invocations
+```
+# Minimal — just name the parameters, let the agent figure out ranges
+/dse-loop "Run gem5 mcf benchmark. Tune: L1D_SIZE, L2_SIZE, ROB_ENTRIES. Objective: maximize IPC. Timeout: 3h"
+# Partial — some ranges given, some not
+/dse-loop "Run make synth. Tune: CLOCK_PERIOD [5ns, 4ns, 3ns, 2ns], FLATTEN, ABC_SCRIPT. Objective: minimize area at timing closure. Timeout: 1h"
+# Fully specified — explicit ranges for everything
+/dse-loop "Simulate processor with FIFO_DEPTH [4,8,16,32], ISSUE_WIDTH [1,2,4], PREFETCH [on,off]. Run: make sim. Objective: max throughput/area. Timeout: 2h"
+# Real-world: PDAG-SFA formal verification tuning
+/dse-loop "Run python run_bmc.py. Tune: BMC_DEPTH, ENGINE, TIMEOUT_PER_PROP. Objective: maximize properties proved. Timeout: 2h"
+```

package/generated/skills-bundles/paper/skills/data-analysis/statistics/auto-claude__analyze-results/SKILL.md ADDED Viewed

@@ -0,0 +1,47 @@
+---
+name: "auto-claude/analyze-results"
+description: "Analyze ML experiment results, compute statistics, generate comparison tables and insights. Use when user says \"analyze results\", \"compare\", or needs to interpret experimental data."
+argument-hint: ["results-path-or-description"]
+allowed-tools: "Bash(*), Read, Grep, Glob, Write, Edit, Agent"
+metadata:
+  category: "data-analysis/statistics"
+---
+# Analyze Experiment Results
+Analyze: $ARGUMENTS
+## Workflow
+### Step 1: Locate Results
+Find all relevant JSON/CSV result files:
+- Check `figures/`, `results/`, or project-specific output directories
+- Parse JSON results into structured data
+### Step 2: Build Comparison Table
+Organize results by:
+- **Independent variables**: model type, hyperparameters, data config
+- **Dependent variables**: primary metric (e.g., perplexity, accuracy, loss), secondary metrics
+- **Delta vs baseline**: always compute relative improvement
+### Step 3: Statistical Analysis
+- If multiple seeds: report mean +/- std, check reproducibility
+- If sweeping a parameter: identify trends (monotonic, U-shaped, plateau)
+- Flag outliers or suspicious results
+### Step 4: Generate Insights
+For each finding, structure as:
+1. **Observation**: what the data shows (with numbers)
+2. **Interpretation**: why this might be happening
+3. **Implication**: what this means for the research question
+4. **Next step**: what experiment would test the interpretation
+### Step 5: Update Documentation
+If findings are significant:
+- Propose updates to project notes or experiment reports
+- Draft a concise finding statement (1-2 sentences)
+## Output Format
+Always include:
+1. Raw data table
+2. Key findings (numbered, concise)
+3. Suggested next experiments (if any)

package/generated/skills-bundles/paper/skills/data-analysis/visualization/auto-claude__paper-figure/SKILL.md ADDED Viewed

@@ -0,0 +1,281 @@
+---
+name: "auto-claude/paper-figure"
+description: "Generate publication-quality figures and tables from experiment results. Use when user says \"画图\", \"作图\", \"generate figures\", \"paper figures\", or needs plots for a paper."
+argument-hint: ["figure-plan-or-data-path"]
+allowed-tools: "Bash(*), Read, Write, Edit, Grep, Glob, Agent, mcp__codex__codex, mcp__codex__codex-reply"
+metadata:
+  category: "data-analysis/visualization"
+---
+# Paper Figure: Publication-Quality Plots from Experiment Data
+Generate all figures and tables for a paper based on: **$ARGUMENTS**
+## Scope: What This Skill Can and Cannot Do
+| Category | Can auto-generate? | Examples |
+|----------|-------------------|----------|
+| **Data-driven plots** | ✅ Yes | Line plots (training curves), bar charts (method comparison), scatter plots, heatmaps, box/violin plots |
+| **Comparison tables** | ✅ Yes | LaTeX tables comparing prior bounds, method features, ablation results |
+| **Multi-panel figures** | ✅ Yes | Subfigure grids combining multiple plots (e.g., 3×3 dataset × method) |
+| **Architecture/pipeline diagrams** | ❌ No — manual | Model architecture, data flow diagrams, system overviews. At best can generate a rough TikZ skeleton, but **expect to draw these yourself** using tools like draw.io, Figma, or TikZ |
+| **Generated image grids** | ❌ No — manual | Grids of generated samples (e.g., GAN/diffusion outputs). These come from running your model, not from this skill |
+| **Photographs / screenshots** | ❌ No — manual | Real-world images, UI screenshots, qualitative examples |
+**In practice:** For a typical ML paper, this skill handles ~60% of figures (all data plots + tables). The remaining ~40% (hero figure, architecture diagram, qualitative results) need to be created manually and placed in `figures/` before running `/paper-write`. The skill will detect these as "existing figures" and preserve them.
+## Constants
+- **STYLE = `publication`** — Visual style preset. Options: `publication` (default, clean for print), `poster` (larger fonts), `slide` (bold colors)
+- **DPI = 300** — Output resolution
+- **FORMAT = `pdf`** — Output format. Options: `pdf` (vector, best for LaTeX), `png` (raster fallback)
+- **COLOR_PALETTE = `tab10`** — Default matplotlib color cycle. Options: `tab10`, `Set2`, `colorblind` (deuteranopia-safe)
+- **FONT_SIZE = 10** — Base font size (matches typical conference body text)
+- **FIG_DIR = `figures/`** — Output directory for generated figures
+- **REVIEWER_MODEL = `gpt-5.4`** — Model used via Codex MCP for figure quality review.
+## Inputs
+1. **PAPER_PLAN.md** — figure plan table (from `/paper-plan`)
+2. **Experiment data** — JSON files, CSV files, or screen logs in `figures/` or project root
+3. **Existing figures** — any manually created figures to preserve
+If no PAPER_PLAN.md exists, scan for data files and ask the user which figures to generate.
+## Workflow
+### Step 1: Read Figure Plan
+Parse the Figure Plan table from PAPER_PLAN.md:
+```markdown
+| ID | Type | Description | Data Source | Priority |
+|----|------|-------------|-------------|----------|
+| Fig 1 | Architecture | ... | manual | HIGH |
+| Fig 2 | Line plot | ... | figures/exp.json | HIGH |
+```
+Identify:
+- Which figures can be auto-generated from data
+- Which need manual creation (architecture diagrams, etc.)
+- Which are comparison tables (generate as LaTeX)
+### Step 2: Set Up Plotting Environment
+Create a shared style configuration script:
+```python
+# paper_plot_style.py — shared across all figure scripts
+import matplotlib.pyplot as plt
+import matplotlib
+matplotlib.rcParams.update({
+    'font.size': FONT_SIZE,
+    'font.family': 'serif',
+    'font.serif': ['Times New Roman', 'Times', 'DejaVu Serif'],
+    'axes.labelsize': FONT_SIZE,
+    'axes.titlesize': FONT_SIZE + 1,
+    'xtick.labelsize': FONT_SIZE - 1,
+    'ytick.labelsize': FONT_SIZE - 1,
+    'legend.fontsize': FONT_SIZE - 1,
+    'figure.dpi': DPI,
+    'savefig.dpi': DPI,
+    'savefig.bbox': 'tight',
+    'savefig.pad_inches': 0.05,
+    'axes.grid': False,
+    'axes.spines.top': False,
+    'axes.spines.right': False,
+    'text.usetex': False,  # set True if LaTeX is available
+    'mathtext.fontset': 'stix',
+})
+# Color palette
+COLORS = plt.cm.tab10.colors  # or Set2, or colorblind-safe
+def save_fig(fig, name, fmt=FORMAT):
+    """Save figure to FIG_DIR with consistent naming."""
+    fig.savefig(f'{FIG_DIR}/{name}.{fmt}')
+    print(f'Saved: {FIG_DIR}/{name}.{fmt}')
+```
+### Step 3: Auto-Select Figure Type
+Use this decision tree for data-driven figures (inspired by Imbad0202/academic-research-skills):
+| Data Pattern | Recommended Type | Size |
+|-------------|-----------------|------|
+| X=time/steps, Y=metric | Line plot | 0.48\textwidth |
+| Methods × 1 metric | Bar chart | 0.48\textwidth |
+| Methods × multiple metrics | Grouped bar / radar | 0.95\textwidth |
+| Two continuous variables | Scatter plot | 0.48\textwidth |
+| Matrix / grid values | Heatmap | 0.48\textwidth |
+| Distribution comparison | Box/violin plot | 0.48\textwidth |
+| Multi-dataset results | Multi-panel (subfigure) | 0.95\textwidth |
+| Prior work comparison | LaTeX table | — |
+### Step 4: Generate Each Figure
+For each figure in the plan, create a standalone Python script:
+**Line plots** (training curves, scaling):
+```python
+# gen_fig2_training_curves.py
+from paper_plot_style import *
+import json
+with open('figures/exp_results.json') as f:
+    data = json.load(f)
+fig, ax = plt.subplots(1, 1, figsize=(5, 3.5))
+ax.plot(data['steps'], data['fac_loss'], label='Factorized', color=COLORS[0])
+ax.plot(data['steps'], data['crf_loss'], label='CRF-LR', color=COLORS[1])
+ax.set_xlabel('Training Steps')
+ax.set_ylabel('Cross-Entropy Loss')
+ax.legend(frameon=False)
+save_fig(fig, 'fig2_training_curves')
+```
+**Bar charts** (comparison, ablation):
+```python
+fig, ax = plt.subplots(1, 1, figsize=(5, 3))
+methods = ['Baseline', 'Method A', 'Method B', 'Ours']
+values = [82.3, 85.1, 86.7, 89.2]
+bars = ax.bar(methods, values, color=[COLORS[i] for i in range(len(methods))])
+ax.set_ylabel('Accuracy (%)')
+# Add value labels on bars
+for bar, val in zip(bars, values):
+    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3,
+            f'{val:.1f}', ha='center', va='bottom', fontsize=FONT_SIZE-1)
+save_fig(fig, 'fig3_comparison')
+```
+**Comparison tables** (LaTeX, for theory papers):
+```latex
+\begin{table}[t]
+\centering
+\caption{Comparison of estimation error bounds. $n$: sample size, $D$: ambient dim, $d$: latent dim, $K$: subspaces, $n_k$: modes.}
+\label{tab:bounds}
+\begin{tabular}{lccc}
+\toprule
+Method & Rate & Depends on $D$? & Multi-modal? \\
+\midrule
+\citet{MinimaxOkoAS23} & $n^{-s'/D}$ & Yes (curse) & No \\
+\citet{ScoreMatchingdistributionrecovery} & $n^{-2/d}$ & No & No \\
+\textbf{Ours} & $\sqrt{\sum n_k d_k / n}$ & No & Yes \\
+\bottomrule
+\end{tabular}
+\end{table}
+```
+**Architecture/pipeline diagrams** (MANUAL — outside this skill's scope):
+- These require manual creation using draw.io, Figma, Keynote, or TikZ
+- This skill can generate a rough TikZ skeleton as a starting point, but **do not expect publication-quality results**
+- If the figure already exists in `figures/`, preserve it and generate only the LaTeX `\includegraphics` snippet
+- Flag as `[MANUAL]` in the figure plan and `latex_includes.tex`
+### Step 5: Run All Scripts
+```bash
+# Run all figure generation scripts
+for script in gen_fig*.py; do
+    python "$script"
+done
+```
+Verify all output files exist and are non-empty.
+### Step 6: Generate LaTeX Include Snippets
+For each figure, output the LaTeX code to include it:
+```latex
+% === Fig 2: Training Curves ===
+\begin{figure}[t]
+    \centering
+    \includegraphics[width=0.48\textwidth]{figures/fig2_training_curves.pdf}
+    \caption{Training curves comparing factorized and CRF-LR denoising.}
+    \label{fig:training_curves}
+\end{figure}
+```
+Save all snippets to `figures/latex_includes.tex` for easy copy-paste into the paper.
+### Step 7: Figure Quality Review with REVIEWER_MODEL
+Send figure descriptions and captions to GPT-5.4 for review:
+```
+mcp__codex__codex:
+  model: gpt-5.4
+  config: {"model_reasoning_effort": "xhigh"}
+  prompt: |
+    Review these figure/table plans for a [VENUE] submission.
+    For each figure:
+    1. Is the caption informative and self-contained?
+    2. Does the figure type match the data being shown?
+    3. Is the comparison fair and clear?
+    4. Any missing baselines or ablations?
+    5. Would a different visualization be more effective?
+    [list all figures with captions and descriptions]
+```
+### Step 8: Quality Checklist
+Before finishing, verify each figure (from pedrohcgs/claude-code-my-workflow):
+- [ ] Font size readable at printed paper size (not too small)
+- [ ] Colors distinguishable in grayscale (print-friendly)
+- [ ] **No title inside figures** — titles go only in LaTeX `\caption{}` (from pedrohcgs)
+- [ ] Legend does not overlap data
+- [ ] Axis labels have units where applicable
+- [ ] Axis labels are publication-quality (not variable names like `emp_rate`)
+- [ ] Figure width fits single column (0.48\textwidth) or full width (0.95\textwidth)
+- [ ] PDF output is vector (not rasterized text)
+- [ ] No matplotlib default title (remove `plt.title` for publications)
+- [ ] Serif font matches paper body text (Times / Computer Modern)
+- [ ] Colorblind-accessible (if using colorblind palette)
+## Output
+```
+figures/
+├── paper_plot_style.py          # shared style config
+├── gen_fig1_architecture.py     # per-figure scripts
+├── gen_fig2_training_curves.py
+├── gen_fig3_comparison.py
+├── fig1_architecture.pdf        # generated figures
+├── fig2_training_curves.pdf
+├── fig3_comparison.pdf
+├── latex_includes.tex           # LaTeX snippets for all figures
+└── TABLE_*.tex                  # standalone table LaTeX files
+```
+## Key Rules
+- **Every figure must be reproducible** — save the generation script alongside the output
+- **Do NOT hardcode data** — always read from JSON/CSV files
+- **Use vector format (PDF)** for all plots — PNG only as fallback
+- **No decorative elements** — no background colors, no 3D effects, no chart junk
+- **Consistent style across all figures** — same fonts, colors, line widths
+- **Colorblind-safe** — verify with https://davidmathlogic.com/colorblind/ if needed
+- **One script per figure** — easy to re-run individual figures when data changes
+- **No titles inside figures** — captions are in LaTeX only
+- **Comparison tables count as figures** — generate them as standalone .tex files
+## Figure Type Reference
+| Type | When to Use | Typical Size |
+|------|------------|--------------|
+| Line plot | Training curves, scaling trends | 0.48\textwidth |
+| Bar chart | Method comparison, ablation | 0.48\textwidth |
+| Grouped bar | Multi-metric comparison | 0.95\textwidth |
+| Scatter plot | Correlation analysis | 0.48\textwidth |
+| Heatmap | Attention, confusion matrix | 0.48\textwidth |
+| Box/violin | Distribution comparison | 0.48\textwidth |
+| Architecture | System overview | 0.95\textwidth |
+| Multi-panel | Combined results (subfigures) | 0.95\textwidth |
+| Comparison table | Prior bounds vs. ours (theory) | full width |
+## Acknowledgements
+Design pattern (type × style matrix) inspired by [baoyu-skills](https://github.com/jimliu/baoyu-skills). Publication style defaults and figure rules from [pedrohcgs/claude-code-my-workflow](https://github.com/pedrohcgs/claude-code-my-workflow). Visualization decision tree from [Imbad0202/academic-research-skills](https://github.com/Imbad0202/academic-research-skills).