npm - @research-copilot/plugin - Versions diffs - 1.1.16 → 1.1.18 - Mend

@research-copilot/plugin 1.1.16 → 1.1.18

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (799) hide show

package/dist/skills/benchmark-paper-template/references/checklist.md ADDED Viewed

@@ -0,0 +1,113 @@
+# Pre-submission Checklist
+## Contents
+- [1. Introduction Checklist](#1-introduction-checklist)
+- [2. Benchmark Section Checklist](#2-benchmark-section-checklist)
+- [3. Experiment Section Checklist](#3-experiment-section-checklist)
+- [4. Structure & Completeness Checklist](#4-structure--completeness-checklist)
+- [5. Venue-Specific Checks](#5-venue-specific-checks)
+- [Severity Classification](#severity-classification)
+- [After the Checklist](#after-the-checklist)
+The final quality gate before submission. A single missing element, a forgotten pipeline figure, absent Finding summaries, or no benchmark comparison table, can cost a paper its acceptance at top venues.
+## 1. Introduction Checklist
+- [ ] **Running Example (Figure 1)**: A carefully designed figure illustrating the evaluation gap with a concrete example?
+- [ ] **Evaluation Gap**: Key blind spots of existing benchmarks clearly stated (≤3 points, specific not vague)?
+- [ ] **Benchmark Comparison (Table 1)**: Comparison with ≥5 existing benchmarks across key dimensions?
+- [ ] **Research Questions**: 2-3 RQs explicitly stated?
+- [ ] **Design Considerations**: Design rationale and key choices explained (WHY, not just WHAT)?
+- [ ] **Contributions**: 2-4 contributions listed, each aligned with an RQ and a paper section?
+- [ ] **Narrative Flow**: Background → Gap → Our Solution → Preview Findings → Contributions?
+- [ ] **Concrete Numbers**: Scale, model count, and key metrics mentioned in the Introduction?
+## 2. Benchmark Section Checklist
+- [ ] **Design Goals**: G1-G4 (or custom goals) explicitly stated with strategies?
+- [ ] **Task Scope**: Clear boundary of what IS and IS NOT evaluated, with reasons for exclusions?
+- [ ] **Pipeline Figure (Figure 2)**: A clear construction pipeline flowchart with steps, data counts, and quality gates?
+- [ ] **Pipeline Steps**: Every step has explicit input, operation, and output?
+- [ ] **Quality Control**: QC strategy described (inter-annotator agreement, expert review, validation scores)?
+- [ ] **Dataset Statistics (Table 2)**: Comprehensive statistics presented (total count, per-split, per-category)?
+- [ ] **Distribution Figures**: Taxonomy distribution, difficulty distribution, length distribution visualized?
+- [ ] **Data Examples**: 2-3 representative examples shown as a figure?
+- [ ] **Taxonomy Coverage**: Data distribution covers all taxonomy categories with reasonable balance?
+## 3. Experiment Section Checklist
+- [ ] **Model Coverage**: Baselines span open/closed source, multiple scales, different architectures (≥10 models)?
+- [ ] **Overall Performance Table (Table 3)**: Comprehensive table with all models × all metrics, best/second-best marked?
+- [ ] **Fine-grained Analysis**: Multi-dimensional analysis beyond overall scores, organized by RQ?
+- [ ] **Error Taxonomy**: Analysis of WHAT types of mistakes models make (not just accuracy numbers)?
+- [ ] **Human vs. Model**: Human performance baseline included? (Mark N/A with reason if genuinely not applicable)
+- [ ] **Case Studies**: 2-4 qualitative examples showing concrete model success/failure?
+- [ ] **Finding X Summaries**: Key insights extracted and bold-highlighted after each major analysis?
+- [ ] **Research Opportunities**: Future directions discussed, grounded in the Findings?
+## 4. Structure & Completeness Checklist
+- [ ] **Logical Thread**: Paper follows Gap → Benchmark → Evaluation → Insights → Opportunities?
+- [ ] **Open Source**: Code and data released (or will be)? Link in Abstract or Contributions?
+- [ ] **Data Hosting**: Dataset hosted on a dedicated platform (Hugging Face, Kaggle, OpenML, Dataverse), accessible without emailing authors?
+- [ ] **License**: Explicit license specified for both code and data?
+- [ ] **Limitations**: Honest discussion of benchmark limitations and scope restrictions?
+- [ ] **Ethics**: Discussion of potential harms, biases, PII exposure, informed consent, and responsible-use guidelines?
+- [ ] **Datasheet / Data Card**: Structured documentation of collection process, intended use, and limitations included (in appendix or supplementary)?
+- [ ] **Related Work**: Systematic comparison with existing work, including a comparison table?
+- [ ] **Appendix**: Supplementary material includes full results, annotation guidelines, more examples, prompts used?
+- [ ] **Reproducibility**: Enough detail for another team to reproduce construction and experiments? Executable evaluation script provided?
+- [ ] **Statistical Rigor**: Multiple evaluation runs reported with mean ± std? Statistical significance tests where appropriate?
+- [ ] **Contamination Mitigation**: Strategy to prevent training data leakage documented?
+- [ ] **Maintenance Plan**: Who will maintain the benchmark? Feedback channels? Update/retirement criteria?
+- [ ] **Page Budget**: Paper fits within the venue's page limit (main body)?
+- [ ] **References**: All cited works verified and formatted per venue style?
+## 5. Venue-Specific Checks
+### NeurIPS (Evaluations & Datasets track, renamed from D&B in 2026)
+- [ ] Evaluations & Datasets track selected (double-blind by default since 2026)?
+- [ ] NeurIPS paper checklist completed?
+- [ ] **Croissant metadata file** included and validated? (Mandatory since 2025, missing/invalid Croissant is grounds for desk rejection)
+- [ ] Data and code submitted in final form alongside the paper (NOT supplementary material since 2026)?
+- [ ] Supplementary material within size limits?
+- [ ] Anonymous submission (no author-identifying information)?
+### ICLR / ICML
+- [ ] Reproducibility form / checklist completed?
+- [ ] Supplementary material within size limits?
+- [ ] Anonymous submission?
+### ACL / EMNLP / NAACL
+- [ ] Limitations section present (mandatory)?
+- [ ] Ethics statement present?
+- [ ] ARR submission format followed?
+## Severity Classification
+After checking all items, classify issues:
+| Severity | Definition | Action |
+|----------|-----------|--------|
+| **Critical** ✗✗ | Missing core element (no pipeline figure, no Finding summaries, no comparison table) | MUST fix before submission |
+| **Major** ✗ | Present but insufficient (too few baselines, weak QC description, vague gap) | Strongly recommended to fix |
+| **Minor** ~ | Polish-level (figure aesthetics, word choice, appendix order) | Fix if time permits |
+Present a summary:
+```
+Pre-submission Verdict: [READY / NOT READY]
+Critical issues: [count], [list]
+Major issues:    [count], [list]
+Minor issues:    [count], [list]
+```
+## After the Checklist
+- **If critical issues remain**: Help the user fix them by routing back to the relevant sub-skill (e.g., missing pipeline figure → bench-construction concepts)
+- **If all clear**: Confirm readiness and remind about:
+  - Venue-specific formatting and templates
+  - Supplementary material preparation
+  - Co-author review before final submission
+  - Camera-ready preparation timeline (if accepted)

package/dist/skills/benchmark-paper-template/references/construction-pipeline.md ADDED Viewed

@@ -0,0 +1,127 @@
+# Benchmark Construction Pipeline
+## Contents
+- [Pipeline Design Principles](#pipeline-design-principles)
+- [Three Proven Construction Paradigms](#three-proven-construction-paradigms)
+- [Pipeline Stages Specification](#pipeline-stages-specification)
+- [Quality Control Strategies](#quality-control-strategies)
+- [Statistical Characteristics (Section 2.3 Content)](#statistical-characteristics-section-23-content)
+- [Pipeline Figure (Figure 2)](#pipeline-figure-figure-2)
+Design a systematic, reproducible, and scalable data construction pipeline. The pipeline is the core technical contribution of a benchmark paper, equivalent to the "Method" section of a technique paper.
+## Pipeline Design Principles
+Every construction pipeline must satisfy three properties:
+1. **Reproducibility**, another team can follow the same steps and get comparable data
+2. **Scalability**, the pipeline can generate more data without proportionally more human effort
+3. **Quality Assurance**, every data point passes explicit quality gates
+## Three Proven Construction Paradigms
+Choose the paradigm that best fits your task, or combine elements from multiple paradigms.
+### Paradigm 1: Reverse Synthesis
+**StatQA approach**, Fix the answer first, then generate the question.
+```
+Define answer space → Generate matching conditions → Compose questions → Validate uniqueness
+     (enumerate)          (scenarios)                   (NL wrap)         (one correct answer)
+```
+**When to use:** The answer space is enumerable (e.g., statistical methods, chart types, code patterns). You need precise control over difficulty distribution and category balance.
+**Key advantage:** Guarantees unambiguous ground truth by construction.
+### Paradigm 2: Controlled Injection
+**nvBench 2.0 approach**, Start from clean seeds, systematically inject the target phenomenon.
+```
+Collect clean seeds → Define injection types → Inject at calibrated levels → Validate naturalness
+    (unambiguous)        (e.g., 4 types)         (e.g., 5 severity levels)    (human check)
+```
+**When to use:** You study a specific phenomenon (ambiguity, noise, bias, adversarial perturbation) and need controlled severity gradients across the phenomenon.
+**Key advantage:** Isolates the variable of interest; enables severity-stratified analysis.
+### Paradigm 3: Adaptive Generation plus Expert Annotation
+**VisJudge-Bench approach**, Use LLMs to generate candidate data, then apply multi-stage human annotation.
+```
+LLM generates candidates → Stage-1 screening → Stage-2 fine-grained scoring → Stage-3 cross-validation
+   (adapted to content)      (expert filter)       (multi-dimension)            (disagreement resolution)
+```
+**When to use:** The task requires nuanced human judgment that cannot be fully automated. Quality depends on expert assessment across multiple subjective dimensions.
+**Key advantage:** Combines LLM scalability with human judgment quality.
+Ask the user: "Which construction paradigm fits this project? Describe the data source and rough pipeline."
+## Pipeline Stages Specification
+<EXTREMELY-IMPORTANT>
+The construction pipeline is what reviewers scrutinize most intensely. Every step must have explicit input, output, and key operations described in detail. Vague pipeline descriptions are a top rejection reason.
+</EXTREMELY-IMPORTANT>
+Regardless of paradigm, document every stage with this template:
+| Stage | Input | Operation | Output | Quality Gate |
+|-------|-------|-----------|--------|-------------|
+| **1. Source Selection** | Raw data sources | Filtering, licensing, dedup | Clean source corpus | Coverage audit, license check |
+| **2. Seed Generation** | Source corpus | [Paradigm-specific method] | Initial samples | Format validation, uniqueness |
+| **3. Enrichment** | Initial samples | Add metadata, labels, tags | Annotated samples | Taxonomy coverage balance |
+| **4. Quality Control** | Annotated samples | Human review / auto validation | Validated samples | IAA ≥ threshold |
+| **5. Splitting** | Validated samples | Stratified train/dev/test split | Final dataset | Distribution balance check |
+For each stage, be explicit about:
+- **Who performs it** (automated script / LLM / human annotator / domain expert)
+- **How long it takes** (per sample and total)
+- **What can go wrong** and how you handle failures
+## Quality Control Strategies
+Document at least three strategies. Reviewers scrutinize this heavily:
+| Strategy | Method | When to Use |
+|----------|--------|------------|
+| **Inter-Annotator Agreement** | Multiple annotators on same samples; report Cohen's κ, Fleiss' κ, or ICC (Intraclass Correlation Coefficient) | Any human annotation task |
+| **Expert Spot-Check** | Domain expert reviews random ≥10% subset | Semi-automated pipelines |
+| **Adversarial Validation** | Test if a classifier can distinguish synthetic vs. real data | Synthetic data generation |
+| **Automated Sanity Checks** | Rule-based validation: format, range, consistency, deduplication | All pipelines |
+| **Pilot Study** | Small-scale trial run (50-100 samples) before full construction | Complex multi-stage pipelines |
+| **LLM Cross-Validation** | Use a different LLM family to verify LLM-generated data | LLM-in-the-loop pipelines |
+## Statistical Characteristics (Section 2.3 Content)
+After construction, present comprehensive statistics:
+**Must-have (typically Table 2 + Figure 3-4):**
+- Total sample count by split (train/dev/test)
+- Distribution across taxonomy categories (bar chart or pie chart)
+- Difficulty distribution (histogram)
+- Text length / complexity distribution
+- Scale comparison with existing benchmarks
+**Recommended:**
+- Word cloud or topic distribution
+- Source diversity metrics
+- 2-3 representative data examples as a Figure
+- Annotation time and cost statistics
+## Pipeline Figure (Figure 2)
+The pipeline MUST be visualized. Design guidelines:
+- **Left-to-right or top-to-bottom flow** with clear directionality
+- **Each step**: labeled box with brief description
+- **Between steps**: arrows annotated with data count (e.g., "10K → filtering → 8.5K")
+- **Quality gates**: visually distinct (diamond shapes, colored borders, or gate icons)
+- **Concrete example**: show one sample flowing through the entire pipeline
+- **Paradigm label**: indicate which construction paradigm is used

package/dist/skills/benchmark-paper-template/references/experiments.md ADDED Viewed

@@ -0,0 +1,159 @@
+# Experiment & Insight Design
+Design experiments that go beyond leaderboard numbers to reveal deep insights about model capabilities. In a Benchmark paper, the experiment section is NOT about "proving our method is best", it is about **revealing where and why models fail, and what this means for future research.**
+## 1. Baseline Model Selection
+Cover multiple axes for comprehensive comparison. Minimum 10-15 models:
+| Axis | Categories | Purpose |
+|------|-----------|---------|
+| **Open vs. Closed** | GPT-4o, Claude, Gemini vs. LLaMA, Qwen, Mistral, DeepSeek | Capability gap between proprietary and accessible models |
+| **Model Scale** | 7B → 13B → 70B → 100B+ | How capability scales with parameters |
+| **Architecture** | Decoder-only vs. Encoder-Decoder; text-only vs. multimodal | Architecture-specific strengths/weaknesses |
+| **Specialization** | General vs. domain-specific (code, math, vision, etc.) | Whether specialized training transfers |
+Ask the user: "Which models will be evaluated? Do they cover all four axes above?"
+## 2. Evaluation Protocol
+### 2.1 Evaluation Settings
+Define how models interact with the benchmark:
+| Element | Key Questions | Guidance |
+|---------|-------------|----------|
+| **Input format** | How does the model receive input? What context is provided? | Specify prompt template, input structure, and any special formatting |
+| **Output extraction** | How is model output parsed and scored? | Define extraction rules, parsing logic, and edge case handling |
+| **Prompting strategies** | Zero-shot, Few-shot (1/3/5), CoT, Domain knowledge prompting, Tool-use | Test multiple; report which helps/hurts |
+| **Evaluation method** | Auto metrics, LLM-as-Judge, Human eval | Use ≥2 methods; validate LLM-judge against human |
+| **Metrics selection** | What metrics and why? (e.g., Accuracy, F1@K, MAE, Correlation) | Justify metric choices; prefer ONE headline metric for main ranking, with breakdowns as secondary |
+| **Repetitions** | 1-5 runs per model | Report mean ± std for non-deterministic setups; report intra-model variance |
+| **Temperature** | 0 for reproducibility, >0 for diversity analysis | Document the choice |
+### 2.2 Human Baseline Experiment
+If conducting human evaluation, specify:
+| Element | Details |
+|---------|---------|
+| **Participant profile** | How many? Background/expertise? Recruitment criteria? |
+| **Experiment protocol** | Task instructions, time limit, annotation interface |
+| **Inter-rater reliability** | Agreement metric (Cohen's κ, Fleiss' κ, ICC) and threshold |
+| **Comparison design** | Same samples evaluated by both humans and models |
+### 2.3 Baseline Fairness
+Document the optimization effort for each baseline equally. Reviewers are increasingly aware of "baseline nerfing", under-optimizing competitor hyperparameters. Each baseline should use the best known configuration.
+## 3. RQ-driven Experiment Structure
+Organize ALL experiments around your Research Questions. Each RQ drives one analysis subsection:
+### RQ Analysis Template
+For each RQ, specify:
+```
+RQ[N]: [The question]
+├── Hypothesis: [What you expect to find]
+├── Experiment: [Specific comparison or analysis]
+├── Variables: [What you vary vs. control]
+├── Metrics: [What you measure]
+├── Visualization: [Figure or Table type]
+└── Expected Finding: [What insight this yields]
+```
+### Common Analysis Types
+| Analysis | What It Reveals | Visualization | Priority |
+|----------|----------------|---------------|----------|
+| **Overall Performance** | General capability landscape | Large table: all models × all metrics | MUST |
+| **Category Breakdown** | Per-taxonomy performance | Grouped bar chart or heatmap | MUST |
+| **Difficulty Gradient** | How performance degrades with difficulty | Line chart (x=difficulty, y=score) | MUST |
+| **Error Taxonomy** | WHAT types of mistakes models make | Stacked bar chart, pie chart + error examples | HIGH |
+| **Model Behavioral Bias** | Whether models have systematic tendencies (e.g., score inflation, over-conservatism, preference for certain answer types) | Distribution density curves, calibration plots | HIGH |
+| **Human vs. Model** | Where AI matches/exceeds/falls behind | Radar chart or paired comparison bars | HIGH |
+| **Prompting Impact** | How strategy affects performance | Ablation table (rows=models, cols=strategies) | MEDIUM |
+| **Scale Effect** | How model size affects capability | Line chart (x=params, y=score) | MEDIUM |
+| **Cross-dim Correlation** | Which capabilities are linked | Correlation heatmap | OPTIONAL |
+## 4. The Overall Performance Table
+The largest and most important table in the paper (typically Table 2 or 3):
+```
+| Model | Size | Overall | [Dim1] | [Dim2] | [Dim3] | [SubDim1.1] | [SubDim1.2] | ... |
+|-------|------|---------|--------|--------|--------|-------------|-------------|-----|
+| GPT-4o | - | XX.X | XX.X | XX.X | XX.X | XX.X | XX.X | |
+| Claude | - | XX.X | XX.X | XX.X | XX.X | XX.X | XX.X | |
+| LLaMA-70B | 70B | XX.X | XX.X | XX.X | XX.X | XX.X | XX.X | |
+| ... | | | | | | | | |
+| Human | - | XX.X | XX.X | XX.X | XX.X | XX.X | XX.X | |
+```
+Design guidelines:
+- **Bold** the best result per column; **underline** second-best
+- Group models by category (closed-source / open-source / specialized)
+- Include human performance as upper bound (if applicable)
+- Add average and worst-case rows if insightful
+## 5. The "Finding X" Pattern
+<EXTREMELY-IMPORTANT>
+This is the SIGNATURE writing technique of Benchmark papers. After each major analysis, extract and bold-highlight a numbered Finding:
+**Finding 1.** Few-shot learning and domain knowledge inclusion help LLMs, whereas Chain-of-Thought tends to slightly degrade performance in smaller models.
+**Finding 2.** All tested models perform below random baseline on [specific sub-category], revealing a fundamental capability gap in [specific skill].
+**Finding 3.** Model scale improves [Dimension A] performance monotonically, but has no significant effect on [Dimension B], suggesting these capabilities have different scaling properties.
+Each Finding MUST be:
+- **Surprising or non-obvious**, not just "bigger models are better"
+- **Specific**, names concrete models, categories, or conditions
+- **Actionable**, implies what future research should address
+- **Supported by data**, directly backed by the analysis above it
+**Micro-structure template** for each Finding:
+```
+[Analysis paragraph with data...]
+**Finding N.** [Lead sentence: the key result in one sentence.]
+[Evidence: specific numbers, models, conditions.]
+[Qualification: scope and boundary conditions.]
+[Forward pointer: what this implies for future work.]
+```
+</EXTREMELY-IMPORTANT>
+## 6. Case Study Design
+Include 2-4 case studies for qualitative depth:
+| Type | Purpose | Format |
+|------|---------|--------|
+| **Success case** | Show benchmark discriminates capability | Strong model vs. weak model on same input |
+| **Failure case** | Reveal specific model limitation | Model output + annotation of where/why it fails |
+| **Surprising case** | Highlight counter-intuitive behavior | Setup expectation → show unexpected result |
+| **Edge case** | Test capability boundary | Minimal change that flips correct → incorrect |
+## 7. Companion Method Experiments (If Applicable)
+If proposing a companion method (from bench-design):
+- Comparison with baselines on the new benchmark
+- Ablation study of method components
+- Per-category improvement analysis (WHERE does the method help most?)
+- Generalization test (does improvement transfer to other benchmarks?)
+- **Downstream application validation** (can the benchmark/method be used to improve real-world performance? e.g., VisJudge-Bench validated through downstream visualization quality improvement)
+## 8. Research Opportunities
+Derive 3-5 future directions from your Findings. Typical directions:
+- How to enhance model capability on [the gap area]
+- Human-AI collaboration patterns for [the task]
+- Benchmark extension (new modalities, languages, domains, difficulty levels)
+- Training methodology improvements inspired by the findings
+- Theoretical understanding of why [specific Finding] occurs

package/dist/skills/benchmark-paper-template/references/gap-analysis.md ADDED Viewed

@@ -0,0 +1,86 @@
+# Benchmark Gap Analysis
+Systematically identify and articulate an evaluation gap, the foundational question every benchmark paper must answer: **"What critical capability is NOT being evaluated, and why does that matter?"**
+## Step 1: Survey the Evaluation Landscape
+Help the user map what ALREADY exists. Build a comparison table (this becomes **Table 1** in the paper):
+| Dimension | Benchmark A | Benchmark B | Benchmark C | ... | Gap |
+|-----------|------------|------------|------------|-----|-----|
+| Task scope | | | | | |
+| Data scale | | | | | |
+| Evaluation dimensions | | | | | |
+| Granularity level | | | | | |
+| Real-world alignment | | | | | |
+| [Key differentiator] | ✗ | ✗ | ✗ | | **Ours** |
+Guide the user with these questions:
+- What mainstream benchmarks already exist in this area, and what does each one evaluate?
+- What implicit assumption do they share, and does that assumption hold in realistic scenarios?
+- If a model scored perfectly on every existing benchmark, would it really have mastered the underlying capability?
+## Step 2: Identify the Blind Spot
+The gap must be a **structural limitation** of existing evaluation, not "we need more data" or "we need a bigger dataset." Three proven gap patterns:
+### Pattern A: Dimension Blindness
+Existing benchmarks measure capability X but completely ignore related capability Y.
+> **StatQA example**: Math reasoning benchmarks test whether models can compute correct answers, but completely ignore whether models can select the appropriate statistical method. A model that applies the wrong test but computes correctly would score perfectly, yet its reasoning is fundamentally flawed.
+### Pattern B: Assumption Violation
+Existing benchmarks share an implicit assumption that does not hold in real-world scenarios.
+> **nvBench 2.0 example**: All Text-to-Visualization benchmarks assume each natural language query maps to exactly one correct visualization. In practice, real queries are inherently ambiguous, "show sales trends" could reasonably produce a line chart, bar chart, or area chart. Current benchmarks penalize valid alternative interpretations.
+### Pattern C: Evaluation Granularity Mismatch
+Existing evaluation is too coarse to diagnose specific failure modes.
+> **VisJudge-Bench example**: Existing visualization evaluation checks surface aesthetics OR information accuracy in isolation. But real quality requires the interplay of Fidelity, Expressiveness, and Aesthetics. A chart can be beautiful but misleading, or accurate but incomprehensible.
+**Ask the user:**
+- Which gap pattern does the observed blind spot most resemble? (Dimension Blindness, Assumption Violation, or Granularity Mismatch)
+- Can the user cite a concrete failure case where existing evaluation misses the real problem?
+## Step 3: Validate the Gap
+A gap is only worth a paper if it passes ALL four checks:
+- [ ] **Practical impact**: Real users or applications encounter this problem
+- [ ] **Measurable**: The gap can be quantified with concrete metrics, not just described qualitatively
+- [ ] **Non-trivial**: Existing benchmarks cannot be trivially extended to cover it (adding a few samples is not enough)
+- [ ] **Actionable**: Identifying this gap opens concrete research directions for model improvement
+If any check fails, the gap needs refinement. Help the user iterate.
+## Step 4: Articulate the Gap Statement
+Draft a 2-3 sentence gap statement using this structure:
+> **Existing benchmarks for [TASK] focus on [WHAT THEY MEASURE], operating under the assumption that [IMPLICIT ASSUMPTION]. However, in real-world scenarios, [WHY THE ASSUMPTION FAILS]. This creates a critical evaluation blind spot: [SPECIFIC CAPABILITY THAT CANNOT BE EVALUATED], meaning that [CONSEQUENCE, what can go wrong with models that appear to perform well].**
+The gap statement must be:
+- **Specific**, names the exact blind spot, not a vague "limitation"
+- **Surprising**, challenges a common assumption the community takes for granted
+- **Consequential**, shows real harm from the blind spot (not just theoretical incompleteness)
+## Step 5: Derive Research Questions
+Map the gap directly to 2-3 Research Questions. RQs should cover three areas:
+| Coverage Area | RQ Template | Purpose |
+|--------------|-------------|---------|
+| **Benchmark construction** | How should we design a benchmark to systematically evaluate [the blind spot]? | Justify design choices |
+| **Model capability boundary** | To what extent do current models fail on [the blind spot], and what sub-capabilities differentiate strong vs. weak models? | Establish severity + enable diagnosis |
+| **Human-model gap** | How do models compare with human experts on [this dimension], and what factors (scale / prompting / domain knowledge) affect the gap? | Ground truth + actionable insights |
+Typical RQ examples:
+| RQ | Template | Maps to |
+|----|----------|---------|
+| RQ1 | To what extent do current models fail on [the blind spot]? | §4.2 Overall Performance |
+| RQ2 | What specific sub-capabilities differentiate strong vs. weak models in [this dimension]? | §4.3 Fine-grained Analysis |
+| RQ3 | How does [factor: model scale / prompting / domain knowledge] affect performance, and how do models compare with human experts? | §4.3 + Human vs. Model |
+These RQs will drive the experiment design (bench-experiments) and structure the paper. Each RQ should correspond to a subsection in the experiments chapter.

package/dist/skills/benchmark-paper-template/references/instantiation-template.md ADDED Viewed

@@ -0,0 +1,194 @@
+# Benchmark Paper Instantiation Template
+## Table of contents
+1. Basic information
+2. Gap analysis
+3. Research questions mapping
+4. Benchmark design
+5. Construction pipeline
+6. Experiment plan
+7. Figure plan
+Use this template to concretize your benchmark idea before writing. Fill in each section through discussion.
+## 1. Basic Information
+| Field | Your Answer |
+|-------|-------------|
+| **Benchmark Name** | |
+| **Target Domain** | |
+| **Target Venue** | |
+| **Target Submission Date** | |
+## 2. Core Evaluation Definition
+| Field | Your Answer |
+|-------|-------------|
+| **Core capability/dimension to evaluate** (one sentence) | |
+| **What makes your evaluation definition unique** vs. existing benchmarks? | |
+| **Sub-dimensions** in your evaluation framework | |
+## 3. Gap Analysis (bench-gap-analysis)
+| Field | Your Answer |
+|-------|-------------|
+| **Existing Benchmarks** (list 5+) | |
+| **Their Shared Assumption** | |
+| **The Blind Spot** | |
+| **Gap Pattern** (Dimension Blindness / Assumption Violation / Granularity Mismatch) | |
+| **Gap Statement** (2-3 sentences) | |
+| **Concrete Failure Example** | |
+## 4. Research Questions
+| RQ | Question | Mapped to Section |
+|----|----------|------------------|
+| RQ1 | | §4.3 |
+| RQ2 | | §4.3 |
+| RQ3 | | §4.3 |
+## 5. Benchmark Design (bench-design)
+### Design Goals
+| Goal | Strategy |
+|------|----------|
+| G1 (Coverage) | |
+| G2 (Fine-grained) | |
+| G3 (Scalability) | |
+| G4 (Quality) | |
+### Task Scope
+| In Scope | Out of Scope (with reason) |
+|----------|---------------------------|
+| | |
+| | |
+### Taxonomy
+| Pattern | Chosen: 1 / 2 / 3 |
+|---------|-------------------|
+| **Dimension 1** | Sub: |
+| **Dimension 2** | Sub: |
+| **Dimension 3** | Sub: |
+| **Total cells** | |
+### Evaluation Framework
+| Sub-dimension | Metric | Scoring | Range | Auto/Human |
+|--------------|--------|---------|-------|-----------|
+| | | | | |
+| | | | | |
+### Companion Method (Optional)
+| Field | Your Answer |
+|-------|-------------|
+| Training Signal | |
+| Optimization Technique | |
+| Expected Improvement | |
+## 6. Construction Pipeline (bench-construction)
+| Field | Your Answer |
+|-------|-------------|
+| **Paradigm** (Reverse Synthesis / Controlled Injection / Adaptive Generation) | |
+| **Data Source** | |
+| **Target Scale** | |
+| **Biggest Construction Challenge** (data scarcity? subjective annotation? ambiguity? cost?) | |
+| **Pipeline Core Innovation** | |
+### Pipeline Steps
+| Step | Input | Operation | Output | QC Gate |
+|------|-------|-----------|--------|---------|
+| 1 | | | | |
+| 2 | | | | |
+| 3 | | | | |
+| 4 | | | | |
+| 5 | | | | |
+## 7. Dataset Characteristics
+| Field | Your Answer |
+|-------|-------------|
+| **Scale** (samples, splits, domains) | |
+| **Difficulty levels** | |
+| **Rich metadata / reasoning paths** for deep evaluation | |
+| **Key comparison dimensions** with existing benchmarks | |
+## 8. Expected Findings (预期洞察)
+| Field | Your Answer |
+|-------|-------------|
+| **What difficulties will SOTA models face?** | |
+| **Where will model errors concentrate?** | |
+| **Expected human-model performance gap** | |
+## 9. Running Example Design (Figure 1)
+| Field | Your Answer |
+|-------|-------------|
+| **What concrete example will Figure 1 show?** | |
+| **Does it simultaneously illustrate** existing method limitations AND your benchmark's value? | |
+## 10. Contamination Mitigation
+| Field | Your Answer |
+|-------|-------------|
+| **How do you prevent training data leakage?** | |
+| **Data provenance verification strategy** | |
+## 11. Experiment Plan (bench-experiments)
+### Baseline Models
+| Model | Type | Scale | Why Include |
+|-------|------|-------|------------|
+| | | | |
+| | | | |
+### Analysis Plan
+| RQ | Hypothesis | Analysis Type | Visualization |
+|----|-----------|--------------|---------------|
+| RQ1 | | | |
+| RQ2 | | | |
+| RQ3 | | | |
+## 12. Figure & Table Plan
+| Item | Content | Position |
+|------|---------|----------|
+| Figure 1 | Running Example: | Page 1-2 |
+| Table 1 | Benchmark Comparison: | Page 2-3 |
+| Figure 2 | Pipeline: | Page 3-4 |
+| Table 2 | Statistics: | Page 4-5 |
+| Table 3 | Overall Performance: | Page 6-7 |
+## Reference Case Studies
+For inspiration, consult these three published benchmark papers:
+### StatQA (NeurIPS 2024)
+- **Gap**: Math benchmarks test computation but ignore statistical method selection
+- **Paradigm**: Reverse synthesis (answer-first)
+- **Taxonomy**: 5 categories × 2 difficulty levels
+- **Key Finding**: Models compute correctly but choose wrong statistical methods
+### nvBench 2.0 (NeurIPS 2025)
+- **Gap**: Text2VIS benchmarks assume single correct answer, ignoring query ambiguity
+- **Paradigm**: Controlled ambiguity injection
+- **Taxonomy**: 4 ambiguity types × 5 severity levels
+- **Companion Method**: Step-Text2Vis (Step-DPO)
+- **Key Finding**: All models struggle with ambiguous queries, even with explicit disambiguation prompts
+### VisJudge-Bench (ICLR 2026)
+- **Gap**: Evaluation checks aesthetics OR accuracy, not their interplay
+- **Paradigm**: Adaptive generation + 3-stage expert annotation
+- **Taxonomy**: 3 dimensions (信达雅) → 6 sub-dimensions
+- **Companion Method**: VisJudge (GRPO)
+- **Key Finding**: Models excel at surface aesthetics but fail at fidelity-expressiveness balance