mindforge-cc 11.5.1 → 11.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (170) hide show
  1. package/.agent/mindforge/skill-tdd.md +53 -0
  2. package/.agent/mindforge/skills-index.md +118 -0
  3. package/.agent/mindforge/systematic-debug.md +60 -0
  4. package/.agent/skills/1password-skill/SKILL.md +156 -0
  5. package/.agent/skills/1password-skill/references/cli-examples.md +31 -0
  6. package/.agent/skills/1password-skill/references/get-started.md +21 -0
  7. package/.agent/skills/article-illustrator/SKILL.md +199 -0
  8. package/.agent/skills/article-illustrator/references/prompt-construction.md +426 -0
  9. package/.agent/skills/article-illustrator/references/style-presets.md +80 -0
  10. package/.agent/skills/article-illustrator/references/styles.md +224 -0
  11. package/.agent/skills/article-illustrator/references/usage.md +50 -0
  12. package/.agent/skills/article-illustrator/references/workflow.md +332 -0
  13. package/.agent/skills/arxiv/SKILL.md +275 -0
  14. package/.agent/skills/blogwatcher/SKILL.md +130 -0
  15. package/.agent/skills/code-wiki/SKILL.md +438 -0
  16. package/.agent/skills/code-wiki/templates/README.md +31 -0
  17. package/.agent/skills/code-wiki/templates/architecture.md +30 -0
  18. package/.agent/skills/code-wiki/templates/getting-started.md +47 -0
  19. package/.agent/skills/code-wiki/templates/module.md +38 -0
  20. package/.agent/skills/codebase-inspection/SKILL.md +109 -0
  21. package/.agent/skills/comic-creator/SKILL.md +240 -0
  22. package/.agent/skills/comic-creator/references/analysis-framework.md +176 -0
  23. package/.agent/skills/comic-creator/references/auto-selection.md +71 -0
  24. package/.agent/skills/comic-creator/references/base-prompt.md +98 -0
  25. package/.agent/skills/comic-creator/references/character-template.md +180 -0
  26. package/.agent/skills/comic-creator/references/ohmsha-guide.md +85 -0
  27. package/.agent/skills/comic-creator/references/partial-workflows.md +106 -0
  28. package/.agent/skills/comic-creator/references/storyboard-template.md +143 -0
  29. package/.agent/skills/comic-creator/references/workflow.md +401 -0
  30. package/.agent/skills/concept-diagrams/SKILL.md +355 -0
  31. package/.agent/skills/concept-diagrams/references/dashboard-patterns.md +43 -0
  32. package/.agent/skills/concept-diagrams/references/infrastructure-patterns.md +144 -0
  33. package/.agent/skills/concept-diagrams/references/physical-shape-cookbook.md +42 -0
  34. package/.agent/skills/creative-ideation/SKILL.md +144 -0
  35. package/.agent/skills/creative-ideation/references/full-prompt-library.md +110 -0
  36. package/.agent/skills/devops-cli/SKILL.md +149 -0
  37. package/.agent/skills/devops-cli/references/app-discovery.md +112 -0
  38. package/.agent/skills/devops-cli/references/authentication.md +59 -0
  39. package/.agent/skills/devops-cli/references/cli-reference.md +104 -0
  40. package/.agent/skills/devops-cli/references/running-apps.md +171 -0
  41. package/.agent/skills/devops-watchers/SKILL.md +103 -0
  42. package/.agent/skills/docker-management/SKILL.md +273 -0
  43. package/.agent/skills/domain-intel/SKILL.md +96 -0
  44. package/.agent/skills/duckduckgo-search/SKILL.md +230 -0
  45. package/.agent/skills/github-auth/SKILL.md +240 -0
  46. package/.agent/skills/github-code-review/SKILL.md +474 -0
  47. package/.agent/skills/github-code-review/references/review-output-template.md +74 -0
  48. package/.agent/skills/github-issues/SKILL.md +363 -0
  49. package/.agent/skills/github-issues/templates/bug-report.md +35 -0
  50. package/.agent/skills/github-issues/templates/feature-request.md +31 -0
  51. package/.agent/skills/github-pr-workflow/SKILL.md +360 -0
  52. package/.agent/skills/github-pr-workflow/references/ci-troubleshooting.md +183 -0
  53. package/.agent/skills/github-pr-workflow/references/conventional-commits.md +71 -0
  54. package/.agent/skills/github-pr-workflow/templates/pr-body-bugfix.md +35 -0
  55. package/.agent/skills/github-pr-workflow/templates/pr-body-feature.md +33 -0
  56. package/.agent/skills/github-repo-management/SKILL.md +509 -0
  57. package/.agent/skills/github-repo-management/references/github-api-cheatsheet.md +161 -0
  58. package/.agent/skills/godmode/SKILL.md +396 -0
  59. package/.agent/skills/godmode/references/jailbreak-templates.md +128 -0
  60. package/.agent/skills/godmode/references/refusal-detection.md +142 -0
  61. package/.agent/skills/hyperframes/SKILL.md +182 -0
  62. package/.agent/skills/hyperframes/references/cli.md +185 -0
  63. package/.agent/skills/hyperframes/references/composition.md +129 -0
  64. package/.agent/skills/hyperframes/references/features.md +289 -0
  65. package/.agent/skills/hyperframes/references/gsap.md +136 -0
  66. package/.agent/skills/hyperframes/references/troubleshooting.md +137 -0
  67. package/.agent/skills/hyperframes/references/website-to-video.md +145 -0
  68. package/.agent/skills/jupyter-live-kernel/SKILL.md +160 -0
  69. package/.agent/skills/kanban-orchestrator/SKILL.md +209 -0
  70. package/.agent/skills/kanban-worker/SKILL.md +188 -0
  71. package/.agent/skills/llm-wiki/SKILL.md +499 -0
  72. package/.agent/skills/meme-generation/SKILL.md +122 -0
  73. package/.agent/skills/node-inspect-debugger/SKILL.md +312 -0
  74. package/.agent/skills/obsidian/SKILL.md +60 -0
  75. package/.agent/skills/osint-investigation/SKILL.md +269 -0
  76. package/.agent/skills/osint-investigation/templates/source-template.md +59 -0
  77. package/.agent/skills/oss-forensics/SKILL.md +422 -0
  78. package/.agent/skills/oss-forensics/references/evidence-types.md +89 -0
  79. package/.agent/skills/oss-forensics/references/github-archive-guide.md +184 -0
  80. package/.agent/skills/oss-forensics/references/investigation-templates.md +131 -0
  81. package/.agent/skills/oss-forensics/references/recovery-techniques.md +164 -0
  82. package/.agent/skills/oss-forensics/templates/forensic-report.md +151 -0
  83. package/.agent/skills/oss-forensics/templates/malicious-package-report.md +43 -0
  84. package/.agent/skills/parallel-cli/SKILL.md +384 -0
  85. package/.agent/skills/pinggy-tunnel/SKILL.md +302 -0
  86. package/.agent/skills/pixel-art/SKILL.md +209 -0
  87. package/.agent/skills/pixel-art/references/palettes.md +49 -0
  88. package/.agent/skills/plan/SKILL.md +331 -0
  89. package/.agent/skills/polymarket/SKILL.md +75 -0
  90. package/.agent/skills/polymarket/references/api-endpoints.md +220 -0
  91. package/.agent/skills/python-debugpy/SKILL.md +368 -0
  92. package/.agent/skills/requesting-code-review/SKILL.md +273 -0
  93. package/.agent/skills/research-paper-writing/SKILL.md +2367 -0
  94. package/.agent/skills/research-paper-writing/references/autoreason-methodology.md +394 -0
  95. package/.agent/skills/research-paper-writing/references/checklists.md +434 -0
  96. package/.agent/skills/research-paper-writing/references/citation-workflow.md +563 -0
  97. package/.agent/skills/research-paper-writing/references/experiment-patterns.md +728 -0
  98. package/.agent/skills/research-paper-writing/references/human-evaluation.md +476 -0
  99. package/.agent/skills/research-paper-writing/references/paper-types.md +481 -0
  100. package/.agent/skills/research-paper-writing/references/reviewer-guidelines.md +433 -0
  101. package/.agent/skills/research-paper-writing/references/sources.md +191 -0
  102. package/.agent/skills/research-paper-writing/references/writing-guide.md +474 -0
  103. package/.agent/skills/research-paper-writing/templates/README.md +251 -0
  104. package/.agent/skills/rest-graphql-debug/SKILL.md +507 -0
  105. package/.agent/skills/s6-container-supervision/SKILL.md +171 -0
  106. package/.agent/skills/scrapling/SKILL.md +328 -0
  107. package/.agent/skills/sherlock/SKILL.md +186 -0
  108. package/.agent/skills/simplify-code/SKILL.md +168 -0
  109. package/.agent/skills/skill-authoring/SKILL.md +158 -0
  110. package/.agent/skills/spike/SKILL.md +190 -0
  111. package/.agent/skills/subagent-driven-development/SKILL.md +345 -0
  112. package/.agent/skills/subagent-driven-development/references/context-budget-discipline.md +53 -0
  113. package/.agent/skills/subagent-driven-development/references/gates-taxonomy.md +93 -0
  114. package/.agent/skills/systematic-debugging/SKILL.md +360 -0
  115. package/.agent/skills/test-driven-development/SKILL.md +336 -0
  116. package/.agent/skills/video-orchestrator/SKILL.md +194 -0
  117. package/.agent/skills/video-orchestrator/references/examples.md +227 -0
  118. package/.agent/skills/video-orchestrator/references/intake.md +166 -0
  119. package/.agent/skills/video-orchestrator/references/kanban-setup.md +278 -0
  120. package/.agent/skills/video-orchestrator/references/monitoring.md +180 -0
  121. package/.agent/skills/video-orchestrator/references/role-archetypes.md +298 -0
  122. package/.agent/skills/video-orchestrator/references/tool-matrix.md +317 -0
  123. package/.agent/skills/web-pentest/SKILL.md +332 -0
  124. package/.agent/skills/web-pentest/references/bypass-techniques.md +133 -0
  125. package/.agent/skills/web-pentest/references/exploitation-techniques.md +204 -0
  126. package/.agent/skills/web-pentest/references/scope-enforcement.md +110 -0
  127. package/.agent/skills/web-pentest/references/vuln-taxonomy.md +81 -0
  128. package/.agent/skills/web-pentest/templates/authorization.md +69 -0
  129. package/.agent/skills/web-pentest/templates/pentest-report.md +178 -0
  130. package/.claude/commands/mindforge/skill-tdd.md +53 -0
  131. package/.claude/commands/mindforge/skills-index.md +118 -0
  132. package/.claude/commands/mindforge/systematic-debug.md +60 -0
  133. package/.mindforge/config.json +2 -2
  134. package/.mindforge/memory/sync-manifest.json +1 -1
  135. package/.mindforge/skills/arxiv/SKILL.md +294 -0
  136. package/.mindforge/skills/blogwatcher/SKILL.md +147 -0
  137. package/.mindforge/skills/code-wiki/SKILL.md +457 -0
  138. package/.mindforge/skills/codebase-inspection/SKILL.md +126 -0
  139. package/.mindforge/skills/concept-diagrams/SKILL.md +373 -0
  140. package/.mindforge/skills/creative-ideation/SKILL.md +162 -0
  141. package/.mindforge/skills/domain-intel/SKILL.md +116 -0
  142. package/.mindforge/skills/duckduckgo-search/SKILL.md +249 -0
  143. package/.mindforge/skills/github-code-review/SKILL.md +493 -0
  144. package/.mindforge/skills/github-issues/SKILL.md +382 -0
  145. package/.mindforge/skills/github-pr-workflow/SKILL.md +379 -0
  146. package/.mindforge/skills/jupyter-live-kernel/SKILL.md +179 -0
  147. package/.mindforge/skills/kanban-orchestrator/SKILL.md +227 -0
  148. package/.mindforge/skills/kanban-worker/SKILL.md +206 -0
  149. package/.mindforge/skills/meme-generation/SKILL.md +141 -0
  150. package/.mindforge/skills/obsidian/SKILL.md +80 -0
  151. package/.mindforge/skills/osint-investigation/SKILL.md +288 -0
  152. package/.mindforge/skills/oss-forensics/SKILL.md +421 -0
  153. package/.mindforge/skills/pixel-art/SKILL.md +228 -0
  154. package/.mindforge/skills/plan/SKILL.md +350 -0
  155. package/.mindforge/skills/requesting-code-review/SKILL.md +292 -0
  156. package/.mindforge/skills/research-paper-writing/SKILL.md +2384 -0
  157. package/.mindforge/skills/scrapling/SKILL.md +345 -0
  158. package/.mindforge/skills/sherlock/SKILL.md +203 -0
  159. package/.mindforge/skills/simplify-code/SKILL.md +187 -0
  160. package/.mindforge/skills/spike/SKILL.md +209 -0
  161. package/.mindforge/skills/subagent-driven-development/SKILL.md +364 -0
  162. package/.mindforge/skills/systematic-debugging/SKILL.md +379 -0
  163. package/.mindforge/skills/test-driven-development/SKILL.md +355 -0
  164. package/.mindforge/skills/web-pentest/SKILL.md +327 -0
  165. package/CHANGELOG.md +43 -0
  166. package/MINDFORGE.md +2 -2
  167. package/README.md +39 -3
  168. package/RELEASENOTES.md +55 -0
  169. package/docs/getting-started.md +42 -5
  170. package/package.json +1 -1
@@ -0,0 +1,476 @@
1
+ # Human Evaluation Guide for ML/AI Research
2
+
3
+ Comprehensive guide for designing, running, and reporting human evaluations in ML/AI papers. Human evaluation is the primary evidence for many NLP, HCI, and alignment papers, and is increasingly expected as complementary evidence at all ML venues.
4
+
5
+ ---
6
+
7
+ ## Contents
8
+
9
+ - [When Human Evaluation Is Needed](#when-human-evaluation-is-needed)
10
+ - [Study Design](#study-design)
11
+ - [Annotation Guidelines](#annotation-guidelines)
12
+ - [Platforms and Recruitment](#platforms-and-recruitment)
13
+ - [Quality Control](#quality-control)
14
+ - [Agreement Metrics](#agreement-metrics)
15
+ - [Statistical Analysis for Human Eval](#statistical-analysis-for-human-eval)
16
+ - [Reporting Requirements](#reporting-requirements)
17
+ - [IRB and Ethics](#irb-and-ethics)
18
+ - [Common Pitfalls](#common-pitfalls)
19
+
20
+ ---
21
+
22
+ ## When Human Evaluation Is Needed
23
+
24
+ | Scenario | Human Eval Required? | Notes |
25
+ |----------|---------------------|-------|
26
+ | Text generation quality (fluency, coherence) | **Yes** | Automated metrics (BLEU, ROUGE) correlate poorly with human judgment |
27
+ | Factual accuracy of generated text | **Strongly recommended** | Automated fact-checking is unreliable |
28
+ | Safety/toxicity evaluation | **Yes for nuanced cases** | Classifiers miss context-dependent harm |
29
+ | Preference between two systems | **Yes** | Most reliable method for comparing LLM outputs |
30
+ | Summarization quality | **Yes** | ROUGE doesn't capture faithfulness or relevance well |
31
+ | Task completion (UI, agents) | **Yes** | User studies are the gold standard |
32
+ | Classification accuracy | **Usually no** | Ground truth labels suffice; human eval adds cost without insight |
33
+ | Perplexity or loss comparisons | **No** | Automated metrics are the correct evaluation |
34
+
35
+ ---
36
+
37
+ ## Study Design
38
+
39
+ ### Evaluation Types
40
+
41
+ | Type | When to Use | Pros | Cons |
42
+ |------|-------------|------|------|
43
+ | **Pairwise comparison** | Comparing two systems | Most reliable, minimizes scale bias | Only compares pairs, quadratic in systems |
44
+ | **Likert scale** (1-5 or 1-7) | Rating individual outputs | Easy to aggregate | Subjective anchoring, scale compression |
45
+ | **Ranking** | Ordering 3+ systems | Captures full preference order | Cognitive load increases with items |
46
+ | **Best-worst scaling** | Comparing many systems efficiently | More reliable than Likert, linear in items | Requires careful item selection |
47
+ | **Binary judgment** | Yes/no decisions (grammatical? factual?) | Simple, high agreement | Loses nuance |
48
+ | **Error annotation** | Identifying specific error types | Rich diagnostic information | Expensive, requires trained annotators |
49
+
50
+ **Recommendation for most ML papers**: Pairwise comparison is the most defensible. Reviewers rarely question its validity. For Likert scales, always report both mean and distribution.
51
+
52
+ ### Sample Size Planning
53
+
54
+ **Minimum viable sample sizes:**
55
+
56
+ | Study Type | Minimum Items | Minimum Annotators | Notes |
57
+ |------------|--------------|-------------------|-------|
58
+ | Pairwise comparison | 100 pairs | 3 per pair | Detects ~10% win rate difference at p<0.05 |
59
+ | Likert rating | 100 items | 3 per item | Enough for meaningful averages |
60
+ | Ranking | 50 sets | 3 per set | Each set contains all systems being compared |
61
+ | Error annotation | 200 items | 2 per item | Higher agreement expected for structured schemes |
62
+
63
+ **Power analysis** (for planning more precisely):
64
+
65
+ ```python
66
+ from scipy import stats
67
+ import numpy as np
68
+
69
+ def sample_size_pairwise(effect_size=0.10, alpha=0.05, power=0.80):
70
+ """
71
+ Estimate sample size for pairwise comparison (sign test).
72
+ effect_size: expected win rate difference from 0.50
73
+ """
74
+ p_expected = 0.50 + effect_size
75
+ # Normal approximation to binomial
76
+ z_alpha = stats.norm.ppf(1 - alpha / 2)
77
+ z_beta = stats.norm.ppf(power)
78
+ n = ((z_alpha * np.sqrt(0.25) + z_beta * np.sqrt(p_expected * (1 - p_expected))) ** 2) / (effect_size ** 2)
79
+ return int(np.ceil(n))
80
+
81
+ print(f"Sample size for 10% effect: {sample_size_pairwise(0.10)}") # ~200
82
+ print(f"Sample size for 15% effect: {sample_size_pairwise(0.15)}") # ~90
83
+ print(f"Sample size for 20% effect: {sample_size_pairwise(0.20)}") # ~50
84
+ ```
85
+
86
+ ### Controlling for Bias
87
+
88
+ | Bias | Mitigation |
89
+ |------|-----------|
90
+ | **Order bias** (first item preferred) | Randomize presentation order for each annotator |
91
+ | **Length bias** (longer = better) | Control for length or analyze separately |
92
+ | **Anchoring** (first annotation sets scale) | Include warm-up items (not counted) |
93
+ | **Fatigue** (quality drops over time) | Limit session length (30-45 min max), randomize item order |
94
+ | **Annotator expertise** | Report annotator background; use qualification tasks |
95
+
96
+ ---
97
+
98
+ ## Annotation Guidelines
99
+
100
+ Well-written annotation guidelines are the single biggest factor in evaluation quality. Invest significant time here.
101
+
102
+ ### Structure of Good Guidelines
103
+
104
+ ```markdown
105
+ # [Task Name] Annotation Guidelines
106
+
107
+ ## Overview
108
+ [1-2 sentences describing the task]
109
+
110
+ ## Definitions
111
+ [Define every term annotators will use in their judgments]
112
+ - Quality: [specific definition for this study]
113
+ - Fluency: [specific definition]
114
+ - Factuality: [specific definition]
115
+
116
+ ## Rating Scale
117
+ [For each scale point, provide:]
118
+ - Numeric value
119
+ - Label (e.g., "Excellent", "Good", "Acceptable", "Poor", "Unacceptable")
120
+ - Definition of what qualifies for this rating
121
+ - 1-2 concrete examples at this level
122
+
123
+ ## Examples
124
+
125
+ ### Example 1: [Rating = 5]
126
+ Input: [exact input]
127
+ Output: [exact output]
128
+ Rating: 5
129
+ Explanation: [why this is a 5]
130
+
131
+ ### Example 2: [Rating = 2]
132
+ Input: [exact input]
133
+ Output: [exact output]
134
+ Rating: 2
135
+ Explanation: [why this is a 2]
136
+
137
+ [Include at least 2 examples per rating level, covering edge cases]
138
+
139
+ ## Edge Cases
140
+ - If the output is [ambiguous case]: [instruction]
141
+ - If the input is [unusual case]: [instruction]
142
+
143
+ ## Common Mistakes
144
+ - Don't [common annotator error]
145
+ - Don't let [bias] influence your rating
146
+ ```
147
+
148
+ ### Pilot Testing
149
+
150
+ **Always run a pilot** before the full study:
151
+ 1. 3-5 annotators, 20-30 items
152
+ 2. Compute agreement metrics
153
+ 3. Discuss disagreements in group session
154
+ 4. Revise guidelines based on confusion points
155
+ 5. Run second pilot if agreement was poor (<0.40 kappa)
156
+
157
+ ---
158
+
159
+ ## Platforms and Recruitment
160
+
161
+ | Platform | Best For | Cost | Quality |
162
+ |----------|----------|------|---------|
163
+ | **Prolific** | General annotation, surveys | $8-15/hr | High (academic-focused pool) |
164
+ | **Amazon MTurk** | Large-scale simple tasks | $5-12/hr | Variable (needs strong QC) |
165
+ | **Surge AI** | NLP-specific annotation | $15-25/hr | Very high (trained annotators) |
166
+ | **Scale AI** | Production-quality labeling | Varies | High (managed workforce) |
167
+ | **Internal team** | Domain expertise required | Varies | Highest for specialized tasks |
168
+ | **Upwork/contractors** | Long-term annotation projects | $10-30/hr | Depends on hiring |
169
+
170
+ **Fair compensation**: Always pay at least the equivalent of local minimum wage for the annotator's location. Many conferences (ACL in particular) now ask about annotator compensation. Paying below minimum wage is an ethics risk.
171
+
172
+ **Prolific setup (recommended for most ML papers):**
173
+ 1. Create study on prolific.co
174
+ 2. Set prescreening filters (language, country, approval rate >95%)
175
+ 3. Estimate time per task from pilot → set fair payment
176
+ 4. Use Prolific's built-in attention checks or add your own
177
+ 5. Collect Prolific IDs for quality tracking (but don't share in paper)
178
+
179
+ ---
180
+
181
+ ## Quality Control
182
+
183
+ ### Attention Checks
184
+
185
+ Include items where the correct answer is unambiguous:
186
+
187
+ ```python
188
+ # Types of attention checks
189
+ attention_checks = {
190
+ "instructed_response": "For this item, please select 'Strongly Agree' regardless of content.",
191
+ "obvious_quality": "Rate this clearly ungrammatical text: 'The cat dog house green yesterday.'", # Should get lowest score
192
+ "gold_standard": "Items where expert consensus exists (pre-annotated by authors)",
193
+ "trap_question": "What color is the sky on a clear day? (embedded in annotation interface)"
194
+ }
195
+
196
+ # Recommended: 10-15% of total items should be checks
197
+ # Exclusion criterion: fail 2+ attention checks → exclude annotator
198
+ ```
199
+
200
+ ### Annotator Qualification
201
+
202
+ For tasks requiring expertise:
203
+
204
+ ```
205
+ Qualification Task Design:
206
+ 1. Create a set of 20-30 items with known-correct labels
207
+ 2. Require annotators to complete this before the main task
208
+ 3. Set threshold: ≥80% agreement with gold labels to qualify
209
+ 4. Record qualification scores for reporting
210
+ ```
211
+
212
+ ### Monitoring During Collection
213
+
214
+ ```python
215
+ # Real-time quality monitoring
216
+ def monitor_quality(annotations):
217
+ """Check for annotation quality issues during collection."""
218
+ issues = []
219
+
220
+ # 1. Check for straight-lining (same answer for everything)
221
+ for annotator_id, items in annotations.groupby('annotator'):
222
+ if items['rating'].nunique() <= 1:
223
+ issues.append(f"Annotator {annotator_id}: straight-lining detected")
224
+
225
+ # 2. Check time per item (too fast = not reading)
226
+ median_time = annotations['time_seconds'].median()
227
+ fast_annotators = annotations.groupby('annotator')['time_seconds'].median()
228
+ for ann_id, time in fast_annotators.items():
229
+ if time < median_time * 0.3:
230
+ issues.append(f"Annotator {ann_id}: suspiciously fast ({time:.0f}s vs median {median_time:.0f}s)")
231
+
232
+ # 3. Check attention check performance
233
+ checks = annotations[annotations['is_attention_check']]
234
+ for ann_id, items in checks.groupby('annotator'):
235
+ accuracy = (items['rating'] == items['gold_rating']).mean()
236
+ if accuracy < 0.80:
237
+ issues.append(f"Annotator {ann_id}: failing attention checks ({accuracy:.0%})")
238
+
239
+ return issues
240
+ ```
241
+
242
+ ---
243
+
244
+ ## Agreement Metrics
245
+
246
+ ### Which Metric to Use
247
+
248
+ | Metric | When to Use | Interpretation |
249
+ |--------|-------------|---------------|
250
+ | **Cohen's kappa (κ)** | Exactly 2 annotators, categorical | Chance-corrected agreement |
251
+ | **Fleiss' kappa** | 3+ annotators, all rate same items, categorical | Multi-annotator extension of Cohen's |
252
+ | **Krippendorff's alpha (α)** | Any number of annotators, handles missing data | Most general; recommended default |
253
+ | **ICC (Intraclass Correlation)** | Continuous ratings (Likert) | Consistency among raters |
254
+ | **Percent agreement** | Reporting alongside kappa/alpha | Raw agreement (not chance-corrected) |
255
+ | **Kendall's W** | Rankings | Concordance among rankers |
256
+
257
+ **Always report at least two**: one chance-corrected metric (kappa or alpha) AND raw percent agreement.
258
+
259
+ ### Interpretation Guide
260
+
261
+ | Value | Krippendorff's α / Cohen's κ | Quality |
262
+ |-------|-------------------------------|---------|
263
+ | > 0.80 | Excellent agreement | Reliable for most purposes |
264
+ | 0.67 - 0.80 | Good agreement | Acceptable for most ML papers |
265
+ | 0.40 - 0.67 | Moderate agreement | Borderline; discuss in paper |
266
+ | < 0.40 | Poor agreement | Revise guidelines and redo annotation |
267
+
268
+ **Note**: Krippendorff recommends α > 0.667 as minimum for tentative conclusions. NLP tasks with subjective judgments (fluency, helpfulness) typically achieve 0.40-0.70.
269
+
270
+ ### Implementation
271
+
272
+ ```python
273
+ import numpy as np
274
+ from sklearn.metrics import cohen_kappa_score
275
+ import krippendorff # pip install krippendorff
276
+
277
+ def compute_agreement(annotations_matrix):
278
+ """
279
+ annotations_matrix: shape (n_items, n_annotators)
280
+ Values: ratings (int or float). Use np.nan for missing.
281
+ """
282
+ results = {}
283
+
284
+ # Krippendorff's alpha (handles missing data, any number of annotators)
285
+ results['krippendorff_alpha'] = krippendorff.alpha(
286
+ annotations_matrix.T, # krippendorff expects (annotators, items)
287
+ level_of_measurement='ordinal' # or 'nominal', 'interval', 'ratio'
288
+ )
289
+
290
+ # Pairwise Cohen's kappa (for 2 annotators at a time)
291
+ n_annotators = annotations_matrix.shape[1]
292
+ kappas = []
293
+ for i in range(n_annotators):
294
+ for j in range(i + 1, n_annotators):
295
+ mask = ~np.isnan(annotations_matrix[:, i]) & ~np.isnan(annotations_matrix[:, j])
296
+ if mask.sum() > 0:
297
+ k = cohen_kappa_score(
298
+ annotations_matrix[mask, i].astype(int),
299
+ annotations_matrix[mask, j].astype(int)
300
+ )
301
+ kappas.append(k)
302
+ results['mean_pairwise_kappa'] = np.mean(kappas) if kappas else None
303
+
304
+ # Raw percent agreement
305
+ agree_count = 0
306
+ total_count = 0
307
+ for item in range(annotations_matrix.shape[0]):
308
+ ratings = annotations_matrix[item, ~np.isnan(annotations_matrix[item, :])]
309
+ if len(ratings) >= 2:
310
+ # All annotators agree
311
+ if len(set(ratings.astype(int))) == 1:
312
+ agree_count += 1
313
+ total_count += 1
314
+ results['percent_agreement'] = agree_count / total_count if total_count > 0 else None
315
+
316
+ return results
317
+ ```
318
+
319
+ ---
320
+
321
+ ## Statistical Analysis for Human Eval
322
+
323
+ ### Pairwise Comparisons
324
+
325
+ ```python
326
+ from scipy import stats
327
+
328
+ def analyze_pairwise(wins_a, wins_b, ties=0):
329
+ """
330
+ Analyze pairwise comparison results.
331
+ wins_a: number of times system A won
332
+ wins_b: number of times system B won
333
+ ties: number of ties (excluded from sign test)
334
+ """
335
+ n = wins_a + wins_b # exclude ties
336
+
337
+ # Sign test (exact binomial)
338
+ p_value = stats.binom_test(wins_a, n, 0.5, alternative='two-sided')
339
+
340
+ # Win rate with 95% CI (Wilson score interval)
341
+ win_rate = wins_a / n if n > 0 else 0.5
342
+ z = 1.96
343
+ denominator = 1 + z**2 / n
344
+ center = (win_rate + z**2 / (2 * n)) / denominator
345
+ margin = z * np.sqrt((win_rate * (1 - win_rate) + z**2 / (4 * n)) / n) / denominator
346
+ ci_lower = center - margin
347
+ ci_upper = center + margin
348
+
349
+ return {
350
+ 'win_rate_a': win_rate,
351
+ 'win_rate_b': 1 - win_rate,
352
+ 'p_value': p_value,
353
+ 'ci_95': (ci_lower, ci_upper),
354
+ 'significant': p_value < 0.05,
355
+ 'n_comparisons': n,
356
+ 'ties': ties,
357
+ }
358
+ ```
359
+
360
+ ### Likert Scale Analysis
361
+
362
+ ```python
363
+ def analyze_likert(ratings_a, ratings_b):
364
+ """Compare Likert ratings between two systems (paired)."""
365
+ # Wilcoxon signed-rank test (non-parametric, paired)
366
+ stat, p_value = stats.wilcoxon(ratings_a, ratings_b, alternative='two-sided')
367
+
368
+ # Effect size (rank-biserial correlation)
369
+ n = len(ratings_a)
370
+ r = 1 - (2 * stat) / (n * (n + 1))
371
+
372
+ return {
373
+ 'mean_a': np.mean(ratings_a),
374
+ 'mean_b': np.mean(ratings_b),
375
+ 'std_a': np.std(ratings_a),
376
+ 'std_b': np.std(ratings_b),
377
+ 'wilcoxon_stat': stat,
378
+ 'p_value': p_value,
379
+ 'effect_size_r': r,
380
+ 'significant': p_value < 0.05,
381
+ }
382
+ ```
383
+
384
+ ### Multiple Comparisons Correction
385
+
386
+ When comparing more than two systems:
387
+
388
+ ```python
389
+ from statsmodels.stats.multitest import multipletests
390
+
391
+ # After computing p-values for all pairs
392
+ p_values = [0.03, 0.001, 0.08, 0.04, 0.15, 0.002]
393
+ rejected, corrected_p, _, _ = multipletests(p_values, method='holm')
394
+ # Use corrected p-values in your paper
395
+ ```
396
+
397
+ ---
398
+
399
+ ## Reporting Requirements
400
+
401
+ Reviewers at NLP venues (ACL, EMNLP, NAACL) check for all of these. ML venues (NeurIPS, ICML) increasingly expect them too.
402
+
403
+ ### Mandatory Reporting
404
+
405
+ ```latex
406
+ % In your paper's human evaluation section:
407
+ \paragraph{Annotators.} We recruited [N] annotators via [platform].
408
+ [Describe qualifications or screening.] Annotators were paid
409
+ \$[X]/hour, above the [country] minimum wage.
410
+
411
+ \paragraph{Agreement.} Inter-annotator agreement was [metric] = [value]
412
+ (Krippendorff's $\alpha$ = [value]; raw agreement = [value]\%).
413
+ [If low: explain why the task is subjective and how you handle disagreements.]
414
+
415
+ \paragraph{Evaluation Protocol.} Each [item type] was rated by [N]
416
+ annotators on a [scale description]. We collected [total] annotations
417
+ across [N items]. [Describe randomization and blinding.]
418
+ ```
419
+
420
+ ### What Goes in the Appendix
421
+
422
+ ```
423
+ Appendix: Human Evaluation Details
424
+ - Full annotation guidelines (verbatim)
425
+ - Screenshot of annotation interface
426
+ - Qualification task details and threshold
427
+ - Attention check items and failure rates
428
+ - Per-annotator agreement breakdown
429
+ - Full results table (not just averages)
430
+ - Compensation calculation
431
+ - IRB approval number (if applicable)
432
+ ```
433
+
434
+ ---
435
+
436
+ ## IRB and Ethics
437
+
438
+ ### When IRB Approval Is Needed
439
+
440
+ | Situation | IRB Required? |
441
+ |-----------|---------------|
442
+ | Crowdworkers rating text quality | **Usually no** (not "human subjects research" at most institutions) |
443
+ | User study with real users | **Yes** at most US/EU institutions |
444
+ | Collecting personal information | **Yes** |
445
+ | Studying annotator behavior/cognition | **Yes** (they become the subject) |
446
+ | Using existing annotated data | **Usually no** (secondary data analysis) |
447
+
448
+ **Check your institution's policy.** The definition of "human subjects research" varies. When in doubt, submit an IRB protocol — the review is often fast for minimal-risk studies.
449
+
450
+ ### Ethics Checklist for Human Evaluation
451
+
452
+ ```
453
+ - [ ] Annotators informed about task purpose (not deceptive)
454
+ - [ ] Annotators can withdraw at any time without penalty
455
+ - [ ] No personally identifiable information collected beyond platform ID
456
+ - [ ] Content being evaluated does not expose annotators to harm
457
+ (if it does: content warnings + opt-out + higher compensation)
458
+ - [ ] Fair compensation (>= equivalent local minimum wage)
459
+ - [ ] Data stored securely, access limited to research team
460
+ - [ ] IRB approval obtained if required by institution
461
+ ```
462
+
463
+ ---
464
+
465
+ ## Common Pitfalls
466
+
467
+ | Pitfall | Problem | Fix |
468
+ |---------|---------|-----|
469
+ | Too few annotators (1-2) | No agreement metric possible | Minimum 3 annotators per item |
470
+ | No attention checks | Can't detect low-quality annotations | Include 10-15% attention checks |
471
+ | Not reporting compensation | Reviewers flag as ethics concern | Always report hourly rate |
472
+ | Using only automated metrics for generation | Reviewers will ask for human eval | Add at least pairwise comparison |
473
+ | Not piloting guidelines | Low agreement, wasted budget | Always pilot with 3-5 people first |
474
+ | Reporting only averages | Hides annotator disagreement | Report distribution and agreement |
475
+ | Not controlling for order/position | Position bias inflates results | Randomize presentation order |
476
+ | Conflating annotator agreement with ground truth | High agreement doesn't mean correct | Validate against expert judgments |