@rfxlamia/skillkit 1.0.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (269) hide show
  1. package/agents/agents/creative-copywriter.md +212 -0
  2. package/agents/agents/dario-amodei.md +135 -0
  3. package/agents/agents/doc-simplifier.md +63 -0
  4. package/agents/agents/kotlin-pro.md +433 -0
  5. package/agents/agents/red-team.md +136 -0
  6. package/agents/agents/sam-altman.md +121 -0
  7. package/agents/agents/seo-manager.md +184 -0
  8. package/package.json +7 -2
  9. package/skills/quick-spec/tests/__pycache__/test_skill.cpython-314-pytest-9.0.2.pyc +0 -0
  10. package/skills/skillkit/.claude/settings.local.json +7 -0
  11. package/skills/skillkit/scripts/__pycache__/decision_helper.cpython-314.pyc +0 -0
  12. package/skills/skillkit/scripts/__pycache__/quick_validate.cpython-312.pyc +0 -0
  13. package/skills/skillkit/scripts/__pycache__/quick_validate.cpython-314.pyc +0 -0
  14. package/skills/skillkit/scripts/__pycache__/test_generator.cpython-314-pytest-9.0.2.pyc +0 -0
  15. package/skills/skillkit/scripts/utils/__pycache__/__init__.cpython-312.pyc +0 -0
  16. package/skills/skillkit/scripts/utils/__pycache__/__init__.cpython-314.pyc +0 -0
  17. package/skills/skillkit/scripts/utils/__pycache__/budget_tracker.cpython-312.pyc +0 -0
  18. package/skills/skillkit/scripts/utils/__pycache__/budget_tracker.cpython-314.pyc +0 -0
  19. package/skills/skillkit/scripts/utils/__pycache__/output_formatter.cpython-312.pyc +0 -0
  20. package/skills/skillkit/scripts/utils/__pycache__/output_formatter.cpython-314.pyc +0 -0
  21. package/skills/skillkit/scripts/utils/__pycache__/reference_validator.cpython-312.pyc +0 -0
  22. package/skills/skillkit/scripts/utils/__pycache__/reference_validator.cpython-314.pyc +0 -0
  23. package/skills/skillkit-help/SKILL.md +81 -0
  24. package/skills/skillkit-help/knowledge/application/09-case-studies.md +257 -0
  25. package/skills/skillkit-help/knowledge/application/12-testing-and-validation.md +276 -0
  26. package/skills/skillkit-help/knowledge/foundation/01-why-skills-exist.md +246 -0
  27. package/skills/skillkit-help/knowledge/foundation/02-skills-vs-subagents-comparison.md +312 -0
  28. package/skills/skillkit-help/knowledge/foundation/03-skills-vs-subagents-decision-tree.md +346 -0
  29. package/skills/skillkit-help/knowledge/foundation/06-platform-constraints.md +237 -0
  30. package/skills/skillkit-help/knowledge/foundation/08-when-not-to-use-skills.md +270 -0
  31. package/skills/skillkit-help/template/SKILL.md +52 -0
  32. package/skills/skills/adversarial-review/SKILL.md +219 -0
  33. package/skills/skills/baby-education/SKILL.md +260 -0
  34. package/skills/skills/baby-education/references/advanced-techniques.md +323 -0
  35. package/skills/skills/baby-education/references/transformations.md +345 -0
  36. package/skills/skills/been-there-done-that/SKILL.md +455 -0
  37. package/skills/skills/been-there-done-that/references/analysis-patterns.md +162 -0
  38. package/skills/skills/been-there-done-that/references/git-commands.md +132 -0
  39. package/skills/skills/been-there-done-that/references/tree-insertion-logic.md +145 -0
  40. package/skills/skills/coolhunter/SKILL.md +270 -0
  41. package/skills/skills/coolhunter/assets/elicitation-methods.csv +51 -0
  42. package/skills/skills/coolhunter/knowledge/elicitation-methods.md +312 -0
  43. package/skills/skills/coolhunter/references/workflow-execution.md +238 -0
  44. package/skills/skills/coolhunter/workflow-plan-coolhunter.md +232 -0
  45. package/skills/skills/creative-copywriting/SKILL.md +324 -0
  46. package/skills/skills/creative-copywriting/databases/README.md +60 -0
  47. package/skills/skills/creative-copywriting/databases/carousel-structures.csv +16 -0
  48. package/skills/skills/creative-copywriting/databases/emotional-arcs.csv +11 -0
  49. package/skills/skills/creative-copywriting/databases/hook-formulas.csv +51 -0
  50. package/skills/skills/creative-copywriting/databases/power-words.csv +201 -0
  51. package/skills/skills/creative-copywriting/databases/psychological-triggers.csv +21 -0
  52. package/skills/skills/creative-copywriting/databases/read-more-patterns.csv +26 -0
  53. package/skills/skills/creative-copywriting/databases/swipe-triggers.csv +31 -0
  54. package/skills/skills/creative-copywriting/references/carousel-psychology.md +223 -0
  55. package/skills/skills/creative-copywriting/references/hook-anatomy.md +169 -0
  56. package/skills/skills/creative-copywriting/references/power-word-science.md +134 -0
  57. package/skills/skills/creative-copywriting/references/storytelling-frameworks.md +157 -0
  58. package/skills/skills/diverse-content-gen/SKILL.md +201 -0
  59. package/skills/skills/diverse-content-gen/references/advanced-techniques.md +320 -0
  60. package/skills/skills/diverse-content-gen/references/research-findings.md +379 -0
  61. package/skills/skills/diverse-content-gen/references/task-workflows.md +241 -0
  62. package/skills/skills/diverse-content-gen/references/tool-integration.md +419 -0
  63. package/skills/skills/diverse-content-gen/references/troubleshooting.md +426 -0
  64. package/skills/skills/diverse-content-gen/references/vs-core-technique.md +240 -0
  65. package/skills/skills/framework-critical-thinking/SKILL.md +220 -0
  66. package/skills/skills/framework-critical-thinking/references/bias_detector.md +375 -0
  67. package/skills/skills/framework-critical-thinking/references/fallback_handler.md +239 -0
  68. package/skills/skills/framework-critical-thinking/references/memory_curator.md +161 -0
  69. package/skills/skills/framework-critical-thinking/references/metacognitive_monitor.md +297 -0
  70. package/skills/skills/framework-critical-thinking/references/producer_critic_orchestrator.md +333 -0
  71. package/skills/skills/framework-critical-thinking/references/reasoning_router.md +235 -0
  72. package/skills/skills/framework-critical-thinking/references/reasoning_validator.md +97 -0
  73. package/skills/skills/framework-critical-thinking/references/reflection_trigger.md +78 -0
  74. package/skills/skills/framework-critical-thinking/references/self_verification.md +388 -0
  75. package/skills/skills/framework-critical-thinking/references/uncertainty_quantifier.md +207 -0
  76. package/skills/skills/framework-initiative/SKILL.md +231 -0
  77. package/skills/skills/framework-initiative/references/examples.md +150 -0
  78. package/skills/skills/framework-initiative/references/impact-analysis.md +157 -0
  79. package/skills/skills/framework-initiative/references/intent-patterns.md +145 -0
  80. package/skills/skills/framework-initiative/references/star-framework.md +165 -0
  81. package/skills/skills/humanize-docs/SKILL.md +203 -0
  82. package/skills/skills/humanize-docs/references/advanced-techniques.md +13 -0
  83. package/skills/skills/humanize-docs/references/core-transformations.md +368 -0
  84. package/skills/skills/humanize-docs/references/detection-patterns.md +400 -0
  85. package/skills/skills/humanize-docs/references/examples-gallery.md +374 -0
  86. package/skills/skills/imagine/SKILL.md +190 -0
  87. package/skills/skills/imagine/references/artstyle-corporate-memphis.md +625 -0
  88. package/skills/skills/imagine/references/artstyle-crewdson-hyperrealism.md +295 -0
  89. package/skills/skills/imagine/references/artstyle-iphone-social-media.md +426 -0
  90. package/skills/skills/imagine/references/artstyle-sciencesaru.md +276 -0
  91. package/skills/skills/pre-deploy-checklist/README.md +26 -0
  92. package/skills/skills/pre-deploy-checklist/SKILL.md +153 -0
  93. package/skills/skills/pre-deploy-checklist/references/checklist-categories.md +174 -0
  94. package/skills/skills/pre-deploy-checklist/references/domain-prompts.md +216 -0
  95. package/skills/skills/prompt-engineering/SKILL.md +209 -0
  96. package/skills/skills/prompt-engineering/references/advanced-combinations.md +444 -0
  97. package/skills/skills/prompt-engineering/references/chain-of-thought.md +140 -0
  98. package/skills/skills/prompt-engineering/references/decision_matrix.md +220 -0
  99. package/skills/skills/prompt-engineering/references/few-shot.md +346 -0
  100. package/skills/skills/prompt-engineering/references/json-format.md +270 -0
  101. package/skills/skills/prompt-engineering/references/natural-language.md +420 -0
  102. package/skills/skills/prompt-engineering/references/pitfalls.md +365 -0
  103. package/skills/skills/prompt-engineering/references/prompt-chaining.md +498 -0
  104. package/skills/skills/prompt-engineering/references/react.md +108 -0
  105. package/skills/skills/prompt-engineering/references/self-consistency.md +322 -0
  106. package/skills/skills/prompt-engineering/references/tree-of-thoughts.md +386 -0
  107. package/skills/skills/prompt-engineering/references/xml-format.md +220 -0
  108. package/skills/skills/prompt-engineering/references/yaml-format.md +488 -0
  109. package/skills/skills/prompt-engineering/references/zero-shot.md +74 -0
  110. package/skills/skills/quick-spec/SKILL.md +280 -0
  111. package/skills/skills/quick-spec/assets/tech-spec-template.md +74 -0
  112. package/skills/skills/quick-spec/references/step-01-understand.md +189 -0
  113. package/skills/skills/quick-spec/references/step-02-investigate.md +144 -0
  114. package/skills/skills/quick-spec/references/step-03-generate.md +128 -0
  115. package/skills/skills/quick-spec/references/step-04-review.md +173 -0
  116. package/skills/skills/quick-spec/tests/__pycache__/test_skill.cpython-314-pytest-9.0.2.pyc +0 -0
  117. package/skills/skills/quick-spec/tests/test_scenarios.md +83 -0
  118. package/skills/skills/quick-spec/tests/test_skill.py +136 -0
  119. package/skills/skills/readme-expert/SKILL.md +538 -0
  120. package/skills/skills/readme-expert/knowledge/INDEX.md +192 -0
  121. package/skills/skills/readme-expert/knowledge/application/quality-standards.md +470 -0
  122. package/skills/skills/readme-expert/knowledge/application/script-executor.md +604 -0
  123. package/skills/skills/readme-expert/knowledge/application/template-library.md +822 -0
  124. package/skills/skills/readme-expert/knowledge/foundation/codebase-scanner.md +361 -0
  125. package/skills/skills/readme-expert/knowledge/foundation/validation-checklist.md +481 -0
  126. package/skills/skills/red-teaming/SKILL.md +321 -0
  127. package/skills/skills/red-teaming/references/ai-llm-redteam.md +517 -0
  128. package/skills/skills/red-teaming/references/attack-techniques.md +410 -0
  129. package/skills/skills/red-teaming/references/cybersecurity-redteam.md +383 -0
  130. package/skills/skills/red-teaming/references/tools-frameworks.md +446 -0
  131. package/skills/skills/releasing/.skillkit-mode +1 -0
  132. package/skills/skills/releasing/SKILL.md +225 -0
  133. package/skills/skills/releasing/references/version-detection.md +108 -0
  134. package/skills/skills/screenwriter/SKILL.md +273 -0
  135. package/skills/skills/screenwriter/references/advanced-techniques.md +216 -0
  136. package/skills/skills/screenwriter/references/pipeline-integration.md +266 -0
  137. package/skills/skills/skillkit/.claude/settings.local.json +7 -0
  138. package/skills/skills/skillkit/.claude-plugin/plugin.json +27 -0
  139. package/skills/skills/skillkit/CHANGELOG.md +484 -0
  140. package/skills/skills/skillkit/SKILL.md +511 -0
  141. package/skills/skills/skillkit/commands/skillkit.md +6 -0
  142. package/skills/skills/skillkit/commands/validate-plan.md +6 -0
  143. package/skills/skills/skillkit/commands/verify.md +6 -0
  144. package/skills/skills/skillkit/knowledge/INDEX.md +352 -0
  145. package/skills/skills/skillkit/knowledge/application/09-case-studies.md +257 -0
  146. package/skills/skills/skillkit/knowledge/application/10-technical-architecture.md +324 -0
  147. package/skills/skills/skillkit/knowledge/application/11-adoption-strategy.md +267 -0
  148. package/skills/skills/skillkit/knowledge/application/12-testing-and-validation.md +276 -0
  149. package/skills/skills/skillkit/knowledge/application/13-competitive-landscape.md +198 -0
  150. package/skills/skills/skillkit/knowledge/foundation/01-why-skills-exist.md +246 -0
  151. package/skills/skills/skillkit/knowledge/foundation/02-skills-vs-subagents-comparison.md +312 -0
  152. package/skills/skills/skillkit/knowledge/foundation/03-skills-vs-subagents-decision-tree.md +346 -0
  153. package/skills/skills/skillkit/knowledge/foundation/04-hybrid-patterns.md +308 -0
  154. package/skills/skills/skillkit/knowledge/foundation/05-token-economics.md +275 -0
  155. package/skills/skills/skillkit/knowledge/foundation/06-platform-constraints.md +237 -0
  156. package/skills/skills/skillkit/knowledge/foundation/07-security-concerns.md +322 -0
  157. package/skills/skills/skillkit/knowledge/foundation/08-when-not-to-use-skills.md +270 -0
  158. package/skills/skills/skillkit/knowledge/plugin-guide.md +614 -0
  159. package/skills/skills/skillkit/knowledge/tools/14-validation-tools-guide.md +150 -0
  160. package/skills/skills/skillkit/knowledge/tools/15-cost-tools-guide.md +157 -0
  161. package/skills/skills/skillkit/knowledge/tools/16-security-tools-guide.md +122 -0
  162. package/skills/skills/skillkit/knowledge/tools/17-pattern-tools-guide.md +161 -0
  163. package/skills/skills/skillkit/knowledge/tools/18-decision-helper-guide.md +243 -0
  164. package/skills/skills/skillkit/knowledge/tools/19-test-generator-guide.md +275 -0
  165. package/skills/skills/skillkit/knowledge/tools/20-split-skill-guide.md +149 -0
  166. package/skills/skills/skillkit/knowledge/tools/21-quality-scorer-guide.md +226 -0
  167. package/skills/skills/skillkit/knowledge/tools/22-migration-helper-guide.md +356 -0
  168. package/skills/skills/skillkit/knowledge/tools/23-subagent-creation-guide.md +448 -0
  169. package/skills/skills/skillkit/knowledge/tools/24-behavioral-testing-guide.md +122 -0
  170. package/skills/skills/skillkit/references/proposal-generation.md +982 -0
  171. package/skills/skills/skillkit/references/rationalization-catalog.md +75 -0
  172. package/skills/skills/skillkit/references/research-methodology.md +661 -0
  173. package/skills/skills/skillkit/references/section-2-full-creation-workflow.md +452 -0
  174. package/skills/skills/skillkit/references/section-3-validation-workflow-existing-skill.md +63 -0
  175. package/skills/skills/skillkit/references/section-4-decision-workflow-skills-vs-subagents.md +64 -0
  176. package/skills/skills/skillkit/references/section-5-migration-workflow-doc-to-skill.md +58 -0
  177. package/skills/skills/skillkit/references/section-6-subagent-creation-workflow.md +499 -0
  178. package/skills/skills/skillkit/references/section-7-knowledge-reference-map.md +72 -0
  179. package/skills/skills/skillkit/scripts/__pycache__/decision_helper.cpython-314.pyc +0 -0
  180. package/skills/skills/skillkit/scripts/__pycache__/quick_validate.cpython-312.pyc +0 -0
  181. package/skills/skills/skillkit/scripts/__pycache__/quick_validate.cpython-314.pyc +0 -0
  182. package/skills/skills/skillkit/scripts/__pycache__/test_generator.cpython-314-pytest-9.0.2.pyc +0 -0
  183. package/skills/skills/skillkit/scripts/decision_helper.py +799 -0
  184. package/skills/skills/skillkit/scripts/init_skill.py +400 -0
  185. package/skills/skills/skillkit/scripts/init_subagent.py +231 -0
  186. package/skills/skills/skillkit/scripts/migration_helper.py +669 -0
  187. package/skills/skills/skillkit/scripts/package_skill.py +211 -0
  188. package/skills/skills/skillkit/scripts/pattern_detector.py +381 -0
  189. package/skills/skills/skillkit/scripts/pattern_detector_new.py +382 -0
  190. package/skills/skills/skillkit/scripts/pressure_tester.py +157 -0
  191. package/skills/skills/skillkit/scripts/quality_scorer.py +999 -0
  192. package/skills/skills/skillkit/scripts/quick_validate.py +100 -0
  193. package/skills/skills/skillkit/scripts/security_scanner.py +474 -0
  194. package/skills/skills/skillkit/scripts/split_skill.py +540 -0
  195. package/skills/skills/skillkit/scripts/test_generator.py +695 -0
  196. package/skills/skills/skillkit/scripts/token_estimator.py +493 -0
  197. package/skills/skills/skillkit/scripts/utils/__init__.py +49 -0
  198. package/skills/skills/skillkit/scripts/utils/__pycache__/__init__.cpython-312.pyc +0 -0
  199. package/skills/skills/skillkit/scripts/utils/__pycache__/__init__.cpython-314.pyc +0 -0
  200. package/skills/skills/skillkit/scripts/utils/__pycache__/budget_tracker.cpython-312.pyc +0 -0
  201. package/skills/skills/skillkit/scripts/utils/__pycache__/budget_tracker.cpython-314.pyc +0 -0
  202. package/skills/skills/skillkit/scripts/utils/__pycache__/output_formatter.cpython-312.pyc +0 -0
  203. package/skills/skills/skillkit/scripts/utils/__pycache__/output_formatter.cpython-314.pyc +0 -0
  204. package/skills/skills/skillkit/scripts/utils/__pycache__/reference_validator.cpython-312.pyc +0 -0
  205. package/skills/skills/skillkit/scripts/utils/__pycache__/reference_validator.cpython-314.pyc +0 -0
  206. package/skills/skills/skillkit/scripts/utils/budget_tracker.py +388 -0
  207. package/skills/skills/skillkit/scripts/utils/output_formatter.py +263 -0
  208. package/skills/skills/skillkit/scripts/utils/reference_validator.py +401 -0
  209. package/skills/skills/skillkit/scripts/validate_skill.py +594 -0
  210. package/skills/skills/skillkit/tests/test_behavioral.py +39 -0
  211. package/skills/skills/skillkit/tests/test_scenarios.md +83 -0
  212. package/skills/skills/skillkit/tests/test_skill.py +136 -0
  213. package/skills/skills/skillkit-help/SKILL.md +81 -0
  214. package/skills/skills/skillkit-help/knowledge/application/09-case-studies.md +257 -0
  215. package/skills/skills/skillkit-help/knowledge/application/12-testing-and-validation.md +276 -0
  216. package/skills/skills/skillkit-help/knowledge/foundation/01-why-skills-exist.md +246 -0
  217. package/skills/skills/skillkit-help/knowledge/foundation/02-skills-vs-subagents-comparison.md +312 -0
  218. package/skills/skills/skillkit-help/knowledge/foundation/03-skills-vs-subagents-decision-tree.md +346 -0
  219. package/skills/skills/skillkit-help/knowledge/foundation/06-platform-constraints.md +237 -0
  220. package/skills/skills/skillkit-help/knowledge/foundation/08-when-not-to-use-skills.md +270 -0
  221. package/skills/skills/skillkit-help/template/SKILL.md +52 -0
  222. package/skills/skills/social-media-seo/SKILL.md +278 -0
  223. package/skills/skills/social-media-seo/databases/caption-styles.csv +31 -0
  224. package/skills/skills/social-media-seo/databases/engagement-tactics.csv +16 -0
  225. package/skills/skills/social-media-seo/databases/hashtag-strategies.csv +21 -0
  226. package/skills/skills/social-media-seo/databases/hook-formulas.csv +26 -0
  227. package/skills/skills/social-media-seo/databases/keyword-clusters.csv +11 -0
  228. package/skills/skills/social-media-seo/databases/thread-structures.csv +26 -0
  229. package/skills/skills/social-media-seo/databases/viral-patterns.csv +21 -0
  230. package/skills/skills/social-media-seo/references/analytics-guide.md +321 -0
  231. package/skills/skills/social-media-seo/references/instagram-seo.md +235 -0
  232. package/skills/skills/social-media-seo/references/threads-seo.md +305 -0
  233. package/skills/skills/social-media-seo/references/x-twitter-seo.md +337 -0
  234. package/skills/skills/social-media-seo/scripts/query_database.py +191 -0
  235. package/skills/skills/storyteller/SKILL.md +241 -0
  236. package/skills/skills/storyteller/references/transformation-methodology.md +293 -0
  237. package/skills/skills/storyteller/references/visual-vocabulary.md +177 -0
  238. package/skills/skills/thread-pro/SKILL.md +162 -0
  239. package/skills/skills/thread-pro/anti-ai-patterns.md +120 -0
  240. package/skills/skills/thread-pro/hook-formulas.md +138 -0
  241. package/skills/skills/thread-pro/references/anti-ai-patterns.md +120 -0
  242. package/skills/skills/thread-pro/references/hook-formulas.md +138 -0
  243. package/skills/skills/thread-pro/references/thread-structures.md +240 -0
  244. package/skills/skills/thread-pro/references/voice-injection.md +130 -0
  245. package/skills/skills/thread-pro/thread-structures.md +240 -0
  246. package/skills/skills/thread-pro/voice-injection.md +130 -0
  247. package/skills/skills/tinkering/SKILL.md +251 -0
  248. package/skills/skills/tinkering/references/graduation-checklist.md +100 -0
  249. package/skills/skills/validate-plan/.skillkit-mode +1 -0
  250. package/skills/skills/validate-plan/SKILL.md +406 -0
  251. package/skills/skills/validate-plan/references/dry-principles.md +251 -0
  252. package/skills/skills/validate-plan/references/gap-analysis-guide.md +320 -0
  253. package/skills/skills/validate-plan/references/tdd-patterns.md +413 -0
  254. package/skills/skills/validate-plan/references/yagni-checklist.md +330 -0
  255. package/skills/skills/verify-before-ship/.skillkit-mode +1 -0
  256. package/skills/skills/verify-before-ship/SKILL.md +116 -0
  257. package/skills/skills/verify-before-ship/references/anti-rationalization.md +212 -0
  258. package/skills/skills/verify-before-ship/references/verification-gates.md +305 -0
  259. package/skills-manifest.json +8 -2
  260. package/src/banner.js +1 -1
  261. package/src/cli.js +15 -4
  262. package/src/install.js +45 -29
  263. package/src/install.test.js +75 -7
  264. package/src/picker.js +15 -4
  265. package/src/picker.test.js +36 -1
  266. package/src/scope.js +8 -39
  267. package/src/scope.test.js +9 -13
  268. package/src/tools.js +76 -0
  269. package/src/tools.test.js +80 -0
@@ -0,0 +1,320 @@
1
+ # Advanced VS Techniques
2
+
3
+ **Purpose:** Advanced VS methods for higher quality, better control, and production use
4
+
5
+ **Load when:** Basic VS doesn't meet quality needs, or user requests refinement/polish
6
+
7
+ **Prerequisite:** Familiarity with core VS technique (`vs-core-technique.md`)
8
+
9
+ ---
10
+
11
+ ## VS-CoT: Chain-of-Thought Enhancement
12
+
13
+ ### When to Use
14
+ - Complex creative tasks requiring reasoning
15
+ - When basic VS outputs lack coherence
16
+ - User requests "thoughtful" or "well-reasoned" variations
17
+
18
+ ### How It Works
19
+
20
+ Add reasoning step **before** generating responses:
21
+
22
+ ```
23
+ Before generating responses, first think through:
24
+ 1. What are the different possible angles/perspectives for this request?
25
+ 2. What styles/tones would work? (humorous, professional, casual, poetic, etc.)
26
+ 3. What makes each response unique from the others?
27
+
28
+ Then generate {k} responses with probabilities as instructed.
29
+ ```
30
+
31
+ ### Complete VS-CoT Prompt Template
32
+
33
+ ```
34
+ [REASONING SECTION]
35
+ Before generating responses, think through:
36
+ 1. Different angles/perspectives for this request
37
+ 2. Appropriate styles/tones
38
+ 3. What makes each response unique
39
+
40
+ [STANDARD VS SECTION]
41
+ Generate {k} responses to the following user request. Each response should be approximately {target_words} words.
42
+
43
+ Return the responses in JSON format with the key: "responses" (list of dicts). Each dictionary must include:
44
+ • text: the response string only (no explanation or extra text)
45
+ • probability: the estimated probability from 0.0 to 1.0
46
+
47
+ Give ONLY the JSON object, with no explanations or extra text.
48
+
49
+ USER REQUEST:
50
+ {user_original_request}
51
+ ```
52
+
53
+ ### Research Results
54
+
55
+ **VS-CoT achieves:**
56
+ - **Highest quality+diversity balance** among all VS variants
57
+ - Creative writing: 25.8% diversity (vs. 21.9% for VS-Standard)
58
+ - Quality scores: Matches or exceeds baseline
59
+
60
+ **Best for:** Stories, essays, campaign briefs, product narratives
61
+
62
+ ---
63
+
64
+ ## VS-Multi: Multi-Turn Refinement
65
+
66
+ ### When to Use
67
+ - Production content that needs polish
68
+ - User says "good but needs refinement"
69
+ - Quality is priority, diversity is secondary
70
+
71
+ ### The VS-Multi Workflow
72
+
73
+ **3-round process combining diversity + quality:**
74
+
75
+ #### Round 1: Initial VS Generation (Diversity)
76
+ ```
77
+ Parameters: k=5-10, threshold=0.10
78
+ Goal: Cast wide net, explore diverse options
79
+ ```
80
+
81
+ **Agent action:** Generate 5-10 diverse candidates using standard VS
82
+
83
+ #### Round 2: User Selection + Refinement (Curation)
84
+ ```
85
+ Agent: "Which 2-3 options resonate most? I'll refine them."
86
+ User: Selects top candidates
87
+ ```
88
+
89
+ **Agent action:** For each selected candidate, generate 3 refined variations (k=3)
90
+
91
+ #### Round 3: Final Polish (Quality)
92
+ ```
93
+ Agent: Standard high-quality rewrite of user's final selection
94
+ No VS needed - just quality optimization
95
+ ```
96
+
97
+ **Agent action:** Traditional quality-focused generation on chosen option
98
+
99
+ ### VS-Multi Benefits
100
+
101
+ **Research shows:**
102
+ - Large models: **+5.0 quality points** vs. single-shot VS
103
+ - Combines diversity (Round 1) + quality (Rounds 2-3)
104
+ - User involvement ensures alignment with intent
105
+
106
+ **Time investment:** ~2-3× longer than standard VS, but worth it for production content
107
+
108
+ ---
109
+
110
+ ## Parameter Tuning for Advanced Control
111
+
112
+ ### Probability Threshold Tuning
113
+
114
+ **Purpose:** Control how far into the probability tail to sample
115
+
116
+ | Threshold | Effect | Use Case |
117
+ |-----------|--------|----------|
118
+ | **None** | Balanced diversity+quality | Standard brainstorming |
119
+ | **0.10** | Moderate tail sampling | More creative options |
120
+ | **0.01** | Deep tail sampling | Maximum creativity |
121
+ | **0.001** | Extreme tail sampling | Experimental/avant-garde |
122
+
123
+ **Prompt modification:**
124
+ ```
125
+ [Add to VS prompt]
126
+ Randomly sample the responses from the distribution, with the probability of each response below {threshold}.
127
+ ```
128
+
129
+ **Research optimal range:** 0.01-0.10 for most tasks
130
+
131
+ ### k Value Tuning
132
+
133
+ **Quality-diversity tradeoff:**
134
+
135
+ | k | Diversity Gain | Quality Impact | Best For |
136
+ |---|----------------|----------------|----------|
137
+ | **3** | +15% | -2% | Quick tasks, time-sensitive |
138
+ | **5** | +35% | -3% | **Optimal balance (recommended)** |
139
+ | **10** | +52% | -5% | Deep exploration, ideation |
140
+ | **20** | +68% | -8% | Research, synthetic data |
141
+
142
+ **Recommendation:** Start with k=5, adjust based on results
143
+
144
+ ### Temperature Combination
145
+
146
+ **VS is orthogonal to temperature** - can combine both:
147
+
148
+ | Configuration | Diversity | Quality | Use Case |
149
+ |--------------|-----------|---------|----------|
150
+ | VS only, temp=0.7 | Moderate | High | Production content |
151
+ | VS only, temp=1.0 | High | Moderate | Creative brainstorming |
152
+ | VS + temp=0.8 | Very High | Moderate | Maximum diversity |
153
+ | VS + temp=0.5 | Low-Moderate | Very High | Controlled variation |
154
+
155
+ **Optimal combo:** VS with temperature 0.8-1.0 for creative tasks
156
+
157
+ ---
158
+
159
+ ## Iterative Refinement Patterns
160
+
161
+ ### Pattern 1: Expand-Then-Narrow
162
+
163
+ **Use case:** User unsure of direction
164
+
165
+ ```
166
+ Round 1: VS with k=10, threshold=0.05 (wide exploration)
167
+ Round 2: User narrows to 3 favorites
168
+ Round 3: VS-Multi refinement on favorites (k=3 each)
169
+ Round 4: User picks final, agent polishes
170
+ ```
171
+
172
+ ### Pattern 2: Parallel Tracks
173
+
174
+ **Use case:** Exploring multiple distinct directions
175
+
176
+ ```
177
+ Track A: Humorous tone
178
+ - VS with k=5, emphasis on "playful, witty"
179
+
180
+ Track B: Professional tone
181
+ - VS with k=5, emphasis on "authoritative, polished"
182
+
183
+ Track C: Inspirational tone
184
+ - VS with k=5, emphasis on "uplifting, motivational"
185
+
186
+ User compares across tracks, selects best direction
187
+ ```
188
+
189
+ ### Pattern 3: Incremental Diversity
190
+
191
+ **Use case:** Gradually exploring creative space
192
+
193
+ ```
194
+ Iteration 1: Standard VS (k=5)
195
+ Iteration 2: If outputs too similar → Lower threshold (0.10 → 0.01)
196
+ Iteration 3: If still similar → Increase k (5 → 10)
197
+ Iteration 4: If still similar → Add explicit diversity constraints
198
+ ```
199
+
200
+ ---
201
+
202
+ ## Advanced Diversity Controls
203
+
204
+ ### Explicit Diversity Instructions
205
+
206
+ **Add to VS prompt when needed:**
207
+
208
+ ```
209
+ IMPORTANT: Ensure responses cover different:
210
+ - Tones: (humorous, professional, casual, inspirational)
211
+ - Perspectives: (consumer, expert, beginner, skeptic)
212
+ - Formats: (question, statement, story, call-to-action)
213
+
214
+ Avoid generating similar responses.
215
+ ```
216
+
217
+ **Effect:** Pushes model to consciously diversify, works well with VS-CoT
218
+
219
+ ### Negative Examples
220
+
221
+ **Provide examples of what NOT to generate:**
222
+
223
+ ```
224
+ USER REQUEST:
225
+ Write social media captions for coffee shop
226
+
227
+ AVOID outputs like:
228
+ - "Wake up and smell the coffee! ☕" (too generic)
229
+ - "Coffee is life" (overused cliché)
230
+
231
+ Generate 5 UNIQUE captions with probabilities.
232
+ ```
233
+
234
+ **Effect:** Steers away from typical outputs, increases tail sampling
235
+
236
+ ---
237
+
238
+ ## Quality Constraints
239
+
240
+ ### Maintaining Quality at High Diversity
241
+
242
+ **Add quality guardrails to VS prompt:**
243
+
244
+ ```
245
+ Requirements:
246
+ - Maintain professional tone
247
+ - Align with brand voice (friendly, approachable)
248
+ - No clichés or overused phrases
249
+ - Proofread for grammar and clarity
250
+
251
+ Then generate {k} diverse responses meeting these standards.
252
+ ```
253
+
254
+ **Best practice:** Combine quality constraints with VS-CoT for best results
255
+
256
+ ### Quality Metrics to Track
257
+
258
+ **Before presenting to user, check:**
259
+
260
+ 1. **Baseline quality:** Each output individually acceptable?
261
+ 2. **Consistency:** All outputs meet same quality bar?
262
+ 3. **Diversity-quality balance:** Not sacrificing too much for variety?
263
+
264
+ **If quality drops below threshold:** Use VS-Multi instead of single-shot VS
265
+
266
+ ---
267
+
268
+ ## Production Checklist
269
+
270
+ **Before using advanced VS in production:**
271
+
272
+ - [ ] Tested on representative samples (not just cherry-picked examples)
273
+ - [ ] Quality metrics defined and tracked
274
+ - [ ] User feedback incorporated (iterate based on real usage)
275
+ - [ ] Fallback plan if VS fails (standard generation)
276
+ - [ ] Cost analysis done (advanced techniques use more tokens)
277
+
278
+ ---
279
+
280
+ ## Advanced Technique Selector
281
+
282
+ **Quick guide to choosing advanced technique:**
283
+
284
+ | Situation | Recommended Technique | Why |
285
+ |-----------|----------------------|-----|
286
+ | Complex creative task | VS-CoT | Reasoning improves coherence |
287
+ | Production content | VS-Multi | 3-round process ensures quality |
288
+ | Outputs too similar | Lower threshold OR increase k | Samples more from tail |
289
+ | Quality too low | VS-Multi OR add quality constraints | Balances diversity+quality |
290
+ | Need maximum diversity | k=10 + threshold=0.01 + temp=1.0 | All diversity levers |
291
+ | Time-sensitive | k=3, no threshold, temp=0.7 | Quick, quality-focused |
292
+
293
+ ---
294
+
295
+ ## Research-Backed Recommendations
296
+
297
+ ### For Best Results
298
+
299
+ **Creative writing:** VS-CoT with k=5, temperature=0.8
300
+ **Marketing content:** VS-Multi for production, VS-Standard for ideation
301
+ **Brainstorming:** VS-Standard with k=10, threshold=0.05
302
+ **Production at scale:** VS with k=5, add quality constraints
303
+
304
+ ### Model-Specific Tips
305
+
306
+ **GPT-4.1+, Claude 4+, Gemini 2.5+:**
307
+ - All advanced techniques work excellently
308
+ - Large models benefit most from VS-Multi (+5.0 quality)
309
+ - Can handle complex VS-CoT reasoning
310
+
311
+ **Smaller models (<70B params):**
312
+ - Stick to VS-Standard (may struggle with VS-CoT)
313
+ - Use lower k values (3-5)
314
+ - Avoid stacking multiple techniques
315
+
316
+ ---
317
+
318
+ **Next steps:**
319
+ - **Tool integration:** See `tool-integration.md` for file operations
320
+ - **Troubleshooting:** See `troubleshooting.md` if issues arise
@@ -0,0 +1,379 @@
1
+ # VS Research Findings & Benchmarks
2
+
3
+ **Purpose:** Research-backed performance data, model compatibility, and evidence base
4
+
5
+ **Load when:** User asks about effectiveness, model selection, or wants evidence
6
+
7
+ **Source:** "Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity" (Zhang et al., 2025)
8
+
9
+ ---
10
+
11
+ ## Key Research Results
12
+
13
+ ### Diversity Improvements
14
+
15
+ **Creative writing tasks (average across models):**
16
+
17
+ | Task | Direct Prompting | VS-Standard | VS-CoT | Improvement |
18
+ |------|-----------------|-------------|---------|-------------|
19
+ | **Poem** | 11.4% | 21.9% | **25.8%** | **+126% diversity** |
20
+ | **Story** | 22.2% | 34.7% | **38.2%** | **+72% diversity** |
21
+ | **Joke** | 30.0% | 62.5% | **62.9%** | **+110% diversity** |
22
+
23
+ **Key finding:** VS-CoT achieves **1.6-2.1× diversity improvement** while maintaining quality
24
+
25
+ ### Quality Maintenance
26
+
27
+ **Quality scores remain high:**
28
+ - Poems: VS-CoT achieves highest quality AND diversity
29
+ - Stories: All VS variants within 3% of baseline quality
30
+ - Jokes: Quality scores 82-84/100 across all methods
31
+
32
+ **Conclusion:** Diversity gains don't sacrifice quality
33
+
34
+ ### Human Evaluation
35
+
36
+ **30 annotators rated diversity (4-point Likert scale):**
37
+
38
+ | Task | Direct | Sequence | VS-Standard | Improvement |
39
+ |------|--------|----------|-------------|-------------|
40
+ | Poem | 1.90 | 2.07 | **2.39** | **+25.7%** |
41
+ | Story | 2.74 | 2.76 | **3.06** | **+11.7%** |
42
+ | Joke | 1.83 | 2.93 | **3.01** | **+64.5%** |
43
+
44
+ **Statistical significance:** p < 0.001 for all improvements
45
+
46
+ ---
47
+
48
+ ## Model Compatibility
49
+
50
+ ### Tested Models (Research)
51
+
52
+ **Excellent compatibility:**
53
+ - GPT-4.1 series (GPT-4.1, GPT-4.1-mini)
54
+ - Claude 4 series (Claude 4 Sonnet)
55
+ - Gemini 2.5 series (Gemini 2.5 Pro, Flash)
56
+ - DeepSeek-R1
57
+ - o3
58
+ - Llama-3.1-70B and larger
59
+
60
+ **Limited compatibility:**
61
+ - Models < 70B parameters show quality degradation
62
+ - Smaller models may not follow VS format reliably
63
+
64
+ ### Model-Specific Performance
65
+
66
+ | Model Family | Diversity Gain | Quality Impact | Best For |
67
+ |--------------|----------------|----------------|----------|
68
+ | **GPT-4.1** | ✅ Excellent (+15.4% large model) | Neutral | General purpose, balanced |
69
+ | **Claude 4** | ✅ Excellent | Slight improvement | Creative writing, narratives |
70
+ | **Gemini 2.5** | ✅ Excellent | Neutral | Balanced across tasks |
71
+ | **DeepSeek-R1** | ✅ Excellent | Improved factual accuracy | Reasoning tasks, QA |
72
+ | **o3** | ✅ Excellent | Improved | Complex prompts, CoT |
73
+ | **Llama-3.1-70B** | ✅ Good | -3 to -5% | Open source, cost-effective |
74
+ | **Smaller models** | ⚠️ Limited (-2% to -8%) | Quality drop | Not recommended |
75
+
76
+ ### Emergent Trend: Larger Models Benefit More
77
+
78
+ **Diversity gain over direct prompting:**
79
+
80
+ | Model Size | Sequence | VS-Standard | VS-CoT | VS-Multi |
81
+ |------------|----------|-------------|---------|----------|
82
+ | **Small** (Mini, Flash) | +4.4% | +5.4% | +6.4% | +6.8% |
83
+ | **Large** (Full, Pro) | +9.6% | +15.4% | +9.9% | +7.7% |
84
+
85
+ **Ratio:** Large models gain **1.5-2.0× more diversity** than small models
86
+
87
+ **Quality improvements with scale:**
88
+ - Small models: VS-Multi +0.7% quality
89
+ - Large models: VS-Multi **+5.0% quality**
90
+
91
+ **Conclusion:** Frontier models turn prompt complexity into benefits
92
+
93
+ ---
94
+
95
+ ## Performance Benchmarks
96
+
97
+ ### Open-Ended QA (CoverageQA dataset)
98
+
99
+ **Task:** Generate diverse valid answers (e.g., "Name a US state")
100
+
101
+ | Metric | Direct | Sequence | VS-Standard | VS-CoT |
102
+ |--------|--------|----------|-------------|---------|
103
+ | **KL Divergence** ↓ | 14.43 | 4.27 | 3.50 | **3.23** |
104
+ | **Coverage** ↑ | 0.10 | 0.64 | 0.67 | **0.68** |
105
+ | **Precision** ↑ | 1.00 | 0.96 | 0.96 | **0.96** |
106
+
107
+ **Key finding:** VS-CoT achieves **71% coverage** of valid answers (vs. 10% direct) while maintaining 96% precision
108
+
109
+ ### Dialogue Simulation (PersuasionForGood)
110
+
111
+ **Task:** Simulate human persuadee in donation conversation
112
+
113
+ **Donation distribution alignment:**
114
+
115
+ | Model | Method | KS Test ↓ | L1 Distance ↓ |
116
+ |-------|--------|-----------|---------------|
117
+ | GPT-4.1 | Direct | 0.373 | 0.613 |
118
+ | GPT-4.1 | **VS** | **0.211** | **0.579** |
119
+ | DeepSeek-R1 | Direct | 0.368 | 0.684 |
120
+ | DeepSeek-R1 | **VS** | **0.114** | **0.642** |
121
+
122
+ **Linguistic diversity:**
123
+
124
+ | Metric | Direct | VS-Standard | Human | Fine-tuned |
125
+ |--------|--------|-------------|--------|------------|
126
+ | Distinct-1 | 0.178 | **0.269** | 0.419 | 0.400 |
127
+ | Distinct-2 | 0.633 | **0.763** | 0.809 | 0.791 |
128
+ | Semantic Diversity | 0.577 | **0.664** | 0.721 | 0.696 |
129
+
130
+ **Key finding:** VS with GPT-4.1 matches fine-tuned model performance, no training needed
131
+
132
+ ### Synthetic Data Generation (Math problems)
133
+
134
+ **Task:** Generate 1K math problems, train models on synthetic data
135
+
136
+ **Downstream accuracy (average across 3 benchmarks):**
137
+
138
+ | Method | Qwen2.5-7B | Qwen3-1.7B | Qwen3-4B | Average |
139
+ |--------|------------|------------|----------|---------|
140
+ | Baseline (no synth) | 27.2 | 30.5 | 40.7 | 32.8 |
141
+ | Direct | 26.1 | 31.4 | 34.5 | 30.6 ⚠️ |
142
+ | Sequence | 30.5 | 31.0 | 42.1 | 34.3 |
143
+ | **VS-Standard** | 32.7 | 33.6 | 45.5 | **36.1** |
144
+ | **VS-CoT** | 33.4 | 33.7 | 45.9 | **36.9** |
145
+ | **VS-Multi** | **34.8** | **34.9** | **45.0** | **37.5** |
146
+
147
+ **Key finding:** VS-Multi achieves **37.5% accuracy** vs. 32.8% baseline (+4.7%). Direct prompting can actually **hurt** performance (-2.2%) due to mode collapse.
148
+
149
+ ---
150
+
151
+ ## Safety & Factual Accuracy
152
+
153
+ ### Safety Evaluation (StrongReject benchmark)
154
+
155
+ **Task:** 353 harmful prompts, measure refusal rate
156
+
157
+ | Method | Refusal Rate | Δ vs. Direct |
158
+ |--------|--------------| -------------|
159
+ | Direct | 98.22% | - |
160
+ | VS-Standard | 97.45% | -0.77% |
161
+ | VS-CoT | 97.81% | -0.41% |
162
+ | VS-Multi | 97.91% | -0.31% |
163
+
164
+ **Conclusion:** VS maintains **97%+ safety**, only minor (0.3-0.8%) decrease in refusal rate
165
+
166
+ ### Factual Accuracy (SimpleQA)
167
+
168
+ **Task:** 300 factual questions
169
+
170
+ | Metric | Direct | CoT | VS-Standard | VS-CoT |
171
+ |--------|--------|-----|-------------|---------|
172
+ | **Top@1 Accuracy** | 0.310 | 0.342 | 0.329 | **0.348** |
173
+ | **Pass@N Accuracy** | 0.430 | 0.473 | 0.448 | **0.485** |
174
+
175
+ **Conclusion:** VS-CoT achieves **highest accuracy** on both metrics. Diversity doesn't compromise correctness.
176
+
177
+ ---
178
+
179
+ ## Optimal Hyperparameters (Research-Backed)
180
+
181
+ ### k (Candidates per Call)
182
+
183
+ **Quality-diversity tradeoff:**
184
+
185
+ | k | Diversity Gain | Quality Impact | Recommended For |
186
+ |---|----------------|----------------|-----------------|
187
+ | **3** | +15% | -2% | Quick tasks, time-sensitive |
188
+ | **5** | +35% | -3% | **Optimal balance ✅** |
189
+ | **10** | +52% | -5% | Deep exploration, ideation |
190
+ | **20** | +68% | -8% | Research, synthetic data |
191
+
192
+ **Recommendation:** k=5 provides best practical balance
193
+
194
+ ### Probability Threshold
195
+
196
+ **Diversity tuning (poem generation):**
197
+
198
+ | Threshold | Diversity | Notes |
199
+ |-----------|-----------|-------|
200
+ | None | 17.0% | Baseline VS |
201
+ | 0.10 | 17.8% | +5% diversity |
202
+ | **0.01** | **18.2%** | **+7% diversity (optimal)** |
203
+ | 0.001 | 17.4% | Diminishing returns |
204
+
205
+ **Optimal range:** 0.01-0.10 for most tasks
206
+
207
+ ### Temperature Combination
208
+
209
+ **VS is orthogonal to temperature** - can combine both:
210
+
211
+ - VS alone achieves **1.6-2.1× diversity**
212
+ - VS + temperature 0.8-1.0 pushes Pareto front further
213
+ - Optimal: VS with temperature 0.8-1.0
214
+
215
+ ---
216
+
217
+ ## Post-Training Stage Analysis
218
+
219
+ **Using Tulu-3 family (Llama-3.1-70B base → SFT → DPO → RLVR):**
220
+
221
+ | Stage | Direct Diversity | VS Diversity | Base Model Diversity |
222
+ |-------|------------------|--------------|---------------------|
223
+ | Base | - | - | 45.4% |
224
+ | After SFT | 20.8% | 30.3% | - |
225
+ | After DPO | **10.8%** | **30.2%** | - |
226
+ | After RLVR | 10.8% | 30.3% | - |
227
+
228
+ **Key findings:**
229
+ - Direct prompting: **182.6% diversity drop** after DPO (45.4% → 10.8%)
230
+ - VS: Maintains ~30% diversity across all stages
231
+ - VS recovers **66.8%** of base model diversity (vs. 23.8% for direct)
232
+
233
+ **Conclusion:** Alignment dramatically reduces diversity, but VS recovers most of it
234
+
235
+ ---
236
+
237
+ ## Use Case Suitability
238
+
239
+ ### Highly Effective For:
240
+
241
+ ✅ **Creative writing** (poems, stories, scripts)
242
+ - Research shows: +60-120% diversity improvement
243
+ - Quality maintained or improved
244
+
245
+ ✅ **Marketing content** (campaigns, taglines, ad copy)
246
+ - Multiple valid approaches
247
+ - Benefits from exploring creative space
248
+
249
+ ✅ **Brainstorming & ideation**
250
+ - Open-ended exploration
251
+ - No single "correct" answer
252
+
253
+ ✅ **Open-ended QA** (tasks with multiple valid answers)
254
+ - Research shows: +400% coverage increase
255
+ - Example: "Name a US state" → covers 71% vs. 10%
256
+
257
+ ✅ **Synthetic data generation**
258
+ - Diverse training data improves model performance
259
+ - Research shows: +4.7% downstream accuracy
260
+
261
+ ### Less Effective For:
262
+
263
+ ⚠️ **Single-answer factual questions**
264
+ - Example: "What's the capital of France?" → Only Paris is correct
265
+ - VS adds no value
266
+
267
+ ⚠️ **Deterministic outputs**
268
+ - Tasks requiring exact, reproducible results
269
+ - Diversity is not desired
270
+
271
+ ⚠️ **Real-time low-latency applications**
272
+ - VS requires k× more generation time
273
+ - Not suitable for < 1 second response requirements
274
+
275
+ ⚠️ **Weak models** (< 70B parameters)
276
+ - Quality degradation observed
277
+ - May not follow VS format reliably
278
+
279
+ ---
280
+
281
+ ## Cost-Benefit Analysis
282
+
283
+ ### Token Usage
284
+
285
+ **Typical VS call:**
286
+ - Prompt: ~200 tokens (template + request)
287
+ - Response: ~300 tokens (k=5, JSON format)
288
+ - **Total: ~500 tokens per VS call**
289
+
290
+ **Comparison:**
291
+ - Direct prompting: 1 call × 100 tokens = 100 tokens
292
+ - VS: 1 call × 500 tokens = 500 tokens
293
+ - **Multiplier: 5× tokens for 2× diversity**
294
+
295
+ ### When Worth the Cost
296
+
297
+ ✅ **High value per output** (production content, campaigns, key messaging)
298
+ ✅ **Diversity critical** (avoiding repetitive content, exploring options)
299
+ ✅ **User satisfaction** (25.7% higher ratings in human evaluation)
300
+ ✅ **Downstream quality** (better synthetic training data)
301
+
302
+ ⚠️ **Not worth for:**
303
+ - High-volume low-value content
304
+ - Single-answer tasks
305
+ - Extremely tight budgets
306
+
307
+ ---
308
+
309
+ ## Comparative Methods
310
+
311
+ ### VS vs. Temperature Tuning
312
+
313
+ | Method | Diversity Gain | Quality Impact | Works with API? |
314
+ |--------|----------------|----------------|-----------------|
315
+ | Temperature ↑ | Moderate | Can degrade | ✅ Yes |
316
+ | **VS** | **High (2×)** | **Maintained** | ✅ Yes |
317
+ | VS + Temp | **Very High** | Moderate | ✅ Yes |
318
+
319
+ **Conclusion:** VS and temperature are complementary, can combine
320
+
321
+ ### VS vs. Multiple Sampling
322
+
323
+ | Method | API Calls | Diversity | Token Cost |
324
+ |--------|-----------|-----------|------------|
325
+ | Sample 5× | 5 | Low (mode collapse) | 5× |
326
+ | **VS (k=5)** | **1** | **High** | **1× (but larger)** |
327
+
328
+ **Conclusion:** VS more token-efficient than repeated sampling
329
+
330
+ ### VS vs. Fine-Tuning
331
+
332
+ | Method | Setup Cost | Inference Cost | Flexibility |
333
+ |--------|------------|----------------|-------------|
334
+ | Fine-tune for diversity | High (data + compute) | Same as base | Low |
335
+ | **VS** | **Zero** | **Higher tokens** | **High** |
336
+
337
+ **Conclusion:** VS is training-free, more flexible
338
+
339
+ ---
340
+
341
+ ## Research Methodology Notes
342
+
343
+ **Dataset sizes:**
344
+ - Creative writing: 100 samples per task
345
+ - Open-ended QA: 40 questions, 100 samples each
346
+ - Dialogue simulation: 200 test dialogues
347
+ - Factual accuracy: 300 questions (SimpleQA)
348
+ - Safety: 353 harmful prompts (StrongReject)
349
+
350
+ **Models tested:**
351
+ - GPT-4.1 series, Gemini 2.5 series, Claude 4 series
352
+ - DeepSeek-R1, o3, Llama-3.1-70B
353
+ - Tulu-3 family (for post-training analysis)
354
+
355
+ **Metrics:**
356
+ - Diversity: Self-BLEU, distinct-n, semantic diversity, coverage
357
+ - Quality: GPT-4-as-judge, human evaluation (Likert scale)
358
+ - Accuracy: Exact match, pass@N
359
+ - Safety: Refusal rate
360
+
361
+ ---
362
+
363
+ ## Citation
364
+
365
+ ```bibtex
366
+ @article{zhang2025verbalized,
367
+ title={Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity},
368
+ author={Zhang, Jiayi and Yu, Simon and Chong, Derek and Sicilia, Anthony and Tomz, Michael R. and Manning, Christopher D. and Shi, Weiyan},
369
+ journal={arXiv preprint arXiv:2510.01171},
370
+ year={2025}
371
+ }
372
+ ```
373
+
374
+ **Project page:** https://verbalized-sampling.github.io
375
+
376
+ ---
377
+
378
+ **For practical implementation:** See `vs-core-technique.md`
379
+ **For task-specific workflows:** See `task-workflows.md`