quiver-cli 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (281) hide show
  1. package/README.md +188 -0
  2. package/bin/quiver-cli.mjs +2 -0
  3. package/dist/cli.js +3074 -0
  4. package/package.json +55 -0
  5. package/template/.agents/AGENTS.md +25 -0
  6. package/template/.agents/commands/cp.md +116 -0
  7. package/template/.agents/commands/next-setup.md +1064 -0
  8. package/template/.agents/commands/tf-readme.md +38 -0
  9. package/template/.agents/config.json +60 -0
  10. package/template/.agents/skills/agent-browser/SKILL.md +55 -0
  11. package/template/.agents/skills/apps/skybridge/SKILL.md +46 -0
  12. package/template/.agents/skills/apps/skybridge/references/architecture.md +175 -0
  13. package/template/.agents/skills/apps/skybridge/references/copy-template.md +24 -0
  14. package/template/.agents/skills/apps/skybridge/references/csp.md +33 -0
  15. package/template/.agents/skills/apps/skybridge/references/deploy.md +33 -0
  16. package/template/.agents/skills/apps/skybridge/references/discover.md +84 -0
  17. package/template/.agents/skills/apps/skybridge/references/download-file.md +77 -0
  18. package/template/.agents/skills/apps/skybridge/references/fetch-and-render-data.md +151 -0
  19. package/template/.agents/skills/apps/skybridge/references/oauth.md +115 -0
  20. package/template/.agents/skills/apps/skybridge/references/open-external-links.md +71 -0
  21. package/template/.agents/skills/apps/skybridge/references/prompt-llm.md +20 -0
  22. package/template/.agents/skills/apps/skybridge/references/publish.md +19 -0
  23. package/template/.agents/skills/apps/skybridge/references/run-locally.md +51 -0
  24. package/template/.agents/skills/apps/skybridge/references/state-and-context.md +151 -0
  25. package/template/.agents/skills/apps/skybridge/references/ui-guidelines.md +205 -0
  26. package/template/.agents/skills/code/cleanup/SKILL.md +26 -0
  27. package/template/.agents/skills/code/vercel-react-best-practices/AGENTS.md +3810 -0
  28. package/template/.agents/skills/code/vercel-react-best-practices/README.md +123 -0
  29. package/template/.agents/skills/code/vercel-react-best-practices/SKILL.md +149 -0
  30. package/template/.agents/skills/code/vercel-react-best-practices/metadata.json +15 -0
  31. package/template/.agents/skills/code/vercel-react-best-practices/rules/_sections.md +46 -0
  32. package/template/.agents/skills/code/vercel-react-best-practices/rules/_template.md +28 -0
  33. package/template/.agents/skills/code/vercel-react-best-practices/rules/advanced-effect-event-deps.md +56 -0
  34. package/template/.agents/skills/code/vercel-react-best-practices/rules/advanced-event-handler-refs.md +55 -0
  35. package/template/.agents/skills/code/vercel-react-best-practices/rules/advanced-init-once.md +42 -0
  36. package/template/.agents/skills/code/vercel-react-best-practices/rules/advanced-use-latest.md +39 -0
  37. package/template/.agents/skills/code/vercel-react-best-practices/rules/async-api-routes.md +38 -0
  38. package/template/.agents/skills/code/vercel-react-best-practices/rules/async-cheap-condition-before-await.md +37 -0
  39. package/template/.agents/skills/code/vercel-react-best-practices/rules/async-defer-await.md +82 -0
  40. package/template/.agents/skills/code/vercel-react-best-practices/rules/async-dependencies.md +51 -0
  41. package/template/.agents/skills/code/vercel-react-best-practices/rules/async-parallel.md +28 -0
  42. package/template/.agents/skills/code/vercel-react-best-practices/rules/async-suspense-boundaries.md +99 -0
  43. package/template/.agents/skills/code/vercel-react-best-practices/rules/bundle-analyzable-paths.md +63 -0
  44. package/template/.agents/skills/code/vercel-react-best-practices/rules/bundle-barrel-imports.md +60 -0
  45. package/template/.agents/skills/code/vercel-react-best-practices/rules/bundle-conditional.md +31 -0
  46. package/template/.agents/skills/code/vercel-react-best-practices/rules/bundle-defer-third-party.md +49 -0
  47. package/template/.agents/skills/code/vercel-react-best-practices/rules/bundle-dynamic-imports.md +35 -0
  48. package/template/.agents/skills/code/vercel-react-best-practices/rules/bundle-preload.md +50 -0
  49. package/template/.agents/skills/code/vercel-react-best-practices/rules/client-event-listeners.md +74 -0
  50. package/template/.agents/skills/code/vercel-react-best-practices/rules/client-localstorage-schema.md +71 -0
  51. package/template/.agents/skills/code/vercel-react-best-practices/rules/client-passive-event-listeners.md +48 -0
  52. package/template/.agents/skills/code/vercel-react-best-practices/rules/client-swr-dedup.md +56 -0
  53. package/template/.agents/skills/code/vercel-react-best-practices/rules/js-batch-dom-css.md +107 -0
  54. package/template/.agents/skills/code/vercel-react-best-practices/rules/js-cache-function-results.md +80 -0
  55. package/template/.agents/skills/code/vercel-react-best-practices/rules/js-cache-property-access.md +28 -0
  56. package/template/.agents/skills/code/vercel-react-best-practices/rules/js-cache-storage.md +70 -0
  57. package/template/.agents/skills/code/vercel-react-best-practices/rules/js-combine-iterations.md +32 -0
  58. package/template/.agents/skills/code/vercel-react-best-practices/rules/js-early-exit.md +50 -0
  59. package/template/.agents/skills/code/vercel-react-best-practices/rules/js-flatmap-filter.md +60 -0
  60. package/template/.agents/skills/code/vercel-react-best-practices/rules/js-hoist-regexp.md +45 -0
  61. package/template/.agents/skills/code/vercel-react-best-practices/rules/js-index-maps.md +37 -0
  62. package/template/.agents/skills/code/vercel-react-best-practices/rules/js-length-check-first.md +49 -0
  63. package/template/.agents/skills/code/vercel-react-best-practices/rules/js-min-max-loop.md +82 -0
  64. package/template/.agents/skills/code/vercel-react-best-practices/rules/js-request-idle-callback.md +105 -0
  65. package/template/.agents/skills/code/vercel-react-best-practices/rules/js-set-map-lookups.md +24 -0
  66. package/template/.agents/skills/code/vercel-react-best-practices/rules/js-tosorted-immutable.md +57 -0
  67. package/template/.agents/skills/code/vercel-react-best-practices/rules/rendering-activity.md +26 -0
  68. package/template/.agents/skills/code/vercel-react-best-practices/rules/rendering-animate-svg-wrapper.md +47 -0
  69. package/template/.agents/skills/code/vercel-react-best-practices/rules/rendering-conditional-render.md +40 -0
  70. package/template/.agents/skills/code/vercel-react-best-practices/rules/rendering-content-visibility.md +38 -0
  71. package/template/.agents/skills/code/vercel-react-best-practices/rules/rendering-hoist-jsx.md +46 -0
  72. package/template/.agents/skills/code/vercel-react-best-practices/rules/rendering-hydration-no-flicker.md +82 -0
  73. package/template/.agents/skills/code/vercel-react-best-practices/rules/rendering-hydration-suppress-warning.md +30 -0
  74. package/template/.agents/skills/code/vercel-react-best-practices/rules/rendering-resource-hints.md +85 -0
  75. package/template/.agents/skills/code/vercel-react-best-practices/rules/rendering-script-defer-async.md +68 -0
  76. package/template/.agents/skills/code/vercel-react-best-practices/rules/rendering-svg-precision.md +28 -0
  77. package/template/.agents/skills/code/vercel-react-best-practices/rules/rendering-usetransition-loading.md +75 -0
  78. package/template/.agents/skills/code/vercel-react-best-practices/rules/rerender-defer-reads.md +39 -0
  79. package/template/.agents/skills/code/vercel-react-best-practices/rules/rerender-dependencies.md +45 -0
  80. package/template/.agents/skills/code/vercel-react-best-practices/rules/rerender-derived-state-no-effect.md +40 -0
  81. package/template/.agents/skills/code/vercel-react-best-practices/rules/rerender-derived-state.md +29 -0
  82. package/template/.agents/skills/code/vercel-react-best-practices/rules/rerender-functional-setstate.md +74 -0
  83. package/template/.agents/skills/code/vercel-react-best-practices/rules/rerender-lazy-state-init.md +58 -0
  84. package/template/.agents/skills/code/vercel-react-best-practices/rules/rerender-memo-with-default-value.md +38 -0
  85. package/template/.agents/skills/code/vercel-react-best-practices/rules/rerender-memo.md +44 -0
  86. package/template/.agents/skills/code/vercel-react-best-practices/rules/rerender-move-effect-to-event.md +45 -0
  87. package/template/.agents/skills/code/vercel-react-best-practices/rules/rerender-no-inline-components.md +82 -0
  88. package/template/.agents/skills/code/vercel-react-best-practices/rules/rerender-simple-expression-in-memo.md +35 -0
  89. package/template/.agents/skills/code/vercel-react-best-practices/rules/rerender-split-combined-hooks.md +64 -0
  90. package/template/.agents/skills/code/vercel-react-best-practices/rules/rerender-transitions.md +40 -0
  91. package/template/.agents/skills/code/vercel-react-best-practices/rules/rerender-use-deferred-value.md +59 -0
  92. package/template/.agents/skills/code/vercel-react-best-practices/rules/rerender-use-ref-transient-values.md +73 -0
  93. package/template/.agents/skills/code/vercel-react-best-practices/rules/server-after-nonblocking.md +73 -0
  94. package/template/.agents/skills/code/vercel-react-best-practices/rules/server-auth-actions.md +96 -0
  95. package/template/.agents/skills/code/vercel-react-best-practices/rules/server-cache-lru.md +41 -0
  96. package/template/.agents/skills/code/vercel-react-best-practices/rules/server-cache-react.md +76 -0
  97. package/template/.agents/skills/code/vercel-react-best-practices/rules/server-dedup-props.md +65 -0
  98. package/template/.agents/skills/code/vercel-react-best-practices/rules/server-hoist-static-io.md +149 -0
  99. package/template/.agents/skills/code/vercel-react-best-practices/rules/server-no-shared-module-state.md +50 -0
  100. package/template/.agents/skills/code/vercel-react-best-practices/rules/server-parallel-fetching.md +83 -0
  101. package/template/.agents/skills/code/vercel-react-best-practices/rules/server-parallel-nested-fetching.md +34 -0
  102. package/template/.agents/skills/code/vercel-react-best-practices/rules/server-serialization.md +38 -0
  103. package/template/.agents/skills/data/prisma-cli/SKILL.md +247 -0
  104. package/template/.agents/skills/data/prisma-cli/references/db-execute.md +78 -0
  105. package/template/.agents/skills/data/prisma-cli/references/db-pull.md +185 -0
  106. package/template/.agents/skills/data/prisma-cli/references/db-push.md +148 -0
  107. package/template/.agents/skills/data/prisma-cli/references/db-seed.md +188 -0
  108. package/template/.agents/skills/data/prisma-cli/references/debug.md +46 -0
  109. package/template/.agents/skills/data/prisma-cli/references/dev.md +157 -0
  110. package/template/.agents/skills/data/prisma-cli/references/format.md +48 -0
  111. package/template/.agents/skills/data/prisma-cli/references/generate.md +173 -0
  112. package/template/.agents/skills/data/prisma-cli/references/init.md +136 -0
  113. package/template/.agents/skills/data/prisma-cli/references/mcp.md +38 -0
  114. package/template/.agents/skills/data/prisma-cli/references/migrate-deploy.md +127 -0
  115. package/template/.agents/skills/data/prisma-cli/references/migrate-dev.md +145 -0
  116. package/template/.agents/skills/data/prisma-cli/references/migrate-diff.md +89 -0
  117. package/template/.agents/skills/data/prisma-cli/references/migrate-reset.md +78 -0
  118. package/template/.agents/skills/data/prisma-cli/references/migrate-resolve.md +57 -0
  119. package/template/.agents/skills/data/prisma-cli/references/migrate-status.md +65 -0
  120. package/template/.agents/skills/data/prisma-cli/references/studio.md +137 -0
  121. package/template/.agents/skills/data/prisma-cli/references/validate.md +53 -0
  122. package/template/.agents/skills/data/prisma-client-api/SKILL.md +216 -0
  123. package/template/.agents/skills/data/prisma-client-api/references/client-methods.md +223 -0
  124. package/template/.agents/skills/data/prisma-client-api/references/constructor.md +208 -0
  125. package/template/.agents/skills/data/prisma-client-api/references/filters.md +256 -0
  126. package/template/.agents/skills/data/prisma-client-api/references/model-queries.md +281 -0
  127. package/template/.agents/skills/data/prisma-client-api/references/query-options.md +276 -0
  128. package/template/.agents/skills/data/prisma-client-api/references/raw-queries.md +194 -0
  129. package/template/.agents/skills/data/prisma-client-api/references/relations.md +308 -0
  130. package/template/.agents/skills/data/prisma-client-api/references/transactions.md +184 -0
  131. package/template/.agents/skills/design/impeccable/SKILL.md +176 -0
  132. package/template/.agents/skills/design/impeccable/reference/adapt.md +311 -0
  133. package/template/.agents/skills/design/impeccable/reference/animate.md +201 -0
  134. package/template/.agents/skills/design/impeccable/reference/audit.md +133 -0
  135. package/template/.agents/skills/design/impeccable/reference/bolder.md +113 -0
  136. package/template/.agents/skills/design/impeccable/reference/brand.md +108 -0
  137. package/template/.agents/skills/design/impeccable/reference/clarify.md +288 -0
  138. package/template/.agents/skills/design/impeccable/reference/codex.md +105 -0
  139. package/template/.agents/skills/design/impeccable/reference/colorize.md +257 -0
  140. package/template/.agents/skills/design/impeccable/reference/craft.md +123 -0
  141. package/template/.agents/skills/design/impeccable/reference/critique.md +767 -0
  142. package/template/.agents/skills/design/impeccable/reference/delight.md +302 -0
  143. package/template/.agents/skills/design/impeccable/reference/distill.md +111 -0
  144. package/template/.agents/skills/design/impeccable/reference/document.md +429 -0
  145. package/template/.agents/skills/design/impeccable/reference/extract.md +69 -0
  146. package/template/.agents/skills/design/impeccable/reference/harden.md +347 -0
  147. package/template/.agents/skills/design/impeccable/reference/init.md +172 -0
  148. package/template/.agents/skills/design/impeccable/reference/interaction-design.md +189 -0
  149. package/template/.agents/skills/design/impeccable/reference/layout.md +161 -0
  150. package/template/.agents/skills/design/impeccable/reference/live.md +718 -0
  151. package/template/.agents/skills/design/impeccable/reference/onboard.md +234 -0
  152. package/template/.agents/skills/design/impeccable/reference/optimize.md +258 -0
  153. package/template/.agents/skills/design/impeccable/reference/overdrive.md +130 -0
  154. package/template/.agents/skills/design/impeccable/reference/polish.md +241 -0
  155. package/template/.agents/skills/design/impeccable/reference/product.md +60 -0
  156. package/template/.agents/skills/design/impeccable/reference/quieter.md +99 -0
  157. package/template/.agents/skills/design/impeccable/reference/shape.md +165 -0
  158. package/template/.agents/skills/design/impeccable/reference/typeset.md +279 -0
  159. package/template/.agents/skills/design/impeccable/scripts/cleanup-deprecated.mjs +284 -0
  160. package/template/.agents/skills/design/impeccable/scripts/command-metadata.json +94 -0
  161. package/template/.agents/skills/design/impeccable/scripts/context-signals.mjs +225 -0
  162. package/template/.agents/skills/design/impeccable/scripts/context.mjs +270 -0
  163. package/template/.agents/skills/design/impeccable/scripts/critique-storage.mjs +242 -0
  164. package/template/.agents/skills/design/impeccable/scripts/design-parser.mjs +835 -0
  165. package/template/.agents/skills/design/impeccable/scripts/detect-csp.mjs +198 -0
  166. package/template/.agents/skills/design/impeccable/scripts/detect.mjs +21 -0
  167. package/template/.agents/skills/design/impeccable/scripts/detector/browser/injected/index.mjs +1733 -0
  168. package/template/.agents/skills/design/impeccable/scripts/detector/cli/main.mjs +244 -0
  169. package/template/.agents/skills/design/impeccable/scripts/detector/detect-antipatterns-browser.js +4551 -0
  170. package/template/.agents/skills/design/impeccable/scripts/detector/detect-antipatterns.mjs +43 -0
  171. package/template/.agents/skills/design/impeccable/scripts/detector/engines/browser/detect-url.mjs +252 -0
  172. package/template/.agents/skills/design/impeccable/scripts/detector/engines/regex/detect-text.mjs +535 -0
  173. package/template/.agents/skills/design/impeccable/scripts/detector/engines/static-html/css-cascade.mjs +986 -0
  174. package/template/.agents/skills/design/impeccable/scripts/detector/engines/static-html/detect-html.mjs +208 -0
  175. package/template/.agents/skills/design/impeccable/scripts/detector/engines/visual/screenshot-contrast.mjs +189 -0
  176. package/template/.agents/skills/design/impeccable/scripts/detector/findings.mjs +12 -0
  177. package/template/.agents/skills/design/impeccable/scripts/detector/node/file-system.mjs +198 -0
  178. package/template/.agents/skills/design/impeccable/scripts/detector/profile/profiler.mjs +166 -0
  179. package/template/.agents/skills/design/impeccable/scripts/detector/registry/antipatterns.mjs +419 -0
  180. package/template/.agents/skills/design/impeccable/scripts/detector/rules/checks.mjs +2316 -0
  181. package/template/.agents/skills/design/impeccable/scripts/detector/shared/color.mjs +124 -0
  182. package/template/.agents/skills/design/impeccable/scripts/detector/shared/constants.mjs +101 -0
  183. package/template/.agents/skills/design/impeccable/scripts/detector/shared/page.mjs +7 -0
  184. package/template/.agents/skills/design/impeccable/scripts/impeccable-paths.mjs +126 -0
  185. package/template/.agents/skills/design/impeccable/scripts/is-generated.mjs +69 -0
  186. package/template/.agents/skills/design/impeccable/scripts/live-accept.mjs +812 -0
  187. package/template/.agents/skills/design/impeccable/scripts/live-browser-session.js +123 -0
  188. package/template/.agents/skills/design/impeccable/scripts/live-browser.js +10316 -0
  189. package/template/.agents/skills/design/impeccable/scripts/live-commit-manual-edits.mjs +1241 -0
  190. package/template/.agents/skills/design/impeccable/scripts/live-complete.mjs +75 -0
  191. package/template/.agents/skills/design/impeccable/scripts/live-completion.mjs +19 -0
  192. package/template/.agents/skills/design/impeccable/scripts/live-copy-edit-agent.mjs +683 -0
  193. package/template/.agents/skills/design/impeccable/scripts/live-discard-manual-edits.mjs +51 -0
  194. package/template/.agents/skills/design/impeccable/scripts/live-event-validation.mjs +136 -0
  195. package/template/.agents/skills/design/impeccable/scripts/live-inject.mjs +557 -0
  196. package/template/.agents/skills/design/impeccable/scripts/live-insert-ui.mjs +458 -0
  197. package/template/.agents/skills/design/impeccable/scripts/live-insert.mjs +272 -0
  198. package/template/.agents/skills/design/impeccable/scripts/live-manual-edit-evidence.mjs +363 -0
  199. package/template/.agents/skills/design/impeccable/scripts/live-manual-edits-buffer.mjs +152 -0
  200. package/template/.agents/skills/design/impeccable/scripts/live-poll.mjs +379 -0
  201. package/template/.agents/skills/design/impeccable/scripts/live-resume.mjs +94 -0
  202. package/template/.agents/skills/design/impeccable/scripts/live-server.mjs +2322 -0
  203. package/template/.agents/skills/design/impeccable/scripts/live-session-store.mjs +289 -0
  204. package/template/.agents/skills/design/impeccable/scripts/live-status.mjs +61 -0
  205. package/template/.agents/skills/design/impeccable/scripts/live-svelte-component.mjs +826 -0
  206. package/template/.agents/skills/design/impeccable/scripts/live-sveltekit-adapter.mjs +274 -0
  207. package/template/.agents/skills/design/impeccable/scripts/live-ui-core.mjs +179 -0
  208. package/template/.agents/skills/design/impeccable/scripts/live-wrap.mjs +894 -0
  209. package/template/.agents/skills/design/impeccable/scripts/live.mjs +246 -0
  210. package/template/.agents/skills/design/impeccable/scripts/modern-screenshot.umd.js +14 -0
  211. package/template/.agents/skills/design/impeccable/scripts/palette.mjs +633 -0
  212. package/template/.agents/skills/design/impeccable/scripts/pin.mjs +214 -0
  213. package/template/.agents/skills/design/shadcn/SKILL.md +242 -0
  214. package/template/.agents/skills/design/shadcn/agents/openai.yml +5 -0
  215. package/template/.agents/skills/design/shadcn/assets/shadcn-small.png +0 -0
  216. package/template/.agents/skills/design/shadcn/assets/shadcn.png +0 -0
  217. package/template/.agents/skills/design/shadcn/cli.md +257 -0
  218. package/template/.agents/skills/design/shadcn/customization.md +202 -0
  219. package/template/.agents/skills/design/shadcn/evals/evals.json +47 -0
  220. package/template/.agents/skills/design/shadcn/mcp.md +94 -0
  221. package/template/.agents/skills/design/shadcn/rules/base-vs-radix.md +306 -0
  222. package/template/.agents/skills/design/shadcn/rules/composition.md +195 -0
  223. package/template/.agents/skills/design/shadcn/rules/forms.md +192 -0
  224. package/template/.agents/skills/design/shadcn/rules/icons.md +101 -0
  225. package/template/.agents/skills/design/shadcn/rules/styling.md +162 -0
  226. package/template/.agents/skills/find-skills/SKILL.md +142 -0
  227. package/template/.agents/skills/integrations/langfuse/SKILL.md +142 -0
  228. package/template/.agents/skills/integrations/langfuse/references/cli.md +52 -0
  229. package/template/.agents/skills/integrations/langfuse/references/error-analysis.md +100 -0
  230. package/template/.agents/skills/integrations/langfuse/references/instrumentation.md +134 -0
  231. package/template/.agents/skills/integrations/langfuse/references/judge-calibration.md +288 -0
  232. package/template/.agents/skills/integrations/langfuse/references/prompt-migration.md +234 -0
  233. package/template/.agents/skills/integrations/langfuse/references/sdk-upgrade.md +175 -0
  234. package/template/.agents/skills/integrations/langfuse/references/skill-feedback.md +52 -0
  235. package/template/.agents/skills/integrations/langfuse/references/user-feedback.md +88 -0
  236. package/template/.agents/skills/integrations/posthog/SKILL.md +102 -0
  237. package/template/.agents/skills/integrations/posthog/references/error-tracking-alerts.md +63 -0
  238. package/template/.agents/skills/integrations/posthog/references/error-tracking-assigning-issues.md +77 -0
  239. package/template/.agents/skills/integrations/posthog/references/error-tracking-fingerprints.md +57 -0
  240. package/template/.agents/skills/integrations/posthog/references/error-tracking-monitoring.md +140 -0
  241. package/template/.agents/skills/integrations/posthog/references/error-tracking-nextjs.md +490 -0
  242. package/template/.agents/skills/integrations/posthog/references/error-tracking-source-maps.md +45 -0
  243. package/template/.agents/skills/integrations/posthog/references/feature-flags-best-practices.md +139 -0
  244. package/template/.agents/skills/integrations/posthog/references/feature-flags-react.md +302 -0
  245. package/template/.agents/skills/integrations/posthog/references/identify-users.md +202 -0
  246. package/template/.agents/skills/integrations/posthog/references/integration-example.md +706 -0
  247. package/template/.agents/skills/integrations/posthog/references/integration-nextjs.md +385 -0
  248. package/template/.agents/skills/integrations/posthog/references/integration-step-1-begin.md +43 -0
  249. package/template/.agents/skills/integrations/posthog/references/integration-step-2-edit.md +37 -0
  250. package/template/.agents/skills/integrations/posthog/references/integration-step-3-revise.md +22 -0
  251. package/template/.agents/skills/integrations/posthog/references/integration-step-4-conclude.md +38 -0
  252. package/template/.agents/skills/integrations/posthog/references/llm-analytics-anthropic.md +200 -0
  253. package/template/.agents/skills/integrations/posthog/references/llm-analytics-basics.md +62 -0
  254. package/template/.agents/skills/integrations/posthog/references/llm-analytics-costs.md +197 -0
  255. package/template/.agents/skills/integrations/posthog/references/llm-analytics-manual-capture.md +397 -0
  256. package/template/.agents/skills/integrations/posthog/references/llm-analytics-traces.md +98 -0
  257. package/template/.agents/skills/integrations/posthog/references/llm-analytics-vercel-ai.md +120 -0
  258. package/template/.agents/skills/repo/repo-ci/SKILL.md +265 -0
  259. package/template/.agents/skills/repo/repo-init-next-js/SKILL.md +129 -0
  260. package/template/.agents/skills/repo/repo-init-next-js/references/file-contents.md +800 -0
  261. package/template/.agents/skills/repo/repo-init-next-js/scripts/setup.sh +47 -0
  262. package/template/.agents/skills/repo/repo-init-node/SKILL.md +196 -0
  263. package/template/.agents/skills/skill-creator/LICENSE.txt +202 -0
  264. package/template/.agents/skills/skill-creator/SKILL.md +485 -0
  265. package/template/.agents/skills/skill-creator/agents/analyzer.md +274 -0
  266. package/template/.agents/skills/skill-creator/agents/comparator.md +202 -0
  267. package/template/.agents/skills/skill-creator/agents/grader.md +223 -0
  268. package/template/.agents/skills/skill-creator/assets/eval_review.html +146 -0
  269. package/template/.agents/skills/skill-creator/eval-viewer/generate_review.py +471 -0
  270. package/template/.agents/skills/skill-creator/eval-viewer/viewer.html +1325 -0
  271. package/template/.agents/skills/skill-creator/references/schemas.md +430 -0
  272. package/template/.agents/skills/skill-creator/scripts/__init__.py +0 -0
  273. package/template/.agents/skills/skill-creator/scripts/aggregate_benchmark.py +401 -0
  274. package/template/.agents/skills/skill-creator/scripts/generate_report.py +326 -0
  275. package/template/.agents/skills/skill-creator/scripts/improve_description.py +247 -0
  276. package/template/.agents/skills/skill-creator/scripts/package_skill.py +136 -0
  277. package/template/.agents/skills/skill-creator/scripts/quick_validate.py +103 -0
  278. package/template/.agents/skills/skill-creator/scripts/run_eval.py +310 -0
  279. package/template/.agents/skills/skill-creator/scripts/run_loop.py +328 -0
  280. package/template/.agents/skills/skill-creator/scripts/utils.py +47 -0
  281. package/template/.agents/upstreams.json +80 -0
@@ -0,0 +1,100 @@
1
+ ---
2
+ name: langfuse-error-analysis
3
+ description: Deep-dive error analysis of an LLM pipeline or AI application using Langfuse traces.
4
+ Use this skill whenever the user wants to understand why their AI system is producing
5
+ bad outputs, where their pipeline is failing, how to categorise or label failures,
6
+ what to prioritise fixing, or how to set up evaluators. Also trigger for "review my
7
+ traces", "my outputs look wrong", "help me debug my LLM app", "I want to analyse
8
+ errors", "build a failure taxonomy", "what's going wrong with my pipeline", or any
9
+ request to systematically inspect, annotate, or score Langfuse traces. If the user
10
+ is trying to understand or improve the quality of an AI system's outputs, use this skill.
11
+ ---
12
+
13
+ # Error Analysis
14
+
15
+ ## Primary Guide
16
+
17
+ **1. Fetch the guide in this blogpost**
18
+
19
+ https://langfuse.com/guides/cookbook/error-analysis-llm-applications.md
20
+
21
+ If fetch is not available query for langfuse.com error analysis guide
22
+
23
+ Read it in full. It defines the authoritative 5-step process (sample selection → open coding → clustering → labelling → deciding what to fix).
24
+
25
+ **2. Guide the user through this step by step**
26
+
27
+ You as a coding agent and the user go through this together to perform a full error analysis with their data in langfuse. Do everything you can achieve via CLI (look up traces, create annotation queues, ...) for the user. Provide them with direct links to UI wherever their action is required. Be proactive and narrate what is going on for the user.
28
+
29
+ ## Rules CRITICAL
30
+ Use Langfuse CLI wherever possible
31
+ Use charts where possible to display data
32
+
33
+ ---
34
+
35
+ ## Langfuse Implementation Notes
36
+
37
+ The guide describes the process. These notes cover the Langfuse-specific API and CLI mechanics required to execute it.
38
+
39
+ ### Credentials
40
+
41
+ ```bash
42
+ echo $LANGFUSE_PUBLIC_KEY # pk-lf-...
43
+ echo $LANGFUSE_SECRET_KEY # sk-lf-...
44
+ echo $LANGFUSE_BASE_URL # https://cloud.langfuse.com (EU), https://us.cloud.langfuse.com (US), https://jp.cloud.langfuse.com (JP) or self-hosted
45
+ ```
46
+
47
+ If not set, check `.env` in the project root: `export $(grep -v '^#' .env | xargs)`. If `LANGFUSE_HOST` is used instead of `LANGFUSE_BASE_URL`, run `export LANGFUSE_BASE_URL="$LANGFUSE_HOST"`.
48
+
49
+ ```bash
50
+ AUTH=$(echo -n "${LANGFUSE_PUBLIC_KEY}:${LANGFUSE_SECRET_KEY}" | base64)
51
+
52
+ # Verify before proceeding
53
+ STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
54
+ -H "Authorization: Basic $AUTH" \
55
+ "${LANGFUSE_BASE_URL}/api/public/projects")
56
+ echo "Auth check: $STATUS"
57
+ ```
58
+
59
+ If status is not `200`, stop and ask the user to check their credentials and host before continuing.
60
+
61
+ ### Annotation target: OBSERVATION versus TRACE
62
+
63
+ > **CRITICAL:** In OpenTelemetry-instrumented apps, trace-level `input`/`output` can be null — content often lives in a GENERATION observation. Always consider if the right objectType to add is `objectType: OBSERVATION` pointing to the GENERATION observation ID to annotation queues.
64
+
65
+ ### Annotation queues
66
+
67
+ > **CRITICAL:** Queues cannot be updated or deleted after creation. Create score configs first, then the queue with all config IDs. To add new configs later, create a new queue.
68
+
69
+
70
+ **Always give the user a direct link immediately after creating a queue:**
71
+
72
+ | Host | URL pattern |
73
+ |------|-------------|
74
+ | EU cloud | `https://cloud.langfuse.com/project/<projectId>/annotation-queues/<queueId>` |
75
+ | US cloud | `https://us.cloud.langfuse.com/project/<projectId>/annotation-queues/<queueId>` |
76
+ | Self-hosted | `<LANGFUSE_BASE_URL>/project/<projectId>/annotation-queues/<queueId>` |
77
+
78
+ Instruction to give: *"Please open code the first ~50 examples. For each trace, write what you observe in the `open_coding` field (describe behaviour, don't diagnose root causes), then set `pass_fail_assessment` to Pass or Fail."*
79
+
80
+
81
+ ### Prompt fixes
82
+
83
+ When a category warrants a prompt fix, always offer the user two options:
84
+ 1. Create it as a versioned prompt in Langfuse (tracked, usable via the prompt API)
85
+ 2. Draft the specific text change for them to review and apply
86
+
87
+ ### Setup evaluators
88
+
89
+ When a category warrants an evaluator setup, propose the type of evaluator and offer to set it up for user via CLI
90
+
91
+
92
+ ### Common gotchas
93
+
94
+ | Mistake | Fix |
95
+ |---------|-----|
96
+ | `objectType: TRACE` in queue | Use `objectType: OBSERVATION` with GENERATION obs ID |
97
+ | Creating score config without checking existing | `GET /api/public/score-configs` first; can't delete |
98
+ | Queue created before score configs | Create configs → collect IDs → create queue |
99
+ | `--limit` > 100 on traces list | API hard cap; paginate with `--page` |
100
+ | No rate limiting on queue item creation | `sleep 0.4` between calls to avoid 429 |
@@ -0,0 +1,134 @@
1
+ ---
2
+ name: langfuse-observability
3
+ description: Instrument LLM applications with Langfuse tracing. Use when setting up Langfuse, adding observability to LLM calls, or auditing existing instrumentation.
4
+ ---
5
+
6
+ # Langfuse Observability
7
+
8
+ Instrument LLM applications with Langfuse tracing, following best practices and tailored to your use case.
9
+
10
+ ## Workflow
11
+
12
+ ### 1. Assess Current State
13
+
14
+ Check the project:
15
+
16
+ - Is Langfuse SDK installed?
17
+ - What LLM frameworks are used? (OpenAI SDK, LangChain, LlamaIndex, Vercel AI SDK, etc.)
18
+ - Is there existing instrumentation?
19
+
20
+ **No integration yet:** Set up Langfuse using a framework integration if available. Integrations capture more context automatically and require less code than manual instrumentation.
21
+
22
+ **Integration exists:** Audit against baseline requirements below.
23
+
24
+ ### 2. Verify Baseline Requirements
25
+
26
+ Every trace should have these fundamentals:
27
+
28
+ | Requirement | Check | Why |
29
+ | ------------------------- | ---------------------------------------------------------------------------------------- | ------------------------------------------------------ |
30
+ | Model name | Is the LLM model captured? | Enables model comparison and filtering |
31
+ | Token usage | Are input/output tokens tracked? | Enables automatic cost calculation |
32
+ | Good trace names | Are names descriptive? (`chat-response`, not `trace-1`) | Makes traces findable and filterable |
33
+ | Span hierarchy | Are multi-step operations nested properly? | Shows which step is slow or failing |
34
+ | Correct observation types | Are generations marked as generations? | Enables model-specific analytics |
35
+ | Sensitive data masked | Is PII/confidential data excluded or masked? | Prevents data leakage |
36
+ | Trace input/output | Does the trace capture meaningful input/output? Is input explicitly set to show only relevant data (e.g., user message), not all function args? | Makes traces readable in the UI and avoids leaking sensitive args |
37
+
38
+ Framework integrations (OpenAI, LangChain, etc.) handle model name, tokens, and observation types automatically. Prefer integrations over manual instrumentation.
39
+
40
+ Docs: https://langfuse.com/docs/tracing
41
+
42
+ ### 3. Explore Traces First
43
+
44
+ Once baseline instrumentation is working, encourage the user to explore their traces in the Langfuse UI before adding more context:
45
+
46
+ "Your traces are now appearing in Langfuse. Take a look at a few of them—see what data is being captured, what's useful, and what's missing. This will help us decide what additional context to add."
47
+
48
+ This helps the user:
49
+
50
+ - Understand what they're already getting
51
+ - Form opinions about what's missing
52
+ - Ask better questions about what they need
53
+
54
+ ### 4. Discover Additional Context Needs
55
+
56
+ Determine what additional instrumentation would be valuable. **Infer from code when possible, only ask when unclear.**
57
+
58
+ **Infer from code:**
59
+
60
+ | If you see in code... | Infer | Suggest |
61
+ | ---------------------------------------------------- | ----------------- | ------------------------- |
62
+ | Conversation history, chat endpoints, message arrays | Multi-turn app | `session_id` |
63
+ | User authentication, `user_id` variables | User-aware app | `user_id` on traces |
64
+ | Multiple distinct endpoints/features | Multi-feature app | `feature` tag |
65
+ | Customer/tenant identifiers | Multi-tenant app | `customer_id` or tier tag |
66
+ | Feedback collection, ratings | Has user feedback | Capture as scores |
67
+
68
+ **Only ask when not obvious from code:**
69
+
70
+ - "How do you know when a response is good vs bad?" → Determines scoring approach
71
+ - "What would you want to filter by in a dashboard?" → Surfaces non-obvious tags
72
+ - "Are there different user segments you'd want to compare?" → Customer tiers, plans, etc.
73
+
74
+ **Additions and their value:**
75
+
76
+ | Addition | Why | Docs |
77
+ | ------------------- | ------------------------------------------- | --------------------------------------------------- |
78
+ | `session_id` | Groups conversations together | https://langfuse.com/docs/tracing-features/sessions |
79
+ | `user_id` | Enables user filtering and cost attribution | https://langfuse.com/docs/tracing-features/users |
80
+ | User feedback score | Enables quality filtering and trends | https://langfuse.com/docs/scores/overview |
81
+ | `feature` tag | Per-feature analytics | https://langfuse.com/docs/tracing-features/tags |
82
+ | `customer_tier` tag | Cost/quality breakdown by segment | https://langfuse.com/docs/tracing-features/tags |
83
+
84
+ These are NOT baseline requirements—only add what's relevant based on inference or user input.
85
+
86
+ ### 5. Guide to UI
87
+
88
+ After adding context, point users to relevant UI features:
89
+
90
+ - Traces view: See individual requests
91
+ - Sessions view: See grouped conversations (if session_id added)
92
+ - Dashboard: Build filtered views using tags
93
+ - Scores: Filter by quality metrics
94
+
95
+ ## Framework Integrations
96
+
97
+ Prefer these over manual instrumentation:
98
+
99
+ | Framework | Integration | Docs |
100
+ | ------------- | ---------------------- | ---------------------------------------------------- |
101
+ | OpenAI SDK | Drop-in replacement | https://langfuse.com/docs/integrations/openai |
102
+ | LangChain | Callback handler | https://langfuse.com/docs/integrations/langchain |
103
+ | LlamaIndex | Callback handler | https://langfuse.com/docs/integrations/llama-index |
104
+ | Vercel AI SDK | OpenTelemetry exporter | https://langfuse.com/docs/integrations/vercel-ai-sdk |
105
+ | LiteLLM | Callback or proxy | https://langfuse.com/docs/integrations/litellm |
106
+
107
+ Full list: https://langfuse.com/docs/integrations
108
+
109
+ ## Always Explain Why
110
+
111
+ When suggesting additions, explain the user benefit:
112
+
113
+ ```
114
+ "I recommend adding session_id to your traces.
115
+
116
+ Why: This groups messages from the same conversation together.
117
+ You'll be able to see full conversation flows in the Sessions view,
118
+ making it much easier to debug multi-turn interactions.
119
+
120
+ Learn more: https://langfuse.com/docs/tracing-features/sessions"
121
+ ```
122
+
123
+ ## Common Mistakes
124
+
125
+ | Mistake | Problem | Fix |
126
+ | ---------------------------------------------- | --------------------------------------------------- | --------------------------------------------------------------------------------- |
127
+ | No `flush()` in scripts | Traces never sent | Call `langfuse.flush()` before exit |
128
+ | Flat traces | Can't see which step failed | Use nested spans for distinct steps |
129
+ | Generic trace names | Hard to filter | Use descriptive names: `chat-response`, `doc-summary` |
130
+ | Logging sensitive data | Data leakage risk | Mask PII before tracing |
131
+ | Not explicitly setting input with `@observe` | All function args become trace input (including API keys, configs) | Python: use `langfuse.update_current_span(input=...)`. JS/TS: use `updateActiveObservation({ input: ... })`. Set only the relevant input (e.g., user message) |
132
+ | Manual instrumentation when integration exists | More code, less context | Use framework integration |
133
+ | Langfuse import before env vars loaded | Langfuse initializes with missing/wrong credentials | Import Langfuse AFTER loading environment variables (e.g., after `load_dotenv()`) |
134
+ | Wrong import order with OpenAI | Langfuse can't patch the OpenAI client | Import Langfuse and call its setup BEFORE importing OpenAI client |
@@ -0,0 +1,288 @@
1
+ ---
2
+ name: langfuse-judge-calibration
3
+ description: Calibrate and validate LLM-as-a-Judge evaluators against dataset ground truth. Runs the judge prompt as a Langfuse dataset experiment, compares judge outputs with dataset item expected outputs, and reports simple accuracy or advanced confusion-matrix metrics. Use this guide whenever a user asks if their LLM judge is actually useful,
4
+ aligned with human judgment, or safe to trust for monitoring decisions.
5
+ ---
6
+
7
+ # Judge Calibration (LLM-as-a-Judge)
8
+
9
+ ## Goal
10
+
11
+ Validate judge outputs against human labels using the smallest reliable workflow
12
+ for the user's goal.
13
+
14
+ Default to a **Langfuse dataset experiment** when the user has a Langfuse
15
+ dataset or wants results in the Langfuse Experiments UI.
16
+
17
+ Default to **simple calibration** unless the user asks for deeper metrics,
18
+ split-based validation, thresholding, or production automation.
19
+
20
+ ## 1) Choose the calibration mode
21
+
22
+ ### Simple calibration
23
+
24
+ Use this when the user wants a quick answer like "does this judge basically
25
+ match human labels?" or explicitly asks for accuracy only.
26
+
27
+ - No train/dev/test split is required.
28
+ - Compute `exact_match` for each valid row.
29
+ - Report valid sample size, invalid-label count, accuracy, and a short
30
+ recommendation.
31
+ - Do not include Precision/Recall/F1, TPR/TNR, denominator notes, or top failure
32
+ direction unless the user asks for advanced metrics.
33
+
34
+ ### Advanced calibration
35
+
36
+ Use this when the user asks for confusion matrix metrics, thresholds,
37
+ production monitoring, high-stakes automation, or train/test-style validation.
38
+
39
+ - If split labels exist, keep train/dev/test separate.
40
+ - If no split exists, compute metrics on the provided rows and state that this is
41
+ not a held-out final quality claim.
42
+ - Compute TP/FP/FN/TN and derived metrics.
43
+ - Use `references/error-analysis.md` for qualitative diagnosis of disagreements.
44
+
45
+ ## 2) Primary workflow
46
+
47
+ 1. Confirm the dataset name, ground-truth label location in `expectedOutput`,
48
+ judge prompt name/version, judge model, and label vocabulary.
49
+ 2. Choose simple or advanced mode. If ambiguous, use simple mode.
50
+ 3. Run the judge prompt against each dataset item input as a Langfuse experiment.
51
+ 4. Compare the judge output to `item.expected_output` in evaluator functions.
52
+ 5. Return the matching report format from section 7.
53
+
54
+ ## 3) Langfuse experiment workflow
55
+
56
+ Use the SDK experiment runner as the default implementation. A Langfuse-hosted
57
+ dataset automatically creates a dataset run that can be inspected and compared
58
+ in the Langfuse UI.
59
+
60
+ Before implementing, you **must** retrieve the current experiment SDK
61
+ documentation from the Langfuse docs — do not rely on memory, the SDK changes
62
+ frequently. Fetch these pages (see SKILL.md section 2 for retrieval methods):
63
+
64
+ - [Experiments via SDK](https://langfuse.com/docs/evaluation/experiments/experiments-via-sdk) — primary reference for `dataset.run_experiment`, task/evaluator signatures, and `Evaluation` return shape
65
+ - [Datasets](https://langfuse.com/docs/evaluation/experiments/datasets) — dataset item structure (`input`, `expectedOutput`) and how to load a hosted dataset
66
+ - [Experiments data model](https://langfuse.com/docs/evaluation/experiments/data-model) — how runs, items, and scores relate in the UI
67
+
68
+ High-level shape of a simple-mode calibration experiment:
69
+
70
+ ```
71
+ load dataset and judge prompt from Langfuse
72
+ define POSITIVE / NEGATIVE label set
73
+
74
+ task(item):
75
+ compile judge prompt with item.input
76
+ call judge model
77
+ return normalized label
78
+ # never read item.expected_output here — that would leak the answer
79
+
80
+ item_evaluator(output, expected_output):
81
+ if either label is outside the allowed set:
82
+ return invalid (excluded from accuracy denominator)
83
+ return exact_match score (0 or 1)
84
+
85
+ run_evaluator(item_results):
86
+ accuracy = matches / valid_rows
87
+ return aggregate score + invalid_label count
88
+
89
+ dataset.run_experiment(task, [item_evaluator], [run_evaluator],
90
+ metadata={calibration_mode, judge_prompt, labels})
91
+ ```
92
+
93
+ For advanced calibration, the item evaluator emits one score per confusion-matrix
94
+ cell (`judge-is-tp`, `judge-is-fp`, `judge-is-fn`, `judge-is-tn`), and the run
95
+ evaluator aggregates them into Precision / Recall / F1 / TPR / TNR. See section
96
+ 6 for the metric definitions and zero-denominator guardrails.
97
+
98
+ Rules:
99
+ - Use a Langfuse-hosted dataset when the user wants a real Langfuse experiment.
100
+ Local SDK datasets create traces and scores, but not Langfuse dataset runs.
101
+ - The dataset item `input` must contain everything needed to run the judge
102
+ prompt. The dataset item `expectedOutput` must contain the ground-truth label.
103
+ - Never pass `expectedOutput` into the judge prompt or task. That would leak the
104
+ answer and invalidate calibration.
105
+ - Return `Evaluation(...)` objects from item and run evaluators for stable SDK
106
+ formatting and score ingestion.
107
+ - Store prompt name/version, judge model, label vocabulary, dataset version, and
108
+ calibration mode in run metadata.
109
+ - Use a unique run name so the experiment appears as a separate dataset run.
110
+
111
+ ## 4) Label validation
112
+
113
+ Never silently treat unknown labels as negative.
114
+
115
+ - For binary judges, define the positive and negative labels before computing
116
+ confusion-matrix metrics.
117
+ - Normalize only deterministic differences such as surrounding whitespace or
118
+ casing, and mention that normalization was applied.
119
+ - Rows where `expected` or `actual` is outside the allowed label set are invalid.
120
+ Exclude them from metric denominators and report the invalid count.
121
+ - For multi-class judges, use simple `exact_match` / accuracy unless the user
122
+ defines one positive class for binary metrics.
123
+
124
+ Example binary labels:
125
+ - Positive: `ESCALATE`
126
+ - Negative: `RESOLVE`
127
+
128
+ ## 5) Simple metrics
129
+
130
+ For each valid row:
131
+
132
+ - `exact_match = 1 if actual == expected else 0`
133
+
134
+ Aggregate:
135
+
136
+ - `accuracy = sum(exact_match) / valid_rows`
137
+
138
+ If `valid_rows == 0`, report that accuracy is undefined and ask for valid
139
+ expected/actual labels.
140
+
141
+ ## 6) Advanced metrics
142
+
143
+ ### Dataset and split discipline
144
+
145
+ Use split discipline only when the user asks for advanced validation or final
146
+ quality claims.
147
+
148
+ - **Train**: optional few-shot examples for the judge prompt
149
+ - **Dev**: iterative prompt/model refinement
150
+ - **Test**: single final calibration pass
151
+
152
+ Do not tune on the same rows used to claim final quality. Use balanced classes
153
+ in dev/test when possible so both error directions are measurable.
154
+
155
+ ### Per-row classification mapping
156
+
157
+ For each row with `expected` and `actual` labels:
158
+
159
+ - **TP**: expected = positive and actual = positive
160
+ - **FP**: expected = negative and actual = positive
161
+ - **FN**: expected = positive and actual = negative
162
+ - **TN**: expected = negative and actual = negative
163
+
164
+ Also compute:
165
+ - `exact_match = 1 if actual == expected else 0`
166
+
167
+ ### Aggregate metrics
168
+
169
+ From aggregate counts:
170
+ - `accuracy = (TP + TN) / valid_rows`
171
+ - `precision = TP / (TP + FP)`
172
+ - `recall = TP / (TP + FN)`
173
+ - `f1 = 2 * precision * recall / (precision + recall)`
174
+ - `TPR = TP / (TP + FN)`
175
+ - `TNR = TN / (TN + FP)`
176
+
177
+ Guardrails:
178
+ - if `TP + FP == 0`, precision is undefined (report null + note)
179
+ - if `TP + FN == 0`, recall and TPR are undefined (report null + note)
180
+ - if `TN + FP == 0`, TNR is undefined (report null + note)
181
+ - if `precision + recall == 0`, set `f1 = 0`
182
+
183
+ ### Advanced quality gates
184
+
185
+ Before trusting the judge on production traffic:
186
+
187
+ 1. **Split integrity**: no leakage from held-out rows into prompt examples.
188
+ 2. **Confusion matrix sanity**: `TP + FP + FN + TN == valid_rows`.
189
+ 3. **Metric recomputation check**: recompute aggregate stats from row-level
190
+ flags and compare.
191
+ 4. **TPR/TNR review**: inspect both directions for class-direction bias.
192
+ 5. **Threshold**: target `TPR > 0.90` and `TNR > 0.90` before high-stakes
193
+ automation.
194
+
195
+ ## 7) Report format
196
+
197
+ ### Simple report
198
+
199
+ Return only:
200
+ - dataset name and dataset run URL when available
201
+ - valid rows / total rows
202
+ - invalid-label count
203
+ - accuracy
204
+ - one-sentence recommendation
205
+
206
+ ### Advanced report
207
+
208
+ Add:
209
+ - confusion matrix: TP, FP, FN, TN
210
+ - accuracy, precision, recall, F1, TPR, TNR
211
+ - denominator notes for undefined metrics
212
+ - top failure direction: false positives or false negatives
213
+ - recommendation: ship, iterate, collect more labels, or do not automate
214
+
215
+ ## 8) Langfuse implementation notes
216
+
217
+ Prefer SDK experiment evaluators for score creation. They attach item-level
218
+ scores to the experiment traces and run-level scores to the dataset run.
219
+
220
+ Use manual REST score creation only as a fallback when not using the SDK
221
+ experiment runner, or for local smoke tests. See
222
+ [Scores via SDK](https://langfuse.com/docs/evaluation/evaluation-methods/scores-via-sdk)
223
+ and the [Scores API reference](https://langfuse.com/docs/api) (`POST /api/public/scores`)
224
+ for the current payload shape. Do not use the current `langfuse-cli` score-create
225
+ wrapper unless `--help` shows a usable `value` argument; `langfuse-cli@0.0.10`
226
+ exposes `legacy-score-v1s create` but cannot pass the required score `value`.
227
+
228
+ Score names to emit:
229
+
230
+ Simple mode:
231
+ - `judge-exact-match`
232
+ - `judge-accuracy`
233
+
234
+ Advanced mode:
235
+ - `judge-exact-match`
236
+ - `judge-is-tp`
237
+ - `judge-is-fp`
238
+ - `judge-is-fn`
239
+ - `judge-is-tn`
240
+
241
+ Recommended metadata:
242
+ - `expected_label`, `actual_label`
243
+ - calibration mode: `simple` or `advanced`
244
+ - positive/negative labels when binary metrics are used
245
+ - evaluator prompt name+version
246
+ - dataset/split version when used
247
+ - run identifier
248
+
249
+ ## 9) Classification logic (pseudo-code)
250
+
251
+ The per-row classification each evaluator must perform:
252
+
253
+ ```
254
+ normalize(label) = strip whitespace, uppercase
255
+ ALLOWED = {POSITIVE, NEGATIVE}
256
+
257
+ classify(expected, actual):
258
+ expected, actual = normalize(expected), normalize(actual)
259
+
260
+ if expected ∉ ALLOWED or actual ∉ ALLOWED:
261
+ mark row invalid → exclude from denominators
262
+
263
+ exact_match = (expected == actual)
264
+ is_tp = (expected == POSITIVE and actual == POSITIVE)
265
+ is_fp = (expected == NEGATIVE and actual == POSITIVE)
266
+ is_fn = (expected == POSITIVE and actual == NEGATIVE)
267
+ is_tn = (expected == NEGATIVE and actual == NEGATIVE)
268
+ ```
269
+
270
+ ## 10) Common failure modes
271
+
272
+ - label vocabulary not constrained (judge outputs free text instead of strict
273
+ labels)
274
+ - positive/negative label inversion between annotators and evaluator code
275
+ - leaking `expectedOutput` into the judge task instead of only the evaluator
276
+ - using local SDK data when the user expects a Langfuse dataset run in the UI
277
+ - reporting only accuracy when classes are imbalanced and error direction matters
278
+ - calculating F1 without explicit zero-denominator handling
279
+ - using advanced validation claims without a held-out split
280
+
281
+ ## 11) What to do after calibration
282
+
283
+ - If simple accuracy is enough: report it and stop.
284
+ - If metrics are weak and advanced validation is needed: iterate prompt and
285
+ few-shots on dev data only.
286
+ - If metrics pass: freeze the baseline and monitor drift over time.
287
+ - For qualitative diagnosis of disagreements, switch to
288
+ `references/error-analysis.md`.