xtrm-tools 0.7.0 → 0.7.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (633) hide show
  1. package/.xtrm/config/README.md +10 -0
  2. package/{config/mcp_servers.json → .xtrm/config/claude.mcp.json} +0 -28
  3. package/{config → .xtrm/config}/instructions/agents-top.md +12 -3
  4. package/{config → .xtrm/config}/instructions/claude-top.md +12 -3
  5. package/{config → .xtrm/config}/pi/extensions/beads/index.ts +44 -13
  6. package/{config → .xtrm/config}/pi/extensions/custom-footer/index.ts +59 -82
  7. package/{config → .xtrm/config}/pi/extensions/xtrm-ui/index.ts +2 -2
  8. package/{config → .xtrm/config}/pi/install-schema.json +2 -2
  9. package/.xtrm/config/pi.mcp.json +39 -0
  10. package/.xtrm/config/settings.json +41 -0
  11. package/.xtrm/extensions/auto-session-name/index.ts +29 -0
  12. package/.xtrm/extensions/auto-session-name/package.json +16 -0
  13. package/.xtrm/extensions/auto-update/index.ts +71 -0
  14. package/.xtrm/extensions/auto-update/package.json +16 -0
  15. package/.xtrm/extensions/beads/index.ts +232 -0
  16. package/.xtrm/extensions/beads/package.json +19 -0
  17. package/.xtrm/extensions/compact-header/index.ts +69 -0
  18. package/.xtrm/extensions/compact-header/package.json +16 -0
  19. package/.xtrm/extensions/core/adapter.ts +52 -0
  20. package/.xtrm/extensions/core/guard-rules.ts +100 -0
  21. package/.xtrm/extensions/core/lib.ts +3 -0
  22. package/.xtrm/extensions/core/logger.ts +45 -0
  23. package/.xtrm/extensions/core/package.json +18 -0
  24. package/.xtrm/extensions/core/runner.ts +71 -0
  25. package/.xtrm/extensions/core/session-state.ts +59 -0
  26. package/.xtrm/extensions/custom-footer/index.ts +398 -0
  27. package/.xtrm/extensions/custom-footer/package.json +19 -0
  28. package/.xtrm/extensions/custom-provider-qwen-cli/index.ts +363 -0
  29. package/.xtrm/extensions/custom-provider-qwen-cli/package.json +1 -0
  30. package/.xtrm/extensions/git-checkpoint/index.ts +53 -0
  31. package/.xtrm/extensions/git-checkpoint/package.json +16 -0
  32. package/.xtrm/extensions/lsp-bootstrap/index.ts +134 -0
  33. package/.xtrm/extensions/lsp-bootstrap/package.json +17 -0
  34. package/.xtrm/extensions/pi-serena-compact/index.ts +121 -0
  35. package/.xtrm/extensions/pi-serena-compact/package.json +16 -0
  36. package/.xtrm/extensions/quality-gates/index.ts +66 -0
  37. package/.xtrm/extensions/quality-gates/package.json +19 -0
  38. package/.xtrm/extensions/service-skills/index.ts +108 -0
  39. package/.xtrm/extensions/service-skills/package.json +19 -0
  40. package/.xtrm/extensions/session-flow/index.ts +96 -0
  41. package/.xtrm/extensions/session-flow/package.json +19 -0
  42. package/.xtrm/extensions/xtrm-loader/index.ts +152 -0
  43. package/.xtrm/extensions/xtrm-loader/package.json +19 -0
  44. package/.xtrm/extensions/xtrm-ui/format.ts +93 -0
  45. package/.xtrm/extensions/xtrm-ui/index.ts +1044 -0
  46. package/.xtrm/extensions/xtrm-ui/package.json +10 -0
  47. package/.xtrm/extensions/xtrm-ui/themes/pidex-dark.json +85 -0
  48. package/.xtrm/extensions/xtrm-ui/themes/pidex-light.json +85 -0
  49. package/{hooks → .xtrm/hooks}/README.md +2 -1
  50. package/{hooks → .xtrm/hooks}/beads-commit-gate.mjs +4 -0
  51. package/.xtrm/hooks/beads-memory-gate.mjs +119 -0
  52. package/{plugins/xtrm-tools → .xtrm}/hooks/quality-check-env.mjs +1 -4
  53. package/.xtrm/hooks/statusline.mjs +156 -0
  54. package/{plugins/xtrm-tools → .xtrm}/hooks/using-xtrm-reminder.mjs +8 -7
  55. package/.xtrm/registry.json +1323 -0
  56. package/CHANGELOG.md +31 -0
  57. package/README.md +20 -3
  58. package/cli/dist/index.cjs +26796 -30901
  59. package/cli/dist/index.cjs.map +1 -1
  60. package/cli/package.json +6 -3
  61. package/package.json +15 -13
  62. package/scripts/ghgrep.mjs +358 -0
  63. package/.claude-plugin/marketplace.json +0 -19
  64. package/.claude-plugin/plugin.json +0 -9
  65. package/config/hooks.json +0 -83
  66. package/config/settings.json +0 -70
  67. package/hooks/beads-memory-gate.mjs +0 -94
  68. package/hooks/quality-check-env.mjs +0 -79
  69. package/hooks/statusline.mjs +0 -183
  70. package/hooks/tsconfig-cache.json +0 -4
  71. package/hooks/using-xtrm-reminder.mjs +0 -47
  72. package/plugins/xtrm-tools/.claude-plugin/plugin.json +0 -9
  73. package/plugins/xtrm-tools/.mcp.json +0 -18
  74. package/plugins/xtrm-tools/hooks/README.md +0 -61
  75. package/plugins/xtrm-tools/hooks/beads-claim-sync.mjs +0 -154
  76. package/plugins/xtrm-tools/hooks/beads-commit-gate.mjs +0 -70
  77. package/plugins/xtrm-tools/hooks/beads-compact-restore.mjs +0 -77
  78. package/plugins/xtrm-tools/hooks/beads-compact-save.mjs +0 -63
  79. package/plugins/xtrm-tools/hooks/beads-edit-gate.mjs +0 -85
  80. package/plugins/xtrm-tools/hooks/beads-gate-core.mjs +0 -236
  81. package/plugins/xtrm-tools/hooks/beads-gate-messages.mjs +0 -75
  82. package/plugins/xtrm-tools/hooks/beads-gate-utils.mjs +0 -176
  83. package/plugins/xtrm-tools/hooks/beads-memory-gate.mjs +0 -94
  84. package/plugins/xtrm-tools/hooks/beads-stop-gate.mjs +0 -53
  85. package/plugins/xtrm-tools/hooks/gitnexus/gitnexus-hook.cjs +0 -222
  86. package/plugins/xtrm-tools/hooks/hooks.json +0 -129
  87. package/plugins/xtrm-tools/hooks/quality-check.cjs +0 -1286
  88. package/plugins/xtrm-tools/hooks/quality-check.py +0 -345
  89. package/plugins/xtrm-tools/hooks/statusline.mjs +0 -183
  90. package/plugins/xtrm-tools/hooks/tsconfig-cache.json +0 -4
  91. package/plugins/xtrm-tools/hooks/worktree-boundary.mjs +0 -33
  92. package/plugins/xtrm-tools/hooks/xtrm-logger.mjs +0 -123
  93. package/plugins/xtrm-tools/hooks/xtrm-session-logger.mjs +0 -27
  94. package/plugins/xtrm-tools/hooks/xtrm-tool-logger.mjs +0 -53
  95. package/plugins/xtrm-tools/skills/README.txt +0 -31
  96. package/plugins/xtrm-tools/skills/clean-code/SKILL.md +0 -201
  97. package/plugins/xtrm-tools/skills/creating-service-skills/SKILL.md +0 -433
  98. package/plugins/xtrm-tools/skills/creating-service-skills/references/script_quality_standards.md +0 -425
  99. package/plugins/xtrm-tools/skills/creating-service-skills/references/service_skill_system_guide.md +0 -278
  100. package/plugins/xtrm-tools/skills/creating-service-skills/scripts/bootstrap.py +0 -326
  101. package/plugins/xtrm-tools/skills/creating-service-skills/scripts/deep_dive.py +0 -304
  102. package/plugins/xtrm-tools/skills/creating-service-skills/scripts/scaffolder.py +0 -482
  103. package/plugins/xtrm-tools/skills/delegating/SKILL.md +0 -196
  104. package/plugins/xtrm-tools/skills/delegating/config.yaml +0 -210
  105. package/plugins/xtrm-tools/skills/delegating/references/orchestration-protocols.md +0 -41
  106. package/plugins/xtrm-tools/skills/docker-expert/SKILL.md +0 -409
  107. package/plugins/xtrm-tools/skills/documenting/CHANGELOG.md +0 -23
  108. package/plugins/xtrm-tools/skills/documenting/README.md +0 -148
  109. package/plugins/xtrm-tools/skills/documenting/SKILL.md +0 -113
  110. package/plugins/xtrm-tools/skills/documenting/examples/example_pattern.md +0 -70
  111. package/plugins/xtrm-tools/skills/documenting/examples/example_reference.md +0 -70
  112. package/plugins/xtrm-tools/skills/documenting/examples/example_ssot_analytics.md +0 -64
  113. package/plugins/xtrm-tools/skills/documenting/examples/example_workflow.md +0 -141
  114. package/plugins/xtrm-tools/skills/documenting/references/changelog-format.md +0 -97
  115. package/plugins/xtrm-tools/skills/documenting/references/metadata-schema.md +0 -136
  116. package/plugins/xtrm-tools/skills/documenting/references/taxonomy.md +0 -81
  117. package/plugins/xtrm-tools/skills/documenting/references/versioning-rules.md +0 -78
  118. package/plugins/xtrm-tools/skills/documenting/scripts/bump_version.sh +0 -60
  119. package/plugins/xtrm-tools/skills/documenting/scripts/changelog/__init__.py +0 -0
  120. package/plugins/xtrm-tools/skills/documenting/scripts/changelog/add_entry.py +0 -216
  121. package/plugins/xtrm-tools/skills/documenting/scripts/changelog/bump_release.py +0 -117
  122. package/plugins/xtrm-tools/skills/documenting/scripts/changelog/init_changelog.py +0 -54
  123. package/plugins/xtrm-tools/skills/documenting/scripts/changelog/validate_changelog.py +0 -128
  124. package/plugins/xtrm-tools/skills/documenting/scripts/drift_detector.py +0 -266
  125. package/plugins/xtrm-tools/skills/documenting/scripts/generate_template.py +0 -311
  126. package/plugins/xtrm-tools/skills/documenting/scripts/list_by_category.sh +0 -84
  127. package/plugins/xtrm-tools/skills/documenting/scripts/orchestrator.py +0 -255
  128. package/plugins/xtrm-tools/skills/documenting/scripts/validate_metadata.py +0 -242
  129. package/plugins/xtrm-tools/skills/documenting/templates/CHANGELOG.md.template +0 -13
  130. package/plugins/xtrm-tools/skills/documenting/tests/integration_test.sh +0 -70
  131. package/plugins/xtrm-tools/skills/documenting/tests/test_changelog.py +0 -201
  132. package/plugins/xtrm-tools/skills/documenting/tests/test_drift_detector.py +0 -80
  133. package/plugins/xtrm-tools/skills/documenting/tests/test_orchestrator.py +0 -52
  134. package/plugins/xtrm-tools/skills/documenting/tests/test_validate_metadata.py +0 -64
  135. package/plugins/xtrm-tools/skills/find-skills/SKILL.md +0 -133
  136. package/plugins/xtrm-tools/skills/gitnexus-exploring/SKILL.md +0 -75
  137. package/plugins/xtrm-tools/skills/gitnexus-impact-analysis/SKILL.md +0 -94
  138. package/plugins/xtrm-tools/skills/gitnexus-refactoring/SKILL.md +0 -113
  139. package/plugins/xtrm-tools/skills/hook-development/SKILL.md +0 -797
  140. package/plugins/xtrm-tools/skills/hook-development/examples/load-context.sh +0 -55
  141. package/plugins/xtrm-tools/skills/hook-development/examples/quality-check.js +0 -1168
  142. package/plugins/xtrm-tools/skills/hook-development/examples/validate-bash.sh +0 -43
  143. package/plugins/xtrm-tools/skills/hook-development/examples/validate-write.sh +0 -38
  144. package/plugins/xtrm-tools/skills/hook-development/references/advanced.md +0 -527
  145. package/plugins/xtrm-tools/skills/hook-development/references/migration.md +0 -369
  146. package/plugins/xtrm-tools/skills/hook-development/references/patterns.md +0 -412
  147. package/plugins/xtrm-tools/skills/hook-development/scripts/README.md +0 -164
  148. package/plugins/xtrm-tools/skills/hook-development/scripts/hook-linter.sh +0 -153
  149. package/plugins/xtrm-tools/skills/hook-development/scripts/test-hook.sh +0 -252
  150. package/plugins/xtrm-tools/skills/hook-development/scripts/validate-hook-schema.sh +0 -159
  151. package/plugins/xtrm-tools/skills/obsidian-cli/SKILL.md +0 -106
  152. package/plugins/xtrm-tools/skills/orchestrating-agents/SKILL.md +0 -135
  153. package/plugins/xtrm-tools/skills/orchestrating-agents/config.yaml +0 -45
  154. package/plugins/xtrm-tools/skills/orchestrating-agents/references/agent-context-integration.md +0 -37
  155. package/plugins/xtrm-tools/skills/orchestrating-agents/references/examples.md +0 -45
  156. package/plugins/xtrm-tools/skills/orchestrating-agents/references/handover-protocol.md +0 -31
  157. package/plugins/xtrm-tools/skills/orchestrating-agents/references/workflows.md +0 -42
  158. package/plugins/xtrm-tools/skills/orchestrating-agents/scripts/detect_neighbors.py +0 -23
  159. package/plugins/xtrm-tools/skills/planning/SKILL.md +0 -405
  160. package/plugins/xtrm-tools/skills/planning/evals/evals.json +0 -19
  161. package/plugins/xtrm-tools/skills/prompt-improving/README.md +0 -162
  162. package/plugins/xtrm-tools/skills/prompt-improving/SKILL.md +0 -74
  163. package/plugins/xtrm-tools/skills/prompt-improving/references/analysis_commands.md +0 -24
  164. package/plugins/xtrm-tools/skills/prompt-improving/references/chain_of_thought.md +0 -24
  165. package/plugins/xtrm-tools/skills/prompt-improving/references/mcp_definitions.md +0 -20
  166. package/plugins/xtrm-tools/skills/prompt-improving/references/multishot.md +0 -23
  167. package/plugins/xtrm-tools/skills/prompt-improving/references/xml_core.md +0 -60
  168. package/plugins/xtrm-tools/skills/python-testing/SKILL.md +0 -815
  169. package/plugins/xtrm-tools/skills/scoping-service-skills/SKILL.md +0 -231
  170. package/plugins/xtrm-tools/skills/scoping-service-skills/scripts/scope.py +0 -74
  171. package/plugins/xtrm-tools/skills/senior-backend/SKILL.md +0 -209
  172. package/plugins/xtrm-tools/skills/senior-backend/references/api_design_patterns.md +0 -103
  173. package/plugins/xtrm-tools/skills/senior-backend/references/backend_security_practices.md +0 -103
  174. package/plugins/xtrm-tools/skills/senior-backend/references/database_optimization_guide.md +0 -103
  175. package/plugins/xtrm-tools/skills/senior-backend/scripts/api_load_tester.py +0 -114
  176. package/plugins/xtrm-tools/skills/senior-backend/scripts/api_scaffolder.py +0 -114
  177. package/plugins/xtrm-tools/skills/senior-backend/scripts/database_migration_tool.py +0 -114
  178. package/plugins/xtrm-tools/skills/senior-data-scientist/SKILL.md +0 -226
  179. package/plugins/xtrm-tools/skills/senior-data-scientist/references/experiment_design_frameworks.md +0 -80
  180. package/plugins/xtrm-tools/skills/senior-data-scientist/references/feature_engineering_patterns.md +0 -80
  181. package/plugins/xtrm-tools/skills/senior-data-scientist/references/statistical_methods_advanced.md +0 -80
  182. package/plugins/xtrm-tools/skills/senior-data-scientist/scripts/experiment_designer.py +0 -100
  183. package/plugins/xtrm-tools/skills/senior-data-scientist/scripts/feature_engineering_pipeline.py +0 -100
  184. package/plugins/xtrm-tools/skills/senior-data-scientist/scripts/model_evaluation_suite.py +0 -100
  185. package/plugins/xtrm-tools/skills/senior-devops/SKILL.md +0 -209
  186. package/plugins/xtrm-tools/skills/senior-devops/references/cicd_pipeline_guide.md +0 -103
  187. package/plugins/xtrm-tools/skills/senior-devops/references/deployment_strategies.md +0 -103
  188. package/plugins/xtrm-tools/skills/senior-devops/references/infrastructure_as_code.md +0 -103
  189. package/plugins/xtrm-tools/skills/senior-devops/scripts/deployment_manager.py +0 -114
  190. package/plugins/xtrm-tools/skills/senior-devops/scripts/pipeline_generator.py +0 -114
  191. package/plugins/xtrm-tools/skills/senior-devops/scripts/terraform_scaffolder.py +0 -114
  192. package/plugins/xtrm-tools/skills/senior-security/SKILL.md +0 -209
  193. package/plugins/xtrm-tools/skills/senior-security/references/cryptography_implementation.md +0 -103
  194. package/plugins/xtrm-tools/skills/senior-security/references/penetration_testing_guide.md +0 -103
  195. package/plugins/xtrm-tools/skills/senior-security/references/security_architecture_patterns.md +0 -103
  196. package/plugins/xtrm-tools/skills/senior-security/scripts/pentest_automator.py +0 -114
  197. package/plugins/xtrm-tools/skills/senior-security/scripts/security_auditor.py +0 -114
  198. package/plugins/xtrm-tools/skills/senior-security/scripts/threat_modeler.py +0 -114
  199. package/plugins/xtrm-tools/skills/skill-creator/LICENSE.txt +0 -202
  200. package/plugins/xtrm-tools/skills/skill-creator/SKILL.md +0 -479
  201. package/plugins/xtrm-tools/skills/skill-creator/agents/analyzer.md +0 -274
  202. package/plugins/xtrm-tools/skills/skill-creator/agents/comparator.md +0 -202
  203. package/plugins/xtrm-tools/skills/skill-creator/agents/grader.md +0 -223
  204. package/plugins/xtrm-tools/skills/skill-creator/assets/eval_review.html +0 -146
  205. package/plugins/xtrm-tools/skills/skill-creator/eval-viewer/generate_review.py +0 -471
  206. package/plugins/xtrm-tools/skills/skill-creator/eval-viewer/viewer.html +0 -1325
  207. package/plugins/xtrm-tools/skills/skill-creator/references/schemas.md +0 -430
  208. package/plugins/xtrm-tools/skills/skill-creator/scripts/__init__.py +0 -0
  209. package/plugins/xtrm-tools/skills/skill-creator/scripts/aggregate_benchmark.py +0 -401
  210. package/plugins/xtrm-tools/skills/skill-creator/scripts/generate_report.py +0 -326
  211. package/plugins/xtrm-tools/skills/skill-creator/scripts/improve_description.py +0 -248
  212. package/plugins/xtrm-tools/skills/skill-creator/scripts/package_skill.py +0 -136
  213. package/plugins/xtrm-tools/skills/skill-creator/scripts/quick_validate.py +0 -103
  214. package/plugins/xtrm-tools/skills/skill-creator/scripts/run_eval.py +0 -310
  215. package/plugins/xtrm-tools/skills/skill-creator/scripts/run_loop.py +0 -332
  216. package/plugins/xtrm-tools/skills/skill-creator/scripts/utils.py +0 -47
  217. package/plugins/xtrm-tools/skills/sync-docs/SKILL.md +0 -286
  218. package/plugins/xtrm-tools/skills/sync-docs/evals/evals.json +0 -89
  219. package/plugins/xtrm-tools/skills/sync-docs/references/doc-structure.md +0 -99
  220. package/plugins/xtrm-tools/skills/sync-docs/references/schema.md +0 -103
  221. package/plugins/xtrm-tools/skills/sync-docs/scripts/changelog/add_entry.py +0 -216
  222. package/plugins/xtrm-tools/skills/sync-docs/scripts/context_gatherer.py +0 -240
  223. package/plugins/xtrm-tools/skills/sync-docs/scripts/doc_structure_analyzer.py +0 -495
  224. package/plugins/xtrm-tools/skills/sync-docs/scripts/drift_detector.py +0 -563
  225. package/plugins/xtrm-tools/skills/sync-docs/scripts/validate_doc.py +0 -365
  226. package/plugins/xtrm-tools/skills/sync-docs/scripts/validate_metadata.py +0 -185
  227. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/benchmark.json +0 -293
  228. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/benchmark.md +0 -13
  229. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-doc-audit/eval_metadata.json +0 -27
  230. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-doc-audit/with_skill/outputs/result.md +0 -210
  231. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-doc-audit/with_skill/run-1/grading.json +0 -28
  232. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-doc-audit/with_skill/run-1/timing.json +0 -1
  233. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-doc-audit/without_skill/outputs/result.md +0 -101
  234. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-doc-audit/without_skill/run-1/grading.json +0 -28
  235. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-doc-audit/without_skill/run-1/timing.json +0 -5
  236. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-doc-audit/without_skill/timing.json +0 -5
  237. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-fix-mode/eval_metadata.json +0 -27
  238. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-fix-mode/with_skill/outputs/result.md +0 -198
  239. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-fix-mode/with_skill/run-1/grading.json +0 -28
  240. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-fix-mode/with_skill/run-1/timing.json +0 -1
  241. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-fix-mode/without_skill/outputs/result.md +0 -94
  242. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-fix-mode/without_skill/run-1/grading.json +0 -28
  243. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-fix-mode/without_skill/run-1/timing.json +0 -1
  244. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-sprint-closeout/eval_metadata.json +0 -27
  245. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-sprint-closeout/with_skill/outputs/result.md +0 -237
  246. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-sprint-closeout/with_skill/run-1/grading.json +0 -28
  247. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-sprint-closeout/with_skill/run-1/timing.json +0 -1
  248. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-sprint-closeout/without_skill/outputs/result.md +0 -134
  249. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-sprint-closeout/without_skill/run-1/grading.json +0 -28
  250. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-1/eval-sprint-closeout/without_skill/run-1/timing.json +0 -1
  251. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/benchmark.json +0 -297
  252. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/benchmark.md +0 -13
  253. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-doc-audit/eval_metadata.json +0 -27
  254. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-doc-audit/with_skill/outputs/result.md +0 -137
  255. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-doc-audit/with_skill/run-1/grading.json +0 -92
  256. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-doc-audit/with_skill/run-1/timing.json +0 -1
  257. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-doc-audit/without_skill/outputs/result.md +0 -134
  258. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-doc-audit/without_skill/run-1/grading.json +0 -86
  259. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-doc-audit/without_skill/run-1/timing.json +0 -1
  260. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-fix-mode/eval_metadata.json +0 -27
  261. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-fix-mode/with_skill/outputs/result.md +0 -193
  262. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-fix-mode/with_skill/run-1/grading.json +0 -72
  263. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-fix-mode/with_skill/run-1/timing.json +0 -1
  264. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-fix-mode/without_skill/outputs/result.md +0 -211
  265. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-fix-mode/without_skill/run-1/grading.json +0 -91
  266. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-fix-mode/without_skill/run-1/timing.json +0 -5
  267. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-sprint-closeout/eval_metadata.json +0 -27
  268. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-sprint-closeout/with_skill/outputs/result.md +0 -182
  269. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-sprint-closeout/with_skill/run-1/grading.json +0 -95
  270. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-sprint-closeout/with_skill/run-1/timing.json +0 -1
  271. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-sprint-closeout/without_skill/outputs/result.md +0 -222
  272. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-sprint-closeout/without_skill/run-1/grading.json +0 -88
  273. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-2/eval-sprint-closeout/without_skill/run-1/timing.json +0 -5
  274. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/benchmark.json +0 -298
  275. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/benchmark.md +0 -13
  276. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-doc-audit/eval_metadata.json +0 -27
  277. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-doc-audit/with_skill/outputs/result.md +0 -125
  278. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-doc-audit/with_skill/run-1/grading.json +0 -97
  279. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-doc-audit/with_skill/run-1/timing.json +0 -5
  280. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-doc-audit/without_skill/outputs/result.md +0 -144
  281. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-doc-audit/without_skill/run-1/grading.json +0 -78
  282. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-doc-audit/without_skill/run-1/timing.json +0 -5
  283. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-fix-mode/eval_metadata.json +0 -27
  284. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-fix-mode/with_skill/outputs/result.md +0 -104
  285. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-fix-mode/with_skill/run-1/grading.json +0 -91
  286. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-fix-mode/with_skill/run-1/timing.json +0 -5
  287. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-fix-mode/without_skill/outputs/result.md +0 -79
  288. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-fix-mode/without_skill/run-1/grading.json +0 -82
  289. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-fix-mode/without_skill/run-1/timing.json +0 -5
  290. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/eval_metadata.json +0 -27
  291. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/with_skill/outputs/phase1_context.json +0 -302
  292. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/with_skill/outputs/phase2_drift.txt +0 -33
  293. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/with_skill/outputs/phase3_analysis.json +0 -114
  294. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/with_skill/outputs/phase4_fix.txt +0 -118
  295. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/with_skill/outputs/phase5_validate.txt +0 -38
  296. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/with_skill/outputs/result.md +0 -158
  297. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/with_skill/run-1/grading.json +0 -95
  298. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/with_skill/run-1/timing.json +0 -5
  299. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/without_skill/outputs/result.md +0 -71
  300. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/without_skill/run-1/grading.json +0 -90
  301. package/plugins/xtrm-tools/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/without_skill/run-1/timing.json +0 -5
  302. package/plugins/xtrm-tools/skills/test-planning/SKILL.md +0 -465
  303. package/plugins/xtrm-tools/skills/test-planning/evals/evals.json +0 -23
  304. package/plugins/xtrm-tools/skills/updating-service-skills/SKILL.md +0 -136
  305. package/plugins/xtrm-tools/skills/updating-service-skills/scripts/drift_detector.py +0 -222
  306. package/plugins/xtrm-tools/skills/using-quality-gates/SKILL.md +0 -254
  307. package/plugins/xtrm-tools/skills/using-serena-lsp/README.md +0 -8
  308. package/plugins/xtrm-tools/skills/using-serena-lsp/REFERENCE.md +0 -194
  309. package/plugins/xtrm-tools/skills/using-serena-lsp/SKILL.md +0 -82
  310. package/plugins/xtrm-tools/skills/using-service-skills/SKILL.md +0 -108
  311. package/plugins/xtrm-tools/skills/using-service-skills/scripts/cataloger.py +0 -74
  312. package/plugins/xtrm-tools/skills/using-service-skills/scripts/skill_activator.py +0 -152
  313. package/plugins/xtrm-tools/skills/using-service-skills/scripts/test_skill_activator.py +0 -58
  314. package/plugins/xtrm-tools/skills/using-tdd/SKILL.md +0 -410
  315. package/plugins/xtrm-tools/skills/using-xtrm/SKILL.md +0 -127
  316. package/plugins/xtrm-tools/skills/xt-debugging/SKILL.md +0 -149
  317. package/plugins/xtrm-tools/skills/xt-end/SKILL.md +0 -297
  318. package/plugins/xtrm-tools/skills/xt-merge/SKILL.md +0 -313
  319. package/project-skills/quality-gates/.claude/hooks/hook-config.json +0 -66
  320. package/project-skills/quality-gates/.claude/hooks/quality-check.cjs +0 -1286
  321. package/project-skills/quality-gates/.claude/hooks/quality-check.py +0 -334
  322. package/project-skills/quality-gates/.claude/settings.json +0 -3
  323. package/project-skills/quality-gates/.claude/skills/using-quality-gates/SKILL.md +0 -254
  324. package/project-skills/quality-gates/README.md +0 -109
  325. package/project-skills/quality-gates/evals/evals.json +0 -181
  326. package/project-skills/quality-gates/workspace/iteration-1/FINAL-EVAL-SUMMARY.md +0 -75
  327. package/project-skills/quality-gates/workspace/iteration-1/edge-case-auto-fix-verification/with_skill/outputs/response.md +0 -59
  328. package/project-skills/quality-gates/workspace/iteration-1/edge-case-mixed-language-project/with_skill/outputs/response.md +0 -60
  329. package/project-skills/quality-gates/workspace/iteration-1/eval-summary.md +0 -105
  330. package/project-skills/quality-gates/workspace/iteration-1/partial-install-python-only/with_skill/outputs/response.md +0 -93
  331. package/project-skills/quality-gates/workspace/iteration-1/python-refactor-request/with_skill/outputs/response.md +0 -104
  332. package/project-skills/quality-gates/workspace/iteration-1/quality-gate-error-fix/with_skill/outputs/response.md +0 -74
  333. package/project-skills/quality-gates/workspace/iteration-1/should-not-trigger-general-chat/with_skill/outputs/response.md +0 -18
  334. package/project-skills/quality-gates/workspace/iteration-1/should-not-trigger-math-question/with_skill/outputs/response.md +0 -18
  335. package/project-skills/quality-gates/workspace/iteration-1/should-not-trigger-unrelated-coding/with_skill/outputs/response.md +0 -56
  336. package/project-skills/quality-gates/workspace/iteration-1/tdd-guard-blocking-confusion/with_skill/outputs/response.md +0 -67
  337. package/project-skills/quality-gates/workspace/iteration-1/typescript-feature-with-tests/with_skill/outputs/response.md +0 -97
  338. package/project-skills/service-skills-set/.claude/git-hooks/doc_reminder.py +0 -67
  339. package/project-skills/service-skills-set/.claude/git-hooks/skill_staleness.py +0 -194
  340. package/project-skills/service-skills-set/.claude/service-registry.json +0 -4
  341. package/project-skills/service-skills-set/.claude/settings.json +0 -37
  342. package/project-skills/service-skills-set/.claude/skills/creating-service-skills/SKILL.md +0 -433
  343. package/project-skills/service-skills-set/.claude/skills/creating-service-skills/references/script_quality_standards.md +0 -425
  344. package/project-skills/service-skills-set/.claude/skills/creating-service-skills/references/service_skill_system_guide.md +0 -278
  345. package/project-skills/service-skills-set/.claude/skills/creating-service-skills/scripts/bootstrap.py +0 -308
  346. package/project-skills/service-skills-set/.claude/skills/creating-service-skills/scripts/deep_dive.py +0 -304
  347. package/project-skills/service-skills-set/.claude/skills/creating-service-skills/scripts/scaffolder.py +0 -482
  348. package/project-skills/service-skills-set/.claude/skills/scoping-service-skills/SKILL.md +0 -231
  349. package/project-skills/service-skills-set/.claude/skills/scoping-service-skills/scripts/scope.py +0 -74
  350. package/project-skills/service-skills-set/.claude/skills/updating-service-skills/SKILL.md +0 -136
  351. package/project-skills/service-skills-set/.claude/skills/updating-service-skills/scripts/drift_detector.py +0 -222
  352. package/project-skills/service-skills-set/.claude/skills/using-service-skills/SKILL.md +0 -108
  353. package/project-skills/service-skills-set/.claude/skills/using-service-skills/scripts/cataloger.py +0 -74
  354. package/project-skills/service-skills-set/.claude/skills/using-service-skills/scripts/skill_activator.py +0 -152
  355. package/project-skills/service-skills-set/README.md +0 -93
  356. package/project-skills/service-skills-set/install-service-skills.py +0 -193
  357. package/project-skills/service-skills-set/service-skills-readme.md +0 -236
  358. package/skills/README.txt +0 -31
  359. package/skills/clean-code/SKILL.md +0 -201
  360. package/skills/creating-service-skills/SKILL.md +0 -433
  361. package/skills/creating-service-skills/references/script_quality_standards.md +0 -425
  362. package/skills/creating-service-skills/references/service_skill_system_guide.md +0 -278
  363. package/skills/creating-service-skills/scripts/bootstrap.py +0 -326
  364. package/skills/creating-service-skills/scripts/deep_dive.py +0 -304
  365. package/skills/creating-service-skills/scripts/scaffolder.py +0 -482
  366. package/skills/delegating/SKILL.md +0 -196
  367. package/skills/delegating/config.yaml +0 -210
  368. package/skills/delegating/references/orchestration-protocols.md +0 -41
  369. package/skills/docker-expert/SKILL.md +0 -409
  370. package/skills/documenting/CHANGELOG.md +0 -23
  371. package/skills/documenting/README.md +0 -148
  372. package/skills/documenting/SKILL.md +0 -113
  373. package/skills/documenting/examples/example_pattern.md +0 -70
  374. package/skills/documenting/examples/example_reference.md +0 -70
  375. package/skills/documenting/examples/example_ssot_analytics.md +0 -64
  376. package/skills/documenting/examples/example_workflow.md +0 -141
  377. package/skills/documenting/references/changelog-format.md +0 -97
  378. package/skills/documenting/references/metadata-schema.md +0 -136
  379. package/skills/documenting/references/taxonomy.md +0 -81
  380. package/skills/documenting/references/versioning-rules.md +0 -78
  381. package/skills/documenting/scripts/bump_version.sh +0 -60
  382. package/skills/documenting/scripts/changelog/__init__.py +0 -0
  383. package/skills/documenting/scripts/changelog/add_entry.py +0 -216
  384. package/skills/documenting/scripts/changelog/bump_release.py +0 -117
  385. package/skills/documenting/scripts/changelog/init_changelog.py +0 -54
  386. package/skills/documenting/scripts/changelog/validate_changelog.py +0 -128
  387. package/skills/documenting/scripts/drift_detector.py +0 -266
  388. package/skills/documenting/scripts/generate_template.py +0 -311
  389. package/skills/documenting/scripts/list_by_category.sh +0 -84
  390. package/skills/documenting/scripts/orchestrator.py +0 -255
  391. package/skills/documenting/scripts/validate_metadata.py +0 -242
  392. package/skills/documenting/templates/CHANGELOG.md.template +0 -13
  393. package/skills/find-skills/SKILL.md +0 -133
  394. package/skills/gitnexus-exploring/SKILL.md +0 -75
  395. package/skills/gitnexus-impact-analysis/SKILL.md +0 -94
  396. package/skills/gitnexus-refactoring/SKILL.md +0 -113
  397. package/skills/hook-development/SKILL.md +0 -797
  398. package/skills/hook-development/examples/load-context.sh +0 -55
  399. package/skills/hook-development/examples/quality-check.js +0 -1168
  400. package/skills/hook-development/examples/validate-bash.sh +0 -43
  401. package/skills/hook-development/examples/validate-write.sh +0 -38
  402. package/skills/hook-development/references/advanced.md +0 -527
  403. package/skills/hook-development/references/migration.md +0 -369
  404. package/skills/hook-development/references/patterns.md +0 -412
  405. package/skills/hook-development/scripts/README.md +0 -164
  406. package/skills/hook-development/scripts/hook-linter.sh +0 -153
  407. package/skills/hook-development/scripts/test-hook.sh +0 -252
  408. package/skills/hook-development/scripts/validate-hook-schema.sh +0 -159
  409. package/skills/obsidian-cli/SKILL.md +0 -106
  410. package/skills/orchestrating-agents/SKILL.md +0 -135
  411. package/skills/orchestrating-agents/config.yaml +0 -45
  412. package/skills/orchestrating-agents/references/agent-context-integration.md +0 -37
  413. package/skills/orchestrating-agents/references/examples.md +0 -45
  414. package/skills/orchestrating-agents/references/handover-protocol.md +0 -31
  415. package/skills/orchestrating-agents/references/workflows.md +0 -42
  416. package/skills/orchestrating-agents/scripts/detect_neighbors.py +0 -23
  417. package/skills/planning/SKILL.md +0 -405
  418. package/skills/planning/evals/evals.json +0 -19
  419. package/skills/prompt-improving/README.md +0 -162
  420. package/skills/prompt-improving/SKILL.md +0 -74
  421. package/skills/prompt-improving/references/analysis_commands.md +0 -24
  422. package/skills/prompt-improving/references/chain_of_thought.md +0 -24
  423. package/skills/prompt-improving/references/mcp_definitions.md +0 -20
  424. package/skills/prompt-improving/references/multishot.md +0 -23
  425. package/skills/prompt-improving/references/xml_core.md +0 -60
  426. package/skills/python-testing/SKILL.md +0 -815
  427. package/skills/scoping-service-skills/SKILL.md +0 -231
  428. package/skills/scoping-service-skills/scripts/scope.py +0 -74
  429. package/skills/senior-backend/SKILL.md +0 -209
  430. package/skills/senior-backend/references/api_design_patterns.md +0 -103
  431. package/skills/senior-backend/references/backend_security_practices.md +0 -103
  432. package/skills/senior-backend/references/database_optimization_guide.md +0 -103
  433. package/skills/senior-backend/scripts/api_load_tester.py +0 -114
  434. package/skills/senior-backend/scripts/api_scaffolder.py +0 -114
  435. package/skills/senior-backend/scripts/database_migration_tool.py +0 -114
  436. package/skills/senior-data-scientist/SKILL.md +0 -226
  437. package/skills/senior-data-scientist/references/experiment_design_frameworks.md +0 -80
  438. package/skills/senior-data-scientist/references/feature_engineering_patterns.md +0 -80
  439. package/skills/senior-data-scientist/references/statistical_methods_advanced.md +0 -80
  440. package/skills/senior-data-scientist/scripts/experiment_designer.py +0 -100
  441. package/skills/senior-data-scientist/scripts/feature_engineering_pipeline.py +0 -100
  442. package/skills/senior-data-scientist/scripts/model_evaluation_suite.py +0 -100
  443. package/skills/senior-devops/SKILL.md +0 -209
  444. package/skills/senior-devops/references/cicd_pipeline_guide.md +0 -103
  445. package/skills/senior-devops/references/deployment_strategies.md +0 -103
  446. package/skills/senior-devops/references/infrastructure_as_code.md +0 -103
  447. package/skills/senior-devops/scripts/deployment_manager.py +0 -114
  448. package/skills/senior-devops/scripts/pipeline_generator.py +0 -114
  449. package/skills/senior-devops/scripts/terraform_scaffolder.py +0 -114
  450. package/skills/senior-security/SKILL.md +0 -209
  451. package/skills/senior-security/references/cryptography_implementation.md +0 -103
  452. package/skills/senior-security/references/penetration_testing_guide.md +0 -103
  453. package/skills/senior-security/references/security_architecture_patterns.md +0 -103
  454. package/skills/senior-security/scripts/pentest_automator.py +0 -114
  455. package/skills/senior-security/scripts/security_auditor.py +0 -114
  456. package/skills/senior-security/scripts/threat_modeler.py +0 -114
  457. package/skills/skill-creator/LICENSE.txt +0 -202
  458. package/skills/skill-creator/SKILL.md +0 -479
  459. package/skills/skill-creator/agents/analyzer.md +0 -274
  460. package/skills/skill-creator/agents/comparator.md +0 -202
  461. package/skills/skill-creator/agents/grader.md +0 -223
  462. package/skills/skill-creator/assets/eval_review.html +0 -146
  463. package/skills/skill-creator/eval-viewer/generate_review.py +0 -471
  464. package/skills/skill-creator/eval-viewer/viewer.html +0 -1325
  465. package/skills/skill-creator/references/schemas.md +0 -430
  466. package/skills/skill-creator/scripts/__init__.py +0 -0
  467. package/skills/skill-creator/scripts/aggregate_benchmark.py +0 -401
  468. package/skills/skill-creator/scripts/generate_report.py +0 -326
  469. package/skills/skill-creator/scripts/improve_description.py +0 -248
  470. package/skills/skill-creator/scripts/package_skill.py +0 -136
  471. package/skills/skill-creator/scripts/quick_validate.py +0 -103
  472. package/skills/skill-creator/scripts/run_eval.py +0 -310
  473. package/skills/skill-creator/scripts/run_loop.py +0 -332
  474. package/skills/skill-creator/scripts/utils.py +0 -47
  475. package/skills/sync-docs/SKILL.md +0 -286
  476. package/skills/sync-docs/evals/evals.json +0 -89
  477. package/skills/sync-docs/references/doc-structure.md +0 -99
  478. package/skills/sync-docs/references/schema.md +0 -103
  479. package/skills/sync-docs/scripts/changelog/add_entry.py +0 -216
  480. package/skills/sync-docs/scripts/context_gatherer.py +0 -240
  481. package/skills/sync-docs/scripts/doc_structure_analyzer.py +0 -495
  482. package/skills/sync-docs/scripts/drift_detector.py +0 -563
  483. package/skills/sync-docs/scripts/validate_doc.py +0 -365
  484. package/skills/sync-docs/scripts/validate_metadata.py +0 -185
  485. package/skills/sync-docs-workspace/iteration-1/benchmark.json +0 -293
  486. package/skills/sync-docs-workspace/iteration-1/benchmark.md +0 -13
  487. package/skills/sync-docs-workspace/iteration-1/eval-doc-audit/eval_metadata.json +0 -27
  488. package/skills/sync-docs-workspace/iteration-1/eval-doc-audit/with_skill/outputs/result.md +0 -210
  489. package/skills/sync-docs-workspace/iteration-1/eval-doc-audit/with_skill/run-1/grading.json +0 -28
  490. package/skills/sync-docs-workspace/iteration-1/eval-doc-audit/with_skill/run-1/timing.json +0 -1
  491. package/skills/sync-docs-workspace/iteration-1/eval-doc-audit/without_skill/outputs/result.md +0 -101
  492. package/skills/sync-docs-workspace/iteration-1/eval-doc-audit/without_skill/run-1/grading.json +0 -28
  493. package/skills/sync-docs-workspace/iteration-1/eval-doc-audit/without_skill/run-1/timing.json +0 -5
  494. package/skills/sync-docs-workspace/iteration-1/eval-doc-audit/without_skill/timing.json +0 -5
  495. package/skills/sync-docs-workspace/iteration-1/eval-fix-mode/eval_metadata.json +0 -27
  496. package/skills/sync-docs-workspace/iteration-1/eval-fix-mode/with_skill/outputs/result.md +0 -198
  497. package/skills/sync-docs-workspace/iteration-1/eval-fix-mode/with_skill/run-1/grading.json +0 -28
  498. package/skills/sync-docs-workspace/iteration-1/eval-fix-mode/with_skill/run-1/timing.json +0 -1
  499. package/skills/sync-docs-workspace/iteration-1/eval-fix-mode/without_skill/outputs/result.md +0 -94
  500. package/skills/sync-docs-workspace/iteration-1/eval-fix-mode/without_skill/run-1/grading.json +0 -28
  501. package/skills/sync-docs-workspace/iteration-1/eval-fix-mode/without_skill/run-1/timing.json +0 -1
  502. package/skills/sync-docs-workspace/iteration-1/eval-sprint-closeout/eval_metadata.json +0 -27
  503. package/skills/sync-docs-workspace/iteration-1/eval-sprint-closeout/with_skill/outputs/result.md +0 -237
  504. package/skills/sync-docs-workspace/iteration-1/eval-sprint-closeout/with_skill/run-1/grading.json +0 -28
  505. package/skills/sync-docs-workspace/iteration-1/eval-sprint-closeout/with_skill/run-1/timing.json +0 -1
  506. package/skills/sync-docs-workspace/iteration-1/eval-sprint-closeout/without_skill/outputs/result.md +0 -134
  507. package/skills/sync-docs-workspace/iteration-1/eval-sprint-closeout/without_skill/run-1/grading.json +0 -28
  508. package/skills/sync-docs-workspace/iteration-1/eval-sprint-closeout/without_skill/run-1/timing.json +0 -1
  509. package/skills/sync-docs-workspace/iteration-2/benchmark.json +0 -297
  510. package/skills/sync-docs-workspace/iteration-2/benchmark.md +0 -13
  511. package/skills/sync-docs-workspace/iteration-2/eval-doc-audit/eval_metadata.json +0 -27
  512. package/skills/sync-docs-workspace/iteration-2/eval-doc-audit/with_skill/outputs/result.md +0 -137
  513. package/skills/sync-docs-workspace/iteration-2/eval-doc-audit/with_skill/run-1/grading.json +0 -92
  514. package/skills/sync-docs-workspace/iteration-2/eval-doc-audit/with_skill/run-1/timing.json +0 -1
  515. package/skills/sync-docs-workspace/iteration-2/eval-doc-audit/without_skill/outputs/result.md +0 -134
  516. package/skills/sync-docs-workspace/iteration-2/eval-doc-audit/without_skill/run-1/grading.json +0 -86
  517. package/skills/sync-docs-workspace/iteration-2/eval-doc-audit/without_skill/run-1/timing.json +0 -1
  518. package/skills/sync-docs-workspace/iteration-2/eval-fix-mode/eval_metadata.json +0 -27
  519. package/skills/sync-docs-workspace/iteration-2/eval-fix-mode/with_skill/outputs/result.md +0 -193
  520. package/skills/sync-docs-workspace/iteration-2/eval-fix-mode/with_skill/run-1/grading.json +0 -72
  521. package/skills/sync-docs-workspace/iteration-2/eval-fix-mode/with_skill/run-1/timing.json +0 -1
  522. package/skills/sync-docs-workspace/iteration-2/eval-fix-mode/without_skill/outputs/result.md +0 -211
  523. package/skills/sync-docs-workspace/iteration-2/eval-fix-mode/without_skill/run-1/grading.json +0 -91
  524. package/skills/sync-docs-workspace/iteration-2/eval-fix-mode/without_skill/run-1/timing.json +0 -5
  525. package/skills/sync-docs-workspace/iteration-2/eval-sprint-closeout/eval_metadata.json +0 -27
  526. package/skills/sync-docs-workspace/iteration-2/eval-sprint-closeout/with_skill/outputs/result.md +0 -182
  527. package/skills/sync-docs-workspace/iteration-2/eval-sprint-closeout/with_skill/run-1/grading.json +0 -95
  528. package/skills/sync-docs-workspace/iteration-2/eval-sprint-closeout/with_skill/run-1/timing.json +0 -1
  529. package/skills/sync-docs-workspace/iteration-2/eval-sprint-closeout/without_skill/outputs/result.md +0 -222
  530. package/skills/sync-docs-workspace/iteration-2/eval-sprint-closeout/without_skill/run-1/grading.json +0 -88
  531. package/skills/sync-docs-workspace/iteration-2/eval-sprint-closeout/without_skill/run-1/timing.json +0 -5
  532. package/skills/sync-docs-workspace/iteration-3/benchmark.json +0 -298
  533. package/skills/sync-docs-workspace/iteration-3/benchmark.md +0 -13
  534. package/skills/sync-docs-workspace/iteration-3/eval-doc-audit/eval_metadata.json +0 -27
  535. package/skills/sync-docs-workspace/iteration-3/eval-doc-audit/with_skill/outputs/result.md +0 -125
  536. package/skills/sync-docs-workspace/iteration-3/eval-doc-audit/with_skill/run-1/grading.json +0 -97
  537. package/skills/sync-docs-workspace/iteration-3/eval-doc-audit/with_skill/run-1/timing.json +0 -5
  538. package/skills/sync-docs-workspace/iteration-3/eval-doc-audit/without_skill/outputs/result.md +0 -144
  539. package/skills/sync-docs-workspace/iteration-3/eval-doc-audit/without_skill/run-1/grading.json +0 -78
  540. package/skills/sync-docs-workspace/iteration-3/eval-doc-audit/without_skill/run-1/timing.json +0 -5
  541. package/skills/sync-docs-workspace/iteration-3/eval-fix-mode/eval_metadata.json +0 -27
  542. package/skills/sync-docs-workspace/iteration-3/eval-fix-mode/with_skill/outputs/result.md +0 -104
  543. package/skills/sync-docs-workspace/iteration-3/eval-fix-mode/with_skill/run-1/grading.json +0 -91
  544. package/skills/sync-docs-workspace/iteration-3/eval-fix-mode/with_skill/run-1/timing.json +0 -5
  545. package/skills/sync-docs-workspace/iteration-3/eval-fix-mode/without_skill/outputs/result.md +0 -79
  546. package/skills/sync-docs-workspace/iteration-3/eval-fix-mode/without_skill/run-1/grading.json +0 -82
  547. package/skills/sync-docs-workspace/iteration-3/eval-fix-mode/without_skill/run-1/timing.json +0 -5
  548. package/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/eval_metadata.json +0 -27
  549. package/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/with_skill/outputs/phase1_context.json +0 -302
  550. package/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/with_skill/outputs/phase2_drift.txt +0 -33
  551. package/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/with_skill/outputs/phase3_analysis.json +0 -114
  552. package/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/with_skill/outputs/phase4_fix.txt +0 -118
  553. package/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/with_skill/outputs/phase5_validate.txt +0 -38
  554. package/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/with_skill/outputs/result.md +0 -158
  555. package/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/with_skill/run-1/grading.json +0 -95
  556. package/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/with_skill/run-1/timing.json +0 -5
  557. package/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/without_skill/outputs/result.md +0 -71
  558. package/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/without_skill/run-1/grading.json +0 -90
  559. package/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/without_skill/run-1/timing.json +0 -5
  560. package/skills/test-planning/SKILL.md +0 -465
  561. package/skills/test-planning/evals/evals.json +0 -23
  562. package/skills/updating-service-skills/SKILL.md +0 -136
  563. package/skills/updating-service-skills/scripts/drift_detector.py +0 -222
  564. package/skills/using-quality-gates/SKILL.md +0 -254
  565. package/skills/using-serena-lsp/README.md +0 -8
  566. package/skills/using-serena-lsp/REFERENCE.md +0 -194
  567. package/skills/using-serena-lsp/SKILL.md +0 -82
  568. package/skills/using-service-skills/SKILL.md +0 -108
  569. package/skills/using-service-skills/scripts/cataloger.py +0 -74
  570. package/skills/using-service-skills/scripts/skill_activator.py +0 -152
  571. package/skills/using-service-skills/scripts/test_skill_activator.py +0 -58
  572. package/skills/using-tdd/SKILL.md +0 -410
  573. package/skills/using-xtrm/SKILL.md +0 -127
  574. package/skills/xt-debugging/SKILL.md +0 -149
  575. package/skills/xt-end/SKILL.md +0 -297
  576. package/skills/xt-merge/SKILL.md +0 -313
  577. /package/{config → .xtrm/config}/.env.example +0 -0
  578. /package/{config/mcp_servers_optional.json → .xtrm/config/claude.mcp.optional.json} +0 -0
  579. /package/{hooks → .xtrm/config}/hooks.json +0 -0
  580. /package/{config → .xtrm/config}/pi/auth.json.template +0 -0
  581. /package/{config → .xtrm/config}/pi/extensions/auto-session-name/index.ts +0 -0
  582. /package/{config → .xtrm/config}/pi/extensions/auto-session-name/package.json +0 -0
  583. /package/{config → .xtrm/config}/pi/extensions/auto-update/index.ts +0 -0
  584. /package/{config → .xtrm/config}/pi/extensions/auto-update/package.json +0 -0
  585. /package/{config → .xtrm/config}/pi/extensions/beads/package.json +0 -0
  586. /package/{config → .xtrm/config}/pi/extensions/compact-header/index.ts +0 -0
  587. /package/{config → .xtrm/config}/pi/extensions/compact-header/package.json +0 -0
  588. /package/{config → .xtrm/config}/pi/extensions/core/adapter.ts +0 -0
  589. /package/{config → .xtrm/config}/pi/extensions/core/guard-rules.ts +0 -0
  590. /package/{config → .xtrm/config}/pi/extensions/core/lib.ts +0 -0
  591. /package/{config → .xtrm/config}/pi/extensions/core/logger.ts +0 -0
  592. /package/{config → .xtrm/config}/pi/extensions/core/package.json +0 -0
  593. /package/{config → .xtrm/config}/pi/extensions/core/runner.ts +0 -0
  594. /package/{config → .xtrm/config}/pi/extensions/core/session-state.ts +0 -0
  595. /package/{config → .xtrm/config}/pi/extensions/custom-footer/package.json +0 -0
  596. /package/{config → .xtrm/config}/pi/extensions/custom-provider-qwen-cli/index.ts +0 -0
  597. /package/{config → .xtrm/config}/pi/extensions/custom-provider-qwen-cli/package.json +0 -0
  598. /package/{config → .xtrm/config}/pi/extensions/git-checkpoint/index.ts +0 -0
  599. /package/{config → .xtrm/config}/pi/extensions/git-checkpoint/package.json +0 -0
  600. /package/{config → .xtrm/config}/pi/extensions/lsp-bootstrap/index.ts +0 -0
  601. /package/{config → .xtrm/config}/pi/extensions/lsp-bootstrap/package.json +0 -0
  602. /package/{config → .xtrm/config}/pi/extensions/pi-serena-compact/index.ts +0 -0
  603. /package/{config → .xtrm/config}/pi/extensions/pi-serena-compact/package.json +0 -0
  604. /package/{config → .xtrm/config}/pi/extensions/quality-gates/index.ts +0 -0
  605. /package/{config → .xtrm/config}/pi/extensions/quality-gates/package.json +0 -0
  606. /package/{config → .xtrm/config}/pi/extensions/service-skills/index.ts +0 -0
  607. /package/{config → .xtrm/config}/pi/extensions/service-skills/package.json +0 -0
  608. /package/{config → .xtrm/config}/pi/extensions/session-flow/index.ts +0 -0
  609. /package/{config → .xtrm/config}/pi/extensions/session-flow/package.json +0 -0
  610. /package/{config → .xtrm/config}/pi/extensions/xtrm-loader/index.ts +0 -0
  611. /package/{config → .xtrm/config}/pi/extensions/xtrm-loader/package.json +0 -0
  612. /package/{config → .xtrm/config}/pi/extensions/xtrm-ui/format.ts +0 -0
  613. /package/{config → .xtrm/config}/pi/extensions/xtrm-ui/package.json +0 -0
  614. /package/{config → .xtrm/config}/pi/extensions/xtrm-ui/themes/pidex-dark.json +0 -0
  615. /package/{config → .xtrm/config}/pi/extensions/xtrm-ui/themes/pidex-light.json +0 -0
  616. /package/{config → .xtrm/config}/pi/models.json.template +0 -0
  617. /package/{config → .xtrm/config}/pi/pi-worktrees-settings.json +0 -0
  618. /package/{config → .xtrm/config}/pi/settings.json.template +0 -0
  619. /package/{hooks → .xtrm/hooks}/beads-claim-sync.mjs +0 -0
  620. /package/{hooks → .xtrm/hooks}/beads-compact-restore.mjs +0 -0
  621. /package/{hooks → .xtrm/hooks}/beads-compact-save.mjs +0 -0
  622. /package/{hooks → .xtrm/hooks}/beads-edit-gate.mjs +0 -0
  623. /package/{hooks → .xtrm/hooks}/beads-gate-core.mjs +0 -0
  624. /package/{hooks → .xtrm/hooks}/beads-gate-messages.mjs +0 -0
  625. /package/{hooks → .xtrm/hooks}/beads-gate-utils.mjs +0 -0
  626. /package/{hooks → .xtrm/hooks}/beads-stop-gate.mjs +0 -0
  627. /package/{hooks → .xtrm/hooks}/gitnexus/gitnexus-hook.cjs +0 -0
  628. /package/{hooks → .xtrm/hooks}/quality-check.cjs +0 -0
  629. /package/{hooks → .xtrm/hooks}/quality-check.py +0 -0
  630. /package/{hooks → .xtrm/hooks}/worktree-boundary.mjs +0 -0
  631. /package/{hooks → .xtrm/hooks}/xtrm-logger.mjs +0 -0
  632. /package/{hooks → .xtrm/hooks}/xtrm-session-logger.mjs +0 -0
  633. /package/{hooks → .xtrm/hooks}/xtrm-tool-logger.mjs +0 -0
@@ -1,274 +0,0 @@
1
- # Post-hoc Analyzer Agent
2
-
3
- Analyze blind comparison results to understand WHY the winner won and generate improvement suggestions.
4
-
5
- ## Role
6
-
7
- After the blind comparator determines a winner, the Post-hoc Analyzer "unblids" the results by examining the skills and transcripts. The goal is to extract actionable insights: what made the winner better, and how can the loser be improved?
8
-
9
- ## Inputs
10
-
11
- You receive these parameters in your prompt:
12
-
13
- - **winner**: "A" or "B" (from blind comparison)
14
- - **winner_skill_path**: Path to the skill that produced the winning output
15
- - **winner_transcript_path**: Path to the execution transcript for the winner
16
- - **loser_skill_path**: Path to the skill that produced the losing output
17
- - **loser_transcript_path**: Path to the execution transcript for the loser
18
- - **comparison_result_path**: Path to the blind comparator's output JSON
19
- - **output_path**: Where to save the analysis results
20
-
21
- ## Process
22
-
23
- ### Step 1: Read Comparison Result
24
-
25
- 1. Read the blind comparator's output at comparison_result_path
26
- 2. Note the winning side (A or B), the reasoning, and any scores
27
- 3. Understand what the comparator valued in the winning output
28
-
29
- ### Step 2: Read Both Skills
30
-
31
- 1. Read the winner skill's SKILL.md and key referenced files
32
- 2. Read the loser skill's SKILL.md and key referenced files
33
- 3. Identify structural differences:
34
- - Instructions clarity and specificity
35
- - Script/tool usage patterns
36
- - Example coverage
37
- - Edge case handling
38
-
39
- ### Step 3: Read Both Transcripts
40
-
41
- 1. Read the winner's transcript
42
- 2. Read the loser's transcript
43
- 3. Compare execution patterns:
44
- - How closely did each follow their skill's instructions?
45
- - What tools were used differently?
46
- - Where did the loser diverge from optimal behavior?
47
- - Did either encounter errors or make recovery attempts?
48
-
49
- ### Step 4: Analyze Instruction Following
50
-
51
- For each transcript, evaluate:
52
- - Did the agent follow the skill's explicit instructions?
53
- - Did the agent use the skill's provided tools/scripts?
54
- - Were there missed opportunities to leverage skill content?
55
- - Did the agent add unnecessary steps not in the skill?
56
-
57
- Score instruction following 1-10 and note specific issues.
58
-
59
- ### Step 5: Identify Winner Strengths
60
-
61
- Determine what made the winner better:
62
- - Clearer instructions that led to better behavior?
63
- - Better scripts/tools that produced better output?
64
- - More comprehensive examples that guided edge cases?
65
- - Better error handling guidance?
66
-
67
- Be specific. Quote from skills/transcripts where relevant.
68
-
69
- ### Step 6: Identify Loser Weaknesses
70
-
71
- Determine what held the loser back:
72
- - Ambiguous instructions that led to suboptimal choices?
73
- - Missing tools/scripts that forced workarounds?
74
- - Gaps in edge case coverage?
75
- - Poor error handling that caused failures?
76
-
77
- ### Step 7: Generate Improvement Suggestions
78
-
79
- Based on the analysis, produce actionable suggestions for improving the loser skill:
80
- - Specific instruction changes to make
81
- - Tools/scripts to add or modify
82
- - Examples to include
83
- - Edge cases to address
84
-
85
- Prioritize by impact. Focus on changes that would have changed the outcome.
86
-
87
- ### Step 8: Write Analysis Results
88
-
89
- Save structured analysis to `{output_path}`.
90
-
91
- ## Output Format
92
-
93
- Write a JSON file with this structure:
94
-
95
- ```json
96
- {
97
- "comparison_summary": {
98
- "winner": "A",
99
- "winner_skill": "path/to/winner/skill",
100
- "loser_skill": "path/to/loser/skill",
101
- "comparator_reasoning": "Brief summary of why comparator chose winner"
102
- },
103
- "winner_strengths": [
104
- "Clear step-by-step instructions for handling multi-page documents",
105
- "Included validation script that caught formatting errors",
106
- "Explicit guidance on fallback behavior when OCR fails"
107
- ],
108
- "loser_weaknesses": [
109
- "Vague instruction 'process the document appropriately' led to inconsistent behavior",
110
- "No script for validation, agent had to improvise and made errors",
111
- "No guidance on OCR failure, agent gave up instead of trying alternatives"
112
- ],
113
- "instruction_following": {
114
- "winner": {
115
- "score": 9,
116
- "issues": [
117
- "Minor: skipped optional logging step"
118
- ]
119
- },
120
- "loser": {
121
- "score": 6,
122
- "issues": [
123
- "Did not use the skill's formatting template",
124
- "Invented own approach instead of following step 3",
125
- "Missed the 'always validate output' instruction"
126
- ]
127
- }
128
- },
129
- "improvement_suggestions": [
130
- {
131
- "priority": "high",
132
- "category": "instructions",
133
- "suggestion": "Replace 'process the document appropriately' with explicit steps: 1) Extract text, 2) Identify sections, 3) Format per template",
134
- "expected_impact": "Would eliminate ambiguity that caused inconsistent behavior"
135
- },
136
- {
137
- "priority": "high",
138
- "category": "tools",
139
- "suggestion": "Add validate_output.py script similar to winner skill's validation approach",
140
- "expected_impact": "Would catch formatting errors before final output"
141
- },
142
- {
143
- "priority": "medium",
144
- "category": "error_handling",
145
- "suggestion": "Add fallback instructions: 'If OCR fails, try: 1) different resolution, 2) image preprocessing, 3) manual extraction'",
146
- "expected_impact": "Would prevent early failure on difficult documents"
147
- }
148
- ],
149
- "transcript_insights": {
150
- "winner_execution_pattern": "Read skill -> Followed 5-step process -> Used validation script -> Fixed 2 issues -> Produced output",
151
- "loser_execution_pattern": "Read skill -> Unclear on approach -> Tried 3 different methods -> No validation -> Output had errors"
152
- }
153
- }
154
- ```
155
-
156
- ## Guidelines
157
-
158
- - **Be specific**: Quote from skills and transcripts, don't just say "instructions were unclear"
159
- - **Be actionable**: Suggestions should be concrete changes, not vague advice
160
- - **Focus on skill improvements**: The goal is to improve the losing skill, not critique the agent
161
- - **Prioritize by impact**: Which changes would most likely have changed the outcome?
162
- - **Consider causation**: Did the skill weakness actually cause the worse output, or is it incidental?
163
- - **Stay objective**: Analyze what happened, don't editorialize
164
- - **Think about generalization**: Would this improvement help on other evals too?
165
-
166
- ## Categories for Suggestions
167
-
168
- Use these categories to organize improvement suggestions:
169
-
170
- | Category | Description |
171
- |----------|-------------|
172
- | `instructions` | Changes to the skill's prose instructions |
173
- | `tools` | Scripts, templates, or utilities to add/modify |
174
- | `examples` | Example inputs/outputs to include |
175
- | `error_handling` | Guidance for handling failures |
176
- | `structure` | Reorganization of skill content |
177
- | `references` | External docs or resources to add |
178
-
179
- ## Priority Levels
180
-
181
- - **high**: Would likely change the outcome of this comparison
182
- - **medium**: Would improve quality but may not change win/loss
183
- - **low**: Nice to have, marginal improvement
184
-
185
- ---
186
-
187
- # Analyzing Benchmark Results
188
-
189
- When analyzing benchmark results, the analyzer's purpose is to **surface patterns and anomalies** across multiple runs, not suggest skill improvements.
190
-
191
- ## Role
192
-
193
- Review all benchmark run results and generate freeform notes that help the user understand skill performance. Focus on patterns that wouldn't be visible from aggregate metrics alone.
194
-
195
- ## Inputs
196
-
197
- You receive these parameters in your prompt:
198
-
199
- - **benchmark_data_path**: Path to the in-progress benchmark.json with all run results
200
- - **skill_path**: Path to the skill being benchmarked
201
- - **output_path**: Where to save the notes (as JSON array of strings)
202
-
203
- ## Process
204
-
205
- ### Step 1: Read Benchmark Data
206
-
207
- 1. Read the benchmark.json containing all run results
208
- 2. Note the configurations tested (with_skill, without_skill)
209
- 3. Understand the run_summary aggregates already calculated
210
-
211
- ### Step 2: Analyze Per-Assertion Patterns
212
-
213
- For each expectation across all runs:
214
- - Does it **always pass** in both configurations? (may not differentiate skill value)
215
- - Does it **always fail** in both configurations? (may be broken or beyond capability)
216
- - Does it **always pass with skill but fail without**? (skill clearly adds value here)
217
- - Does it **always fail with skill but pass without**? (skill may be hurting)
218
- - Is it **highly variable**? (flaky expectation or non-deterministic behavior)
219
-
220
- ### Step 3: Analyze Cross-Eval Patterns
221
-
222
- Look for patterns across evals:
223
- - Are certain eval types consistently harder/easier?
224
- - Do some evals show high variance while others are stable?
225
- - Are there surprising results that contradict expectations?
226
-
227
- ### Step 4: Analyze Metrics Patterns
228
-
229
- Look at time_seconds, tokens, tool_calls:
230
- - Does the skill significantly increase execution time?
231
- - Is there high variance in resource usage?
232
- - Are there outlier runs that skew the aggregates?
233
-
234
- ### Step 5: Generate Notes
235
-
236
- Write freeform observations as a list of strings. Each note should:
237
- - State a specific observation
238
- - Be grounded in the data (not speculation)
239
- - Help the user understand something the aggregate metrics don't show
240
-
241
- Examples:
242
- - "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value"
243
- - "Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure that may be flaky"
244
- - "Without-skill runs consistently fail on table extraction expectations (0% pass rate)"
245
- - "Skill adds 13s average execution time but improves pass rate by 50%"
246
- - "Token usage is 80% higher with skill, primarily due to script output parsing"
247
- - "All 3 without-skill runs for eval 1 produced empty output"
248
-
249
- ### Step 6: Write Notes
250
-
251
- Save notes to `{output_path}` as a JSON array of strings:
252
-
253
- ```json
254
- [
255
- "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value",
256
- "Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure",
257
- "Without-skill runs consistently fail on table extraction expectations",
258
- "Skill adds 13s average execution time but improves pass rate by 50%"
259
- ]
260
- ```
261
-
262
- ## Guidelines
263
-
264
- **DO:**
265
- - Report what you observe in the data
266
- - Be specific about which evals, expectations, or runs you're referring to
267
- - Note patterns that aggregate metrics would hide
268
- - Provide context that helps interpret the numbers
269
-
270
- **DO NOT:**
271
- - Suggest improvements to the skill (that's for the improvement step, not benchmarking)
272
- - Make subjective quality judgments ("the output was good/bad")
273
- - Speculate about causes without evidence
274
- - Repeat information already in the run_summary aggregates
@@ -1,202 +0,0 @@
1
- # Blind Comparator Agent
2
-
3
- Compare two outputs WITHOUT knowing which skill produced them.
4
-
5
- ## Role
6
-
7
- The Blind Comparator judges which output better accomplishes the eval task. You receive two outputs labeled A and B, but you do NOT know which skill produced which. This prevents bias toward a particular skill or approach.
8
-
9
- Your judgment is based purely on output quality and task completion.
10
-
11
- ## Inputs
12
-
13
- You receive these parameters in your prompt:
14
-
15
- - **output_a_path**: Path to the first output file or directory
16
- - **output_b_path**: Path to the second output file or directory
17
- - **eval_prompt**: The original task/prompt that was executed
18
- - **expectations**: List of expectations to check (optional - may be empty)
19
-
20
- ## Process
21
-
22
- ### Step 1: Read Both Outputs
23
-
24
- 1. Examine output A (file or directory)
25
- 2. Examine output B (file or directory)
26
- 3. Note the type, structure, and content of each
27
- 4. If outputs are directories, examine all relevant files inside
28
-
29
- ### Step 2: Understand the Task
30
-
31
- 1. Read the eval_prompt carefully
32
- 2. Identify what the task requires:
33
- - What should be produced?
34
- - What qualities matter (accuracy, completeness, format)?
35
- - What would distinguish a good output from a poor one?
36
-
37
- ### Step 3: Generate Evaluation Rubric
38
-
39
- Based on the task, generate a rubric with two dimensions:
40
-
41
- **Content Rubric** (what the output contains):
42
- | Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
43
- |-----------|----------|----------------|---------------|
44
- | Correctness | Major errors | Minor errors | Fully correct |
45
- | Completeness | Missing key elements | Mostly complete | All elements present |
46
- | Accuracy | Significant inaccuracies | Minor inaccuracies | Accurate throughout |
47
-
48
- **Structure Rubric** (how the output is organized):
49
- | Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
50
- |-----------|----------|----------------|---------------|
51
- | Organization | Disorganized | Reasonably organized | Clear, logical structure |
52
- | Formatting | Inconsistent/broken | Mostly consistent | Professional, polished |
53
- | Usability | Difficult to use | Usable with effort | Easy to use |
54
-
55
- Adapt criteria to the specific task. For example:
56
- - PDF form → "Field alignment", "Text readability", "Data placement"
57
- - Document → "Section structure", "Heading hierarchy", "Paragraph flow"
58
- - Data output → "Schema correctness", "Data types", "Completeness"
59
-
60
- ### Step 4: Evaluate Each Output Against the Rubric
61
-
62
- For each output (A and B):
63
-
64
- 1. **Score each criterion** on the rubric (1-5 scale)
65
- 2. **Calculate dimension totals**: Content score, Structure score
66
- 3. **Calculate overall score**: Average of dimension scores, scaled to 1-10
67
-
68
- ### Step 5: Check Assertions (if provided)
69
-
70
- If expectations are provided:
71
-
72
- 1. Check each expectation against output A
73
- 2. Check each expectation against output B
74
- 3. Count pass rates for each output
75
- 4. Use expectation scores as secondary evidence (not the primary decision factor)
76
-
77
- ### Step 6: Determine the Winner
78
-
79
- Compare A and B based on (in priority order):
80
-
81
- 1. **Primary**: Overall rubric score (content + structure)
82
- 2. **Secondary**: Assertion pass rates (if applicable)
83
- 3. **Tiebreaker**: If truly equal, declare a TIE
84
-
85
- Be decisive - ties should be rare. One output is usually better, even if marginally.
86
-
87
- ### Step 7: Write Comparison Results
88
-
89
- Save results to a JSON file at the path specified (or `comparison.json` if not specified).
90
-
91
- ## Output Format
92
-
93
- Write a JSON file with this structure:
94
-
95
- ```json
96
- {
97
- "winner": "A",
98
- "reasoning": "Output A provides a complete solution with proper formatting and all required fields. Output B is missing the date field and has formatting inconsistencies.",
99
- "rubric": {
100
- "A": {
101
- "content": {
102
- "correctness": 5,
103
- "completeness": 5,
104
- "accuracy": 4
105
- },
106
- "structure": {
107
- "organization": 4,
108
- "formatting": 5,
109
- "usability": 4
110
- },
111
- "content_score": 4.7,
112
- "structure_score": 4.3,
113
- "overall_score": 9.0
114
- },
115
- "B": {
116
- "content": {
117
- "correctness": 3,
118
- "completeness": 2,
119
- "accuracy": 3
120
- },
121
- "structure": {
122
- "organization": 3,
123
- "formatting": 2,
124
- "usability": 3
125
- },
126
- "content_score": 2.7,
127
- "structure_score": 2.7,
128
- "overall_score": 5.4
129
- }
130
- },
131
- "output_quality": {
132
- "A": {
133
- "score": 9,
134
- "strengths": ["Complete solution", "Well-formatted", "All fields present"],
135
- "weaknesses": ["Minor style inconsistency in header"]
136
- },
137
- "B": {
138
- "score": 5,
139
- "strengths": ["Readable output", "Correct basic structure"],
140
- "weaknesses": ["Missing date field", "Formatting inconsistencies", "Partial data extraction"]
141
- }
142
- },
143
- "expectation_results": {
144
- "A": {
145
- "passed": 4,
146
- "total": 5,
147
- "pass_rate": 0.80,
148
- "details": [
149
- {"text": "Output includes name", "passed": true},
150
- {"text": "Output includes date", "passed": true},
151
- {"text": "Format is PDF", "passed": true},
152
- {"text": "Contains signature", "passed": false},
153
- {"text": "Readable text", "passed": true}
154
- ]
155
- },
156
- "B": {
157
- "passed": 3,
158
- "total": 5,
159
- "pass_rate": 0.60,
160
- "details": [
161
- {"text": "Output includes name", "passed": true},
162
- {"text": "Output includes date", "passed": false},
163
- {"text": "Format is PDF", "passed": true},
164
- {"text": "Contains signature", "passed": false},
165
- {"text": "Readable text", "passed": true}
166
- ]
167
- }
168
- }
169
- }
170
- ```
171
-
172
- If no expectations were provided, omit the `expectation_results` field entirely.
173
-
174
- ## Field Descriptions
175
-
176
- - **winner**: "A", "B", or "TIE"
177
- - **reasoning**: Clear explanation of why the winner was chosen (or why it's a tie)
178
- - **rubric**: Structured rubric evaluation for each output
179
- - **content**: Scores for content criteria (correctness, completeness, accuracy)
180
- - **structure**: Scores for structure criteria (organization, formatting, usability)
181
- - **content_score**: Average of content criteria (1-5)
182
- - **structure_score**: Average of structure criteria (1-5)
183
- - **overall_score**: Combined score scaled to 1-10
184
- - **output_quality**: Summary quality assessment
185
- - **score**: 1-10 rating (should match rubric overall_score)
186
- - **strengths**: List of positive aspects
187
- - **weaknesses**: List of issues or shortcomings
188
- - **expectation_results**: (Only if expectations provided)
189
- - **passed**: Number of expectations that passed
190
- - **total**: Total number of expectations
191
- - **pass_rate**: Fraction passed (0.0 to 1.0)
192
- - **details**: Individual expectation results
193
-
194
- ## Guidelines
195
-
196
- - **Stay blind**: DO NOT try to infer which skill produced which output. Judge purely on output quality.
197
- - **Be specific**: Cite specific examples when explaining strengths and weaknesses.
198
- - **Be decisive**: Choose a winner unless outputs are genuinely equivalent.
199
- - **Output quality first**: Assertion scores are secondary to overall task completion.
200
- - **Be objective**: Don't favor outputs based on style preferences; focus on correctness and completeness.
201
- - **Explain your reasoning**: The reasoning field should make it clear why you chose the winner.
202
- - **Handle edge cases**: If both outputs fail, pick the one that fails less badly. If both are excellent, pick the one that's marginally better.
@@ -1,223 +0,0 @@
1
- # Grader Agent
2
-
3
- Evaluate expectations against an execution transcript and outputs.
4
-
5
- ## Role
6
-
7
- The Grader reviews a transcript and output files, then determines whether each expectation passes or fails. Provide clear evidence for each judgment.
8
-
9
- You have two jobs: grade the outputs, and critique the evals themselves. A passing grade on a weak assertion is worse than useless — it creates false confidence. When you notice an assertion that's trivially satisfied, or an important outcome that no assertion checks, say so.
10
-
11
- ## Inputs
12
-
13
- You receive these parameters in your prompt:
14
-
15
- - **expectations**: List of expectations to evaluate (strings)
16
- - **transcript_path**: Path to the execution transcript (markdown file)
17
- - **outputs_dir**: Directory containing output files from execution
18
-
19
- ## Process
20
-
21
- ### Step 1: Read the Transcript
22
-
23
- 1. Read the transcript file completely
24
- 2. Note the eval prompt, execution steps, and final result
25
- 3. Identify any issues or errors documented
26
-
27
- ### Step 2: Examine Output Files
28
-
29
- 1. List files in outputs_dir
30
- 2. Read/examine each file relevant to the expectations. If outputs aren't plain text, use the inspection tools provided in your prompt — don't rely solely on what the transcript says the executor produced.
31
- 3. Note contents, structure, and quality
32
-
33
- ### Step 3: Evaluate Each Assertion
34
-
35
- For each expectation:
36
-
37
- 1. **Search for evidence** in the transcript and outputs
38
- 2. **Determine verdict**:
39
- - **PASS**: Clear evidence the expectation is true AND the evidence reflects genuine task completion, not just surface-level compliance
40
- - **FAIL**: No evidence, or evidence contradicts the expectation, or the evidence is superficial (e.g., correct filename but empty/wrong content)
41
- 3. **Cite the evidence**: Quote the specific text or describe what you found
42
-
43
- ### Step 4: Extract and Verify Claims
44
-
45
- Beyond the predefined expectations, extract implicit claims from the outputs and verify them:
46
-
47
- 1. **Extract claims** from the transcript and outputs:
48
- - Factual statements ("The form has 12 fields")
49
- - Process claims ("Used pypdf to fill the form")
50
- - Quality claims ("All fields were filled correctly")
51
-
52
- 2. **Verify each claim**:
53
- - **Factual claims**: Can be checked against the outputs or external sources
54
- - **Process claims**: Can be verified from the transcript
55
- - **Quality claims**: Evaluate whether the claim is justified
56
-
57
- 3. **Flag unverifiable claims**: Note claims that cannot be verified with available information
58
-
59
- This catches issues that predefined expectations might miss.
60
-
61
- ### Step 5: Read User Notes
62
-
63
- If `{outputs_dir}/user_notes.md` exists:
64
- 1. Read it and note any uncertainties or issues flagged by the executor
65
- 2. Include relevant concerns in the grading output
66
- 3. These may reveal problems even when expectations pass
67
-
68
- ### Step 6: Critique the Evals
69
-
70
- After grading, consider whether the evals themselves could be improved. Only surface suggestions when there's a clear gap.
71
-
72
- Good suggestions test meaningful outcomes — assertions that are hard to satisfy without actually doing the work correctly. Think about what makes an assertion *discriminating*: it passes when the skill genuinely succeeds and fails when it doesn't.
73
-
74
- Suggestions worth raising:
75
- - An assertion that passed but would also pass for a clearly wrong output (e.g., checking filename existence but not file content)
76
- - An important outcome you observed — good or bad — that no assertion covers at all
77
- - An assertion that can't actually be verified from the available outputs
78
-
79
- Keep the bar high. The goal is to flag things the eval author would say "good catch" about, not to nitpick every assertion.
80
-
81
- ### Step 7: Write Grading Results
82
-
83
- Save results to `{outputs_dir}/../grading.json` (sibling to outputs_dir).
84
-
85
- ## Grading Criteria
86
-
87
- **PASS when**:
88
- - The transcript or outputs clearly demonstrate the expectation is true
89
- - Specific evidence can be cited
90
- - The evidence reflects genuine substance, not just surface compliance (e.g., a file exists AND contains correct content, not just the right filename)
91
-
92
- **FAIL when**:
93
- - No evidence found for the expectation
94
- - Evidence contradicts the expectation
95
- - The expectation cannot be verified from available information
96
- - The evidence is superficial — the assertion is technically satisfied but the underlying task outcome is wrong or incomplete
97
- - The output appears to meet the assertion by coincidence rather than by actually doing the work
98
-
99
- **When uncertain**: The burden of proof to pass is on the expectation.
100
-
101
- ### Step 8: Read Executor Metrics and Timing
102
-
103
- 1. If `{outputs_dir}/metrics.json` exists, read it and include in grading output
104
- 2. If `{outputs_dir}/../timing.json` exists, read it and include timing data
105
-
106
- ## Output Format
107
-
108
- Write a JSON file with this structure:
109
-
110
- ```json
111
- {
112
- "expectations": [
113
- {
114
- "text": "The output includes the name 'John Smith'",
115
- "passed": true,
116
- "evidence": "Found in transcript Step 3: 'Extracted names: John Smith, Sarah Johnson'"
117
- },
118
- {
119
- "text": "The spreadsheet has a SUM formula in cell B10",
120
- "passed": false,
121
- "evidence": "No spreadsheet was created. The output was a text file."
122
- },
123
- {
124
- "text": "The assistant used the skill's OCR script",
125
- "passed": true,
126
- "evidence": "Transcript Step 2 shows: 'Tool: Bash - python ocr_script.py image.png'"
127
- }
128
- ],
129
- "summary": {
130
- "passed": 2,
131
- "failed": 1,
132
- "total": 3,
133
- "pass_rate": 0.67
134
- },
135
- "execution_metrics": {
136
- "tool_calls": {
137
- "Read": 5,
138
- "Write": 2,
139
- "Bash": 8
140
- },
141
- "total_tool_calls": 15,
142
- "total_steps": 6,
143
- "errors_encountered": 0,
144
- "output_chars": 12450,
145
- "transcript_chars": 3200
146
- },
147
- "timing": {
148
- "executor_duration_seconds": 165.0,
149
- "grader_duration_seconds": 26.0,
150
- "total_duration_seconds": 191.0
151
- },
152
- "claims": [
153
- {
154
- "claim": "The form has 12 fillable fields",
155
- "type": "factual",
156
- "verified": true,
157
- "evidence": "Counted 12 fields in field_info.json"
158
- },
159
- {
160
- "claim": "All required fields were populated",
161
- "type": "quality",
162
- "verified": false,
163
- "evidence": "Reference section was left blank despite data being available"
164
- }
165
- ],
166
- "user_notes_summary": {
167
- "uncertainties": ["Used 2023 data, may be stale"],
168
- "needs_review": [],
169
- "workarounds": ["Fell back to text overlay for non-fillable fields"]
170
- },
171
- "eval_feedback": {
172
- "suggestions": [
173
- {
174
- "assertion": "The output includes the name 'John Smith'",
175
- "reason": "A hallucinated document that mentions the name would also pass — consider checking it appears as the primary contact with matching phone and email from the input"
176
- },
177
- {
178
- "reason": "No assertion checks whether the extracted phone numbers match the input — I observed incorrect numbers in the output that went uncaught"
179
- }
180
- ],
181
- "overall": "Assertions check presence but not correctness. Consider adding content verification."
182
- }
183
- }
184
- ```
185
-
186
- ## Field Descriptions
187
-
188
- - **expectations**: Array of graded expectations
189
- - **text**: The original expectation text
190
- - **passed**: Boolean - true if expectation passes
191
- - **evidence**: Specific quote or description supporting the verdict
192
- - **summary**: Aggregate statistics
193
- - **passed**: Count of passed expectations
194
- - **failed**: Count of failed expectations
195
- - **total**: Total expectations evaluated
196
- - **pass_rate**: Fraction passed (0.0 to 1.0)
197
- - **execution_metrics**: Copied from executor's metrics.json (if available)
198
- - **output_chars**: Total character count of output files (proxy for tokens)
199
- - **transcript_chars**: Character count of transcript
200
- - **timing**: Wall clock timing from timing.json (if available)
201
- - **executor_duration_seconds**: Time spent in executor subagent
202
- - **total_duration_seconds**: Total elapsed time for the run
203
- - **claims**: Extracted and verified claims from the output
204
- - **claim**: The statement being verified
205
- - **type**: "factual", "process", or "quality"
206
- - **verified**: Boolean - whether the claim holds
207
- - **evidence**: Supporting or contradicting evidence
208
- - **user_notes_summary**: Issues flagged by the executor
209
- - **uncertainties**: Things the executor wasn't sure about
210
- - **needs_review**: Items requiring human attention
211
- - **workarounds**: Places where the skill didn't work as expected
212
- - **eval_feedback**: Improvement suggestions for the evals (only when warranted)
213
- - **suggestions**: List of concrete suggestions, each with a `reason` and optionally an `assertion` it relates to
214
- - **overall**: Brief assessment — can be "No suggestions, evals look solid" if nothing to flag
215
-
216
- ## Guidelines
217
-
218
- - **Be objective**: Base verdicts on evidence, not assumptions
219
- - **Be specific**: Quote the exact text that supports your verdict
220
- - **Be thorough**: Check both transcript and output files
221
- - **Be consistent**: Apply the same standard to each expectation
222
- - **Explain failures**: Make it clear why evidence was insufficient
223
- - **No partial credit**: Each expectation is pass or fail, not partial