pixie-qa 0.1.1__tar.gz → 0.1.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (416) hide show
  1. pixie_qa-0.1.2/.claude/skills/eval-driven-dev/SKILL.md +522 -0
  2. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev/references/pixie-api.md +9 -7
  3. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/PKG-INFO +11 -79
  4. pixie_qa-0.1.2/README.md +38 -0
  5. pixie_qa-0.1.2/changelogs/pixie-directory-and-skill-improvements.md +63 -0
  6. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/cli/main.py +41 -5
  7. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/config.py +17 -4
  8. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/evals/runner.py +10 -0
  9. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/instrumentation/handlers.py +26 -4
  10. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pyproject.toml +1 -1
  11. pixie_qa-0.1.2/specs/agent-skill-1.md +25 -0
  12. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/specs/agent-skill.md +15 -5
  13. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/test_config.py +24 -9
  14. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/uv.lock +1 -1
  15. pixie_qa-0.1.1/.claude/settings.local.json +0 -42
  16. pixie_qa-0.1.1/.claude/skills/eval-driven-dev/SKILL.md +0 -345
  17. pixie_qa-0.1.1/.env +0 -0
  18. pixie_qa-0.1.1/README.md +0 -106
  19. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/benchmark.json +0 -0
  20. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/benchmark.md +0 -0
  21. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/eval_metadata.json +0 -0
  22. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/with_skill/outputs/metrics.json +0 -0
  23. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/with_skill/outputs/response.md +0 -0
  24. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/with_skill/run-1/grading.json +0 -0
  25. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/with_skill/run-1/timing.json +0 -0
  26. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/without_skill/outputs/metrics.json +0 -0
  27. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/without_skill/outputs/response.md +0 -0
  28. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/without_skill/run-1/grading.json +0 -0
  29. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/without_skill/run-1/timing.json +0 -0
  30. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/eval_metadata.json +0 -0
  31. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/with_skill/outputs/metrics.json +0 -0
  32. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/with_skill/outputs/response.md +0 -0
  33. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/with_skill/run-1/grading.json +0 -0
  34. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/with_skill/run-1/timing.json +0 -0
  35. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/without_skill/outputs/metrics.json +0 -0
  36. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/without_skill/outputs/response.md +0 -0
  37. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/without_skill/run-1/grading.json +0 -0
  38. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/without_skill/run-1/timing.json +0 -0
  39. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/eval_metadata.json +0 -0
  40. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/with_skill/outputs/metrics.json +0 -0
  41. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/with_skill/outputs/response.md +0 -0
  42. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/with_skill/run-1/grading.json +0 -0
  43. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/with_skill/run-1/timing.json +0 -0
  44. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/without_skill/outputs/metrics.json +0 -0
  45. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/without_skill/outputs/response.md +0 -0
  46. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/without_skill/run-1/grading.json +0 -0
  47. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/without_skill/run-1/timing.json +0 -0
  48. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/benchmark.json +0 -0
  49. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/benchmark.md +0 -0
  50. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/eval_metadata.json +0 -0
  51. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/grading.json +0 -0
  52. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/outputs/metrics.json +0 -0
  53. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/outputs/summary.md +0 -0
  54. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/project/pixie_datasets/qa-golden-set.json +0 -0
  55. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/project/qa_app.py +0 -0
  56. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/project/requirements.txt +0 -0
  57. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/project/tests/test_qa.py +0 -0
  58. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/timing.json +0 -0
  59. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/grading.json +0 -0
  60. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/outputs/metrics.json +0 -0
  61. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/outputs/summary.md +0 -0
  62. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/project/pixie_datasets/qa-golden-set.json +0 -0
  63. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/project/qa_app.py +0 -0
  64. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/project/requirements.txt +0 -0
  65. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/project/tests/test_qa.py +0 -0
  66. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/timing.json +0 -0
  67. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/eval_metadata.json +0 -0
  68. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/grading.json +0 -0
  69. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/outputs/metrics.json +0 -0
  70. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/outputs/summary.md +0 -0
  71. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/project/MEMORY.md +0 -0
  72. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/project/build_dataset.py +0 -0
  73. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/project/extractor.py +0 -0
  74. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/project/requirements.txt +0 -0
  75. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/project/tests/__init__.py +0 -0
  76. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/project/tests/test_email_extraction.py +0 -0
  77. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/timing.json +0 -0
  78. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/grading.json +0 -0
  79. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/outputs/metrics.json +0 -0
  80. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/outputs/summary.md +0 -0
  81. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/project/build_dataset.py +0 -0
  82. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/project/extractor.py +0 -0
  83. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/project/requirements.txt +0 -0
  84. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/project/test_extractor.py +0 -0
  85. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/timing.json +0 -0
  86. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/eval_metadata.json +0 -0
  87. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/grading.json +0 -0
  88. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/outputs/metrics.json +0 -0
  89. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/outputs/summary.md +0 -0
  90. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/project/MEMORY.md +0 -0
  91. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/project/build_dataset.py +0 -0
  92. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/project/chatbot.py +0 -0
  93. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/project/requirements.txt +0 -0
  94. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/project/tests/test_rag_chatbot.py +0 -0
  95. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/timing.json +0 -0
  96. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/grading.json +0 -0
  97. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/outputs/metrics.json +0 -0
  98. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/outputs/summary.md +0 -0
  99. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/project/build_dataset.py +0 -0
  100. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/project/chatbot.py +0 -0
  101. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/project/requirements.txt +0 -0
  102. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/project/test_chatbot.py +0 -0
  103. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/timing.json +0 -0
  104. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/benchmark.json +0 -0
  105. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/benchmark.md +0 -0
  106. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/eval_metadata.json +0 -0
  107. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/grading.json +0 -0
  108. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/project/MEMORY.md +0 -0
  109. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/project/pixie_datasets/qa-golden-set.json +0 -0
  110. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/project/qa_app.py +0 -0
  111. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/project/requirements.txt +0 -0
  112. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/project/tests/test_qa.py +0 -0
  113. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/run-1/grading.json +0 -0
  114. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/run-1/outputs/MEMORY.md +0 -0
  115. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/run-1/outputs/test_qa.py +0 -0
  116. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/timing.json +0 -0
  117. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/grading.json +0 -0
  118. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/project/INVESTIGATION_NOTES.md +0 -0
  119. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/project/pixie_datasets/qa-golden-set.json +0 -0
  120. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/project/qa_app.py +0 -0
  121. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/project/requirements.txt +0 -0
  122. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/project/tests/test_qa.py +0 -0
  123. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/run-1/grading.json +0 -0
  124. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/run-1/outputs/INVESTIGATION_NOTES.md +0 -0
  125. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/run-1/outputs/test_qa.py +0 -0
  126. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/timing.json +0 -0
  127. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/eval_metadata.json +0 -0
  128. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/grading.json +0 -0
  129. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/project/MEMORY.md +0 -0
  130. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/project/build_dataset.py +0 -0
  131. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/project/extractor.py +0 -0
  132. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/project/requirements.txt +0 -0
  133. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/project/run_evals.sh +0 -0
  134. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/project/tests/test_classifier.py +0 -0
  135. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/run-1/grading.json +0 -0
  136. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/run-1/outputs/MEMORY.md +0 -0
  137. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/run-1/outputs/build_dataset.py +0 -0
  138. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/run-1/outputs/extractor.py +0 -0
  139. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/run-1/outputs/test_classifier.py +0 -0
  140. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/timing.json +0 -0
  141. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/grading.json +0 -0
  142. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/project/collect_traces.py +0 -0
  143. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/project/extractor.py +0 -0
  144. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/project/requirements.txt +0 -0
  145. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/run-1/grading.json +0 -0
  146. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/run-1/outputs/collect_traces.py +0 -0
  147. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/run-1/outputs/extractor.py +0 -0
  148. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/timing.json +0 -0
  149. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/eval_metadata.json +0 -0
  150. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/grading.json +0 -0
  151. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/project/MEMORY.md +0 -0
  152. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/project/chatbot.py +0 -0
  153. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/project/pixie_datasets/rag-chatbot-golden.json +0 -0
  154. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/project/pixie_observations.db +0 -0
  155. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/project/requirements.txt +0 -0
  156. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/project/tests/test_chatbot.py +0 -0
  157. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/run-1/grading.json +0 -0
  158. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/run-1/outputs/MEMORY.md +0 -0
  159. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/run-1/outputs/chatbot.py +0 -0
  160. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/run-1/outputs/test_chatbot.py +0 -0
  161. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/timing.json +0 -0
  162. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/grading.json +0 -0
  163. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/project/capture_traces.py +0 -0
  164. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/project/chatbot.py +0 -0
  165. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/project/requirements.txt +0 -0
  166. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/project/test_chatbot_evals.py +0 -0
  167. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/run-1/grading.json +0 -0
  168. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/run-1/outputs/capture_traces.py +0 -0
  169. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/run-1/outputs/chatbot.py +0 -0
  170. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/run-1/outputs/test_chatbot_evals.py +0 -0
  171. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/timing.json +0 -0
  172. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/benchmark.json +0 -0
  173. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/benchmark.md +0 -0
  174. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/eval_metadata.json +0 -0
  175. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/grading.json +0 -0
  176. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/project/MEMORY.md +0 -0
  177. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/project/pixie_datasets/qa-golden-set.json +0 -0
  178. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/project/qa_app.py +0 -0
  179. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/project/requirements.txt +0 -0
  180. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/project/tests/test_qa.py +0 -0
  181. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/run-1/grading.json +0 -0
  182. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/run-1/outputs/MEMORY.md +0 -0
  183. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/run-1/outputs/test_qa.py +0 -0
  184. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/timing.json +0 -0
  185. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/grading.json +0 -0
  186. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/project/pixie_datasets/qa-golden-set.json +0 -0
  187. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/project/qa_app.py +0 -0
  188. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/project/requirements.txt +0 -0
  189. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/project/tests/test_qa.py +0 -0
  190. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/run-1/grading.json +0 -0
  191. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/run-1/outputs/test_qa.py +0 -0
  192. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/timing.json +0 -0
  193. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/eval_metadata.json +0 -0
  194. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/grading.json +0 -0
  195. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/project/MEMORY.md +0 -0
  196. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/project/extractor.py +0 -0
  197. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/project/pixie_datasets/email-classifier-golden.json +0 -0
  198. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/project/pixie_observations.db +0 -0
  199. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/project/requirements.txt +0 -0
  200. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/project/tests/test_email_classifier.py +0 -0
  201. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/run-1/grading.json +0 -0
  202. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/run-1/outputs/MEMORY.md +0 -0
  203. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/run-1/outputs/extractor.py +0 -0
  204. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/run-1/outputs/test_email_classifier.py +0 -0
  205. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/timing.json +0 -0
  206. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/grading.json +0 -0
  207. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/conftest.py +0 -0
  208. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/extractor.py +0 -0
  209. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/generate_dataset.py +0 -0
  210. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/instrumented_extractor.py +0 -0
  211. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/pytest.ini +0 -0
  212. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/requirements.txt +0 -0
  213. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/test_email_classifier.py +0 -0
  214. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/run-1/grading.json +0 -0
  215. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/run-1/outputs/extractor.py +0 -0
  216. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/run-1/outputs/test_email_classifier.py +0 -0
  217. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/timing.json +0 -0
  218. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/eval_metadata.json +0 -0
  219. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/grading.json +0 -0
  220. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/project/MEMORY.md +0 -0
  221. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/project/chatbot.py +0 -0
  222. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/project/pixie_datasets/rag-chatbot-golden.json +0 -0
  223. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/project/pixie_observations.db +0 -0
  224. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/project/requirements.txt +0 -0
  225. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/project/tests/test_rag_chatbot.py +0 -0
  226. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/run-1/grading.json +0 -0
  227. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/run-1/outputs/MEMORY.md +0 -0
  228. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/run-1/outputs/chatbot.py +0 -0
  229. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/run-1/outputs/test_rag_chatbot.py +0 -0
  230. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/timing.json +0 -0
  231. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/grading.json +0 -0
  232. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/project/chatbot.py +0 -0
  233. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/project/chatbot_instrumented.py +0 -0
  234. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/project/requirements.txt +0 -0
  235. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/project/save_dataset.py +0 -0
  236. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/project/test_chatbot_evals.py +0 -0
  237. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/run-1/grading.json +0 -0
  238. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/run-1/outputs/chatbot_instrumented.py +0 -0
  239. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/run-1/outputs/test_chatbot_evals.py +0 -0
  240. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/timing.json +0 -0
  241. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/benchmark.json +0 -0
  242. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/benchmark.md +0 -0
  243. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/eval_metadata.json +0 -0
  244. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/grading.json +0 -0
  245. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/project/MEMORY.md +0 -0
  246. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/project/pixie_datasets/qa-golden-set.json +0 -0
  247. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/project/qa_app.py +0 -0
  248. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/project/requirements.txt +0 -0
  249. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/project/tests/test_qa.py +0 -0
  250. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/run-1/grading.json +0 -0
  251. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/run-1/outputs/MEMORY.md +0 -0
  252. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/run-1/outputs/pixie_datasets/qa-golden-set.json +0 -0
  253. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/run-1/outputs/qa_app.py +0 -0
  254. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/run-1/outputs/requirements.txt +0 -0
  255. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/run-1/outputs/tests/test_qa.py +0 -0
  256. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/timing.json +0 -0
  257. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/grading.json +0 -0
  258. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/project/MEMORY.md +0 -0
  259. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/project/pixie_datasets/qa-golden-set.json +0 -0
  260. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/project/qa_app.py +0 -0
  261. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/project/requirements.txt +0 -0
  262. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/project/tests/test_qa.py +0 -0
  263. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/run-1/grading.json +0 -0
  264. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/run-1/outputs/MEMORY.md +0 -0
  265. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/run-1/outputs/pixie_datasets/qa-golden-set.json +0 -0
  266. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/run-1/outputs/qa_app.py +0 -0
  267. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/run-1/outputs/requirements.txt +0 -0
  268. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/run-1/outputs/tests/test_qa.py +0 -0
  269. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/timing.json +0 -0
  270. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/eval_metadata.json +0 -0
  271. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/grading.json +0 -0
  272. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/project/MEMORY.md +0 -0
  273. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/project/build_dataset.py +0 -0
  274. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/project/extractor.py +0 -0
  275. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/project/requirements.txt +0 -0
  276. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/project/tests/test_email_classifier.py +0 -0
  277. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/run-1/grading.json +0 -0
  278. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/run-1/outputs/MEMORY.md +0 -0
  279. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/run-1/outputs/build_dataset.py +0 -0
  280. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/run-1/outputs/extractor.py +0 -0
  281. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/run-1/outputs/requirements.txt +0 -0
  282. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/run-1/outputs/tests/test_email_classifier.py +0 -0
  283. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/timing.json +0 -0
  284. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/grading.json +0 -0
  285. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/project/build_dataset.py +0 -0
  286. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/project/extractor.py +0 -0
  287. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/project/requirements.txt +0 -0
  288. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/project/test_email_classifier.py +0 -0
  289. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/run-1/grading.json +0 -0
  290. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/run-1/outputs/build_dataset.py +0 -0
  291. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/run-1/outputs/extractor.py +0 -0
  292. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/run-1/outputs/requirements.txt +0 -0
  293. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/run-1/outputs/test_email_classifier.py +0 -0
  294. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/timing.json +0 -0
  295. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/eval_metadata.json +0 -0
  296. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/grading.json +0 -0
  297. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/project/MEMORY.md +0 -0
  298. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/project/build_dataset.py +0 -0
  299. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/project/chatbot.py +0 -0
  300. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/project/requirements.txt +0 -0
  301. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/project/tests/test_rag_chatbot.py +0 -0
  302. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/run-1/grading.json +0 -0
  303. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/run-1/outputs/MEMORY.md +0 -0
  304. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/run-1/outputs/build_dataset.py +0 -0
  305. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/run-1/outputs/chatbot.py +0 -0
  306. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/run-1/outputs/requirements.txt +0 -0
  307. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/run-1/outputs/tests/test_rag_chatbot.py +0 -0
  308. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/timing.json +0 -0
  309. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/grading.json +0 -0
  310. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/project/MEMORY.md +0 -0
  311. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/project/chatbot.py +0 -0
  312. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/project/datasets/rag-chatbot-golden.json +0 -0
  313. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/project/requirements.txt +0 -0
  314. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/project/test_chatbot_eval.py +0 -0
  315. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/run-1/grading.json +0 -0
  316. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/run-1/outputs/MEMORY.md +0 -0
  317. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/run-1/outputs/chatbot.py +0 -0
  318. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/run-1/outputs/datasets/rag-chatbot-golden.json +0 -0
  319. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/run-1/outputs/requirements.txt +0 -0
  320. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/run-1/outputs/test_chatbot_eval.py +0 -0
  321. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/timing.json +0 -0
  322. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/review-iteration-1.html +0 -0
  323. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/review-iteration-2.html +0 -0
  324. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/review-iteration-3.html +0 -0
  325. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/review-iteration-4.html +0 -0
  326. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/review-iteration-5.html +0 -0
  327. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.claude/skills/eval-driven-dev-workspace/trigger-eval-set.json +0 -0
  328. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.github/copilot-instructions.md +0 -0
  329. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.github/workflows/daily-release.yml +0 -0
  330. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.github/workflows/publish.yml +0 -0
  331. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/.gitignore +0 -0
  332. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/LICENSE +0 -0
  333. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/changelogs/async-handler-processing.md +0 -0
  334. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/changelogs/autoevals-adapters.md +0 -0
  335. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/changelogs/cli-dataset-commands.md +0 -0
  336. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/changelogs/dataset-management.md +0 -0
  337. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/changelogs/eval-harness.md +0 -0
  338. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/changelogs/expected-output-in-evals.md +0 -0
  339. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/changelogs/instrumentation-module-implementation.md +0 -0
  340. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/changelogs/loud-failure-mode.md +0 -0
  341. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/changelogs/manual-instrumentation-usability.md +0 -0
  342. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/changelogs/observation-store-implementation.md +0 -0
  343. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/changelogs/root-package-exports-and-trace-id.md +0 -0
  344. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/changelogs/usability-utils.md +0 -0
  345. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/docs/package.md +0 -0
  346. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/__init__.py +0 -0
  347. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/cli/__init__.py +0 -0
  348. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/cli/dataset_command.py +0 -0
  349. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/cli/test_command.py +0 -0
  350. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/dataset/__init__.py +0 -0
  351. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/dataset/models.py +0 -0
  352. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/dataset/store.py +0 -0
  353. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/evals/__init__.py +0 -0
  354. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/evals/criteria.py +0 -0
  355. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/evals/eval_utils.py +0 -0
  356. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/evals/evaluation.py +0 -0
  357. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/evals/scorers.py +0 -0
  358. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/evals/trace_capture.py +0 -0
  359. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/evals/trace_helpers.py +0 -0
  360. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/instrumentation/__init__.py +0 -0
  361. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/instrumentation/context.py +0 -0
  362. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/instrumentation/handler.py +0 -0
  363. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/instrumentation/instrumentors.py +0 -0
  364. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/instrumentation/observation.py +0 -0
  365. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/instrumentation/processor.py +0 -0
  366. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/instrumentation/queue.py +0 -0
  367. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/instrumentation/spans.py +0 -0
  368. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/storage/__init__.py +0 -0
  369. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/storage/evaluable.py +0 -0
  370. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/storage/piccolo_conf.py +0 -0
  371. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/storage/piccolo_migrations/__init__.py +0 -0
  372. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/storage/serialization.py +0 -0
  373. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/storage/store.py +0 -0
  374. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/storage/tables.py +0 -0
  375. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/pixie/storage/tree.py +0 -0
  376. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/specs/autoevals-adapters.md +0 -0
  377. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/specs/dataset-management.md +0 -0
  378. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/specs/evals-harness.md +0 -0
  379. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/specs/expected-output-in-evals.md +0 -0
  380. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/specs/instrumentation.md +0 -0
  381. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/specs/manual-instrumentation-usability.md +0 -0
  382. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/specs/storage.md +0 -0
  383. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/specs/usability-utils.md +0 -0
  384. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/__init__.py +0 -0
  385. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/__init__.py +0 -0
  386. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/cli/__init__.py +0 -0
  387. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/cli/test_dataset_command.py +0 -0
  388. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/cli/test_main.py +0 -0
  389. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/dataset/__init__.py +0 -0
  390. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/dataset/test_models.py +0 -0
  391. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/dataset/test_store.py +0 -0
  392. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/evals/__init__.py +0 -0
  393. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/evals/test_criteria.py +0 -0
  394. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/evals/test_eval_utils.py +0 -0
  395. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/evals/test_evaluation.py +0 -0
  396. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/evals/test_runner.py +0 -0
  397. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/evals/test_scorers.py +0 -0
  398. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/evals/test_trace_capture.py +0 -0
  399. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/evals/test_trace_helpers.py +0 -0
  400. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/instrumentation/__init__.py +0 -0
  401. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/instrumentation/conftest.py +0 -0
  402. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/instrumentation/test_context.py +0 -0
  403. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/instrumentation/test_handler.py +0 -0
  404. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/instrumentation/test_integration.py +0 -0
  405. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/instrumentation/test_observation.py +0 -0
  406. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/instrumentation/test_processor.py +0 -0
  407. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/instrumentation/test_queue.py +0 -0
  408. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/instrumentation/test_spans.py +0 -0
  409. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/instrumentation/test_storage_handler.py +0 -0
  410. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/observation_store/__init__.py +0 -0
  411. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/observation_store/conftest.py +0 -0
  412. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/observation_store/test_evaluable.py +0 -0
  413. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/observation_store/test_serialization.py +0 -0
  414. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/observation_store/test_store.py +0 -0
  415. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/observation_store/test_tree.py +0 -0
  416. {pixie_qa-0.1.1 → pixie_qa-0.1.2}/tests/pixie/test_init.py +0 -0
@@ -0,0 +1,522 @@
1
+ ---
2
+ name: eval-driven-dev
3
+ description: Instrument Python LLM apps, build golden datasets, write eval-based tests, run them, and root-cause failures — covering the full eval-driven development cycle. Make sure to use this skill whenever a user is developing, testing, QA-ing, evaluating, or benchmarking a Python project that calls an LLM, even if they don't say "evals" explicitly. Use for making sure an AI app works correctly, catching regressions after prompt changes, debugging why an agent started behaving differently, or validating output quality before shipping.
4
+ ---
5
+
6
+ # Eval-Driven Development with pixie
7
+
8
+ This skill is about doing the work, not describing it. When a user asks you to set up evals for their app, you should be reading their code, editing their files, running commands, and producing a working test pipeline — not writing a plan for them to follow later.
9
+
10
+ The loop is: understand the app → instrument it → write the test file → build a dataset → run the tests → investigate failures → iterate. In practice the stages blur and you'll be going back and forth, but this ordering helps: write all the files (instrumentation, test file, MEMORY.md) before running any commands. That way your work survives even if an execution step hits a snag.
11
+
12
+ **All pixie-generated files live in a single `.pixie` directory** at the project root:
13
+
14
+ ```
15
+ .pixie/
16
+ MEMORY.md # your understanding and eval plan
17
+ observations.db # SQLite trace DB (auto-created by enable_storage)
18
+ datasets/ # golden datasets (JSON files)
19
+ tests/ # eval test files (test_*.py)
20
+ scripts/ # helper scripts (build_dataset.py, etc.)
21
+ ```
22
+
23
+ ---
24
+
25
+ ## Stage 0: Ensure pixie-qa is Installed and API Keys Are Set
26
+
27
+ Before doing anything else, check that the `pixie-qa` package is available:
28
+
29
+ ```bash
30
+ python -c "import pixie" 2>/dev/null && echo "installed" || echo "not installed"
31
+ ```
32
+
33
+ If it's not installed, install it:
34
+
35
+ ```bash
36
+ pip install pixie-qa
37
+ ```
38
+
39
+ This provides the `pixie` Python module, the `pixie` CLI, and the `pixie test` runner — all required for instrumentation and evals. Don't skip this step; everything else in this skill depends on it.
40
+
41
+ ### Verify API keys
42
+
43
+ The application under test almost certainly needs an LLM provider API key (e.g. `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`). LLM-as-judge evaluators like `FactualityEval` also need `OPENAI_API_KEY`. **Before running anything**, verify the key is set:
44
+
45
+ ```bash
46
+ [ -n "$OPENAI_API_KEY" ] && echo "OPENAI_API_KEY set" || echo "OPENAI_API_KEY missing"
47
+ ```
48
+
49
+ If not set, ask the user. Do not proceed with running the app or evals without it — you'll get silent failures or import-time errors.
50
+
51
+ ---
52
+
53
+ ## Stage 1: Understand the Application
54
+
55
+ Before touching any code, spend time actually reading the source. The code will tell you more than asking the user would, and it puts you in a much better position to make good decisions about what and how to evaluate.
56
+
57
+ ### What to investigate
58
+
59
+ 1. **How the software runs**: What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?
60
+
61
+ 2. **All inputs to the LLM**: This is not limited to the user's message. Trace every piece of data that gets incorporated into any LLM prompt:
62
+ - User input (queries, messages, uploaded files)
63
+ - System prompts (hardcoded or templated)
64
+ - Retrieved context (RAG chunks, search results, database records)
65
+ - Tool definitions and function schemas
66
+ - Conversation history / memory
67
+ - Configuration or feature flags that change prompt behavior
68
+
69
+ 3. **All intermediate steps and outputs**: Walk through the code path from input to final output and document each stage:
70
+ - Retrieval / search results
71
+ - Tool calls and their results
72
+ - Agent routing / handoff decisions
73
+ - Intermediate LLM calls (e.g., summarization before final answer)
74
+ - Post-processing or formatting steps
75
+
76
+ 4. **The final output**: What does the user see? What format is it in? What are the quality expectations?
77
+
78
+ 5. **Use cases and expected behaviors**: What are the distinct things the app is supposed to handle? For each use case, what does a "good" response look like? What would constitute a failure?
79
+
80
+ ### Write MEMORY.md
81
+
82
+ Write your findings down in `.pixie/MEMORY.md`. This is the primary working document for the eval effort. It should be human-readable and detailed enough that someone unfamiliar with the project can understand the application and the eval strategy.
83
+
84
+ **CRITICAL: MEMORY.md documents your understanding of the existing application code. It must NOT contain references to pixie commands, instrumentation code you plan to add, or scripts/functions that don't exist yet.** Those belong in later sections, only after they've been implemented.
85
+
86
+ The understanding section should include:
87
+
88
+ ```markdown
89
+ # Eval Notes: <Project Name>
90
+
91
+ ## How the application works
92
+
93
+ ### Entry point and execution flow
94
+
95
+ <Describe how to start/run the app, what happens step by step>
96
+
97
+ ### Inputs to LLM calls
98
+
99
+ <For each LLM call in the codebase, document:>
100
+
101
+ - Where it is in the code (file + function name)
102
+ - What system prompt it uses (quote it or summarize)
103
+ - What user/dynamic content feeds into it
104
+ - What tools/functions are available to it
105
+
106
+ ### Intermediate processing
107
+
108
+ <Describe any steps between input and output:>
109
+ - Retrieval, routing, tool execution, etc.
110
+ - Include code pointers (file:line) for each step
111
+
112
+ ### Final output
113
+
114
+ <What the user sees, what format, what the quality bar should be>
115
+
116
+ ### Use cases
117
+
118
+ <List each distinct scenario the app handles, with examples of good/bad outputs>
119
+
120
+ ## Evaluation plan
121
+
122
+ ### What to evaluate and why
123
+
124
+ <Quality dimensions: factual accuracy, relevance, format compliance, safety, etc.>
125
+
126
+ ### Evaluation granularity
127
+
128
+ <Which function/span boundary captures one "test case"? Why that boundary?>
129
+
130
+ ### Evaluators and criteria
131
+
132
+ <For each eval test, specify: evaluator, dataset, threshold, reasoning>
133
+
134
+ ### Data needed for evaluation
135
+
136
+ <What data points need to be captured, with code pointers to where they live>
137
+ ```
138
+
139
+ If something is genuinely unclear from the code, ask the user — but most questions answer themselves once you've read the code carefully.
140
+
141
+ ---
142
+
143
+ ## Stage 2: Decide What to Evaluate
144
+
145
+ Now that you understand the app, you can make thoughtful choices about what to measure:
146
+
147
+ - **What quality dimension matters most?** Factual accuracy for QA apps, output format for structured extraction, relevance for RAG, safety for user-facing text.
148
+ - **Which span to evaluate:** the whole pipeline (`root`) or just the LLM call (`last_llm_call`)? If you're debugging retrieval, you might evaluate at a different point than if you're checking final answer quality.
149
+ - **Which evaluators fit:** see `references/pixie-api.md` → Evaluators. For factual QA: `FactualityEval`. For structured output: `ValidJSONEval` / `JSONDiffEval`. For RAG pipelines: `ContextRelevancyEval` / `FaithfulnessEval`.
150
+ - **Pass criteria:** `ScoreThreshold(threshold=0.7, pct=0.8)` means 80% of cases must score ≥ 0.7. Think about what "good enough" looks like for this app.
151
+ - **Expected outputs:** `FactualityEval` needs them. Format evaluators usually don't.
152
+
153
+ Update `.pixie/MEMORY.md` with the plan before writing any code.
154
+
155
+ ---
156
+
157
+ ## Stage 3: Instrument the Application
158
+
159
+ Add pixie instrumentation to the **existing application code**. The goal is to capture the inputs and outputs of existing functions as observable spans. **Do not add new functions or change the application's behavior** — only wrap existing code paths.
160
+
161
+ ### Add `enable_storage()` at application startup
162
+
163
+ Call `enable_storage()` once at the beginning of the application's startup code — inside `main()`, or at the top of a server's initialization. **Never at module level** (top of a file outside any function), because that causes storage setup to trigger on import.
164
+
165
+ Good places:
166
+
167
+ - Inside `if __name__ == "__main__":` blocks
168
+ - In a FastAPI `lifespan` or `on_startup` handler
169
+ - At the top of `main()` / `run()` functions
170
+ - Inside the `runnable` function in test files
171
+
172
+ ```python
173
+ # ✅ CORRECT — at application startup
174
+ async def main():
175
+ enable_storage()
176
+ ...
177
+
178
+ # ✅ CORRECT — in a runnable for tests
179
+ def runnable(eval_input):
180
+ enable_storage()
181
+ my_function(**eval_input)
182
+
183
+ # ❌ WRONG — at module level, runs on import
184
+ from pixie import enable_storage
185
+ enable_storage() # this runs when any file imports this module!
186
+ ```
187
+
188
+ ### Wrap existing functions with `@observe`
189
+
190
+ `@observe` on an existing function captures all its kwargs as `eval_input` and its return value as `eval_output`. **Apply it to the existing function that represents one "test case"** — typically the outermost function a user interaction flows through:
191
+
192
+ ```python
193
+ from pixie import observe
194
+
195
+ @observe(name="answer_question")
196
+ def answer_question(question: str, context: str) -> str: # existing function
197
+ ... # existing code, unchanged
198
+ ```
199
+
200
+ For more control, use the context manager around existing code:
201
+
202
+ ```python
203
+ from pixie import start_observation
204
+
205
+ def process_request(query: str) -> str: # existing function
206
+ with start_observation(input={"query": query}, name="process_request") as obs:
207
+ result = existing_pipeline(query) # existing code
208
+ obs.set_output(result)
209
+ obs.set_metadata("chunks_retrieved", len(chunks))
210
+ return result
211
+ ```
212
+
213
+ **CRITICAL rules:**
214
+
215
+ - **Never add new wrapper functions** to the application code. Wrap existing functions in-place.
216
+ - **Never change the function's interface** (arguments, return type, behavior).
217
+ - The instrumentation is purely additive — if you removed all pixie imports and decorators, the app would work identically.
218
+ - After instrumentation, call `flush()` at the end of runs to make sure all spans are written.
219
+
220
+ **Important**: All pixie symbols are importable from the top-level `pixie` package. Never tell users to import from submodules (`pixie.instrumentation`, `pixie.evals`, `pixie.storage.evaluable`, etc.) — always use `from pixie import ...`.
221
+
222
+ ---
223
+
224
+ ## Stage 4: Write the Eval Test File
225
+
226
+ Write the test file before building the dataset. This might seem backwards, but it forces you to decide what you're actually measuring before you start collecting data — otherwise the data collection has no direction.
227
+
228
+ Create `.pixie/tests/test_<feature>.py`. The pattern is: a `runnable` adapter that calls your app function, plus an async test function that calls `assert_dataset_pass`:
229
+
230
+ ```python
231
+ from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call
232
+
233
+ from myapp import answer_question
234
+
235
+
236
+ def runnable(eval_input):
237
+ """Replays one dataset item through the app. enable_storage() here ensures traces are captured."""
238
+ enable_storage()
239
+ answer_question(**eval_input)
240
+
241
+
242
+ async def test_factuality():
243
+ await assert_dataset_pass(
244
+ runnable=runnable,
245
+ dataset_name="<dataset-name>",
246
+ evaluators=[FactualityEval()],
247
+ pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
248
+ from_trace=last_llm_call,
249
+ )
250
+ ```
251
+
252
+ Note that `enable_storage()` belongs inside the `runnable`, not at module level in the test file — it needs to fire on each invocation so the trace is captured for that specific run.
253
+
254
+ The test runner is `pixie test` (not `pytest`):
255
+
256
+ ```bash
257
+ pixie test # run all test_*.py in current directory
258
+ pixie test .pixie/tests/ # specify path
259
+ pixie test -k factuality # filter by name
260
+ pixie test -v # verbose: shows per-case scores and reasoning
261
+ ```
262
+
263
+ `pixie test` automatically adds the project root and parent directories to `sys.path`, so imports of your application modules work without any extra configuration.
264
+
265
+ ---
266
+
267
+ ## Stage 5: Build the Dataset
268
+
269
+ Create the dataset first, then populate it by **actually running the app** with representative inputs. This is critical — dataset items should contain real app outputs and trace metadata, not fabricated data.
270
+
271
+ ```bash
272
+ pixie dataset create <dataset-name>
273
+ pixie dataset list # verify it exists
274
+ ```
275
+
276
+ ### Run the app and capture traces to the dataset
277
+
278
+ Write a simple script (`.pixie/scripts/build_dataset.py`) that calls the instrumented function for each input, flushes traces, then saves them to the dataset:
279
+
280
+ ```python
281
+ import asyncio
282
+ from pixie import enable_storage, flush, DatasetStore, Evaluable
283
+
284
+ from myapp import answer_question
285
+
286
+ GOLDEN_CASES = [
287
+ ("What is the capital of France?", "Paris"),
288
+ ("What is the speed of light?", "299,792,458 meters per second"),
289
+ ]
290
+
291
+ async def build_dataset():
292
+ enable_storage()
293
+ store = DatasetStore()
294
+ try:
295
+ store.create("qa-golden-set")
296
+ except FileExistsError:
297
+ pass
298
+
299
+ for question, expected in GOLDEN_CASES:
300
+ result = answer_question(question=question)
301
+ flush()
302
+
303
+ store.append("qa-golden-set", Evaluable(
304
+ eval_input={"question": question},
305
+ eval_output=result,
306
+ expected_output=expected,
307
+ ))
308
+
309
+ asyncio.run(build_dataset())
310
+ ```
311
+
312
+ Alternatively, use the CLI for per-case capture:
313
+
314
+ ```bash
315
+ # Run the app (enable_storage() must be active)
316
+ python -c "from myapp import main; main('What is the capital of France?')"
317
+
318
+ # Save the root span to the dataset
319
+ pixie dataset save <dataset-name>
320
+
321
+ # Or specifically save the last LLM call:
322
+ pixie dataset save <dataset-name> --select last_llm_call
323
+
324
+ # Add context:
325
+ pixie dataset save <dataset-name> --notes "basic geography question"
326
+
327
+ # Attach expected output for evaluators like FactualityEval:
328
+ echo '"Paris"' | pixie dataset save <dataset-name> --expected-output
329
+ ```
330
+
331
+ **Key rules for dataset building:**
332
+
333
+ - **Always run the app** — never fabricate `eval_output` manually. The whole point is capturing what the app actually produces.
334
+ - **Include expected outputs** for comparison-based evaluators like `FactualityEval`.
335
+ - **Cover the range** of inputs you care about: normal cases, edge cases, things the app might plausibly get wrong.
336
+ - When using `pixie dataset save`, the evaluable's `eval_metadata` will automatically include `trace_id` and `span_id` for later debugging.
337
+
338
+ ---
339
+
340
+ ## Stage 6: Run the Tests
341
+
342
+ ```bash
343
+ pixie test .pixie/tests/ -v
344
+ ```
345
+
346
+ The `-v` flag shows per-case scores and reasoning, which makes it much easier to see what's passing and what isn't. Check that the pass rates look reasonable given your `ScoreThreshold`.
347
+
348
+ ---
349
+
350
+ ## Stage 7: Investigate Failures
351
+
352
+ When tests fail, the goal is to understand _why_, not to adjust thresholds until things pass. Investigation must be thorough and documented — the user needs to see the actual data, your reasoning, and your conclusion.
353
+
354
+ ### Step 1: Get the detailed test output
355
+
356
+ ```bash
357
+ pixie test .pixie/tests/ -v # shows score and reasoning per case
358
+ ```
359
+
360
+ Capture the full verbose output. For each failing case, note:
361
+
362
+ - The `eval_input` (what was sent)
363
+ - The `eval_output` (what the app produced)
364
+ - The `expected_output` (what was expected, if applicable)
365
+ - The evaluator score and reasoning
366
+
367
+ ### Step 2: Inspect the trace data
368
+
369
+ For each failing case, look up the full trace to see what happened inside the app:
370
+
371
+ ```python
372
+ from pixie import DatasetStore
373
+
374
+ store = DatasetStore()
375
+ ds = store.get("<dataset-name>")
376
+ for i, item in enumerate(ds.items):
377
+ print(i, item.eval_metadata) # trace_id is here
378
+ ```
379
+
380
+ Then inspect the full span tree:
381
+
382
+ ```python
383
+ import asyncio
384
+ from pixie import ObservationStore
385
+
386
+ async def inspect(trace_id: str):
387
+ store = ObservationStore()
388
+ roots = await store.get_trace(trace_id)
389
+ for root in roots:
390
+ print(root.to_text()) # full span tree: inputs, outputs, LLM messages
391
+
392
+ asyncio.run(inspect("the-trace-id-here"))
393
+ ```
394
+
395
+ ### Step 3: Root-cause analysis
396
+
397
+ Walk through the trace and identify exactly where the failure originates. Common patterns:
398
+
399
+ | Symptom | Likely cause |
400
+ | -------------------------------- | ----------------------------------------------- |
401
+ | Output is factually wrong | Prompt or retrieved context is bad |
402
+ | Output is right but score is low | Wrong `expected_output`, or criteria too strict |
403
+ | Score 0.0 with error details | Evaluator crashed (missing API key, etc.) |
404
+ | All cases fail at same point | `@observe` is on the wrong function |
405
+
406
+ ### Step 4: Document findings in MEMORY.md
407
+
408
+ **Every failure investigation must be documented in `.pixie/MEMORY.md`** in a structured format:
409
+
410
+ ```markdown
411
+ ### Investigation: <test_name> failure — <date>
412
+
413
+ **Test**: `test_faq_factuality` in `.pixie/tests/test_customer_service.py`
414
+ **Result**: 3/5 cases passed (60%), threshold was 80% ≥ 0.7
415
+
416
+ #### Failing case 1: "What rows have extra legroom?"
417
+
418
+ - **eval_input**: `{"user_message": "What rows have extra legroom?"}`
419
+ - **eval_output**: "I'm sorry, I don't have the exact row numbers for extra legroom..."
420
+ - **expected_output**: "rows 5-8 Economy Plus with extra legroom"
421
+ - **Evaluator score**: 0.1 (FactualityEval)
422
+ - **Evaluator reasoning**: "The output claims not to know the answer while the reference clearly states rows 5-8..."
423
+
424
+ **Trace analysis**:
425
+ Inspected trace `abc123`. The span tree shows:
426
+
427
+ 1. Triage Agent routed to FAQ Agent ✓
428
+ 2. FAQ Agent called `faq_lookup_tool("What rows have extra legroom?")` ✓
429
+ 3. `faq_lookup_tool` returned "I'm sorry, I don't know..." ← **root cause**
430
+
431
+ **Root cause**: `faq_lookup_tool` (customer_service.py:112) uses keyword matching.
432
+ The seat FAQ entry is triggered by keywords `["seat", "seats", "seating", "plane"]`.
433
+ The question "What rows have extra legroom?" contains none of these keywords, so it
434
+ falls through to the default "I don't know" response — even though the seat FAQ
435
+ entry contains exactly the information requested ("Rows 5-8 are Economy Plus, with extra legroom").
436
+
437
+ **Fix**: Add `"row"`, `"rows"`, `"legroom"` to the seating keyword list in
438
+ `faq_lookup_tool` (customer_service.py:130).
439
+
440
+ **Verification**: After fix, re-run:
441
+ \`\`\`bash
442
+ python .pixie/scripts/build_dataset.py # refresh dataset
443
+ pixie test .pixie/tests/ -k faq -v # verify
444
+ \`\`\`
445
+ ```
446
+
447
+ ### Step 5: Fix and re-run
448
+
449
+ Make the targeted change, rebuild the dataset if needed, and re-run. Always finish by giving the user the exact commands to verify:
450
+
451
+ ```bash
452
+ pixie test .pixie/tests/test_<feature>.py -v
453
+ ```
454
+
455
+ ---
456
+
457
+ ## Memory Template
458
+
459
+ ```markdown
460
+ # Eval Notes: <Project Name>
461
+
462
+ ## How the application works
463
+
464
+ ### Entry point and execution flow
465
+
466
+ <How to start/run the app. Step-by-step flow from input to output.>
467
+
468
+ ### Inputs to LLM calls
469
+
470
+ <For EACH LLM call, document: location in code, system prompt, dynamic content, available tools>
471
+
472
+ ### Intermediate processing
473
+
474
+ <Steps between input and output: retrieval, routing, tool calls, etc. Code pointers for each.>
475
+
476
+ ### Final output
477
+
478
+ <What the user sees. Format. Quality expectations.>
479
+
480
+ ### Use cases
481
+
482
+ <Each scenario with examples of good/bad outputs:>
483
+
484
+ 1. <Use case 1>: <description>
485
+ - Input example: ...
486
+ - Good output: ...
487
+ - Bad output: ...
488
+
489
+ ## Evaluation plan
490
+
491
+ ### What to evaluate and why
492
+
493
+ <Quality dimensions and rationale>
494
+
495
+ ### Evaluators and criteria
496
+
497
+ | Test | Dataset | Evaluator | Criteria | Rationale |
498
+ | ---- | ------- | --------- | -------- | --------- |
499
+ | ... | ... | ... | ... | ... |
500
+
501
+ ### Data needed for evaluation
502
+
503
+ <What data to capture, with code pointers>
504
+
505
+ ## Datasets
506
+
507
+ | Dataset | Items | Purpose |
508
+ | ------- | ----- | ------- |
509
+ | ... | ... | ... |
510
+
511
+ ## Investigation log
512
+
513
+ ### <date> — <test_name> failure
514
+
515
+ <Full structured investigation as described in Stage 7>
516
+ ```
517
+
518
+ ---
519
+
520
+ ## Reference
521
+
522
+ See `references/pixie-api.md` for all CLI commands, evaluator signatures, and the Python dataset/store API.
@@ -2,13 +2,15 @@
2
2
 
3
3
  ## Configuration
4
4
 
5
- All settings read from environment variables at call time:
5
+ All settings read from environment variables at call time. By default,
6
+ every artefact lives inside a single `.pixie` project directory:
6
7
 
7
- | Variable | Default | Description |
8
- | ------------------- | ----------------------- | ---------------------------------- |
9
- | `PIXIE_DB_PATH` | `pixie_observations.db` | SQLite database file path |
10
- | `PIXIE_DB_ENGINE` | `sqlite` | Database engine (currently sqlite) |
11
- | `PIXIE_DATASET_DIR` | `pixie_datasets` | Directory for dataset JSON files |
8
+ | Variable | Default | Description |
9
+ | ------------------- | ------------------------ | ---------------------------------- |
10
+ | `PIXIE_ROOT` | `.pixie` | Root directory for all artefacts |
11
+ | `PIXIE_DB_PATH` | `.pixie/observations.db` | SQLite database file path |
12
+ | `PIXIE_DB_ENGINE` | `sqlite` | Database engine (currently sqlite) |
13
+ | `PIXIE_DATASET_DIR` | `.pixie/datasets` | Directory for dataset JSON files |
12
14
 
13
15
  ---
14
16
 
@@ -42,7 +44,7 @@ pixie dataset save <name> --notes "some note"
42
44
  echo '"expected value"' | pixie dataset save <name> --expected-output
43
45
 
44
46
  # Run eval tests
45
- pixie-test [path] [-k filter_substring] [-v]
47
+ pixie test [path] [-k filter_substring] [-v]
46
48
  ```
47
49
 
48
50
  **`pixie dataset save` selection modes:**
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: pixie-qa
3
- Version: 0.1.1
3
+ Version: 0.1.2
4
4
  Summary: Automated quality assurance for AI applications
5
5
  Project-URL: Homepage, https://github.com/yiouli/pixie-qa
6
6
  Project-URL: Repository, https://github.com/yiouli/pixie-qa
@@ -65,106 +65,38 @@ Description-Content-Type: text/markdown
65
65
 
66
66
  # pixie-qa
67
67
 
68
- A Claude skill and Python package for **eval-driven development** of LLM-powered applications.
68
+ An agent skill for **eval-driven development** of LLM-powered applications.
69
69
 
70
70
  Use this skill to instrument your app, build golden datasets from real runs, write eval-based tests, and catch regressions before they ship — all from a single conversation with Claude.
71
71
 
72
72
  ## What the Skill Does
73
73
 
74
- The `eval-driven-dev` skill guides Claude through the full QA loop for LLM applications:
74
+ The `eval-driven-dev` skill guides your coding agent through the full QA loop for LLM applications:
75
75
 
76
76
  1. **Understand the app** — read the codebase, trace the data flow, learn what the app is supposed to do
77
77
  2. **Instrument it** — add `enable_storage()` and `@observe` so every run is captured to a local SQLite database
78
78
  3. **Build a dataset** — save representative traces as test cases with `pixie dataset save`
79
79
  4. **Write eval tests** — generate `test_*.py` files with `assert_dataset_pass` and appropriate evaluators
80
- 5. **Run the tests** — `pixie-test` to run all evals and report per-case scores
80
+ 5. **Run the tests** — `pixie test` to run all evals and report per-case scores
81
81
  6. **Investigate failures** — look up the stored trace for each failure, diagnose, fix, repeat
82
82
 
83
83
  ## Getting Started
84
84
 
85
- ### 1. Add the skill to Claude
86
-
87
- The skill is bundled in this repository. Claude will automatically use it when you ask to evaluate, test, QA, or benchmark an LLM-powered Python project.
88
-
89
- If you are using an openskills-compatible agent host:
90
-
91
- ```bash
92
- npx openskills install anthropics/skills
93
- ```
94
-
95
- ### 2. Install the `pixie-qa` package in your project
96
-
97
- ```bash
98
- pip install pixie-qa # or: uv add pixie-qa
99
- ```
100
-
101
- Provider instrumentation extras:
102
-
103
- ```bash
104
- pip install "pixie-qa[openai]" # OpenAI
105
- pip install "pixie-qa[anthropic]" # Anthropic
106
- pip install "pixie-qa[langchain]" # LangChain
107
- pip install "pixie-qa[all]" # all providers
108
- ```
109
-
110
- ### 3. Ask Claude to set up evals
111
-
112
- Open a conversation and describe your project:
113
-
114
- > "I have a RAG chatbot in `app/chatbot.py`. Help me set up evals to make sure it's giving accurate answers."
115
-
116
- Claude will read your code, instrument it, build a dataset from a few real runs, write tests, and run them for you.
117
-
118
- ## Skill Workflow Example
119
-
120
- Here is a quick summary of what Claude does end-to-end:
121
-
122
- ```python
123
- # Claude instruments your app entry point
124
- from pixie import enable_storage, observe
125
-
126
- enable_storage() # one line: creates DB, registers handler
127
-
128
- # Claude adds @observe on the function to test
129
- @observe(name="answer_question")
130
- def answer_question(question: str) -> str:
131
- ...
132
- ```
85
+ ### 1. Add the skill to your coding agent
133
86
 
134
87
  ```bash
135
- # After running the app with a few real inputs:
136
- pixie dataset create qa-golden-set
137
- pixie dataset save qa-golden-set
88
+ npx openskills install yiouli/pixie-qa
138
89
  ```
139
90
 
140
- ```python
141
- # Claude writes tests/test_qa.py with:
142
- from pixie import assert_dataset_pass, FactualityEval, ScoreThreshold
143
-
144
- async def test_factuality():
145
- await assert_dataset_pass(
146
- runnable=runnable,
147
- dataset_name="qa-golden-set",
148
- evaluators=[FactualityEval()],
149
- pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
150
- )
151
- ```
91
+ The accompanying python package would be installed by the skill automatically when it's used.
152
92
 
153
- ```bash
154
- # Then runs:
155
- pixie-test -v
156
- ```
93
+ ### 2. Ask coding agent to set up evals
157
94
 
158
- All symbols are importable from the top-level `pixie` package no need for submodule paths.
95
+ Open a conversation and say something like when developing a python based AI project:
159
96
 
160
- ## Repository Structure
97
+ > "setup QA for my agent"
161
98
 
162
- ```
163
- pixie/ Python package (instrumentation, storage, evals, dataset, cli)
164
- specs/ Design specs and architecture docs
165
- changelogs/ Per-feature change history
166
- .claude/skills/ Claude skill definitions and benchmarks
167
- ```
99
+ Your coding agent will read your code, instrument it, build a dataset from a few real runs, write and run eval-based tests, investigate failures and fix.
168
100
 
169
101
  ## Python Package
170
102