pixie-qa 0.1.0__tar.gz → 0.1.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (427) hide show
  1. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev/SKILL.md +94 -31
  2. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev/references/pixie-api.md +50 -47
  3. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.github/workflows/daily-release.yml +3 -3
  4. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.github/workflows/publish.yml +2 -2
  5. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/PKG-INFO +15 -6
  6. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/README.md +14 -5
  7. pixie_qa-0.1.1/changelogs/loud-failure-mode.md +58 -0
  8. pixie_qa-0.1.1/changelogs/root-package-exports-and-trace-id.md +58 -0
  9. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/docs/package.md +10 -9
  10. pixie_qa-0.1.1/pixie/__init__.py +108 -0
  11. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/evals/evaluation.py +13 -17
  12. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/evals/runner.py +30 -14
  13. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/storage/evaluable.py +12 -3
  14. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pyproject.toml +1 -1
  15. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/specs/evals-harness.md +8 -4
  16. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/evals/test_evaluation.py +15 -6
  17. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/evals/test_runner.py +87 -1
  18. pixie_qa-0.1.1/tests/pixie/observation_store/__init__.py +0 -0
  19. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/observation_store/test_evaluable.py +48 -8
  20. pixie_qa-0.1.1/tests/pixie/test_init.py +157 -0
  21. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/uv.lock +28 -185
  22. pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/evals.json +0 -52
  23. pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/email-classifier/extractor.py +0 -40
  24. pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/email-classifier/requirements.txt +0 -2
  25. pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/email-classifier-mock/extractor.py +0 -57
  26. pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/email-classifier-mock/requirements.txt +0 -1
  27. pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/qa-app-with-tests/pixie_datasets/qa-golden-set.json +0 -23
  28. pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/qa-app-with-tests/qa_app.py +0 -26
  29. pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/qa-app-with-tests/requirements.txt +0 -2
  30. pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/qa-app-with-tests/tests/test_qa.py +0 -24
  31. pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/rag-chatbot/chatbot.py +0 -53
  32. pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/rag-chatbot/requirements.txt +0 -2
  33. pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/rag-chatbot-mock/chatbot.py +0 -46
  34. pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/rag-chatbot-mock/requirements.txt +0 -1
  35. pixie_qa-0.1.0/.github/workflows/deploy-docs.yml +0 -171
  36. pixie_qa-0.1.0/pixie/__init__.py +0 -11
  37. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/settings.local.json +0 -0
  38. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/benchmark.json +0 -0
  39. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/benchmark.md +0 -0
  40. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/eval_metadata.json +0 -0
  41. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/with_skill/outputs/metrics.json +0 -0
  42. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/with_skill/outputs/response.md +0 -0
  43. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/with_skill/run-1/grading.json +0 -0
  44. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/with_skill/run-1/timing.json +0 -0
  45. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/without_skill/outputs/metrics.json +0 -0
  46. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/without_skill/outputs/response.md +0 -0
  47. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/without_skill/run-1/grading.json +0 -0
  48. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/without_skill/run-1/timing.json +0 -0
  49. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/eval_metadata.json +0 -0
  50. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/with_skill/outputs/metrics.json +0 -0
  51. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/with_skill/outputs/response.md +0 -0
  52. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/with_skill/run-1/grading.json +0 -0
  53. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/with_skill/run-1/timing.json +0 -0
  54. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/without_skill/outputs/metrics.json +0 -0
  55. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/without_skill/outputs/response.md +0 -0
  56. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/without_skill/run-1/grading.json +0 -0
  57. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/without_skill/run-1/timing.json +0 -0
  58. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/eval_metadata.json +0 -0
  59. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/with_skill/outputs/metrics.json +0 -0
  60. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/with_skill/outputs/response.md +0 -0
  61. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/with_skill/run-1/grading.json +0 -0
  62. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/with_skill/run-1/timing.json +0 -0
  63. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/without_skill/outputs/metrics.json +0 -0
  64. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/without_skill/outputs/response.md +0 -0
  65. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/without_skill/run-1/grading.json +0 -0
  66. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/without_skill/run-1/timing.json +0 -0
  67. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/benchmark.json +0 -0
  68. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/benchmark.md +0 -0
  69. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/eval_metadata.json +0 -0
  70. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/grading.json +0 -0
  71. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/outputs/metrics.json +0 -0
  72. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/outputs/summary.md +0 -0
  73. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/project/pixie_datasets/qa-golden-set.json +0 -0
  74. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/project/qa_app.py +0 -0
  75. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/project/requirements.txt +0 -0
  76. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/project/tests/test_qa.py +0 -0
  77. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/timing.json +0 -0
  78. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/grading.json +0 -0
  79. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/outputs/metrics.json +0 -0
  80. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/outputs/summary.md +0 -0
  81. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/project/pixie_datasets/qa-golden-set.json +0 -0
  82. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/project/qa_app.py +0 -0
  83. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/project/requirements.txt +0 -0
  84. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/project/tests/test_qa.py +0 -0
  85. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/timing.json +0 -0
  86. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/eval_metadata.json +0 -0
  87. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/grading.json +0 -0
  88. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/outputs/metrics.json +0 -0
  89. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/outputs/summary.md +0 -0
  90. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/project/MEMORY.md +0 -0
  91. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/project/build_dataset.py +0 -0
  92. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/project/extractor.py +0 -0
  93. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/project/requirements.txt +0 -0
  94. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/project/tests/__init__.py +0 -0
  95. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/project/tests/test_email_extraction.py +0 -0
  96. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/timing.json +0 -0
  97. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/grading.json +0 -0
  98. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/outputs/metrics.json +0 -0
  99. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/outputs/summary.md +0 -0
  100. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/project/build_dataset.py +0 -0
  101. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/project/extractor.py +0 -0
  102. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/project/requirements.txt +0 -0
  103. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/project/test_extractor.py +0 -0
  104. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/timing.json +0 -0
  105. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/eval_metadata.json +0 -0
  106. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/grading.json +0 -0
  107. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/outputs/metrics.json +0 -0
  108. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/outputs/summary.md +0 -0
  109. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/project/MEMORY.md +0 -0
  110. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/project/build_dataset.py +0 -0
  111. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/project/chatbot.py +0 -0
  112. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/project/requirements.txt +0 -0
  113. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/project/tests/test_rag_chatbot.py +0 -0
  114. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/timing.json +0 -0
  115. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/grading.json +0 -0
  116. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/outputs/metrics.json +0 -0
  117. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/outputs/summary.md +0 -0
  118. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/project/build_dataset.py +0 -0
  119. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/project/chatbot.py +0 -0
  120. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/project/requirements.txt +0 -0
  121. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/project/test_chatbot.py +0 -0
  122. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/timing.json +0 -0
  123. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/benchmark.json +0 -0
  124. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/benchmark.md +0 -0
  125. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/eval_metadata.json +0 -0
  126. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/grading.json +0 -0
  127. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/project/MEMORY.md +0 -0
  128. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/project/pixie_datasets/qa-golden-set.json +0 -0
  129. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/project/qa_app.py +0 -0
  130. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/project/requirements.txt +0 -0
  131. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/project/tests/test_qa.py +0 -0
  132. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/run-1/grading.json +0 -0
  133. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/run-1/outputs/MEMORY.md +0 -0
  134. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/run-1/outputs/test_qa.py +0 -0
  135. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/timing.json +0 -0
  136. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/grading.json +0 -0
  137. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/project/INVESTIGATION_NOTES.md +0 -0
  138. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/project/pixie_datasets/qa-golden-set.json +0 -0
  139. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/project/qa_app.py +0 -0
  140. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/project/requirements.txt +0 -0
  141. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/project/tests/test_qa.py +0 -0
  142. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/run-1/grading.json +0 -0
  143. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/run-1/outputs/INVESTIGATION_NOTES.md +0 -0
  144. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/run-1/outputs/test_qa.py +0 -0
  145. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/timing.json +0 -0
  146. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/eval_metadata.json +0 -0
  147. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/grading.json +0 -0
  148. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/project/MEMORY.md +0 -0
  149. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/project/build_dataset.py +0 -0
  150. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/project/extractor.py +0 -0
  151. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/project/requirements.txt +0 -0
  152. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/project/run_evals.sh +0 -0
  153. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/project/tests/test_classifier.py +0 -0
  154. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/run-1/grading.json +0 -0
  155. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/run-1/outputs/MEMORY.md +0 -0
  156. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/run-1/outputs/build_dataset.py +0 -0
  157. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/run-1/outputs/extractor.py +0 -0
  158. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/run-1/outputs/test_classifier.py +0 -0
  159. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/timing.json +0 -0
  160. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/grading.json +0 -0
  161. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/project/collect_traces.py +0 -0
  162. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/project/extractor.py +0 -0
  163. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/project/requirements.txt +0 -0
  164. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/run-1/grading.json +0 -0
  165. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/run-1/outputs/collect_traces.py +0 -0
  166. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/run-1/outputs/extractor.py +0 -0
  167. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/timing.json +0 -0
  168. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/eval_metadata.json +0 -0
  169. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/grading.json +0 -0
  170. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/project/MEMORY.md +0 -0
  171. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/project/chatbot.py +0 -0
  172. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/project/pixie_datasets/rag-chatbot-golden.json +0 -0
  173. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/project/pixie_observations.db +0 -0
  174. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/project/requirements.txt +0 -0
  175. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/project/tests/test_chatbot.py +0 -0
  176. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/run-1/grading.json +0 -0
  177. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/run-1/outputs/MEMORY.md +0 -0
  178. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/run-1/outputs/chatbot.py +0 -0
  179. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/run-1/outputs/test_chatbot.py +0 -0
  180. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/timing.json +0 -0
  181. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/grading.json +0 -0
  182. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/project/capture_traces.py +0 -0
  183. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/project/chatbot.py +0 -0
  184. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/project/requirements.txt +0 -0
  185. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/project/test_chatbot_evals.py +0 -0
  186. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/run-1/grading.json +0 -0
  187. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/run-1/outputs/capture_traces.py +0 -0
  188. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/run-1/outputs/chatbot.py +0 -0
  189. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/run-1/outputs/test_chatbot_evals.py +0 -0
  190. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/timing.json +0 -0
  191. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/benchmark.json +0 -0
  192. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/benchmark.md +0 -0
  193. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/eval_metadata.json +0 -0
  194. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/grading.json +0 -0
  195. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/project/MEMORY.md +0 -0
  196. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/project/pixie_datasets/qa-golden-set.json +0 -0
  197. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/project/qa_app.py +0 -0
  198. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/project/requirements.txt +0 -0
  199. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/project/tests/test_qa.py +0 -0
  200. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/run-1/grading.json +0 -0
  201. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/run-1/outputs/MEMORY.md +0 -0
  202. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/run-1/outputs/test_qa.py +0 -0
  203. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/timing.json +0 -0
  204. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/grading.json +0 -0
  205. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/project/pixie_datasets/qa-golden-set.json +0 -0
  206. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/project/qa_app.py +0 -0
  207. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/project/requirements.txt +0 -0
  208. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/project/tests/test_qa.py +0 -0
  209. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/run-1/grading.json +0 -0
  210. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/run-1/outputs/test_qa.py +0 -0
  211. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/timing.json +0 -0
  212. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/eval_metadata.json +0 -0
  213. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/grading.json +0 -0
  214. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/project/MEMORY.md +0 -0
  215. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/project/extractor.py +0 -0
  216. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/project/pixie_datasets/email-classifier-golden.json +0 -0
  217. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/project/pixie_observations.db +0 -0
  218. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/project/requirements.txt +0 -0
  219. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/project/tests/test_email_classifier.py +0 -0
  220. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/run-1/grading.json +0 -0
  221. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/run-1/outputs/MEMORY.md +0 -0
  222. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/run-1/outputs/extractor.py +0 -0
  223. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/run-1/outputs/test_email_classifier.py +0 -0
  224. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/timing.json +0 -0
  225. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/grading.json +0 -0
  226. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/conftest.py +0 -0
  227. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/extractor.py +0 -0
  228. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/generate_dataset.py +0 -0
  229. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/instrumented_extractor.py +0 -0
  230. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/pytest.ini +0 -0
  231. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/requirements.txt +0 -0
  232. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/test_email_classifier.py +0 -0
  233. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/run-1/grading.json +0 -0
  234. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/run-1/outputs/extractor.py +0 -0
  235. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/run-1/outputs/test_email_classifier.py +0 -0
  236. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/timing.json +0 -0
  237. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/eval_metadata.json +0 -0
  238. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/grading.json +0 -0
  239. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/project/MEMORY.md +0 -0
  240. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/project/chatbot.py +0 -0
  241. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/project/pixie_datasets/rag-chatbot-golden.json +0 -0
  242. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/project/pixie_observations.db +0 -0
  243. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/project/requirements.txt +0 -0
  244. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/project/tests/test_rag_chatbot.py +0 -0
  245. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/run-1/grading.json +0 -0
  246. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/run-1/outputs/MEMORY.md +0 -0
  247. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/run-1/outputs/chatbot.py +0 -0
  248. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/run-1/outputs/test_rag_chatbot.py +0 -0
  249. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/timing.json +0 -0
  250. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/grading.json +0 -0
  251. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/project/chatbot.py +0 -0
  252. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/project/chatbot_instrumented.py +0 -0
  253. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/project/requirements.txt +0 -0
  254. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/project/save_dataset.py +0 -0
  255. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/project/test_chatbot_evals.py +0 -0
  256. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/run-1/grading.json +0 -0
  257. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/run-1/outputs/chatbot_instrumented.py +0 -0
  258. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/run-1/outputs/test_chatbot_evals.py +0 -0
  259. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/timing.json +0 -0
  260. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/benchmark.json +0 -0
  261. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/benchmark.md +0 -0
  262. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/eval_metadata.json +0 -0
  263. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/grading.json +0 -0
  264. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/project/MEMORY.md +0 -0
  265. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/project/pixie_datasets/qa-golden-set.json +0 -0
  266. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/project/qa_app.py +0 -0
  267. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/project/requirements.txt +0 -0
  268. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/project/tests/test_qa.py +0 -0
  269. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/run-1/grading.json +0 -0
  270. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/run-1/outputs/MEMORY.md +0 -0
  271. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/run-1/outputs/pixie_datasets/qa-golden-set.json +0 -0
  272. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/run-1/outputs/qa_app.py +0 -0
  273. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/run-1/outputs/requirements.txt +0 -0
  274. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/run-1/outputs/tests/test_qa.py +0 -0
  275. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/timing.json +0 -0
  276. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/grading.json +0 -0
  277. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/project/MEMORY.md +0 -0
  278. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/project/pixie_datasets/qa-golden-set.json +0 -0
  279. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/project/qa_app.py +0 -0
  280. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/project/requirements.txt +0 -0
  281. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/project/tests/test_qa.py +0 -0
  282. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/run-1/grading.json +0 -0
  283. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/run-1/outputs/MEMORY.md +0 -0
  284. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/run-1/outputs/pixie_datasets/qa-golden-set.json +0 -0
  285. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/run-1/outputs/qa_app.py +0 -0
  286. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/run-1/outputs/requirements.txt +0 -0
  287. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/run-1/outputs/tests/test_qa.py +0 -0
  288. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/timing.json +0 -0
  289. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/eval_metadata.json +0 -0
  290. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/grading.json +0 -0
  291. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/project/MEMORY.md +0 -0
  292. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/project/build_dataset.py +0 -0
  293. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/project/extractor.py +0 -0
  294. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/project/requirements.txt +0 -0
  295. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/project/tests/test_email_classifier.py +0 -0
  296. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/run-1/grading.json +0 -0
  297. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/run-1/outputs/MEMORY.md +0 -0
  298. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/run-1/outputs/build_dataset.py +0 -0
  299. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/run-1/outputs/extractor.py +0 -0
  300. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/run-1/outputs/requirements.txt +0 -0
  301. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/run-1/outputs/tests/test_email_classifier.py +0 -0
  302. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/timing.json +0 -0
  303. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/grading.json +0 -0
  304. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/project/build_dataset.py +0 -0
  305. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/project/extractor.py +0 -0
  306. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/project/requirements.txt +0 -0
  307. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/project/test_email_classifier.py +0 -0
  308. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/run-1/grading.json +0 -0
  309. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/run-1/outputs/build_dataset.py +0 -0
  310. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/run-1/outputs/extractor.py +0 -0
  311. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/run-1/outputs/requirements.txt +0 -0
  312. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/run-1/outputs/test_email_classifier.py +0 -0
  313. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/timing.json +0 -0
  314. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/eval_metadata.json +0 -0
  315. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/grading.json +0 -0
  316. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/project/MEMORY.md +0 -0
  317. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/project/build_dataset.py +0 -0
  318. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/project/chatbot.py +0 -0
  319. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/project/requirements.txt +0 -0
  320. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/project/tests/test_rag_chatbot.py +0 -0
  321. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/run-1/grading.json +0 -0
  322. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/run-1/outputs/MEMORY.md +0 -0
  323. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/run-1/outputs/build_dataset.py +0 -0
  324. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/run-1/outputs/chatbot.py +0 -0
  325. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/run-1/outputs/requirements.txt +0 -0
  326. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/run-1/outputs/tests/test_rag_chatbot.py +0 -0
  327. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/timing.json +0 -0
  328. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/grading.json +0 -0
  329. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/project/MEMORY.md +0 -0
  330. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/project/chatbot.py +0 -0
  331. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/project/datasets/rag-chatbot-golden.json +0 -0
  332. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/project/requirements.txt +0 -0
  333. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/project/test_chatbot_eval.py +0 -0
  334. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/run-1/grading.json +0 -0
  335. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/run-1/outputs/MEMORY.md +0 -0
  336. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/run-1/outputs/chatbot.py +0 -0
  337. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/run-1/outputs/datasets/rag-chatbot-golden.json +0 -0
  338. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/run-1/outputs/requirements.txt +0 -0
  339. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/run-1/outputs/test_chatbot_eval.py +0 -0
  340. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/timing.json +0 -0
  341. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/review-iteration-1.html +0 -0
  342. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/review-iteration-2.html +0 -0
  343. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/review-iteration-3.html +0 -0
  344. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/review-iteration-4.html +0 -0
  345. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/review-iteration-5.html +0 -0
  346. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/trigger-eval-set.json +0 -0
  347. /pixie_qa-0.1.0/tests/__init__.py → /pixie_qa-0.1.1/.env +0 -0
  348. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.github/copilot-instructions.md +0 -0
  349. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.gitignore +0 -0
  350. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/LICENSE +0 -0
  351. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/async-handler-processing.md +0 -0
  352. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/autoevals-adapters.md +0 -0
  353. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/cli-dataset-commands.md +0 -0
  354. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/dataset-management.md +0 -0
  355. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/eval-harness.md +0 -0
  356. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/expected-output-in-evals.md +0 -0
  357. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/instrumentation-module-implementation.md +0 -0
  358. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/manual-instrumentation-usability.md +0 -0
  359. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/observation-store-implementation.md +0 -0
  360. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/usability-utils.md +0 -0
  361. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/cli/__init__.py +0 -0
  362. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/cli/dataset_command.py +0 -0
  363. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/cli/main.py +0 -0
  364. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/cli/test_command.py +0 -0
  365. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/config.py +0 -0
  366. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/dataset/__init__.py +0 -0
  367. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/dataset/models.py +0 -0
  368. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/dataset/store.py +0 -0
  369. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/evals/__init__.py +0 -0
  370. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/evals/criteria.py +0 -0
  371. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/evals/eval_utils.py +0 -0
  372. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/evals/scorers.py +0 -0
  373. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/evals/trace_capture.py +0 -0
  374. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/evals/trace_helpers.py +0 -0
  375. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/instrumentation/__init__.py +0 -0
  376. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/instrumentation/context.py +0 -0
  377. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/instrumentation/handler.py +0 -0
  378. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/instrumentation/handlers.py +0 -0
  379. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/instrumentation/instrumentors.py +0 -0
  380. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/instrumentation/observation.py +0 -0
  381. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/instrumentation/processor.py +0 -0
  382. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/instrumentation/queue.py +0 -0
  383. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/instrumentation/spans.py +0 -0
  384. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/storage/__init__.py +0 -0
  385. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/storage/piccolo_conf.py +0 -0
  386. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/storage/piccolo_migrations/__init__.py +0 -0
  387. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/storage/serialization.py +0 -0
  388. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/storage/store.py +0 -0
  389. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/storage/tables.py +0 -0
  390. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/storage/tree.py +0 -0
  391. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/specs/agent-skill.md +0 -0
  392. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/specs/autoevals-adapters.md +0 -0
  393. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/specs/dataset-management.md +0 -0
  394. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/specs/expected-output-in-evals.md +0 -0
  395. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/specs/instrumentation.md +0 -0
  396. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/specs/manual-instrumentation-usability.md +0 -0
  397. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/specs/storage.md +0 -0
  398. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/specs/usability-utils.md +0 -0
  399. {pixie_qa-0.1.0/tests/pixie → pixie_qa-0.1.1/tests}/__init__.py +0 -0
  400. {pixie_qa-0.1.0/tests/pixie/cli → pixie_qa-0.1.1/tests/pixie}/__init__.py +0 -0
  401. {pixie_qa-0.1.0/tests/pixie/dataset → pixie_qa-0.1.1/tests/pixie/cli}/__init__.py +0 -0
  402. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/cli/test_dataset_command.py +0 -0
  403. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/cli/test_main.py +0 -0
  404. {pixie_qa-0.1.0/tests/pixie/evals → pixie_qa-0.1.1/tests/pixie/dataset}/__init__.py +0 -0
  405. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/dataset/test_models.py +0 -0
  406. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/dataset/test_store.py +0 -0
  407. {pixie_qa-0.1.0/tests/pixie/instrumentation → pixie_qa-0.1.1/tests/pixie/evals}/__init__.py +0 -0
  408. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/evals/test_criteria.py +0 -0
  409. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/evals/test_eval_utils.py +0 -0
  410. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/evals/test_scorers.py +0 -0
  411. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/evals/test_trace_capture.py +0 -0
  412. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/evals/test_trace_helpers.py +0 -0
  413. {pixie_qa-0.1.0/tests/pixie/observation_store → pixie_qa-0.1.1/tests/pixie/instrumentation}/__init__.py +0 -0
  414. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/instrumentation/conftest.py +0 -0
  415. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/instrumentation/test_context.py +0 -0
  416. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/instrumentation/test_handler.py +0 -0
  417. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/instrumentation/test_integration.py +0 -0
  418. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/instrumentation/test_observation.py +0 -0
  419. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/instrumentation/test_processor.py +0 -0
  420. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/instrumentation/test_queue.py +0 -0
  421. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/instrumentation/test_spans.py +0 -0
  422. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/instrumentation/test_storage_handler.py +0 -0
  423. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/observation_store/conftest.py +0 -0
  424. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/observation_store/test_serialization.py +0 -0
  425. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/observation_store/test_store.py +0 -0
  426. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/observation_store/test_tree.py +0 -0
  427. {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/test_config.py +0 -0
@@ -11,6 +11,34 @@ The loop is: understand the app → instrument it → write the test file → bu
11
11
 
12
12
  ---
13
13
 
14
+ ## Stage 0: Ensure pixie-qa is Installed and API Keys Are Set
15
+
16
+ Before doing anything else, check that the `pixie-qa` package is available:
17
+
18
+ ```bash
19
+ python -c "import pixie" 2>/dev/null && echo "installed" || echo "not installed"
20
+ ```
21
+
22
+ If it's not installed, install it:
23
+
24
+ ```bash
25
+ pip install pixie-qa
26
+ ```
27
+
28
+ This provides the `pixie` Python module, the `pixie` CLI, and the `pixie-test` test runner — all required for instrumentation and evals. Don't skip this step; everything else in this skill depends on it.
29
+
30
+ ### Verify API keys
31
+
32
+ The application under test almost certainly needs an LLM provider API key (e.g. `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`). LLM-as-judge evaluators like `FactualityEval` also need `OPENAI_API_KEY`. **Before running anything**, verify the key is set:
33
+
34
+ ```bash
35
+ [ -n "$OPENAI_API_KEY" ] && echo "OPENAI_API_KEY set" || echo "OPENAI_API_KEY missing"
36
+ ```
37
+
38
+ If not set, ask the user. Do not proceed with running the app or evals without it — you'll get silent failures or import-time errors.
39
+
40
+ ---
41
+
14
42
  ## Stage 1: Understand the Application
15
43
 
16
44
  Before touching any code, spend time actually reading the source. The code will tell you more than asking the user would, and it puts you in a much better position to make good decisions about what and how to evaluate.
@@ -69,9 +97,9 @@ This is what actually persists traces to disk. Without it, `@observe` decorators
69
97
  `@observe` on a function captures all its kwargs as `eval_input` and its return value as `eval_output`:
70
98
 
71
99
  ```python
72
- import pixie.instrumentation as px
100
+ from pixie import observe
73
101
 
74
- @px.observe(name="answer_question")
102
+ @observe(name="answer_question")
75
103
  def answer_question(question: str, context: str) -> str:
76
104
  ...
77
105
  ```
@@ -79,7 +107,9 @@ def answer_question(question: str, context: str) -> str:
79
107
  For more control, use the context manager:
80
108
 
81
109
  ```python
82
- with px.start_observation(input={"question": question, "context": context}, name="answer_question") as obs:
110
+ from pixie import start_observation
111
+
112
+ with start_observation(input={"question": question, "context": context}, name="answer_question") as obs:
83
113
  result = run_pipeline(question, context)
84
114
  obs.set_output(result)
85
115
  obs.set_metadata("retrieved_chunks", len(chunks))
@@ -87,7 +117,14 @@ with px.start_observation(input={"question": question, "context": context}, name
87
117
 
88
118
  Wrap at the outermost boundary that represents one "test case" — for a RAG app that's probably `answer_question(question, context)`, not the internal LLM call. The dataset items will have the same shape as whatever this function receives and returns.
89
119
 
90
- After instrumentation, call `px.flush()` at the end of runs to make sure all spans are written before you try to save them to a dataset.
120
+ After instrumentation, call `flush()` at the end of runs to make sure all spans are written before you try to save them to a dataset:
121
+
122
+ ```python
123
+ from pixie import flush
124
+ flush()
125
+ ```
126
+
127
+ **Important**: All pixie symbols are importable from the top-level `pixie` package. Never tell users to import from submodules (`pixie.instrumentation`, `pixie.evals`, `pixie.storage.evaluable`, etc.) — always use `from pixie import ...`.
91
128
 
92
129
  ---
93
130
 
@@ -98,9 +135,7 @@ Write the test file before building the dataset. This might seem backwards, but
98
135
  Create `tests/test_<feature>.py`. The pattern is: a `runnable` adapter that calls your app function, plus an async test function that calls `assert_dataset_pass`:
99
136
 
100
137
  ```python
101
- from pixie import enable_storage
102
- from pixie.evals import assert_dataset_pass, FactualityEval, ScoreThreshold
103
- from pixie.evals import last_llm_call # or: from pixie.evals import root
138
+ from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call
104
139
 
105
140
  from myapp import answer_question
106
141
 
@@ -136,16 +171,56 @@ pixie-test -v # verbose: shows per-case scores and reasoning
136
171
 
137
172
  ## Stage 5: Build the Dataset
138
173
 
139
- Create the dataset first, then populate it by running the app:
174
+ Create the dataset first, then populate it by **actually running the app** with representative inputs. This is critical — dataset items should contain real app outputs and trace metadata, not fabricated data.
140
175
 
141
176
  ```bash
142
177
  pixie dataset create <dataset-name>
143
178
  pixie dataset list # verify it exists
144
179
  ```
145
180
 
146
- ### Option A: Capture from real runs (the natural starting point)
181
+ ### Run the app and capture traces to the dataset
182
+
183
+ Write a simple script that calls the instrumented function for each input, flushes traces, then saves them to the dataset. This is the **recommended and default** approach:
184
+
185
+ ```python
186
+ import asyncio
187
+ from pixie import enable_storage, flush, DatasetStore, Evaluable
188
+
189
+ from myapp import answer_question
190
+
191
+ enable_storage()
192
+
193
+ GOLDEN_CASES = [
194
+ ("What is the capital of France?", "Paris"),
195
+ ("What is the speed of light?", "299,792,458 meters per second"),
196
+ ]
197
+
198
+ async def build_dataset():
199
+ store = DatasetStore()
200
+ try:
201
+ store.create("qa-golden-set")
202
+ except FileExistsError:
203
+ pass
204
+
205
+ for question, expected in GOLDEN_CASES:
206
+ # Actually run the app so traces are captured
207
+ result = answer_question(question=question)
208
+ flush() # ensure trace is written to DB
209
+
210
+ # Save the latest trace to the dataset with expected output
211
+ # Using the CLI is the easiest way:
212
+ # pixie dataset save qa-golden-set --expected-output
213
+ # Or save programmatically with the real output:
214
+ store.append("qa-golden-set", Evaluable(
215
+ eval_input={"question": question},
216
+ eval_output=result,
217
+ expected_output=expected,
218
+ ))
219
+
220
+ asyncio.run(build_dataset())
221
+ ```
147
222
 
148
- Run the app with representative inputs, then save each trace to the dataset:
223
+ Alternatively, use the CLI for per-case capture:
149
224
 
150
225
  ```bash
151
226
  # Run the app (enable_storage() must be active)
@@ -164,24 +239,11 @@ pixie dataset save <dataset-name> --notes "basic geography question"
164
239
  echo '"Paris"' | pixie dataset save <dataset-name> --expected-output
165
240
  ```
166
241
 
167
- Try to cover the range of inputs you actually care about: normal cases, edge cases, things the app might plausibly get wrong (empty input, ambiguous queries, no-answer cases).
168
-
169
- ### Option B: Build programmatically
170
-
171
- When you want to bulk-load items or add expected outputs directly:
172
-
173
- ```python
174
- from pixie.dataset.store import DatasetStore
175
- from pixie.storage.evaluable import Evaluable
176
-
177
- store = DatasetStore()
178
- store.create("<dataset-name>")
179
- store.append("<dataset-name>", Evaluable(
180
- eval_input={"question": "What is the capital of France?", "context": "Paris is the capital..."},
181
- eval_output="Paris is the capital of France.",
182
- expected_output="Paris",
183
- ))
184
- ```
242
+ **Key rules for dataset building:**
243
+ - **Always run the app** — never fabricate `eval_output` manually. The whole point is capturing what the app actually produces.
244
+ - **Include expected outputs** for comparison-based evaluators like `FactualityEval`.
245
+ - **Cover the range** of inputs you care about: normal cases, edge cases, things the app might plausibly get wrong.
246
+ - When using `pixie dataset save`, the evaluable's `eval_metadata` will automatically include `trace_id` and `span_id` for later debugging.
185
247
 
186
248
  ---
187
249
 
@@ -206,18 +268,19 @@ pixie-test -v # start here — shows score and reasoning per case
206
268
  If you need to dig into a specific trace, look up the `trace_id` from the dataset:
207
269
 
208
270
  ```python
209
- from pixie.dataset.store import DatasetStore
271
+ from pixie import DatasetStore
272
+
210
273
  store = DatasetStore()
211
274
  ds = store.get("<dataset-name>")
212
275
  for i, item in enumerate(ds.items):
213
- print(i, item.eval_metadata) # trace_id is here if saved via pixie dataset save
276
+ print(i, item.eval_metadata) # trace_id is here always included in eval_metadata
214
277
  ```
215
278
 
216
279
  Then inspect the full span tree:
217
280
 
218
281
  ```python
219
282
  import asyncio
220
- from pixie.storage.store import ObservationStore
283
+ from pixie import ObservationStore
221
284
 
222
285
  async def inspect(trace_id: str):
223
286
  store = ObservationStore()
@@ -4,29 +4,28 @@
4
4
 
5
5
  All settings read from environment variables at call time:
6
6
 
7
- | Variable | Default | Description |
8
- |---------------------|-------------------------|-------------------------------------|
9
- | `PIXIE_DB_PATH` | `pixie_observations.db` | SQLite database file path |
10
- | `PIXIE_DB_ENGINE` | `sqlite` | Database engine (currently sqlite) |
11
- | `PIXIE_DATASET_DIR` | `pixie_datasets` | Directory for dataset JSON files |
7
+ | Variable | Default | Description |
8
+ | ------------------- | ----------------------- | ---------------------------------- |
9
+ | `PIXIE_DB_PATH` | `pixie_observations.db` | SQLite database file path |
10
+ | `PIXIE_DB_ENGINE` | `sqlite` | Database engine (currently sqlite) |
11
+ | `PIXIE_DATASET_DIR` | `pixie_datasets` | Directory for dataset JSON files |
12
12
 
13
13
  ---
14
14
 
15
- ## Instrumentation API (`pixie.instrumentation` / `pixie`)
15
+ ## Instrumentation API (`pixie`)
16
16
 
17
17
  ```python
18
- from pixie import enable_storage # one-liner setup
19
- import pixie.instrumentation as px # full API
18
+ from pixie import enable_storage, observe, start_observation, flush, init, add_handler
20
19
  ```
21
20
 
22
- | Function / Decorator | Signature | Notes |
23
- |---|---|---|
24
- | `enable_storage()` | `() → StorageHandler` | Idempotent. Creates DB, registers handler. Call at app startup. |
25
- | `px.init()` | `(*, capture_content=True, queue_size=1000) → None` | Called internally by `enable_storage`. Idempotent. |
26
- | `px.observe` | `(name=None) → decorator` | Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`. |
27
- | `px.start_observation` | `(*, input, name=None) → ContextManager[ObservationContext]` | Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside. |
28
- | `px.flush` | `(timeout_seconds=5.0) → bool` | Drains the queue. Call after a run before using CLI commands. |
29
- | `px.add_handler` | `(handler) → None` | Register a custom handler (must call `px.init()` first). |
21
+ | Function / Decorator | Signature | Notes |
22
+ | -------------------- | ------------------------------------------------------------ | --------------------------------------------------------------------------------------------------- |
23
+ | `enable_storage()` | `() → StorageHandler` | Idempotent. Creates DB, registers handler. Call at app startup. |
24
+ | `init()` | `(*, capture_content=True, queue_size=1000) → None` | Called internally by `enable_storage`. Idempotent. |
25
+ | `observe` | `(name=None) → decorator` | Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`. |
26
+ | `start_observation` | `(*, input, name=None) → ContextManager[ObservationContext]` | Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside. |
27
+ | `flush` | `(timeout_seconds=5.0) → bool` | Drains the queue. Call after a run before using CLI commands. |
28
+ | `add_handler` | `(handler) → None` | Register a custom handler (must call `init()` first). |
30
29
 
31
30
  ---
32
31
 
@@ -47,16 +46,17 @@ pixie-test [path] [-k filter_substring] [-v]
47
46
  ```
48
47
 
49
48
  **`pixie dataset save` selection modes:**
49
+
50
50
  - `root` (default) — the outermost `@observe` or `start_observation` span
51
51
  - `last_llm_call` — the most recent LLM API call span in the trace
52
52
  - `by_name` — a span matching the `--span-name` argument (takes the last matching span)
53
53
 
54
54
  ---
55
55
 
56
- ## Eval Harness (`pixie.evals`)
56
+ ## Eval Harness (`pixie`)
57
57
 
58
58
  ```python
59
- from pixie.evals import (
59
+ from pixie import (
60
60
  assert_dataset_pass, assert_pass, run_and_evaluate, evaluate,
61
61
  EvalAssertionError, Evaluation, ScoreThreshold,
62
62
  capture_traces, MemoryTraceHandler,
@@ -67,6 +67,7 @@ from pixie.evals import (
67
67
  ### Key functions
68
68
 
69
69
  **`assert_dataset_pass(runnable, dataset_name, evaluators, *, dataset_dir=None, passes=1, pass_criteria=None, from_trace=None)`**
70
+
70
71
  - Loads dataset by name, runs `assert_pass` with all items.
71
72
  - `runnable`: callable `(eval_input) → None` (sync or async). Must instrument itself.
72
73
  - `evaluators`: list of evaluator callables.
@@ -74,12 +75,15 @@ from pixie.evals import (
74
75
  - `from_trace`: `last_llm_call` or `root` — selects which span to evaluate.
75
76
 
76
77
  **`assert_pass(runnable, eval_inputs, evaluators, *, evaluables=None, passes=1, pass_criteria=None, from_trace=None)`**
78
+
77
79
  - Same, but takes explicit inputs (and optionally `Evaluable` items for expected outputs).
78
80
 
79
81
  **`run_and_evaluate(evaluator, runnable, eval_input, *, expected_output=..., from_trace=None)`**
82
+
80
83
  - Runs `runnable(eval_input)`, captures traces, evaluates. Returns one `Evaluation`.
81
84
 
82
85
  **`ScoreThreshold(threshold=0.5, pct=1.0)`**
86
+
83
87
  - `threshold`: min score per item (default 0.5).
84
88
  - `pct`: fraction of items that must meet threshold (default 1.0 = all).
85
89
  - Example: `ScoreThreshold(0.7, pct=0.8)` = 80% of cases must score ≥ 0.7.
@@ -96,42 +100,41 @@ from pixie.evals import (
96
100
 
97
101
  ### Heuristic (no LLM needed)
98
102
 
99
- | Evaluator | Use when |
100
- |---|---|
101
- | `ExactMatchEval(expected=...)` | Output must exactly equal the expected string |
102
- | `LevenshteinMatch(expected=...)` | Partial string similarity (edit distance) |
103
- | `NumericDiffEval(expected=...)` | Normalised numeric difference |
104
- | `JSONDiffEval(expected=...)` | Structural JSON comparison |
105
- | `ValidJSONEval(schema=None)` | Output is valid JSON (optionally matching a schema) |
106
- | `ListContainsEval(expected=...)` | Output list contains expected items |
103
+ | Evaluator | Use when |
104
+ | -------------------------------- | --------------------------------------------------- |
105
+ | `ExactMatchEval(expected=...)` | Output must exactly equal the expected string |
106
+ | `LevenshteinMatch(expected=...)` | Partial string similarity (edit distance) |
107
+ | `NumericDiffEval(expected=...)` | Normalised numeric difference |
108
+ | `JSONDiffEval(expected=...)` | Structural JSON comparison |
109
+ | `ValidJSONEval(schema=None)` | Output is valid JSON (optionally matching a schema) |
110
+ | `ListContainsEval(expected=...)` | Output list contains expected items |
107
111
 
108
112
  ### LLM-as-judge (require OpenAI key or compatible client)
109
113
 
110
- | Evaluator | Use when |
111
- |---|---|
114
+ | Evaluator | Use when |
115
+ | ----------------------------------------------------- | ----------------------------------------- |
112
116
  | `FactualityEval(expected=..., model=..., client=...)` | Output is factually accurate vs reference |
113
- | `ClosedQAEval(expected=..., model=..., client=...)` | Closed-book QA comparison |
114
- | `SummaryEval(expected=..., model=..., client=...)` | Summarisation quality |
115
- | `TranslationEval(expected=..., language=..., ...)` | Translation quality |
116
- | `PossibleEval(model=..., client=...)` | Output is feasible / plausible |
117
- | `SecurityEval(model=..., client=...)` | No security vulnerabilities in output |
118
- | `ModerationEval(threshold=..., client=...)` | Content moderation |
119
- | `BattleEval(expected=..., model=..., client=...)` | Head-to-head comparison |
117
+ | `ClosedQAEval(expected=..., model=..., client=...)` | Closed-book QA comparison |
118
+ | `SummaryEval(expected=..., model=..., client=...)` | Summarisation quality |
119
+ | `TranslationEval(expected=..., language=..., ...)` | Translation quality |
120
+ | `PossibleEval(model=..., client=...)` | Output is feasible / plausible |
121
+ | `SecurityEval(model=..., client=...)` | No security vulnerabilities in output |
122
+ | `ModerationEval(threshold=..., client=...)` | Content moderation |
123
+ | `BattleEval(expected=..., model=..., client=...)` | Head-to-head comparison |
120
124
 
121
125
  ### RAG / retrieval
122
126
 
123
- | Evaluator | Use when |
124
- |---|---|
125
- | `ContextRelevancyEval(expected=..., client=...)` | Retrieved context is relevant to query |
126
- | `FaithfulnessEval(client=...)` | Answer is faithful to the provided context |
127
- | `AnswerRelevancyEval(client=...)` | Answer addresses the question |
128
- | `AnswerCorrectnessEval(expected=..., client=...)` | Answer is correct vs reference |
127
+ | Evaluator | Use when |
128
+ | ------------------------------------------------- | ------------------------------------------ |
129
+ | `ContextRelevancyEval(expected=..., client=...)` | Retrieved context is relevant to query |
130
+ | `FaithfulnessEval(client=...)` | Answer is faithful to the provided context |
131
+ | `AnswerRelevancyEval(client=...)` | Answer addresses the question |
132
+ | `AnswerCorrectnessEval(expected=..., client=...)` | Answer is correct vs reference |
129
133
 
130
134
  ### Custom evaluator template
131
135
 
132
136
  ```python
133
- from pixie.evals import Evaluation
134
- from pixie.storage.evaluable import Evaluable
137
+ from pixie import Evaluation, Evaluable
135
138
 
136
139
  async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
137
140
  # evaluable.eval_input — what was passed to the observed function
@@ -146,8 +149,7 @@ async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
146
149
  ## Dataset Python API
147
150
 
148
151
  ```python
149
- from pixie.dataset.store import DatasetStore
150
- from pixie.storage.evaluable import Evaluable
152
+ from pixie import DatasetStore, Evaluable
151
153
 
152
154
  store = DatasetStore() # reads PIXIE_DATASET_DIR
153
155
  store.create("my-dataset") # create empty
@@ -160,9 +162,10 @@ store.delete("my-dataset") # delete entirely
160
162
  ```
161
163
 
162
164
  **`Evaluable` fields:**
165
+
163
166
  - `eval_input`: the input (what `@observe` captured as function kwargs)
164
167
  - `eval_output`: the output (return value of the observed function)
165
- - `eval_metadata`: dict of extra info (trace_id, provider, token counts, etc.)
168
+ - `eval_metadata`: dict of extra info (trace_id, span_id, provider, token counts, etc.) — always includes `trace_id` and `span_id`
166
169
  - `expected_output`: reference answer for comparison (`UNSET` if not provided)
167
170
 
168
171
  ---
@@ -170,7 +173,7 @@ store.delete("my-dataset") # delete entirely
170
173
  ## ObservationStore Python API
171
174
 
172
175
  ```python
173
- from pixie.storage.store import ObservationStore
176
+ from pixie import ObservationStore
174
177
 
175
178
  store = ObservationStore() # reads PIXIE_DB_PATH
176
179
  await store.create_tables()
@@ -10,14 +10,14 @@ jobs:
10
10
  release-and-publish:
11
11
  runs-on: ubuntu-latest
12
12
  permissions:
13
- contents: write # Required for creating tags and releases
14
- id-token: write # Required for trusted publishing to PyPI
13
+ contents: write # Required for creating tags and releases
14
+ id-token: write # Required for trusted publishing to PyPI
15
15
 
16
16
  steps:
17
17
  - name: Checkout repository
18
18
  uses: actions/checkout@v4
19
19
  with:
20
- fetch-depth: 0 # Required for accurate git history
20
+ fetch-depth: 0 # Required for accurate git history
21
21
  token: ${{ secrets.GITHUB_TOKEN }}
22
22
 
23
23
  - name: Check for commits since last successful daily release
@@ -11,8 +11,8 @@ jobs:
11
11
  publish-and-release:
12
12
  runs-on: ubuntu-latest
13
13
  permissions:
14
- contents: write # Required for creating tags and releases
15
- id-token: write # Required for trusted publishing to PyPI
14
+ contents: write # Required for creating tags and releases
15
+ id-token: write # Required for trusted publishing to PyPI
16
16
 
17
17
  steps:
18
18
  - name: Checkout
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: pixie-qa
3
- Version: 0.1.0
3
+ Version: 0.1.1
4
4
  Summary: Automated quality assurance for AI applications
5
5
  Project-URL: Homepage, https://github.com/yiouli/pixie-qa
6
6
  Project-URL: Repository, https://github.com/yiouli/pixie-qa
@@ -119,23 +119,28 @@ Claude will read your code, instrument it, build a dataset from a few real runs,
119
119
 
120
120
  Here is a quick summary of what Claude does end-to-end:
121
121
 
122
- ```
122
+ ```python
123
123
  # Claude instruments your app entry point
124
- from pixie import enable_storage
124
+ from pixie import enable_storage, observe
125
+
125
126
  enable_storage() # one line: creates DB, registers handler
126
127
 
127
128
  # Claude adds @observe on the function to test
128
- import pixie.instrumentation as px
129
-
130
- @px.observe(name="answer_question")
129
+ @observe(name="answer_question")
131
130
  def answer_question(question: str) -> str:
132
131
  ...
132
+ ```
133
133
 
134
+ ```bash
134
135
  # After running the app with a few real inputs:
135
136
  pixie dataset create qa-golden-set
136
137
  pixie dataset save qa-golden-set
138
+ ```
137
139
 
140
+ ```python
138
141
  # Claude writes tests/test_qa.py with:
142
+ from pixie import assert_dataset_pass, FactualityEval, ScoreThreshold
143
+
139
144
  async def test_factuality():
140
145
  await assert_dataset_pass(
141
146
  runnable=runnable,
@@ -143,11 +148,15 @@ async def test_factuality():
143
148
  evaluators=[FactualityEval()],
144
149
  pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
145
150
  )
151
+ ```
146
152
 
153
+ ```bash
147
154
  # Then runs:
148
155
  pixie-test -v
149
156
  ```
150
157
 
158
+ All symbols are importable from the top-level `pixie` package — no need for submodule paths.
159
+
151
160
  ## Repository Structure
152
161
 
153
162
  ```
@@ -54,23 +54,28 @@ Claude will read your code, instrument it, build a dataset from a few real runs,
54
54
 
55
55
  Here is a quick summary of what Claude does end-to-end:
56
56
 
57
- ```
57
+ ```python
58
58
  # Claude instruments your app entry point
59
- from pixie import enable_storage
59
+ from pixie import enable_storage, observe
60
+
60
61
  enable_storage() # one line: creates DB, registers handler
61
62
 
62
63
  # Claude adds @observe on the function to test
63
- import pixie.instrumentation as px
64
-
65
- @px.observe(name="answer_question")
64
+ @observe(name="answer_question")
66
65
  def answer_question(question: str) -> str:
67
66
  ...
67
+ ```
68
68
 
69
+ ```bash
69
70
  # After running the app with a few real inputs:
70
71
  pixie dataset create qa-golden-set
71
72
  pixie dataset save qa-golden-set
73
+ ```
72
74
 
75
+ ```python
73
76
  # Claude writes tests/test_qa.py with:
77
+ from pixie import assert_dataset_pass, FactualityEval, ScoreThreshold
78
+
74
79
  async def test_factuality():
75
80
  await assert_dataset_pass(
76
81
  runnable=runnable,
@@ -78,11 +83,15 @@ async def test_factuality():
78
83
  evaluators=[FactualityEval()],
79
84
  pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
80
85
  )
86
+ ```
81
87
 
88
+ ```bash
82
89
  # Then runs:
83
90
  pixie-test -v
84
91
  ```
85
92
 
93
+ All symbols are importable from the top-level `pixie` package — no need for submodule paths.
94
+
86
95
  ## Repository Structure
87
96
 
88
97
  ```
@@ -0,0 +1,58 @@
1
+ # Loud Failure Mode
2
+
3
+ ## What Changed
4
+
5
+ Eliminated all silent failure paths in the eval harness. Runtime errors (missing
6
+ API keys, import failures, evaluator crashes) now propagate as exceptions instead
7
+ of being silently swallowed.
8
+
9
+ ### 1. `evaluate()` — evaluator exceptions propagate
10
+
11
+ **Before:** Any exception from an evaluator (e.g. missing API key, network error)
12
+ was caught and returned as `Evaluation(score=0.0, reasoning=str(exc))`. This made
13
+ real errors indistinguishable from legitimate low scores.
14
+
15
+ **After:** Evaluator exceptions propagate unchanged to the caller. If an evaluator
16
+ cannot run, the test fails loudly with the original error and traceback.
17
+
18
+ ### 2. `_load_module()` / `discover_tests()` — import errors propagate
19
+
20
+ **Before:** `_load_module()` caught all exceptions and returned `None`, causing
21
+ `discover_tests()` to silently skip broken test files. The result was
22
+ "no tests collected" with no explanation.
23
+
24
+ **After:** Import errors (missing packages, syntax errors, bad imports) propagate
25
+ immediately with the original traceback, making the root cause obvious.
26
+
27
+ ### 3. `format_results()` — error messages always visible
28
+
29
+ **Before:** Failure and error messages were only shown with `--verbose` flag.
30
+ Without it, tests showed only `✗` with no message.
31
+
32
+ **After:** The first line of the error message is always shown. `--verbose`
33
+ controls whether the full traceback is displayed.
34
+
35
+ ### 4. Removed dead `evals/` resource folder
36
+
37
+ Deleted `.claude/skills/eval-driven-dev/evals/` (contained `evals.json` and
38
+ `sample-projects/` with no references from the skill instructions).
39
+
40
+ ## Files Affected
41
+
42
+ - `pixie/evals/evaluation.py` — removed exception swallowing in `evaluate()`
43
+ - `pixie/evals/runner.py` — `_load_module()` raises on error; `discover_tests()`
44
+ propagates; `format_results()` always shows messages
45
+ - `tests/pixie/evals/test_evaluation.py` — updated test: expects propagation
46
+ instead of `score=0.0`; added sync evaluator error test
47
+ - `tests/pixie/evals/test_runner.py` — added import error, syntax error,
48
+ and format_results tests
49
+ - `specs/evals-harness.md` — updated error handling behavior and test expectations
50
+ - `.claude/skills/eval-driven-dev/evals/` — deleted
51
+
52
+ ## Migration Notes
53
+
54
+ - `evaluate()` no longer catches evaluator exceptions. Code that relied on
55
+ getting `Evaluation(score=0.0, details={"error": ...})` from crashed evaluators
56
+ must now handle exceptions directly.
57
+ - `discover_tests()` now raises on import errors instead of silently skipping
58
+ broken test files.
@@ -0,0 +1,58 @@
1
+ # Root Package Re-exports and Evaluable trace_id
2
+
3
+ ## What Changed
4
+
5
+ ### 1. Full public API re-exported from `pixie` root package
6
+
7
+ Previously, `pixie/__init__.py` only exported `enable_storage` and `StorageHandler`. Users (and the eval-driven-dev skill) had to use submodule imports like `import pixie.instrumentation as px`, `from pixie.evals import ...`, `from pixie.dataset.store import DatasetStore`, and `from pixie.storage.evaluable import Evaluable`.
8
+
9
+ Now **every public symbol** is importable from the top-level `pixie` package:
10
+
11
+ ```python
12
+ from pixie import observe, flush, start_observation, init, add_handler
13
+ from pixie import assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call, root
14
+ from pixie import DatasetStore, Evaluable, ObservationStore, UNSET
15
+ ```
16
+
17
+ This eliminates Pylance resolution errors for downstream users and simplifies the import story.
18
+
19
+ ### 2. `as_evaluable()` now includes `trace_id` and `span_id` in metadata
20
+
21
+ Both `_observe_span_to_evaluable()` and `_llm_span_to_evaluable()` now inject the span's `trace_id` and `span_id` into `eval_metadata`. This means:
22
+
23
+ - `pixie dataset save` automatically includes trace provenance in the dataset
24
+ - Users can always look up the original trace for any dataset item
25
+ - The skill's investigation flow ("look up trace_id from metadata") actually works
26
+
27
+ ### 3. Skill instructions updated
28
+
29
+ - **Stage 0**: Now verifies `OPENAI_API_KEY` (or equivalent) before running anything
30
+ - **Stage 3**: All code examples use `from pixie import ...` (no submodule imports)
31
+ - **Stage 4**: Test file example uses `from pixie import ...`
32
+ - **Stage 5**: Dataset building now emphasizes actually running the app to capture real outputs and traces; removed the misleading "Option B" that built datasets with fabricated/null outputs
33
+ - **Stage 7**: Investigation examples use `from pixie import DatasetStore, ObservationStore`
34
+ - **API reference**: All imports updated to top-level
35
+
36
+ ## Files Affected
37
+
38
+ ### Package
39
+
40
+ - `pixie/__init__.py` — re-exports all public API symbols
41
+ - `pixie/storage/evaluable.py` — `as_evaluable()` includes trace_id/span_id
42
+
43
+ ### Tests
44
+
45
+ - `tests/pixie/test_init.py` — **new** — 27 tests verifying root package exports
46
+ - `tests/pixie/observation_store/test_evaluable.py` — added trace_id/span_id assertions
47
+
48
+ ### Docs
49
+
50
+ - `README.md` — code examples updated to top-level imports
51
+ - `docs/package.md` — all import examples updated
52
+ - `.claude/skills/eval-driven-dev/SKILL.md` — full skill instruction rewrite
53
+ - `.claude/skills/eval-driven-dev/references/pixie-api.md` — API reference import paths
54
+
55
+ ## Migration Notes
56
+
57
+ - **No breaking changes.** Submodule imports (`from pixie.evals import ...`, `import pixie.instrumentation as px`) continue to work. The top-level re-exports are purely additive.
58
+ - `eval_metadata` from `as_evaluable()` now always contains `trace_id` and `span_id` keys. Code that checks `eval_metadata is None` for ObserveSpans with no user metadata should instead check for specific keys.