pixie-qa 0.1.0__tar.gz → 0.1.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev/SKILL.md +94 -31
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev/references/pixie-api.md +50 -47
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.github/workflows/daily-release.yml +3 -3
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.github/workflows/publish.yml +2 -2
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/PKG-INFO +15 -6
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/README.md +14 -5
- pixie_qa-0.1.1/changelogs/loud-failure-mode.md +58 -0
- pixie_qa-0.1.1/changelogs/root-package-exports-and-trace-id.md +58 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/docs/package.md +10 -9
- pixie_qa-0.1.1/pixie/__init__.py +108 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/evals/evaluation.py +13 -17
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/evals/runner.py +30 -14
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/storage/evaluable.py +12 -3
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pyproject.toml +1 -1
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/specs/evals-harness.md +8 -4
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/evals/test_evaluation.py +15 -6
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/evals/test_runner.py +87 -1
- pixie_qa-0.1.1/tests/pixie/observation_store/__init__.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/observation_store/test_evaluable.py +48 -8
- pixie_qa-0.1.1/tests/pixie/test_init.py +157 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/uv.lock +28 -185
- pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/evals.json +0 -52
- pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/email-classifier/extractor.py +0 -40
- pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/email-classifier/requirements.txt +0 -2
- pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/email-classifier-mock/extractor.py +0 -57
- pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/email-classifier-mock/requirements.txt +0 -1
- pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/qa-app-with-tests/pixie_datasets/qa-golden-set.json +0 -23
- pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/qa-app-with-tests/qa_app.py +0 -26
- pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/qa-app-with-tests/requirements.txt +0 -2
- pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/qa-app-with-tests/tests/test_qa.py +0 -24
- pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/rag-chatbot/chatbot.py +0 -53
- pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/rag-chatbot/requirements.txt +0 -2
- pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/rag-chatbot-mock/chatbot.py +0 -46
- pixie_qa-0.1.0/.claude/skills/eval-driven-dev/evals/sample-projects/rag-chatbot-mock/requirements.txt +0 -1
- pixie_qa-0.1.0/.github/workflows/deploy-docs.yml +0 -171
- pixie_qa-0.1.0/pixie/__init__.py +0 -11
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/settings.local.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/benchmark.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/benchmark.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/eval_metadata.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/with_skill/outputs/metrics.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/with_skill/outputs/response.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/with_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/with_skill/run-1/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/without_skill/outputs/metrics.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/without_skill/outputs/response.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/without_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-debug-failures/without_skill/run-1/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/eval_metadata.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/with_skill/outputs/metrics.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/with_skill/outputs/response.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/with_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/with_skill/run-1/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/without_skill/outputs/metrics.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/without_skill/outputs/response.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/without_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-json-extraction/without_skill/run-1/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/eval_metadata.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/with_skill/outputs/metrics.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/with_skill/outputs/response.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/with_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/with_skill/run-1/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/without_skill/outputs/metrics.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/without_skill/outputs/response.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/without_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-1/eval-rag-chatbot/without_skill/run-1/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/benchmark.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/benchmark.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/eval_metadata.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/outputs/metrics.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/outputs/summary.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/project/pixie_datasets/qa-golden-set.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/project/qa_app.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/project/tests/test_qa.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/with_skill/run-1/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/outputs/metrics.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/outputs/summary.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/project/pixie_datasets/qa-golden-set.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/project/qa_app.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/project/tests/test_qa.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-debug-failures/without_skill/run-1/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/eval_metadata.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/outputs/metrics.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/outputs/summary.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/project/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/project/build_dataset.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/project/extractor.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/project/tests/__init__.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/project/tests/test_email_extraction.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/with_skill/run-1/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/outputs/metrics.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/outputs/summary.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/project/build_dataset.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/project/extractor.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/project/test_extractor.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-json-extraction/without_skill/run-1/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/eval_metadata.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/outputs/metrics.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/outputs/summary.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/project/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/project/build_dataset.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/project/chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/project/tests/test_rag_chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/with_skill/run-1/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/outputs/metrics.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/outputs/summary.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/project/build_dataset.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/project/chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/project/test_chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-2/eval-rag-chatbot/without_skill/run-1/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/benchmark.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/benchmark.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/eval_metadata.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/project/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/project/pixie_datasets/qa-golden-set.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/project/qa_app.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/project/tests/test_qa.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/run-1/outputs/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/run-1/outputs/test_qa.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/with_skill/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/project/INVESTIGATION_NOTES.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/project/pixie_datasets/qa-golden-set.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/project/qa_app.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/project/tests/test_qa.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/run-1/outputs/INVESTIGATION_NOTES.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/run-1/outputs/test_qa.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-debug-failures/without_skill/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/eval_metadata.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/project/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/project/build_dataset.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/project/extractor.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/project/run_evals.sh +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/project/tests/test_classifier.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/run-1/outputs/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/run-1/outputs/build_dataset.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/run-1/outputs/extractor.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/run-1/outputs/test_classifier.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/with_skill/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/project/collect_traces.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/project/extractor.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/run-1/outputs/collect_traces.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/run-1/outputs/extractor.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-email-classifier/without_skill/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/eval_metadata.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/project/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/project/chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/project/pixie_datasets/rag-chatbot-golden.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/project/pixie_observations.db +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/project/tests/test_chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/run-1/outputs/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/run-1/outputs/chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/run-1/outputs/test_chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/with_skill/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/project/capture_traces.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/project/chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/project/test_chatbot_evals.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/run-1/outputs/capture_traces.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/run-1/outputs/chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/run-1/outputs/test_chatbot_evals.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-3/eval-rag-chatbot/without_skill/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/benchmark.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/benchmark.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/eval_metadata.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/project/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/project/pixie_datasets/qa-golden-set.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/project/qa_app.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/project/tests/test_qa.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/run-1/outputs/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/run-1/outputs/test_qa.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/with_skill/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/project/pixie_datasets/qa-golden-set.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/project/qa_app.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/project/tests/test_qa.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/run-1/outputs/test_qa.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-debug-failures/without_skill/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/eval_metadata.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/project/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/project/extractor.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/project/pixie_datasets/email-classifier-golden.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/project/pixie_observations.db +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/project/tests/test_email_classifier.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/run-1/outputs/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/run-1/outputs/extractor.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/run-1/outputs/test_email_classifier.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/with_skill/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/conftest.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/extractor.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/generate_dataset.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/instrumented_extractor.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/pytest.ini +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/project/test_email_classifier.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/run-1/outputs/extractor.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/run-1/outputs/test_email_classifier.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-email-classifier/without_skill/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/eval_metadata.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/project/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/project/chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/project/pixie_datasets/rag-chatbot-golden.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/project/pixie_observations.db +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/project/tests/test_rag_chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/run-1/outputs/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/run-1/outputs/chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/run-1/outputs/test_rag_chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/with_skill/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/project/chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/project/chatbot_instrumented.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/project/save_dataset.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/project/test_chatbot_evals.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/run-1/outputs/chatbot_instrumented.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/run-1/outputs/test_chatbot_evals.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-4/eval-rag-chatbot/without_skill/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/benchmark.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/benchmark.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/eval_metadata.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/project/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/project/pixie_datasets/qa-golden-set.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/project/qa_app.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/project/tests/test_qa.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/run-1/outputs/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/run-1/outputs/pixie_datasets/qa-golden-set.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/run-1/outputs/qa_app.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/run-1/outputs/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/run-1/outputs/tests/test_qa.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/with_skill/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/project/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/project/pixie_datasets/qa-golden-set.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/project/qa_app.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/project/tests/test_qa.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/run-1/outputs/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/run-1/outputs/pixie_datasets/qa-golden-set.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/run-1/outputs/qa_app.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/run-1/outputs/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/run-1/outputs/tests/test_qa.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-debug-failures/without_skill/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/eval_metadata.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/project/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/project/build_dataset.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/project/extractor.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/project/tests/test_email_classifier.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/run-1/outputs/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/run-1/outputs/build_dataset.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/run-1/outputs/extractor.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/run-1/outputs/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/run-1/outputs/tests/test_email_classifier.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/with_skill/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/project/build_dataset.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/project/extractor.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/project/test_email_classifier.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/run-1/outputs/build_dataset.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/run-1/outputs/extractor.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/run-1/outputs/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/run-1/outputs/test_email_classifier.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-email-classifier/without_skill/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/eval_metadata.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/project/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/project/build_dataset.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/project/chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/project/tests/test_rag_chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/run-1/outputs/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/run-1/outputs/build_dataset.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/run-1/outputs/chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/run-1/outputs/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/run-1/outputs/tests/test_rag_chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/with_skill/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/project/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/project/chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/project/datasets/rag-chatbot-golden.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/project/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/project/test_chatbot_eval.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/run-1/grading.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/run-1/outputs/MEMORY.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/run-1/outputs/chatbot.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/run-1/outputs/datasets/rag-chatbot-golden.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/run-1/outputs/requirements.txt +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/run-1/outputs/test_chatbot_eval.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/iteration-5/eval-rag-chatbot/without_skill/timing.json +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/review-iteration-1.html +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/review-iteration-2.html +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/review-iteration-3.html +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/review-iteration-4.html +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/review-iteration-5.html +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev-workspace/trigger-eval-set.json +0 -0
- /pixie_qa-0.1.0/tests/__init__.py → /pixie_qa-0.1.1/.env +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.github/copilot-instructions.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/.gitignore +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/LICENSE +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/async-handler-processing.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/autoevals-adapters.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/cli-dataset-commands.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/dataset-management.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/eval-harness.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/expected-output-in-evals.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/instrumentation-module-implementation.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/manual-instrumentation-usability.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/observation-store-implementation.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/changelogs/usability-utils.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/cli/__init__.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/cli/dataset_command.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/cli/main.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/cli/test_command.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/config.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/dataset/__init__.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/dataset/models.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/dataset/store.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/evals/__init__.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/evals/criteria.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/evals/eval_utils.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/evals/scorers.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/evals/trace_capture.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/evals/trace_helpers.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/instrumentation/__init__.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/instrumentation/context.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/instrumentation/handler.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/instrumentation/handlers.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/instrumentation/instrumentors.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/instrumentation/observation.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/instrumentation/processor.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/instrumentation/queue.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/instrumentation/spans.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/storage/__init__.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/storage/piccolo_conf.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/storage/piccolo_migrations/__init__.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/storage/serialization.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/storage/store.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/storage/tables.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/pixie/storage/tree.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/specs/agent-skill.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/specs/autoevals-adapters.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/specs/dataset-management.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/specs/expected-output-in-evals.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/specs/instrumentation.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/specs/manual-instrumentation-usability.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/specs/storage.md +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/specs/usability-utils.md +0 -0
- {pixie_qa-0.1.0/tests/pixie → pixie_qa-0.1.1/tests}/__init__.py +0 -0
- {pixie_qa-0.1.0/tests/pixie/cli → pixie_qa-0.1.1/tests/pixie}/__init__.py +0 -0
- {pixie_qa-0.1.0/tests/pixie/dataset → pixie_qa-0.1.1/tests/pixie/cli}/__init__.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/cli/test_dataset_command.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/cli/test_main.py +0 -0
- {pixie_qa-0.1.0/tests/pixie/evals → pixie_qa-0.1.1/tests/pixie/dataset}/__init__.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/dataset/test_models.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/dataset/test_store.py +0 -0
- {pixie_qa-0.1.0/tests/pixie/instrumentation → pixie_qa-0.1.1/tests/pixie/evals}/__init__.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/evals/test_criteria.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/evals/test_eval_utils.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/evals/test_scorers.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/evals/test_trace_capture.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/evals/test_trace_helpers.py +0 -0
- {pixie_qa-0.1.0/tests/pixie/observation_store → pixie_qa-0.1.1/tests/pixie/instrumentation}/__init__.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/instrumentation/conftest.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/instrumentation/test_context.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/instrumentation/test_handler.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/instrumentation/test_integration.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/instrumentation/test_observation.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/instrumentation/test_processor.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/instrumentation/test_queue.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/instrumentation/test_spans.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/instrumentation/test_storage_handler.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/observation_store/conftest.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/observation_store/test_serialization.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/observation_store/test_store.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/observation_store/test_tree.py +0 -0
- {pixie_qa-0.1.0 → pixie_qa-0.1.1}/tests/pixie/test_config.py +0 -0
|
@@ -11,6 +11,34 @@ The loop is: understand the app → instrument it → write the test file → bu
|
|
|
11
11
|
|
|
12
12
|
---
|
|
13
13
|
|
|
14
|
+
## Stage 0: Ensure pixie-qa is Installed and API Keys Are Set
|
|
15
|
+
|
|
16
|
+
Before doing anything else, check that the `pixie-qa` package is available:
|
|
17
|
+
|
|
18
|
+
```bash
|
|
19
|
+
python -c "import pixie" 2>/dev/null && echo "installed" || echo "not installed"
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
If it's not installed, install it:
|
|
23
|
+
|
|
24
|
+
```bash
|
|
25
|
+
pip install pixie-qa
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
This provides the `pixie` Python module, the `pixie` CLI, and the `pixie-test` test runner — all required for instrumentation and evals. Don't skip this step; everything else in this skill depends on it.
|
|
29
|
+
|
|
30
|
+
### Verify API keys
|
|
31
|
+
|
|
32
|
+
The application under test almost certainly needs an LLM provider API key (e.g. `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`). LLM-as-judge evaluators like `FactualityEval` also need `OPENAI_API_KEY`. **Before running anything**, verify the key is set:
|
|
33
|
+
|
|
34
|
+
```bash
|
|
35
|
+
[ -n "$OPENAI_API_KEY" ] && echo "OPENAI_API_KEY set" || echo "OPENAI_API_KEY missing"
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
If not set, ask the user. Do not proceed with running the app or evals without it — you'll get silent failures or import-time errors.
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
14
42
|
## Stage 1: Understand the Application
|
|
15
43
|
|
|
16
44
|
Before touching any code, spend time actually reading the source. The code will tell you more than asking the user would, and it puts you in a much better position to make good decisions about what and how to evaluate.
|
|
@@ -69,9 +97,9 @@ This is what actually persists traces to disk. Without it, `@observe` decorators
|
|
|
69
97
|
`@observe` on a function captures all its kwargs as `eval_input` and its return value as `eval_output`:
|
|
70
98
|
|
|
71
99
|
```python
|
|
72
|
-
|
|
100
|
+
from pixie import observe
|
|
73
101
|
|
|
74
|
-
@
|
|
102
|
+
@observe(name="answer_question")
|
|
75
103
|
def answer_question(question: str, context: str) -> str:
|
|
76
104
|
...
|
|
77
105
|
```
|
|
@@ -79,7 +107,9 @@ def answer_question(question: str, context: str) -> str:
|
|
|
79
107
|
For more control, use the context manager:
|
|
80
108
|
|
|
81
109
|
```python
|
|
82
|
-
|
|
110
|
+
from pixie import start_observation
|
|
111
|
+
|
|
112
|
+
with start_observation(input={"question": question, "context": context}, name="answer_question") as obs:
|
|
83
113
|
result = run_pipeline(question, context)
|
|
84
114
|
obs.set_output(result)
|
|
85
115
|
obs.set_metadata("retrieved_chunks", len(chunks))
|
|
@@ -87,7 +117,14 @@ with px.start_observation(input={"question": question, "context": context}, name
|
|
|
87
117
|
|
|
88
118
|
Wrap at the outermost boundary that represents one "test case" — for a RAG app that's probably `answer_question(question, context)`, not the internal LLM call. The dataset items will have the same shape as whatever this function receives and returns.
|
|
89
119
|
|
|
90
|
-
After instrumentation, call `
|
|
120
|
+
After instrumentation, call `flush()` at the end of runs to make sure all spans are written before you try to save them to a dataset:
|
|
121
|
+
|
|
122
|
+
```python
|
|
123
|
+
from pixie import flush
|
|
124
|
+
flush()
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
**Important**: All pixie symbols are importable from the top-level `pixie` package. Never tell users to import from submodules (`pixie.instrumentation`, `pixie.evals`, `pixie.storage.evaluable`, etc.) — always use `from pixie import ...`.
|
|
91
128
|
|
|
92
129
|
---
|
|
93
130
|
|
|
@@ -98,9 +135,7 @@ Write the test file before building the dataset. This might seem backwards, but
|
|
|
98
135
|
Create `tests/test_<feature>.py`. The pattern is: a `runnable` adapter that calls your app function, plus an async test function that calls `assert_dataset_pass`:
|
|
99
136
|
|
|
100
137
|
```python
|
|
101
|
-
from pixie import enable_storage
|
|
102
|
-
from pixie.evals import assert_dataset_pass, FactualityEval, ScoreThreshold
|
|
103
|
-
from pixie.evals import last_llm_call # or: from pixie.evals import root
|
|
138
|
+
from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call
|
|
104
139
|
|
|
105
140
|
from myapp import answer_question
|
|
106
141
|
|
|
@@ -136,16 +171,56 @@ pixie-test -v # verbose: shows per-case scores and reasoning
|
|
|
136
171
|
|
|
137
172
|
## Stage 5: Build the Dataset
|
|
138
173
|
|
|
139
|
-
Create the dataset first, then populate it by running the app
|
|
174
|
+
Create the dataset first, then populate it by **actually running the app** with representative inputs. This is critical — dataset items should contain real app outputs and trace metadata, not fabricated data.
|
|
140
175
|
|
|
141
176
|
```bash
|
|
142
177
|
pixie dataset create <dataset-name>
|
|
143
178
|
pixie dataset list # verify it exists
|
|
144
179
|
```
|
|
145
180
|
|
|
146
|
-
###
|
|
181
|
+
### Run the app and capture traces to the dataset
|
|
182
|
+
|
|
183
|
+
Write a simple script that calls the instrumented function for each input, flushes traces, then saves them to the dataset. This is the **recommended and default** approach:
|
|
184
|
+
|
|
185
|
+
```python
|
|
186
|
+
import asyncio
|
|
187
|
+
from pixie import enable_storage, flush, DatasetStore, Evaluable
|
|
188
|
+
|
|
189
|
+
from myapp import answer_question
|
|
190
|
+
|
|
191
|
+
enable_storage()
|
|
192
|
+
|
|
193
|
+
GOLDEN_CASES = [
|
|
194
|
+
("What is the capital of France?", "Paris"),
|
|
195
|
+
("What is the speed of light?", "299,792,458 meters per second"),
|
|
196
|
+
]
|
|
197
|
+
|
|
198
|
+
async def build_dataset():
|
|
199
|
+
store = DatasetStore()
|
|
200
|
+
try:
|
|
201
|
+
store.create("qa-golden-set")
|
|
202
|
+
except FileExistsError:
|
|
203
|
+
pass
|
|
204
|
+
|
|
205
|
+
for question, expected in GOLDEN_CASES:
|
|
206
|
+
# Actually run the app so traces are captured
|
|
207
|
+
result = answer_question(question=question)
|
|
208
|
+
flush() # ensure trace is written to DB
|
|
209
|
+
|
|
210
|
+
# Save the latest trace to the dataset with expected output
|
|
211
|
+
# Using the CLI is the easiest way:
|
|
212
|
+
# pixie dataset save qa-golden-set --expected-output
|
|
213
|
+
# Or save programmatically with the real output:
|
|
214
|
+
store.append("qa-golden-set", Evaluable(
|
|
215
|
+
eval_input={"question": question},
|
|
216
|
+
eval_output=result,
|
|
217
|
+
expected_output=expected,
|
|
218
|
+
))
|
|
219
|
+
|
|
220
|
+
asyncio.run(build_dataset())
|
|
221
|
+
```
|
|
147
222
|
|
|
148
|
-
|
|
223
|
+
Alternatively, use the CLI for per-case capture:
|
|
149
224
|
|
|
150
225
|
```bash
|
|
151
226
|
# Run the app (enable_storage() must be active)
|
|
@@ -164,24 +239,11 @@ pixie dataset save <dataset-name> --notes "basic geography question"
|
|
|
164
239
|
echo '"Paris"' | pixie dataset save <dataset-name> --expected-output
|
|
165
240
|
```
|
|
166
241
|
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
When
|
|
172
|
-
|
|
173
|
-
```python
|
|
174
|
-
from pixie.dataset.store import DatasetStore
|
|
175
|
-
from pixie.storage.evaluable import Evaluable
|
|
176
|
-
|
|
177
|
-
store = DatasetStore()
|
|
178
|
-
store.create("<dataset-name>")
|
|
179
|
-
store.append("<dataset-name>", Evaluable(
|
|
180
|
-
eval_input={"question": "What is the capital of France?", "context": "Paris is the capital..."},
|
|
181
|
-
eval_output="Paris is the capital of France.",
|
|
182
|
-
expected_output="Paris",
|
|
183
|
-
))
|
|
184
|
-
```
|
|
242
|
+
**Key rules for dataset building:**
|
|
243
|
+
- **Always run the app** — never fabricate `eval_output` manually. The whole point is capturing what the app actually produces.
|
|
244
|
+
- **Include expected outputs** for comparison-based evaluators like `FactualityEval`.
|
|
245
|
+
- **Cover the range** of inputs you care about: normal cases, edge cases, things the app might plausibly get wrong.
|
|
246
|
+
- When using `pixie dataset save`, the evaluable's `eval_metadata` will automatically include `trace_id` and `span_id` for later debugging.
|
|
185
247
|
|
|
186
248
|
---
|
|
187
249
|
|
|
@@ -206,18 +268,19 @@ pixie-test -v # start here — shows score and reasoning per case
|
|
|
206
268
|
If you need to dig into a specific trace, look up the `trace_id` from the dataset:
|
|
207
269
|
|
|
208
270
|
```python
|
|
209
|
-
from pixie
|
|
271
|
+
from pixie import DatasetStore
|
|
272
|
+
|
|
210
273
|
store = DatasetStore()
|
|
211
274
|
ds = store.get("<dataset-name>")
|
|
212
275
|
for i, item in enumerate(ds.items):
|
|
213
|
-
print(i, item.eval_metadata) # trace_id is here
|
|
276
|
+
print(i, item.eval_metadata) # trace_id is here — always included in eval_metadata
|
|
214
277
|
```
|
|
215
278
|
|
|
216
279
|
Then inspect the full span tree:
|
|
217
280
|
|
|
218
281
|
```python
|
|
219
282
|
import asyncio
|
|
220
|
-
from pixie
|
|
283
|
+
from pixie import ObservationStore
|
|
221
284
|
|
|
222
285
|
async def inspect(trace_id: str):
|
|
223
286
|
store = ObservationStore()
|
|
@@ -4,29 +4,28 @@
|
|
|
4
4
|
|
|
5
5
|
All settings read from environment variables at call time:
|
|
6
6
|
|
|
7
|
-
| Variable | Default | Description
|
|
8
|
-
|
|
9
|
-
| `PIXIE_DB_PATH` | `pixie_observations.db` | SQLite database file path
|
|
10
|
-
| `PIXIE_DB_ENGINE` | `sqlite` | Database engine (currently sqlite)
|
|
11
|
-
| `PIXIE_DATASET_DIR` | `pixie_datasets` | Directory for dataset JSON files
|
|
7
|
+
| Variable | Default | Description |
|
|
8
|
+
| ------------------- | ----------------------- | ---------------------------------- |
|
|
9
|
+
| `PIXIE_DB_PATH` | `pixie_observations.db` | SQLite database file path |
|
|
10
|
+
| `PIXIE_DB_ENGINE` | `sqlite` | Database engine (currently sqlite) |
|
|
11
|
+
| `PIXIE_DATASET_DIR` | `pixie_datasets` | Directory for dataset JSON files |
|
|
12
12
|
|
|
13
13
|
---
|
|
14
14
|
|
|
15
|
-
## Instrumentation API (`pixie
|
|
15
|
+
## Instrumentation API (`pixie`)
|
|
16
16
|
|
|
17
17
|
```python
|
|
18
|
-
from pixie import enable_storage
|
|
19
|
-
import pixie.instrumentation as px # full API
|
|
18
|
+
from pixie import enable_storage, observe, start_observation, flush, init, add_handler
|
|
20
19
|
```
|
|
21
20
|
|
|
22
|
-
| Function / Decorator | Signature
|
|
23
|
-
|
|
24
|
-
| `enable_storage()`
|
|
25
|
-
| `
|
|
26
|
-
| `
|
|
27
|
-
| `
|
|
28
|
-
| `
|
|
29
|
-
| `
|
|
21
|
+
| Function / Decorator | Signature | Notes |
|
|
22
|
+
| -------------------- | ------------------------------------------------------------ | --------------------------------------------------------------------------------------------------- |
|
|
23
|
+
| `enable_storage()` | `() → StorageHandler` | Idempotent. Creates DB, registers handler. Call at app startup. |
|
|
24
|
+
| `init()` | `(*, capture_content=True, queue_size=1000) → None` | Called internally by `enable_storage`. Idempotent. |
|
|
25
|
+
| `observe` | `(name=None) → decorator` | Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`. |
|
|
26
|
+
| `start_observation` | `(*, input, name=None) → ContextManager[ObservationContext]` | Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside. |
|
|
27
|
+
| `flush` | `(timeout_seconds=5.0) → bool` | Drains the queue. Call after a run before using CLI commands. |
|
|
28
|
+
| `add_handler` | `(handler) → None` | Register a custom handler (must call `init()` first). |
|
|
30
29
|
|
|
31
30
|
---
|
|
32
31
|
|
|
@@ -47,16 +46,17 @@ pixie-test [path] [-k filter_substring] [-v]
|
|
|
47
46
|
```
|
|
48
47
|
|
|
49
48
|
**`pixie dataset save` selection modes:**
|
|
49
|
+
|
|
50
50
|
- `root` (default) — the outermost `@observe` or `start_observation` span
|
|
51
51
|
- `last_llm_call` — the most recent LLM API call span in the trace
|
|
52
52
|
- `by_name` — a span matching the `--span-name` argument (takes the last matching span)
|
|
53
53
|
|
|
54
54
|
---
|
|
55
55
|
|
|
56
|
-
## Eval Harness (`pixie
|
|
56
|
+
## Eval Harness (`pixie`)
|
|
57
57
|
|
|
58
58
|
```python
|
|
59
|
-
from pixie
|
|
59
|
+
from pixie import (
|
|
60
60
|
assert_dataset_pass, assert_pass, run_and_evaluate, evaluate,
|
|
61
61
|
EvalAssertionError, Evaluation, ScoreThreshold,
|
|
62
62
|
capture_traces, MemoryTraceHandler,
|
|
@@ -67,6 +67,7 @@ from pixie.evals import (
|
|
|
67
67
|
### Key functions
|
|
68
68
|
|
|
69
69
|
**`assert_dataset_pass(runnable, dataset_name, evaluators, *, dataset_dir=None, passes=1, pass_criteria=None, from_trace=None)`**
|
|
70
|
+
|
|
70
71
|
- Loads dataset by name, runs `assert_pass` with all items.
|
|
71
72
|
- `runnable`: callable `(eval_input) → None` (sync or async). Must instrument itself.
|
|
72
73
|
- `evaluators`: list of evaluator callables.
|
|
@@ -74,12 +75,15 @@ from pixie.evals import (
|
|
|
74
75
|
- `from_trace`: `last_llm_call` or `root` — selects which span to evaluate.
|
|
75
76
|
|
|
76
77
|
**`assert_pass(runnable, eval_inputs, evaluators, *, evaluables=None, passes=1, pass_criteria=None, from_trace=None)`**
|
|
78
|
+
|
|
77
79
|
- Same, but takes explicit inputs (and optionally `Evaluable` items for expected outputs).
|
|
78
80
|
|
|
79
81
|
**`run_and_evaluate(evaluator, runnable, eval_input, *, expected_output=..., from_trace=None)`**
|
|
82
|
+
|
|
80
83
|
- Runs `runnable(eval_input)`, captures traces, evaluates. Returns one `Evaluation`.
|
|
81
84
|
|
|
82
85
|
**`ScoreThreshold(threshold=0.5, pct=1.0)`**
|
|
86
|
+
|
|
83
87
|
- `threshold`: min score per item (default 0.5).
|
|
84
88
|
- `pct`: fraction of items that must meet threshold (default 1.0 = all).
|
|
85
89
|
- Example: `ScoreThreshold(0.7, pct=0.8)` = 80% of cases must score ≥ 0.7.
|
|
@@ -96,42 +100,41 @@ from pixie.evals import (
|
|
|
96
100
|
|
|
97
101
|
### Heuristic (no LLM needed)
|
|
98
102
|
|
|
99
|
-
| Evaluator
|
|
100
|
-
|
|
101
|
-
| `ExactMatchEval(expected=...)`
|
|
102
|
-
| `LevenshteinMatch(expected=...)` | Partial string similarity (edit distance)
|
|
103
|
-
| `NumericDiffEval(expected=...)`
|
|
104
|
-
| `JSONDiffEval(expected=...)`
|
|
105
|
-
| `ValidJSONEval(schema=None)`
|
|
106
|
-
| `ListContainsEval(expected=...)` | Output list contains expected items
|
|
103
|
+
| Evaluator | Use when |
|
|
104
|
+
| -------------------------------- | --------------------------------------------------- |
|
|
105
|
+
| `ExactMatchEval(expected=...)` | Output must exactly equal the expected string |
|
|
106
|
+
| `LevenshteinMatch(expected=...)` | Partial string similarity (edit distance) |
|
|
107
|
+
| `NumericDiffEval(expected=...)` | Normalised numeric difference |
|
|
108
|
+
| `JSONDiffEval(expected=...)` | Structural JSON comparison |
|
|
109
|
+
| `ValidJSONEval(schema=None)` | Output is valid JSON (optionally matching a schema) |
|
|
110
|
+
| `ListContainsEval(expected=...)` | Output list contains expected items |
|
|
107
111
|
|
|
108
112
|
### LLM-as-judge (require OpenAI key or compatible client)
|
|
109
113
|
|
|
110
|
-
| Evaluator
|
|
111
|
-
|
|
114
|
+
| Evaluator | Use when |
|
|
115
|
+
| ----------------------------------------------------- | ----------------------------------------- |
|
|
112
116
|
| `FactualityEval(expected=..., model=..., client=...)` | Output is factually accurate vs reference |
|
|
113
|
-
| `ClosedQAEval(expected=..., model=..., client=...)`
|
|
114
|
-
| `SummaryEval(expected=..., model=..., client=...)`
|
|
115
|
-
| `TranslationEval(expected=..., language=..., ...)`
|
|
116
|
-
| `PossibleEval(model=..., client=...)`
|
|
117
|
-
| `SecurityEval(model=..., client=...)`
|
|
118
|
-
| `ModerationEval(threshold=..., client=...)`
|
|
119
|
-
| `BattleEval(expected=..., model=..., client=...)`
|
|
117
|
+
| `ClosedQAEval(expected=..., model=..., client=...)` | Closed-book QA comparison |
|
|
118
|
+
| `SummaryEval(expected=..., model=..., client=...)` | Summarisation quality |
|
|
119
|
+
| `TranslationEval(expected=..., language=..., ...)` | Translation quality |
|
|
120
|
+
| `PossibleEval(model=..., client=...)` | Output is feasible / plausible |
|
|
121
|
+
| `SecurityEval(model=..., client=...)` | No security vulnerabilities in output |
|
|
122
|
+
| `ModerationEval(threshold=..., client=...)` | Content moderation |
|
|
123
|
+
| `BattleEval(expected=..., model=..., client=...)` | Head-to-head comparison |
|
|
120
124
|
|
|
121
125
|
### RAG / retrieval
|
|
122
126
|
|
|
123
|
-
| Evaluator
|
|
124
|
-
|
|
125
|
-
| `ContextRelevancyEval(expected=..., client=...)`
|
|
126
|
-
| `FaithfulnessEval(client=...)`
|
|
127
|
-
| `AnswerRelevancyEval(client=...)`
|
|
128
|
-
| `AnswerCorrectnessEval(expected=..., client=...)` | Answer is correct vs reference
|
|
127
|
+
| Evaluator | Use when |
|
|
128
|
+
| ------------------------------------------------- | ------------------------------------------ |
|
|
129
|
+
| `ContextRelevancyEval(expected=..., client=...)` | Retrieved context is relevant to query |
|
|
130
|
+
| `FaithfulnessEval(client=...)` | Answer is faithful to the provided context |
|
|
131
|
+
| `AnswerRelevancyEval(client=...)` | Answer addresses the question |
|
|
132
|
+
| `AnswerCorrectnessEval(expected=..., client=...)` | Answer is correct vs reference |
|
|
129
133
|
|
|
130
134
|
### Custom evaluator template
|
|
131
135
|
|
|
132
136
|
```python
|
|
133
|
-
from pixie
|
|
134
|
-
from pixie.storage.evaluable import Evaluable
|
|
137
|
+
from pixie import Evaluation, Evaluable
|
|
135
138
|
|
|
136
139
|
async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
|
|
137
140
|
# evaluable.eval_input — what was passed to the observed function
|
|
@@ -146,8 +149,7 @@ async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
|
|
|
146
149
|
## Dataset Python API
|
|
147
150
|
|
|
148
151
|
```python
|
|
149
|
-
from pixie
|
|
150
|
-
from pixie.storage.evaluable import Evaluable
|
|
152
|
+
from pixie import DatasetStore, Evaluable
|
|
151
153
|
|
|
152
154
|
store = DatasetStore() # reads PIXIE_DATASET_DIR
|
|
153
155
|
store.create("my-dataset") # create empty
|
|
@@ -160,9 +162,10 @@ store.delete("my-dataset") # delete entirely
|
|
|
160
162
|
```
|
|
161
163
|
|
|
162
164
|
**`Evaluable` fields:**
|
|
165
|
+
|
|
163
166
|
- `eval_input`: the input (what `@observe` captured as function kwargs)
|
|
164
167
|
- `eval_output`: the output (return value of the observed function)
|
|
165
|
-
- `eval_metadata`: dict of extra info (trace_id, provider, token counts, etc.)
|
|
168
|
+
- `eval_metadata`: dict of extra info (trace_id, span_id, provider, token counts, etc.) — always includes `trace_id` and `span_id`
|
|
166
169
|
- `expected_output`: reference answer for comparison (`UNSET` if not provided)
|
|
167
170
|
|
|
168
171
|
---
|
|
@@ -170,7 +173,7 @@ store.delete("my-dataset") # delete entirely
|
|
|
170
173
|
## ObservationStore Python API
|
|
171
174
|
|
|
172
175
|
```python
|
|
173
|
-
from pixie
|
|
176
|
+
from pixie import ObservationStore
|
|
174
177
|
|
|
175
178
|
store = ObservationStore() # reads PIXIE_DB_PATH
|
|
176
179
|
await store.create_tables()
|
|
@@ -10,14 +10,14 @@ jobs:
|
|
|
10
10
|
release-and-publish:
|
|
11
11
|
runs-on: ubuntu-latest
|
|
12
12
|
permissions:
|
|
13
|
-
contents: write
|
|
14
|
-
id-token: write
|
|
13
|
+
contents: write # Required for creating tags and releases
|
|
14
|
+
id-token: write # Required for trusted publishing to PyPI
|
|
15
15
|
|
|
16
16
|
steps:
|
|
17
17
|
- name: Checkout repository
|
|
18
18
|
uses: actions/checkout@v4
|
|
19
19
|
with:
|
|
20
|
-
fetch-depth: 0
|
|
20
|
+
fetch-depth: 0 # Required for accurate git history
|
|
21
21
|
token: ${{ secrets.GITHUB_TOKEN }}
|
|
22
22
|
|
|
23
23
|
- name: Check for commits since last successful daily release
|
|
@@ -11,8 +11,8 @@ jobs:
|
|
|
11
11
|
publish-and-release:
|
|
12
12
|
runs-on: ubuntu-latest
|
|
13
13
|
permissions:
|
|
14
|
-
contents: write
|
|
15
|
-
id-token: write
|
|
14
|
+
contents: write # Required for creating tags and releases
|
|
15
|
+
id-token: write # Required for trusted publishing to PyPI
|
|
16
16
|
|
|
17
17
|
steps:
|
|
18
18
|
- name: Checkout
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: pixie-qa
|
|
3
|
-
Version: 0.1.
|
|
3
|
+
Version: 0.1.1
|
|
4
4
|
Summary: Automated quality assurance for AI applications
|
|
5
5
|
Project-URL: Homepage, https://github.com/yiouli/pixie-qa
|
|
6
6
|
Project-URL: Repository, https://github.com/yiouli/pixie-qa
|
|
@@ -119,23 +119,28 @@ Claude will read your code, instrument it, build a dataset from a few real runs,
|
|
|
119
119
|
|
|
120
120
|
Here is a quick summary of what Claude does end-to-end:
|
|
121
121
|
|
|
122
|
-
```
|
|
122
|
+
```python
|
|
123
123
|
# Claude instruments your app entry point
|
|
124
|
-
from pixie import enable_storage
|
|
124
|
+
from pixie import enable_storage, observe
|
|
125
|
+
|
|
125
126
|
enable_storage() # one line: creates DB, registers handler
|
|
126
127
|
|
|
127
128
|
# Claude adds @observe on the function to test
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
@px.observe(name="answer_question")
|
|
129
|
+
@observe(name="answer_question")
|
|
131
130
|
def answer_question(question: str) -> str:
|
|
132
131
|
...
|
|
132
|
+
```
|
|
133
133
|
|
|
134
|
+
```bash
|
|
134
135
|
# After running the app with a few real inputs:
|
|
135
136
|
pixie dataset create qa-golden-set
|
|
136
137
|
pixie dataset save qa-golden-set
|
|
138
|
+
```
|
|
137
139
|
|
|
140
|
+
```python
|
|
138
141
|
# Claude writes tests/test_qa.py with:
|
|
142
|
+
from pixie import assert_dataset_pass, FactualityEval, ScoreThreshold
|
|
143
|
+
|
|
139
144
|
async def test_factuality():
|
|
140
145
|
await assert_dataset_pass(
|
|
141
146
|
runnable=runnable,
|
|
@@ -143,11 +148,15 @@ async def test_factuality():
|
|
|
143
148
|
evaluators=[FactualityEval()],
|
|
144
149
|
pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
|
|
145
150
|
)
|
|
151
|
+
```
|
|
146
152
|
|
|
153
|
+
```bash
|
|
147
154
|
# Then runs:
|
|
148
155
|
pixie-test -v
|
|
149
156
|
```
|
|
150
157
|
|
|
158
|
+
All symbols are importable from the top-level `pixie` package — no need for submodule paths.
|
|
159
|
+
|
|
151
160
|
## Repository Structure
|
|
152
161
|
|
|
153
162
|
```
|
|
@@ -54,23 +54,28 @@ Claude will read your code, instrument it, build a dataset from a few real runs,
|
|
|
54
54
|
|
|
55
55
|
Here is a quick summary of what Claude does end-to-end:
|
|
56
56
|
|
|
57
|
-
```
|
|
57
|
+
```python
|
|
58
58
|
# Claude instruments your app entry point
|
|
59
|
-
from pixie import enable_storage
|
|
59
|
+
from pixie import enable_storage, observe
|
|
60
|
+
|
|
60
61
|
enable_storage() # one line: creates DB, registers handler
|
|
61
62
|
|
|
62
63
|
# Claude adds @observe on the function to test
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
@px.observe(name="answer_question")
|
|
64
|
+
@observe(name="answer_question")
|
|
66
65
|
def answer_question(question: str) -> str:
|
|
67
66
|
...
|
|
67
|
+
```
|
|
68
68
|
|
|
69
|
+
```bash
|
|
69
70
|
# After running the app with a few real inputs:
|
|
70
71
|
pixie dataset create qa-golden-set
|
|
71
72
|
pixie dataset save qa-golden-set
|
|
73
|
+
```
|
|
72
74
|
|
|
75
|
+
```python
|
|
73
76
|
# Claude writes tests/test_qa.py with:
|
|
77
|
+
from pixie import assert_dataset_pass, FactualityEval, ScoreThreshold
|
|
78
|
+
|
|
74
79
|
async def test_factuality():
|
|
75
80
|
await assert_dataset_pass(
|
|
76
81
|
runnable=runnable,
|
|
@@ -78,11 +83,15 @@ async def test_factuality():
|
|
|
78
83
|
evaluators=[FactualityEval()],
|
|
79
84
|
pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
|
|
80
85
|
)
|
|
86
|
+
```
|
|
81
87
|
|
|
88
|
+
```bash
|
|
82
89
|
# Then runs:
|
|
83
90
|
pixie-test -v
|
|
84
91
|
```
|
|
85
92
|
|
|
93
|
+
All symbols are importable from the top-level `pixie` package — no need for submodule paths.
|
|
94
|
+
|
|
86
95
|
## Repository Structure
|
|
87
96
|
|
|
88
97
|
```
|
|
@@ -0,0 +1,58 @@
|
|
|
1
|
+
# Loud Failure Mode
|
|
2
|
+
|
|
3
|
+
## What Changed
|
|
4
|
+
|
|
5
|
+
Eliminated all silent failure paths in the eval harness. Runtime errors (missing
|
|
6
|
+
API keys, import failures, evaluator crashes) now propagate as exceptions instead
|
|
7
|
+
of being silently swallowed.
|
|
8
|
+
|
|
9
|
+
### 1. `evaluate()` — evaluator exceptions propagate
|
|
10
|
+
|
|
11
|
+
**Before:** Any exception from an evaluator (e.g. missing API key, network error)
|
|
12
|
+
was caught and returned as `Evaluation(score=0.0, reasoning=str(exc))`. This made
|
|
13
|
+
real errors indistinguishable from legitimate low scores.
|
|
14
|
+
|
|
15
|
+
**After:** Evaluator exceptions propagate unchanged to the caller. If an evaluator
|
|
16
|
+
cannot run, the test fails loudly with the original error and traceback.
|
|
17
|
+
|
|
18
|
+
### 2. `_load_module()` / `discover_tests()` — import errors propagate
|
|
19
|
+
|
|
20
|
+
**Before:** `_load_module()` caught all exceptions and returned `None`, causing
|
|
21
|
+
`discover_tests()` to silently skip broken test files. The result was
|
|
22
|
+
"no tests collected" with no explanation.
|
|
23
|
+
|
|
24
|
+
**After:** Import errors (missing packages, syntax errors, bad imports) propagate
|
|
25
|
+
immediately with the original traceback, making the root cause obvious.
|
|
26
|
+
|
|
27
|
+
### 3. `format_results()` — error messages always visible
|
|
28
|
+
|
|
29
|
+
**Before:** Failure and error messages were only shown with `--verbose` flag.
|
|
30
|
+
Without it, tests showed only `✗` with no message.
|
|
31
|
+
|
|
32
|
+
**After:** The first line of the error message is always shown. `--verbose`
|
|
33
|
+
controls whether the full traceback is displayed.
|
|
34
|
+
|
|
35
|
+
### 4. Removed dead `evals/` resource folder
|
|
36
|
+
|
|
37
|
+
Deleted `.claude/skills/eval-driven-dev/evals/` (contained `evals.json` and
|
|
38
|
+
`sample-projects/` with no references from the skill instructions).
|
|
39
|
+
|
|
40
|
+
## Files Affected
|
|
41
|
+
|
|
42
|
+
- `pixie/evals/evaluation.py` — removed exception swallowing in `evaluate()`
|
|
43
|
+
- `pixie/evals/runner.py` — `_load_module()` raises on error; `discover_tests()`
|
|
44
|
+
propagates; `format_results()` always shows messages
|
|
45
|
+
- `tests/pixie/evals/test_evaluation.py` — updated test: expects propagation
|
|
46
|
+
instead of `score=0.0`; added sync evaluator error test
|
|
47
|
+
- `tests/pixie/evals/test_runner.py` — added import error, syntax error,
|
|
48
|
+
and format_results tests
|
|
49
|
+
- `specs/evals-harness.md` — updated error handling behavior and test expectations
|
|
50
|
+
- `.claude/skills/eval-driven-dev/evals/` — deleted
|
|
51
|
+
|
|
52
|
+
## Migration Notes
|
|
53
|
+
|
|
54
|
+
- `evaluate()` no longer catches evaluator exceptions. Code that relied on
|
|
55
|
+
getting `Evaluation(score=0.0, details={"error": ...})` from crashed evaluators
|
|
56
|
+
must now handle exceptions directly.
|
|
57
|
+
- `discover_tests()` now raises on import errors instead of silently skipping
|
|
58
|
+
broken test files.
|
|
@@ -0,0 +1,58 @@
|
|
|
1
|
+
# Root Package Re-exports and Evaluable trace_id
|
|
2
|
+
|
|
3
|
+
## What Changed
|
|
4
|
+
|
|
5
|
+
### 1. Full public API re-exported from `pixie` root package
|
|
6
|
+
|
|
7
|
+
Previously, `pixie/__init__.py` only exported `enable_storage` and `StorageHandler`. Users (and the eval-driven-dev skill) had to use submodule imports like `import pixie.instrumentation as px`, `from pixie.evals import ...`, `from pixie.dataset.store import DatasetStore`, and `from pixie.storage.evaluable import Evaluable`.
|
|
8
|
+
|
|
9
|
+
Now **every public symbol** is importable from the top-level `pixie` package:
|
|
10
|
+
|
|
11
|
+
```python
|
|
12
|
+
from pixie import observe, flush, start_observation, init, add_handler
|
|
13
|
+
from pixie import assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call, root
|
|
14
|
+
from pixie import DatasetStore, Evaluable, ObservationStore, UNSET
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
This eliminates Pylance resolution errors for downstream users and simplifies the import story.
|
|
18
|
+
|
|
19
|
+
### 2. `as_evaluable()` now includes `trace_id` and `span_id` in metadata
|
|
20
|
+
|
|
21
|
+
Both `_observe_span_to_evaluable()` and `_llm_span_to_evaluable()` now inject the span's `trace_id` and `span_id` into `eval_metadata`. This means:
|
|
22
|
+
|
|
23
|
+
- `pixie dataset save` automatically includes trace provenance in the dataset
|
|
24
|
+
- Users can always look up the original trace for any dataset item
|
|
25
|
+
- The skill's investigation flow ("look up trace_id from metadata") actually works
|
|
26
|
+
|
|
27
|
+
### 3. Skill instructions updated
|
|
28
|
+
|
|
29
|
+
- **Stage 0**: Now verifies `OPENAI_API_KEY` (or equivalent) before running anything
|
|
30
|
+
- **Stage 3**: All code examples use `from pixie import ...` (no submodule imports)
|
|
31
|
+
- **Stage 4**: Test file example uses `from pixie import ...`
|
|
32
|
+
- **Stage 5**: Dataset building now emphasizes actually running the app to capture real outputs and traces; removed the misleading "Option B" that built datasets with fabricated/null outputs
|
|
33
|
+
- **Stage 7**: Investigation examples use `from pixie import DatasetStore, ObservationStore`
|
|
34
|
+
- **API reference**: All imports updated to top-level
|
|
35
|
+
|
|
36
|
+
## Files Affected
|
|
37
|
+
|
|
38
|
+
### Package
|
|
39
|
+
|
|
40
|
+
- `pixie/__init__.py` — re-exports all public API symbols
|
|
41
|
+
- `pixie/storage/evaluable.py` — `as_evaluable()` includes trace_id/span_id
|
|
42
|
+
|
|
43
|
+
### Tests
|
|
44
|
+
|
|
45
|
+
- `tests/pixie/test_init.py` — **new** — 27 tests verifying root package exports
|
|
46
|
+
- `tests/pixie/observation_store/test_evaluable.py` — added trace_id/span_id assertions
|
|
47
|
+
|
|
48
|
+
### Docs
|
|
49
|
+
|
|
50
|
+
- `README.md` — code examples updated to top-level imports
|
|
51
|
+
- `docs/package.md` — all import examples updated
|
|
52
|
+
- `.claude/skills/eval-driven-dev/SKILL.md` — full skill instruction rewrite
|
|
53
|
+
- `.claude/skills/eval-driven-dev/references/pixie-api.md` — API reference import paths
|
|
54
|
+
|
|
55
|
+
## Migration Notes
|
|
56
|
+
|
|
57
|
+
- **No breaking changes.** Submodule imports (`from pixie.evals import ...`, `import pixie.instrumentation as px`) continue to work. The top-level re-exports are purely additive.
|
|
58
|
+
- `eval_metadata` from `as_evaluable()` now always contains `trace_id` and `span_id` keys. Code that checks `eval_metadata is None` for ObserveSpans with no user metadata should instead check for specific keys.
|