PyPI - pixie-qa - Versions diffs - 0.1.0__tar.gz → 0.1.1__tar.gz - Mend

pixie-qa 0.1.0tar.gz → 0.1.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (427) hide show

{pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev/SKILL.md RENAMED Viewed

@@ -11,6 +11,34 @@ The loop is: understand the app → instrument it → write the test file → bu
 ---
+## Stage 0: Ensure pixie-qa is Installed and API Keys Are Set
+Before doing anything else, check that the `pixie-qa` package is available:
+```bash
+python -c "import pixie" 2>/dev/null && echo "installed" || echo "not installed"
+```
+If it's not installed, install it:
+```bash
+pip install pixie-qa
+```
+This provides the `pixie` Python module, the `pixie` CLI, and the `pixie-test` test runner — all required for instrumentation and evals. Don't skip this step; everything else in this skill depends on it.
+### Verify API keys
+The application under test almost certainly needs an LLM provider API key (e.g. `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`). LLM-as-judge evaluators like `FactualityEval` also need `OPENAI_API_KEY`. **Before running anything**, verify the key is set:
+```bash
+[ -n "$OPENAI_API_KEY" ] && echo "OPENAI_API_KEY set" || echo "OPENAI_API_KEY missing"
+```
+If not set, ask the user. Do not proceed with running the app or evals without it — you'll get silent failures or import-time errors.
+---
 ## Stage 1: Understand the Application
 Before touching any code, spend time actually reading the source. The code will tell you more than asking the user would, and it puts you in a much better position to make good decisions about what and how to evaluate.
@@ -69,9 +97,9 @@ This is what actually persists traces to disk. Without it, `@observe` decorators
 `@observe` on a function captures all its kwargs as `eval_input` and its return value as `eval_output`:
 ```python
-import pixie.instrumentation as px
+from pixie import observe
-@px.observe(name="answer_question")
+@observe(name="answer_question")
 def answer_question(question: str, context: str) -> str:
     ...
 ```
@@ -79,7 +107,9 @@ def answer_question(question: str, context: str) -> str:
 For more control, use the context manager:
 ```python
-with px.start_observation(input={"question": question, "context": context}, name="answer_question") as obs:
+from pixie import start_observation
+with start_observation(input={"question": question, "context": context}, name="answer_question") as obs:
     result = run_pipeline(question, context)
     obs.set_output(result)
     obs.set_metadata("retrieved_chunks", len(chunks))
@@ -87,7 +117,14 @@ with px.start_observation(input={"question": question, "context": context}, name
 Wrap at the outermost boundary that represents one "test case" — for a RAG app that's probably `answer_question(question, context)`, not the internal LLM call. The dataset items will have the same shape as whatever this function receives and returns.
-After instrumentation, call `px.flush()` at the end of runs to make sure all spans are written before you try to save them to a dataset.
+After instrumentation, call `flush()` at the end of runs to make sure all spans are written before you try to save them to a dataset:
+```python
+from pixie import flush
+flush()
+```
+**Important**: All pixie symbols are importable from the top-level `pixie` package. Never tell users to import from submodules (`pixie.instrumentation`, `pixie.evals`, `pixie.storage.evaluable`, etc.) — always use `from pixie import ...`.
 ---
@@ -98,9 +135,7 @@ Write the test file before building the dataset. This might seem backwards, but
 Create `tests/test_<feature>.py`. The pattern is: a `runnable` adapter that calls your app function, plus an async test function that calls `assert_dataset_pass`:
 ```python
-from pixie import enable_storage
-from pixie.evals import assert_dataset_pass, FactualityEval, ScoreThreshold
-from pixie.evals import last_llm_call  # or: from pixie.evals import root
+from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call
 from myapp import answer_question
@@ -136,16 +171,56 @@ pixie-test -v                  # verbose: shows per-case scores and reasoning
 ## Stage 5: Build the Dataset
-Create the dataset first, then populate it by running the app:
+Create the dataset first, then populate it by **actually running the app** with representative inputs. This is critical — dataset items should contain real app outputs and trace metadata, not fabricated data.
 ```bash
 pixie dataset create <dataset-name>
 pixie dataset list   # verify it exists
 ```
-### Option A: Capture from real runs (the natural starting point)
+### Run the app and capture traces to the dataset
+Write a simple script that calls the instrumented function for each input, flushes traces, then saves them to the dataset. This is the **recommended and default** approach:
+```python
+import asyncio
+from pixie import enable_storage, flush, DatasetStore, Evaluable
+from myapp import answer_question
+enable_storage()
+GOLDEN_CASES = [
+    ("What is the capital of France?", "Paris"),
+    ("What is the speed of light?", "299,792,458 meters per second"),
+]
+async def build_dataset():
+    store = DatasetStore()
+    try:
+        store.create("qa-golden-set")
+    except FileExistsError:
+        pass
+    for question, expected in GOLDEN_CASES:
+        # Actually run the app so traces are captured
+        result = answer_question(question=question)
+        flush()  # ensure trace is written to DB
+        # Save the latest trace to the dataset with expected output
+        # Using the CLI is the easiest way:
+        #   pixie dataset save qa-golden-set --expected-output
+        # Or save programmatically with the real output:
+        store.append("qa-golden-set", Evaluable(
+            eval_input={"question": question},
+            eval_output=result,
+            expected_output=expected,
+        ))
+asyncio.run(build_dataset())
+```
-Run the app with representative inputs, then save each trace to the dataset:
+Alternatively, use the CLI for per-case capture:
 ```bash
 # Run the app (enable_storage() must be active)
@@ -164,24 +239,11 @@ pixie dataset save <dataset-name> --notes "basic geography question"
 echo '"Paris"' | pixie dataset save <dataset-name> --expected-output
 ```
-Try to cover the range of inputs you actually care about: normal cases, edge cases, things the app might plausibly get wrong (empty input, ambiguous queries, no-answer cases).
-### Option B: Build programmatically
-When you want to bulk-load items or add expected outputs directly:
-```python
-from pixie.dataset.store import DatasetStore
-from pixie.storage.evaluable import Evaluable
-store = DatasetStore()
-store.create("<dataset-name>")
-store.append("<dataset-name>", Evaluable(
-    eval_input={"question": "What is the capital of France?", "context": "Paris is the capital..."},
-    eval_output="Paris is the capital of France.",
-    expected_output="Paris",
-))
-```
+**Key rules for dataset building:**
+- **Always run the app** — never fabricate `eval_output` manually. The whole point is capturing what the app actually produces.
+- **Include expected outputs** for comparison-based evaluators like `FactualityEval`.
+- **Cover the range** of inputs you care about: normal cases, edge cases, things the app might plausibly get wrong.
+- When using `pixie dataset save`, the evaluable's `eval_metadata` will automatically include `trace_id` and `span_id` for later debugging.
 ---
@@ -206,18 +268,19 @@ pixie-test -v    # start here — shows score and reasoning per case
 If you need to dig into a specific trace, look up the `trace_id` from the dataset:
 ```python
-from pixie.dataset.store import DatasetStore
+from pixie import DatasetStore
 store = DatasetStore()
 ds = store.get("<dataset-name>")
 for i, item in enumerate(ds.items):
-    print(i, item.eval_metadata)   # trace_id is here if saved via pixie dataset save
+    print(i, item.eval_metadata)   # trace_id is here — always included in eval_metadata
 ```
 Then inspect the full span tree:
 ```python
 import asyncio
-from pixie.storage.store import ObservationStore
+from pixie import ObservationStore
 async def inspect(trace_id: str):
     store = ObservationStore()

{pixie_qa-0.1.0 → pixie_qa-0.1.1}/.claude/skills/eval-driven-dev/references/pixie-api.md RENAMED Viewed

@@ -4,29 +4,28 @@
 All settings read from environment variables at call time:
-| Variable            | Default                 | Description                         |
-|---------------------|-------------------------|-------------------------------------|
-| `PIXIE_DB_PATH`     | `pixie_observations.db` | SQLite database file path           |
-| `PIXIE_DB_ENGINE`   | `sqlite`                | Database engine (currently sqlite)  |
-| `PIXIE_DATASET_DIR` | `pixie_datasets`        | Directory for dataset JSON files    |
+| Variable            | Default                 | Description                        |
+| ------------------- | ----------------------- | ---------------------------------- |
+| `PIXIE_DB_PATH`     | `pixie_observations.db` | SQLite database file path          |
+| `PIXIE_DB_ENGINE`   | `sqlite`                | Database engine (currently sqlite) |
+| `PIXIE_DATASET_DIR` | `pixie_datasets`        | Directory for dataset JSON files   |
 ---
-## Instrumentation API (`pixie.instrumentation` / `pixie`)
+## Instrumentation API (`pixie`)
 ```python
-from pixie import enable_storage          # one-liner setup
-import pixie.instrumentation as px        # full API
+from pixie import enable_storage, observe, start_observation, flush, init, add_handler
 ```
-| Function / Decorator | Signature | Notes |
-|---|---|---|
-| `enable_storage()` | `() → StorageHandler` | Idempotent. Creates DB, registers handler. Call at app startup. |
-| `px.init()` | `(*, capture_content=True, queue_size=1000) → None` | Called internally by `enable_storage`. Idempotent. |
-| `px.observe` | `(name=None) → decorator` | Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`. |
-| `px.start_observation` | `(*, input, name=None) → ContextManager[ObservationContext]` | Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside. |
-| `px.flush` | `(timeout_seconds=5.0) → bool` | Drains the queue. Call after a run before using CLI commands. |
-| `px.add_handler` | `(handler) → None` | Register a custom handler (must call `px.init()` first). |
+| Function / Decorator | Signature                                                    | Notes                                                                                               |
+| -------------------- | ------------------------------------------------------------ | --------------------------------------------------------------------------------------------------- |
+| `enable_storage()`   | `() → StorageHandler`                                        | Idempotent. Creates DB, registers handler. Call at app startup.                                     |
+| `init()`             | `(*, capture_content=True, queue_size=1000) → None`          | Called internally by `enable_storage`. Idempotent.                                                  |
+| `observe`            | `(name=None) → decorator`                                    | Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`. |
+| `start_observation`  | `(*, input, name=None) → ContextManager[ObservationContext]` | Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside.                    |
+| `flush`              | `(timeout_seconds=5.0) → bool`                               | Drains the queue. Call after a run before using CLI commands.                                       |
+| `add_handler`        | `(handler) → None`                                           | Register a custom handler (must call `init()` first).                                               |
 ---
@@ -47,16 +46,17 @@ pixie-test [path] [-k filter_substring] [-v]
 ```
 **`pixie dataset save` selection modes:**
 - `root` (default) — the outermost `@observe` or `start_observation` span
 - `last_llm_call` — the most recent LLM API call span in the trace
 - `by_name` — a span matching the `--span-name` argument (takes the last matching span)
 ---
-## Eval Harness (`pixie.evals`)
+## Eval Harness (`pixie`)
 ```python
-from pixie.evals import (
+from pixie import (
     assert_dataset_pass, assert_pass, run_and_evaluate, evaluate,
     EvalAssertionError, Evaluation, ScoreThreshold,
     capture_traces, MemoryTraceHandler,
@@ -67,6 +67,7 @@ from pixie.evals import (
 ### Key functions
 **`assert_dataset_pass(runnable, dataset_name, evaluators, *, dataset_dir=None, passes=1, pass_criteria=None, from_trace=None)`**
 - Loads dataset by name, runs `assert_pass` with all items.
 - `runnable`: callable `(eval_input) → None` (sync or async). Must instrument itself.
 - `evaluators`: list of evaluator callables.
@@ -74,12 +75,15 @@ from pixie.evals import (
 - `from_trace`: `last_llm_call` or `root` — selects which span to evaluate.
 **`assert_pass(runnable, eval_inputs, evaluators, *, evaluables=None, passes=1, pass_criteria=None, from_trace=None)`**
 - Same, but takes explicit inputs (and optionally `Evaluable` items for expected outputs).
 **`run_and_evaluate(evaluator, runnable, eval_input, *, expected_output=..., from_trace=None)`**
 - Runs `runnable(eval_input)`, captures traces, evaluates. Returns one `Evaluation`.
 **`ScoreThreshold(threshold=0.5, pct=1.0)`**
 - `threshold`: min score per item (default 0.5).
 - `pct`: fraction of items that must meet threshold (default 1.0 = all).
 - Example: `ScoreThreshold(0.7, pct=0.8)` = 80% of cases must score ≥ 0.7.
@@ -96,42 +100,41 @@ from pixie.evals import (
 ### Heuristic (no LLM needed)
-| Evaluator | Use when |
-|---|---|
-| `ExactMatchEval(expected=...)` | Output must exactly equal the expected string |
-| `LevenshteinMatch(expected=...)` | Partial string similarity (edit distance) |
-| `NumericDiffEval(expected=...)` | Normalised numeric difference |
-| `JSONDiffEval(expected=...)` | Structural JSON comparison |
-| `ValidJSONEval(schema=None)` | Output is valid JSON (optionally matching a schema) |
-| `ListContainsEval(expected=...)` | Output list contains expected items |
+| Evaluator                        | Use when                                            |
+| -------------------------------- | --------------------------------------------------- |
+| `ExactMatchEval(expected=...)`   | Output must exactly equal the expected string       |
+| `LevenshteinMatch(expected=...)` | Partial string similarity (edit distance)           |
+| `NumericDiffEval(expected=...)`  | Normalised numeric difference                       |
+| `JSONDiffEval(expected=...)`     | Structural JSON comparison                          |
+| `ValidJSONEval(schema=None)`     | Output is valid JSON (optionally matching a schema) |
+| `ListContainsEval(expected=...)` | Output list contains expected items                 |
 ### LLM-as-judge (require OpenAI key or compatible client)
-| Evaluator | Use when |
-|---|---|
+| Evaluator                                             | Use when                                  |
+| ----------------------------------------------------- | ----------------------------------------- |
 | `FactualityEval(expected=..., model=..., client=...)` | Output is factually accurate vs reference |
-| `ClosedQAEval(expected=..., model=..., client=...)` | Closed-book QA comparison |
-| `SummaryEval(expected=..., model=..., client=...)` | Summarisation quality |
-| `TranslationEval(expected=..., language=..., ...)` | Translation quality |
-| `PossibleEval(model=..., client=...)` | Output is feasible / plausible |
-| `SecurityEval(model=..., client=...)` | No security vulnerabilities in output |
-| `ModerationEval(threshold=..., client=...)` | Content moderation |
-| `BattleEval(expected=..., model=..., client=...)` | Head-to-head comparison |
+| `ClosedQAEval(expected=..., model=..., client=...)`   | Closed-book QA comparison                 |
+| `SummaryEval(expected=..., model=..., client=...)`    | Summarisation quality                     |
+| `TranslationEval(expected=..., language=..., ...)`    | Translation quality                       |
+| `PossibleEval(model=..., client=...)`                 | Output is feasible / plausible            |
+| `SecurityEval(model=..., client=...)`                 | No security vulnerabilities in output     |
+| `ModerationEval(threshold=..., client=...)`           | Content moderation                        |
+| `BattleEval(expected=..., model=..., client=...)`     | Head-to-head comparison                   |
 ### RAG / retrieval
-| Evaluator | Use when |
-|---|---|
-| `ContextRelevancyEval(expected=..., client=...)` | Retrieved context is relevant to query |
-| `FaithfulnessEval(client=...)` | Answer is faithful to the provided context |
-| `AnswerRelevancyEval(client=...)` | Answer addresses the question |
-| `AnswerCorrectnessEval(expected=..., client=...)` | Answer is correct vs reference |
+| Evaluator                                         | Use when                                   |
+| ------------------------------------------------- | ------------------------------------------ |
+| `ContextRelevancyEval(expected=..., client=...)`  | Retrieved context is relevant to query     |
+| `FaithfulnessEval(client=...)`                    | Answer is faithful to the provided context |
+| `AnswerRelevancyEval(client=...)`                 | Answer addresses the question              |
+| `AnswerCorrectnessEval(expected=..., client=...)` | Answer is correct vs reference             |
 ### Custom evaluator template
 ```python
-from pixie.evals import Evaluation
-from pixie.storage.evaluable import Evaluable
+from pixie import Evaluation, Evaluable
 async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
     # evaluable.eval_input  — what was passed to the observed function
@@ -146,8 +149,7 @@ async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
 ## Dataset Python API
 ```python
-from pixie.dataset.store import DatasetStore
-from pixie.storage.evaluable import Evaluable
+from pixie import DatasetStore, Evaluable
 store = DatasetStore()                               # reads PIXIE_DATASET_DIR
 store.create("my-dataset")                          # create empty
@@ -160,9 +162,10 @@ store.delete("my-dataset")                          # delete entirely
 ```
 **`Evaluable` fields:**
 - `eval_input`: the input (what `@observe` captured as function kwargs)
 - `eval_output`: the output (return value of the observed function)
-- `eval_metadata`: dict of extra info (trace_id, provider, token counts, etc.)
+- `eval_metadata`: dict of extra info (trace_id, span_id, provider, token counts, etc.) — always includes `trace_id` and `span_id`
 - `expected_output`: reference answer for comparison (`UNSET` if not provided)
 ---
@@ -170,7 +173,7 @@ store.delete("my-dataset")                          # delete entirely
 ## ObservationStore Python API
 ```python
-from pixie.storage.store import ObservationStore
+from pixie import ObservationStore
 store = ObservationStore()   # reads PIXIE_DB_PATH
 await store.create_tables()

{pixie_qa-0.1.0 → pixie_qa-0.1.1}/.github/workflows/daily-release.yml RENAMED Viewed

@@ -10,14 +10,14 @@ jobs:
   release-and-publish:
     runs-on: ubuntu-latest
     permissions:
-      contents: write  # Required for creating tags and releases
-      id-token: write  # Required for trusted publishing to PyPI
+      contents: write # Required for creating tags and releases
+      id-token: write # Required for trusted publishing to PyPI
     steps:
       - name: Checkout repository
         uses: actions/checkout@v4
         with:
-          fetch-depth: 0  # Required for accurate git history
+          fetch-depth: 0 # Required for accurate git history
           token: ${{ secrets.GITHUB_TOKEN }}
       - name: Check for commits since last successful daily release

{pixie_qa-0.1.0 → pixie_qa-0.1.1}/.github/workflows/publish.yml RENAMED Viewed

@@ -11,8 +11,8 @@ jobs:
   publish-and-release:
     runs-on: ubuntu-latest
     permissions:
-      contents: write  # Required for creating tags and releases
-      id-token: write  # Required for trusted publishing to PyPI
+      contents: write # Required for creating tags and releases
+      id-token: write # Required for trusted publishing to PyPI
     steps:
       - name: Checkout

{pixie_qa-0.1.0 → pixie_qa-0.1.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: pixie-qa
-Version: 0.1.0
+Version: 0.1.1
 Summary: Automated quality assurance for AI applications
 Project-URL: Homepage, https://github.com/yiouli/pixie-qa
 Project-URL: Repository, https://github.com/yiouli/pixie-qa
@@ -119,23 +119,28 @@ Claude will read your code, instrument it, build a dataset from a few real runs,
 Here is a quick summary of what Claude does end-to-end:
-```
+```python
 # Claude instruments your app entry point
-from pixie import enable_storage
+from pixie import enable_storage, observe
 enable_storage()              # one line: creates DB, registers handler
 # Claude adds @observe on the function to test
-import pixie.instrumentation as px
-@px.observe(name="answer_question")
+@observe(name="answer_question")
 def answer_question(question: str) -> str:
     ...
+```
+```bash
 # After running the app with a few real inputs:
 pixie dataset create qa-golden-set
 pixie dataset save qa-golden-set
+```
+```python
 # Claude writes tests/test_qa.py with:
+from pixie import assert_dataset_pass, FactualityEval, ScoreThreshold
 async def test_factuality():
     await assert_dataset_pass(
         runnable=runnable,
@@ -143,11 +148,15 @@ async def test_factuality():
         evaluators=[FactualityEval()],
         pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
     )
+```
+```bash
 # Then runs:
 pixie-test -v
 ```
+All symbols are importable from the top-level `pixie` package — no need for submodule paths.
 ## Repository Structure
 ```

{pixie_qa-0.1.0 → pixie_qa-0.1.1}/README.md RENAMED Viewed

@@ -54,23 +54,28 @@ Claude will read your code, instrument it, build a dataset from a few real runs,
 Here is a quick summary of what Claude does end-to-end:
-```
+```python
 # Claude instruments your app entry point
-from pixie import enable_storage
+from pixie import enable_storage, observe
 enable_storage()              # one line: creates DB, registers handler
 # Claude adds @observe on the function to test
-import pixie.instrumentation as px
-@px.observe(name="answer_question")
+@observe(name="answer_question")
 def answer_question(question: str) -> str:
     ...
+```
+```bash
 # After running the app with a few real inputs:
 pixie dataset create qa-golden-set
 pixie dataset save qa-golden-set
+```
+```python
 # Claude writes tests/test_qa.py with:
+from pixie import assert_dataset_pass, FactualityEval, ScoreThreshold
 async def test_factuality():
     await assert_dataset_pass(
         runnable=runnable,
@@ -78,11 +83,15 @@ async def test_factuality():
         evaluators=[FactualityEval()],
         pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
     )
+```
+```bash
 # Then runs:
 pixie-test -v
 ```
+All symbols are importable from the top-level `pixie` package — no need for submodule paths.
 ## Repository Structure
 ```

pixie_qa-0.1.1/changelogs/loud-failure-mode.md ADDED Viewed

@@ -0,0 +1,58 @@
+# Loud Failure Mode
+## What Changed
+Eliminated all silent failure paths in the eval harness. Runtime errors (missing
+API keys, import failures, evaluator crashes) now propagate as exceptions instead
+of being silently swallowed.
+### 1. `evaluate()` — evaluator exceptions propagate
+**Before:** Any exception from an evaluator (e.g. missing API key, network error)
+was caught and returned as `Evaluation(score=0.0, reasoning=str(exc))`. This made
+real errors indistinguishable from legitimate low scores.
+**After:** Evaluator exceptions propagate unchanged to the caller. If an evaluator
+cannot run, the test fails loudly with the original error and traceback.
+### 2. `_load_module()` / `discover_tests()` — import errors propagate
+**Before:** `_load_module()` caught all exceptions and returned `None`, causing
+`discover_tests()` to silently skip broken test files. The result was
+"no tests collected" with no explanation.
+**After:** Import errors (missing packages, syntax errors, bad imports) propagate
+immediately with the original traceback, making the root cause obvious.
+### 3. `format_results()` — error messages always visible
+**Before:** Failure and error messages were only shown with `--verbose` flag.
+Without it, tests showed only `✗` with no message.
+**After:** The first line of the error message is always shown. `--verbose`
+controls whether the full traceback is displayed.
+### 4. Removed dead `evals/` resource folder
+Deleted `.claude/skills/eval-driven-dev/evals/` (contained `evals.json` and
+`sample-projects/` with no references from the skill instructions).
+## Files Affected
+- `pixie/evals/evaluation.py` — removed exception swallowing in `evaluate()`
+- `pixie/evals/runner.py` — `_load_module()` raises on error; `discover_tests()`
+  propagates; `format_results()` always shows messages
+- `tests/pixie/evals/test_evaluation.py` — updated test: expects propagation
+  instead of `score=0.0`; added sync evaluator error test
+- `tests/pixie/evals/test_runner.py` — added import error, syntax error,
+  and format_results tests
+- `specs/evals-harness.md` — updated error handling behavior and test expectations
+- `.claude/skills/eval-driven-dev/evals/` — deleted
+## Migration Notes
+- `evaluate()` no longer catches evaluator exceptions. Code that relied on
+  getting `Evaluation(score=0.0, details={"error": ...})` from crashed evaluators
+  must now handle exceptions directly.
+- `discover_tests()` now raises on import errors instead of silently skipping
+  broken test files.

pixie_qa-0.1.1/changelogs/root-package-exports-and-trace-id.md ADDED Viewed

@@ -0,0 +1,58 @@
+# Root Package Re-exports and Evaluable trace_id
+## What Changed
+### 1. Full public API re-exported from `pixie` root package
+Previously, `pixie/__init__.py` only exported `enable_storage` and `StorageHandler`. Users (and the eval-driven-dev skill) had to use submodule imports like `import pixie.instrumentation as px`, `from pixie.evals import ...`, `from pixie.dataset.store import DatasetStore`, and `from pixie.storage.evaluable import Evaluable`.
+Now **every public symbol** is importable from the top-level `pixie` package:
+```python
+from pixie import observe, flush, start_observation, init, add_handler
+from pixie import assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call, root
+from pixie import DatasetStore, Evaluable, ObservationStore, UNSET
+```
+This eliminates Pylance resolution errors for downstream users and simplifies the import story.
+### 2. `as_evaluable()` now includes `trace_id` and `span_id` in metadata
+Both `_observe_span_to_evaluable()` and `_llm_span_to_evaluable()` now inject the span's `trace_id` and `span_id` into `eval_metadata`. This means:
+- `pixie dataset save` automatically includes trace provenance in the dataset
+- Users can always look up the original trace for any dataset item
+- The skill's investigation flow ("look up trace_id from metadata") actually works
+### 3. Skill instructions updated
+- **Stage 0**: Now verifies `OPENAI_API_KEY` (or equivalent) before running anything
+- **Stage 3**: All code examples use `from pixie import ...` (no submodule imports)
+- **Stage 4**: Test file example uses `from pixie import ...`
+- **Stage 5**: Dataset building now emphasizes actually running the app to capture real outputs and traces; removed the misleading "Option B" that built datasets with fabricated/null outputs
+- **Stage 7**: Investigation examples use `from pixie import DatasetStore, ObservationStore`
+- **API reference**: All imports updated to top-level
+## Files Affected
+### Package
+- `pixie/__init__.py` — re-exports all public API symbols
+- `pixie/storage/evaluable.py` — `as_evaluable()` includes trace_id/span_id
+### Tests
+- `tests/pixie/test_init.py` — **new** — 27 tests verifying root package exports
+- `tests/pixie/observation_store/test_evaluable.py` — added trace_id/span_id assertions
+### Docs
+- `README.md` — code examples updated to top-level imports
+- `docs/package.md` — all import examples updated
+- `.claude/skills/eval-driven-dev/SKILL.md` — full skill instruction rewrite
+- `.claude/skills/eval-driven-dev/references/pixie-api.md` — API reference import paths
+## Migration Notes
+- **No breaking changes.** Submodule imports (`from pixie.evals import ...`, `import pixie.instrumentation as px`) continue to work. The top-level re-exports are purely additive.
+- `eval_metadata` from `as_evaluable()` now always contains `trace_id` and `span_id` keys. Code that checks `eval_metadata is None` for ObserveSpans with no user metadata should instead check for specific keys.

pixie-qa 0.1.0__tar.gz → 0.1.1__tar.gz

pixie-qa 0.1.0tar.gz → 0.1.1tar.gz