PyPI - pixie-qa - Versions diffs - 0.5.1__tar.gz → 0.6.1__tar.gz - Mend

pixie-qa 0.5.1tar.gz → 0.6.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (42) hide show

pixie_qa-0.6.1/PKG-INFO ADDED Viewed

@@ -0,0 +1,228 @@
+Metadata-Version: 2.4
+Name: pixie-qa
+Version: 0.6.1
+Summary: Automated quality assurance for AI applications
+Project-URL: Homepage, https://github.com/yiouli/pixie-qa
+Project-URL: Repository, https://github.com/yiouli/pixie-qa
+Project-URL: Documentation, https://yiouli.github.io/pixie-qa/
+Project-URL: Bug Tracker, https://github.com/yiouli/pixie-qa/issues
+License: MIT License
+        Copyright (c) 2026 Yiou Li
+        Permission is hereby granted, free of charge, to any person obtaining a copy
+        of this software and associated documentation files (the "Software"), to deal
+        in the Software without restriction, including without limitation the rights
+        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+        copies of the Software, and to permit persons to whom the Software is
+        furnished to do so, subject to the following conditions:
+        The above copyright notice and this permission notice shall be included in all
+        copies or substantial portions of the Software.
+        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+        SOFTWARE.
+License-File: LICENSE
+Keywords: ai,evals,llm,observability,opentelemetry,testing
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Software Development :: Testing
+Requires-Python: >=3.11
+Requires-Dist: autoevals>=0.1.0
+Requires-Dist: jsonpickle>=4.0.0
+Requires-Dist: openai>=2.29.0
+Requires-Dist: openinference-instrumentation>=0.1.44
+Requires-Dist: opentelemetry-api>=1.27.0
+Requires-Dist: opentelemetry-sdk>=1.27.0
+Requires-Dist: pydantic>=2.0
+Requires-Dist: python-dotenv>=1.2.2
+Requires-Dist: starlette>=1.0.0
+Requires-Dist: uvicorn>=0.42.0
+Requires-Dist: watchfiles>=1.1.1
+Provides-Extra: all
+Requires-Dist: openinference-instrumentation-anthropic; extra == 'all'
+Requires-Dist: openinference-instrumentation-dspy; extra == 'all'
+Requires-Dist: openinference-instrumentation-google-genai; extra == 'all'
+Requires-Dist: openinference-instrumentation-langchain; extra == 'all'
+Requires-Dist: openinference-instrumentation-openai; extra == 'all'
+Provides-Extra: anthropic
+Requires-Dist: openinference-instrumentation-anthropic; extra == 'anthropic'
+Provides-Extra: dspy
+Requires-Dist: openinference-instrumentation-dspy; extra == 'dspy'
+Provides-Extra: google
+Requires-Dist: openinference-instrumentation-google-genai; extra == 'google'
+Provides-Extra: langchain
+Requires-Dist: openinference-instrumentation-langchain; extra == 'langchain'
+Provides-Extra: openai
+Requires-Dist: openinference-instrumentation-openai; extra == 'openai'
+Description-Content-Type: text/markdown
+# pixie-qa
+Eval-driven development for Python LLM applications.
+pixie-qa ships two complementary tools:
+- **`eval-driven-dev` agent skill** — guides a coding agent through the full eval-driven development loop: instrument → capture → build dataset → test → investigate → iterate.
+- **`pixie-qa` Python package** — the runtime: `wrap()` for data-boundary instrumentation, `Runnable` for dataset-driven test execution, built-in and custom evaluators, and the `pixie` CLI.
+## Agent Skill
+### Install
+```bash
+npx skills add yiouli/pixie-qa
+```
+### Usage
+Open a conversation with your coding agent and say something like:
+> "set up QA for my app"
+The agent follows a six-step workflow:
+1. **Understand the app** — entry point, execution flow, expected behaviors
+2. **Instrument with `wrap()`** — mark data boundaries in the production code path
+3. **Define evaluators** — map quality criteria to built-in or custom evaluators
+4. **Build a dataset** — diverse representative scenarios in JSON
+5. **Run `pixie test`** — real pass/fail scores for every scenario
+6. **Investigate & iterate** — root-cause failures and fix
+## Python Package
+### Install
+```bash
+pip install pixie-qa
+# with an LLM provider auto-instrumentor:
+pip install "pixie-qa[openai]"   # openai | anthropic | langchain | google | dspy | all
+```
+### `wrap()` — instrument data boundaries
+Call `wrap()` at data boundaries in your application code. At test time, `wrap(purpose="input")` values are injected from the dataset; `wrap(purpose="output")` values are captured and scored by evaluators.
+```python
+from pixie import wrap
+db_result = wrap(fetch_from_db(user_id), purpose="input", name="db_result")
+response   = wrap(generate_response(db_result), purpose="output", name="response")
+```
+| Purpose    | Meaning                                               |
+| ---------- | ----------------------------------------------------- |
+| `"input"`  | External data fed into the LLM (injected at test time) |
+| `"output"` | Final or intermediate output to evaluate              |
+| `"state"`  | Intermediate state captured for debugging             |
+### `Runnable` — run the app against each dataset entry
+Implement the `Runnable` protocol so `pixie test` and `pixie trace` know how to run your app:
+```python
+from pydantic import BaseModel
+import pixie
+class MyArgs(BaseModel):
+    user_id: str
+    message: str
+class MyAppRunnable(pixie.Runnable[MyArgs]):
+    @classmethod
+    def create(cls) -> "MyAppRunnable":
+        return cls()
+    async def setup(self) -> None:
+        pass  # one-time initialization before entries run
+    async def run(self, args: MyArgs) -> None:
+        await my_app.handle(args.user_id, args.message)
+    async def teardown(self) -> None:
+        pass  # one-time cleanup after all entries finish
+```
+`run()` is called concurrently for all dataset entries — protect shared mutable state with `asyncio.Semaphore` or `asyncio.Lock` if needed.
+### Dataset JSON format
+```json
+{
+  "runnable": "pixie_qa/scripts/run_app.py:MyAppRunnable",
+  "evaluators": ["Factuality"],
+  "entries": [
+    {
+      "entry_kwargs": { "user_id": "u1", "message": "What is my balance?" },
+      "test_case": {
+        "eval_input": [
+          { "purpose": "input", "name": "db_result", "data": { "balance": 120.5 } }
+        ],
+        "expectation": "Your current balance is $120.50.",
+        "description": "basic balance query"
+      }
+    }
+  ]
+}
+```
+Use `pixie trace` + `pixie format` to capture real traces and turn them into dataset entries with the correct data shapes.
+### Evaluators
+| Evaluator               | Task                                                |
+| ----------------------- | --------------------------------------------------- |
+| `Factuality`            | LLM-as-judge factual accuracy                      |
+| `ClosedQA`              | LLM-as-judge Q&A with reference answer             |
+| `AnswerCorrectness`     | RAGAS combined factual + semantic similarity        |
+| `EmbeddingSimilarity`   | Cosine similarity between output and expectation    |
+| `ExactMatch`            | Deterministic exact string match                   |
+| `create_llm_evaluator`  | Custom prompt-based LLM-as-judge                   |
+Full evaluator list: [docs/pixie/index.md](docs/pixie/index.md)
+### CLI reference
+| Command                                          | Description                                      |
+| ------------------------------------------------ | ------------------------------------------------ |
+| `pixie test [path]`                              | Run eval tests; open scorecard in browser        |
+| `pixie trace --runnable R --input I --output O`  | Run a Runnable, capture trace to JSONL           |
+| `pixie format --input I --output O`              | Convert a trace JSONL to a dataset entry JSON    |
+| `pixie analyze <test_run_id>`                    | LLM analysis of a completed test run             |
+| `pixie init [root]`                              | Scaffold the `pixie_qa/` working directory       |
+| `pixie start [root]`                             | Launch the web UI at `http://localhost:7118`     |
+## Web UI
+View all eval artifacts (results, datasets, markdown docs) in a live-updating local web UI:
+```bash
+pixie start              # initializes pixie_qa/ (if needed) and opens http://localhost:7118
+pixie start my_dir       # use a custom artifact root
+pixie init               # scaffolds pixie_qa/ without starting the server
+```
+Changes to artifacts are pushed to the browser in real time via SSE.
+## Configuration
+Pixie reads configuration from environment variables and a local `.env` file. Existing process env vars take priority over `.env` values.
+| Variable                   | Description                                       |
+| -------------------------- | ------------------------------------------------- |
+| `PIXIE_ROOT`               | Root directory for all generated artefacts        |
+| `PIXIE_RATE_LIMIT_ENABLED` | `true` to enable evaluator throttling             |
+| `PIXIE_RATE_LIMIT_RPS`     | Max requests per second for LLM-as-judge calls    |
+| `PIXIE_RATE_LIMIT_RPM`     | Max requests per minute                           |
+| `PIXIE_RATE_LIMIT_TPS`     | Max tokens per second                             |
+| `PIXIE_RATE_LIMIT_TPM`     | Max tokens per minute                             |

pixie_qa-0.6.1/README.md ADDED Viewed

@@ -0,0 +1,159 @@
+# pixie-qa
+Eval-driven development for Python LLM applications.
+pixie-qa ships two complementary tools:
+- **`eval-driven-dev` agent skill** — guides a coding agent through the full eval-driven development loop: instrument → capture → build dataset → test → investigate → iterate.
+- **`pixie-qa` Python package** — the runtime: `wrap()` for data-boundary instrumentation, `Runnable` for dataset-driven test execution, built-in and custom evaluators, and the `pixie` CLI.
+## Agent Skill
+### Install
+```bash
+npx skills add yiouli/pixie-qa
+```
+### Usage
+Open a conversation with your coding agent and say something like:
+> "set up QA for my app"
+The agent follows a six-step workflow:
+1. **Understand the app** — entry point, execution flow, expected behaviors
+2. **Instrument with `wrap()`** — mark data boundaries in the production code path
+3. **Define evaluators** — map quality criteria to built-in or custom evaluators
+4. **Build a dataset** — diverse representative scenarios in JSON
+5. **Run `pixie test`** — real pass/fail scores for every scenario
+6. **Investigate & iterate** — root-cause failures and fix
+## Python Package
+### Install
+```bash
+pip install pixie-qa
+# with an LLM provider auto-instrumentor:
+pip install "pixie-qa[openai]"   # openai | anthropic | langchain | google | dspy | all
+```
+### `wrap()` — instrument data boundaries
+Call `wrap()` at data boundaries in your application code. At test time, `wrap(purpose="input")` values are injected from the dataset; `wrap(purpose="output")` values are captured and scored by evaluators.
+```python
+from pixie import wrap
+db_result = wrap(fetch_from_db(user_id), purpose="input", name="db_result")
+response   = wrap(generate_response(db_result), purpose="output", name="response")
+```
+| Purpose    | Meaning                                               |
+| ---------- | ----------------------------------------------------- |
+| `"input"`  | External data fed into the LLM (injected at test time) |
+| `"output"` | Final or intermediate output to evaluate              |
+| `"state"`  | Intermediate state captured for debugging             |
+### `Runnable` — run the app against each dataset entry
+Implement the `Runnable` protocol so `pixie test` and `pixie trace` know how to run your app:
+```python
+from pydantic import BaseModel
+import pixie
+class MyArgs(BaseModel):
+    user_id: str
+    message: str
+class MyAppRunnable(pixie.Runnable[MyArgs]):
+    @classmethod
+    def create(cls) -> "MyAppRunnable":
+        return cls()
+    async def setup(self) -> None:
+        pass  # one-time initialization before entries run
+    async def run(self, args: MyArgs) -> None:
+        await my_app.handle(args.user_id, args.message)
+    async def teardown(self) -> None:
+        pass  # one-time cleanup after all entries finish
+```
+`run()` is called concurrently for all dataset entries — protect shared mutable state with `asyncio.Semaphore` or `asyncio.Lock` if needed.
+### Dataset JSON format
+```json
+{
+  "runnable": "pixie_qa/scripts/run_app.py:MyAppRunnable",
+  "evaluators": ["Factuality"],
+  "entries": [
+    {
+      "entry_kwargs": { "user_id": "u1", "message": "What is my balance?" },
+      "test_case": {
+        "eval_input": [
+          { "purpose": "input", "name": "db_result", "data": { "balance": 120.5 } }
+        ],
+        "expectation": "Your current balance is $120.50.",
+        "description": "basic balance query"
+      }
+    }
+  ]
+}
+```
+Use `pixie trace` + `pixie format` to capture real traces and turn them into dataset entries with the correct data shapes.
+### Evaluators
+| Evaluator               | Task                                                |
+| ----------------------- | --------------------------------------------------- |
+| `Factuality`            | LLM-as-judge factual accuracy                      |
+| `ClosedQA`              | LLM-as-judge Q&A with reference answer             |
+| `AnswerCorrectness`     | RAGAS combined factual + semantic similarity        |
+| `EmbeddingSimilarity`   | Cosine similarity between output and expectation    |
+| `ExactMatch`            | Deterministic exact string match                   |
+| `create_llm_evaluator`  | Custom prompt-based LLM-as-judge                   |
+Full evaluator list: [docs/pixie/index.md](docs/pixie/index.md)
+### CLI reference
+| Command                                          | Description                                      |
+| ------------------------------------------------ | ------------------------------------------------ |
+| `pixie test [path]`                              | Run eval tests; open scorecard in browser        |
+| `pixie trace --runnable R --input I --output O`  | Run a Runnable, capture trace to JSONL           |
+| `pixie format --input I --output O`              | Convert a trace JSONL to a dataset entry JSON    |
+| `pixie analyze <test_run_id>`                    | LLM analysis of a completed test run             |
+| `pixie init [root]`                              | Scaffold the `pixie_qa/` working directory       |
+| `pixie start [root]`                             | Launch the web UI at `http://localhost:7118`     |
+## Web UI
+View all eval artifacts (results, datasets, markdown docs) in a live-updating local web UI:
+```bash
+pixie start              # initializes pixie_qa/ (if needed) and opens http://localhost:7118
+pixie start my_dir       # use a custom artifact root
+pixie init               # scaffolds pixie_qa/ without starting the server
+```
+Changes to artifacts are pushed to the browser in real time via SSE.
+## Configuration
+Pixie reads configuration from environment variables and a local `.env` file. Existing process env vars take priority over `.env` values.
+| Variable                   | Description                                       |
+| -------------------------- | ------------------------------------------------- |
+| `PIXIE_ROOT`               | Root directory for all generated artefacts        |
+| `PIXIE_RATE_LIMIT_ENABLED` | `true` to enable evaluator throttling             |
+| `PIXIE_RATE_LIMIT_RPS`     | Max requests per second for LLM-as-judge calls    |
+| `PIXIE_RATE_LIMIT_RPM`     | Max requests per minute                           |
+| `PIXIE_RATE_LIMIT_TPS`     | Max tokens per second                             |
+| `PIXIE_RATE_LIMIT_TPM`     | Max tokens per minute                             |

pixie-qa 0.5.1__tar.gz → 0.6.1__tar.gz

pixie-qa 0.5.1tar.gz → 0.6.1tar.gz