pixie-qa 0.5.1__tar.gz → 0.6.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. pixie_qa-0.6.1/PKG-INFO +228 -0
  2. pixie_qa-0.6.1/README.md +159 -0
  3. pixie_qa-0.6.1/pixie/assets/webui.html +64 -0
  4. pixie_qa-0.6.1/pixie/cli/format_command.py +204 -0
  5. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/cli/main.py +24 -0
  6. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/cli/start_command.py +17 -8
  7. pixie_qa-0.6.1/pixie/cli/stop_command.py +33 -0
  8. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/cli/trace_command.py +18 -10
  9. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/eval/__init__.py +6 -8
  10. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/harness/runner.py +26 -44
  11. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/instrumentation/__init__.py +4 -0
  12. pixie_qa-0.6.1/pixie/instrumentation/models.py +72 -0
  13. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/instrumentation/wrap.py +31 -0
  14. pixie_qa-0.6.1/pixie/web/_serve.py +34 -0
  15. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/web/app.py +21 -7
  16. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/web/server.py +159 -9
  17. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pyproject.toml +1 -1
  18. pixie_qa-0.5.1/PKG-INFO +0 -136
  19. pixie_qa-0.5.1/README.md +0 -67
  20. pixie_qa-0.5.1/pixie/assets/webui.html +0 -64
  21. pixie_qa-0.5.1/pixie/cli/format_command.py +0 -223
  22. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/.gitignore +0 -0
  23. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/LICENSE +0 -0
  24. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/__init__.py +0 -0
  25. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/assets/mock-data.json +0 -0
  26. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/cli/__init__.py +0 -0
  27. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/cli/analyze_command.py +0 -0
  28. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/cli/init_command.py +0 -0
  29. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/cli/test_command.py +0 -0
  30. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/config.py +0 -0
  31. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/eval/evaluable.py +0 -0
  32. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/eval/evaluation.py +0 -0
  33. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/eval/llm_evaluator.py +0 -0
  34. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/eval/rate_limiter.py +0 -0
  35. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/eval/scorers.py +0 -0
  36. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/favicon.png +0 -0
  37. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/harness/__init__.py +0 -0
  38. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/harness/run_result.py +0 -0
  39. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/harness/runnable.py +0 -0
  40. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/instrumentation/llm_tracing.py +0 -0
  41. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/web/__init__.py +0 -0
  42. {pixie_qa-0.5.1 → pixie_qa-0.6.1}/pixie/web/watcher.py +0 -0
@@ -0,0 +1,228 @@
1
+ Metadata-Version: 2.4
2
+ Name: pixie-qa
3
+ Version: 0.6.1
4
+ Summary: Automated quality assurance for AI applications
5
+ Project-URL: Homepage, https://github.com/yiouli/pixie-qa
6
+ Project-URL: Repository, https://github.com/yiouli/pixie-qa
7
+ Project-URL: Documentation, https://yiouli.github.io/pixie-qa/
8
+ Project-URL: Bug Tracker, https://github.com/yiouli/pixie-qa/issues
9
+ License: MIT License
10
+
11
+ Copyright (c) 2026 Yiou Li
12
+
13
+ Permission is hereby granted, free of charge, to any person obtaining a copy
14
+ of this software and associated documentation files (the "Software"), to deal
15
+ in the Software without restriction, including without limitation the rights
16
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
17
+ copies of the Software, and to permit persons to whom the Software is
18
+ furnished to do so, subject to the following conditions:
19
+
20
+ The above copyright notice and this permission notice shall be included in all
21
+ copies or substantial portions of the Software.
22
+
23
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
24
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
25
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
26
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
27
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
28
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
29
+ SOFTWARE.
30
+ License-File: LICENSE
31
+ Keywords: ai,evals,llm,observability,opentelemetry,testing
32
+ Classifier: Development Status :: 4 - Beta
33
+ Classifier: Intended Audience :: Developers
34
+ Classifier: License :: OSI Approved :: MIT License
35
+ Classifier: Programming Language :: Python :: 3
36
+ Classifier: Programming Language :: Python :: 3.11
37
+ Classifier: Programming Language :: Python :: 3.12
38
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
39
+ Classifier: Topic :: Software Development :: Testing
40
+ Requires-Python: >=3.11
41
+ Requires-Dist: autoevals>=0.1.0
42
+ Requires-Dist: jsonpickle>=4.0.0
43
+ Requires-Dist: openai>=2.29.0
44
+ Requires-Dist: openinference-instrumentation>=0.1.44
45
+ Requires-Dist: opentelemetry-api>=1.27.0
46
+ Requires-Dist: opentelemetry-sdk>=1.27.0
47
+ Requires-Dist: pydantic>=2.0
48
+ Requires-Dist: python-dotenv>=1.2.2
49
+ Requires-Dist: starlette>=1.0.0
50
+ Requires-Dist: uvicorn>=0.42.0
51
+ Requires-Dist: watchfiles>=1.1.1
52
+ Provides-Extra: all
53
+ Requires-Dist: openinference-instrumentation-anthropic; extra == 'all'
54
+ Requires-Dist: openinference-instrumentation-dspy; extra == 'all'
55
+ Requires-Dist: openinference-instrumentation-google-genai; extra == 'all'
56
+ Requires-Dist: openinference-instrumentation-langchain; extra == 'all'
57
+ Requires-Dist: openinference-instrumentation-openai; extra == 'all'
58
+ Provides-Extra: anthropic
59
+ Requires-Dist: openinference-instrumentation-anthropic; extra == 'anthropic'
60
+ Provides-Extra: dspy
61
+ Requires-Dist: openinference-instrumentation-dspy; extra == 'dspy'
62
+ Provides-Extra: google
63
+ Requires-Dist: openinference-instrumentation-google-genai; extra == 'google'
64
+ Provides-Extra: langchain
65
+ Requires-Dist: openinference-instrumentation-langchain; extra == 'langchain'
66
+ Provides-Extra: openai
67
+ Requires-Dist: openinference-instrumentation-openai; extra == 'openai'
68
+ Description-Content-Type: text/markdown
69
+
70
+ # pixie-qa
71
+
72
+ Eval-driven development for Python LLM applications.
73
+
74
+ pixie-qa ships two complementary tools:
75
+
76
+ - **`eval-driven-dev` agent skill** — guides a coding agent through the full eval-driven development loop: instrument → capture → build dataset → test → investigate → iterate.
77
+ - **`pixie-qa` Python package** — the runtime: `wrap()` for data-boundary instrumentation, `Runnable` for dataset-driven test execution, built-in and custom evaluators, and the `pixie` CLI.
78
+
79
+ ## Agent Skill
80
+
81
+ ### Install
82
+
83
+ ```bash
84
+ npx skills add yiouli/pixie-qa
85
+ ```
86
+
87
+ ### Usage
88
+
89
+ Open a conversation with your coding agent and say something like:
90
+
91
+ > "set up QA for my app"
92
+
93
+ The agent follows a six-step workflow:
94
+
95
+ 1. **Understand the app** — entry point, execution flow, expected behaviors
96
+ 2. **Instrument with `wrap()`** — mark data boundaries in the production code path
97
+ 3. **Define evaluators** — map quality criteria to built-in or custom evaluators
98
+ 4. **Build a dataset** — diverse representative scenarios in JSON
99
+ 5. **Run `pixie test`** — real pass/fail scores for every scenario
100
+ 6. **Investigate & iterate** — root-cause failures and fix
101
+
102
+ ## Python Package
103
+
104
+ ### Install
105
+
106
+ ```bash
107
+ pip install pixie-qa
108
+ # with an LLM provider auto-instrumentor:
109
+ pip install "pixie-qa[openai]" # openai | anthropic | langchain | google | dspy | all
110
+ ```
111
+
112
+ ### `wrap()` — instrument data boundaries
113
+
114
+ Call `wrap()` at data boundaries in your application code. At test time, `wrap(purpose="input")` values are injected from the dataset; `wrap(purpose="output")` values are captured and scored by evaluators.
115
+
116
+ ```python
117
+ from pixie import wrap
118
+
119
+ db_result = wrap(fetch_from_db(user_id), purpose="input", name="db_result")
120
+ response = wrap(generate_response(db_result), purpose="output", name="response")
121
+ ```
122
+
123
+ | Purpose | Meaning |
124
+ | ---------- | ----------------------------------------------------- |
125
+ | `"input"` | External data fed into the LLM (injected at test time) |
126
+ | `"output"` | Final or intermediate output to evaluate |
127
+ | `"state"` | Intermediate state captured for debugging |
128
+
129
+ ### `Runnable` — run the app against each dataset entry
130
+
131
+ Implement the `Runnable` protocol so `pixie test` and `pixie trace` know how to run your app:
132
+
133
+ ```python
134
+ from pydantic import BaseModel
135
+ import pixie
136
+
137
+ class MyArgs(BaseModel):
138
+ user_id: str
139
+ message: str
140
+
141
+ class MyAppRunnable(pixie.Runnable[MyArgs]):
142
+ @classmethod
143
+ def create(cls) -> "MyAppRunnable":
144
+ return cls()
145
+
146
+ async def setup(self) -> None:
147
+ pass # one-time initialization before entries run
148
+
149
+ async def run(self, args: MyArgs) -> None:
150
+ await my_app.handle(args.user_id, args.message)
151
+
152
+ async def teardown(self) -> None:
153
+ pass # one-time cleanup after all entries finish
154
+ ```
155
+
156
+ `run()` is called concurrently for all dataset entries — protect shared mutable state with `asyncio.Semaphore` or `asyncio.Lock` if needed.
157
+
158
+ ### Dataset JSON format
159
+
160
+ ```json
161
+ {
162
+ "runnable": "pixie_qa/scripts/run_app.py:MyAppRunnable",
163
+ "evaluators": ["Factuality"],
164
+ "entries": [
165
+ {
166
+ "entry_kwargs": { "user_id": "u1", "message": "What is my balance?" },
167
+ "test_case": {
168
+ "eval_input": [
169
+ { "purpose": "input", "name": "db_result", "data": { "balance": 120.5 } }
170
+ ],
171
+ "expectation": "Your current balance is $120.50.",
172
+ "description": "basic balance query"
173
+ }
174
+ }
175
+ ]
176
+ }
177
+ ```
178
+
179
+ Use `pixie trace` + `pixie format` to capture real traces and turn them into dataset entries with the correct data shapes.
180
+
181
+ ### Evaluators
182
+
183
+ | Evaluator | Task |
184
+ | ----------------------- | --------------------------------------------------- |
185
+ | `Factuality` | LLM-as-judge factual accuracy |
186
+ | `ClosedQA` | LLM-as-judge Q&A with reference answer |
187
+ | `AnswerCorrectness` | RAGAS combined factual + semantic similarity |
188
+ | `EmbeddingSimilarity` | Cosine similarity between output and expectation |
189
+ | `ExactMatch` | Deterministic exact string match |
190
+ | `create_llm_evaluator` | Custom prompt-based LLM-as-judge |
191
+
192
+ Full evaluator list: [docs/pixie/index.md](docs/pixie/index.md)
193
+
194
+ ### CLI reference
195
+
196
+ | Command | Description |
197
+ | ------------------------------------------------ | ------------------------------------------------ |
198
+ | `pixie test [path]` | Run eval tests; open scorecard in browser |
199
+ | `pixie trace --runnable R --input I --output O` | Run a Runnable, capture trace to JSONL |
200
+ | `pixie format --input I --output O` | Convert a trace JSONL to a dataset entry JSON |
201
+ | `pixie analyze <test_run_id>` | LLM analysis of a completed test run |
202
+ | `pixie init [root]` | Scaffold the `pixie_qa/` working directory |
203
+ | `pixie start [root]` | Launch the web UI at `http://localhost:7118` |
204
+
205
+ ## Web UI
206
+
207
+ View all eval artifacts (results, datasets, markdown docs) in a live-updating local web UI:
208
+
209
+ ```bash
210
+ pixie start # initializes pixie_qa/ (if needed) and opens http://localhost:7118
211
+ pixie start my_dir # use a custom artifact root
212
+ pixie init # scaffolds pixie_qa/ without starting the server
213
+ ```
214
+
215
+ Changes to artifacts are pushed to the browser in real time via SSE.
216
+
217
+ ## Configuration
218
+
219
+ Pixie reads configuration from environment variables and a local `.env` file. Existing process env vars take priority over `.env` values.
220
+
221
+ | Variable | Description |
222
+ | -------------------------- | ------------------------------------------------- |
223
+ | `PIXIE_ROOT` | Root directory for all generated artefacts |
224
+ | `PIXIE_RATE_LIMIT_ENABLED` | `true` to enable evaluator throttling |
225
+ | `PIXIE_RATE_LIMIT_RPS` | Max requests per second for LLM-as-judge calls |
226
+ | `PIXIE_RATE_LIMIT_RPM` | Max requests per minute |
227
+ | `PIXIE_RATE_LIMIT_TPS` | Max tokens per second |
228
+ | `PIXIE_RATE_LIMIT_TPM` | Max tokens per minute |
@@ -0,0 +1,159 @@
1
+ # pixie-qa
2
+
3
+ Eval-driven development for Python LLM applications.
4
+
5
+ pixie-qa ships two complementary tools:
6
+
7
+ - **`eval-driven-dev` agent skill** — guides a coding agent through the full eval-driven development loop: instrument → capture → build dataset → test → investigate → iterate.
8
+ - **`pixie-qa` Python package** — the runtime: `wrap()` for data-boundary instrumentation, `Runnable` for dataset-driven test execution, built-in and custom evaluators, and the `pixie` CLI.
9
+
10
+ ## Agent Skill
11
+
12
+ ### Install
13
+
14
+ ```bash
15
+ npx skills add yiouli/pixie-qa
16
+ ```
17
+
18
+ ### Usage
19
+
20
+ Open a conversation with your coding agent and say something like:
21
+
22
+ > "set up QA for my app"
23
+
24
+ The agent follows a six-step workflow:
25
+
26
+ 1. **Understand the app** — entry point, execution flow, expected behaviors
27
+ 2. **Instrument with `wrap()`** — mark data boundaries in the production code path
28
+ 3. **Define evaluators** — map quality criteria to built-in or custom evaluators
29
+ 4. **Build a dataset** — diverse representative scenarios in JSON
30
+ 5. **Run `pixie test`** — real pass/fail scores for every scenario
31
+ 6. **Investigate & iterate** — root-cause failures and fix
32
+
33
+ ## Python Package
34
+
35
+ ### Install
36
+
37
+ ```bash
38
+ pip install pixie-qa
39
+ # with an LLM provider auto-instrumentor:
40
+ pip install "pixie-qa[openai]" # openai | anthropic | langchain | google | dspy | all
41
+ ```
42
+
43
+ ### `wrap()` — instrument data boundaries
44
+
45
+ Call `wrap()` at data boundaries in your application code. At test time, `wrap(purpose="input")` values are injected from the dataset; `wrap(purpose="output")` values are captured and scored by evaluators.
46
+
47
+ ```python
48
+ from pixie import wrap
49
+
50
+ db_result = wrap(fetch_from_db(user_id), purpose="input", name="db_result")
51
+ response = wrap(generate_response(db_result), purpose="output", name="response")
52
+ ```
53
+
54
+ | Purpose | Meaning |
55
+ | ---------- | ----------------------------------------------------- |
56
+ | `"input"` | External data fed into the LLM (injected at test time) |
57
+ | `"output"` | Final or intermediate output to evaluate |
58
+ | `"state"` | Intermediate state captured for debugging |
59
+
60
+ ### `Runnable` — run the app against each dataset entry
61
+
62
+ Implement the `Runnable` protocol so `pixie test` and `pixie trace` know how to run your app:
63
+
64
+ ```python
65
+ from pydantic import BaseModel
66
+ import pixie
67
+
68
+ class MyArgs(BaseModel):
69
+ user_id: str
70
+ message: str
71
+
72
+ class MyAppRunnable(pixie.Runnable[MyArgs]):
73
+ @classmethod
74
+ def create(cls) -> "MyAppRunnable":
75
+ return cls()
76
+
77
+ async def setup(self) -> None:
78
+ pass # one-time initialization before entries run
79
+
80
+ async def run(self, args: MyArgs) -> None:
81
+ await my_app.handle(args.user_id, args.message)
82
+
83
+ async def teardown(self) -> None:
84
+ pass # one-time cleanup after all entries finish
85
+ ```
86
+
87
+ `run()` is called concurrently for all dataset entries — protect shared mutable state with `asyncio.Semaphore` or `asyncio.Lock` if needed.
88
+
89
+ ### Dataset JSON format
90
+
91
+ ```json
92
+ {
93
+ "runnable": "pixie_qa/scripts/run_app.py:MyAppRunnable",
94
+ "evaluators": ["Factuality"],
95
+ "entries": [
96
+ {
97
+ "entry_kwargs": { "user_id": "u1", "message": "What is my balance?" },
98
+ "test_case": {
99
+ "eval_input": [
100
+ { "purpose": "input", "name": "db_result", "data": { "balance": 120.5 } }
101
+ ],
102
+ "expectation": "Your current balance is $120.50.",
103
+ "description": "basic balance query"
104
+ }
105
+ }
106
+ ]
107
+ }
108
+ ```
109
+
110
+ Use `pixie trace` + `pixie format` to capture real traces and turn them into dataset entries with the correct data shapes.
111
+
112
+ ### Evaluators
113
+
114
+ | Evaluator | Task |
115
+ | ----------------------- | --------------------------------------------------- |
116
+ | `Factuality` | LLM-as-judge factual accuracy |
117
+ | `ClosedQA` | LLM-as-judge Q&A with reference answer |
118
+ | `AnswerCorrectness` | RAGAS combined factual + semantic similarity |
119
+ | `EmbeddingSimilarity` | Cosine similarity between output and expectation |
120
+ | `ExactMatch` | Deterministic exact string match |
121
+ | `create_llm_evaluator` | Custom prompt-based LLM-as-judge |
122
+
123
+ Full evaluator list: [docs/pixie/index.md](docs/pixie/index.md)
124
+
125
+ ### CLI reference
126
+
127
+ | Command | Description |
128
+ | ------------------------------------------------ | ------------------------------------------------ |
129
+ | `pixie test [path]` | Run eval tests; open scorecard in browser |
130
+ | `pixie trace --runnable R --input I --output O` | Run a Runnable, capture trace to JSONL |
131
+ | `pixie format --input I --output O` | Convert a trace JSONL to a dataset entry JSON |
132
+ | `pixie analyze <test_run_id>` | LLM analysis of a completed test run |
133
+ | `pixie init [root]` | Scaffold the `pixie_qa/` working directory |
134
+ | `pixie start [root]` | Launch the web UI at `http://localhost:7118` |
135
+
136
+ ## Web UI
137
+
138
+ View all eval artifacts (results, datasets, markdown docs) in a live-updating local web UI:
139
+
140
+ ```bash
141
+ pixie start # initializes pixie_qa/ (if needed) and opens http://localhost:7118
142
+ pixie start my_dir # use a custom artifact root
143
+ pixie init # scaffolds pixie_qa/ without starting the server
144
+ ```
145
+
146
+ Changes to artifacts are pushed to the browser in real time via SSE.
147
+
148
+ ## Configuration
149
+
150
+ Pixie reads configuration from environment variables and a local `.env` file. Existing process env vars take priority over `.env` values.
151
+
152
+ | Variable | Description |
153
+ | -------------------------- | ------------------------------------------------- |
154
+ | `PIXIE_ROOT` | Root directory for all generated artefacts |
155
+ | `PIXIE_RATE_LIMIT_ENABLED` | `true` to enable evaluator throttling |
156
+ | `PIXIE_RATE_LIMIT_RPS` | Max requests per second for LLM-as-judge calls |
157
+ | `PIXIE_RATE_LIMIT_RPM` | Max requests per minute |
158
+ | `PIXIE_RATE_LIMIT_TPS` | Max tokens per second |
159
+ | `PIXIE_RATE_LIMIT_TPM` | Max tokens per minute |