@arizeai/phoenix-client 6.5.4 → 6.5.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (54) hide show
  1. package/dist/esm/__generated__/api/v1.d.ts +244 -0
  2. package/dist/esm/__generated__/api/v1.d.ts.map +1 -1
  3. package/dist/esm/experiments/resumeEvaluation.d.ts.map +1 -1
  4. package/dist/esm/experiments/resumeEvaluation.js +179 -170
  5. package/dist/esm/experiments/resumeEvaluation.js.map +1 -1
  6. package/dist/esm/experiments/resumeExperiment.d.ts.map +1 -1
  7. package/dist/esm/experiments/resumeExperiment.js +201 -185
  8. package/dist/esm/experiments/resumeExperiment.js.map +1 -1
  9. package/dist/esm/experiments/runExperiment.d.ts.map +1 -1
  10. package/dist/esm/experiments/runExperiment.js +238 -207
  11. package/dist/esm/experiments/runExperiment.js.map +1 -1
  12. package/dist/esm/experiments/tracing.d.ts +10 -0
  13. package/dist/esm/experiments/tracing.d.ts.map +1 -0
  14. package/dist/esm/experiments/tracing.js +21 -0
  15. package/dist/esm/experiments/tracing.js.map +1 -0
  16. package/dist/esm/prompts/sdks/toSDK.d.ts +2 -2
  17. package/dist/esm/tsconfig.esm.tsbuildinfo +1 -1
  18. package/dist/esm/utils/formatPromptMessages.d.ts.map +1 -1
  19. package/dist/esm/utils/getPromptBySelector.d.ts.map +1 -1
  20. package/dist/src/__generated__/api/v1.d.ts +244 -0
  21. package/dist/src/__generated__/api/v1.d.ts.map +1 -1
  22. package/dist/src/experiments/resumeEvaluation.d.ts.map +1 -1
  23. package/dist/src/experiments/resumeEvaluation.js +192 -183
  24. package/dist/src/experiments/resumeEvaluation.js.map +1 -1
  25. package/dist/src/experiments/resumeExperiment.d.ts.map +1 -1
  26. package/dist/src/experiments/resumeExperiment.js +214 -198
  27. package/dist/src/experiments/resumeExperiment.js.map +1 -1
  28. package/dist/src/experiments/runExperiment.d.ts.map +1 -1
  29. package/dist/src/experiments/runExperiment.js +228 -197
  30. package/dist/src/experiments/runExperiment.js.map +1 -1
  31. package/dist/src/experiments/tracing.d.ts +10 -0
  32. package/dist/src/experiments/tracing.d.ts.map +1 -0
  33. package/dist/src/experiments/tracing.js +24 -0
  34. package/dist/src/experiments/tracing.js.map +1 -0
  35. package/dist/src/utils/formatPromptMessages.d.ts.map +1 -1
  36. package/dist/src/utils/getPromptBySelector.d.ts.map +1 -1
  37. package/dist/tsconfig.tsbuildinfo +1 -1
  38. package/docs/annotations.mdx +83 -0
  39. package/docs/datasets.mdx +77 -0
  40. package/docs/document-annotations.mdx +208 -0
  41. package/docs/experiments.mdx +271 -0
  42. package/docs/overview.mdx +176 -0
  43. package/docs/prompts.mdx +73 -0
  44. package/docs/session-annotations.mdx +158 -0
  45. package/docs/sessions.mdx +87 -0
  46. package/docs/span-annotations.mdx +283 -0
  47. package/docs/spans.mdx +76 -0
  48. package/docs/traces.mdx +63 -0
  49. package/package.json +10 -4
  50. package/src/__generated__/api/v1.ts +244 -0
  51. package/src/experiments/resumeEvaluation.ts +224 -206
  52. package/src/experiments/resumeExperiment.ts +237 -213
  53. package/src/experiments/runExperiment.ts +281 -243
  54. package/src/experiments/tracing.ts +30 -0
@@ -0,0 +1,83 @@
1
+ ---
2
+ title: "Annotations"
3
+ description: "Attach structured feedback to spans, documents, and sessions with @arizeai/phoenix-client"
4
+ ---
5
+
6
+ Annotations are structured feedback records — labels, scores, and explanations — attached to observability artifacts in Phoenix. They are the primary mechanism for recording quality signals, whether those signals come from end-users, LLM judges, or programmatic checks.
7
+
8
+ <section className="hidden" data-agent-context="relevant-source-files" aria-label="Relevant source files">
9
+ <h2>Relevant Source Files</h2>
10
+ <ul>
11
+ <li><code>src/types/annotations.ts</code> for the shared <code>Annotation</code> base interface</li>
12
+ <li><code>src/spans/types.ts</code> for <code>SpanAnnotation</code> and <code>DocumentAnnotation</code></li>
13
+ <li><code>src/sessions/types.ts</code> for <code>SessionAnnotation</code></li>
14
+ </ul>
15
+ </section>
16
+
17
+ ## Why Annotations Matter
18
+
19
+ Annotations close the feedback loop on your LLM application:
20
+
21
+ - **Human feedback** — Thumbs-up/down from end-users, QA reviews from teammates, or labeling tasks for dataset curation.
22
+ - **LLM-as-judge evaluations** — Automated quality scoring using a second LLM (groundedness, helpfulness, safety).
23
+ - **Code-based metrics** — Programmatic checks like regex validation, threshold comparisons, or retrieval precision calculations.
24
+
25
+ Once attached, annotations appear in the Phoenix UI alongside traces and can be used to filter spans, build datasets, and track improvements during experimentation.
26
+
27
+ ## Annotation Targets
28
+
29
+ Phoenix supports three annotation targets, each focused on a different level of your application:
30
+
31
+ - **[Span Annotations](./span-annotations)** — Feedback on individual traced operations: an LLM call, a tool invocation, a retrieval step. The most common annotation target.
32
+ - **[Document Annotations](./document-annotations)** — Feedback on specific retrieved documents within a retriever span, indexed by position. Essential for evaluating RAG pipeline quality.
33
+ - **[Session Annotations](./session-annotations)** — Feedback on multi-turn conversations or threads as a whole. Use for conversation-level quality signals like resolution rate or customer satisfaction.
34
+
35
+ ## Annotator Kinds
36
+
37
+ Every annotation records _who or what_ produced the feedback:
38
+
39
+ | Kind | Default | Use case |
40
+ |------|---------|----------|
41
+ | `"HUMAN"` | Yes | Manual review, end-user thumbs-up/down, labeling tasks |
42
+ | `"LLM"` | — | LLM-as-judge evaluations, automated quality scoring |
43
+ | `"CODE"` | — | Programmatic rules, regex checks, threshold-based metrics |
44
+
45
+ ## Shared Annotation Shape
46
+
47
+ All annotation types share this base interface:
48
+
49
+ ```ts
50
+ interface Annotation {
51
+ name: string; // What is being measured (e.g. "groundedness")
52
+ label?: string; // Categorical result (e.g. "grounded")
53
+ score?: number; // Numeric result (e.g. 0.95)
54
+ explanation?: string; // Free-text justification
55
+ identifier?: string; // For idempotent upserts
56
+ metadata?: Record<string, unknown>; // Arbitrary context
57
+ }
58
+ ```
59
+
60
+ At least one of `label`, `score`, or `explanation` must be provided.
61
+
62
+ Each target adds its own identifier field — `spanId` for spans, `spanId` + `documentPosition` for documents, and `sessionId` for sessions.
63
+
64
+ ## Sync vs. Async
65
+
66
+ All annotation write functions accept an optional `sync` parameter:
67
+
68
+ - **`sync: false`** (default) — The server acknowledges receipt and processes the annotation asynchronously. Higher throughput, but the response does not include the annotation ID.
69
+ - **`sync: true`** — The server processes the annotation synchronously and returns its ID. Useful in tests or workflows that need to read the annotation back immediately.
70
+
71
+ <section className="hidden" data-agent-context="source-map" aria-label="Source map">
72
+ <h2>Source Map</h2>
73
+ <ul>
74
+ <li><code>src/types/annotations.ts</code></li>
75
+ <li><code>src/spans/addSpanAnnotation.ts</code></li>
76
+ <li><code>src/spans/logSpanAnnotations.ts</code></li>
77
+ <li><code>src/spans/addDocumentAnnotation.ts</code></li>
78
+ <li><code>src/spans/logDocumentAnnotations.ts</code></li>
79
+ <li><code>src/spans/getSpanAnnotations.ts</code></li>
80
+ <li><code>src/sessions/addSessionAnnotation.ts</code></li>
81
+ <li><code>src/sessions/logSessionAnnotations.ts</code></li>
82
+ </ul>
83
+ </section>
@@ -0,0 +1,77 @@
1
+ ---
2
+ title: "Datasets"
3
+ description: "Create and inspect datasets with @arizeai/phoenix-client"
4
+ ---
5
+
6
+ Datasets are the foundation for experiment runs. The dataset helpers cover creation, idempotent creation, record inspection, and example appends.
7
+
8
+ <section className="hidden" data-agent-context="relevant-source-files" aria-label="Relevant source files">
9
+ <h2>Relevant Source Files</h2>
10
+ <ul>
11
+ <li>
12
+ <code>src/datasets/createOrGetDataset.ts</code> for the exact return
13
+ shape of the idempotent helper
14
+ </li>
15
+ </ul>
16
+ </section>
17
+
18
+ ## Create A Dataset
19
+
20
+ ```ts
21
+ import { createDataset } from "@arizeai/phoenix-client/datasets";
22
+
23
+ const { datasetId } = await createDataset({
24
+ name: "support-eval",
25
+ description: "Support questions with expected answers",
26
+ examples: [
27
+ {
28
+ input: { question: "Where is my order?" },
29
+ output: { answer: "Use the tracking page in your account." },
30
+ metadata: { channel: "chat" },
31
+ },
32
+ ],
33
+ });
34
+ ```
35
+
36
+ ## Reuse Or Append
37
+
38
+ ```ts
39
+ import {
40
+ appendDatasetExamples,
41
+ createOrGetDataset,
42
+ } from "@arizeai/phoenix-client/datasets";
43
+
44
+ const dataset = await createOrGetDataset({
45
+ name: "support-eval",
46
+ description: "Support questions with expected answers",
47
+ examples: [],
48
+ });
49
+
50
+ await appendDatasetExamples({
51
+ dataset,
52
+ examples: [
53
+ {
54
+ input: { question: "How do I reset my password?" },
55
+ output: { answer: "Use the forgot password flow." },
56
+ },
57
+ ],
58
+ });
59
+ ```
60
+
61
+ `createOrGetDataset()` returns `{ datasetId }`, so you can pass that object directly as the dataset selector for append or experiment calls.
62
+
63
+ ## Read Back Dataset State
64
+
65
+ Use `getDataset`, `getDatasetExamples`, and `getDatasetInfo` to inspect datasets after creation.
66
+
67
+ <section className="hidden" data-agent-context="source-map" aria-label="Source map">
68
+ <h2>Source Map</h2>
69
+ <ul>
70
+ <li><code>src/datasets/createDataset.ts</code></li>
71
+ <li><code>src/datasets/createOrGetDataset.ts</code></li>
72
+ <li><code>src/datasets/appendDatasetExamples.ts</code></li>
73
+ <li><code>src/datasets/getDataset.ts</code></li>
74
+ <li><code>src/datasets/getDatasetExamples.ts</code></li>
75
+ <li><code>src/datasets/getDatasetInfo.ts</code></li>
76
+ </ul>
77
+ </section>
@@ -0,0 +1,208 @@
1
+ ---
2
+ title: "Document Annotations"
3
+ description: "Log document-level annotations for RAG evaluation with @arizeai/phoenix-client"
4
+ ---
5
+
6
+ Document annotations tag individual retrieved documents as relevant or irrelevant within a retriever span. They are the building block for RAG evaluation — once you annotate documents with relevance scores, Phoenix automatically computes retrieval metrics like **nDCG**, **Precision@K**, **MRR**, and **Hit Rate** across your project.
7
+
8
+ All functions are imported from `@arizeai/phoenix-client/spans`. See [Annotations](./annotations) for the shared annotation model and concepts.
9
+
10
+ <section className="hidden" data-agent-context="relevant-source-files" aria-label="Relevant source files">
11
+ <h2>Relevant Source Files</h2>
12
+ <ul>
13
+ <li><code>src/spans/addDocumentAnnotation.ts</code> for the single-annotation API</li>
14
+ <li><code>src/spans/logDocumentAnnotations.ts</code> for batch logging</li>
15
+ <li><code>src/spans/types.ts</code> for the <code>DocumentAnnotation</code> interface</li>
16
+ </ul>
17
+ </section>
18
+
19
+ ## Why Document Annotations
20
+
21
+ When a retriever returns a ranked list of documents, you need to know:
22
+
23
+ - **Were the right documents retrieved?** (relevance)
24
+ - **Were they ranked in the right order?** (nDCG, MRR)
25
+ - **Was at least one relevant document returned?** (hit rate)
26
+ - **How many of the top-K were relevant?** (Precision@K)
27
+
28
+ Document annotations let you label each retrieved document with a relevance score. Phoenix then aggregates those scores into standard retrieval metrics — both per-span and across your entire project.
29
+
30
+ ## How Document Annotations Work
31
+
32
+ Each document annotation targets a specific document by its **position** in the retriever span's output. The `documentPosition` is a 0-based index: if a retriever returns 5 documents, positions `0` through `4` are valid targets.
33
+
34
+ Document annotations share the same fields as span annotations (`spanId`, `name`, `annotatorKind`, `label`, `score`, `explanation`, `metadata`). The `documentPosition` tells Phoenix _which_ retrieved document the feedback applies to.
35
+
36
+ ### Automatic Retrieval Metrics
37
+
38
+ <Note>
39
+ Phoenix automatically computes **nDCG**, **Precision@K**, **MRR**, and **Hit Rate** from document annotations that have `annotatorKind: "LLM"` and a numeric `score`. Annotations with `annotatorKind: "HUMAN"` or `"CODE"` are stored but do not feed into the auto-computed retrieval metrics.
40
+ </Note>
41
+
42
+ If you want Phoenix to compute retrieval metrics for you, use `annotatorKind: "LLM"` when logging relevance scores. This is the typical pattern when running an LLM-as-judge relevance evaluator over your retrieval results.
43
+
44
+ ## Score All Documents In A Retrieval
45
+
46
+ The most common pattern: after a retriever returns N documents, score each one for relevance. Use `logDocumentAnnotations` to send them in a single batch:
47
+
48
+ ```ts
49
+ import { logDocumentAnnotations } from "@arizeai/phoenix-client/spans";
50
+
51
+ // retrievedDocs comes from your evaluator — each has a relevanceScore
52
+ const annotations = retrievedDocs.map((doc, position) => ({
53
+ spanId: retrieverSpanId,
54
+ documentPosition: position,
55
+ name: "relevance",
56
+ annotatorKind: "LLM" as const,
57
+ score: doc.relevanceScore,
58
+ label: doc.relevanceScore > 0.7 ? "relevant" : "not-relevant",
59
+ }));
60
+
61
+ await logDocumentAnnotations({ documentAnnotations: annotations });
62
+ // Phoenix now auto-computes nDCG, Precision@K, MRR, and Hit Rate
63
+ // for this retriever span in the UI.
64
+ ```
65
+
66
+ ## Binary Relevance Labeling
67
+
68
+ The simplest relevance scheme: each document is either relevant (1) or not (0). This is the most common input for hit rate and nDCG:
69
+
70
+ ```ts
71
+ import { logDocumentAnnotations } from "@arizeai/phoenix-client/spans";
72
+
73
+ const annotations = retrievedDocs.map((doc, position) => ({
74
+ spanId: retrieverSpanId,
75
+ documentPosition: position,
76
+ name: "relevance",
77
+ annotatorKind: "LLM" as const,
78
+ score: isRelevant(doc, userQuery) ? 1 : 0,
79
+ label: isRelevant(doc, userQuery) ? "relevant" : "irrelevant",
80
+ }));
81
+
82
+ await logDocumentAnnotations({ documentAnnotations: annotations });
83
+ ```
84
+
85
+ With binary scores:
86
+ - **Hit Rate** = 1 if any document has score 1, else 0
87
+ - **Precision@K** = fraction of top-K documents with score 1
88
+ - **MRR** = 1 / (rank of first document with score 1)
89
+ - **nDCG** = normalized discounted cumulative gain across the ranked list
90
+
91
+ ## Graded Relevance
92
+
93
+ For finer-grained evaluation, use continuous scores (e.g. 0–1) instead of binary. This gives nDCG more signal about _how_ relevant each document is, not just whether it's relevant at all:
94
+
95
+ ```ts
96
+ import { logDocumentAnnotations } from "@arizeai/phoenix-client/spans";
97
+
98
+ // LLM judge returns a 0-1 relevance score per document
99
+ const annotations = retrievedDocs.map((doc, position) => ({
100
+ spanId: retrieverSpanId,
101
+ documentPosition: position,
102
+ name: "relevance",
103
+ annotatorKind: "LLM" as const,
104
+ score: doc.relevanceScore, // e.g. 0.0, 0.3, 0.7, 1.0
105
+ explanation: doc.relevanceReasoning,
106
+ metadata: { model: "gpt-4o-mini" },
107
+ }));
108
+
109
+ await logDocumentAnnotations({ documentAnnotations: annotations });
110
+ ```
111
+
112
+ ## Add A Single Document Annotation
113
+
114
+ For one-off annotations — e.g. a human reviewer flagging a specific document:
115
+
116
+ ```ts
117
+ import { addDocumentAnnotation } from "@arizeai/phoenix-client/spans";
118
+
119
+ await addDocumentAnnotation({
120
+ documentAnnotation: {
121
+ spanId: "retriever-span-id",
122
+ documentPosition: 0,
123
+ name: "relevance",
124
+ annotatorKind: "LLM",
125
+ score: 0.95,
126
+ label: "relevant",
127
+ explanation: "Document directly answers the user question.",
128
+ },
129
+ });
130
+ ```
131
+
132
+ ## Multi-Dimensional Document Scoring
133
+
134
+ Score the same documents on multiple axes by using different annotation names. Each name creates a separate annotation series in the Phoenix UI:
135
+
136
+ ```ts
137
+ import { logDocumentAnnotations } from "@arizeai/phoenix-client/spans";
138
+
139
+ const relevanceAnnotations = docs.map((doc, position) => ({
140
+ spanId: retrieverSpanId,
141
+ documentPosition: position,
142
+ name: "relevance",
143
+ annotatorKind: "LLM" as const,
144
+ score: doc.relevanceScore,
145
+ }));
146
+
147
+ const recencyAnnotations = docs.map((doc, position) => ({
148
+ spanId: retrieverSpanId,
149
+ documentPosition: position,
150
+ name: "recency",
151
+ annotatorKind: "CODE" as const,
152
+ score: isRecent(doc.publishDate) ? 1 : 0,
153
+ }));
154
+
155
+ await logDocumentAnnotations({
156
+ documentAnnotations: [...relevanceAnnotations, ...recencyAnnotations],
157
+ });
158
+ ```
159
+
160
+ ## Re-Ranking Evaluation
161
+
162
+ Document annotations are useful for evaluating re-rankers. Annotate the same retriever span before and after re-ranking to compare the quality of the original vs. re-ranked order:
163
+
164
+ ```ts
165
+ import { logDocumentAnnotations } from "@arizeai/phoenix-client/spans";
166
+
167
+ // Score documents in the re-ranker's output order
168
+ const annotations = rerankedDocs.map((doc, position) => ({
169
+ spanId: rerankerSpanId,
170
+ documentPosition: position,
171
+ name: "relevance",
172
+ annotatorKind: "LLM" as const,
173
+ score: doc.relevanceScore,
174
+ }));
175
+
176
+ await logDocumentAnnotations({ documentAnnotations: annotations });
177
+ // Compare nDCG between the retriever span and re-ranker span
178
+ // in the Phoenix UI to measure re-ranking effectiveness.
179
+ ```
180
+
181
+ ## Parameter Reference
182
+
183
+ ### `DocumentAnnotation`
184
+
185
+ | Field | Type | Required | Description |
186
+ |-------|------|----------|-------------|
187
+ | `spanId` | `string` | Yes | The retriever span's OpenTelemetry ID |
188
+ | `documentPosition` | `number` | Yes | 0-based index of the document in retrieval results |
189
+ | `name` | `string` | Yes | Annotation name (e.g. `"relevance"`) |
190
+ | `annotatorKind` | `"HUMAN" \| "LLM" \| "CODE"` | No | Defaults to `"HUMAN"`. Use `"LLM"` for auto-computed retrieval metrics. |
191
+ | `label` | `string` | No* | Categorical label (e.g. `"relevant"`, `"irrelevant"`) |
192
+ | `score` | `number` | No* | Numeric relevance score (e.g. 0 or 1 for binary, 0–1 for graded) |
193
+ | `explanation` | `string` | No* | Free-text explanation |
194
+ | `metadata` | `Record<string, unknown>` | No | Arbitrary metadata |
195
+
196
+ \*At least one of `label`, `score`, or `explanation` is required.
197
+
198
+ Document annotations are unique by `(name, spanId, documentPosition)`. Unlike span annotations, the `identifier` field is not supported for document annotations.
199
+
200
+ <section className="hidden" data-agent-context="source-map" aria-label="Source map">
201
+ <h2>Source Map</h2>
202
+ <ul>
203
+ <li><code>src/spans/addDocumentAnnotation.ts</code></li>
204
+ <li><code>src/spans/logDocumentAnnotations.ts</code></li>
205
+ <li><code>src/spans/types.ts</code></li>
206
+ <li><code>src/types/annotations.ts</code></li>
207
+ </ul>
208
+ </section>
@@ -0,0 +1,271 @@
1
+ ---
2
+ title: "Experiments"
3
+ description: "Run experiments with @arizeai/phoenix-client"
4
+ ---
5
+
6
+ The experiments module runs tasks over dataset examples, records experiment runs in Phoenix, and can evaluate each run with either plain experiment evaluators or `@arizeai/phoenix-evals` evaluators.
7
+
8
+ <section className="hidden" data-agent-context="relevant-source-files" aria-label="Relevant source files">
9
+ <h2>Relevant Source Files</h2>
10
+ <ul>
11
+ <li><code>src/experiments/runExperiment.ts</code> for the task execution flow and return shape</li>
12
+ <li><code>src/experiments/helpers/getExperimentEvaluators.ts</code> for evaluator normalization</li>
13
+ <li><code>src/experiments/helpers/fromPhoenixLLMEvaluator.ts</code> for the phoenix-evals bridge</li>
14
+ <li><code>src/experiments/getExperimentRuns.ts</code> for reading runs back after execution</li>
15
+ </ul>
16
+ </section>
17
+
18
+ ## Two Common Patterns
19
+
20
+ Use `asExperimentEvaluator()` when your evaluation logic is plain TypeScript.
21
+
22
+ Use `@arizeai/phoenix-evals` evaluators directly when you want model-backed judging.
23
+
24
+ ## Code-Based Example
25
+
26
+ If you just want to compare task output against a reference answer or apply deterministic checks, use `asExperimentEvaluator()`:
27
+
28
+ ```ts
29
+ /* eslint-disable no-console */
30
+ import { createDataset } from "@arizeai/phoenix-client/datasets";
31
+ import {
32
+ asExperimentEvaluator,
33
+ runExperiment,
34
+ } from "@arizeai/phoenix-client/experiments";
35
+
36
+ async function main() {
37
+ const { datasetId } = await createDataset({
38
+ name: `simple-dataset-${Date.now()}`,
39
+ description: "a simple dataset",
40
+ examples: [
41
+ {
42
+ input: { name: "John" },
43
+ output: { text: "Hello, John!" },
44
+ metadata: {},
45
+ },
46
+ {
47
+ input: { name: "Jane" },
48
+ output: { text: "Hello, Jane!" },
49
+ metadata: {},
50
+ },
51
+ {
52
+ input: { name: "Bill" },
53
+ output: { text: "Hello, Bill!" },
54
+ metadata: {},
55
+ },
56
+ ],
57
+ });
58
+
59
+ const experiment = await runExperiment({
60
+ dataset: { datasetId },
61
+ task: async (example) => `hello ${example.input.name}`,
62
+ evaluators: [
63
+ asExperimentEvaluator({
64
+ name: "matches",
65
+ kind: "CODE",
66
+ evaluate: async ({ output, expected }) => {
67
+ const matches = output === expected?.text;
68
+ return {
69
+ label: matches ? "matches" : "does not match",
70
+ score: matches ? 1 : 0,
71
+ explanation: matches
72
+ ? "output matches expected"
73
+ : "output does not match expected",
74
+ metadata: {},
75
+ };
76
+ },
77
+ }),
78
+ asExperimentEvaluator({
79
+ name: "contains-hello",
80
+ kind: "CODE",
81
+ evaluate: async ({ output }) => {
82
+ const matches =
83
+ typeof output === "string" && output.includes("hello");
84
+ return {
85
+ label: matches ? "contains hello" : "does not contain hello",
86
+ score: matches ? 1 : 0,
87
+ explanation: matches
88
+ ? "output contains hello"
89
+ : "output does not contain hello",
90
+ metadata: {},
91
+ };
92
+ },
93
+ }),
94
+ ],
95
+ });
96
+
97
+ console.table(experiment.runs);
98
+ console.table(experiment.evaluationRuns);
99
+ }
100
+
101
+ main().catch(console.error);
102
+ ```
103
+
104
+ This pattern is useful when:
105
+
106
+ - you already know the exact correctness rule
107
+ - you want fast, deterministic evaluation
108
+ - you do not want to call another model during evaluation
109
+
110
+ ## Model-Backed Example
111
+
112
+ If you want a model-backed experiment with automatic tracing and an LLM-as-a-judge evaluator, this is the core pattern:
113
+
114
+ ```ts
115
+ import { openai } from "@ai-sdk/openai";
116
+ import { createOrGetDataset } from "@arizeai/phoenix-client/datasets";
117
+ import { runExperiment } from "@arizeai/phoenix-client/experiments";
118
+ import type { ExperimentTask } from "@arizeai/phoenix-client/types/experiments";
119
+ import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
120
+ import { generateText } from "ai";
121
+
122
+ const model = openai("gpt-4o-mini");
123
+
124
+ const main = async () => {
125
+ const answersQuestion = createClassificationEvaluator({
126
+ name: "answersQuestion",
127
+ model,
128
+ promptTemplate:
129
+ "Does the following answer the user's question: <question>{{input.question}}</question><answer>{{output}}</answer>",
130
+ choices: {
131
+ correct: 1,
132
+ incorrect: 0,
133
+ },
134
+ });
135
+
136
+ const dataset = await createOrGetDataset({
137
+ name: "correctness-eval",
138
+ description: "Evaluate the correctness of the model",
139
+ examples: [
140
+ {
141
+ input: {
142
+ question: "Is ArizeAI Phoenix Open-Source?",
143
+ context: "ArizeAI Phoenix is Open-Source.",
144
+ },
145
+ },
146
+ // ... more examples
147
+ ],
148
+ });
149
+
150
+ const task: ExperimentTask = async (example) => {
151
+ if (typeof example.input.question !== "string") {
152
+ throw new Error("Invalid input: question must be a string");
153
+ }
154
+ if (typeof example.input.context !== "string") {
155
+ throw new Error("Invalid input: context must be a string");
156
+ }
157
+
158
+ return generateText({
159
+ model,
160
+ experimental_telemetry: {
161
+ isEnabled: true,
162
+ },
163
+ prompt: [
164
+ {
165
+ role: "system",
166
+ content: `You answer questions based on this context: ${example.input.context}`,
167
+ },
168
+ {
169
+ role: "user",
170
+ content: example.input.question,
171
+ },
172
+ ],
173
+ }).then((response) => {
174
+ if (response.text) {
175
+ return response.text;
176
+ }
177
+ throw new Error("Invalid response: text is required");
178
+ });
179
+ };
180
+
181
+ const experiment = await runExperiment({
182
+ experimentName: "answers-question-eval",
183
+ experimentDescription:
184
+ "Evaluate the ability of the model to answer questions based on the context",
185
+ dataset,
186
+ task,
187
+ evaluators: [answersQuestion],
188
+ repetitions: 3,
189
+ });
190
+
191
+ console.log(experiment.id);
192
+ console.log(Object.values(experiment.runs).length);
193
+ console.log(experiment.evaluationRuns?.length ?? 0);
194
+ };
195
+
196
+ main().catch(console.error);
197
+ ```
198
+
199
+ ## What This Example Shows
200
+
201
+ - `createOrGetDataset()` creates or reuses the dataset the experiment will run against
202
+ - `task` receives the full dataset example object
203
+ - `generateText()` emits traces that Phoenix can attach to the experiment when telemetry is enabled
204
+ - `createClassificationEvaluator()` from `@arizeai/phoenix-evals` can be passed directly to `runExperiment()`
205
+ - `runExperiment()` records both task runs and evaluation runs in Phoenix
206
+
207
+ ## Task Inputs
208
+
209
+ `runExperiment()` calls your task with the full dataset example, not just `example.input`.
210
+
211
+ That means your task should usually read:
212
+
213
+ - `example.input` for the task inputs
214
+ - `example.output` for any reference answer
215
+ - `example.metadata` for additional context
216
+
217
+ In the example above, the task validates `example.input.question` and `example.input.context` before generating a response.
218
+
219
+ ## Evaluator Inputs
220
+
221
+ When an evaluator runs, it receives a normalized object with these fields:
222
+
223
+ | Field | Description |
224
+ |--------|-------------|
225
+ | `input` | The dataset example's `input` object |
226
+ | `output` | The task output for that run |
227
+ | `expected` | The dataset example's `output` object |
228
+ | `metadata` | The dataset example's `metadata` object |
229
+
230
+ This is why the `createClassificationEvaluator()` prompt can reference `{{input.question}}` and `{{output}}`.
231
+
232
+ For code-based evaluators created with `asExperimentEvaluator()`, those same fields are available inside `evaluate({ input, output, expected, metadata })`.
233
+
234
+ ## What `runExperiment()` Returns
235
+
236
+ The returned object includes the experiment metadata plus the task and evaluation results from the run.
237
+
238
+ - `experiment.id` is the experiment ID in Phoenix
239
+ - `experiment.projectName` is the Phoenix project that received the task traces
240
+ - `experiment.runs` is a map of run IDs to task run objects
241
+ - `experiment.evaluationRuns` contains evaluator results when evaluators are provided
242
+
243
+ ## Follow-Up Helpers
244
+
245
+ Use these exports for follow-up workflows:
246
+
247
+ - `createExperiment`
248
+ - `getExperiment`
249
+ - `getExperimentInfo`
250
+ - `getExperimentRuns`
251
+ - `listExperiments`
252
+ - `resumeExperiment`
253
+ - `resumeEvaluation`
254
+ - `deleteExperiment`
255
+
256
+ ## Tracing Behavior
257
+
258
+ `runExperiment()` can register a tracer provider for the task run so that task spans and evaluator spans show up in Phoenix during the experiment. This is why tasks that call the AI SDK can still emit traces to Phoenix when global tracing is enabled.
259
+
260
+ <section className="hidden" data-agent-context="source-map" aria-label="Source map">
261
+ <h2>Source Map</h2>
262
+ <ul>
263
+ <li><code>src/experiments/runExperiment.ts</code></li>
264
+ <li><code>src/experiments/createExperiment.ts</code></li>
265
+ <li><code>src/experiments/getExperiment.ts</code></li>
266
+ <li><code>src/experiments/getExperimentRuns.ts</code></li>
267
+ <li><code>src/experiments/helpers/getExperimentEvaluators.ts</code></li>
268
+ <li><code>src/experiments/helpers/fromPhoenixLLMEvaluator.ts</code></li>
269
+ <li><code>src/experiments/helpers/asExperimentEvaluator.ts</code></li>
270
+ </ul>
271
+ </section>