@arizeai/phoenix-client 6.5.4 → 6.5.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/esm/__generated__/api/v1.d.ts +244 -0
- package/dist/esm/__generated__/api/v1.d.ts.map +1 -1
- package/dist/esm/experiments/resumeEvaluation.d.ts.map +1 -1
- package/dist/esm/experiments/resumeEvaluation.js +179 -170
- package/dist/esm/experiments/resumeEvaluation.js.map +1 -1
- package/dist/esm/experiments/resumeExperiment.d.ts.map +1 -1
- package/dist/esm/experiments/resumeExperiment.js +201 -185
- package/dist/esm/experiments/resumeExperiment.js.map +1 -1
- package/dist/esm/experiments/runExperiment.d.ts.map +1 -1
- package/dist/esm/experiments/runExperiment.js +238 -207
- package/dist/esm/experiments/runExperiment.js.map +1 -1
- package/dist/esm/experiments/tracing.d.ts +10 -0
- package/dist/esm/experiments/tracing.d.ts.map +1 -0
- package/dist/esm/experiments/tracing.js +21 -0
- package/dist/esm/experiments/tracing.js.map +1 -0
- package/dist/esm/prompts/sdks/toSDK.d.ts +2 -2
- package/dist/esm/tsconfig.esm.tsbuildinfo +1 -1
- package/dist/esm/utils/formatPromptMessages.d.ts.map +1 -1
- package/dist/esm/utils/getPromptBySelector.d.ts.map +1 -1
- package/dist/src/__generated__/api/v1.d.ts +244 -0
- package/dist/src/__generated__/api/v1.d.ts.map +1 -1
- package/dist/src/experiments/resumeEvaluation.d.ts.map +1 -1
- package/dist/src/experiments/resumeEvaluation.js +192 -183
- package/dist/src/experiments/resumeEvaluation.js.map +1 -1
- package/dist/src/experiments/resumeExperiment.d.ts.map +1 -1
- package/dist/src/experiments/resumeExperiment.js +214 -198
- package/dist/src/experiments/resumeExperiment.js.map +1 -1
- package/dist/src/experiments/runExperiment.d.ts.map +1 -1
- package/dist/src/experiments/runExperiment.js +228 -197
- package/dist/src/experiments/runExperiment.js.map +1 -1
- package/dist/src/experiments/tracing.d.ts +10 -0
- package/dist/src/experiments/tracing.d.ts.map +1 -0
- package/dist/src/experiments/tracing.js +24 -0
- package/dist/src/experiments/tracing.js.map +1 -0
- package/dist/src/utils/formatPromptMessages.d.ts.map +1 -1
- package/dist/src/utils/getPromptBySelector.d.ts.map +1 -1
- package/dist/tsconfig.tsbuildinfo +1 -1
- package/docs/annotations.mdx +83 -0
- package/docs/datasets.mdx +77 -0
- package/docs/document-annotations.mdx +208 -0
- package/docs/experiments.mdx +271 -0
- package/docs/overview.mdx +176 -0
- package/docs/prompts.mdx +73 -0
- package/docs/session-annotations.mdx +158 -0
- package/docs/sessions.mdx +87 -0
- package/docs/span-annotations.mdx +283 -0
- package/docs/spans.mdx +76 -0
- package/docs/traces.mdx +63 -0
- package/package.json +10 -4
- package/src/__generated__/api/v1.ts +244 -0
- package/src/experiments/resumeEvaluation.ts +224 -206
- package/src/experiments/resumeExperiment.ts +237 -213
- package/src/experiments/runExperiment.ts +281 -243
- package/src/experiments/tracing.ts +30 -0
|
@@ -0,0 +1,83 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Annotations"
|
|
3
|
+
description: "Attach structured feedback to spans, documents, and sessions with @arizeai/phoenix-client"
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
Annotations are structured feedback records — labels, scores, and explanations — attached to observability artifacts in Phoenix. They are the primary mechanism for recording quality signals, whether those signals come from end-users, LLM judges, or programmatic checks.
|
|
7
|
+
|
|
8
|
+
<section className="hidden" data-agent-context="relevant-source-files" aria-label="Relevant source files">
|
|
9
|
+
<h2>Relevant Source Files</h2>
|
|
10
|
+
<ul>
|
|
11
|
+
<li><code>src/types/annotations.ts</code> for the shared <code>Annotation</code> base interface</li>
|
|
12
|
+
<li><code>src/spans/types.ts</code> for <code>SpanAnnotation</code> and <code>DocumentAnnotation</code></li>
|
|
13
|
+
<li><code>src/sessions/types.ts</code> for <code>SessionAnnotation</code></li>
|
|
14
|
+
</ul>
|
|
15
|
+
</section>
|
|
16
|
+
|
|
17
|
+
## Why Annotations Matter
|
|
18
|
+
|
|
19
|
+
Annotations close the feedback loop on your LLM application:
|
|
20
|
+
|
|
21
|
+
- **Human feedback** — Thumbs-up/down from end-users, QA reviews from teammates, or labeling tasks for dataset curation.
|
|
22
|
+
- **LLM-as-judge evaluations** — Automated quality scoring using a second LLM (groundedness, helpfulness, safety).
|
|
23
|
+
- **Code-based metrics** — Programmatic checks like regex validation, threshold comparisons, or retrieval precision calculations.
|
|
24
|
+
|
|
25
|
+
Once attached, annotations appear in the Phoenix UI alongside traces and can be used to filter spans, build datasets, and track improvements during experimentation.
|
|
26
|
+
|
|
27
|
+
## Annotation Targets
|
|
28
|
+
|
|
29
|
+
Phoenix supports three annotation targets, each focused on a different level of your application:
|
|
30
|
+
|
|
31
|
+
- **[Span Annotations](./span-annotations)** — Feedback on individual traced operations: an LLM call, a tool invocation, a retrieval step. The most common annotation target.
|
|
32
|
+
- **[Document Annotations](./document-annotations)** — Feedback on specific retrieved documents within a retriever span, indexed by position. Essential for evaluating RAG pipeline quality.
|
|
33
|
+
- **[Session Annotations](./session-annotations)** — Feedback on multi-turn conversations or threads as a whole. Use for conversation-level quality signals like resolution rate or customer satisfaction.
|
|
34
|
+
|
|
35
|
+
## Annotator Kinds
|
|
36
|
+
|
|
37
|
+
Every annotation records _who or what_ produced the feedback:
|
|
38
|
+
|
|
39
|
+
| Kind | Default | Use case |
|
|
40
|
+
|------|---------|----------|
|
|
41
|
+
| `"HUMAN"` | Yes | Manual review, end-user thumbs-up/down, labeling tasks |
|
|
42
|
+
| `"LLM"` | — | LLM-as-judge evaluations, automated quality scoring |
|
|
43
|
+
| `"CODE"` | — | Programmatic rules, regex checks, threshold-based metrics |
|
|
44
|
+
|
|
45
|
+
## Shared Annotation Shape
|
|
46
|
+
|
|
47
|
+
All annotation types share this base interface:
|
|
48
|
+
|
|
49
|
+
```ts
|
|
50
|
+
interface Annotation {
|
|
51
|
+
name: string; // What is being measured (e.g. "groundedness")
|
|
52
|
+
label?: string; // Categorical result (e.g. "grounded")
|
|
53
|
+
score?: number; // Numeric result (e.g. 0.95)
|
|
54
|
+
explanation?: string; // Free-text justification
|
|
55
|
+
identifier?: string; // For idempotent upserts
|
|
56
|
+
metadata?: Record<string, unknown>; // Arbitrary context
|
|
57
|
+
}
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
At least one of `label`, `score`, or `explanation` must be provided.
|
|
61
|
+
|
|
62
|
+
Each target adds its own identifier field — `spanId` for spans, `spanId` + `documentPosition` for documents, and `sessionId` for sessions.
|
|
63
|
+
|
|
64
|
+
## Sync vs. Async
|
|
65
|
+
|
|
66
|
+
All annotation write functions accept an optional `sync` parameter:
|
|
67
|
+
|
|
68
|
+
- **`sync: false`** (default) — The server acknowledges receipt and processes the annotation asynchronously. Higher throughput, but the response does not include the annotation ID.
|
|
69
|
+
- **`sync: true`** — The server processes the annotation synchronously and returns its ID. Useful in tests or workflows that need to read the annotation back immediately.
|
|
70
|
+
|
|
71
|
+
<section className="hidden" data-agent-context="source-map" aria-label="Source map">
|
|
72
|
+
<h2>Source Map</h2>
|
|
73
|
+
<ul>
|
|
74
|
+
<li><code>src/types/annotations.ts</code></li>
|
|
75
|
+
<li><code>src/spans/addSpanAnnotation.ts</code></li>
|
|
76
|
+
<li><code>src/spans/logSpanAnnotations.ts</code></li>
|
|
77
|
+
<li><code>src/spans/addDocumentAnnotation.ts</code></li>
|
|
78
|
+
<li><code>src/spans/logDocumentAnnotations.ts</code></li>
|
|
79
|
+
<li><code>src/spans/getSpanAnnotations.ts</code></li>
|
|
80
|
+
<li><code>src/sessions/addSessionAnnotation.ts</code></li>
|
|
81
|
+
<li><code>src/sessions/logSessionAnnotations.ts</code></li>
|
|
82
|
+
</ul>
|
|
83
|
+
</section>
|
|
@@ -0,0 +1,77 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Datasets"
|
|
3
|
+
description: "Create and inspect datasets with @arizeai/phoenix-client"
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
Datasets are the foundation for experiment runs. The dataset helpers cover creation, idempotent creation, record inspection, and example appends.
|
|
7
|
+
|
|
8
|
+
<section className="hidden" data-agent-context="relevant-source-files" aria-label="Relevant source files">
|
|
9
|
+
<h2>Relevant Source Files</h2>
|
|
10
|
+
<ul>
|
|
11
|
+
<li>
|
|
12
|
+
<code>src/datasets/createOrGetDataset.ts</code> for the exact return
|
|
13
|
+
shape of the idempotent helper
|
|
14
|
+
</li>
|
|
15
|
+
</ul>
|
|
16
|
+
</section>
|
|
17
|
+
|
|
18
|
+
## Create A Dataset
|
|
19
|
+
|
|
20
|
+
```ts
|
|
21
|
+
import { createDataset } from "@arizeai/phoenix-client/datasets";
|
|
22
|
+
|
|
23
|
+
const { datasetId } = await createDataset({
|
|
24
|
+
name: "support-eval",
|
|
25
|
+
description: "Support questions with expected answers",
|
|
26
|
+
examples: [
|
|
27
|
+
{
|
|
28
|
+
input: { question: "Where is my order?" },
|
|
29
|
+
output: { answer: "Use the tracking page in your account." },
|
|
30
|
+
metadata: { channel: "chat" },
|
|
31
|
+
},
|
|
32
|
+
],
|
|
33
|
+
});
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
## Reuse Or Append
|
|
37
|
+
|
|
38
|
+
```ts
|
|
39
|
+
import {
|
|
40
|
+
appendDatasetExamples,
|
|
41
|
+
createOrGetDataset,
|
|
42
|
+
} from "@arizeai/phoenix-client/datasets";
|
|
43
|
+
|
|
44
|
+
const dataset = await createOrGetDataset({
|
|
45
|
+
name: "support-eval",
|
|
46
|
+
description: "Support questions with expected answers",
|
|
47
|
+
examples: [],
|
|
48
|
+
});
|
|
49
|
+
|
|
50
|
+
await appendDatasetExamples({
|
|
51
|
+
dataset,
|
|
52
|
+
examples: [
|
|
53
|
+
{
|
|
54
|
+
input: { question: "How do I reset my password?" },
|
|
55
|
+
output: { answer: "Use the forgot password flow." },
|
|
56
|
+
},
|
|
57
|
+
],
|
|
58
|
+
});
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
`createOrGetDataset()` returns `{ datasetId }`, so you can pass that object directly as the dataset selector for append or experiment calls.
|
|
62
|
+
|
|
63
|
+
## Read Back Dataset State
|
|
64
|
+
|
|
65
|
+
Use `getDataset`, `getDatasetExamples`, and `getDatasetInfo` to inspect datasets after creation.
|
|
66
|
+
|
|
67
|
+
<section className="hidden" data-agent-context="source-map" aria-label="Source map">
|
|
68
|
+
<h2>Source Map</h2>
|
|
69
|
+
<ul>
|
|
70
|
+
<li><code>src/datasets/createDataset.ts</code></li>
|
|
71
|
+
<li><code>src/datasets/createOrGetDataset.ts</code></li>
|
|
72
|
+
<li><code>src/datasets/appendDatasetExamples.ts</code></li>
|
|
73
|
+
<li><code>src/datasets/getDataset.ts</code></li>
|
|
74
|
+
<li><code>src/datasets/getDatasetExamples.ts</code></li>
|
|
75
|
+
<li><code>src/datasets/getDatasetInfo.ts</code></li>
|
|
76
|
+
</ul>
|
|
77
|
+
</section>
|
|
@@ -0,0 +1,208 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Document Annotations"
|
|
3
|
+
description: "Log document-level annotations for RAG evaluation with @arizeai/phoenix-client"
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
Document annotations tag individual retrieved documents as relevant or irrelevant within a retriever span. They are the building block for RAG evaluation — once you annotate documents with relevance scores, Phoenix automatically computes retrieval metrics like **nDCG**, **Precision@K**, **MRR**, and **Hit Rate** across your project.
|
|
7
|
+
|
|
8
|
+
All functions are imported from `@arizeai/phoenix-client/spans`. See [Annotations](./annotations) for the shared annotation model and concepts.
|
|
9
|
+
|
|
10
|
+
<section className="hidden" data-agent-context="relevant-source-files" aria-label="Relevant source files">
|
|
11
|
+
<h2>Relevant Source Files</h2>
|
|
12
|
+
<ul>
|
|
13
|
+
<li><code>src/spans/addDocumentAnnotation.ts</code> for the single-annotation API</li>
|
|
14
|
+
<li><code>src/spans/logDocumentAnnotations.ts</code> for batch logging</li>
|
|
15
|
+
<li><code>src/spans/types.ts</code> for the <code>DocumentAnnotation</code> interface</li>
|
|
16
|
+
</ul>
|
|
17
|
+
</section>
|
|
18
|
+
|
|
19
|
+
## Why Document Annotations
|
|
20
|
+
|
|
21
|
+
When a retriever returns a ranked list of documents, you need to know:
|
|
22
|
+
|
|
23
|
+
- **Were the right documents retrieved?** (relevance)
|
|
24
|
+
- **Were they ranked in the right order?** (nDCG, MRR)
|
|
25
|
+
- **Was at least one relevant document returned?** (hit rate)
|
|
26
|
+
- **How many of the top-K were relevant?** (Precision@K)
|
|
27
|
+
|
|
28
|
+
Document annotations let you label each retrieved document with a relevance score. Phoenix then aggregates those scores into standard retrieval metrics — both per-span and across your entire project.
|
|
29
|
+
|
|
30
|
+
## How Document Annotations Work
|
|
31
|
+
|
|
32
|
+
Each document annotation targets a specific document by its **position** in the retriever span's output. The `documentPosition` is a 0-based index: if a retriever returns 5 documents, positions `0` through `4` are valid targets.
|
|
33
|
+
|
|
34
|
+
Document annotations share the same fields as span annotations (`spanId`, `name`, `annotatorKind`, `label`, `score`, `explanation`, `metadata`). The `documentPosition` tells Phoenix _which_ retrieved document the feedback applies to.
|
|
35
|
+
|
|
36
|
+
### Automatic Retrieval Metrics
|
|
37
|
+
|
|
38
|
+
<Note>
|
|
39
|
+
Phoenix automatically computes **nDCG**, **Precision@K**, **MRR**, and **Hit Rate** from document annotations that have `annotatorKind: "LLM"` and a numeric `score`. Annotations with `annotatorKind: "HUMAN"` or `"CODE"` are stored but do not feed into the auto-computed retrieval metrics.
|
|
40
|
+
</Note>
|
|
41
|
+
|
|
42
|
+
If you want Phoenix to compute retrieval metrics for you, use `annotatorKind: "LLM"` when logging relevance scores. This is the typical pattern when running an LLM-as-judge relevance evaluator over your retrieval results.
|
|
43
|
+
|
|
44
|
+
## Score All Documents In A Retrieval
|
|
45
|
+
|
|
46
|
+
The most common pattern: after a retriever returns N documents, score each one for relevance. Use `logDocumentAnnotations` to send them in a single batch:
|
|
47
|
+
|
|
48
|
+
```ts
|
|
49
|
+
import { logDocumentAnnotations } from "@arizeai/phoenix-client/spans";
|
|
50
|
+
|
|
51
|
+
// retrievedDocs comes from your evaluator — each has a relevanceScore
|
|
52
|
+
const annotations = retrievedDocs.map((doc, position) => ({
|
|
53
|
+
spanId: retrieverSpanId,
|
|
54
|
+
documentPosition: position,
|
|
55
|
+
name: "relevance",
|
|
56
|
+
annotatorKind: "LLM" as const,
|
|
57
|
+
score: doc.relevanceScore,
|
|
58
|
+
label: doc.relevanceScore > 0.7 ? "relevant" : "not-relevant",
|
|
59
|
+
}));
|
|
60
|
+
|
|
61
|
+
await logDocumentAnnotations({ documentAnnotations: annotations });
|
|
62
|
+
// Phoenix now auto-computes nDCG, Precision@K, MRR, and Hit Rate
|
|
63
|
+
// for this retriever span in the UI.
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
## Binary Relevance Labeling
|
|
67
|
+
|
|
68
|
+
The simplest relevance scheme: each document is either relevant (1) or not (0). This is the most common input for hit rate and nDCG:
|
|
69
|
+
|
|
70
|
+
```ts
|
|
71
|
+
import { logDocumentAnnotations } from "@arizeai/phoenix-client/spans";
|
|
72
|
+
|
|
73
|
+
const annotations = retrievedDocs.map((doc, position) => ({
|
|
74
|
+
spanId: retrieverSpanId,
|
|
75
|
+
documentPosition: position,
|
|
76
|
+
name: "relevance",
|
|
77
|
+
annotatorKind: "LLM" as const,
|
|
78
|
+
score: isRelevant(doc, userQuery) ? 1 : 0,
|
|
79
|
+
label: isRelevant(doc, userQuery) ? "relevant" : "irrelevant",
|
|
80
|
+
}));
|
|
81
|
+
|
|
82
|
+
await logDocumentAnnotations({ documentAnnotations: annotations });
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
With binary scores:
|
|
86
|
+
- **Hit Rate** = 1 if any document has score 1, else 0
|
|
87
|
+
- **Precision@K** = fraction of top-K documents with score 1
|
|
88
|
+
- **MRR** = 1 / (rank of first document with score 1)
|
|
89
|
+
- **nDCG** = normalized discounted cumulative gain across the ranked list
|
|
90
|
+
|
|
91
|
+
## Graded Relevance
|
|
92
|
+
|
|
93
|
+
For finer-grained evaluation, use continuous scores (e.g. 0–1) instead of binary. This gives nDCG more signal about _how_ relevant each document is, not just whether it's relevant at all:
|
|
94
|
+
|
|
95
|
+
```ts
|
|
96
|
+
import { logDocumentAnnotations } from "@arizeai/phoenix-client/spans";
|
|
97
|
+
|
|
98
|
+
// LLM judge returns a 0-1 relevance score per document
|
|
99
|
+
const annotations = retrievedDocs.map((doc, position) => ({
|
|
100
|
+
spanId: retrieverSpanId,
|
|
101
|
+
documentPosition: position,
|
|
102
|
+
name: "relevance",
|
|
103
|
+
annotatorKind: "LLM" as const,
|
|
104
|
+
score: doc.relevanceScore, // e.g. 0.0, 0.3, 0.7, 1.0
|
|
105
|
+
explanation: doc.relevanceReasoning,
|
|
106
|
+
metadata: { model: "gpt-4o-mini" },
|
|
107
|
+
}));
|
|
108
|
+
|
|
109
|
+
await logDocumentAnnotations({ documentAnnotations: annotations });
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
## Add A Single Document Annotation
|
|
113
|
+
|
|
114
|
+
For one-off annotations — e.g. a human reviewer flagging a specific document:
|
|
115
|
+
|
|
116
|
+
```ts
|
|
117
|
+
import { addDocumentAnnotation } from "@arizeai/phoenix-client/spans";
|
|
118
|
+
|
|
119
|
+
await addDocumentAnnotation({
|
|
120
|
+
documentAnnotation: {
|
|
121
|
+
spanId: "retriever-span-id",
|
|
122
|
+
documentPosition: 0,
|
|
123
|
+
name: "relevance",
|
|
124
|
+
annotatorKind: "LLM",
|
|
125
|
+
score: 0.95,
|
|
126
|
+
label: "relevant",
|
|
127
|
+
explanation: "Document directly answers the user question.",
|
|
128
|
+
},
|
|
129
|
+
});
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
## Multi-Dimensional Document Scoring
|
|
133
|
+
|
|
134
|
+
Score the same documents on multiple axes by using different annotation names. Each name creates a separate annotation series in the Phoenix UI:
|
|
135
|
+
|
|
136
|
+
```ts
|
|
137
|
+
import { logDocumentAnnotations } from "@arizeai/phoenix-client/spans";
|
|
138
|
+
|
|
139
|
+
const relevanceAnnotations = docs.map((doc, position) => ({
|
|
140
|
+
spanId: retrieverSpanId,
|
|
141
|
+
documentPosition: position,
|
|
142
|
+
name: "relevance",
|
|
143
|
+
annotatorKind: "LLM" as const,
|
|
144
|
+
score: doc.relevanceScore,
|
|
145
|
+
}));
|
|
146
|
+
|
|
147
|
+
const recencyAnnotations = docs.map((doc, position) => ({
|
|
148
|
+
spanId: retrieverSpanId,
|
|
149
|
+
documentPosition: position,
|
|
150
|
+
name: "recency",
|
|
151
|
+
annotatorKind: "CODE" as const,
|
|
152
|
+
score: isRecent(doc.publishDate) ? 1 : 0,
|
|
153
|
+
}));
|
|
154
|
+
|
|
155
|
+
await logDocumentAnnotations({
|
|
156
|
+
documentAnnotations: [...relevanceAnnotations, ...recencyAnnotations],
|
|
157
|
+
});
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
## Re-Ranking Evaluation
|
|
161
|
+
|
|
162
|
+
Document annotations are useful for evaluating re-rankers. Annotate the same retriever span before and after re-ranking to compare the quality of the original vs. re-ranked order:
|
|
163
|
+
|
|
164
|
+
```ts
|
|
165
|
+
import { logDocumentAnnotations } from "@arizeai/phoenix-client/spans";
|
|
166
|
+
|
|
167
|
+
// Score documents in the re-ranker's output order
|
|
168
|
+
const annotations = rerankedDocs.map((doc, position) => ({
|
|
169
|
+
spanId: rerankerSpanId,
|
|
170
|
+
documentPosition: position,
|
|
171
|
+
name: "relevance",
|
|
172
|
+
annotatorKind: "LLM" as const,
|
|
173
|
+
score: doc.relevanceScore,
|
|
174
|
+
}));
|
|
175
|
+
|
|
176
|
+
await logDocumentAnnotations({ documentAnnotations: annotations });
|
|
177
|
+
// Compare nDCG between the retriever span and re-ranker span
|
|
178
|
+
// in the Phoenix UI to measure re-ranking effectiveness.
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
## Parameter Reference
|
|
182
|
+
|
|
183
|
+
### `DocumentAnnotation`
|
|
184
|
+
|
|
185
|
+
| Field | Type | Required | Description |
|
|
186
|
+
|-------|------|----------|-------------|
|
|
187
|
+
| `spanId` | `string` | Yes | The retriever span's OpenTelemetry ID |
|
|
188
|
+
| `documentPosition` | `number` | Yes | 0-based index of the document in retrieval results |
|
|
189
|
+
| `name` | `string` | Yes | Annotation name (e.g. `"relevance"`) |
|
|
190
|
+
| `annotatorKind` | `"HUMAN" \| "LLM" \| "CODE"` | No | Defaults to `"HUMAN"`. Use `"LLM"` for auto-computed retrieval metrics. |
|
|
191
|
+
| `label` | `string` | No* | Categorical label (e.g. `"relevant"`, `"irrelevant"`) |
|
|
192
|
+
| `score` | `number` | No* | Numeric relevance score (e.g. 0 or 1 for binary, 0–1 for graded) |
|
|
193
|
+
| `explanation` | `string` | No* | Free-text explanation |
|
|
194
|
+
| `metadata` | `Record<string, unknown>` | No | Arbitrary metadata |
|
|
195
|
+
|
|
196
|
+
\*At least one of `label`, `score`, or `explanation` is required.
|
|
197
|
+
|
|
198
|
+
Document annotations are unique by `(name, spanId, documentPosition)`. Unlike span annotations, the `identifier` field is not supported for document annotations.
|
|
199
|
+
|
|
200
|
+
<section className="hidden" data-agent-context="source-map" aria-label="Source map">
|
|
201
|
+
<h2>Source Map</h2>
|
|
202
|
+
<ul>
|
|
203
|
+
<li><code>src/spans/addDocumentAnnotation.ts</code></li>
|
|
204
|
+
<li><code>src/spans/logDocumentAnnotations.ts</code></li>
|
|
205
|
+
<li><code>src/spans/types.ts</code></li>
|
|
206
|
+
<li><code>src/types/annotations.ts</code></li>
|
|
207
|
+
</ul>
|
|
208
|
+
</section>
|
|
@@ -0,0 +1,271 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Experiments"
|
|
3
|
+
description: "Run experiments with @arizeai/phoenix-client"
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
The experiments module runs tasks over dataset examples, records experiment runs in Phoenix, and can evaluate each run with either plain experiment evaluators or `@arizeai/phoenix-evals` evaluators.
|
|
7
|
+
|
|
8
|
+
<section className="hidden" data-agent-context="relevant-source-files" aria-label="Relevant source files">
|
|
9
|
+
<h2>Relevant Source Files</h2>
|
|
10
|
+
<ul>
|
|
11
|
+
<li><code>src/experiments/runExperiment.ts</code> for the task execution flow and return shape</li>
|
|
12
|
+
<li><code>src/experiments/helpers/getExperimentEvaluators.ts</code> for evaluator normalization</li>
|
|
13
|
+
<li><code>src/experiments/helpers/fromPhoenixLLMEvaluator.ts</code> for the phoenix-evals bridge</li>
|
|
14
|
+
<li><code>src/experiments/getExperimentRuns.ts</code> for reading runs back after execution</li>
|
|
15
|
+
</ul>
|
|
16
|
+
</section>
|
|
17
|
+
|
|
18
|
+
## Two Common Patterns
|
|
19
|
+
|
|
20
|
+
Use `asExperimentEvaluator()` when your evaluation logic is plain TypeScript.
|
|
21
|
+
|
|
22
|
+
Use `@arizeai/phoenix-evals` evaluators directly when you want model-backed judging.
|
|
23
|
+
|
|
24
|
+
## Code-Based Example
|
|
25
|
+
|
|
26
|
+
If you just want to compare task output against a reference answer or apply deterministic checks, use `asExperimentEvaluator()`:
|
|
27
|
+
|
|
28
|
+
```ts
|
|
29
|
+
/* eslint-disable no-console */
|
|
30
|
+
import { createDataset } from "@arizeai/phoenix-client/datasets";
|
|
31
|
+
import {
|
|
32
|
+
asExperimentEvaluator,
|
|
33
|
+
runExperiment,
|
|
34
|
+
} from "@arizeai/phoenix-client/experiments";
|
|
35
|
+
|
|
36
|
+
async function main() {
|
|
37
|
+
const { datasetId } = await createDataset({
|
|
38
|
+
name: `simple-dataset-${Date.now()}`,
|
|
39
|
+
description: "a simple dataset",
|
|
40
|
+
examples: [
|
|
41
|
+
{
|
|
42
|
+
input: { name: "John" },
|
|
43
|
+
output: { text: "Hello, John!" },
|
|
44
|
+
metadata: {},
|
|
45
|
+
},
|
|
46
|
+
{
|
|
47
|
+
input: { name: "Jane" },
|
|
48
|
+
output: { text: "Hello, Jane!" },
|
|
49
|
+
metadata: {},
|
|
50
|
+
},
|
|
51
|
+
{
|
|
52
|
+
input: { name: "Bill" },
|
|
53
|
+
output: { text: "Hello, Bill!" },
|
|
54
|
+
metadata: {},
|
|
55
|
+
},
|
|
56
|
+
],
|
|
57
|
+
});
|
|
58
|
+
|
|
59
|
+
const experiment = await runExperiment({
|
|
60
|
+
dataset: { datasetId },
|
|
61
|
+
task: async (example) => `hello ${example.input.name}`,
|
|
62
|
+
evaluators: [
|
|
63
|
+
asExperimentEvaluator({
|
|
64
|
+
name: "matches",
|
|
65
|
+
kind: "CODE",
|
|
66
|
+
evaluate: async ({ output, expected }) => {
|
|
67
|
+
const matches = output === expected?.text;
|
|
68
|
+
return {
|
|
69
|
+
label: matches ? "matches" : "does not match",
|
|
70
|
+
score: matches ? 1 : 0,
|
|
71
|
+
explanation: matches
|
|
72
|
+
? "output matches expected"
|
|
73
|
+
: "output does not match expected",
|
|
74
|
+
metadata: {},
|
|
75
|
+
};
|
|
76
|
+
},
|
|
77
|
+
}),
|
|
78
|
+
asExperimentEvaluator({
|
|
79
|
+
name: "contains-hello",
|
|
80
|
+
kind: "CODE",
|
|
81
|
+
evaluate: async ({ output }) => {
|
|
82
|
+
const matches =
|
|
83
|
+
typeof output === "string" && output.includes("hello");
|
|
84
|
+
return {
|
|
85
|
+
label: matches ? "contains hello" : "does not contain hello",
|
|
86
|
+
score: matches ? 1 : 0,
|
|
87
|
+
explanation: matches
|
|
88
|
+
? "output contains hello"
|
|
89
|
+
: "output does not contain hello",
|
|
90
|
+
metadata: {},
|
|
91
|
+
};
|
|
92
|
+
},
|
|
93
|
+
}),
|
|
94
|
+
],
|
|
95
|
+
});
|
|
96
|
+
|
|
97
|
+
console.table(experiment.runs);
|
|
98
|
+
console.table(experiment.evaluationRuns);
|
|
99
|
+
}
|
|
100
|
+
|
|
101
|
+
main().catch(console.error);
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
This pattern is useful when:
|
|
105
|
+
|
|
106
|
+
- you already know the exact correctness rule
|
|
107
|
+
- you want fast, deterministic evaluation
|
|
108
|
+
- you do not want to call another model during evaluation
|
|
109
|
+
|
|
110
|
+
## Model-Backed Example
|
|
111
|
+
|
|
112
|
+
If you want a model-backed experiment with automatic tracing and an LLM-as-a-judge evaluator, this is the core pattern:
|
|
113
|
+
|
|
114
|
+
```ts
|
|
115
|
+
import { openai } from "@ai-sdk/openai";
|
|
116
|
+
import { createOrGetDataset } from "@arizeai/phoenix-client/datasets";
|
|
117
|
+
import { runExperiment } from "@arizeai/phoenix-client/experiments";
|
|
118
|
+
import type { ExperimentTask } from "@arizeai/phoenix-client/types/experiments";
|
|
119
|
+
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
|
|
120
|
+
import { generateText } from "ai";
|
|
121
|
+
|
|
122
|
+
const model = openai("gpt-4o-mini");
|
|
123
|
+
|
|
124
|
+
const main = async () => {
|
|
125
|
+
const answersQuestion = createClassificationEvaluator({
|
|
126
|
+
name: "answersQuestion",
|
|
127
|
+
model,
|
|
128
|
+
promptTemplate:
|
|
129
|
+
"Does the following answer the user's question: <question>{{input.question}}</question><answer>{{output}}</answer>",
|
|
130
|
+
choices: {
|
|
131
|
+
correct: 1,
|
|
132
|
+
incorrect: 0,
|
|
133
|
+
},
|
|
134
|
+
});
|
|
135
|
+
|
|
136
|
+
const dataset = await createOrGetDataset({
|
|
137
|
+
name: "correctness-eval",
|
|
138
|
+
description: "Evaluate the correctness of the model",
|
|
139
|
+
examples: [
|
|
140
|
+
{
|
|
141
|
+
input: {
|
|
142
|
+
question: "Is ArizeAI Phoenix Open-Source?",
|
|
143
|
+
context: "ArizeAI Phoenix is Open-Source.",
|
|
144
|
+
},
|
|
145
|
+
},
|
|
146
|
+
// ... more examples
|
|
147
|
+
],
|
|
148
|
+
});
|
|
149
|
+
|
|
150
|
+
const task: ExperimentTask = async (example) => {
|
|
151
|
+
if (typeof example.input.question !== "string") {
|
|
152
|
+
throw new Error("Invalid input: question must be a string");
|
|
153
|
+
}
|
|
154
|
+
if (typeof example.input.context !== "string") {
|
|
155
|
+
throw new Error("Invalid input: context must be a string");
|
|
156
|
+
}
|
|
157
|
+
|
|
158
|
+
return generateText({
|
|
159
|
+
model,
|
|
160
|
+
experimental_telemetry: {
|
|
161
|
+
isEnabled: true,
|
|
162
|
+
},
|
|
163
|
+
prompt: [
|
|
164
|
+
{
|
|
165
|
+
role: "system",
|
|
166
|
+
content: `You answer questions based on this context: ${example.input.context}`,
|
|
167
|
+
},
|
|
168
|
+
{
|
|
169
|
+
role: "user",
|
|
170
|
+
content: example.input.question,
|
|
171
|
+
},
|
|
172
|
+
],
|
|
173
|
+
}).then((response) => {
|
|
174
|
+
if (response.text) {
|
|
175
|
+
return response.text;
|
|
176
|
+
}
|
|
177
|
+
throw new Error("Invalid response: text is required");
|
|
178
|
+
});
|
|
179
|
+
};
|
|
180
|
+
|
|
181
|
+
const experiment = await runExperiment({
|
|
182
|
+
experimentName: "answers-question-eval",
|
|
183
|
+
experimentDescription:
|
|
184
|
+
"Evaluate the ability of the model to answer questions based on the context",
|
|
185
|
+
dataset,
|
|
186
|
+
task,
|
|
187
|
+
evaluators: [answersQuestion],
|
|
188
|
+
repetitions: 3,
|
|
189
|
+
});
|
|
190
|
+
|
|
191
|
+
console.log(experiment.id);
|
|
192
|
+
console.log(Object.values(experiment.runs).length);
|
|
193
|
+
console.log(experiment.evaluationRuns?.length ?? 0);
|
|
194
|
+
};
|
|
195
|
+
|
|
196
|
+
main().catch(console.error);
|
|
197
|
+
```
|
|
198
|
+
|
|
199
|
+
## What This Example Shows
|
|
200
|
+
|
|
201
|
+
- `createOrGetDataset()` creates or reuses the dataset the experiment will run against
|
|
202
|
+
- `task` receives the full dataset example object
|
|
203
|
+
- `generateText()` emits traces that Phoenix can attach to the experiment when telemetry is enabled
|
|
204
|
+
- `createClassificationEvaluator()` from `@arizeai/phoenix-evals` can be passed directly to `runExperiment()`
|
|
205
|
+
- `runExperiment()` records both task runs and evaluation runs in Phoenix
|
|
206
|
+
|
|
207
|
+
## Task Inputs
|
|
208
|
+
|
|
209
|
+
`runExperiment()` calls your task with the full dataset example, not just `example.input`.
|
|
210
|
+
|
|
211
|
+
That means your task should usually read:
|
|
212
|
+
|
|
213
|
+
- `example.input` for the task inputs
|
|
214
|
+
- `example.output` for any reference answer
|
|
215
|
+
- `example.metadata` for additional context
|
|
216
|
+
|
|
217
|
+
In the example above, the task validates `example.input.question` and `example.input.context` before generating a response.
|
|
218
|
+
|
|
219
|
+
## Evaluator Inputs
|
|
220
|
+
|
|
221
|
+
When an evaluator runs, it receives a normalized object with these fields:
|
|
222
|
+
|
|
223
|
+
| Field | Description |
|
|
224
|
+
|--------|-------------|
|
|
225
|
+
| `input` | The dataset example's `input` object |
|
|
226
|
+
| `output` | The task output for that run |
|
|
227
|
+
| `expected` | The dataset example's `output` object |
|
|
228
|
+
| `metadata` | The dataset example's `metadata` object |
|
|
229
|
+
|
|
230
|
+
This is why the `createClassificationEvaluator()` prompt can reference `{{input.question}}` and `{{output}}`.
|
|
231
|
+
|
|
232
|
+
For code-based evaluators created with `asExperimentEvaluator()`, those same fields are available inside `evaluate({ input, output, expected, metadata })`.
|
|
233
|
+
|
|
234
|
+
## What `runExperiment()` Returns
|
|
235
|
+
|
|
236
|
+
The returned object includes the experiment metadata plus the task and evaluation results from the run.
|
|
237
|
+
|
|
238
|
+
- `experiment.id` is the experiment ID in Phoenix
|
|
239
|
+
- `experiment.projectName` is the Phoenix project that received the task traces
|
|
240
|
+
- `experiment.runs` is a map of run IDs to task run objects
|
|
241
|
+
- `experiment.evaluationRuns` contains evaluator results when evaluators are provided
|
|
242
|
+
|
|
243
|
+
## Follow-Up Helpers
|
|
244
|
+
|
|
245
|
+
Use these exports for follow-up workflows:
|
|
246
|
+
|
|
247
|
+
- `createExperiment`
|
|
248
|
+
- `getExperiment`
|
|
249
|
+
- `getExperimentInfo`
|
|
250
|
+
- `getExperimentRuns`
|
|
251
|
+
- `listExperiments`
|
|
252
|
+
- `resumeExperiment`
|
|
253
|
+
- `resumeEvaluation`
|
|
254
|
+
- `deleteExperiment`
|
|
255
|
+
|
|
256
|
+
## Tracing Behavior
|
|
257
|
+
|
|
258
|
+
`runExperiment()` can register a tracer provider for the task run so that task spans and evaluator spans show up in Phoenix during the experiment. This is why tasks that call the AI SDK can still emit traces to Phoenix when global tracing is enabled.
|
|
259
|
+
|
|
260
|
+
<section className="hidden" data-agent-context="source-map" aria-label="Source map">
|
|
261
|
+
<h2>Source Map</h2>
|
|
262
|
+
<ul>
|
|
263
|
+
<li><code>src/experiments/runExperiment.ts</code></li>
|
|
264
|
+
<li><code>src/experiments/createExperiment.ts</code></li>
|
|
265
|
+
<li><code>src/experiments/getExperiment.ts</code></li>
|
|
266
|
+
<li><code>src/experiments/getExperimentRuns.ts</code></li>
|
|
267
|
+
<li><code>src/experiments/helpers/getExperimentEvaluators.ts</code></li>
|
|
268
|
+
<li><code>src/experiments/helpers/fromPhoenixLLMEvaluator.ts</code></li>
|
|
269
|
+
<li><code>src/experiments/helpers/asExperimentEvaluator.ts</code></li>
|
|
270
|
+
</ul>
|
|
271
|
+
</section>
|