bir-sdk 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- bir_sdk-0.1.0/LICENSE +100 -0
- bir_sdk-0.1.0/PKG-INFO +584 -0
- bir_sdk-0.1.0/README.md +558 -0
- bir_sdk-0.1.0/pyproject.toml +49 -0
- bir_sdk-0.1.0/setup.cfg +4 -0
- bir_sdk-0.1.0/src/bir/__init__.py +47 -0
- bir_sdk-0.1.0/src/bir/_sdk.py +1688 -0
- bir_sdk-0.1.0/src/bir/evals.py +1454 -0
- bir_sdk-0.1.0/src/bir/integrations/__init__.py +21 -0
- bir_sdk-0.1.0/src/bir/integrations/_common.py +47 -0
- bir_sdk-0.1.0/src/bir/integrations/anthropic.py +92 -0
- bir_sdk-0.1.0/src/bir/integrations/cohere.py +87 -0
- bir_sdk-0.1.0/src/bir/integrations/google.py +84 -0
- bir_sdk-0.1.0/src/bir/integrations/langchain.py +429 -0
- bir_sdk-0.1.0/src/bir/integrations/litellm.py +108 -0
- bir_sdk-0.1.0/src/bir/integrations/llamaindex.py +374 -0
- bir_sdk-0.1.0/src/bir/integrations/mistral.py +87 -0
- bir_sdk-0.1.0/src/bir/integrations/openai.py +167 -0
- bir_sdk-0.1.0/src/bir_sdk.egg-info/PKG-INFO +584 -0
- bir_sdk-0.1.0/src/bir_sdk.egg-info/SOURCES.txt +33 -0
- bir_sdk-0.1.0/src/bir_sdk.egg-info/dependency_links.txt +1 -0
- bir_sdk-0.1.0/src/bir_sdk.egg-info/requires.txt +3 -0
- bir_sdk-0.1.0/src/bir_sdk.egg-info/top_level.txt +1 -0
- bir_sdk-0.1.0/tests/test_anthropic_integration.py +187 -0
- bir_sdk-0.1.0/tests/test_cohere_integration.py +187 -0
- bir_sdk-0.1.0/tests/test_evals.py +1285 -0
- bir_sdk-0.1.0/tests/test_examples.py +138 -0
- bir_sdk-0.1.0/tests/test_google_integration.py +199 -0
- bir_sdk-0.1.0/tests/test_langchain_integration.py +212 -0
- bir_sdk-0.1.0/tests/test_litellm_integration.py +216 -0
- bir_sdk-0.1.0/tests/test_llamaindex_integration.py +208 -0
- bir_sdk-0.1.0/tests/test_mistral_integration.py +189 -0
- bir_sdk-0.1.0/tests/test_openai_integration.py +277 -0
- bir_sdk-0.1.0/tests/test_redaction_parity.py +41 -0
- bir_sdk-0.1.0/tests/test_sdk.py +2369 -0
bir_sdk-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,100 @@
|
|
|
1
|
+
# Functional Source License, Version 1.1, ALv2 Future License
|
|
2
|
+
|
|
3
|
+
## Abbreviation
|
|
4
|
+
|
|
5
|
+
FSL-1.1-ALv2
|
|
6
|
+
|
|
7
|
+
## Notice
|
|
8
|
+
|
|
9
|
+
Copyright 2026 mskayacioglu
|
|
10
|
+
|
|
11
|
+
## Terms and Conditions
|
|
12
|
+
|
|
13
|
+
### Licensor ("We")
|
|
14
|
+
|
|
15
|
+
The party offering the Software under these Terms and Conditions.
|
|
16
|
+
|
|
17
|
+
### The Software
|
|
18
|
+
|
|
19
|
+
The "Software" is each version of the software that we make available under
|
|
20
|
+
these Terms and Conditions, as indicated by our inclusion of these Terms and
|
|
21
|
+
Conditions with the Software.
|
|
22
|
+
|
|
23
|
+
### License Grant
|
|
24
|
+
|
|
25
|
+
Subject to your compliance with this License Grant and the Patents,
|
|
26
|
+
Redistribution and Trademark clauses below, we hereby grant you the right to
|
|
27
|
+
use, copy, modify, create derivative works, publicly perform, publicly display
|
|
28
|
+
and redistribute the Software for any Permitted Purpose identified below.
|
|
29
|
+
|
|
30
|
+
### Permitted Purpose
|
|
31
|
+
|
|
32
|
+
A Permitted Purpose is any purpose other than a Competing Use. A Competing Use
|
|
33
|
+
means making the Software available to others in a commercial product or service
|
|
34
|
+
that:
|
|
35
|
+
|
|
36
|
+
1. substitutes for the Software;
|
|
37
|
+
2. substitutes for any other product or service we offer using the Software that
|
|
38
|
+
exists as of the date we make the Software available; or
|
|
39
|
+
3. offers the same or substantially similar functionality as the Software.
|
|
40
|
+
|
|
41
|
+
Permitted Purposes specifically include using the Software:
|
|
42
|
+
|
|
43
|
+
1. for your internal use and access;
|
|
44
|
+
2. for non-commercial education;
|
|
45
|
+
3. for non-commercial research; and
|
|
46
|
+
4. in connection with professional services that you provide to a licensee using
|
|
47
|
+
the Software in accordance with these Terms and Conditions.
|
|
48
|
+
|
|
49
|
+
### Patents
|
|
50
|
+
|
|
51
|
+
To the extent your use for a Permitted Purpose would necessarily infringe our
|
|
52
|
+
patents, the license grant above includes a license under our patents. If you
|
|
53
|
+
make a claim against any party that the Software infringes or contributes to the
|
|
54
|
+
infringement of any patent, then your patent license to the Software ends
|
|
55
|
+
immediately.
|
|
56
|
+
|
|
57
|
+
### Redistribution
|
|
58
|
+
|
|
59
|
+
The Terms and Conditions apply to all copies, modifications and derivatives of
|
|
60
|
+
the Software.
|
|
61
|
+
|
|
62
|
+
If you redistribute any copies, modifications or derivatives of the Software,
|
|
63
|
+
you must include a copy of or a link to these Terms and Conditions and not
|
|
64
|
+
remove any copyright notices provided in or with the Software.
|
|
65
|
+
|
|
66
|
+
### Disclaimer
|
|
67
|
+
|
|
68
|
+
THE SOFTWARE IS PROVIDED "AS IS" AND WITHOUT WARRANTIES OF ANY KIND, EXPRESS OR
|
|
69
|
+
IMPLIED, INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR
|
|
70
|
+
PURPOSE, MERCHANTABILITY, TITLE OR NON-INFRINGEMENT.
|
|
71
|
+
|
|
72
|
+
IN NO EVENT WILL WE HAVE ANY LIABILITY TO YOU ARISING OUT OF OR RELATED TO THE
|
|
73
|
+
SOFTWARE, INCLUDING INDIRECT, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES, EVEN
|
|
74
|
+
IF WE HAVE BEEN INFORMED OF THEIR POSSIBILITY IN ADVANCE.
|
|
75
|
+
|
|
76
|
+
### Trademarks
|
|
77
|
+
|
|
78
|
+
Except for displaying the License Details and identifying us as the origin of
|
|
79
|
+
the Software, you have no right under these Terms and Conditions to use our
|
|
80
|
+
trademarks, trade names, service marks or product names.
|
|
81
|
+
|
|
82
|
+
## Grant of Future License
|
|
83
|
+
|
|
84
|
+
We hereby irrevocably grant you an additional license to use the Software under
|
|
85
|
+
the Apache License, Version 2.0 that is effective on the second anniversary of
|
|
86
|
+
the date we make the Software available. On or after that date, you may use the
|
|
87
|
+
Software under the Apache License, Version 2.0, in which case the following
|
|
88
|
+
will apply:
|
|
89
|
+
|
|
90
|
+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use
|
|
91
|
+
this file except in compliance with the License.
|
|
92
|
+
|
|
93
|
+
You may obtain a copy of the License at
|
|
94
|
+
|
|
95
|
+
http://www.apache.org/licenses/LICENSE-2.0
|
|
96
|
+
|
|
97
|
+
Unless required by applicable law or agreed to in writing, software distributed
|
|
98
|
+
under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
|
|
99
|
+
CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
100
|
+
specific language governing permissions and limitations under the License.
|
bir_sdk-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,584 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: bir-sdk
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Minimal local tracing SDK for LLM applications.
|
|
5
|
+
Author: mskayacioglu
|
|
6
|
+
Project-URL: Homepage, https://github.com/mskayacioglu/bir
|
|
7
|
+
Project-URL: Documentation, https://github.com/mskayacioglu/bir/tree/main/docs
|
|
8
|
+
Project-URL: Source, https://github.com/mskayacioglu/bir
|
|
9
|
+
Project-URL: Issues, https://github.com/mskayacioglu/bir/issues
|
|
10
|
+
Keywords: ai,evals,llm,observability,tracing
|
|
11
|
+
Classifier: Development Status :: 3 - Alpha
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Classifier: Programming Language :: Python :: 3 :: Only
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
19
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
20
|
+
Requires-Python: >=3.10
|
|
21
|
+
Description-Content-Type: text/markdown
|
|
22
|
+
License-File: LICENSE
|
|
23
|
+
Provides-Extra: dev
|
|
24
|
+
Requires-Dist: pytest>=8.0; extra == "dev"
|
|
25
|
+
Dynamic: license-file
|
|
26
|
+
|
|
27
|
+
# Bir Python SDK
|
|
28
|
+
|
|
29
|
+
Minimal local tracing SDK for Python LLM applications.
|
|
30
|
+
|
|
31
|
+
Bir records traces, spans, generations, tool calls, and scores to local JSONL
|
|
32
|
+
without requiring a server. Start locally, then send events to the Bir FastAPI
|
|
33
|
+
server when you want to inspect them in the dashboard.
|
|
34
|
+
|
|
35
|
+
## Installation
|
|
36
|
+
|
|
37
|
+
After the first package release:
|
|
38
|
+
|
|
39
|
+
```bash
|
|
40
|
+
python -m pip install bir-sdk
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
The distribution is published on PyPI as `bir-sdk`; the import name is `bir`
|
|
44
|
+
(e.g. `from bir import observe`).
|
|
45
|
+
|
|
46
|
+
For local development from this repository:
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
python -m pip install -e ".[dev]"
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
## Quickstart
|
|
53
|
+
|
|
54
|
+
```python
|
|
55
|
+
from bir import generation, observe, retrieval, score, span
|
|
56
|
+
|
|
57
|
+
|
|
58
|
+
@observe()
|
|
59
|
+
def answer_question(question: str) -> str:
|
|
60
|
+
with span("retrieve_context"):
|
|
61
|
+
with retrieval("search_docs", query=question) as result:
|
|
62
|
+
result.add_document(id="doc-1", text="local context")
|
|
63
|
+
documents = ["local context"]
|
|
64
|
+
|
|
65
|
+
with generation("local.llm", model="demo-model") as gen:
|
|
66
|
+
response = f"{documents[0]}: {question}"
|
|
67
|
+
gen.set_output(response)
|
|
68
|
+
gen.set_usage(input_tokens=12, output_tokens=24)
|
|
69
|
+
gen.set_cost(input_cost=0.000012, output_cost=0.000048)
|
|
70
|
+
|
|
71
|
+
score("helpfulness", 0.82)
|
|
72
|
+
return response
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
Trace, span, tool call, generation, and score events are written as JSONL to:
|
|
76
|
+
|
|
77
|
+
```text
|
|
78
|
+
.bir/traces.jsonl
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
Writes to the local trace file are serialized within the SDK process so
|
|
82
|
+
multi-threaded sync applications keep the JSONL file line-delimited and
|
|
83
|
+
parseable.
|
|
84
|
+
|
|
85
|
+
## Manual Trace Contexts
|
|
86
|
+
|
|
87
|
+
Use `trace()` when your workflow is easier to wrap with a context manager than a
|
|
88
|
+
decorator:
|
|
89
|
+
|
|
90
|
+
```python
|
|
91
|
+
from bir import generation, score, span, trace
|
|
92
|
+
|
|
93
|
+
with trace("answer_question", metadata={"kind": "manual"}):
|
|
94
|
+
with span("draft_answer"):
|
|
95
|
+
with generation("local.llm", model="demo-model") as gen:
|
|
96
|
+
response = "ok"
|
|
97
|
+
gen.set_output(response)
|
|
98
|
+
score("helpfulness", 0.82)
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
## Read Local Traces
|
|
102
|
+
|
|
103
|
+
You can also read local traces back from the same file:
|
|
104
|
+
|
|
105
|
+
```python
|
|
106
|
+
from bir import load_traces
|
|
107
|
+
|
|
108
|
+
for trace in load_traces():
|
|
109
|
+
print(trace.name, trace.status, trace.duration_ms)
|
|
110
|
+
for event in trace.events:
|
|
111
|
+
print(event.type, event.name)
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
## Send Events To The Server
|
|
115
|
+
|
|
116
|
+
To send local events to a running Bir server:
|
|
117
|
+
|
|
118
|
+
```python
|
|
119
|
+
from bir import send_events
|
|
120
|
+
|
|
121
|
+
result = send_events("http://127.0.0.1:8000")
|
|
122
|
+
print(result.accepted, result.attempted, result.skipped)
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
`send_events()` posts local JSONL events to the Bir server, batching them into a
|
|
126
|
+
single request when the server supports it and otherwise posting them to
|
|
127
|
+
`/v1/events` one at a time. It uses the Python standard library, reports how many
|
|
128
|
+
local events were attempted, how many were newly accepted, and how many were
|
|
129
|
+
skipped by an idempotent server response,
|
|
130
|
+
raises `RuntimeError` when the server rejects an event or cannot be reached, and
|
|
131
|
+
does not remove local events after sending. Re-sending the same file is safe
|
|
132
|
+
against the Bir server because duplicate event IDs are treated as already
|
|
133
|
+
ingested.
|
|
134
|
+
Complete traces are sent root-first so the server receives the trace event before
|
|
135
|
+
its spans, tool calls, generations, and scores.
|
|
136
|
+
|
|
137
|
+
## Privacy And Capture
|
|
138
|
+
|
|
139
|
+
Input and output capture is disabled by default. Enable it globally with `configure()`
|
|
140
|
+
or for a single function with `@observe(capture_inputs=True, capture_outputs=True)`.
|
|
141
|
+
Common secret-like fields such as `api_key`, `authorization`, `password`, `secret`,
|
|
142
|
+
and `token` are redacted before events are written.
|
|
143
|
+
Captured strings, fallback object representations, and captured error messages
|
|
144
|
+
are also scanned for common secret-like text patterns before events are written.
|
|
145
|
+
Redaction is best-effort, so keep capture opt-in for sensitive payloads and review
|
|
146
|
+
what your application records.
|
|
147
|
+
|
|
148
|
+
```python
|
|
149
|
+
from bir import configure
|
|
150
|
+
|
|
151
|
+
configure(capture_inputs=True, capture_outputs=True)
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
Captured values are normalized to JSON-compatible data before writing. Non-finite
|
|
155
|
+
floats such as `NaN` and `Infinity` are stored as strings, and deeply nested
|
|
156
|
+
values are truncated. `score()` requires a finite numeric value and accepts
|
|
157
|
+
optional `metadata` (for example `score("faithfulness", 0.4, metadata={"reason":
|
|
158
|
+
"answer cites no context"})`) that is redacted with the same rules before it is
|
|
159
|
+
written to the score event. Generation token usage and generation cost require at
|
|
160
|
+
least one field, and all provided values must be non-negative finite numbers.
|
|
161
|
+
|
|
162
|
+
Generation cost is user-provided. Bir records explicit cost values and defaults
|
|
163
|
+
the currency to `USD`; it does not calculate provider pricing automatically.
|
|
164
|
+
|
|
165
|
+
## Service Metadata
|
|
166
|
+
|
|
167
|
+
Use `configure()` to tag traces with the service and environment that produced
|
|
168
|
+
them. Both values are optional, must be non-empty strings, and are recorded on
|
|
169
|
+
trace root events under `metadata.service` so the server and dashboard can
|
|
170
|
+
filter traces by service and environment.
|
|
171
|
+
|
|
172
|
+
```python
|
|
173
|
+
from bir import configure
|
|
174
|
+
|
|
175
|
+
configure(service_name="rag-api", environment="production")
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
## Sampling
|
|
179
|
+
|
|
180
|
+
Use `configure(sample_rate=...)` to keep local trace volume bounded under load.
|
|
181
|
+
`sample_rate` is the probability (`0.0` to `1.0`) that a trace is recorded and
|
|
182
|
+
defaults to `1.0`, which records every trace. The decision is made once per
|
|
183
|
+
trace root and inherited by every event under it, so a sampled-out trace and all
|
|
184
|
+
of its spans, generations, tool calls, retrievals, and scores write nothing.
|
|
185
|
+
|
|
186
|
+
Sampling never changes control flow: a sampled-out function still runs and still
|
|
187
|
+
raises its own exceptions; only the local JSONL writes are skipped.
|
|
188
|
+
|
|
189
|
+
```python
|
|
190
|
+
from bir import configure
|
|
191
|
+
|
|
192
|
+
configure(sample_rate=0.1) # record about 10% of traces
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
## Retrieval
|
|
196
|
+
|
|
197
|
+
Use `retrieval()` to record RAG lookups with the existing `tool_call` event
|
|
198
|
+
contract. It sets `metadata.kind` to `retrieval`, stores the query at
|
|
199
|
+
`input.query` when input capture is enabled, and stores retrieved records at
|
|
200
|
+
`output.documents` when output capture is enabled.
|
|
201
|
+
Document `rank` values must be non-negative integers, and document `score`
|
|
202
|
+
values must be non-negative finite numbers.
|
|
203
|
+
|
|
204
|
+
```python
|
|
205
|
+
from bir import retrieval
|
|
206
|
+
|
|
207
|
+
with retrieval("vector_search", query=question) as result:
|
|
208
|
+
result.add_document(
|
|
209
|
+
id="doc-1",
|
|
210
|
+
rank=1,
|
|
211
|
+
score=0.82,
|
|
212
|
+
source="docs",
|
|
213
|
+
text="Bir records local traces with JSONL.",
|
|
214
|
+
)
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
## Prompt Versions
|
|
218
|
+
|
|
219
|
+
Use `prompt()` to attach prompt identity and version metadata to a generation.
|
|
220
|
+
Prompt template text, variables, and rendered prompts are not captured unless
|
|
221
|
+
you opt in.
|
|
222
|
+
|
|
223
|
+
```python
|
|
224
|
+
from bir import generation, prompt
|
|
225
|
+
|
|
226
|
+
answer_prompt = prompt(
|
|
227
|
+
"answer_question",
|
|
228
|
+
version="v1",
|
|
229
|
+
template="Answer using this context: {context}",
|
|
230
|
+
variables={"context": "local context"},
|
|
231
|
+
)
|
|
232
|
+
|
|
233
|
+
with generation("local.llm", model="demo-model", prompt=answer_prompt) as gen:
|
|
234
|
+
gen.set_output("ok")
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
The generation event records `metadata.prompt.name`, `metadata.prompt.version`,
|
|
238
|
+
and a `metadata.prompt.template_sha256` when a template is provided. To inspect
|
|
239
|
+
the actual prompt payload locally, opt in explicitly:
|
|
240
|
+
|
|
241
|
+
```python
|
|
242
|
+
answer_prompt = prompt(
|
|
243
|
+
"answer_question",
|
|
244
|
+
version="v1",
|
|
245
|
+
template="Answer using this context: {context}",
|
|
246
|
+
variables={"context": "local context"},
|
|
247
|
+
capture_template=True,
|
|
248
|
+
capture_variables=True,
|
|
249
|
+
capture_rendered=True,
|
|
250
|
+
)
|
|
251
|
+
```
|
|
252
|
+
|
|
253
|
+
Captured prompt fields use the same best-effort redaction as other captured
|
|
254
|
+
payloads. After sending traces to the local server, the dashboard shows prompt
|
|
255
|
+
metadata on generation details without requiring you to inspect the raw event
|
|
256
|
+
JSON.
|
|
257
|
+
|
|
258
|
+
## LangChain Callback
|
|
259
|
+
|
|
260
|
+
Use `BirCallbackHandler` to record LangChain callback events as Bir traces
|
|
261
|
+
without adding LangChain as a Bir dependency:
|
|
262
|
+
|
|
263
|
+
```python
|
|
264
|
+
from bir import configure
|
|
265
|
+
from bir.integrations.langchain import BirCallbackHandler
|
|
266
|
+
|
|
267
|
+
configure(capture_inputs=True, capture_outputs=True)
|
|
268
|
+
|
|
269
|
+
result = chain.invoke(
|
|
270
|
+
{"question": "What is Bir?"},
|
|
271
|
+
config={"callbacks": [BirCallbackHandler()]},
|
|
272
|
+
)
|
|
273
|
+
```
|
|
274
|
+
|
|
275
|
+
Root chains become trace events, nested chains become spans, LLM/chat model
|
|
276
|
+
callbacks become generation events, retrievers become retrieval tool calls, and
|
|
277
|
+
tools become tool call events. Direct model calls without an active chain create
|
|
278
|
+
a small implicit trace root. The handler records token usage from common
|
|
279
|
+
LangChain response shapes including `llm_output.token_usage`,
|
|
280
|
+
`usage_metadata`, and `response_metadata.token_usage`.
|
|
281
|
+
|
|
282
|
+
## Mistral
|
|
283
|
+
|
|
284
|
+
Use `trace_chat()` to record Mistral chat completions without adding `mistralai`
|
|
285
|
+
as a Bir dependency:
|
|
286
|
+
|
|
287
|
+
```python
|
|
288
|
+
from bir import trace
|
|
289
|
+
from bir.integrations.mistral import trace_chat
|
|
290
|
+
|
|
291
|
+
with trace("chat"):
|
|
292
|
+
response = trace_chat(
|
|
293
|
+
client.chat.complete,
|
|
294
|
+
model="mistral-small-latest",
|
|
295
|
+
messages=[{"role": "user", "content": "What is Bir?"}],
|
|
296
|
+
)
|
|
297
|
+
```
|
|
298
|
+
|
|
299
|
+
The wrapper forwards positional and keyword arguments unchanged, returns the
|
|
300
|
+
Mistral response untouched, and records the response model, token usage, and
|
|
301
|
+
`model_dump()` output when capture settings allow it.
|
|
302
|
+
|
|
303
|
+
## Cohere
|
|
304
|
+
|
|
305
|
+
Use `trace_chat()` to record Cohere v2 chat calls without adding `cohere` as a
|
|
306
|
+
Bir dependency:
|
|
307
|
+
|
|
308
|
+
```python
|
|
309
|
+
from bir import trace
|
|
310
|
+
from bir.integrations.cohere import trace_chat
|
|
311
|
+
|
|
312
|
+
with trace("chat"):
|
|
313
|
+
response = trace_chat(
|
|
314
|
+
client.chat,
|
|
315
|
+
model="command-a-03-2025",
|
|
316
|
+
messages=[{"role": "user", "content": "What is Bir?"}],
|
|
317
|
+
)
|
|
318
|
+
```
|
|
319
|
+
|
|
320
|
+
The wrapper forwards positional and keyword arguments unchanged, returns the
|
|
321
|
+
Cohere response untouched, records the request model, and reads token usage from
|
|
322
|
+
`response.usage.tokens` when present.
|
|
323
|
+
|
|
324
|
+
## Local Evals And Experiments
|
|
325
|
+
|
|
326
|
+
Bir includes a small deterministic evaluation layer for local regression checks.
|
|
327
|
+
It does not require a server or an LLM judge.
|
|
328
|
+
|
|
329
|
+
```python
|
|
330
|
+
from bir.evals import Dataset, DatasetExample, contains, exact_match, latency_under, run_experiment
|
|
331
|
+
|
|
332
|
+
|
|
333
|
+
dataset = Dataset(
|
|
334
|
+
[
|
|
335
|
+
DatasetExample(
|
|
336
|
+
id="q1",
|
|
337
|
+
input={"question": "What is Bir?"},
|
|
338
|
+
expected="An observability SDK",
|
|
339
|
+
)
|
|
340
|
+
]
|
|
341
|
+
)
|
|
342
|
+
|
|
343
|
+
|
|
344
|
+
def answer_question(question: str) -> str:
|
|
345
|
+
return "Bir is an observability SDK."
|
|
346
|
+
|
|
347
|
+
|
|
348
|
+
result = run_experiment(
|
|
349
|
+
"quickstart",
|
|
350
|
+
dataset=dataset,
|
|
351
|
+
task=answer_question,
|
|
352
|
+
evaluators=[
|
|
353
|
+
contains(),
|
|
354
|
+
exact_match("Bir is an observability SDK."),
|
|
355
|
+
latency_under(1000),
|
|
356
|
+
],
|
|
357
|
+
)
|
|
358
|
+
|
|
359
|
+
print(result.aggregate_scores)
|
|
360
|
+
```
|
|
361
|
+
|
|
362
|
+
Datasets can be stored as JSONL:
|
|
363
|
+
|
|
364
|
+
```json
|
|
365
|
+
{"id":"q1","input":{"question":"What is Bir?"},"expected":"An observability SDK"}
|
|
366
|
+
```
|
|
367
|
+
|
|
368
|
+
Load and run them locally:
|
|
369
|
+
|
|
370
|
+
```python
|
|
371
|
+
from bir.evals import Dataset, contains, list_experiments, load_experiment, run_experiment, send_experiment
|
|
372
|
+
|
|
373
|
+
dataset = Dataset.from_jsonl("questions.jsonl")
|
|
374
|
+
result = run_experiment("prompt-v1", dataset=dataset, task=answer_question, evaluators=[contains()])
|
|
375
|
+
loaded = load_experiment(result.path)
|
|
376
|
+
summaries = list_experiments()
|
|
377
|
+
```
|
|
378
|
+
|
|
379
|
+
`Dataset.to_jsonl()` redacts common secret-like values by default when exporting
|
|
380
|
+
examples. If you intentionally need to preserve raw dataset payloads, opt out
|
|
381
|
+
explicitly:
|
|
382
|
+
|
|
383
|
+
```python
|
|
384
|
+
dataset.to_jsonl("questions.jsonl", redact=False)
|
|
385
|
+
```
|
|
386
|
+
|
|
387
|
+
Experiment results are written to `.bir/experiments/*.jsonl` by default, with
|
|
388
|
+
one result row per example. Bir also writes a sibling `.summary.json` file with
|
|
389
|
+
the experiment id, status, example count, error count, aggregate scores, and
|
|
390
|
+
result path so local runs can be listed without scanning every result row.
|
|
391
|
+
Available deterministic evaluators are `exact_match()`, `contains()`,
|
|
392
|
+
`regex_match()`, `json_valid()`, `field_equals()`, `field_contains()`,
|
|
393
|
+
`latency_under()`, `cost_under()`, `numeric_between()`,
|
|
394
|
+
`retrieved_context_contains()`, `answer_context_overlap()`,
|
|
395
|
+
`answer_contains_citation()`, and `custom_evaluator()`.
|
|
396
|
+
|
|
397
|
+
Upload a completed experiment to a running Bir server so the dashboard can show
|
|
398
|
+
the experiment list and per-example result detail:
|
|
399
|
+
|
|
400
|
+
```python
|
|
401
|
+
send_experiment(result.path, "http://127.0.0.1:8000")
|
|
402
|
+
```
|
|
403
|
+
|
|
404
|
+
To connect local experiment rows back to traces, opt in when running the
|
|
405
|
+
experiment:
|
|
406
|
+
|
|
407
|
+
```python
|
|
408
|
+
result = run_experiment(
|
|
409
|
+
"prompt-v1",
|
|
410
|
+
dataset=dataset,
|
|
411
|
+
task=answer_question,
|
|
412
|
+
evaluators=[contains()],
|
|
413
|
+
record_traces=True,
|
|
414
|
+
)
|
|
415
|
+
```
|
|
416
|
+
|
|
417
|
+
`record_traces=True` writes one trace per dataset example and records evaluator
|
|
418
|
+
outputs as score events on that trace.
|
|
419
|
+
|
|
420
|
+
Use threshold evaluators for local gates:
|
|
421
|
+
|
|
422
|
+
```python
|
|
423
|
+
from bir.evals import cost_under, latency_under, numeric_between
|
|
424
|
+
|
|
425
|
+
evaluators = [
|
|
426
|
+
latency_under(1000),
|
|
427
|
+
cost_under(0.05),
|
|
428
|
+
numeric_between(min_value=0.0, max_value=1.0),
|
|
429
|
+
]
|
|
430
|
+
```
|
|
431
|
+
|
|
432
|
+
`latency_under()` uses measured task duration from `run_experiment()`.
|
|
433
|
+
`cost_under()` checks explicit cost fields returned by your task, either
|
|
434
|
+
`{"total_cost": 0.01}` or `{"cost": {"total_cost": 0.01}}`. Bir records only the
|
|
435
|
+
cost values you provide and does not calculate provider pricing.
|
|
436
|
+
`numeric_between()` checks numeric task outputs, or a numeric field when
|
|
437
|
+
`field=` is provided.
|
|
438
|
+
|
|
439
|
+
Use structured output evaluators for JSON-like task results:
|
|
440
|
+
|
|
441
|
+
```python
|
|
442
|
+
from bir.evals import field_contains, field_equals, numeric_between
|
|
443
|
+
|
|
444
|
+
evaluators = [
|
|
445
|
+
field_contains("answer", "observability"),
|
|
446
|
+
field_equals("citations[0].id", "doc-1"),
|
|
447
|
+
numeric_between(min_value=0.7, max_value=1.0, field="confidence"),
|
|
448
|
+
]
|
|
449
|
+
```
|
|
450
|
+
|
|
451
|
+
Field paths support dot paths and list indexes, such as `answer`,
|
|
452
|
+
`usage.total_tokens`, and `items[0].name`. Missing paths produce a `0.0` score
|
|
453
|
+
with failure metadata instead of failing the experiment.
|
|
454
|
+
|
|
455
|
+
Use `custom_evaluator()` for local checks that are specific to your task:
|
|
456
|
+
|
|
457
|
+
```python
|
|
458
|
+
from bir.evals import EvalResult, custom_evaluator
|
|
459
|
+
|
|
460
|
+
has_citation = custom_evaluator(
|
|
461
|
+
"has_citation",
|
|
462
|
+
lambda output, expected: "[1]" in str(output),
|
|
463
|
+
)
|
|
464
|
+
|
|
465
|
+
debuggable = custom_evaluator(
|
|
466
|
+
"debuggable",
|
|
467
|
+
lambda output, expected: EvalResult(
|
|
468
|
+
name="debuggable",
|
|
469
|
+
value=1.0,
|
|
470
|
+
metadata={"expected": expected},
|
|
471
|
+
),
|
|
472
|
+
)
|
|
473
|
+
```
|
|
474
|
+
|
|
475
|
+
Custom evaluators may return `bool`, `int`, `float`, or `EvalResult`. Exceptions
|
|
476
|
+
from custom evaluator functions surface normally during development.
|
|
477
|
+
|
|
478
|
+
Use `retrieved_context_contains()` to check retrieval quality without an LLM
|
|
479
|
+
judge:
|
|
480
|
+
|
|
481
|
+
```python
|
|
482
|
+
from bir.evals import retrieved_context_contains
|
|
483
|
+
|
|
484
|
+
evaluators = [
|
|
485
|
+
retrieved_context_contains("observability"),
|
|
486
|
+
]
|
|
487
|
+
```
|
|
488
|
+
|
|
489
|
+
`retrieved_context_contains()` reads the `contexts` list from a structured RAG
|
|
490
|
+
output such as `{"answer": "...", "contexts": ["doc text", ...]}` and scores
|
|
491
|
+
`1.0` when `expected` appears in one of the retrieved strings. Missing or empty
|
|
492
|
+
`contexts` produce a `0.0` score with failure metadata instead of failing the
|
|
493
|
+
experiment. Pass `case_sensitive=False` for case-insensitive matching. This is a
|
|
494
|
+
deterministic retrieval check, not proof that the answer used the context.
|
|
495
|
+
|
|
496
|
+
Use `answer_context_overlap()` to flag answers that may not be grounded in the
|
|
497
|
+
retrieved context, also without an LLM judge:
|
|
498
|
+
|
|
499
|
+
```python
|
|
500
|
+
from bir.evals import answer_context_overlap
|
|
501
|
+
|
|
502
|
+
evaluators = [
|
|
503
|
+
answer_context_overlap(0.5),
|
|
504
|
+
]
|
|
505
|
+
```
|
|
506
|
+
|
|
507
|
+
`answer_context_overlap()` reads the same structured RAG output
|
|
508
|
+
(`{"answer": "...", "contexts": ["doc text", ...]}`) and scores `1.0` when at
|
|
509
|
+
least `min_ratio` of the answer's word tokens also appear in the retrieved
|
|
510
|
+
contexts. It is a deterministic faithfulness heuristic, not proof of
|
|
511
|
+
faithfulness: paraphrased but faithful answers can score low, and unfaithful
|
|
512
|
+
answers that reuse context words can score high. Missing answers or contexts
|
|
513
|
+
produce a `0.0` score with failure metadata instead of failing the experiment.
|
|
514
|
+
|
|
515
|
+
Use `answer_contains_citation()` to check that an answer cites a source, also
|
|
516
|
+
without an LLM judge:
|
|
517
|
+
|
|
518
|
+
```python
|
|
519
|
+
from bir.evals import answer_contains_citation
|
|
520
|
+
|
|
521
|
+
evaluators = [
|
|
522
|
+
answer_contains_citation(),
|
|
523
|
+
]
|
|
524
|
+
```
|
|
525
|
+
|
|
526
|
+
`answer_contains_citation()` reads a plain answer string or the `answer` field
|
|
527
|
+
of a structured RAG output (`{"answer": "...", "contexts": [...]}`) and scores
|
|
528
|
+
`1.0` when the answer contains a citation marker. By default any bracketed
|
|
529
|
+
marker such as `[1]` or `[doc-1]` counts; pass `pattern` to require a custom
|
|
530
|
+
citation format such as `pattern=r"\(\d+\)"` for markers like `(1)`. This is a
|
|
531
|
+
deterministic format check, not proof that the citation is correct or that the
|
|
532
|
+
cited source supports the answer. Non-text output or a missing `answer` produces
|
|
533
|
+
a `0.0` score with failure metadata instead of failing the experiment.
|
|
534
|
+
|
|
535
|
+
## Event Loading
|
|
536
|
+
|
|
537
|
+
`load_events()` validates JSONL records against the current event schema and
|
|
538
|
+
raises `ValueError` for malformed rows, unsupported event types, invalid
|
|
539
|
+
timestamps, or unsupported schema versions.
|
|
540
|
+
|
|
541
|
+
To write traces somewhere else:
|
|
542
|
+
|
|
543
|
+
```python
|
|
544
|
+
configure(trace_path="tmp/bir-traces.jsonl")
|
|
545
|
+
```
|
|
546
|
+
|
|
547
|
+
## Development
|
|
548
|
+
|
|
549
|
+
Run the SDK unit tests from this directory:
|
|
550
|
+
|
|
551
|
+
```bash
|
|
552
|
+
PYTHONPATH=src ../../.venv/bin/python -m unittest discover -s tests
|
|
553
|
+
```
|
|
554
|
+
|
|
555
|
+
Or install the package with test dependencies and run pytest:
|
|
556
|
+
|
|
557
|
+
```bash
|
|
558
|
+
python3 -m pip install -e ".[dev]"
|
|
559
|
+
pytest
|
|
560
|
+
```
|
|
561
|
+
|
|
562
|
+
Run repository type checking from the repository root:
|
|
563
|
+
|
|
564
|
+
```bash
|
|
565
|
+
./.venv/bin/pyright
|
|
566
|
+
```
|
|
567
|
+
|
|
568
|
+
Run the release verification script from the repository root before publishing:
|
|
569
|
+
|
|
570
|
+
```bash
|
|
571
|
+
./.venv/bin/python scripts/verify_release.py
|
|
572
|
+
```
|
|
573
|
+
|
|
574
|
+
The script builds a temporary SDK wheel, installs it in a fresh temporary
|
|
575
|
+
virtual environment, and smoke-tests local tracing and retrieval without writing
|
|
576
|
+
build artifacts into the repository.
|
|
577
|
+
|
|
578
|
+
Release planning lives in `CHANGELOG.md` and `docs/SDK_RELEASE_CHECKLIST.md`.
|
|
579
|
+
|
|
580
|
+
## License
|
|
581
|
+
|
|
582
|
+
Bir is source-available under the Functional Source License 1.1 with Apache 2.0
|
|
583
|
+
as the future license (`FSL-1.1-ALv2`). FSL is not an OSI-approved open source
|
|
584
|
+
license.
|