bir-sdk 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (35) hide show
  1. bir_sdk-0.1.0/LICENSE +100 -0
  2. bir_sdk-0.1.0/PKG-INFO +584 -0
  3. bir_sdk-0.1.0/README.md +558 -0
  4. bir_sdk-0.1.0/pyproject.toml +49 -0
  5. bir_sdk-0.1.0/setup.cfg +4 -0
  6. bir_sdk-0.1.0/src/bir/__init__.py +47 -0
  7. bir_sdk-0.1.0/src/bir/_sdk.py +1688 -0
  8. bir_sdk-0.1.0/src/bir/evals.py +1454 -0
  9. bir_sdk-0.1.0/src/bir/integrations/__init__.py +21 -0
  10. bir_sdk-0.1.0/src/bir/integrations/_common.py +47 -0
  11. bir_sdk-0.1.0/src/bir/integrations/anthropic.py +92 -0
  12. bir_sdk-0.1.0/src/bir/integrations/cohere.py +87 -0
  13. bir_sdk-0.1.0/src/bir/integrations/google.py +84 -0
  14. bir_sdk-0.1.0/src/bir/integrations/langchain.py +429 -0
  15. bir_sdk-0.1.0/src/bir/integrations/litellm.py +108 -0
  16. bir_sdk-0.1.0/src/bir/integrations/llamaindex.py +374 -0
  17. bir_sdk-0.1.0/src/bir/integrations/mistral.py +87 -0
  18. bir_sdk-0.1.0/src/bir/integrations/openai.py +167 -0
  19. bir_sdk-0.1.0/src/bir_sdk.egg-info/PKG-INFO +584 -0
  20. bir_sdk-0.1.0/src/bir_sdk.egg-info/SOURCES.txt +33 -0
  21. bir_sdk-0.1.0/src/bir_sdk.egg-info/dependency_links.txt +1 -0
  22. bir_sdk-0.1.0/src/bir_sdk.egg-info/requires.txt +3 -0
  23. bir_sdk-0.1.0/src/bir_sdk.egg-info/top_level.txt +1 -0
  24. bir_sdk-0.1.0/tests/test_anthropic_integration.py +187 -0
  25. bir_sdk-0.1.0/tests/test_cohere_integration.py +187 -0
  26. bir_sdk-0.1.0/tests/test_evals.py +1285 -0
  27. bir_sdk-0.1.0/tests/test_examples.py +138 -0
  28. bir_sdk-0.1.0/tests/test_google_integration.py +199 -0
  29. bir_sdk-0.1.0/tests/test_langchain_integration.py +212 -0
  30. bir_sdk-0.1.0/tests/test_litellm_integration.py +216 -0
  31. bir_sdk-0.1.0/tests/test_llamaindex_integration.py +208 -0
  32. bir_sdk-0.1.0/tests/test_mistral_integration.py +189 -0
  33. bir_sdk-0.1.0/tests/test_openai_integration.py +277 -0
  34. bir_sdk-0.1.0/tests/test_redaction_parity.py +41 -0
  35. bir_sdk-0.1.0/tests/test_sdk.py +2369 -0
bir_sdk-0.1.0/LICENSE ADDED
@@ -0,0 +1,100 @@
1
+ # Functional Source License, Version 1.1, ALv2 Future License
2
+
3
+ ## Abbreviation
4
+
5
+ FSL-1.1-ALv2
6
+
7
+ ## Notice
8
+
9
+ Copyright 2026 mskayacioglu
10
+
11
+ ## Terms and Conditions
12
+
13
+ ### Licensor ("We")
14
+
15
+ The party offering the Software under these Terms and Conditions.
16
+
17
+ ### The Software
18
+
19
+ The "Software" is each version of the software that we make available under
20
+ these Terms and Conditions, as indicated by our inclusion of these Terms and
21
+ Conditions with the Software.
22
+
23
+ ### License Grant
24
+
25
+ Subject to your compliance with this License Grant and the Patents,
26
+ Redistribution and Trademark clauses below, we hereby grant you the right to
27
+ use, copy, modify, create derivative works, publicly perform, publicly display
28
+ and redistribute the Software for any Permitted Purpose identified below.
29
+
30
+ ### Permitted Purpose
31
+
32
+ A Permitted Purpose is any purpose other than a Competing Use. A Competing Use
33
+ means making the Software available to others in a commercial product or service
34
+ that:
35
+
36
+ 1. substitutes for the Software;
37
+ 2. substitutes for any other product or service we offer using the Software that
38
+ exists as of the date we make the Software available; or
39
+ 3. offers the same or substantially similar functionality as the Software.
40
+
41
+ Permitted Purposes specifically include using the Software:
42
+
43
+ 1. for your internal use and access;
44
+ 2. for non-commercial education;
45
+ 3. for non-commercial research; and
46
+ 4. in connection with professional services that you provide to a licensee using
47
+ the Software in accordance with these Terms and Conditions.
48
+
49
+ ### Patents
50
+
51
+ To the extent your use for a Permitted Purpose would necessarily infringe our
52
+ patents, the license grant above includes a license under our patents. If you
53
+ make a claim against any party that the Software infringes or contributes to the
54
+ infringement of any patent, then your patent license to the Software ends
55
+ immediately.
56
+
57
+ ### Redistribution
58
+
59
+ The Terms and Conditions apply to all copies, modifications and derivatives of
60
+ the Software.
61
+
62
+ If you redistribute any copies, modifications or derivatives of the Software,
63
+ you must include a copy of or a link to these Terms and Conditions and not
64
+ remove any copyright notices provided in or with the Software.
65
+
66
+ ### Disclaimer
67
+
68
+ THE SOFTWARE IS PROVIDED "AS IS" AND WITHOUT WARRANTIES OF ANY KIND, EXPRESS OR
69
+ IMPLIED, INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR
70
+ PURPOSE, MERCHANTABILITY, TITLE OR NON-INFRINGEMENT.
71
+
72
+ IN NO EVENT WILL WE HAVE ANY LIABILITY TO YOU ARISING OUT OF OR RELATED TO THE
73
+ SOFTWARE, INCLUDING INDIRECT, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES, EVEN
74
+ IF WE HAVE BEEN INFORMED OF THEIR POSSIBILITY IN ADVANCE.
75
+
76
+ ### Trademarks
77
+
78
+ Except for displaying the License Details and identifying us as the origin of
79
+ the Software, you have no right under these Terms and Conditions to use our
80
+ trademarks, trade names, service marks or product names.
81
+
82
+ ## Grant of Future License
83
+
84
+ We hereby irrevocably grant you an additional license to use the Software under
85
+ the Apache License, Version 2.0 that is effective on the second anniversary of
86
+ the date we make the Software available. On or after that date, you may use the
87
+ Software under the Apache License, Version 2.0, in which case the following
88
+ will apply:
89
+
90
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use
91
+ this file except in compliance with the License.
92
+
93
+ You may obtain a copy of the License at
94
+
95
+ http://www.apache.org/licenses/LICENSE-2.0
96
+
97
+ Unless required by applicable law or agreed to in writing, software distributed
98
+ under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
99
+ CONDITIONS OF ANY KIND, either express or implied. See the License for the
100
+ specific language governing permissions and limitations under the License.
bir_sdk-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,584 @@
1
+ Metadata-Version: 2.4
2
+ Name: bir-sdk
3
+ Version: 0.1.0
4
+ Summary: Minimal local tracing SDK for LLM applications.
5
+ Author: mskayacioglu
6
+ Project-URL: Homepage, https://github.com/mskayacioglu/bir
7
+ Project-URL: Documentation, https://github.com/mskayacioglu/bir/tree/main/docs
8
+ Project-URL: Source, https://github.com/mskayacioglu/bir
9
+ Project-URL: Issues, https://github.com/mskayacioglu/bir/issues
10
+ Keywords: ai,evals,llm,observability,tracing
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Programming Language :: Python :: 3 :: Only
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Programming Language :: Python :: 3.13
19
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
20
+ Requires-Python: >=3.10
21
+ Description-Content-Type: text/markdown
22
+ License-File: LICENSE
23
+ Provides-Extra: dev
24
+ Requires-Dist: pytest>=8.0; extra == "dev"
25
+ Dynamic: license-file
26
+
27
+ # Bir Python SDK
28
+
29
+ Minimal local tracing SDK for Python LLM applications.
30
+
31
+ Bir records traces, spans, generations, tool calls, and scores to local JSONL
32
+ without requiring a server. Start locally, then send events to the Bir FastAPI
33
+ server when you want to inspect them in the dashboard.
34
+
35
+ ## Installation
36
+
37
+ After the first package release:
38
+
39
+ ```bash
40
+ python -m pip install bir-sdk
41
+ ```
42
+
43
+ The distribution is published on PyPI as `bir-sdk`; the import name is `bir`
44
+ (e.g. `from bir import observe`).
45
+
46
+ For local development from this repository:
47
+
48
+ ```bash
49
+ python -m pip install -e ".[dev]"
50
+ ```
51
+
52
+ ## Quickstart
53
+
54
+ ```python
55
+ from bir import generation, observe, retrieval, score, span
56
+
57
+
58
+ @observe()
59
+ def answer_question(question: str) -> str:
60
+ with span("retrieve_context"):
61
+ with retrieval("search_docs", query=question) as result:
62
+ result.add_document(id="doc-1", text="local context")
63
+ documents = ["local context"]
64
+
65
+ with generation("local.llm", model="demo-model") as gen:
66
+ response = f"{documents[0]}: {question}"
67
+ gen.set_output(response)
68
+ gen.set_usage(input_tokens=12, output_tokens=24)
69
+ gen.set_cost(input_cost=0.000012, output_cost=0.000048)
70
+
71
+ score("helpfulness", 0.82)
72
+ return response
73
+ ```
74
+
75
+ Trace, span, tool call, generation, and score events are written as JSONL to:
76
+
77
+ ```text
78
+ .bir/traces.jsonl
79
+ ```
80
+
81
+ Writes to the local trace file are serialized within the SDK process so
82
+ multi-threaded sync applications keep the JSONL file line-delimited and
83
+ parseable.
84
+
85
+ ## Manual Trace Contexts
86
+
87
+ Use `trace()` when your workflow is easier to wrap with a context manager than a
88
+ decorator:
89
+
90
+ ```python
91
+ from bir import generation, score, span, trace
92
+
93
+ with trace("answer_question", metadata={"kind": "manual"}):
94
+ with span("draft_answer"):
95
+ with generation("local.llm", model="demo-model") as gen:
96
+ response = "ok"
97
+ gen.set_output(response)
98
+ score("helpfulness", 0.82)
99
+ ```
100
+
101
+ ## Read Local Traces
102
+
103
+ You can also read local traces back from the same file:
104
+
105
+ ```python
106
+ from bir import load_traces
107
+
108
+ for trace in load_traces():
109
+ print(trace.name, trace.status, trace.duration_ms)
110
+ for event in trace.events:
111
+ print(event.type, event.name)
112
+ ```
113
+
114
+ ## Send Events To The Server
115
+
116
+ To send local events to a running Bir server:
117
+
118
+ ```python
119
+ from bir import send_events
120
+
121
+ result = send_events("http://127.0.0.1:8000")
122
+ print(result.accepted, result.attempted, result.skipped)
123
+ ```
124
+
125
+ `send_events()` posts local JSONL events to the Bir server, batching them into a
126
+ single request when the server supports it and otherwise posting them to
127
+ `/v1/events` one at a time. It uses the Python standard library, reports how many
128
+ local events were attempted, how many were newly accepted, and how many were
129
+ skipped by an idempotent server response,
130
+ raises `RuntimeError` when the server rejects an event or cannot be reached, and
131
+ does not remove local events after sending. Re-sending the same file is safe
132
+ against the Bir server because duplicate event IDs are treated as already
133
+ ingested.
134
+ Complete traces are sent root-first so the server receives the trace event before
135
+ its spans, tool calls, generations, and scores.
136
+
137
+ ## Privacy And Capture
138
+
139
+ Input and output capture is disabled by default. Enable it globally with `configure()`
140
+ or for a single function with `@observe(capture_inputs=True, capture_outputs=True)`.
141
+ Common secret-like fields such as `api_key`, `authorization`, `password`, `secret`,
142
+ and `token` are redacted before events are written.
143
+ Captured strings, fallback object representations, and captured error messages
144
+ are also scanned for common secret-like text patterns before events are written.
145
+ Redaction is best-effort, so keep capture opt-in for sensitive payloads and review
146
+ what your application records.
147
+
148
+ ```python
149
+ from bir import configure
150
+
151
+ configure(capture_inputs=True, capture_outputs=True)
152
+ ```
153
+
154
+ Captured values are normalized to JSON-compatible data before writing. Non-finite
155
+ floats such as `NaN` and `Infinity` are stored as strings, and deeply nested
156
+ values are truncated. `score()` requires a finite numeric value and accepts
157
+ optional `metadata` (for example `score("faithfulness", 0.4, metadata={"reason":
158
+ "answer cites no context"})`) that is redacted with the same rules before it is
159
+ written to the score event. Generation token usage and generation cost require at
160
+ least one field, and all provided values must be non-negative finite numbers.
161
+
162
+ Generation cost is user-provided. Bir records explicit cost values and defaults
163
+ the currency to `USD`; it does not calculate provider pricing automatically.
164
+
165
+ ## Service Metadata
166
+
167
+ Use `configure()` to tag traces with the service and environment that produced
168
+ them. Both values are optional, must be non-empty strings, and are recorded on
169
+ trace root events under `metadata.service` so the server and dashboard can
170
+ filter traces by service and environment.
171
+
172
+ ```python
173
+ from bir import configure
174
+
175
+ configure(service_name="rag-api", environment="production")
176
+ ```
177
+
178
+ ## Sampling
179
+
180
+ Use `configure(sample_rate=...)` to keep local trace volume bounded under load.
181
+ `sample_rate` is the probability (`0.0` to `1.0`) that a trace is recorded and
182
+ defaults to `1.0`, which records every trace. The decision is made once per
183
+ trace root and inherited by every event under it, so a sampled-out trace and all
184
+ of its spans, generations, tool calls, retrievals, and scores write nothing.
185
+
186
+ Sampling never changes control flow: a sampled-out function still runs and still
187
+ raises its own exceptions; only the local JSONL writes are skipped.
188
+
189
+ ```python
190
+ from bir import configure
191
+
192
+ configure(sample_rate=0.1) # record about 10% of traces
193
+ ```
194
+
195
+ ## Retrieval
196
+
197
+ Use `retrieval()` to record RAG lookups with the existing `tool_call` event
198
+ contract. It sets `metadata.kind` to `retrieval`, stores the query at
199
+ `input.query` when input capture is enabled, and stores retrieved records at
200
+ `output.documents` when output capture is enabled.
201
+ Document `rank` values must be non-negative integers, and document `score`
202
+ values must be non-negative finite numbers.
203
+
204
+ ```python
205
+ from bir import retrieval
206
+
207
+ with retrieval("vector_search", query=question) as result:
208
+ result.add_document(
209
+ id="doc-1",
210
+ rank=1,
211
+ score=0.82,
212
+ source="docs",
213
+ text="Bir records local traces with JSONL.",
214
+ )
215
+ ```
216
+
217
+ ## Prompt Versions
218
+
219
+ Use `prompt()` to attach prompt identity and version metadata to a generation.
220
+ Prompt template text, variables, and rendered prompts are not captured unless
221
+ you opt in.
222
+
223
+ ```python
224
+ from bir import generation, prompt
225
+
226
+ answer_prompt = prompt(
227
+ "answer_question",
228
+ version="v1",
229
+ template="Answer using this context: {context}",
230
+ variables={"context": "local context"},
231
+ )
232
+
233
+ with generation("local.llm", model="demo-model", prompt=answer_prompt) as gen:
234
+ gen.set_output("ok")
235
+ ```
236
+
237
+ The generation event records `metadata.prompt.name`, `metadata.prompt.version`,
238
+ and a `metadata.prompt.template_sha256` when a template is provided. To inspect
239
+ the actual prompt payload locally, opt in explicitly:
240
+
241
+ ```python
242
+ answer_prompt = prompt(
243
+ "answer_question",
244
+ version="v1",
245
+ template="Answer using this context: {context}",
246
+ variables={"context": "local context"},
247
+ capture_template=True,
248
+ capture_variables=True,
249
+ capture_rendered=True,
250
+ )
251
+ ```
252
+
253
+ Captured prompt fields use the same best-effort redaction as other captured
254
+ payloads. After sending traces to the local server, the dashboard shows prompt
255
+ metadata on generation details without requiring you to inspect the raw event
256
+ JSON.
257
+
258
+ ## LangChain Callback
259
+
260
+ Use `BirCallbackHandler` to record LangChain callback events as Bir traces
261
+ without adding LangChain as a Bir dependency:
262
+
263
+ ```python
264
+ from bir import configure
265
+ from bir.integrations.langchain import BirCallbackHandler
266
+
267
+ configure(capture_inputs=True, capture_outputs=True)
268
+
269
+ result = chain.invoke(
270
+ {"question": "What is Bir?"},
271
+ config={"callbacks": [BirCallbackHandler()]},
272
+ )
273
+ ```
274
+
275
+ Root chains become trace events, nested chains become spans, LLM/chat model
276
+ callbacks become generation events, retrievers become retrieval tool calls, and
277
+ tools become tool call events. Direct model calls without an active chain create
278
+ a small implicit trace root. The handler records token usage from common
279
+ LangChain response shapes including `llm_output.token_usage`,
280
+ `usage_metadata`, and `response_metadata.token_usage`.
281
+
282
+ ## Mistral
283
+
284
+ Use `trace_chat()` to record Mistral chat completions without adding `mistralai`
285
+ as a Bir dependency:
286
+
287
+ ```python
288
+ from bir import trace
289
+ from bir.integrations.mistral import trace_chat
290
+
291
+ with trace("chat"):
292
+ response = trace_chat(
293
+ client.chat.complete,
294
+ model="mistral-small-latest",
295
+ messages=[{"role": "user", "content": "What is Bir?"}],
296
+ )
297
+ ```
298
+
299
+ The wrapper forwards positional and keyword arguments unchanged, returns the
300
+ Mistral response untouched, and records the response model, token usage, and
301
+ `model_dump()` output when capture settings allow it.
302
+
303
+ ## Cohere
304
+
305
+ Use `trace_chat()` to record Cohere v2 chat calls without adding `cohere` as a
306
+ Bir dependency:
307
+
308
+ ```python
309
+ from bir import trace
310
+ from bir.integrations.cohere import trace_chat
311
+
312
+ with trace("chat"):
313
+ response = trace_chat(
314
+ client.chat,
315
+ model="command-a-03-2025",
316
+ messages=[{"role": "user", "content": "What is Bir?"}],
317
+ )
318
+ ```
319
+
320
+ The wrapper forwards positional and keyword arguments unchanged, returns the
321
+ Cohere response untouched, records the request model, and reads token usage from
322
+ `response.usage.tokens` when present.
323
+
324
+ ## Local Evals And Experiments
325
+
326
+ Bir includes a small deterministic evaluation layer for local regression checks.
327
+ It does not require a server or an LLM judge.
328
+
329
+ ```python
330
+ from bir.evals import Dataset, DatasetExample, contains, exact_match, latency_under, run_experiment
331
+
332
+
333
+ dataset = Dataset(
334
+ [
335
+ DatasetExample(
336
+ id="q1",
337
+ input={"question": "What is Bir?"},
338
+ expected="An observability SDK",
339
+ )
340
+ ]
341
+ )
342
+
343
+
344
+ def answer_question(question: str) -> str:
345
+ return "Bir is an observability SDK."
346
+
347
+
348
+ result = run_experiment(
349
+ "quickstart",
350
+ dataset=dataset,
351
+ task=answer_question,
352
+ evaluators=[
353
+ contains(),
354
+ exact_match("Bir is an observability SDK."),
355
+ latency_under(1000),
356
+ ],
357
+ )
358
+
359
+ print(result.aggregate_scores)
360
+ ```
361
+
362
+ Datasets can be stored as JSONL:
363
+
364
+ ```json
365
+ {"id":"q1","input":{"question":"What is Bir?"},"expected":"An observability SDK"}
366
+ ```
367
+
368
+ Load and run them locally:
369
+
370
+ ```python
371
+ from bir.evals import Dataset, contains, list_experiments, load_experiment, run_experiment, send_experiment
372
+
373
+ dataset = Dataset.from_jsonl("questions.jsonl")
374
+ result = run_experiment("prompt-v1", dataset=dataset, task=answer_question, evaluators=[contains()])
375
+ loaded = load_experiment(result.path)
376
+ summaries = list_experiments()
377
+ ```
378
+
379
+ `Dataset.to_jsonl()` redacts common secret-like values by default when exporting
380
+ examples. If you intentionally need to preserve raw dataset payloads, opt out
381
+ explicitly:
382
+
383
+ ```python
384
+ dataset.to_jsonl("questions.jsonl", redact=False)
385
+ ```
386
+
387
+ Experiment results are written to `.bir/experiments/*.jsonl` by default, with
388
+ one result row per example. Bir also writes a sibling `.summary.json` file with
389
+ the experiment id, status, example count, error count, aggregate scores, and
390
+ result path so local runs can be listed without scanning every result row.
391
+ Available deterministic evaluators are `exact_match()`, `contains()`,
392
+ `regex_match()`, `json_valid()`, `field_equals()`, `field_contains()`,
393
+ `latency_under()`, `cost_under()`, `numeric_between()`,
394
+ `retrieved_context_contains()`, `answer_context_overlap()`,
395
+ `answer_contains_citation()`, and `custom_evaluator()`.
396
+
397
+ Upload a completed experiment to a running Bir server so the dashboard can show
398
+ the experiment list and per-example result detail:
399
+
400
+ ```python
401
+ send_experiment(result.path, "http://127.0.0.1:8000")
402
+ ```
403
+
404
+ To connect local experiment rows back to traces, opt in when running the
405
+ experiment:
406
+
407
+ ```python
408
+ result = run_experiment(
409
+ "prompt-v1",
410
+ dataset=dataset,
411
+ task=answer_question,
412
+ evaluators=[contains()],
413
+ record_traces=True,
414
+ )
415
+ ```
416
+
417
+ `record_traces=True` writes one trace per dataset example and records evaluator
418
+ outputs as score events on that trace.
419
+
420
+ Use threshold evaluators for local gates:
421
+
422
+ ```python
423
+ from bir.evals import cost_under, latency_under, numeric_between
424
+
425
+ evaluators = [
426
+ latency_under(1000),
427
+ cost_under(0.05),
428
+ numeric_between(min_value=0.0, max_value=1.0),
429
+ ]
430
+ ```
431
+
432
+ `latency_under()` uses measured task duration from `run_experiment()`.
433
+ `cost_under()` checks explicit cost fields returned by your task, either
434
+ `{"total_cost": 0.01}` or `{"cost": {"total_cost": 0.01}}`. Bir records only the
435
+ cost values you provide and does not calculate provider pricing.
436
+ `numeric_between()` checks numeric task outputs, or a numeric field when
437
+ `field=` is provided.
438
+
439
+ Use structured output evaluators for JSON-like task results:
440
+
441
+ ```python
442
+ from bir.evals import field_contains, field_equals, numeric_between
443
+
444
+ evaluators = [
445
+ field_contains("answer", "observability"),
446
+ field_equals("citations[0].id", "doc-1"),
447
+ numeric_between(min_value=0.7, max_value=1.0, field="confidence"),
448
+ ]
449
+ ```
450
+
451
+ Field paths support dot paths and list indexes, such as `answer`,
452
+ `usage.total_tokens`, and `items[0].name`. Missing paths produce a `0.0` score
453
+ with failure metadata instead of failing the experiment.
454
+
455
+ Use `custom_evaluator()` for local checks that are specific to your task:
456
+
457
+ ```python
458
+ from bir.evals import EvalResult, custom_evaluator
459
+
460
+ has_citation = custom_evaluator(
461
+ "has_citation",
462
+ lambda output, expected: "[1]" in str(output),
463
+ )
464
+
465
+ debuggable = custom_evaluator(
466
+ "debuggable",
467
+ lambda output, expected: EvalResult(
468
+ name="debuggable",
469
+ value=1.0,
470
+ metadata={"expected": expected},
471
+ ),
472
+ )
473
+ ```
474
+
475
+ Custom evaluators may return `bool`, `int`, `float`, or `EvalResult`. Exceptions
476
+ from custom evaluator functions surface normally during development.
477
+
478
+ Use `retrieved_context_contains()` to check retrieval quality without an LLM
479
+ judge:
480
+
481
+ ```python
482
+ from bir.evals import retrieved_context_contains
483
+
484
+ evaluators = [
485
+ retrieved_context_contains("observability"),
486
+ ]
487
+ ```
488
+
489
+ `retrieved_context_contains()` reads the `contexts` list from a structured RAG
490
+ output such as `{"answer": "...", "contexts": ["doc text", ...]}` and scores
491
+ `1.0` when `expected` appears in one of the retrieved strings. Missing or empty
492
+ `contexts` produce a `0.0` score with failure metadata instead of failing the
493
+ experiment. Pass `case_sensitive=False` for case-insensitive matching. This is a
494
+ deterministic retrieval check, not proof that the answer used the context.
495
+
496
+ Use `answer_context_overlap()` to flag answers that may not be grounded in the
497
+ retrieved context, also without an LLM judge:
498
+
499
+ ```python
500
+ from bir.evals import answer_context_overlap
501
+
502
+ evaluators = [
503
+ answer_context_overlap(0.5),
504
+ ]
505
+ ```
506
+
507
+ `answer_context_overlap()` reads the same structured RAG output
508
+ (`{"answer": "...", "contexts": ["doc text", ...]}`) and scores `1.0` when at
509
+ least `min_ratio` of the answer's word tokens also appear in the retrieved
510
+ contexts. It is a deterministic faithfulness heuristic, not proof of
511
+ faithfulness: paraphrased but faithful answers can score low, and unfaithful
512
+ answers that reuse context words can score high. Missing answers or contexts
513
+ produce a `0.0` score with failure metadata instead of failing the experiment.
514
+
515
+ Use `answer_contains_citation()` to check that an answer cites a source, also
516
+ without an LLM judge:
517
+
518
+ ```python
519
+ from bir.evals import answer_contains_citation
520
+
521
+ evaluators = [
522
+ answer_contains_citation(),
523
+ ]
524
+ ```
525
+
526
+ `answer_contains_citation()` reads a plain answer string or the `answer` field
527
+ of a structured RAG output (`{"answer": "...", "contexts": [...]}`) and scores
528
+ `1.0` when the answer contains a citation marker. By default any bracketed
529
+ marker such as `[1]` or `[doc-1]` counts; pass `pattern` to require a custom
530
+ citation format such as `pattern=r"\(\d+\)"` for markers like `(1)`. This is a
531
+ deterministic format check, not proof that the citation is correct or that the
532
+ cited source supports the answer. Non-text output or a missing `answer` produces
533
+ a `0.0` score with failure metadata instead of failing the experiment.
534
+
535
+ ## Event Loading
536
+
537
+ `load_events()` validates JSONL records against the current event schema and
538
+ raises `ValueError` for malformed rows, unsupported event types, invalid
539
+ timestamps, or unsupported schema versions.
540
+
541
+ To write traces somewhere else:
542
+
543
+ ```python
544
+ configure(trace_path="tmp/bir-traces.jsonl")
545
+ ```
546
+
547
+ ## Development
548
+
549
+ Run the SDK unit tests from this directory:
550
+
551
+ ```bash
552
+ PYTHONPATH=src ../../.venv/bin/python -m unittest discover -s tests
553
+ ```
554
+
555
+ Or install the package with test dependencies and run pytest:
556
+
557
+ ```bash
558
+ python3 -m pip install -e ".[dev]"
559
+ pytest
560
+ ```
561
+
562
+ Run repository type checking from the repository root:
563
+
564
+ ```bash
565
+ ./.venv/bin/pyright
566
+ ```
567
+
568
+ Run the release verification script from the repository root before publishing:
569
+
570
+ ```bash
571
+ ./.venv/bin/python scripts/verify_release.py
572
+ ```
573
+
574
+ The script builds a temporary SDK wheel, installs it in a fresh temporary
575
+ virtual environment, and smoke-tests local tracing and retrieval without writing
576
+ build artifacts into the repository.
577
+
578
+ Release planning lives in `CHANGELOG.md` and `docs/SDK_RELEASE_CHECKLIST.md`.
579
+
580
+ ## License
581
+
582
+ Bir is source-available under the Functional Source License 1.1 with Apache 2.0
583
+ as the future license (`FSL-1.1-ALv2`). FSL is not an OSI-approved open source
584
+ license.