regression-substrate 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- regression_substrate-0.1.0/.gitignore +24 -0
- regression_substrate-0.1.0/CHANGES.md +39 -0
- regression_substrate-0.1.0/LICENSE +21 -0
- regression_substrate-0.1.0/PKG-INFO +104 -0
- regression_substrate-0.1.0/README.md +63 -0
- regression_substrate-0.1.0/adapters.py +185 -0
- regression_substrate-0.1.0/data/gold.jsonl +14 -0
- regression_substrate-0.1.0/data/responses.csv +25 -0
- regression_substrate-0.1.0/diff_engine.py +350 -0
- regression_substrate-0.1.0/examples/gold.jsonl +14 -0
- regression_substrate-0.1.0/examples/responses.csv +25 -0
- regression_substrate-0.1.0/gold.py +108 -0
- regression_substrate-0.1.0/ingest.py +255 -0
- regression_substrate-0.1.0/otel_exporter.py +220 -0
- regression_substrate-0.1.0/otel_spec.md +107 -0
- regression_substrate-0.1.0/pyproject.toml +35 -0
- regression_substrate-0.1.0/requirements.txt +2 -0
- regression_substrate-0.1.0/run_evaluation.py +181 -0
- regression_substrate-0.1.0/sequential_gate.py +243 -0
- regression_substrate-0.1.0/src/regression_substrate/__init__.py +36 -0
- regression_substrate-0.1.0/src/regression_substrate/adapters.py +185 -0
- regression_substrate-0.1.0/src/regression_substrate/cli.py +108 -0
- regression_substrate-0.1.0/src/regression_substrate/diff_engine.py +350 -0
- regression_substrate-0.1.0/src/regression_substrate/gold.py +108 -0
- regression_substrate-0.1.0/src/regression_substrate/ingest.py +255 -0
- regression_substrate-0.1.0/src/regression_substrate/otel_exporter.py +220 -0
- regression_substrate-0.1.0/src/regression_substrate/sequential_gate.py +243 -0
- regression_substrate-0.1.0/tests/test_gate.py +69 -0
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
# Build artifacts
|
|
2
|
+
dist/
|
|
3
|
+
build/
|
|
4
|
+
*.egg-info/
|
|
5
|
+
src/*.egg-info/
|
|
6
|
+
|
|
7
|
+
# Python
|
|
8
|
+
__pycache__/
|
|
9
|
+
*.pyc
|
|
10
|
+
*.pyo
|
|
11
|
+
|
|
12
|
+
# Test / output
|
|
13
|
+
.pytest_cache/
|
|
14
|
+
out/
|
|
15
|
+
out_test/
|
|
16
|
+
out_test2/
|
|
17
|
+
|
|
18
|
+
# IDE
|
|
19
|
+
.vscode/
|
|
20
|
+
.idea/
|
|
21
|
+
|
|
22
|
+
# Env
|
|
23
|
+
.venv/
|
|
24
|
+
*.env
|
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
# Changes — closing the deployment gaps
|
|
2
|
+
|
|
3
|
+
Four loopholes were raised for moving from a local runnable ZIP to a live
|
|
4
|
+
deployment. Here is what changed and what is verified.
|
|
5
|
+
|
|
6
|
+
## 1. Ephemeral martingale state → persistence boundary (sequential_gate.py)
|
|
7
|
+
`SequentialGate` now takes a pluggable `Backend`. The default is in-memory; in
|
|
8
|
+
production you back `append_event` with a durable append-only store (Kafka /
|
|
9
|
+
Postgres) and `save_checkpoint`/`load_checkpoint` with a fast cache (Redis)
|
|
10
|
+
keyed by `(stream, epoch)`. A cold-booting pod resumes from the checkpoint in
|
|
11
|
+
O(1) instead of replaying the whole log, while the log stays the audit source of
|
|
12
|
+
truth. **Verified:** a fresh gate sharing a backend resumes a stream's capital
|
|
13
|
+
exactly, and `replay()` matches the checkpoint. **Not done here:** the actual
|
|
14
|
+
Redis/Postgres `Backend` subclass — it's a drop-in against the documented
|
|
15
|
+
interface, but needs your infra to test.
|
|
16
|
+
|
|
17
|
+
## 2. Gold-set concept drift → rolling gold + drift detection (gold.py, new)
|
|
18
|
+
`RollingGoldSet` (FIFO, oldest labels roll out), `sample_for_labeling` (routes a
|
|
19
|
+
random fraction of live traffic to a human queue so the gold set tracks the live
|
|
20
|
+
distribution), and `drift_report` (compares judge agreement and error_sd on the
|
|
21
|
+
older vs most-recent half at a fixed threshold; flags `drift_warning` or
|
|
22
|
+
`judge_inadmissible`). **Verified:** a judge that silently breaks on a new query
|
|
23
|
+
type is caught (kappa 1.0 → 0.0, error_sd 0.10 → 0.41). **Not done here:** wiring
|
|
24
|
+
the sample queue to a real labeling tool.
|
|
25
|
+
|
|
26
|
+
## 3. TF-IDF too brittle → pluggable embedder (ingest.py)
|
|
27
|
+
`auto_cluster` now takes an `embedder`. Default `tfidf_embedder` stays offline;
|
|
28
|
+
`sentence_transformer_embedder("all-MiniLM-L6-v2")` is a one-line swap for
|
|
29
|
+
semantically-close failure modes. **Verified:** the TF-IDF path. **Not done
|
|
30
|
+
here:** the sentence-transformer path — the model download needs network access
|
|
31
|
+
to the model hub, which this build environment blocks; verified the interface,
|
|
32
|
+
not the weights.
|
|
33
|
+
|
|
34
|
+
## 4. Small-N bootstrap instability → power floor (diff_engine.py)
|
|
35
|
+
`gate()` gained `min_n` (default 30). Below it the gate returns `HOLD`
|
|
36
|
+
(insufficient power) instead of risking a `SHIP`/`REGRESSION` on a bootstrap
|
|
37
|
+
that underestimates variance. **Verified:** N=8 → HOLD, N=40 → REGRESSION on the
|
|
38
|
+
same effect size. The bundled sample (N=6) now reports `underpowered` and relaxes
|
|
39
|
+
the floor for illustration only; real data keeps the default 30.
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 [Your Name or Organization]
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,104 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: regression-substrate
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: A statistically rigorous CI gate for AI: treats model outputs as distributions, penalizes unreliable judges, and decides ship / hold / regression.
|
|
5
|
+
License: MIT License
|
|
6
|
+
|
|
7
|
+
Copyright (c) 2026 [Your Name or Organization]
|
|
8
|
+
|
|
9
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
10
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
11
|
+
in the Software without restriction, including without limitation the rights
|
|
12
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
13
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
14
|
+
furnished to do so, subject to the following conditions:
|
|
15
|
+
|
|
16
|
+
The above copyright notice and this permission notice shall be included in all
|
|
17
|
+
copies or substantial portions of the Software.
|
|
18
|
+
|
|
19
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
20
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
21
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
22
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
23
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
24
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
25
|
+
SOFTWARE.
|
|
26
|
+
License-File: LICENSE
|
|
27
|
+
Keywords: ci,evaluation,llm,mlops,regression-testing
|
|
28
|
+
Requires-Python: >=3.10
|
|
29
|
+
Requires-Dist: numpy>=1.24
|
|
30
|
+
Requires-Dist: scipy>=1.10
|
|
31
|
+
Provides-Extra: clustering
|
|
32
|
+
Requires-Dist: scikit-learn>=1.2; extra == 'clustering'
|
|
33
|
+
Provides-Extra: dev
|
|
34
|
+
Requires-Dist: pytest>=7.0; extra == 'dev'
|
|
35
|
+
Requires-Dist: scikit-learn>=1.2; extra == 'dev'
|
|
36
|
+
Provides-Extra: embeddings
|
|
37
|
+
Requires-Dist: sentence-transformers>=2.2; extra == 'embeddings'
|
|
38
|
+
Provides-Extra: langsmith
|
|
39
|
+
Requires-Dist: langsmith>=0.1; extra == 'langsmith'
|
|
40
|
+
Description-Content-Type: text/markdown
|
|
41
|
+
|
|
42
|
+
# regression-substrate
|
|
43
|
+
|
|
44
|
+
A statistically rigorous CI gate for AI systems. It treats model outputs as
|
|
45
|
+
distributions, penalizes unreliable judges, and returns a `SHIP` / `HOLD` /
|
|
46
|
+
`REGRESSION` verdict you can block a pull request on.
|
|
47
|
+
|
|
48
|
+
## Install
|
|
49
|
+
|
|
50
|
+
```bash
|
|
51
|
+
pip install regression-substrate # core (numpy, scipy)
|
|
52
|
+
pip install "regression-substrate[clustering]" # + auto_cluster (scikit-learn)
|
|
53
|
+
pip install "regression-substrate[langsmith]" # + LangSmith adapter
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
For development (editable install with test dependencies):
|
|
57
|
+
|
|
58
|
+
```bash
|
|
59
|
+
git clone <repo-url>
|
|
60
|
+
cd regression-substrate
|
|
61
|
+
pip install -e ".[dev]"
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
## CLI (drop into CI)
|
|
65
|
+
|
|
66
|
+
```bash
|
|
67
|
+
regsub --data evals.csv --gold gold.jsonl --version-a v1 --version-b v2 --out out/
|
|
68
|
+
# exit 0 = SHIP / SHIP_WITH_FLAGS ; 1 = REGRESSION / HOLD ; 2 = JUDGE_INADMISSIBLE
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
One line in your CI pipeline blocks the PR on a regression.
|
|
72
|
+
|
|
73
|
+
## Library
|
|
74
|
+
|
|
75
|
+
```python
|
|
76
|
+
from regression_substrate import gate, load_from_csv, Judge
|
|
77
|
+
|
|
78
|
+
judge = Judge(my_llm_scorer) # any (input, response) -> [0,1]
|
|
79
|
+
cal = judge.calibrate(gold_records) # -> kappa, error_sd
|
|
80
|
+
sa, sb, cids, meta = load_from_csv("evals.csv", "v1", "v2")
|
|
81
|
+
decision = gate(sa, sb, cids, judge_error_sd=cal["error_sd"], kappa=cal["kappa"])
|
|
82
|
+
print(decision.verdict)
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
## What's inside
|
|
86
|
+
|
|
87
|
+
| Module | Purpose |
|
|
88
|
+
|---|---|
|
|
89
|
+
| `diff_engine` | Offline gate: variance components, bootstrap CI, cluster scan, BH/e-BH |
|
|
90
|
+
| `ingest` | Loaders (JSONL, CSV), judge harness, auto-clustering |
|
|
91
|
+
| `sequential_gate` | Always-valid martingale monitor for continuous deployment |
|
|
92
|
+
| `gold` | Rolling gold set, drift detection, forced sampling for labeling |
|
|
93
|
+
| `adapters` | Vendor flatteners (LangSmith preset) |
|
|
94
|
+
| `otel_exporter` | OTel-aligned span capture path |
|
|
95
|
+
| `cli` | The `regsub` console command |
|
|
96
|
+
|
|
97
|
+
## Running tests
|
|
98
|
+
|
|
99
|
+
```bash
|
|
100
|
+
pip install -e ".[dev]"
|
|
101
|
+
pytest
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
See `examples/` for a runnable dataset and `CHANGES.md` for design decisions.
|
|
@@ -0,0 +1,63 @@
|
|
|
1
|
+
# regression-substrate
|
|
2
|
+
|
|
3
|
+
A statistically rigorous CI gate for AI systems. It treats model outputs as
|
|
4
|
+
distributions, penalizes unreliable judges, and returns a `SHIP` / `HOLD` /
|
|
5
|
+
`REGRESSION` verdict you can block a pull request on.
|
|
6
|
+
|
|
7
|
+
## Install
|
|
8
|
+
|
|
9
|
+
```bash
|
|
10
|
+
pip install regression-substrate # core (numpy, scipy)
|
|
11
|
+
pip install "regression-substrate[clustering]" # + auto_cluster (scikit-learn)
|
|
12
|
+
pip install "regression-substrate[langsmith]" # + LangSmith adapter
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
For development (editable install with test dependencies):
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
git clone <repo-url>
|
|
19
|
+
cd regression-substrate
|
|
20
|
+
pip install -e ".[dev]"
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
## CLI (drop into CI)
|
|
24
|
+
|
|
25
|
+
```bash
|
|
26
|
+
regsub --data evals.csv --gold gold.jsonl --version-a v1 --version-b v2 --out out/
|
|
27
|
+
# exit 0 = SHIP / SHIP_WITH_FLAGS ; 1 = REGRESSION / HOLD ; 2 = JUDGE_INADMISSIBLE
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
One line in your CI pipeline blocks the PR on a regression.
|
|
31
|
+
|
|
32
|
+
## Library
|
|
33
|
+
|
|
34
|
+
```python
|
|
35
|
+
from regression_substrate import gate, load_from_csv, Judge
|
|
36
|
+
|
|
37
|
+
judge = Judge(my_llm_scorer) # any (input, response) -> [0,1]
|
|
38
|
+
cal = judge.calibrate(gold_records) # -> kappa, error_sd
|
|
39
|
+
sa, sb, cids, meta = load_from_csv("evals.csv", "v1", "v2")
|
|
40
|
+
decision = gate(sa, sb, cids, judge_error_sd=cal["error_sd"], kappa=cal["kappa"])
|
|
41
|
+
print(decision.verdict)
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
## What's inside
|
|
45
|
+
|
|
46
|
+
| Module | Purpose |
|
|
47
|
+
|---|---|
|
|
48
|
+
| `diff_engine` | Offline gate: variance components, bootstrap CI, cluster scan, BH/e-BH |
|
|
49
|
+
| `ingest` | Loaders (JSONL, CSV), judge harness, auto-clustering |
|
|
50
|
+
| `sequential_gate` | Always-valid martingale monitor for continuous deployment |
|
|
51
|
+
| `gold` | Rolling gold set, drift detection, forced sampling for labeling |
|
|
52
|
+
| `adapters` | Vendor flatteners (LangSmith preset) |
|
|
53
|
+
| `otel_exporter` | OTel-aligned span capture path |
|
|
54
|
+
| `cli` | The `regsub` console command |
|
|
55
|
+
|
|
56
|
+
## Running tests
|
|
57
|
+
|
|
58
|
+
```bash
|
|
59
|
+
pip install -e ".[dev]"
|
|
60
|
+
pytest
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
See `examples/` for a runnable dataset and `CHANGES.md` for design decisions.
|
|
@@ -0,0 +1,185 @@
|
|
|
1
|
+
"""
|
|
2
|
+
adapters.py — pull from vendor observability platforms into the 7-field schema.
|
|
3
|
+
|
|
4
|
+
HONESTY NOTE — READ THIS:
|
|
5
|
+
* The FLATTENING logic below is tested (see __main__) against a synthetic
|
|
6
|
+
fixture shaped like a vendor's documented run/feedback model. That transform
|
|
7
|
+
is proven.
|
|
8
|
+
* The LIVE FETCH (`load_from_langsmith`) is SDK- and auth-dependent and is NOT
|
|
9
|
+
exercised here. Vendor schemas drift, so the field paths in the presets are
|
|
10
|
+
best-effort and MUST be verified against the platform's current docs before
|
|
11
|
+
you trust them in production.
|
|
12
|
+
|
|
13
|
+
Design: don't hardwire any one vendor. A `TraceMap` says where each field lives
|
|
14
|
+
inside one run record; `flatten_runs` collapses (possibly nested, multi-step)
|
|
15
|
+
runs + feedback into flat 7-field records that ingest.assemble_records consumes.
|
|
16
|
+
A vendor is then just a TraceMap preset plus a thin fetch wrapper.
|
|
17
|
+
"""
|
|
18
|
+
|
|
19
|
+
from __future__ import annotations
|
|
20
|
+
from dataclasses import dataclass
|
|
21
|
+
import json
|
|
22
|
+
|
|
23
|
+
|
|
24
|
+
def _dig(obj, path: str, default=None):
|
|
25
|
+
"""Read a nested field by dotted path, e.g. 'extra.metadata.version'."""
|
|
26
|
+
cur = obj
|
|
27
|
+
for key in path.split("."):
|
|
28
|
+
if isinstance(cur, dict) and key in cur:
|
|
29
|
+
cur = cur[key]
|
|
30
|
+
else:
|
|
31
|
+
return default
|
|
32
|
+
return cur
|
|
33
|
+
|
|
34
|
+
|
|
35
|
+
def _canon(x) -> str:
|
|
36
|
+
"""Canonicalize an input/output payload to a stable string. The `input`
|
|
37
|
+
string is what groups replicates and pairs versions, so it must be stable."""
|
|
38
|
+
if isinstance(x, str):
|
|
39
|
+
return x
|
|
40
|
+
if isinstance(x, dict):
|
|
41
|
+
for k in ("input", "question", "query", "text", "prompt",
|
|
42
|
+
"output", "answer", "result", "response"):
|
|
43
|
+
if isinstance(x.get(k), str):
|
|
44
|
+
return x[k]
|
|
45
|
+
return json.dumps(x, sort_keys=True)
|
|
46
|
+
return str(x)
|
|
47
|
+
|
|
48
|
+
|
|
49
|
+
@dataclass
|
|
50
|
+
class TraceMap:
|
|
51
|
+
"""Where the fields live inside one vendor run record."""
|
|
52
|
+
input_path: str
|
|
53
|
+
output_path: str
|
|
54
|
+
version_path: str # MUST have been logged by the team; no version => unpairable
|
|
55
|
+
score_key: str # which feedback key carries the quality score
|
|
56
|
+
run_type_path: str = "run_type"
|
|
57
|
+
parent_path: str = "parent_run_id"
|
|
58
|
+
id_path: str = "id"
|
|
59
|
+
cluster_path: str | None = None
|
|
60
|
+
score_scale: tuple = (0.0, 1.0)
|
|
61
|
+
|
|
62
|
+
|
|
63
|
+
# Best-effort preset. VERIFY these paths against current LangSmith docs.
|
|
64
|
+
LANGSMITH = TraceMap(
|
|
65
|
+
input_path="inputs",
|
|
66
|
+
output_path="outputs",
|
|
67
|
+
version_path="extra.metadata.version",
|
|
68
|
+
score_key="quality",
|
|
69
|
+
run_type_path="run_type",
|
|
70
|
+
parent_path="parent_run_id",
|
|
71
|
+
id_path="id",
|
|
72
|
+
cluster_path="extra.metadata.cluster",
|
|
73
|
+
score_scale=(0.0, 1.0),
|
|
74
|
+
)
|
|
75
|
+
|
|
76
|
+
|
|
77
|
+
def flatten_runs(runs: list[dict], feedback: list[dict], tmap: TraceMap,
|
|
78
|
+
unit: str = "root") -> list[dict]:
|
|
79
|
+
"""Collapse runs + feedback into flat 7-field records.
|
|
80
|
+
|
|
81
|
+
unit="root" -> evaluate whole trajectories (input=root input, response=root
|
|
82
|
+
output). Child LLM/tool runs are diagnostic and dropped.
|
|
83
|
+
unit=<type> -> evaluate a component instead (e.g. "retriever", "llm").
|
|
84
|
+
"""
|
|
85
|
+
by_run: dict[str, dict] = {}
|
|
86
|
+
for f in feedback:
|
|
87
|
+
by_run.setdefault(f["run_id"], {})[f["key"]] = f.get("score")
|
|
88
|
+
|
|
89
|
+
lo, hi = tmap.score_scale
|
|
90
|
+
skipped_no_version = skipped_no_score = 0
|
|
91
|
+
records = []
|
|
92
|
+
for r in runs:
|
|
93
|
+
is_root = _dig(r, tmap.parent_path) is None
|
|
94
|
+
if unit == "root":
|
|
95
|
+
if not is_root:
|
|
96
|
+
continue
|
|
97
|
+
elif _dig(r, tmap.run_type_path) != unit:
|
|
98
|
+
continue
|
|
99
|
+
|
|
100
|
+
raw = (by_run.get(_dig(r, tmap.id_path)) or {}).get(tmap.score_key)
|
|
101
|
+
if raw is None:
|
|
102
|
+
skipped_no_score += 1
|
|
103
|
+
continue
|
|
104
|
+
version = _dig(r, tmap.version_path)
|
|
105
|
+
if version is None:
|
|
106
|
+
skipped_no_version += 1
|
|
107
|
+
continue
|
|
108
|
+
|
|
109
|
+
score = (float(raw) - lo) / (hi - lo) if hi != lo else float(raw)
|
|
110
|
+
rec = {
|
|
111
|
+
"input": _canon(_dig(r, tmap.input_path)),
|
|
112
|
+
"version": str(version),
|
|
113
|
+
"response": _canon(_dig(r, tmap.output_path)),
|
|
114
|
+
"score": max(0.0, min(1.0, score)),
|
|
115
|
+
}
|
|
116
|
+
if tmap.cluster_path:
|
|
117
|
+
c = _dig(r, tmap.cluster_path)
|
|
118
|
+
if c is not None:
|
|
119
|
+
rec["cluster"] = c
|
|
120
|
+
records.append(rec)
|
|
121
|
+
|
|
122
|
+
if skipped_no_version:
|
|
123
|
+
print(f" [adapter] WARNING: skipped {skipped_no_version} runs with no "
|
|
124
|
+
f"version tag at '{tmap.version_path}' -- they cannot be paired.")
|
|
125
|
+
if skipped_no_score:
|
|
126
|
+
print(f" [adapter] note: skipped {skipped_no_score} runs with no "
|
|
127
|
+
f"'{tmap.score_key}' feedback.")
|
|
128
|
+
return records
|
|
129
|
+
|
|
130
|
+
|
|
131
|
+
def load_from_langsmith(project: str, version_a: str, version_b: str,
|
|
132
|
+
tmap: TraceMap = LANGSMITH, unit: str = "root"):
|
|
133
|
+
"""LIVE fetch + flatten + assemble. NOT exercised in this repo -- requires
|
|
134
|
+
the `langsmith` SDK and LANGSMITH_API_KEY, and the SDK call signatures and
|
|
135
|
+
schema below must be verified against current LangSmith docs."""
|
|
136
|
+
from langsmith import Client # raises if not installed
|
|
137
|
+
from ingest import assemble_records
|
|
138
|
+
|
|
139
|
+
client = Client()
|
|
140
|
+
runs = [r.dict() for r in client.list_runs(project_name=project)]
|
|
141
|
+
run_ids = [r["id"] for r in runs]
|
|
142
|
+
feedback = [f.dict() for f in client.list_feedback(run_ids=run_ids)]
|
|
143
|
+
records = flatten_runs(runs, feedback, tmap, unit=unit)
|
|
144
|
+
return assemble_records(records, version_a, version_b)
|
|
145
|
+
|
|
146
|
+
|
|
147
|
+
# --------------------------------------------------------------------------- #
|
|
148
|
+
# Demo: flatten a SYNTHETIC LangSmith-shaped fixture -> gate(). Proves the
|
|
149
|
+
# transform (nested trajectory collapse, version extraction, score scaling,
|
|
150
|
+
# replicate derivation) without touching the live API.
|
|
151
|
+
# --------------------------------------------------------------------------- #
|
|
152
|
+
|
|
153
|
+
if __name__ == "__main__":
|
|
154
|
+
import numpy as np
|
|
155
|
+
from ingest import assemble_records, validate_records
|
|
156
|
+
from diff_engine import gate
|
|
157
|
+
|
|
158
|
+
rng = np.random.default_rng(0)
|
|
159
|
+
inputs = ([("billing", f"billing question {i}") for i in range(3)] +
|
|
160
|
+
[("general", f"general question {i}") for i in range(3)])
|
|
161
|
+
|
|
162
|
+
runs, feedback, rid = [], [], 0
|
|
163
|
+
for cluster, q in inputs:
|
|
164
|
+
for ver, base in [("v1", 0.90), ("v2", 0.20 if cluster == "billing" else 0.78)]:
|
|
165
|
+
for _rep in range(2): # two replicates per (input, version)
|
|
166
|
+
root = f"run-{rid}"; rid += 1
|
|
167
|
+
runs.append({"id": root, "run_type": "chain", "parent_run_id": None,
|
|
168
|
+
"inputs": {"question": q}, "outputs": {"answer": "..."},
|
|
169
|
+
"extra": {"metadata": {"version": ver, "cluster": cluster}}})
|
|
170
|
+
child = f"run-{rid}"; rid += 1 # a nested LLM step (must be dropped)
|
|
171
|
+
runs.append({"id": child, "run_type": "llm", "parent_run_id": root,
|
|
172
|
+
"inputs": {}, "outputs": {}, "extra": {"metadata": {"version": ver}}})
|
|
173
|
+
score = float(np.clip(base + rng.normal(0, 0.03), 0, 1))
|
|
174
|
+
feedback.append({"run_id": root, "key": "quality", "score": round(score, 3)})
|
|
175
|
+
|
|
176
|
+
records = flatten_runs(runs, feedback, LANGSMITH, unit="root")
|
|
177
|
+
print(f"FLATTEN: {len(runs)} raw runs -> {len(records)} flat records "
|
|
178
|
+
f"(child runs dropped under unit='root')")
|
|
179
|
+
print(" sample:", records[0])
|
|
180
|
+
print(" validation problems:", validate_records(records) or "none")
|
|
181
|
+
|
|
182
|
+
sa, sb, cids, meta = assemble_records(records, "v1", "v2")
|
|
183
|
+
print(" assembled:", meta)
|
|
184
|
+
dec = gate(sa, sb, cids, judge_error_sd=0.05, kappa=0.78, alpha=0.05)
|
|
185
|
+
print(" verdict:", dec.verdict, "| CI:", tuple(round(x, 3) for x in dec.delta_ci))
|
|
@@ -0,0 +1,14 @@
|
|
|
1
|
+
{"input": "refund question", "response": "I'm sorry, I'll issue a refund within 5-7 business days.", "human": 0.9}
|
|
2
|
+
{"input": "billing question", "response": "Please contact billing.", "human": 0.2}
|
|
3
|
+
{"input": "password question", "response": "You can reset your password using the link we email you.", "human": 0.85}
|
|
4
|
+
{"input": "billing question", "response": "Check your bill.", "human": 0.25}
|
|
5
|
+
{"input": "hours question", "response": "We're available 24/7 to help.", "human": 0.8}
|
|
6
|
+
{"input": "vague question", "response": "Maybe.", "human": 0.05}
|
|
7
|
+
{"input": "email question", "response": "Head to account settings to update your email.", "human": 0.85}
|
|
8
|
+
{"input": "vague question", "response": "It depends.", "human": 0.15}
|
|
9
|
+
{"input": "duplicate question", "response": "I'll refund the duplicate charge to your account right away.", "human": 0.9}
|
|
10
|
+
{"input": "overcharge question", "response": "Look at your invoice.", "human": 0.3}
|
|
11
|
+
{"input": "email question", "response": "Sure — you can update your email in settings.", "human": 0.8}
|
|
12
|
+
{"input": "apology question", "response": "No idea, sorry.", "human": 0.2}
|
|
13
|
+
{"input": "hours question", "response": "We're open 24/7.", "human": 0.6}
|
|
14
|
+
{"input": "reset question", "response": "I'm happy to help reset your account.", "human": 0.75}
|
|
@@ -0,0 +1,25 @@
|
|
|
1
|
+
input,version,replicate,cluster,response
|
|
2
|
+
How do I get a refund for a double charge?,v1,0,billing,"I'm sorry about the double charge — I can issue a refund, which posts in 5-7 business days."
|
|
3
|
+
How do I get a refund for a double charge?,v1,1,billing,"Happy to help: we'll refund the double charge, usually within 5-7 business days."
|
|
4
|
+
How do I get a refund for a double charge?,v2,0,billing,Please contact billing.
|
|
5
|
+
How do I get a refund for a double charge?,v2,1,billing,That's a billing issue.
|
|
6
|
+
Why was I billed twice this month?,v1,0,billing,"It looks like a duplicate charge — I'll refund the extra payment right away."
|
|
7
|
+
Why was I billed twice this month?,v1,1,billing,"Sorry about that; that's a duplicate, and I'll refund it to your account."
|
|
8
|
+
Why was I billed twice this month?,v2,0,billing,Possibly a duplicate.
|
|
9
|
+
Why was I billed twice this month?,v2,1,billing,Check your bill.
|
|
10
|
+
"I think I was overcharged, what should I do?",v1,0,billing,"I can review the overcharge and refund the difference to your account."
|
|
11
|
+
"I think I was overcharged, what should I do?",v1,1,billing,"Sorry for the overcharge — I'll refund the difference in a few business days."
|
|
12
|
+
"I think I was overcharged, what should I do?",v2,0,billing,Overcharges can happen.
|
|
13
|
+
"I think I was overcharged, what should I do?",v2,1,billing,Look at your invoice.
|
|
14
|
+
What are your support hours?,v1,0,general,"We're available 24/7 to help, any day of the week."
|
|
15
|
+
What are your support hours?,v1,1,general,"Our team is here 24/7, including weekends."
|
|
16
|
+
What are your support hours?,v2,0,general,Support is 24/7.
|
|
17
|
+
What are your support hours?,v2,1,general,We're open 24/7.
|
|
18
|
+
How do I reset my password?,v1,0,general,"To reset your password, click the reset link we email you."
|
|
19
|
+
How do I reset my password?,v1,1,general,Use the password reset link we send to your email address.
|
|
20
|
+
How do I reset my password?,v2,0,general,Use the reset link.
|
|
21
|
+
How do I reset my password?,v2,1,general,Reset it from the login page.
|
|
22
|
+
Where do I update my email address?,v1,0,general,You can update your email under Account settings.
|
|
23
|
+
Where do I update my email address?,v1,1,general,Head to settings to change your email address.
|
|
24
|
+
Where do I update my email address?,v2,0,general,Update it in settings.
|
|
25
|
+
Where do I update my email address?,v2,1,general,Under account settings.
|