vrty 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
vrty-1.0.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Sundeyp Singh
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
vrty-1.0.0/PKG-INFO ADDED
@@ -0,0 +1,383 @@
1
+ Metadata-Version: 2.4
2
+ Name: vrty
3
+ Version: 1.0.0
4
+ Summary: Deterministic LLM-output quality scoring in milliseconds. No AI judge in the loop.
5
+ License-Expression: MIT
6
+ Project-URL: Homepage, https://github.com/sundeyp/vrty
7
+ Project-URL: Repository, https://github.com/sundeyp/vrty
8
+ Project-URL: Issues, https://github.com/sundeyp/vrty/issues
9
+ Keywords: llm,evaluation,scoring,tf-idf,deterministic
10
+ Classifier: Development Status :: 5 - Production/Stable
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: Programming Language :: Python :: 3
13
+ Classifier: Programming Language :: Python :: 3.11
14
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
15
+ Classifier: Topic :: Text Processing :: Linguistic
16
+ Requires-Python: <3.12,>=3.11
17
+ Description-Content-Type: text/markdown
18
+ License-File: LICENSE
19
+ Provides-Extra: dev
20
+ Requires-Dist: pytest==8.3.3; extra == "dev"
21
+ Dynamic: license-file
22
+
23
+ # VRTY
24
+
25
+ [![CI](https://github.com/sundeyp/vrty/actions/workflows/vrty.yml/badge.svg)](https://github.com/sundeyp/vrty/actions/workflows/vrty.yml)
26
+ [![PyPI](https://img.shields.io/pypi/v/vrty.svg)](https://pypi.org/project/vrty/)
27
+ [![Python 3.11](https://img.shields.io/badge/python-3.11.9-blue.svg)](https://www.python.org/downloads/release/python-3119/)
28
+ [![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
29
+ [![Runtime deps: 0](https://img.shields.io/badge/runtime%20deps-0-brightgreen.svg)](pyproject.toml)
30
+
31
+ **The deterministic, zero-dependency LLM evaluator. Sub-millisecond, no API key, byte-identical across runs.**
32
+
33
+ *A stdlib alternative to ROUGE for no-reference scoring, and a sanity layer
34
+ in front of GPT-as-judge when reproducibility matters.*
35
+
36
+ VRTY scores a `(prompt, response)` pair on four standard, auditable
37
+ dimensions and returns a single composite plus a per-dimension breakdown.
38
+ Every formula is a textbook formula you can verify against a reference in
39
+ five minutes. There is no LLM call anywhere in the scoring path.
40
+
41
+ > **What VRTY does not do.** VRTY measures *surface text properties* —
42
+ > vocabulary overlap, sentence flow, term coverage, information density.
43
+ > **It does not check whether the answer is true.** A confident wrong answer
44
+ > that echoes the prompt's vocabulary will score *higher* than a correct
45
+ > one-word answer (see [Known properties and limitations](#known-properties-and-limitations):
46
+ > `"London is the capital of France."` scores 0.879; `"Paris."` scores 0.350).
47
+ > Use VRTY to catch malformed, off-topic, or padded output; pair it with a
48
+ > fact-check or human review when correctness matters.
49
+
50
+ ```python
51
+ from vrty import score
52
+ result = score("What is the capital of France?", "Paris is the capital of France.")
53
+ print(result.composite) # 0.8653358523094898
54
+ print(result.explanations["relevance"]) # Relevance: 0.83 - response strongly overlaps with the prompt's key terms.
55
+ ```
56
+
57
+ That is the entire 60-second example. Four lines, runs as-is, returns a
58
+ score. No configuration, no API key.
59
+
60
+ > **About that 0.865.** That number is what *factoid* prompts look like —
61
+ > short prompt, short answer, heavy vocabulary overlap. Open-ended prompts
62
+ > (customer support, instruction-following, prose drafts) typically score
63
+ > **0.20 – 0.40** because the response is *expected* not to echo prompt
64
+ > vocabulary. VRTY is calibrated *relative to a fixed prompt*, not as an
65
+ > absolute quality threshold. See [Calibration bands](#calibration-bands)
66
+ > below before setting CI gates.
67
+
68
+ ---
69
+
70
+ ## Install
71
+
72
+ ```sh
73
+ pip install vrty
74
+ ```
75
+
76
+ Or from source:
77
+
78
+ ```sh
79
+ git clone https://github.com/sundeyp/vrty
80
+ cd vrty
81
+ pip install -e .
82
+ ```
83
+
84
+ Determinism is guaranteed only on the pinned interpreter (Python 3.11.9)
85
+ and pinned dependency set. The scoring path has **zero third-party runtime
86
+ dependencies** — everything is Python stdlib. See [Determinism](#determinism)
87
+ below.
88
+
89
+ ---
90
+
91
+ ## The four dimensions
92
+
93
+ | Dimension | Formula | What it measures |
94
+ |---|---|---|
95
+ | **Relevance** | TF·IDF weighted cosine similarity between prompt and response | How much the response's content overlaps the prompt's content |
96
+ | **Coherence** | Mean cosine similarity of adjacent-sentence TF·IDF vectors | How much each sentence shares with the next (topical flow) |
97
+ | **Completeness** | IDF-weighted fraction of prompt content terms that appear in the response | How many of the prompt's key terms are addressed |
98
+ | **Conciseness** | `|unique content tokens| / |total tokens|` (content-word type–token ratio) | Information density vs padding |
99
+
100
+ Each dimension returns a value in `[0.0, 1.0]`. The composite is a fixed,
101
+ version-locked weighted sum:
102
+
103
+ ```
104
+ composite = 0.35 * relevance
105
+ + 0.20 * coherence
106
+ + 0.30 * completeness
107
+ + 0.15 * conciseness
108
+ ```
109
+
110
+ The weights are pinned constants, not configurable. Configurability is
111
+ explicitly post-v1.0.
112
+
113
+ ---
114
+
115
+ ## What you get back
116
+
117
+ `score()` returns a frozen `VrtyScore` object with a 9-key `to_dict()`:
118
+
119
+ ```python
120
+ {
121
+ "composite": 0.8653358523094898,
122
+ "relevance": 0.8295310065985426,
123
+ "coherence": 1.0,
124
+ "completeness": 1.0,
125
+ "conciseness": 0.5,
126
+ "explanations": {
127
+ "relevance": "Relevance: 0.83 - response strongly overlaps with the prompt's key terms.",
128
+ "coherence": "Coherence: 1.00 - adjacent sentences carry consistent topic.",
129
+ "completeness": "Completeness: 1.00 - most of the prompt's key terms appear in the response.",
130
+ "conciseness": "Conciseness: 0.50 - response has moderate information density."
131
+ },
132
+ "vrty_version": "1.0.0",
133
+ "idf_sha256": "0e475bcaa5524d1e26cbb166bb5c138e37f87e1e47b75e6506c6460a94259fd2",
134
+ "weights": {"relevance": 0.35, "coherence": 0.20, "completeness": 0.30, "conciseness": 0.15}
135
+ }
136
+ ```
137
+
138
+ `vrty_version` and `idf_sha256` make every score reproducible — together
139
+ they pin the scoring logic and the exact IDF data used.
140
+
141
+ ---
142
+
143
+ ## CLI
144
+
145
+ ```sh
146
+ vrty --prompt "What is the capital of France?" \
147
+ --response "Paris is the capital of France."
148
+ ```
149
+
150
+ Equivalent stdlib invocation:
151
+
152
+ ```sh
153
+ python -m vrty --prompt "..." --response "..."
154
+ ```
155
+
156
+ Accepts `--prompt-file PATH` / `--response-file PATH` for long inputs;
157
+ `/dev/stdin` works as a file path. `--pretty` indents the JSON.
158
+ Exit codes: `0` success, `1` I/O error, `2` argparse error.
159
+
160
+ ---
161
+
162
+ ## Benchmarks
163
+
164
+ VRTY is not an embedding-based scorer; if you need semantic similarity that
165
+ survives paraphrase, use **BERTScore** or **MoverScore**. VRTY is not n-gram
166
+ precision against a reference; if you have reference answers, use **BLEU**
167
+ or **ROUGE**. VRTY's niche is *no-reference, no-model, deterministic*
168
+ scoring — the gap ROUGE leaves when you don't have a gold reference, and
169
+ the gap GPT-as-judge leaves when you need reproducibility.
170
+
171
+ Reproducibility, cost, and latency vs ROUGE and LLM-as-judge. VRTY and
172
+ ROUGE were measured on the same machine with the same 1000 synthetic
173
+ (prompt, response) pairs per response-size bucket; reproduce via
174
+ `python tools/benchmark.py`. LLM-as-judge cost and latency are intentionally
175
+ not measured here — they depend on model choice and provider pricing, both
176
+ of which drift; fill them in for your own model before relying on the
177
+ comparison.
178
+
179
+ | | VRTY | ROUGE (rouge-score 0.1.2) | LLM-as-judge |
180
+ |------------------------|------|---------------------------|--------------|
181
+ | **Reproducibility** | Byte-identical across processes (pinned Python 3.11.9, asserted in CI on three subprocesses with adversarial `PYTHONHASHSEED` values) | Deterministic for a fixed tokenizer | Non-deterministic; varies with temperature, sampling, model version |
182
+ | **Cost per score** | $0 (no API call) | $0 (local) | $ per call × tokens; measure with your chosen model |
183
+ | **Latency p99 — 100 tokens** | **0.16 ms** | 1.66 ms | typically 500–2000 ms (network + inference) |
184
+ | **Latency p99 — 500 tokens** | **0.52 ms** | 6.66 ms | typically 500–2000 ms |
185
+ | **Latency p99 — 2000 tokens** | **2.94 ms** | 25.96 ms | typically 1000–5000 ms |
186
+ | **Network required** | No | No | Yes |
187
+ | **Reference hardware** | AMD Ryzen 7 8745HS, 16 cores, 27 GiB RAM, Ubuntu 24.04, Python 3.11.9 | (same) | (varies by provider) |
188
+
189
+ **Latency claim (v1.0)**: `< 3 ms p99 for responses under 2000 tokens on
190
+ AMD Ryzen 7 8745HS`. Reproduce: `python tools/benchmark.py` from a clean
191
+ venv with `vrty` and `rouge-score==0.1.2` installed.
192
+
193
+ VRTY is roughly **9–10× faster than ROUGE** at every input size in this
194
+ table because the scoring path is pure stdlib with no regex-based stemmer
195
+ and no sentence-pair grid construction.
196
+
197
+ ---
198
+
199
+ ## Calibration bands
200
+
201
+ Expected composite ranges by prompt type, observed across realistic input.
202
+ Use these to set CI gates and user-facing displays — do not assume a single
203
+ threshold works across prompt types.
204
+
205
+ | Prompt type | Typical composite | Use the score as |
206
+ |---|---|---|
207
+ | Factoid Q&A where the answer echoes prompt vocabulary (`"capital of France?"` → `"Paris is the capital of France."`) | 0.70 – 0.90 | Absolute threshold viable |
208
+ | Customer-support / instruction-following | 0.20 – 0.40 | Relative delta from a baseline answer on the *same* prompt |
209
+ | Open-ended prose (email drafts, summaries) | 0.15 – 0.35 | Relative delta only |
210
+ | Repetition / padding spam with OOV technical terms | can score 0.60+ | Catch by pairing with a length / repetition sanity check |
211
+
212
+ **Practical rule.** Compute a baseline composite on a known-good response
213
+ to your prompt, then gate on `score >= baseline * k` for some
214
+ `k ∈ [0.7, 0.9]`. Do not gate on `composite > 0.8` as an absolute — that
215
+ will fire false-negative on obviously-fine open-ended responses.
216
+
217
+ ---
218
+
219
+ ## Determinism
220
+
221
+ Identical input returns byte-identical output. This guarantee holds under
222
+ the following conditions, all of which are documented and enforced:
223
+
224
+ - **Pinned interpreter**: Python 3.11.9 (CPython, official build or
225
+ python-build-standalone). The CI matrix runs on this version. Other 3.x
226
+ versions are likely to produce identical output but are not asserted.
227
+ - **Pinned IDF data**: `vrty/data/idf.json.gz` ships with the package
228
+ and is SHA-256-verified at import. A modified data file fails fast with
229
+ `VrtyDataError` before any score is computed.
230
+ - **Zero third-party runtime dependencies**: the scoring path uses only
231
+ CPython stdlib (`re`, `math`, `collections`, `json`, `gzip`,
232
+ `hashlib`, `importlib.resources`, `unicodedata`). No `numpy`, no
233
+ `scikit-learn`, no BLAS-backed FP variance.
234
+ - **Sort-before-reduction**: every set and dict is sorted before any
235
+ floating-point accumulation, so dict-iteration order under
236
+ `PYTHONHASHSEED` randomization cannot change the result.
237
+
238
+ The test suite asserts byte-identity on `json.dumps(result.to_dict(),
239
+ sort_keys=True)` across three fresh OS subprocesses with `PYTHONHASHSEED`
240
+ set to `0`, `12345`, and the CPython default (`random`).
241
+
242
+ ---
243
+
244
+ ## Self-host
245
+
246
+ A one-command Docker self-host is shipped alongside the library. See the
247
+ [Dockerfile](Dockerfile) for the pinned image and the
248
+ [GitHub Actions snippet](.github/workflows/vrty.yml) for CI/CD
249
+ integration.
250
+
251
+ ```sh
252
+ docker build -t vrty:1.0.0 .
253
+ docker run --rm vrty:1.0.0 \
254
+ --prompt "What is the capital of France?" \
255
+ --response "Paris is the capital of France."
256
+ ```
257
+
258
+ ---
259
+
260
+ ## Known properties and limitations
261
+
262
+ **Read this section before integrating VRTY into anything load-bearing.**
263
+ Seven honest limitations of the v1.0 design.
264
+
265
+ ### 1. VRTY scores surface properties, not factual correctness
266
+
267
+ The four dimensions measure **term overlap, sentence flow, key-term
268
+ coverage, and information density**. They do *not* verify that the response
269
+ is factually true. A correct answer that does not echo prompt vocabulary
270
+ scores low on relevance and completeness; a confident wrong answer that
271
+ echoes prompt vocabulary scores high.
272
+
273
+ Worked example, prompt = `"What is the capital of France?"`:
274
+
275
+ | Response | Correct? | Composite | Relevance | Completeness | Conciseness |
276
+ |-------------------------------------------|----------|-----------|-----------|--------------|-------------|
277
+ | `"Paris is the capital of France."` | yes | 0.865 | 0.830 | 1.000 | 0.500 |
278
+ | `"London is the capital of France."` | **no** | 0.879 | 0.867 | 1.000 | 0.500 |
279
+ | `"Paris."` | yes | 0.350 | 0.000 | 0.000 | 1.000 |
280
+ | `"London."` | **no** | 0.350 | 0.000 | 0.000 | 1.000 |
281
+ | `"Banana."` | **no** | 0.350 | 0.000 | 0.000 | 1.000 |
282
+
283
+ The verbose incorrect answer scores *higher* than the verbose correct one
284
+ (slight IDF asymmetry between `"london"` and `"paris"` in the bundled
285
+ corpus); the three terse responses — one correct, two wrong — receive
286
+ identical 0.350 scores. **VRTY cannot distinguish them; an external
287
+ fact-check must.** Use VRTY to detect malformed, off-topic, or padded
288
+ outputs; use a separate fact-check or human review to verify truth.
289
+
290
+ ### 2. Conciseness and completeness intentionally pull against each other
291
+
292
+ A response that covers every prompt term tends to be longer (lower
293
+ conciseness); a terse response tends to omit prompt terms (lower
294
+ completeness). This tension is correct behavior, not a bug. Always read
295
+ the per-dimension breakdown — a single composite hides the trade-off.
296
+
297
+ ### 3. Single-sentence coherence returns 1.0 by deliberate choice
298
+
299
+ When the response is one sentence (or zero — see the empty-response
300
+ wrapper), there is no adjacent-sentence pair that can disagree, so
301
+ coherence is set to 1.0. This is a deliberate v1.0 convention: penalizing
302
+ short responses on coherence would double-count what completeness already
303
+ measures via prompt-term coverage.
304
+
305
+ ### 4. OOV tokens receive maximum IDF weight by deliberate choice
306
+
307
+ Tokens not present in the bundled IDF corpus are assigned `idf_oov =
308
+ log(N+1) + 1`, the value the smoothed IDF formula assigns to a token that
309
+ appears in zero documents. This treats unseen words as maximally
310
+ informative — the standard add-one (Laplace) smoothing choice — so
311
+ technical jargon and proper nouns are not silently dropped to zero weight.
312
+
313
+ ### 5. Conciseness is a type–token ratio, which is mildly length-sensitive
314
+
315
+ The conciseness measure (`|unique content tokens| / |total tokens|`) tends
316
+ to decline for longer responses because the vocabulary saturates while the
317
+ length keeps growing. This is a known property of the type–token ratio
318
+ (Hess et al. 1986). Two responses of very different lengths are not
319
+ directly comparable on conciseness alone; interpret the conciseness score
320
+ together with the other dimensions and the response length.
321
+
322
+ ### 6. Repetition can score high when prompt terms are out-of-corpus
323
+
324
+ Because OOV tokens receive maximum IDF weight (limitation 4 above) and
325
+ conciseness is a type–token ratio (limitation 5), a response that *repeats*
326
+ OOV technical terms (e.g. `"multi-head multi-head attention attention
327
+ attention transformer transformer transformer."` against a transformer-
328
+ architecture prompt) can score *higher* than a substantive paragraph on the
329
+ same prompt. Mitigation: combine the VRTY composite with a basic length /
330
+ repetition sanity check, or treat the composite as one signal among
331
+ several. This is a known property of TF·IDF-family scorers, not unique to
332
+ VRTY.
333
+
334
+ ### 7. The bundled IDF corpus is 19th-century English literature
335
+
336
+ IDF weights are computed from ten US-public-domain Project Gutenberg books
337
+ (Austen, Melville, Shelley, Doyle, Stoker, Carroll, Wilde, Dickens, Wells,
338
+ Thoreau) — about 5,400 200-token pseudo-documents, 32,000-word vocabulary.
339
+ Modern technical vocabulary like `"API"`, `"endpoint"`, `"deploy"`,
340
+ `"kubernetes"`, `"async"` is not in the corpus and falls into the OOV
341
+ bucket, where it receives the maximum IDF weight (see limitation 4).
342
+
343
+ This generally *helps* technical text (rare jargon is correctly treated as
344
+ informative) but can cause uneven weighting when one technical term is
345
+ in-corpus by coincidence and a similar one is not. **A domain-matched IDF
346
+ corpus is explicitly post-v1.0**; v1.0 disclaims this rather than fixes it.
347
+ Non-English text scores as-is with no special handling and is similarly
348
+ disclaimed.
349
+
350
+ ---
351
+
352
+ ## Input contract
353
+
354
+ Behavior on degenerate inputs is part of the v1.0 spec, not an afterthought:
355
+
356
+ | Input | Behavior |
357
+ |--------------------------------------|----------------------------------------------------------------|
358
+ | Empty response | Every dimension and the composite return `0.0`; explanations say "response contained no scorable tokens." |
359
+ | Empty prompt | Relevance and completeness return `0.0`; coherence and conciseness depend only on the response and score normally |
360
+ | Inputs above 2,048 tokens | Truncated at 2,048 tokens (the `MAX_TOKENS` constant) before scoring; truncation is deterministic |
361
+ | Non-English text | NFKD-normalized then ASCII-stripped; accented Latin folds to base letters; non-Latin scripts (CJK, Cyrillic, Arabic, ...) drop entirely. Quality outside English is not claimed |
362
+ | Response identical to prompt | Scored normally; no special case |
363
+ | Single word | Scored normally; no special case |
364
+
365
+ ---
366
+
367
+ ## License
368
+
369
+ MIT — see [LICENSE](LICENSE).
370
+
371
+ ---
372
+
373
+ ## Versioning
374
+
375
+ `vrty_version` is included with every score so any historical score is
376
+ traceable to the exact scoring logic that produced it. The bundled IDF
377
+ data file's SHA-256 (`idf_sha256`) is also returned with every score so
378
+ two scores from different builds can be compared at the data-pinning
379
+ level, not just the code level. Bumping either invalidates byte-equality
380
+ guarantees and requires a version bump.
381
+
382
+ A score from `vrty_version="1.0.0"` will be reproducible on any future
383
+ machine that installs `vrty==1.0.0` on Python 3.11.9.