context-recall 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- context_recall-0.1.0/.gitignore +6 -0
- context_recall-0.1.0/CHANGELOG.md +21 -0
- context_recall-0.1.0/LICENSE +21 -0
- context_recall-0.1.0/PKG-INFO +356 -0
- context_recall-0.1.0/README.md +317 -0
- context_recall-0.1.0/context_lens/__init__.py +840 -0
- context_recall-0.1.0/examples/basic_usage.py +105 -0
- context_recall-0.1.0/examples/rag_pipeline_audit.py +137 -0
- context_recall-0.1.0/pyproject.toml +69 -0
- context_recall-0.1.0/tests/test_context_lens.py +786 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
## [0.1.0] - 2026-03-27
|
|
4
|
+
|
|
5
|
+
### Added
|
|
6
|
+
- `Needle` dataclass — defines key fact + question + expected keywords
|
|
7
|
+
- `HaystackTemplate` — builds context strings with needle at arbitrary position (0.0–1.0)
|
|
8
|
+
- `KeywordJudge` — deterministic zero-cost retrieval evaluation (50% keyword threshold)
|
|
9
|
+
- `PositionResult` — per-position audit record (retrieved, keyword hits/misses, latency, error)
|
|
10
|
+
- `PositionHeatmap` — full audit result with score, verdict, fault zones, human-readable report
|
|
11
|
+
- `ContextLens` — main engine: `audit()`, `audit_multi()`, `summary_report()`, `ci_gate()`, `history()`
|
|
12
|
+
- Three verdicts: `RELIABLE` (>=90%), `CONDITIONAL` (>=70%), `UNRELIABLE` (<70%)
|
|
13
|
+
- Three fault zone patterns: MIDDLE-HEAVY, EDGE, SCATTERED
|
|
14
|
+
- SQLite audit history store (zero dependencies)
|
|
15
|
+
- CLI: `context-lens audit`, `context-lens ci`, `context-lens history`
|
|
16
|
+
- YAML/JSON config file support
|
|
17
|
+
- Anthropic and OpenAI CLI provider support
|
|
18
|
+
- `build_from_config()` helper for programmatic config loading
|
|
19
|
+
- 80 tests passing, 0 hard dependencies
|
|
20
|
+
- Biblical pattern: Ezekiel 37:1-10 (Valley of Dry Bones — Spirit traverses back and forth = exhaustive positional coverage)
|
|
21
|
+
- BibleWorld pivot score: 8.80 (third-highest in BibleWorld history)
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 BuildWorld
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,356 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: context-recall
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Test whether your LLM retrieves information from every position in its context window.
|
|
5
|
+
Project-URL: Homepage, https://github.com/Rowusuduah/context-lens
|
|
6
|
+
Project-URL: Repository, https://github.com/Rowusuduah/context-lens
|
|
7
|
+
Project-URL: Issues, https://github.com/Rowusuduah/context-lens/issues
|
|
8
|
+
Project-URL: Changelog, https://github.com/Rowusuduah/context-lens/blob/main/CHANGELOG.md
|
|
9
|
+
Author-email: Richmond Owusu Duah <Rowusuduah@users.noreply.github.com>
|
|
10
|
+
License: MIT
|
|
11
|
+
License-File: LICENSE
|
|
12
|
+
Keywords: ai,context-window,evaluation,llm,lost-in-the-middle,machine-learning,rag,testing
|
|
13
|
+
Classifier: Development Status :: 4 - Beta
|
|
14
|
+
Classifier: Intended Audience :: Developers
|
|
15
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
16
|
+
Classifier: Programming Language :: Python :: 3
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
20
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
21
|
+
Classifier: Topic :: Software Development :: Testing
|
|
22
|
+
Requires-Python: >=3.10
|
|
23
|
+
Provides-Extra: all
|
|
24
|
+
Requires-Dist: anthropic>=0.20.0; extra == 'all'
|
|
25
|
+
Requires-Dist: openai>=1.0.0; extra == 'all'
|
|
26
|
+
Requires-Dist: pyyaml>=6.0; extra == 'all'
|
|
27
|
+
Provides-Extra: anthropic
|
|
28
|
+
Requires-Dist: anthropic>=0.20.0; extra == 'anthropic'
|
|
29
|
+
Provides-Extra: dev
|
|
30
|
+
Requires-Dist: pytest-cov; extra == 'dev'
|
|
31
|
+
Requires-Dist: pytest>=7.0; extra == 'dev'
|
|
32
|
+
Requires-Dist: pyyaml>=6.0; extra == 'dev'
|
|
33
|
+
Requires-Dist: ruff; extra == 'dev'
|
|
34
|
+
Provides-Extra: openai
|
|
35
|
+
Requires-Dist: openai>=1.0.0; extra == 'openai'
|
|
36
|
+
Provides-Extra: yaml
|
|
37
|
+
Requires-Dist: pyyaml>=6.0; extra == 'yaml'
|
|
38
|
+
Description-Content-Type: text/markdown
|
|
39
|
+
|
|
40
|
+
# context-lens
|
|
41
|
+
|
|
42
|
+
**Test whether your LLM actually retrieves information from every position in its context window — before it silently fails in production.**
|
|
43
|
+
|
|
44
|
+
Your LLM passes all your evals. You ship it. Users start complaining that it ignores half their documents. You check the logs — technically successful calls, no errors. The bug is invisible. This is the lost-in-the-middle problem, and context-lens is the missing test gate.
|
|
45
|
+
|
|
46
|
+
---
|
|
47
|
+
|
|
48
|
+
## The Problem
|
|
49
|
+
|
|
50
|
+
Research confirmed what production engineers already knew: LLMs pay heavy attention to information at the **start** and **end** of long contexts, and silently drop information **buried in the middle**.
|
|
51
|
+
|
|
52
|
+
```
|
|
53
|
+
Context position: [start] ====== [middle] ====== [end]
|
|
54
|
+
LLM attention: HIGH LOW HIGH
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
This breaks:
|
|
58
|
+
- **RAG pipelines** — retrieved chunks in positions 3-7 of 10 may be ignored
|
|
59
|
+
- **Long document analysis** — key clauses in the middle of contracts get missed
|
|
60
|
+
- **Multi-turn agents** — prior tool outputs buried in conversation history get lost
|
|
61
|
+
- **System prompts with long instructions** — middle constraints are violated silently
|
|
62
|
+
|
|
63
|
+
**The failure mode is always the same:** traditional evals test correctness on a single input. They do not test whether the model is reliable *at every position* in the context window.
|
|
64
|
+
|
|
65
|
+
context-lens fills this gap.
|
|
66
|
+
|
|
67
|
+
---
|
|
68
|
+
|
|
69
|
+
## What context-lens Does
|
|
70
|
+
|
|
71
|
+
context-lens places a "needle" (a key fact or instruction) at every position across your context window, runs your LLM, and produces a **PositionHeatmap** — a complete picture of where your model is reliable and where it fails.
|
|
72
|
+
|
|
73
|
+
```
|
|
74
|
+
position 1/10 (fraction=0.00) [OK] ##
|
|
75
|
+
position 2/10 (fraction=0.11) [OK] ##
|
|
76
|
+
position 3/10 (fraction=0.22) [OK] ##
|
|
77
|
+
position 4/10 (fraction=0.33) [MISS] .. <- FAULT ZONE
|
|
78
|
+
position 5/10 (fraction=0.44) [MISS] .. <- FAULT ZONE
|
|
79
|
+
position 6/10 (fraction=0.56) [MISS] .. <- FAULT ZONE
|
|
80
|
+
position 7/10 (fraction=0.67) [OK] ##
|
|
81
|
+
position 8/10 (fraction=0.78) [OK] ##
|
|
82
|
+
position 9/10 (fraction=0.89) [OK] ##
|
|
83
|
+
position 10/10 (fraction=1.00) [OK] ##
|
|
84
|
+
|
|
85
|
+
Retrieval Score: 70% — CONDITIONAL
|
|
86
|
+
Fault zones: MIDDLE-HEAVY FAILURE (lost-in-the-middle pattern detected)
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
Then it gives you a **CI gate** that fails your pipeline if the score drops below your threshold.
|
|
90
|
+
|
|
91
|
+
---
|
|
92
|
+
|
|
93
|
+
## Installation
|
|
94
|
+
|
|
95
|
+
```bash
|
|
96
|
+
pip install context-lens
|
|
97
|
+
|
|
98
|
+
# With Anthropic support:
|
|
99
|
+
pip install "context-lens[anthropic]"
|
|
100
|
+
|
|
101
|
+
# With OpenAI support:
|
|
102
|
+
pip install "context-lens[openai]"
|
|
103
|
+
|
|
104
|
+
# With YAML config support:
|
|
105
|
+
pip install "context-lens[yaml]"
|
|
106
|
+
|
|
107
|
+
# Everything:
|
|
108
|
+
pip install "context-lens[all]"
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
**Zero hard dependencies.** context-lens uses only Python stdlib. Install provider SDKs separately.
|
|
112
|
+
|
|
113
|
+
---
|
|
114
|
+
|
|
115
|
+
## Quick Start
|
|
116
|
+
|
|
117
|
+
```python
|
|
118
|
+
from context_lens import ContextLens, Needle, HaystackTemplate
|
|
119
|
+
import anthropic
|
|
120
|
+
|
|
121
|
+
# 1. Wrap your LLM in a str -> str function
|
|
122
|
+
client = anthropic.Anthropic()
|
|
123
|
+
def my_llm(prompt: str) -> str:
|
|
124
|
+
response = client.messages.create(
|
|
125
|
+
model="claude-haiku-4-5-20251001",
|
|
126
|
+
max_tokens=256,
|
|
127
|
+
messages=[{"role": "user", "content": prompt}],
|
|
128
|
+
)
|
|
129
|
+
return response.content[0].text
|
|
130
|
+
|
|
131
|
+
# 2. Define what must be found (the "needle")
|
|
132
|
+
needle = Needle(
|
|
133
|
+
label="API rate limit",
|
|
134
|
+
content="The API rate limit is 1000 requests per minute per key.",
|
|
135
|
+
question="What is the API rate limit?",
|
|
136
|
+
expected_answer="1000 requests per minute",
|
|
137
|
+
answer_keywords=["1000", "per minute"],
|
|
138
|
+
)
|
|
139
|
+
|
|
140
|
+
# 3. Define the surrounding context (the "haystack")
|
|
141
|
+
haystack = HaystackTemplate(
|
|
142
|
+
filler_text="This document describes the system API. All endpoints require authentication. ",
|
|
143
|
+
target_tokens=4000,
|
|
144
|
+
tokens_per_filler=15,
|
|
145
|
+
)
|
|
146
|
+
|
|
147
|
+
# 4. Run the audit
|
|
148
|
+
lens = ContextLens(model_fn=my_llm, model_name="claude-haiku")
|
|
149
|
+
heatmap = lens.audit(needle=needle, haystack=haystack, positions=10)
|
|
150
|
+
|
|
151
|
+
# 5. Read the result
|
|
152
|
+
heatmap.report()
|
|
153
|
+
print(f"Score: {heatmap.retrieval_score:.1%}")
|
|
154
|
+
print(f"Verdict: {heatmap.verdict}")
|
|
155
|
+
print(f"Fault zones: {heatmap.fault_zones}")
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
---
|
|
159
|
+
|
|
160
|
+
## Multi-Needle Audit
|
|
161
|
+
|
|
162
|
+
Test multiple pieces of critical information in one run:
|
|
163
|
+
|
|
164
|
+
```python
|
|
165
|
+
needles = [
|
|
166
|
+
Needle(
|
|
167
|
+
label="Rate limit",
|
|
168
|
+
content="Rate limit: 1000 req/min.",
|
|
169
|
+
question="What is the rate limit?",
|
|
170
|
+
expected_answer="1000 req/min",
|
|
171
|
+
answer_keywords=["1000"],
|
|
172
|
+
),
|
|
173
|
+
Needle(
|
|
174
|
+
label="Retry policy",
|
|
175
|
+
content="On 429 errors, use exponential backoff starting at 2 seconds.",
|
|
176
|
+
question="How should you handle 429 errors?",
|
|
177
|
+
expected_answer="exponential backoff, 2 seconds",
|
|
178
|
+
answer_keywords=["exponential backoff", "2 seconds"],
|
|
179
|
+
),
|
|
180
|
+
Needle(
|
|
181
|
+
label="Token expiry",
|
|
182
|
+
content="Session tokens expire after 24 hours.",
|
|
183
|
+
question="When do session tokens expire?",
|
|
184
|
+
expected_answer="24 hours",
|
|
185
|
+
answer_keywords=["24 hours"],
|
|
186
|
+
),
|
|
187
|
+
]
|
|
188
|
+
|
|
189
|
+
heatmaps = lens.audit_multi(needles, haystack, positions=10)
|
|
190
|
+
summary = lens.summary_report(heatmaps)
|
|
191
|
+
print(f"Overall score: {summary['overall_score']:.1%}")
|
|
192
|
+
print(f"Overall verdict: {summary['overall_verdict']}")
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
---
|
|
196
|
+
|
|
197
|
+
## CI Gate
|
|
198
|
+
|
|
199
|
+
Block deployment if context retrieval is unreliable:
|
|
200
|
+
|
|
201
|
+
```python
|
|
202
|
+
heatmaps = lens.audit_multi(needles, haystack, positions=10)
|
|
203
|
+
passed, message = lens.ci_gate(heatmaps, min_score=0.80)
|
|
204
|
+
print(message)
|
|
205
|
+
import sys
|
|
206
|
+
sys.exit(0 if passed else 1)
|
|
207
|
+
```
|
|
208
|
+
|
|
209
|
+
---
|
|
210
|
+
|
|
211
|
+
## CLI Usage
|
|
212
|
+
|
|
213
|
+
```bash
|
|
214
|
+
# Run an audit from config file
|
|
215
|
+
context-lens audit --config my_audit.yaml
|
|
216
|
+
|
|
217
|
+
# Write results to JSON
|
|
218
|
+
context-lens audit --config my_audit.yaml --output results.json
|
|
219
|
+
|
|
220
|
+
# CI gate (exits 1 on failure)
|
|
221
|
+
context-lens ci --config my_audit.yaml --min-score 0.85
|
|
222
|
+
|
|
223
|
+
# View audit history
|
|
224
|
+
context-lens history --limit 10
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
---
|
|
228
|
+
|
|
229
|
+
## Config File Format (YAML)
|
|
230
|
+
|
|
231
|
+
```yaml
|
|
232
|
+
# my_audit.yaml
|
|
233
|
+
model_name: claude-haiku-4-5-20251001
|
|
234
|
+
provider: anthropic # anthropic | openai | mock
|
|
235
|
+
model: claude-haiku-4-5-20251001
|
|
236
|
+
positions: 10
|
|
237
|
+
reliable_threshold: 0.90
|
|
238
|
+
conditional_threshold: 0.70
|
|
239
|
+
|
|
240
|
+
haystack:
|
|
241
|
+
filler_text: "This document contains system documentation. "
|
|
242
|
+
target_tokens: 4000
|
|
243
|
+
tokens_per_filler: 10
|
|
244
|
+
system_prompt: "Answer questions using only the provided context."
|
|
245
|
+
|
|
246
|
+
needles:
|
|
247
|
+
- label: "Database connection string"
|
|
248
|
+
content: "The database connection string is db://prod-server:5432/myapp"
|
|
249
|
+
question: "What is the database connection string?"
|
|
250
|
+
expected_answer: "db://prod-server:5432/myapp"
|
|
251
|
+
answer_keywords: ["prod-server", "5432"]
|
|
252
|
+
|
|
253
|
+
- label: "Retry limit"
|
|
254
|
+
content: "The maximum retry count is 3 attempts with 5-second intervals."
|
|
255
|
+
question: "How many retries are allowed and at what interval?"
|
|
256
|
+
expected_answer: "3 retries, 5-second intervals"
|
|
257
|
+
answer_keywords: ["3", "5-second"]
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
---
|
|
261
|
+
|
|
262
|
+
## GitHub Actions Integration
|
|
263
|
+
|
|
264
|
+
```yaml
|
|
265
|
+
# .github/workflows/context-lens.yml
|
|
266
|
+
name: Context Window Audit
|
|
267
|
+
|
|
268
|
+
on: [push, pull_request]
|
|
269
|
+
|
|
270
|
+
jobs:
|
|
271
|
+
context-audit:
|
|
272
|
+
runs-on: ubuntu-latest
|
|
273
|
+
steps:
|
|
274
|
+
- uses: actions/checkout@v4
|
|
275
|
+
|
|
276
|
+
- name: Install context-lens
|
|
277
|
+
run: pip install "context-lens[all]"
|
|
278
|
+
|
|
279
|
+
- name: Run context position audit
|
|
280
|
+
env:
|
|
281
|
+
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
|
282
|
+
run: |
|
|
283
|
+
context-lens ci --config .context_lens.yaml --min-score 0.80
|
|
284
|
+
|
|
285
|
+
- name: Upload audit results
|
|
286
|
+
if: always()
|
|
287
|
+
uses: actions/upload-artifact@v4
|
|
288
|
+
with:
|
|
289
|
+
name: context-lens-results
|
|
290
|
+
path: .context_lens.db
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
---
|
|
294
|
+
|
|
295
|
+
## Verdicts Explained
|
|
296
|
+
|
|
297
|
+
| Verdict | Score Range | Meaning |
|
|
298
|
+
|---------|-------------|---------|
|
|
299
|
+
| **RELIABLE** | >= 90% | LLM consistently retrieves information from all context positions. Safe to ship. |
|
|
300
|
+
| **CONDITIONAL** | 70–89% | LLM has some positional failures. Review fault zones before shipping. |
|
|
301
|
+
| **UNRELIABLE** | < 70% | LLM has significant positional failures. Do not ship this configuration. |
|
|
302
|
+
|
|
303
|
+
---
|
|
304
|
+
|
|
305
|
+
## Fault Zone Patterns
|
|
306
|
+
|
|
307
|
+
context-lens identifies three failure patterns:
|
|
308
|
+
|
|
309
|
+
### 1. Middle-Heavy Failure (Lost-in-the-Middle)
|
|
310
|
+
```
|
|
311
|
+
Positions: [OK] [OK] [MISS] [MISS] [MISS] [OK] [OK]
|
|
312
|
+
```
|
|
313
|
+
Information in the middle of the context is not retrieved. Classic LLM attention pattern.
|
|
314
|
+
**Fix:** Reorder retrieved chunks to put critical info first/last. Reduce total context size.
|
|
315
|
+
|
|
316
|
+
### 2. Edge Failure
|
|
317
|
+
```
|
|
318
|
+
Positions: [MISS] [OK] [OK] [OK] [OK] [OK] [MISS]
|
|
319
|
+
```
|
|
320
|
+
Rare — usually indicates prompt structure issues.
|
|
321
|
+
|
|
322
|
+
### 3. Scattered Failures
|
|
323
|
+
```
|
|
324
|
+
Positions: [OK] [MISS] [OK] [MISS] [OK] [MISS] [OK]
|
|
325
|
+
```
|
|
326
|
+
General degradation. Often indicates context is too long for the model's reliable attention window.
|
|
327
|
+
|
|
328
|
+
---
|
|
329
|
+
|
|
330
|
+
## Why context-lens?
|
|
331
|
+
|
|
332
|
+
| Tool | What it tests |
|
|
333
|
+
|------|---------------|
|
|
334
|
+
| DeepEval, Promptfoo | Whether specific inputs produce correct outputs |
|
|
335
|
+
| prompt-shield | Whether outputs are stable across paraphrase variants |
|
|
336
|
+
| drift-guard | Whether PR code matches PR intent |
|
|
337
|
+
| **context-lens** | **Whether the LLM retrieves information from all context positions** |
|
|
338
|
+
|
|
339
|
+
The problem these tools solve is different. context-lens tests a specific failure mode that is invisible to all of them: positional sensitivity in the context window.
|
|
340
|
+
|
|
341
|
+
---
|
|
342
|
+
|
|
343
|
+
## Roadmap
|
|
344
|
+
|
|
345
|
+
- **v0.1** (current): KeywordJudge, PositionHeatmap, CLI, SQLite history, CI gate, GitHub Action
|
|
346
|
+
- **v0.2**: LLM-as-judge for semantic retrieval checking (beyond keyword matching)
|
|
347
|
+
- **v0.3**: Automatic fault zone diagnosis with remediation suggestions
|
|
348
|
+
- **v0.4**: Token-precise position control (place needle at exact token offset)
|
|
349
|
+
- **v0.5**: Multi-model comparison (which model is more position-robust?)
|
|
350
|
+
- **v1.0**: pytest plugin, pre-commit hook
|
|
351
|
+
|
|
352
|
+
---
|
|
353
|
+
|
|
354
|
+
## License
|
|
355
|
+
|
|
356
|
+
MIT License. Copyright 2026 BuildWorld.
|