context-recall 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,6 @@
1
+ __pycache__/
2
+ *.pyc
3
+ dist/
4
+ *.egg-info/
5
+ .eggs/
6
+ build/
@@ -0,0 +1,21 @@
1
+ # Changelog
2
+
3
+ ## [0.1.0] - 2026-03-27
4
+
5
+ ### Added
6
+ - `Needle` dataclass — defines key fact + question + expected keywords
7
+ - `HaystackTemplate` — builds context strings with needle at arbitrary position (0.0–1.0)
8
+ - `KeywordJudge` — deterministic zero-cost retrieval evaluation (50% keyword threshold)
9
+ - `PositionResult` — per-position audit record (retrieved, keyword hits/misses, latency, error)
10
+ - `PositionHeatmap` — full audit result with score, verdict, fault zones, human-readable report
11
+ - `ContextLens` — main engine: `audit()`, `audit_multi()`, `summary_report()`, `ci_gate()`, `history()`
12
+ - Three verdicts: `RELIABLE` (>=90%), `CONDITIONAL` (>=70%), `UNRELIABLE` (<70%)
13
+ - Three fault zone patterns: MIDDLE-HEAVY, EDGE, SCATTERED
14
+ - SQLite audit history store (zero dependencies)
15
+ - CLI: `context-lens audit`, `context-lens ci`, `context-lens history`
16
+ - YAML/JSON config file support
17
+ - Anthropic and OpenAI CLI provider support
18
+ - `build_from_config()` helper for programmatic config loading
19
+ - 80 tests passing, 0 hard dependencies
20
+ - Biblical pattern: Ezekiel 37:1-10 (Valley of Dry Bones — Spirit traverses back and forth = exhaustive positional coverage)
21
+ - BibleWorld pivot score: 8.80 (third-highest in BibleWorld history)
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 BuildWorld
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,356 @@
1
+ Metadata-Version: 2.4
2
+ Name: context-recall
3
+ Version: 0.1.0
4
+ Summary: Test whether your LLM retrieves information from every position in its context window.
5
+ Project-URL: Homepage, https://github.com/Rowusuduah/context-lens
6
+ Project-URL: Repository, https://github.com/Rowusuduah/context-lens
7
+ Project-URL: Issues, https://github.com/Rowusuduah/context-lens/issues
8
+ Project-URL: Changelog, https://github.com/Rowusuduah/context-lens/blob/main/CHANGELOG.md
9
+ Author-email: Richmond Owusu Duah <Rowusuduah@users.noreply.github.com>
10
+ License: MIT
11
+ License-File: LICENSE
12
+ Keywords: ai,context-window,evaluation,llm,lost-in-the-middle,machine-learning,rag,testing
13
+ Classifier: Development Status :: 4 - Beta
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Programming Language :: Python :: 3
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
21
+ Classifier: Topic :: Software Development :: Testing
22
+ Requires-Python: >=3.10
23
+ Provides-Extra: all
24
+ Requires-Dist: anthropic>=0.20.0; extra == 'all'
25
+ Requires-Dist: openai>=1.0.0; extra == 'all'
26
+ Requires-Dist: pyyaml>=6.0; extra == 'all'
27
+ Provides-Extra: anthropic
28
+ Requires-Dist: anthropic>=0.20.0; extra == 'anthropic'
29
+ Provides-Extra: dev
30
+ Requires-Dist: pytest-cov; extra == 'dev'
31
+ Requires-Dist: pytest>=7.0; extra == 'dev'
32
+ Requires-Dist: pyyaml>=6.0; extra == 'dev'
33
+ Requires-Dist: ruff; extra == 'dev'
34
+ Provides-Extra: openai
35
+ Requires-Dist: openai>=1.0.0; extra == 'openai'
36
+ Provides-Extra: yaml
37
+ Requires-Dist: pyyaml>=6.0; extra == 'yaml'
38
+ Description-Content-Type: text/markdown
39
+
40
+ # context-lens
41
+
42
+ **Test whether your LLM actually retrieves information from every position in its context window — before it silently fails in production.**
43
+
44
+ Your LLM passes all your evals. You ship it. Users start complaining that it ignores half their documents. You check the logs — technically successful calls, no errors. The bug is invisible. This is the lost-in-the-middle problem, and context-lens is the missing test gate.
45
+
46
+ ---
47
+
48
+ ## The Problem
49
+
50
+ Research confirmed what production engineers already knew: LLMs pay heavy attention to information at the **start** and **end** of long contexts, and silently drop information **buried in the middle**.
51
+
52
+ ```
53
+ Context position: [start] ====== [middle] ====== [end]
54
+ LLM attention: HIGH LOW HIGH
55
+ ```
56
+
57
+ This breaks:
58
+ - **RAG pipelines** — retrieved chunks in positions 3-7 of 10 may be ignored
59
+ - **Long document analysis** — key clauses in the middle of contracts get missed
60
+ - **Multi-turn agents** — prior tool outputs buried in conversation history get lost
61
+ - **System prompts with long instructions** — middle constraints are violated silently
62
+
63
+ **The failure mode is always the same:** traditional evals test correctness on a single input. They do not test whether the model is reliable *at every position* in the context window.
64
+
65
+ context-lens fills this gap.
66
+
67
+ ---
68
+
69
+ ## What context-lens Does
70
+
71
+ context-lens places a "needle" (a key fact or instruction) at every position across your context window, runs your LLM, and produces a **PositionHeatmap** — a complete picture of where your model is reliable and where it fails.
72
+
73
+ ```
74
+ position 1/10 (fraction=0.00) [OK] ##
75
+ position 2/10 (fraction=0.11) [OK] ##
76
+ position 3/10 (fraction=0.22) [OK] ##
77
+ position 4/10 (fraction=0.33) [MISS] .. <- FAULT ZONE
78
+ position 5/10 (fraction=0.44) [MISS] .. <- FAULT ZONE
79
+ position 6/10 (fraction=0.56) [MISS] .. <- FAULT ZONE
80
+ position 7/10 (fraction=0.67) [OK] ##
81
+ position 8/10 (fraction=0.78) [OK] ##
82
+ position 9/10 (fraction=0.89) [OK] ##
83
+ position 10/10 (fraction=1.00) [OK] ##
84
+
85
+ Retrieval Score: 70% — CONDITIONAL
86
+ Fault zones: MIDDLE-HEAVY FAILURE (lost-in-the-middle pattern detected)
87
+ ```
88
+
89
+ Then it gives you a **CI gate** that fails your pipeline if the score drops below your threshold.
90
+
91
+ ---
92
+
93
+ ## Installation
94
+
95
+ ```bash
96
+ pip install context-lens
97
+
98
+ # With Anthropic support:
99
+ pip install "context-lens[anthropic]"
100
+
101
+ # With OpenAI support:
102
+ pip install "context-lens[openai]"
103
+
104
+ # With YAML config support:
105
+ pip install "context-lens[yaml]"
106
+
107
+ # Everything:
108
+ pip install "context-lens[all]"
109
+ ```
110
+
111
+ **Zero hard dependencies.** context-lens uses only Python stdlib. Install provider SDKs separately.
112
+
113
+ ---
114
+
115
+ ## Quick Start
116
+
117
+ ```python
118
+ from context_lens import ContextLens, Needle, HaystackTemplate
119
+ import anthropic
120
+
121
+ # 1. Wrap your LLM in a str -> str function
122
+ client = anthropic.Anthropic()
123
+ def my_llm(prompt: str) -> str:
124
+ response = client.messages.create(
125
+ model="claude-haiku-4-5-20251001",
126
+ max_tokens=256,
127
+ messages=[{"role": "user", "content": prompt}],
128
+ )
129
+ return response.content[0].text
130
+
131
+ # 2. Define what must be found (the "needle")
132
+ needle = Needle(
133
+ label="API rate limit",
134
+ content="The API rate limit is 1000 requests per minute per key.",
135
+ question="What is the API rate limit?",
136
+ expected_answer="1000 requests per minute",
137
+ answer_keywords=["1000", "per minute"],
138
+ )
139
+
140
+ # 3. Define the surrounding context (the "haystack")
141
+ haystack = HaystackTemplate(
142
+ filler_text="This document describes the system API. All endpoints require authentication. ",
143
+ target_tokens=4000,
144
+ tokens_per_filler=15,
145
+ )
146
+
147
+ # 4. Run the audit
148
+ lens = ContextLens(model_fn=my_llm, model_name="claude-haiku")
149
+ heatmap = lens.audit(needle=needle, haystack=haystack, positions=10)
150
+
151
+ # 5. Read the result
152
+ heatmap.report()
153
+ print(f"Score: {heatmap.retrieval_score:.1%}")
154
+ print(f"Verdict: {heatmap.verdict}")
155
+ print(f"Fault zones: {heatmap.fault_zones}")
156
+ ```
157
+
158
+ ---
159
+
160
+ ## Multi-Needle Audit
161
+
162
+ Test multiple pieces of critical information in one run:
163
+
164
+ ```python
165
+ needles = [
166
+ Needle(
167
+ label="Rate limit",
168
+ content="Rate limit: 1000 req/min.",
169
+ question="What is the rate limit?",
170
+ expected_answer="1000 req/min",
171
+ answer_keywords=["1000"],
172
+ ),
173
+ Needle(
174
+ label="Retry policy",
175
+ content="On 429 errors, use exponential backoff starting at 2 seconds.",
176
+ question="How should you handle 429 errors?",
177
+ expected_answer="exponential backoff, 2 seconds",
178
+ answer_keywords=["exponential backoff", "2 seconds"],
179
+ ),
180
+ Needle(
181
+ label="Token expiry",
182
+ content="Session tokens expire after 24 hours.",
183
+ question="When do session tokens expire?",
184
+ expected_answer="24 hours",
185
+ answer_keywords=["24 hours"],
186
+ ),
187
+ ]
188
+
189
+ heatmaps = lens.audit_multi(needles, haystack, positions=10)
190
+ summary = lens.summary_report(heatmaps)
191
+ print(f"Overall score: {summary['overall_score']:.1%}")
192
+ print(f"Overall verdict: {summary['overall_verdict']}")
193
+ ```
194
+
195
+ ---
196
+
197
+ ## CI Gate
198
+
199
+ Block deployment if context retrieval is unreliable:
200
+
201
+ ```python
202
+ heatmaps = lens.audit_multi(needles, haystack, positions=10)
203
+ passed, message = lens.ci_gate(heatmaps, min_score=0.80)
204
+ print(message)
205
+ import sys
206
+ sys.exit(0 if passed else 1)
207
+ ```
208
+
209
+ ---
210
+
211
+ ## CLI Usage
212
+
213
+ ```bash
214
+ # Run an audit from config file
215
+ context-lens audit --config my_audit.yaml
216
+
217
+ # Write results to JSON
218
+ context-lens audit --config my_audit.yaml --output results.json
219
+
220
+ # CI gate (exits 1 on failure)
221
+ context-lens ci --config my_audit.yaml --min-score 0.85
222
+
223
+ # View audit history
224
+ context-lens history --limit 10
225
+ ```
226
+
227
+ ---
228
+
229
+ ## Config File Format (YAML)
230
+
231
+ ```yaml
232
+ # my_audit.yaml
233
+ model_name: claude-haiku-4-5-20251001
234
+ provider: anthropic # anthropic | openai | mock
235
+ model: claude-haiku-4-5-20251001
236
+ positions: 10
237
+ reliable_threshold: 0.90
238
+ conditional_threshold: 0.70
239
+
240
+ haystack:
241
+ filler_text: "This document contains system documentation. "
242
+ target_tokens: 4000
243
+ tokens_per_filler: 10
244
+ system_prompt: "Answer questions using only the provided context."
245
+
246
+ needles:
247
+ - label: "Database connection string"
248
+ content: "The database connection string is db://prod-server:5432/myapp"
249
+ question: "What is the database connection string?"
250
+ expected_answer: "db://prod-server:5432/myapp"
251
+ answer_keywords: ["prod-server", "5432"]
252
+
253
+ - label: "Retry limit"
254
+ content: "The maximum retry count is 3 attempts with 5-second intervals."
255
+ question: "How many retries are allowed and at what interval?"
256
+ expected_answer: "3 retries, 5-second intervals"
257
+ answer_keywords: ["3", "5-second"]
258
+ ```
259
+
260
+ ---
261
+
262
+ ## GitHub Actions Integration
263
+
264
+ ```yaml
265
+ # .github/workflows/context-lens.yml
266
+ name: Context Window Audit
267
+
268
+ on: [push, pull_request]
269
+
270
+ jobs:
271
+ context-audit:
272
+ runs-on: ubuntu-latest
273
+ steps:
274
+ - uses: actions/checkout@v4
275
+
276
+ - name: Install context-lens
277
+ run: pip install "context-lens[all]"
278
+
279
+ - name: Run context position audit
280
+ env:
281
+ ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
282
+ run: |
283
+ context-lens ci --config .context_lens.yaml --min-score 0.80
284
+
285
+ - name: Upload audit results
286
+ if: always()
287
+ uses: actions/upload-artifact@v4
288
+ with:
289
+ name: context-lens-results
290
+ path: .context_lens.db
291
+ ```
292
+
293
+ ---
294
+
295
+ ## Verdicts Explained
296
+
297
+ | Verdict | Score Range | Meaning |
298
+ |---------|-------------|---------|
299
+ | **RELIABLE** | >= 90% | LLM consistently retrieves information from all context positions. Safe to ship. |
300
+ | **CONDITIONAL** | 70–89% | LLM has some positional failures. Review fault zones before shipping. |
301
+ | **UNRELIABLE** | < 70% | LLM has significant positional failures. Do not ship this configuration. |
302
+
303
+ ---
304
+
305
+ ## Fault Zone Patterns
306
+
307
+ context-lens identifies three failure patterns:
308
+
309
+ ### 1. Middle-Heavy Failure (Lost-in-the-Middle)
310
+ ```
311
+ Positions: [OK] [OK] [MISS] [MISS] [MISS] [OK] [OK]
312
+ ```
313
+ Information in the middle of the context is not retrieved. Classic LLM attention pattern.
314
+ **Fix:** Reorder retrieved chunks to put critical info first/last. Reduce total context size.
315
+
316
+ ### 2. Edge Failure
317
+ ```
318
+ Positions: [MISS] [OK] [OK] [OK] [OK] [OK] [MISS]
319
+ ```
320
+ Rare — usually indicates prompt structure issues.
321
+
322
+ ### 3. Scattered Failures
323
+ ```
324
+ Positions: [OK] [MISS] [OK] [MISS] [OK] [MISS] [OK]
325
+ ```
326
+ General degradation. Often indicates context is too long for the model's reliable attention window.
327
+
328
+ ---
329
+
330
+ ## Why context-lens?
331
+
332
+ | Tool | What it tests |
333
+ |------|---------------|
334
+ | DeepEval, Promptfoo | Whether specific inputs produce correct outputs |
335
+ | prompt-shield | Whether outputs are stable across paraphrase variants |
336
+ | drift-guard | Whether PR code matches PR intent |
337
+ | **context-lens** | **Whether the LLM retrieves information from all context positions** |
338
+
339
+ The problem these tools solve is different. context-lens tests a specific failure mode that is invisible to all of them: positional sensitivity in the context window.
340
+
341
+ ---
342
+
343
+ ## Roadmap
344
+
345
+ - **v0.1** (current): KeywordJudge, PositionHeatmap, CLI, SQLite history, CI gate, GitHub Action
346
+ - **v0.2**: LLM-as-judge for semantic retrieval checking (beyond keyword matching)
347
+ - **v0.3**: Automatic fault zone diagnosis with remediation suggestions
348
+ - **v0.4**: Token-precise position control (place needle at exact token offset)
349
+ - **v0.5**: Multi-model comparison (which model is more position-robust?)
350
+ - **v1.0**: pytest plugin, pre-commit hook
351
+
352
+ ---
353
+
354
+ ## License
355
+
356
+ MIT License. Copyright 2026 BuildWorld.