freshcontext-mcp 0.3.23 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/METHODOLOGY.md CHANGED
@@ -1,381 +1,455 @@
1
- # FreshContext Data Intelligence Methodology
2
- **Version 1.2May 2026**
3
- *Authored by Immanuel Gabriel (Prince Gabriel) — Grootfontein, Namibia*
4
-
5
- ---
6
-
7
- ## What This Document Is
8
-
9
- This document formally describes the data collection, scoring, ranking, storage, and provenance methodology underlying FreshContext.
10
-
11
- It exists for four audiences:
12
-
13
- 1. **Technical integrators** teams embedding FreshContext into their agent infrastructure who need to understand what the data represents and how it is scored.
14
- 2. **Agent/retrieval system builders** — teams designing retrieval pipelines that need temporal relevance instead of undated context.
15
- 3. **Auditors and reviewers** people verifying that timestamped AI context is represented honestly and reproducibly.
16
- 4. **Future licensing or platform partners** — entities evaluating FreshContext as infrastructure, who need to audit the methodology that makes the data defensible.
17
-
18
- ---
19
-
20
- ## Section 1: Core Methodology and Source Collection
21
-
22
- ### 1.1 Architecture
23
-
24
- FreshContext Core methodology describes the signal contract and temporal scoring primitives that can be used by MCP servers, APIs, CLIs, dashboards, agents, or internal retrieval systems.
25
-
26
- The Core methodology covers:
27
-
28
- - **Signal schema** — source, content, timestamps, confidence, adapter identity
29
- - **Source/provenance** — where the observation came from and how it was retrieved
30
- - **Published/content date** when the source claims the content became true or available
31
- - **Retrieved timestamp** — when FreshContext observed the content
32
- - **Confidence** how reliable the timestamp extraction is
33
- - **Decay-Adjusted Relevancy (DAR)** — temporal utility after source-specific decay
34
- - **Failure honesty** — failed adapters must not be promoted as fresh successful context
35
- - **Ranking/explain primitives** — fields that let agents and systems explain why a signal ranked where it did
36
-
37
- FreshContext also supports a Store/Ledger methodology for systems that persist recurring signals over time. The production Worker implementation uses Cloudflare runtime pieces for MCP transport, KV cache policy, rate limiting, D1 persistence, feeds, cron collection, and deployment concerns. Those runtime concerns are implementation layers, not requirements for every FreshContext-compatible system.
38
-
39
- ### 1.2 Store / Ledger Collection Layer
40
-
41
- The Store/Ledger methodology describes a continuous data collection pipeline that can run on Cloudflare's global edge infrastructure. A deployment may execute scheduled collection via cron and query watched definitions stored in D1 or another durable store.
42
-
43
- Each watched query specifies:
44
- - **Adapter** — the data source to query (e.g., `hackernews`, `jobs`, `reposearch`)
45
- - **Query** the search term or URL
46
- - **User ID** — the profile this query serves
47
- - **Filters** optional parameters (location, exclusion terms, etc.)
48
-
49
- This D1 cron ledger is one implementation layer and future Store direction. It is not required for every FreshContext-compatible envelope implementation.
50
-
51
- ### 1.3 Example Adapter / Source Classes
52
-
53
- FreshContext currently has:
54
-
55
- - A reference MCP implementation with `evaluate_context` and 21 read-only reference adapters
56
- - Separate feed products such as Fresh HN Feed and Fresh Jobs Feed
57
- - A Store/Ledger methodology for systems that collect recurring signals over time
58
-
59
- The following table describes example source classes used by FreshContext implementations. Not every source class is necessarily collected by every cron/feed deployment.
60
-
61
- | Adapter class | Source | Auth Required | Typical Update Frequency |
62
- |---|---|---|---|
63
- | `hackernews` | Hacker News Algolia API | None | Real-time |
64
- | `jobs` | Remotive API | None | Continuous |
65
- | `reposearch` | GitHub Search API | Optional (rate limit) | Real-time |
66
- | `github` | GitHub Repository API | Optional | Real-time |
67
- | `reddit` | Reddit JSON API | None | Real-time |
68
- | `yc` | YC Open Source API | None | Per batch cycle |
69
- | `packagetrends` | npm Registry + npm Downloads API | None | Per publish |
70
- | `finance` | Stooq quote API | None | Market hours / quote feed cadence |
71
- | `producthunt` | Product Hunt launch data | Token when API-backed | Launch cadence |
72
- | `changelog` | GitHub Releases / npm package metadata | Optional | Per release |
73
- | `arxiv` / `scholar` | Academic sources | None | Publication cadence |
74
- | `gdelt` | GDELT global news | None | 15-minute feed cadence |
75
- | `govcontracts` / `gebiz` | Government procurement datasets | None | Dataset cadence |
76
- | `sec_filings` | SEC EDGAR filings | None | Filing cadence |
77
-
78
- FreshContext adapters operate on publicly accessible or publicly documented data sources. Most reference adapters require no credentials. Some APIs may optionally use tokens for rate limits or official API access, but FreshContext-compatible adapters should not require private user data unless explicitly documented by the implementation. All fetch requests include a `User-Agent` header identifying the FreshContext crawler where the runtime/source supports it.
79
-
80
- ### 1.4 Content Hash Deduplication
81
-
82
- Before any signal is stored, the platform computes a 32-bit rolling hash of the raw content. If the most recent stored result for a given watched query carries an identical hash, the current result is discarded. This prevents storing unchanged content across cron cycles.
83
-
84
- ### 1.5 Semantic Deduplication
85
-
86
- Beyond exact-match deduplication, FreshContext implements semantic deduplication to prevent the same underlying story appearing as multiple signals because it was covered by multiple sources (e.g., the same GitHub release appearing in both HN and Reddit).
87
-
88
- The semantic fingerprint is computed as follows:
89
-
90
- 1. Extract the first canonical URL from the raw content
91
- 2. Extract the first ISO 8601 publication date from the raw content
92
- 3. Extract and normalise the first substantive line (title) lowercased, punctuation stripped, truncated to 80 characters
93
- 4. Concatenate: `normalised_title|canonical_url|publication_date`
94
- 5. Compute SHA-256 of the concatenated string
95
- 6. Retain the first 16 hex characters as the fingerprint
96
-
97
- If any signal stored within the preceding 48 hours carries an identical fingerprint, the new result is discarded. The 48-hour window is configurable.
98
-
99
- ---
100
-
101
- ## Section 2: Temporal Scoring The DAR Engine
102
-
103
- ### 2.1 Overview
104
-
105
- The Decay-Adjusted Relevancy (DAR) engine scores every collected signal on two axes:
106
-
107
- - **R_0 (Base Score)**semantic relevancy of the content against the user's profile, independent of time
108
- - **R_t (Decay-Adjusted Score)** — R_0 adjusted for how much time has elapsed since the content was published
109
-
110
- The final stored `rt_score` is what drives signal ranking in briefings and the intelligence feed.
111
-
112
- FreshContext measures temporal utility, not truth. A source can be valid and still have low utility for the current query if it is stale. A source can be fresh but low-confidence if its timestamp is missing, malformed, inferred, or contradicted.
113
-
114
- ### 2.2 Base Score Calculation (R_0)
115
-
116
- R_0 is the starting relevance or utility before temporal decay. In the Store/Feed implementation, R_0 is computed by matching content against the user profile:
117
-
118
- ```
119
- R_0 = baseline (40)
120
- + vital_keyword_matches × 15 [capped at +35]
121
- + skill_keyword_matches × 3 [capped at +15]
122
- + location_accessibility_bonus [+8 if remote/accessible]
123
- - error_penalty [−40 if content is empty/error]
124
- ```
125
-
126
- Vital keywords are drawn from the `targets` field of the user profile — job titles, company names, and technology domains the user is specifically tracking.
127
-
128
- Skill keywords are drawn from the `skills` field — the user's technical competencies. A match here adds relevancy signal but at lower weight than a direct target match.
129
-
130
- The location accessibility bonus is applied when the content explicitly mentions "remote", "worldwide", "anywhere", or the user's stated location. This is not a geographic filter — it is a signal boost for content that is accessible to the user regardless of their physical location.
131
-
132
- **Hard exclusions:** If any term from the `exclusion_terms` list appears in the content, R_0 is forced to zero. The result is still stored (for audit purposes) but marked `is_relevant = 0`.
133
-
134
- This profile formula is a Store/Feed implementation example, not the only possible way to produce base relevance. For Core/MCP envelope scoring, R_0 may be normalised to 100. For feed/ranking systems, R_0 may come from semantic relevance, profile relevance, adapter-specific relevance, or another documented scoring layer.
135
-
136
- ### 2.3 Context-Conditioned Utility
137
-
138
- FreshContext scoring is context-conditioned. The same signal can have different utility depending on the user, query, agent, platform, or workflow requesting it.
139
-
140
- In the Store/Feed implementation, this context is represented by `R_0`, the base relevance or utility score before temporal decay. `R_0` may be computed from profile targets, query terms, semantic relevance, adapter-specific relevance, or another documented scoring layer.
141
-
142
- The DAR function then applies temporal pressure:
143
-
144
- ```
145
- R_t = R_0 · e^(-λt)
146
- ```
147
-
148
- This means FreshContext does not treat freshness as a standalone ranking signal. A fresh but irrelevant signal should not outrank an older but highly relevant signal unless the source policy and use case justify it.
149
-
150
- FreshContext Core exposes a pure context utility primitive for this direction:
151
-
152
- ```
153
- U(q, s, t) = R(q, s) · e^(-λt) · C_date · C_status
154
- ```
155
-
156
- Where:
157
- - `q` is the requester context: user, query, agent, platform, or workflow
158
- - `s` is the signal or database record
159
- - `R(q, s)` is contextual relevance between the request and the signal
160
- - `λ` is the source-specific decay constant
161
- - `t` is signal age
162
- - `C_date` is a timestamp-confidence factor
163
- - `C_status` is a failure/partial/success factor
164
-
165
- This is an extension of the DAR methodology, not a replacement for it. The purpose is to support systems where FreshContext runs over databases, feeds, retrieved documents, or agent memory and ranks information by both relevance and temporal utility. It does not imply vector search, multi-agent orchestration, or a hosted context store.
166
-
167
- ### 2.4 Decay Function (R_t)
168
-
169
- ```
170
- R_t = R_0 · e^(-λt)
171
- ```
172
-
173
- Where:
174
- - `λ` = source-specific decay constant (per hour)
175
- - `t` = hours elapsed since `published_at` / `content_date`
176
- - `R_t` = current temporal utility score
177
-
178
- If `published_at` / `content_date` cannot be extracted, the system must not pretend the signal is fresh. Core-compatible envelope scoring SHOULD use `freshness_score: null` and low confidence. Store/feed systems MAY apply a conservative fallback assumption, such as one source half-life, but must mark confidence low and explain the assumption.
179
-
180
- ### 2.5 Source Decay Constants (λ)
181
-
182
- These constants are reference/default calibration values for how quickly signals from each source class lose temporal utility:
183
-
184
- | Source | λ (per hour) | Half-life |
185
- |---|---|---|
186
- | Hacker News | 0.050 | ~14 hours |
187
- | Reddit | 0.010 | ~3 days |
188
- | Product Hunt | 0.010 | ~3 days |
189
- | Job listings | 0.005 | ~6 days |
190
- | Financial data | 0.001 | ~29 days |
191
- | YC companies | 0.001 | ~29 days |
192
- | Package trends | 0.0005 | ~58 days |
193
- | GitHub repositories | 0.0002 | ~5 months |
194
- | Academic papers | 0.00005 | ~1.6 years |
195
-
196
- These constants are reference defaults used by the FreshContext methodology and may be tuned by implementation. Hosted or private deployments may use calibrated variants per source, query type, or user profile. The calibration process and production tuning may be proprietary, even when public reference defaults are documented.
197
-
198
- ### 2.6 Entropy Classification
199
-
200
- Each signal is classified into one of three entropy states based on its position on the decay curve:
201
-
202
- | State | Condition | Interpretation |
203
- |---|---|---|
204
- | `low` | `t < half_life / 2` | Signal near peak value — act now |
205
- | `stable` | `t < 1.5 × half_life` | Usable signal — monitor |
206
- | `high` | `t 1.5 × half_life` | Significantly degraded verify before acting |
207
-
208
- Entropy labels describe signal decay state, not confidence level. A high-entropy signal may still be factually accurate, but it has lost temporal utility for current retrieval unless reinforced by newer evidence.
209
-
210
- ### 2.7 Relevancy Threshold
211
-
212
- Signals with `rt_score < 35` are stored with `is_relevant = 0`. They remain in the database for audit and historical analysis but are excluded from briefings and the intelligence feed by default. The threshold is configurable per profile.
213
-
214
- ### 2.8 Failure Honesty
215
-
216
- Failed adapters must not be promoted by freshness scoring. Empty, blocked, timeout, malformed, rate-limited, access-denied, or error-only outputs reduce R_0 or mark the signal status as failed/unknown.
217
-
218
- A failed result should not receive high confidence. A failed result should not produce `Score: 100/100`. Partial composites should preserve successful upstream results while marking failures explicitly.
219
-
220
- ---
221
-
222
- ## Section 3: FreshContext Store / Ledger Methodology
223
-
224
- ### 3.1 The Ha-Pri Audit Signature
225
-
226
- Every signal stored in a FreshContext Store/Ledger deployment carries a `ha_pri_sig`a SHA-256 audit signature computed as:
227
-
228
- ```
229
- SHA-256( result_id + ":" + content_hash + ":" + "FRESHCONTEXT_DAR_V1" )
230
- ```
231
-
232
- In Ha-Pri v1, this signature is a provenance stamp and audit reference for stored signals. It binds the result ID, the current content hash, and the engine version. It is not yet a full tamper-enforcement system: the current `content_hash` source is the existing rolling `result_hash`, and signatures are not recomputed on read to reject modified rows.
233
-
234
- Ha-Pri v1 serves three purposes:
235
-
236
- 1. **Provenance reference** the signature binds the result ID, current rolling content hash, and engine/version marker so the stored signal can be audited against the v1 formula.
237
- 2. **Scoring lineage** — the signature records the scoring/signature formula used when the row was written.
238
- 3. **Licensing / audit reference** when FreshContext data is provided to a third party under licence, the `ha_pri_sig` column gives a stable reference for what was stored and delivered.
239
-
240
- Ha-Pri v1 is not hard tamper enforcement. It is not recomputed on read, it signs the existing rolling result_hash (`result_hash`) rather than canonical content SHA-256, and it does not reject rows. Ha-Pri v2 is the planned/additive path for stronger verification.
241
-
242
- Future Ha-Pri v2 may add canonical content SHA-256, stronger canonicalization, and explicit verification/rejection on read. That hardening is separate from the current v1 provenance stamp.
243
-
244
- Ha-Pri v1 is the provenance layer and the foundation for a stronger integrity layer, while DAR and context-conditioned utility are the ranking/scoring layer.
245
-
246
- ### 3.2 D1 Historical Ledger
247
-
248
- The `scrape_results` table functions as a **Contextual Ledger** — not merely a cache, but a time-series record of intelligence signals with full provenance.
249
-
250
- This Store/Ledger methodology is not required for basic FreshContext-compatible envelope implementations. It is the methodology for systems that persist recurring signals and want auditability over time.
251
-
252
- Key properties of the ledger:
253
- - Scored signal material is treated as immutable once written; consumption metadata such as `is_new` may be updated
254
- - Every row carries a `scraped_at` timestamp with second precision
255
- - Every row carries a `published_at` date extracted from content (where available)
256
- - The ledger accumulates continuously at 6-hour intervals regardless of active user sessions
257
- - The ledger enables time-travel queries: "what was the intelligence landscape for topic X at date Y?"
258
-
259
- ### 3.3 Schema Reference
260
-
261
- ```sql
262
- scrape_results (
263
- id TEXT PRIMARY KEY, -- sr_{timestamp}_{random}
264
- watched_query_id TEXT, -- FK watched_queries.id
265
- adapter TEXT, -- source adapter name
266
- query TEXT, -- the search term used
267
- raw_content TEXT, -- scraped content (max 8000 chars)
268
- result_hash TEXT, -- 32-bit rolling hash of raw_content
269
- semantic_fingerprint TEXT, -- 16-char SHA-256 of normalised title|url|date
270
- is_new INTEGER, -- 1 until consumed by briefing
271
- scraped_at TEXT, -- ISO 8601 UTC timestamp
272
- published_at TEXT, -- extracted content publication date
273
- relevancy_score INTEGER, -- = round(rt_score), 0-100
274
- is_relevant INTEGER, -- 1 if rt_score >= 35, else 0
275
- base_score INTEGER, -- R_0 semantic score, 0-100
276
- rt_score REAL, -- R_t decay-adjusted score, 0-100
277
- ha_pri_sig TEXT, -- SHA-256 audit signature (64 hex chars)
278
- entropy_level TEXT -- 'low' | 'stable' | 'high'
279
- )
280
- ```
281
-
282
- ---
283
-
284
- ## Section 4: The Intelligence Feed
285
-
286
- ### 4.1 Endpoint
287
-
288
- ```
289
- GET /v1/intel/feed/{profile_id}
290
- ```
291
-
292
- Optional parameters:
293
- - `limit` — maximum signals to return (default: 20)
294
- - `min_rt` minimum rt_score filter (default: 0)
295
-
296
- ### 4.2 Response Structure
297
-
298
- ```json
299
- {
300
- "feed_metadata": {
301
- "profile_id": "default",
302
- "generated_at": "2026-04-14T09:00:00Z",
303
- "signal_count": 18,
304
- "version": "freshcontext-1.2"
305
- },
306
- "signals": [
307
- {
308
- "signal_id": "sr_1744628412_a3f7b",
309
- "source": "hackernews",
310
- "label": "HN: MCP Servers",
311
- "content": {
312
- "preview": "...",
313
- "url": "mcp server 2026"
314
- },
315
- "intelligence_stamps": {
316
- "scraped_at": "2026-04-14T08:12:00Z",
317
- "published_at": "2026-04-14",
318
- "base_score": 78,
319
- "rt_score": 61.4,
320
- "entropy_level": "stable",
321
- "ha_pri_sig": "a3f7b2c1d4e5f6a7b8c9d0e1f2a3b4c5..."
322
- }
323
- }
324
- ]
325
- }
326
- ```
327
-
328
- ### 4.3 LLM Integration
329
-
330
- The intelligence feed is designed to be consumed directly by any language model or AI agent without modification. The `intelligence_stamps` block gives the agent everything it needs to reason about data freshness:
331
-
332
- - `rt_score` — a single number representing current signal value
333
- - `entropy_level` human-readable decay state
334
- - `published_at` the actual content date (not the retrieval date)
335
- - `ha_pri_sig` provenance reference the agent can cite
336
-
337
- This is the core value proposition: **AI agents get grounded, timestamped, scored intelligence rather than undated web content of unknown age.**
338
-
339
- MCP is one interface over this methodology, not the whole system. The same scoring, timestamp, confidence, and provenance primitives can support APIs, CLIs, npm packages, dashboards, agents, and internal services.
340
-
341
- ---
342
-
343
- ## Section 5: Asset Summary
344
-
345
- For technical integrators, auditors, and future platform partners:
346
-
347
- **What FreshContext owns:**
348
-
349
- 1. **The FreshContext Specification v1.2** (MIT licence, open standard) — defines the envelope format, confidence levels, structured JSON form, freshness score behavior, and failure-honesty requirements. Timestamped in the public GitHub repository.
350
-
351
- 2. **The DAR Engine** — the exponential decay scoring methodology with source-specific λ reference defaults and calibrated production tuning.
352
-
353
- 3. **The Semantic Fingerprinting Method** the three-field normalisation and SHA-256 fingerprinting approach for cross-adapter deduplication.
354
-
355
- 4. **The Ha-Pri Audit Signature scheme** — the provenance stamp and audit reference that binds stored row material to the current v1 formula; stronger tamper-evidence is the future additive v2 path.
356
-
357
- 5. **The Store / Ledger design** — support for recurring watched queries, historical signal accumulation, D1-backed storage, and time-series auditability.
358
-
359
- 6. **The Reference Implementation** — `freshcontext-mcp@0.3.19`, the `evaluate_context` MCP interface, and 21 read-only reference adapters, listed on npm and the MCP Registry. The hosted Worker endpoint is a separate deployment surface.
360
-
361
- ---
362
-
363
- ## Changelog
364
-
365
- ### Version 1.2 — May 2026
366
- - Clarified Core methodology vs Store/Ledger methodology.
367
- - Preserved DAR as the mathematical scoring backbone.
368
- - Updated reference implementation language for `evaluate_context` plus 21 MCP reference adapters.
369
- - Reframed source decay constants as reference defaults/calibration values.
370
- - Added failure-honesty methodology.
371
- - Added context-conditioned utility as a Core scoring primitive.
372
- - Clarified missing timestamp behavior.
373
- - Clarified MCP as one interface, not the whole system.
374
-
375
- ### Version 1.1 — April 2026
376
- - Existing methodology version.
377
-
378
- ---
379
-
380
- *"The work isn't gone. It's just waiting to be continued."*
381
- *— Prince Gabriel, Grootfontein, Namibia*
1
+ # FreshContext Data Intelligence Methodology
2
+ **Version 1.3June 2026**
3
+ *Authored by Immanuel Gabriel (Prince Gabriel) — Grootfontein / Tsumeb, Namibia*
4
+
5
+ > v1.3 updates the provenance/integrity layer to document the SHIPPED, LIVE Ha-Pri v2/v3
6
+ > signed-verdict loop (HMAC, append-only ledger, stateless verify endpoint) which is in
7
+ > production as of 2026-06-30, and records the honest Flag A decay validation against 1,219
8
+ > rows of real data. See the Changelog and Section 3 for the v3 detail; see the companion
9
+ > FRESHCONTEXT_FLAG_A_THESIS for the full decay audit.
10
+
11
+ ---
12
+
13
+ ## What This Document Is
14
+
15
+ This document formally describes the data collection, scoring, ranking, storage, and provenance methodology underlying FreshContext.
16
+
17
+ It exists for four audiences:
18
+
19
+ 1. **Technical integrators** — teams embedding FreshContext into their agent infrastructure who need to understand what the data represents and how it is scored.
20
+ 2. **Agent/retrieval system builders** teams designing retrieval pipelines that need temporal relevance instead of undated context.
21
+ 3. **Auditors and reviewers** — people verifying that timestamped AI context is represented honestly and reproducibly.
22
+ 4. **Future licensing or platform partners** — entities evaluating FreshContext as infrastructure, who need to audit the methodology that makes the data defensible.
23
+
24
+ ---
25
+
26
+ ## Section 1: Core Methodology and Source Collection
27
+
28
+ ### 1.1 Architecture
29
+
30
+ FreshContext Core methodology describes the signal contract and temporal scoring primitives that can be used by MCP servers, APIs, CLIs, dashboards, agents, or internal retrieval systems.
31
+
32
+ The Core methodology covers:
33
+
34
+ - **Signal schema** — source, content, timestamps, confidence, adapter identity
35
+ - **Source/provenance** — where the observation came from and how it was retrieved
36
+ - **Published/content date** — when the source claims the content became true or available
37
+ - **Retrieved timestamp** when FreshContext observed the content
38
+ - **Confidence** — how reliable the timestamp extraction is
39
+ - **Decay-Adjusted Relevancy (DAR)** temporal utility after source-specific decay
40
+ - **Failure honesty** — failed adapters must not be promoted as fresh successful context
41
+ - **Ranking/explain primitives** fields that let agents and systems explain why a signal ranked where it did
42
+
43
+ FreshContext also supports a Store/Ledger methodology for systems that persist recurring signals over time. The production Worker implementation uses Cloudflare runtime pieces for MCP transport, KV cache policy, rate limiting, D1 persistence, feeds, cron collection, and deployment concerns. Those runtime concerns are implementation layers, not requirements for every FreshContext-compatible system.
44
+
45
+ ### 1.2 Store / Ledger Collection Layer
46
+
47
+ The Store/Ledger methodology describes a continuous data collection pipeline that can run on Cloudflare's global edge infrastructure. A deployment may execute scheduled collection via cron and query watched definitions stored in D1 or another durable store.
48
+
49
+ Each watched query specifies:
50
+ - **Adapter** — the data source to query (e.g., `hackernews`, `jobs`, `reposearch`)
51
+ - **Query** the search term or URL
52
+ - **User ID** — the profile this query serves
53
+ - **Filters** — optional parameters (location, exclusion terms, etc.)
54
+
55
+ This D1 cron ledger is one implementation layer and future Store direction. It is not required for every FreshContext-compatible envelope implementation.
56
+
57
+ ### 1.3 Example Adapter / Source Classes
58
+
59
+ FreshContext currently has:
60
+
61
+ - A reference MCP implementation with `evaluate_context` and 21 read-only reference adapters
62
+ - Separate feed products such as Fresh HN Feed and Fresh Jobs Feed
63
+ - A Store/Ledger methodology for systems that collect recurring signals over time
64
+
65
+ The following table describes example source classes used by FreshContext implementations. Not every source class is necessarily collected by every cron/feed deployment.
66
+
67
+ | Adapter class | Source | Auth Required | Typical Update Frequency |
68
+ |---|---|---|---|
69
+ | `hackernews` | Hacker News Algolia API | None | Real-time |
70
+ | `jobs` | Remotive API | None | Continuous |
71
+ | `reposearch` | GitHub Search API | Optional (rate limit) | Real-time |
72
+ | `github` | GitHub Repository API | Optional | Real-time |
73
+ | `reddit` | Reddit JSON API | None | Real-time |
74
+ | `yc` | YC Open Source API | None | Per batch cycle |
75
+ | `packagetrends` | npm Registry + npm Downloads API | None | Per publish |
76
+ | `finance` | Stooq quote API | None | Market hours / quote feed cadence |
77
+ | `producthunt` | Product Hunt launch data | Token when API-backed | Launch cadence |
78
+ | `changelog` | GitHub Releases / npm package metadata | Optional | Per release |
79
+ | `arxiv` / `scholar` | Academic sources | None | Publication cadence |
80
+ | `gdelt` | GDELT global news | None | 15-minute feed cadence |
81
+ | `govcontracts` / `gebiz` | Government procurement datasets | None | Dataset cadence |
82
+ | `sec_filings` | SEC EDGAR filings | None | Filing cadence |
83
+
84
+ FreshContext adapters operate on publicly accessible or publicly documented data sources. Most reference adapters require no credentials. Some APIs may optionally use tokens for rate limits or official API access, but FreshContext-compatible adapters should not require private user data unless explicitly documented by the implementation. All fetch requests include a `User-Agent` header identifying the FreshContext crawler where the runtime/source supports it.
85
+
86
+ ### 1.4 Content Hash Deduplication
87
+
88
+ Before any signal is stored, the platform computes a 32-bit rolling hash of the raw content. If the most recent stored result for a given watched query carries an identical hash, the current result is discarded. This prevents storing unchanged content across cron cycles.
89
+
90
+ ### 1.5 Semantic Deduplication
91
+
92
+ Beyond exact-match deduplication, FreshContext implements semantic deduplication to prevent the same underlying story appearing as multiple signals because it was covered by multiple sources (e.g., the same GitHub release appearing in both HN and Reddit).
93
+
94
+ The semantic fingerprint is computed as follows:
95
+
96
+ 1. Extract the first canonical URL from the raw content
97
+ 2. Extract the first ISO 8601 publication date from the raw content
98
+ 3. Extract and normalise the first substantive line (title) — lowercased, punctuation stripped, truncated to 80 characters
99
+ 4. Concatenate: `normalised_title|canonical_url|publication_date`
100
+ 5. Compute SHA-256 of the concatenated string
101
+ 6. Retain the first 16 hex characters as the fingerprint
102
+
103
+ If any signal stored within the preceding 48 hours carries an identical fingerprint, the new result is discarded. The 48-hour window is configurable.
104
+
105
+ ---
106
+
107
+ ## Section 2: Temporal Scoring The DAR Engine
108
+
109
+ ### 2.1 Overview
110
+
111
+ The Decay-Adjusted Relevancy (DAR) engine scores every collected signal on two axes:
112
+
113
+ - **R_0 (Base Score)** — semantic relevancy of the content against the user's profile, independent of time
114
+ - **R_t (Decay-Adjusted Score)** R_0 adjusted for how much time has elapsed since the content was published
115
+
116
+ The final stored `rt_score` is what drives signal ranking in briefings and the intelligence feed.
117
+
118
+ FreshContext measures temporal utility, not truth. A source can be valid and still have low utility for the current query if it is stale. A source can be fresh but low-confidence if its timestamp is missing, malformed, inferred, or contradicted.
119
+
120
+ ### 2.2 Base Score Calculation (R_0)
121
+
122
+ R_0 is the starting relevance or utility before temporal decay. In the Store/Feed implementation, R_0 is computed by matching content against the user profile:
123
+
124
+ ```
125
+ R_0 = baseline (40)
126
+ + vital_keyword_matches × 15 [capped at +35]
127
+ + skill_keyword_matches × 3 [capped at +15]
128
+ + location_accessibility_bonus [+8 if remote/accessible]
129
+ - error_penalty [−40 if content is empty/error]
130
+ ```
131
+
132
+ Vital keywords are drawn from the `targets` field of the user profile job titles, company names, and technology domains the user is specifically tracking.
133
+
134
+ Skill keywords are drawn from the `skills` field the user's technical competencies. A match here adds relevancy signal but at lower weight than a direct target match.
135
+
136
+ The location accessibility bonus is applied when the content explicitly mentions "remote", "worldwide", "anywhere", or the user's stated location. This is not a geographic filter — it is a signal boost for content that is accessible to the user regardless of their physical location.
137
+
138
+ **Hard exclusions:** If any term from the `exclusion_terms` list appears in the content, R_0 is forced to zero. The result is still stored (for audit purposes) but marked `is_relevant = 0`.
139
+
140
+ This profile formula is a Store/Feed implementation example, not the only possible way to produce base relevance. For Core/MCP envelope scoring, R_0 may be normalised to 100. For feed/ranking systems, R_0 may come from semantic relevance, profile relevance, adapter-specific relevance, or another documented scoring layer.
141
+
142
+ ### 2.3 Context-Conditioned Utility
143
+
144
+ FreshContext scoring is context-conditioned. The same signal can have different utility depending on the user, query, agent, platform, or workflow requesting it.
145
+
146
+ In the Store/Feed implementation, this context is represented by `R_0`, the base relevance or utility score before temporal decay. `R_0` may be computed from profile targets, query terms, semantic relevance, adapter-specific relevance, or another documented scoring layer.
147
+
148
+ The DAR function then applies temporal pressure:
149
+
150
+ ```
151
+ R_t = R_0 · e^(-λt)
152
+ ```
153
+
154
+ This means FreshContext does not treat freshness as a standalone ranking signal. A fresh but irrelevant signal should not outrank an older but highly relevant signal unless the source policy and use case justify it.
155
+
156
+ FreshContext Core exposes a pure context utility primitive for this direction:
157
+
158
+ ```
159
+ U(q, s, t) = R(q, s) · e^(-λt) · C_date · C_status
160
+ ```
161
+
162
+ Where:
163
+ - `q` is the requester context: user, query, agent, platform, or workflow
164
+ - `s` is the signal or database record
165
+ - `R(q, s)` is contextual relevance between the request and the signal
166
+ - `λ` is the source-specific decay constant
167
+ - `t` is signal age
168
+ - `C_date` is a timestamp-confidence factor
169
+ - `C_status` is a failure/partial/success factor
170
+
171
+ This is an extension of the DAR methodology, not a replacement for it. The purpose is to support systems where FreshContext runs over databases, feeds, retrieved documents, or agent memory and ranks information by both relevance and temporal utility. It does not imply vector search, multi-agent orchestration, or a hosted context store.
172
+
173
+ ### 2.4 Decay Function (R_t)
174
+
175
+ ```
176
+ R_t = R_0 · e^(-λt)
177
+ ```
178
+
179
+ Where:
180
+ - `λ` = source-specific decay constant (per hour)
181
+ - `t` = hours elapsed since `published_at` / `content_date`
182
+ - `R_t` = current temporal utility score
183
+
184
+ If `published_at` / `content_date` cannot be extracted, the system must not pretend the signal is fresh. Core-compatible envelope scoring SHOULD use `freshness_score: null` and low confidence. Store/feed systems MAY apply a conservative fallback assumption, such as one source half-life, but must mark confidence low and explain the assumption.
185
+
186
+ ### 2.5 Source Decay Constants (λ)
187
+
188
+ These constants are reference/default calibration values for how quickly signals from each source class lose temporal utility:
189
+
190
+ | Source | λ (per hour) | Half-life |
191
+ |---|---|---|
192
+ | Hacker News | 0.050 | ~14 hours |
193
+ | Reddit | 0.010 | ~3 days |
194
+ | Product Hunt | 0.010 | ~3 days |
195
+ | Job listings | 0.005 | ~6 days |
196
+ | Financial data | 0.001 | ~29 days |
197
+ | YC companies | 0.001 | ~29 days |
198
+ | Package trends | 0.0005 | ~58 days |
199
+ | GitHub repositories | 0.0002 | ~5 months |
200
+ | Academic papers | 0.00005 | ~1.6 years |
201
+
202
+ These constants are reference defaults used by the FreshContext methodology and may be tuned by implementation. Hosted or private deployments may use calibrated variants per source, query type, or user profile. The calibration process and production tuning may be proprietary, even when public reference defaults are documented.
203
+
204
+ **Validation note (Flag A, 2026-06-30).** The decay model was tested against 1,219 rows of
205
+ real production data across 6 active sources. Findings, stated honestly: (1) the pure
206
+ per-source exponential `freshness_score` is clean and correctly implementedmeasured
207
+ half-lives match design intent (Hacker News ~0.6 days, GitHub ~144 days, arXiv ~578 days);
208
+ (2) age predicts SOURCE-LEVEL decay *rate* (the volatility ordering HN-fastest
209
+ code-search-slowest is stable and real in the data), but does NOT predict any individual
210
+ item's retained value — within a single age bin, retained relevance varies widely because it
211
+ is driven by per-item content, not age alone; (3) the λ values remain reasoned reference
212
+ defaults, not yet outcome-calibrated. Conclusion: λ is a correct source-volatility-ranking
213
+ primitive, not a per-item validity oracle, and the clean exponential is the right floor model
214
+ (Weibull did not beat it on the data). Per-item validity and λ calibration are staged future
215
+ work (the latter data-gated on the signed-verdict ledger accumulating). Full audit:
216
+ companion FRESHCONTEXT_FLAG_A_THESIS.
217
+
218
+ ### 2.6 Entropy Classification
219
+
220
+ Each signal is classified into one of three entropy states based on its position on the decay curve:
221
+
222
+ | State | Condition | Interpretation |
223
+ |---|---|---|
224
+ | `low` | `t < half_life / 2` | Signal near peak value — act now |
225
+ | `stable` | `t < 1.5 × half_life` | Usable signal — monitor |
226
+ | `high` | `t 1.5 × half_life` | Significantly degradedverify before acting |
227
+
228
+ Entropy labels describe signal decay state, not confidence level. A high-entropy signal may still be factually accurate, but it has lost temporal utility for current retrieval unless reinforced by newer evidence.
229
+
230
+ ### 2.7 Relevancy Threshold
231
+
232
+ Signals with `rt_score < 35` are stored with `is_relevant = 0`. They remain in the database for audit and historical analysis but are excluded from briefings and the intelligence feed by default. The threshold is configurable per profile.
233
+
234
+ ### 2.8 Failure Honesty
235
+
236
+ Failed adapters must not be promoted by freshness scoring. Empty, blocked, timeout, malformed, rate-limited, access-denied, or error-only outputs reduce R_0 or mark the signal status as failed/unknown.
237
+
238
+ A failed result should not receive high confidence. A failed result should not produce `Score: 100/100`. Partial composites should preserve successful upstream results while marking failures explicitly.
239
+
240
+ ---
241
+
242
+ ## Section 3: FreshContext Store / Ledger Methodology
243
+
244
+ ### 3.1 The Ha-Pri Audit Signature
245
+
246
+ Every signal stored in a FreshContext Store/Ledger deployment carries a `ha_pri_sig` — a SHA-256 audit signature computed as:
247
+
248
+ ```
249
+ SHA-256( result_id + ":" + content_hash + ":" + "FRESHCONTEXT_DAR_V1" )
250
+ ```
251
+
252
+ In Ha-Pri v1, this signature is a provenance stamp and audit reference for stored signals. It binds the result ID, the current content hash, and the engine version. It is not yet a full tamper-enforcement system: the current `content_hash` source is the existing rolling `result_hash`, and signatures are not recomputed on read to reject modified rows.
253
+
254
+ Ha-Pri v1 serves three purposes:
255
+
256
+ 1. **Provenance reference** the signature binds the result ID, current rolling content hash, and engine/version marker so the stored signal can be audited against the v1 formula.
257
+ 2. **Scoring lineage** the signature records the scoring/signature formula used when the row was written.
258
+ 3. **Licensing / audit reference** — when FreshContext data is provided to a third party under licence, the `ha_pri_sig` column gives a stable reference for what was stored and delivered.
259
+
260
+ Ha-Pri v1 is not hard tamper enforcement. It is not recomputed on read, it signs the existing rolling result_hash (`result_hash`) rather than canonical content SHA-256, and it does not reject rows. Ha-Pri v2 is the planned/additive path for stronger verification.
261
+
262
+ Future Ha-Pri v2 may add canonical content SHA-256, stronger canonicalization, and explicit verification/rejection on read. That hardening is separate from the current v1 provenance stamp.
263
+
264
+ Ha-Pri v1 is the provenance layer and the foundation for a stronger integrity layer, while DAR and context-conditioned utility are the ranking/scoring layer.
265
+
266
+ ### 3.1a The Ha-Pri v2 / v3 Signed-Verdict Loop (SHIPPED, LIVE 2026-06-30)
267
+
268
+ Ha-Pri v1 (above) was the provenance *stamp*. The hardening it anticipated is now built,
269
+ tested, and running in production. This subsection documents the live integrity layer.
270
+
271
+ **v2 HMAC-signed content provenance.** The signing payload binds the canonical content
272
+ SHA-256, semantic fingerprint, adapter, published/retrieved timestamps, and the version-scoped
273
+ engine version, signed with HMAC-SHA256 under a server-held secret (`FC_HMAC_SECRET`, never
274
+ shipped in Core, never logged). The secret is injected at the edge, not imported into Core.
275
+
276
+ **v3 verdict-bound signing.** v3 extends the v2 payload with the `verdict_id` and the
277
+ `decision` itself, under the header `FRESHCONTEXT_HA_PRI_V3`. Because the decision is inside
278
+ the signed bytes, the signature is **tamper-evident at the verdict level**: changing the
279
+ decision after signing breaks the HMAC, and a new valid signature cannot be forged without the
280
+ secret. v3 runs alongside v2 (additive, non-breaking).
281
+
282
+ **The append-only ledger.** Every signed verdict is written to an append-only D1 table
283
+ (`evaluation_snapshots`) storing the full signing payload, signature, version-scoped engine
284
+ version, decision, verdict_id, and the decision-time `evaluated_at` (the stored decision time,
285
+ not a fresh clock read — the now-per-pull invariant). Rows are never updated or deleted;
286
+ integrity rests on immutability. The write is non-fatal and non-blocking: a ledger failure can
287
+ never break the consumer's evaluation response.
288
+
289
+ **Trustless verification.** A stateless `/v1/verify` endpoint recomputes the HMAC and returns
290
+ `valid` / `invalid` / `unknown`. A third party verifies a verdict by recompute-and-compare
291
+ *without holding the secret*. Verification reads the STORED engine_version (version-scoped),
292
+ never a live constant, so a verdict signed under any version remains verifiable forever.
293
+
294
+ **Why this matters (the trust-layer claim):** v1 could say "this is what we stored." v3 can
295
+ say "this verdict was reached at this time, signed, and you can prove it was not altered —
296
+ without trusting us." That is an audit primitive, not a dashboard. It is the difference
297
+ between a freshness *feature* and context-integrity *infrastructure*.
298
+
299
+ **Honest status line:** the signed evaluate → store → verify loop is live in production
300
+ (first real signed row landed 2026-06-30, verified byte-correct). The emitted `[FRESHCONTEXT_SIG_V1]`
301
+ block in tool output remains v2 for now; v3 is stored in the ledger (the verdict-bound record).
302
+ Mounting the full public REST surface and an enforcement wrapper that *acts* on verdicts are
303
+ staged future work.
304
+
305
+ ### 3.2 D1 Historical Ledger
306
+
307
+ The `scrape_results` table functions as a **Contextual Ledger** — not merely a cache, but a time-series record of intelligence signals with full provenance.
308
+
309
+ This Store/Ledger methodology is not required for basic FreshContext-compatible envelope implementations. It is the methodology for systems that persist recurring signals and want auditability over time.
310
+
311
+ Key properties of the ledger:
312
+ - Scored signal material is treated as immutable once written; consumption metadata such as `is_new` may be updated
313
+ - Every row carries a `scraped_at` timestamp with second precision
314
+ - Every row carries a `published_at` date extracted from content (where available)
315
+ - The ledger accumulates continuously at 6-hour intervals regardless of active user sessions
316
+ - The ledger enables time-travel queries: "what was the intelligence landscape for topic X at date Y?"
317
+
318
+ ### 3.3 Schema Reference
319
+
320
+ ```sql
321
+ scrape_results (
322
+ id TEXT PRIMARY KEY, -- sr_{timestamp}_{random}
323
+ watched_query_id TEXT, -- FK → watched_queries.id
324
+ adapter TEXT, -- source adapter name
325
+ query TEXT, -- the search term used
326
+ raw_content TEXT, -- scraped content (max 8000 chars)
327
+ result_hash TEXT, -- 32-bit rolling hash of raw_content
328
+ semantic_fingerprint TEXT, -- 16-char SHA-256 of normalised title|url|date
329
+ is_new INTEGER, -- 1 until consumed by briefing
330
+ scraped_at TEXT, -- ISO 8601 UTC timestamp
331
+ published_at TEXT, -- extracted content publication date
332
+ relevancy_score INTEGER, -- = round(rt_score), 0-100
333
+ is_relevant INTEGER, -- 1 if rt_score >= 35, else 0
334
+ base_score INTEGER, -- R_0 semantic score, 0-100
335
+ rt_score REAL, -- R_t decay-adjusted score, 0-100
336
+ ha_pri_sig TEXT, -- SHA-256 audit signature (64 hex chars)
337
+ entropy_level TEXT -- 'low' | 'stable' | 'high'
338
+ )
339
+ ```
340
+
341
+ ---
342
+
343
+ ## Section 4: The Intelligence Feed
344
+
345
+ ### 4.1 Endpoint
346
+
347
+ ```
348
+ GET /v1/intel/feed/{profile_id}
349
+ ```
350
+
351
+ Optional parameters:
352
+ - `limit` — maximum signals to return (default: 20)
353
+ - `min_rt`minimum rt_score filter (default: 0)
354
+
355
+ ### 4.2 Response Structure
356
+
357
+ ```json
358
+ {
359
+ "feed_metadata": {
360
+ "profile_id": "default",
361
+ "generated_at": "2026-04-14T09:00:00Z",
362
+ "signal_count": 18,
363
+ "version": "freshcontext-1.2"
364
+ },
365
+ "signals": [
366
+ {
367
+ "signal_id": "sr_1744628412_a3f7b",
368
+ "source": "hackernews",
369
+ "label": "HN: MCP Servers",
370
+ "content": {
371
+ "preview": "...",
372
+ "url": "mcp server 2026"
373
+ },
374
+ "intelligence_stamps": {
375
+ "scraped_at": "2026-04-14T08:12:00Z",
376
+ "published_at": "2026-04-14",
377
+ "base_score": 78,
378
+ "rt_score": 61.4,
379
+ "entropy_level": "stable",
380
+ "ha_pri_sig": "a3f7b2c1d4e5f6a7b8c9d0e1f2a3b4c5..."
381
+ }
382
+ }
383
+ ]
384
+ }
385
+ ```
386
+
387
+ ### 4.3 LLM Integration
388
+
389
+ The intelligence feed is designed to be consumed directly by any language model or AI agent without modification. The `intelligence_stamps` block gives the agent everything it needs to reason about data freshness:
390
+
391
+ - `rt_score` — a single number representing current signal value
392
+ - `entropy_level` — human-readable decay state
393
+ - `published_at` — the actual content date (not the retrieval date)
394
+ - `ha_pri_sig` — provenance reference the agent can cite
395
+
396
+ This is the core value proposition: **AI agents get grounded, timestamped, scored intelligence rather than undated web content of unknown age.**
397
+
398
+ MCP is one interface over this methodology, not the whole system. The same scoring, timestamp, confidence, and provenance primitives can support APIs, CLIs, npm packages, dashboards, agents, and internal services.
399
+
400
+ ---
401
+
402
+ ## Section 5: Asset Summary
403
+
404
+ For technical integrators, auditors, and future platform partners:
405
+
406
+ **What FreshContext owns:**
407
+
408
+ 1. **The FreshContext Specification v1.2** (MIT licence, open standard) — defines the envelope format, confidence levels, structured JSON form, freshness score behavior, and failure-honesty requirements. Timestamped in the public GitHub repository.
409
+
410
+ 2. **The DAR Engine** — the exponential decay scoring methodology with source-specific λ reference defaults and calibrated production tuning.
411
+
412
+ 3. **The Semantic Fingerprinting Method** — the three-field normalisation and SHA-256 fingerprinting approach for cross-adapter deduplication.
413
+
414
+ 4. **The Ha-Pri Audit Signature scheme** — a layered integrity system: v1 is the provenance
415
+ stamp; **v2/v3 is the live, shipped, HMAC-signed, verdict-bound, tamper-evident loop** with
416
+ an append-only ledger and a stateless trustless `/v1/verify` endpoint (in production as of
417
+ 2026-06-30). This is the defensible core: a signed verdict a third party can verify without
418
+ trusting the issuer.
419
+
420
+ 5. **The Store / Ledger design** — support for recurring watched queries, historical signal accumulation, D1-backed storage, and time-series auditability.
421
+
422
+ 6. **The Reference Implementation** — `freshcontext-mcp@0.3.23`, the `evaluate_context` MCP interface, and 21 read-only reference adapters, listed on npm and the MCP Registry. The hosted Worker endpoint is a separate deployment surface.
423
+
424
+ ---
425
+
426
+ ## Changelog
427
+
428
+ ### Version 1.3 — June 2026
429
+ - Documented the shipped, live Ha-Pri v2/v3 signed-verdict loop (§3.1a): HMAC signing,
430
+ verdict-bound v3 payload, append-only `evaluation_snapshots` ledger, stateless trustless
431
+ `/v1/verify`, version-scoped verification, non-fatal ledger writes. Live in production
432
+ 2026-06-30 (first signed row verified byte-correct).
433
+ - Added the Flag A decay validation note (§2.5): tested against 1,219 real rows; pure
434
+ exponential `freshness_score` confirmed clean; honest boundary stated (source-level rate,
435
+ not per-item oracle; λ reasoned-not-calibrated). Full audit in companion FLAG_A_THESIS.
436
+ - Updated asset summary (§5.4) to reflect the live signed-verdict integrity layer as the
437
+ defensible core.
438
+
439
+ ### Version 1.2 — May 2026
440
+ - Clarified Core methodology vs Store/Ledger methodology.
441
+ - Preserved DAR as the mathematical scoring backbone.
442
+ - Updated reference implementation language for `evaluate_context` plus 21 MCP reference adapters.
443
+ - Reframed source decay constants as reference defaults/calibration values.
444
+ - Added failure-honesty methodology.
445
+ - Added context-conditioned utility as a Core scoring primitive.
446
+ - Clarified missing timestamp behavior.
447
+ - Clarified MCP as one interface, not the whole system.
448
+
449
+ ### Version 1.1 — April 2026
450
+ - Existing methodology version.
451
+
452
+ ---
453
+
454
+ *"The work isn't gone. It's just waiting to be continued."*
455
+ *— Prince Gabriel, Grootfontein, Namibia*