@xdarkicex/openclaw-memory-libravdb 1.3.9 → 1.3.12
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +18 -0
- package/docs/README.md +9 -1
- package/docs/ast-v2.md +125 -0
- package/docs/ast.md +70 -0
- package/docs/compaction-evaluation.md +182 -0
- package/docs/continuity.md +488 -0
- package/docs/contributing.md +1 -1
- package/docs/gating.md +53 -255
- package/docs/installation.md +45 -9
- package/docs/mathematics-v2.md +1228 -0
- package/openclaw.plugin.json +2 -2
- package/package.json +1 -1
- package/src/context-engine.ts +306 -35
- package/src/continuity.ts +93 -0
- package/src/index.ts +1 -1
- package/src/openclaw-plugin-sdk.d.ts +2 -2
- package/src/recall-utils.ts +100 -8
- package/src/scoring.ts +263 -9
- package/src/tokens.ts +1 -1
- package/src/types.ts +33 -2
|
@@ -0,0 +1,1228 @@
|
|
|
1
|
+
# Mathematical Reference
|
|
2
|
+
|
|
3
|
+
This document is the formal reference for the scoring and optimization math used
|
|
4
|
+
by the plugin. The gating scalar is documented separately in
|
|
5
|
+
[gating.md](./gating.md). The continuity model and recent-tail preservation
|
|
6
|
+
layer are documented in [continuity.md](./continuity.md). The authored
|
|
7
|
+
invariant/variant partitioning rules are documented in
|
|
8
|
+
[ast-v2.md](./ast-v2.md). Earlier non-versioned math docs are preserved for
|
|
9
|
+
historical context, but the reviewed `*-v*` documents are authoritative when
|
|
10
|
+
both forms exist.
|
|
11
|
+
|
|
12
|
+
Every formula below points at the file that currently implements it. If the code
|
|
13
|
+
changes first, this document must change with it.
|
|
14
|
+
|
|
15
|
+
This revision (3.3) merges the complete section set from `mathematics.md` with
|
|
16
|
+
the formal corrections introduced in `mathematics-3-2.md`. All sections are now
|
|
17
|
+
present and carry the 3.2 corrections:
|
|
18
|
+
|
|
19
|
+
- explicit domain and startup invariants where later proofs depend on them
|
|
20
|
+
- removal of self-referential set definitions in the planned two-pass model
|
|
21
|
+
- disambiguation of decay symbols with different units and meanings
|
|
22
|
+
- explicit convex-combination proof obligations for bounded scores
|
|
23
|
+
- regularized Matryoshka normalization with $\varepsilon$-guarded denominators
|
|
24
|
+
and explicit early-exit threshold values
|
|
25
|
+
- division-by-zero guards in compaction clustering ($n = 0$ and $k = 0$ cases)
|
|
26
|
+
- clamped confidence formula with per-backend range proofs
|
|
27
|
+
- cold-start smoothing in the authority-weight frequency term $f(d)$
|
|
28
|
+
- separated coarse-candidate raw set from filtered set in Pass 1
|
|
29
|
+
- $\eta_{\mathrm{hop}}$ symbol replacing bare $\lambda$ for hop attenuation
|
|
30
|
+
- startup invariant $\tau_{\mathcal{I}} \le \tau$ made explicit
|
|
31
|
+
- edge-case safety and quality-multiplier boundedness added as runtime invariants
|
|
32
|
+
- Unicode code-point correction in sidecar token estimator
|
|
33
|
+
- $\chi$ calibration notice tied to tokenizer validation
|
|
34
|
+
|
|
35
|
+
## 1. Hybrid Scoring
|
|
36
|
+
|
|
37
|
+
Each candidate returned by the vector store starts with a cosine similarity score
|
|
38
|
+
$\cos(q,d) \in [0,1]$ from embedding retrieval. The host then applies a hybrid
|
|
39
|
+
ranker:
|
|
40
|
+
|
|
41
|
+
$$
|
|
42
|
+
\mathrm{base}(d) =
|
|
43
|
+
\alpha \cdot \cos(q,d) +
|
|
44
|
+
\beta \cdot R(d) +
|
|
45
|
+
\gamma \cdot S(d)
|
|
46
|
+
$$
|
|
47
|
+
|
|
48
|
+
$$
|
|
49
|
+
\mathrm{score}(d) = \mathrm{base}(d) \cdot Q(d)
|
|
50
|
+
$$
|
|
51
|
+
|
|
52
|
+
where:
|
|
53
|
+
|
|
54
|
+
$$
|
|
55
|
+
R(d) = e^{-\lambda_s(d)\,\Delta t_d}
|
|
56
|
+
$$
|
|
57
|
+
|
|
58
|
+
$$
|
|
59
|
+
S(d)=
|
|
60
|
+
\begin{cases}
|
|
61
|
+
1.0 & \text{if } d \text{ is from the active session} \\
|
|
62
|
+
0.6 & \text{if } d \text{ is from durable user memory} \\
|
|
63
|
+
0.3 & \text{if } d \text{ is from global memory}
|
|
64
|
+
\end{cases}
|
|
65
|
+
$$
|
|
66
|
+
|
|
67
|
+
$$
|
|
68
|
+
Q(d)=
|
|
69
|
+
\begin{cases}
|
|
70
|
+
1 - \delta \cdot \mathrm{decay\_rate}(d) & \text{if } d \text{ is a summary} \\
|
|
71
|
+
1 & \text{otherwise}
|
|
72
|
+
\end{cases}
|
|
73
|
+
$$
|
|
74
|
+
|
|
75
|
+
Implemented in [`src/scoring.ts`](../src/scoring.ts).
|
|
76
|
+
|
|
77
|
+
The current implementation defaults are:
|
|
78
|
+
|
|
79
|
+
- $\alpha = 0.7$
|
|
80
|
+
- $\beta = 0.2$
|
|
81
|
+
- $\gamma = 0.1$
|
|
82
|
+
- $\delta = 0.5$
|
|
83
|
+
|
|
84
|
+
The runtime enforces this convex-mixture contract by clamping weights into
|
|
85
|
+
$[0,1]$ and re-normalizing them onto a unit sum before scoring. This keeps the
|
|
86
|
+
base score on a stable scale and makes tuning interpretable: increasing one
|
|
87
|
+
weight means explicitly decreasing another.
|
|
88
|
+
|
|
89
|
+
**Note on retrieval similarity.** The term $\cos(q,d) \in [0,1]$ represents the
|
|
90
|
+
similarity score as bounded at the host ranking boundary. If the retrieval layer
|
|
91
|
+
surfaces a negative cosine-style score, the host clamps it to $0$ before applying
|
|
92
|
+
the section-1 hybrid ranker. The planned two-pass system in Section 7 uses raw
|
|
93
|
+
cosine similarity spanning $[-1,1]$ with negatives clipped explicitly. These are
|
|
94
|
+
described separately to avoid conflating current implementation with planned
|
|
95
|
+
architecture.
|
|
96
|
+
|
|
97
|
+
### 1.1 Domain Constraints
|
|
98
|
+
|
|
99
|
+
The following parameter domains are required for all formulas in this section:
|
|
100
|
+
|
|
101
|
+
$$
|
|
102
|
+
\alpha, \beta, \gamma \in [0,1], \qquad \alpha + \beta + \gamma = 1
|
|
103
|
+
$$
|
|
104
|
+
|
|
105
|
+
$$
|
|
106
|
+
\delta \in [0,1]
|
|
107
|
+
$$
|
|
108
|
+
|
|
109
|
+
$$
|
|
110
|
+
\cos(q,d) \in [0,1], \qquad R(d) \in (0,1], \qquad S(d) \in \{0.3, 0.6, 1.0\}
|
|
111
|
+
$$
|
|
112
|
+
|
|
113
|
+
$$
|
|
114
|
+
\mathrm{decay\_rate}(d) \in [0,1]
|
|
115
|
+
$$
|
|
116
|
+
|
|
117
|
+
Under these assumptions, $\mathrm{base}(d)$ is a convex combination of
|
|
118
|
+
quantities in $[0,1]$, so:
|
|
119
|
+
|
|
120
|
+
$$
|
|
121
|
+
\mathrm{base}(d) \in [0,1]
|
|
122
|
+
$$
|
|
123
|
+
|
|
124
|
+
And since $\delta \in [0,1]$ and the decay rate is in $[0,1]$:
|
|
125
|
+
|
|
126
|
+
$$
|
|
127
|
+
Q(d) \in [1-\delta,\, 1] \subseteq [0,1]
|
|
128
|
+
$$
|
|
129
|
+
|
|
130
|
+
Therefore:
|
|
131
|
+
|
|
132
|
+
$$
|
|
133
|
+
\mathrm{score}(d) \in [0,1]
|
|
134
|
+
$$
|
|
135
|
+
|
|
136
|
+
### 1.2 Boundary Cases
|
|
137
|
+
|
|
138
|
+
- $\alpha = 1$ collapses to semantic retrieval only.
|
|
139
|
+
- $\beta = 1$ collapses to pure recency preference.
|
|
140
|
+
- $\gamma = 1$ collapses to scope-only ranking and is almost always wrong
|
|
141
|
+
because it ignores content.
|
|
142
|
+
- $\delta = 0$ ignores summary quality completely.
|
|
143
|
+
- $\delta = 1$ applies the maximum configured penalty to low-confidence
|
|
144
|
+
summaries while preserving nonnegativity, because
|
|
145
|
+
the decay rate is in $[0,1]$, which guarantees $Q(d) \ge 0$.
|
|
146
|
+
|
|
147
|
+
### 1.3 Note on $S(d)$ Values
|
|
148
|
+
|
|
149
|
+
The scope weights $\{1.0, 0.6, 0.3\}$ are empirically tuned constants, not
|
|
150
|
+
values derived from a normalized probability model. They are intentionally
|
|
151
|
+
stable across query types. At the default $\gamma = 0.1$, the maximum
|
|
152
|
+
contribution of $S(d)$ to $\mathrm{base}(d)$ is $0.1$, so miscalibration of
|
|
153
|
+
these values has bounded impact on the final score. Future work may replace
|
|
154
|
+
this step function with access-frequency priors derived from retrieval
|
|
155
|
+
telemetry.
|
|
156
|
+
|
|
157
|
+
## 2. Recency Decay
|
|
158
|
+
|
|
159
|
+
Recency uses exponential decay:
|
|
160
|
+
|
|
161
|
+
$$
|
|
162
|
+
R(d) = e^{-\lambda_s \Delta t_d}
|
|
163
|
+
$$
|
|
164
|
+
|
|
165
|
+
where $\Delta t_d$ is the age of the record in seconds and $\lambda_s$ is the
|
|
166
|
+
scope-specific decay constant.
|
|
167
|
+
|
|
168
|
+
Implemented in [`src/scoring.ts`](../src/scoring.ts).
|
|
169
|
+
|
|
170
|
+
In the current implementation, $\Delta t_d$ is measured in **seconds**, not
|
|
171
|
+
milliseconds:
|
|
172
|
+
|
|
173
|
+
$$
|
|
174
|
+
\Delta t_d = \frac{\mathrm{Date.now()} - ts_d}{1000}
|
|
175
|
+
$$
|
|
176
|
+
|
|
177
|
+
and the $\lambda_s$ values are therefore **per-second** decay constants. The
|
|
178
|
+
product $\lambda_s \Delta t_d$ is dimensionless, as required by the exponential.
|
|
179
|
+
|
|
180
|
+
The current implementation uses different constants by scope:
|
|
181
|
+
|
|
182
|
+
- active session: $\lambda_s = 0.0001$
|
|
183
|
+
- durable user memory: $\lambda_s = 0.00001$
|
|
184
|
+
- global memory: $\lambda_s = 0.000002$
|
|
185
|
+
|
|
186
|
+
The implied half-lives make the decay constants auditable at a glance:
|
|
187
|
+
|
|
188
|
+
| Scope | $\lambda_s$ | Half-life |
|
|
189
|
+
|---|---|---|
|
|
190
|
+
| Session | $0.0001$ | $\approx 1.9\ \text{hours}$ |
|
|
191
|
+
| User | $0.00001$ | $\approx 19\ \text{hours}$ |
|
|
192
|
+
| Global | $0.000002$ | $\approx 4\ \text{days}$ |
|
|
193
|
+
|
|
194
|
+
$$
|
|
195
|
+
t_{1/2} = \frac{\ln 2}{\lambda_s}
|
|
196
|
+
$$
|
|
197
|
+
|
|
198
|
+
If those half-lives feel wrong for a given deployment, adjust $\lambda_s$ via
|
|
199
|
+
config — do not change the decay formula itself.
|
|
200
|
+
|
|
201
|
+
This makes session context fade fastest, user memory fade more slowly, and
|
|
202
|
+
global memory remain the most stable.
|
|
203
|
+
|
|
204
|
+
**Note on symbol disambiguation.** The symbol $\lambda_s$ here denotes the
|
|
205
|
+
scope-specific recency decay constant with units $\mathrm{s}^{-1}$. Section 7.3
|
|
206
|
+
uses $\lambda_r$ for a separate recency constant in the planned authority weight.
|
|
207
|
+
Section 7.7 uses $\eta_{\mathrm{hop}}$ for a dimensionless hop attenuation
|
|
208
|
+
factor. These three parameters are distinct and must not be substituted for each
|
|
209
|
+
other.
|
|
210
|
+
|
|
211
|
+
Why exponential instead of linear:
|
|
212
|
+
|
|
213
|
+
- exponential decay preserves ordering smoothly across many time scales
|
|
214
|
+
- it never goes negative
|
|
215
|
+
- it gives a natural "fast drop then long tail" shape for conversational relevance
|
|
216
|
+
|
|
217
|
+
Linear decay has a hard cutoff or requires arbitrary clipping. Exponential decay
|
|
218
|
+
decays old memories continuously without inventing a discontinuity.
|
|
219
|
+
|
|
220
|
+
## 3. Token Budget Fitting
|
|
221
|
+
|
|
222
|
+
After ranking, the system performs greedy prompt packing.
|
|
223
|
+
|
|
224
|
+
Implemented in [`src/tokens.ts`](../src/tokens.ts).
|
|
225
|
+
|
|
226
|
+
Let candidates be sorted by final hybrid score:
|
|
227
|
+
|
|
228
|
+
$$
|
|
229
|
+
\mathrm{score}(d_1) \ge \mathrm{score}(d_2) \ge \dots \ge \mathrm{score}(d_n)
|
|
230
|
+
$$
|
|
231
|
+
|
|
232
|
+
and let $c_i$ be the estimated token cost of candidate $d_i$. The current host
|
|
233
|
+
token estimator is:
|
|
234
|
+
|
|
235
|
+
$$
|
|
236
|
+
\mathrm{estimateTokens}(t)=\left\lceil\frac{|t|}{\chi(t)}\right\rceil
|
|
237
|
+
$$
|
|
238
|
+
|
|
239
|
+
where:
|
|
240
|
+
|
|
241
|
+
$$
|
|
242
|
+
\chi(t)=
|
|
243
|
+
\begin{cases}
|
|
244
|
+
1.6 & \text{for CJK scripts} \\
|
|
245
|
+
2.5 & \text{for Cyrillic, Arabic, or Hebrew scripts} \\
|
|
246
|
+
4.0 & \text{otherwise}
|
|
247
|
+
\end{cases}
|
|
248
|
+
$$
|
|
249
|
+
|
|
250
|
+
Given prompt budget $B$, the system selects the longest ranked prefix whose
|
|
251
|
+
cumulative cost fits:
|
|
252
|
+
|
|
253
|
+
$$
|
|
254
|
+
S = [d_1, d_2, \dots, d_m]
|
|
255
|
+
$$
|
|
256
|
+
|
|
257
|
+
such that:
|
|
258
|
+
|
|
259
|
+
$$
|
|
260
|
+
\sum_{i=1}^{m} c_i \le B
|
|
261
|
+
$$
|
|
262
|
+
|
|
263
|
+
and either $m=n$ or $\sum_{i=1}^{m+1} c_i > B$.
|
|
264
|
+
|
|
265
|
+
Greedy is optimal for this implementation because the ranking is already fixed.
|
|
266
|
+
The problem is not "find the best weighted subset under a knapsack objective";
|
|
267
|
+
it is "preserve rank order while honoring a hard prompt cap." Once rank order
|
|
268
|
+
is fixed, prefix acceptance is the correct policy.
|
|
269
|
+
|
|
270
|
+
**Note on estimator divergence.** The host estimator
|
|
271
|
+
([`src/tokens.ts`](../src/tokens.ts)) is script-aware and is used for prompt
|
|
272
|
+
budget fitting. The sidecar estimator
|
|
273
|
+
([`sidecar/compact/tokens.go`](../sidecar/compact/tokens.go)) uses a fixed
|
|
274
|
+
normalization rule:
|
|
275
|
+
|
|
276
|
+
$$
|
|
277
|
+
\widehat{T}_{sidecar}(t)=\max\!\left(\left\lfloor\frac{C(t)}{4}\right\rfloor,\, 1\right)
|
|
278
|
+
$$
|
|
279
|
+
|
|
280
|
+
where $C(t)$ is the Unicode code-point count of the string. The sidecar uses
|
|
281
|
+
`utf8.RuneCountInString()` rather than `len()`, because Go's `len()` returns
|
|
282
|
+
the UTF-8 byte length, not the code-point count; a CJK character occupies 3
|
|
283
|
+
bytes, so `len()` would produce a systematic over-count relative to the host
|
|
284
|
+
estimator's character-based ratios. The remaining divergence is bounded in
|
|
285
|
+
impact because the sidecar value appears only as a normalization denominator
|
|
286
|
+
in $P(t)$, never in prompt-budget arithmetic.
|
|
287
|
+
|
|
288
|
+
The two estimators are intentionally different. The host estimator optimizes
|
|
289
|
+
prompt-budget accuracy. The sidecar estimator is used only as a stable
|
|
290
|
+
normalization denominator in the technical specificity signal $P(t)$ of the
|
|
291
|
+
gating scalar. They must not be substituted for each other.
|
|
292
|
+
|
|
293
|
+
**Note on $\chi$ calibration.** The ratios $\{1.6, 2.5, 4.0\}$ are validated
|
|
294
|
+
against GPT-4 family tokenizers. They should be re-validated against the
|
|
295
|
+
deployment tokenizer on a representative corpus sample whenever the tokenizer
|
|
296
|
+
changes; the validation script and its results should be committed alongside
|
|
297
|
+
this document.
|
|
298
|
+
|
|
299
|
+
## 4. Matryoshka Cascade
|
|
300
|
+
|
|
301
|
+
For Nomic embeddings, one full vector $\vec{v} \in \mathbb{R}^{768}$ produces
|
|
302
|
+
three tiers via regularized normalization:
|
|
303
|
+
|
|
304
|
+
$$
|
|
305
|
+
\vec{u}_{64} = \frac{\vec{v}_{1:64}}{\sqrt{\lVert \vec{v}_{1:64} \rVert_2^2 + \varepsilon^2}}, \quad
|
|
306
|
+
\vec{u}_{256} = \frac{\vec{v}_{1:256}}{\sqrt{\lVert \vec{v}_{1:256} \rVert_2^2 + \varepsilon^2}}, \quad
|
|
307
|
+
\vec{u}_{768} = \frac{\vec{v}_{1:768}}{\sqrt{\lVert \vec{v}_{1:768} \rVert_2^2 + \varepsilon^2}}
|
|
308
|
+
$$
|
|
309
|
+
|
|
310
|
+
where $\varepsilon = 10^{-8}$.
|
|
311
|
+
|
|
312
|
+
Re-normalization is required after truncation because a prefix of a unit vector
|
|
313
|
+
is not itself a unit vector in general. The regularized denominator
|
|
314
|
+
$\sqrt{\lVert \vec{v}_{1:k} \rVert_2^2 + \varepsilon^2}$ is numerically
|
|
315
|
+
identical to the plain $L_2$ norm when the norm is large, and smoothly forces
|
|
316
|
+
$\vec{u}_k \to \vec{0}$ when the norm approaches zero rather than producing NaN
|
|
317
|
+
or amplifying floating-point noise. A near-zero-norm tier vector yields a cosine
|
|
318
|
+
score near zero, which falls below both early-exit thresholds and produces
|
|
319
|
+
automatic fall-through to the next tier.
|
|
320
|
+
|
|
321
|
+
**Note on approximate unit normalization.** For any nonzero prefix with
|
|
322
|
+
$\varepsilon > 0$:
|
|
323
|
+
|
|
324
|
+
$$
|
|
325
|
+
\lVert \vec{u}_k \rVert_2
|
|
326
|
+
= \frac{\lVert \vec{v}_{1:k} \rVert_2}{\sqrt{\lVert \vec{v}_{1:k} \rVert_2^2 + \varepsilon^2}}
|
|
327
|
+
< 1
|
|
328
|
+
$$
|
|
329
|
+
|
|
330
|
+
So regularized prefix vectors are **approximately** unit-normalized. The
|
|
331
|
+
approximation becomes negligible when the prefix norm is large relative to
|
|
332
|
+
$\varepsilon$; with $\varepsilon = 10^{-8}$ and ordinary float32 prefix norms
|
|
333
|
+
this difference is not operationally significant, but the distinction matters
|
|
334
|
+
for formal correctness.
|
|
335
|
+
|
|
336
|
+
Implemented in [`sidecar/embed/matryoshka.go`](../sidecar/embed/matryoshka.go)
|
|
337
|
+
and [`sidecar/store/libravdb.go`](../sidecar/store/libravdb.go).
|
|
338
|
+
|
|
339
|
+
Cascade search uses:
|
|
340
|
+
|
|
341
|
+
- L1: `64d` with early-exit threshold $\theta_{L1} = 0.65$
|
|
342
|
+
- L2: `256d` with early-exit threshold $\theta_{L2} = 0.75$
|
|
343
|
+
- L3: `768d`
|
|
344
|
+
|
|
345
|
+
These thresholds are calibrated on held-out cosine rank correlation with the
|
|
346
|
+
768d ground truth for the chosen embedding model. They control the
|
|
347
|
+
precision/recall tradeoff of the cascade and are not required to preserve exact
|
|
348
|
+
ranking — rank preservation at reduced dimension is approximate by design of
|
|
349
|
+
Matryoshka prefix embeddings, not a mathematical guarantee. The L1 and L2 tiers
|
|
350
|
+
function as recall-oriented coarse filters; the false-positive rate at each tier
|
|
351
|
+
is an explicit design parameter controlled by $\theta_{L1}$ and $\theta_{L2}$.
|
|
352
|
+
If the embedding model changes, both thresholds must be re-derived from the new
|
|
353
|
+
model's ROC curve against 768d ground truth.
|
|
354
|
+
|
|
355
|
+
The search exits early when a tier's best score exceeds the configured threshold;
|
|
356
|
+
otherwise it falls through to the next tier. Empty lower-tier collections
|
|
357
|
+
degrade gracefully because:
|
|
358
|
+
|
|
359
|
+
$$
|
|
360
|
+
\max(\emptyset) = 0
|
|
361
|
+
$$
|
|
362
|
+
|
|
363
|
+
and $0$ is below both early-exit thresholds by design.
|
|
364
|
+
|
|
365
|
+
Backfill condition:
|
|
366
|
+
|
|
367
|
+
- L3 is the source of truth
|
|
368
|
+
- L1 and L2 are derived caches
|
|
369
|
+
- if an L1 or L2 insert fails, a dirty-tier marker is recorded
|
|
370
|
+
- startup backfill reconstructs the missing tier vector from L3
|
|
371
|
+
|
|
372
|
+
**Note on $\varepsilon$ calibration.** The value $\varepsilon = 10^{-8}$ is
|
|
373
|
+
appropriate for float32 embeddings where pathological near-zero norms are
|
|
374
|
+
numerical artifacts. If the embedding model changes, verify that near-zero norms
|
|
375
|
+
in the new model are indeed artifacts and not meaningful signal before retaining
|
|
376
|
+
this value.
|
|
377
|
+
|
|
378
|
+
## 5. Compaction Clustering
|
|
379
|
+
|
|
380
|
+
Compaction groups raw session turns into deterministic chronological clusters
|
|
381
|
+
and replaces each cluster with one summary record. The intent is to turn many
|
|
382
|
+
highly local turns into fewer retrieval-worthy summaries.
|
|
383
|
+
|
|
384
|
+
Implemented in [`sidecar/compact/summarize.go`](../sidecar/compact/summarize.go).
|
|
385
|
+
|
|
386
|
+
The current algorithm is not semantic k-means. It is deterministic chronological
|
|
387
|
+
partitioning:
|
|
388
|
+
|
|
389
|
+
1. collect eligible non-summary turns
|
|
390
|
+
2. sort them by `(ts, id)`
|
|
391
|
+
3. choose target cluster size $k$
|
|
392
|
+
4. normalize the requested target cluster size:
|
|
393
|
+
|
|
394
|
+
Non-positive runtime inputs are normalized to the shipped default
|
|
395
|
+
$k = 20$ before clustering. After normalization, the effective target size must
|
|
396
|
+
satisfy $k \ge 1$.
|
|
397
|
+
|
|
398
|
+
5. derive cluster count:
|
|
399
|
+
|
|
400
|
+
Let $n$ be the number of eligible turns. The cluster count is:
|
|
401
|
+
|
|
402
|
+
$$
|
|
403
|
+
c = \left\lceil \frac{\max(n,\,1)}{k} \right\rceil
|
|
404
|
+
$$
|
|
405
|
+
|
|
406
|
+
6. assign turn $i$ to cluster:
|
|
407
|
+
|
|
408
|
+
$$
|
|
409
|
+
\mathrm{clusterIndex}(i) = \left\lfloor \frac{i \cdot c}{\max(n,\,1)} \right\rfloor
|
|
410
|
+
$$
|
|
411
|
+
|
|
412
|
+
The $\max(n, 1)$ guards prevent division by zero when $n = 0$. When $n \ge 1$,
|
|
413
|
+
these are identical to the unguarded forms $\lceil n/k \rceil$ and
|
|
414
|
+
$\lfloor (i \cdot c)/n \rfloor$.
|
|
415
|
+
|
|
416
|
+
When $n < k$, the formula produces $c = 1$ and all turns map to cluster 0: a
|
|
417
|
+
single cluster containing fewer turns than the target size. Single-member
|
|
418
|
+
clusters should be tagged with method `trivial` so that downstream consumers can
|
|
419
|
+
apply a different quality interpretation if needed.
|
|
420
|
+
|
|
421
|
+
This yields contiguous chronological buckets of roughly equal size while
|
|
422
|
+
avoiding nondeterministic clustering behavior.
|
|
423
|
+
|
|
424
|
+
The summarizer input for cluster $C_j$ is the ordered turn sequence:
|
|
425
|
+
|
|
426
|
+
$$
|
|
427
|
+
C_j = [t_1, t_2, \dots, t_m]
|
|
428
|
+
$$
|
|
429
|
+
|
|
430
|
+
with each element carrying turn id and text.
|
|
431
|
+
|
|
432
|
+
The output is a summary record $s(C_j)$ with:
|
|
433
|
+
|
|
434
|
+
- summary text
|
|
435
|
+
- source ids
|
|
436
|
+
- confidence
|
|
437
|
+
- method
|
|
438
|
+
- `decay_rate = 1 - confidence`
|
|
439
|
+
|
|
440
|
+
Implemented across [`sidecar/compact/summarize.go`](../sidecar/compact/summarize.go),
|
|
441
|
+
[`sidecar/summarize/engine.go`](../sidecar/summarize/engine.go), and
|
|
442
|
+
[`sidecar/summarize/onnx_local.go`](../sidecar/summarize/onnx_local.go).
|
|
443
|
+
|
|
444
|
+
For the first real-model benchmark pass comparing raw T5 confidence against
|
|
445
|
+
Nomic-space preservation metrics and the hard preservation gate, see
|
|
446
|
+
[`compaction-evaluation.md`](./compaction-evaluation.md).
|
|
447
|
+
|
|
448
|
+
### 5.1 Semiotic Mismatch
|
|
449
|
+
|
|
450
|
+
The system uses:
|
|
451
|
+
|
|
452
|
+
- T5-small as an optional local abstractive decoder
|
|
453
|
+
- Nomic `nomic-embed-text-v1.5` as the canonical retrieval embedding space
|
|
454
|
+
|
|
455
|
+
Those models do not measure the same thing.
|
|
456
|
+
|
|
457
|
+
The raw T5 confidence term is:
|
|
458
|
+
|
|
459
|
+
$$
|
|
460
|
+
\mathrm{conf}_{\mathrm{t5}}(s, C_j) =
|
|
461
|
+
\exp\!\left(\frac{1}{m}\sum_{i=1}^{m}\log p(x_i \mid x_{<i}, C_j)\right)
|
|
462
|
+
$$
|
|
463
|
+
|
|
464
|
+
where $x_i$ are generated summary tokens. This measures decoder
|
|
465
|
+
self-consistency, not geometric preservation in the vector space used later for
|
|
466
|
+
retrieval.
|
|
467
|
+
|
|
468
|
+
So a T5 summary can be locally confident while still drifting away from the
|
|
469
|
+
source cluster in Nomic space.
|
|
470
|
+
|
|
471
|
+
### 5.2 Nomic-Space Preservation
|
|
472
|
+
|
|
473
|
+
Let the embedding function be:
|
|
474
|
+
|
|
475
|
+
$$
|
|
476
|
+
E : \text{text} \to \mathbb{R}^d
|
|
477
|
+
$$
|
|
478
|
+
|
|
479
|
+
For a source cluster $C_j = \langle t_1, \dots, t_n \rangle$, define:
|
|
480
|
+
|
|
481
|
+
$$
|
|
482
|
+
v_i = E(t_i)
|
|
483
|
+
$$
|
|
484
|
+
|
|
485
|
+
$$
|
|
486
|
+
\mu_C = \frac{1}{n}\sum_{i=1}^{n} v_i
|
|
487
|
+
$$
|
|
488
|
+
|
|
489
|
+
$$
|
|
490
|
+
v_s = E(s)
|
|
491
|
+
$$
|
|
492
|
+
|
|
493
|
+
where cosine similarity renormalizes vectors at comparison time, so $\mu_C$
|
|
494
|
+
does not need separate unit normalization in the definition below.
|
|
495
|
+
|
|
496
|
+
The primary preservation term is centroid alignment:
|
|
497
|
+
|
|
498
|
+
$$
|
|
499
|
+
Q_{\mathrm{align}}(s, C_j) = \cos(v_s, \mu_C)
|
|
500
|
+
$$
|
|
501
|
+
|
|
502
|
+
The secondary preservation term is average positive source coverage:
|
|
503
|
+
|
|
504
|
+
$$
|
|
505
|
+
Q_{\mathrm{cover}}(s, C_j) =
|
|
506
|
+
\frac{1}{n}\sum_{i=1}^{n}\max(0, \cos(v_s, v_i))
|
|
507
|
+
$$
|
|
508
|
+
|
|
509
|
+
The Nomic-space confidence term is then:
|
|
510
|
+
|
|
511
|
+
$$
|
|
512
|
+
\mathrm{conf}_{\mathrm{nomic}}(s, C_j) =
|
|
513
|
+
\max\!\left(0,\;\min\!\left(1,\;\frac{Q_{\mathrm{align}} + Q_{\mathrm{cover}}}{2}\right)\right)
|
|
514
|
+
$$
|
|
515
|
+
|
|
516
|
+
This is the canonical compaction quality signal because it is defined in the
|
|
517
|
+
same geometric space the vector store uses at retrieval time.
|
|
518
|
+
|
|
519
|
+
### 5.3 Preservation Gate
|
|
520
|
+
|
|
521
|
+
Before an abstractive T5 summary is accepted, it must pass a hard preservation
|
|
522
|
+
gate:
|
|
523
|
+
|
|
524
|
+
$$
|
|
525
|
+
Q_{\mathrm{align}}(s, C_j) \ge \tau_{\mathrm{preserve}}
|
|
526
|
+
$$
|
|
527
|
+
|
|
528
|
+
with the shipped default:
|
|
529
|
+
|
|
530
|
+
$$
|
|
531
|
+
\tau_{\mathrm{preserve}} = 0.65
|
|
532
|
+
$$
|
|
533
|
+
|
|
534
|
+
If the abstractive summary fails this test, the system rejects it and falls back
|
|
535
|
+
to deterministic extractive compaction.
|
|
536
|
+
|
|
537
|
+
This means the decoder may propose a summary, but Nomic-space preservation
|
|
538
|
+
decides whether it is faithful enough to become memory.
|
|
539
|
+
|
|
540
|
+
### 5.4 Final Confidence
|
|
541
|
+
|
|
542
|
+
For extractive summaries, the final stored confidence is:
|
|
543
|
+
|
|
544
|
+
$$
|
|
545
|
+
\mathrm{confidence}(s) = \mathrm{conf}_{\mathrm{nomic}}(s, C_j)
|
|
546
|
+
$$
|
|
547
|
+
|
|
548
|
+
For accepted abstractive T5 summaries, the final stored confidence is a
|
|
549
|
+
Nomic-heavy hybrid:
|
|
550
|
+
|
|
551
|
+
$$
|
|
552
|
+
\mathrm{confidence}(s) =
|
|
553
|
+
\lambda \cdot \mathrm{conf}_{\mathrm{nomic}}(s, C_j)
|
|
554
|
+
+ (1-\lambda)\cdot \mathrm{conf}_{\mathrm{t5}}(s, C_j)
|
|
555
|
+
$$
|
|
556
|
+
|
|
557
|
+
with the shipped default:
|
|
558
|
+
|
|
559
|
+
$$
|
|
560
|
+
\lambda = 0.8
|
|
561
|
+
$$
|
|
562
|
+
|
|
563
|
+
So Nomic-space preservation remains the dominant term, while T5 decoder
|
|
564
|
+
confidence contributes only auxiliary stability information.
|
|
565
|
+
|
|
566
|
+
Therefore:
|
|
567
|
+
|
|
568
|
+
$$
|
|
569
|
+
\mathrm{confidence}(s) \in [0,1]
|
|
570
|
+
$$
|
|
571
|
+
|
|
572
|
+
for all valid inputs, because both $\mathrm{conf}_{\mathrm{nomic}}$ and
|
|
573
|
+
$\mathrm{conf}_{\mathrm{t5}}$ are bounded in $[0,1]$ and the hybrid is a convex
|
|
574
|
+
combination.
|
|
575
|
+
|
|
576
|
+
### 5.5 Retrieval Decay Multiplier
|
|
577
|
+
|
|
578
|
+
The retrieval decay metadata is then:
|
|
579
|
+
|
|
580
|
+
$$
|
|
581
|
+
\mathrm{decay\_rate}(s) = 1 - \mathrm{confidence}(s)
|
|
582
|
+
$$
|
|
583
|
+
|
|
584
|
+
and the retrieval quality multiplier from Section 1 becomes:
|
|
585
|
+
|
|
586
|
+
$$
|
|
587
|
+
Q(s) = 1 - \delta \cdot \mathrm{decay\_rate}(s)
|
|
588
|
+
$$
|
|
589
|
+
|
|
590
|
+
Given $\delta \in [0,1]$ and $\mathrm{confidence}(s) \in [0,1]$, the decay rate
|
|
591
|
+
is in $[0,1]$ and therefore:
|
|
592
|
+
|
|
593
|
+
$$
|
|
594
|
+
Q(s) \in [1-\delta,\, 1] \subseteq [0,1]
|
|
595
|
+
$$
|
|
596
|
+
|
|
597
|
+
At the shipped default $\delta = 0.5$, this constrains summary quality weights
|
|
598
|
+
to:
|
|
599
|
+
|
|
600
|
+
$$
|
|
601
|
+
Q(s) \in [0.5,\, 1.0]
|
|
602
|
+
$$
|
|
603
|
+
|
|
604
|
+
This makes compaction load-bearing in retrieval rather than archival only.
|
|
605
|
+
|
|
606
|
+
## 6. Why These Pieces Compose
|
|
607
|
+
|
|
608
|
+
The full quality loop is:
|
|
609
|
+
|
|
610
|
+
$$
|
|
611
|
+
\text{high-value turns}
|
|
612
|
+
\rightarrow \text{better clusters}
|
|
613
|
+
\rightarrow \text{higher summary confidence}
|
|
614
|
+
\rightarrow \text{lower decay rate}
|
|
615
|
+
\rightarrow \text{higher retrieval score}
|
|
616
|
+
$$
|
|
617
|
+
|
|
618
|
+
That is the system-level reason the math is distributed across ingestion,
|
|
619
|
+
compaction, and retrieval instead of existing only in one scoring function.
|
|
620
|
+
|
|
621
|
+
For rigor, this section should be read in two parts:
|
|
622
|
+
|
|
623
|
+
- The upstream step
|
|
624
|
+
`high-value turns -> better clusters -> higher summary confidence`
|
|
625
|
+
is an engineering hypothesis supported by preservation metrics and empirical
|
|
626
|
+
calibration evidence. It is not a pure algebraic proof obligation because it
|
|
627
|
+
depends on learned-model behavior.
|
|
628
|
+
- The downstream step
|
|
629
|
+
`higher summary confidence -> lower decay rate -> higher retrieval score`
|
|
630
|
+
is a formal and implementation-correspondence obligation. It follows from:
|
|
631
|
+
|
|
632
|
+
$$
|
|
633
|
+
\mathrm{decay\_rate}(s) = 1 - \mathrm{confidence}(s)
|
|
634
|
+
$$
|
|
635
|
+
|
|
636
|
+
and
|
|
637
|
+
|
|
638
|
+
$$
|
|
639
|
+
Q(s) = 1 - \delta \cdot \mathrm{decay\_rate}(s),
|
|
640
|
+
\qquad
|
|
641
|
+
S_{\mathrm{final}}(s) = S_{\mathrm{base}}(s) \cdot Q(s)
|
|
642
|
+
$$
|
|
643
|
+
|
|
644
|
+
Under equal base score $S_{\mathrm{base}}$ and fixed $\delta \in [0,1]$,
|
|
645
|
+
higher confidence implies lower decay, larger $Q(s)$, and therefore a larger
|
|
646
|
+
final retrieval score. This downstream monotonic composition is the part that
|
|
647
|
+
must be locked by exact code-level tests before later retrieval architecture
|
|
648
|
+
work proceeds.
|
|
649
|
+
|
|
650
|
+
## 7. Two-Pass Discovery Scoring
|
|
651
|
+
|
|
652
|
+
This section documents the reviewed scoring and assembly model for the
|
|
653
|
+
two-pass retrieval system. Parts of this section are now implemented in
|
|
654
|
+
[`src/scoring.ts`](../src/scoring.ts),
|
|
655
|
+
[`src/context-engine.ts`](../src/context-engine.ts),
|
|
656
|
+
[`src/continuity.ts`](../src/continuity.ts), and the sidecar store/RPC
|
|
657
|
+
adapter. Remaining unimplemented or approximate pieces should be treated as
|
|
658
|
+
explicit follow-on work, not as permission to relax the mathematical contract.
|
|
659
|
+
|
|
660
|
+
The design goal is to separate:
|
|
661
|
+
|
|
662
|
+
1. invariant documents that must always be present
|
|
663
|
+
2. cheap discovery over variant documents
|
|
664
|
+
3. selective second-pass expansion under a hard prompt budget
|
|
665
|
+
|
|
666
|
+
### 7.1 Foundational Definitions
|
|
667
|
+
|
|
668
|
+
Let the retrievable document corpus be:
|
|
669
|
+
|
|
670
|
+
$$
|
|
671
|
+
\mathbf{D}=\{d_1, d_2, \ldots, d_n\}
|
|
672
|
+
$$
|
|
673
|
+
|
|
674
|
+
and let the query space be $\mathbf{Q}$.
|
|
675
|
+
|
|
676
|
+
Let the embedding function:
|
|
677
|
+
|
|
678
|
+
$$
|
|
679
|
+
\varphi : \mathbf{D}\cup\mathbf{Q}\rightarrow \mathbb{R}^m
|
|
680
|
+
$$
|
|
681
|
+
|
|
682
|
+
map documents and queries to unit vectors:
|
|
683
|
+
|
|
684
|
+
$$
|
|
685
|
+
\lVert \varphi(x) \rVert_2 = 1 \qquad \forall x \in \mathbf{D}\cup\mathbf{Q}
|
|
686
|
+
$$
|
|
687
|
+
|
|
688
|
+
The gating function is:
|
|
689
|
+
|
|
690
|
+
$$
|
|
691
|
+
G : \mathbf{Q}\times\mathbf{D}\rightarrow \{0,1\}
|
|
692
|
+
$$
|
|
693
|
+
|
|
694
|
+
and determines whether a document is injected for a query.
|
|
695
|
+
|
|
696
|
+
### 7.2 Corpus Decomposition
|
|
697
|
+
|
|
698
|
+
The reviewed AST partitioning model in [`ast-v2.md`](./ast-v2.md) refines the
|
|
699
|
+
older binary invariant-or-variant split into three authored tiers plus a
|
|
700
|
+
continuity carve-out inside the retrievable variant corpus.
|
|
701
|
+
|
|
702
|
+
The authored corpus is partitioned into hard invariants, soft invariants, and
|
|
703
|
+
variant memory:
|
|
704
|
+
|
|
705
|
+
$$
|
|
706
|
+
\mathbf{D} = \mathcal{I}_1\cup\mathcal{I}_2\cup\mathcal{V},
|
|
707
|
+
\qquad
|
|
708
|
+
\mathcal{I}_1\cap\mathcal{I}_2=\mathcal{I}_1\cap\mathcal{V}=\mathcal{I}_2\cap\mathcal{V}=\emptyset
|
|
709
|
+
$$
|
|
710
|
+
|
|
711
|
+
The tier membership predicate is:
|
|
712
|
+
|
|
713
|
+
$$
|
|
714
|
+
\iota : \mathbf{D}\rightarrow \{0,1,2\}
|
|
715
|
+
$$
|
|
716
|
+
|
|
717
|
+
with:
|
|
718
|
+
|
|
719
|
+
$$
|
|
720
|
+
\mathcal{I}_1 = \{d\in\mathbf{D}\mid \iota(d)=1\}
|
|
721
|
+
$$
|
|
722
|
+
|
|
723
|
+
and:
|
|
724
|
+
|
|
725
|
+
$$
|
|
726
|
+
\mathcal{I}_2 = \{d\in\mathbf{D}\mid \iota(d)=2\}
|
|
727
|
+
\qquad
|
|
728
|
+
\mathcal{V} = \{d\in\mathbf{D}\mid \iota(d)=0\}
|
|
729
|
+
$$
|
|
730
|
+
|
|
731
|
+
Here:
|
|
732
|
+
|
|
733
|
+
- $\mathcal{I}_1$ is the hard invariant set, injected exactly and never
|
|
734
|
+
truncated
|
|
735
|
+
- $\mathcal{I}_2$ is the soft invariant sequence, injected by longest-prefix
|
|
736
|
+
truncation in authored order
|
|
737
|
+
- $\mathcal{V}$ is the retrievable variant corpus
|
|
738
|
+
|
|
739
|
+
For OpenClaw, the intended implementation is that authored documents such as
|
|
740
|
+
`AGENTS.md` and `souls.md` are compiled into $\mathcal{I}_1$, $\mathcal{I}_2$,
|
|
741
|
+
and $\mathcal{V}$ at load time rather than discovered monolithically at query
|
|
742
|
+
time.
|
|
743
|
+
|
|
744
|
+
The hard authored guarantee is:
|
|
745
|
+
|
|
746
|
+
$$
|
|
747
|
+
\iota(d)=1 \Rightarrow G(q,d)=1 \qquad \forall q\in\mathbf{Q}
|
|
748
|
+
$$
|
|
749
|
+
|
|
750
|
+
Soft invariants are also authored constants, but unlike $\mathcal{I}_1$ they
|
|
751
|
+
are budget-elastic. Let the authored order on $\mathcal{I}_2$ be:
|
|
752
|
+
|
|
753
|
+
$$
|
|
754
|
+
\mathcal{I}_2=\langle d^{(2)}_1,d^{(2)}_2,\dots,d^{(2)}_m\rangle
|
|
755
|
+
$$
|
|
756
|
+
|
|
757
|
+
and define the longest-prefix operator:
|
|
758
|
+
|
|
759
|
+
$$
|
|
760
|
+
\mathrm{Pref}(\mathcal{I}_2;\,b)=\langle d^{(2)}_1,\dots,d^{(2)}_j\rangle
|
|
761
|
+
$$
|
|
762
|
+
|
|
763
|
+
where:
|
|
764
|
+
|
|
765
|
+
$$
|
|
766
|
+
j=\max\left\{r\in\{0,\dots,m\}\ \middle|\ \sum_{i=1}^{r}\mathrm{toks}(d^{(2)}_i)\le b\right\}
|
|
767
|
+
$$
|
|
768
|
+
|
|
769
|
+
When continuity is enabled, the runtime further refines the variant corpus into
|
|
770
|
+
an exact recent raw suffix and the remaining retrievable variant set:
|
|
771
|
+
|
|
772
|
+
$$
|
|
773
|
+
\mathcal{V}=T_{\mathrm{recent}}\cup\mathcal{V}_{\mathrm{rest}},
|
|
774
|
+
\qquad
|
|
775
|
+
T_{\mathrm{recent}}\cap\mathcal{V}_{\mathrm{rest}}=\emptyset
|
|
776
|
+
$$
|
|
777
|
+
|
|
778
|
+
Only $\mathcal{V}_{\mathrm{rest}}$ participates in semantic retrieval. The
|
|
779
|
+
recent tail is preserved exactly and budgeted separately.
|
|
780
|
+
|
|
781
|
+
### 7.3 Document Authority Weight
|
|
782
|
+
|
|
783
|
+
Each retrievable variant document carries a precomputed authority weight:
|
|
784
|
+
|
|
785
|
+
$$
|
|
786
|
+
\omega(d)=\alpha_r\cdot r(d)+\alpha_f\cdot f(d)+\alpha_a\cdot a(d)
|
|
787
|
+
$$
|
|
788
|
+
|
|
789
|
+
with:
|
|
790
|
+
|
|
791
|
+
$$
|
|
792
|
+
\alpha_r+\alpha_f+\alpha_a=1, \qquad \alpha_r,\alpha_f,\alpha_a \in [0,1]
|
|
793
|
+
$$
|
|
794
|
+
|
|
795
|
+
where:
|
|
796
|
+
|
|
797
|
+
$$
|
|
798
|
+
r(d)=\exp\!\left(-\lambda_r\cdot \Delta t(d)\right)
|
|
799
|
+
$$
|
|
800
|
+
|
|
801
|
+
$$
|
|
802
|
+
f(d)=\frac{\log(1+\mathrm{acc}(d))}{\log\!\left(1+\max_{d'\in\mathcal{V}_{\mathrm{rest}}}\mathrm{acc}(d')+1\right)}
|
|
803
|
+
$$
|
|
804
|
+
|
|
805
|
+
$$
|
|
806
|
+
a(d)\in[0,1]
|
|
807
|
+
$$
|
|
808
|
+
|
|
809
|
+
Here $\lambda_r > 0$ is the recency decay constant with units $\mathrm{s}^{-1}$,
|
|
810
|
+
and $\Delta t(d) \ge 0$ is document age in seconds.
|
|
811
|
+
|
|
812
|
+
The $+1$ in the denominator of $f(d)$, but not the numerator, implements minimal
|
|
813
|
+
additive smoothing that guarantees a defined value at cold start. The asymmetry
|
|
814
|
+
is deliberate: a document with zero accesses should score $f(d) = 0$ exactly,
|
|
815
|
+
which the unsmoothed numerator preserves. When
|
|
816
|
+
$\max_{d'\in\mathcal{V}_{\mathrm{rest}}}\mathrm{acc}(d') = 0$, the denominator equals $\log 2$
|
|
817
|
+
and:
|
|
818
|
+
|
|
819
|
+
$$
|
|
820
|
+
f(d) = 0 \qquad \forall d\in\mathcal{V}_{\mathrm{rest}}
|
|
821
|
+
$$
|
|
822
|
+
|
|
823
|
+
cleanly deferring frequency weight to $r(d)$ and $a(d)$ until access history
|
|
824
|
+
accumulates.
|
|
825
|
+
|
|
826
|
+
Because $r(d)\in(0,1]$, $f(d)\in[0,1]$, and $a(d)\in[0,1]$, and $\omega(d)$
|
|
827
|
+
is a convex combination of these terms:
|
|
828
|
+
|
|
829
|
+
$$
|
|
830
|
+
\omega(d)\in[0,1]
|
|
831
|
+
$$
|
|
832
|
+
|
|
833
|
+
For variant nodes extracted from core authored identity documents,
|
|
834
|
+
[`ast-v2.md`](./ast-v2.md) sets $a(d)=1.0$. This lets the planned discovery
|
|
835
|
+
score incorporate recency, access frequency, and authored authority without
|
|
836
|
+
baking those concerns into the raw cosine term.
|
|
837
|
+
|
|
838
|
+
### 7.4 Pass 1: Coarse Semantic Filtering
|
|
839
|
+
|
|
840
|
+
Pass 1 computes cosine similarity:
|
|
841
|
+
|
|
842
|
+
$$
|
|
843
|
+
\mathrm{sim}(q,d)=\varphi(q)^\top \varphi(d) \in [-1,1]
|
|
844
|
+
$$
|
|
845
|
+
|
|
846
|
+
The raw top-$k_1$ candidate set is:
|
|
847
|
+
|
|
848
|
+
$$
|
|
849
|
+
\mathcal{C}_1^{\mathrm{raw}}(q)=\mathrm{TopK}_{d\in\mathcal{V}_{\mathrm{rest}}}\!\left(k_1,\,\mathrm{sim}(q,d)\right)
|
|
850
|
+
$$
|
|
851
|
+
|
|
852
|
+
with filtered coarse set:
|
|
853
|
+
|
|
854
|
+
$$
|
|
855
|
+
\mathcal{C}_1(q)=\left\{d\in\mathcal{C}_1^{\mathrm{raw}}(q)\mid \mathrm{sim}(q,d)\ge \theta_1\right\}
|
|
856
|
+
$$
|
|
857
|
+
|
|
858
|
+
where $\theta_1\in[-1,1]$.
|
|
859
|
+
|
|
860
|
+
The purpose of this pass is breadth with cheap semantic recall. Documents below
|
|
861
|
+
$\theta_1$ are rejected even if they land in the top-$k_1$ set, because the
|
|
862
|
+
first pass must not admit semantically orthogonal noise into second-pass work.
|
|
863
|
+
|
|
864
|
+
### 7.5 Pass 2: Normalized Hybrid Scoring
|
|
865
|
+
|
|
866
|
+
Let the query keyword extractor return:
|
|
867
|
+
|
|
868
|
+
$$
|
|
869
|
+
K = \mathrm{KeyExt}(q)
|
|
870
|
+
$$
|
|
871
|
+
|
|
872
|
+
and define normalized keyword coverage:
|
|
873
|
+
|
|
874
|
+
$$
|
|
875
|
+
M_{norm}(K,d)=\frac{|K\cap \mathrm{terms}(d)|}{\max(|K|,\,1)}\in[0,1]
|
|
876
|
+
$$
|
|
877
|
+
|
|
878
|
+
When $|K| > 0$ this is identical to $|K\cap \mathrm{terms}(d)| / |K|$. When
|
|
879
|
+
$|K| = 0$ (the query yields no extractable keywords), the numerator is zero and
|
|
880
|
+
$M_{norm} = 0$ exactly, collapsing the second-pass score to pure semantic
|
|
881
|
+
retrieval — the correct degenerate behavior.
|
|
882
|
+
|
|
883
|
+
The proposed normalized second-pass score is:
|
|
884
|
+
|
|
885
|
+
$$
|
|
886
|
+
S_{final}(d)=
|
|
887
|
+
\frac{
|
|
888
|
+
\omega(d)\cdot\max(\mathrm{sim}(q,d),\,0)\cdot\left(1+\kappa\cdot M_{norm}(K,d)\right)
|
|
889
|
+
}{
|
|
890
|
+
1+\kappa
|
|
891
|
+
}
|
|
892
|
+
$$
|
|
893
|
+
|
|
894
|
+
where $\kappa\in[0,\infty)$.
|
|
895
|
+
|
|
896
|
+
The normalized second-pass score form above was suggested during design review
|
|
897
|
+
by GitHub contributor [@JuanHuaXu](https://github.com/JuanHuaXu). The broader
|
|
898
|
+
two-pass architecture in this section remains project-authored.
|
|
899
|
+
|
|
900
|
+
This form is preferred over a hard clamp such as $\min(\mathrm{term},1)$
|
|
901
|
+
because clamping discards ranking information at the high end of the score
|
|
902
|
+
distribution. The denominator $(1+\kappa)$ gives an analytic bound instead of
|
|
903
|
+
truncating the result.
|
|
904
|
+
|
|
905
|
+
The second-pass candidate set is:
|
|
906
|
+
|
|
907
|
+
$$
|
|
908
|
+
\mathcal{C}_2(q)=\mathrm{TopK}_{d\in\mathcal{C}_1(q)}\!\left(k_2,\,S_{final}(d)\right)
|
|
909
|
+
$$
|
|
910
|
+
|
|
911
|
+
with $k_2 \le k_1$ and $k_1, k_2 \in \mathbb{Z}_{>0}$.
|
|
912
|
+
|
|
913
|
+
### 7.6 Bounded Range and Interpretation of $\kappa$
|
|
914
|
+
|
|
915
|
+
Let:
|
|
916
|
+
|
|
917
|
+
$$
|
|
918
|
+
s=\max(\mathrm{sim}(q,d),\,0)\in[0,1]
|
|
919
|
+
$$
|
|
920
|
+
|
|
921
|
+
Then:
|
|
922
|
+
|
|
923
|
+
$$
|
|
924
|
+
S_{final}(d)=\frac{\omega(d)\cdot s\cdot(1+\kappa M_{norm}(K,d))}{1+\kappa}
|
|
925
|
+
$$
|
|
926
|
+
|
|
927
|
+
Because $M_{norm}(K,d)\in[0,1]$ and $\kappa\ge 0$:
|
|
928
|
+
|
|
929
|
+
$$
|
|
930
|
+
1 \le 1+\kappa M_{norm}(K,d) \le 1+\kappa
|
|
931
|
+
$$
|
|
932
|
+
|
|
933
|
+
so:
|
|
934
|
+
|
|
935
|
+
$$
|
|
936
|
+
0 \le \frac{1+\kappa M_{norm}(K,d)}{1+\kappa} \le 1
|
|
937
|
+
$$
|
|
938
|
+
|
|
939
|
+
Combining with $s\in[0,1]$ and $\omega(d)\in[0,1]$:
|
|
940
|
+
|
|
941
|
+
$$
|
|
942
|
+
0 \le S_{final}(d)\le \omega(d)\le 1
|
|
943
|
+
$$
|
|
944
|
+
|
|
945
|
+
This yields a clean interpretation of $\kappa$:
|
|
946
|
+
|
|
947
|
+
- $\kappa = 0$ gives pure semantic retrieval
|
|
948
|
+
- $\kappa = 0.5$ allows keyword coverage to provide up to a one-third relative
|
|
949
|
+
boost before normalization
|
|
950
|
+
- $\kappa = 1.0$ makes full lexical support restore the pure semantic ceiling
|
|
951
|
+
while penalizing semantic-only matches with no keyword support
|
|
952
|
+
|
|
953
|
+
A reasonable initial experiment value is:
|
|
954
|
+
|
|
955
|
+
$$
|
|
956
|
+
\kappa = 0.3
|
|
957
|
+
$$
|
|
958
|
+
|
|
959
|
+
### 7.7 Multi-Hop Expansion
|
|
960
|
+
|
|
961
|
+
Let the authored hop graph be:
|
|
962
|
+
|
|
963
|
+
$$
|
|
964
|
+
\mathcal{G}=(\mathbf{D},\, E)
|
|
965
|
+
$$
|
|
966
|
+
|
|
967
|
+
where edges are registered in document metadata at authorship time.
|
|
968
|
+
|
|
969
|
+
For a document $d$, define its hop neighborhood:
|
|
970
|
+
|
|
971
|
+
$$
|
|
972
|
+
H(d)=\{d'\in\mathbf{D}\mid (d,d')\in E\}
|
|
973
|
+
$$
|
|
974
|
+
|
|
975
|
+
The hop expansion set is:
|
|
976
|
+
|
|
977
|
+
$$
|
|
978
|
+
\mathcal{C}_{hop}(q)=\bigcup_{d\in\mathcal{C}_2(q)} H(d)\setminus\mathcal{C}_2(q)
|
|
979
|
+
$$
|
|
980
|
+
|
|
981
|
+
Each hop candidate inherits a decayed score from its best parent:
|
|
982
|
+
|
|
983
|
+
$$
|
|
984
|
+
S_{hop}(d')=
|
|
985
|
+
\eta_{\mathrm{hop}}\cdot
|
|
986
|
+
\max_{d\in\mathcal{C}_2(q),\; d'\in H(d)} S_{final}(d)
|
|
987
|
+
$$
|
|
988
|
+
|
|
989
|
+
with hop decay factor $\eta_{\mathrm{hop}}\in(0,1)$.
|
|
990
|
+
|
|
991
|
+
**Note on symbol disambiguation.** The symbol $\eta_{\mathrm{hop}}$ is used
|
|
992
|
+
here deliberately to avoid collision with $\lambda_s$ (scope recency, Section 2)
|
|
993
|
+
and $\lambda_r$ (authority-weight recency, Section 7.3). The parameters have
|
|
994
|
+
different semantics and units: $\lambda_r$ has units $\mathrm{s}^{-1}$, while
|
|
995
|
+
$\eta_{\mathrm{hop}}$ is a dimensionless attenuation factor in $(0,1)$.
|
|
996
|
+
|
|
997
|
+
The filtered hop set is:
|
|
998
|
+
|
|
999
|
+
$$
|
|
1000
|
+
\mathcal{C}_{hop}^{*}(q)=\{d'\in\mathcal{C}_{hop}(q)\mid S_{hop}(d')\ge\theta_{hop}\}
|
|
1001
|
+
$$
|
|
1002
|
+
|
|
1003
|
+
with $\theta_{hop}\in[0,1]$.
|
|
1004
|
+
|
|
1005
|
+
Since $S_{final}(d)\in[0,1]$ and $\eta_{\mathrm{hop}}\in(0,1)$:
|
|
1006
|
+
|
|
1007
|
+
$$
|
|
1008
|
+
S_{hop}(d')\in[0,\,1)
|
|
1009
|
+
$$
|
|
1010
|
+
|
|
1011
|
+
### 7.8 Final Assembly Under a Token Budget
|
|
1012
|
+
|
|
1013
|
+
Variant projection is:
|
|
1014
|
+
|
|
1015
|
+
$$
|
|
1016
|
+
\mathrm{Proj}(\mathcal{V}_{\mathrm{rest}},\, q)=\mathcal{C}_2(q)\cup\mathcal{C}_{hop}^{*}(q)
|
|
1017
|
+
$$
|
|
1018
|
+
|
|
1019
|
+
The final injected context is:
|
|
1020
|
+
|
|
1021
|
+
$$
|
|
1022
|
+
C_{\mathrm{total}}(q)=\mathcal{I}_1\cup \mathcal{I}_2^{*}\cup T_{\mathrm{recent}}\cup \mathrm{Proj}(\mathcal{V}_{\mathrm{rest}},\, q)
|
|
1023
|
+
$$
|
|
1024
|
+
|
|
1025
|
+
Let the total prompt budget be $\tau$, and let the reserve fractions satisfy:
|
|
1026
|
+
|
|
1027
|
+
$$
|
|
1028
|
+
\alpha_1,\alpha_2,\beta\in[0,1],
|
|
1029
|
+
\qquad
|
|
1030
|
+
\alpha_1+\alpha_2+\beta\le 1
|
|
1031
|
+
$$
|
|
1032
|
+
|
|
1033
|
+
where:
|
|
1034
|
+
|
|
1035
|
+
- $\alpha_1$ reserves hard authored budget
|
|
1036
|
+
- $\alpha_2$ reserves soft authored budget
|
|
1037
|
+
- $\beta$ is the target recent-tail budget fraction
|
|
1038
|
+
|
|
1039
|
+
Define the hard authored token mass:
|
|
1040
|
+
|
|
1041
|
+
$$
|
|
1042
|
+
\tau_{\mathcal{I}_1}=\sum_{d\in\mathcal{I}_1}\mathrm{toks}(d)
|
|
1043
|
+
$$
|
|
1044
|
+
|
|
1045
|
+
**Required startup hard authored invariant:**
|
|
1046
|
+
|
|
1047
|
+
$$
|
|
1048
|
+
\tau_{\mathcal{I}_1}\le \alpha_1\tau
|
|
1049
|
+
$$
|
|
1050
|
+
|
|
1051
|
+
This must be enforced at startup or configuration validation time. If violated,
|
|
1052
|
+
the system cannot simultaneously satisfy "the hard invariant set is never
|
|
1053
|
+
truncated" and "total injected tokens do not exceed the total budget."
|
|
1054
|
+
Initialization must fail or the deployment must be reconfigured.
|
|
1055
|
+
|
|
1056
|
+
Let $T_{\mathrm{base}}$ be the mandatory recent-tail base suffix defined in
|
|
1057
|
+
[`continuity.md`](./continuity.md): the shortest raw suffix of the active
|
|
1058
|
+
session containing at least the most recent $m$ turns. The mandatory continuity
|
|
1059
|
+
fit requirement is:
|
|
1060
|
+
|
|
1061
|
+
$$
|
|
1062
|
+
\tau_{\mathcal{I}_1} + \sum_{d\in T_{\mathrm{base}}}\mathrm{toks}(d)\le \tau
|
|
1063
|
+
$$
|
|
1064
|
+
|
|
1065
|
+
Otherwise no legal assembly exists that preserves both hard invariants and the
|
|
1066
|
+
minimum continuity tail. The runtime must surface degraded mode explicitly; it
|
|
1067
|
+
must not silently truncate $\mathcal{I}_1$ or split the mandatory recent tail.
|
|
1068
|
+
|
|
1069
|
+
The effective soft authored budget is:
|
|
1070
|
+
|
|
1071
|
+
$$
|
|
1072
|
+
\tau_{\mathcal{I}_2}^{\mathrm{eff}}
|
|
1073
|
+
=
|
|
1074
|
+
\min\!\left(
|
|
1075
|
+
\alpha_2\tau,\,
|
|
1076
|
+
\tau-\tau_{\mathcal{I}_1}-\sum_{d\in T_{\mathrm{base}}}\mathrm{toks}(d)
|
|
1077
|
+
\right)
|
|
1078
|
+
$$
|
|
1079
|
+
|
|
1080
|
+
and the injected soft invariant prefix is:
|
|
1081
|
+
|
|
1082
|
+
$$
|
|
1083
|
+
\mathcal{I}_2^{*}=\mathrm{Pref}(\mathcal{I}_2;\,\tau_{\mathcal{I}_2}^{\mathrm{eff}})
|
|
1084
|
+
$$
|
|
1085
|
+
|
|
1086
|
+
Define the recent-tail target:
|
|
1087
|
+
|
|
1088
|
+
$$
|
|
1089
|
+
\tau_{\mathrm{tail}}^{\mathrm{target}}=\beta\tau
|
|
1090
|
+
$$
|
|
1091
|
+
|
|
1092
|
+
The exact recent-tail selector is the longest bundle-safe raw suffix containing
|
|
1093
|
+
$T_{\mathrm{base}}$ and satisfying:
|
|
1094
|
+
|
|
1095
|
+
$$
|
|
1096
|
+
\sum_{d\in T_{\mathrm{recent}}}\mathrm{toks}(d)
|
|
1097
|
+
\le
|
|
1098
|
+
\min\!\left(
|
|
1099
|
+
\max\!\left(\tau_{\mathrm{tail}}^{\mathrm{target}},\,
|
|
1100
|
+
\sum_{d\in T_{\mathrm{base}}}\mathrm{toks}(d)\right),\,
|
|
1101
|
+
\tau-\tau_{\mathcal{I}_1}-\sum_{d\in\mathcal{I}_2^{*}}\mathrm{toks}(d)
|
|
1102
|
+
\right)
|
|
1103
|
+
$$
|
|
1104
|
+
|
|
1105
|
+
This preserves the continuity rule that the mandatory recent suffix wins over
|
|
1106
|
+
the nominal tail target when they conflict, while still respecting the total
|
|
1107
|
+
prompt budget.
|
|
1108
|
+
|
|
1109
|
+
The residual retrievable variant budget is:
|
|
1110
|
+
|
|
1111
|
+
$$
|
|
1112
|
+
\tau_{\mathcal{V}}(q)
|
|
1113
|
+
=
|
|
1114
|
+
\tau-\tau_{\mathcal{I}_1}
|
|
1115
|
+
-\sum_{d\in\mathcal{I}_2^{*}}\mathrm{toks}(d)
|
|
1116
|
+
-\sum_{d\in T_{\mathrm{recent}}}\mathrm{toks}(d)
|
|
1117
|
+
$$
|
|
1118
|
+
|
|
1119
|
+
which must satisfy:
|
|
1120
|
+
|
|
1121
|
+
$$
|
|
1122
|
+
\tau_{\mathcal{V}}(q)\ge 0
|
|
1123
|
+
$$
|
|
1124
|
+
|
|
1125
|
+
Documents in $\mathrm{Proj}(\mathcal{V}_{\mathrm{rest}}, q)$ are injected in descending
|
|
1126
|
+
score order until:
|
|
1127
|
+
|
|
1128
|
+
$$
|
|
1129
|
+
\sum_{d\in \text{injected}} \mathrm{toks}(d)\le\tau_{\mathcal{V}}
|
|
1130
|
+
$$
|
|
1131
|
+
|
|
1132
|
+
The merged score sequence is:
|
|
1133
|
+
|
|
1134
|
+
$$
|
|
1135
|
+
\sigma(d)=
|
|
1136
|
+
\begin{cases}
|
|
1137
|
+
S_{final}(d) & d\in\mathcal{C}_2(q) \\
|
|
1138
|
+
S_{hop}(d) & d\in\mathcal{C}_{hop}^{*}(q)
|
|
1139
|
+
\end{cases}
|
|
1140
|
+
$$
|
|
1141
|
+
|
|
1142
|
+
### 7.9 Complete Gating Definition
|
|
1143
|
+
|
|
1144
|
+
$$
|
|
1145
|
+
G(q,d)=
|
|
1146
|
+
\begin{cases}
|
|
1147
|
+
1 & \text{if } d\in\mathcal{I}_1\cup\mathcal{I}_2^{*}\cup T_{\mathrm{recent}} \\
|
|
1148
|
+
\mathbf{1}[d\in\mathcal{C}_2(q)\cup\mathcal{C}_{hop}^{*}(q)] & \text{if } d\in\mathcal{V}_{\mathrm{rest}}
|
|
1149
|
+
\end{cases}
|
|
1150
|
+
$$
|
|
1151
|
+
|
|
1152
|
+
### 7.10 Required Runtime Invariants
|
|
1153
|
+
|
|
1154
|
+
The implementation must preserve these properties:
|
|
1155
|
+
|
|
1156
|
+
1. Invariant completeness:
|
|
1157
|
+
|
|
1158
|
+
$$
|
|
1159
|
+
\forall d\in\mathcal{I}_1,\; \forall q\in\mathbf{Q}: d\in C_{\mathrm{total}}(q)
|
|
1160
|
+
$$
|
|
1161
|
+
|
|
1162
|
+
2. Soft invariant order preservation:
|
|
1163
|
+
|
|
1164
|
+
$$
|
|
1165
|
+
\mathcal{I}_2^{*}\text{ is a prefix of }\mathcal{I}_2
|
|
1166
|
+
$$
|
|
1167
|
+
|
|
1168
|
+
3. Partition integrity:
|
|
1169
|
+
|
|
1170
|
+
$$
|
|
1171
|
+
\mathcal{I}_1\cap\mathcal{I}_2=\mathcal{I}_1\cap\mathcal{V}=\mathcal{I}_2\cap\mathcal{V}=\emptyset,
|
|
1172
|
+
\qquad
|
|
1173
|
+
T_{\mathrm{recent}}\cap\mathcal{V}_{\mathrm{rest}}=\emptyset
|
|
1174
|
+
$$
|
|
1175
|
+
|
|
1176
|
+
4. Mandatory recent-tail completeness:
|
|
1177
|
+
|
|
1178
|
+
$$
|
|
1179
|
+
T_{\mathrm{base}}\subseteq T_{\mathrm{recent}}
|
|
1180
|
+
$$
|
|
1181
|
+
|
|
1182
|
+
5. Score boundedness:
|
|
1183
|
+
|
|
1184
|
+
$$
|
|
1185
|
+
S_{final}(d)\in[0,1]
|
|
1186
|
+
$$
|
|
1187
|
+
|
|
1188
|
+
6. Token budget respect:
|
|
1189
|
+
|
|
1190
|
+
$$
|
|
1191
|
+
\sum_{d\in C_{\mathrm{total}}(q)} \mathrm{toks}(d)\le\tau
|
|
1192
|
+
$$
|
|
1193
|
+
|
|
1194
|
+
with $\mathcal{I}_1$ never truncated, $\mathcal{I}_2$ truncated only by
|
|
1195
|
+
longest-prefix selection, and the recent-tail base never silently dropped.
|
|
1196
|
+
|
|
1197
|
+
7. Compaction boundary safety:
|
|
1198
|
+
|
|
1199
|
+
Compaction may operate only on $\mathcal{V}_{\mathrm{rest}}$, never on
|
|
1200
|
+
$T_{\mathrm{recent}}$.
|
|
1201
|
+
|
|
1202
|
+
8. Hop termination:
|
|
1203
|
+
|
|
1204
|
+
The authored hop graph should be acyclic, or the runtime must cap hop depth at
|
|
1205
|
+
one to guarantee termination.
|
|
1206
|
+
|
|
1207
|
+
9. Edge-case safety:
|
|
1208
|
+
|
|
1209
|
+
No valid input in the declared domain may produce a NaN, a negative score, or a
|
|
1210
|
+
division-by-zero. This includes at minimum:
|
|
1211
|
+
|
|
1212
|
+
- cold-start corpus with $\max \mathrm{acc}=0$
|
|
1213
|
+
- empty extracted keyword set with $|K|=0$
|
|
1214
|
+
- zero eligible clustering turns with $n=0$
|
|
1215
|
+
- near-zero-norm Matryoshka prefix vectors
|
|
1216
|
+
- empty hop neighborhoods
|
|
1217
|
+
- empty or zero-residual $\tau_{\mathcal{V}}(q)$ after invariant and
|
|
1218
|
+
continuity reservation
|
|
1219
|
+
|
|
1220
|
+
7. Quality multiplier boundedness:
|
|
1221
|
+
|
|
1222
|
+
$$
|
|
1223
|
+
\mathrm{confidence}(s)\in[0,1],
|
|
1224
|
+
\qquad
|
|
1225
|
+
Q(d)\in[1-\delta,\,1]\subseteq[0,1]
|
|
1226
|
+
$$
|
|
1227
|
+
|
|
1228
|
+
for all valid inputs with $\delta\in[0,1]$.
|