uchi-python 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Joseph Woodall
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,468 @@
1
+ Metadata-Version: 2.4
2
+ Name: uchi_python
3
+ Version: 0.1.0
4
+ Summary: Online credibility-weighted sequence predictor for tabular, time series, and generative ML
5
+ License: MIT
6
+ Keywords: machine learning,online learning,sequence prediction,time series,concept drift,trie,CTW
7
+ Classifier: Development Status :: 3 - Alpha
8
+ Classifier: Intended Audience :: Science/Research
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Programming Language :: Python :: 3
11
+ Classifier: Programming Language :: Python :: 3.10
12
+ Classifier: Programming Language :: Python :: 3.11
13
+ Classifier: Programming Language :: Python :: 3.12
14
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
15
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
16
+ Requires-Python: >=3.10
17
+ Description-Content-Type: text/markdown
18
+ License-File: LICENSE
19
+ Provides-Extra: sklearn
20
+ Requires-Dist: scikit-learn>=1.0; extra == "sklearn"
21
+ Provides-Extra: numpy
22
+ Requires-Dist: numpy>=1.20; extra == "numpy"
23
+ Provides-Extra: pandas
24
+ Requires-Dist: pandas>=1.3; extra == "pandas"
25
+ Provides-Extra: pyspark
26
+ Requires-Dist: pyspark>=3.0.0; extra == "pyspark"
27
+ Provides-Extra: optuna
28
+ Requires-Dist: optuna>=3.0.0; extra == "optuna"
29
+ Provides-Extra: all
30
+ Requires-Dist: scikit-learn>=1.0; extra == "all"
31
+ Requires-Dist: numpy>=1.20; extra == "all"
32
+ Requires-Dist: pandas>=1.3; extra == "all"
33
+ Requires-Dist: pyspark>=3.0.0; extra == "all"
34
+ Requires-Dist: optuna>=3.0.0; extra == "all"
35
+ Dynamic: license-file
36
+
37
+ # Universal Sequence Predictor
38
+
39
+ Online, instance-based sequence predictor. Given any stream of discrete observations, it learns to predict what comes next — for any symbol type, in any domain — without assuming a fixed distribution, a known alphabet, or a stationary process.
40
+
41
+ The `uchi` package extends this core engine to tabular classification, regression, multivariate time series forecasting, anomaly detection, and generative modeling. All classes are sklearn-compatible.
42
+
43
+ ---
44
+
45
+ ## Installation
46
+
47
+ ```bash
48
+ pip install -e . # editable install (no required deps)
49
+ pip install -e ".[all]" # with scikit-learn, numpy, pandas
50
+ ```
51
+
52
+ ```python
53
+ from uchi import (
54
+ UniversalPredictor, PredictorForest, # core engine
55
+ TabularPredictor, TabularRegressor, # tabular ML
56
+ MultivariateTSPredictor, TimeSeriesClassifier, # time series
57
+ AnomalyDetector,
58
+ SequenceGenerator, TabularGenerator, # generative
59
+ TimeSeriesGenerator,
60
+ )
61
+ ```
62
+
63
+ ---
64
+
65
+ ## Components
66
+
67
+ ### Core Engine
68
+
69
+ **`UniversalPredictor`**
70
+
71
+ The base algorithm. Maintains a prefix trie of observed contexts. Each node stores a credibility score that rises on correct predictions and falls — faster when the node was highly confident — on wrong ones. At prediction time it blends distributions from shallow (general) to deep (specific) contexts using CTW-style recursive mixing, where each depth's influence is proportional to its credibility track record. No forgetting parameter, no drift detector: concept drift is handled automatically because stale nodes lose credibility and the blend shifts back to shallower, more stable contexts.
72
+
73
+ API: `observe(x)` → `predict()` → `feedback(x)`. Set `min_confidence` to abstain rather than guess below a threshold.
74
+
75
+ **`PredictorForest`**
76
+
77
+ Ensemble of `UniversalPredictor` instances with four diversity mechanisms: heterogeneous context lengths (k, k+1, k+2, …), feedback dropout (each tree independently skips learning steps), staggered training offsets, and per-tree credibility weights. Adaptive voting: when trees agree confidently it uses a product (decisive), when uncertain it uses a mixture (calibrated).
78
+
79
+ ---
80
+
81
+ ### Preprocessing
82
+
83
+ **`FeatureDiscretizer`**
84
+
85
+ Converts any feature matrix to token sequences. Continuous features → equal-frequency quantile bins (tokens are bin indices). Categorical features → ordinal integers. Missing values and `NaN` → a special `__MISSING__` token. The result is a list of `(feature_index, bin)` tuples per row, which the trie can match exactly.
86
+
87
+ **`LabelEncoder`**
88
+
89
+ Bidirectional label ↔ integer mapping with `partial_fit` for new classes arriving at runtime. Used internally by all supervised classes.
90
+
91
+ ---
92
+
93
+ ### Tabular ML
94
+
95
+ **`TabularPredictor`** — classification
96
+
97
+ Encodes each row as a sequence of feature tokens, with the class label as the final token. The trie learns `P(label | feature_sequence)`. Three feature orderings are ensembled (MI-ascending, MI-descending, natural) to reduce ordering sensitivity. Prediction averages label distributions across all orderings.
98
+
99
+ sklearn-compatible: works in `Pipeline`, `GridSearchCV`, `cross_val_score`. Supports `partial_fit` for streaming or incremental learning.
100
+
101
+ ```python
102
+ clf = TabularPredictor(n_bins=10, n_orderings=3)
103
+ clf.fit(X_train, y_train)
104
+ clf.predict(X_test) # class labels
105
+ clf.predict_proba(X_test) # list of {label: prob} dicts
106
+ clf.partial_fit(X_new, y_new) # online update
107
+ ```
108
+
109
+ **`TabularRegressor`** — regression
110
+
111
+ Same architecture as `TabularPredictor` but the continuous target is discretized into quantile bins. Prediction returns the credibility-weighted mean of bin centers. `predict_interval()` also returns the standard deviation of the bin distribution as a calibrated uncertainty estimate.
112
+
113
+ ```python
114
+ reg = TabularRegressor(n_bins=10, n_target_bins=20)
115
+ reg.fit(X_train, y_train)
116
+ reg.predict(X_test) # float means
117
+ reg.predict_interval(X_test) # list of (mean, std) tuples
118
+ reg.score(X_test, y_test) # R²
119
+ ```
120
+
121
+ ---
122
+
123
+ ### Time Series
124
+
125
+ **`MultivariateTSPredictor`**
126
+
127
+ Online step-ahead predictor for multivariate (or univariate) time series. Each timestep is encoded as a compound token `(bin_0, bin_1, ..., bin_{M-1})` — a hashable tuple the trie matches exactly. Context is the last k compound tokens. Adapts immediately to distribution shift without retraining.
128
+
129
+ ```python
130
+ pred = MultivariateTSPredictor(n_bins=8, context_length=5)
131
+ pred.fit(X_train) # warm up trie on historical data
132
+ pred.predict() # float vector (per-dimension means)
133
+ pred.observe(x_new) # advance internal state
134
+ pred.feedback(x_new) # update trie with true value
135
+ pred.forecast(n_steps=10) # autoregressive multi-step forecast
136
+ pred.score(X_test) # bits/step (lower = better fit)
137
+ ```
138
+
139
+ **`TimeSeriesClassifier`**
140
+
141
+ Classifies fixed-length time series windows. Each window of T steps becomes T compound tokens; the class label is predicted as the next token after the full window. Supports `partial_fit` for streaming classification. Works in sklearn Pipeline.
142
+
143
+ ```python
144
+ clf = TimeSeriesClassifier(n_bins=8, window_size=50)
145
+ clf.fit(X_windows, y_labels)
146
+ clf.predict(X_test) # class labels
147
+ clf.predict_proba(X_test) # list of {label: prob} dicts
148
+ ```
149
+
150
+ **`AnomalyDetector`**
151
+
152
+ Trains a `MultivariateTSPredictor` on normal data. At inference, each timestep receives anomaly score = `-log2 P(actual | context)`. High score = low predictability = anomalous. The trie is not updated during scoring, so anomalous patterns do not contaminate the model of normal behavior.
153
+
154
+ sklearn `OutlierMixin` compliant: `predict()` returns 1 (anomaly) / -1 (normal); `decision_function()` returns negative anomaly scores for threshold-based pipelines.
155
+
156
+ ```python
157
+ det = AnomalyDetector(n_bins=8, context_length=5)
158
+ det.fit(X_normal)
159
+ det.score_samples(X_test) # float scores (higher = more anomalous)
160
+ det.predict(X_test) # 1 or -1 per timestep
161
+ ```
162
+
163
+ ---
164
+
165
+ ### Generative
166
+
167
+ **`SequenceGenerator`**
168
+
169
+ Learns a distribution over sequences and samples from it. Supports temperature scaling (`p_i ← p_i^(1/T)`), top-k filtering, and nucleus (top-p) sampling. `generate_text()` joins tokens with a separator for character- or word-level text generation.
170
+
171
+ ```python
172
+ gen = SequenceGenerator(context_length=6, temperature=0.9)
173
+ gen.fit(list_of_sequences)
174
+ gen.generate(50, seed=['the '], stop_tokens=['.']) # list of tokens
175
+ gen.generate_text(100, sep='') # joined string
176
+ gen.score(sequence) # bits/token
177
+ ```
178
+
179
+ **`TabularGenerator`**
180
+
181
+ Learns the joint distribution `P(f0, f1, ..., fn, label)` and samples synthetic rows. Trains two predictors internally: one with label last (unconditional generation, `P(label | features)`) and one with label first (class-conditional generation, `P(features | label)`). This separation is necessary — a label-last model given a leading label token is out-of-distribution.
182
+
183
+ ```python
184
+ gen = TabularGenerator(n_bins=10, temperature=1.0)
185
+ gen.fit(X, y)
186
+ gen.sample(n_rows=100) # list of dicts
187
+ gen.sample(n_rows=50, given_label='cat') # class-conditional
188
+ gen.sample_dataframe(n_rows=100) # pandas DataFrame
189
+ ```
190
+
191
+ **`TimeSeriesGenerator`**
192
+
193
+ Learns a distribution over multivariate time series and samples from it. Unlike `MultivariateTSPredictor.forecast()` (argmax, deterministic), generation here samples from the distribution — producing diverse trajectories. `augment()` wraps generation for data augmentation.
194
+
195
+ ```python
196
+ gen = TimeSeriesGenerator(n_bins=8, temperature=1.1)
197
+ gen.fit(X_series)
198
+ gen.generate(n_steps=100, seed=X_seed) # list of float vectors
199
+ gen.augment(X, n_copies=5, temperature=1.1) # augmented dataset
200
+ ```
201
+
202
+ ---
203
+
204
+ ## Generative Services
205
+
206
+ The three generators (`SequenceGenerator`, `TabularGenerator`, `TimeSeriesGenerator`) share the same trie engine as the predictors. Generation is sampling from the learned conditional distribution rather than taking the argmax. All sampling controls (temperature, top-k, top-p, stop tokens) operate on that distribution at runtime.
207
+
208
+
209
+
210
+ ---
211
+
212
+ ## What this is for
213
+
214
+ **Its clearest domain: discrete event streams where the underlying pattern shifts over time.**
215
+
216
+ If you have a stream of categorical states and need to predict the next one — without knowing in advance how the pattern will change — this is the right tool. It beats count-based methods (N-gram, PPM, CTW) and online neural methods specifically in non-stationary settings, and it does so with no retraining, no drift detector, and no forgetting window to tune.
217
+
218
+ **Natural fits:**
219
+
220
+ - **System observability** — sequences of log event codes, API call chains, process state transitions. Predicts next failure type. When a deployment changes the pattern, adaptation is automatic.
221
+ - **User behavior** — clickstreams, navigation paths, in-app action sequences. Next-action prediction that updates on every new user event without a retraining cycle.
222
+ - **Industrial / IoT** — machine state sequences (idle / running / warning / fault), energy consumption states, production line events. Works on tiny datasets where neural methods don't have enough data.
223
+ - **Financial regimes** — discretized price movements, order flow states, market microstructure events. Handles regime shifts that break count-based models.
224
+ - **Anomaly detection** — when the predictor is consistently wrong, something structurally unusual is happening. Confidence collapses before a human notices; no separate anomaly model needed.
225
+ - **Game AI / opponent modeling** — predict next move in any discrete-action game. Adapts to opponent strategy shifts in real time.
226
+
227
+ **Where it is not competitive:**
228
+
229
+ - **Tabular classification where the data is large and stationary** — on tabular datasets >10K rows without concept drift, gradient boosting will typically win by 5–10pp. The trie shines when data is small, streaming, or drifting.
230
+ - **Long-range sequence dependencies** — the context window is fixed at k. Anything requiring memory beyond the last k observations needs a transformer or RNN.
231
+ - **Large stationary corpora** — on 50K tokens of text or DNA, count-based methods (CTW, KN) hold a 1–2pp accuracy advantage because their unbounded counts eventually outcompete the credibility cap. The gap closes on noisy or drifting data.
232
+ - **Continuous regression targets** — the regressor bins the output; precision is bounded by `n_target_bins`. Point-prediction accuracy on smooth regression tasks is below random forests.
233
+
234
+ ---
235
+
236
+ ## The Core Idea
237
+
238
+ The algorithm keeps a trie of contexts. Every node in the trie stores two things: a credibility-weighted distribution over successor symbols, and a track record of how reliable that context has been as a predictor. When predicting, it blends the distributions from shallow (general) to deep (specific), where each depth's influence is proportional to its track record. When updating after a wrong prediction, a node that was confidently wrong loses trust faster than one that was fresh and uncertain.
239
+
240
+ That is the entire algorithm. No drift detector, no forgetting parameter, no domain-specific tuning.
241
+
242
+ ---
243
+
244
+ ## Architecture
245
+
246
+ ### Module 1 — Universal Sequence Predictor (`predictor.py`)
247
+
248
+ **Data structure:** a prefix trie. Each `_TrieNode` stores:
249
+ - `succ_cred` — credibility weight per successor symbol
250
+ - `node_cred` — reliability of this context as a predictor overall
251
+ - `n_obs` — number of times this context has been seen
252
+
253
+ The root holds continuation counts (how many distinct predecessors each symbol appeared after, KN-style) for large vocabularies (|V|≥8), falling back to raw KT counts for small alphabets (DNA=4, Electricity=2) where continuation counts are too sparse. This seeds the blend with a better-calibrated unigram prior.
254
+
255
+ **Prediction O(k):**
256
+
257
+ Walk the trie at depths `min_k..k`. For each matching node, compute a KT-smoothed local distribution. Blend from shallow to deep using the CTW-style recursive formula:
258
+
259
+ ```
260
+ λ_d = node_cred_d^p / (node_cred_d^p + 1) # p=0.65; softened mixing weight
261
+ P_d = λ_d · P_local(d) + (1 − λ_d) · P_{d−1}
262
+ ```
263
+
264
+ High credibility → λ → 1 → deep context dominates.
265
+ Low credibility → λ → 0 → falls back to shallow.
266
+ Root provides the seed.
267
+
268
+ The exponent `p=0.65` (versus the standard CTW value of p=1) lets shallower contexts retain 22% blend weight even when deeper contexts are fully saturated. This acts as implicit depth regularization — preventing rare deep contexts from monopolizing predictions on stationary data, without affecting drift adaptation (credibility degrades naturally under drift regardless of p).
269
+
270
+ **Update O(k):**
271
+
272
+ For each depth, find the context node and apply a multiplicative rule:
273
+
274
+ ```
275
+ effective_cap = C_MAX × (1 + 0.5 × log(1 + n_obs/100)) # adaptive (optional)
276
+ = C_MAX # fixed (default)
277
+
278
+ correct: node_cred ← min(cap, node_cred × (1 + lr))
279
+ succ_cred[actual] ← min(cap, succ_cred[actual] × (1 + lr))
280
+
281
+ wrong: lr_down = lr × (1 + node_cred / cap) # confidence-proportional
282
+ node_cred ← max(C_MIN, node_cred × (1 − lr_down))
283
+ succ_cred[wrong] ← max(C_MIN, succ_cred[wrong] × (1 − lr_down))
284
+ succ_cred[actual] ← min(cap, succ_cred[actual] × (1 + lr))
285
+ × binary_scale # for V≤2 only; prevents false-flip cascades
286
+ ```
287
+
288
+ With `adaptive_cap=True`, nodes with many observations are allowed to build higher credibility — the cap grows logarithmically with `n_obs`, so λ can approach 1 more closely on stationary data while the maximum `lr_down = 2×lr` is preserved.
289
+
290
+ The `lr_down` scaling is the key drift-adaptation mechanism: a node that was highly trusted when it turned wrong loses credibility up to 2× faster than a fresh node. This halves the adaptation lag after a concept drift without requiring any drift detector.
291
+
292
+ **Concept drift:**
293
+
294
+ Wrong predictions degrade `node_cred`, reducing λ at that depth, causing the blend to automatically fall back to shallower (more general) contexts. As the new pattern accumulates correct observations, `node_cred` rebuilds at the updated depth. No explicit change detection; no forgetting window; adaptation speed is a function of how confidently the old pattern was held.
295
+
296
+ **Regret bound:**
297
+
298
+ The multiplicative credibility update is an instance of the Multiplicative Weights Update (MWU) algorithm applied to depth selection. For a class of k single-depth predictors, MWU achieves O(√(T ln k)) regret. The CTW-style blend runs this across all depths simultaneously.
299
+
300
+ ---
301
+
302
+ ### Module 1 — Forest Ensemble (`forest.py`)
303
+
304
+ `PredictorForest` is a collection of `UniversalPredictor` instances that start identical and diverge through experience. Diversity comes from four sources:
305
+
306
+ | Mechanism | How it works |
307
+ |---|---|
308
+ | Heterogeneous k | Each tree uses a different context length: k, k+1, k+2, … capturing different temporal scales. Disabled for DNA (4-symbol near-uniform alphabet) where deeper-k trees add noise rather than signal. |
309
+ | Feedback dropout | Each tree independently skips learning on each step with probability `dropout` — the sequence analogue of bagging |
310
+ | Staggered offsets | Tree i doesn't start learning until step `i × stagger`; early topology has outsized influence on later structure |
311
+ | Inter-tree credibility | Each tree maintains a persistent weight updated by whether it was right; correct trees speak louder on the next prediction |
312
+
313
+ **Voting:** adaptive hybrid by default. Each tree contributes two representations:
314
+
315
+ - **Full blended distribution** (`tree._distribution()`) — the complete CTW-style probability over all vocabulary symbols. Used in the *mixture* component: proper calibration when trees at different context lengths express partial disagreement.
316
+ - **Mode-focused distribution** (`_tree_dist`) — only the most-probable successor at each depth, weighted by node credibility. Used in the *product* component: maximally decisive agreement signal for high-persistence or low-entropy data where unanimous tree confidence should dominate.
317
+
318
+ The adaptive blend computes `α × product(mode-focused) + (1−α) × mixture(full)` where `α` is the mean per-tree confidence — high confidence drives product-mode behaviour, uncertainty drives mixture-mode behaviour.
319
+
320
+ ---
321
+
322
+ ### Module 2 — Goal-Directed Generation (`module2.py`)
323
+
324
+ Module 1 is **intuition** — fast, associative, pattern-matching.
325
+ Module 2 is **deliberation** — goal-directed, using Module 1 as a world model.
326
+
327
+ Module 1 is already generative: call `predict()` autoregressively and it produces continuations. Module 2 adds **steering**: constraining or guiding that generation toward a target.
328
+
329
+ **Training format:** represent Q&A or any prompt→response task as a flat sequence:
330
+ ```
331
+ [prompt tokens ...] [SEPARATOR] [response tokens ...] [END]
332
+ ```
333
+ Module 1 learns that SEPARATOR is followed by responses, not more prompts. No architectural changes needed.
334
+
335
+ **Three generation strategies** (all implemented in `module2.py`):
336
+
337
+ | Strategy | Mechanism | Best for |
338
+ |---|---|---|
339
+ | **Autoregressive** | Feed `[prompt + SEPARATOR]` as context seed; generate token by token until END | Direct completion, short responses |
340
+ | **Beam search** | Maintain N candidate sequences; at each step expand by all vocabulary tokens; prune to top N by cumulative log-probability | Longer responses, controllable diversity |
341
+ | **Retrieval** | Two-stage: (1) Bhattacharyya similarity on post-SEP trie distributions — exact for seen prompts; (2) surface Jaccard fallback when Bhattacharyya < 0.5 — domain-correct for novel tokens | Factual lookup; graceful degradation to novel inputs |
342
+
343
+ ---
344
+
345
+ ## Benchmark Results
346
+
347
+ Evaluated on 7 standard datasets (two large text corpora, full DNA genome) and 4 concept-drift streams. All methods use the same train/test split (80/20). Baselines: Persistence, Majority, N-gram(5), PPM-D(5), CTW(5).
348
+
349
+ **Standard benchmarks (test accuracy %):**
350
+
351
+ | Dataset | n | k | Persistence | PPM-D(5) | CTW(5) | **Predictor** | **Forest** |
352
+ |---|---|---|---|---|---|---|---|
353
+ | Airline passengers | 144 | 4 | 37.9 | 27.6 | 31.0 | **41.4** | **41.4** |
354
+ | Alice in Wonderland (15K) | 15,000 | 6 | 2.8 | 51.6 | **53.3** | 51.7 | 51.9 |
355
+ | Moby Dick (50K) | 50,000 | 6 | 2.1 | 45.7 | **47.4** | 46.2 | 46.1 |
356
+ | DNA — bacteriophage lambda (full) | 48,502 | 5 | 26.1 | 29.7 | **30.7** | 29.1 | 28.0 |
357
+ | Weather | 547 | 3 | **57.3** | 47.3 | 50.0 | **52.7** | **51.8** |
358
+ | PRNG (noise floor) | 500 | 3 | 10.0 | **18.0** | 16.0 | 15.0 | 13.0 |
359
+ | Electricity (45K) | 45,312 | 4 | **84.8** | **84.8** | **84.8** | **84.7** | **84.6** |
360
+
361
+ **Concept-drift streams (test accuracy %, k=1):**
362
+
363
+ | Drift type | N-gram | PPM-D | CTW | **Predictor** | **Forest** |
364
+ |---|---|---|---|---|---|
365
+ | Sudden reversal | 2.5 | 2.5 | 4.5 | **97.0** | **97.0** |
366
+ | Gradual ramp | 5.0 | 5.0 | 6.2 | **98.3** | **98.3** |
367
+ | Recurring A→B→A | 3.8 | 3.3 | 4.2 | **97.5** | **97.5** |
368
+ | Fast (150-step cycles) | 40.0 | 39.6 | 40.4 | **94.6** | 93.3 |
369
+
370
+ The concept-drift numbers are the clearest statement of what this architecture is for. Count-based methods (N-gram, PPM-D, CTW) never recover from a reversal because counts only accumulate. The Predictor recovers automatically.
371
+
372
+ **Extended baseline comparison — KN, PPM\*, Online LSTM (test accuracy %):**
373
+
374
+ | Dataset | KN(5) | PPM\*(20) | LSTM(64) | Predictor | Forest |
375
+ |---|---|---|---|---|---|
376
+ | Airline passengers | 27.6 | 27.6 | 24.1 | **41.4** | **41.4** |
377
+ | Alice in Wonderland (15K) | **52.8** | 51.8 | 39.9 | 51.7 | 51.9 |
378
+ | Moby Dick (50K) | **47.2** | 45.3 | 38.6 | 46.2 | 46.1 |
379
+ | DNA — bacteriophage lambda | 30.1 | 26.6 | **32.5** | 29.1 | 28.0 |
380
+ | Weather | 50.9 | 48.2 | 43.6 | **52.7** | **51.8** |
381
+ | PRNG (noise floor) | 15.0 | **18.0** | 10.0 | 15.0 | 13.0 |
382
+ | Electricity (45K) | **84.8** | 81.9 | **84.8** | **84.7** | **84.6** |
383
+
384
+ KN(5) = Interpolated Kneser-Ney N-gram. PPM\*(20) = PPM with max order 20. LSTM(64) = single-layer LSTM, hidden size 64, trained online with BPTT-1 and Adam.
385
+
386
+ **Key findings:**
387
+
388
+ - **Predictor leads on Weather and Airline** — short, noisy, non-stationary datasets where count-based methods overfit to stale patterns. No other method is competitive on Airline (n=144).
389
+ - **KN(5) is the strongest text predictor** on large stationary corpora (52.8% Alice, 47.2% Moby). The credibility cap prevents our predictor from fully converging — a structural trade-off for drift recovery.
390
+ - **LSTM wins on DNA** (32.5%) — neural sequence modeling captures long-range non-Markovian dependencies that any fixed-order predictor misses.
391
+ - **Electricity: all methods tie** (84.6–84.8%) — a high-persistence binary stream where persistence itself is the ceiling.
392
+
393
+ ---
394
+
395
+ ### Confidence-gated prediction (abstain mode)
396
+
397
+ `UniversalPredictor` accepts a `min_confidence` parameter (default `0.0`). When set, the predictor abstains — returns `(None, conf)` — whenever its best prediction is less than `min_confidence × (1/|vocab|)` above the uniform baseline. A value of `1.5` means "only predict when at least 1.5× more confident than random."
398
+
399
+ Abstaining does not penalize the node: `node_cred` is unchanged. The successor distribution still updates so learning continues. This makes the warmup period implicit — early steps where the predictor is near-uniform simply produce no output rather than noisy guesses.
400
+
401
+ **Precision–coverage tradeoffs on natural language (Alice, k=4):**
402
+
403
+ | min_confidence | Accuracy (predicted only) | Coverage | Lift |
404
+ |---|---|---|---|
405
+ | 0.0 (off) | 48.5% | 100% | — |
406
+ | 3.0 | 50.3% | 96.5% | +1.8pp |
407
+ | 4.0 | 56.7% | 83.7% | +8.2pp |
408
+ | 5.0 | 59.4% | 77.6% | +10.9pp |
409
+ | 6.0 | 61.4% | 71.8% | +12.9pp |
410
+
411
+ Alice at min_conf=5.0 reaches 59.4% accuracy (vs CTW's 53.3% on 100% coverage) by only speaking when confident. For anomaly detection or alerting use cases where coverage matters less than per-prediction reliability, this is the correct mode.
412
+
413
+ ---
414
+
415
+ ### The two-regime finding
416
+
417
+ Expanding from small samples to full datasets exposed a fundamental architectural property:
418
+
419
+ **Data-limited regime (n ≲ 800):** Credibility builds up quickly, blend weights become decisive, and the Predictor is competitive or best. At 1,500 DNA bases the Predictor was 33.0% — best across all methods.
420
+
421
+ **Architecture-limited regime (n ≫ 800):** Every node hits `CRED_MAX` and the blend weight freezes at λ = cap^p/(cap^p+1). Count-based methods (PPM-D, CTW) have no cap — their counts keep growing, giving predictions increasingly close to 1.0. At 48K DNA bases CTW reaches 30.7% while the Predictor reaches 29.1%.
422
+
423
+ **The exception is noisy and drifting data.** Weather improved from 41% to 52.7% — the Predictor leads on Weather because noisy, high-variance datasets are exactly where count-based methods overfit to stale patterns.
424
+
425
+ **The CRED_MAX cap is a design choice, not a bug.** A node with unbounded credibility would adapt from drift in O(n) steps. The cap guarantees O(1/CRED_MAX) adaptation speed. The trade-off is explicit: fast drift recovery at the cost of long-term convergence on stationary data.
426
+
427
+ ---
428
+
429
+ ## Files
430
+
431
+ **Package (`uchi/`):**
432
+
433
+ | File | Purpose |
434
+ |---|---|
435
+ | `predictor.py` | `UniversalPredictor` — core trie engine |
436
+ | `forest.py` | `PredictorForest` — ensemble with heterogeneous k and feedback dropout |
437
+ | `discretize.py` | `FeatureDiscretizer`, `LabelEncoder` — preprocessing |
438
+ | `tabular.py` | `TabularPredictor`, `TabularRegressor` — tabular ML |
439
+ | `timeseries.py` | `MultivariateTSPredictor`, `TimeSeriesClassifier`, `AnomalyDetector` |
440
+ | `generative.py` | `SequenceGenerator`, `TabularGenerator`, `TimeSeriesGenerator` |
441
+
442
+ **Root (benchmark scripts and shims):**
443
+
444
+ | File | Purpose |
445
+ |---|---|
446
+ | `baselines.py` | Standard baselines: Persistence, Majority, N-gram, PPM-D |
447
+ | `baselines_extended.py` | Extended baselines: KN, PPM\*, Online LSTM |
448
+ | `datasets.py` | Dataset loaders (airline, text, DNA, weather, PRNG, electricity) |
449
+ | `ieee_benchmark.py` | Full benchmark suite generating LaTeX tables |
450
+ | `run_experiments.py` | Quick single-predictor experiment runner |
451
+ | `run_forest.py` | Quick forest experiment runner |
452
+ | `module2.py` | `GoalDirectedGenerator` — autoregressive, beam search, retrieval |
453
+ | `tasks/` | Core-principle manifesto and todo |
454
+
455
+ ---
456
+
457
+ ## How This Differs from Standard Approaches
458
+
459
+ | Property | N-gram / PPM-D | CTW | **This architecture** |
460
+ |---|---|---|---|
461
+ | Drift adaptation | None — counts only grow | None — counts only grow | Automatic via credibility degradation |
462
+ | Depth selection | Fixed or backoff heuristic | Bayesian mixture (stationary) | MWU — theoretically optimal for adversarial depth selection |
463
+ | Concept drift recovery | Requires reset or windowing | Requires reset or windowing | Self-correcting; speed proportional to prior confidence |
464
+ | Node count | O(V^k) worst case | O(V^k) worst case | O(sequence length) — only observed contexts |
465
+ | Online adaptation | Counts update, predictions sharpen | Weights update | Credibility update; fresh vs. stale nodes naturally separated |
466
+ | Small dataset behavior | Overtrusts rare k-grams | Overtrusts rare k-grams | Credibility builds slowly on sparse observations |
467
+
468
+ The single deepest difference from count-based methods: **credibility is earned and can be lost.** A context that was reliable on Monday and wrong on Tuesday sees its influence reduced on Wednesday. Counts only accumulate.