PyPI - uchi-python - Versions diffs - 0.1.0__tar.gz - Mend

uchi-python 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

uchi_python-0.1.0/LICENSE +21 -0
uchi_python-0.1.0/PKG-INFO +468 -0
uchi_python-0.1.0/README.md +432 -0
uchi_python-0.1.0/pyproject.toml +45 -0
uchi_python-0.1.0/setup.cfg +4 -0
uchi_python-0.1.0/tests/test_forest.py +348 -0
uchi_python-0.1.0/tests/test_generative_fixes.py +706 -0
uchi_python-0.1.0/tests/test_replay.py +40 -0
uchi_python-0.1.0/tests/tests.py +274 -0
uchi_python-0.1.0/uchi/__init__.py +57 -0
uchi_python-0.1.0/uchi/discretize.py +307 -0
uchi_python-0.1.0/uchi/distributional.py +105 -0
uchi_python-0.1.0/uchi/dual_predictor.py +172 -0
uchi_python-0.1.0/uchi/forest.py +410 -0
uchi_python-0.1.0/uchi/generative.py +910 -0
uchi_python-0.1.0/uchi/hoeffding.py +225 -0
uchi_python-0.1.0/uchi/long_term_store.py +345 -0
uchi_python-0.1.0/uchi/node_compressor.py +492 -0
uchi_python-0.1.0/uchi/online_tokenizer.py +349 -0
uchi_python-0.1.0/uchi/predictor.py +578 -0
uchi_python-0.1.0/uchi/semantic_tokenizer.py +48 -0
uchi_python-0.1.0/uchi/tabular.py +401 -0
uchi_python-0.1.0/uchi/timeseries.py +445 -0
uchi_python-0.1.0/uchi_python.egg-info/PKG-INFO +468 -0
uchi_python-0.1.0/uchi_python.egg-info/SOURCES.txt +26 -0
uchi_python-0.1.0/uchi_python.egg-info/dependency_links.txt +1 -0
uchi_python-0.1.0/uchi_python.egg-info/requires.txt +22 -0
uchi_python-0.1.0/uchi_python.egg-info/top_level.txt +1 -0

uchi_python-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Joseph Woodall
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

uchi_python-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,468 @@
+Metadata-Version: 2.4
+Name: uchi_python
+Version: 0.1.0
+Summary: Online credibility-weighted sequence predictor for tabular, time series, and generative ML
+License: MIT
+Keywords: machine learning,online learning,sequence prediction,time series,concept drift,trie,CTW
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Provides-Extra: sklearn
+Requires-Dist: scikit-learn>=1.0; extra == "sklearn"
+Provides-Extra: numpy
+Requires-Dist: numpy>=1.20; extra == "numpy"
+Provides-Extra: pandas
+Requires-Dist: pandas>=1.3; extra == "pandas"
+Provides-Extra: pyspark
+Requires-Dist: pyspark>=3.0.0; extra == "pyspark"
+Provides-Extra: optuna
+Requires-Dist: optuna>=3.0.0; extra == "optuna"
+Provides-Extra: all
+Requires-Dist: scikit-learn>=1.0; extra == "all"
+Requires-Dist: numpy>=1.20; extra == "all"
+Requires-Dist: pandas>=1.3; extra == "all"
+Requires-Dist: pyspark>=3.0.0; extra == "all"
+Requires-Dist: optuna>=3.0.0; extra == "all"
+Dynamic: license-file
+# Universal Sequence Predictor
+Online, instance-based sequence predictor. Given any stream of discrete observations, it learns to predict what comes next — for any symbol type, in any domain — without assuming a fixed distribution, a known alphabet, or a stationary process.
+The `uchi` package extends this core engine to tabular classification, regression, multivariate time series forecasting, anomaly detection, and generative modeling. All classes are sklearn-compatible.
+---
+## Installation
+```bash
+pip install -e .                  # editable install (no required deps)
+pip install -e ".[all]"           # with scikit-learn, numpy, pandas
+```
+```python
+from uchi import (
+    UniversalPredictor, PredictorForest,           # core engine
+    TabularPredictor, TabularRegressor,             # tabular ML
+    MultivariateTSPredictor, TimeSeriesClassifier,  # time series
+    AnomalyDetector,
+    SequenceGenerator, TabularGenerator,            # generative
+    TimeSeriesGenerator,
+)
+```
+---
+## Components
+### Core Engine
+**`UniversalPredictor`**
+The base algorithm. Maintains a prefix trie of observed contexts. Each node stores a credibility score that rises on correct predictions and falls — faster when the node was highly confident — on wrong ones. At prediction time it blends distributions from shallow (general) to deep (specific) contexts using CTW-style recursive mixing, where each depth's influence is proportional to its credibility track record. No forgetting parameter, no drift detector: concept drift is handled automatically because stale nodes lose credibility and the blend shifts back to shallower, more stable contexts.
+API: `observe(x)` → `predict()` → `feedback(x)`. Set `min_confidence` to abstain rather than guess below a threshold.
+**`PredictorForest`**
+Ensemble of `UniversalPredictor` instances with four diversity mechanisms: heterogeneous context lengths (k, k+1, k+2, …), feedback dropout (each tree independently skips learning steps), staggered training offsets, and per-tree credibility weights. Adaptive voting: when trees agree confidently it uses a product (decisive), when uncertain it uses a mixture (calibrated).
+---
+### Preprocessing
+**`FeatureDiscretizer`**
+Converts any feature matrix to token sequences. Continuous features → equal-frequency quantile bins (tokens are bin indices). Categorical features → ordinal integers. Missing values and `NaN` → a special `__MISSING__` token. The result is a list of `(feature_index, bin)` tuples per row, which the trie can match exactly.
+**`LabelEncoder`**
+Bidirectional label ↔ integer mapping with `partial_fit` for new classes arriving at runtime. Used internally by all supervised classes.
+---
+### Tabular ML
+**`TabularPredictor`** — classification
+Encodes each row as a sequence of feature tokens, with the class label as the final token. The trie learns `P(label | feature_sequence)`. Three feature orderings are ensembled (MI-ascending, MI-descending, natural) to reduce ordering sensitivity. Prediction averages label distributions across all orderings.
+sklearn-compatible: works in `Pipeline`, `GridSearchCV`, `cross_val_score`. Supports `partial_fit` for streaming or incremental learning.
+```python
+clf = TabularPredictor(n_bins=10, n_orderings=3)
+clf.fit(X_train, y_train)
+clf.predict(X_test)            # class labels
+clf.predict_proba(X_test)      # list of {label: prob} dicts
+clf.partial_fit(X_new, y_new)  # online update
+```
+**`TabularRegressor`** — regression
+Same architecture as `TabularPredictor` but the continuous target is discretized into quantile bins. Prediction returns the credibility-weighted mean of bin centers. `predict_interval()` also returns the standard deviation of the bin distribution as a calibrated uncertainty estimate.
+```python
+reg = TabularRegressor(n_bins=10, n_target_bins=20)
+reg.fit(X_train, y_train)
+reg.predict(X_test)            # float means
+reg.predict_interval(X_test)   # list of (mean, std) tuples
+reg.score(X_test, y_test)      # R²
+```
+---
+### Time Series
+**`MultivariateTSPredictor`**
+Online step-ahead predictor for multivariate (or univariate) time series. Each timestep is encoded as a compound token `(bin_0, bin_1, ..., bin_{M-1})` — a hashable tuple the trie matches exactly. Context is the last k compound tokens. Adapts immediately to distribution shift without retraining.
+```python
+pred = MultivariateTSPredictor(n_bins=8, context_length=5)
+pred.fit(X_train)              # warm up trie on historical data
+pred.predict()                 # float vector (per-dimension means)
+pred.observe(x_new)            # advance internal state
+pred.feedback(x_new)           # update trie with true value
+pred.forecast(n_steps=10)      # autoregressive multi-step forecast
+pred.score(X_test)             # bits/step (lower = better fit)
+```
+**`TimeSeriesClassifier`**
+Classifies fixed-length time series windows. Each window of T steps becomes T compound tokens; the class label is predicted as the next token after the full window. Supports `partial_fit` for streaming classification. Works in sklearn Pipeline.
+```python
+clf = TimeSeriesClassifier(n_bins=8, window_size=50)
+clf.fit(X_windows, y_labels)
+clf.predict(X_test)            # class labels
+clf.predict_proba(X_test)      # list of {label: prob} dicts
+```
+**`AnomalyDetector`**
+Trains a `MultivariateTSPredictor` on normal data. At inference, each timestep receives anomaly score = `-log2 P(actual | context)`. High score = low predictability = anomalous. The trie is not updated during scoring, so anomalous patterns do not contaminate the model of normal behavior.
+sklearn `OutlierMixin` compliant: `predict()` returns 1 (anomaly) / -1 (normal); `decision_function()` returns negative anomaly scores for threshold-based pipelines.
+```python
+det = AnomalyDetector(n_bins=8, context_length=5)
+det.fit(X_normal)
+det.score_samples(X_test)      # float scores (higher = more anomalous)
+det.predict(X_test)            # 1 or -1 per timestep
+```
+---
+### Generative
+**`SequenceGenerator`**
+Learns a distribution over sequences and samples from it. Supports temperature scaling (`p_i ← p_i^(1/T)`), top-k filtering, and nucleus (top-p) sampling. `generate_text()` joins tokens with a separator for character- or word-level text generation.
+```python
+gen = SequenceGenerator(context_length=6, temperature=0.9)
+gen.fit(list_of_sequences)
+gen.generate(50, seed=['the '], stop_tokens=['.'])  # list of tokens
+gen.generate_text(100, sep='')                       # joined string
+gen.score(sequence)                                  # bits/token
+```
+**`TabularGenerator`**
+Learns the joint distribution `P(f0, f1, ..., fn, label)` and samples synthetic rows. Trains two predictors internally: one with label last (unconditional generation, `P(label | features)`) and one with label first (class-conditional generation, `P(features | label)`). This separation is necessary — a label-last model given a leading label token is out-of-distribution.
+```python
+gen = TabularGenerator(n_bins=10, temperature=1.0)
+gen.fit(X, y)
+gen.sample(n_rows=100)                       # list of dicts
+gen.sample(n_rows=50, given_label='cat')     # class-conditional
+gen.sample_dataframe(n_rows=100)             # pandas DataFrame
+```
+**`TimeSeriesGenerator`**
+Learns a distribution over multivariate time series and samples from it. Unlike `MultivariateTSPredictor.forecast()` (argmax, deterministic), generation here samples from the distribution — producing diverse trajectories. `augment()` wraps generation for data augmentation.
+```python
+gen = TimeSeriesGenerator(n_bins=8, temperature=1.1)
+gen.fit(X_series)
+gen.generate(n_steps=100, seed=X_seed)       # list of float vectors
+gen.augment(X, n_copies=5, temperature=1.1)  # augmented dataset
+```
+---
+## Generative Services
+The three generators (`SequenceGenerator`, `TabularGenerator`, `TimeSeriesGenerator`) share the same trie engine as the predictors. Generation is sampling from the learned conditional distribution rather than taking the argmax. All sampling controls (temperature, top-k, top-p, stop tokens) operate on that distribution at runtime.
+---
+## What this is for
+**Its clearest domain: discrete event streams where the underlying pattern shifts over time.**
+If you have a stream of categorical states and need to predict the next one — without knowing in advance how the pattern will change — this is the right tool. It beats count-based methods (N-gram, PPM, CTW) and online neural methods specifically in non-stationary settings, and it does so with no retraining, no drift detector, and no forgetting window to tune.
+**Natural fits:**
+- **System observability** — sequences of log event codes, API call chains, process state transitions. Predicts next failure type. When a deployment changes the pattern, adaptation is automatic.
+- **User behavior** — clickstreams, navigation paths, in-app action sequences. Next-action prediction that updates on every new user event without a retraining cycle.
+- **Industrial / IoT** — machine state sequences (idle / running / warning / fault), energy consumption states, production line events. Works on tiny datasets where neural methods don't have enough data.
+- **Financial regimes** — discretized price movements, order flow states, market microstructure events. Handles regime shifts that break count-based models.
+- **Anomaly detection** — when the predictor is consistently wrong, something structurally unusual is happening. Confidence collapses before a human notices; no separate anomaly model needed.
+- **Game AI / opponent modeling** — predict next move in any discrete-action game. Adapts to opponent strategy shifts in real time.
+**Where it is not competitive:**
+- **Tabular classification where the data is large and stationary** — on tabular datasets >10K rows without concept drift, gradient boosting will typically win by 5–10pp. The trie shines when data is small, streaming, or drifting.
+- **Long-range sequence dependencies** — the context window is fixed at k. Anything requiring memory beyond the last k observations needs a transformer or RNN.
+- **Large stationary corpora** — on 50K tokens of text or DNA, count-based methods (CTW, KN) hold a 1–2pp accuracy advantage because their unbounded counts eventually outcompete the credibility cap. The gap closes on noisy or drifting data.
+- **Continuous regression targets** — the regressor bins the output; precision is bounded by `n_target_bins`. Point-prediction accuracy on smooth regression tasks is below random forests.
+---
+## The Core Idea
+The algorithm keeps a trie of contexts. Every node in the trie stores two things: a credibility-weighted distribution over successor symbols, and a track record of how reliable that context has been as a predictor. When predicting, it blends the distributions from shallow (general) to deep (specific), where each depth's influence is proportional to its track record. When updating after a wrong prediction, a node that was confidently wrong loses trust faster than one that was fresh and uncertain.
+That is the entire algorithm. No drift detector, no forgetting parameter, no domain-specific tuning.
+---
+## Architecture
+### Module 1 — Universal Sequence Predictor (`predictor.py`)
+**Data structure:** a prefix trie. Each `_TrieNode` stores:
+- `succ_cred` — credibility weight per successor symbol
+- `node_cred` — reliability of this context as a predictor overall
+- `n_obs` — number of times this context has been seen
+The root holds continuation counts (how many distinct predecessors each symbol appeared after, KN-style) for large vocabularies (|V|≥8), falling back to raw KT counts for small alphabets (DNA=4, Electricity=2) where continuation counts are too sparse. This seeds the blend with a better-calibrated unigram prior.
+**Prediction O(k):**
+Walk the trie at depths `min_k..k`. For each matching node, compute a KT-smoothed local distribution. Blend from shallow to deep using the CTW-style recursive formula:
+```
+λ_d = node_cred_d^p / (node_cred_d^p + 1)     # p=0.65; softened mixing weight
+P_d = λ_d · P_local(d)  +  (1 − λ_d) · P_{d−1}
+```
+High credibility → λ → 1 → deep context dominates.
+Low credibility  → λ → 0 → falls back to shallow.
+Root provides the seed.
+The exponent `p=0.65` (versus the standard CTW value of p=1) lets shallower contexts retain 22% blend weight even when deeper contexts are fully saturated. This acts as implicit depth regularization — preventing rare deep contexts from monopolizing predictions on stationary data, without affecting drift adaptation (credibility degrades naturally under drift regardless of p).
+**Update O(k):**
+For each depth, find the context node and apply a multiplicative rule:
+```
+effective_cap = C_MAX × (1 + 0.5 × log(1 + n_obs/100))  # adaptive (optional)
+             = C_MAX                                       # fixed (default)
+correct:  node_cred ← min(cap, node_cred × (1 + lr))
+          succ_cred[actual] ← min(cap, succ_cred[actual] × (1 + lr))
+wrong:    lr_down = lr × (1 + node_cred / cap)   # confidence-proportional
+          node_cred ← max(C_MIN, node_cred × (1 − lr_down))
+          succ_cred[wrong] ← max(C_MIN, succ_cred[wrong] × (1 − lr_down))
+          succ_cred[actual] ← min(cap, succ_cred[actual] × (1 + lr))
+                               ×  binary_scale   # for V≤2 only; prevents false-flip cascades
+```
+With `adaptive_cap=True`, nodes with many observations are allowed to build higher credibility — the cap grows logarithmically with `n_obs`, so λ can approach 1 more closely on stationary data while the maximum `lr_down = 2×lr` is preserved.
+The `lr_down` scaling is the key drift-adaptation mechanism: a node that was highly trusted when it turned wrong loses credibility up to 2× faster than a fresh node. This halves the adaptation lag after a concept drift without requiring any drift detector.
+**Concept drift:**
+Wrong predictions degrade `node_cred`, reducing λ at that depth, causing the blend to automatically fall back to shallower (more general) contexts. As the new pattern accumulates correct observations, `node_cred` rebuilds at the updated depth. No explicit change detection; no forgetting window; adaptation speed is a function of how confidently the old pattern was held.
+**Regret bound:**
+The multiplicative credibility update is an instance of the Multiplicative Weights Update (MWU) algorithm applied to depth selection. For a class of k single-depth predictors, MWU achieves O(√(T ln k)) regret. The CTW-style blend runs this across all depths simultaneously.
+---
+### Module 1 — Forest Ensemble (`forest.py`)
+`PredictorForest` is a collection of `UniversalPredictor` instances that start identical and diverge through experience. Diversity comes from four sources:
+| Mechanism | How it works |
+|---|---|
+| Heterogeneous k | Each tree uses a different context length: k, k+1, k+2, … capturing different temporal scales. Disabled for DNA (4-symbol near-uniform alphabet) where deeper-k trees add noise rather than signal. |
+| Feedback dropout | Each tree independently skips learning on each step with probability `dropout` — the sequence analogue of bagging |
+| Staggered offsets | Tree i doesn't start learning until step `i × stagger`; early topology has outsized influence on later structure |
+| Inter-tree credibility | Each tree maintains a persistent weight updated by whether it was right; correct trees speak louder on the next prediction |
+**Voting:** adaptive hybrid by default. Each tree contributes two representations:
+- **Full blended distribution** (`tree._distribution()`) — the complete CTW-style probability over all vocabulary symbols. Used in the *mixture* component: proper calibration when trees at different context lengths express partial disagreement.
+- **Mode-focused distribution** (`_tree_dist`) — only the most-probable successor at each depth, weighted by node credibility. Used in the *product* component: maximally decisive agreement signal for high-persistence or low-entropy data where unanimous tree confidence should dominate.
+The adaptive blend computes `α × product(mode-focused) + (1−α) × mixture(full)` where `α` is the mean per-tree confidence — high confidence drives product-mode behaviour, uncertainty drives mixture-mode behaviour.
+---
+### Module 2 — Goal-Directed Generation (`module2.py`)
+Module 1 is **intuition** — fast, associative, pattern-matching.
+Module 2 is **deliberation** — goal-directed, using Module 1 as a world model.
+Module 1 is already generative: call `predict()` autoregressively and it produces continuations. Module 2 adds **steering**: constraining or guiding that generation toward a target.
+**Training format:** represent Q&A or any prompt→response task as a flat sequence:
+```
+[prompt tokens ...] [SEPARATOR] [response tokens ...] [END]
+```
+Module 1 learns that SEPARATOR is followed by responses, not more prompts. No architectural changes needed.
+**Three generation strategies** (all implemented in `module2.py`):
+| Strategy | Mechanism | Best for |
+|---|---|---|
+| **Autoregressive** | Feed `[prompt + SEPARATOR]` as context seed; generate token by token until END | Direct completion, short responses |
+| **Beam search** | Maintain N candidate sequences; at each step expand by all vocabulary tokens; prune to top N by cumulative log-probability | Longer responses, controllable diversity |
+| **Retrieval** | Two-stage: (1) Bhattacharyya similarity on post-SEP trie distributions — exact for seen prompts; (2) surface Jaccard fallback when Bhattacharyya < 0.5 — domain-correct for novel tokens | Factual lookup; graceful degradation to novel inputs |
+---
+## Benchmark Results
+Evaluated on 7 standard datasets (two large text corpora, full DNA genome) and 4 concept-drift streams. All methods use the same train/test split (80/20). Baselines: Persistence, Majority, N-gram(5), PPM-D(5), CTW(5).
+**Standard benchmarks (test accuracy %):**
+| Dataset | n | k | Persistence | PPM-D(5) | CTW(5) | **Predictor** | **Forest** |
+|---|---|---|---|---|---|---|---|
+| Airline passengers | 144 | 4 | 37.9 | 27.6 | 31.0 | **41.4** | **41.4** |
+| Alice in Wonderland (15K) | 15,000 | 6 | 2.8 | 51.6 | **53.3** | 51.7 | 51.9 |
+| Moby Dick (50K) | 50,000 | 6 | 2.1 | 45.7 | **47.4** | 46.2 | 46.1 |
+| DNA — bacteriophage lambda (full) | 48,502 | 5 | 26.1 | 29.7 | **30.7** | 29.1 | 28.0 |
+| Weather | 547 | 3 | **57.3** | 47.3 | 50.0 | **52.7** | **51.8** |
+| PRNG (noise floor) | 500 | 3 | 10.0 | **18.0** | 16.0 | 15.0 | 13.0 |
+| Electricity (45K) | 45,312 | 4 | **84.8** | **84.8** | **84.8** | **84.7** | **84.6** |
+**Concept-drift streams (test accuracy %, k=1):**
+| Drift type | N-gram | PPM-D | CTW | **Predictor** | **Forest** |
+|---|---|---|---|---|---|
+| Sudden reversal | 2.5 | 2.5 | 4.5 | **97.0** | **97.0** |
+| Gradual ramp | 5.0 | 5.0 | 6.2 | **98.3** | **98.3** |
+| Recurring A→B→A | 3.8 | 3.3 | 4.2 | **97.5** | **97.5** |
+| Fast (150-step cycles) | 40.0 | 39.6 | 40.4 | **94.6** | 93.3 |
+The concept-drift numbers are the clearest statement of what this architecture is for. Count-based methods (N-gram, PPM-D, CTW) never recover from a reversal because counts only accumulate. The Predictor recovers automatically.
+**Extended baseline comparison — KN, PPM\*, Online LSTM (test accuracy %):**
+| Dataset | KN(5) | PPM\*(20) | LSTM(64) | Predictor | Forest |
+|---|---|---|---|---|---|
+| Airline passengers | 27.6 | 27.6 | 24.1 | **41.4** | **41.4** |
+| Alice in Wonderland (15K) | **52.8** | 51.8 | 39.9 | 51.7 | 51.9 |
+| Moby Dick (50K) | **47.2** | 45.3 | 38.6 | 46.2 | 46.1 |
+| DNA — bacteriophage lambda | 30.1 | 26.6 | **32.5** | 29.1 | 28.0 |
+| Weather | 50.9 | 48.2 | 43.6 | **52.7** | **51.8** |
+| PRNG (noise floor) | 15.0 | **18.0** | 10.0 | 15.0 | 13.0 |
+| Electricity (45K) | **84.8** | 81.9 | **84.8** | **84.7** | **84.6** |
+KN(5) = Interpolated Kneser-Ney N-gram. PPM\*(20) = PPM with max order 20. LSTM(64) = single-layer LSTM, hidden size 64, trained online with BPTT-1 and Adam.
+**Key findings:**
+- **Predictor leads on Weather and Airline** — short, noisy, non-stationary datasets where count-based methods overfit to stale patterns. No other method is competitive on Airline (n=144).
+- **KN(5) is the strongest text predictor** on large stationary corpora (52.8% Alice, 47.2% Moby). The credibility cap prevents our predictor from fully converging — a structural trade-off for drift recovery.
+- **LSTM wins on DNA** (32.5%) — neural sequence modeling captures long-range non-Markovian dependencies that any fixed-order predictor misses.
+- **Electricity: all methods tie** (84.6–84.8%) — a high-persistence binary stream where persistence itself is the ceiling.
+---
+### Confidence-gated prediction (abstain mode)
+`UniversalPredictor` accepts a `min_confidence` parameter (default `0.0`). When set, the predictor abstains — returns `(None, conf)` — whenever its best prediction is less than `min_confidence × (1/|vocab|)` above the uniform baseline. A value of `1.5` means "only predict when at least 1.5× more confident than random."
+Abstaining does not penalize the node: `node_cred` is unchanged. The successor distribution still updates so learning continues. This makes the warmup period implicit — early steps where the predictor is near-uniform simply produce no output rather than noisy guesses.
+**Precision–coverage tradeoffs on natural language (Alice, k=4):**
+| min_confidence | Accuracy (predicted only) | Coverage | Lift |
+|---|---|---|---|
+| 0.0 (off) | 48.5% | 100% | — |
+| 3.0 | 50.3% | 96.5% | +1.8pp |
+| 4.0 | 56.7% | 83.7% | +8.2pp |
+| 5.0 | 59.4% | 77.6% | +10.9pp |
+| 6.0 | 61.4% | 71.8% | +12.9pp |
+Alice at min_conf=5.0 reaches 59.4% accuracy (vs CTW's 53.3% on 100% coverage) by only speaking when confident. For anomaly detection or alerting use cases where coverage matters less than per-prediction reliability, this is the correct mode.
+---
+### The two-regime finding
+Expanding from small samples to full datasets exposed a fundamental architectural property:
+**Data-limited regime (n ≲ 800):** Credibility builds up quickly, blend weights become decisive, and the Predictor is competitive or best. At 1,500 DNA bases the Predictor was 33.0% — best across all methods.
+**Architecture-limited regime (n ≫ 800):** Every node hits `CRED_MAX` and the blend weight freezes at λ = cap^p/(cap^p+1). Count-based methods (PPM-D, CTW) have no cap — their counts keep growing, giving predictions increasingly close to 1.0. At 48K DNA bases CTW reaches 30.7% while the Predictor reaches 29.1%.
+**The exception is noisy and drifting data.** Weather improved from 41% to 52.7% — the Predictor leads on Weather because noisy, high-variance datasets are exactly where count-based methods overfit to stale patterns.
+**The CRED_MAX cap is a design choice, not a bug.** A node with unbounded credibility would adapt from drift in O(n) steps. The cap guarantees O(1/CRED_MAX) adaptation speed. The trade-off is explicit: fast drift recovery at the cost of long-term convergence on stationary data.
+---
+## Files
+**Package (`uchi/`):**
+| File | Purpose |
+|---|---|
+| `predictor.py` | `UniversalPredictor` — core trie engine |
+| `forest.py` | `PredictorForest` — ensemble with heterogeneous k and feedback dropout |
+| `discretize.py` | `FeatureDiscretizer`, `LabelEncoder` — preprocessing |
+| `tabular.py` | `TabularPredictor`, `TabularRegressor` — tabular ML |
+| `timeseries.py` | `MultivariateTSPredictor`, `TimeSeriesClassifier`, `AnomalyDetector` |
+| `generative.py` | `SequenceGenerator`, `TabularGenerator`, `TimeSeriesGenerator` |
+**Root (benchmark scripts and shims):**
+| File | Purpose |
+|---|---|
+| `baselines.py` | Standard baselines: Persistence, Majority, N-gram, PPM-D |
+| `baselines_extended.py` | Extended baselines: KN, PPM\*, Online LSTM |
+| `datasets.py` | Dataset loaders (airline, text, DNA, weather, PRNG, electricity) |
+| `ieee_benchmark.py` | Full benchmark suite generating LaTeX tables |
+| `run_experiments.py` | Quick single-predictor experiment runner |
+| `run_forest.py` | Quick forest experiment runner |
+| `module2.py` | `GoalDirectedGenerator` — autoregressive, beam search, retrieval |
+| `tasks/` | Core-principle manifesto and todo |
+---
+## How This Differs from Standard Approaches
+| Property | N-gram / PPM-D | CTW | **This architecture** |
+|---|---|---|---|
+| Drift adaptation | None — counts only grow | None — counts only grow | Automatic via credibility degradation |
+| Depth selection | Fixed or backoff heuristic | Bayesian mixture (stationary) | MWU — theoretically optimal for adversarial depth selection |
+| Concept drift recovery | Requires reset or windowing | Requires reset or windowing | Self-correcting; speed proportional to prior confidence |
+| Node count | O(V^k) worst case | O(V^k) worst case | O(sequence length) — only observed contexts |
+| Online adaptation | Counts update, predictions sharpen | Weights update | Credibility update; fresh vs. stale nodes naturally separated |
+| Small dataset behavior | Overtrusts rare k-grams | Overtrusts rare k-grams | Credibility builds slowly on sparse observations |
+The single deepest difference from count-based methods: **credibility is earned and can be lost.** A context that was reliable on Monday and wrong on Tuesday sees its influence reduced on Wednesday. Counts only accumulate.