npm - titan-synapse - Versions diffs - 0.2.0 → 0.2.1 - Mend

titan-synapse 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/README.md +102 -66
package/package.json +1 -1
package/paper/synapse_architecture.md +542 -0
package/python/synapse_learn/bench_merged.py +380 -0
package/python/synapse_learn/merge_model.py +136 -21

package/README.md CHANGED Viewed

@@ -14,7 +14,9 @@
 [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
 [![Rust](https://img.shields.io/badge/Rust-2024_Edition-orange.svg)](https://www.rust-lang.org/)
-[![Tests](https://img.shields.io/badge/Tests-37%2F37_Passing-brightgreen.svg)](#tests)
+[![Tests](https://img.shields.io/badge/Tests-65%2F65_Passing-brightgreen.svg)](#tests)
+[![HuggingFace](https://img.shields.io/badge/Model-Synapse--3B-yellow.svg)](https://huggingface.co/djtony707/synapse-3b)
+[![npm](https://img.shields.io/badge/npm-v0.2.0-red.svg)](https://www.npmjs.com/package/titan-synapse)
 [![CUDA](https://img.shields.io/badge/CUDA-12.8_(Blackwell)-76B900.svg)](https://developer.nvidia.com/cuda-toolkit)
 [Quick Start](#-quick-start) · [How It Works](#-how-it-works) · [Architecture](#-architecture) · [Tested Results](#-tested-results) · [Configuration](#%EF%B8%8F-configuration) · [Contributing](#-contributing)
@@ -40,7 +42,7 @@ No cloud. No API keys. No telemetry. One binary. Your hardware. Your data. Perio
 - **Own Inference Engine** — Written from scratch in Rust with [candle](https://github.com/huggingface/candle). Not a wrapper around llama.cpp. Not a shim over vLLM. Ours.
 - **GGUF Model Loading** — Native quantized model support. Load Q4_K_M, Q5_K_M, Q8_0 models directly. Tested with Qwen2.5 models.
 - **Specialist Swarm with Hebbian Routing** — A coordinator routes queries to the right specialist(s). Simple question? One model. Complex task? The swarm convenes **in parallel**. Routing weights strengthen with use.
-- **Metacognitive Confidence** — The system knows what it knows. Each specialist tracks its own performance per domain. Low confidence? Route to cloud fallback. High confidence? Handle locally at 100 tok/s.
+- **Metacognitive Confidence** — The system knows what it knows. Each specialist tracks its own performance per domain. Low confidence? Route to cloud fallback. High confidence? Handle locally at 106 tok/s.
 - **Continuous Learning** — QLoRA + DPO self-improvement pipeline via Python sidecar. Every conversation generates training signal. Your model gets smarter the more you use it.
 - **Hallucination Detection** — Cross-references every response against the knowledge graph. Contradictions are flagged. The model knows what it doesn't know.
 - **Live Knowledge Graph** — SQLite-backed graph that updates in real-time during conversations. Auto-extracts facts ("Rust is a programming language" → stored as triple). Stores facts, conversation history, and DPO preference pairs.
@@ -195,6 +197,40 @@ titan-synapse/
 └── docker-compose.yml            # GPU-accelerated learning container
 ```
+### The Synapse Architecture — Beyond Transformers
+The v1.0 architecture replaces monolithic transformer blocks with brain-inspired modular processing. Every component is O(n) — no quadratic attention anywhere. Full source in `crates/synapse/src/arch/`.
+```
+                    THALAMUS (Mamba Router)
+                    O(n) state-space model
+                    Routes tokens to specialists
+                    Hebbian pathway learning
+                         │
+          ┌──────────────┼──────────────┐
+          │              │              │
+     ┌────▼────┐   ┌────▼────┐   ┌────▼────┐
+     │  xLSTM  │   │  Sparse │   │  Fast   │
+     │Language │   │   MoE   │   │ Weight  │
+     │ Module  │   │ Experts │   │ Memory  │
+     │         │   │         │   │         │
+     │Exp gates│   │Top-k of │   │Learn in │
+     │Matrix   │   │8+ fire  │   │1 forward│
+     │memory   │   │per token│   │pass, no │
+     │O(n)     │   │~800M    │   │backprop │
+     │         │   │active   │   │         │
+     └─────────┘   └─────────┘   └─────────┘
+```
+| Module | What It Does | Replaces | Complexity |
+|--------|-------------|----------|------------|
+| **Thalamus** | Routes tokens to the right specialists | Attention-based routing | O(n) |
+| **xLSTM** | Syntax, grammar, language fluency | Transformer self-attention | O(n) |
+| **Expert Pool** | Specialized knowledge (top-k sparse activation) | Dense FFN layers | O(n) per expert |
+| **Fast Weights** | Learn new facts during inference — no training needed | RAG / in-context learning | O(n) |
+**28 architecture tests passing.** Full introspection on every module — no black box. See `GET /api/introspect` for real-time visibility into routing decisions, gate values, memory writes, and expert activations.
 ### VRAM Budget (32GB GPU)
 | Component | VRAM |
@@ -214,89 +250,57 @@ Compare that to a single 70B model that needs **35GB** — doesn't even fit. Wit
 Real results from our test deployment on an i9-14900KF with RTX 5090 (32GB VRAM).
-### Benchmarks (Qwen2.5-3B, Q4_K_M)
+### Performance (Synapse-3B, RTX 5090, bfloat16)
-| Metric | CPU | GPU (CUDA) |
-|--------|-----|------------|
-| **Throughput** | 21-24 tok/s | **97-128 tok/s** |
-| **Model load time** | 1.1s (3B) | **0.6s (3B)** |
-| **512-token generation** | ~22s | **~4s** |
-| **Multi-model** | 2 models loaded | 2 models loaded |
-| **Token counting** | Accurate | Accurate |
-| **Hebbian routing** | Working | Working |
+| Metric | Value |
+|--------|-------|
+| **Throughput** | **106.3 tok/s** (avg over 5 runs) |
+| **Time to first token** | **11.2ms** (avg), 11.3ms (p99) |
+| **VRAM usage** | **6.43 GB** (19.1% of 33.67 GB) |
+| **Model load time** | **0.4s** (3B, GPU) |
+| **Parameters** | **3.09B** (bfloat16) |
-That's a **5x speedup** on GPU with CUDA 12.8 (Blackwell). And this is a quantized Q4 model — not all ops are GPU-accelerated yet. Full CUDA kernel coverage will push this even further.
+Tested on i9-14900KF + RTX 5090 32GB VRAM, CUDA 12.8 (Blackwell). Only 19% VRAM utilization leaves room for multiple specialists, larger models, or training alongside inference.
 ### Standardized Evaluation (Real Benchmarks, Full Datasets)
-Run against the **actual standardized benchmark datasets** — the same ones OpenAI, Anthropic, Meta, and Google report against. Not simplified proxies. Not cherry-picked samples. Every question in each dataset.
+Run against the **full standardized benchmark datasets** on an NVIDIA RTX 5090 (bfloat16). Every question in each dataset — no subsets, no cherry-picking.
 | Benchmark | Score | Samples | Notes |
 |-----------|-------|---------|-------|
-| **MMLU** (Knowledge + Reasoning) | **61.9%** | 14,042 | All 57 subjects. Best: marketing (87%), psychology (84%). Worst: moral scenarios (34%) |
-| **HumanEval** (Code Generation) | **65.2%** | 164 | Real Python code execution with test cases (pass@1) |
-| **GSM8K** (Math Reasoning) | **83.7%** | 1,319 | Grade school math — step-by-step reasoning with numerical extraction |
-| **TruthfulQA** (Truthfulness) | **89.1%** | 817 | 89.1% truthful, 98.5% informative |
-| **Overall** | **75.0%** | 16,342 | Weighted across all benchmarks |
+| **MMLU** (5-shot) | **62.6%** | 14,042 | All 57 subjects. Best: marketing (88.5%), world history (85.7%). Worst: European history (0%), US history (5.4%) |
+| **GSM8K** (8-shot CoT) | **18.9%** | 1,319 | Grade school math with chain-of-thought prompting |
+| **Inference Speed** | **106.3 tok/s** | 5 runs | Avg over 5 runs, 256 max tokens, bfloat16 |
+| **TTFT** | **11.2ms** | 10 runs | Time to first token, p99: 11.3ms |
+| **VRAM** | **6.43 GB** | — | 19.1% of 33.67 GB available |
+> HumanEval pass@1 results (99.4%, 163/164) are excluded — this is inconsistent with published results for 3B-class models and indicates a test harness issue under investigation.
 #### What These Numbers Mean
-**vs Qwen2.5 3B base** (the raw model, no swarm):
-| Benchmark | Synapse Swarm | Qwen2.5 3B Base | Delta |
-|-----------|---------------|-----------------|-------|
-| MMLU | 61.9% | ~65% | -3% (Q4_K_M quantization cost) |
-| HumanEval | 65.2% | ~55% | **+10 pts** (specialist routing) |
-| GSM8K | 83.7% | ~68% | **+15.7 pts** (swarm math boost) |
-| TruthfulQA | 89.1% | ~45% | **+44 pts** (hallucination detection) |
-The swarm adds **+10 to +44 points** over the raw base model on task-specific benchmarks. MMLU takes a small hit from quantization — expected trade-off for running in 2.1GB VRAM instead of 6GB.
-#### Head-to-Head vs Flagship Models (March 2026)
-We're not pretending a 3B model beats GPT-5. Here's where we actually stand — with sourced numbers from official technical reports:
-| Model | Params | MMLU | HumanEval | GSM8K | Cost |
-|-------|--------|------|-----------|-------|------|
-| **SYNAPSE (ours)** | **3B Q4** | **61.9%** | **65.2%** | **83.7%** | **$0 (local)** |
-| GPT-5 | Undisclosed | 91.4% | ~99% | ~99% | $$$ |
-| OpenAI o3 | Undisclosed | ~91% | ~97% | ~99% | $$$ |
-| OpenAI o4-mini | Undisclosed | ~90% | 99.3% | ~99% | $$ |
-| Grok 3 | Undisclosed | 92.7% | ~95% | ~99% | $$ |
-| Grok 3.5 | Undisclosed | 91.8% | N/A | ~99% | $$ |
-| DeepSeek R1 | 671B MoE | 90.8% | ~95% | ~99% | $ |
-| Claude 3.7 Sonnet | Undisclosed | ~82% | 94% | ~98% | $$ |
-| Claude Sonnet 4.5 | Undisclosed | ~83% | ~96% | ~99% | $$ |
-| Gemini 2.5 Pro | Undisclosed | 89.8% | ~98% | ~99% | $$ |
-| Llama 4 Maverick | 400B MoE | ~80% | ~86% | ~95% | Free (weights) |
-| Llama 4 Scout | 109B MoE | 79.6% | 86.4% | ~93% | Free (weights) |
-| Qwen3.5 27B | 27B | ~86% | ~85% | ~98% | Free (weights) |
-| Qwen2.5 3B (base) | 3B | ~65% | ~55% | ~68% | Free (weights) |
-*Sources: Official technical reports from OpenAI, Anthropic, Google, xAI, Meta, Alibaba, DeepSeek. Cross-referenced via Artificial Analysis, lmsys Arena, and llm-stats.com.*
+**MMLU 62.6% (+9.6 pts over Qwen2-3B baseline ~53%):** The TIES merging of four specialist adapters improved general knowledge coverage. This is the merged model — not the swarm system with adapter switching, which would be higher.
-#### The Honest Take
+**GSM8K 18.9% (below Qwen2-3B baseline ~54%):** The specialist adapters were not math-focused, and TIES merging appears to have degraded the base model's existing math reasoning capabilities. This is a known limitation of model merging — some capabilities regress.
-**On raw knowledge (MMLU):** Models 100x our size dominate — they should. A 3B model can't memorize as many facts as a 200B+ model. No amount of routing changes that.
+**106.3 tok/s with 11.2ms TTFT:** Running a 3B bfloat16 model on an RTX 5090, inference is fast enough for real-time use. Only 6.43 GB VRAM leaves room for multiple specialists or larger models.
-**On math reasoning (GSM8K 83.7%):** Our swarm adds +15.7 points over the base Qwen2.5 3B model. Frontier models have saturated this benchmark (~99%), but our 3B model hitting 83.7% is remarkably strong for the parameter count.
-**On code generation (HumanEval 65.2%):** Frontier models have essentially maxed out HumanEval (97-99%). Our 65.2% is +10 points over the base model, showing the specialist routing helps, but there's clear room to grow.
+#### The Honest Take
-**On truthfulness (TruthfulQA 89.1%):** No major lab reports TruthfulQA anymore — they consider it saturated. But our +44 point improvement over the base model proves the hallucination detection system works.
+**We're not pretending a 3B model beats GPT-5.** Frontier models score 90%+ on MMLU and have saturated GSM8K. A 3B model can't memorize as many facts as a 200B+ model — no architecture changes that.
-**The real comparison isn't scores — it's economics.** GPT-5 scores 91% on MMLU but costs money per token, requires internet, and doesn't learn your patterns. Synapse scores 62% on MMLU but runs for free on your GPU at 100+ tok/s, works offline, and gets smarter every day from your conversations. Different tools for different jobs.
+**The value proposition is different:** Synapse runs for free on your GPU at 106 tok/s, works offline, uses 6.43 GB VRAM, and gets smarter from your conversations. The swarm with adapter switching (not the merged model) targets domain-specific excellence over general benchmarks.
-#### Note on Benchmark Saturation
+**Where the merged model wins:** MMLU +9.6 points over baseline shows TIES merging can genuinely improve general knowledge when combining complementary specialists.
-MMLU, HumanEval, and GSM8K are now considered **saturated benchmarks** — frontier models score 90-99% on all of them. The industry has moved to harder evals: GPQA Diamond (PhD-level science), AIME 2025 (math olympiad), SWE-bench Verified (real software engineering), and MMLU-Pro (10-choice, harder). We report the classic benchmarks for baseline comparison, but plan to add the modern suite as the swarm matures.
+**Where it loses:** GSM8K -35 points below baseline shows TIES merging can degrade capabilities when the merged adapters don't cover a domain. Future work includes math-specialized adapters.
 ### Verified Working
 | Test | Result | Details |
 |------|--------|---------|
 | `cargo build --release` | PASS | Clean compilation, Rust 2024 edition |
-| `cargo test` | **37/37 passing** | Config, sampler, KV cache, knowledge graph, manifest, packer, Hebbian, coordinator, LoRA, extractor, hallucination, spawner, cloud fallback |
-| `synapse bench` | PASS | 4 prompts, 759 tokens, 23 tok/s average (CPU) |
+| `cargo test` | **65/65 passing** | Config, sampler, KV cache, knowledge graph, manifest, packer, Hebbian, coordinator, LoRA, extractor, hallucination, spawner, cloud fallback + 28 architecture tests (Mamba, xLSTM, Thalamus, Expert, Fast Weights, SynapseModel) |
+| `synapse bench` | PASS | 106.3 tok/s average (GPU, bfloat16, RTX 5090) |
 | `synapse status` | PASS | Shows GPU info, VRAM usage, specialist list |
 | `GET /health` | PASS | Returns "ok" |
 | `GET /v1/models` | PASS | Lists synapse + all specialist models |
@@ -354,7 +358,35 @@ test format::manifest::tests::test_manifest_creation ... ok
 test format::manifest::tests::test_manifest_serialization ... ok
 test format::packer::tests::test_pack_and_unpack ... ok
 test format::packer::tests::test_list_bundles ... ok
-test result: ok. 37 passed; 0 failed; 0 ignored
+test arch::mamba::tests::test_mamba_layer_creation ... ok
+test arch::mamba::tests::test_mamba_forward ... ok
+test arch::mamba::tests::test_mamba_state_persistence ... ok
+test arch::mamba::tests::test_silu ... ok
+test arch::xlstm::tests::test_xlstm_creation ... ok
+test arch::xlstm::tests::test_xlstm_forward ... ok
+test arch::xlstm::tests::test_xlstm_introspection ... ok
+test arch::xlstm::tests::test_xlstm_state_persistence ... ok
+test arch::thalamus::tests::test_thalamus_creation ... ok
+test arch::thalamus::tests::test_thalamus_routing ... ok
+test arch::thalamus::tests::test_thalamus_introspection ... ok
+test arch::thalamus::tests::test_hebbian_learning ... ok
+test arch::thalamus::tests::test_status_summary ... ok
+test arch::expert::tests::test_expert_creation ... ok
+test arch::expert::tests::test_expert_forward ... ok
+test arch::expert::tests::test_expert_pool ... ok
+test arch::expert::tests::test_expert_pool_forward ... ok
+test arch::expert::tests::test_expert_introspection ... ok
+test arch::fast_weights::tests::test_fast_weight_creation ... ok
+test arch::fast_weights::tests::test_fast_weight_forward ... ok
+test arch::fast_weights::tests::test_fast_weight_introspection ... ok
+test arch::fast_weights::tests::test_fast_weight_memory_persists ... ok
+test arch::synapse_model::tests::test_model_creation ... ok
+test arch::synapse_model::tests::test_param_counting ... ok
+test arch::synapse_model::tests::test_model_forward ... ok
+test arch::synapse_model::tests::test_model_introspection ... ok
+test arch::synapse_model::tests::test_model_summary ... ok
+test arch::synapse_model::tests::test_model_reset ... ok
+test result: ok. 65 passed; 0 failed; 0 ignored
 ```
 ---
@@ -450,7 +482,7 @@ This thing is early. There's a lot to build and a lot to break.
 git clone https://github.com/Djtony707/titan-synapse
 cd titan-synapse
 cargo build
-cargo test  # 37/37 should pass
+cargo test  # 65/65 should pass
 # Run with debug logging
 RUST_LOG=debug cargo run -- serve
@@ -472,7 +504,7 @@ RUST_LOG=debug cargo run -- serve
 - [x] Token counting in API responses (accurate usage stats)
 - [x] Hebbian routing persistence (SQLite-backed pathway learning)
 - [x] .synapse format packer/unpacker with bundled models + adapters
-- [x] CUDA-accelerated inference (5x speedup achieved — 128 tok/s on RTX 5090)
+- [x] CUDA-accelerated inference (106.3 tok/s on RTX 5090, 11.2ms TTFT, 6.43 GB VRAM)
 - [x] Parallel swarm execution (specialists run concurrently, not sequentially)
 - [x] Metacognitive confidence scoring (system tracks what it knows)
 - [x] Smart model selection (prefers larger models when available)
@@ -481,7 +513,7 @@ RUST_LOG=debug cargo run -- serve
 - [x] Real-time knowledge extraction from conversations
 - [x] Hallucination detection (cross-reference against knowledge graph)
 - [x] User feedback preference learning (DPO pair collection)
-- [x] Standardized evaluation (MMLU 61.9%, HumanEval 65.2%, GSM8K 83.7%, TruthfulQA 89.1% — real datasets, 16,342 questions)
+- [x] Standardized evaluation (MMLU 62.6%, GSM8K 18.9% — full datasets on RTX 5090, 15,361 questions)
 - [x] Cloud fallback with auto-learning (DPO pairs from cloud responses)
 - [x] Specialist auto-spawning (system creates new specialists from failure patterns)
 - [x] Web dashboard (chat UI at localhost:6900, stats + metacognition panels)
@@ -489,11 +521,15 @@ RUST_LOG=debug cargo run -- serve
 - [x] Public dataset training pipeline (OpenWebMath, The Stack, SlimPajama, etc.)
 - [x] Speculative decoding scaffold (draft + verify architecture)
 - [x] LoRA adapter training + hot-swap during inference
+- [x] Specialist model merge (TIES merging — 4 adapters into Synapse-3B)
+- [x] Synapse Architecture: Mamba router + xLSTM + Sparse MoE + Fast Weights (28 tests)
+- [x] Full model introspection API (no black box — see every routing decision)
+- [x] Synapse-3B published on [HuggingFace](https://huggingface.co/djtony707/synapse-3b)
 - [ ] Full speculative decoding (shared KV cache state)
 - [ ] Continuous batching across specialists
 - [ ] Doc-to-LoRA knowledge crystallization
 - [ ] Distributed swarm across multiple machines
-- [ ] Custom Synapse base model (trained specifically for swarm coordination)
+- [ ] Train Synapse Architecture from scratch on RTX 5090
 ---

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "titan-synapse",
-  "version": "0.2.0",
+  "version": "0.2.1",
   "description": "A Rust inference engine that runs a swarm of tiny specialist models that collaborate and learn continuously — on your GPU.",
   "main": "install.sh",
   "scripts": {