npm - groove-dev - Versions diffs - 0.27.28 → 0.27.29 - Mend

groove-dev 0.27.28 → 0.27.29

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

package/CLAUDE.md +7 -0
package/node_modules/@groove-dev/cli/package.json +1 -1
package/node_modules/@groove-dev/daemon/package.json +1 -1
package/node_modules/@groove-dev/daemon/src/journalist.js +103 -45
package/node_modules/@groove-dev/daemon/test/journalist.test.js +1 -1
package/node_modules/@groove-dev/gui/dist/assets/{index-Ch1N9G4Z.js → index-CNsQ3n1t.js} +290 -290
package/node_modules/@groove-dev/gui/dist/index.html +1 -1
package/node_modules/@groove-dev/gui/package.json +1 -1
package/node_modules/@groove-dev/gui/src/components/agents/agent-config.jsx +2 -2
package/node_modules/@groove-dev/gui/src/views/settings.jsx +2 -2
package/package.json +1 -1
package/packages/cli/package.json +1 -1
package/packages/daemon/package.json +1 -1
package/packages/daemon/src/journalist.js +103 -45
package/packages/gui/dist/assets/{index-Ch1N9G4Z.js → index-CNsQ3n1t.js} +290 -290
package/packages/gui/dist/index.html +1 -1
package/packages/gui/package.json +1 -1
package/packages/gui/src/components/agents/agent-config.jsx +2 -2
package/packages/gui/src/views/settings.jsx +2 -2
package/.groove-staging/state.json +0 -3
package/.groove-staging/timeline.json +0 -13
package/DECENTRALIZED_NET_WP_V1.md +0 -871
package/decentralized-net/ACTION_PLAN.md +0 -422

package/decentralized-net/ACTION_PLAN.md DELETED Viewed

@@ -1,422 +0,0 @@
-Groove Decentralized Net — Layer 1 Action Plan
-Proving the Inference Pipeline on a Local Network
-================================================================================
-HARDWARE MAP
-================================================================================
-Machine 1 — MacBook Air M1 (8GB)
-  Role: Consumer node / "phone simulator"
-  Runs: 1B-3B draft model for speculative decoding
-  Also runs: The orchestration layer (session init, relay logic)
-  Why: 8GB is too small for large model shards but perfect for testing the
-  consumer experience — this IS the phone equivalent
-Machine 2 — iMac (TBD specs, likely M-series with 16-32GB)
-  Role: Compute Node A
-  Runs: First half of model layers (e.g., Layers 0-15 of a 32-layer model)
-  Why: Apple Silicon is fast at inference via MLX/llama.cpp
-Machine 3 — GPU PC (TBD specs, likely NVIDIA with 8-24GB VRAM)
-  Role: Compute Node B
-  Runs: Second half of model layers (e.g., Layers 16-31)
-  Why: CUDA is the fastest inference runtime for NVIDIA hardware
-Action item: Document the exact specs of each machine before starting.
-Run on iMac:     system_profiler SPHardwareDataType
-Run on GPU PC:   nvidia-smi (for GPU) + systeminfo or lscpu (for CPU/RAM)
-TECH STACK FOR THE POC
-================================================================================
-Language: Python 3.11+
-  Why: PyTorch, transformers, and all ML tooling is Python-native. The
-  production Groove daemon is Node.js, but the inference layer will always
-  be Python. No point fighting that.
-Model Loading: HuggingFace transformers + PyTorch
-  Why: Full control over which layers load on which machine. You can
-  load a subset of layers by manipulating the model's state_dict.
-  Alternative: llama.cpp (faster on Apple Silicon, but harder to split
-  layers programmatically). We may migrate to llama.cpp in Sprint 3.
-Networking: WebSocket (websockets library)
-  Why: Simple, bidirectional, low overhead for a local network PoC.
-  Production will use QUIC, but WebSocket proves the concept without
-  the complexity of QUIC NAT traversal on a LAN.
-Serialization: msgpack or pickle for tensor transfer
-  Why: Hidden states are PyTorch tensors (~13KB per token for a 7B model).
-  msgpack is fast. Pickle works for trusted local network. We are NOT
-  doing JSON — binary tensors need binary serialization.
-Draft Model: Qwen2.5-0.5B or SmolLM2-1.7B (quantized)
-  Why: Fits easily in 8GB on the MacBook Air. Good enough acceptance
-  rate for code tasks. Available on HuggingFace.
-MODEL SELECTION FOR POC
-================================================================================
-Start small, prove the architecture, then scale up.
-Phase A model: Llama 3.1 8B (or Qwen2.5-7B)
-  - 32 transformer layers
-  - FP16: ~16GB, Q4: ~4.5GB
-  - Split across 2 nodes: ~2.3GB per node at Q4
-  - This is trivially small for your hardware — the point is proving
-    the sharding and pipeline work, not stressing the GPUs
-Phase B model (after pipeline works): Llama 3.1 70B or Qwen2.5-32B
-  - 80 layers (70B) or 64 layers (32B)
-  - This is where you actually NEED sharding — a single consumer machine
-    can't hold the full model
-  - Split across 2 nodes: 35-40 layers each, ~20GB per node at Q4
-PROJECT STRUCTURE
-================================================================================
-decentralized-net/
-  ACTION_PLAN.md              — this file
-  DECENTRALIZED_NET_WP_V1.md  — (symlink or copy from parent, reference only)
-  src/
-    node/                     — compute node (runs on iMac + GPU PC)
-      server.py               — WebSocket server, receives activations, runs layers, returns results
-      shard_loader.py         — loads a specific layer range from a model
-      kv_cache.py             — manages KV cache for assigned layers
-    consumer/                 — consumer client (runs on MacBook Air)
-      client.py               — session init, sends prompts, receives tokens
-      draft_model.py          — local draft model for speculative decoding
-      speculative.py          — speculative decode loop (generate candidates, verify, display)
-    relay/                    — relay / orchestration (runs on MacBook Air for PoC)
-      relay.py                — pipeline assembly, routing, session management
-      pipeline.py             — manages the ordered list of nodes, activation forwarding
-    common/
-      protocol.py             — message types, serialization, shared constants
-      tensor_transfer.py      — efficient tensor packing/unpacking over WebSocket
-  tests/
-    test_shard_loading.py     — verify a model can be split and reassembled
-    test_pipeline.py          — verify activations flow correctly through 2 nodes
-    test_speculative.py       — verify speculative decode acceptance logic
-  scripts/
-    start_node.sh             — launch a compute node with specified layer range
-    start_consumer.sh         — launch the consumer client
-    benchmark.py              — measure tok/s, latency, acceptance rate
-  requirements.txt            — torch, transformers, websockets, msgpack, etc.
-================================================================================
-SPRINT 1 — Single Machine Baseline (2-3 days)
-================================================================================
-Goal: Load a model, run inference, measure baseline tok/s.
-Machine: GPU PC (fastest single machine)
-Tasks:
-  1.1 Set up Python environment
-      - Create venv, install torch + transformers + accelerate
-      - Verify CUDA works on GPU PC (torch.cuda.is_available())
-      - Verify MPS works on MacBook/iMac (torch.backends.mps.is_available())
-  1.2 Baseline inference benchmark
-      - Load Llama 3.1 8B (or Qwen2.5-7B) in Q4 on GPU PC
-      - Run 10 prompts, measure:
-        * Time to first token (TTFT)
-        * Tokens per second (sustained)
-        * Total generation time for 200 tokens
-      - Record these numbers — they are your ceiling. Distributed inference
-        will always be slower than this. The question is how much slower.
-  1.3 Understand the model internals
-      - Print model.config to see layer count, hidden_dim, num_heads
-      - Print model.model.layers to see the ModuleList of transformer layers
-      - Calculate hidden state size: hidden_dim * dtype_bytes
-        (e.g., 4096 * 2 = 8KB per token for a 7B model at fp16)
-      - This tells you exactly how much data transfers between nodes per token
-  1.4 Manual layer split test (single machine)
-      - Load only layers 0-15 on one model instance
-      - Load only layers 16-31 on another model instance
-      - Run a forward pass: input -> layers 0-15 -> capture hidden state ->
-        feed into layers 16-31 -> output logits
-      - Verify the output matches a full-model forward pass
-      - This proves the model CAN be split. If the hidden states don't match,
-        something is wrong with how layers are being loaded.
-Deliverable: Baseline numbers + confirmed layer splitting works on one machine.
-================================================================================
-SPRINT 2 — Two-Node Sharded Inference (3-5 days)
-================================================================================
-Goal: Run inference across two machines on the local network.
-Machines: GPU PC (layers 0-15) + iMac (layers 16-31)
-Tasks:
-  2.1 Build the compute node server (node/server.py)
-      - WebSocket server that:
-        * On startup: loads its assigned layer range
-        * On receiving ACTIVATIONS message: runs forward pass through its layers
-        * Returns: output hidden states (or logits if it's the final node)
-      - Command line: python server.py --model llama-8b --layers 0-15 --port 8765
-  2.2 Build the activation relay (common/tensor_transfer.py)
-      - Serialize PyTorch tensors to bytes efficiently
-      - Use torch.save() to BytesIO buffer, or raw .numpy().tobytes()
-      - Measure serialization overhead — should be <1ms for a 8KB tensor
-      - Deserialize on receiving end, move to correct device (CUDA/MPS/CPU)
-  2.3 Build a simple consumer client (consumer/client.py)
-      - Connects to Node A (GPU PC) and Node B (iMac) via WebSocket
-      - Sends tokenized prompt to Node A
-      - Node A processes layers 0-15, sends hidden states to Node B
-        (for now, routed through the consumer — direct P2P comes later)
-      - Node B processes layers 16-31, sends logits back to consumer
-      - Consumer samples next token, repeats
-  2.4 Run the 2-node pipeline
-      - Start server.py on GPU PC (layers 0-15)
-      - Start server.py on iMac (layers 16-31)
-      - Run client.py on MacBook Air
-      - Generate 200 tokens, measure:
-        * TTFT
-        * Tok/s
-        * Per-hop latency (time for hidden state transfer)
-        * Compare to Sprint 1 baseline
-  2.5 Implement basic pipeline parallelism
-      - Modify the consumer to send token N+1's hidden states to Node A
-        while Node B is still processing token N
-      - This is the pipelining from Section 4.1 of the whitepaper
-      - Measure throughput improvement vs. sequential
-Expected results:
-  - Sequential 2-node: 3-5 tok/s (down from 15-30 baseline, network overhead)
-  - Pipelined 2-node: 5-10 tok/s (pipeline fills, bounded by slower node)
-  - On a local LAN, hop latency should be 1-3ms — much better than internet
-Deliverable: Working 2-node inference with measurable pipeline parallelism gain.
-================================================================================
-SPRINT 3 — Three-Node Pipeline (2-3 days)
-================================================================================
-Goal: Full 3-node pipeline matching the whitepaper architecture.
-Machines: MacBook Air (consumer) + iMac (layers 0-15) + GPU PC (layers 16-31)
-Tasks:
-  3.1 Direct node-to-node activation transfer
-      - Currently activations route through the consumer (star topology)
-      - Change to: Node A sends directly to Node B (chain topology)
-      - Node A needs to know Node B's address — the consumer tells both
-        nodes the full pipeline at session start
-      - This halves the network hops for activation transfer
-  3.2 Build the relay/pipeline manager (relay/pipeline.py)
-      - Assembles the pipeline: which node runs which layers
-      - Sends PIPELINE_CONFIG to each node at session start
-      - Monitors heartbeats from each node
-      - For now, runs on the MacBook Air alongside the consumer
-  3.3 Three-node inference
-      - Split the 32-layer model into 3 shards:
-        * iMac: Layers 0-10
-        * GPU PC: Layers 11-21
-        * MacBook Air: Layers 22-31 (small shard, fits in 8GB)
-        OR if MacBook Air can't hold layers, use it as pure consumer
-        and split between iMac and GPU PC only (2 compute + 1 consumer)
-      - Run full pipeline with 3 participants
-      - Measure throughput vs 2-node pipeline
-  3.4 Benchmark suite (scripts/benchmark.py)
-      - Automated benchmark that tests:
-        * Single node baseline
-        * 2-node sequential
-        * 2-node pipelined
-        * 3-node pipelined
-      - Outputs a comparison table with TTFT, tok/s, per-hop latency
-      - This becomes the evidence that Layer 1 works
-Deliverable: 3-node pipeline running, benchmark comparison table.
-================================================================================
-SPRINT 4 — Speculative Decoding (3-5 days)
-================================================================================
-Goal: Add the draft model on the MacBook Air, prove the speed multiplier.
-This is the "secret weapon" from the whitepaper.
-Tasks:
-  4.1 Load draft model on MacBook Air (consumer/draft_model.py)
-      - Load Qwen2.5-0.5B or SmolLM2-1.7B quantized to 4-bit
-      - Verify it runs on the M1 with MPS backend
-      - Measure draft generation speed (should be 50-100+ tok/s locally)
-  4.2 Implement speculative decode loop (consumer/speculative.py)
-      - Draft model generates N candidate tokens
-      - All N candidates sent to the pipeline in a single batch
-      - Pipeline processes all N tokens in one forward pass
-      - Final node returns: acceptance mask + correction token
-      - Consumer accepts matching tokens, restarts from correction
-  4.3 Verification logic on the final compute node
-      - Receives N candidate tokens + the prompt
-      - Runs a single forward pass for all positions
-      - Compares model's token distribution at each position to the
-        candidate token
-      - Returns: number of accepted tokens + first corrected token
-      - Standard speculative decoding math (rejection sampling)
-  4.4 Adaptive window sizing
-      - Track acceptance rate over a rolling window of 10 verification rounds
-      - Expand window (N=12) when acceptance > 80%
-      - Shrink window (N=4) when acceptance < 50%
-      - Default window: N=8
-  4.5 Full stack benchmark
-      - Run the 3-node pipeline WITH speculative decoding
-      - Compare to 3-node pipeline WITHOUT speculative decoding
-      - Measure:
-        * Effective tok/s (accepted tokens per second of wall time)
-        * Acceptance rate per domain (try code prompts vs. creative prompts)
-        * Overhead of draft model generation on MacBook Air
-        * Network round-trips saved
-Expected results:
-  - Without speculative: 5-10 tok/s (from Sprint 3)
-  - With speculative (70% acceptance): 12-20 tok/s
-  - Code-specific prompts should see higher acceptance than creative
-Deliverable: Speculative decoding working, measured speed multiplier,
-acceptance rate data across prompt types.
-================================================================================
-SPRINT 5 — Hardening & KV Cache (3-5 days)
-================================================================================
-Goal: Handle failures and long conversations gracefully.
-Tasks:
-  5.1 KV cache management per node
-      - Each node maintains KV cache for its layers across tokens
-      - Implement cache eviction when context window fills
-      - Test: run a 2000-token conversation, verify cache doesn't OOM
-  5.2 Node dropout handling
-      - Simulate Node B going offline mid-generation
-      - Consumer/relay detects via heartbeat timeout
-      - For PoC: restart the session (full re-prefill)
-      - Measure recovery time
-      - Later: implement standby promotion with KV cache streaming
-  5.3 Session resumption
-      - Consumer disconnects and reconnects
-      - Compute nodes still hold KV cache for the session (within TTL)
-      - Consumer resumes generation without re-prefill
-      - Test: disconnect for 30s, reconnect, verify continuity
-  5.4 Metrics dashboard
-      - Simple terminal UI or web page showing:
-        * Active pipeline topology
-        * Per-node GPU utilization
-        * Tok/s in real time
-        * Acceptance rate (if speculative decoding is on)
-        * Per-hop latency
-      - This becomes the foundation for the Groove GUI integration later
-Deliverable: Resilient pipeline that handles failures, metrics visibility.
-================================================================================
-WHAT SUCCESS LOOKS LIKE
-================================================================================
-After Sprint 4, you should have hard data proving:
-  1. Model sharding works — a model split across 2-3 machines produces
-     identical output to the same model on one machine
-  2. Pipeline parallelism provides measurable throughput improvement —
-     2-3x over naive sequential, bounded by slowest node not sum of nodes
-  3. Speculative decoding provides measurable throughput improvement —
-     2-4x over non-speculative, with acceptance rates by domain
-  4. The full stack (sharding + pipeline + speculative) achieves
-     production-viable throughput — targeting 12-20 tok/s on a LAN,
-     which maps to 5-10 tok/s over internet (the "average case" from
-     the whitepaper)
-This data is what turns the whitepaper from a thesis into a proof.
-Every claim in the whitepaper Section 13 (Performance Expectations)
-should be backed by real measurements from your 3-machine testbed.
-================================================================================
-WHAT COMES AFTER LAYER 1
-================================================================================
-Once Layer 1 is proven on the local network:
-  Layer 2 — Relay Network:
-    Move from hardcoded pipeline (you manually assign layers to machines)
-    to dynamic discovery via DHT. Implement the relay node as a separate
-    process. Test with machines on different networks (Tailscale).
-  Layer 3 — Proof & Settlement:
-    Implement proof of compute generation. Deploy $GROOVE token contract
-    on Base testnet. Wire up escrow and batch settlement.
-  Layer 4 — Training Loop:
-    Start capturing inference traces. Build the federated fine-tuning
-    pipeline. Train Savant v0.1.
-  Integration with Groove:
-    The decentralized inference pipeline becomes a new provider in
-    packages/daemon/src/providers/ — "groove-net.js". Groove agents can
-    route to the decentralized network just like they route to Claude,
-    Codex, or Gemini today.
-================================================================================
-GETTING STARTED — LITERALLY THE FIRST COMMANDS
-================================================================================
-On the MacBook Air (your dev machine):
-  cd ~/Desktop/groove/decentralized-net
-  python3 -m venv venv
-  source venv/bin/activate
-  pip install torch transformers accelerate websockets msgpack
-  mkdir -p src/{node,consumer,relay,common} tests scripts
-  # Verify PyTorch works with MPS
-  python3 -c "import torch; print(torch.backends.mps.is_available())"
-  # Download a small model to start
-  python3 -c "from transformers import AutoModelForCausalLM, AutoTokenizer; \
-    AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-0.5B'); \
-    AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B')"
-Then start Sprint 1, Task 1.2 — run a baseline benchmark on your fastest
-machine.