PyPI - loom-gpt - Versions diffs - 0.1.0__tar.gz - Mend

loom-gpt 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

loom_gpt-0.1.0/PKG-INFO +598 -0
loom_gpt-0.1.0/README.md +588 -0
loom_gpt-0.1.0/config.py +48 -0
loom_gpt-0.1.0/loom.py +165 -0
loom_gpt-0.1.0/loom_gpt.egg-info/PKG-INFO +598 -0
loom_gpt-0.1.0/loom_gpt.egg-info/SOURCES.txt +25 -0
loom_gpt-0.1.0/loom_gpt.egg-info/dependency_links.txt +1 -0
loom_gpt-0.1.0/loom_gpt.egg-info/entry_points.txt +2 -0
loom_gpt-0.1.0/loom_gpt.egg-info/requires.txt +3 -0
loom_gpt-0.1.0/loom_gpt.egg-info/top_level.txt +3 -0
loom_gpt-0.1.0/pyproject.toml +24 -0
loom_gpt-0.1.0/setup.cfg +4 -0
loom_gpt-0.1.0/src/__init__.py +0 -0
loom_gpt-0.1.0/src/attention.py +38 -0
loom_gpt-0.1.0/src/bigram.py +37 -0
loom_gpt-0.1.0/src/constellation.py +42 -0
loom_gpt-0.1.0/src/data_prep.py +123 -0
loom_gpt-0.1.0/src/dataset.py +105 -0
loom_gpt-0.1.0/src/model.py +91 -0
loom_gpt-0.1.0/src/tokenizer.py +62 -0
loom_gpt-0.1.0/src/training.py +67 -0
loom_gpt-0.1.0/src/weaving.py +157 -0
loom_gpt-0.1.0/tests/test_constellation.py +16 -0
loom_gpt-0.1.0/tests/test_data_prep.py +43 -0
loom_gpt-0.1.0/tests/test_tokenizer.py +20 -0
loom_gpt-0.1.0/tests/test_training.py +36 -0
loom_gpt-0.1.0/tests/test_weaving.py +55 -0

loom_gpt-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,598 @@
+Metadata-Version: 2.4
+Name: loom-gpt
+Version: 0.1.0
+Summary: A local toolkit for training tiny GPT models on your own data.
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+Requires-Dist: torch
+Requires-Dist: numpy
+Requires-Dist: matplotlib
+# LOOM-GPT
+Train small specialist transformers locally. Weave their outputs together. Inspect which specialist shaped each generated token.
+LOOM-GPT is a local transformer laboratory for students, developers, writers, and researchers who want to understand and experiment with GPT-style models from the inside.
+It started as a from-scratch PyTorch implementation inspired by Andrej Karpathy's "Let's build GPT" tutorial. It is now becoming **LOOM Studio**: a framework where users can prepare their own datasets, train compact specialist models, and blend those specialists during generation.
+LOOM-GPT is not a ChatGPT replacement. It does not use a giant pretrained model. Instead, it gives you a readable, hackable, local system for training tiny domain-specific transformers and studying how they behave.
+## What This Project Does
+LOOM-GPT lets you:
+- Prepare a dataset from your own files and folders.
+- Train a small GPT-style transformer from scratch.
+- Save reusable checkpoints with model configuration included.
+- Track training and validation loss in `history.csv`.
+- Stop training early when validation loss stops improving.
+- Generate text from one trained specialist.
+- Train multiple specialists on different datasets.
+- Weave specialists by blending their next-token predictions.
+- Export a JSON trace showing which specialist most influenced each generated token.
+- Open the Neural Constellation interface to visualize specialists, influence, threads, and token traces.
+The core workflow looks like this:
+```text
+Your files
+  -> dataset preparation
+  -> byte tokenization
+  -> specialist training
+  -> checkpoint
+  -> generation
+  -> optional Model Weaving
+```
+## Who Can Use It?
+LOOM-GPT is useful for:
+- **Students** learning how GPT models work without hiding everything behind an API.
+- **Developers** experimenting with small domain-specific text models.
+- **Writers** training tiny style models on different genres or voices.
+- **Researchers** testing interpretable model composition ideas.
+- **Educators** demonstrating tokenization, attention, overfitting, validation loss, and sampling.
+Example user stories:
+- A student trains one specialist on poetry and another on technical documentation, then blends the two to see how generation changes.
+- A developer trains a tiny model on internal notes or code comments to study local domain language.
+- A researcher compares one mixed-data model against several woven specialist models.
+- A teacher uses the training logs to show why validation loss matters more than training loss.
+## Key Features
+### Custom Dataset Preparation
+Point LOOM at a file or folder:
+```bash
+loom dataset add ./my-notes --name notes
+```
+LOOM combines supported files into:
+```text
+data/loom/notes/
+  input.txt
+  manifest.json
+```
+Supported file types include:
+- `.txt`
+- `.md`
+- `.jsonl`
+- `.csv`
+- Common code files such as `.py`, `.js`, `.ts`, `.java`, `.rs`, `.go`, `.html`, `.css`, `.sql`, `.yaml`
+Each source file is wrapped with a boundary marker:
+```text
+<loom:file path="docs/example.md">
+file contents
+</loom:file>
+```
+That keeps file context visible to the model and to future experiments.
+### Local Transformer Training
+Train a small decoder-only GPT model:
+```bash
+loom train --data data/loom/notes/input.txt --out out/notes --preset tiny
+```
+Longer training with early stopping:
+```bash
+loom train \
+  --data data/loom/notes/input.txt \
+  --out out/notes \
+  --preset laptop \
+  --max-iters 5000 \
+  --early-stopping 8 \
+  --seed 42
+```
+Training creates:
+```text
+out/notes/
+  best_model.pt
+  final_model.pt
+  history.csv
+```
+Use `best_model.pt` for generation because it stores the checkpoint with the lowest validation loss.
+### Training Presets
+| Preset | Use case | Layers | Heads | Embedding size |
+| --- | --- | ---: | ---: | ---: |
+| `tiny` | Quick smoke tests | 2 | 2 | 64 |
+| `laptop` | Normal local experiments | 4 | 4 | 128 |
+| `single_gpu` | Longer GPU runs | 6 | 6 | 384 |
+### Byte Tokenization
+LOOM uses UTF-8 byte tokenization by default:
+```text
+text -> bytes -> token IDs from 0 to 255
+```
+This means the same training pipeline can handle English, multilingual text, code, and mixed folders.
+The original character tokenizer is still available for educational experiments:
+```bash
+loom train --data data/input.txt --tokenizer char
+```
+### Generation
+Generate from a single trained specialist:
+```bash
+loom generate \
+  --checkpoint out/notes/best_model.pt \
+  --prompt "Today I learned that " \
+  --preset precise \
+  --tokens 250
+```
+Generation presets:
+| Preset | Temperature | Top-k | Behavior |
+| --- | ---: | ---: | --- |
+| `precise` | 0.5 | 15 | More conservative |
+| `balanced` | 0.8 | 40 | Default |
+| `creative` | 1.0 | 80 | More varied |
+Manual override:
+```bash
+loom generate \
+  --checkpoint out/notes/best_model.pt \
+  --prompt "Artificial intelligence can " \
+  --temperature 0.6 \
+  --top-k 20
+```
+## Model Weaving
+Model Weaving is LOOM-GPT's signature feature.
+Instead of training one model on everything, you train separate specialists:
+```text
+poetry specialist
+technology specialist
+philosophy specialist
+```
+During generation, LOOM asks each specialist for its next-token prediction, blends their logits using your weights, samples one token, and repeats.
+```text
+Prompt
+  -> poetry logits
+  -> technology logits
+  -> philosophy logits
+  -> weighted blend
+  -> sampled token
+  -> influence trace
+```
+Simple example:
+```text
+poetry      70%
+technology 30%
+Prompt: "The city at night"
+```
+LOOM blends the specialists like this:
+```python
+woven_logits = 0.7 * poetry_logits + 0.3 * technology_logits
+```
+The result is not just one model generating text. It is several small models contributing to the next token.
+### Weaving Command
+```bash
+loom weave \
+  --model poetry=out/poetry/best_model.pt \
+  --model technology=out/technology/best_model.pt \
+  --weight poetry=0.7 \
+  --weight technology=0.3 \
+  --prompt "The city at night" \
+  --tokens 300 \
+  --preset balanced \
+  --trace-out out/weaving/city-trace.json
+```
+If no weights are provided, LOOM gives all specialists equal weight.
+```bash
+loom weave \
+  --model poetry=out/poetry/best_model.pt \
+  --model technology=out/technology/best_model.pt \
+  --prompt "The city at night"
+```
+### Influence Trace
+When you pass `--trace-out`, LOOM writes a JSON file like:
+```json
+[
+  {
+    "token_id": 84,
+    "specialist": "poetry",
+    "contributions": {
+      "poetry": 0.72,
+      "technology": 0.28
+    }
+  }
+]
+```
+Each item tells you:
+- The generated token ID.
+- Which specialist had the strongest contribution.
+- Each specialist's normalized contribution for that token.
+This trace is the foundation for the future dashboard visualization where generated tokens can be colored by specialist influence.
+## Neural Constellation Interface
+LOOM-GPT includes a cinematic local interface called **The Neural Constellation**.
+It is not a standard chatbot and not a business dashboard. It is a visual explanation of Model Weaving:
+```text
+specialist stars
+  -> gravitational influence
+  -> energy streams
+  -> LOOM CORE
+  -> woven threads
+  -> generated tokens
+  -> clickable token trace
+```
+Run it locally:
+```bash
+loom constellation
+```
+Or choose a port:
+```bash
+loom constellation --port 8765
+```
+What you can do inside the interface:
+- Drag specialist stars closer to the LOOM CORE to increase influence.
+- Watch energy streams grow brighter and thicker as influence increases.
+- Enter a prompt and awaken the constellation.
+- See tokens form one by one from the Neural Weave.
+- Click generated tokens to inspect specialist contribution.
+- Load a real JSON trace exported by `loom weave --trace-out`.
+The current interface ships with sample trace data so visitors can understand the concept immediately, even before training their own specialists.
+### Current Weaving Constraints
+For now:
+- Specialists must use the default `byte` tokenizer.
+- Specialists must have the same architecture.
+- Legacy character-tokenizer checkpoints cannot be woven.
+- Weaving works best when specialists were trained with the same preset.
+Recommended specialist training:
+```bash
+loom train --data data/loom/poetry/input.txt --out out/poetry --preset laptop --early-stopping 8
+loom train --data data/loom/technology/input.txt --out out/technology --preset laptop --early-stopping 8
+loom train --data data/loom/philosophy/input.txt --out out/philosophy --preset laptop --early-stopping 8
+```
+Then weave:
+```bash
+loom weave \
+  --model poetry=out/poetry/best_model.pt \
+  --model technology=out/technology/best_model.pt \
+  --model philosophy=out/philosophy/best_model.pt \
+  --weight poetry=0.5 \
+  --weight technology=0.3 \
+  --weight philosophy=0.2 \
+  --prompt "The future belongs to "
+```
+## Complete Example Use Case
+Imagine a student wants to explore how style changes when technical writing and poetry are blended.
+Create two folders:
+```text
+demo-data/
+  poetry/
+    poems.txt
+  technology/
+    ai-notes.md
+    software-docs.txt
+```
+Prepare datasets:
+```bash
+loom dataset add ./demo-data/poetry --name poetry
+loom dataset add ./demo-data/technology --name technology
+```
+Train specialists:
+```bash
+loom train --data data/loom/poetry/input.txt --out out/poetry --preset laptop --early-stopping 8
+loom train --data data/loom/technology/input.txt --out out/technology --preset laptop --early-stopping 8
+```
+Generate from each specialist separately:
+```bash
+loom generate --checkpoint out/poetry/best_model.pt --prompt "The city at night" --preset precise
+loom generate --checkpoint out/technology/best_model.pt --prompt "The city at night" --preset precise
+```
+Now weave them:
+```bash
+loom weave \
+  --model poetry=out/poetry/best_model.pt \
+  --model technology=out/technology/best_model.pt \
+  --weight poetry=0.8 \
+  --weight technology=0.2 \
+  --prompt "The city at night" \
+  --trace-out out/weaving/poetic-city.json
+```
+Then flip the weights:
+```bash
+loom weave \
+  --model poetry=out/poetry/best_model.pt \
+  --model technology=out/technology/best_model.pt \
+  --weight poetry=0.2 \
+  --weight technology=0.8 \
+  --prompt "The city at night" \
+  --trace-out out/weaving/technical-city.json
+```
+The user can compare:
+- Poetry-only output
+- Technology-only output
+- Mostly-poetry woven output
+- Mostly-technology woven output
+- Token influence traces
+That is the main product idea: train local specialists, control their blend, and inspect how the blend shapes generation.
+## Installation
+```bash
+git clone https://github.com/Karthik-Unni/Loom-gpt.git
+cd Loom-gpt
+python -m venv .venv
+.venv\Scripts\activate
+pip install -e .
+```
+If PowerShell blocks activation:
+```powershell
+Set-ExecutionPolicy -Scope Process Bypass
+.venv\Scripts\Activate.ps1
+```
+## Commands
+Prepare a dataset:
+```bash
+loom dataset add ./my-notes --name notes
+loom dataset inspect notes
+```
+Train:
+```bash
+loom train --data data/loom/notes/input.txt --out out/notes --preset laptop
+```
+Resume:
+```bash
+loom train \
+  --data data/loom/notes/input.txt \
+  --out out/notes \
+  --preset laptop \
+  --resume out/notes/final_model.pt
+```
+Generate:
+```bash
+loom generate --checkpoint out/notes/best_model.pt --prompt "Today I learned"
+```
+Weave:
+```bash
+loom weave \
+  --model a=out/a/best_model.pt \
+  --model b=out/b/best_model.pt \
+  --weight a=0.6 \
+  --weight b=0.4 \
+  --prompt "Once upon a system"
+```
+## Architecture
+The model is a small decoder-only transformer built from scratch in PyTorch:
+```text
+tokens
+  -> token embeddings
+  -> position embeddings
+  -> causal multi-head self-attention
+  -> feed-forward layers
+  -> layer normalization
+  -> next-token logits
+```
+Important files:
+```text
+loom.py              Main CLI wrapper
+train.py             Training entry point
+generate.py          Single-checkpoint generation
+weave.py             Multi-specialist weaving entry point
+config.py            Model presets
+src/model.py         GPT model
+src/attention.py     Causal self-attention
+src/tokenizer.py     Byte and character tokenizers
+src/data_prep.py     Dataset ingestion
+src/training.py      Early stopping, history, generation presets
+src/weaving.py       Weighted Model Weaving
+tests/               Unit tests
+```
+## What LOOM-GPT Is Good At
+- Learning transformer internals.
+- Running small local experiments.
+- Comparing datasets and specialists.
+- Demonstrating overfitting and validation loss.
+- Exploring controllable generation through weighted specialists.
+- Creating a portfolio project with a clear research-style idea.
+## What LOOM-GPT Is Not
+- It is not ChatGPT.
+- It is not a factual assistant.
+- It is not trained on internet-scale data.
+- It will not produce polished text from tiny datasets.
+- It does not yet have a full dashboard.
+Small models trained from scratch need clean data and patience. The goal is experimentation and interpretability, not production-grade language understanding.
+## Recommended Data Size
+For experiments:
+```text
+100,000+ characters: basic behavior
+500,000+ characters: better small-model experiments
+2,000,000+ characters: noticeably stronger local style learning
+```
+Use clean, consistent data. Remove broken HTML, duplicated lines, unrelated text, and noisy formatting when possible.
+## Roadmap
+Completed:
+- Custom dataset preparation
+- Byte tokenizer
+- GPT training from scratch
+- Early stopping
+- Training history CSV
+- Generation presets
+- Weighted Model Weaving CLI
+- Token influence trace export
+Next:
+- Streamlit dashboard
+- Loss charts
+- Specialist sliders
+- Colored token influence visualization
+- BPE tokenizer experiments
+- Research evaluation suite
+Future dashboard concept:
+```text
+Datasets -> Train -> Generate -> Weave -> Metrics
+```
+The long-term vision is a local LOOM Studio interface where users train specialists, move sliders, generate text, and see which specialist influenced each token.
+## Development Workflow
+Run tests:
+```bash
+python -m unittest discover -s tests -v
+```
+Compile check:
+```bash
+python -m compileall -q loom.py train.py generate.py weave.py src tests
+```
+Before pushing:
+```bash
+git status
+git diff --stat
+```
+Do not commit:
+- `.venv/`
+- `out/`
+- `data/loom/`
+- personal datasets
+- `.pt` checkpoints
+These are ignored by default.
+## License
+Add a license before using this as a public release project.