loom-gpt 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,598 @@
1
+ Metadata-Version: 2.4
2
+ Name: loom-gpt
3
+ Version: 0.1.0
4
+ Summary: A local toolkit for training tiny GPT models on your own data.
5
+ Requires-Python: >=3.10
6
+ Description-Content-Type: text/markdown
7
+ Requires-Dist: torch
8
+ Requires-Dist: numpy
9
+ Requires-Dist: matplotlib
10
+
11
+ # LOOM-GPT
12
+
13
+ Train small specialist transformers locally. Weave their outputs together. Inspect which specialist shaped each generated token.
14
+
15
+ LOOM-GPT is a local transformer laboratory for students, developers, writers, and researchers who want to understand and experiment with GPT-style models from the inside.
16
+
17
+ It started as a from-scratch PyTorch implementation inspired by Andrej Karpathy's "Let's build GPT" tutorial. It is now becoming **LOOM Studio**: a framework where users can prepare their own datasets, train compact specialist models, and blend those specialists during generation.
18
+
19
+ LOOM-GPT is not a ChatGPT replacement. It does not use a giant pretrained model. Instead, it gives you a readable, hackable, local system for training tiny domain-specific transformers and studying how they behave.
20
+
21
+ ## What This Project Does
22
+
23
+ LOOM-GPT lets you:
24
+
25
+ - Prepare a dataset from your own files and folders.
26
+ - Train a small GPT-style transformer from scratch.
27
+ - Save reusable checkpoints with model configuration included.
28
+ - Track training and validation loss in `history.csv`.
29
+ - Stop training early when validation loss stops improving.
30
+ - Generate text from one trained specialist.
31
+ - Train multiple specialists on different datasets.
32
+ - Weave specialists by blending their next-token predictions.
33
+ - Export a JSON trace showing which specialist most influenced each generated token.
34
+ - Open the Neural Constellation interface to visualize specialists, influence, threads, and token traces.
35
+
36
+ The core workflow looks like this:
37
+
38
+ ```text
39
+ Your files
40
+ -> dataset preparation
41
+ -> byte tokenization
42
+ -> specialist training
43
+ -> checkpoint
44
+ -> generation
45
+ -> optional Model Weaving
46
+ ```
47
+
48
+ ## Who Can Use It?
49
+
50
+ LOOM-GPT is useful for:
51
+
52
+ - **Students** learning how GPT models work without hiding everything behind an API.
53
+ - **Developers** experimenting with small domain-specific text models.
54
+ - **Writers** training tiny style models on different genres or voices.
55
+ - **Researchers** testing interpretable model composition ideas.
56
+ - **Educators** demonstrating tokenization, attention, overfitting, validation loss, and sampling.
57
+
58
+ Example user stories:
59
+
60
+ - A student trains one specialist on poetry and another on technical documentation, then blends the two to see how generation changes.
61
+ - A developer trains a tiny model on internal notes or code comments to study local domain language.
62
+ - A researcher compares one mixed-data model against several woven specialist models.
63
+ - A teacher uses the training logs to show why validation loss matters more than training loss.
64
+
65
+ ## Key Features
66
+
67
+ ### Custom Dataset Preparation
68
+
69
+ Point LOOM at a file or folder:
70
+
71
+ ```bash
72
+ loom dataset add ./my-notes --name notes
73
+ ```
74
+
75
+ LOOM combines supported files into:
76
+
77
+ ```text
78
+ data/loom/notes/
79
+ input.txt
80
+ manifest.json
81
+ ```
82
+
83
+ Supported file types include:
84
+
85
+ - `.txt`
86
+ - `.md`
87
+ - `.jsonl`
88
+ - `.csv`
89
+ - Common code files such as `.py`, `.js`, `.ts`, `.java`, `.rs`, `.go`, `.html`, `.css`, `.sql`, `.yaml`
90
+
91
+ Each source file is wrapped with a boundary marker:
92
+
93
+ ```text
94
+ <loom:file path="docs/example.md">
95
+ file contents
96
+ </loom:file>
97
+ ```
98
+
99
+ That keeps file context visible to the model and to future experiments.
100
+
101
+ ### Local Transformer Training
102
+
103
+ Train a small decoder-only GPT model:
104
+
105
+ ```bash
106
+ loom train --data data/loom/notes/input.txt --out out/notes --preset tiny
107
+ ```
108
+
109
+ Longer training with early stopping:
110
+
111
+ ```bash
112
+ loom train \
113
+ --data data/loom/notes/input.txt \
114
+ --out out/notes \
115
+ --preset laptop \
116
+ --max-iters 5000 \
117
+ --early-stopping 8 \
118
+ --seed 42
119
+ ```
120
+
121
+ Training creates:
122
+
123
+ ```text
124
+ out/notes/
125
+ best_model.pt
126
+ final_model.pt
127
+ history.csv
128
+ ```
129
+
130
+ Use `best_model.pt` for generation because it stores the checkpoint with the lowest validation loss.
131
+
132
+ ### Training Presets
133
+
134
+ | Preset | Use case | Layers | Heads | Embedding size |
135
+ | --- | --- | ---: | ---: | ---: |
136
+ | `tiny` | Quick smoke tests | 2 | 2 | 64 |
137
+ | `laptop` | Normal local experiments | 4 | 4 | 128 |
138
+ | `single_gpu` | Longer GPU runs | 6 | 6 | 384 |
139
+
140
+ ### Byte Tokenization
141
+
142
+ LOOM uses UTF-8 byte tokenization by default:
143
+
144
+ ```text
145
+ text -> bytes -> token IDs from 0 to 255
146
+ ```
147
+
148
+ This means the same training pipeline can handle English, multilingual text, code, and mixed folders.
149
+
150
+ The original character tokenizer is still available for educational experiments:
151
+
152
+ ```bash
153
+ loom train --data data/input.txt --tokenizer char
154
+ ```
155
+
156
+ ### Generation
157
+
158
+ Generate from a single trained specialist:
159
+
160
+ ```bash
161
+ loom generate \
162
+ --checkpoint out/notes/best_model.pt \
163
+ --prompt "Today I learned that " \
164
+ --preset precise \
165
+ --tokens 250
166
+ ```
167
+
168
+ Generation presets:
169
+
170
+ | Preset | Temperature | Top-k | Behavior |
171
+ | --- | ---: | ---: | --- |
172
+ | `precise` | 0.5 | 15 | More conservative |
173
+ | `balanced` | 0.8 | 40 | Default |
174
+ | `creative` | 1.0 | 80 | More varied |
175
+
176
+ Manual override:
177
+
178
+ ```bash
179
+ loom generate \
180
+ --checkpoint out/notes/best_model.pt \
181
+ --prompt "Artificial intelligence can " \
182
+ --temperature 0.6 \
183
+ --top-k 20
184
+ ```
185
+
186
+ ## Model Weaving
187
+
188
+ Model Weaving is LOOM-GPT's signature feature.
189
+
190
+ Instead of training one model on everything, you train separate specialists:
191
+
192
+ ```text
193
+ poetry specialist
194
+ technology specialist
195
+ philosophy specialist
196
+ ```
197
+
198
+ During generation, LOOM asks each specialist for its next-token prediction, blends their logits using your weights, samples one token, and repeats.
199
+
200
+ ```text
201
+ Prompt
202
+ -> poetry logits
203
+ -> technology logits
204
+ -> philosophy logits
205
+ -> weighted blend
206
+ -> sampled token
207
+ -> influence trace
208
+ ```
209
+
210
+ Simple example:
211
+
212
+ ```text
213
+ poetry 70%
214
+ technology 30%
215
+
216
+ Prompt: "The city at night"
217
+ ```
218
+
219
+ LOOM blends the specialists like this:
220
+
221
+ ```python
222
+ woven_logits = 0.7 * poetry_logits + 0.3 * technology_logits
223
+ ```
224
+
225
+ The result is not just one model generating text. It is several small models contributing to the next token.
226
+
227
+ ### Weaving Command
228
+
229
+ ```bash
230
+ loom weave \
231
+ --model poetry=out/poetry/best_model.pt \
232
+ --model technology=out/technology/best_model.pt \
233
+ --weight poetry=0.7 \
234
+ --weight technology=0.3 \
235
+ --prompt "The city at night" \
236
+ --tokens 300 \
237
+ --preset balanced \
238
+ --trace-out out/weaving/city-trace.json
239
+ ```
240
+
241
+ If no weights are provided, LOOM gives all specialists equal weight.
242
+
243
+ ```bash
244
+ loom weave \
245
+ --model poetry=out/poetry/best_model.pt \
246
+ --model technology=out/technology/best_model.pt \
247
+ --prompt "The city at night"
248
+ ```
249
+
250
+ ### Influence Trace
251
+
252
+ When you pass `--trace-out`, LOOM writes a JSON file like:
253
+
254
+ ```json
255
+ [
256
+ {
257
+ "token_id": 84,
258
+ "specialist": "poetry",
259
+ "contributions": {
260
+ "poetry": 0.72,
261
+ "technology": 0.28
262
+ }
263
+ }
264
+ ]
265
+ ```
266
+
267
+ Each item tells you:
268
+
269
+ - The generated token ID.
270
+ - Which specialist had the strongest contribution.
271
+ - Each specialist's normalized contribution for that token.
272
+
273
+ This trace is the foundation for the future dashboard visualization where generated tokens can be colored by specialist influence.
274
+
275
+ ## Neural Constellation Interface
276
+
277
+ LOOM-GPT includes a cinematic local interface called **The Neural Constellation**.
278
+
279
+ It is not a standard chatbot and not a business dashboard. It is a visual explanation of Model Weaving:
280
+
281
+ ```text
282
+ specialist stars
283
+ -> gravitational influence
284
+ -> energy streams
285
+ -> LOOM CORE
286
+ -> woven threads
287
+ -> generated tokens
288
+ -> clickable token trace
289
+ ```
290
+
291
+ Run it locally:
292
+
293
+ ```bash
294
+ loom constellation
295
+ ```
296
+
297
+ Or choose a port:
298
+
299
+ ```bash
300
+ loom constellation --port 8765
301
+ ```
302
+
303
+ What you can do inside the interface:
304
+
305
+ - Drag specialist stars closer to the LOOM CORE to increase influence.
306
+ - Watch energy streams grow brighter and thicker as influence increases.
307
+ - Enter a prompt and awaken the constellation.
308
+ - See tokens form one by one from the Neural Weave.
309
+ - Click generated tokens to inspect specialist contribution.
310
+ - Load a real JSON trace exported by `loom weave --trace-out`.
311
+
312
+ The current interface ships with sample trace data so visitors can understand the concept immediately, even before training their own specialists.
313
+
314
+ ### Current Weaving Constraints
315
+
316
+ For now:
317
+
318
+ - Specialists must use the default `byte` tokenizer.
319
+ - Specialists must have the same architecture.
320
+ - Legacy character-tokenizer checkpoints cannot be woven.
321
+ - Weaving works best when specialists were trained with the same preset.
322
+
323
+ Recommended specialist training:
324
+
325
+ ```bash
326
+ loom train --data data/loom/poetry/input.txt --out out/poetry --preset laptop --early-stopping 8
327
+ loom train --data data/loom/technology/input.txt --out out/technology --preset laptop --early-stopping 8
328
+ loom train --data data/loom/philosophy/input.txt --out out/philosophy --preset laptop --early-stopping 8
329
+ ```
330
+
331
+ Then weave:
332
+
333
+ ```bash
334
+ loom weave \
335
+ --model poetry=out/poetry/best_model.pt \
336
+ --model technology=out/technology/best_model.pt \
337
+ --model philosophy=out/philosophy/best_model.pt \
338
+ --weight poetry=0.5 \
339
+ --weight technology=0.3 \
340
+ --weight philosophy=0.2 \
341
+ --prompt "The future belongs to "
342
+ ```
343
+
344
+ ## Complete Example Use Case
345
+
346
+ Imagine a student wants to explore how style changes when technical writing and poetry are blended.
347
+
348
+ Create two folders:
349
+
350
+ ```text
351
+ demo-data/
352
+ poetry/
353
+ poems.txt
354
+ technology/
355
+ ai-notes.md
356
+ software-docs.txt
357
+ ```
358
+
359
+ Prepare datasets:
360
+
361
+ ```bash
362
+ loom dataset add ./demo-data/poetry --name poetry
363
+ loom dataset add ./demo-data/technology --name technology
364
+ ```
365
+
366
+ Train specialists:
367
+
368
+ ```bash
369
+ loom train --data data/loom/poetry/input.txt --out out/poetry --preset laptop --early-stopping 8
370
+ loom train --data data/loom/technology/input.txt --out out/technology --preset laptop --early-stopping 8
371
+ ```
372
+
373
+ Generate from each specialist separately:
374
+
375
+ ```bash
376
+ loom generate --checkpoint out/poetry/best_model.pt --prompt "The city at night" --preset precise
377
+ loom generate --checkpoint out/technology/best_model.pt --prompt "The city at night" --preset precise
378
+ ```
379
+
380
+ Now weave them:
381
+
382
+ ```bash
383
+ loom weave \
384
+ --model poetry=out/poetry/best_model.pt \
385
+ --model technology=out/technology/best_model.pt \
386
+ --weight poetry=0.8 \
387
+ --weight technology=0.2 \
388
+ --prompt "The city at night" \
389
+ --trace-out out/weaving/poetic-city.json
390
+ ```
391
+
392
+ Then flip the weights:
393
+
394
+ ```bash
395
+ loom weave \
396
+ --model poetry=out/poetry/best_model.pt \
397
+ --model technology=out/technology/best_model.pt \
398
+ --weight poetry=0.2 \
399
+ --weight technology=0.8 \
400
+ --prompt "The city at night" \
401
+ --trace-out out/weaving/technical-city.json
402
+ ```
403
+
404
+ The user can compare:
405
+
406
+ - Poetry-only output
407
+ - Technology-only output
408
+ - Mostly-poetry woven output
409
+ - Mostly-technology woven output
410
+ - Token influence traces
411
+
412
+ That is the main product idea: train local specialists, control their blend, and inspect how the blend shapes generation.
413
+
414
+ ## Installation
415
+
416
+ ```bash
417
+ git clone https://github.com/Karthik-Unni/Loom-gpt.git
418
+ cd Loom-gpt
419
+ python -m venv .venv
420
+ .venv\Scripts\activate
421
+ pip install -e .
422
+ ```
423
+
424
+ If PowerShell blocks activation:
425
+
426
+ ```powershell
427
+ Set-ExecutionPolicy -Scope Process Bypass
428
+ .venv\Scripts\Activate.ps1
429
+ ```
430
+
431
+ ## Commands
432
+
433
+ Prepare a dataset:
434
+
435
+ ```bash
436
+ loom dataset add ./my-notes --name notes
437
+ loom dataset inspect notes
438
+ ```
439
+
440
+ Train:
441
+
442
+ ```bash
443
+ loom train --data data/loom/notes/input.txt --out out/notes --preset laptop
444
+ ```
445
+
446
+ Resume:
447
+
448
+ ```bash
449
+ loom train \
450
+ --data data/loom/notes/input.txt \
451
+ --out out/notes \
452
+ --preset laptop \
453
+ --resume out/notes/final_model.pt
454
+ ```
455
+
456
+ Generate:
457
+
458
+ ```bash
459
+ loom generate --checkpoint out/notes/best_model.pt --prompt "Today I learned"
460
+ ```
461
+
462
+ Weave:
463
+
464
+ ```bash
465
+ loom weave \
466
+ --model a=out/a/best_model.pt \
467
+ --model b=out/b/best_model.pt \
468
+ --weight a=0.6 \
469
+ --weight b=0.4 \
470
+ --prompt "Once upon a system"
471
+ ```
472
+
473
+ ## Architecture
474
+
475
+ The model is a small decoder-only transformer built from scratch in PyTorch:
476
+
477
+ ```text
478
+ tokens
479
+ -> token embeddings
480
+ -> position embeddings
481
+ -> causal multi-head self-attention
482
+ -> feed-forward layers
483
+ -> layer normalization
484
+ -> next-token logits
485
+ ```
486
+
487
+ Important files:
488
+
489
+ ```text
490
+ loom.py Main CLI wrapper
491
+ train.py Training entry point
492
+ generate.py Single-checkpoint generation
493
+ weave.py Multi-specialist weaving entry point
494
+ config.py Model presets
495
+ src/model.py GPT model
496
+ src/attention.py Causal self-attention
497
+ src/tokenizer.py Byte and character tokenizers
498
+ src/data_prep.py Dataset ingestion
499
+ src/training.py Early stopping, history, generation presets
500
+ src/weaving.py Weighted Model Weaving
501
+ tests/ Unit tests
502
+ ```
503
+
504
+ ## What LOOM-GPT Is Good At
505
+
506
+ - Learning transformer internals.
507
+ - Running small local experiments.
508
+ - Comparing datasets and specialists.
509
+ - Demonstrating overfitting and validation loss.
510
+ - Exploring controllable generation through weighted specialists.
511
+ - Creating a portfolio project with a clear research-style idea.
512
+
513
+ ## What LOOM-GPT Is Not
514
+
515
+ - It is not ChatGPT.
516
+ - It is not a factual assistant.
517
+ - It is not trained on internet-scale data.
518
+ - It will not produce polished text from tiny datasets.
519
+ - It does not yet have a full dashboard.
520
+
521
+ Small models trained from scratch need clean data and patience. The goal is experimentation and interpretability, not production-grade language understanding.
522
+
523
+ ## Recommended Data Size
524
+
525
+ For experiments:
526
+
527
+ ```text
528
+ 100,000+ characters: basic behavior
529
+ 500,000+ characters: better small-model experiments
530
+ 2,000,000+ characters: noticeably stronger local style learning
531
+ ```
532
+
533
+ Use clean, consistent data. Remove broken HTML, duplicated lines, unrelated text, and noisy formatting when possible.
534
+
535
+ ## Roadmap
536
+
537
+ Completed:
538
+
539
+ - Custom dataset preparation
540
+ - Byte tokenizer
541
+ - GPT training from scratch
542
+ - Early stopping
543
+ - Training history CSV
544
+ - Generation presets
545
+ - Weighted Model Weaving CLI
546
+ - Token influence trace export
547
+
548
+ Next:
549
+
550
+ - Streamlit dashboard
551
+ - Loss charts
552
+ - Specialist sliders
553
+ - Colored token influence visualization
554
+ - BPE tokenizer experiments
555
+ - Research evaluation suite
556
+
557
+ Future dashboard concept:
558
+
559
+ ```text
560
+ Datasets -> Train -> Generate -> Weave -> Metrics
561
+ ```
562
+
563
+ The long-term vision is a local LOOM Studio interface where users train specialists, move sliders, generate text, and see which specialist influenced each token.
564
+
565
+ ## Development Workflow
566
+
567
+ Run tests:
568
+
569
+ ```bash
570
+ python -m unittest discover -s tests -v
571
+ ```
572
+
573
+ Compile check:
574
+
575
+ ```bash
576
+ python -m compileall -q loom.py train.py generate.py weave.py src tests
577
+ ```
578
+
579
+ Before pushing:
580
+
581
+ ```bash
582
+ git status
583
+ git diff --stat
584
+ ```
585
+
586
+ Do not commit:
587
+
588
+ - `.venv/`
589
+ - `out/`
590
+ - `data/loom/`
591
+ - personal datasets
592
+ - `.pt` checkpoints
593
+
594
+ These are ignored by default.
595
+
596
+ ## License
597
+
598
+ Add a license before using this as a public release project.