pokepy-generator 1.0.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pokepy_generator-1.0.1/LICENSE +21 -0
- pokepy_generator-1.0.1/PKG-INFO +411 -0
- pokepy_generator-1.0.1/README.md +399 -0
- pokepy_generator-1.0.1/app.py +199 -0
- pokepy_generator-1.0.1/pokepy_generator.egg-info/PKG-INFO +411 -0
- pokepy_generator-1.0.1/pokepy_generator.egg-info/SOURCES.txt +10 -0
- pokepy_generator-1.0.1/pokepy_generator.egg-info/dependency_links.txt +1 -0
- pokepy_generator-1.0.1/pokepy_generator.egg-info/entry_points.txt +2 -0
- pokepy_generator-1.0.1/pokepy_generator.egg-info/requires.txt +3 -0
- pokepy_generator-1.0.1/pokepy_generator.egg-info/top_level.txt +1 -0
- pokepy_generator-1.0.1/pyproject.toml +24 -0
- pokepy_generator-1.0.1/setup.cfg +4 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Anay Shekhar
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,411 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: pokepy-generator
|
|
3
|
+
Version: 1.0.1
|
|
4
|
+
Summary: A character-level language model built from scratch using only numpy.
|
|
5
|
+
Requires-Python: >=3.9
|
|
6
|
+
Description-Content-Type: text/markdown
|
|
7
|
+
License-File: LICENSE
|
|
8
|
+
Requires-Dist: numpy>=1.24.0
|
|
9
|
+
Requires-Dist: gradio>=4.0.0
|
|
10
|
+
Requires-Dist: huggingface-hub<=0.24.0
|
|
11
|
+
Dynamic: license-file
|
|
12
|
+
|
|
13
|
+
# pokepy :)
|
|
14
|
+
|
|
15
|
+
 </br>
|
|
16
|
+
|
|
17
|
+
a character-level language model built completely from scratch using only numpy that generates pokémon sounding names. no PyTorch, no autograd, no deep learning frameworks — every forward pass, backward pass, and gradient update is manually implemented.
|
|
18
|
+
|
|
19
|
+
this project started as a way to understand how neural networks actually work under the hood. I first built a simple MLP, then expanded it into a WaveNet-style architecture to explore how increasing context length changes what a model can learn.
|
|
20
|
+
|
|
21
|
+
## demo
|
|
22
|
+
|
|
23
|
+
try it live (local desktop):
|
|
24
|
+
|
|
25
|
+
for mac users, double click PokepyLauncher.command to launch the application.
|
|
26
|
+
for windows users, double click PokepyLauncher.bat to launch the application.
|
|
27
|
+
|
|
28
|
+
once running, the terminal will spin up a local matrix-inference engine. open your browser and navigate to: http://localhost:10000
|
|
29
|
+
|
|
30
|
+
---
|
|
31
|
+
|
|
32
|
+
# what is this
|
|
33
|
+
|
|
34
|
+
I wanted to understand the foundations behind language models, so I built a mini character-level text generator completely from scratch.
|
|
35
|
+
|
|
36
|
+
instead of relying on existing machine learning libraries, I manually implemented:
|
|
37
|
+
|
|
38
|
+
* embeddings
|
|
39
|
+
* linear layers
|
|
40
|
+
* batch normalization
|
|
41
|
+
* tanh activations
|
|
42
|
+
* softmax
|
|
43
|
+
* cross entropy loss
|
|
44
|
+
* backpropagation
|
|
45
|
+
* gradient descent
|
|
46
|
+
|
|
47
|
+
everything runs only with numpy.
|
|
48
|
+
|
|
49
|
+
the model learns character patterns from pokémon names and generates new names based on the relationships it discovers.
|
|
50
|
+
|
|
51
|
+
---
|
|
52
|
+
|
|
53
|
+
# model 1 — MLP
|
|
54
|
+
|
|
55
|
+
## architecture
|
|
56
|
+
|
|
57
|
+
the first version was a simple multilayer perceptron:
|
|
58
|
+
|
|
59
|
+
* character embedding layer (10-dimensional vectors)
|
|
60
|
+
* linear layer
|
|
61
|
+
* batch normalization
|
|
62
|
+
* tanh activation
|
|
63
|
+
* output linear layer
|
|
64
|
+
* softmax + cross entropy loss
|
|
65
|
+
|
|
66
|
+
hidden size:
|
|
67
|
+
|
|
68
|
+
```
|
|
69
|
+
200 neurons
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
context length:
|
|
73
|
+
|
|
74
|
+
```
|
|
75
|
+
3 characters
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
this means the model only looks at the previous 3 characters to predict the next one.
|
|
79
|
+
|
|
80
|
+
example:
|
|
81
|
+
|
|
82
|
+
```
|
|
83
|
+
pik → a
|
|
84
|
+
ika → next character
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
the context continuously shifts as the model generates.
|
|
88
|
+
|
|
89
|
+
---
|
|
90
|
+
|
|
91
|
+
## MLP results
|
|
92
|
+
|
|
93
|
+
training:
|
|
94
|
+
|
|
95
|
+
```
|
|
96
|
+
train loss: 1.294
|
|
97
|
+
validation loss: 3.504
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
generated names:
|
|
101
|
+
|
|
102
|
+
```
|
|
103
|
+
blipedeedo
|
|
104
|
+
rosalini
|
|
105
|
+
lect
|
|
106
|
+
dartic
|
|
107
|
+
star
|
|
108
|
+
vigus
|
|
109
|
+
swannon
|
|
110
|
+
hippowdon
|
|
111
|
+
the
|
|
112
|
+
larvinerao
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
the MLP learned basic character relationships, but the limited context window made it difficult to understand longer patterns inside names.
|
|
116
|
+
|
|
117
|
+
---
|
|
118
|
+
|
|
119
|
+
# model 2 — WaveNet
|
|
120
|
+
|
|
121
|
+
## why I built this
|
|
122
|
+
|
|
123
|
+
the biggest limitation of the MLP was context length.
|
|
124
|
+
|
|
125
|
+
with only 3 characters of context, the model could only see a small part of each name.
|
|
126
|
+
|
|
127
|
+
for example:
|
|
128
|
+
|
|
129
|
+
```
|
|
130
|
+
charizard
|
|
131
|
+
|
|
132
|
+
cha
|
|
133
|
+
har
|
|
134
|
+
ari
|
|
135
|
+
riz
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
the model does not understand the larger structure of the word.
|
|
139
|
+
|
|
140
|
+
WaveNet improves this by gradually combining groups of characters, allowing the model to build larger representations without massively increasing the number of parameters.
|
|
141
|
+
|
|
142
|
+
---
|
|
143
|
+
|
|
144
|
+
## architecture
|
|
145
|
+
|
|
146
|
+
WaveNet-style architecture:
|
|
147
|
+
|
|
148
|
+
* character embedding layer (10 dimensions)
|
|
149
|
+
* FlattenConsecutive layers
|
|
150
|
+
* multiple linear layers
|
|
151
|
+
* batch normalization
|
|
152
|
+
* tanh activations
|
|
153
|
+
* final output layer
|
|
154
|
+
* softmax + cross entropy loss
|
|
155
|
+
|
|
156
|
+
context length:
|
|
157
|
+
|
|
158
|
+
```
|
|
159
|
+
8 characters
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
the model builds information hierarchically:
|
|
163
|
+
|
|
164
|
+
```
|
|
165
|
+
characters
|
|
166
|
+
|
|
167
|
+
↓
|
|
168
|
+
|
|
169
|
+
combined character groups
|
|
170
|
+
|
|
171
|
+
↓
|
|
172
|
+
|
|
173
|
+
higher level features
|
|
174
|
+
|
|
175
|
+
↓
|
|
176
|
+
|
|
177
|
+
next character prediction
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
---
|
|
181
|
+
|
|
182
|
+
## WaveNet results
|
|
183
|
+
|
|
184
|
+
training:
|
|
185
|
+
|
|
186
|
+
```
|
|
187
|
+
train loss: 1.949
|
|
188
|
+
validation loss: 2.748
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
generated names:
|
|
192
|
+
|
|
193
|
+
```
|
|
194
|
+
gropinig
|
|
195
|
+
pyghislacat
|
|
196
|
+
poloun
|
|
197
|
+
hoongel
|
|
198
|
+
spuspiniyan
|
|
199
|
+
ongtover
|
|
200
|
+
kasato
|
|
201
|
+
xel
|
|
202
|
+
felspipon
|
|
203
|
+
linmatie
|
|
204
|
+
asherron
|
|
205
|
+
beatdiqdule
|
|
206
|
+
madstutf
|
|
207
|
+
drudona
|
|
208
|
+
rouzslra
|
|
209
|
+
liwsywunk
|
|
210
|
+
galeon
|
|
211
|
+
magnoslaws
|
|
212
|
+
araidono
|
|
213
|
+
lickopt
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
WaveNet produced longer and more structured generations because it had access to a larger context window.
|
|
217
|
+
|
|
218
|
+
---
|
|
219
|
+
|
|
220
|
+
# MLP vs WaveNet
|
|
221
|
+
|
|
222
|
+
| | MLP | WaveNet |
|
|
223
|
+
| ---------------- | ------------------- | ------------------- |
|
|
224
|
+
| Context size | 3 characters | 8 characters |
|
|
225
|
+
| Hidden size | 200 | 32 |
|
|
226
|
+
| Training steps | 300,000 | 10,000 |
|
|
227
|
+
| Architecture | Single hidden layer | Hierarchical layers |
|
|
228
|
+
| Feature learning | Direct | Progressive |
|
|
229
|
+
| Main advantage | Simple baseline | Larger context |
|
|
230
|
+
|
|
231
|
+
---
|
|
232
|
+
|
|
233
|
+
# challenges
|
|
234
|
+
|
|
235
|
+
## context length
|
|
236
|
+
|
|
237
|
+
one of the biggest lessons from this project was understanding why context matters.
|
|
238
|
+
|
|
239
|
+
a model with a smaller context window can only learn local patterns, while larger context allows it to understand longer relationships.
|
|
240
|
+
|
|
241
|
+
the MLP used:
|
|
242
|
+
|
|
243
|
+
```
|
|
244
|
+
3 character context
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
while WaveNet increased this to:
|
|
248
|
+
|
|
249
|
+
```
|
|
250
|
+
8 character context
|
|
251
|
+
```
|
|
252
|
+
|
|
253
|
+
which allowed it to capture more structure from names.
|
|
254
|
+
|
|
255
|
+
---
|
|
256
|
+
|
|
257
|
+
## batch normalization
|
|
258
|
+
|
|
259
|
+
implementing batch normalization manually was one of the hardest parts.
|
|
260
|
+
|
|
261
|
+
I had to handle:
|
|
262
|
+
|
|
263
|
+
* batch mean
|
|
264
|
+
* batch variance
|
|
265
|
+
* running mean
|
|
266
|
+
* running variance
|
|
267
|
+
|
|
268
|
+
training and inference use different statistics, so saving the running values was required for the deployed model to generate correctly.
|
|
269
|
+
|
|
270
|
+
---
|
|
271
|
+
|
|
272
|
+
## backpropagation
|
|
273
|
+
|
|
274
|
+
instead of using:
|
|
275
|
+
|
|
276
|
+
```python
|
|
277
|
+
loss.backward()
|
|
278
|
+
```
|
|
279
|
+
|
|
280
|
+
I manually calculated gradients for:
|
|
281
|
+
|
|
282
|
+
* embeddings
|
|
283
|
+
* linear layers
|
|
284
|
+
* batch normalization
|
|
285
|
+
* tanh activations
|
|
286
|
+
* softmax cross entropy
|
|
287
|
+
|
|
288
|
+
this helped me understand how neural networks actually learn instead of treating them as black boxes.
|
|
289
|
+
|
|
290
|
+
---
|
|
291
|
+
|
|
292
|
+
## random generation
|
|
293
|
+
|
|
294
|
+
generation is probabilistic.
|
|
295
|
+
|
|
296
|
+
even with the same trained model, outputs change because the next character is sampled from the model's probability distribution.
|
|
297
|
+
|
|
298
|
+
---
|
|
299
|
+
|
|
300
|
+
# training details
|
|
301
|
+
|
|
302
|
+
## MLP
|
|
303
|
+
|
|
304
|
+
dataset:
|
|
305
|
+
|
|
306
|
+
```
|
|
307
|
+
pokemon names
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
training:
|
|
311
|
+
|
|
312
|
+
* optimizer: SGD
|
|
313
|
+
* batch size: 32
|
|
314
|
+
* steps: 300,000
|
|
315
|
+
* learning rate: `0.1 → 0.01 after 100k steps`
|
|
316
|
+
|
|
317
|
+
parameters:
|
|
318
|
+
|
|
319
|
+
```
|
|
320
|
+
C
|
|
321
|
+
W1
|
|
322
|
+
W2
|
|
323
|
+
b2
|
|
324
|
+
bngain
|
|
325
|
+
bnbias
|
|
326
|
+
```
|
|
327
|
+
|
|
328
|
+
---
|
|
329
|
+
|
|
330
|
+
## WaveNet
|
|
331
|
+
|
|
332
|
+
dataset:
|
|
333
|
+
|
|
334
|
+
```
|
|
335
|
+
pokemon names
|
|
336
|
+
```
|
|
337
|
+
|
|
338
|
+
training:
|
|
339
|
+
|
|
340
|
+
* optimizer: SGD
|
|
341
|
+
* batch size: 32
|
|
342
|
+
* steps: 10,000
|
|
343
|
+
* learning rate: `0.1 → 0.01 after 8000 steps`
|
|
344
|
+
|
|
345
|
+
parameters:
|
|
346
|
+
|
|
347
|
+
```
|
|
348
|
+
embeddings
|
|
349
|
+
linear layers
|
|
350
|
+
batch normalization parameters
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
---
|
|
354
|
+
|
|
355
|
+
# deployment
|
|
356
|
+
|
|
357
|
+
the model is deployed using Hugging Face Spaces with Gradio.
|
|
358
|
+
|
|
359
|
+
the demo loads the trained numpy weights and runs inference without PyTorch or external ML frameworks.
|
|
360
|
+
|
|
361
|
+
the deployed model uses:
|
|
362
|
+
|
|
363
|
+
* trained embeddings
|
|
364
|
+
* linear layer weights
|
|
365
|
+
* batch normalization parameters
|
|
366
|
+
* vocabulary mappings
|
|
367
|
+
|
|
368
|
+
the entire inference pipeline runs using manually implemented numpy layers.
|
|
369
|
+
|
|
370
|
+
---
|
|
371
|
+
|
|
372
|
+
# usage
|
|
373
|
+
|
|
374
|
+
install dependencies:
|
|
375
|
+
|
|
376
|
+
```bash
|
|
377
|
+
pip install numpy
|
|
378
|
+
```
|
|
379
|
+
|
|
380
|
+
run:
|
|
381
|
+
|
|
382
|
+
```bash
|
|
383
|
+
python mlp.py
|
|
384
|
+
python wavenet.py
|
|
385
|
+
```
|
|
386
|
+
|
|
387
|
+
you will need:
|
|
388
|
+
|
|
389
|
+
```
|
|
390
|
+
data/
|
|
391
|
+
|
|
392
|
+
└── pokemon.txt
|
|
393
|
+
```
|
|
394
|
+
|
|
395
|
+
with one pokémon name per line.
|
|
396
|
+
|
|
397
|
+
---
|
|
398
|
+
|
|
399
|
+
# what I learned
|
|
400
|
+
|
|
401
|
+
this project taught me how language models are built from the ground up.
|
|
402
|
+
|
|
403
|
+
the biggest takeaway was that improving a model is not always about making it bigger. changing the architecture and giving the model better ways to understand context can have a larger impact than simply adding more parameters.
|
|
404
|
+
|
|
405
|
+
---
|
|
406
|
+
|
|
407
|
+
# license
|
|
408
|
+
|
|
409
|
+
MIT
|
|
410
|
+
|
|
411
|
+
heavily inspired by Andrej Karpathy's makemore series — highly recommend if you want to understand neural networks from the inside out :)
|