RubyGems - nanogpt - Versions diffs - 0.1.2 → 0.2.0 - Mend

nanogpt 0.1.2 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: cf308fcec8ccec074200361b2327a2381a5412b92497391bb71ec5f154cd1283
-  data.tar.gz: 87dcc389df03af0ac59fc0e75bbef7be7071f00613a2662cecf365ba38bc853e
+  metadata.gz: c2d0853148f473bb23f4ffffaa5a7b9e45f3faff570cc0a85253d40b6b63b80a
+  data.tar.gz: 68f31e52460273d97d1a97de844fb79c9bfab22b939a897482b6def4c77bdecc
 SHA512:
-  metadata.gz: 2b6ceeb10236b639c82398c94d3c1a876eff549a17289ff45abae489e480a3d6a1db45f95a4b2e56f1d1df6afde28159965b188342d1e38c2482359dfb11e061
-  data.tar.gz: 0fd4653c2719d1d3c339a904437fd0657f17e49fa0a9d4361a3a259806fab8dc798ed7f179722b41913d3471e8799abf78823f207988987d86cba680d7f90f03
+  metadata.gz: 42a617a95e04d7727cc011a033ea039b8763a9103093dbec7ad614eb74a9a4f4ed818d82ffedeb8e627494dd73fd7950fdb30f97dd5ff63866123a0e892befc1
+  data.tar.gz: af2916e7179830cc91fc516be16b765e46a3c6f37cb732583900bf71591abf16d8bbdc250bbe07a0b83eac124a5ec5b7241684017b4c166ebe3fcb8611702592

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,45 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.2.0] - 2025-12-08
+### Added
+- **Custom text file training**: New `nanogpt prepare textfile <path>` command to train on any text file with character-level tokenization
+  - Streams through large files without loading everything into memory
+  - Auto-detects file encoding (UTF-8 or Windows-1252)
+  - Configurable output directory name (`--output=NAME`)
+  - Configurable train/validation split ratio (`--val_ratio=F`, default 0.1)
+- Updated README with documentation for training on custom text files
+## [0.1.2] - 2025-12-07
+### Fixed
+- Fixed `prepare` command to output files to current working directory
+## [0.1.1] - 2025-12-07
+### Fixed
+- Fixed typo in documentation
+## [0.1.0] - 2025-12-07
+### Added
+- Initial release
+- Full GPT-2 architecture implementation in Ruby
+- MPS (Metal) and CUDA GPU acceleration via torch.rb
+- Flash attention support when dropout=0
+- Character-level and GPT-2 BPE tokenizers
+- Cosine learning rate schedule with warmup
+- Gradient accumulation for larger effective batch sizes
+- Checkpointing and training resumption
+- Shakespeare character-level dataset
+- OpenWebText dataset support
+- CLI commands: `train`, `sample`, `bench`, `prepare`

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    nanogpt (0.1.0)
+    nanogpt (0.1.2)
       numo-narray (~> 0.9)
       tiktoken_ruby (~> 0.0)
       torch-rb (~> 0.14)
@@ -32,6 +32,7 @@ GEM
 PLATFORMS
   arm64-darwin-24
+  arm64-darwin-25
 DEPENDENCIES
   nanogpt!

data/README.md CHANGED Viewed

@@ -1,5 +1,7 @@
 # nanoGPT
+[![Gem Version](https://badge.fury.io/rb/nanogpt.svg)](https://rubygems.org/gems/nanogpt)
 A Ruby port of Karpathy's [nanoGPT](https://github.com/karpathy/nanoGPT). Train GPT-2 style language models from scratch using [torch.rb](https://github.com/ankane/torch.rb).
 Built for Ruby developers who want to understand how LLMs work by building one.
@@ -13,7 +15,7 @@ gem install nanogpt
 nanogpt prepare shakespeare_char
 # Train (use MPS on Apple Silicon for 17x speedup)
-nanogpt train --dataset=shakespeare_char --device=mps
+nanogpt train --dataset=shakespeare_char --device=mps --max_iters=2000
 # Generate text
 nanogpt sample --dataset=shakespeare_char
@@ -30,7 +32,7 @@ bundle install
 bundle exec ruby data/shakespeare_char/prepare.rb
 # Train
-bundle exec exe/nanogpt train --dataset=shakespeare_char --device=mps
+bundle exec exe/nanogpt train --dataset=shakespeare_char --device=mps --max_iters=2000
 # Sample
 bundle exec exe/nanogpt sample --dataset=shakespeare_char
@@ -81,6 +83,53 @@ nanogpt bench [options]    # Run performance benchmarks
 --top_k=N            # Top-k sampling (default: 200)
 ```
+## Training on Your Own Text
+You can train on any text file using the `textfile` command:
+```bash
+# Prepare your text file (creates char-level tokenizer)
+nanogpt prepare textfile /path/to/mybook.txt --output=mybook
+# Train a model
+nanogpt train --dataset=mybook --device=mps --max_iters=2000
+# Generate text
+nanogpt sample --dataset=mybook --start="Once upon a time"
+```
+### Options
+```bash
+--output=NAME       # Output directory name (default: derived from filename)
+--val_ratio=F       # Validation split ratio (default: 0.1)
+```
+### Example: Training on a Novel
+```bash
+# Download a book
+curl -o lotr.txt "https://example.com/fellowship.txt"
+# Prepare (handles UTF-8 and Windows-1252 encodings)
+nanogpt prepare textfile lotr.txt --output=lotr
+# Train a larger model for better results
+nanogpt train --dataset=lotr --device=mps \
+  --max_iters=2000 \
+  --n_layer=6 --n_head=6 --n_embd=384 \
+  --block_size=256 --batch_size=32
+# Sample with a prompt
+nanogpt sample --dataset=lotr --start="Frodo" --max_new_tokens=500
+```
+The `textfile` command:
+- Streams through large files without loading everything into memory
+- Auto-detects encoding (UTF-8 or Windows-1252)
+- Creates a character-level vocabulary from your text
+- Splits into train/validation sets
 ## Features
 - Full GPT-2 architecture (attention, MLP, layer norm, embeddings)

data/exe/nanogpt CHANGED Viewed

@@ -48,12 +48,25 @@ class NanoGPTCLI
     if dataset.nil?
       puts "Usage: nanogpt prepare <dataset>"
+      puts "       nanogpt prepare textfile <path> [options]"
       puts ""
       puts "Available datasets:"
       available.each { |d| puts "  #{d}" }
+      puts ""
+      puts "Custom text file:"
+      puts "  textfile    Prepare a custom text file with char-level tokenization"
+      puts ""
+      puts "Textfile options:"
+      puts "  --output=NAME     Output directory name (default: derived from filename)"
+      puts "  --val_ratio=F     Validation split ratio (default: 0.1)"
       exit 1
     end
+    if dataset == "textfile"
+      prepare_textfile
+      return
+    end
     prepare_script = File.join(data_dir, dataset, "prepare.rb")
     unless File.exist?(prepare_script)
@@ -73,6 +86,176 @@ class NanoGPTCLI
     load prepare_script
   end
+  def prepare_textfile
+    require "numo/narray"
+    require "json"
+    require "fileutils"
+    input_path = nil
+    output_name = nil
+    val_ratio = 0.1
+    @args[1..].each do |arg|
+      if arg.start_with?("--output=")
+        output_name = arg.split("=", 2).last
+      elsif arg.start_with?("--val_ratio=")
+        val_ratio = arg.split("=", 2).last.to_f
+      elsif !arg.start_with?("--")
+        input_path = arg
+      end
+    end
+    if input_path.nil?
+      puts "Error: No input file specified"
+      puts ""
+      puts "Usage: nanogpt prepare textfile <path> [options]"
+      puts ""
+      puts "Options:"
+      puts "  --output=NAME     Output directory name (default: derived from filename)"
+      puts "  --val_ratio=F     Validation split ratio (default: 0.1)"
+      exit 1
+    end
+    unless File.exist?(input_path)
+      puts "Error: File not found: #{input_path}"
+      exit 1
+    end
+    output_name ||= File.basename(input_path, ".*").gsub(/[^a-zA-Z0-9_-]/, "_")
+    output_dir = File.join(Dir.pwd, "data", output_name)
+    FileUtils.mkdir_p(output_dir)
+    file_size = File.size(input_path)
+    puts "Preparing text file: #{input_path}"
+    puts "File size: #{(file_size / 1_000_000.0).round(2)} MB"
+    puts "Output directory: #{output_dir}"
+    puts "Validation ratio: #{val_ratio}"
+    puts ""
+    # Phase 1: Build vocabulary by reading entire file
+    # For very large files, we read line by line to avoid memory issues
+    puts "Phase 1: Building vocabulary..."
+    char_set = Set.new
+    char_count = 0
+    # Detect encoding: check if file is valid UTF-8, otherwise assume Windows-1252
+    sample = File.binread(input_path, 100_000)
+    encoding = sample.force_encoding("UTF-8").valid_encoding? ? "UTF-8" : "Windows-1252:UTF-8"
+    puts "  Detected encoding: #{encoding.split(':').first}"
+    File.foreach(input_path, encoding: encoding) do |line|
+      line.each_char { |c| char_set.add(c) }
+      char_count += line.length
+      print "\r  Scanned #{char_count} characters, #{char_set.size} unique..." if (char_count % 100_000) < 1000
+    end
+    puts "\r  Scanned #{char_count} characters, #{char_set.size} unique..."
+    chars = char_set.to_a.sort
+    vocab_size = chars.size
+    puts "Vocabulary size: #{vocab_size}"
+    stoi = chars.each_with_index.to_h
+    itos = chars.each_with_index.map { |c, i| [i, c] }.to_h
+    # Phase 2: Calculate split point
+    total_chars = char_count
+    val_chars = (total_chars * val_ratio).to_i
+    train_chars = total_chars - val_chars
+    puts ""
+    puts "Train: #{train_chars} characters"
+    puts "Val: #{val_chars} characters"
+    # Phase 3: Encode and write train.bin (streaming line by line)
+    puts ""
+    puts "Phase 2: Encoding and writing train.bin..."
+    train_path = File.join(output_dir, "train.bin")
+    chars_written = 0
+    buffer = []
+    buffer_size = 100_000
+    File.open(train_path, "wb") do |output|
+      File.foreach(input_path, encoding: encoding) do |line|
+        line.each_char do |c|
+          break if chars_written >= train_chars
+          buffer << stoi[c]
+          chars_written += 1
+          if buffer.size >= buffer_size
+            arr = Numo::UInt16.cast(buffer)
+            output.write(arr.to_binary)
+            buffer.clear
+            print "\r  Written #{chars_written}/#{train_chars} characters..."
+          end
+        end
+        break if chars_written >= train_chars
+      end
+      unless buffer.empty?
+        arr = Numo::UInt16.cast(buffer)
+        output.write(arr.to_binary)
+        buffer.clear
+      end
+    end
+    puts ""
+    # Phase 4: Encode and write val.bin (streaming line by line)
+    puts "Phase 3: Encoding and writing val.bin..."
+    val_path = File.join(output_dir, "val.bin")
+    chars_written = 0
+    skipped = 0
+    buffer = []
+    File.open(val_path, "wb") do |output|
+      File.foreach(input_path, encoding: encoding) do |line|
+        line.each_char do |c|
+          if skipped < train_chars
+            skipped += 1
+            next
+          end
+          buffer << stoi[c]
+          chars_written += 1
+          if buffer.size >= buffer_size
+            arr = Numo::UInt16.cast(buffer)
+            output.write(arr.to_binary)
+            buffer.clear
+            print "\r  Written #{chars_written}/#{val_chars} characters..."
+          end
+        end
+      end
+      unless buffer.empty?
+        arr = Numo::UInt16.cast(buffer)
+        output.write(arr.to_binary)
+        buffer.clear
+      end
+    end
+    puts ""
+    # Phase 5: Save meta.json
+    puts "Phase 4: Saving meta.json..."
+    meta = {
+      "vocab_size" => vocab_size,
+      "itos" => itos.transform_keys(&:to_s),
+      "stoi" => stoi
+    }
+    File.write(File.join(output_dir, "meta.json"), JSON.pretty_generate(meta))
+    train_size_mb = File.size(train_path) / 1_000_000.0
+    val_size_mb = File.size(val_path) / 1_000_000.0
+    puts ""
+    puts "Done!"
+    puts "  train.bin: #{train_chars} tokens (#{train_size_mb.round(2)} MB)"
+    puts "  val.bin: #{val_chars} tokens (#{val_size_mb.round(2)} MB)"
+    puts "  meta.json: vocab_size=#{vocab_size}"
+    puts ""
+    puts "To train:"
+    puts "  nanogpt train --dataset=#{output_name}"
+  end
   def train
     config = NanoGPT::TrainConfig.load(@args)

data/lib/nano_gpt/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module NanoGPT
-  VERSION = "0.1.2"
+  VERSION = "0.2.0"
 end

metadata CHANGED Viewed

@@ -1,14 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: nanogpt
 version: !ruby/object:Gem::Version
-  version: 0.1.2
+  version: 0.2.0
 platform: ruby
 authors:
 - Chris Hasiński
-autorequire:
 bindir: exe
 cert_chain: []
-date: 2025-12-02 00:00:00.000000000 Z
+date: 1980-01-02 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: torch-rb
@@ -75,6 +74,7 @@ executables:
 extensions: []
 extra_rdoc_files: []
 files:
+- CHANGELOG.md
 - Gemfile
 - Gemfile.lock
 - README.md
@@ -111,7 +111,6 @@ metadata:
   homepage_uri: https://github.com/khasinski/nanogpt-rb
   source_code_uri: https://github.com/khasinski/nanogpt-rb
   changelog_uri: https://github.com/khasinski/nanogpt-rb/blob/main/CHANGELOG.md
-post_install_message:
 rdoc_options: []
 require_paths:
 - lib
@@ -126,8 +125,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.1.6
-signing_key:
+rubygems_version: 3.6.9
 specification_version: 4
 summary: A Ruby port of Karpathy's nanoGPT
 test_files: []