RubyGems - cton - Versions diffs - 0.2.0 → 0.3.0 - Mend

cton 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +19 -0
data/README.md +108 -1
data/bench/encode_decode_bench.rb +65 -0
data/lib/cton/decoder.rb +43 -23
data/lib/cton/encoder.rb +99 -29
data/lib/cton/version.rb +1 -1
data/lib/cton.rb +2 -1
metadata +3 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: b010a8f0e0da39e4e4d0a4217eddaa8f9496f1889bf32e12430fdb7737f17fab
-  data.tar.gz: 6fe6f58ff0a40233a279ae5c8881ccca4ce382fa85cae15c2c5e26782bb02875
+  metadata.gz: 1c9161ae830ba6b01d3ec94d1170fc4295aaacfa839c869aaa6adefe2711cc2d
+  data.tar.gz: 80c4ba30abbf8a562bde581f26e7dc5529aa46275b61930fe28370f34156db61
 SHA512:
-  metadata.gz: 3a85563dd205c2c00b204359d85376514de8fc45ce2b2c98e4d52a0325bff2937e2d88ba5e367fe718a0b82127603deadfe16dd6f60062e77a1b75babc666ec4
-  data.tar.gz: b4b27bfb483e0145c49def7b9ab735c27e03420dc59fd6bcaabc57d1b2bf6868d7bc5c55fea9866da3270a6c81126df032590129a1e28385827b8b4f3058e92a
+  metadata.gz: 914196284081bacd5b7f5f6ac9a1b246ea8924eddbb26cd28796b12a2ee2156718a9ff3b795b86188d483da4fc294b002a93e935815f692b432dac78b5304dcf
+  data.tar.gz: d9b1bfb1f7de402fe9de0d7da90f750d490dbdcb4e572e9283bc3ed15b1b43e3f3f9b44a4031f80ecf5d610a73d64719f3c78f170060de78035e04dab3a9d663

data/CHANGELOG.md CHANGED Viewed

@@ -5,6 +5,25 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.3.0] - 2025-11-20
+### Added
+- **Performance tunables**: `Cton.dump` now accepts `decimal_mode: :fast | :precise`, allowing callers to trade float-format determinism for lower allocation pressure. Specs cover both modes.
+- **Benchmark harness**: New `bench/encode_decode_bench.rb` script (wired into the README/Development docs) exercises encode/decode hot paths and prints comparative JSON vs CTON timings. On Ruby 3.1.4/macOS the fast encoder completes 1,000 iterations in ~0.63s and the new inline decoder stress test wraps 400 concatenated documents in ~4.14s.
+- **Regression tests**: Added specs for streaming documents without separators plus validation around the new decimal mode toggle.
+### Changed
+- **Encoder**: Memoizes table schemas per array instance, adds a fast-path for homogeneous scalar lists, and reduces float/BigDecimal copying by favoring Ruby's native float formatting before falling back to `BigDecimal`. Unsupported `decimal_mode` values now raise immediately.
+- **Decoder**: Replaces high-allocation `StringScanner` tokenization with raw string slicing, improves key-boundary detection for inline payloads, and keeps symbolization logic untouched. Boundary heuristics now prefer alphabetic key starts to avoid splitting numeric payloads.
+- **Documentation**: README now calls out the tuning flags, inline caveats, and benchmark instructions; Development workflow highlights how to rerun the perf suite.
+### Fixed
+- **Inline parsing**: Eliminated the runaway allocations and incorrect key splits when processing long documents with `separator: ""`.
+- **Float normalization**: Restored canonical `9.2`-style output in fast mode while keeping the new perf optimizations.
 ## [0.2.0] - 2025-11-19
 ### Added

data/README.md CHANGED Viewed

@@ -15,6 +15,8 @@
 - [Token Savings](#token-savings-vs-json--toon)
 - [Installation](#installation)
 - [Usage](#usage)
+- [Performance & Benchmarks](#performance--benchmarks)
+- [Teaching CTON to LLMs](#teaching-cton-to-llms)
 - [Development](#development)
 - [Contributing](#contributing)
 - [License](#license)
@@ -165,6 +167,10 @@ pretty = Cton.dump(payload, pretty: true)
 File.open("data.cton", "w") do |f|
   Cton.dump(payload, f)
 end
+# Toggle float normalization strategies
+fast  = Cton.dump(payload) # default :fast mode
+strict = Cton.dump(payload, decimal_mode: :precise)
 ```
 ### CLI Tool
@@ -196,7 +202,7 @@ CTON natively supports serialization for:
 Whenever an array is made of hashes that all expose the same scalar keys, the encoder flattens it into a table to save tokens. Mixed or nested arrays fall back to `[N]=(value1,value2,...)`.
 #### Separators & ambiguity
-Removing every newline makes certain inputs ambiguous because `sam` and the next key `hikes` can merge into `samhikes`. The default `separator: "\n"` avoids that by inserting a single newline between root segments. You may pass `separator: ""` to `Cton.dump` for maximum compactness, but decoding such strings is only safe if you can guarantee extra quoting or whitespace between segments.
+Removing every newline makes certain inputs ambiguous because `sam` and the next key `hikes` can merge into `samhikes`. The default `separator: "\n"` avoids that by inserting a single newline between root segments. You may pass `separator: ""` to `Cton.dump` for maximum compactness, but decoding such strings is only safe if you can guarantee extra quoting or whitespace between segments. When you intentionally omit separators, keep next-level keys alphabetic (e.g., `payload`, `k42`) so the decoder's boundary heuristic can split `...1payload...` without misclassifying numeric prefixes.
 #### Literal safety & number normalization
 Following the TOON specification's guardrails, the encoder now:
@@ -204,6 +210,106 @@ Following the TOON specification's guardrails, the encoder now:
 - Canonicalizes float/BigDecimal output: no exponent notation, no trailing zeros, and `-0` collapses to `0`.
 - Converts `NaN` and `±Infinity` inputs to `null`, matching TOON's normalization guidance so downstream decoders don't explode on non-finite numbers.
+#### Decimal normalization modes
+- `decimal_mode: :fast` (default) prefers Ruby's native float representation and only falls back to `BigDecimal` when scientific notation is detected, minimizing allocations on tight loops.
+- `decimal_mode: :precise` forces the legacy `BigDecimal` path for every float, which is slower but useful for audit-grade dumps where you want deterministic decimal expansion.
+- Both modes share the same trailing-zero stripping and `-0 → 0` normalization, so switching modes never affects integer formatting.
+---
+## Performance & Benchmarks
+CTON focuses on throughput: encoder table schemas are memoized, scalar list encoding keeps a reusable buffer, floats avoid `BigDecimal` when they can, and the decoder slices straight from the raw string to sidestep `StringScanner` allocations. You can reproduce the numbers below with the bundled script:
+```bash
+bundle exec ruby bench/encode_decode_bench.rb
+# customize input size / iterations
+ITERATIONS=2000 STREAM_SIZE=400 bundle exec ruby bench/encode_decode_bench.rb
+```
+Latest results on Ruby 3.1.4/macOS (M-series), 1,000 iterations, `STREAM_SIZE=200`:
+| Benchmark | Time (s) |
+| --- | --- |
+| `cton dump` (:fast) | 0.626 |
+| `cton dump` (:precise) | 0.658 |
+| `json generate` | 0.027 |
+| `cton load` | 2.067 |
+| `json parse` | 0.045 |
+| `cton inline load` (separator=`""`, double payload) | 4.140 |
+`cton inline load` deliberately concatenates documents without separators to stress the new boundary detector; it now finishes without the runaway allocations seen in earlier releases.
+---
+## Teaching CTON to LLMs
+Use this system prompt to teach an LLM how to understand and generate CTON:
+````markdown
+You are an expert in data serialization and specifically in CTON (Compact Token-Oriented Notation). CTON is a token-efficient data format optimized for LLMs that serves as a compact alternative to JSON.
+Your task is to interpret CTON input and convert it to JSON, or convert JSON input into valid CTON format, following the specification below.
+### CTON Specification
+CTON minimizes syntax characters (braces, quotes) while preserving structure and type safety.
+**1. Basic Structure (Key-Value)**
+- **Rule:** Do not use outer curly braces `{}` for the root object.
+- **Rule:** Use `=` to separate keys and values.
+- **Rule:** Use `,` to separate fields.
+- **Rule:** Do not use quotes around "safe" strings (alphanumeric, simple text).
+- **Example:** - JSON: `{"task": "planning", "urgent": true}`
+  - CTON: `task=planning,urgent=true`
+**2. Nested Objects**
+- **Rule:** Use parentheses `()` to denote a nested object instead of `{}`.
+- **Example:**
+  - JSON: `{"context": {"user": "Davide", "theme": "dark"}}`
+  - CTON: `context(user=Davide,theme=dark)`
+**3. Arrays of Objects (Table Compression)**
+- **Rule:** Use the syntax `key[count]{columns}=values` for arrays of objects to avoid repeating keys.
+- **Structure:** `key[Length]{col1,col2}=val1,val2;val1,val2`
+- **Details:** - `[N]` denotes the number of items in the array.
+  - `{col1,col2}` defines the schema headers.
+  - `;` separates distinct objects (rows).
+  - `,` separates values within an object.
+- **Example:**
+JSON:
+```json
+{
+  "files": [
+    { "name": "README.md", "size": 1024 },
+    { "name": "lib.rb", "size": 2048 }
+  ]
+}
+```
+CTON: `files[2]{name,size}=README.md,1024;lib.rb,2048`
+**4. Type Safety & Literals**
+- **Booleans/Null:** `true`, `false`, and `null` are preserved as literals (unquoted).
+- **Numbers:** Integers and floats are written as is (e.g., `1024`, `3.14`).
+- **Escaping:** If a string value looks like a boolean, number, or contains reserved characters (like `,`, `;`, `=`, `(`, `)`), it must be wrapped in double quotes (e.g., `"true"`).
+### Examples for Training
+**Input (JSON):**
+```json
+{
+  "id": 123,
+  "active": true,
+  "metadata": {
+    "created_at": "2023-01-01",
+    "tags": "admin"
+  }
+}
+```
+````
 ---
 ## Type Safety
@@ -216,6 +322,7 @@ CTON ships with RBS signatures (`sig/cton.rbs`) to support type checking and IDE
 bin/setup        # install dependencies
 bundle exec rake # run tests and rubocop
 bin/console      # interactive playground
+bundle exec ruby bench/encode_decode_bench.rb # performance smoke test
 ```
 To release a new version, bump `Cton::VERSION` and run `bundle exec rake release`.

data/bench/encode_decode_bench.rb ADDED Viewed

@@ -0,0 +1,65 @@
+#!/usr/bin/env ruby
+# frozen_string_literal: true
+require "benchmark"
+require "json"
+require_relative "../lib/cton"
+ITERATIONS = Integer(ENV.fetch("ITERATIONS", 1_000))
+STREAM_SIZE = Integer(ENV.fetch("STREAM_SIZE", 200))
+sample_payload = {
+  "context" => {
+    "task" => "Our favorite hikes together",
+    "location" => "Boulder",
+    "season" => "spring_2025"
+  },
+  "friends" => %w[ana luis sam],
+  "hikes" => Array.new(STREAM_SIZE) do |idx|
+    {
+      "id" => idx + 1,
+      "name" => "Trail ##{idx + 1}",
+      "distanceKm" => (6.0 + ((idx % 5) * 0.5)),
+      "elevationGain" => 250 + ((idx % 3) * 50),
+      "companion" => %w[ana luis sam][idx % 3],
+      "wasSunny" => idx.even?
+    }
+  end
+}
+warm_cton = Cton.dump(sample_payload)
+warm_json = JSON.generate(sample_payload)
+puts "\nEncoding benchmarks (iterations=#{ITERATIONS}, stream_size=#{STREAM_SIZE})"
+Benchmark.bm(25) do |bm|
+  bm.report("cton dump fast") do
+    ITERATIONS.times { Cton.dump(sample_payload) }
+  end
+  bm.report("cton dump precise") do
+    ITERATIONS.times { Cton.dump(sample_payload, decimal_mode: :precise) }
+  end
+  bm.report("json generate") do
+    ITERATIONS.times { JSON.generate(sample_payload) }
+  end
+end
+puts "\nDecoding benchmarks"
+Benchmark.bm(25) do |bm|
+  bm.report("cton load") do
+    ITERATIONS.times { Cton.load(warm_cton) }
+  end
+  bm.report("json parse") do
+    ITERATIONS.times { JSON.parse(warm_json) }
+  end
+end
+puts "\nStreaming decode stress (#{STREAM_SIZE * 2} documents, separator=\"\")"
+inline_blob = warm_cton.delete("\n") * 2
+Benchmark.bm(25) do |bm|
+  bm.report("cton inline load") do
+    ITERATIONS.times { Cton.load(inline_blob) }
+  end
+end

data/lib/cton/decoder.rb CHANGED Viewed

@@ -5,13 +5,15 @@ require "strscan"
 module Cton
   class Decoder
     TERMINATORS = [",", ";", ")", "]", "}"].freeze
+    KEY_VALUE_BOUNDARY_TOKENS = ["(", "[", "="].freeze
     def initialize(symbolize_names: false)
       @symbolize_names = symbolize_names
     end
     def decode(cton)
-      @scanner = StringScanner.new(cton.to_s)
+      @raw_string = cton.to_s
+      @scanner = StringScanner.new(@raw_string)
       skip_ws
       value = if key_ahead?
@@ -28,7 +30,7 @@ module Cton
     private
-    attr_reader :symbolize_names, :scanner
+    attr_reader :symbolize_names, :scanner, :raw_string
     def raise_error(message)
       line, col = calculate_location(@scanner.pos)
@@ -36,7 +38,7 @@ module Cton
     end
     def calculate_location(pos)
-      string = @scanner.string
+      string = raw_string
       consumed = string[0...pos]
       line = consumed.count("\n") + 1
       last_newline = consumed.rindex("\n")
@@ -168,56 +170,74 @@ module Cton
     end
     def scan_until_terminator
-      @scanner.scan(/[^,;\]\}\)\(\[\{\s]+/)
+      start_pos = @scanner.pos
+      end_pos = find_terminator_position(start_pos)
+      consume_slice(start_pos, end_pos)
     end
     def scan_until_boundary_or_terminator
       start_pos = @scanner.pos
+      boundary_pos = find_key_boundary(start_pos)
+      end_pos = boundary_pos || find_terminator_position(start_pos)
+      consume_slice(start_pos, end_pos)
+    end
-      chunk = @scanner.scan(/[0-9A-Za-z_.:-]+/)
-      return nil unless chunk
+    def consume_slice(start_pos, end_pos)
+      return nil if end_pos <= start_pos
-      boundary_idx = find_key_boundary(start_pos)
+      token = raw_string.byteslice(start_pos, end_pos - start_pos)
+      @scanner.pos = end_pos
+      token
+    end
-      if boundary_idx
-        length = boundary_idx - start_pos
-        @scanner.pos = start_pos
-        token = @scanner.peek(length)
-        @scanner.pos += length
-        token
-      else
-        @scanner.pos = start_pos + chunk.length
-        chunk
+    def find_terminator_position(start_pos)
+      str = raw_string
+      len = str.length
+      idx = start_pos
+      while idx < len
+        char = str[idx]
+        break if terminator?(char)
+        idx += 1
       end
+      idx
     end
     def find_key_boundary(from_index)
-      str = @scanner.string
+      str = raw_string
       len = str.length
       idx = from_index
       while idx < len
         char = str[idx]
-        return nil if TERMINATORS.include?(char) || whitespace?(char) || "([{".include?(char)
+        return nil if terminator?(char)
         if safe_key_char?(char)
           key_end = idx
           key_end += 1 while key_end < len && safe_key_char?(str[key_end])
-          next_char_idx = key_end
-          if next_char_idx < len
-            next_char = str[next_char_idx]
-            return idx if ["(", "[", "="].include?(next_char) && (idx > from_index)
+          if key_end < len && KEY_VALUE_BOUNDARY_TOKENS.include?(str[key_end]) && idx > from_index && boundary_start_allowed?(str[idx])
+            return idx
           end
         end
         idx += 1
       end
       nil
     end
+    def terminator?(char)
+      TERMINATORS.include?(char) || whitespace?(char) || ["(", "[", "{"].include?(char)
+    end
+    def boundary_start_allowed?(char)
+      !char.nil? && char.match?(/[A-Za-z_.:-]/)
+    end
     def convert_scalar(token)
       case token
       when "true" then true

data/lib/cton/encoder.rb CHANGED Viewed

@@ -11,10 +11,14 @@ module Cton
     RESERVED_LITERALS = %w[true false null].freeze
     FLOAT_DECIMAL_PRECISION = Float::DIG
-    def initialize(separator: "\n", pretty: false)
+    def initialize(separator: "\n", pretty: false, decimal_mode: :fast)
       @separator = separator || ""
       @pretty = pretty
+      @decimal_mode = decimal_mode
+      raise ArgumentError, "decimal_mode must be :fast or :precise" unless %i[fast precise].include?(@decimal_mode)
       @indent_level = 0
+      @table_schema_cache = {}
     end
     def encode(payload, io: nil)
@@ -25,7 +29,7 @@ module Cton
     private
-    attr_reader :separator, :io, :pretty, :indent_level
+    attr_reader :separator, :io, :pretty, :indent_level, :decimal_mode
     def encode_root(value)
       case value
@@ -96,8 +100,8 @@ module Cton
       io << "[" << length.to_s << "]"
-      if table_candidate?(list)
-        encode_table(list)
+      if (header = table_schema_for(list))
+        encode_table(list, header)
       else
         io << "="
         if list.all? { |value| scalar?(value) }
@@ -108,8 +112,7 @@ module Cton
       end
     end
-    def encode_table(rows)
-      header = rows.first.keys
+    def encode_table(rows, header)
       io << "{"
       io << header.map { |key| format_key(key) }.join(",")
       io << "}="
@@ -150,10 +153,14 @@ module Cton
         outdent
       else
         first = true
-        list.each do |value|
-          io << "," unless first
-          encode_scalar(value)
-          first = false
+        if fast_scalar_stream?(list)
+          io << fast_scalar_stream(list)
+        else
+          list.each do |value|
+            io << "," unless first
+            encode_scalar(value)
+            first = false
+          end
         end
       end
     end
@@ -174,30 +181,34 @@ module Cton
     end
     def encode_scalar(value)
+      io << scalar_to_string(value)
+    end
+    def scalar_to_string(value)
       case value
       when String
-        encode_string(value)
+        format_string(value)
       when TrueClass, FalseClass
-        io << (value ? "true" : "false")
+        value ? "true" : "false"
       when NilClass
-        io << "null"
+        "null"
       when Numeric
-        io << format_number(value)
+        format_number(value)
       when Time, Date
-        encode_string(value.iso8601)
+        format_string(value.iso8601)
       else
         raise EncodeError, "Unsupported value: #{value.class}"
       end
     end
-    def encode_string(value)
-      io << if value.empty?
-              '""'
-            elsif string_needs_quotes?(value)
-              quote_string(value)
-            else
-              value
-            end
+    def format_string(value)
+      if value.empty?
+        '""'
+      elsif string_needs_quotes?(value)
+        quote_string(value)
+      else
+        value
+      end
     end
     def format_number(value)
@@ -234,6 +245,17 @@ module Cton
     end
     def float_decimal_string(value)
+      return precise_float_decimal_string(value) if decimal_mode == :precise
+      decimal = value.to_s
+      if decimal.include?("e") || decimal.include?("E")
+        precise_float_decimal_string(value)
+      else
+        decimal
+      end
+    end
+    def precise_float_decimal_string(value)
       if defined?(BigDecimal)
         BigDecimal(value.to_s).to_s("F")
       else
@@ -278,16 +300,64 @@ module Cton
       value.is_a?(String) || value.is_a?(Numeric) || value == true || value == false || value.nil? || value.is_a?(Time) || value.is_a?(Date)
     end
-    def table_candidate?(rows)
-      return false if rows.empty?
+    def table_schema_for(rows)
+      cache_lookup = @table_schema_cache.fetch(rows.object_id, :__missing__)
+      return cache_lookup unless cache_lookup == :__missing__
+      schema = compute_table_schema(rows)
+      @table_schema_cache[rows.object_id] = schema
+    end
+    def compute_table_schema(rows)
+      return nil if rows.empty?
       first = rows.first
-      return false unless first.is_a?(Hash) && !first.empty?
+      return nil unless first.is_a?(Hash) && !first.empty?
+      header = first.keys.freeze
+      rows.each do |row|
+        return nil unless row.is_a?(Hash)
+        return nil unless row.keys == header
+        return nil unless row.values.all? { |val| scalar?(val) }
+      end
+      header
+    end
+    def fast_scalar_stream?(list)
+      !pretty && list.length > 4 && homogeneous_scalar_tokens?(list)
+    end
+    def homogeneous_scalar_tokens?(list)
+      first_class = nil
+      list.all? do |value|
+        return false unless scalar?(value)
+        token_class = value.class
+        first_class ||= token_class
+        token_class == first_class && token_does_not_require_quotes?(value)
+      end
+    end
+    def token_does_not_require_quotes?(value)
+      case value
+      when String
+        !value.empty? && !string_needs_quotes?(value)
+      when Integer, TrueClass, FalseClass, NilClass
+        true
+      else
+        false
+      end
+    end
-      keys = first.keys
-      rows.all? do |row|
-        row.is_a?(Hash) && row.keys == keys && row.values.all? { |val| scalar?(val) }
+    def fast_scalar_stream(list)
+      buffer = String.new
+      list.each_with_index do |value, index|
+        buffer << "," unless index.zero?
+        buffer << scalar_to_string(value)
       end
+      buffer
     end
     def indent

data/lib/cton/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Cton
-  VERSION = "0.2.0"
+  VERSION = "0.3.0"
 end

data/lib/cton.rb CHANGED Viewed

@@ -28,7 +28,8 @@ module Cton
     separator = options.fetch(:separator, "\n")
     pretty = options.fetch(:pretty, false)
-    Encoder.new(separator: separator, pretty: pretty).encode(payload, io: io)
+    decimal_mode = options.fetch(:decimal_mode, :fast)
+    Encoder.new(separator: separator, pretty: pretty, decimal_mode: decimal_mode).encode(payload, io: io)
   end
   alias generate dump

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: cton
 version: !ruby/object:Gem::Version
-  version: 0.2.0
+  version: 0.3.0
 platform: ruby
 authors:
 - Davide Santangelo
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2025-11-19 00:00:00.000000000 Z
+date: 2025-11-20 00:00:00.000000000 Z
 dependencies: []
 description: CTON provides a JSON-compatible, token-efficient text representation
   optimized for LLM prompts.
@@ -25,6 +25,7 @@ files:
 - LICENSE.txt
 - README.md
 - Rakefile
+- bench/encode_decode_bench.rb
 - lib/cton.rb
 - lib/cton/decoder.rb
 - lib/cton/encoder.rb