RubyGems - cton - Versions diffs - 0.1.0 → 0.2.0 - Mend

cton 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: d011537d6c7b854d8ffffb3dc1e7848534a7919c12f4c1cc4ba84db338cb1669
-  data.tar.gz: c96a91aae6acc9dd37e3df0b8e0fe28acfbdb6c20ea1f22a946676962f7bf7ac
+  metadata.gz: b010a8f0e0da39e4e4d0a4217eddaa8f9496f1889bf32e12430fdb7737f17fab
+  data.tar.gz: 6fe6f58ff0a40233a279ae5c8881ccca4ce382fa85cae15c2c5e26782bb02875
 SHA512:
-  metadata.gz: 77be2aa2db0a728eaa6835be6a66456fefcd12ca27f32d3fbb3a6d77d91f4b109eca64eb958f24bb63b7ae2e193f389029fc6f1986cd590b708da3e3820b288e
-  data.tar.gz: b09740b530363e012c7f00ba6b33aaec366503310756e328a00d9def849fcb13158e911265cf57dcf6a8b613be6cd3e0e6bf164e1cfa480983b73c29a53e8cbb
+  metadata.gz: 3a85563dd205c2c00b204359d85376514de8fc45ce2b2c98e4d52a0325bff2937e2d88ba5e367fe718a0b82127603deadfe16dd6f60062e77a1b75babc666ec4
+  data.tar.gz: b4b27bfb483e0145c49def7b9ab735c27e03420dc59fd6bcaabc57d1b2bf6868d7bc5c55fea9866da3270a6c81126df032590129a1e28385827b8b4f3058e92a

data/.rubocop.yml CHANGED Viewed

@@ -1,4 +1,5 @@
 AllCops:
+  NewCops: enable
   TargetRubyVersion: 3.1
 Style/StringLiterals:
@@ -6,3 +7,15 @@ Style/StringLiterals:
 Style/StringLiteralsInInterpolation:
   EnforcedStyle: double_quotes
+Style/FrozenStringLiteralComment:
+  Enabled: true
+Metrics/MethodLength:
+  Max: 25
+Metrics/ClassLength:
+  Max: 200
+Layout/LineLength:
+  Max: 120

data/CHANGELOG.md CHANGED Viewed

@@ -5,6 +5,37 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.2.0] - 2025-11-19
+### Added
+- **CLI Tool**: New `bin/cton` executable for converting between JSON and CTON from the command line. Supports auto-detection, pretty printing, and file I/O.
+- **Streaming IO**: `Cton.dump` now accepts an `IO` object as the second argument (or via `io:` keyword), allowing direct writing to files or sockets without intermediate string allocation.
+- **Pretty Printing**: Added `pretty: true` option to `Cton.dump` to format output with indentation and newlines for better readability.
+- **Extended Types**: Native support for `Time`, `Date` (ISO8601), `Set` (as Array), and `OpenStruct` (as Object).
+- **Enhanced Error Reporting**: `ParseError` now includes line and column numbers to help locate syntax errors in large documents.
+### Changed
+- **Ruby 3 Compatibility**: Improved argument handling in `Cton.dump` to robustly support Ruby 3 keyword arguments when passing hashes.
+## [0.1.1] - 2025-11-18
+### Changed
+- **Performance**: Refactored `Encoder` to use `StringIO` and `Decoder` to use `StringScanner` for significantly improved performance and memory usage.
+- **Architecture**: Split `Cton` module into dedicated `Cton::Encoder` and `Cton::Decoder` classes for better maintainability.
+### Fixed
+- **Parsing**: Fixed an issue where unterminated strings were not correctly detected.
+- **Whitespace**: Improved whitespace handling in the decoder, specifically fixing issues with whitespace between keys and structure markers.
+### Added
+- **Type Safety**: Added comprehensive RBS signatures (`sig/cton.rbs`) for better IDE support and static analysis.
+- **Tests**: Expanded test coverage for validation, complex tables, mixed arrays, unicode values, and error cases.
 ## [0.1.0] - 2025-11-18
 ### Added

data/README.md CHANGED Viewed

@@ -1,32 +1,113 @@
 # CTON
-CTON (Compact Token-Oriented Notation) is an aggressively minified, JSON-compatible wire format that keeps prompts short without giving up schema hints. It is shape-preserving (objects, arrays, scalars, table-like arrays) and deterministic, so you can safely round-trip between Ruby hashes and compact strings that work well in LLM prompts.
+[![Gem Version](https://badge.fury.io/rb/cton.svg)](https://badge.fury.io/rb/cton)
+[![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/davidesantangelo/cton/blob/master/LICENSE.txt)
+**CTON** (Compact Token-Oriented Notation) is an aggressively minified, JSON-compatible wire format that keeps prompts short without giving up schema hints. It is shape-preserving (objects, arrays, scalars, table-like arrays) and deterministic, so you can safely round-trip between Ruby hashes and compact strings that work well in LLM prompts.
+---
+## 📖 Table of Contents
+- [What is CTON?](#what-is-cton)
+- [Why another format?](#why-another-format)
+- [Examples](#examples)
+- [Token Savings](#token-savings-vs-json--toon)
+- [Installation](#installation)
+- [Usage](#usage)
+- [Development](#development)
+- [Contributing](#contributing)
+- [License](#license)
+---
+## What is CTON?
+CTON is designed to be the most efficient way to represent structured data for Large Language Models (LLMs). It strips away the "syntactic sugar" of JSON that humans like (indentation, excessive quoting, braces) but machines don't strictly need, while adding "structural hints" that help LLMs generate valid output.
+### Key Concepts
+1.  **Root is Implicit**: No curly braces `{}` wrapping the entire document.
+2.  **Minimal Punctuation**:
+    *   Objects use `key=value`.
+    *   Nested objects use parentheses `(key=value)`.
+    *   Arrays use brackets with length `[N]=item1,item2`.
+3.  **Table Compression**: If an array contains objects with the same keys, CTON automatically converts it into a table format `[N]{header1,header2}=val1,val2;val3,val4`. This is a massive token saver for datasets.
+---
+## Examples
+### Simple Key-Value Pairs
+**JSON**
+```json
+{
+  "task": "planning",
+  "urgent": true,
+  "id": 123
+}
+```
+**CTON**
+```text
+task=planning,urgent=true,id=123
+```
+### Nested Objects
+**JSON**
+```json
+{
+  "user": {
+    "name": "Davide",
+    "settings": {
+      "theme": "dark"
+    }
+  }
+}
+```
+**CTON**
+```text
+user(name=Davide,settings(theme=dark))
+```
+### Arrays and Tables
+**JSON**
+```json
+{
+  "tags": ["ruby", "gem", "llm"],
+  "files": [
+    { "name": "README.md", "size": 1024 },
+    { "name": "lib/cton.rb", "size": 2048 }
+  ]
+}
+```
+**CTON**
+```text
+tags[3]=ruby,gem,llm
+files[2]{name,size}=README.md,1024;lib/cton.rb,2048
+```
+---
 ## Why another format?
 - **Less noise than YAML/JSON**: no indentation, no braces around the root, and optional quoting.
 - **Schema guardrails**: arrays carry their length (`friends[3]`) and table headers (`{id,name,...}`) so downstream parsing can verify shape.
 - **LLM-friendly**: works as a single string you can embed in a prompt together with short parsing instructions.
-- **Token savings**: CTON compounds the JSON → TOON savings; see the section below for concrete numbers.
+- **Token savings**: CTON compounds the JSON → TOON savings.
-## Token savings vs JSON & TOON
+### Token savings vs JSON & TOON
 - **JSON → TOON**: The [TOON benchmarks](https://toonformat.dev) report roughly 40% fewer tokens than plain JSON on mixed-structure prompts while retaining accuracy due to explicit array lengths and headers.
-- **TOON → CTON**: By stripping indentation and forcing everything inline, CTON cuts another ~20–40% of characters. The sample above is ~350 characters as TOON and ~250 as CTON (~29% fewer), and larger tabular datasets show similar reductions.
-- **Net effect**: In practice you can often reclaim 50–60% of the token budget versus raw JSON, leaving more room for instructions or reasoning steps while keeping a deterministic schema.
+- **TOON → CTON**: By stripping indentation and forcing everything inline, CTON cuts another ~20–40% of characters.
+- **Net effect**: In practice you can often reclaim **50–60% of the token budget** versus raw JSON, leaving more room for instructions or reasoning steps while keeping a deterministic schema.
-## Format at a glance
-```
-context(task="Our favorite hikes together",location=Boulder,season=spring_2025)
-friends[3]=ana,luis,sam
-hikes[3]{id,name,distanceKm,elevationGain,companion,wasSunny}=1,"Blue Lake Trail",7.5,320,ana,true;2,"Ridge Overlook",9.2,540,luis,false;3,"Wildflower Loop",5.1,180,sam,true
-```
-- Objects use parentheses and `key=value` pairs separated by commas.
-- Arrays encode their length: `[N]=...`. When every element is a flat hash with the same keys, they collapse into a compact table: `[N]{key1,key2}=row1;row2`.
-- Scalars (numbers, booleans, `null`) keep their JSON text. Strings only need quotes when they contain whitespace or reserved punctuation.
-- For parsing safety the Ruby encoder inserts a single `\n` between top-level segments. You can override this if you truly need a fully inline document (see options below).
+---
 ## Installation
@@ -42,28 +123,32 @@ Or install it directly:
 gem install cton
 ```
+---
 ## Usage
 ```ruby
 require "cton"
 payload = {
-	"context" => {
-		"task" => "Our favorite hikes together",
-		"location" => "Boulder",
-		"season" => "spring_2025"
-	},
-	"friends" => %w[ana luis sam],
-	"hikes" => [
-		{ "id" => 1, "name" => "Blue Lake Trail", "distanceKm" => 7.5, "elevationGain" => 320, "companion" => "ana", "wasSunny" => true },
-		{ "id" => 2, "name" => "Ridge Overlook", "distanceKm" => 9.2, "elevationGain" => 540, "companion" => "luis", "wasSunny" => false },
-		{ "id" => 3, "name" => "Wildflower Loop", "distanceKm" => 5.1, "elevationGain" => 180, "companion" => "sam", "wasSunny" => true }
-	]
+  "context" => {
+    "task" => "Our favorite hikes together",
+    "location" => "Boulder",
+    "season" => "spring_2025"
+  },
+  "friends" => %w[ana luis sam],
+  "hikes" => [
+    { "id" => 1, "name" => "Blue Lake Trail", "distanceKm" => 7.5, "elevationGain" => 320, "companion" => "ana", "wasSunny" => true },
+    { "id" => 2, "name" => "Ridge Overlook", "distanceKm" => 9.2, "elevationGain" => 540, "companion" => "luis", "wasSunny" => false },
+    { "id" => 3, "name" => "Wildflower Loop", "distanceKm" => 5.1, "elevationGain" => 180, "companion" => "sam", "wasSunny" => true }
+  ]
 }
+# Encode to CTON
 cton = Cton.dump(payload)
 # => "context(... )\nfriends[3]=ana,luis,sam\nhikes[3]{...}"
+# Decode back to Hash
 round_tripped = Cton.load(cton)
 # => original hash
@@ -72,30 +157,65 @@ symbolized = Cton.load(cton, symbolize_names: true)
 # Want a truly inline document? Opt in explicitly (decoding becomes unsafe for ambiguous cases).
 inline = Cton.dump(payload, separator: "")
+# Pretty print for human readability
+pretty = Cton.dump(payload, pretty: true)
+# Stream to an IO object (file, socket, etc.)
+File.open("data.cton", "w") do |f|
+  Cton.dump(payload, f)
+end
 ```
-### Table detection
+### CLI Tool
-Whenever an array is made of hashes that all expose the same scalar keys, the encoder flattens it into a table to save tokens. Mixed or nested arrays fall back to `[N]=(value1,value2,...)`.
+CTON comes with a command-line tool for quick conversions:
-### Separators & ambiguity
+```bash
+# Convert JSON to CTON
+echo '{"hello": "world"}' | cton
+# => hello=world
-Removing every newline makes certain inputs ambiguous because `sam` and the next key `hikes` can merge into `samhikes`. The default `separator: "\n"` avoids that by inserting a single newline between root segments. You may pass `separator: ""` to `Cton.dump` for maximum compactness, but decoding such strings is only safe if you can guarantee extra quoting or whitespace between segments.
+# Convert CTON to JSON
+echo 'hello=world' | cton --to-json
+# => {"hello":"world"}
-### Literal safety & number normalization
+# Pretty print
+cton --pretty input.json
+```
-Following the TOON specification's guardrails, the encoder now:
+### Advanced Features
+#### Extended Types
+CTON natively supports serialization for:
+- `Time` and `Date` (ISO8601 strings)
+- `Set` (converted to Arrays)
+- `OpenStruct` (converted to Objects)
+#### Table detection
+Whenever an array is made of hashes that all expose the same scalar keys, the encoder flattens it into a table to save tokens. Mixed or nested arrays fall back to `[N]=(value1,value2,...)`.
+#### Separators & ambiguity
+Removing every newline makes certain inputs ambiguous because `sam` and the next key `hikes` can merge into `samhikes`. The default `separator: "\n"` avoids that by inserting a single newline between root segments. You may pass `separator: ""` to `Cton.dump` for maximum compactness, but decoding such strings is only safe if you can guarantee extra quoting or whitespace between segments.
+#### Literal safety & number normalization
+Following the TOON specification's guardrails, the encoder now:
 - Auto-quotes strings that would otherwise be parsed as booleans, `null`, or numbers (e.g., `"true"`, `"007"`, `"1e6"`, `"-5"`) so they round-trip as strings without extra work.
 - Canonicalizes float/BigDecimal output: no exponent notation, no trailing zeros, and `-0` collapses to `0`.
 - Converts `NaN` and `±Infinity` inputs to `null`, matching TOON's normalization guidance so downstream decoders don't explode on non-finite numbers.
+---
+## Type Safety
+CTON ships with RBS signatures (`sig/cton.rbs`) to support type checking and IDE autocompletion.
 ## Development
 ```bash
-bin/setup   # install dependencies
-bundle exec rspec
-bin/console # interactive playground
+bin/setup        # install dependencies
+bundle exec rake # run tests and rubocop
+bin/console      # interactive playground
 ```
 To release a new version, bump `Cton::VERSION` and run `bundle exec rake release`.
@@ -106,4 +226,4 @@ Bug reports and pull requests are welcome at https://github.com/davidesantangelo
 ## License
-MIT © Davide Santangelo
+MIT © [Davide Santangelo](https://github.com/davidesantangelo)

data/lib/cton/decoder.rb ADDED Viewed

@@ -0,0 +1,327 @@
+# frozen_string_literal: true
+require "strscan"
+module Cton
+  class Decoder
+    TERMINATORS = [",", ";", ")", "]", "}"].freeze
+    def initialize(symbolize_names: false)
+      @symbolize_names = symbolize_names
+    end
+    def decode(cton)
+      @scanner = StringScanner.new(cton.to_s)
+      skip_ws
+      value = if key_ahead?
+                parse_document
+              else
+                parse_value(allow_key_boundary: true)
+              end
+      skip_ws
+      raise_error("Unexpected trailing data") unless @scanner.eos?
+      value
+    end
+    private
+    attr_reader :symbolize_names, :scanner
+    def raise_error(message)
+      line, col = calculate_location(@scanner.pos)
+      raise ParseError, "#{message} at line #{line}, column #{col}"
+    end
+    def calculate_location(pos)
+      string = @scanner.string
+      consumed = string[0...pos]
+      line = consumed.count("\n") + 1
+      last_newline = consumed.rindex("\n")
+      col = last_newline ? pos - last_newline : pos + 1
+      [line, col]
+    end
+    def parse_document
+      result = {}
+      until @scanner.eos?
+        key = parse_key_name
+        value = parse_value_for_key
+        result[key] = value
+        skip_ws
+      end
+      result
+    end
+    def parse_value_for_key
+      skip_ws
+      if @scanner.scan("(")
+        parse_object
+      elsif @scanner.scan("[")
+        parse_array
+      elsif @scanner.scan("=")
+        parse_scalar(allow_key_boundary: true)
+      else
+        raise_error("Unexpected token")
+      end
+    end
+    def parse_object
+      skip_ws
+      return {} if @scanner.scan(")")
+      pairs = {}
+      loop do
+        key = parse_key_name
+        expect!("=")
+        value = parse_value
+        pairs[key] = value
+        skip_ws
+        break if @scanner.scan(")")
+        expect!(",")
+        skip_ws
+      end
+      pairs
+    end
+    def parse_array
+      length = parse_integer_literal
+      expect!("]")
+      skip_ws
+      header = parse_header if @scanner.peek(1) == "{"
+      expect!("=")
+      return [] if length.zero?
+      header ? parse_table_rows(length, header) : parse_array_elements(length)
+    end
+    def parse_header
+      expect!("{")
+      fields = []
+      loop do
+        fields << parse_key_name
+        break if @scanner.scan("}")
+        expect!(",")
+      end
+      fields
+    end
+    def parse_table_rows(length, header)
+      rows = []
+      length.times do |row_index|
+        row = {}
+        header.each_with_index do |field, column_index|
+          allow_boundary = row_index == length - 1 && column_index == header.length - 1
+          row[field] = parse_scalar(allow_key_boundary: allow_boundary)
+          expect!(",") if column_index < header.length - 1
+        end
+        rows << symbolize_keys(row)
+        expect!(";") if row_index < length - 1
+      end
+      rows
+    end
+    def parse_array_elements(length)
+      values = []
+      length.times do |index|
+        allow_boundary = index == length - 1
+        values << parse_value(allow_key_boundary: allow_boundary)
+        expect!(",") if index < length - 1
+      end
+      values
+    end
+    def parse_value(allow_key_boundary: false)
+      skip_ws
+      if @scanner.scan("(")
+        parse_object
+      elsif @scanner.scan("[")
+        parse_array
+      elsif @scanner.peek(1) == '"'
+        parse_string
+      else
+        parse_scalar(allow_key_boundary: allow_key_boundary)
+      end
+    end
+    def parse_scalar(allow_key_boundary: false)
+      skip_ws
+      return parse_string if @scanner.peek(1) == '"'
+      @scanner.pos
+      token = if allow_key_boundary
+                scan_until_boundary_or_terminator
+              else
+                scan_until_terminator
+              end
+      raise_error("Empty value") if token.nil? || token.empty?
+      convert_scalar(token)
+    end
+    def scan_until_terminator
+      @scanner.scan(/[^,;\]\}\)\(\[\{\s]+/)
+    end
+    def scan_until_boundary_or_terminator
+      start_pos = @scanner.pos
+      chunk = @scanner.scan(/[0-9A-Za-z_.:-]+/)
+      return nil unless chunk
+      boundary_idx = find_key_boundary(start_pos)
+      if boundary_idx
+        length = boundary_idx - start_pos
+        @scanner.pos = start_pos
+        token = @scanner.peek(length)
+        @scanner.pos += length
+        token
+      else
+        @scanner.pos = start_pos + chunk.length
+        chunk
+      end
+    end
+    def find_key_boundary(from_index)
+      str = @scanner.string
+      len = str.length
+      idx = from_index
+      while idx < len
+        char = str[idx]
+        return nil if TERMINATORS.include?(char) || whitespace?(char) || "([{".include?(char)
+        if safe_key_char?(char)
+          key_end = idx
+          key_end += 1 while key_end < len && safe_key_char?(str[key_end])
+          next_char_idx = key_end
+          if next_char_idx < len
+            next_char = str[next_char_idx]
+            return idx if ["(", "[", "="].include?(next_char) && (idx > from_index)
+          end
+        end
+        idx += 1
+      end
+      nil
+    end
+    def convert_scalar(token)
+      case token
+      when "true" then true
+      when "false" then false
+      when "null" then nil
+      else
+        if integer?(token)
+          token.to_i
+        elsif float?(token)
+          token.to_f
+        else
+          token
+        end
+      end
+    end
+    def parse_string
+      expect!("\"")
+      buffer = +""
+      loop do
+        raise_error("Unterminated string") if @scanner.eos?
+        char = @scanner.getch
+        if char == "\\"
+          escaped = @scanner.getch
+          raise_error("Invalid escape sequence") if escaped.nil?
+          buffer << case escaped
+                    when "n" then "\n"
+                    when "r" then "\r"
+                    when "t" then "\t"
+                    when '"', "\\" then escaped
+                    else
+                      raise_error("Unsupported escape sequence")
+                    end
+        elsif char == '"'
+          break
+        else
+          buffer << char
+        end
+      end
+      buffer
+    end
+    def parse_key_name
+      skip_ws
+      token = @scanner.scan(/[0-9A-Za-z_.:-]+/)
+      raise_error("Invalid key") if token.nil?
+      symbolize_names ? token.to_sym : token
+    end
+    def parse_integer_literal
+      token = @scanner.scan(/-?\d+/)
+      raise_error("Expected digits") if token.nil?
+      Integer(token, 10)
+    rescue ArgumentError
+      raise_error("Invalid length literal")
+    end
+    def symbolize_keys(row)
+      symbolize_names ? row.transform_keys(&:to_sym) : row
+    end
+    def expect!(char)
+      skip_ws
+      return if @scanner.scan(Regexp.new(Regexp.escape(char)))
+      raise_error("Expected #{char.inspect}, got #{@scanner.peek(1).inspect}")
+    end
+    def skip_ws
+      @scanner.skip(/\s+/)
+    end
+    def whitespace?(char)
+      [" ", "\t", "\n", "\r"].include?(char)
+    end
+    def key_ahead?
+      pos = @scanner.pos
+      skip_ws
+      if @scanner.scan(/[0-9A-Za-z_.:-]+/)
+        skip_ws
+        next_char = @scanner.peek(1)
+        result = ["(", "[", "="].include?(next_char)
+        @scanner.pos = pos
+        result
+      else
+        @scanner.pos = pos
+        false
+      end
+    end
+    def safe_key_char?(char)
+      !char.nil? && char.match?(/[0-9A-Za-z_.:-]/)
+    end
+    def integer?(token)
+      token.match?(/\A-?(?:0|[1-9]\d*)\z/)
+    end
+    def float?(token)
+      token.match?(/\A-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?\z/)
+    end
+  end
+end