RubyGems - jrf - Versions diffs - 0.1.0 - Mend

jrf 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 1b776b841380488528a7344ddf5cb4e640ae512b9b22f753305c60c2146ed3bb
+  data.tar.gz: 1def2307c6f2d8b14d7e374c1175cebe177b7086d2d4e7412955c3da7f917e88
+SHA512:
+  metadata.gz: 6a18435576e8c8fea910126e1c2278331569a0dc24c3916e209b274904102428e754b0965c501fae9ed09ce317495d9b222d1b222904a0748288908e6a057c52
+  data.tar.gz: f216e265bd94bc462e1e285957ca609f2b38bd2c5ebc238270f699a4b59d1cd21359aaa70d11dbbb2158982c4723ecb69604f2b9707823121cc1b8f5b0f32bf1

data/DESIGN.txt ADDED Viewed

@@ -0,0 +1,455 @@
+NAME
+    jr - a small, lightweight NDJSON transformer with Ruby-like expressions
+OVERVIEW
+    jr is a command-line tool for transforming NDJSON using Ruby-like
+    expressions.
+    It is intentionally not a jq-compatible general-purpose JSON language.
+    Its value comes from a much narrower scope and from being implementable
+    in a very simple way.
+    The goal is to support expressions like:
+        jr '["foo"]'
+        jr 'select(/abc/.match(["aaa"])) >> ["foo"]'
+        jr '["items"] >> flat'
+        jr 'sum(["foo"])'
+        jr 'select(["x"] > 10) >> ["foo"] >> sum(["bar"])'
+    That is:
+    *   extract a value from each JSON line
+    *   filter lines by a predicate
+    *   flatten arrays into multiple output lines
+    *   aggregate values, such as summing them
+    This document is not just a user-facing description. It is a design
+    constraint document for implementors. The point is to preserve the
+    simplicity we agreed on, so that jr does not drift into a heavy
+    implementation.
+DESIGN PRINCIPLE
+    jr must be implemented in a way that keeps the runtime model extremely
+    simple.
+    The implementation must not drift into:
+    *   AST construction and optimization
+    *   wrapping child objects in DSL wrapper objects
+    *   a large generic streaming-stage framework
+    *   per-line allocation of many intermediate DSL objects
+    *   jq-like general stream semantics
+    Instead, jr should be implemented under the following constraints.
+CORE MODEL
+  Input model
+    Input is NDJSON.
+    Each line is parsed as one JSON value.
+    The primary execution model is line-by-line processing.
+    A simple conceptual loop is sufficient:
+        ARGF.each_line do |line|
+            row = JSON.parse(line)
+            ...
+        end
+  Evaluation context
+    Expressions are evaluated with the current row bound as "self".
+    That means the basic field access syntax is:
+        ["foo"]
+        ["foo"]["bar"]
+    No "_" or "_." prefix is required.
+  Root-only DSL
+    The DSL exists only at the root context.
+    This is a mandatory design rule.
+    The expression context object only needs to represent the current row.
+    Child values are not wrapped.
+  Return value of "[]"
+    "["foo"]" returns the underlying Ruby value directly.
+    That means:
+    *   Hash values remain Hash
+    *   Array values remain Array
+    *   String values remain String
+    *   Numeric values remain Numeric
+    *   "nil" remains "nil"
+    This is critical.
+    For example:
+        ["foo"]["bar"]
+    must work simply because "["foo"]" returned a normal Ruby "Hash", and
+    the next "["bar"]" is just Ruby's normal "Hash#[]".
+    Child wrappers must not exist.
+  Reuse of the root context
+    The root row context must be reused across all input lines.
+    A minimal model is:
+        class RowContext
+            def initialize(obj = nil)
+                @obj = obj
+            end
+            def reset(obj)
+                @obj = obj
+                self
+            end
+            def [](key)
+                @obj[key]
+            end
+        end
+    The per-line execution model should be conceptually as simple as:
+        ctx.reset(row)
+        ctx.instance_eval(expr_source)
+    The implementation should not allocate a new root DSL object for every
+    line.
+PIPELINE SYNTAX
+    Multiple stages are connected using top-level ">>".
+    Example:
+        jr 'select(["x"] > 10) >> ["foo"] >> sum(["bar"])'
+    This ">>" is not Ruby's shift operator in the execution model.
+    Instead, jr splits the top-level source string on top-level occurrences
+    of ">>" before evaluating the individual stage expressions as Ruby.
+    So the above is treated internally as three stages:
+        select(["x"] > 10)
+        ["foo"]
+        sum(["bar"])
+    This design choice is intentional and important.
+    It allows jr to have pipeline syntax without requiring a
+    delayed-expression DSL, operator overloading, or AST construction.
+  Consequence of reserving top-level ">>"
+    At top level, ">>" belongs to jr.
+    If users need Ruby's actual ">>" operator inside a stage expression,
+    they must use an alternative spelling such as "send(:">, ...)>, or some
+    other escape/alternative mechanism chosen by the implementation.
+    That tradeoff is acceptable because the primary value of jr is
+    simplicity.
+STAGE KINDS
+    Each pipeline segment is interpreted according to a small set of
+    explicit rules.
+    The stage kinds are:
+    *   "select(...)" - filter stage
+    *   plain expression - extract stage
+    *   "flat" - flatten stage
+    *   "sum(...)" - reduce/aggregate stage
+    These roles must remain separate. Their responsibilities must not be
+    mixed.
+  Filter stage
+    "select(...)" denotes a filter stage.
+    Examples:
+        select(["x"] > 10)
+        select(/abc/.match(["aaa"]))
+    A filter stage decides whether the current value passes to the next
+    stage.
+    It should not also act as an extractor.
+  Extract stage
+    Any stage expression that is not one of the explicit special forms is an
+    extract stage.
+    Examples:
+        ["foo"]
+        ["foo"]["bar"]
+        ["items"]
+    An extract stage computes a value from the current input and passes it
+    forward.
+    It should not also act as flattening or aggregation.
+  Flat stage
+    "flat" is a stage with no argument.
+    Example:
+        ["items"] >> flat
+    It means that the result of the previous stage should be expanded into
+    multiple output lines.
+    Without "flat", an array is emitted as one JSON array value.
+    With "flat", each element is emitted separately.
+    "flat" must not also be used as a filter or aggregator.
+  Reduce stage
+    "sum(...)" denotes an aggregate stage.
+    Examples:
+        sum(["foo"])
+        sum(["foo"]["bar"])
+    A reduce stage consumes values across all matching rows and emits one
+    final value at the end.
+    For the first implementation, "sum(...)" is sufficient as the only
+    required aggregate.
+IMPLEMENTATION DISCIPLINE
+    This section is the most important part of the document.
+    The implementation should stay close to the following simple execution
+    shapes.
+  Filter + extract only
+    Conceptually:
+        ctx = RowContext.new
+        ARGF.each_line do |line|
+            row = JSON.parse(line)
+            ctx.reset(row)
+            next unless ctx.instance_eval(filter_src)
+            out = ctx.instance_eval(extract_src)
+            emit(out)
+        end
+    This is the target level of simplicity.
+  Filter + extract + flat
+    Conceptually:
+        ctx = RowContext.new
+        ARGF.each_line do |line|
+            row = JSON.parse(line)
+            ctx.reset(row)
+            next unless ctx.instance_eval(filter_src)
+            out = ctx.instance_eval(extract_src)
+            if flat
+                out.each { |v| emit(v) }
+            else
+                emit(out)
+            end
+        end
+    Again, this is intentionally simple.
+  Filter + extract + sum
+    Conceptually:
+        ctx = RowContext.new
+        acc = 0
+        ARGF.each_line do |line|
+            row = JSON.parse(line)
+            ctx.reset(row)
+            next unless ctx.instance_eval(filter_src)
+            value = ctx.instance_eval(extract_src)
+            acc += value
+        end
+        emit(acc)
+    This is the intended model.
+    The implementation must not introduce a heavyweight generic framework
+    unless a clear need arises later.
+  Meaning of "sum(...)"
+    "sum(expr)" should be treated as syntactic sugar for:
+    *   evaluate "expr" for each matching input row
+    *   add the result to an accumulator
+    *   emit the accumulator once, at the end
+    The important thing is not the internal abstraction but preserving the
+    simple runtime shape.
+REQUIRED CONSTRAINTS
+    An implementation that follows this design must satisfy all of the
+    following.
+  1. NDJSON only
+    The initial implementation targets NDJSON line-by-line processing.
+    General stream semantics are out of scope.
+  2. Current row is "self"
+    Expressions run with the current row context bound as "self".
+  3. "["foo"]" is the primary field access syntax
+    This is the only required syntax for the first implementation.
+    Bareword sugar such as "foo" or dotted syntax such as "_.foo" is out of
+    scope.
+  4. "[]" returns raw Ruby values
+    No child wrapper objects are allowed.
+  5. Only one root context object is reused
+    A fresh DSL context object per row is not allowed.
+    The current row object inside the root context should simply be
+    replaced.
+  6. Pipeline parsing happens before Ruby evaluation
+    Top-level ">>" is split by jr itself before stage evaluation.
+    The implementation does not need to make ">>" work as a Ruby operator.
+  7. Stage responsibilities must stay separate
+    *   "select(...)" filters
+    *   plain expressions extract
+    *   "flat" flattens
+    *   "sum(...)" aggregates
+    Do not overload one stage kind with multiple semantics.
+  8. No "nil means skip" rule in extract
+    Skipping rows belongs to filtering.
+    Extract stages return values.
+    Do not make extract return-value conventions more complicated than
+    necessary.
+  9. No child DSL wrappers
+    This is worth repeating.
+    If a child value is a Hash, then further indexing is just normal Ruby
+    indexing. If a child value is an Array, then array access is just normal
+    Ruby array access.
+  10. Avoid heavyweight abstraction
+    Do not introduce any of the following in the first implementation unless
+    they are absolutely necessary:
+    *   AST nodes
+    *   delayed expression objects
+    *   generic stage graphs
+    *   EOF-marker-based general reducer pipelines
+    *   jq-style multi-valued stream semantics
+    *   child wrapper chains
+WHAT IS EXPLICITLY OUT OF SCOPE FOR NOW
+    The following are intentionally deferred.
+    *   jq compatibility
+    *   bareword field access such as "foo"
+    *   dotted field access such as "_.foo"
+    *   child wrappers
+    *   general reducer framework
+    *   EOF-marker stage propagation
+    *   general delayed-expression DSL
+    *   AST optimization
+    *   complicated "nil" output rules
+    *   advanced aggregate families beyond the initial "sum(...)"
+SUMMARY
+    jr is valuable only if it stays small and simple.
+    That means the implementation should follow these core rules:
+    *   NDJSON input, processed line by line
+    *   current row bound as "self"
+    *   field access through "["foo"]"
+    *   "[]" returns raw Ruby values
+    *   no child wrappers
+    *   one reusable root context object
+    *   top-level pipeline split on ">>"
+    *   "select(...)" for filter
+    *   plain expressions for extract
+    *   "flat" for flattening
+    *   "sum(...)" for aggregation
+    *   simple loops instead of heavyweight framework
+    If an implementation stops looking this simple, it has probably drifted
+    away from the intended design.

data/Gemfile ADDED Viewed

@@ -0,0 +1,5 @@
+# frozen_string_literal: true
+source "https://rubygems.org"
+gemspec name: "jrf"

data/Rakefile ADDED Viewed

@@ -0,0 +1,10 @@
+# frozen_string_literal: true
+require "rake/testtask"
+Rake::TestTask.new do |t|
+  t.libs << "test"
+  t.test_files = FileList["test/**/*_test.rb"]
+end
+task default: :test

data/exe/jrf ADDED Viewed

@@ -0,0 +1,7 @@
+#!/usr/bin/env ruby
+# frozen_string_literal: true
+$LOAD_PATH.unshift(File.expand_path("../lib", __dir__))
+require "jrf"
+exit Jrf::CLI.run(ARGV)

data/jrf.gemspec ADDED Viewed

@@ -0,0 +1,20 @@
+# frozen_string_literal: true
+require_relative "lib/jrf/version"
+Gem::Specification.new do |spec|
+  spec.name = "jrf"
+  spec.version = Jrf::VERSION
+  spec.authors = ["kazuho"]
+  spec.email = ["n/a@example.com"]
+  spec.summary = "Small NDJSON transformer with Ruby expressions"
+  spec.description = "A small, lightweight NDJSON transformer with Ruby-like expressions."
+  spec.license = "MIT"
+  spec.required_ruby_version = ">= 3.0"
+  spec.bindir = "exe"
+  spec.executables = ["jrf"]
+  spec.files = Dir.glob("{exe,lib,test}/*") + Dir.glob("lib/**/*") + %w[DESIGN.txt jrf.gemspec Gemfile Rakefile]
+end

data/lib/jrf/cli.rb ADDED Viewed

@@ -0,0 +1,32 @@
+# frozen_string_literal: true
+require_relative "runner"
+module Jrf
+  class CLI
+    def self.run(argv = ARGV, input: ARGF, out: $stdout, err: $stderr)
+      verbose = false
+      while argv.first&.start_with?("-")
+        case argv.first
+        when "-v"
+          verbose = true
+          argv.shift
+        else
+          err.puts "unknown option: #{argv.first}"
+          err.puts "usage: jrf [-v] 'EXPR'"
+          return 1
+        end
+      end
+      if argv.empty?
+        err.puts "usage: jrf [-v] 'EXPR'"
+        return 1
+      end
+      expression = argv.shift
+      Runner.new(input: input, out: out, err: err).run(expression, verbose: verbose)
+      0
+    end
+  end
+end

data/lib/jrf/control.rb ADDED Viewed

@@ -0,0 +1,8 @@
+# frozen_string_literal: true
+module Jrf
+  module Control
+    Flat = Struct.new(:value)
+    DROPPED = Object.new.freeze
+  end
+end

data/lib/jrf/pipeline_parser.rb ADDED Viewed

@@ -0,0 +1,147 @@
+# frozen_string_literal: true
+module Jrf
+  class PipelineParser
+    def initialize(source)
+      @source = source.to_s
+    end
+    def parse
+      stages = split_top_level_pipeline(@source).map(&:strip).reject(&:empty?)
+      raise ArgumentError, "empty expression" if stages.empty?
+      { stages: stages.map { |stage| parse_stage!(stage) } }
+    end
+    private
+    def parse_stage!(stage)
+      if select_stage?(stage)
+        {
+          kind: :select,
+          original: stage,
+          src: "(#{parse_select!(stage)}) ? _ : ::Jrf::Control::DROPPED"
+        }
+      else
+        reject_unsupported_stage!(stage)
+        {
+          kind: :extract,
+          original: stage,
+          src: validate_extract!(stage)
+        }
+      end
+    end
+    def validate_extract!(stage)
+      reject_unsupported_stage!(stage)
+      stage
+    end
+    def parse_select!(stage)
+      reject_unsupported_stage!(stage)
+      match = /\Aselect\s*\((.*)\)\s*\z/m.match(stage)
+      raise ArgumentError, "first stage must be select(...)" unless match
+      inner = match[1].strip
+      raise ArgumentError, "select(...) must contain an expression" if inner.empty?
+      inner
+    end
+    def select_stage?(stage)
+      /\Aselect\s*\(/.match?(stage)
+    end
+    def reject_unsupported_stage!(stage)
+    end
+    def split_top_level_pipeline(source)
+      parts = []
+      start_idx = 0
+      i = 0
+      stack = []
+      quote = nil
+      escaped = false
+      regex = false
+      regex_class = false
+      while i < source.length
+        ch = source[i]
+        if quote
+          escaped = !escaped && ch == "\\" if quote != "'"
+          if quote == "'" && ch == "'" && !escaped
+            quote = nil
+          elsif quote != "'" && ch == quote && !escaped
+            quote = nil
+          end
+          escaped = false if ch != "\\" && quote != "'"
+          i += 1
+          next
+        end
+        if regex
+          if escaped
+            escaped = false
+          elsif regex_class
+            regex_class = false if ch == "]"
+          else
+            case ch
+            when "\\"
+              escaped = true
+            when "["
+              regex_class = true
+            when "/"
+              regex = false
+            end
+          end
+          i += 1
+          next
+        end
+        case ch
+        when "'", '"'
+          quote = ch
+        when "("
+          stack << [")", i]
+        when "["
+          stack << ["]", i]
+        when "{"
+          stack << ["}", i]
+        when ")", "]", "}"
+          expected, open_idx = stack.pop
+          unless expected == ch
+            raise ArgumentError, "mismatched delimiter #{ch.inspect} at offset #{i}"
+          end
+        when "/"
+          regex = looks_like_regex_start?(source, i)
+        when ">"
+          if stack.empty? && source[i, 2] == ">>"
+            parts << source[start_idx...i]
+            i += 2
+            start_idx = i
+            next
+          end
+        end
+        i += 1
+      end
+      parts << source[start_idx..]
+      unless stack.empty?
+        expected, open_idx = stack.last
+        raise ArgumentError, "unclosed delimiter #{expected.inspect} at offset #{open_idx}"
+      end
+      parts
+    end
+    def looks_like_regex_start?(source, slash_idx)
+      j = slash_idx - 1
+      j -= 1 while j >= 0 && source[j] =~ /\s/
+      return true if j < 0
+      prev = source[j]
+      !(/[[:alnum:]_\]\)]/.match?(prev))
+    end
+  end
+end