jrf 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 1b776b841380488528a7344ddf5cb4e640ae512b9b22f753305c60c2146ed3bb
4
+ data.tar.gz: 1def2307c6f2d8b14d7e374c1175cebe177b7086d2d4e7412955c3da7f917e88
5
+ SHA512:
6
+ metadata.gz: 6a18435576e8c8fea910126e1c2278331569a0dc24c3916e209b274904102428e754b0965c501fae9ed09ce317495d9b222d1b222904a0748288908e6a057c52
7
+ data.tar.gz: f216e265bd94bc462e1e285957ca609f2b38bd2c5ebc238270f699a4b59d1cd21359aaa70d11dbbb2158982c4723ecb69604f2b9707823121cc1b8f5b0f32bf1
data/DESIGN.txt ADDED
@@ -0,0 +1,455 @@
1
+ NAME
2
+ jr - a small, lightweight NDJSON transformer with Ruby-like expressions
3
+
4
+ OVERVIEW
5
+ jr is a command-line tool for transforming NDJSON using Ruby-like
6
+ expressions.
7
+
8
+ It is intentionally not a jq-compatible general-purpose JSON language.
9
+ Its value comes from a much narrower scope and from being implementable
10
+ in a very simple way.
11
+
12
+ The goal is to support expressions like:
13
+
14
+ jr '["foo"]'
15
+ jr 'select(/abc/.match(["aaa"])) >> ["foo"]'
16
+ jr '["items"] >> flat'
17
+ jr 'sum(["foo"])'
18
+ jr 'select(["x"] > 10) >> ["foo"] >> sum(["bar"])'
19
+
20
+ That is:
21
+
22
+ * extract a value from each JSON line
23
+
24
+ * filter lines by a predicate
25
+
26
+ * flatten arrays into multiple output lines
27
+
28
+ * aggregate values, such as summing them
29
+
30
+ This document is not just a user-facing description. It is a design
31
+ constraint document for implementors. The point is to preserve the
32
+ simplicity we agreed on, so that jr does not drift into a heavy
33
+ implementation.
34
+
35
+ DESIGN PRINCIPLE
36
+ jr must be implemented in a way that keeps the runtime model extremely
37
+ simple.
38
+
39
+ The implementation must not drift into:
40
+
41
+ * AST construction and optimization
42
+
43
+ * wrapping child objects in DSL wrapper objects
44
+
45
+ * a large generic streaming-stage framework
46
+
47
+ * per-line allocation of many intermediate DSL objects
48
+
49
+ * jq-like general stream semantics
50
+
51
+ Instead, jr should be implemented under the following constraints.
52
+
53
+ CORE MODEL
54
+ Input model
55
+ Input is NDJSON.
56
+
57
+ Each line is parsed as one JSON value.
58
+
59
+ The primary execution model is line-by-line processing.
60
+
61
+ A simple conceptual loop is sufficient:
62
+
63
+ ARGF.each_line do |line|
64
+ row = JSON.parse(line)
65
+ ...
66
+ end
67
+
68
+ Evaluation context
69
+ Expressions are evaluated with the current row bound as "self".
70
+
71
+ That means the basic field access syntax is:
72
+
73
+ ["foo"]
74
+ ["foo"]["bar"]
75
+
76
+ No "_" or "_." prefix is required.
77
+
78
+ Root-only DSL
79
+ The DSL exists only at the root context.
80
+
81
+ This is a mandatory design rule.
82
+
83
+ The expression context object only needs to represent the current row.
84
+ Child values are not wrapped.
85
+
86
+ Return value of "[]"
87
+ "["foo"]" returns the underlying Ruby value directly.
88
+
89
+ That means:
90
+
91
+ * Hash values remain Hash
92
+
93
+ * Array values remain Array
94
+
95
+ * String values remain String
96
+
97
+ * Numeric values remain Numeric
98
+
99
+ * "nil" remains "nil"
100
+
101
+ This is critical.
102
+
103
+ For example:
104
+
105
+ ["foo"]["bar"]
106
+
107
+ must work simply because "["foo"]" returned a normal Ruby "Hash", and
108
+ the next "["bar"]" is just Ruby's normal "Hash#[]".
109
+
110
+ Child wrappers must not exist.
111
+
112
+ Reuse of the root context
113
+ The root row context must be reused across all input lines.
114
+
115
+ A minimal model is:
116
+
117
+ class RowContext
118
+ def initialize(obj = nil)
119
+ @obj = obj
120
+ end
121
+
122
+ def reset(obj)
123
+ @obj = obj
124
+ self
125
+ end
126
+
127
+ def [](key)
128
+ @obj[key]
129
+ end
130
+ end
131
+
132
+ The per-line execution model should be conceptually as simple as:
133
+
134
+ ctx.reset(row)
135
+ ctx.instance_eval(expr_source)
136
+
137
+ The implementation should not allocate a new root DSL object for every
138
+ line.
139
+
140
+ PIPELINE SYNTAX
141
+ Multiple stages are connected using top-level ">>".
142
+
143
+ Example:
144
+
145
+ jr 'select(["x"] > 10) >> ["foo"] >> sum(["bar"])'
146
+
147
+ This ">>" is not Ruby's shift operator in the execution model.
148
+
149
+ Instead, jr splits the top-level source string on top-level occurrences
150
+ of ">>" before evaluating the individual stage expressions as Ruby.
151
+
152
+ So the above is treated internally as three stages:
153
+
154
+ select(["x"] > 10)
155
+ ["foo"]
156
+ sum(["bar"])
157
+
158
+ This design choice is intentional and important.
159
+
160
+ It allows jr to have pipeline syntax without requiring a
161
+ delayed-expression DSL, operator overloading, or AST construction.
162
+
163
+ Consequence of reserving top-level ">>"
164
+ At top level, ">>" belongs to jr.
165
+
166
+ If users need Ruby's actual ">>" operator inside a stage expression,
167
+ they must use an alternative spelling such as "send(:">, ...)>, or some
168
+ other escape/alternative mechanism chosen by the implementation.
169
+
170
+ That tradeoff is acceptable because the primary value of jr is
171
+ simplicity.
172
+
173
+ STAGE KINDS
174
+ Each pipeline segment is interpreted according to a small set of
175
+ explicit rules.
176
+
177
+ The stage kinds are:
178
+
179
+ * "select(...)" - filter stage
180
+
181
+ * plain expression - extract stage
182
+
183
+ * "flat" - flatten stage
184
+
185
+ * "sum(...)" - reduce/aggregate stage
186
+
187
+ These roles must remain separate. Their responsibilities must not be
188
+ mixed.
189
+
190
+ Filter stage
191
+ "select(...)" denotes a filter stage.
192
+
193
+ Examples:
194
+
195
+ select(["x"] > 10)
196
+ select(/abc/.match(["aaa"]))
197
+
198
+ A filter stage decides whether the current value passes to the next
199
+ stage.
200
+
201
+ It should not also act as an extractor.
202
+
203
+ Extract stage
204
+ Any stage expression that is not one of the explicit special forms is an
205
+ extract stage.
206
+
207
+ Examples:
208
+
209
+ ["foo"]
210
+ ["foo"]["bar"]
211
+ ["items"]
212
+
213
+ An extract stage computes a value from the current input and passes it
214
+ forward.
215
+
216
+ It should not also act as flattening or aggregation.
217
+
218
+ Flat stage
219
+ "flat" is a stage with no argument.
220
+
221
+ Example:
222
+
223
+ ["items"] >> flat
224
+
225
+ It means that the result of the previous stage should be expanded into
226
+ multiple output lines.
227
+
228
+ Without "flat", an array is emitted as one JSON array value.
229
+
230
+ With "flat", each element is emitted separately.
231
+
232
+ "flat" must not also be used as a filter or aggregator.
233
+
234
+ Reduce stage
235
+ "sum(...)" denotes an aggregate stage.
236
+
237
+ Examples:
238
+
239
+ sum(["foo"])
240
+ sum(["foo"]["bar"])
241
+
242
+ A reduce stage consumes values across all matching rows and emits one
243
+ final value at the end.
244
+
245
+ For the first implementation, "sum(...)" is sufficient as the only
246
+ required aggregate.
247
+
248
+ IMPLEMENTATION DISCIPLINE
249
+ This section is the most important part of the document.
250
+
251
+ The implementation should stay close to the following simple execution
252
+ shapes.
253
+
254
+ Filter + extract only
255
+ Conceptually:
256
+
257
+ ctx = RowContext.new
258
+
259
+ ARGF.each_line do |line|
260
+ row = JSON.parse(line)
261
+ ctx.reset(row)
262
+
263
+ next unless ctx.instance_eval(filter_src)
264
+ out = ctx.instance_eval(extract_src)
265
+
266
+ emit(out)
267
+ end
268
+
269
+ This is the target level of simplicity.
270
+
271
+ Filter + extract + flat
272
+ Conceptually:
273
+
274
+ ctx = RowContext.new
275
+
276
+ ARGF.each_line do |line|
277
+ row = JSON.parse(line)
278
+ ctx.reset(row)
279
+
280
+ next unless ctx.instance_eval(filter_src)
281
+ out = ctx.instance_eval(extract_src)
282
+
283
+ if flat
284
+ out.each { |v| emit(v) }
285
+ else
286
+ emit(out)
287
+ end
288
+ end
289
+
290
+ Again, this is intentionally simple.
291
+
292
+ Filter + extract + sum
293
+ Conceptually:
294
+
295
+ ctx = RowContext.new
296
+ acc = 0
297
+
298
+ ARGF.each_line do |line|
299
+ row = JSON.parse(line)
300
+ ctx.reset(row)
301
+
302
+ next unless ctx.instance_eval(filter_src)
303
+ value = ctx.instance_eval(extract_src)
304
+
305
+ acc += value
306
+ end
307
+
308
+ emit(acc)
309
+
310
+ This is the intended model.
311
+
312
+ The implementation must not introduce a heavyweight generic framework
313
+ unless a clear need arises later.
314
+
315
+ Meaning of "sum(...)"
316
+ "sum(expr)" should be treated as syntactic sugar for:
317
+
318
+ * evaluate "expr" for each matching input row
319
+
320
+ * add the result to an accumulator
321
+
322
+ * emit the accumulator once, at the end
323
+
324
+ The important thing is not the internal abstraction but preserving the
325
+ simple runtime shape.
326
+
327
+ REQUIRED CONSTRAINTS
328
+ An implementation that follows this design must satisfy all of the
329
+ following.
330
+
331
+ 1. NDJSON only
332
+ The initial implementation targets NDJSON line-by-line processing.
333
+
334
+ General stream semantics are out of scope.
335
+
336
+ 2. Current row is "self"
337
+ Expressions run with the current row context bound as "self".
338
+
339
+ 3. "["foo"]" is the primary field access syntax
340
+ This is the only required syntax for the first implementation.
341
+
342
+ Bareword sugar such as "foo" or dotted syntax such as "_.foo" is out of
343
+ scope.
344
+
345
+ 4. "[]" returns raw Ruby values
346
+ No child wrapper objects are allowed.
347
+
348
+ 5. Only one root context object is reused
349
+ A fresh DSL context object per row is not allowed.
350
+
351
+ The current row object inside the root context should simply be
352
+ replaced.
353
+
354
+ 6. Pipeline parsing happens before Ruby evaluation
355
+ Top-level ">>" is split by jr itself before stage evaluation.
356
+
357
+ The implementation does not need to make ">>" work as a Ruby operator.
358
+
359
+ 7. Stage responsibilities must stay separate
360
+ * "select(...)" filters
361
+
362
+ * plain expressions extract
363
+
364
+ * "flat" flattens
365
+
366
+ * "sum(...)" aggregates
367
+
368
+ Do not overload one stage kind with multiple semantics.
369
+
370
+ 8. No "nil means skip" rule in extract
371
+ Skipping rows belongs to filtering.
372
+
373
+ Extract stages return values.
374
+
375
+ Do not make extract return-value conventions more complicated than
376
+ necessary.
377
+
378
+ 9. No child DSL wrappers
379
+ This is worth repeating.
380
+
381
+ If a child value is a Hash, then further indexing is just normal Ruby
382
+ indexing. If a child value is an Array, then array access is just normal
383
+ Ruby array access.
384
+
385
+ 10. Avoid heavyweight abstraction
386
+ Do not introduce any of the following in the first implementation unless
387
+ they are absolutely necessary:
388
+
389
+ * AST nodes
390
+
391
+ * delayed expression objects
392
+
393
+ * generic stage graphs
394
+
395
+ * EOF-marker-based general reducer pipelines
396
+
397
+ * jq-style multi-valued stream semantics
398
+
399
+ * child wrapper chains
400
+
401
+ WHAT IS EXPLICITLY OUT OF SCOPE FOR NOW
402
+ The following are intentionally deferred.
403
+
404
+ * jq compatibility
405
+
406
+ * bareword field access such as "foo"
407
+
408
+ * dotted field access such as "_.foo"
409
+
410
+ * child wrappers
411
+
412
+ * general reducer framework
413
+
414
+ * EOF-marker stage propagation
415
+
416
+ * general delayed-expression DSL
417
+
418
+ * AST optimization
419
+
420
+ * complicated "nil" output rules
421
+
422
+ * advanced aggregate families beyond the initial "sum(...)"
423
+
424
+ SUMMARY
425
+ jr is valuable only if it stays small and simple.
426
+
427
+ That means the implementation should follow these core rules:
428
+
429
+ * NDJSON input, processed line by line
430
+
431
+ * current row bound as "self"
432
+
433
+ * field access through "["foo"]"
434
+
435
+ * "[]" returns raw Ruby values
436
+
437
+ * no child wrappers
438
+
439
+ * one reusable root context object
440
+
441
+ * top-level pipeline split on ">>"
442
+
443
+ * "select(...)" for filter
444
+
445
+ * plain expressions for extract
446
+
447
+ * "flat" for flattening
448
+
449
+ * "sum(...)" for aggregation
450
+
451
+ * simple loops instead of heavyweight framework
452
+
453
+ If an implementation stops looking this simple, it has probably drifted
454
+ away from the intended design.
455
+
data/Gemfile ADDED
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ source "https://rubygems.org"
4
+
5
+ gemspec name: "jrf"
data/Rakefile ADDED
@@ -0,0 +1,10 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "rake/testtask"
4
+
5
+ Rake::TestTask.new do |t|
6
+ t.libs << "test"
7
+ t.test_files = FileList["test/**/*_test.rb"]
8
+ end
9
+
10
+ task default: :test
data/exe/jrf ADDED
@@ -0,0 +1,7 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ $LOAD_PATH.unshift(File.expand_path("../lib", __dir__))
5
+ require "jrf"
6
+
7
+ exit Jrf::CLI.run(ARGV)
data/jrf.gemspec ADDED
@@ -0,0 +1,20 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative "lib/jrf/version"
4
+
5
+ Gem::Specification.new do |spec|
6
+ spec.name = "jrf"
7
+ spec.version = Jrf::VERSION
8
+ spec.authors = ["kazuho"]
9
+ spec.email = ["n/a@example.com"]
10
+
11
+ spec.summary = "Small NDJSON transformer with Ruby expressions"
12
+ spec.description = "A small, lightweight NDJSON transformer with Ruby-like expressions."
13
+ spec.license = "MIT"
14
+ spec.required_ruby_version = ">= 3.0"
15
+
16
+ spec.bindir = "exe"
17
+ spec.executables = ["jrf"]
18
+
19
+ spec.files = Dir.glob("{exe,lib,test}/*") + Dir.glob("lib/**/*") + %w[DESIGN.txt jrf.gemspec Gemfile Rakefile]
20
+ end
data/lib/jrf/cli.rb ADDED
@@ -0,0 +1,32 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative "runner"
4
+
5
+ module Jrf
6
+ class CLI
7
+ def self.run(argv = ARGV, input: ARGF, out: $stdout, err: $stderr)
8
+ verbose = false
9
+
10
+ while argv.first&.start_with?("-")
11
+ case argv.first
12
+ when "-v"
13
+ verbose = true
14
+ argv.shift
15
+ else
16
+ err.puts "unknown option: #{argv.first}"
17
+ err.puts "usage: jrf [-v] 'EXPR'"
18
+ return 1
19
+ end
20
+ end
21
+
22
+ if argv.empty?
23
+ err.puts "usage: jrf [-v] 'EXPR'"
24
+ return 1
25
+ end
26
+
27
+ expression = argv.shift
28
+ Runner.new(input: input, out: out, err: err).run(expression, verbose: verbose)
29
+ 0
30
+ end
31
+ end
32
+ end
@@ -0,0 +1,8 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Jrf
4
+ module Control
5
+ Flat = Struct.new(:value)
6
+ DROPPED = Object.new.freeze
7
+ end
8
+ end
@@ -0,0 +1,147 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Jrf
4
+ class PipelineParser
5
+ def initialize(source)
6
+ @source = source.to_s
7
+ end
8
+
9
+ def parse
10
+ stages = split_top_level_pipeline(@source).map(&:strip).reject(&:empty?)
11
+ raise ArgumentError, "empty expression" if stages.empty?
12
+ { stages: stages.map { |stage| parse_stage!(stage) } }
13
+ end
14
+
15
+ private
16
+
17
+ def parse_stage!(stage)
18
+ if select_stage?(stage)
19
+ {
20
+ kind: :select,
21
+ original: stage,
22
+ src: "(#{parse_select!(stage)}) ? _ : ::Jrf::Control::DROPPED"
23
+ }
24
+ else
25
+ reject_unsupported_stage!(stage)
26
+ {
27
+ kind: :extract,
28
+ original: stage,
29
+ src: validate_extract!(stage)
30
+ }
31
+ end
32
+ end
33
+
34
+ def validate_extract!(stage)
35
+ reject_unsupported_stage!(stage)
36
+ stage
37
+ end
38
+
39
+ def parse_select!(stage)
40
+ reject_unsupported_stage!(stage)
41
+ match = /\Aselect\s*\((.*)\)\s*\z/m.match(stage)
42
+ raise ArgumentError, "first stage must be select(...)" unless match
43
+
44
+ inner = match[1].strip
45
+ raise ArgumentError, "select(...) must contain an expression" if inner.empty?
46
+
47
+ inner
48
+ end
49
+
50
+ def select_stage?(stage)
51
+ /\Aselect\s*\(/.match?(stage)
52
+ end
53
+
54
+ def reject_unsupported_stage!(stage)
55
+ end
56
+
57
+ def split_top_level_pipeline(source)
58
+ parts = []
59
+ start_idx = 0
60
+ i = 0
61
+ stack = []
62
+ quote = nil
63
+ escaped = false
64
+ regex = false
65
+ regex_class = false
66
+
67
+ while i < source.length
68
+ ch = source[i]
69
+
70
+ if quote
71
+ escaped = !escaped && ch == "\\" if quote != "'"
72
+ if quote == "'" && ch == "'" && !escaped
73
+ quote = nil
74
+ elsif quote != "'" && ch == quote && !escaped
75
+ quote = nil
76
+ end
77
+ escaped = false if ch != "\\" && quote != "'"
78
+ i += 1
79
+ next
80
+ end
81
+
82
+ if regex
83
+ if escaped
84
+ escaped = false
85
+ elsif regex_class
86
+ regex_class = false if ch == "]"
87
+ else
88
+ case ch
89
+ when "\\"
90
+ escaped = true
91
+ when "["
92
+ regex_class = true
93
+ when "/"
94
+ regex = false
95
+ end
96
+ end
97
+ i += 1
98
+ next
99
+ end
100
+
101
+ case ch
102
+ when "'", '"'
103
+ quote = ch
104
+ when "("
105
+ stack << [")", i]
106
+ when "["
107
+ stack << ["]", i]
108
+ when "{"
109
+ stack << ["}", i]
110
+ when ")", "]", "}"
111
+ expected, open_idx = stack.pop
112
+ unless expected == ch
113
+ raise ArgumentError, "mismatched delimiter #{ch.inspect} at offset #{i}"
114
+ end
115
+ when "/"
116
+ regex = looks_like_regex_start?(source, i)
117
+ when ">"
118
+ if stack.empty? && source[i, 2] == ">>"
119
+ parts << source[start_idx...i]
120
+ i += 2
121
+ start_idx = i
122
+ next
123
+ end
124
+ end
125
+
126
+ i += 1
127
+ end
128
+
129
+ parts << source[start_idx..]
130
+ unless stack.empty?
131
+ expected, open_idx = stack.last
132
+ raise ArgumentError, "unclosed delimiter #{expected.inspect} at offset #{open_idx}"
133
+ end
134
+
135
+ parts
136
+ end
137
+
138
+ def looks_like_regex_start?(source, slash_idx)
139
+ j = slash_idx - 1
140
+ j -= 1 while j >= 0 && source[j] =~ /\s/
141
+ return true if j < 0
142
+
143
+ prev = source[j]
144
+ !(/[[:alnum:]_\]\)]/.match?(prev))
145
+ end
146
+ end
147
+ end