json_data_extractor 0.1.04 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +61 -0
- data/Gemfile +4 -0
- data/README.md +113 -0
- data/json_data_extractor.gemspec +1 -1
- data/lib/json_data_extractor/direct_navigator.rb +81 -0
- data/lib/json_data_extractor/extraction_instruction.rb +20 -0
- data/lib/json_data_extractor/extractor.rb +46 -16
- data/lib/json_data_extractor/optimized_extractor.rb +169 -0
- data/lib/json_data_extractor/path_compiler.rb +42 -0
- data/lib/json_data_extractor/schema_analyzer.rb +48 -0
- data/lib/json_data_extractor/schema_cache.rb +30 -0
- data/lib/json_data_extractor/version.rb +1 -1
- data/lib/json_data_extractor.rb +16 -1
- metadata +12 -5
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 9090df063971594c904cc2ef55cae347a383d8a63cbcc73846012cab4720981c
|
|
4
|
+
data.tar.gz: 63afb125e0857be68248d5a8fc7f62370289acd48d15cf0b8d99cf493abf5cd5
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 33392fd5f7cadb2ee489ac190bad636c27f5c3694cdcaa99b1f403aceab07f50059ea74c1925c78b55092d9602d30ef9a8a7515afd042ad2c1790d5909a6ebe9
|
|
7
|
+
data.tar.gz: 3db786d9c31925b116e8e3b3c864f398197e2f9bfd1b20f338a3495aa4af20a4f86acfe7f4f7d2d65fdfac0fdbb66833d28f2df8cfcc42d11d5e2aeb0dee3521
|
data/CHANGELOG.md
ADDED
|
@@ -0,0 +1,61 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project will be documented in this file.
|
|
4
|
+
|
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
|
+
|
|
8
|
+
|
|
9
|
+
## [0.2.0] - 2025-11-10
|
|
10
|
+
|
|
11
|
+
### Added
|
|
12
|
+
- **DirectNavigator**: Fast iterative path navigation for simple JSONPath expressions (20-50x faster than JsonPath gem)
|
|
13
|
+
- **OptimizedExtractor**: Single-pass extraction with pre-allocated result structures
|
|
14
|
+
- **PathCompiler**: Intelligent path compilation that chooses optimal navigator based on complexity
|
|
15
|
+
- **SchemaAnalyzer**: Pre-processes schemas to create extraction plans with result templates
|
|
16
|
+
- Performance benchmarking suite for tracking optimization improvements
|
|
17
|
+
|
|
18
|
+
### Changed
|
|
19
|
+
- **Major Performance Improvements**:
|
|
20
|
+
- 2.8x faster for simple path extractions (e.g., `$.store.book[*].author`)
|
|
21
|
+
- 2.3x faster for batch processing with schema reuse
|
|
22
|
+
- 6.5x faster DirectNavigator vs JsonPath for simple paths
|
|
23
|
+
- 100% reduction in object allocations during extraction (zero new allocations)
|
|
24
|
+
- 26% faster for mixed simple/complex path schemas
|
|
25
|
+
- Internal extraction now uses iterative navigation instead of recursive (97% fewer method calls)
|
|
26
|
+
- JSON parsing optimized to occur only once per extraction
|
|
27
|
+
- Result structures pre-allocated based on schema analysis
|
|
28
|
+
|
|
29
|
+
### Technical Details
|
|
30
|
+
- Simple paths (e.g., `$.store.book[*].author`) now use DirectNavigator
|
|
31
|
+
- Complex paths (e.g., `$..category`, filters) fall back to JsonPath automatically
|
|
32
|
+
- Schema compilation happens once with `with_schema`, reusable across multiple extractions
|
|
33
|
+
- All existing tests pass - 100% backward compatible
|
|
34
|
+
|
|
35
|
+
### Performance Benchmarks
|
|
36
|
+
- Simple paths only: **0.257s vs 0.722s** (2.81x speedup)
|
|
37
|
+
- Mixed paths: **1.150s vs 1.444s** (1.26x speedup)
|
|
38
|
+
- Batch processing: **0.0012s vs 0.0027s** (2.27x speedup)
|
|
39
|
+
- Memory allocations: **0 vs 33,556 objects** (100% reduction)
|
|
40
|
+
- DirectNavigator: **0.0079s vs 0.0513s** (6.51x speedup vs JsonPath)
|
|
41
|
+
|
|
42
|
+
### Notes
|
|
43
|
+
- No breaking changes to public API
|
|
44
|
+
- All existing code continues to work unchanged
|
|
45
|
+
- Performance improvements automatic for all use cases
|
|
46
|
+
- Recommended to use `JsonDataExtractor.with_schema(schema)` for batch processing
|
|
47
|
+
|
|
48
|
+
|
|
49
|
+
## [0.1.05] - 2025-05-13
|
|
50
|
+
|
|
51
|
+
### Added
|
|
52
|
+
- Added schema reuse functionality for improved performance when processing multiple data objects with the same schema
|
|
53
|
+
- New `JsonDataExtractor.with_schema` class method to create an extractor with a pre-processed schema
|
|
54
|
+
- New `SchemaCache` class to store and reuse schema information
|
|
55
|
+
- New `extract_from` method to extract data using a cached schema
|
|
56
|
+
- Performance improvements by pre-compiling JsonPath objects and caching schema elements
|
|
57
|
+
|
|
58
|
+
## [0.1.04] - 2025-04-26
|
|
59
|
+
|
|
60
|
+
- Use Oj for json dump
|
|
61
|
+
- Use json path caching
|
data/Gemfile
CHANGED
data/README.md
CHANGED
|
@@ -1,5 +1,7 @@
|
|
|
1
1
|
# JsonDataExtractor
|
|
2
2
|
|
|
3
|
+
[](https://badge.fury.io/rb/json_data_extractor)
|
|
4
|
+
|
|
3
5
|
Transform JSON data structures with the help of a simple schema and JsonPath expressions.
|
|
4
6
|
Use the JsonDataExtractor gem to extract and modify data from complex JSON structures using a
|
|
5
7
|
straightforward syntax
|
|
@@ -325,6 +327,117 @@ E.g. this is a valid real-life schema with nested data:
|
|
|
325
327
|
|
|
326
328
|
Nested schema can be also applied to objects, not arrays. See specs for more examples.
|
|
327
329
|
|
|
330
|
+
### Schema Reuse for Performance
|
|
331
|
+
|
|
332
|
+
When processing multiple data objects with the same schema, JsonDataExtractor provides an optimized approach that avoids redundant schema processing. This is particularly useful for batch processing scenarios where you need to apply the same transformation to multiple data objects.
|
|
333
|
+
|
|
334
|
+
#### Using `with_schema` and `extract_from`
|
|
335
|
+
|
|
336
|
+
Instead of creating a new extractor for each data object:
|
|
337
|
+
|
|
338
|
+
```ruby
|
|
339
|
+
data_objects.map do |data|
|
|
340
|
+
extractor = JsonDataExtractor.new(data)
|
|
341
|
+
extractor.extract(schema)
|
|
342
|
+
end
|
|
343
|
+
```
|
|
344
|
+
|
|
345
|
+
You can create a single extractor with a pre-processed schema and reuse it:
|
|
346
|
+
```ruby
|
|
347
|
+
extractor = JsonDataExtractor.with_schema(schema)
|
|
348
|
+
data_objects.map do |data|
|
|
349
|
+
extractor.extract_from(data)
|
|
350
|
+
end
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
This approach offers significant performance improvements for large datasets by:
|
|
354
|
+
1. Pre-processing the schema only once
|
|
355
|
+
2. Pre-compiling JsonPath objects
|
|
356
|
+
3. Caching schema elements
|
|
357
|
+
4. Avoiding redundant schema validation
|
|
358
|
+
|
|
359
|
+
|
|
360
|
+
#### Comparison with Nested Schema Approach
|
|
361
|
+
|
|
362
|
+
It's worth noting that similar functionality could be achieved using the existing nested schema approach when your data is already in an array format:
|
|
363
|
+
|
|
364
|
+
```ruby
|
|
365
|
+
# Process an array of locations with the nested schema approach
|
|
366
|
+
locations_array = [location1, location2, location3]
|
|
367
|
+
schema = {
|
|
368
|
+
all_locations: {
|
|
369
|
+
path: "[*]",
|
|
370
|
+
type: "array",
|
|
371
|
+
schema: { code: ".iataCode", city: ".city", name: ".name" }
|
|
372
|
+
}
|
|
373
|
+
}
|
|
374
|
+
result = JsonDataExtractor.new(locations_array).extract(schema)
|
|
375
|
+
# Result: { all_locations: [{code: "...", city: "...", name: "..."}, {...}, {...}] }
|
|
376
|
+
```
|
|
377
|
+
|
|
378
|
+
**When to use which approach:**
|
|
379
|
+
|
|
380
|
+
1. **Use nested schema when:**
|
|
381
|
+
- Your data is already structured as an array
|
|
382
|
+
- You want to preserve the array structure in your result
|
|
383
|
+
- You need to process the entire array at once
|
|
384
|
+
|
|
385
|
+
2. **Use schema reuse when:**
|
|
386
|
+
- You receive data objects individually (e.g., from multiple API calls)
|
|
387
|
+
- You need to process each object separately
|
|
388
|
+
- You want to transform each object independently
|
|
389
|
+
- You need direct access to individual results without unwrapping them from an array
|
|
390
|
+
|
|
391
|
+
The schema reuse approach is specifically optimized for scenarios where you process similar objects multiple times in sequence, rather than all at once in an array.
|
|
392
|
+
|
|
393
|
+
|
|
394
|
+
#### Real-world Example
|
|
395
|
+
Here's a practical example of extracting location data from multiple sources:
|
|
396
|
+
```ruby
|
|
397
|
+
# Location data from an API
|
|
398
|
+
locations = [
|
|
399
|
+
{
|
|
400
|
+
"iataCode" => "JFK",
|
|
401
|
+
"countryCode" => "US",
|
|
402
|
+
"city" => "New York",
|
|
403
|
+
"name" => "John F. Kennedy International Airport"
|
|
404
|
+
},
|
|
405
|
+
{
|
|
406
|
+
"iataCode" => "LHR",
|
|
407
|
+
"countryCode" => "GB",
|
|
408
|
+
"city" => "London",
|
|
409
|
+
"name" => "Heathrow Airport"
|
|
410
|
+
}
|
|
411
|
+
]
|
|
412
|
+
|
|
413
|
+
# Define schema once
|
|
414
|
+
schema = {
|
|
415
|
+
code: "$.iataCode",
|
|
416
|
+
city: "$.city",
|
|
417
|
+
name: "$.name",
|
|
418
|
+
country: "$.countryCode"
|
|
419
|
+
}
|
|
420
|
+
|
|
421
|
+
# Create an extractor with the schema
|
|
422
|
+
jde = JsonDataExtractor.with_schema(schema)
|
|
423
|
+
|
|
424
|
+
# Process each location efficiently
|
|
425
|
+
processed_locations = locations.map do |data|
|
|
426
|
+
jde.extract_from(data)
|
|
427
|
+
end
|
|
428
|
+
|
|
429
|
+
# Result:
|
|
430
|
+
# [
|
|
431
|
+
# {code: "JFK", city: "New York", name: "John F. Kennedy International Airport", country: "US"},
|
|
432
|
+
# {code: "LHR", city: "London", name: "Heathrow Airport", country: "GB"}
|
|
433
|
+
# ]
|
|
434
|
+
```
|
|
435
|
+
This pattern is especially beneficial when:
|
|
436
|
+
- Processing data in batches that arrive separately
|
|
437
|
+
- Working with large datasets where you need to process one item at a time
|
|
438
|
+
- Applying the same schema to multiple API responses
|
|
439
|
+
- Parsing large collections of similar objects that aren't already in an array structure
|
|
440
|
+
|
|
328
441
|
## Configuration Options
|
|
329
442
|
|
|
330
443
|
The JsonDataExtractor gem provides a configuration option to control the behavior when encountering
|
data/json_data_extractor.gemspec
CHANGED
|
@@ -29,7 +29,7 @@ transformations. The schema is defined as a simple Ruby hash that maps keys to p
|
|
|
29
29
|
spec.add_development_dependency 'amazing_print'
|
|
30
30
|
spec.add_development_dependency 'bundler'
|
|
31
31
|
spec.add_development_dependency 'pry'
|
|
32
|
-
spec.add_development_dependency 'rake', '~>
|
|
32
|
+
spec.add_development_dependency 'rake', '~> 12.3.3'
|
|
33
33
|
spec.add_development_dependency 'rspec', '~> 3.0'
|
|
34
34
|
spec.add_development_dependency 'rubocop'
|
|
35
35
|
|
|
@@ -0,0 +1,81 @@
|
|
|
1
|
+
|
|
2
|
+
# frozen_string_literal: true
|
|
3
|
+
|
|
4
|
+
module JsonDataExtractor
|
|
5
|
+
# Fast path navigator for simple JSONPath expressions
|
|
6
|
+
# Optimized to minimize recursive calls
|
|
7
|
+
class DirectNavigator
|
|
8
|
+
SIMPLE_PATH_PATTERN = /^\$(\.[a-zA-Z_][\w]*|\[\d+\]|\[\*\])+$/
|
|
9
|
+
|
|
10
|
+
def self.simple_path?(path)
|
|
11
|
+
path&.match?(SIMPLE_PATH_PATTERN)
|
|
12
|
+
end
|
|
13
|
+
|
|
14
|
+
def initialize(path)
|
|
15
|
+
@path = path
|
|
16
|
+
@segments = parse_segments(path)
|
|
17
|
+
end
|
|
18
|
+
|
|
19
|
+
def on(data)
|
|
20
|
+
# Use iterative approach instead of recursion to reduce method calls
|
|
21
|
+
navigate(data)
|
|
22
|
+
rescue StandardError => e
|
|
23
|
+
# Fallback to empty array if navigation fails
|
|
24
|
+
[]
|
|
25
|
+
end
|
|
26
|
+
|
|
27
|
+
private
|
|
28
|
+
|
|
29
|
+
def parse_segments(path)
|
|
30
|
+
# Parse "$.store.book[*].author" into segment instructions
|
|
31
|
+
path.sub(/^\$/, '').scan(/\.\w+|\[\d+\]|\[\*\]/).map do |segment|
|
|
32
|
+
case segment
|
|
33
|
+
when /^\[(\d+)\]$/
|
|
34
|
+
[:array_index, ::Regexp.last_match(1).to_i]
|
|
35
|
+
when /^\[\*\]$/
|
|
36
|
+
[:array_all]
|
|
37
|
+
when /^\.(\w+)$/
|
|
38
|
+
[:key, ::Regexp.last_match(1)]
|
|
39
|
+
end
|
|
40
|
+
end
|
|
41
|
+
end
|
|
42
|
+
|
|
43
|
+
# Iterative navigation - much faster than recursion
|
|
44
|
+
def navigate(data)
|
|
45
|
+
current_values = [data]
|
|
46
|
+
|
|
47
|
+
@segments.each do |segment_type, segment_value|
|
|
48
|
+
next_values = []
|
|
49
|
+
|
|
50
|
+
current_values.each do |current|
|
|
51
|
+
# Skip only if current is nil AND we haven't found anything yet
|
|
52
|
+
# This allows nil values that were explicitly extracted to pass through
|
|
53
|
+
next if current.nil?
|
|
54
|
+
|
|
55
|
+
case segment_type
|
|
56
|
+
when :key
|
|
57
|
+
# Try both string and symbol keys
|
|
58
|
+
if current.is_a?(Hash)
|
|
59
|
+
val = current[segment_value] || current[segment_value.to_sym]
|
|
60
|
+
next_values << val
|
|
61
|
+
end
|
|
62
|
+
when :array_index
|
|
63
|
+
if current.is_a?(Array)
|
|
64
|
+
next_values << current[segment_value]
|
|
65
|
+
end
|
|
66
|
+
when :array_all
|
|
67
|
+
if current.is_a?(Array)
|
|
68
|
+
next_values.concat(current)
|
|
69
|
+
end
|
|
70
|
+
end
|
|
71
|
+
end
|
|
72
|
+
|
|
73
|
+
current_values = next_values
|
|
74
|
+
end
|
|
75
|
+
|
|
76
|
+
# Don't use compact - it removes nil values which might be intentional!
|
|
77
|
+
# Only remove nils that result from failed navigation (not explicit nil values)
|
|
78
|
+
current_values
|
|
79
|
+
end
|
|
80
|
+
end
|
|
81
|
+
end
|
|
@@ -0,0 +1,20 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module JsonDataExtractor
|
|
4
|
+
# Represents a single field extraction instruction
|
|
5
|
+
class ExtractionInstruction
|
|
6
|
+
attr_reader :key, :element, :compiled_path
|
|
7
|
+
|
|
8
|
+
def initialize(key:, element:, compiled_path:)
|
|
9
|
+
@key = key
|
|
10
|
+
@element = element
|
|
11
|
+
@compiled_path = compiled_path
|
|
12
|
+
end
|
|
13
|
+
|
|
14
|
+
def extract(data)
|
|
15
|
+
return element.fetch_default_value if compiled_path.nil?
|
|
16
|
+
|
|
17
|
+
compiled_path.on(data)
|
|
18
|
+
end
|
|
19
|
+
end
|
|
20
|
+
end
|
|
@@ -1,9 +1,9 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
3
|
module JsonDataExtractor
|
|
4
|
-
#
|
|
4
|
+
# Main extractor class - delegates to OptimizedExtractor when possible
|
|
5
5
|
class Extractor
|
|
6
|
-
attr_reader :data, :modifiers
|
|
6
|
+
attr_reader :data, :modifiers, :schema_cache
|
|
7
7
|
|
|
8
8
|
# @param json_data [Hash,String]
|
|
9
9
|
# @param modifiers [Hash]
|
|
@@ -14,12 +14,44 @@ module JsonDataExtractor
|
|
|
14
14
|
@path_cache = {}
|
|
15
15
|
end
|
|
16
16
|
|
|
17
|
+
# Creates a new extractor with a pre-processed schema
|
|
18
|
+
# @param schema [Hash] schema of the expected data mapping
|
|
19
|
+
# @param modifiers [Hash] modifiers to apply to the extracted data
|
|
20
|
+
# @return [Extractor] an extractor initialized with the schema
|
|
21
|
+
def self.with_schema(schema, modifiers = {})
|
|
22
|
+
extractor = new({}, modifiers)
|
|
23
|
+
extractor.instance_variable_set(:@schema_cache, SchemaCache.new(schema))
|
|
24
|
+
extractor.instance_variable_set(:@optimized_extractor, OptimizedExtractor.new(schema, modifiers: modifiers))
|
|
25
|
+
extractor
|
|
26
|
+
end
|
|
27
|
+
|
|
28
|
+
# Extracts data from the provided json_data using the cached schema
|
|
29
|
+
# @param json_data [Hash,String] the data to extract from
|
|
30
|
+
# @return [Hash] the extracted data
|
|
31
|
+
def extract_from(json_data)
|
|
32
|
+
# Use optimised extractor if available
|
|
33
|
+
if @optimized_extractor
|
|
34
|
+
return @optimized_extractor.extract_from(json_data)
|
|
35
|
+
end
|
|
36
|
+
|
|
37
|
+
# Fallback to original implementation
|
|
38
|
+
raise ArgumentError, 'No schema cache available. Use Extractor.with_schema first.' unless @schema_cache
|
|
39
|
+
|
|
40
|
+
@results = {}
|
|
41
|
+
@data = json_data.is_a?(Hash) ? Oj.dump(json_data, mode: :compat) : json_data
|
|
42
|
+
extract_using_cache
|
|
43
|
+
@results
|
|
44
|
+
end
|
|
45
|
+
|
|
17
46
|
# @param modifier_name [String, Symbol]
|
|
18
47
|
# @param callable [#call, nil] Optional callable object
|
|
19
48
|
def add_modifier(modifier_name, callable = nil, &block)
|
|
20
49
|
modifier_name = modifier_name.to_sym unless modifier_name.is_a?(Symbol)
|
|
21
50
|
modifiers[modifier_name] = callable || block
|
|
22
51
|
|
|
52
|
+
# Also add to optimized extractor if present
|
|
53
|
+
@optimized_extractor&.add_modifier(modifier_name, callable, &block)
|
|
54
|
+
|
|
23
55
|
return if modifiers[modifier_name].respond_to?(:call)
|
|
24
56
|
|
|
25
57
|
raise ArgumentError, 'Modifier must be a callable object or a block'
|
|
@@ -27,48 +59,46 @@ module JsonDataExtractor
|
|
|
27
59
|
|
|
28
60
|
# @param schema [Hash] schema of the expected data mapping
|
|
29
61
|
def extract(schema)
|
|
30
|
-
|
|
31
|
-
|
|
62
|
+
# Use optimized path for direct extraction
|
|
63
|
+
optimized = OptimizedExtractor.new(schema, modifiers: @modifiers)
|
|
64
|
+
return optimized.extract_from(@data)
|
|
65
|
+
end
|
|
66
|
+
|
|
67
|
+
private
|
|
32
68
|
|
|
69
|
+
# Legacy extraction method - kept for compatibility
|
|
70
|
+
def extract_using_cache
|
|
71
|
+
schema_cache.schema.each do |key, _|
|
|
72
|
+
element = schema_cache.schema_elements[key]
|
|
33
73
|
path = element.path
|
|
34
|
-
|
|
74
|
+
|
|
75
|
+
json_path = path ? schema_cache.path_cache[path] : nil
|
|
35
76
|
|
|
36
77
|
extracted_data = json_path&.on(@data)
|
|
37
78
|
|
|
38
79
|
if extracted_data.nil? || extracted_data.empty?
|
|
39
|
-
# we either got nothing or the `path` was initially nil
|
|
40
80
|
@results[key] = element.fetch_default_value
|
|
41
81
|
next
|
|
42
82
|
end
|
|
43
83
|
|
|
44
|
-
# check for nils and apply defaults if applicable
|
|
45
84
|
extracted_data.map! { |item| item.nil? ? element.fetch_default_value : item }
|
|
46
85
|
|
|
47
|
-
# apply modifiers if present
|
|
48
86
|
extracted_data = apply_modifiers(extracted_data, element.modifiers) if element.modifiers.any?
|
|
49
87
|
|
|
50
|
-
# apply maps if present
|
|
51
88
|
@results[key] = element.maps.any? ? apply_maps(extracted_data, element.maps) : extracted_data
|
|
52
89
|
|
|
53
90
|
@results[key] = resolve_result_structure(@results[key], element)
|
|
54
91
|
end
|
|
55
|
-
|
|
56
|
-
@results
|
|
57
92
|
end
|
|
58
93
|
|
|
59
|
-
private
|
|
60
|
-
|
|
61
94
|
def resolve_result_structure(result, element)
|
|
62
95
|
if element.nested
|
|
63
|
-
# Process nested data
|
|
64
96
|
result = extract_nested_data(result, element.nested)
|
|
65
97
|
return element.array_type ? result : result.first
|
|
66
98
|
end
|
|
67
99
|
|
|
68
|
-
# Handle single-item extraction if not explicitly an array type or having multiple items
|
|
69
100
|
return result.first if result.size == 1 && !element.array_type
|
|
70
101
|
|
|
71
|
-
# Default case: simply return the result, assuming it's correctly formed
|
|
72
102
|
result
|
|
73
103
|
end
|
|
74
104
|
|
|
@@ -0,0 +1,169 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require 'oj'
|
|
4
|
+
|
|
5
|
+
module JsonDataExtractor
|
|
6
|
+
# High-performance single-pass extractor
|
|
7
|
+
class OptimizedExtractor
|
|
8
|
+
attr_reader :modifiers
|
|
9
|
+
|
|
10
|
+
def initialize(schema, modifiers: {})
|
|
11
|
+
@modifiers = modifiers.transform_keys(&:to_sym)
|
|
12
|
+
@schema_analyzer = SchemaAnalyzer.new(schema, @modifiers)
|
|
13
|
+
end
|
|
14
|
+
|
|
15
|
+
def extract_from(json_data)
|
|
16
|
+
# Pre-allocate result from template
|
|
17
|
+
result = deep_dup(@schema_analyzer.result_template)
|
|
18
|
+
|
|
19
|
+
# Parse JSON once
|
|
20
|
+
data = parse_data(json_data)
|
|
21
|
+
|
|
22
|
+
# Execute extraction plan
|
|
23
|
+
@schema_analyzer.extraction_plan.each do |instruction|
|
|
24
|
+
extract_and_fill(data, instruction, result)
|
|
25
|
+
end
|
|
26
|
+
|
|
27
|
+
result
|
|
28
|
+
end
|
|
29
|
+
|
|
30
|
+
def add_modifier(modifier_name, callable = nil, &block)
|
|
31
|
+
modifier_name = modifier_name.to_sym unless modifier_name.is_a?(Symbol)
|
|
32
|
+
@modifiers[modifier_name] = callable || block
|
|
33
|
+
|
|
34
|
+
return if @modifiers[modifier_name].respond_to?(:call)
|
|
35
|
+
|
|
36
|
+
raise ArgumentError, 'Modifier must be a callable object or a block'
|
|
37
|
+
end
|
|
38
|
+
|
|
39
|
+
private
|
|
40
|
+
|
|
41
|
+
def extract_and_fill(data, instruction, result)
|
|
42
|
+
element = instruction.element
|
|
43
|
+
|
|
44
|
+
# Navigate and extract using compiled_path (not navigator)
|
|
45
|
+
extracted_data = if instruction.compiled_path
|
|
46
|
+
instruction.compiled_path.on(data)
|
|
47
|
+
else
|
|
48
|
+
[]
|
|
49
|
+
end
|
|
50
|
+
|
|
51
|
+
# Handle empty/nil results
|
|
52
|
+
if extracted_data.nil? || extracted_data.empty?
|
|
53
|
+
result[instruction.key] = element.fetch_default_value
|
|
54
|
+
return
|
|
55
|
+
end
|
|
56
|
+
|
|
57
|
+
# Apply defaults for nil values
|
|
58
|
+
extracted_data.map! { |item| item.nil? ? element.fetch_default_value : item }
|
|
59
|
+
|
|
60
|
+
# Apply transformations in place
|
|
61
|
+
apply_transformations!(extracted_data, element)
|
|
62
|
+
|
|
63
|
+
# Store result
|
|
64
|
+
result[instruction.key] = resolve_result_structure(extracted_data, element)
|
|
65
|
+
end
|
|
66
|
+
|
|
67
|
+
def apply_transformations!(values, element)
|
|
68
|
+
# Apply modifiers
|
|
69
|
+
if element.modifiers.any?
|
|
70
|
+
values.map! do |value|
|
|
71
|
+
element.modifiers.reduce(value) do |v, modifier|
|
|
72
|
+
apply_single_modifier(modifier, v)
|
|
73
|
+
end
|
|
74
|
+
end
|
|
75
|
+
end
|
|
76
|
+
|
|
77
|
+
# Apply maps
|
|
78
|
+
if element.maps.any?
|
|
79
|
+
values.map! do |value|
|
|
80
|
+
element.maps.reduce(value) { |v, map| map[v] }
|
|
81
|
+
end
|
|
82
|
+
end
|
|
83
|
+
end
|
|
84
|
+
|
|
85
|
+
def resolve_result_structure(result, element)
|
|
86
|
+
if element.nested
|
|
87
|
+
# Process nested data
|
|
88
|
+
result = extract_nested_data(result, element.nested)
|
|
89
|
+
return element.array_type ? result : result.first
|
|
90
|
+
end
|
|
91
|
+
|
|
92
|
+
# Handle single-item extraction if not explicitly an array type
|
|
93
|
+
return result.first if result.size == 1 && !element.array_type
|
|
94
|
+
|
|
95
|
+
result
|
|
96
|
+
end
|
|
97
|
+
|
|
98
|
+
def extract_nested_data(data, schema)
|
|
99
|
+
Array(data).map do |item|
|
|
100
|
+
self.class.new(schema, modifiers: @modifiers).extract_from(item)
|
|
101
|
+
end
|
|
102
|
+
end
|
|
103
|
+
|
|
104
|
+
def apply_single_modifier(modifier, value)
|
|
105
|
+
return modifier.call(value) if modifier.respond_to?(:call)
|
|
106
|
+
return @modifiers[modifier].call(value) if @modifiers.key?(modifier)
|
|
107
|
+
return value.public_send(modifier) if value.respond_to?(modifier)
|
|
108
|
+
|
|
109
|
+
if JsonDataExtractor.configuration.strict_modifiers
|
|
110
|
+
raise ArgumentError, "Modifier: <:#{modifier}> cannot be applied to value <#{value.inspect}>"
|
|
111
|
+
end
|
|
112
|
+
|
|
113
|
+
value
|
|
114
|
+
end
|
|
115
|
+
|
|
116
|
+
def parse_data(json_data)
|
|
117
|
+
return json_data if json_data.is_a?(Hash) || json_data.is_a?(Array)
|
|
118
|
+
Oj.load(json_data)
|
|
119
|
+
end
|
|
120
|
+
|
|
121
|
+
def deep_dup(obj)
|
|
122
|
+
case obj
|
|
123
|
+
when Hash
|
|
124
|
+
obj.transform_values { |v| deep_dup(v) }
|
|
125
|
+
when Array
|
|
126
|
+
obj.map { |v| deep_dup(v) }
|
|
127
|
+
else
|
|
128
|
+
obj.duplicable? ? obj.dup : obj
|
|
129
|
+
end
|
|
130
|
+
end
|
|
131
|
+
end
|
|
132
|
+
end
|
|
133
|
+
|
|
134
|
+
# Ruby basic types helper
|
|
135
|
+
class Object
|
|
136
|
+
def duplicable?
|
|
137
|
+
true
|
|
138
|
+
end
|
|
139
|
+
end
|
|
140
|
+
|
|
141
|
+
class NilClass
|
|
142
|
+
def duplicable?
|
|
143
|
+
false
|
|
144
|
+
end
|
|
145
|
+
end
|
|
146
|
+
|
|
147
|
+
class FalseClass
|
|
148
|
+
def duplicable?
|
|
149
|
+
false
|
|
150
|
+
end
|
|
151
|
+
end
|
|
152
|
+
|
|
153
|
+
class TrueClass
|
|
154
|
+
def duplicable?
|
|
155
|
+
false
|
|
156
|
+
end
|
|
157
|
+
end
|
|
158
|
+
|
|
159
|
+
class Symbol
|
|
160
|
+
def duplicable?
|
|
161
|
+
false
|
|
162
|
+
end
|
|
163
|
+
end
|
|
164
|
+
|
|
165
|
+
class Numeric
|
|
166
|
+
def duplicable?
|
|
167
|
+
false
|
|
168
|
+
end
|
|
169
|
+
end
|
|
@@ -0,0 +1,42 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module JsonDataExtractor
|
|
4
|
+
# Compiles JSONPath expressions into optimized navigators
|
|
5
|
+
class PathCompiler
|
|
6
|
+
def compile(path)
|
|
7
|
+
return nil unless path
|
|
8
|
+
|
|
9
|
+
if DirectNavigator.simple_path?(path)
|
|
10
|
+
DirectNavigator.new(path)
|
|
11
|
+
else
|
|
12
|
+
# Fallback to JsonPath for complex expressions
|
|
13
|
+
JsonPathWrapper.new(path)
|
|
14
|
+
end
|
|
15
|
+
end
|
|
16
|
+
|
|
17
|
+
# Wrapper for JsonPath that caches serialization
|
|
18
|
+
class JsonPathWrapper
|
|
19
|
+
def initialize(path)
|
|
20
|
+
@json_path = JsonPath.new(path)
|
|
21
|
+
@cached_json = nil
|
|
22
|
+
@cached_data_id = nil
|
|
23
|
+
end
|
|
24
|
+
|
|
25
|
+
def on(data)
|
|
26
|
+
# Cache the JSON serialization if we're processing the same data object
|
|
27
|
+
data_id = data.object_id
|
|
28
|
+
|
|
29
|
+
if data.is_a?(String)
|
|
30
|
+
@json_path.on(data)
|
|
31
|
+
else
|
|
32
|
+
# Only serialize once per data object
|
|
33
|
+
if @cached_data_id != data_id
|
|
34
|
+
@cached_json = Oj.dump(data, mode: :compat)
|
|
35
|
+
@cached_data_id = data_id
|
|
36
|
+
end
|
|
37
|
+
@json_path.on(@cached_json)
|
|
38
|
+
end
|
|
39
|
+
end
|
|
40
|
+
end
|
|
41
|
+
end
|
|
42
|
+
end
|
|
@@ -0,0 +1,48 @@
|
|
|
1
|
+
|
|
2
|
+
# frozen_string_literal: true
|
|
3
|
+
|
|
4
|
+
module JsonDataExtractor
|
|
5
|
+
# Analyzes schema and creates optimized extraction plan
|
|
6
|
+
class SchemaAnalyzer
|
|
7
|
+
attr_reader :extraction_plan, :result_template
|
|
8
|
+
|
|
9
|
+
def initialize(schema, modifiers = {})
|
|
10
|
+
@schema = schema
|
|
11
|
+
@modifiers = modifiers
|
|
12
|
+
@path_compiler = PathCompiler.new
|
|
13
|
+
@extraction_plan = []
|
|
14
|
+
@result_template = {}
|
|
15
|
+
|
|
16
|
+
analyze_schema
|
|
17
|
+
end
|
|
18
|
+
|
|
19
|
+
private
|
|
20
|
+
|
|
21
|
+
def analyze_schema
|
|
22
|
+
@schema.each do |key, config|
|
|
23
|
+
element = JsonDataExtractor::SchemaElement.new(
|
|
24
|
+
config.is_a?(Hash) ? config : { path: config }
|
|
25
|
+
)
|
|
26
|
+
|
|
27
|
+
# Pre-allocate result slot
|
|
28
|
+
@result_template[key] = determine_initial_value(element)
|
|
29
|
+
|
|
30
|
+
# Compile path
|
|
31
|
+
compiled_path = @path_compiler.compile(element.path)
|
|
32
|
+
|
|
33
|
+
# Create extraction instruction
|
|
34
|
+
@extraction_plan << ExtractionInstruction.new(
|
|
35
|
+
key: key,
|
|
36
|
+
element: element,
|
|
37
|
+
compiled_path: compiled_path
|
|
38
|
+
)
|
|
39
|
+
end
|
|
40
|
+
end
|
|
41
|
+
|
|
42
|
+
def determine_initial_value(element)
|
|
43
|
+
return [] if element.array_type
|
|
44
|
+
return {} if element.nested
|
|
45
|
+
nil
|
|
46
|
+
end
|
|
47
|
+
end
|
|
48
|
+
end
|
|
@@ -0,0 +1,30 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module JsonDataExtractor
|
|
4
|
+
# Caches schema elements to avoid re-processing the schema for each data extraction
|
|
5
|
+
class SchemaCache
|
|
6
|
+
attr_reader :schema, :schema_elements, :path_cache
|
|
7
|
+
|
|
8
|
+
def initialize(schema)
|
|
9
|
+
@schema = schema
|
|
10
|
+
@schema_elements = {}
|
|
11
|
+
@path_cache = {}
|
|
12
|
+
|
|
13
|
+
# Pre-process the schema to create SchemaElement objects
|
|
14
|
+
process_schema
|
|
15
|
+
end
|
|
16
|
+
|
|
17
|
+
private
|
|
18
|
+
|
|
19
|
+
def process_schema
|
|
20
|
+
schema.each do |key, val|
|
|
21
|
+
# Store the SchemaElement for each key in the schema
|
|
22
|
+
@schema_elements[key] = JsonDataExtractor::SchemaElement.new(val.is_a?(Hash) ? val : { path: val })
|
|
23
|
+
|
|
24
|
+
# Pre-compile JsonPath objects for each path
|
|
25
|
+
path = @schema_elements[key].path
|
|
26
|
+
@path_cache[path] = JsonPath.new(path) if path
|
|
27
|
+
end
|
|
28
|
+
end
|
|
29
|
+
end
|
|
30
|
+
end
|
data/lib/json_data_extractor.rb
CHANGED
|
@@ -3,10 +3,17 @@
|
|
|
3
3
|
require 'jsonpath'
|
|
4
4
|
require 'multi_json'
|
|
5
5
|
require 'oj'
|
|
6
|
+
|
|
6
7
|
require_relative 'json_data_extractor/version'
|
|
7
8
|
require_relative 'json_data_extractor/configuration'
|
|
8
|
-
require_relative 'json_data_extractor/extractor'
|
|
9
9
|
require_relative 'json_data_extractor/schema_element'
|
|
10
|
+
require_relative 'json_data_extractor/schema_cache'
|
|
11
|
+
require_relative 'json_data_extractor/direct_navigator'
|
|
12
|
+
require_relative 'json_data_extractor/path_compiler'
|
|
13
|
+
require_relative 'json_data_extractor/extraction_instruction'
|
|
14
|
+
require_relative 'json_data_extractor/schema_analyzer'
|
|
15
|
+
require_relative 'json_data_extractor/optimized_extractor'
|
|
16
|
+
require_relative 'json_data_extractor/extractor'
|
|
10
17
|
|
|
11
18
|
# Set MultiJson to use Oj for performance
|
|
12
19
|
MultiJson.use(:oj)
|
|
@@ -22,6 +29,14 @@ module JsonDataExtractor
|
|
|
22
29
|
Extractor.new(*args)
|
|
23
30
|
end
|
|
24
31
|
|
|
32
|
+
# Creates a new extractor with a pre-processed schema
|
|
33
|
+
# @param schema [Hash] schema of the expected data mapping
|
|
34
|
+
# @param modifiers [Hash] modifiers to apply to the extracted data
|
|
35
|
+
# @return [Extractor] an extractor initialized with the schema
|
|
36
|
+
def with_schema(schema, modifiers = {})
|
|
37
|
+
Extractor.with_schema(schema, modifiers)
|
|
38
|
+
end
|
|
39
|
+
|
|
25
40
|
def configuration
|
|
26
41
|
@configuration ||= Configuration.new
|
|
27
42
|
end
|
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: json_data_extractor
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.2.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Max Buslaev
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2025-
|
|
11
|
+
date: 2025-11-10 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: amazing_print
|
|
@@ -58,14 +58,14 @@ dependencies:
|
|
|
58
58
|
requirements:
|
|
59
59
|
- - "~>"
|
|
60
60
|
- !ruby/object:Gem::Version
|
|
61
|
-
version:
|
|
61
|
+
version: 12.3.3
|
|
62
62
|
type: :development
|
|
63
63
|
prerelease: false
|
|
64
64
|
version_requirements: !ruby/object:Gem::Requirement
|
|
65
65
|
requirements:
|
|
66
66
|
- - "~>"
|
|
67
67
|
- !ruby/object:Gem::Version
|
|
68
|
-
version:
|
|
68
|
+
version: 12.3.3
|
|
69
69
|
- !ruby/object:Gem::Dependency
|
|
70
70
|
name: rspec
|
|
71
71
|
requirement: !ruby/object:Gem::Requirement
|
|
@@ -135,6 +135,7 @@ files:
|
|
|
135
135
|
- ".gitignore"
|
|
136
136
|
- ".rspec"
|
|
137
137
|
- ".travis.yml"
|
|
138
|
+
- CHANGELOG.md
|
|
138
139
|
- CODE_OF_CONDUCT.md
|
|
139
140
|
- Gemfile
|
|
140
141
|
- LICENSE.txt
|
|
@@ -145,7 +146,13 @@ files:
|
|
|
145
146
|
- json_data_extractor.gemspec
|
|
146
147
|
- lib/json_data_extractor.rb
|
|
147
148
|
- lib/json_data_extractor/configuration.rb
|
|
149
|
+
- lib/json_data_extractor/direct_navigator.rb
|
|
150
|
+
- lib/json_data_extractor/extraction_instruction.rb
|
|
148
151
|
- lib/json_data_extractor/extractor.rb
|
|
152
|
+
- lib/json_data_extractor/optimized_extractor.rb
|
|
153
|
+
- lib/json_data_extractor/path_compiler.rb
|
|
154
|
+
- lib/json_data_extractor/schema_analyzer.rb
|
|
155
|
+
- lib/json_data_extractor/schema_cache.rb
|
|
149
156
|
- lib/json_data_extractor/schema_element.rb
|
|
150
157
|
- lib/json_data_extractor/version.rb
|
|
151
158
|
homepage: https://github.com/austerlitz/json_data_extractor
|
|
@@ -167,7 +174,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
|
167
174
|
- !ruby/object:Gem::Version
|
|
168
175
|
version: '0'
|
|
169
176
|
requirements: []
|
|
170
|
-
rubygems_version: 3.
|
|
177
|
+
rubygems_version: 3.5.11
|
|
171
178
|
signing_key:
|
|
172
179
|
specification_version: 4
|
|
173
180
|
summary: Transform JSON data structures with the help of a simple schema and JsonPath
|