json_data_extractor 0.1.04 → 0.1.05
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +20 -0
- data/README.md +113 -0
- data/json_data_extractor.gemspec +1 -1
- data/lib/json_data_extractor/extractor.rb +60 -1
- data/lib/json_data_extractor/schema_cache.rb +30 -0
- data/lib/json_data_extractor/version.rb +1 -1
- data/lib/json_data_extractor.rb +10 -1
- metadata +6 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 1a833b6de5421f4871a5a36814d6f304a495c16b65b8d27d60f1839317bd26ac
|
4
|
+
data.tar.gz: 2e1a98f860ad8f52fd81aa9003b1e2a0fb079bbfdeaef51380625d7e25e83b2c
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 1c408a29566f5999e7ccb251860442b4b7c1cbcee391a72173753bf09c30e9f6e0d78ae18af28c95cd8bc2708dbb7f180493f038221a365745f7f587da08c2a8
|
7
|
+
data.tar.gz: 8c936a46b176ebe7dbc365b79d4a267fb3217b16f5f88280a040f3b951b5a8f65a70384c9f025853ec90d8de5c41e0ac1b3545de2b0ed2ba5c64bc8405240fc8
|
data/CHANGELOG.md
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
# Changelog
|
2
|
+
|
3
|
+
All notable changes to this project will be documented in this file.
|
4
|
+
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
7
|
+
|
8
|
+
## [0.1.05] - 2025-05-13
|
9
|
+
|
10
|
+
### Added
|
11
|
+
- Added schema reuse functionality for improved performance when processing multiple data objects with the same schema
|
12
|
+
- New `JsonDataExtractor.with_schema` class method to create an extractor with a pre-processed schema
|
13
|
+
- New `SchemaCache` class to store and reuse schema information
|
14
|
+
- New `extract_from` method to extract data using a cached schema
|
15
|
+
- Performance improvements by pre-compiling JsonPath objects and caching schema elements
|
16
|
+
|
17
|
+
## [0.1.04] - 2025-04-26
|
18
|
+
|
19
|
+
- Use Oj for json dump
|
20
|
+
- Use json path caching
|
data/README.md
CHANGED
@@ -1,5 +1,7 @@
|
|
1
1
|
# JsonDataExtractor
|
2
2
|
|
3
|
+
[](https://badge.fury.io/rb/json_data_extractor)
|
4
|
+
|
3
5
|
Transform JSON data structures with the help of a simple schema and JsonPath expressions.
|
4
6
|
Use the JsonDataExtractor gem to extract and modify data from complex JSON structures using a
|
5
7
|
straightforward syntax
|
@@ -325,6 +327,117 @@ E.g. this is a valid real-life schema with nested data:
|
|
325
327
|
|
326
328
|
Nested schema can be also applied to objects, not arrays. See specs for more examples.
|
327
329
|
|
330
|
+
### Schema Reuse for Performance
|
331
|
+
|
332
|
+
When processing multiple data objects with the same schema, JsonDataExtractor provides an optimized approach that avoids redundant schema processing. This is particularly useful for batch processing scenarios where you need to apply the same transformation to multiple data objects.
|
333
|
+
|
334
|
+
#### Using `with_schema` and `extract_from`
|
335
|
+
|
336
|
+
Instead of creating a new extractor for each data object:
|
337
|
+
|
338
|
+
```ruby
|
339
|
+
data_objects.map do |data|
|
340
|
+
extractor = JsonDataExtractor.new(data)
|
341
|
+
extractor.extract(schema)
|
342
|
+
end
|
343
|
+
```
|
344
|
+
|
345
|
+
You can create a single extractor with a pre-processed schema and reuse it:
|
346
|
+
```ruby
|
347
|
+
extractor = JsonDataExtractor.with_schema(schema)
|
348
|
+
data_objects.map do |data|
|
349
|
+
extractor.extract_from(data)
|
350
|
+
end
|
351
|
+
```
|
352
|
+
|
353
|
+
This approach offers significant performance improvements for large datasets by:
|
354
|
+
1. Pre-processing the schema only once
|
355
|
+
2. Pre-compiling JsonPath objects
|
356
|
+
3. Caching schema elements
|
357
|
+
4. Avoiding redundant schema validation
|
358
|
+
|
359
|
+
|
360
|
+
#### Comparison with Nested Schema Approach
|
361
|
+
|
362
|
+
It's worth noting that similar functionality could be achieved using the existing nested schema approach when your data is already in an array format:
|
363
|
+
|
364
|
+
```ruby
|
365
|
+
# Process an array of locations with the nested schema approach
|
366
|
+
locations_array = [location1, location2, location3]
|
367
|
+
schema = {
|
368
|
+
all_locations: {
|
369
|
+
path: "[*]",
|
370
|
+
type: "array",
|
371
|
+
schema: { code: ".iataCode", city: ".city", name: ".name" }
|
372
|
+
}
|
373
|
+
}
|
374
|
+
result = JsonDataExtractor.new(locations_array).extract(schema)
|
375
|
+
# Result: { all_locations: [{code: "...", city: "...", name: "..."}, {...}, {...}] }
|
376
|
+
```
|
377
|
+
|
378
|
+
**When to use which approach:**
|
379
|
+
|
380
|
+
1. **Use nested schema when:**
|
381
|
+
- Your data is already structured as an array
|
382
|
+
- You want to preserve the array structure in your result
|
383
|
+
- You need to process the entire array at once
|
384
|
+
|
385
|
+
2. **Use schema reuse when:**
|
386
|
+
- You receive data objects individually (e.g., from multiple API calls)
|
387
|
+
- You need to process each object separately
|
388
|
+
- You want to transform each object independently
|
389
|
+
- You need direct access to individual results without unwrapping them from an array
|
390
|
+
|
391
|
+
The schema reuse approach is specifically optimized for scenarios where you process similar objects multiple times in sequence, rather than all at once in an array.
|
392
|
+
|
393
|
+
|
394
|
+
#### Real-world Example
|
395
|
+
Here's a practical example of extracting location data from multiple sources:
|
396
|
+
```ruby
|
397
|
+
# Location data from an API
|
398
|
+
locations = [
|
399
|
+
{
|
400
|
+
"iataCode" => "JFK",
|
401
|
+
"countryCode" => "US",
|
402
|
+
"city" => "New York",
|
403
|
+
"name" => "John F. Kennedy International Airport"
|
404
|
+
},
|
405
|
+
{
|
406
|
+
"iataCode" => "LHR",
|
407
|
+
"countryCode" => "GB",
|
408
|
+
"city" => "London",
|
409
|
+
"name" => "Heathrow Airport"
|
410
|
+
}
|
411
|
+
]
|
412
|
+
|
413
|
+
# Define schema once
|
414
|
+
schema = {
|
415
|
+
code: "$.iataCode",
|
416
|
+
city: "$.city",
|
417
|
+
name: "$.name",
|
418
|
+
country: "$.countryCode"
|
419
|
+
}
|
420
|
+
|
421
|
+
# Create an extractor with the schema
|
422
|
+
jde = JsonDataExtractor.with_schema(schema)
|
423
|
+
|
424
|
+
# Process each location efficiently
|
425
|
+
processed_locations = locations.map do |data|
|
426
|
+
jde.extract_from(data)
|
427
|
+
end
|
428
|
+
|
429
|
+
# Result:
|
430
|
+
# [
|
431
|
+
# {code: "JFK", city: "New York", name: "John F. Kennedy International Airport", country: "US"},
|
432
|
+
# {code: "LHR", city: "London", name: "Heathrow Airport", country: "GB"}
|
433
|
+
# ]
|
434
|
+
```
|
435
|
+
This pattern is especially beneficial when:
|
436
|
+
- Processing data in batches that arrive separately
|
437
|
+
- Working with large datasets where you need to process one item at a time
|
438
|
+
- Applying the same schema to multiple API responses
|
439
|
+
- Parsing large collections of similar objects that aren't already in an array structure
|
440
|
+
|
328
441
|
## Configuration Options
|
329
442
|
|
330
443
|
The JsonDataExtractor gem provides a configuration option to control the behavior when encountering
|
data/json_data_extractor.gemspec
CHANGED
@@ -29,7 +29,7 @@ transformations. The schema is defined as a simple Ruby hash that maps keys to p
|
|
29
29
|
spec.add_development_dependency 'amazing_print'
|
30
30
|
spec.add_development_dependency 'bundler'
|
31
31
|
spec.add_development_dependency 'pry'
|
32
|
-
spec.add_development_dependency 'rake', '~>
|
32
|
+
spec.add_development_dependency 'rake', '~> 12.3.3'
|
33
33
|
spec.add_development_dependency 'rspec', '~> 3.0'
|
34
34
|
spec.add_development_dependency 'rubocop'
|
35
35
|
|
@@ -3,7 +3,7 @@
|
|
3
3
|
module JsonDataExtractor
|
4
4
|
# does the main job of the gem
|
5
5
|
class Extractor
|
6
|
-
attr_reader :data, :modifiers
|
6
|
+
attr_reader :data, :modifiers, :schema_cache
|
7
7
|
|
8
8
|
# @param json_data [Hash,String]
|
9
9
|
# @param modifiers [Hash]
|
@@ -14,6 +14,35 @@ module JsonDataExtractor
|
|
14
14
|
@path_cache = {}
|
15
15
|
end
|
16
16
|
|
17
|
+
# Creates a new extractor with a pre-processed schema
|
18
|
+
# @param schema [Hash] schema of the expected data mapping
|
19
|
+
# @param modifiers [Hash] modifiers to apply to the extracted data
|
20
|
+
# @return [Extractor] an extractor initialized with the schema
|
21
|
+
def self.with_schema(schema, modifiers = {})
|
22
|
+
extractor = new({}, modifiers)
|
23
|
+
extractor.instance_variable_set(:@schema_cache, SchemaCache.new(schema))
|
24
|
+
extractor
|
25
|
+
end
|
26
|
+
|
27
|
+
# Extracts data from the provided json_data using the cached schema
|
28
|
+
# @param json_data [Hash,String] the data to extract from
|
29
|
+
# @return [Hash] the extracted data
|
30
|
+
def extract_from(json_data)
|
31
|
+
# Ensure we have a schema cache
|
32
|
+
raise ArgumentError, 'No schema cache available. Use Extractor.with_schema first.' unless @schema_cache
|
33
|
+
|
34
|
+
# Reset results
|
35
|
+
@results = {}
|
36
|
+
|
37
|
+
# Update data
|
38
|
+
@data = json_data.is_a?(Hash) ? Oj.dump(json_data, mode: :compat) : json_data
|
39
|
+
|
40
|
+
# Extract data using cached schema
|
41
|
+
extract_using_cache
|
42
|
+
|
43
|
+
@results
|
44
|
+
end
|
45
|
+
|
17
46
|
# @param modifier_name [String, Symbol]
|
18
47
|
# @param callable [#call, nil] Optional callable object
|
19
48
|
def add_modifier(modifier_name, callable = nil, &block)
|
@@ -58,6 +87,36 @@ module JsonDataExtractor
|
|
58
87
|
|
59
88
|
private
|
60
89
|
|
90
|
+
# Extracts data using the cached schema
|
91
|
+
def extract_using_cache
|
92
|
+
schema_cache.schema.each do |key, _|
|
93
|
+
element = schema_cache.schema_elements[key]
|
94
|
+
path = element.path
|
95
|
+
|
96
|
+
# Use cached JsonPath object
|
97
|
+
json_path = path ? schema_cache.path_cache[path] : nil
|
98
|
+
|
99
|
+
extracted_data = json_path&.on(@data)
|
100
|
+
|
101
|
+
if extracted_data.nil? || extracted_data.empty?
|
102
|
+
# we either got nothing or the `path` was initially nil
|
103
|
+
@results[key] = element.fetch_default_value
|
104
|
+
next
|
105
|
+
end
|
106
|
+
|
107
|
+
# check for nils and apply defaults if applicable
|
108
|
+
extracted_data.map! { |item| item.nil? ? element.fetch_default_value : item }
|
109
|
+
|
110
|
+
# apply modifiers if present
|
111
|
+
extracted_data = apply_modifiers(extracted_data, element.modifiers) if element.modifiers.any?
|
112
|
+
|
113
|
+
# apply maps if present
|
114
|
+
@results[key] = element.maps.any? ? apply_maps(extracted_data, element.maps) : extracted_data
|
115
|
+
|
116
|
+
@results[key] = resolve_result_structure(@results[key], element)
|
117
|
+
end
|
118
|
+
end
|
119
|
+
|
61
120
|
def resolve_result_structure(result, element)
|
62
121
|
if element.nested
|
63
122
|
# Process nested data
|
@@ -0,0 +1,30 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module JsonDataExtractor
|
4
|
+
# Caches schema elements to avoid re-processing the schema for each data extraction
|
5
|
+
class SchemaCache
|
6
|
+
attr_reader :schema, :schema_elements, :path_cache
|
7
|
+
|
8
|
+
def initialize(schema)
|
9
|
+
@schema = schema
|
10
|
+
@schema_elements = {}
|
11
|
+
@path_cache = {}
|
12
|
+
|
13
|
+
# Pre-process the schema to create SchemaElement objects
|
14
|
+
process_schema
|
15
|
+
end
|
16
|
+
|
17
|
+
private
|
18
|
+
|
19
|
+
def process_schema
|
20
|
+
schema.each do |key, val|
|
21
|
+
# Store the SchemaElement for each key in the schema
|
22
|
+
@schema_elements[key] = JsonDataExtractor::SchemaElement.new(val.is_a?(Hash) ? val : { path: val })
|
23
|
+
|
24
|
+
# Pre-compile JsonPath objects for each path
|
25
|
+
path = @schema_elements[key].path
|
26
|
+
@path_cache[path] = JsonPath.new(path) if path
|
27
|
+
end
|
28
|
+
end
|
29
|
+
end
|
30
|
+
end
|
data/lib/json_data_extractor.rb
CHANGED
@@ -5,8 +5,9 @@ require 'multi_json'
|
|
5
5
|
require 'oj'
|
6
6
|
require_relative 'json_data_extractor/version'
|
7
7
|
require_relative 'json_data_extractor/configuration'
|
8
|
-
require_relative 'json_data_extractor/extractor'
|
9
8
|
require_relative 'json_data_extractor/schema_element'
|
9
|
+
require_relative 'json_data_extractor/schema_cache'
|
10
|
+
require_relative 'json_data_extractor/extractor'
|
10
11
|
|
11
12
|
# Set MultiJson to use Oj for performance
|
12
13
|
MultiJson.use(:oj)
|
@@ -22,6 +23,14 @@ module JsonDataExtractor
|
|
22
23
|
Extractor.new(*args)
|
23
24
|
end
|
24
25
|
|
26
|
+
# Creates a new extractor with a pre-processed schema
|
27
|
+
# @param schema [Hash] schema of the expected data mapping
|
28
|
+
# @param modifiers [Hash] modifiers to apply to the extracted data
|
29
|
+
# @return [Extractor] an extractor initialized with the schema
|
30
|
+
def with_schema(schema, modifiers = {})
|
31
|
+
Extractor.with_schema(schema, modifiers)
|
32
|
+
end
|
33
|
+
|
25
34
|
def configuration
|
26
35
|
@configuration ||= Configuration.new
|
27
36
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: json_data_extractor
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.05
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Max Buslaev
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2025-
|
11
|
+
date: 2025-05-13 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: amazing_print
|
@@ -58,14 +58,14 @@ dependencies:
|
|
58
58
|
requirements:
|
59
59
|
- - "~>"
|
60
60
|
- !ruby/object:Gem::Version
|
61
|
-
version:
|
61
|
+
version: 12.3.3
|
62
62
|
type: :development
|
63
63
|
prerelease: false
|
64
64
|
version_requirements: !ruby/object:Gem::Requirement
|
65
65
|
requirements:
|
66
66
|
- - "~>"
|
67
67
|
- !ruby/object:Gem::Version
|
68
|
-
version:
|
68
|
+
version: 12.3.3
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
70
|
name: rspec
|
71
71
|
requirement: !ruby/object:Gem::Requirement
|
@@ -135,6 +135,7 @@ files:
|
|
135
135
|
- ".gitignore"
|
136
136
|
- ".rspec"
|
137
137
|
- ".travis.yml"
|
138
|
+
- CHANGELOG.md
|
138
139
|
- CODE_OF_CONDUCT.md
|
139
140
|
- Gemfile
|
140
141
|
- LICENSE.txt
|
@@ -146,6 +147,7 @@ files:
|
|
146
147
|
- lib/json_data_extractor.rb
|
147
148
|
- lib/json_data_extractor/configuration.rb
|
148
149
|
- lib/json_data_extractor/extractor.rb
|
150
|
+
- lib/json_data_extractor/schema_cache.rb
|
149
151
|
- lib/json_data_extractor/schema_element.rb
|
150
152
|
- lib/json_data_extractor/version.rb
|
151
153
|
homepage: https://github.com/austerlitz/json_data_extractor
|