xml_data_extractor 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 2ea176d26d7d1e43ca91ab838b75eca58044da1af8a98675dc3c1c1ba453f313
4
+ data.tar.gz: b4a69acdf185a5b830ef267cbb4cd86a08562795f178d5c0f79c4b893501a3b7
5
+ SHA512:
6
+ metadata.gz: 9bd4f5a9ea20e5ce63d92e9d213267eb06097f4a43e851bd65f622404a5f180e877409164ba5716d9e49dfa29ef1bb9b4cf8ff17ae269867ea74d2b040a7b85d
7
+ data.tar.gz: 21e37f2de2a5bc185fc2f6a30ea0367e67f9130f44a2645d53e10e1e7015e7569e47ccc7bb3e98203a12bae853bcff3d5d8ad60ff5ced36b702934a74a54bef2
@@ -0,0 +1,11 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /_yardoc/
4
+ /coverage/
5
+ /doc/
6
+ /pkg/
7
+ /spec/reports/
8
+ /tmp/
9
+
10
+ # rspec failure tracking
11
+ .rspec_status
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --format documentation
2
+ --color
3
+ --require spec_helper
@@ -0,0 +1,6 @@
1
+ ---
2
+ language: ruby
3
+ cache: bundler
4
+ rvm:
5
+ - 2.6.6
6
+ before_install: gem install bundler -v 2.1.4
data/Gemfile ADDED
@@ -0,0 +1,7 @@
1
+ source "https://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in xml_data_extractor.gemspec
4
+ gemspec
5
+
6
+ gem "rake", "~> 12.0"
7
+ gem "rspec", "~> 3.0"
@@ -0,0 +1,53 @@
1
+ PATH
2
+ remote: .
3
+ specs:
4
+ xml_data_extractor (0.1.0)
5
+ activesupport
6
+ nokogiri
7
+
8
+ GEM
9
+ remote: https://rubygems.org/
10
+ specs:
11
+ activesupport (6.0.3.2)
12
+ concurrent-ruby (~> 1.0, >= 1.0.2)
13
+ i18n (>= 0.7, < 2)
14
+ minitest (~> 5.1)
15
+ tzinfo (~> 1.1)
16
+ zeitwerk (~> 2.2, >= 2.2.2)
17
+ concurrent-ruby (1.1.7)
18
+ diff-lcs (1.3)
19
+ i18n (1.8.5)
20
+ concurrent-ruby (~> 1.0)
21
+ mini_portile2 (2.4.0)
22
+ minitest (5.14.1)
23
+ nokogiri (1.10.10)
24
+ mini_portile2 (~> 2.4.0)
25
+ rake (12.3.3)
26
+ rspec (3.9.0)
27
+ rspec-core (~> 3.9.0)
28
+ rspec-expectations (~> 3.9.0)
29
+ rspec-mocks (~> 3.9.0)
30
+ rspec-core (3.9.2)
31
+ rspec-support (~> 3.9.3)
32
+ rspec-expectations (3.9.2)
33
+ diff-lcs (>= 1.2.0, < 2.0)
34
+ rspec-support (~> 3.9.0)
35
+ rspec-mocks (3.9.1)
36
+ diff-lcs (>= 1.2.0, < 2.0)
37
+ rspec-support (~> 3.9.0)
38
+ rspec-support (3.9.3)
39
+ thread_safe (0.3.6)
40
+ tzinfo (1.2.7)
41
+ thread_safe (~> 0.1)
42
+ zeitwerk (2.4.0)
43
+
44
+ PLATFORMS
45
+ ruby
46
+
47
+ DEPENDENCIES
48
+ rake (~> 12.0)
49
+ rspec (~> 3.0)
50
+ xml_data_extractor!
51
+
52
+ BUNDLED WITH
53
+ 2.1.4
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2020 Fernando Almeida
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
@@ -0,0 +1,381 @@
1
+ # XmlDataExtractor
2
+
3
+ This gem provides a DSL for extracting formatted data from any XML structure.
4
+
5
+ ## Installation
6
+
7
+ Add this line to your application's Gemfile:
8
+
9
+ ```ruby
10
+ gem 'xml_data_extractor'
11
+ ```
12
+
13
+ And then execute:
14
+
15
+ $ bundle install
16
+
17
+ Or install it yourself as:
18
+
19
+ $ gem install xml_data_extractor
20
+
21
+ ## Usage
22
+
23
+ The general ideia is to declare a ruby Hash that represents the fields structure, containing instructions of how every piece of data should be retrieved from the XML document.
24
+
25
+ ```ruby
26
+ structure = { schemas: { character: { path: "xml/FirstName" } } }
27
+ xml = "<xml><FirstName>Gandalf</FirstName></xml>"
28
+
29
+ result = XmlDataExtractor.new(structure).parse(xml)
30
+
31
+ # result -> { character: "Gandalf" }
32
+ ```
33
+
34
+ For convenience, you can write the structure in yaml, which can be easily converted to a ruby hash using `YAML.load(yml).deep_symbolize_keys`.
35
+
36
+ Considering the following yaml and xml:
37
+
38
+ ```yml
39
+ schemas:
40
+ description:
41
+ path: xml/desc
42
+ modifier: downcase
43
+ amount:
44
+ path: xml/info/price
45
+ modifier: to_f
46
+ ```
47
+ ```xml
48
+ <xml>
49
+ <desc>HELLO WORLD</desc>
50
+ <info>
51
+ <price>123</price>
52
+ </info>
53
+ </xml>
54
+ ```
55
+
56
+ The output is:
57
+ ```ruby
58
+ {
59
+ description: "hello world",
60
+ amount: 123.0
61
+ }
62
+ ```
63
+
64
+ ### Defining the structure
65
+
66
+ The structure should be defined as a hash inside the `schemas` key. See the [complete example](https://github.com/monde-sistemas/xml_data_extractor/blob/master/spec/complete_example_spec.rb#L5).
67
+
68
+ When defining the structure you can combine any available command in order to extract and format the data as needed.
69
+
70
+ The available commands are separated in two general pusposes:
71
+
72
+ - [Navigation & Extraction](#navigation--extraction)
73
+ - [Formatting](#formatting)
74
+
75
+ ### Navigation & Extraction:
76
+
77
+ The data extraction process is based on `Xpath` using Nokogiri.
78
+ * [Xpath introduction](https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples)
79
+ * [Xpath cheatsheet](https://devhints.io/xpath)
80
+
81
+ #### path
82
+
83
+ Defines the `xpath` of the element.
84
+ The `path` is the default command of a field definition, so this:
85
+ ```yml
86
+ schemas:
87
+ description:
88
+ path: xml/desc
89
+ ```
90
+ Is equivalent to this:
91
+ ```yml
92
+ schemas:
93
+ description: xml/desc
94
+ ```
95
+
96
+ It can be defined as a string:
97
+ ```yml
98
+ schemas:
99
+ description:
100
+ path: xml/some_field
101
+ ```
102
+ ```xml
103
+ <xml>
104
+ <some_field>ABC</some_field>
105
+ </xml>
106
+ ```
107
+ ```ruby
108
+ { description: "ABC" }
109
+ ```
110
+
111
+ Or as a string array:
112
+ ```yml
113
+ schemas:
114
+ address:
115
+ path: [street, info/city]
116
+ ```
117
+ ```xml
118
+ <xml>
119
+ <street>Diagon Alley</street>
120
+ <info>
121
+ <city>London</city>
122
+ </info>
123
+ </xml>
124
+ ```
125
+ ```ruby
126
+ { address: ["Diagon Alley", "London"] }
127
+ ```
128
+
129
+ And even as a hash array, for complex operations:
130
+ ```yml
131
+ schemas:
132
+ address:
133
+ path:
134
+ - path: street
135
+ modifier: downcase
136
+ - path: info/city
137
+ modifier: upcase
138
+ ```
139
+ ```ruby
140
+ { address: ["diagon alley", "LONDON"] }
141
+ ```
142
+
143
+ #### attr
144
+
145
+ Defines a tag attribute which the value should be extracted from, instead of the tag value itself:
146
+ ```yml
147
+ schemas:
148
+ description:
149
+ path: xml/info
150
+ attr: desc
151
+ ```
152
+ ```xml
153
+ <xml>
154
+ <info desc="ABC">some stuff<info>
155
+ </xml>
156
+ ```
157
+ ```ruby
158
+ { description: "ABC" }
159
+ ```
160
+
161
+ Like the path, it can also be defined as a string array.
162
+
163
+ #### within
164
+
165
+ To define a root path for the fields:
166
+ ```yml
167
+ schemas:
168
+ movie:
169
+ within: info/movie_data
170
+ title: original_title
171
+ actor: main_actor
172
+
173
+ ```
174
+ ```xml
175
+ <xml>
176
+ <info>
177
+ <movie_data>
178
+ <original_title>The Irishman</original_title>
179
+ <main_actor>Robert De Niro</main_actor>
180
+ </movie_data>
181
+ </info>
182
+ </xml>
183
+ ```
184
+ ```ruby
185
+ { movie: { title: "The Irishman", actor: "Robert De Niro" } }
186
+ ```
187
+
188
+ #### array_of
189
+
190
+ Defines the path to a XML collection, which will be looped generating an array of hashes:
191
+ ```yml
192
+ schemas:
193
+ people:
194
+ array_of: characters/character
195
+ name: firstname
196
+ age: age
197
+ ```
198
+ ```xml
199
+ <xml>
200
+ <characters>
201
+ <character>
202
+ <firstname>Geralt</firstname>
203
+ <age>97</age>
204
+ </character>
205
+ <character>
206
+ <firstname>Yennefer</firstname>
207
+ <age>102</age>
208
+ </character>
209
+ </characters>
210
+ </xml>
211
+ ```
212
+ ```ruby
213
+ {
214
+ people: [
215
+ { name: "Geralt", age: "97" },
216
+ { name: "Yennefer", age: "102" }
217
+ ]
218
+ }
219
+ ```
220
+
221
+ If you need to loop trough nested collections, you can define an array of paths:
222
+ ```yml
223
+ schemas:
224
+ show:
225
+ within: show_data
226
+ title: description
227
+ people:
228
+ array_of: [characters/character, info]
229
+ name: name
230
+ ```
231
+ ```xml
232
+ <xml>
233
+ <show_data>
234
+ <description>Peaky Blinders</description>
235
+ <characters>
236
+ <character>
237
+ <info>
238
+ <name>Tommy Shelby</name>
239
+ </info>
240
+ </character>
241
+ <character>
242
+ <info>
243
+ <name>Arthur Shelby</name>
244
+ </info>
245
+ <info>
246
+ <name>Alfie Solomons</name>
247
+ </info>
248
+ </character>
249
+ </characters>
250
+ </show_data>
251
+ </xml>
252
+ ```
253
+ ```ruby
254
+ {
255
+ show: {
256
+ title: "Peaky Blinders",
257
+ people: [
258
+ { name: "Tommy Shelby" },
259
+ { name: "Arthur Shelby" },
260
+ { name: "Alfie Solomons" }
261
+ ]
262
+ }
263
+ }
264
+ ```
265
+
266
+ ### Formatting:
267
+
268
+ #### fixed
269
+
270
+ Defines a fixed value for the field:
271
+ ```yml
272
+ currency:
273
+ fixed: BRL
274
+ ```
275
+ ```ruby
276
+ { currency: "BRL" }
277
+ ```
278
+
279
+ #### mapper
280
+
281
+ Uses a hash of predefined values to replace the extracted value with its respective option.
282
+ If the extracted value is not found in any of the mapper options, it will be replaced by the `default` value, but if the default value is not defined, the returned value is not replaced.
283
+ ```yml
284
+ mappers:
285
+ currencies:
286
+ default: unknown
287
+ options:
288
+ BRL: R$
289
+ USD: [US$, $]
290
+ schemas:
291
+ money:
292
+ array_of: curr_types/type
293
+ path: symbol
294
+ mapper: currencies
295
+ ```
296
+ ```xml
297
+ <xml>
298
+ <curr_type>
299
+ <type>
300
+ <symbol>US$</symbol>
301
+ </type>
302
+ <type>
303
+ <symbol>R$</symbol>
304
+ </type>
305
+ <type>
306
+ <symbol>RB</symbol>
307
+ </type>
308
+ <type>
309
+ <symbol>$</symbol>
310
+ </type>
311
+ </curr_type>
312
+ </xml>
313
+ ```
314
+ ```ruby
315
+ {
316
+ money: ["USD", "BRL", "unknown", "USD"]
317
+ }
318
+ ```
319
+
320
+ #### modifier
321
+
322
+ Defines a method to be called on the returned value.
323
+ ```yml
324
+ schemas:
325
+ name:
326
+ path: some_field
327
+ modifier: upcase
328
+ ```
329
+ ```xml
330
+ <xml>
331
+ <some_field>Lewandovski</some_field>
332
+ </xml>
333
+ ```
334
+ ```ruby
335
+ { name: "LEWANDOVSKI" }
336
+ ```
337
+
338
+ You can also pass parameters to the method. In this case you will have to declare the modifier as an array of hashes, with the `name` and `params` keys:
339
+ ```yml
340
+ schemas:
341
+ name:
342
+ path: [firstname, lastname]
343
+ modifier:
344
+ - name: join
345
+ params: [" "]
346
+ - downcase
347
+ ```
348
+ ```xml
349
+ <xml>
350
+ <firstname>Robert</firstname>
351
+ <lastname>Martin</lastname>
352
+ </xml>
353
+ ```
354
+ ```ruby
355
+ { name: "robert martin" }
356
+ ```
357
+
358
+ If you need to use custom methods, you can pass an object containing the methods in the initialization. The custom method will receive the value as parameter:
359
+ ```yml
360
+ schemas:
361
+ name:
362
+ path: final_price
363
+ modifier: format_as_float
364
+ ```
365
+ ```xml
366
+ <xml>
367
+ <final_price>R$ 12.99</final_price>
368
+ </xml>
369
+ ```
370
+ ```ruby
371
+ class MyMethods
372
+ def format_as_float(value)
373
+ value.gsub(/[^\d.]/, "").to_f
374
+ end
375
+ end
376
+
377
+ XmlDataExtractor.new(yml, MyMethods.new).parse(xml)
378
+ ```
379
+ ```ruby
380
+ { price: 12.99 }
381
+ ```
@@ -0,0 +1,6 @@
1
+ require "bundler/gem_tasks"
2
+ require "rspec/core/rake_task"
3
+
4
+ RSpec::Core::RakeTask.new(:spec)
5
+
6
+ task :default => :spec
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "xml_data_extractor"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start(__FILE__)
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+ set -vx
5
+
6
+ bundle install
7
+
8
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,58 @@
1
+ module Extract
2
+ class ArrayOf < Base
3
+ def initialize(node, extractor, index = 0)
4
+ super(node, extractor)
5
+ @index = index
6
+ end
7
+
8
+ def value
9
+ process_paths.flatten.compact
10
+ end
11
+
12
+ private
13
+
14
+ attr_reader :index
15
+
16
+ def array_items
17
+ arr_path, link_path, uniq_by = node.array_of_paths
18
+
19
+ paths = extractor.paths_of(node.path, arr_path, link_path)
20
+ paths = uniq_paths(paths, uniq_by) if uniq_by
21
+
22
+ paths.each_with_index.map do |path, idx|
23
+ HashBuilder.new(Node.new(node.props, path), extractor).value(index + idx)
24
+ end.compact
25
+ end
26
+
27
+ def process_paths
28
+ paths = paths_from_props
29
+
30
+ if paths.size > 1
31
+ process_path(paths.shift, paths)
32
+ else
33
+ node.props[:array_of] = paths.first
34
+ array_items
35
+ end
36
+ end
37
+
38
+ def process_path(path, inner_paths)
39
+ path = build_path(path) if path.is_a?(Hash)
40
+
41
+ extractor.paths_of(node.path, path).each_with_index.map do |some, idx|
42
+ ArrayOf.new(Node.new(node.props.merge(array_of: inner_paths), some), extractor, index + idx).value
43
+ end
44
+ end
45
+
46
+ def uniq_paths(paths, uniq_by)
47
+ extractor.uniq_paths(paths, uniq_by)
48
+ end
49
+
50
+ def build_path(hash)
51
+ extractor.replace_link(hash[:path], [node.path, hash[:link]].join("/"))
52
+ end
53
+
54
+ def paths_from_props
55
+ [node.props[:array_of]].flatten
56
+ end
57
+ end
58
+ end
@@ -0,0 +1,10 @@
1
+ module Extract
2
+ class ArrayValue < Base
3
+ def value
4
+ props, path = node.to_h.values_at(:props, :path)
5
+ props.map do |prop|
6
+ ValueBuilder.new(Node.new(prop, path), extractor).value
7
+ end.flatten
8
+ end
9
+ end
10
+ end
@@ -0,0 +1,12 @@
1
+ module Extract
2
+ class Base
3
+ def initialize(node, extractor)
4
+ @node = node
5
+ @extractor = extractor
6
+ end
7
+
8
+ private
9
+
10
+ attr_reader :node, :extractor
11
+ end
12
+ end
@@ -0,0 +1,20 @@
1
+ module Extract
2
+ class Expression
3
+ def initialize(expression, hash)
4
+ @expression = expression
5
+ @hash = hash
6
+ end
7
+
8
+ def evaluate
9
+ field_name = expression.split.first.parameterize
10
+ field_value = hash[field_name.to_sym]
11
+ condition = expression.gsub(field_name, field_value.to_s)
12
+
13
+ eval(condition)
14
+ end
15
+
16
+ private
17
+
18
+ attr_reader :expression, :hash
19
+ end
20
+ end
@@ -0,0 +1,33 @@
1
+ module Extract
2
+ class HashBuilder < Base
3
+ INTERNAL_FIELDS = %i[array_of keep_if within].freeze
4
+
5
+ def value(index = 0)
6
+ path, props = node.to_h.values_at(:path, :props)
7
+
8
+ hash = {}
9
+ props.each do |field_name, nested_props|
10
+ next unless valuable_field? field_name, nested_props, index
11
+
12
+ value = ValueBuilder.new(Node.new(nested_props, path), extractor).value
13
+ hash[field_name.to_sym] = value if value.present?
14
+ end
15
+
16
+ keep_hash?(hash, props) ? hash : nil
17
+ end
18
+
19
+ private
20
+
21
+ def keep_hash?(hash, props)
22
+ expression = props[:keep_if]
23
+ expression.present? ? Expression.new(expression, hash).evaluate : true
24
+ end
25
+
26
+ def valuable_field?(field_name, props, index)
27
+ return false if INTERNAL_FIELDS.include? field_name
28
+ return false if index.positive? && Node.new(props, "").first_only?
29
+
30
+ true
31
+ end
32
+ end
33
+ end
@@ -0,0 +1,32 @@
1
+ module Extract
2
+ class StringValue < Base
3
+ def value
4
+ path = node[:props][:path]
5
+ return formatted_array_values(path) if path.is_a?(Array)
6
+
7
+ extract_value(node)
8
+ end
9
+
10
+ private
11
+
12
+ def extract_value(node_to_extract)
13
+ extractor.extract(node_to_extract)
14
+ end
15
+
16
+ def formatted_array_values(paths)
17
+ extractor.format_value(values_from_array(paths), node[:props])
18
+ end
19
+
20
+ def values_from_array(paths)
21
+ node_path = node.path
22
+
23
+ paths.map do |inner|
24
+ if inner.is_a?(String)
25
+ extract_value(Node.new({ path: inner }, node_path))
26
+ else
27
+ StringValue.new(Node.new(inner, node_path), extractor).value
28
+ end
29
+ end
30
+ end
31
+ end
32
+ end
@@ -0,0 +1,44 @@
1
+ require_relative "base"
2
+ require_relative "array_value"
3
+ require_relative "array_of"
4
+ require_relative "hash_builder"
5
+ require_relative "string_value"
6
+ require_relative "value_builder"
7
+ require_relative "within"
8
+ require_relative "expression"
9
+
10
+ module Extract
11
+ class ValueBuilder < Base
12
+ def value
13
+ props = node.props
14
+ case props
15
+ when String then value_for_string
16
+ when Array then value_for_array
17
+ when Hash then value_for_hash
18
+ else
19
+ raise "Invalid kind #{props.class} (#{props})"
20
+ end
21
+ end
22
+
23
+ private
24
+
25
+ def value_for_hash
26
+ props = node.props
27
+ fixed_value = props[:fixed]
28
+ return fixed_value if fixed_value
29
+ return ArrayOf.new(node, extractor).value if props[:array_of]
30
+ return Within.new(node, extractor).value if props[:within]
31
+ return StringValue.new(node, extractor).value if (props.keys & %i[path attr]).any?
32
+
33
+ HashBuilder.new(node, extractor).value
34
+ end
35
+
36
+ def value_for_string
37
+ StringValue.new(Node.new({ path: node.props }, node.path), extractor).value
38
+ end
39
+
40
+ def value_for_array
41
+ ArrayValue.new(node, extractor).value
42
+ end
43
+ end
44
+ end
@@ -0,0 +1,11 @@
1
+ module Extract
2
+ class Within < Base
3
+ def value
4
+ props = node.props
5
+ paths = extractor.paths_of(node.path, props[:within])
6
+ return "" if paths.empty?
7
+
8
+ HashBuilder.new(Node.new(props, paths.first), extractor).value
9
+ end
10
+ end
11
+ end
@@ -0,0 +1,236 @@
1
+ require "cgi"
2
+ require "active_support/core_ext/string"
3
+ require_relative "format/formatter"
4
+
5
+ class PathBuilder < Struct.new(:base, :parent, :tag, keyword_init: true)
6
+ def build
7
+ paths = relative_path.split("/").then do |paths|
8
+ if parent.present?
9
+ navigate_to_parent(parent, paths)
10
+ else
11
+ paths
12
+ end
13
+ end
14
+
15
+ paths << tag unless tag.is_a? Array
16
+ full_path = paths.flatten.compact.join("/")
17
+ "//#{full_path}"
18
+ end
19
+
20
+ private
21
+
22
+ def relative_path
23
+ base.start_with?("//") ? base[2..-1] : base
24
+ end
25
+
26
+ def navigate_to_parent(parent_tag, paths)
27
+ index = path_index(parent_tag, paths)
28
+
29
+ paths[0, index + 1]
30
+ end
31
+
32
+ def path_index(tag, paths)
33
+ paths.each_with_index do |path, index|
34
+ return index if matching_tags?(path, tag)
35
+ end
36
+ 0
37
+ end
38
+
39
+ def matching_tags?(item, tag)
40
+ item.gsub(/\[\d\]/, "") == tag
41
+ end
42
+ end
43
+
44
+ class NodeParamsExtractor < Struct.new(:node)
45
+ def extract
46
+ [node.path, *node.props.values_at(:in_parent, :path, :link, :attr)]
47
+ end
48
+ end
49
+
50
+ class NodeExtractor
51
+ def initialize(xml)
52
+ @xml = Nokogiri::XML(remove_special_elements(xml), nil, Encoding::UTF_8.to_s)
53
+ @xml.remove_namespaces!
54
+ end
55
+
56
+ def extract(path)
57
+ xml.xpath(path)
58
+ rescue StandardError
59
+ nil
60
+ end
61
+
62
+ private
63
+
64
+ def remove_special_elements(xml)
65
+ CGI.unescapeHTML(xml).gsub(/<br>|&nbsp;/, { "<br>" => "", "&nbsp;" => " " })
66
+ end
67
+
68
+ attr_reader :xml
69
+ end
70
+
71
+ class NodeValueExtractor
72
+ def initialize(node_extractor)
73
+ @node_extractor = node_extractor
74
+ end
75
+
76
+ def attr_values(path, attributes)
77
+ return attributes.map { |atr| attr_value(path, atr) } if attributes.is_a? Array
78
+ return tag_count(path) if attributes == :tag_count
79
+
80
+ attr_value(path, attributes)
81
+ end
82
+
83
+ def tag_count(path)
84
+ node_extractor.extract(path).size
85
+ end
86
+
87
+ def tag_values(base_path, paths)
88
+ return tag_value(base_path) unless paths.is_a? Array
89
+
90
+ paths.map { |path| tag_value([base_path, path].flatten.compact.join("/")) }
91
+ end
92
+
93
+ private
94
+
95
+ attr_reader :node_extractor
96
+
97
+ def tag_value(path)
98
+ node_raw_value node_extractor.extract(path)
99
+ end
100
+
101
+ def attr_value(path, att)
102
+ node_raw_value node_extractor.extract(path).attribute(att)
103
+ end
104
+
105
+ def node_raw_value(node)
106
+ NodeValue.new(node).raw_value
107
+ end
108
+ end
109
+
110
+ class NodeValue
111
+ def initialize(node)
112
+ @node = node
113
+ end
114
+
115
+ def raw_value
116
+ return "" unless node
117
+
118
+ node_size = node.try(:size).to_i
119
+ return node.map(&:text) if node_size > 1
120
+ return node.first if node_size == 1 && contains_children?
121
+
122
+ node.text
123
+ end
124
+
125
+ private
126
+
127
+ attr_reader :node
128
+
129
+ def contains_children?
130
+ node.first.try(:children).any? { |child| child.is_a? Nokogiri::XML::Element }
131
+ end
132
+ end
133
+
134
+ class PathManipulator
135
+ def initialize(node_value_extractor)
136
+ @node_value_extractor = node_value_extractor
137
+ end
138
+
139
+ def replace_link(original_path, link_path)
140
+ return original_path if link_path.blank?
141
+
142
+ link_value = node_value_extractor.tag_values(link_path, nil)
143
+
144
+ original_path.gsub "<link>", link_value
145
+ end
146
+
147
+ def uniq_paths(paths, uniq_by_path)
148
+ paths
149
+ .map { |path| { path: path, value: tag_value(path, uniq_by_path) } }
150
+ .then { |paths_values| remove_duplicated_paths(paths_values) }
151
+ .map { |path_value| path_value[:path] }
152
+ end
153
+
154
+ private
155
+
156
+ attr_reader :node_value_extractor
157
+
158
+ def tag_value(path, uniq_by_path)
159
+ node_value_extractor.tag_values([path, uniq_by_path].join("/"), "")
160
+ end
161
+
162
+ def remove_duplicated_paths(paths_values)
163
+ paths_values.delete_if.with_index do |path_value, index|
164
+ index != first_path_value_index(paths_values, path_value)
165
+ end
166
+ end
167
+
168
+ def first_path_value_index(paths_values, current_path)
169
+ paths_values.find_index { |path_value| path_value[:value] == current_path[:value] }
170
+ end
171
+ end
172
+
173
+ class Extractor
174
+ def initialize(xml, yml, modifiers)
175
+ @node_extractor = NodeExtractor.new(xml)
176
+ @node_value_extractor = NodeValueExtractor.new(node_extractor)
177
+ @path_manipulator = PathManipulator.new(node_value_extractor)
178
+ @formatter = Format::Formatter.new(yml, modifiers)
179
+ end
180
+
181
+ def extract(node)
182
+ base, parent, tag, link, attribute = NodeParamsExtractor.new(node).extract
183
+ path = PathBuilder.new(base: base, parent: parent, tag: tag).build
184
+
185
+ if link.present?
186
+ link_path = PathBuilder.new(base: base, parent: parent, tag: link).build
187
+
188
+ if tag.is_a? Array
189
+ tag = tag.map { |tag_path| replace_link(tag_path, link_path) }
190
+ else
191
+ path = replace_link(path, link_path)
192
+ end
193
+ end
194
+
195
+ value = path_value(path, tag, attribute)
196
+ format_value(value, node.props)
197
+ end
198
+
199
+ def format_value(value, props)
200
+ formatter.format_value(value, props)
201
+ end
202
+
203
+ def replace_link(original_path, link_path)
204
+ path_manipulator.replace_link(original_path, link_path)
205
+ end
206
+
207
+ def paths_of(base_path, tag_path, link_path = nil)
208
+ path = PathBuilder.new(base: base_path, tag: tag_path).build
209
+
210
+ if link_path.present?
211
+ link_path = PathBuilder.new(base: base_path, tag: link_path).build
212
+ path = replace_link(path, link_path)
213
+ end
214
+
215
+ node = node_extractor.extract(path)
216
+ (node || []).size.times.map do |index|
217
+ "#{path}[#{index + 1}]"
218
+ end
219
+ end
220
+
221
+ def uniq_paths(paths, uniq_by_path)
222
+ return paths if uniq_by_path.blank?
223
+
224
+ path_manipulator.uniq_paths(paths, uniq_by_path)
225
+ end
226
+
227
+ private
228
+
229
+ attr_reader :node_extractor, :node_value_extractor, :path_manipulator, :formatter
230
+
231
+ def path_value(path, tag, attribute)
232
+ return node_value_extractor.attr_values(path, attribute) if attribute.present?
233
+
234
+ node_value_extractor.tag_values(path, tag)
235
+ end
236
+ end
@@ -0,0 +1,28 @@
1
+ require_relative "mapper"
2
+ require_relative "modifier"
3
+
4
+ module Format
5
+ class Formatter
6
+ def initialize(yml, modifiers)
7
+ @mapper = Format::Mapper.new(yml)
8
+ @modifier = Format::Modifier.new(yml, modifiers)
9
+ end
10
+
11
+ def format_value(value, props)
12
+ modifier_prop, mapper_prop = props.values_at(:modifier, :mapper)
13
+
14
+ value
15
+ .then { |it| modifier.apply(it, modifier_prop) }
16
+ .then { |it| nullify_empty_value(it) }
17
+ .then { |it| mapper.apply(it, mapper_prop) }
18
+ end
19
+
20
+ private
21
+
22
+ attr_reader :modifier, :mapper
23
+
24
+ def nullify_empty_value(value)
25
+ value.blank? || value.try(:zero?) ? nil : value
26
+ end
27
+ end
28
+ end
@@ -0,0 +1,28 @@
1
+ module Format
2
+ class Mapper
3
+ def initialize(yml)
4
+ @mappers = yml.fetch(:mappers, {})
5
+ end
6
+
7
+ def apply(raw_value, mapper_name)
8
+ return raw_value unless mapper_name
9
+
10
+ mappers.each do |name, fields|
11
+ return mapper_value(fields, raw_value) if mapper_name.to_sym == name
12
+ end
13
+
14
+ raise "Mapper not found #{mapper_name}"
15
+ end
16
+
17
+ private
18
+
19
+ attr_reader :mappers
20
+
21
+ def mapper_value(fields, value)
22
+ (fields[:options] || []).each do |option, values|
23
+ return option.to_s if [values].flatten.include?(value.to_s)
24
+ end
25
+ fields[:default] || value
26
+ end
27
+ end
28
+ end
@@ -0,0 +1,37 @@
1
+ module Format
2
+ class Modifier
3
+ def initialize(yml, helper)
4
+ @debug = yml.fetch(:debug, false)
5
+ @helper = helper
6
+ end
7
+
8
+ def apply(raw_value, modifiers)
9
+ [modifiers].flatten.compact.reduce(raw_value) do |value, modifier|
10
+ method_name, params = modifier_props(modifier).values_at(:name, :params)
11
+
12
+ modify_value(value, method_name, params)
13
+ end
14
+ end
15
+
16
+ private
17
+
18
+ attr_reader :helper, :debug
19
+
20
+ def modifier_props(modifier)
21
+ modifier.is_a?(String) ? { name: modifier } : modifier
22
+ end
23
+
24
+ def modify_value(value, method, params)
25
+ args = [value]
26
+ if params.present?
27
+ args = params.is_a?(Array) ? [value, *params] : [value, **params]
28
+ end
29
+
30
+ value.try(method, *params) || helper.send(method, *args)
31
+ rescue StandardError => error
32
+ raise error unless debug
33
+
34
+ "Error invoking '#{method}' with (#{args.join(',')}): #{error}"
35
+ end
36
+ end
37
+ end
@@ -0,0 +1,26 @@
1
+ class Node < Struct.new(:props, :path)
2
+ def initialize(*)
3
+ super
4
+ self.path ||= ""
5
+ end
6
+
7
+ def first_only?
8
+ return unless props.is_a? Hash
9
+
10
+ props[:array_presence] == "first_only"
11
+ end
12
+
13
+ def array_of_paths
14
+ array_paths(props[:array_of])
15
+ end
16
+
17
+ private
18
+
19
+ def array_paths(array_props)
20
+ if array_props.is_a?(Hash)
21
+ array_props.values_at(:path, :link, :uniq_by)
22
+ else
23
+ [array_props].flatten
24
+ end
25
+ end
26
+ end
@@ -0,0 +1,27 @@
1
+ require "nokogiri"
2
+ require_relative "src/extractor"
3
+ require_relative "src/node"
4
+ require_relative "src/extract/value_builder"
5
+
6
+ class XmlDataExtractor
7
+ def initialize(config, modifiers = nil)
8
+ @config = config
9
+ @modifiers = modifiers
10
+ end
11
+
12
+ def parse(xml)
13
+ extractor = Extractor.new(xml, config, modifiers)
14
+ schemas = config.fetch(:schemas, {})
15
+
16
+ {}.tap do |hash|
17
+ schemas.map do |key, val|
18
+ value = Extract::ValueBuilder.new(Node.new(val), extractor).value
19
+ hash[key] = value if value.present?
20
+ end
21
+ end
22
+ end
23
+
24
+ private
25
+
26
+ attr_reader :config, :modifiers
27
+ end
@@ -0,0 +1,28 @@
1
+ Gem::Specification.new do |spec|
2
+ spec.name = "xml_data_extractor"
3
+ spec.version = "0.1.0"
4
+ spec.authors = ["Fernando Almeida"]
5
+ spec.email = ["fernandoprsbr@gmail.com"]
6
+
7
+ spec.summary = "Provides a simples DSL for extracting data from XML documents"
8
+ spec.homepage = "https://github.com/monde-sistemas/xml_data_extractor"
9
+ spec.license = "MIT"
10
+ spec.required_ruby_version = Gem::Requirement.new(">= 2.3.0")
11
+
12
+ spec.metadata["homepage_uri"] = spec.homepage
13
+ spec.metadata["source_code_uri"] = spec.homepage
14
+ spec.metadata["changelog_uri"] = spec.homepage
15
+
16
+ # Specify which files should be added to the gem when it is released.
17
+ # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
18
+ spec.files = Dir.chdir(File.expand_path('..', __FILE__)) do
19
+ `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
20
+ end
21
+ spec.bindir = "exe"
22
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
23
+ spec.require_paths = ["lib"]
24
+
25
+ spec.add_dependency "nokogiri"
26
+ spec.add_dependency "activesupport"
27
+ spec.add_development_dependency "rspec"
28
+ end
metadata ADDED
@@ -0,0 +1,113 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: xml_data_extractor
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Fernando Almeida
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2020-08-24 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: nokogiri
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: '0'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: activesupport
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: rspec
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ description:
56
+ email:
57
+ - fernandoprsbr@gmail.com
58
+ executables: []
59
+ extensions: []
60
+ extra_rdoc_files: []
61
+ files:
62
+ - ".gitignore"
63
+ - ".rspec"
64
+ - ".travis.yml"
65
+ - Gemfile
66
+ - Gemfile.lock
67
+ - LICENSE.txt
68
+ - README.md
69
+ - Rakefile
70
+ - bin/console
71
+ - bin/setup
72
+ - lib/src/extract/array_of.rb
73
+ - lib/src/extract/array_value.rb
74
+ - lib/src/extract/base.rb
75
+ - lib/src/extract/expression.rb
76
+ - lib/src/extract/hash_builder.rb
77
+ - lib/src/extract/string_value.rb
78
+ - lib/src/extract/value_builder.rb
79
+ - lib/src/extract/within.rb
80
+ - lib/src/extractor.rb
81
+ - lib/src/format/formatter.rb
82
+ - lib/src/format/mapper.rb
83
+ - lib/src/format/modifier.rb
84
+ - lib/src/node.rb
85
+ - lib/xml_data_extractor.rb
86
+ - xml_data_extractor.gemspec
87
+ homepage: https://github.com/monde-sistemas/xml_data_extractor
88
+ licenses:
89
+ - MIT
90
+ metadata:
91
+ homepage_uri: https://github.com/monde-sistemas/xml_data_extractor
92
+ source_code_uri: https://github.com/monde-sistemas/xml_data_extractor
93
+ changelog_uri: https://github.com/monde-sistemas/xml_data_extractor
94
+ post_install_message:
95
+ rdoc_options: []
96
+ require_paths:
97
+ - lib
98
+ required_ruby_version: !ruby/object:Gem::Requirement
99
+ requirements:
100
+ - - ">="
101
+ - !ruby/object:Gem::Version
102
+ version: 2.3.0
103
+ required_rubygems_version: !ruby/object:Gem::Requirement
104
+ requirements:
105
+ - - ">="
106
+ - !ruby/object:Gem::Version
107
+ version: '0'
108
+ requirements: []
109
+ rubygems_version: 3.0.3
110
+ signing_key:
111
+ specification_version: 4
112
+ summary: Provides a simples DSL for extracting data from XML documents
113
+ test_files: []