xml_data_extractor 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 2ea176d26d7d1e43ca91ab838b75eca58044da1af8a98675dc3c1c1ba453f313
4
+ data.tar.gz: b4a69acdf185a5b830ef267cbb4cd86a08562795f178d5c0f79c4b893501a3b7
5
+ SHA512:
6
+ metadata.gz: 9bd4f5a9ea20e5ce63d92e9d213267eb06097f4a43e851bd65f622404a5f180e877409164ba5716d9e49dfa29ef1bb9b4cf8ff17ae269867ea74d2b040a7b85d
7
+ data.tar.gz: 21e37f2de2a5bc185fc2f6a30ea0367e67f9130f44a2645d53e10e1e7015e7569e47ccc7bb3e98203a12bae853bcff3d5d8ad60ff5ced36b702934a74a54bef2
@@ -0,0 +1,11 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /_yardoc/
4
+ /coverage/
5
+ /doc/
6
+ /pkg/
7
+ /spec/reports/
8
+ /tmp/
9
+
10
+ # rspec failure tracking
11
+ .rspec_status
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --format documentation
2
+ --color
3
+ --require spec_helper
@@ -0,0 +1,6 @@
1
+ ---
2
+ language: ruby
3
+ cache: bundler
4
+ rvm:
5
+ - 2.6.6
6
+ before_install: gem install bundler -v 2.1.4
data/Gemfile ADDED
@@ -0,0 +1,7 @@
1
+ source "https://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in xml_data_extractor.gemspec
4
+ gemspec
5
+
6
+ gem "rake", "~> 12.0"
7
+ gem "rspec", "~> 3.0"
@@ -0,0 +1,53 @@
1
+ PATH
2
+ remote: .
3
+ specs:
4
+ xml_data_extractor (0.1.0)
5
+ activesupport
6
+ nokogiri
7
+
8
+ GEM
9
+ remote: https://rubygems.org/
10
+ specs:
11
+ activesupport (6.0.3.2)
12
+ concurrent-ruby (~> 1.0, >= 1.0.2)
13
+ i18n (>= 0.7, < 2)
14
+ minitest (~> 5.1)
15
+ tzinfo (~> 1.1)
16
+ zeitwerk (~> 2.2, >= 2.2.2)
17
+ concurrent-ruby (1.1.7)
18
+ diff-lcs (1.3)
19
+ i18n (1.8.5)
20
+ concurrent-ruby (~> 1.0)
21
+ mini_portile2 (2.4.0)
22
+ minitest (5.14.1)
23
+ nokogiri (1.10.10)
24
+ mini_portile2 (~> 2.4.0)
25
+ rake (12.3.3)
26
+ rspec (3.9.0)
27
+ rspec-core (~> 3.9.0)
28
+ rspec-expectations (~> 3.9.0)
29
+ rspec-mocks (~> 3.9.0)
30
+ rspec-core (3.9.2)
31
+ rspec-support (~> 3.9.3)
32
+ rspec-expectations (3.9.2)
33
+ diff-lcs (>= 1.2.0, < 2.0)
34
+ rspec-support (~> 3.9.0)
35
+ rspec-mocks (3.9.1)
36
+ diff-lcs (>= 1.2.0, < 2.0)
37
+ rspec-support (~> 3.9.0)
38
+ rspec-support (3.9.3)
39
+ thread_safe (0.3.6)
40
+ tzinfo (1.2.7)
41
+ thread_safe (~> 0.1)
42
+ zeitwerk (2.4.0)
43
+
44
+ PLATFORMS
45
+ ruby
46
+
47
+ DEPENDENCIES
48
+ rake (~> 12.0)
49
+ rspec (~> 3.0)
50
+ xml_data_extractor!
51
+
52
+ BUNDLED WITH
53
+ 2.1.4
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2020 Fernando Almeida
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
@@ -0,0 +1,381 @@
1
+ # XmlDataExtractor
2
+
3
+ This gem provides a DSL for extracting formatted data from any XML structure.
4
+
5
+ ## Installation
6
+
7
+ Add this line to your application's Gemfile:
8
+
9
+ ```ruby
10
+ gem 'xml_data_extractor'
11
+ ```
12
+
13
+ And then execute:
14
+
15
+ $ bundle install
16
+
17
+ Or install it yourself as:
18
+
19
+ $ gem install xml_data_extractor
20
+
21
+ ## Usage
22
+
23
+ The general ideia is to declare a ruby Hash that represents the fields structure, containing instructions of how every piece of data should be retrieved from the XML document.
24
+
25
+ ```ruby
26
+ structure = { schemas: { character: { path: "xml/FirstName" } } }
27
+ xml = "<xml><FirstName>Gandalf</FirstName></xml>"
28
+
29
+ result = XmlDataExtractor.new(structure).parse(xml)
30
+
31
+ # result -> { character: "Gandalf" }
32
+ ```
33
+
34
+ For convenience, you can write the structure in yaml, which can be easily converted to a ruby hash using `YAML.load(yml).deep_symbolize_keys`.
35
+
36
+ Considering the following yaml and xml:
37
+
38
+ ```yml
39
+ schemas:
40
+ description:
41
+ path: xml/desc
42
+ modifier: downcase
43
+ amount:
44
+ path: xml/info/price
45
+ modifier: to_f
46
+ ```
47
+ ```xml
48
+ <xml>
49
+ <desc>HELLO WORLD</desc>
50
+ <info>
51
+ <price>123</price>
52
+ </info>
53
+ </xml>
54
+ ```
55
+
56
+ The output is:
57
+ ```ruby
58
+ {
59
+ description: "hello world",
60
+ amount: 123.0
61
+ }
62
+ ```
63
+
64
+ ### Defining the structure
65
+
66
+ The structure should be defined as a hash inside the `schemas` key. See the [complete example](https://github.com/monde-sistemas/xml_data_extractor/blob/master/spec/complete_example_spec.rb#L5).
67
+
68
+ When defining the structure you can combine any available command in order to extract and format the data as needed.
69
+
70
+ The available commands are separated in two general pusposes:
71
+
72
+ - [Navigation & Extraction](#navigation--extraction)
73
+ - [Formatting](#formatting)
74
+
75
+ ### Navigation & Extraction:
76
+
77
+ The data extraction process is based on `Xpath` using Nokogiri.
78
+ * [Xpath introduction](https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples)
79
+ * [Xpath cheatsheet](https://devhints.io/xpath)
80
+
81
+ #### path
82
+
83
+ Defines the `xpath` of the element.
84
+ The `path` is the default command of a field definition, so this:
85
+ ```yml
86
+ schemas:
87
+ description:
88
+ path: xml/desc
89
+ ```
90
+ Is equivalent to this:
91
+ ```yml
92
+ schemas:
93
+ description: xml/desc
94
+ ```
95
+
96
+ It can be defined as a string:
97
+ ```yml
98
+ schemas:
99
+ description:
100
+ path: xml/some_field
101
+ ```
102
+ ```xml
103
+ <xml>
104
+ <some_field>ABC</some_field>
105
+ </xml>
106
+ ```
107
+ ```ruby
108
+ { description: "ABC" }
109
+ ```
110
+
111
+ Or as a string array:
112
+ ```yml
113
+ schemas:
114
+ address:
115
+ path: [street, info/city]
116
+ ```
117
+ ```xml
118
+ <xml>
119
+ <street>Diagon Alley</street>
120
+ <info>
121
+ <city>London</city>
122
+ </info>
123
+ </xml>
124
+ ```
125
+ ```ruby
126
+ { address: ["Diagon Alley", "London"] }
127
+ ```
128
+
129
+ And even as a hash array, for complex operations:
130
+ ```yml
131
+ schemas:
132
+ address:
133
+ path:
134
+ - path: street
135
+ modifier: downcase
136
+ - path: info/city
137
+ modifier: upcase
138
+ ```
139
+ ```ruby
140
+ { address: ["diagon alley", "LONDON"] }
141
+ ```
142
+
143
+ #### attr
144
+
145
+ Defines a tag attribute which the value should be extracted from, instead of the tag value itself:
146
+ ```yml
147
+ schemas:
148
+ description:
149
+ path: xml/info
150
+ attr: desc
151
+ ```
152
+ ```xml
153
+ <xml>
154
+ <info desc="ABC">some stuff<info>
155
+ </xml>
156
+ ```
157
+ ```ruby
158
+ { description: "ABC" }
159
+ ```
160
+
161
+ Like the path, it can also be defined as a string array.
162
+
163
+ #### within
164
+
165
+ To define a root path for the fields:
166
+ ```yml
167
+ schemas:
168
+ movie:
169
+ within: info/movie_data
170
+ title: original_title
171
+ actor: main_actor
172
+
173
+ ```
174
+ ```xml
175
+ <xml>
176
+ <info>
177
+ <movie_data>
178
+ <original_title>The Irishman</original_title>
179
+ <main_actor>Robert De Niro</main_actor>
180
+ </movie_data>
181
+ </info>
182
+ </xml>
183
+ ```
184
+ ```ruby
185
+ { movie: { title: "The Irishman", actor: "Robert De Niro" } }
186
+ ```
187
+
188
+ #### array_of
189
+
190
+ Defines the path to a XML collection, which will be looped generating an array of hashes:
191
+ ```yml
192
+ schemas:
193
+ people:
194
+ array_of: characters/character
195
+ name: firstname
196
+ age: age
197
+ ```
198
+ ```xml
199
+ <xml>
200
+ <characters>
201
+ <character>
202
+ <firstname>Geralt</firstname>
203
+ <age>97</age>
204
+ </character>
205
+ <character>
206
+ <firstname>Yennefer</firstname>
207
+ <age>102</age>
208
+ </character>
209
+ </characters>
210
+ </xml>
211
+ ```
212
+ ```ruby
213
+ {
214
+ people: [
215
+ { name: "Geralt", age: "97" },
216
+ { name: "Yennefer", age: "102" }
217
+ ]
218
+ }
219
+ ```
220
+
221
+ If you need to loop trough nested collections, you can define an array of paths:
222
+ ```yml
223
+ schemas:
224
+ show:
225
+ within: show_data
226
+ title: description
227
+ people:
228
+ array_of: [characters/character, info]
229
+ name: name
230
+ ```
231
+ ```xml
232
+ <xml>
233
+ <show_data>
234
+ <description>Peaky Blinders</description>
235
+ <characters>
236
+ <character>
237
+ <info>
238
+ <name>Tommy Shelby</name>
239
+ </info>
240
+ </character>
241
+ <character>
242
+ <info>
243
+ <name>Arthur Shelby</name>
244
+ </info>
245
+ <info>
246
+ <name>Alfie Solomons</name>
247
+ </info>
248
+ </character>
249
+ </characters>
250
+ </show_data>
251
+ </xml>
252
+ ```
253
+ ```ruby
254
+ {
255
+ show: {
256
+ title: "Peaky Blinders",
257
+ people: [
258
+ { name: "Tommy Shelby" },
259
+ { name: "Arthur Shelby" },
260
+ { name: "Alfie Solomons" }
261
+ ]
262
+ }
263
+ }
264
+ ```
265
+
266
+ ### Formatting:
267
+
268
+ #### fixed
269
+
270
+ Defines a fixed value for the field:
271
+ ```yml
272
+ currency:
273
+ fixed: BRL
274
+ ```
275
+ ```ruby
276
+ { currency: "BRL" }
277
+ ```
278
+
279
+ #### mapper
280
+
281
+ Uses a hash of predefined values to replace the extracted value with its respective option.
282
+ If the extracted value is not found in any of the mapper options, it will be replaced by the `default` value, but if the default value is not defined, the returned value is not replaced.
283
+ ```yml
284
+ mappers:
285
+ currencies:
286
+ default: unknown
287
+ options:
288
+ BRL: R$
289
+ USD: [US$, $]
290
+ schemas:
291
+ money:
292
+ array_of: curr_types/type
293
+ path: symbol
294
+ mapper: currencies
295
+ ```
296
+ ```xml
297
+ <xml>
298
+ <curr_type>
299
+ <type>
300
+ <symbol>US$</symbol>
301
+ </type>
302
+ <type>
303
+ <symbol>R$</symbol>
304
+ </type>
305
+ <type>
306
+ <symbol>RB</symbol>
307
+ </type>
308
+ <type>
309
+ <symbol>$</symbol>
310
+ </type>
311
+ </curr_type>
312
+ </xml>
313
+ ```
314
+ ```ruby
315
+ {
316
+ money: ["USD", "BRL", "unknown", "USD"]
317
+ }
318
+ ```
319
+
320
+ #### modifier
321
+
322
+ Defines a method to be called on the returned value.
323
+ ```yml
324
+ schemas:
325
+ name:
326
+ path: some_field
327
+ modifier: upcase
328
+ ```
329
+ ```xml
330
+ <xml>
331
+ <some_field>Lewandovski</some_field>
332
+ </xml>
333
+ ```
334
+ ```ruby
335
+ { name: "LEWANDOVSKI" }
336
+ ```
337
+
338
+ You can also pass parameters to the method. In this case you will have to declare the modifier as an array of hashes, with the `name` and `params` keys:
339
+ ```yml
340
+ schemas:
341
+ name:
342
+ path: [firstname, lastname]
343
+ modifier:
344
+ - name: join
345
+ params: [" "]
346
+ - downcase
347
+ ```
348
+ ```xml
349
+ <xml>
350
+ <firstname>Robert</firstname>
351
+ <lastname>Martin</lastname>
352
+ </xml>
353
+ ```
354
+ ```ruby
355
+ { name: "robert martin" }
356
+ ```
357
+
358
+ If you need to use custom methods, you can pass an object containing the methods in the initialization. The custom method will receive the value as parameter:
359
+ ```yml
360
+ schemas:
361
+ name:
362
+ path: final_price
363
+ modifier: format_as_float
364
+ ```
365
+ ```xml
366
+ <xml>
367
+ <final_price>R$ 12.99</final_price>
368
+ </xml>
369
+ ```
370
+ ```ruby
371
+ class MyMethods
372
+ def format_as_float(value)
373
+ value.gsub(/[^\d.]/, "").to_f
374
+ end
375
+ end
376
+
377
+ XmlDataExtractor.new(yml, MyMethods.new).parse(xml)
378
+ ```
379
+ ```ruby
380
+ { price: 12.99 }
381
+ ```
@@ -0,0 +1,6 @@
1
+ require "bundler/gem_tasks"
2
+ require "rspec/core/rake_task"
3
+
4
+ RSpec::Core::RakeTask.new(:spec)
5
+
6
+ task :default => :spec
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "xml_data_extractor"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start(__FILE__)
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+ set -vx
5
+
6
+ bundle install
7
+
8
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,58 @@
1
+ module Extract
2
+ class ArrayOf < Base
3
+ def initialize(node, extractor, index = 0)
4
+ super(node, extractor)
5
+ @index = index
6
+ end
7
+
8
+ def value
9
+ process_paths.flatten.compact
10
+ end
11
+
12
+ private
13
+
14
+ attr_reader :index
15
+
16
+ def array_items
17
+ arr_path, link_path, uniq_by = node.array_of_paths
18
+
19
+ paths = extractor.paths_of(node.path, arr_path, link_path)
20
+ paths = uniq_paths(paths, uniq_by) if uniq_by
21
+
22
+ paths.each_with_index.map do |path, idx|
23
+ HashBuilder.new(Node.new(node.props, path), extractor).value(index + idx)
24
+ end.compact
25
+ end
26
+
27
+ def process_paths
28
+ paths = paths_from_props
29
+
30
+ if paths.size > 1
31
+ process_path(paths.shift, paths)
32
+ else
33
+ node.props[:array_of] = paths.first
34
+ array_items
35
+ end
36
+ end
37
+
38
+ def process_path(path, inner_paths)
39
+ path = build_path(path) if path.is_a?(Hash)
40
+
41
+ extractor.paths_of(node.path, path).each_with_index.map do |some, idx|
42
+ ArrayOf.new(Node.new(node.props.merge(array_of: inner_paths), some), extractor, index + idx).value
43
+ end
44
+ end
45
+
46
+ def uniq_paths(paths, uniq_by)
47
+ extractor.uniq_paths(paths, uniq_by)
48
+ end
49
+
50
+ def build_path(hash)
51
+ extractor.replace_link(hash[:path], [node.path, hash[:link]].join("/"))
52
+ end
53
+
54
+ def paths_from_props
55
+ [node.props[:array_of]].flatten
56
+ end
57
+ end
58
+ end
@@ -0,0 +1,10 @@
1
+ module Extract
2
+ class ArrayValue < Base
3
+ def value
4
+ props, path = node.to_h.values_at(:props, :path)
5
+ props.map do |prop|
6
+ ValueBuilder.new(Node.new(prop, path), extractor).value
7
+ end.flatten
8
+ end
9
+ end
10
+ end
@@ -0,0 +1,12 @@
1
+ module Extract
2
+ class Base
3
+ def initialize(node, extractor)
4
+ @node = node
5
+ @extractor = extractor
6
+ end
7
+
8
+ private
9
+
10
+ attr_reader :node, :extractor
11
+ end
12
+ end
@@ -0,0 +1,20 @@
1
+ module Extract
2
+ class Expression
3
+ def initialize(expression, hash)
4
+ @expression = expression
5
+ @hash = hash
6
+ end
7
+
8
+ def evaluate
9
+ field_name = expression.split.first.parameterize
10
+ field_value = hash[field_name.to_sym]
11
+ condition = expression.gsub(field_name, field_value.to_s)
12
+
13
+ eval(condition)
14
+ end
15
+
16
+ private
17
+
18
+ attr_reader :expression, :hash
19
+ end
20
+ end
@@ -0,0 +1,33 @@
1
+ module Extract
2
+ class HashBuilder < Base
3
+ INTERNAL_FIELDS = %i[array_of keep_if within].freeze
4
+
5
+ def value(index = 0)
6
+ path, props = node.to_h.values_at(:path, :props)
7
+
8
+ hash = {}
9
+ props.each do |field_name, nested_props|
10
+ next unless valuable_field? field_name, nested_props, index
11
+
12
+ value = ValueBuilder.new(Node.new(nested_props, path), extractor).value
13
+ hash[field_name.to_sym] = value if value.present?
14
+ end
15
+
16
+ keep_hash?(hash, props) ? hash : nil
17
+ end
18
+
19
+ private
20
+
21
+ def keep_hash?(hash, props)
22
+ expression = props[:keep_if]
23
+ expression.present? ? Expression.new(expression, hash).evaluate : true
24
+ end
25
+
26
+ def valuable_field?(field_name, props, index)
27
+ return false if INTERNAL_FIELDS.include? field_name
28
+ return false if index.positive? && Node.new(props, "").first_only?
29
+
30
+ true
31
+ end
32
+ end
33
+ end
@@ -0,0 +1,32 @@
1
+ module Extract
2
+ class StringValue < Base
3
+ def value
4
+ path = node[:props][:path]
5
+ return formatted_array_values(path) if path.is_a?(Array)
6
+
7
+ extract_value(node)
8
+ end
9
+
10
+ private
11
+
12
+ def extract_value(node_to_extract)
13
+ extractor.extract(node_to_extract)
14
+ end
15
+
16
+ def formatted_array_values(paths)
17
+ extractor.format_value(values_from_array(paths), node[:props])
18
+ end
19
+
20
+ def values_from_array(paths)
21
+ node_path = node.path
22
+
23
+ paths.map do |inner|
24
+ if inner.is_a?(String)
25
+ extract_value(Node.new({ path: inner }, node_path))
26
+ else
27
+ StringValue.new(Node.new(inner, node_path), extractor).value
28
+ end
29
+ end
30
+ end
31
+ end
32
+ end
@@ -0,0 +1,44 @@
1
+ require_relative "base"
2
+ require_relative "array_value"
3
+ require_relative "array_of"
4
+ require_relative "hash_builder"
5
+ require_relative "string_value"
6
+ require_relative "value_builder"
7
+ require_relative "within"
8
+ require_relative "expression"
9
+
10
+ module Extract
11
+ class ValueBuilder < Base
12
+ def value
13
+ props = node.props
14
+ case props
15
+ when String then value_for_string
16
+ when Array then value_for_array
17
+ when Hash then value_for_hash
18
+ else
19
+ raise "Invalid kind #{props.class} (#{props})"
20
+ end
21
+ end
22
+
23
+ private
24
+
25
+ def value_for_hash
26
+ props = node.props
27
+ fixed_value = props[:fixed]
28
+ return fixed_value if fixed_value
29
+ return ArrayOf.new(node, extractor).value if props[:array_of]
30
+ return Within.new(node, extractor).value if props[:within]
31
+ return StringValue.new(node, extractor).value if (props.keys & %i[path attr]).any?
32
+
33
+ HashBuilder.new(node, extractor).value
34
+ end
35
+
36
+ def value_for_string
37
+ StringValue.new(Node.new({ path: node.props }, node.path), extractor).value
38
+ end
39
+
40
+ def value_for_array
41
+ ArrayValue.new(node, extractor).value
42
+ end
43
+ end
44
+ end
@@ -0,0 +1,11 @@
1
+ module Extract
2
+ class Within < Base
3
+ def value
4
+ props = node.props
5
+ paths = extractor.paths_of(node.path, props[:within])
6
+ return "" if paths.empty?
7
+
8
+ HashBuilder.new(Node.new(props, paths.first), extractor).value
9
+ end
10
+ end
11
+ end
@@ -0,0 +1,236 @@
1
+ require "cgi"
2
+ require "active_support/core_ext/string"
3
+ require_relative "format/formatter"
4
+
5
+ class PathBuilder < Struct.new(:base, :parent, :tag, keyword_init: true)
6
+ def build
7
+ paths = relative_path.split("/").then do |paths|
8
+ if parent.present?
9
+ navigate_to_parent(parent, paths)
10
+ else
11
+ paths
12
+ end
13
+ end
14
+
15
+ paths << tag unless tag.is_a? Array
16
+ full_path = paths.flatten.compact.join("/")
17
+ "//#{full_path}"
18
+ end
19
+
20
+ private
21
+
22
+ def relative_path
23
+ base.start_with?("//") ? base[2..-1] : base
24
+ end
25
+
26
+ def navigate_to_parent(parent_tag, paths)
27
+ index = path_index(parent_tag, paths)
28
+
29
+ paths[0, index + 1]
30
+ end
31
+
32
+ def path_index(tag, paths)
33
+ paths.each_with_index do |path, index|
34
+ return index if matching_tags?(path, tag)
35
+ end
36
+ 0
37
+ end
38
+
39
+ def matching_tags?(item, tag)
40
+ item.gsub(/\[\d\]/, "") == tag
41
+ end
42
+ end
43
+
44
+ class NodeParamsExtractor < Struct.new(:node)
45
+ def extract
46
+ [node.path, *node.props.values_at(:in_parent, :path, :link, :attr)]
47
+ end
48
+ end
49
+
50
+ class NodeExtractor
51
+ def initialize(xml)
52
+ @xml = Nokogiri::XML(remove_special_elements(xml), nil, Encoding::UTF_8.to_s)
53
+ @xml.remove_namespaces!
54
+ end
55
+
56
+ def extract(path)
57
+ xml.xpath(path)
58
+ rescue StandardError
59
+ nil
60
+ end
61
+
62
+ private
63
+
64
+ def remove_special_elements(xml)
65
+ CGI.unescapeHTML(xml).gsub(/<br>|&nbsp;/, { "<br>" => "", "&nbsp;" => " " })
66
+ end
67
+
68
+ attr_reader :xml
69
+ end
70
+
71
+ class NodeValueExtractor
72
+ def initialize(node_extractor)
73
+ @node_extractor = node_extractor
74
+ end
75
+
76
+ def attr_values(path, attributes)
77
+ return attributes.map { |atr| attr_value(path, atr) } if attributes.is_a? Array
78
+ return tag_count(path) if attributes == :tag_count
79
+
80
+ attr_value(path, attributes)
81
+ end
82
+
83
+ def tag_count(path)
84
+ node_extractor.extract(path).size
85
+ end
86
+
87
+ def tag_values(base_path, paths)
88
+ return tag_value(base_path) unless paths.is_a? Array
89
+
90
+ paths.map { |path| tag_value([base_path, path].flatten.compact.join("/")) }
91
+ end
92
+
93
+ private
94
+
95
+ attr_reader :node_extractor
96
+
97
+ def tag_value(path)
98
+ node_raw_value node_extractor.extract(path)
99
+ end
100
+
101
+ def attr_value(path, att)
102
+ node_raw_value node_extractor.extract(path).attribute(att)
103
+ end
104
+
105
+ def node_raw_value(node)
106
+ NodeValue.new(node).raw_value
107
+ end
108
+ end
109
+
110
+ class NodeValue
111
+ def initialize(node)
112
+ @node = node
113
+ end
114
+
115
+ def raw_value
116
+ return "" unless node
117
+
118
+ node_size = node.try(:size).to_i
119
+ return node.map(&:text) if node_size > 1
120
+ return node.first if node_size == 1 && contains_children?
121
+
122
+ node.text
123
+ end
124
+
125
+ private
126
+
127
+ attr_reader :node
128
+
129
+ def contains_children?
130
+ node.first.try(:children).any? { |child| child.is_a? Nokogiri::XML::Element }
131
+ end
132
+ end
133
+
134
+ class PathManipulator
135
+ def initialize(node_value_extractor)
136
+ @node_value_extractor = node_value_extractor
137
+ end
138
+
139
+ def replace_link(original_path, link_path)
140
+ return original_path if link_path.blank?
141
+
142
+ link_value = node_value_extractor.tag_values(link_path, nil)
143
+
144
+ original_path.gsub "<link>", link_value
145
+ end
146
+
147
+ def uniq_paths(paths, uniq_by_path)
148
+ paths
149
+ .map { |path| { path: path, value: tag_value(path, uniq_by_path) } }
150
+ .then { |paths_values| remove_duplicated_paths(paths_values) }
151
+ .map { |path_value| path_value[:path] }
152
+ end
153
+
154
+ private
155
+
156
+ attr_reader :node_value_extractor
157
+
158
+ def tag_value(path, uniq_by_path)
159
+ node_value_extractor.tag_values([path, uniq_by_path].join("/"), "")
160
+ end
161
+
162
+ def remove_duplicated_paths(paths_values)
163
+ paths_values.delete_if.with_index do |path_value, index|
164
+ index != first_path_value_index(paths_values, path_value)
165
+ end
166
+ end
167
+
168
+ def first_path_value_index(paths_values, current_path)
169
+ paths_values.find_index { |path_value| path_value[:value] == current_path[:value] }
170
+ end
171
+ end
172
+
173
+ class Extractor
174
+ def initialize(xml, yml, modifiers)
175
+ @node_extractor = NodeExtractor.new(xml)
176
+ @node_value_extractor = NodeValueExtractor.new(node_extractor)
177
+ @path_manipulator = PathManipulator.new(node_value_extractor)
178
+ @formatter = Format::Formatter.new(yml, modifiers)
179
+ end
180
+
181
+ def extract(node)
182
+ base, parent, tag, link, attribute = NodeParamsExtractor.new(node).extract
183
+ path = PathBuilder.new(base: base, parent: parent, tag: tag).build
184
+
185
+ if link.present?
186
+ link_path = PathBuilder.new(base: base, parent: parent, tag: link).build
187
+
188
+ if tag.is_a? Array
189
+ tag = tag.map { |tag_path| replace_link(tag_path, link_path) }
190
+ else
191
+ path = replace_link(path, link_path)
192
+ end
193
+ end
194
+
195
+ value = path_value(path, tag, attribute)
196
+ format_value(value, node.props)
197
+ end
198
+
199
+ def format_value(value, props)
200
+ formatter.format_value(value, props)
201
+ end
202
+
203
+ def replace_link(original_path, link_path)
204
+ path_manipulator.replace_link(original_path, link_path)
205
+ end
206
+
207
+ def paths_of(base_path, tag_path, link_path = nil)
208
+ path = PathBuilder.new(base: base_path, tag: tag_path).build
209
+
210
+ if link_path.present?
211
+ link_path = PathBuilder.new(base: base_path, tag: link_path).build
212
+ path = replace_link(path, link_path)
213
+ end
214
+
215
+ node = node_extractor.extract(path)
216
+ (node || []).size.times.map do |index|
217
+ "#{path}[#{index + 1}]"
218
+ end
219
+ end
220
+
221
+ def uniq_paths(paths, uniq_by_path)
222
+ return paths if uniq_by_path.blank?
223
+
224
+ path_manipulator.uniq_paths(paths, uniq_by_path)
225
+ end
226
+
227
+ private
228
+
229
+ attr_reader :node_extractor, :node_value_extractor, :path_manipulator, :formatter
230
+
231
+ def path_value(path, tag, attribute)
232
+ return node_value_extractor.attr_values(path, attribute) if attribute.present?
233
+
234
+ node_value_extractor.tag_values(path, tag)
235
+ end
236
+ end
@@ -0,0 +1,28 @@
1
+ require_relative "mapper"
2
+ require_relative "modifier"
3
+
4
+ module Format
5
+ class Formatter
6
+ def initialize(yml, modifiers)
7
+ @mapper = Format::Mapper.new(yml)
8
+ @modifier = Format::Modifier.new(yml, modifiers)
9
+ end
10
+
11
+ def format_value(value, props)
12
+ modifier_prop, mapper_prop = props.values_at(:modifier, :mapper)
13
+
14
+ value
15
+ .then { |it| modifier.apply(it, modifier_prop) }
16
+ .then { |it| nullify_empty_value(it) }
17
+ .then { |it| mapper.apply(it, mapper_prop) }
18
+ end
19
+
20
+ private
21
+
22
+ attr_reader :modifier, :mapper
23
+
24
+ def nullify_empty_value(value)
25
+ value.blank? || value.try(:zero?) ? nil : value
26
+ end
27
+ end
28
+ end
@@ -0,0 +1,28 @@
1
+ module Format
2
+ class Mapper
3
+ def initialize(yml)
4
+ @mappers = yml.fetch(:mappers, {})
5
+ end
6
+
7
+ def apply(raw_value, mapper_name)
8
+ return raw_value unless mapper_name
9
+
10
+ mappers.each do |name, fields|
11
+ return mapper_value(fields, raw_value) if mapper_name.to_sym == name
12
+ end
13
+
14
+ raise "Mapper not found #{mapper_name}"
15
+ end
16
+
17
+ private
18
+
19
+ attr_reader :mappers
20
+
21
+ def mapper_value(fields, value)
22
+ (fields[:options] || []).each do |option, values|
23
+ return option.to_s if [values].flatten.include?(value.to_s)
24
+ end
25
+ fields[:default] || value
26
+ end
27
+ end
28
+ end
@@ -0,0 +1,37 @@
1
+ module Format
2
+ class Modifier
3
+ def initialize(yml, helper)
4
+ @debug = yml.fetch(:debug, false)
5
+ @helper = helper
6
+ end
7
+
8
+ def apply(raw_value, modifiers)
9
+ [modifiers].flatten.compact.reduce(raw_value) do |value, modifier|
10
+ method_name, params = modifier_props(modifier).values_at(:name, :params)
11
+
12
+ modify_value(value, method_name, params)
13
+ end
14
+ end
15
+
16
+ private
17
+
18
+ attr_reader :helper, :debug
19
+
20
+ def modifier_props(modifier)
21
+ modifier.is_a?(String) ? { name: modifier } : modifier
22
+ end
23
+
24
+ def modify_value(value, method, params)
25
+ args = [value]
26
+ if params.present?
27
+ args = params.is_a?(Array) ? [value, *params] : [value, **params]
28
+ end
29
+
30
+ value.try(method, *params) || helper.send(method, *args)
31
+ rescue StandardError => error
32
+ raise error unless debug
33
+
34
+ "Error invoking '#{method}' with (#{args.join(',')}): #{error}"
35
+ end
36
+ end
37
+ end
@@ -0,0 +1,26 @@
1
+ class Node < Struct.new(:props, :path)
2
+ def initialize(*)
3
+ super
4
+ self.path ||= ""
5
+ end
6
+
7
+ def first_only?
8
+ return unless props.is_a? Hash
9
+
10
+ props[:array_presence] == "first_only"
11
+ end
12
+
13
+ def array_of_paths
14
+ array_paths(props[:array_of])
15
+ end
16
+
17
+ private
18
+
19
+ def array_paths(array_props)
20
+ if array_props.is_a?(Hash)
21
+ array_props.values_at(:path, :link, :uniq_by)
22
+ else
23
+ [array_props].flatten
24
+ end
25
+ end
26
+ end
@@ -0,0 +1,27 @@
1
+ require "nokogiri"
2
+ require_relative "src/extractor"
3
+ require_relative "src/node"
4
+ require_relative "src/extract/value_builder"
5
+
6
+ class XmlDataExtractor
7
+ def initialize(config, modifiers = nil)
8
+ @config = config
9
+ @modifiers = modifiers
10
+ end
11
+
12
+ def parse(xml)
13
+ extractor = Extractor.new(xml, config, modifiers)
14
+ schemas = config.fetch(:schemas, {})
15
+
16
+ {}.tap do |hash|
17
+ schemas.map do |key, val|
18
+ value = Extract::ValueBuilder.new(Node.new(val), extractor).value
19
+ hash[key] = value if value.present?
20
+ end
21
+ end
22
+ end
23
+
24
+ private
25
+
26
+ attr_reader :config, :modifiers
27
+ end
@@ -0,0 +1,28 @@
1
+ Gem::Specification.new do |spec|
2
+ spec.name = "xml_data_extractor"
3
+ spec.version = "0.1.0"
4
+ spec.authors = ["Fernando Almeida"]
5
+ spec.email = ["fernandoprsbr@gmail.com"]
6
+
7
+ spec.summary = "Provides a simples DSL for extracting data from XML documents"
8
+ spec.homepage = "https://github.com/monde-sistemas/xml_data_extractor"
9
+ spec.license = "MIT"
10
+ spec.required_ruby_version = Gem::Requirement.new(">= 2.3.0")
11
+
12
+ spec.metadata["homepage_uri"] = spec.homepage
13
+ spec.metadata["source_code_uri"] = spec.homepage
14
+ spec.metadata["changelog_uri"] = spec.homepage
15
+
16
+ # Specify which files should be added to the gem when it is released.
17
+ # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
18
+ spec.files = Dir.chdir(File.expand_path('..', __FILE__)) do
19
+ `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
20
+ end
21
+ spec.bindir = "exe"
22
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
23
+ spec.require_paths = ["lib"]
24
+
25
+ spec.add_dependency "nokogiri"
26
+ spec.add_dependency "activesupport"
27
+ spec.add_development_dependency "rspec"
28
+ end
metadata ADDED
@@ -0,0 +1,113 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: xml_data_extractor
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Fernando Almeida
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2020-08-24 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: nokogiri
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: '0'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: activesupport
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: rspec
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ description:
56
+ email:
57
+ - fernandoprsbr@gmail.com
58
+ executables: []
59
+ extensions: []
60
+ extra_rdoc_files: []
61
+ files:
62
+ - ".gitignore"
63
+ - ".rspec"
64
+ - ".travis.yml"
65
+ - Gemfile
66
+ - Gemfile.lock
67
+ - LICENSE.txt
68
+ - README.md
69
+ - Rakefile
70
+ - bin/console
71
+ - bin/setup
72
+ - lib/src/extract/array_of.rb
73
+ - lib/src/extract/array_value.rb
74
+ - lib/src/extract/base.rb
75
+ - lib/src/extract/expression.rb
76
+ - lib/src/extract/hash_builder.rb
77
+ - lib/src/extract/string_value.rb
78
+ - lib/src/extract/value_builder.rb
79
+ - lib/src/extract/within.rb
80
+ - lib/src/extractor.rb
81
+ - lib/src/format/formatter.rb
82
+ - lib/src/format/mapper.rb
83
+ - lib/src/format/modifier.rb
84
+ - lib/src/node.rb
85
+ - lib/xml_data_extractor.rb
86
+ - xml_data_extractor.gemspec
87
+ homepage: https://github.com/monde-sistemas/xml_data_extractor
88
+ licenses:
89
+ - MIT
90
+ metadata:
91
+ homepage_uri: https://github.com/monde-sistemas/xml_data_extractor
92
+ source_code_uri: https://github.com/monde-sistemas/xml_data_extractor
93
+ changelog_uri: https://github.com/monde-sistemas/xml_data_extractor
94
+ post_install_message:
95
+ rdoc_options: []
96
+ require_paths:
97
+ - lib
98
+ required_ruby_version: !ruby/object:Gem::Requirement
99
+ requirements:
100
+ - - ">="
101
+ - !ruby/object:Gem::Version
102
+ version: 2.3.0
103
+ required_rubygems_version: !ruby/object:Gem::Requirement
104
+ requirements:
105
+ - - ">="
106
+ - !ruby/object:Gem::Version
107
+ version: '0'
108
+ requirements: []
109
+ rubygems_version: 3.0.3
110
+ signing_key:
111
+ specification_version: 4
112
+ summary: Provides a simples DSL for extracting data from XML documents
113
+ test_files: []