data_collector 0.62.0 → 0.64.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: '0923a758b6a621afd2d2fb4aa85f580cb757ee3ca1150e7175827a9bf617e07b'
4
- data.tar.gz: 1a83fecc0244088447747f4c196ffcfddc6cb70e5ab8f40232a0401ac9912fc3
3
+ metadata.gz: cc87874763fab91f643447fac3148e3d31fd6fcfad889eb4e64d46ade463ac89
4
+ data.tar.gz: 8b8a01033c3cad493c2cc011d1d1bfb92cb5f3412fbd585dad34cfd19ee83647
5
5
  SHA512:
6
- metadata.gz: d6d75c0bd763bbf2c107ebe7e5bc2c5c813454c6ad496eb234d38c79fae01525969e5d8b6abd601cf3103f8363f3f7965957cb54103eb10aaa80b65f80aa3073
7
- data.tar.gz: '029d8bbf6819f90b955f373d8ee8e795fca2b130fc1900b487ef692a1b56cf7f7a8d02b222637bf83361354f1e795713c731eba90fb446f2906a758986c99af1'
6
+ metadata.gz: c4d0abd5ba7864ef268c584feec3b273ba54e30830c9e9addb061905bf66a2d63805b3965ad675fde02aa12102308ec6d4dee6b70fb89700723da031049f81ee
7
+ data.tar.gz: 9ae512465969cc870884ab4355f6cacc2e2e1a2cae9e77d694f4a198ae65d9d138037e6c5cf46b844a6c30db997373bdf832ce4a07d27ad979e28e2ba72b73ee
data/README.md CHANGED
@@ -1,364 +1,460 @@
1
- # DataCollector
2
- Convenience module to Extract, Transform and Load data in a Pipeline.
3
- The 'INPUT', 'OUTPUT' and 'FILTER' object will help you to read, transform and output your data.
4
- Support objects like CONFIG, LOG, ERROR, RULES help you to write manageable rules to transform and log your data.
5
- Include the DataCollector::Core module into your application gives you access to these objects.
1
+ # DataCollector Ruby Gem
2
+
3
+ ## Overview
4
+
5
+ DataCollector is a convenience module for Extract, Transform, and Load (ETL) operations in a pipeline architecture. It provides a simple way to collect, process, transform, and transfer data to various systems and applications.
6
+
7
+ ## Installation
8
+
9
+ Add this line to your application's Gemfile:
10
+
11
+ ```ruby
12
+ gem 'data_collector'
13
+ ```
14
+
15
+ And then execute:
16
+
17
+ ```
18
+ $ bundle
19
+ ```
20
+
21
+ Or install it yourself as:
22
+
23
+ ```
24
+ $ gem install data_collector
25
+ ```
26
+
27
+ ## Getting Started
28
+
29
+ Include the DataCollector::Core module in your application to access all available objects:
30
+
6
31
  ```ruby
32
+ require 'data_collector'
7
33
  include DataCollector::Core
8
34
  ```
9
- Every object can be used on its own.
10
-
11
- ### DataCollector Objects
12
- #### Pipeline
13
- Allows you to create a simple pipeline of operations to process data. With a data pipeline, you can collect, process, and transform data, and then transfer it to various systems and applications.
14
-
15
- You can set a schedule for pipelines that are triggered by new data, specifying how often the pipeline should be
16
- executed in the [ISO8601 duration format](https://www.digi.com/resources/documentation/digidocs//90001488-13/reference/r_iso_8601_duration_format.htm). The processing logic is then executed.
17
- ###### methods:
18
- - .new(options): options can be schedule in [ISO8601 duration format](https://www.digi.com/resources/documentation/digidocs//90001488-13/reference/r_iso_8601_duration_format.htm) and name
19
- - options:
20
- - name: pipeline name
21
- - schedule: [ISO8601 duration format](https://www.digi.com/resources/documentation/digidocs//90001488-13/reference/r_iso_8601_duration_format.htm)
22
- - cron: in cron format ex. '1 12 * * *' intervals are not supported
23
- - uri: a directory/file to watch
24
- - xml_typecast: true/false -> convert convert string values to TrueClass, FalseClass, Time, Date, and DateTime
25
- - .run: start the pipeline. blocking if a schedule is supplied
26
- - .stop: stop the pipeline
27
- - .pause: pause the pipeline. Restart using .run
28
- - .running?: is pipeline running
29
- - .stopped?: is pipeline not running
30
- - .paused?: is pipeline paused
31
- - .name: name of the pipe
32
- - .run_count: number of times the pipe has ran
33
- - .on_message: handle to run every time a trigger event happens
34
- ###### example:
35
+
36
+ This gives you access to the following objects: `pipeline`, `input`, `output`, `filter`, `rules`, `config`, `log`, and `error`.
37
+
38
+ ## Core Components
39
+
40
+ ### Pipeline
41
+
42
+ The Pipeline object allows you to create a data processing pipeline with scheduled execution.
43
+
44
+ #### Methods
45
+
46
+ - `.new(options)`: Create a new pipeline
47
+ - Options:
48
+ - `name`: Pipeline name
49
+ - `schedule`: ISO8601 duration format (e.g., 'PT10M' for every 10 minutes)
50
+ - `cron`: Cron format (e.g., '0 6 * * *' for 6:00 AM daily)
51
+ - `uri`: Directory/file to watch
52
+ - `xml_typecast`: Convert string values to appropriate types (true/false)
53
+ - `.run`: Start the pipeline (blocking if schedule is supplied)
54
+ - `.stop`: Stop the pipeline
55
+ - `.pause`: Pause the pipeline
56
+ - `.running?`: Check if pipeline is running
57
+ - `.stopped?`: Check if pipeline is not running
58
+ - `.paused?`: Check if pipeline is paused
59
+ - `.name`: Get pipeline name
60
+ - `.run_count`: Get number of times the pipeline has run
61
+ - `.on_message`: Handle to run every time a trigger event happens
62
+
63
+ #### Examples
64
+
65
+ Time-scheduled pipeline:
35
66
  ```ruby
36
- #create a pipeline scheduled to run every 10 minutes
67
+ # Run every 10 minutes
37
68
  pipeline = Pipeline.new(schedule: 'PT10M')
38
69
 
39
70
  pipeline.on_message do |input, output|
40
71
  data = input.from_uri("https://dummyjson.com/comments?limit=10")
41
- # process data
72
+ # Process data
42
73
  end
43
74
 
44
75
  pipeline.run
45
76
  ```
46
77
 
78
+ Cron-scheduled pipeline:
47
79
  ```ruby
48
- #create a pipeline scheduled to run every morning at 06:00 am
80
+ # Run every morning at 06:00 AM
49
81
  pipeline = Pipeline.new(cron: '0 6 * * *')
50
82
 
51
83
  pipeline.on_message do |input, output|
52
84
  data = input.from_uri("https://dummyjson.com/comments?limit=10")
53
- # process data
85
+ # Process data
54
86
  end
55
87
 
56
88
  pipeline.run
57
89
  ```
58
90
 
59
-
91
+ File-watching pipeline:
60
92
  ```ruby
61
- #create a pipeline to listen and process files in a directory
93
+ # Listen for and process files in a directory
62
94
  extract = DataCollector::Pipeline.new(name: 'extract', uri: 'file://./data/in')
63
95
 
64
96
  extract.on_message do |input, output, filename|
65
97
  data = input.from_uri("file://#{filename}")
66
- # process data
98
+ # Process data
67
99
  end
68
100
 
69
101
  extract.run
70
102
  ```
71
103
 
72
- #### input
73
- The input component is part of the processing logic. All data is converted into a Hash, Array, ... accessible using plain Ruby or JSONPath using the filter object.
74
- The input component can fetch data from various URIs, such as files, URLs, directories, queues, ...
75
- For a push input component, a listener is created with a processing logic block that is executed whenever new data is available.
76
- A push happens when new data is created in a directory, message queue, ...
77
-
78
- ```ruby
79
- from_uri(source, options = {:raw, :content_type, :headers, :cookies})
80
- ```
81
- - source: an uri with a scheme of http, https, file, amqp
82
- - options:
83
- - raw: _boolean_ do not parse
84
- - content_type: _string_ force a content_type if the 'Content-Type' returned by the http server is incorrect
85
- - headers: request headers
86
- - cookies: session cookies etc.
87
- - method: http verb one of [GET, POST] defaul('GET')
88
- - body: http post body
89
-
90
- ###### example:
91
- ```ruby
92
- # read from an http endpoint
93
- input.from_uri("http://www.libis.be")
94
- input.from_uri("file://hello.txt")
95
- input.from_uri("http://www.libis.be/record.jsonld", content_type: 'application/ld+json')
96
- input.from_uri("https://www.w3.org/TR/rdf12-turtle/examples/example1.ttl")
97
- input.from_uri("https://dbpedia.org/sparql", body: "query=SELECT * WHERE {?sub ?pred ?obj} LIMIT 10", method:"POST", headers: {accept: "text/turtle"})
98
- input.from_uri(StringIO.new(File.read('myrecords.xml')), content_type: 'application/xml' )
99
-
100
- # read data from a RabbitMQ queue
101
- listener = input.from_uri('amqp://user:password@localhost?channel=hello&queue=world')
102
- listener.on_message do |input, output, message|
103
- puts message
104
- end
105
- listener.run
106
-
107
- # read data from a directory
108
- listener = input.from_uri('file://this/is/directory')
109
- listener.on_message do |input, output, filename|
110
- puts filename
111
- end
112
- listener.run
113
- ```
104
+ ### Input
114
105
 
115
- Inputs can be JSON, XML or CSV or XML in a TAR.GZ file
106
+ The input component fetches data from various URIs and converts it into Ruby objects (Hash, Array, etc.).
116
107
 
117
- ###### listener from input.from_uri(directory|message queue)
118
- When a listener is defined that is triggered by an event(PUSH) like a message queue or files written to a directory you have these extra methods.
108
+ #### Methods
119
109
 
120
- - .run: start the listener. blocking if a schedule is supplied
121
- - .stop: stop the listener
122
- - .pause: pause the listener. Restart using .run
123
- - .running?: is listener running
124
- - .stopped?: is listener not running
125
- - .paused?: is listener paused
126
- - .on_message: handle to run every time a trigger event happens
110
+ - `from_uri(source, options = {})`: Fetch data from a source
111
+ - Parameters:
112
+ - `source`: URI with scheme (http, https, file, amqp)
113
+ - `options`:
114
+ - `raw`: Boolean (do not parse)
115
+ - `content_type`: String (force a specific content type)
116
+ - `headers`: Request headers
117
+ - `cookies`: Session cookies
118
+ - `method`: HTTP verb (GET, POST)
119
+ - `body`: HTTP post body
127
120
 
128
- ### output
129
- Output is an object you can store key/value pairs that needs to be written to an output stream.
130
- ```ruby
131
- output[:name] = 'John'
132
- output[:last_name] = 'Doe'
133
- ```
121
+ #### Examples
134
122
 
123
+ HTTP and file sources:
135
124
  ```ruby
136
- # get all keys from the output object
137
- output.keys
138
- output.key?(:name)
139
- output.each do |k,v|
140
- puts "#{k}:#{v}"
141
- end
125
+ # Read from an HTTP endpoint
126
+ input.from_uri("http://www.libis.be")
127
+
128
+ # Read from a file
129
+ input.from_uri("file://hello.txt")
130
+
131
+ # Force content type
132
+ input.from_uri("http://www.libis.be/record.jsonld", content_type: 'application/ld+json')
133
+
134
+ # Read RDF/Turtle data
135
+ input.from_uri("https://www.w3.org/TR/rdf12-turtle/examples/example1.ttl")
136
+
137
+ # POST request
138
+ input.from_uri(
139
+ "https://dbpedia.org/sparql",
140
+ body: "query=SELECT * WHERE {?sub ?pred ?obj} LIMIT 10",
141
+ method: "POST",
142
+ headers: {accept: "text/turtle"}
143
+ )
144
+
145
+ # Read from StringIO
146
+ input.from_uri(
147
+ StringIO.new(File.read('myrecords.xml')),
148
+ content_type: 'application/xml'
149
+ )
142
150
  ```
143
- ```ruby
144
- # add hash to output
145
- output << { age: 22 }
146
151
 
147
- puts output[:age]
148
- # # 22
149
- ```
152
+ Message queues:
150
153
  ```ruby
151
- # add array to output
152
- output << [1,2,3,4]
153
- puts output.keys
154
- # # datap
155
- puts output['datap']
156
- # # [1, 2, 3, 4]
154
+ # Read data from a RabbitMQ queue
155
+ listener = input.from_uri('amqp://user:password@localhost?channel=hello&queue=world')
156
+ listener.on_message do |input, output, message|
157
+ puts message
158
+ end
159
+ listener.run
157
160
  ```
158
161
 
159
- Write output to a file, string use an ERB file as a template
160
- example:
161
- ___test.erb___
162
- ```erbruby
163
- <names>
164
- <combined><%= data[:name] %> <%= data[:last_name] %></combined>
165
- <%= print data, :name, :first_name %>
166
- <%= print data, :last_name %>
167
- </names>
168
- ```
169
- will produce
170
- ```html
171
- <names>
172
- <combined>John Doe</combined>
173
- <first_name>John</first_name>
174
- <last_name>Doe</last_name>
175
- </names>
162
+ Directory monitoring:
163
+ ```ruby
164
+ # Read data from a directory
165
+ listener = input.from_uri('file://this/is/directory')
166
+ listener.on_message do |input, output, filename|
167
+ puts filename
168
+ end
169
+ listener.run
176
170
  ```
177
171
 
178
- Into a variable
172
+ CSV files with options:
179
173
  ```ruby
180
- result = output.to_s("test.erb")
181
- #template is optional
182
- result = output.to_s
183
- ```
174
+ # Load a CSV with semicolon separator
175
+ data = input.from_uri('https://example.com/data.csv', col_sep: ';')
176
+ ```
184
177
 
185
- Into a file
186
- ```ruby
187
- output.to_uri("file://data.xml", {template: "test.erb", content_type: "application/xml"})
188
- #template is optional
189
- output.to_uri("file://data.json", {content_type: "application/json"})
190
- ```
178
+ #### Listener Methods
191
179
 
192
- Into a tar file stored in data
193
- ```ruby
194
- # create a tar file with a random name
195
- data = output.to_uri("file://data.json", {content_type: "application/json", tar:true})
196
- #choose
197
- data = output.to_uri("file://./test.json", {template: "test.erb", content_type: 'application/json', tar_name: "test.tar.gz"})
198
- ```
180
+ When a listener is defined (for directories or message queues):
181
+
182
+ - `.run`: Start the listener (blocking)
183
+ - `.stop`: Stop the listener
184
+ - `.pause`: Pause the listener
185
+ - `.running?`: Check if listener is running
186
+ - `.stopped?`: Check if listener is not running
187
+ - `.paused?`: Check if listener is paused
188
+ - `.on_message`: Handle to run every time a trigger event happens
189
+
190
+ ### Output
191
+
192
+ Output is an object for storing key/value pairs to be written to an output stream.
193
+
194
+ #### Basic Operations
199
195
 
200
- Other output methods
201
196
  ```ruby
202
- output.raw
203
- output.clear
204
- output.to_xml(template: 'test.erb', root: 'record') # root defaults to 'data'
205
- output.to_json
206
- output.flatten
207
- output.crush
197
+ # Set values
198
+ output[:name] = 'John'
199
+ output[:last_name] = 'Doe'
200
+
201
+ # Get all keys
208
202
  output.keys
203
+
204
+ # Check if key exists
205
+ output.key?(:name)
206
+
207
+ # Iterate through keys and values
208
+ output.each do |k, v|
209
+ puts "#{k}:#{v}"
210
+ end
211
+
212
+ # Add hash to output
213
+ output << { age: 22 }
214
+ puts output[:age] # 22
215
+
216
+ # Add array to output
217
+ output << [1, 2, 3, 4]
218
+ puts output['datap'] # [1, 2, 3, 4]
219
+
220
+ # Clear output
221
+ output.clear
209
222
  ```
210
223
 
211
- Into a temp directory
224
+ #### Output Methods
225
+
226
+ - `to_s(template = nil)`: Convert output to string (optional ERB template)
227
+ - `to_uri(uri, options = {})`: Write output to a URI
228
+ - Options:
229
+ - `template`: ERB template file
230
+ - `content_type`: MIME type
231
+ - `tar`: Create a tar file (true/false)
232
+ - `tar_name`: Custom name for tar file
233
+ - `to_tmp_file(template, directory)`: Write to temporary file
234
+ - `to_xml(options = {})`: Convert to XML
235
+ - Options:
236
+ - `template`: ERB template
237
+ - `root`: Root element name (defaults to 'data')
238
+ - `to_json`: Convert to JSON
239
+ - `flatten`: Flatten nested structures
240
+ - `crush`: Compress output
241
+ - `raw`: Get raw output data
242
+
243
+ #### Examples
244
+
245
+ Using ERB templates:
212
246
  ```ruby
213
- output.to_tmp_file("test.erb","directory")
214
- ```
215
-
216
- #### filter
217
- filter data from a hash using [JSONPath](http://goessner.net/articles/JsonPath/index.html)
247
+ # Template (test.erb)
248
+ # <names>
249
+ # <combined><%= data[:name] %> <%= data[:last_name] %></combined>
250
+ # <%= print data, :name, :first_name %>
251
+ # <%= print data, :last_name %>
252
+ # </names>
253
+
254
+ # Generate string from template
255
+ result = output.to_s("test.erb")
256
+
257
+ # Without template
258
+ result = output.to_s
259
+ ```
218
260
 
261
+ Writing to files:
219
262
  ```ruby
220
- filtered_data = filter(data, "$..metadata.record")
263
+ # Write to file with template
264
+ output.to_uri(
265
+ "file://data.xml",
266
+ {template: "test.erb", content_type: "application/xml"}
267
+ )
268
+
269
+ # Write to file without template
270
+ output.to_uri("file://data.json", {content_type: "application/json"})
221
271
  ```
222
272
 
223
- #### rules
224
- The RULES objects have a simple concept. Rules exist of 3 components:
225
- - a destination tag
226
- - a jsonpath filter to get the data
227
- - a lambda to execute on every filter hit
273
+ Creating tar archives:
274
+ ```ruby
275
+ # Create tar with random name
276
+ data = output.to_uri(
277
+ "file://data.json",
278
+ {content_type: "application/json", tar: true}
279
+ )
280
+
281
+ # Create tar with specific name
282
+ data = output.to_uri(
283
+ "file://./test.json",
284
+ {
285
+ template: "test.erb",
286
+ content_type: 'application/json',
287
+ tar_name: "test.tar.gz"
288
+ }
289
+ )
290
+ ```
228
291
 
229
- TODO: work in progress see test for examples on how to use
292
+ ### Filter
230
293
 
231
- ```
232
- RULE_SET
233
- RULES*
234
- FILTERS*
235
- LAMBDA*
236
- SUFFIX
294
+ Filter data from a hash using JSONPath.
295
+
296
+ ```ruby
297
+ # Extract data using JSONPath
298
+ filtered_data = filter(data, "$..metadata.record")
237
299
  ```
238
300
 
239
- ##### Examples
301
+ ### Rules
240
302
 
241
- Here you find different rule combination that are possible
303
+ Rules provide a systematic way to transform data using three components:
304
+ - A destination tag
305
+ - A JSONPath filter to get the data
306
+ - A lambda function to execute on every filter hit
242
307
 
243
- ``` ruby
244
- RULE_SETS = {
245
- 'rs_only_filter' => {
246
- 'only_filter' => "$.title"
247
- },
248
- 'rs_only_text' => {
249
- 'plain_text_tag' => {
250
- 'text' => 'hello world'
251
- }
252
- },
253
- 'rs_text_with_suffix' => {
254
- 'text_tag_with_suffix' => {
255
- 'text' => ['hello_world', {'suffix' => '-suffix'}]
256
- }
257
- },
258
- 'rs_map_with_json_filter' => {
259
- 'language' => {
260
- '@' => {'nl' => 'dut', 'fr' => 'fre', 'de' => 'ger', 'en' => 'eng'}
261
- }
262
- },
263
- 'rs_hash_with_json_filter' => {
264
- 'multiple_of_2' => {
265
- '@' => lambda { |d| d.to_i * 2 }
266
- }
267
- },
268
- 'rs_hash_with_multiple_json_filter' => {
269
- 'multiple_of' => [
270
- {'@' => lambda { |d| d.to_i * 2 }},
271
- {'@' => lambda { |d| d.to_i * 3 }}
272
- ]
273
- },
274
- 'rs_hash_with_json_filter_and_suffix' => {
275
- 'multiple_of_with_suffix' => {
276
- '@' => [lambda {|d| d.to_i*2}, 'suffix' => '-multiple_of_2']
277
- }
278
- },
279
- 'rs_hash_with_json_filter_and_multiple_lambdas' => {
280
- 'multiple_lambdas' => {
281
- '@' => [lambda {|d| d.to_i*2}, lambda {|d| Math.sqrt(d.to_i) }]
282
- }
283
- },
284
- 'rs_hash_with_json_filter_and_option' => {
285
- 'subjects' => {
286
- '$..subject' => [
287
- lambda {|d,o|
288
- {
289
- doc_id: o['id'],
290
- subject: d
291
- }
292
- }
293
- ]
308
+ #### Example Rule Sets
309
+
310
+ ```ruby
311
+ RULE_SETS = {
312
+ # Simple filter
313
+ 'rs_only_filter' => {
314
+ 'only_filter' => "$.title"
315
+ },
316
+
317
+ # Plain text
318
+ 'rs_only_text' => {
319
+ 'plain_text_tag' => {
320
+ 'text' => 'hello world'
321
+ }
322
+ },
323
+
324
+ # Text with suffix
325
+ 'rs_text_with_suffix' => {
326
+ 'text_tag_with_suffix' => {
327
+ 'text' => ['hello_world', {'suffix' => '-suffix'}]
328
+ }
329
+ },
330
+
331
+ # Map values
332
+ 'rs_map_with_json_filter' => {
333
+ 'language' => {
334
+ '@' => {'nl' => 'dut', 'fr' => 'fre', 'de' => 'ger', 'en' => 'eng'}
335
+ }
336
+ },
337
+
338
+ # Transform with lambda
339
+ 'rs_hash_with_json_filter' => {
340
+ 'multiple_of_2' => {
341
+ '@' => lambda { |d| d.to_i * 2 }
342
+ }
343
+ },
344
+
345
+ # Multiple transforms
346
+ 'rs_hash_with_multiple_json_filter' => {
347
+ 'multiple_of' => [
348
+ {'@' => lambda { |d| d.to_i * 2 }},
349
+ {'@' => lambda { |d| d.to_i * 3 }}
350
+ ]
351
+ },
352
+
353
+ # Transform with suffix
354
+ 'rs_hash_with_json_filter_and_suffix' => {
355
+ 'multiple_of_with_suffix' => {
356
+ '@' => [lambda {|d| d.to_i*2}, 'suffix' => '-multiple_of_2']
357
+ }
358
+ },
359
+
360
+ # Multiple lambdas
361
+ 'rs_hash_with_json_filter_and_multiple_lambdas' => {
362
+ 'multiple_lambdas' => {
363
+ '@' => [lambda {|d| d.to_i*2}, lambda {|d| Math.sqrt(d.to_i) }]
364
+ }
365
+ },
366
+
367
+ # With options
368
+ 'rs_hash_with_json_filter_and_option' => {
369
+ 'subjects' => {
370
+ '$..subject' => [
371
+ lambda {|d,o|
372
+ {
373
+ doc_id: o['id'],
374
+ subject: d
294
375
  }
295
- }
376
+ }
377
+ ]
378
+ }
379
+ }
380
+ }
296
381
  ```
297
382
 
298
-
299
- ***rules.run*** can have 4 parameters. First 3 are mandatory. The last one ***options*** can hold data static to a rule set or engine directives.
383
+ #### Using Rules
300
384
 
301
- ##### List of engine directives:
302
- - _no_array_with_one_element: defaults to false. if the result is an array with 1 element just return the element.
303
-
304
- ###### example:
305
385
  ```ruby
306
- # apply RULESET "rs_hash_with_json_filter_and_option" to data
307
- include DataCollector::Core
308
- output.clear
309
- data = {'subject' => ['water', 'thermodynamics']}
386
+ # Apply rule set with options
387
+ data = {'subject' => ['water', 'thermodynamics']}
388
+ rules.run(RULE_SETS['rs_hash_with_json_filter_and_option'], data, output, {'id' => 1})
389
+
390
+ # Result:
391
+ # {
392
+ # "subjects":[
393
+ # {"doc_id":1,"subject":"water"},
394
+ # {"doc_id":1,"subject":"thermodynamics"}
395
+ # ]
396
+ # }
397
+ ```
310
398
 
311
- rules_ng.run(RULE_SETS['rs_hash_with_json_filter_and_option'], data, output, {'id' => 1})
399
+ Engine directives:
400
+ - `no_array_with_one_element`: If true and result is a single-element array/hash, return just the element (default: false)
401
+ - `_no_array_with_one_literal`: If result is a single-element is in an array, return the element (default: false)
312
402
 
313
- ```
403
+ ### Config
314
404
 
315
- Results in:
316
- ```json
317
- {
318
- "subjects":[
319
- {"doc_id":1,"subject":"water"},
320
- {"doc_id":1,"subject":"thermodynamics"}
321
- ]
322
- }
405
+ The config object points to a configuration file (default: "config.yml").
406
+ - Configuration is automatically reloaded when the file is modified
407
+ - All hash keys are converted to symbols (:key instead of "key")
408
+ - Writing values with []= immediately persists changes to the YAML file
409
+ - The class uses a singleton pattern - you cannot create instances with .new
410
+ - Environment variable substitution uses ${VAR_NAME} syntax in the YAML file
411
+
412
+ ```shell
413
+ export SECRET=my_secret
414
+ ```
415
+ __Example__ config.yml
416
+ ```yaml
417
+ cache: "/tmp"
418
+ password: ${SECRET}
419
+ active: true
323
420
  ```
324
421
 
422
+ __Usage__
423
+ ```ruby
424
+ # Set config path and filename
425
+ config.path = "/path/to/my/config"
426
+ config.name = "not_my_config.yml"
325
427
 
428
+ # Check config
429
+ puts config.version
430
+ puts config.include?(:key)
431
+ puts config.keys
326
432
 
327
- #### config
328
- config is an object that points to "config.yml" you can read and/or store data to this object.
433
+ # Read config value
434
+ config[:active]
329
435
 
330
- ___read___
331
- ```ruby
332
- config[:active]
333
- ```
334
- ___write___
335
- ```ruby
336
- config[:active] = false
337
- ```
338
- #### log
339
- Log to stdout
340
- ```ruby
341
- log("hello world")
342
- ```
343
- #### error
344
- Log an error to stdout
345
- ```ruby
346
- error("if you have an issue take a tissue")
436
+ # Write config value
437
+ config[:active] = false
347
438
  ```
348
- ### logger
349
- Logs are by default written to Standard OUT. If you want to change where to log to.
439
+
440
+ ### Logging
441
+
350
442
  ```ruby
351
- f = File.open('/tmp/data.log', 'w')
352
- f.sync = true # do not buffer
353
- # add multiple log outputs
354
- logger(STDOUT, f)
443
+ # Log to stdout
444
+ log("hello world")
445
+
446
+ # Log error
447
+ error("if you have an issue take a tissue")
355
448
 
356
- #write to both STDOUT and /tmp/data.log
357
- log('Hello world')
449
+ # Configure logger outputs
450
+ f = File.open('/tmp/data.log', 'w')
451
+ f.sync = true # Do not buffer
452
+ logger(STDOUT, f) # Log to both STDOUT and file
358
453
  ```
359
454
 
360
- ## Example
361
- Input data ___test.csv___
455
+ ## Complete Example
456
+
457
+ Input data (test.csv):
362
458
  ```csv
363
459
  sequence, data
364
460
  1, apple
@@ -366,107 +462,86 @@ sequence, data
366
462
  3, peach
367
463
  ```
368
464
 
369
- Output template ___test.erb___
370
- ```ruby
371
- <data>
372
- <% data[:record].each do |d| %>
373
- <record sequence="<%= d[:sequence] %>">
374
- <%= print d, :data %>
375
- </record>
376
- <% end %>
465
+ Output template (test.erb):
466
+ ```erb
467
+ <data>
468
+ <% data[:record].each do |d| %>
469
+ <record sequence="<%= d[:sequence] %>">
470
+ <%= print d, :data %>
471
+ </record>
472
+ <% end %>
377
473
  </data>
378
474
  ```
379
475
 
476
+ Processing script:
380
477
  ```ruby
381
478
  require 'data_collector'
382
479
  include DataCollector::Core
383
480
 
481
+ # Read CSV data
384
482
  data = input.from_uri('file://test.csv')
483
+
484
+ # Transform data
385
485
  data.map{ |m| m[:sequence] *=2; m }
386
486
 
387
- output[:record]=data
487
+ # Store in output
488
+ output[:record] = data
388
489
 
490
+ # Generate result using template
389
491
  puts output.to_s('test.erb')
390
492
  ```
391
493
 
392
- Should give as output
494
+ Output:
393
495
  ```xml
394
496
  <data>
395
- <record sequence="11">
396
- <data> apple</data>
397
- </record>
398
- <record sequence="22">
399
- <data> banana</data>
400
- </record>
401
- <record sequence="33">
402
- <data> peach</data>
403
- </record>
497
+ <record sequence="11">
498
+ <data> apple</data>
499
+ </record>
500
+ <record sequence="22">
501
+ <data> banana</data>
502
+ </record>
503
+ <record sequence="33">
504
+ <data> peach</data>
505
+ </record>
404
506
  </data>
405
507
  ```
406
508
 
407
- You can provide options to input.from_uri for better reading CSV formats these
408
- are the same the Ruby [CSV](https://docs.ruby-lang.org/en/master/CSV.html#class-CSV-label-Options) class
409
-
410
- Loading a CSV file with **;** as the row seperator
411
- ```ruby
412
- i = input.from_uri('https://support.staffbase.com/hc/en-us/article_attachments/360009197031/username.csv', col_sep: ';')
413
- ```
414
-
415
- ## Installation
416
-
417
- Add this line to your application's Gemfile:
418
-
419
- ```ruby
420
- gem 'data_collector'
421
- ```
422
-
423
- And then execute:
424
-
425
- $ bundle
426
-
427
- Or install it yourself as:
428
-
429
- $ gem install data_collector
430
-
431
- ## Usage
509
+ ## Full Pipeline Example
432
510
 
433
511
  ```ruby
434
512
  require 'data_collector'
435
513
 
436
514
  include DataCollector::Core
437
- # including core gives you a pipeline, input, output, filter, config, log, error object to work with
515
+
516
+ # Define rules
438
517
  RULES = {
439
- 'title' => '$..vertitle'
518
+ 'title' => '$..vertitle'
440
519
  }
441
- #create a PULL pipeline and schedule it to run every 5 seconds
520
+
521
+ # Create a PULL pipeline and schedule it to run every 5 seconds
442
522
  pipeline = DataCollector::Pipeline.new(schedule: 'PT5S')
443
523
 
444
524
  pipeline.on_message do |input, output|
525
+ # Fetch data
445
526
  data = input.from_uri('https://services3.libis.be/primo_artefact/lirias3611609')
527
+
528
+ # Apply rules
446
529
  rules.run(RULES, data, output)
447
- #puts JSON.pretty_generate(input.raw)
530
+
531
+ # Output results
448
532
  puts JSON.pretty_generate(output.raw)
449
533
  output.clear
450
534
 
535
+ # Stop after 3 runs
451
536
  if pipeline.run_count > 2
452
537
  log('stopping pipeline after one run')
453
538
  pipeline.stop
454
539
  end
455
540
  end
456
- pipeline.run
457
541
 
542
+ pipeline.run
458
543
  ```
459
544
 
460
- ## Development
461
-
462
- After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
463
-
464
- To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
465
-
466
- ## Contributing
467
-
468
- Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/data_collector.
469
-
470
545
  ## License
471
546
 
472
547
  The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
@@ -22,6 +22,7 @@ module DataCollector
22
22
  end
23
23
 
24
24
  def self.path
25
+ init if @config_file_path.empty?
25
26
  @config_file_path
26
27
  end
27
28
 
@@ -52,6 +53,16 @@ module DataCollector
52
53
  @config.keys
53
54
  end
54
55
 
56
+ def self.raw
57
+ init
58
+ @config
59
+ end
60
+
61
+ def self.dig(*args)
62
+ init
63
+ @config.dig(*args)
64
+ end
65
+
55
66
  def self.init
56
67
  @config_file_name = 'config.yml' if @config_file_name.nil?
57
68
  discover_config_file_path
@@ -66,6 +77,8 @@ module DataCollector
66
77
  end
67
78
 
68
79
  def self.discover_config_file_path
80
+ @config_file_name = ENV['CONFIG_FILE_NAME'] if ENV.key?('CONFIG_FILE_NAME')
81
+
69
82
  if @config_file_path.nil? || @config_file_path.empty?
70
83
  if ENV.key?('CONFIG_FILE_PATH')
71
84
  @config_file_path = ENV['CONFIG_FILE_PATH']
@@ -24,13 +24,13 @@ module DataCollector
24
24
  when Array
25
25
  odata = {}
26
26
  rule.each do |sub_rule|
27
- d=apply_rule(tag, sub_rule, input_data, output_data, options)
27
+ d = apply_rule(tag, sub_rule, input_data, output_data, options)
28
28
  next if d.nil?
29
- odata.merge!(d) {|k,v, n|
30
- [v,n].flatten
29
+ odata.merge!(d) { |k, v, n|
30
+ [v, n].flatten
31
31
  }
32
32
  end
33
- odata.each do |k,v|
33
+ odata.each do |k, v|
34
34
  output_data.data[k] = v
35
35
  end
36
36
  return output_data
@@ -120,13 +120,16 @@ module DataCollector
120
120
 
121
121
  output_data.compact! if output_data.is_a?(Array)
122
122
  output_data.flatten! if output_data.is_a?(Array)
123
- if output_data.is_a?(Array) &&
123
+ if options.with_indifferent_access.key?('_no_array_with_one_literal') &&
124
+ options.with_indifferent_access['_no_array_with_one_literal'] &&
125
+ output_data.is_a?(Array) &&
124
126
  output_data.size == 1 &&
125
- (output_data.first.is_a?(Array) || output_data.first.is_a?(Hash))
127
+ not((output_data.first.is_a?(Array) || output_data.first.is_a?(Hash)))
126
128
  output_data = output_data.first
127
129
  end
128
130
 
129
- if options.with_indifferent_access.key?('_no_array_with_one_element') && options.with_indifferent_access['_no_array_with_one_element'] &&
131
+ if options.with_indifferent_access.key?('_no_array_with_one_element') &&
132
+ options.with_indifferent_access['_no_array_with_one_element'] &&
130
133
  output_data.is_a?(Array) && output_data.size == 1
131
134
  output_data = output_data.first
132
135
  end
@@ -1,4 +1,4 @@
1
1
  # encoding: utf-8
2
2
  module DataCollector
3
- VERSION = "0.62.0"
3
+ VERSION = "0.64.0"
4
4
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: data_collector
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.62.0
4
+ version: 0.64.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Mehmet Celik
@@ -407,7 +407,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
407
407
  - !ruby/object:Gem::Version
408
408
  version: '0'
409
409
  requirements: []
410
- rubygems_version: 3.6.8
410
+ rubygems_version: 3.7.2
411
411
  specification_version: 4
412
412
  summary: ETL helper library
413
413
  test_files: []