estratto 1.0.0 → 1.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/Gemfile.lock +3 -1
- data/README.md +248 -2
- data/estratto.gemspec +2 -0
- data/lib/estratto/content.rb +11 -0
- data/lib/estratto/encoder.rb +21 -0
- data/lib/estratto/parser.rb +4 -4
- data/lib/estratto/version.rb +1 -1
- metadata +18 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 768680440af76aff2f08d4a72dad782fbfa17c70
|
4
|
+
data.tar.gz: 68974ef5b81a15b4fd57814f21025b93a5cd2ac8
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 0bff7f6b4c314824bd827e8a61a8459f2b9af8e2c8511fb319b244beb5d7d1034102293fda7843a71ed2c5a990885be763ae3507d79b2264d52d3d0d3eca4597
|
7
|
+
data.tar.gz: 31f0f3e48656b4359972033f5b188525dc34bacf78128f9d73a4ecbe990c9c6925cbbba8535d9e80fb4373eb248f6c32a28cf913e8bdf654ddfaaf2535516386
|
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -1,10 +1,18 @@
|
|
1
1
|
# Estratto
|
2
2
|
|
3
|
+
[](https://badge.fury.io/rb/estratto)
|
3
4
|
[](https://travis-ci.com/Rynaro/estratto)
|
4
5
|
[](https://coveralls.io/github/Rynaro/estratto?branch=master)
|
5
6
|
[](https://codeclimate.com/github/Rynaro/estratto/maintainability)
|
6
7
|
|
7
|
-
|
8
|
+
[](https://waffle.io/Rynaro/estratto)
|
9
|
+
|
10
|
+
> Estratto is a easy to handle parser based on YAML templating engine. Creating a easy interface for developers, and non developers to extract data from fixed width files
|
11
|
+
|
12
|
+
## Motivation
|
13
|
+
|
14
|
+
In various scenarios the data processment is a crucial step of a integration with partner systems, or data storage. But the task to create parsing and import data from these text format is boring, and causing code duplication in every code project.
|
15
|
+
This project borns to help developers to reduce the time spent in this task, or creating a total delegation scenario to other team responsabilities.
|
8
16
|
|
9
17
|
## Installation
|
10
18
|
|
@@ -24,7 +32,245 @@ Or install it yourself as:
|
|
24
32
|
|
25
33
|
## Usage
|
26
34
|
|
27
|
-
|
35
|
+
**Estratto** works with simple input of _data to parse file_ and a _yaml layout equivalent_.
|
36
|
+
|
37
|
+
Example of a default call for parsing:
|
38
|
+
|
39
|
+
```ruby
|
40
|
+
Estratto::Document.process(file: 'path/to/data.txt', layout: 'path/to/layout.yml')
|
41
|
+
```
|
42
|
+
|
43
|
+
### Layout specifications
|
44
|
+
|
45
|
+
Fixed width files is sometimes ~always~ painful for human reading, and the layout manual comes in a very useful pdf or spreasheet format.
|
46
|
+
|
47
|
+
Here, we'll try to made things fun again, or less painful. :joy:
|
48
|
+
|
49
|
+
The base layout for YAML file is:
|
50
|
+
|
51
|
+
```yaml
|
52
|
+
layout:
|
53
|
+
name: 'jojo stand users'
|
54
|
+
multi-register: true
|
55
|
+
prefix: 0..1
|
56
|
+
registers:
|
57
|
+
- register: '01'
|
58
|
+
fields:
|
59
|
+
- name: name
|
60
|
+
range: 2..45
|
61
|
+
type: String
|
62
|
+
- name: stand
|
63
|
+
range: 46..75
|
64
|
+
type: String
|
65
|
+
```
|
66
|
+
|
67
|
+
And the output will be a array of hashes reflection of your columns:
|
68
|
+
|
69
|
+
```ruby
|
70
|
+
[
|
71
|
+
{
|
72
|
+
name: 'Jotaro Kujo',
|
73
|
+
stand: 'Star Platinum'
|
74
|
+
},
|
75
|
+
{
|
76
|
+
name: 'Giorno Giovanna',
|
77
|
+
stand: 'Golden Experience Requiem'
|
78
|
+
},
|
79
|
+
{
|
80
|
+
name: 'Jobin Higashikata',
|
81
|
+
stand: 'Speed King'
|
82
|
+
}
|
83
|
+
]
|
84
|
+
```
|
85
|
+
|
86
|
+
The structure follows the strict directive
|
87
|
+
```yaml
|
88
|
+
layout:
|
89
|
+
(base configuration)
|
90
|
+
registers:
|
91
|
+
(layouts)
|
92
|
+
```
|
93
|
+
|
94
|
+
Actually **Estratto** supports these types of fixed width layouts:
|
95
|
+
|
96
|
+
- Batch prefix based registers
|
97
|
+
- Mono layout based registers _(development)_
|
98
|
+
|
99
|
+
|
100
|
+
### UTF-8 Conversion
|
101
|
+
|
102
|
+
Estratto makes use of [CharlockHolmes](https://github.com/brianmario/charlock_holmes) gem to detect the file content encoding and convert it to UTF-8.
|
103
|
+
This approach prevents invalid characters from being present in the output.
|
104
|
+
|
105
|
+
### Type Coercion
|
106
|
+
|
107
|
+
**Estratto** supports type coercion, with some perks called _formats_, on layout file.
|
108
|
+
|
109
|
+
Data type supported to handle in **Estratto**
|
110
|
+
|
111
|
+
- String
|
112
|
+
- Integer
|
113
|
+
- Float
|
114
|
+
- DateTime
|
115
|
+
- Date
|
116
|
+
|
117
|
+
Default data type in fields is `String`, if no one type is setted in field list register.
|
118
|
+
|
119
|
+
Registers fields list always respect this base structure:
|
120
|
+
|
121
|
+
```yaml
|
122
|
+
fields:
|
123
|
+
- name: name
|
124
|
+
range: 2..12
|
125
|
+
type: String
|
126
|
+
formats:
|
127
|
+
strip: true
|
128
|
+
```
|
129
|
+
|
130
|
+
`name` is your field identification of field, this value will be your symbol in hashed parsed data
|
131
|
+
|
132
|
+
`range` is where data is inside the file. (First index is 0)
|
133
|
+
|
134
|
+
`type` data type to be coerced
|
135
|
+
|
136
|
+
`formats` receives a specific configuration for data type. Here we can format Strings, and adjust precision for unformatted Float data.
|
137
|
+
|
138
|
+
### Formats
|
139
|
+
|
140
|
+
Formats is the resource for deal with some "surprises" that this type of file can provide to us. Like, super large string fields that has a huge blank space, DateTime with suspicious formatting, or Float without any decimal point, but the manual description shows _"Decimal(15, 2)"_
|
141
|
+
|
142
|
+
#### String
|
143
|
+
|
144
|
+
##### strip
|
145
|
+
|
146
|
+
Works like common ruby String strip method
|
147
|
+
|
148
|
+
```yaml
|
149
|
+
strip: true
|
150
|
+
```
|
151
|
+
|
152
|
+
Output example:
|
153
|
+
|
154
|
+
```ruby
|
155
|
+
#raw_data
|
156
|
+
'Hierophant Green '
|
157
|
+
# with strip clause
|
158
|
+
'Hierophant Green'
|
159
|
+
```
|
160
|
+
|
161
|
+
#### Integer
|
162
|
+
|
163
|
+
Simple integer values converter. Useful in cases that you need to deal with ids.
|
164
|
+
|
165
|
+
Actually we don't have any formats for Integer. :)
|
166
|
+
|
167
|
+
```ruby
|
168
|
+
#raw_data
|
169
|
+
'000123'
|
170
|
+
# coerced
|
171
|
+
123
|
172
|
+
#raw_data
|
173
|
+
'123'
|
174
|
+
# coerced
|
175
|
+
123
|
176
|
+
#raw_data
|
177
|
+
'a'
|
178
|
+
# coerced
|
179
|
+
0
|
180
|
+
```
|
181
|
+
|
182
|
+
#### Float
|
183
|
+
|
184
|
+
Float is one of most important types here. The fixed width files always respect the _non logical_ format to deliver information.
|
185
|
+
|
186
|
+
##### precision
|
187
|
+
|
188
|
+
```yaml
|
189
|
+
precision: <integer>
|
190
|
+
```
|
191
|
+
|
192
|
+
Examples:
|
193
|
+
|
194
|
+
```yaml
|
195
|
+
precision: 2
|
196
|
+
```
|
197
|
+
|
198
|
+
```ruby
|
199
|
+
#raw data
|
200
|
+
'12345'
|
201
|
+
# with precision
|
202
|
+
123.45
|
203
|
+
```
|
204
|
+
|
205
|
+
```yaml
|
206
|
+
precision: 3
|
207
|
+
```
|
208
|
+
|
209
|
+
```ruby
|
210
|
+
#raw data
|
211
|
+
'12345'
|
212
|
+
# with precision
|
213
|
+
12.345
|
214
|
+
```
|
215
|
+
|
216
|
+
##### comma_format
|
217
|
+
|
218
|
+
```yaml
|
219
|
+
comma_format: <boolean>
|
220
|
+
```
|
221
|
+
|
222
|
+
Examples:
|
223
|
+
|
224
|
+
```yaml
|
225
|
+
comma_format: true
|
226
|
+
```
|
227
|
+
|
228
|
+
```ruby
|
229
|
+
#raw data
|
230
|
+
'123,45'
|
231
|
+
# with comma formats
|
232
|
+
123.45
|
233
|
+
```
|
234
|
+
|
235
|
+
|
236
|
+
#### DateTime and Date
|
237
|
+
|
238
|
+
The `DateTime` and `Date` has the same formats attributes. But the difference, one shows DateTime format, and other always respect Date output
|
239
|
+
|
240
|
+
|
241
|
+
##### format
|
242
|
+
|
243
|
+
```yaml
|
244
|
+
format: <ruby strptime format pattern>
|
245
|
+
```
|
246
|
+
|
247
|
+
Examples
|
248
|
+
|
249
|
+
```yaml
|
250
|
+
format: '%Y%m%d'
|
251
|
+
```
|
252
|
+
|
253
|
+
```ruby
|
254
|
+
#raw data
|
255
|
+
'20180101'
|
256
|
+
# with comma formats
|
257
|
+
#<DateTime: 2018-01-01T00:00:00+00:00 ...>
|
258
|
+
```
|
259
|
+
|
260
|
+
```yaml
|
261
|
+
format: '%d/%m/%Y'
|
262
|
+
```
|
263
|
+
|
264
|
+
```ruby
|
265
|
+
#raw data
|
266
|
+
'01/01/2018'
|
267
|
+
# with comma formats
|
268
|
+
#<DateTime: 2018-01-01T00:00:00+00:00 ...>
|
269
|
+
```
|
270
|
+
|
271
|
+
## Tests
|
272
|
+
|
273
|
+
Simple `rake spec`
|
28
274
|
|
29
275
|
## Development
|
30
276
|
|
data/estratto.gemspec
CHANGED
@@ -20,6 +20,8 @@ Gem::Specification.new do |spec|
|
|
20
20
|
spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
|
21
21
|
spec.require_paths = ["lib"]
|
22
22
|
|
23
|
+
spec.add_dependency "charlock_holmes"
|
24
|
+
|
23
25
|
spec.add_development_dependency "bundler", "~> 1.17"
|
24
26
|
spec.add_development_dependency "rake", "~> 10.0"
|
25
27
|
spec.add_development_dependency "rspec", "~> 3.0"
|
@@ -0,0 +1,21 @@
|
|
1
|
+
require 'charlock_holmes'
|
2
|
+
|
3
|
+
module Estratto
|
4
|
+
class Encoder
|
5
|
+
attr_reader :content
|
6
|
+
|
7
|
+
def initialize(content)
|
8
|
+
@content = content
|
9
|
+
end
|
10
|
+
|
11
|
+
def encode
|
12
|
+
CharlockHolmes::Converter.convert(content, encoding, 'UTF-8')
|
13
|
+
end
|
14
|
+
|
15
|
+
private
|
16
|
+
|
17
|
+
def encoding
|
18
|
+
CharlockHolmes::EncodingDetector.detect(content)[:encoding]
|
19
|
+
end
|
20
|
+
end
|
21
|
+
end
|
data/lib/estratto/parser.rb
CHANGED
@@ -1,4 +1,5 @@
|
|
1
1
|
require_relative 'register'
|
2
|
+
require_relative 'content'
|
2
3
|
|
3
4
|
module Estratto
|
4
5
|
class Parser
|
@@ -10,16 +11,15 @@ module Estratto
|
|
10
11
|
end
|
11
12
|
|
12
13
|
def perform
|
13
|
-
@data ||=
|
14
|
+
@data ||= raw_content.map do |line|
|
14
15
|
register_layout = layout.register_fields_for(line[layout.prefix_range])
|
15
16
|
next if register_layout.nil?
|
16
17
|
Register.new(line, register_layout).refine
|
17
18
|
end.compact
|
18
19
|
end
|
19
20
|
|
20
|
-
def
|
21
|
-
@raw_data
|
21
|
+
def raw_content
|
22
|
+
@raw_data = Content.for(file_path)
|
22
23
|
end
|
23
|
-
|
24
24
|
end
|
25
25
|
end
|
data/lib/estratto/version.rb
CHANGED
metadata
CHANGED
@@ -1,15 +1,29 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: estratto
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.
|
4
|
+
version: 1.0.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Henrique A. Lavezzo
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2019-01-
|
11
|
+
date: 2019-01-21 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: charlock_holmes
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - ">="
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: '0'
|
20
|
+
type: :runtime
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - ">="
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '0'
|
13
27
|
- !ruby/object:Gem::Dependency
|
14
28
|
name: bundler
|
15
29
|
requirement: !ruby/object:Gem::Requirement
|
@@ -85,6 +99,7 @@ files:
|
|
85
99
|
- bin/setup
|
86
100
|
- estratto.gemspec
|
87
101
|
- lib/estratto.rb
|
102
|
+
- lib/estratto/content.rb
|
88
103
|
- lib/estratto/data/base.rb
|
89
104
|
- lib/estratto/data/coercer.rb
|
90
105
|
- lib/estratto/data/date.rb
|
@@ -93,6 +108,7 @@ files:
|
|
93
108
|
- lib/estratto/data/integer.rb
|
94
109
|
- lib/estratto/data/string.rb
|
95
110
|
- lib/estratto/document.rb
|
111
|
+
- lib/estratto/encoder.rb
|
96
112
|
- lib/estratto/helpers/range.rb
|
97
113
|
- lib/estratto/layout/base.rb
|
98
114
|
- lib/estratto/layout/factory.rb
|