estratto 1.0.0 → 1.0.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/Gemfile.lock +3 -1
- data/README.md +248 -2
- data/estratto.gemspec +2 -0
- data/lib/estratto/content.rb +11 -0
- data/lib/estratto/encoder.rb +21 -0
- data/lib/estratto/parser.rb +4 -4
- data/lib/estratto/version.rb +1 -1
- metadata +18 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 768680440af76aff2f08d4a72dad782fbfa17c70
|
4
|
+
data.tar.gz: 68974ef5b81a15b4fd57814f21025b93a5cd2ac8
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 0bff7f6b4c314824bd827e8a61a8459f2b9af8e2c8511fb319b244beb5d7d1034102293fda7843a71ed2c5a990885be763ae3507d79b2264d52d3d0d3eca4597
|
7
|
+
data.tar.gz: 31f0f3e48656b4359972033f5b188525dc34bacf78128f9d73a4ecbe990c9c6925cbbba8535d9e80fb4373eb248f6c32a28cf913e8bdf654ddfaaf2535516386
|
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -1,10 +1,18 @@
|
|
1
1
|
# Estratto
|
2
2
|
|
3
|
+
[![Gem Version](https://badge.fury.io/rb/estratto.svg)](https://badge.fury.io/rb/estratto)
|
3
4
|
[![Build Status](https://travis-ci.com/Rynaro/estratto.svg?branch=master)](https://travis-ci.com/Rynaro/estratto)
|
4
5
|
[![Coverage Status](https://coveralls.io/repos/github/Rynaro/estratto/badge.svg?branch=master)](https://coveralls.io/github/Rynaro/estratto?branch=master)
|
5
6
|
[![Maintainability](https://api.codeclimate.com/v1/badges/46532b90e850401fce72/maintainability)](https://codeclimate.com/github/Rynaro/estratto/maintainability)
|
6
7
|
|
7
|
-
|
8
|
+
[![Waffle.io - Columns and their card count](https://badge.waffle.io/Rynaro/estratto.svg?columns=all)](https://waffle.io/Rynaro/estratto)
|
9
|
+
|
10
|
+
> Estratto is a easy to handle parser based on YAML templating engine. Creating a easy interface for developers, and non developers to extract data from fixed width files
|
11
|
+
|
12
|
+
## Motivation
|
13
|
+
|
14
|
+
In various scenarios the data processment is a crucial step of a integration with partner systems, or data storage. But the task to create parsing and import data from these text format is boring, and causing code duplication in every code project.
|
15
|
+
This project borns to help developers to reduce the time spent in this task, or creating a total delegation scenario to other team responsabilities.
|
8
16
|
|
9
17
|
## Installation
|
10
18
|
|
@@ -24,7 +32,245 @@ Or install it yourself as:
|
|
24
32
|
|
25
33
|
## Usage
|
26
34
|
|
27
|
-
|
35
|
+
**Estratto** works with simple input of _data to parse file_ and a _yaml layout equivalent_.
|
36
|
+
|
37
|
+
Example of a default call for parsing:
|
38
|
+
|
39
|
+
```ruby
|
40
|
+
Estratto::Document.process(file: 'path/to/data.txt', layout: 'path/to/layout.yml')
|
41
|
+
```
|
42
|
+
|
43
|
+
### Layout specifications
|
44
|
+
|
45
|
+
Fixed width files is sometimes ~always~ painful for human reading, and the layout manual comes in a very useful pdf or spreasheet format.
|
46
|
+
|
47
|
+
Here, we'll try to made things fun again, or less painful. :joy:
|
48
|
+
|
49
|
+
The base layout for YAML file is:
|
50
|
+
|
51
|
+
```yaml
|
52
|
+
layout:
|
53
|
+
name: 'jojo stand users'
|
54
|
+
multi-register: true
|
55
|
+
prefix: 0..1
|
56
|
+
registers:
|
57
|
+
- register: '01'
|
58
|
+
fields:
|
59
|
+
- name: name
|
60
|
+
range: 2..45
|
61
|
+
type: String
|
62
|
+
- name: stand
|
63
|
+
range: 46..75
|
64
|
+
type: String
|
65
|
+
```
|
66
|
+
|
67
|
+
And the output will be a array of hashes reflection of your columns:
|
68
|
+
|
69
|
+
```ruby
|
70
|
+
[
|
71
|
+
{
|
72
|
+
name: 'Jotaro Kujo',
|
73
|
+
stand: 'Star Platinum'
|
74
|
+
},
|
75
|
+
{
|
76
|
+
name: 'Giorno Giovanna',
|
77
|
+
stand: 'Golden Experience Requiem'
|
78
|
+
},
|
79
|
+
{
|
80
|
+
name: 'Jobin Higashikata',
|
81
|
+
stand: 'Speed King'
|
82
|
+
}
|
83
|
+
]
|
84
|
+
```
|
85
|
+
|
86
|
+
The structure follows the strict directive
|
87
|
+
```yaml
|
88
|
+
layout:
|
89
|
+
(base configuration)
|
90
|
+
registers:
|
91
|
+
(layouts)
|
92
|
+
```
|
93
|
+
|
94
|
+
Actually **Estratto** supports these types of fixed width layouts:
|
95
|
+
|
96
|
+
- Batch prefix based registers
|
97
|
+
- Mono layout based registers _(development)_
|
98
|
+
|
99
|
+
|
100
|
+
### UTF-8 Conversion
|
101
|
+
|
102
|
+
Estratto makes use of [CharlockHolmes](https://github.com/brianmario/charlock_holmes) gem to detect the file content encoding and convert it to UTF-8.
|
103
|
+
This approach prevents invalid characters from being present in the output.
|
104
|
+
|
105
|
+
### Type Coercion
|
106
|
+
|
107
|
+
**Estratto** supports type coercion, with some perks called _formats_, on layout file.
|
108
|
+
|
109
|
+
Data type supported to handle in **Estratto**
|
110
|
+
|
111
|
+
- String
|
112
|
+
- Integer
|
113
|
+
- Float
|
114
|
+
- DateTime
|
115
|
+
- Date
|
116
|
+
|
117
|
+
Default data type in fields is `String`, if no one type is setted in field list register.
|
118
|
+
|
119
|
+
Registers fields list always respect this base structure:
|
120
|
+
|
121
|
+
```yaml
|
122
|
+
fields:
|
123
|
+
- name: name
|
124
|
+
range: 2..12
|
125
|
+
type: String
|
126
|
+
formats:
|
127
|
+
strip: true
|
128
|
+
```
|
129
|
+
|
130
|
+
`name` is your field identification of field, this value will be your symbol in hashed parsed data
|
131
|
+
|
132
|
+
`range` is where data is inside the file. (First index is 0)
|
133
|
+
|
134
|
+
`type` data type to be coerced
|
135
|
+
|
136
|
+
`formats` receives a specific configuration for data type. Here we can format Strings, and adjust precision for unformatted Float data.
|
137
|
+
|
138
|
+
### Formats
|
139
|
+
|
140
|
+
Formats is the resource for deal with some "surprises" that this type of file can provide to us. Like, super large string fields that has a huge blank space, DateTime with suspicious formatting, or Float without any decimal point, but the manual description shows _"Decimal(15, 2)"_
|
141
|
+
|
142
|
+
#### String
|
143
|
+
|
144
|
+
##### strip
|
145
|
+
|
146
|
+
Works like common ruby String strip method
|
147
|
+
|
148
|
+
```yaml
|
149
|
+
strip: true
|
150
|
+
```
|
151
|
+
|
152
|
+
Output example:
|
153
|
+
|
154
|
+
```ruby
|
155
|
+
#raw_data
|
156
|
+
'Hierophant Green '
|
157
|
+
# with strip clause
|
158
|
+
'Hierophant Green'
|
159
|
+
```
|
160
|
+
|
161
|
+
#### Integer
|
162
|
+
|
163
|
+
Simple integer values converter. Useful in cases that you need to deal with ids.
|
164
|
+
|
165
|
+
Actually we don't have any formats for Integer. :)
|
166
|
+
|
167
|
+
```ruby
|
168
|
+
#raw_data
|
169
|
+
'000123'
|
170
|
+
# coerced
|
171
|
+
123
|
172
|
+
#raw_data
|
173
|
+
'123'
|
174
|
+
# coerced
|
175
|
+
123
|
176
|
+
#raw_data
|
177
|
+
'a'
|
178
|
+
# coerced
|
179
|
+
0
|
180
|
+
```
|
181
|
+
|
182
|
+
#### Float
|
183
|
+
|
184
|
+
Float is one of most important types here. The fixed width files always respect the _non logical_ format to deliver information.
|
185
|
+
|
186
|
+
##### precision
|
187
|
+
|
188
|
+
```yaml
|
189
|
+
precision: <integer>
|
190
|
+
```
|
191
|
+
|
192
|
+
Examples:
|
193
|
+
|
194
|
+
```yaml
|
195
|
+
precision: 2
|
196
|
+
```
|
197
|
+
|
198
|
+
```ruby
|
199
|
+
#raw data
|
200
|
+
'12345'
|
201
|
+
# with precision
|
202
|
+
123.45
|
203
|
+
```
|
204
|
+
|
205
|
+
```yaml
|
206
|
+
precision: 3
|
207
|
+
```
|
208
|
+
|
209
|
+
```ruby
|
210
|
+
#raw data
|
211
|
+
'12345'
|
212
|
+
# with precision
|
213
|
+
12.345
|
214
|
+
```
|
215
|
+
|
216
|
+
##### comma_format
|
217
|
+
|
218
|
+
```yaml
|
219
|
+
comma_format: <boolean>
|
220
|
+
```
|
221
|
+
|
222
|
+
Examples:
|
223
|
+
|
224
|
+
```yaml
|
225
|
+
comma_format: true
|
226
|
+
```
|
227
|
+
|
228
|
+
```ruby
|
229
|
+
#raw data
|
230
|
+
'123,45'
|
231
|
+
# with comma formats
|
232
|
+
123.45
|
233
|
+
```
|
234
|
+
|
235
|
+
|
236
|
+
#### DateTime and Date
|
237
|
+
|
238
|
+
The `DateTime` and `Date` has the same formats attributes. But the difference, one shows DateTime format, and other always respect Date output
|
239
|
+
|
240
|
+
|
241
|
+
##### format
|
242
|
+
|
243
|
+
```yaml
|
244
|
+
format: <ruby strptime format pattern>
|
245
|
+
```
|
246
|
+
|
247
|
+
Examples
|
248
|
+
|
249
|
+
```yaml
|
250
|
+
format: '%Y%m%d'
|
251
|
+
```
|
252
|
+
|
253
|
+
```ruby
|
254
|
+
#raw data
|
255
|
+
'20180101'
|
256
|
+
# with comma formats
|
257
|
+
#<DateTime: 2018-01-01T00:00:00+00:00 ...>
|
258
|
+
```
|
259
|
+
|
260
|
+
```yaml
|
261
|
+
format: '%d/%m/%Y'
|
262
|
+
```
|
263
|
+
|
264
|
+
```ruby
|
265
|
+
#raw data
|
266
|
+
'01/01/2018'
|
267
|
+
# with comma formats
|
268
|
+
#<DateTime: 2018-01-01T00:00:00+00:00 ...>
|
269
|
+
```
|
270
|
+
|
271
|
+
## Tests
|
272
|
+
|
273
|
+
Simple `rake spec`
|
28
274
|
|
29
275
|
## Development
|
30
276
|
|
data/estratto.gemspec
CHANGED
@@ -20,6 +20,8 @@ Gem::Specification.new do |spec|
|
|
20
20
|
spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
|
21
21
|
spec.require_paths = ["lib"]
|
22
22
|
|
23
|
+
spec.add_dependency "charlock_holmes"
|
24
|
+
|
23
25
|
spec.add_development_dependency "bundler", "~> 1.17"
|
24
26
|
spec.add_development_dependency "rake", "~> 10.0"
|
25
27
|
spec.add_development_dependency "rspec", "~> 3.0"
|
@@ -0,0 +1,21 @@
|
|
1
|
+
require 'charlock_holmes'
|
2
|
+
|
3
|
+
module Estratto
|
4
|
+
class Encoder
|
5
|
+
attr_reader :content
|
6
|
+
|
7
|
+
def initialize(content)
|
8
|
+
@content = content
|
9
|
+
end
|
10
|
+
|
11
|
+
def encode
|
12
|
+
CharlockHolmes::Converter.convert(content, encoding, 'UTF-8')
|
13
|
+
end
|
14
|
+
|
15
|
+
private
|
16
|
+
|
17
|
+
def encoding
|
18
|
+
CharlockHolmes::EncodingDetector.detect(content)[:encoding]
|
19
|
+
end
|
20
|
+
end
|
21
|
+
end
|
data/lib/estratto/parser.rb
CHANGED
@@ -1,4 +1,5 @@
|
|
1
1
|
require_relative 'register'
|
2
|
+
require_relative 'content'
|
2
3
|
|
3
4
|
module Estratto
|
4
5
|
class Parser
|
@@ -10,16 +11,15 @@ module Estratto
|
|
10
11
|
end
|
11
12
|
|
12
13
|
def perform
|
13
|
-
@data ||=
|
14
|
+
@data ||= raw_content.map do |line|
|
14
15
|
register_layout = layout.register_fields_for(line[layout.prefix_range])
|
15
16
|
next if register_layout.nil?
|
16
17
|
Register.new(line, register_layout).refine
|
17
18
|
end.compact
|
18
19
|
end
|
19
20
|
|
20
|
-
def
|
21
|
-
@raw_data
|
21
|
+
def raw_content
|
22
|
+
@raw_data = Content.for(file_path)
|
22
23
|
end
|
23
|
-
|
24
24
|
end
|
25
25
|
end
|
data/lib/estratto/version.rb
CHANGED
metadata
CHANGED
@@ -1,15 +1,29 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: estratto
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.
|
4
|
+
version: 1.0.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Henrique A. Lavezzo
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2019-01-
|
11
|
+
date: 2019-01-21 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: charlock_holmes
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - ">="
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: '0'
|
20
|
+
type: :runtime
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - ">="
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '0'
|
13
27
|
- !ruby/object:Gem::Dependency
|
14
28
|
name: bundler
|
15
29
|
requirement: !ruby/object:Gem::Requirement
|
@@ -85,6 +99,7 @@ files:
|
|
85
99
|
- bin/setup
|
86
100
|
- estratto.gemspec
|
87
101
|
- lib/estratto.rb
|
102
|
+
- lib/estratto/content.rb
|
88
103
|
- lib/estratto/data/base.rb
|
89
104
|
- lib/estratto/data/coercer.rb
|
90
105
|
- lib/estratto/data/date.rb
|
@@ -93,6 +108,7 @@ files:
|
|
93
108
|
- lib/estratto/data/integer.rb
|
94
109
|
- lib/estratto/data/string.rb
|
95
110
|
- lib/estratto/document.rb
|
111
|
+
- lib/estratto/encoder.rb
|
96
112
|
- lib/estratto/helpers/range.rb
|
97
113
|
- lib/estratto/layout/base.rb
|
98
114
|
- lib/estratto/layout/factory.rb
|