chronicle-etl 0.4.0 → 0.4.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.github/workflows/ruby.yml +2 -2
- data/README.md +148 -84
- data/lib/chronicle/etl/cli/jobs.rb +10 -7
- data/lib/chronicle/etl/configurable.rb +8 -0
- data/lib/chronicle/etl/exceptions.rb +6 -2
- data/lib/chronicle/etl/extractors/csv_extractor.rb +24 -17
- data/lib/chronicle/etl/extractors/extractor.rb +4 -4
- data/lib/chronicle/etl/extractors/file_extractor.rb +33 -13
- data/lib/chronicle/etl/extractors/helpers/input_reader.rb +76 -0
- data/lib/chronicle/etl/extractors/json_extractor.rb +21 -12
- data/lib/chronicle/etl/job_definition.rb +1 -1
- data/lib/chronicle/etl/loaders/json_loader.rb +44 -0
- data/lib/chronicle/etl/loaders/loader.rb +1 -1
- data/lib/chronicle/etl/loaders/table_loader.rb +12 -9
- data/lib/chronicle/etl/logger.rb +1 -0
- data/lib/chronicle/etl/models/base.rb +3 -0
- data/lib/chronicle/etl/models/entity.rb +8 -2
- data/lib/chronicle/etl/models/raw.rb +26 -0
- data/lib/chronicle/etl/runner.rb +5 -4
- data/lib/chronicle/etl/serializers/jsonapi_serializer.rb +6 -0
- data/lib/chronicle/etl/serializers/raw_serializer.rb +10 -0
- data/lib/chronicle/etl/serializers/serializer.rb +2 -1
- data/lib/chronicle/etl/transformers/null_transformer.rb +1 -1
- data/lib/chronicle/etl/version.rb +1 -1
- data/lib/chronicle/etl.rb +11 -4
- metadata +6 -5
- data/lib/chronicle/etl/extractors/helpers/filesystem_reader.rb +0 -104
- data/lib/chronicle/etl/loaders/stdout_loader.rb +0 -14
- data/lib/chronicle/etl/models/generic.rb +0 -23
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 8a267de435b41b579e36128b7392729ef499eb37f05fabaead7811f089938ddb
|
4
|
+
data.tar.gz: d4af2f62f3f5de926bdfbb0e3d6dbe2c952ec286c07317af4dca8d98f665d6da
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: c78080cce008340f0b2795be46da2b5eb6562b2bffd97728150960343870f2bea4699e4efa07905710dd0e2eba7aaa1e803d8c0f727196f5d9d655b28a04f02e
|
7
|
+
data.tar.gz: cae3a3ffb6527f5c0b3ff89c75dc98d9cd66157ee6230c9db797f4683f90e2146daadf291108e55d3090d0120d3c9e25135cb21c4e9078bcaf4d1edf2172c930
|
data/.github/workflows/ruby.yml
CHANGED
data/README.md
CHANGED
@@ -1,125 +1,189 @@
|
|
1
|
-
|
1
|
+
## A CLI toolkit for extracting and working with your digital history
|
2
2
|
|
3
3
|
[![Gem Version](https://badge.fury.io/rb/chronicle-etl.svg)](https://badge.fury.io/rb/chronicle-etl) [![Ruby](https://github.com/chronicle-app/chronicle-etl/actions/workflows/ruby.yml/badge.svg)](https://github.com/chronicle-app/chronicle-etl/actions/workflows/ruby.yml)
|
4
4
|
|
5
|
-
|
5
|
+
Are you trying to archive your digital history or incorporate it into your own projects? You’ve probably discovered how frustrating it is to get machine-readable access to your own data. While [building a memex](https://hyfen.net/memex/), I learned first-hand what great efforts must be made before you can begin using the data in interesting ways.
|
6
6
|
|
7
|
-
|
7
|
+
If you don’t want to spend all your time writing scrapers, reverse-engineering APIs, or parsing takeout data, this project is for you! (*If you do enjoy these things, please see the [open issues](https://github.com/chronicle-app/chronicle-etl/issues).*)
|
8
8
|
|
9
|
-
|
9
|
+
`chronicle-etl` is a CLI tool that gives you the ability to easily access your personal data. It uses the ETL pattern to **extract** it from a source (e.g. your local browser history, a directory of images, goodreads.com reading history), **transform** it (into a given schema), and **load** it to a source (e.g. a CSV file, JSON, external API).
|
10
10
|
|
11
|
-
|
12
|
-
|
11
|
+
## What does `chronicle-etl` give you?
|
12
|
+
* **CLI tool for working with personal data**. You can monitor progress of exports, manipulate the output, set up recurring jobs, manage credentials, and more.
|
13
|
+
* **Plugins for many third-party providers**. A plugin system allows you to access data from third-party providers and hook it into the shared CLI infrastructure.
|
14
|
+
* **A common, opinionated schema**: You can normalize different datasets into a single schema so that, for example, all your iMessages and emails are stored in a common schema. Don’t want to use the schema? `chronicle-etl` always allows you to fall back on working with the raw extraction data.
|
15
|
+
|
16
|
+
## Installation
|
17
|
+
```sh
|
18
|
+
# Install chronicle-etl
|
19
|
+
gem install chronicle-etl
|
13
20
|
```
|
14
21
|
|
15
|
-
|
22
|
+
After installation, the `chronicle-etl` command will be available in your shell. Homebrew support [is coming soon](https://github.com/chronicle-app/chronicle-etl/issues/13).
|
16
23
|
|
17
|
-
|
24
|
+
## Basic usage and running jobs
|
18
25
|
|
19
|
-
```
|
20
|
-
#
|
21
|
-
$ chronicle-etl
|
26
|
+
```sh
|
27
|
+
# Display help
|
28
|
+
$ chronicle-etl help
|
22
29
|
|
23
|
-
#
|
24
|
-
$ chronicle-etl
|
30
|
+
# Basic job usage
|
31
|
+
$ chronicle-etl --extractor NAME --transformer NAME --loader NAME
|
32
|
+
|
33
|
+
# Read test.csv and display it to stdout as a table
|
34
|
+
$ chronicle-etl --extractor csv --input ./data.csv --loader table
|
25
35
|
```
|
26
36
|
|
27
|
-
|
37
|
+
### Common options
|
38
|
+
```sh
|
39
|
+
Options:
|
40
|
+
-j, [--name=NAME] # Job configuration name
|
41
|
+
-e, [--extractor=EXTRACTOR-NAME] # Extractor class. Default: stdin
|
42
|
+
[--extractor-opts=key:value] # Extractor options
|
43
|
+
-t, [--transformer=TRANFORMER-NAME] # Transformer class. Default: null
|
44
|
+
[--transformer-opts=key:value] # Transformer options
|
45
|
+
-l, [--loader=LOADER-NAME] # Loader class. Default: stdout
|
46
|
+
[--loader-opts=key:value] # Loader options
|
47
|
+
-i, [--input=FILENAME] # Input filename or directory
|
48
|
+
[--since=DATE] # Load records SINCE this date. Overrides job's `load_since` configuration option in extractor's options
|
49
|
+
[--until=DATE] # Load records UNTIL this date
|
50
|
+
[--limit=N] # Only extract the first LIMIT records
|
51
|
+
-o, [--output=OUTPUT] # Output filename
|
52
|
+
[--fields=field1 field2 ...] # Output only these fields
|
53
|
+
[--log-level=LOG_LEVEL] # Log level (debug, info, warn, error, fatal)
|
54
|
+
# Default: info
|
55
|
+
-v, [--verbose], [--no-verbose] # Set log level to verbose
|
56
|
+
[--silent], [--no-silent] # Silence all output
|
57
|
+
```
|
28
58
|
|
59
|
+
## Connectors
|
29
60
|
Connectors are available to read, process, and load data from different formats or external services.
|
30
61
|
|
31
|
-
```
|
62
|
+
```sh
|
32
63
|
# List all available connectors
|
33
64
|
$ chronicle-etl connectors:list
|
34
|
-
|
35
|
-
# Install a connector
|
36
|
-
$ chronicle-etl connectors:install imessage
|
37
65
|
```
|
38
66
|
|
39
|
-
Built
|
40
|
-
|
41
|
-
### Extractors
|
42
|
-
- `stdin` - (default) Load records from line-separated stdin
|
43
|
-
- `csv`
|
44
|
-
- `file` - load from a single file or directory (with a glob pattern)
|
45
|
-
|
46
|
-
### Transformers
|
47
|
-
- `null` - (default) Don't do anything
|
48
|
-
|
49
|
-
### Loaders
|
50
|
-
- `stdout` - (default) output records to stdout serialized as JSON
|
51
|
-
- `csv` - Load records to a csv file
|
52
|
-
- `rest` - Serialize records with [JSONAPI](https://jsonapi.org/) and send to a REST API
|
53
|
-
- `table` - Output an ascii table of records. Useful for debugging.
|
54
|
-
|
55
|
-
### Provider-specific importers
|
56
|
-
|
57
|
-
In addition to the built-in importers, importers for third-party platforms are available. They are packaged as individual Ruby gems.
|
67
|
+
### Built-in Connectors
|
68
|
+
`chronicle-etl` comes with several built-in connectors for common formats and sources.
|
58
69
|
|
59
|
-
|
60
|
-
- [
|
61
|
-
- [
|
70
|
+
#### Extractors
|
71
|
+
- [`csv`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/extractors/csv_extractor.rb) - Load records from CSV files or stdin
|
72
|
+
- [`json`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/extractors/json_extractor.rb) - Load JSON (either [line-separated objects](https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON) or one object)
|
73
|
+
- [`file`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/extractors/file_extractor.rb) - load from a single file or directory (with a glob pattern)
|
62
74
|
|
63
|
-
|
75
|
+
#### Transformers
|
76
|
+
- [`null`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/transformers/null_transformer.rb) - (default) Don’t do anything and pass on raw extraction data
|
64
77
|
|
65
|
-
|
78
|
+
#### Loaders
|
79
|
+
- [`table`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/loaders/table_loader.rb) - (default) Output an ascii table of records. Useful for exploring data.
|
80
|
+
- [`csv`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/extractors/csv_extractor.rb) - Load records to CSV
|
81
|
+
- [`json`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/loaders/json_loader.rb) - Load records serialized as JSON
|
82
|
+
- [`rest`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/loaders/rest_loader.rb) - Serialize records with [JSONAPI](https://jsonapi.org/) and send to a REST API
|
66
83
|
|
67
|
-
|
68
|
-
|
69
|
-
## Full commands
|
70
|
-
|
71
|
-
```
|
72
|
-
$ chronicle-etl help
|
73
|
-
|
74
|
-
ALL COMMANDS
|
75
|
-
help # This help menu
|
76
|
-
connectors help [COMMAND] # Describe subcommands or one specific subcommand
|
77
|
-
connectors:install NAME # Installs connector NAME
|
78
|
-
connectors:list # Lists available connectors
|
79
|
-
jobs help [COMMAND] # Describe subcommands or one specific subcommand
|
80
|
-
jobs:create # Create a job
|
81
|
-
jobs:list # List all available jobs
|
82
|
-
jobs:run # Start a job
|
83
|
-
jobs:show # Show details about a job
|
84
|
-
```
|
85
|
-
|
86
|
-
### Running a job
|
84
|
+
### Plugins
|
85
|
+
Plugins provide access to data from third-party platforms, services, or formats.
|
87
86
|
|
87
|
+
```bash
|
88
|
+
# Install a plugin
|
89
|
+
$ chronicle-etl connectors:install NAME
|
88
90
|
```
|
89
|
-
Usage:
|
90
|
-
chronicle-etl jobs:run
|
91
91
|
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
92
|
+
A few dozen importers exist [in my Memex project](https://hyfen.net/memex/) and they’re being ported over to the Chronicle system. This table shows what’s available now and what’s coming. Rows are sorted in very rough order of priority.
|
93
|
+
|
94
|
+
If you want to work together on a connector, please [get in touch](#get-in-touch)!
|
95
|
+
|
96
|
+
| Name | Description | Availability |
|
97
|
+
|-----------------------------------------------------------------|---------------------------------------------------------------------------------------------|----------------------------------|
|
98
|
+
| [imessage](https://github.com/chronicle-app/chronicle-imessage) | iMessage messages and attachments | Available |
|
99
|
+
| [shell](https://github.com/chronicle-app/chronicle-shell) | Shell command history | Available (zsh support pending) |
|
100
|
+
| [email](https://github.com/chronicle-app/chronicle-email) | Emails and attachments from IMAP or .mbox files | Available (imap support pending) |
|
101
|
+
| [pinboard](https://github.com/chronicle-app/chronicle-email) | Bookmarks and tags | Available |
|
102
|
+
| github | Github user and repo activity | In progress |
|
103
|
+
| safari | Browser history from local sqlite db | Needs porting |
|
104
|
+
| chrome | Browser history from local sqlite db | Needs porting |
|
105
|
+
| whatsapp | Messaging history (via individual chat exports) or reverse-engineered local desktop install | Unstarted |
|
106
|
+
| anki | Studying and card creation history | Needs porting |
|
107
|
+
| facebook | Messaging and history posting via data export files | Needs porting |
|
108
|
+
| twitter | History via API or export data files | Needs porting |
|
109
|
+
| foursquare | Location history via API | Needs porting |
|
110
|
+
| goodreads | Reading history via export csv (RIP goodreads API) | Needs porting |
|
111
|
+
| lastfm | Listening history via API | Needs porting |
|
112
|
+
| images | Process image files | Needs porting |
|
113
|
+
| arc | Location history from synced icloud backup files | Needs porting |
|
114
|
+
| firefox | Browser history from local sqlite db | Needs porting |
|
115
|
+
| fitbit | Personal analytics via API | Needs porting |
|
116
|
+
| git | Commit history on a repo | Needs porting |
|
117
|
+
| google-calendar | Calendar events via API | Needs porting |
|
118
|
+
| instagram | Posting and messaging history via export data | Needs porting |
|
119
|
+
| shazam | Song tags via reverse-engineered API | Needs porting |
|
120
|
+
| slack | Messaging history via API | Need rethinking |
|
121
|
+
| strava | Activity history via API | Needs porting |
|
122
|
+
| things | Task activity via local sqlite db | Needs porting |
|
123
|
+
| bear | Note taking activity via local sqlite db | Needs porting |
|
124
|
+
| youtube | Video activity via takeout data and API | Needs porting |
|
125
|
+
|
126
|
+
### Writing your own connector
|
127
|
+
|
128
|
+
Additional connectors are packaged as separate ruby gems. You can view the [iMessage plugin](https://github.com/chronicle-app/chronicle-imessage) for an example.
|
129
|
+
|
130
|
+
If you want to load a custom connector without creating a gem, you can help by [completing this issue](https://github.com/chronicle-app/chronicle-etl/issues/23).
|
131
|
+
|
132
|
+
If you want to work together on a connector, please [get in touch](#get-in-touch)!
|
133
|
+
|
134
|
+
#### Sample custom Extractor class
|
135
|
+
```ruby
|
136
|
+
module Chronicle
|
137
|
+
module FooService
|
138
|
+
class FooExtractor < Chronicle::ETL::Extractor
|
139
|
+
register_connector do |r|
|
140
|
+
r.identifier = 'foo'
|
141
|
+
r.description = 'From foo.com'
|
142
|
+
end
|
143
|
+
|
144
|
+
setting :access_token, required: true
|
145
|
+
|
146
|
+
def prepare
|
147
|
+
@records = # load from somewhere
|
148
|
+
end
|
149
|
+
|
150
|
+
def extract
|
151
|
+
@records.each do |record|
|
152
|
+
yield Chronicle::ETL::Extraction.new(data: row.to_h)
|
153
|
+
end
|
154
|
+
end
|
155
|
+
end
|
156
|
+
end
|
157
|
+
end
|
107
158
|
```
|
108
159
|
|
109
160
|
## Development
|
110
|
-
|
111
161
|
After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
|
112
162
|
|
113
163
|
To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
|
114
164
|
|
115
|
-
|
165
|
+
### Additional development commands
|
166
|
+
```bash
|
167
|
+
# run tests
|
168
|
+
bundle exec rake spec
|
169
|
+
|
170
|
+
# generate docs
|
171
|
+
bundle exec rake yard
|
172
|
+
|
173
|
+
# use Guard to run specs automatically
|
174
|
+
bundle exec guard
|
175
|
+
```
|
116
176
|
|
177
|
+
## Get in touch
|
178
|
+
- [@hyfen](https://twitter.com/hyfen) on Twitter
|
179
|
+
- [@hyfen](https://github.com/hyfen) on Github
|
180
|
+
- Email: andrew@hyfen.net
|
181
|
+
|
182
|
+
## Contributing
|
117
183
|
Bug reports and pull requests are welcome on GitHub at https://github.com/chronicle-app/chronicle-etl. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
|
118
184
|
|
119
185
|
## License
|
120
|
-
|
121
186
|
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|
122
187
|
|
123
188
|
## Code of Conduct
|
124
|
-
|
125
|
-
Everyone interacting in the Chronicle::ETL project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/chronicle-app/chronicle-etl/blob/master/CODE_OF_CONDUCT.md).
|
189
|
+
Everyone interacting in the Chronicle::ETL project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/chronicle-app/chronicle-etl/blob/master/CODE_OF_CONDUCT.md).
|
@@ -6,20 +6,20 @@ module Chronicle
|
|
6
6
|
# CLI commands for working with ETL jobs
|
7
7
|
class Jobs < SubcommandBase
|
8
8
|
default_task "start"
|
9
|
-
namespace :jobs
|
9
|
+
namespace :jobs
|
10
10
|
|
11
11
|
class_option :name, aliases: '-j', desc: 'Job configuration name'
|
12
12
|
|
13
|
-
class_option :extractor, aliases: '-e', desc: "Extractor class. Default: stdin", banner: '
|
13
|
+
class_option :extractor, aliases: '-e', desc: "Extractor class. Default: stdin", banner: 'NAME'
|
14
14
|
class_option :'extractor-opts', desc: 'Extractor options', type: :hash, default: {}
|
15
|
-
class_option :transformer, aliases: '-t', desc: 'Transformer class. Default: null', banner: '
|
15
|
+
class_option :transformer, aliases: '-t', desc: 'Transformer class. Default: null', banner: 'NAME'
|
16
16
|
class_option :'transformer-opts', desc: 'Transformer options', type: :hash, default: {}
|
17
|
-
class_option :loader, aliases: '-l', desc: 'Loader class. Default:
|
17
|
+
class_option :loader, aliases: '-l', desc: 'Loader class. Default: table', banner: 'NAME'
|
18
18
|
class_option :'loader-opts', desc: 'Loader options', type: :hash, default: {}
|
19
19
|
|
20
20
|
# This is an array to deal with shell globbing
|
21
21
|
class_option :input, aliases: '-i', desc: 'Input filename or directory', default: [], type: 'array', banner: 'FILENAME'
|
22
|
-
class_option :since, desc: "Load records SINCE this date
|
22
|
+
class_option :since, desc: "Load records SINCE this date", banner: 'DATE'
|
23
23
|
class_option :until, desc: "Load records UNTIL this date", banner: 'DATE'
|
24
24
|
class_option :limit, desc: "Only extract the first LIMIT records", banner: 'N'
|
25
25
|
|
@@ -28,6 +28,7 @@ module Chronicle
|
|
28
28
|
|
29
29
|
class_option :log_level, desc: 'Log level (debug, info, warn, error, fatal)', default: 'info'
|
30
30
|
class_option :verbose, aliases: '-v', desc: 'Set log level to verbose', type: :boolean
|
31
|
+
class_option :silent, desc: 'Silence all output', type: :boolean
|
31
32
|
|
32
33
|
# Thor doesn't like `run` as a command name
|
33
34
|
map run: :start
|
@@ -93,7 +94,9 @@ LONG_DESC
|
|
93
94
|
private
|
94
95
|
|
95
96
|
def setup_log_level
|
96
|
-
if options[:
|
97
|
+
if options[:silent]
|
98
|
+
Chronicle::ETL::Logger.log_level = Chronicle::ETL::Logger::SILENT
|
99
|
+
elsif options[:verbose]
|
97
100
|
Chronicle::ETL::Logger.log_level = Chronicle::ETL::Logger::DEBUG
|
98
101
|
elsif options[:log_level]
|
99
102
|
level = Chronicle::ETL::Logger.const_get(options[:log_level].upcase)
|
@@ -116,7 +119,7 @@ LONG_DESC
|
|
116
119
|
# Takes flag options and turns them into a runner config
|
117
120
|
def process_flag_options options
|
118
121
|
extractor_options = options[:'extractor-opts'].merge({
|
119
|
-
|
122
|
+
input: (options[:input] if options[:input].any?),
|
120
123
|
since: options[:since],
|
121
124
|
until: options[:until],
|
122
125
|
limit: options[:limit],
|
@@ -89,6 +89,14 @@ module Chronicle
|
|
89
89
|
value.to_s
|
90
90
|
end
|
91
91
|
|
92
|
+
def coerce_boolean(value)
|
93
|
+
if value.is_a?(String)
|
94
|
+
value.downcase == "true"
|
95
|
+
else
|
96
|
+
value
|
97
|
+
end
|
98
|
+
end
|
99
|
+
|
92
100
|
def coerce_time(value)
|
93
101
|
# TODO: handle durations like '3h'
|
94
102
|
if value.is_a?(String)
|
@@ -1,8 +1,8 @@
|
|
1
1
|
module Chronicle
|
2
2
|
module ETL
|
3
|
-
class Error < StandardError; end
|
3
|
+
class Error < StandardError; end
|
4
4
|
|
5
|
-
class ConfigurationError < Error; end
|
5
|
+
class ConfigurationError < Error; end
|
6
6
|
|
7
7
|
class RunnerTypeError < Error; end
|
8
8
|
|
@@ -18,6 +18,10 @@ module Chronicle
|
|
18
18
|
class ProviderNotAvailableError < ConnectorNotAvailableError; end
|
19
19
|
class ProviderConnectorNotAvailableError < ConnectorNotAvailableError; end
|
20
20
|
|
21
|
+
class ExtractionError < Error; end
|
22
|
+
|
23
|
+
class SerializationError < Error; end
|
24
|
+
|
21
25
|
class TransformationError < Error
|
22
26
|
attr_reader :transformation
|
23
27
|
|
@@ -3,39 +3,46 @@ require 'csv'
|
|
3
3
|
module Chronicle
|
4
4
|
module ETL
|
5
5
|
class CSVExtractor < Chronicle::ETL::Extractor
|
6
|
-
include Extractors::Helpers::
|
6
|
+
include Extractors::Helpers::InputReader
|
7
7
|
|
8
8
|
register_connector do |r|
|
9
|
-
r.description = '
|
9
|
+
r.description = 'CSV'
|
10
10
|
end
|
11
11
|
|
12
12
|
setting :headers, default: true
|
13
|
-
|
13
|
+
|
14
|
+
def prepare
|
15
|
+
@csvs = prepare_sources
|
16
|
+
end
|
14
17
|
|
15
18
|
def extract
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
+
@csvs.each do |csv|
|
20
|
+
csv.read.each do |row|
|
21
|
+
yield Chronicle::ETL::Extraction.new(data: row.to_h)
|
22
|
+
end
|
19
23
|
end
|
20
24
|
end
|
21
25
|
|
22
26
|
def results_count
|
23
|
-
|
27
|
+
@csvs.reduce(0) do |total_rows, csv|
|
28
|
+
row_count = csv.readlines.size
|
29
|
+
csv.rewind
|
30
|
+
total_rows + row_count
|
31
|
+
end
|
24
32
|
end
|
25
33
|
|
26
34
|
private
|
27
35
|
|
28
|
-
def
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
open_from_filesystem(filename: @config.filename) do |file|
|
37
|
-
return CSV.new(file, **csv_options)
|
36
|
+
def prepare_sources
|
37
|
+
@csvs = []
|
38
|
+
read_input do |csv_data|
|
39
|
+
csv_options = {
|
40
|
+
headers: @config.headers.is_a?(String) ? @config.headers.split(',') : @config.headers,
|
41
|
+
converters: :all
|
42
|
+
}
|
43
|
+
@csvs << CSV.new(csv_data, **csv_options)
|
38
44
|
end
|
45
|
+
@csvs
|
39
46
|
end
|
40
47
|
end
|
41
48
|
end
|
@@ -7,11 +7,11 @@ module Chronicle
|
|
7
7
|
extend Chronicle::ETL::Registry::SelfRegistering
|
8
8
|
include Chronicle::ETL::Configurable
|
9
9
|
|
10
|
-
setting :since, type: :
|
11
|
-
setting :until, type: :
|
10
|
+
setting :since, type: :time
|
11
|
+
setting :until, type: :time
|
12
12
|
setting :limit
|
13
13
|
setting :load_after_id
|
14
|
-
setting :
|
14
|
+
setting :input
|
15
15
|
|
16
16
|
# Construct a new instance of this extractor. Options are passed in from a Runner
|
17
17
|
# == Parameters:
|
@@ -46,7 +46,7 @@ module Chronicle
|
|
46
46
|
end
|
47
47
|
end
|
48
48
|
|
49
|
-
require_relative 'helpers/
|
49
|
+
require_relative 'helpers/input_reader'
|
50
50
|
require_relative 'csv_extractor'
|
51
51
|
require_relative 'file_extractor'
|
52
52
|
require_relative 'json_extractor'
|
@@ -2,35 +2,55 @@ require 'pathname'
|
|
2
2
|
|
3
3
|
module Chronicle
|
4
4
|
module ETL
|
5
|
+
# Return filenames that match a pattern in a directory
|
5
6
|
class FileExtractor < Chronicle::ETL::Extractor
|
6
|
-
include Extractors::Helpers::FilesystemReader
|
7
7
|
|
8
8
|
register_connector do |r|
|
9
9
|
r.description = 'file or directory of files'
|
10
10
|
end
|
11
11
|
|
12
|
-
|
13
|
-
setting :dir_glob_pattern
|
12
|
+
setting :input, default: ['.']
|
13
|
+
setting :dir_glob_pattern, default: "**/*"
|
14
|
+
setting :larger_than
|
15
|
+
setting :smaller_than
|
16
|
+
|
17
|
+
def prepare
|
18
|
+
@pathnames = gather_files
|
19
|
+
end
|
14
20
|
|
15
21
|
def extract
|
16
|
-
|
17
|
-
yield Chronicle::ETL::Extraction.new(data:
|
22
|
+
@pathnames.each do |pathname|
|
23
|
+
yield Chronicle::ETL::Extraction.new(data: pathname.to_path)
|
18
24
|
end
|
19
25
|
end
|
20
26
|
|
21
27
|
def results_count
|
22
|
-
|
28
|
+
@pathnames.count
|
23
29
|
end
|
24
30
|
|
25
31
|
private
|
26
32
|
|
27
|
-
def
|
28
|
-
@
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
33
|
+
def gather_files
|
34
|
+
roots = [@config.input].flatten.map { |filename| Pathname.new(filename) }
|
35
|
+
raise(ExtractionError, "Input must exist") unless roots.all?(&:exist?)
|
36
|
+
|
37
|
+
directories, files = roots.partition(&:directory?)
|
38
|
+
|
39
|
+
directories.each do |directory|
|
40
|
+
files += Dir.glob(File.join(directory, @config.dir_glob_pattern)).map { |filename| Pathname.new(filename) }
|
41
|
+
end
|
42
|
+
|
43
|
+
files = files.uniq
|
44
|
+
|
45
|
+
files = files.keep_if { |f| (f.mtime > @config.since) } if @config.since
|
46
|
+
files = files.keep_if { |f| (f.mtime < @config.until) } if @config.until
|
47
|
+
|
48
|
+
# pass in file sizes in bytes
|
49
|
+
files = files.keep_if { |f| (f.size < @config.smaller_than) } if @config.smaller_than
|
50
|
+
files = files.keep_if { |f| (f.size > @config.larger_than) } if @config.larger_than
|
51
|
+
|
52
|
+
# # TODO: incorporate sort argument
|
53
|
+
files.sort_by(&:mtime)
|
34
54
|
end
|
35
55
|
end
|
36
56
|
end
|
@@ -0,0 +1,76 @@
|
|
1
|
+
require 'pathname'
|
2
|
+
|
3
|
+
module Chronicle
|
4
|
+
module ETL
|
5
|
+
module Extractors
|
6
|
+
module Helpers
|
7
|
+
module InputReader
|
8
|
+
# Return an array of input filenames; converts a single string
|
9
|
+
# to an array if necessary
|
10
|
+
def filenames
|
11
|
+
[@config.input].flatten.map
|
12
|
+
end
|
13
|
+
|
14
|
+
# Filenames as an array of pathnames
|
15
|
+
def pathnames
|
16
|
+
filenames.map { |filename| Pathname.new(filename) }
|
17
|
+
end
|
18
|
+
|
19
|
+
# Whether we're reading from files
|
20
|
+
def read_from_files?
|
21
|
+
filenames.any?
|
22
|
+
end
|
23
|
+
|
24
|
+
# Whether we're reading input from stdin
|
25
|
+
def read_from_stdin?
|
26
|
+
!read_from_files? && $stdin.stat.pipe?
|
27
|
+
end
|
28
|
+
|
29
|
+
# Read input sources and yield each content
|
30
|
+
def read_input
|
31
|
+
if read_from_files?
|
32
|
+
pathnames.each do |pathname|
|
33
|
+
File.open(pathname) do |file|
|
34
|
+
yield file.read, pathname.to_path
|
35
|
+
end
|
36
|
+
end
|
37
|
+
elsif read_from_stdin?
|
38
|
+
yield $stdin.read, $stdin
|
39
|
+
else
|
40
|
+
raise ExtractionError, "No input files or stdin provided"
|
41
|
+
end
|
42
|
+
end
|
43
|
+
|
44
|
+
# Read input sources line by line
|
45
|
+
def read_input_as_lines(&block)
|
46
|
+
if read_from_files?
|
47
|
+
lines_from_files(&block)
|
48
|
+
elsif read_from_stdin?
|
49
|
+
lines_from_stdin(&block)
|
50
|
+
else
|
51
|
+
raise ExtractionError, "No input files or stdin provided"
|
52
|
+
end
|
53
|
+
end
|
54
|
+
|
55
|
+
private
|
56
|
+
|
57
|
+
def lines_from_files(&block)
|
58
|
+
pathnames.each do |pathname|
|
59
|
+
File.open(pathname) do |file|
|
60
|
+
lines_from_io(file, &block)
|
61
|
+
end
|
62
|
+
end
|
63
|
+
end
|
64
|
+
|
65
|
+
def lines_from_stdin(&block)
|
66
|
+
lines_from_io($stdin, &block)
|
67
|
+
end
|
68
|
+
|
69
|
+
def lines_from_io(io, &block)
|
70
|
+
io.each_line(&block)
|
71
|
+
end
|
72
|
+
end
|
73
|
+
end
|
74
|
+
end
|
75
|
+
end
|
76
|
+
end
|
@@ -1,35 +1,44 @@
|
|
1
1
|
module Chronicle
|
2
2
|
module ETL
|
3
|
-
class
|
4
|
-
include Extractors::Helpers::
|
3
|
+
class JSONExtractor < Chronicle::ETL::Extractor
|
4
|
+
include Extractors::Helpers::InputReader
|
5
5
|
|
6
6
|
register_connector do |r|
|
7
|
-
r.description = '
|
7
|
+
r.description = 'JSON'
|
8
8
|
end
|
9
9
|
|
10
|
-
setting :
|
11
|
-
setting :jsonl, default: true
|
10
|
+
setting :jsonl, default: true, type: :boolean
|
12
11
|
|
13
|
-
def
|
12
|
+
def prepare
|
13
|
+
@jsons = []
|
14
14
|
load_input do |input|
|
15
|
-
|
16
|
-
|
15
|
+
@jsons << parse_data(input)
|
16
|
+
end
|
17
|
+
end
|
18
|
+
|
19
|
+
def extract
|
20
|
+
@jsons.each do |json|
|
21
|
+
yield Chronicle::ETL::Extraction.new(data: json)
|
17
22
|
end
|
18
23
|
end
|
19
24
|
|
20
25
|
def results_count
|
26
|
+
@jsons.count
|
21
27
|
end
|
22
28
|
|
23
29
|
private
|
24
30
|
|
25
31
|
def parse_data data
|
26
32
|
JSON.parse(data)
|
27
|
-
rescue JSON::ParserError
|
33
|
+
rescue JSON::ParserError
|
34
|
+
raise Chronicle::ETL::ExtractionError, "Could not parse JSON"
|
28
35
|
end
|
29
36
|
|
30
|
-
def load_input
|
31
|
-
|
32
|
-
|
37
|
+
def load_input(&block)
|
38
|
+
if @config.jsonl
|
39
|
+
read_input_as_lines(&block)
|
40
|
+
else
|
41
|
+
read_input(&block)
|
33
42
|
end
|
34
43
|
end
|
35
44
|
end
|
@@ -0,0 +1,44 @@
|
|
1
|
+
module Chronicle
|
2
|
+
module ETL
|
3
|
+
class JSONLoader < Chronicle::ETL::Loader
|
4
|
+
register_connector do |r|
|
5
|
+
r.description = 'json'
|
6
|
+
end
|
7
|
+
|
8
|
+
setting :serializer
|
9
|
+
setting :output, default: $stdout
|
10
|
+
|
11
|
+
def start
|
12
|
+
if @config.output == $stdout
|
13
|
+
@output = @config.output
|
14
|
+
else
|
15
|
+
@output = File.open(@config.output, "w")
|
16
|
+
end
|
17
|
+
end
|
18
|
+
|
19
|
+
def load(record)
|
20
|
+
serialized = serializer.serialize(record)
|
21
|
+
|
22
|
+
# When dealing with raw data, we can get improperly encoded strings
|
23
|
+
# (eg from sqlite database columns). We force conversion to UTF-8
|
24
|
+
# before converting into JSON
|
25
|
+
encoded = serialized.transform_values do |value|
|
26
|
+
next value unless value.is_a?(String)
|
27
|
+
|
28
|
+
value.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
|
29
|
+
end
|
30
|
+
@output.puts encoded.to_json
|
31
|
+
end
|
32
|
+
|
33
|
+
def finish
|
34
|
+
@output.close
|
35
|
+
end
|
36
|
+
|
37
|
+
private
|
38
|
+
|
39
|
+
def serializer
|
40
|
+
@config.serializer || Chronicle::ETL::RawSerializer
|
41
|
+
end
|
42
|
+
end
|
43
|
+
end
|
44
|
+
end
|
@@ -11,20 +11,19 @@ module Chronicle
|
|
11
11
|
|
12
12
|
setting :fields_limit, default: nil
|
13
13
|
setting :fields_exclude, default: ['lids', 'type']
|
14
|
-
setting :
|
14
|
+
setting :fields, default: []
|
15
15
|
setting :truncate_values_at, default: 40
|
16
16
|
setting :table_renderer, default: :basic
|
17
17
|
|
18
18
|
def load(record)
|
19
|
-
|
20
|
-
@records << record.to_h_flattened
|
19
|
+
records << record.to_h_flattened
|
21
20
|
end
|
22
21
|
|
23
22
|
def finish
|
24
|
-
return if
|
23
|
+
return if records.empty?
|
25
24
|
|
26
|
-
headers = build_headers(
|
27
|
-
rows = build_rows(
|
25
|
+
headers = build_headers(records)
|
26
|
+
rows = build_rows(records, headers)
|
28
27
|
|
29
28
|
@table = TTY::Table.new(header: headers, rows: rows)
|
30
29
|
puts @table.render(
|
@@ -33,12 +32,16 @@ module Chronicle
|
|
33
32
|
)
|
34
33
|
end
|
35
34
|
|
35
|
+
def records
|
36
|
+
@records ||= []
|
37
|
+
end
|
38
|
+
|
36
39
|
private
|
37
40
|
|
38
41
|
def build_headers(records)
|
39
42
|
headers =
|
40
|
-
if @config.
|
41
|
-
Set[*@config.
|
43
|
+
if @config.fields.any?
|
44
|
+
Set[*@config.fields]
|
42
45
|
else
|
43
46
|
# use all the keys of the flattened record hash
|
44
47
|
Set[*records.map(&:keys).flatten.map(&:to_s).uniq]
|
@@ -52,7 +55,7 @@ module Chronicle
|
|
52
55
|
|
53
56
|
def build_rows(records, headers)
|
54
57
|
records.map do |record|
|
55
|
-
values = record.values_at(*headers).map{|value| value.to_s }
|
58
|
+
values = record.transform_keys(&:to_sym).values_at(*headers).map{|value| value.to_s }
|
56
59
|
|
57
60
|
if @config.truncate_values_at
|
58
61
|
values = values.map{ |value| value.truncate(@config.truncate_values_at) }
|
data/lib/chronicle/etl/logger.rb
CHANGED
@@ -5,6 +5,9 @@ module Chronicle
|
|
5
5
|
module Models
|
6
6
|
# Represents a record that's been transformed by a Transformer and
|
7
7
|
# ready to be loaded. Loosely based on ActiveModel.
|
8
|
+
#
|
9
|
+
# @todo Experiment with just mixing in ActiveModel instead of this
|
10
|
+
# this reimplementation
|
8
11
|
class Base
|
9
12
|
ATTRIBUTES = [:provider, :provider_id, :lat, :lng, :metadata].freeze
|
10
13
|
ASSOCIATIONS = [].freeze
|
@@ -5,13 +5,19 @@ module Chronicle
|
|
5
5
|
module Models
|
6
6
|
class Entity < Chronicle::ETL::Models::Base
|
7
7
|
TYPE = 'entities'.freeze
|
8
|
-
ATTRIBUTES = [:title, :body, :represents, :slug, :myself, :metadata].freeze
|
8
|
+
ATTRIBUTES = [:title, :body, :provider_url, :represents, :slug, :myself, :metadata].freeze
|
9
|
+
|
10
|
+
# TODO: This desperately needs a validation system
|
9
11
|
ASSOCIATIONS = [
|
12
|
+
:involvements, # inverse of activity's `involved`
|
13
|
+
|
10
14
|
:attachments,
|
11
15
|
:abouts,
|
16
|
+
:aboutables, # inverse of above
|
12
17
|
:depicts,
|
13
18
|
:consumers,
|
14
|
-
:contains
|
19
|
+
:contains,
|
20
|
+
:containers # inverse of above
|
15
21
|
].freeze # TODO: add these to reflect Chronicle Schema
|
16
22
|
|
17
23
|
attr_accessor(*ATTRIBUTES, *ASSOCIATIONS)
|
@@ -0,0 +1,26 @@
|
|
1
|
+
require 'chronicle/etl/models/base'
|
2
|
+
|
3
|
+
module Chronicle
|
4
|
+
module ETL
|
5
|
+
module Models
|
6
|
+
# A record from an extraction with no processing or normalization applied
|
7
|
+
class Raw
|
8
|
+
TYPE = 'raw'
|
9
|
+
|
10
|
+
attr_accessor :raw_data
|
11
|
+
|
12
|
+
def initialize(raw_data)
|
13
|
+
@raw_data = raw_data
|
14
|
+
end
|
15
|
+
|
16
|
+
def to_h
|
17
|
+
@raw_data.to_h
|
18
|
+
end
|
19
|
+
|
20
|
+
def to_h_flattened
|
21
|
+
Chronicle::ETL::Utils::HashUtilities.flatten_hash(to_h)
|
22
|
+
end
|
23
|
+
end
|
24
|
+
end
|
25
|
+
end
|
26
|
+
end
|
data/lib/chronicle/etl/runner.rb
CHANGED
@@ -28,9 +28,10 @@ class Chronicle::ETL::Runner
|
|
28
28
|
transformer = @job.instantiate_transformer(extraction)
|
29
29
|
record = transformer.transform
|
30
30
|
|
31
|
-
|
32
|
-
|
33
|
-
|
31
|
+
# TODO: rethink this
|
32
|
+
# unless record.is_a?(Chronicle::ETL::Models)
|
33
|
+
# raise Chronicle::ETL::RunnerTypeError, "Transformed data should be a type of Chronicle::ETL::Models"
|
34
|
+
# end
|
34
35
|
|
35
36
|
Chronicle::ETL::Logger.info(tty_log_transformation(transformer))
|
36
37
|
@job_logger.log_transformation(transformer)
|
@@ -52,7 +53,7 @@ class Chronicle::ETL::Runner
|
|
52
53
|
raise e
|
53
54
|
ensure
|
54
55
|
@job_logger.save
|
55
|
-
@progress_bar
|
56
|
+
@progress_bar&.finish
|
56
57
|
Chronicle::ETL::Logger.detach_from_progress_bar
|
57
58
|
Chronicle::ETL::Logger.info(tty_log_completion)
|
58
59
|
end
|
@@ -1,6 +1,12 @@
|
|
1
1
|
module Chronicle
|
2
2
|
module ETL
|
3
3
|
class JSONAPISerializer < Chronicle::ETL::Serializer
|
4
|
+
def initialize(*args)
|
5
|
+
super
|
6
|
+
|
7
|
+
raise(SerializationError, "Record must be a subclass of Chronicle::ETL::Model::Base") unless @record.is_a?(Chronicle::ETL::Models::Base)
|
8
|
+
end
|
9
|
+
|
4
10
|
def serializable_hash
|
5
11
|
@record
|
6
12
|
.identifier_hash
|
data/lib/chronicle/etl.rb
CHANGED
@@ -3,23 +3,30 @@ require_relative 'etl/config'
|
|
3
3
|
require_relative 'etl/configurable'
|
4
4
|
require_relative 'etl/exceptions'
|
5
5
|
require_relative 'etl/extraction'
|
6
|
-
require_relative 'etl/extractors/extractor'
|
7
6
|
require_relative 'etl/job_definition'
|
8
7
|
require_relative 'etl/job_log'
|
9
8
|
require_relative 'etl/job_logger'
|
10
9
|
require_relative 'etl/job'
|
11
|
-
require_relative 'etl/loaders/loader'
|
12
10
|
require_relative 'etl/logger'
|
13
11
|
require_relative 'etl/models/activity'
|
14
12
|
require_relative 'etl/models/attachment'
|
15
13
|
require_relative 'etl/models/base'
|
14
|
+
require_relative 'etl/models/raw'
|
16
15
|
require_relative 'etl/models/entity'
|
17
|
-
require_relative 'etl/models/generic'
|
18
16
|
require_relative 'etl/runner'
|
19
17
|
require_relative 'etl/serializers/serializer'
|
20
|
-
require_relative 'etl/transformers/transformer'
|
21
18
|
require_relative 'etl/utils/binary_attachments'
|
22
19
|
require_relative 'etl/utils/hash_utilities'
|
23
20
|
require_relative 'etl/utils/text_recognition'
|
24
21
|
require_relative 'etl/utils/progress_bar'
|
25
22
|
require_relative 'etl/version'
|
23
|
+
|
24
|
+
require_relative 'etl/extractors/extractor'
|
25
|
+
require_relative 'etl/loaders/loader'
|
26
|
+
require_relative 'etl/transformers/transformer'
|
27
|
+
|
28
|
+
begin
|
29
|
+
require 'pry'
|
30
|
+
rescue LoadError
|
31
|
+
# Pry not available
|
32
|
+
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: chronicle-etl
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.4.
|
4
|
+
version: 0.4.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Andrew Louis
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2022-
|
11
|
+
date: 2022-03-05 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: activesupport
|
@@ -328,7 +328,7 @@ files:
|
|
328
328
|
- lib/chronicle/etl/extractors/csv_extractor.rb
|
329
329
|
- lib/chronicle/etl/extractors/extractor.rb
|
330
330
|
- lib/chronicle/etl/extractors/file_extractor.rb
|
331
|
-
- lib/chronicle/etl/extractors/helpers/
|
331
|
+
- lib/chronicle/etl/extractors/helpers/input_reader.rb
|
332
332
|
- lib/chronicle/etl/extractors/json_extractor.rb
|
333
333
|
- lib/chronicle/etl/extractors/stdin_extractor.rb
|
334
334
|
- lib/chronicle/etl/job.rb
|
@@ -336,21 +336,22 @@ files:
|
|
336
336
|
- lib/chronicle/etl/job_log.rb
|
337
337
|
- lib/chronicle/etl/job_logger.rb
|
338
338
|
- lib/chronicle/etl/loaders/csv_loader.rb
|
339
|
+
- lib/chronicle/etl/loaders/json_loader.rb
|
339
340
|
- lib/chronicle/etl/loaders/loader.rb
|
340
341
|
- lib/chronicle/etl/loaders/rest_loader.rb
|
341
|
-
- lib/chronicle/etl/loaders/stdout_loader.rb
|
342
342
|
- lib/chronicle/etl/loaders/table_loader.rb
|
343
343
|
- lib/chronicle/etl/logger.rb
|
344
344
|
- lib/chronicle/etl/models/activity.rb
|
345
345
|
- lib/chronicle/etl/models/attachment.rb
|
346
346
|
- lib/chronicle/etl/models/base.rb
|
347
347
|
- lib/chronicle/etl/models/entity.rb
|
348
|
-
- lib/chronicle/etl/models/
|
348
|
+
- lib/chronicle/etl/models/raw.rb
|
349
349
|
- lib/chronicle/etl/registry/connector_registration.rb
|
350
350
|
- lib/chronicle/etl/registry/registry.rb
|
351
351
|
- lib/chronicle/etl/registry/self_registering.rb
|
352
352
|
- lib/chronicle/etl/runner.rb
|
353
353
|
- lib/chronicle/etl/serializers/jsonapi_serializer.rb
|
354
|
+
- lib/chronicle/etl/serializers/raw_serializer.rb
|
354
355
|
- lib/chronicle/etl/serializers/serializer.rb
|
355
356
|
- lib/chronicle/etl/transformers/image_file_transformer.rb
|
356
357
|
- lib/chronicle/etl/transformers/null_transformer.rb
|
@@ -1,104 +0,0 @@
|
|
1
|
-
require 'pathname'
|
2
|
-
|
3
|
-
module Chronicle
|
4
|
-
module ETL
|
5
|
-
module Extractors
|
6
|
-
module Helpers
|
7
|
-
module FilesystemReader
|
8
|
-
|
9
|
-
def filenames_in_directory(...)
|
10
|
-
filenames = gather_files(...)
|
11
|
-
if block_given?
|
12
|
-
filenames.each do |filename|
|
13
|
-
yield filename
|
14
|
-
end
|
15
|
-
else
|
16
|
-
filenames
|
17
|
-
end
|
18
|
-
end
|
19
|
-
|
20
|
-
def read_from_filesystem(filename:, yield_each_line: true, dir_glob_pattern: '**/*')
|
21
|
-
open_files(filename: filename, dir_glob_pattern: dir_glob_pattern) do |file|
|
22
|
-
if yield_each_line
|
23
|
-
file.each_line do |line|
|
24
|
-
yield line
|
25
|
-
end
|
26
|
-
else
|
27
|
-
yield file.read
|
28
|
-
end
|
29
|
-
end
|
30
|
-
end
|
31
|
-
|
32
|
-
def open_from_filesystem(filename:, dir_glob_pattern: '**/*')
|
33
|
-
open_files(filename: filename, dir_glob_pattern: dir_glob_pattern) do |file|
|
34
|
-
yield file
|
35
|
-
end
|
36
|
-
end
|
37
|
-
|
38
|
-
def results_count
|
39
|
-
raise NotImplementedError
|
40
|
-
# if file?
|
41
|
-
# return 1
|
42
|
-
# else
|
43
|
-
# search_pattern = File.join(@options[:filename], '**/*')
|
44
|
-
# Dir.glob(search_pattern).count
|
45
|
-
# end
|
46
|
-
end
|
47
|
-
|
48
|
-
private
|
49
|
-
|
50
|
-
def gather_files(path:, dir_glob_pattern: '**/*', load_since: nil, load_until: nil, smaller_than: nil, larger_than: nil, sort: :mtime)
|
51
|
-
search_pattern = File.join(path, '**', dir_glob_pattern)
|
52
|
-
files = Dir.glob(search_pattern)
|
53
|
-
|
54
|
-
files = files.keep_if {|f| (File.mtime(f) > load_since)} if load_since
|
55
|
-
files = files.keep_if {|f| (File.mtime(f) < load_until)} if load_until
|
56
|
-
|
57
|
-
# pass in file sizes in bytes
|
58
|
-
files = files.keep_if {|f| (File.size(f) < smaller_than)} if smaller_than
|
59
|
-
files = files.keep_if {|f| (File.size(f) > larger_than)} if larger_than
|
60
|
-
|
61
|
-
# TODO: incorporate sort argument
|
62
|
-
files.sort_by{ |f| File.mtime(f) }
|
63
|
-
end
|
64
|
-
|
65
|
-
def select_files_in_directory(path:, dir_glob_pattern: '**/*')
|
66
|
-
raise IOError.new("#{path} is not a directory.") unless directory?(path)
|
67
|
-
|
68
|
-
search_pattern = File.join(path, dir_glob_pattern)
|
69
|
-
Dir.glob(search_pattern).each do |filename|
|
70
|
-
yield(filename)
|
71
|
-
end
|
72
|
-
end
|
73
|
-
|
74
|
-
def open_files(filename:, dir_glob_pattern:)
|
75
|
-
if stdin?(filename)
|
76
|
-
yield $stdin
|
77
|
-
elsif directory?(filename)
|
78
|
-
search_pattern = File.join(filename, dir_glob_pattern)
|
79
|
-
filenames = Dir.glob(search_pattern)
|
80
|
-
filenames.each do |filename|
|
81
|
-
file = File.open(filename)
|
82
|
-
yield(file)
|
83
|
-
end
|
84
|
-
elsif file?(filename)
|
85
|
-
yield File.open(filename)
|
86
|
-
end
|
87
|
-
end
|
88
|
-
|
89
|
-
def stdin?(filename)
|
90
|
-
filename == $stdin
|
91
|
-
end
|
92
|
-
|
93
|
-
def directory?(filename)
|
94
|
-
Pathname.new(filename).directory?
|
95
|
-
end
|
96
|
-
|
97
|
-
def file?(filename)
|
98
|
-
Pathname.new(filename).file?
|
99
|
-
end
|
100
|
-
end
|
101
|
-
end
|
102
|
-
end
|
103
|
-
end
|
104
|
-
end
|
@@ -1,14 +0,0 @@
|
|
1
|
-
module Chronicle
|
2
|
-
module ETL
|
3
|
-
class StdoutLoader < Chronicle::ETL::Loader
|
4
|
-
register_connector do |r|
|
5
|
-
r.description = 'stdout'
|
6
|
-
end
|
7
|
-
|
8
|
-
def load(record)
|
9
|
-
serializer = Chronicle::ETL::JSONAPISerializer.new(record)
|
10
|
-
puts serializer.serializable_hash.to_json
|
11
|
-
end
|
12
|
-
end
|
13
|
-
end
|
14
|
-
end
|
@@ -1,23 +0,0 @@
|
|
1
|
-
require 'chronicle/etl/models/base'
|
2
|
-
|
3
|
-
module Chronicle
|
4
|
-
module ETL
|
5
|
-
module Models
|
6
|
-
class Generic < Chronicle::ETL::Models::Base
|
7
|
-
TYPE = 'generic'
|
8
|
-
|
9
|
-
attr_accessor :properties
|
10
|
-
|
11
|
-
def initialize(properties = {})
|
12
|
-
@properties = properties
|
13
|
-
super
|
14
|
-
end
|
15
|
-
|
16
|
-
# Generic models have arbitrary attributes stored in @properties
|
17
|
-
def attributes
|
18
|
-
@properties.transform_keys(&:to_sym)
|
19
|
-
end
|
20
|
-
end
|
21
|
-
end
|
22
|
-
end
|
23
|
-
end
|