chronicle-etl 0.4.0 → 0.4.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 5fd411a9a41a645b85780230c79b09f361e121d0e8ca7f3270ca8eba55a76ca8
4
- data.tar.gz: c09053715910ab4f027fbdc3a5b7d10c042eee962f7fa93c6571ce8359f51009
3
+ metadata.gz: 8a267de435b41b579e36128b7392729ef499eb37f05fabaead7811f089938ddb
4
+ data.tar.gz: d4af2f62f3f5de926bdfbb0e3d6dbe2c952ec286c07317af4dca8d98f665d6da
5
5
  SHA512:
6
- metadata.gz: 2c9ec14b6c0a51f1c5ec77ee8d9a7f016d16bdc35db5634f9fa5d38aabc30dec201cd4b8bef06a31b86773a0c1cda2d271d7008dcb247a86d956c094919f3c0f
7
- data.tar.gz: 0dca41e1654e5b2b98a148f853492a67126cdac767000b3c5f97c5c8ff88b77464e17a2fab38b72c1f014f3515c911e5f3f391eaf68d64e73dcfcff5d8e6cb6a
6
+ metadata.gz: c78080cce008340f0b2795be46da2b5eb6562b2bffd97728150960343870f2bea4699e4efa07905710dd0e2eba7aaa1e803d8c0f727196f5d9d655b28a04f02e
7
+ data.tar.gz: cae3a3ffb6527f5c0b3ff89c75dc98d9cd66157ee6230c9db797f4683f90e2146daadf291108e55d3090d0120d3c9e25135cb21c4e9078bcaf4d1edf2172c930
@@ -9,9 +9,9 @@ name: Ruby
9
9
 
10
10
  on:
11
11
  push:
12
- branches: [ master ]
12
+ branches: [ main ]
13
13
  pull_request:
14
- branches: [ master ]
14
+ branches: [ main ]
15
15
 
16
16
  jobs:
17
17
  test:
data/README.md CHANGED
@@ -1,125 +1,189 @@
1
- # Chronicle::ETL
1
+ ## A CLI toolkit for extracting and working with your digital history
2
2
 
3
3
  [![Gem Version](https://badge.fury.io/rb/chronicle-etl.svg)](https://badge.fury.io/rb/chronicle-etl) [![Ruby](https://github.com/chronicle-app/chronicle-etl/actions/workflows/ruby.yml/badge.svg)](https://github.com/chronicle-app/chronicle-etl/actions/workflows/ruby.yml)
4
4
 
5
- Chronicle ETL is a utility that helps you archive and processes personal data. You can *extract* it from a variety of sources, *transform* it, and *load* it to an external API, file, or stdout.
5
+ Are you trying to archive your digital history or incorporate it into your own projects? You’ve probably discovered how frustrating it is to get machine-readable access to your own data. While [building a memex](https://hyfen.net/memex/), I learned first-hand what great efforts must be made before you can begin using the data in interesting ways.
6
6
 
7
- This tool is an adaptation of Andrew Louis's experimental [Memex project](https://hyfen.net/memex) and the dozens of existing importers are being migrated to Chronicle.
7
+ If you don’t want to spend all your time writing scrapers, reverse-engineering APIs, or parsing takeout data, this project is for you! (*If you do enjoy these things, please see the [open issues](https://github.com/chronicle-app/chronicle-etl/issues).*)
8
8
 
9
- ## Installation
9
+ `chronicle-etl` is a CLI tool that gives you the ability to easily access your personal data. It uses the ETL pattern to **extract** it from a source (e.g. your local browser history, a directory of images, goodreads.com reading history), **transform** it (into a given schema), and **load** it to a source (e.g. a CSV file, JSON, external API).
10
10
 
11
- ```bash
12
- $ gem install chronicle-etl
11
+ ## What does `chronicle-etl` give you?
12
+ * **CLI tool for working with personal data**. You can monitor progress of exports, manipulate the output, set up recurring jobs, manage credentials, and more.
13
+ * **Plugins for many third-party providers**. A plugin system allows you to access data from third-party providers and hook it into the shared CLI infrastructure.
14
+ * **A common, opinionated schema**: You can normalize different datasets into a single schema so that, for example, all your iMessages and emails are stored in a common schema. Don’t want to use the schema? `chronicle-etl` always allows you to fall back on working with the raw extraction data.
15
+
16
+ ## Installation
17
+ ```sh
18
+ # Install chronicle-etl
19
+ gem install chronicle-etl
13
20
  ```
14
21
 
15
- ## Usage
22
+ After installation, the `chronicle-etl` command will be available in your shell. Homebrew support [is coming soon](https://github.com/chronicle-app/chronicle-etl/issues/13).
16
23
 
17
- After installing the gem, `chronicle-etl` is available to run in your shell.
24
+ ## Basic usage and running jobs
18
25
 
19
- ```bash
20
- # read test.csv and display it as a table
21
- $ chronicle-etl jobs:run --extractor csv --extractor-opts filename:test.csv --loader table
26
+ ```sh
27
+ # Display help
28
+ $ chronicle-etl help
22
29
 
23
- # Display help for the jobs:run command
24
- $ chronicle-etl jobs help run
30
+ # Basic job usage
31
+ $ chronicle-etl --extractor NAME --transformer NAME --loader NAME
32
+
33
+ # Read test.csv and display it to stdout as a table
34
+ $ chronicle-etl --extractor csv --input ./data.csv --loader table
25
35
  ```
26
36
 
27
- ## Connectors
37
+ ### Common options
38
+ ```sh
39
+ Options:
40
+ -j, [--name=NAME] # Job configuration name
41
+ -e, [--extractor=EXTRACTOR-NAME] # Extractor class. Default: stdin
42
+ [--extractor-opts=key:value] # Extractor options
43
+ -t, [--transformer=TRANFORMER-NAME] # Transformer class. Default: null
44
+ [--transformer-opts=key:value] # Transformer options
45
+ -l, [--loader=LOADER-NAME] # Loader class. Default: stdout
46
+ [--loader-opts=key:value] # Loader options
47
+ -i, [--input=FILENAME] # Input filename or directory
48
+ [--since=DATE] # Load records SINCE this date. Overrides job's `load_since` configuration option in extractor's options
49
+ [--until=DATE] # Load records UNTIL this date
50
+ [--limit=N] # Only extract the first LIMIT records
51
+ -o, [--output=OUTPUT] # Output filename
52
+ [--fields=field1 field2 ...] # Output only these fields
53
+ [--log-level=LOG_LEVEL] # Log level (debug, info, warn, error, fatal)
54
+ # Default: info
55
+ -v, [--verbose], [--no-verbose] # Set log level to verbose
56
+ [--silent], [--no-silent] # Silence all output
57
+ ```
28
58
 
59
+ ## Connectors
29
60
  Connectors are available to read, process, and load data from different formats or external services.
30
61
 
31
- ```bash
62
+ ```sh
32
63
  # List all available connectors
33
64
  $ chronicle-etl connectors:list
34
-
35
- # Install a connector
36
- $ chronicle-etl connectors:install imessage
37
65
  ```
38
66
 
39
- Built in connectors:
40
-
41
- ### Extractors
42
- - `stdin` - (default) Load records from line-separated stdin
43
- - `csv`
44
- - `file` - load from a single file or directory (with a glob pattern)
45
-
46
- ### Transformers
47
- - `null` - (default) Don't do anything
48
-
49
- ### Loaders
50
- - `stdout` - (default) output records to stdout serialized as JSON
51
- - `csv` - Load records to a csv file
52
- - `rest` - Serialize records with [JSONAPI](https://jsonapi.org/) and send to a REST API
53
- - `table` - Output an ascii table of records. Useful for debugging.
54
-
55
- ### Provider-specific importers
56
-
57
- In addition to the built-in importers, importers for third-party platforms are available. They are packaged as individual Ruby gems.
67
+ ### Built-in Connectors
68
+ `chronicle-etl` comes with several built-in connectors for common formats and sources.
58
69
 
59
- - [email](https://github.com/chronicle-app/chronicle-email). Extractors for `mbox` and other email files
60
- - [shell](https://github.com/chronicle-app/chronicle-shell). Extract shell history from Bash or Zsh`
61
- - [imessage](https://github.com/chronicle-app/chronicle-imessage). Extract iMessage messages from a local macOS installation
70
+ #### Extractors
71
+ - [`csv`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/extractors/csv_extractor.rb) - Load records from CSV files or stdin
72
+ - [`json`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/extractors/json_extractor.rb) - Load JSON (either [line-separated objects](https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON) or one object)
73
+ - [`file`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/extractors/file_extractor.rb) - load from a single file or directory (with a glob pattern)
62
74
 
63
- To install any of these, run `gem install chronicle-PROVIDER`.
75
+ #### Transformers
76
+ - [`null`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/transformers/null_transformer.rb) - (default) Don’t do anything and pass on raw extraction data
64
77
 
65
- If you don't want to use the available rubygem importers, `chronicle-etl` can use `stdin` as an Extractor source (newline separated records). You can also use `stdout` as a loader — transformed records will be outputted separated by newlines.
78
+ #### Loaders
79
+ - [`table`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/loaders/table_loader.rb) - (default) Output an ascii table of records. Useful for exploring data.
80
+ - [`csv`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/extractors/csv_extractor.rb) - Load records to CSV
81
+ - [`json`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/loaders/json_loader.rb) - Load records serialized as JSON
82
+ - [`rest`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/loaders/rest_loader.rb) - Serialize records with [JSONAPI](https://jsonapi.org/) and send to a REST API
66
83
 
67
- I'll be open-sourcing more importers. Please [contact me](mailto:andrew@hyfen.net) to chat about what will be available!
68
-
69
- ## Full commands
70
-
71
- ```
72
- $ chronicle-etl help
73
-
74
- ALL COMMANDS
75
- help # This help menu
76
- connectors help [COMMAND] # Describe subcommands or one specific subcommand
77
- connectors:install NAME # Installs connector NAME
78
- connectors:list # Lists available connectors
79
- jobs help [COMMAND] # Describe subcommands or one specific subcommand
80
- jobs:create # Create a job
81
- jobs:list # List all available jobs
82
- jobs:run # Start a job
83
- jobs:show # Show details about a job
84
- ```
85
-
86
- ### Running a job
84
+ ### Plugins
85
+ Plugins provide access to data from third-party platforms, services, or formats.
87
86
 
87
+ ```bash
88
+ # Install a plugin
89
+ $ chronicle-etl connectors:install NAME
88
90
  ```
89
- Usage:
90
- chronicle-etl jobs:run
91
91
 
92
- Options:
93
- [--log-level=LOG_LEVEL] # Log level (debug, info, warn, error, fatal)
94
- # Default: info
95
- -v, [--verbose], [--no-verbose] # Set log level to verbose
96
- [--dry-run], [--no-dry-run] # Only run the extraction and transform steps, not the loading
97
- -e, [--extractor=extractor-name] # Extractor class. Default: stdin
98
- [--extractor-opts=key:value] # Extractor options
99
- -t, [--transformer=transformer-name] # Transformer class. Default: null
100
- [--transformer-opts=key:value] # Transformer options
101
- -l, [--loader=loader-name] # Loader class. Default: stdout
102
- [--loader-opts=key:value] # Loader options
103
- -j, [--name=NAME] # Job configuration name
104
-
105
-
106
- Runs an ETL job
92
+ A few dozen importers exist [in my Memex project](https://hyfen.net/memex/) and they’re being ported over to the Chronicle system. This table shows what’s available now and what’s coming. Rows are sorted in very rough order of priority.
93
+
94
+ If you want to work together on a connector, please [get in touch](#get-in-touch)!
95
+
96
+ | Name | Description | Availability |
97
+ |-----------------------------------------------------------------|---------------------------------------------------------------------------------------------|----------------------------------|
98
+ | [imessage](https://github.com/chronicle-app/chronicle-imessage) | iMessage messages and attachments | Available |
99
+ | [shell](https://github.com/chronicle-app/chronicle-shell) | Shell command history | Available (zsh support pending) |
100
+ | [email](https://github.com/chronicle-app/chronicle-email) | Emails and attachments from IMAP or .mbox files | Available (imap support pending) |
101
+ | [pinboard](https://github.com/chronicle-app/chronicle-email) | Bookmarks and tags | Available |
102
+ | github | Github user and repo activity | In progress |
103
+ | safari | Browser history from local sqlite db | Needs porting |
104
+ | chrome | Browser history from local sqlite db | Needs porting |
105
+ | whatsapp | Messaging history (via individual chat exports) or reverse-engineered local desktop install | Unstarted |
106
+ | anki | Studying and card creation history | Needs porting |
107
+ | facebook | Messaging and history posting via data export files | Needs porting |
108
+ | twitter | History via API or export data files | Needs porting |
109
+ | foursquare | Location history via API | Needs porting |
110
+ | goodreads | Reading history via export csv (RIP goodreads API) | Needs porting |
111
+ | lastfm | Listening history via API | Needs porting |
112
+ | images | Process image files | Needs porting |
113
+ | arc | Location history from synced icloud backup files | Needs porting |
114
+ | firefox | Browser history from local sqlite db | Needs porting |
115
+ | fitbit | Personal analytics via API | Needs porting |
116
+ | git | Commit history on a repo | Needs porting |
117
+ | google-calendar | Calendar events via API | Needs porting |
118
+ | instagram | Posting and messaging history via export data | Needs porting |
119
+ | shazam | Song tags via reverse-engineered API | Needs porting |
120
+ | slack | Messaging history via API | Need rethinking |
121
+ | strava | Activity history via API | Needs porting |
122
+ | things | Task activity via local sqlite db | Needs porting |
123
+ | bear | Note taking activity via local sqlite db | Needs porting |
124
+ | youtube | Video activity via takeout data and API | Needs porting |
125
+
126
+ ### Writing your own connector
127
+
128
+ Additional connectors are packaged as separate ruby gems. You can view the [iMessage plugin](https://github.com/chronicle-app/chronicle-imessage) for an example.
129
+
130
+ If you want to load a custom connector without creating a gem, you can help by [completing this issue](https://github.com/chronicle-app/chronicle-etl/issues/23).
131
+
132
+ If you want to work together on a connector, please [get in touch](#get-in-touch)!
133
+
134
+ #### Sample custom Extractor class
135
+ ```ruby
136
+ module Chronicle
137
+ module FooService
138
+ class FooExtractor < Chronicle::ETL::Extractor
139
+ register_connector do |r|
140
+ r.identifier = 'foo'
141
+ r.description = 'From foo.com'
142
+ end
143
+
144
+ setting :access_token, required: true
145
+
146
+ def prepare
147
+ @records = # load from somewhere
148
+ end
149
+
150
+ def extract
151
+ @records.each do |record|
152
+ yield Chronicle::ETL::Extraction.new(data: row.to_h)
153
+ end
154
+ end
155
+ end
156
+ end
157
+ end
107
158
  ```
108
159
 
109
160
  ## Development
110
-
111
161
  After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
112
162
 
113
163
  To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
114
164
 
115
- ## Contributing
165
+ ### Additional development commands
166
+ ```bash
167
+ # run tests
168
+ bundle exec rake spec
169
+
170
+ # generate docs
171
+ bundle exec rake yard
172
+
173
+ # use Guard to run specs automatically
174
+ bundle exec guard
175
+ ```
116
176
 
177
+ ## Get in touch
178
+ - [@hyfen](https://twitter.com/hyfen) on Twitter
179
+ - [@hyfen](https://github.com/hyfen) on Github
180
+ - Email: andrew@hyfen.net
181
+
182
+ ## Contributing
117
183
  Bug reports and pull requests are welcome on GitHub at https://github.com/chronicle-app/chronicle-etl. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
118
184
 
119
185
  ## License
120
-
121
186
  The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
122
187
 
123
188
  ## Code of Conduct
124
-
125
- Everyone interacting in the Chronicle::ETL project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/chronicle-app/chronicle-etl/blob/master/CODE_OF_CONDUCT.md).
189
+ Everyone interacting in the Chronicle::ETL project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/chronicle-app/chronicle-etl/blob/master/CODE_OF_CONDUCT.md).
@@ -6,20 +6,20 @@ module Chronicle
6
6
  # CLI commands for working with ETL jobs
7
7
  class Jobs < SubcommandBase
8
8
  default_task "start"
9
- namespace :jobs
9
+ namespace :jobs
10
10
 
11
11
  class_option :name, aliases: '-j', desc: 'Job configuration name'
12
12
 
13
- class_option :extractor, aliases: '-e', desc: "Extractor class. Default: stdin", banner: 'extractor-name'
13
+ class_option :extractor, aliases: '-e', desc: "Extractor class. Default: stdin", banner: 'NAME'
14
14
  class_option :'extractor-opts', desc: 'Extractor options', type: :hash, default: {}
15
- class_option :transformer, aliases: '-t', desc: 'Transformer class. Default: null', banner: 'transformer-name'
15
+ class_option :transformer, aliases: '-t', desc: 'Transformer class. Default: null', banner: 'NAME'
16
16
  class_option :'transformer-opts', desc: 'Transformer options', type: :hash, default: {}
17
- class_option :loader, aliases: '-l', desc: 'Loader class. Default: stdout', banner: 'loader-name'
17
+ class_option :loader, aliases: '-l', desc: 'Loader class. Default: table', banner: 'NAME'
18
18
  class_option :'loader-opts', desc: 'Loader options', type: :hash, default: {}
19
19
 
20
20
  # This is an array to deal with shell globbing
21
21
  class_option :input, aliases: '-i', desc: 'Input filename or directory', default: [], type: 'array', banner: 'FILENAME'
22
- class_option :since, desc: "Load records SINCE this date. Overrides job's `load_since` configuration option in extractor's options", banner: 'DATE'
22
+ class_option :since, desc: "Load records SINCE this date", banner: 'DATE'
23
23
  class_option :until, desc: "Load records UNTIL this date", banner: 'DATE'
24
24
  class_option :limit, desc: "Only extract the first LIMIT records", banner: 'N'
25
25
 
@@ -28,6 +28,7 @@ module Chronicle
28
28
 
29
29
  class_option :log_level, desc: 'Log level (debug, info, warn, error, fatal)', default: 'info'
30
30
  class_option :verbose, aliases: '-v', desc: 'Set log level to verbose', type: :boolean
31
+ class_option :silent, desc: 'Silence all output', type: :boolean
31
32
 
32
33
  # Thor doesn't like `run` as a command name
33
34
  map run: :start
@@ -93,7 +94,9 @@ LONG_DESC
93
94
  private
94
95
 
95
96
  def setup_log_level
96
- if options[:verbose]
97
+ if options[:silent]
98
+ Chronicle::ETL::Logger.log_level = Chronicle::ETL::Logger::SILENT
99
+ elsif options[:verbose]
97
100
  Chronicle::ETL::Logger.log_level = Chronicle::ETL::Logger::DEBUG
98
101
  elsif options[:log_level]
99
102
  level = Chronicle::ETL::Logger.const_get(options[:log_level].upcase)
@@ -116,7 +119,7 @@ LONG_DESC
116
119
  # Takes flag options and turns them into a runner config
117
120
  def process_flag_options options
118
121
  extractor_options = options[:'extractor-opts'].merge({
119
- filename: (options[:input] if options[:input].any?),
122
+ input: (options[:input] if options[:input].any?),
120
123
  since: options[:since],
121
124
  until: options[:until],
122
125
  limit: options[:limit],
@@ -89,6 +89,14 @@ module Chronicle
89
89
  value.to_s
90
90
  end
91
91
 
92
+ def coerce_boolean(value)
93
+ if value.is_a?(String)
94
+ value.downcase == "true"
95
+ else
96
+ value
97
+ end
98
+ end
99
+
92
100
  def coerce_time(value)
93
101
  # TODO: handle durations like '3h'
94
102
  if value.is_a?(String)
@@ -1,8 +1,8 @@
1
1
  module Chronicle
2
2
  module ETL
3
- class Error < StandardError; end;
3
+ class Error < StandardError; end
4
4
 
5
- class ConfigurationError < Error; end;
5
+ class ConfigurationError < Error; end
6
6
 
7
7
  class RunnerTypeError < Error; end
8
8
 
@@ -18,6 +18,10 @@ module Chronicle
18
18
  class ProviderNotAvailableError < ConnectorNotAvailableError; end
19
19
  class ProviderConnectorNotAvailableError < ConnectorNotAvailableError; end
20
20
 
21
+ class ExtractionError < Error; end
22
+
23
+ class SerializationError < Error; end
24
+
21
25
  class TransformationError < Error
22
26
  attr_reader :transformation
23
27
 
@@ -3,39 +3,46 @@ require 'csv'
3
3
  module Chronicle
4
4
  module ETL
5
5
  class CSVExtractor < Chronicle::ETL::Extractor
6
- include Extractors::Helpers::FilesystemReader
6
+ include Extractors::Helpers::InputReader
7
7
 
8
8
  register_connector do |r|
9
- r.description = 'input as CSV'
9
+ r.description = 'CSV'
10
10
  end
11
11
 
12
12
  setting :headers, default: true
13
- setting :filename, default: $stdin
13
+
14
+ def prepare
15
+ @csvs = prepare_sources
16
+ end
14
17
 
15
18
  def extract
16
- csv = initialize_csv
17
- csv.each do |row|
18
- yield Chronicle::ETL::Extraction.new(data: row.to_h)
19
+ @csvs.each do |csv|
20
+ csv.read.each do |row|
21
+ yield Chronicle::ETL::Extraction.new(data: row.to_h)
22
+ end
19
23
  end
20
24
  end
21
25
 
22
26
  def results_count
23
- CSV.read(@config.filename, headers: @config.headers).count unless stdin?(@config.filename)
27
+ @csvs.reduce(0) do |total_rows, csv|
28
+ row_count = csv.readlines.size
29
+ csv.rewind
30
+ total_rows + row_count
31
+ end
24
32
  end
25
33
 
26
34
  private
27
35
 
28
- def initialize_csv
29
- headers = @config.headers.is_a?(String) ? @config.headers.split(',') : @config.headers
30
-
31
- csv_options = {
32
- headers: headers,
33
- converters: :all
34
- }
35
-
36
- open_from_filesystem(filename: @config.filename) do |file|
37
- return CSV.new(file, **csv_options)
36
+ def prepare_sources
37
+ @csvs = []
38
+ read_input do |csv_data|
39
+ csv_options = {
40
+ headers: @config.headers.is_a?(String) ? @config.headers.split(',') : @config.headers,
41
+ converters: :all
42
+ }
43
+ @csvs << CSV.new(csv_data, **csv_options)
38
44
  end
45
+ @csvs
39
46
  end
40
47
  end
41
48
  end
@@ -7,11 +7,11 @@ module Chronicle
7
7
  extend Chronicle::ETL::Registry::SelfRegistering
8
8
  include Chronicle::ETL::Configurable
9
9
 
10
- setting :since, type: :date
11
- setting :until, type: :date
10
+ setting :since, type: :time
11
+ setting :until, type: :time
12
12
  setting :limit
13
13
  setting :load_after_id
14
- setting :filename
14
+ setting :input
15
15
 
16
16
  # Construct a new instance of this extractor. Options are passed in from a Runner
17
17
  # == Parameters:
@@ -46,7 +46,7 @@ module Chronicle
46
46
  end
47
47
  end
48
48
 
49
- require_relative 'helpers/filesystem_reader'
49
+ require_relative 'helpers/input_reader'
50
50
  require_relative 'csv_extractor'
51
51
  require_relative 'file_extractor'
52
52
  require_relative 'json_extractor'
@@ -2,35 +2,55 @@ require 'pathname'
2
2
 
3
3
  module Chronicle
4
4
  module ETL
5
+ # Return filenames that match a pattern in a directory
5
6
  class FileExtractor < Chronicle::ETL::Extractor
6
- include Extractors::Helpers::FilesystemReader
7
7
 
8
8
  register_connector do |r|
9
9
  r.description = 'file or directory of files'
10
10
  end
11
11
 
12
- # TODO: consolidate this with @config.filename
13
- setting :dir_glob_pattern
12
+ setting :input, default: ['.']
13
+ setting :dir_glob_pattern, default: "**/*"
14
+ setting :larger_than
15
+ setting :smaller_than
16
+
17
+ def prepare
18
+ @pathnames = gather_files
19
+ end
14
20
 
15
21
  def extract
16
- filenames.each do |filename|
17
- yield Chronicle::ETL::Extraction.new(data: filename)
22
+ @pathnames.each do |pathname|
23
+ yield Chronicle::ETL::Extraction.new(data: pathname.to_path)
18
24
  end
19
25
  end
20
26
 
21
27
  def results_count
22
- filenames.count
28
+ @pathnames.count
23
29
  end
24
30
 
25
31
  private
26
32
 
27
- def filenames
28
- @filenames ||= filenames_in_directory(
29
- path: @config.filename,
30
- dir_glob_pattern: @config.dir_glob_pattern,
31
- load_since: @config.since,
32
- load_until: @config.until
33
- )
33
+ def gather_files
34
+ roots = [@config.input].flatten.map { |filename| Pathname.new(filename) }
35
+ raise(ExtractionError, "Input must exist") unless roots.all?(&:exist?)
36
+
37
+ directories, files = roots.partition(&:directory?)
38
+
39
+ directories.each do |directory|
40
+ files += Dir.glob(File.join(directory, @config.dir_glob_pattern)).map { |filename| Pathname.new(filename) }
41
+ end
42
+
43
+ files = files.uniq
44
+
45
+ files = files.keep_if { |f| (f.mtime > @config.since) } if @config.since
46
+ files = files.keep_if { |f| (f.mtime < @config.until) } if @config.until
47
+
48
+ # pass in file sizes in bytes
49
+ files = files.keep_if { |f| (f.size < @config.smaller_than) } if @config.smaller_than
50
+ files = files.keep_if { |f| (f.size > @config.larger_than) } if @config.larger_than
51
+
52
+ # # TODO: incorporate sort argument
53
+ files.sort_by(&:mtime)
34
54
  end
35
55
  end
36
56
  end
@@ -0,0 +1,76 @@
1
+ require 'pathname'
2
+
3
+ module Chronicle
4
+ module ETL
5
+ module Extractors
6
+ module Helpers
7
+ module InputReader
8
+ # Return an array of input filenames; converts a single string
9
+ # to an array if necessary
10
+ def filenames
11
+ [@config.input].flatten.map
12
+ end
13
+
14
+ # Filenames as an array of pathnames
15
+ def pathnames
16
+ filenames.map { |filename| Pathname.new(filename) }
17
+ end
18
+
19
+ # Whether we're reading from files
20
+ def read_from_files?
21
+ filenames.any?
22
+ end
23
+
24
+ # Whether we're reading input from stdin
25
+ def read_from_stdin?
26
+ !read_from_files? && $stdin.stat.pipe?
27
+ end
28
+
29
+ # Read input sources and yield each content
30
+ def read_input
31
+ if read_from_files?
32
+ pathnames.each do |pathname|
33
+ File.open(pathname) do |file|
34
+ yield file.read, pathname.to_path
35
+ end
36
+ end
37
+ elsif read_from_stdin?
38
+ yield $stdin.read, $stdin
39
+ else
40
+ raise ExtractionError, "No input files or stdin provided"
41
+ end
42
+ end
43
+
44
+ # Read input sources line by line
45
+ def read_input_as_lines(&block)
46
+ if read_from_files?
47
+ lines_from_files(&block)
48
+ elsif read_from_stdin?
49
+ lines_from_stdin(&block)
50
+ else
51
+ raise ExtractionError, "No input files or stdin provided"
52
+ end
53
+ end
54
+
55
+ private
56
+
57
+ def lines_from_files(&block)
58
+ pathnames.each do |pathname|
59
+ File.open(pathname) do |file|
60
+ lines_from_io(file, &block)
61
+ end
62
+ end
63
+ end
64
+
65
+ def lines_from_stdin(&block)
66
+ lines_from_io($stdin, &block)
67
+ end
68
+
69
+ def lines_from_io(io, &block)
70
+ io.each_line(&block)
71
+ end
72
+ end
73
+ end
74
+ end
75
+ end
76
+ end
@@ -1,35 +1,44 @@
1
1
  module Chronicle
2
2
  module ETL
3
- class JsonExtractor < Chronicle::ETL::Extractor
4
- include Extractors::Helpers::FilesystemReader
3
+ class JSONExtractor < Chronicle::ETL::Extractor
4
+ include Extractors::Helpers::InputReader
5
5
 
6
6
  register_connector do |r|
7
- r.description = 'input as JSON'
7
+ r.description = 'JSON'
8
8
  end
9
9
 
10
- setting :filename, default: $stdin
11
- setting :jsonl, default: true
10
+ setting :jsonl, default: true, type: :boolean
12
11
 
13
- def extract
12
+ def prepare
13
+ @jsons = []
14
14
  load_input do |input|
15
- parsed_data = parse_data(input)
16
- yield Chronicle::ETL::Extraction.new(data: parsed_data) if parsed_data
15
+ @jsons << parse_data(input)
16
+ end
17
+ end
18
+
19
+ def extract
20
+ @jsons.each do |json|
21
+ yield Chronicle::ETL::Extraction.new(data: json)
17
22
  end
18
23
  end
19
24
 
20
25
  def results_count
26
+ @jsons.count
21
27
  end
22
28
 
23
29
  private
24
30
 
25
31
  def parse_data data
26
32
  JSON.parse(data)
27
- rescue JSON::ParserError => e
33
+ rescue JSON::ParserError
34
+ raise Chronicle::ETL::ExtractionError, "Could not parse JSON"
28
35
  end
29
36
 
30
- def load_input
31
- read_from_filesystem(filename: @options[:filename]) do |data|
32
- yield data
37
+ def load_input(&block)
38
+ if @config.jsonl
39
+ read_input_as_lines(&block)
40
+ else
41
+ read_input(&block)
33
42
  end
34
43
  end
35
44
  end
@@ -14,7 +14,7 @@ module Chronicle
14
14
  options: {}
15
15
  },
16
16
  loader: {
17
- name: 'stdout',
17
+ name: 'table',
18
18
  options: {}
19
19
  }
20
20
  }.freeze
@@ -0,0 +1,44 @@
1
+ module Chronicle
2
+ module ETL
3
+ class JSONLoader < Chronicle::ETL::Loader
4
+ register_connector do |r|
5
+ r.description = 'json'
6
+ end
7
+
8
+ setting :serializer
9
+ setting :output, default: $stdout
10
+
11
+ def start
12
+ if @config.output == $stdout
13
+ @output = @config.output
14
+ else
15
+ @output = File.open(@config.output, "w")
16
+ end
17
+ end
18
+
19
+ def load(record)
20
+ serialized = serializer.serialize(record)
21
+
22
+ # When dealing with raw data, we can get improperly encoded strings
23
+ # (eg from sqlite database columns). We force conversion to UTF-8
24
+ # before converting into JSON
25
+ encoded = serialized.transform_values do |value|
26
+ next value unless value.is_a?(String)
27
+
28
+ value.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
29
+ end
30
+ @output.puts encoded.to_json
31
+ end
32
+
33
+ def finish
34
+ @output.close
35
+ end
36
+
37
+ private
38
+
39
+ def serializer
40
+ @config.serializer || Chronicle::ETL::RawSerializer
41
+ end
42
+ end
43
+ end
44
+ end
@@ -30,6 +30,6 @@ module Chronicle
30
30
  end
31
31
 
32
32
  require_relative 'csv_loader'
33
+ require_relative 'json_loader'
33
34
  require_relative 'rest_loader'
34
- require_relative 'stdout_loader'
35
35
  require_relative 'table_loader'
@@ -11,20 +11,19 @@ module Chronicle
11
11
 
12
12
  setting :fields_limit, default: nil
13
13
  setting :fields_exclude, default: ['lids', 'type']
14
- setting :fields_include, default: []
14
+ setting :fields, default: []
15
15
  setting :truncate_values_at, default: 40
16
16
  setting :table_renderer, default: :basic
17
17
 
18
18
  def load(record)
19
- @records ||= []
20
- @records << record.to_h_flattened
19
+ records << record.to_h_flattened
21
20
  end
22
21
 
23
22
  def finish
24
- return if @records.empty?
23
+ return if records.empty?
25
24
 
26
- headers = build_headers(@records)
27
- rows = build_rows(@records, headers)
25
+ headers = build_headers(records)
26
+ rows = build_rows(records, headers)
28
27
 
29
28
  @table = TTY::Table.new(header: headers, rows: rows)
30
29
  puts @table.render(
@@ -33,12 +32,16 @@ module Chronicle
33
32
  )
34
33
  end
35
34
 
35
+ def records
36
+ @records ||= []
37
+ end
38
+
36
39
  private
37
40
 
38
41
  def build_headers(records)
39
42
  headers =
40
- if @config.fields_include.any?
41
- Set[*@config.fields_include]
43
+ if @config.fields.any?
44
+ Set[*@config.fields]
42
45
  else
43
46
  # use all the keys of the flattened record hash
44
47
  Set[*records.map(&:keys).flatten.map(&:to_s).uniq]
@@ -52,7 +55,7 @@ module Chronicle
52
55
 
53
56
  def build_rows(records, headers)
54
57
  records.map do |record|
55
- values = record.values_at(*headers).map{|value| value.to_s }
58
+ values = record.transform_keys(&:to_sym).values_at(*headers).map{|value| value.to_s }
56
59
 
57
60
  if @config.truncate_values_at
58
61
  values = values.map{ |value| value.truncate(@config.truncate_values_at) }
@@ -8,6 +8,7 @@ module Chronicle
8
8
  WARN = 2
9
9
  ERROR = 3
10
10
  FATAL = 4
11
+ SILENT = 5
11
12
 
12
13
  attr_accessor :log_level
13
14
 
@@ -5,6 +5,9 @@ module Chronicle
5
5
  module Models
6
6
  # Represents a record that's been transformed by a Transformer and
7
7
  # ready to be loaded. Loosely based on ActiveModel.
8
+ #
9
+ # @todo Experiment with just mixing in ActiveModel instead of this
10
+ # this reimplementation
8
11
  class Base
9
12
  ATTRIBUTES = [:provider, :provider_id, :lat, :lng, :metadata].freeze
10
13
  ASSOCIATIONS = [].freeze
@@ -5,13 +5,19 @@ module Chronicle
5
5
  module Models
6
6
  class Entity < Chronicle::ETL::Models::Base
7
7
  TYPE = 'entities'.freeze
8
- ATTRIBUTES = [:title, :body, :represents, :slug, :myself, :metadata].freeze
8
+ ATTRIBUTES = [:title, :body, :provider_url, :represents, :slug, :myself, :metadata].freeze
9
+
10
+ # TODO: This desperately needs a validation system
9
11
  ASSOCIATIONS = [
12
+ :involvements, # inverse of activity's `involved`
13
+
10
14
  :attachments,
11
15
  :abouts,
16
+ :aboutables, # inverse of above
12
17
  :depicts,
13
18
  :consumers,
14
- :contains
19
+ :contains,
20
+ :containers # inverse of above
15
21
  ].freeze # TODO: add these to reflect Chronicle Schema
16
22
 
17
23
  attr_accessor(*ATTRIBUTES, *ASSOCIATIONS)
@@ -0,0 +1,26 @@
1
+ require 'chronicle/etl/models/base'
2
+
3
+ module Chronicle
4
+ module ETL
5
+ module Models
6
+ # A record from an extraction with no processing or normalization applied
7
+ class Raw
8
+ TYPE = 'raw'
9
+
10
+ attr_accessor :raw_data
11
+
12
+ def initialize(raw_data)
13
+ @raw_data = raw_data
14
+ end
15
+
16
+ def to_h
17
+ @raw_data.to_h
18
+ end
19
+
20
+ def to_h_flattened
21
+ Chronicle::ETL::Utils::HashUtilities.flatten_hash(to_h)
22
+ end
23
+ end
24
+ end
25
+ end
26
+ end
@@ -28,9 +28,10 @@ class Chronicle::ETL::Runner
28
28
  transformer = @job.instantiate_transformer(extraction)
29
29
  record = transformer.transform
30
30
 
31
- unless record.is_a?(Chronicle::ETL::Models::Base)
32
- raise Chronicle::ETL::RunnerTypeError, "Transformed data should be a type of Chronicle::ETL::Models"
33
- end
31
+ # TODO: rethink this
32
+ # unless record.is_a?(Chronicle::ETL::Models)
33
+ # raise Chronicle::ETL::RunnerTypeError, "Transformed data should be a type of Chronicle::ETL::Models"
34
+ # end
34
35
 
35
36
  Chronicle::ETL::Logger.info(tty_log_transformation(transformer))
36
37
  @job_logger.log_transformation(transformer)
@@ -52,7 +53,7 @@ class Chronicle::ETL::Runner
52
53
  raise e
53
54
  ensure
54
55
  @job_logger.save
55
- @progress_bar.finish
56
+ @progress_bar&.finish
56
57
  Chronicle::ETL::Logger.detach_from_progress_bar
57
58
  Chronicle::ETL::Logger.info(tty_log_completion)
58
59
  end
@@ -1,6 +1,12 @@
1
1
  module Chronicle
2
2
  module ETL
3
3
  class JSONAPISerializer < Chronicle::ETL::Serializer
4
+ def initialize(*args)
5
+ super
6
+
7
+ raise(SerializationError, "Record must be a subclass of Chronicle::ETL::Model::Base") unless @record.is_a?(Chronicle::ETL::Models::Base)
8
+ end
9
+
4
10
  def serializable_hash
5
11
  @record
6
12
  .identifier_hash
@@ -0,0 +1,10 @@
1
+ module Chronicle
2
+ module ETL
3
+ # Take a Raw model and output `raw_data` as a hash
4
+ class RawSerializer < Chronicle::ETL::Serializer
5
+ def serializable_hash
6
+ @record.to_h
7
+ end
8
+ end
9
+ end
10
+ end
@@ -24,4 +24,5 @@ module Chronicle
24
24
  end
25
25
  end
26
26
 
27
- require_relative 'jsonapi_serializer'
27
+ require_relative 'jsonapi_serializer'
28
+ require_relative 'raw_serializer'
@@ -7,7 +7,7 @@ module Chronicle
7
7
  end
8
8
 
9
9
  def transform
10
- Chronicle::ETL::Models::Generic.new(@extraction.data)
10
+ Chronicle::ETL::Models::Raw.new(@extraction.data)
11
11
  end
12
12
 
13
13
  def timestamp; end
@@ -1,5 +1,5 @@
1
1
  module Chronicle
2
2
  module ETL
3
- VERSION = "0.4.0"
3
+ VERSION = "0.4.1"
4
4
  end
5
5
  end
data/lib/chronicle/etl.rb CHANGED
@@ -3,23 +3,30 @@ require_relative 'etl/config'
3
3
  require_relative 'etl/configurable'
4
4
  require_relative 'etl/exceptions'
5
5
  require_relative 'etl/extraction'
6
- require_relative 'etl/extractors/extractor'
7
6
  require_relative 'etl/job_definition'
8
7
  require_relative 'etl/job_log'
9
8
  require_relative 'etl/job_logger'
10
9
  require_relative 'etl/job'
11
- require_relative 'etl/loaders/loader'
12
10
  require_relative 'etl/logger'
13
11
  require_relative 'etl/models/activity'
14
12
  require_relative 'etl/models/attachment'
15
13
  require_relative 'etl/models/base'
14
+ require_relative 'etl/models/raw'
16
15
  require_relative 'etl/models/entity'
17
- require_relative 'etl/models/generic'
18
16
  require_relative 'etl/runner'
19
17
  require_relative 'etl/serializers/serializer'
20
- require_relative 'etl/transformers/transformer'
21
18
  require_relative 'etl/utils/binary_attachments'
22
19
  require_relative 'etl/utils/hash_utilities'
23
20
  require_relative 'etl/utils/text_recognition'
24
21
  require_relative 'etl/utils/progress_bar'
25
22
  require_relative 'etl/version'
23
+
24
+ require_relative 'etl/extractors/extractor'
25
+ require_relative 'etl/loaders/loader'
26
+ require_relative 'etl/transformers/transformer'
27
+
28
+ begin
29
+ require 'pry'
30
+ rescue LoadError
31
+ # Pry not available
32
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: chronicle-etl
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.0
4
+ version: 0.4.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Louis
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2022-02-25 00:00:00.000000000 Z
11
+ date: 2022-03-05 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activesupport
@@ -328,7 +328,7 @@ files:
328
328
  - lib/chronicle/etl/extractors/csv_extractor.rb
329
329
  - lib/chronicle/etl/extractors/extractor.rb
330
330
  - lib/chronicle/etl/extractors/file_extractor.rb
331
- - lib/chronicle/etl/extractors/helpers/filesystem_reader.rb
331
+ - lib/chronicle/etl/extractors/helpers/input_reader.rb
332
332
  - lib/chronicle/etl/extractors/json_extractor.rb
333
333
  - lib/chronicle/etl/extractors/stdin_extractor.rb
334
334
  - lib/chronicle/etl/job.rb
@@ -336,21 +336,22 @@ files:
336
336
  - lib/chronicle/etl/job_log.rb
337
337
  - lib/chronicle/etl/job_logger.rb
338
338
  - lib/chronicle/etl/loaders/csv_loader.rb
339
+ - lib/chronicle/etl/loaders/json_loader.rb
339
340
  - lib/chronicle/etl/loaders/loader.rb
340
341
  - lib/chronicle/etl/loaders/rest_loader.rb
341
- - lib/chronicle/etl/loaders/stdout_loader.rb
342
342
  - lib/chronicle/etl/loaders/table_loader.rb
343
343
  - lib/chronicle/etl/logger.rb
344
344
  - lib/chronicle/etl/models/activity.rb
345
345
  - lib/chronicle/etl/models/attachment.rb
346
346
  - lib/chronicle/etl/models/base.rb
347
347
  - lib/chronicle/etl/models/entity.rb
348
- - lib/chronicle/etl/models/generic.rb
348
+ - lib/chronicle/etl/models/raw.rb
349
349
  - lib/chronicle/etl/registry/connector_registration.rb
350
350
  - lib/chronicle/etl/registry/registry.rb
351
351
  - lib/chronicle/etl/registry/self_registering.rb
352
352
  - lib/chronicle/etl/runner.rb
353
353
  - lib/chronicle/etl/serializers/jsonapi_serializer.rb
354
+ - lib/chronicle/etl/serializers/raw_serializer.rb
354
355
  - lib/chronicle/etl/serializers/serializer.rb
355
356
  - lib/chronicle/etl/transformers/image_file_transformer.rb
356
357
  - lib/chronicle/etl/transformers/null_transformer.rb
@@ -1,104 +0,0 @@
1
- require 'pathname'
2
-
3
- module Chronicle
4
- module ETL
5
- module Extractors
6
- module Helpers
7
- module FilesystemReader
8
-
9
- def filenames_in_directory(...)
10
- filenames = gather_files(...)
11
- if block_given?
12
- filenames.each do |filename|
13
- yield filename
14
- end
15
- else
16
- filenames
17
- end
18
- end
19
-
20
- def read_from_filesystem(filename:, yield_each_line: true, dir_glob_pattern: '**/*')
21
- open_files(filename: filename, dir_glob_pattern: dir_glob_pattern) do |file|
22
- if yield_each_line
23
- file.each_line do |line|
24
- yield line
25
- end
26
- else
27
- yield file.read
28
- end
29
- end
30
- end
31
-
32
- def open_from_filesystem(filename:, dir_glob_pattern: '**/*')
33
- open_files(filename: filename, dir_glob_pattern: dir_glob_pattern) do |file|
34
- yield file
35
- end
36
- end
37
-
38
- def results_count
39
- raise NotImplementedError
40
- # if file?
41
- # return 1
42
- # else
43
- # search_pattern = File.join(@options[:filename], '**/*')
44
- # Dir.glob(search_pattern).count
45
- # end
46
- end
47
-
48
- private
49
-
50
- def gather_files(path:, dir_glob_pattern: '**/*', load_since: nil, load_until: nil, smaller_than: nil, larger_than: nil, sort: :mtime)
51
- search_pattern = File.join(path, '**', dir_glob_pattern)
52
- files = Dir.glob(search_pattern)
53
-
54
- files = files.keep_if {|f| (File.mtime(f) > load_since)} if load_since
55
- files = files.keep_if {|f| (File.mtime(f) < load_until)} if load_until
56
-
57
- # pass in file sizes in bytes
58
- files = files.keep_if {|f| (File.size(f) < smaller_than)} if smaller_than
59
- files = files.keep_if {|f| (File.size(f) > larger_than)} if larger_than
60
-
61
- # TODO: incorporate sort argument
62
- files.sort_by{ |f| File.mtime(f) }
63
- end
64
-
65
- def select_files_in_directory(path:, dir_glob_pattern: '**/*')
66
- raise IOError.new("#{path} is not a directory.") unless directory?(path)
67
-
68
- search_pattern = File.join(path, dir_glob_pattern)
69
- Dir.glob(search_pattern).each do |filename|
70
- yield(filename)
71
- end
72
- end
73
-
74
- def open_files(filename:, dir_glob_pattern:)
75
- if stdin?(filename)
76
- yield $stdin
77
- elsif directory?(filename)
78
- search_pattern = File.join(filename, dir_glob_pattern)
79
- filenames = Dir.glob(search_pattern)
80
- filenames.each do |filename|
81
- file = File.open(filename)
82
- yield(file)
83
- end
84
- elsif file?(filename)
85
- yield File.open(filename)
86
- end
87
- end
88
-
89
- def stdin?(filename)
90
- filename == $stdin
91
- end
92
-
93
- def directory?(filename)
94
- Pathname.new(filename).directory?
95
- end
96
-
97
- def file?(filename)
98
- Pathname.new(filename).file?
99
- end
100
- end
101
- end
102
- end
103
- end
104
- end
@@ -1,14 +0,0 @@
1
- module Chronicle
2
- module ETL
3
- class StdoutLoader < Chronicle::ETL::Loader
4
- register_connector do |r|
5
- r.description = 'stdout'
6
- end
7
-
8
- def load(record)
9
- serializer = Chronicle::ETL::JSONAPISerializer.new(record)
10
- puts serializer.serializable_hash.to_json
11
- end
12
- end
13
- end
14
- end
@@ -1,23 +0,0 @@
1
- require 'chronicle/etl/models/base'
2
-
3
- module Chronicle
4
- module ETL
5
- module Models
6
- class Generic < Chronicle::ETL::Models::Base
7
- TYPE = 'generic'
8
-
9
- attr_accessor :properties
10
-
11
- def initialize(properties = {})
12
- @properties = properties
13
- super
14
- end
15
-
16
- # Generic models have arbitrary attributes stored in @properties
17
- def attributes
18
- @properties.transform_keys(&:to_sym)
19
- end
20
- end
21
- end
22
- end
23
- end