chronicle-etl 0.4.0 → 0.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 5fd411a9a41a645b85780230c79b09f361e121d0e8ca7f3270ca8eba55a76ca8
4
- data.tar.gz: c09053715910ab4f027fbdc3a5b7d10c042eee962f7fa93c6571ce8359f51009
3
+ metadata.gz: 8a267de435b41b579e36128b7392729ef499eb37f05fabaead7811f089938ddb
4
+ data.tar.gz: d4af2f62f3f5de926bdfbb0e3d6dbe2c952ec286c07317af4dca8d98f665d6da
5
5
  SHA512:
6
- metadata.gz: 2c9ec14b6c0a51f1c5ec77ee8d9a7f016d16bdc35db5634f9fa5d38aabc30dec201cd4b8bef06a31b86773a0c1cda2d271d7008dcb247a86d956c094919f3c0f
7
- data.tar.gz: 0dca41e1654e5b2b98a148f853492a67126cdac767000b3c5f97c5c8ff88b77464e17a2fab38b72c1f014f3515c911e5f3f391eaf68d64e73dcfcff5d8e6cb6a
6
+ metadata.gz: c78080cce008340f0b2795be46da2b5eb6562b2bffd97728150960343870f2bea4699e4efa07905710dd0e2eba7aaa1e803d8c0f727196f5d9d655b28a04f02e
7
+ data.tar.gz: cae3a3ffb6527f5c0b3ff89c75dc98d9cd66157ee6230c9db797f4683f90e2146daadf291108e55d3090d0120d3c9e25135cb21c4e9078bcaf4d1edf2172c930
@@ -9,9 +9,9 @@ name: Ruby
9
9
 
10
10
  on:
11
11
  push:
12
- branches: [ master ]
12
+ branches: [ main ]
13
13
  pull_request:
14
- branches: [ master ]
14
+ branches: [ main ]
15
15
 
16
16
  jobs:
17
17
  test:
data/README.md CHANGED
@@ -1,125 +1,189 @@
1
- # Chronicle::ETL
1
+ ## A CLI toolkit for extracting and working with your digital history
2
2
 
3
3
  [![Gem Version](https://badge.fury.io/rb/chronicle-etl.svg)](https://badge.fury.io/rb/chronicle-etl) [![Ruby](https://github.com/chronicle-app/chronicle-etl/actions/workflows/ruby.yml/badge.svg)](https://github.com/chronicle-app/chronicle-etl/actions/workflows/ruby.yml)
4
4
 
5
- Chronicle ETL is a utility that helps you archive and processes personal data. You can *extract* it from a variety of sources, *transform* it, and *load* it to an external API, file, or stdout.
5
+ Are you trying to archive your digital history or incorporate it into your own projects? You’ve probably discovered how frustrating it is to get machine-readable access to your own data. While [building a memex](https://hyfen.net/memex/), I learned first-hand what great efforts must be made before you can begin using the data in interesting ways.
6
6
 
7
- This tool is an adaptation of Andrew Louis's experimental [Memex project](https://hyfen.net/memex) and the dozens of existing importers are being migrated to Chronicle.
7
+ If you don’t want to spend all your time writing scrapers, reverse-engineering APIs, or parsing takeout data, this project is for you! (*If you do enjoy these things, please see the [open issues](https://github.com/chronicle-app/chronicle-etl/issues).*)
8
8
 
9
- ## Installation
9
+ `chronicle-etl` is a CLI tool that gives you the ability to easily access your personal data. It uses the ETL pattern to **extract** it from a source (e.g. your local browser history, a directory of images, goodreads.com reading history), **transform** it (into a given schema), and **load** it to a source (e.g. a CSV file, JSON, external API).
10
10
 
11
- ```bash
12
- $ gem install chronicle-etl
11
+ ## What does `chronicle-etl` give you?
12
+ * **CLI tool for working with personal data**. You can monitor progress of exports, manipulate the output, set up recurring jobs, manage credentials, and more.
13
+ * **Plugins for many third-party providers**. A plugin system allows you to access data from third-party providers and hook it into the shared CLI infrastructure.
14
+ * **A common, opinionated schema**: You can normalize different datasets into a single schema so that, for example, all your iMessages and emails are stored in a common schema. Don’t want to use the schema? `chronicle-etl` always allows you to fall back on working with the raw extraction data.
15
+
16
+ ## Installation
17
+ ```sh
18
+ # Install chronicle-etl
19
+ gem install chronicle-etl
13
20
  ```
14
21
 
15
- ## Usage
22
+ After installation, the `chronicle-etl` command will be available in your shell. Homebrew support [is coming soon](https://github.com/chronicle-app/chronicle-etl/issues/13).
16
23
 
17
- After installing the gem, `chronicle-etl` is available to run in your shell.
24
+ ## Basic usage and running jobs
18
25
 
19
- ```bash
20
- # read test.csv and display it as a table
21
- $ chronicle-etl jobs:run --extractor csv --extractor-opts filename:test.csv --loader table
26
+ ```sh
27
+ # Display help
28
+ $ chronicle-etl help
22
29
 
23
- # Display help for the jobs:run command
24
- $ chronicle-etl jobs help run
30
+ # Basic job usage
31
+ $ chronicle-etl --extractor NAME --transformer NAME --loader NAME
32
+
33
+ # Read test.csv and display it to stdout as a table
34
+ $ chronicle-etl --extractor csv --input ./data.csv --loader table
25
35
  ```
26
36
 
27
- ## Connectors
37
+ ### Common options
38
+ ```sh
39
+ Options:
40
+ -j, [--name=NAME] # Job configuration name
41
+ -e, [--extractor=EXTRACTOR-NAME] # Extractor class. Default: stdin
42
+ [--extractor-opts=key:value] # Extractor options
43
+ -t, [--transformer=TRANFORMER-NAME] # Transformer class. Default: null
44
+ [--transformer-opts=key:value] # Transformer options
45
+ -l, [--loader=LOADER-NAME] # Loader class. Default: stdout
46
+ [--loader-opts=key:value] # Loader options
47
+ -i, [--input=FILENAME] # Input filename or directory
48
+ [--since=DATE] # Load records SINCE this date. Overrides job's `load_since` configuration option in extractor's options
49
+ [--until=DATE] # Load records UNTIL this date
50
+ [--limit=N] # Only extract the first LIMIT records
51
+ -o, [--output=OUTPUT] # Output filename
52
+ [--fields=field1 field2 ...] # Output only these fields
53
+ [--log-level=LOG_LEVEL] # Log level (debug, info, warn, error, fatal)
54
+ # Default: info
55
+ -v, [--verbose], [--no-verbose] # Set log level to verbose
56
+ [--silent], [--no-silent] # Silence all output
57
+ ```
28
58
 
59
+ ## Connectors
29
60
  Connectors are available to read, process, and load data from different formats or external services.
30
61
 
31
- ```bash
62
+ ```sh
32
63
  # List all available connectors
33
64
  $ chronicle-etl connectors:list
34
-
35
- # Install a connector
36
- $ chronicle-etl connectors:install imessage
37
65
  ```
38
66
 
39
- Built in connectors:
40
-
41
- ### Extractors
42
- - `stdin` - (default) Load records from line-separated stdin
43
- - `csv`
44
- - `file` - load from a single file or directory (with a glob pattern)
45
-
46
- ### Transformers
47
- - `null` - (default) Don't do anything
48
-
49
- ### Loaders
50
- - `stdout` - (default) output records to stdout serialized as JSON
51
- - `csv` - Load records to a csv file
52
- - `rest` - Serialize records with [JSONAPI](https://jsonapi.org/) and send to a REST API
53
- - `table` - Output an ascii table of records. Useful for debugging.
54
-
55
- ### Provider-specific importers
56
-
57
- In addition to the built-in importers, importers for third-party platforms are available. They are packaged as individual Ruby gems.
67
+ ### Built-in Connectors
68
+ `chronicle-etl` comes with several built-in connectors for common formats and sources.
58
69
 
59
- - [email](https://github.com/chronicle-app/chronicle-email). Extractors for `mbox` and other email files
60
- - [shell](https://github.com/chronicle-app/chronicle-shell). Extract shell history from Bash or Zsh`
61
- - [imessage](https://github.com/chronicle-app/chronicle-imessage). Extract iMessage messages from a local macOS installation
70
+ #### Extractors
71
+ - [`csv`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/extractors/csv_extractor.rb) - Load records from CSV files or stdin
72
+ - [`json`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/extractors/json_extractor.rb) - Load JSON (either [line-separated objects](https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON) or one object)
73
+ - [`file`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/extractors/file_extractor.rb) - load from a single file or directory (with a glob pattern)
62
74
 
63
- To install any of these, run `gem install chronicle-PROVIDER`.
75
+ #### Transformers
76
+ - [`null`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/transformers/null_transformer.rb) - (default) Don’t do anything and pass on raw extraction data
64
77
 
65
- If you don't want to use the available rubygem importers, `chronicle-etl` can use `stdin` as an Extractor source (newline separated records). You can also use `stdout` as a loader — transformed records will be outputted separated by newlines.
78
+ #### Loaders
79
+ - [`table`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/loaders/table_loader.rb) - (default) Output an ascii table of records. Useful for exploring data.
80
+ - [`csv`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/extractors/csv_extractor.rb) - Load records to CSV
81
+ - [`json`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/loaders/json_loader.rb) - Load records serialized as JSON
82
+ - [`rest`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/loaders/rest_loader.rb) - Serialize records with [JSONAPI](https://jsonapi.org/) and send to a REST API
66
83
 
67
- I'll be open-sourcing more importers. Please [contact me](mailto:andrew@hyfen.net) to chat about what will be available!
68
-
69
- ## Full commands
70
-
71
- ```
72
- $ chronicle-etl help
73
-
74
- ALL COMMANDS
75
- help # This help menu
76
- connectors help [COMMAND] # Describe subcommands or one specific subcommand
77
- connectors:install NAME # Installs connector NAME
78
- connectors:list # Lists available connectors
79
- jobs help [COMMAND] # Describe subcommands or one specific subcommand
80
- jobs:create # Create a job
81
- jobs:list # List all available jobs
82
- jobs:run # Start a job
83
- jobs:show # Show details about a job
84
- ```
85
-
86
- ### Running a job
84
+ ### Plugins
85
+ Plugins provide access to data from third-party platforms, services, or formats.
87
86
 
87
+ ```bash
88
+ # Install a plugin
89
+ $ chronicle-etl connectors:install NAME
88
90
  ```
89
- Usage:
90
- chronicle-etl jobs:run
91
91
 
92
- Options:
93
- [--log-level=LOG_LEVEL] # Log level (debug, info, warn, error, fatal)
94
- # Default: info
95
- -v, [--verbose], [--no-verbose] # Set log level to verbose
96
- [--dry-run], [--no-dry-run] # Only run the extraction and transform steps, not the loading
97
- -e, [--extractor=extractor-name] # Extractor class. Default: stdin
98
- [--extractor-opts=key:value] # Extractor options
99
- -t, [--transformer=transformer-name] # Transformer class. Default: null
100
- [--transformer-opts=key:value] # Transformer options
101
- -l, [--loader=loader-name] # Loader class. Default: stdout
102
- [--loader-opts=key:value] # Loader options
103
- -j, [--name=NAME] # Job configuration name
104
-
105
-
106
- Runs an ETL job
92
+ A few dozen importers exist [in my Memex project](https://hyfen.net/memex/) and they’re being ported over to the Chronicle system. This table shows what’s available now and what’s coming. Rows are sorted in very rough order of priority.
93
+
94
+ If you want to work together on a connector, please [get in touch](#get-in-touch)!
95
+
96
+ | Name | Description | Availability |
97
+ |-----------------------------------------------------------------|---------------------------------------------------------------------------------------------|----------------------------------|
98
+ | [imessage](https://github.com/chronicle-app/chronicle-imessage) | iMessage messages and attachments | Available |
99
+ | [shell](https://github.com/chronicle-app/chronicle-shell) | Shell command history | Available (zsh support pending) |
100
+ | [email](https://github.com/chronicle-app/chronicle-email) | Emails and attachments from IMAP or .mbox files | Available (imap support pending) |
101
+ | [pinboard](https://github.com/chronicle-app/chronicle-email) | Bookmarks and tags | Available |
102
+ | github | Github user and repo activity | In progress |
103
+ | safari | Browser history from local sqlite db | Needs porting |
104
+ | chrome | Browser history from local sqlite db | Needs porting |
105
+ | whatsapp | Messaging history (via individual chat exports) or reverse-engineered local desktop install | Unstarted |
106
+ | anki | Studying and card creation history | Needs porting |
107
+ | facebook | Messaging and history posting via data export files | Needs porting |
108
+ | twitter | History via API or export data files | Needs porting |
109
+ | foursquare | Location history via API | Needs porting |
110
+ | goodreads | Reading history via export csv (RIP goodreads API) | Needs porting |
111
+ | lastfm | Listening history via API | Needs porting |
112
+ | images | Process image files | Needs porting |
113
+ | arc | Location history from synced icloud backup files | Needs porting |
114
+ | firefox | Browser history from local sqlite db | Needs porting |
115
+ | fitbit | Personal analytics via API | Needs porting |
116
+ | git | Commit history on a repo | Needs porting |
117
+ | google-calendar | Calendar events via API | Needs porting |
118
+ | instagram | Posting and messaging history via export data | Needs porting |
119
+ | shazam | Song tags via reverse-engineered API | Needs porting |
120
+ | slack | Messaging history via API | Need rethinking |
121
+ | strava | Activity history via API | Needs porting |
122
+ | things | Task activity via local sqlite db | Needs porting |
123
+ | bear | Note taking activity via local sqlite db | Needs porting |
124
+ | youtube | Video activity via takeout data and API | Needs porting |
125
+
126
+ ### Writing your own connector
127
+
128
+ Additional connectors are packaged as separate ruby gems. You can view the [iMessage plugin](https://github.com/chronicle-app/chronicle-imessage) for an example.
129
+
130
+ If you want to load a custom connector without creating a gem, you can help by [completing this issue](https://github.com/chronicle-app/chronicle-etl/issues/23).
131
+
132
+ If you want to work together on a connector, please [get in touch](#get-in-touch)!
133
+
134
+ #### Sample custom Extractor class
135
+ ```ruby
136
+ module Chronicle
137
+ module FooService
138
+ class FooExtractor < Chronicle::ETL::Extractor
139
+ register_connector do |r|
140
+ r.identifier = 'foo'
141
+ r.description = 'From foo.com'
142
+ end
143
+
144
+ setting :access_token, required: true
145
+
146
+ def prepare
147
+ @records = # load from somewhere
148
+ end
149
+
150
+ def extract
151
+ @records.each do |record|
152
+ yield Chronicle::ETL::Extraction.new(data: row.to_h)
153
+ end
154
+ end
155
+ end
156
+ end
157
+ end
107
158
  ```
108
159
 
109
160
  ## Development
110
-
111
161
  After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
112
162
 
113
163
  To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
114
164
 
115
- ## Contributing
165
+ ### Additional development commands
166
+ ```bash
167
+ # run tests
168
+ bundle exec rake spec
169
+
170
+ # generate docs
171
+ bundle exec rake yard
172
+
173
+ # use Guard to run specs automatically
174
+ bundle exec guard
175
+ ```
116
176
 
177
+ ## Get in touch
178
+ - [@hyfen](https://twitter.com/hyfen) on Twitter
179
+ - [@hyfen](https://github.com/hyfen) on Github
180
+ - Email: andrew@hyfen.net
181
+
182
+ ## Contributing
117
183
  Bug reports and pull requests are welcome on GitHub at https://github.com/chronicle-app/chronicle-etl. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
118
184
 
119
185
  ## License
120
-
121
186
  The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
122
187
 
123
188
  ## Code of Conduct
124
-
125
- Everyone interacting in the Chronicle::ETL project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/chronicle-app/chronicle-etl/blob/master/CODE_OF_CONDUCT.md).
189
+ Everyone interacting in the Chronicle::ETL project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/chronicle-app/chronicle-etl/blob/master/CODE_OF_CONDUCT.md).
@@ -6,20 +6,20 @@ module Chronicle
6
6
  # CLI commands for working with ETL jobs
7
7
  class Jobs < SubcommandBase
8
8
  default_task "start"
9
- namespace :jobs
9
+ namespace :jobs
10
10
 
11
11
  class_option :name, aliases: '-j', desc: 'Job configuration name'
12
12
 
13
- class_option :extractor, aliases: '-e', desc: "Extractor class. Default: stdin", banner: 'extractor-name'
13
+ class_option :extractor, aliases: '-e', desc: "Extractor class. Default: stdin", banner: 'NAME'
14
14
  class_option :'extractor-opts', desc: 'Extractor options', type: :hash, default: {}
15
- class_option :transformer, aliases: '-t', desc: 'Transformer class. Default: null', banner: 'transformer-name'
15
+ class_option :transformer, aliases: '-t', desc: 'Transformer class. Default: null', banner: 'NAME'
16
16
  class_option :'transformer-opts', desc: 'Transformer options', type: :hash, default: {}
17
- class_option :loader, aliases: '-l', desc: 'Loader class. Default: stdout', banner: 'loader-name'
17
+ class_option :loader, aliases: '-l', desc: 'Loader class. Default: table', banner: 'NAME'
18
18
  class_option :'loader-opts', desc: 'Loader options', type: :hash, default: {}
19
19
 
20
20
  # This is an array to deal with shell globbing
21
21
  class_option :input, aliases: '-i', desc: 'Input filename or directory', default: [], type: 'array', banner: 'FILENAME'
22
- class_option :since, desc: "Load records SINCE this date. Overrides job's `load_since` configuration option in extractor's options", banner: 'DATE'
22
+ class_option :since, desc: "Load records SINCE this date", banner: 'DATE'
23
23
  class_option :until, desc: "Load records UNTIL this date", banner: 'DATE'
24
24
  class_option :limit, desc: "Only extract the first LIMIT records", banner: 'N'
25
25
 
@@ -28,6 +28,7 @@ module Chronicle
28
28
 
29
29
  class_option :log_level, desc: 'Log level (debug, info, warn, error, fatal)', default: 'info'
30
30
  class_option :verbose, aliases: '-v', desc: 'Set log level to verbose', type: :boolean
31
+ class_option :silent, desc: 'Silence all output', type: :boolean
31
32
 
32
33
  # Thor doesn't like `run` as a command name
33
34
  map run: :start
@@ -93,7 +94,9 @@ LONG_DESC
93
94
  private
94
95
 
95
96
  def setup_log_level
96
- if options[:verbose]
97
+ if options[:silent]
98
+ Chronicle::ETL::Logger.log_level = Chronicle::ETL::Logger::SILENT
99
+ elsif options[:verbose]
97
100
  Chronicle::ETL::Logger.log_level = Chronicle::ETL::Logger::DEBUG
98
101
  elsif options[:log_level]
99
102
  level = Chronicle::ETL::Logger.const_get(options[:log_level].upcase)
@@ -116,7 +119,7 @@ LONG_DESC
116
119
  # Takes flag options and turns them into a runner config
117
120
  def process_flag_options options
118
121
  extractor_options = options[:'extractor-opts'].merge({
119
- filename: (options[:input] if options[:input].any?),
122
+ input: (options[:input] if options[:input].any?),
120
123
  since: options[:since],
121
124
  until: options[:until],
122
125
  limit: options[:limit],
@@ -89,6 +89,14 @@ module Chronicle
89
89
  value.to_s
90
90
  end
91
91
 
92
+ def coerce_boolean(value)
93
+ if value.is_a?(String)
94
+ value.downcase == "true"
95
+ else
96
+ value
97
+ end
98
+ end
99
+
92
100
  def coerce_time(value)
93
101
  # TODO: handle durations like '3h'
94
102
  if value.is_a?(String)
@@ -1,8 +1,8 @@
1
1
  module Chronicle
2
2
  module ETL
3
- class Error < StandardError; end;
3
+ class Error < StandardError; end
4
4
 
5
- class ConfigurationError < Error; end;
5
+ class ConfigurationError < Error; end
6
6
 
7
7
  class RunnerTypeError < Error; end
8
8
 
@@ -18,6 +18,10 @@ module Chronicle
18
18
  class ProviderNotAvailableError < ConnectorNotAvailableError; end
19
19
  class ProviderConnectorNotAvailableError < ConnectorNotAvailableError; end
20
20
 
21
+ class ExtractionError < Error; end
22
+
23
+ class SerializationError < Error; end
24
+
21
25
  class TransformationError < Error
22
26
  attr_reader :transformation
23
27
 
@@ -3,39 +3,46 @@ require 'csv'
3
3
  module Chronicle
4
4
  module ETL
5
5
  class CSVExtractor < Chronicle::ETL::Extractor
6
- include Extractors::Helpers::FilesystemReader
6
+ include Extractors::Helpers::InputReader
7
7
 
8
8
  register_connector do |r|
9
- r.description = 'input as CSV'
9
+ r.description = 'CSV'
10
10
  end
11
11
 
12
12
  setting :headers, default: true
13
- setting :filename, default: $stdin
13
+
14
+ def prepare
15
+ @csvs = prepare_sources
16
+ end
14
17
 
15
18
  def extract
16
- csv = initialize_csv
17
- csv.each do |row|
18
- yield Chronicle::ETL::Extraction.new(data: row.to_h)
19
+ @csvs.each do |csv|
20
+ csv.read.each do |row|
21
+ yield Chronicle::ETL::Extraction.new(data: row.to_h)
22
+ end
19
23
  end
20
24
  end
21
25
 
22
26
  def results_count
23
- CSV.read(@config.filename, headers: @config.headers).count unless stdin?(@config.filename)
27
+ @csvs.reduce(0) do |total_rows, csv|
28
+ row_count = csv.readlines.size
29
+ csv.rewind
30
+ total_rows + row_count
31
+ end
24
32
  end
25
33
 
26
34
  private
27
35
 
28
- def initialize_csv
29
- headers = @config.headers.is_a?(String) ? @config.headers.split(',') : @config.headers
30
-
31
- csv_options = {
32
- headers: headers,
33
- converters: :all
34
- }
35
-
36
- open_from_filesystem(filename: @config.filename) do |file|
37
- return CSV.new(file, **csv_options)
36
+ def prepare_sources
37
+ @csvs = []
38
+ read_input do |csv_data|
39
+ csv_options = {
40
+ headers: @config.headers.is_a?(String) ? @config.headers.split(',') : @config.headers,
41
+ converters: :all
42
+ }
43
+ @csvs << CSV.new(csv_data, **csv_options)
38
44
  end
45
+ @csvs
39
46
  end
40
47
  end
41
48
  end
@@ -7,11 +7,11 @@ module Chronicle
7
7
  extend Chronicle::ETL::Registry::SelfRegistering
8
8
  include Chronicle::ETL::Configurable
9
9
 
10
- setting :since, type: :date
11
- setting :until, type: :date
10
+ setting :since, type: :time
11
+ setting :until, type: :time
12
12
  setting :limit
13
13
  setting :load_after_id
14
- setting :filename
14
+ setting :input
15
15
 
16
16
  # Construct a new instance of this extractor. Options are passed in from a Runner
17
17
  # == Parameters:
@@ -46,7 +46,7 @@ module Chronicle
46
46
  end
47
47
  end
48
48
 
49
- require_relative 'helpers/filesystem_reader'
49
+ require_relative 'helpers/input_reader'
50
50
  require_relative 'csv_extractor'
51
51
  require_relative 'file_extractor'
52
52
  require_relative 'json_extractor'
@@ -2,35 +2,55 @@ require 'pathname'
2
2
 
3
3
  module Chronicle
4
4
  module ETL
5
+ # Return filenames that match a pattern in a directory
5
6
  class FileExtractor < Chronicle::ETL::Extractor
6
- include Extractors::Helpers::FilesystemReader
7
7
 
8
8
  register_connector do |r|
9
9
  r.description = 'file or directory of files'
10
10
  end
11
11
 
12
- # TODO: consolidate this with @config.filename
13
- setting :dir_glob_pattern
12
+ setting :input, default: ['.']
13
+ setting :dir_glob_pattern, default: "**/*"
14
+ setting :larger_than
15
+ setting :smaller_than
16
+
17
+ def prepare
18
+ @pathnames = gather_files
19
+ end
14
20
 
15
21
  def extract
16
- filenames.each do |filename|
17
- yield Chronicle::ETL::Extraction.new(data: filename)
22
+ @pathnames.each do |pathname|
23
+ yield Chronicle::ETL::Extraction.new(data: pathname.to_path)
18
24
  end
19
25
  end
20
26
 
21
27
  def results_count
22
- filenames.count
28
+ @pathnames.count
23
29
  end
24
30
 
25
31
  private
26
32
 
27
- def filenames
28
- @filenames ||= filenames_in_directory(
29
- path: @config.filename,
30
- dir_glob_pattern: @config.dir_glob_pattern,
31
- load_since: @config.since,
32
- load_until: @config.until
33
- )
33
+ def gather_files
34
+ roots = [@config.input].flatten.map { |filename| Pathname.new(filename) }
35
+ raise(ExtractionError, "Input must exist") unless roots.all?(&:exist?)
36
+
37
+ directories, files = roots.partition(&:directory?)
38
+
39
+ directories.each do |directory|
40
+ files += Dir.glob(File.join(directory, @config.dir_glob_pattern)).map { |filename| Pathname.new(filename) }
41
+ end
42
+
43
+ files = files.uniq
44
+
45
+ files = files.keep_if { |f| (f.mtime > @config.since) } if @config.since
46
+ files = files.keep_if { |f| (f.mtime < @config.until) } if @config.until
47
+
48
+ # pass in file sizes in bytes
49
+ files = files.keep_if { |f| (f.size < @config.smaller_than) } if @config.smaller_than
50
+ files = files.keep_if { |f| (f.size > @config.larger_than) } if @config.larger_than
51
+
52
+ # # TODO: incorporate sort argument
53
+ files.sort_by(&:mtime)
34
54
  end
35
55
  end
36
56
  end
@@ -0,0 +1,76 @@
1
+ require 'pathname'
2
+
3
+ module Chronicle
4
+ module ETL
5
+ module Extractors
6
+ module Helpers
7
+ module InputReader
8
+ # Return an array of input filenames; converts a single string
9
+ # to an array if necessary
10
+ def filenames
11
+ [@config.input].flatten.map
12
+ end
13
+
14
+ # Filenames as an array of pathnames
15
+ def pathnames
16
+ filenames.map { |filename| Pathname.new(filename) }
17
+ end
18
+
19
+ # Whether we're reading from files
20
+ def read_from_files?
21
+ filenames.any?
22
+ end
23
+
24
+ # Whether we're reading input from stdin
25
+ def read_from_stdin?
26
+ !read_from_files? && $stdin.stat.pipe?
27
+ end
28
+
29
+ # Read input sources and yield each content
30
+ def read_input
31
+ if read_from_files?
32
+ pathnames.each do |pathname|
33
+ File.open(pathname) do |file|
34
+ yield file.read, pathname.to_path
35
+ end
36
+ end
37
+ elsif read_from_stdin?
38
+ yield $stdin.read, $stdin
39
+ else
40
+ raise ExtractionError, "No input files or stdin provided"
41
+ end
42
+ end
43
+
44
+ # Read input sources line by line
45
+ def read_input_as_lines(&block)
46
+ if read_from_files?
47
+ lines_from_files(&block)
48
+ elsif read_from_stdin?
49
+ lines_from_stdin(&block)
50
+ else
51
+ raise ExtractionError, "No input files or stdin provided"
52
+ end
53
+ end
54
+
55
+ private
56
+
57
+ def lines_from_files(&block)
58
+ pathnames.each do |pathname|
59
+ File.open(pathname) do |file|
60
+ lines_from_io(file, &block)
61
+ end
62
+ end
63
+ end
64
+
65
+ def lines_from_stdin(&block)
66
+ lines_from_io($stdin, &block)
67
+ end
68
+
69
+ def lines_from_io(io, &block)
70
+ io.each_line(&block)
71
+ end
72
+ end
73
+ end
74
+ end
75
+ end
76
+ end
@@ -1,35 +1,44 @@
1
1
  module Chronicle
2
2
  module ETL
3
- class JsonExtractor < Chronicle::ETL::Extractor
4
- include Extractors::Helpers::FilesystemReader
3
+ class JSONExtractor < Chronicle::ETL::Extractor
4
+ include Extractors::Helpers::InputReader
5
5
 
6
6
  register_connector do |r|
7
- r.description = 'input as JSON'
7
+ r.description = 'JSON'
8
8
  end
9
9
 
10
- setting :filename, default: $stdin
11
- setting :jsonl, default: true
10
+ setting :jsonl, default: true, type: :boolean
12
11
 
13
- def extract
12
+ def prepare
13
+ @jsons = []
14
14
  load_input do |input|
15
- parsed_data = parse_data(input)
16
- yield Chronicle::ETL::Extraction.new(data: parsed_data) if parsed_data
15
+ @jsons << parse_data(input)
16
+ end
17
+ end
18
+
19
+ def extract
20
+ @jsons.each do |json|
21
+ yield Chronicle::ETL::Extraction.new(data: json)
17
22
  end
18
23
  end
19
24
 
20
25
  def results_count
26
+ @jsons.count
21
27
  end
22
28
 
23
29
  private
24
30
 
25
31
  def parse_data data
26
32
  JSON.parse(data)
27
- rescue JSON::ParserError => e
33
+ rescue JSON::ParserError
34
+ raise Chronicle::ETL::ExtractionError, "Could not parse JSON"
28
35
  end
29
36
 
30
- def load_input
31
- read_from_filesystem(filename: @options[:filename]) do |data|
32
- yield data
37
+ def load_input(&block)
38
+ if @config.jsonl
39
+ read_input_as_lines(&block)
40
+ else
41
+ read_input(&block)
33
42
  end
34
43
  end
35
44
  end
@@ -14,7 +14,7 @@ module Chronicle
14
14
  options: {}
15
15
  },
16
16
  loader: {
17
- name: 'stdout',
17
+ name: 'table',
18
18
  options: {}
19
19
  }
20
20
  }.freeze
@@ -0,0 +1,44 @@
1
+ module Chronicle
2
+ module ETL
3
+ class JSONLoader < Chronicle::ETL::Loader
4
+ register_connector do |r|
5
+ r.description = 'json'
6
+ end
7
+
8
+ setting :serializer
9
+ setting :output, default: $stdout
10
+
11
+ def start
12
+ if @config.output == $stdout
13
+ @output = @config.output
14
+ else
15
+ @output = File.open(@config.output, "w")
16
+ end
17
+ end
18
+
19
+ def load(record)
20
+ serialized = serializer.serialize(record)
21
+
22
+ # When dealing with raw data, we can get improperly encoded strings
23
+ # (eg from sqlite database columns). We force conversion to UTF-8
24
+ # before converting into JSON
25
+ encoded = serialized.transform_values do |value|
26
+ next value unless value.is_a?(String)
27
+
28
+ value.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
29
+ end
30
+ @output.puts encoded.to_json
31
+ end
32
+
33
+ def finish
34
+ @output.close
35
+ end
36
+
37
+ private
38
+
39
+ def serializer
40
+ @config.serializer || Chronicle::ETL::RawSerializer
41
+ end
42
+ end
43
+ end
44
+ end
@@ -30,6 +30,6 @@ module Chronicle
30
30
  end
31
31
 
32
32
  require_relative 'csv_loader'
33
+ require_relative 'json_loader'
33
34
  require_relative 'rest_loader'
34
- require_relative 'stdout_loader'
35
35
  require_relative 'table_loader'
@@ -11,20 +11,19 @@ module Chronicle
11
11
 
12
12
  setting :fields_limit, default: nil
13
13
  setting :fields_exclude, default: ['lids', 'type']
14
- setting :fields_include, default: []
14
+ setting :fields, default: []
15
15
  setting :truncate_values_at, default: 40
16
16
  setting :table_renderer, default: :basic
17
17
 
18
18
  def load(record)
19
- @records ||= []
20
- @records << record.to_h_flattened
19
+ records << record.to_h_flattened
21
20
  end
22
21
 
23
22
  def finish
24
- return if @records.empty?
23
+ return if records.empty?
25
24
 
26
- headers = build_headers(@records)
27
- rows = build_rows(@records, headers)
25
+ headers = build_headers(records)
26
+ rows = build_rows(records, headers)
28
27
 
29
28
  @table = TTY::Table.new(header: headers, rows: rows)
30
29
  puts @table.render(
@@ -33,12 +32,16 @@ module Chronicle
33
32
  )
34
33
  end
35
34
 
35
+ def records
36
+ @records ||= []
37
+ end
38
+
36
39
  private
37
40
 
38
41
  def build_headers(records)
39
42
  headers =
40
- if @config.fields_include.any?
41
- Set[*@config.fields_include]
43
+ if @config.fields.any?
44
+ Set[*@config.fields]
42
45
  else
43
46
  # use all the keys of the flattened record hash
44
47
  Set[*records.map(&:keys).flatten.map(&:to_s).uniq]
@@ -52,7 +55,7 @@ module Chronicle
52
55
 
53
56
  def build_rows(records, headers)
54
57
  records.map do |record|
55
- values = record.values_at(*headers).map{|value| value.to_s }
58
+ values = record.transform_keys(&:to_sym).values_at(*headers).map{|value| value.to_s }
56
59
 
57
60
  if @config.truncate_values_at
58
61
  values = values.map{ |value| value.truncate(@config.truncate_values_at) }
@@ -8,6 +8,7 @@ module Chronicle
8
8
  WARN = 2
9
9
  ERROR = 3
10
10
  FATAL = 4
11
+ SILENT = 5
11
12
 
12
13
  attr_accessor :log_level
13
14
 
@@ -5,6 +5,9 @@ module Chronicle
5
5
  module Models
6
6
  # Represents a record that's been transformed by a Transformer and
7
7
  # ready to be loaded. Loosely based on ActiveModel.
8
+ #
9
+ # @todo Experiment with just mixing in ActiveModel instead of this
10
+ # this reimplementation
8
11
  class Base
9
12
  ATTRIBUTES = [:provider, :provider_id, :lat, :lng, :metadata].freeze
10
13
  ASSOCIATIONS = [].freeze
@@ -5,13 +5,19 @@ module Chronicle
5
5
  module Models
6
6
  class Entity < Chronicle::ETL::Models::Base
7
7
  TYPE = 'entities'.freeze
8
- ATTRIBUTES = [:title, :body, :represents, :slug, :myself, :metadata].freeze
8
+ ATTRIBUTES = [:title, :body, :provider_url, :represents, :slug, :myself, :metadata].freeze
9
+
10
+ # TODO: This desperately needs a validation system
9
11
  ASSOCIATIONS = [
12
+ :involvements, # inverse of activity's `involved`
13
+
10
14
  :attachments,
11
15
  :abouts,
16
+ :aboutables, # inverse of above
12
17
  :depicts,
13
18
  :consumers,
14
- :contains
19
+ :contains,
20
+ :containers # inverse of above
15
21
  ].freeze # TODO: add these to reflect Chronicle Schema
16
22
 
17
23
  attr_accessor(*ATTRIBUTES, *ASSOCIATIONS)
@@ -0,0 +1,26 @@
1
+ require 'chronicle/etl/models/base'
2
+
3
+ module Chronicle
4
+ module ETL
5
+ module Models
6
+ # A record from an extraction with no processing or normalization applied
7
+ class Raw
8
+ TYPE = 'raw'
9
+
10
+ attr_accessor :raw_data
11
+
12
+ def initialize(raw_data)
13
+ @raw_data = raw_data
14
+ end
15
+
16
+ def to_h
17
+ @raw_data.to_h
18
+ end
19
+
20
+ def to_h_flattened
21
+ Chronicle::ETL::Utils::HashUtilities.flatten_hash(to_h)
22
+ end
23
+ end
24
+ end
25
+ end
26
+ end
@@ -28,9 +28,10 @@ class Chronicle::ETL::Runner
28
28
  transformer = @job.instantiate_transformer(extraction)
29
29
  record = transformer.transform
30
30
 
31
- unless record.is_a?(Chronicle::ETL::Models::Base)
32
- raise Chronicle::ETL::RunnerTypeError, "Transformed data should be a type of Chronicle::ETL::Models"
33
- end
31
+ # TODO: rethink this
32
+ # unless record.is_a?(Chronicle::ETL::Models)
33
+ # raise Chronicle::ETL::RunnerTypeError, "Transformed data should be a type of Chronicle::ETL::Models"
34
+ # end
34
35
 
35
36
  Chronicle::ETL::Logger.info(tty_log_transformation(transformer))
36
37
  @job_logger.log_transformation(transformer)
@@ -52,7 +53,7 @@ class Chronicle::ETL::Runner
52
53
  raise e
53
54
  ensure
54
55
  @job_logger.save
55
- @progress_bar.finish
56
+ @progress_bar&.finish
56
57
  Chronicle::ETL::Logger.detach_from_progress_bar
57
58
  Chronicle::ETL::Logger.info(tty_log_completion)
58
59
  end
@@ -1,6 +1,12 @@
1
1
  module Chronicle
2
2
  module ETL
3
3
  class JSONAPISerializer < Chronicle::ETL::Serializer
4
+ def initialize(*args)
5
+ super
6
+
7
+ raise(SerializationError, "Record must be a subclass of Chronicle::ETL::Model::Base") unless @record.is_a?(Chronicle::ETL::Models::Base)
8
+ end
9
+
4
10
  def serializable_hash
5
11
  @record
6
12
  .identifier_hash
@@ -0,0 +1,10 @@
1
+ module Chronicle
2
+ module ETL
3
+ # Take a Raw model and output `raw_data` as a hash
4
+ class RawSerializer < Chronicle::ETL::Serializer
5
+ def serializable_hash
6
+ @record.to_h
7
+ end
8
+ end
9
+ end
10
+ end
@@ -24,4 +24,5 @@ module Chronicle
24
24
  end
25
25
  end
26
26
 
27
- require_relative 'jsonapi_serializer'
27
+ require_relative 'jsonapi_serializer'
28
+ require_relative 'raw_serializer'
@@ -7,7 +7,7 @@ module Chronicle
7
7
  end
8
8
 
9
9
  def transform
10
- Chronicle::ETL::Models::Generic.new(@extraction.data)
10
+ Chronicle::ETL::Models::Raw.new(@extraction.data)
11
11
  end
12
12
 
13
13
  def timestamp; end
@@ -1,5 +1,5 @@
1
1
  module Chronicle
2
2
  module ETL
3
- VERSION = "0.4.0"
3
+ VERSION = "0.4.1"
4
4
  end
5
5
  end
data/lib/chronicle/etl.rb CHANGED
@@ -3,23 +3,30 @@ require_relative 'etl/config'
3
3
  require_relative 'etl/configurable'
4
4
  require_relative 'etl/exceptions'
5
5
  require_relative 'etl/extraction'
6
- require_relative 'etl/extractors/extractor'
7
6
  require_relative 'etl/job_definition'
8
7
  require_relative 'etl/job_log'
9
8
  require_relative 'etl/job_logger'
10
9
  require_relative 'etl/job'
11
- require_relative 'etl/loaders/loader'
12
10
  require_relative 'etl/logger'
13
11
  require_relative 'etl/models/activity'
14
12
  require_relative 'etl/models/attachment'
15
13
  require_relative 'etl/models/base'
14
+ require_relative 'etl/models/raw'
16
15
  require_relative 'etl/models/entity'
17
- require_relative 'etl/models/generic'
18
16
  require_relative 'etl/runner'
19
17
  require_relative 'etl/serializers/serializer'
20
- require_relative 'etl/transformers/transformer'
21
18
  require_relative 'etl/utils/binary_attachments'
22
19
  require_relative 'etl/utils/hash_utilities'
23
20
  require_relative 'etl/utils/text_recognition'
24
21
  require_relative 'etl/utils/progress_bar'
25
22
  require_relative 'etl/version'
23
+
24
+ require_relative 'etl/extractors/extractor'
25
+ require_relative 'etl/loaders/loader'
26
+ require_relative 'etl/transformers/transformer'
27
+
28
+ begin
29
+ require 'pry'
30
+ rescue LoadError
31
+ # Pry not available
32
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: chronicle-etl
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.0
4
+ version: 0.4.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Louis
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2022-02-25 00:00:00.000000000 Z
11
+ date: 2022-03-05 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activesupport
@@ -328,7 +328,7 @@ files:
328
328
  - lib/chronicle/etl/extractors/csv_extractor.rb
329
329
  - lib/chronicle/etl/extractors/extractor.rb
330
330
  - lib/chronicle/etl/extractors/file_extractor.rb
331
- - lib/chronicle/etl/extractors/helpers/filesystem_reader.rb
331
+ - lib/chronicle/etl/extractors/helpers/input_reader.rb
332
332
  - lib/chronicle/etl/extractors/json_extractor.rb
333
333
  - lib/chronicle/etl/extractors/stdin_extractor.rb
334
334
  - lib/chronicle/etl/job.rb
@@ -336,21 +336,22 @@ files:
336
336
  - lib/chronicle/etl/job_log.rb
337
337
  - lib/chronicle/etl/job_logger.rb
338
338
  - lib/chronicle/etl/loaders/csv_loader.rb
339
+ - lib/chronicle/etl/loaders/json_loader.rb
339
340
  - lib/chronicle/etl/loaders/loader.rb
340
341
  - lib/chronicle/etl/loaders/rest_loader.rb
341
- - lib/chronicle/etl/loaders/stdout_loader.rb
342
342
  - lib/chronicle/etl/loaders/table_loader.rb
343
343
  - lib/chronicle/etl/logger.rb
344
344
  - lib/chronicle/etl/models/activity.rb
345
345
  - lib/chronicle/etl/models/attachment.rb
346
346
  - lib/chronicle/etl/models/base.rb
347
347
  - lib/chronicle/etl/models/entity.rb
348
- - lib/chronicle/etl/models/generic.rb
348
+ - lib/chronicle/etl/models/raw.rb
349
349
  - lib/chronicle/etl/registry/connector_registration.rb
350
350
  - lib/chronicle/etl/registry/registry.rb
351
351
  - lib/chronicle/etl/registry/self_registering.rb
352
352
  - lib/chronicle/etl/runner.rb
353
353
  - lib/chronicle/etl/serializers/jsonapi_serializer.rb
354
+ - lib/chronicle/etl/serializers/raw_serializer.rb
354
355
  - lib/chronicle/etl/serializers/serializer.rb
355
356
  - lib/chronicle/etl/transformers/image_file_transformer.rb
356
357
  - lib/chronicle/etl/transformers/null_transformer.rb
@@ -1,104 +0,0 @@
1
- require 'pathname'
2
-
3
- module Chronicle
4
- module ETL
5
- module Extractors
6
- module Helpers
7
- module FilesystemReader
8
-
9
- def filenames_in_directory(...)
10
- filenames = gather_files(...)
11
- if block_given?
12
- filenames.each do |filename|
13
- yield filename
14
- end
15
- else
16
- filenames
17
- end
18
- end
19
-
20
- def read_from_filesystem(filename:, yield_each_line: true, dir_glob_pattern: '**/*')
21
- open_files(filename: filename, dir_glob_pattern: dir_glob_pattern) do |file|
22
- if yield_each_line
23
- file.each_line do |line|
24
- yield line
25
- end
26
- else
27
- yield file.read
28
- end
29
- end
30
- end
31
-
32
- def open_from_filesystem(filename:, dir_glob_pattern: '**/*')
33
- open_files(filename: filename, dir_glob_pattern: dir_glob_pattern) do |file|
34
- yield file
35
- end
36
- end
37
-
38
- def results_count
39
- raise NotImplementedError
40
- # if file?
41
- # return 1
42
- # else
43
- # search_pattern = File.join(@options[:filename], '**/*')
44
- # Dir.glob(search_pattern).count
45
- # end
46
- end
47
-
48
- private
49
-
50
- def gather_files(path:, dir_glob_pattern: '**/*', load_since: nil, load_until: nil, smaller_than: nil, larger_than: nil, sort: :mtime)
51
- search_pattern = File.join(path, '**', dir_glob_pattern)
52
- files = Dir.glob(search_pattern)
53
-
54
- files = files.keep_if {|f| (File.mtime(f) > load_since)} if load_since
55
- files = files.keep_if {|f| (File.mtime(f) < load_until)} if load_until
56
-
57
- # pass in file sizes in bytes
58
- files = files.keep_if {|f| (File.size(f) < smaller_than)} if smaller_than
59
- files = files.keep_if {|f| (File.size(f) > larger_than)} if larger_than
60
-
61
- # TODO: incorporate sort argument
62
- files.sort_by{ |f| File.mtime(f) }
63
- end
64
-
65
- def select_files_in_directory(path:, dir_glob_pattern: '**/*')
66
- raise IOError.new("#{path} is not a directory.") unless directory?(path)
67
-
68
- search_pattern = File.join(path, dir_glob_pattern)
69
- Dir.glob(search_pattern).each do |filename|
70
- yield(filename)
71
- end
72
- end
73
-
74
- def open_files(filename:, dir_glob_pattern:)
75
- if stdin?(filename)
76
- yield $stdin
77
- elsif directory?(filename)
78
- search_pattern = File.join(filename, dir_glob_pattern)
79
- filenames = Dir.glob(search_pattern)
80
- filenames.each do |filename|
81
- file = File.open(filename)
82
- yield(file)
83
- end
84
- elsif file?(filename)
85
- yield File.open(filename)
86
- end
87
- end
88
-
89
- def stdin?(filename)
90
- filename == $stdin
91
- end
92
-
93
- def directory?(filename)
94
- Pathname.new(filename).directory?
95
- end
96
-
97
- def file?(filename)
98
- Pathname.new(filename).file?
99
- end
100
- end
101
- end
102
- end
103
- end
104
- end
@@ -1,14 +0,0 @@
1
- module Chronicle
2
- module ETL
3
- class StdoutLoader < Chronicle::ETL::Loader
4
- register_connector do |r|
5
- r.description = 'stdout'
6
- end
7
-
8
- def load(record)
9
- serializer = Chronicle::ETL::JSONAPISerializer.new(record)
10
- puts serializer.serializable_hash.to_json
11
- end
12
- end
13
- end
14
- end
@@ -1,23 +0,0 @@
1
- require 'chronicle/etl/models/base'
2
-
3
- module Chronicle
4
- module ETL
5
- module Models
6
- class Generic < Chronicle::ETL::Models::Base
7
- TYPE = 'generic'
8
-
9
- attr_accessor :properties
10
-
11
- def initialize(properties = {})
12
- @properties = properties
13
- super
14
- end
15
-
16
- # Generic models have arbitrary attributes stored in @properties
17
- def attributes
18
- @properties.transform_keys(&:to_sym)
19
- end
20
- end
21
- end
22
- end
23
- end