chronicle-etl 0.2.2 → 0.3.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (52) hide show
  1. checksums.yaml +4 -4
  2. data/.gitignore +3 -0
  3. data/.rubocop.yml +3 -0
  4. data/README.md +22 -15
  5. data/chronicle-etl.gemspec +13 -7
  6. data/lib/chronicle/etl/cli/connectors.rb +19 -7
  7. data/lib/chronicle/etl/cli/jobs.rb +38 -26
  8. data/lib/chronicle/etl/cli/main.rb +10 -2
  9. data/lib/chronicle/etl/config.rb +24 -3
  10. data/lib/chronicle/etl/exceptions.rb +13 -0
  11. data/lib/chronicle/etl/extraction.rb +12 -0
  12. data/lib/chronicle/etl/extractors/csv_extractor.rb +43 -37
  13. data/lib/chronicle/etl/extractors/extractor.rb +25 -4
  14. data/lib/chronicle/etl/extractors/file_extractor.rb +15 -33
  15. data/lib/chronicle/etl/extractors/helpers/filesystem_reader.rb +104 -0
  16. data/lib/chronicle/etl/extractors/json_extractor.rb +45 -0
  17. data/lib/chronicle/etl/extractors/stdin_extractor.rb +6 -1
  18. data/lib/chronicle/etl/job.rb +72 -0
  19. data/lib/chronicle/etl/job_definition.rb +89 -0
  20. data/lib/chronicle/etl/job_log.rb +95 -0
  21. data/lib/chronicle/etl/job_logger.rb +81 -0
  22. data/lib/chronicle/etl/loaders/csv_loader.rb +6 -6
  23. data/lib/chronicle/etl/loaders/loader.rb +2 -2
  24. data/lib/chronicle/etl/loaders/rest_loader.rb +16 -9
  25. data/lib/chronicle/etl/loaders/stdout_loader.rb +8 -3
  26. data/lib/chronicle/etl/loaders/table_loader.rb +58 -7
  27. data/lib/chronicle/etl/logger.rb +48 -0
  28. data/lib/chronicle/etl/models/activity.rb +15 -0
  29. data/lib/chronicle/etl/models/attachment.rb +14 -0
  30. data/lib/chronicle/etl/models/base.rb +119 -0
  31. data/lib/chronicle/etl/models/entity.rb +21 -0
  32. data/lib/chronicle/etl/models/generic.rb +23 -0
  33. data/lib/chronicle/etl/registry/connector_registration.rb +61 -0
  34. data/lib/chronicle/etl/registry/registry.rb +52 -0
  35. data/lib/chronicle/etl/registry/self_registering.rb +25 -0
  36. data/lib/chronicle/etl/runner.rb +66 -24
  37. data/lib/chronicle/etl/serializers/jsonapi_serializer.rb +25 -0
  38. data/lib/chronicle/etl/serializers/serializer.rb +27 -0
  39. data/lib/chronicle/etl/transformers/image_file_transformer.rb +253 -0
  40. data/lib/chronicle/etl/transformers/null_transformer.rb +11 -3
  41. data/lib/chronicle/etl/transformers/transformer.rb +42 -13
  42. data/lib/chronicle/etl/utils/binary_attachments.rb +21 -0
  43. data/lib/chronicle/etl/utils/hash_utilities.rb +19 -0
  44. data/lib/chronicle/etl/utils/progress_bar.rb +3 -1
  45. data/lib/chronicle/etl/utils/text_recognition.rb +15 -0
  46. data/lib/chronicle/etl/version.rb +1 -1
  47. data/lib/chronicle/etl.rb +16 -1
  48. metadata +139 -36
  49. data/CHANGELOG.md +0 -23
  50. data/Gemfile.lock +0 -85
  51. data/lib/chronicle/etl/catalog.rb +0 -102
  52. data/lib/chronicle/etl/transformers/json_transformer.rb +0 -11
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e1c08bc4f71c807525090abbf1701be19ab72cce08a99cc3bbec9b0db7150a02
4
- data.tar.gz: 172a5d7e7ba7a9424ef7b5ab4da2b8c44defdb4e0a34c833248ff1b63f40407e
3
+ metadata.gz: b74c4a7782c1ab31173e628b3e5ccb8743fe21f29d6f48d739b0e3cc2dfda22e
4
+ data.tar.gz: 7ea44638b08f6da12c0a5386f3d852600f50336ce0bb57347114804770f75691
5
5
  SHA512:
6
- metadata.gz: 0f671c00928b15f9c0f6fa159ac106ff9c4f65a8bd16048e5d0cab82d680945317f7680e7796e98c665bb5cc757e0657f1a36d773d89e3e1587d9eebc12abdd8
7
- data.tar.gz: 449d1368e0054f39006c7903218300b9b97ca839d6eff43b6b7bd659e5146d443a31c53325c4769ae7a56db9d42417020ccde17362ae024c01aca2ed63029044
6
+ metadata.gz: efb23677c731a54b0382c3095dc9bb5f98a97365c1daf031bbc8c20335e7bd146b76b3a50486971e48192e7540bc0ae1b09f232590a590257203ae3560396767
7
+ data.tar.gz: cba40b71a7e8c0b17a286ecd3db3724bff290fdd79b3fdf55ab89967f6af14228911c0e6928a949b8dd899acd6ad396b8a21fb03162a8561247c97b1200bac29
data/.gitignore CHANGED
@@ -7,6 +7,9 @@
7
7
  /spec/reports/
8
8
  /tmp/
9
9
 
10
+ # https://yehudakatz.com/2010/12/16/clarifying-the-roles-of-the-gemspec-and-gemfile/
11
+ Gemfile.lock
12
+
10
13
  # rspec failure tracking
11
14
  .rspec_status
12
15
  .DS_Store
data/.rubocop.yml CHANGED
@@ -5,4 +5,7 @@ Style/StringLiterals:
5
5
  Enabled: false
6
6
 
7
7
  Style/MethodCallWithArgsParentheses:
8
+ Enabled: false
9
+
10
+ Lint/ConstantResolution:
8
11
  Enabled: false
data/README.md CHANGED
@@ -2,9 +2,9 @@
2
2
 
3
3
  [![Gem Version](https://badge.fury.io/rb/chronicle-etl.svg)](https://badge.fury.io/rb/chronicle-etl)
4
4
 
5
- Chronicle ETL is a utility tool for archiving and processing personal data. You can extract it from a variety of source, transform it, and load it to different APIs or file formats.
5
+ Chronicle ETL is a utility that helps you archive and processes personal data. You can *extract* it from a variety of sources, *transform* it, and *load* it to an external API, file, or stdout.
6
6
 
7
- This project is an adaptation of Andrew Louis's experimental [Memex project](https://hyfen.net/memex).
7
+ This tool is an adaptation of Andrew Louis's experimental [Memex project](https://hyfen.net/memex) and the dozens of existing importers are being migrated to Chronicle.
8
8
 
9
9
  ## Installation
10
10
 
@@ -31,6 +31,9 @@ Connectors are available to read, process, and load data from different formats
31
31
  ```bash
32
32
  # List all available connectors
33
33
  $ chronicle-etl connectors:list
34
+
35
+ # Install a connector
36
+ $ chronicle-etl connectors:install imessage
34
37
  ```
35
38
 
36
39
  Built in connectors:
@@ -44,16 +47,18 @@ Built in connectors:
44
47
  - `null` - (default) Don't do anything
45
48
 
46
49
  ### Loaders
47
- - `stdout` - (default) output transformed records to stdount
50
+ - `stdout` - (default) output records to stdout serialized as JSON
48
51
  - `csv` - Load records to a csv file
52
+ - `rest` - Serialize records with [JSONAPI](https://jsonapi.org/) and send to a REST API
49
53
  - `table` - Output an ascii table of records. Useful for debugging.
50
54
 
51
55
  ### Provider-specific importers
52
56
 
53
57
  In addition to the built-in importers, importers for third-party platforms are available. They are packaged as individual Ruby gems.
54
58
 
55
- - [email](https://github.com/chronicle-app/chronicle-email). Extractors for `mbox` files. Transformers for chronicle schema
56
- - [bash](https://github.com/chronicle-app/chronicle-bash). Extract bash history from `~/.bash_history`. Transform it for chronicle schema
59
+ - [email](https://github.com/chronicle-app/chronicle-email). Extractors for `mbox` and other email files
60
+ - [bash](https://github.com/chronicle-app/chronicle-bash). Extract bash history from `~/.bash_history`
61
+ - [imessage](https://github.com/chronicle-app/chronicle-imessage). Extract iMessage messages from a local macOS installation
57
62
 
58
63
  To install any of these, run `gem install chronicle-PROVIDER`.
59
64
 
@@ -61,7 +66,7 @@ If you don't want to use the available rubygem importers, `chronicle-etl` can us
61
66
 
62
67
  I'll be open-sourcing more importers. Please [contact me](mailto:andrew@hyfen.net) to chat about what will be available!
63
68
 
64
- ### Full commands
69
+ ## Full commands
65
70
 
66
71
  ```
67
72
  $ chronicle-etl help
@@ -75,26 +80,28 @@ ALL COMMANDS
75
80
  jobs:create # Create a job
76
81
  jobs:list # List all available jobs
77
82
  jobs:run # Start a job
78
- jobs:show # Show a job
83
+ jobs:show # Show details about a job
79
84
  ```
80
85
 
81
- ### Job options
86
+ ### Running a job
82
87
 
83
88
  ```
84
89
  Usage:
85
90
  chronicle-etl jobs:run
86
91
 
87
92
  Options:
88
- -e, [--extractor=extractor-name] # Extractor class (available: stdin, csv, file)
89
- # Default: stdin
93
+ [--log-level=LOG_LEVEL] # Log level (debug, info, warn, error, fatal)
94
+ # Default: info
95
+ -v, [--verbose], [--no-verbose] # Set log level to verbose
96
+ [--dry-run], [--no-dry-run] # Only run the extraction and transform steps, not the loading
97
+ -e, [--extractor=extractor-name] # Extractor class. Default: stdin
90
98
  [--extractor-opts=key:value] # Extractor options
91
- -t, [--transformer=transformer-name] # Transformer class (available: null)
92
- # Default: null
99
+ -t, [--transformer=transformer-name] # Transformer class. Default: null
93
100
  [--transformer-opts=key:value] # Transformer options
94
- -l, [--loader=loader-name] # Loader class (available: stdout, csv, table)
95
- # Default: stdout
101
+ -l, [--loader=loader-name] # Loader class. Default: stdout
96
102
  [--loader-opts=key:value] # Loader options
97
- -j, [--job=JOB] # Job configuration file
103
+ -j, [--name=NAME] # Job configuration name
104
+
98
105
 
99
106
  Runs an ETL job
100
107
  ```
@@ -17,11 +17,11 @@ Gem::Specification.new do |spec|
17
17
  # Prevent pushing this gem to RubyGems.org. To allow pushes either set the 'allowed_push_host'
18
18
  # to allow pushing to a single host or delete this section to allow pushing to any host.
19
19
  if spec.respond_to?(:metadata)
20
- # spec.metadata["allowed_push_host"] = "TODO: Set to 'http://mygemserver.com'"
20
+ spec.metadata['allowed_push_host'] = "https://rubygems.org"
21
21
 
22
22
  spec.metadata["homepage_uri"] = spec.homepage
23
23
  spec.metadata["source_code_uri"] = "https://github.com/chronicle-app/chronicle-etl"
24
- spec.metadata["changelog_uri"] = "https://github.com/chronicle-app/chronicle-etl/blob/master/CHANGELOG.md"
24
+ spec.metadata["changelog_uri"] = "https://github.com/chronicle-app/chronicle-etl/releases"
25
25
  else
26
26
  raise "RubyGems 2.0 or newer is required to protect against " \
27
27
  "public gem pushes."
@@ -36,15 +36,21 @@ Gem::Specification.new do |spec|
36
36
  spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
37
37
  spec.require_paths = ["lib"]
38
38
 
39
- spec.add_dependency "thor", "~> 0.20"
39
+ spec.add_dependency "activesupport"
40
+ spec.add_dependency "chronic_duration", "~> 0.10.6"
40
41
  spec.add_dependency "colorize", "~> 0.8.1"
41
- spec.add_dependency "tty-table", "~> 0.11"
42
+ spec.add_dependency "marcel", "~> 1.0.2"
43
+ spec.add_dependency "mini_exiftool", "~> 2.10"
44
+ spec.add_dependency "nokogiri", "~> 1.13"
45
+ spec.add_dependency "runcom", "~> 6.2"
46
+ spec.add_dependency "sequel", "~> 5.35"
47
+ spec.add_dependency "sqlite3", "~> 1.4"
48
+ spec.add_dependency "thor", "~> 0.20"
42
49
  spec.add_dependency "tty-progressbar", "~> 0.17"
50
+ spec.add_dependency "tty-table", "~> 0.11"
43
51
 
44
52
  spec.add_development_dependency "bundler", "~> 2.1"
53
+ spec.add_development_dependency "pry-byebug", "~> 3.9"
45
54
  spec.add_development_dependency "rake", "~> 13.0"
46
55
  spec.add_development_dependency "rspec", "~> 3.9"
47
- spec.add_development_dependency "pry-byebug", "~> 3.9"
48
- spec.add_development_dependency 'runcom', '~> 6.2'
49
- spec.add_development_dependency 'redcarpet', '~> 3.5'
50
56
  end
@@ -7,23 +7,35 @@ module Chronicle
7
7
  namespace :connectors
8
8
 
9
9
  desc "install NAME", "Installs connector NAME"
10
- def install
11
- puts "Installing"
10
+ def install(name)
11
+ Chronicle::ETL::Registry.install_connector(name)
12
12
  end
13
13
 
14
14
  desc "list", "Lists available connectors"
15
15
  # Display all available connectors that chronicle-etl has access to
16
16
  def list
17
- klasses = Chronicle::ETL::Catalog.available_classes
18
- klasses = klasses.sort_by do |a|
19
- [a[:built_in].to_s, a[:provider], a[:phase]]
17
+ Chronicle::ETL::Registry.load_all!
18
+
19
+ connector_info = Chronicle::ETL::Registry.connectors.map do |connector_registration|
20
+ {
21
+ identifier: connector_registration.identifier,
22
+ phase: connector_registration.phase,
23
+ description: connector_registration.descriptive_phrase,
24
+ provider: connector_registration.provider,
25
+ core: connector_registration.built_in? ? '✓' : '',
26
+ class: connector_registration.klass_name
27
+ }
28
+ end
29
+
30
+ connector_info = connector_info.sort_by do |a|
31
+ [a[:core].to_s, a[:provider], a[:phase], a[:identifier]]
20
32
  end
21
33
 
22
- headers = klasses.first.keys.map do |key|
34
+ headers = connector_info.first.keys.map do |key|
23
35
  key.to_s.upcase.bold
24
36
  end
25
37
 
26
- table = TTY::Table.new(headers, klasses.map(&:values))
38
+ table = TTY::Table.new(headers, connector_info.map(&:values))
27
39
  puts table.render(indent: 0, padding: [0, 2])
28
40
  end
29
41
  end
@@ -1,5 +1,4 @@
1
1
  require 'pp'
2
-
3
2
  module Chronicle
4
3
  module ETL
5
4
  module CLI
@@ -8,16 +7,19 @@ module Chronicle
8
7
  default_task "start"
9
8
  namespace :jobs
10
9
 
11
- class_option :extractor, aliases: '-e', desc: 'Extractor class (available: stdin, csv, file)', default: 'stdin', banner: 'extractor-name'
10
+ class_option :extractor, aliases: '-e', desc: "Extractor class. Default: stdin", banner: 'extractor-name'
12
11
  class_option :'extractor-opts', desc: 'Extractor options', type: :hash, default: {}
13
- class_option :transformer, aliases: '-t', desc: 'Transformer class (available: null)', default: 'null', banner: 'transformer-name'
12
+ class_option :transformer, aliases: '-t', desc: 'Transformer class. Default: null', banner: 'transformer-name'
14
13
  class_option :'transformer-opts', desc: 'Transformer options', type: :hash, default: {}
15
- class_option :loader, aliases: '-l', desc: 'Loader class (available: stdout, csv, table)', default: 'stdout', banner: 'loader-name'
14
+ class_option :loader, aliases: '-l', desc: 'Loader class. Default: stdout', banner: 'loader-name'
16
15
  class_option :'loader-opts', desc: 'Loader options', type: :hash, default: {}
17
- class_option :job, aliases: '-j', desc: 'Job configuration name (or filename)'
16
+ class_option :name, aliases: '-j', desc: 'Job configuration name'
18
17
 
19
18
  map run: :start # Thor doesn't like `run` as a command name
20
19
  desc "run", "Start a job"
20
+ option :log_level, desc: 'Log level (debug, info, warn, error, fatal)', default: 'info'
21
+ option :verbose, aliases: '-v', desc: 'Set log level to verbose', type: :boolean
22
+ option :dry_run, desc: 'Only run the extraction and transform steps, not the loading', type: :boolean
21
23
  long_desc <<-LONG_DESC
22
24
  This will run an ETL job. Each job needs three parts:
23
25
 
@@ -25,36 +27,37 @@ module Chronicle
25
27
 
26
28
  2. #{'Transformer'.underline}: transforms data into a new format. If none is specified, we use the `null` transformer which does nothing to the data.
27
29
 
28
- 3. #{'Loader'.underline}: takes that transformed data and loads it externally. This can be an API, flat files, (or by default), stdout.
30
+ 3. #{'Loader'.underline}: takes that transformed data and loads it externally. This can be an API, flat files, (or by default), stdout. With the --dry-run option, this step won't be run.
29
31
 
30
32
  If you do not want to use the command line flags, you can also configure a job with a .yml config file. You can either specify the path to this file or use the filename and place the file in ~/.config/chronicle/etl/jobs/NAME.yml and call it with `--job NAME`
31
33
  LONG_DESC
32
34
  # Run an ETL job
33
35
  def start
34
- runner_options = build_runner_options(options)
35
- runner = Chronicle::ETL::Runner.new(runner_options)
36
+ setup_log_level
37
+ job_definition = build_job_definition(options)
38
+ job = Chronicle::ETL::Job.new(job_definition)
39
+ runner = Chronicle::ETL::Runner.new(job)
36
40
  runner.run!
37
41
  end
38
42
 
39
43
  desc "create", "Create a job"
40
44
  # Create an ETL job
41
45
  def create
42
- runner_options = build_runner_options(options)
43
- path = File.join('chronicle', 'etl', 'jobs', options[:job])
44
- Chronicle::ETL::Config.write(path, runner_options)
46
+ job_definition = build_job_definition(options)
47
+ path = File.join('chronicle', 'etl', 'jobs', options[:name])
48
+ Chronicle::ETL::Config.write(path, job_definition.definition)
45
49
  end
46
50
 
47
51
  desc "show", "Show details about a job"
48
52
  # Show an ETL job
49
53
  def show
50
- runner_options = build_runner_options(options)
51
- pp runner_options
54
+ puts Chronicle::ETL::Job.new(build_job_definition(options))
52
55
  end
53
56
 
54
57
  desc "list", "List all available jobs"
55
58
  # List available ETL jobs
56
59
  def list
57
- jobs = Chronicle::ETL::Config.jobs
60
+ jobs = Chronicle::ETL::Config.available_jobs
58
61
 
59
62
  job_details = jobs.map do |job|
60
63
  r = Chronicle::ETL::Config.load("chronicle/etl/jobs/#{job}.yml")
@@ -74,34 +77,43 @@ LONG_DESC
74
77
 
75
78
  private
76
79
 
77
- # Create runner options by reading config file and then overwriting with flag options
78
- def build_runner_options options
79
- flag_options = process_flag_options(options)
80
- job_options = load_job(options[:job])
81
- flag_options.merge(job_options)
80
+ def setup_log_level
81
+ if options[:verbose]
82
+ Chronicle::ETL::Logger.log_level = Chronicle::ETL::Logger::DEBUG
83
+ elsif options[:log_level]
84
+ level = Chronicle::ETL::Logger.const_get(options[:log_level].upcase)
85
+ Chronicle::ETL::Logger.log_level = level
86
+ end
87
+ end
88
+
89
+ # Create job definition by reading config file and then overwriting with flag options
90
+ def build_job_definition(options)
91
+ definition = Chronicle::ETL::JobDefinition.new
92
+ definition.add_config(load_job_config(options[:name]))
93
+ definition.add_config(process_flag_options(options))
94
+ definition
82
95
  end
83
96
 
84
- def load_job job
85
- yml_config = Chronicle::ETL::Config.load("chronicle/etl/jobs/#{job}.yml")
86
- # FIXME: use better trick to depely symbolize keys
87
- JSON.parse(yml_config.to_json, symbolize_names: true)
97
+ def load_job_config name
98
+ Chronicle::ETL::Config.load_job_from_config(name)
88
99
  end
89
100
 
90
101
  # Takes flag options and turns them into a runner config
91
102
  def process_flag_options options
92
103
  {
104
+ dry_run: options[:dry_run],
93
105
  extractor: {
94
106
  name: options[:extractor],
95
107
  options: options[:'extractor-opts']
96
- },
108
+ }.compact,
97
109
  transformer: {
98
110
  name: options[:transformer],
99
111
  options: options[:'transformer-opts']
100
- },
112
+ }.compact,
101
113
  loader: {
102
114
  name: options[:loader],
103
115
  options: options[:'loader-opts']
104
- }
116
+ }.compact
105
117
  }
106
118
  end
107
119
  end
@@ -22,6 +22,11 @@ module Chronicle
22
22
 
23
23
  # Entrypoint for the CLI
24
24
  def self.start(given_args = ARGV, config = {})
25
+ if given_args[0] == "--version"
26
+ puts "#{Chronicle::ETL::VERSION}"
27
+ exit
28
+ end
29
+
25
30
  if given_args.none?
26
31
  abort "No command entered or job specified. To see commands, run `chronicle-etl help`".red
27
32
  end
@@ -52,10 +57,10 @@ module Chronicle
52
57
  shell.say " $ chronicle-etl connectors:list"
53
58
  shell.say
54
59
  shell.say " Run a simple job:".italic.light_black
55
- shell.say " $ chronicle-etl jobs:start --extractor stdin --transformer null --loader stdout"
60
+ shell.say " $ chronicle-etl jobs:run --extractor stdin --transformer null --loader stdout"
56
61
  shell.say
57
62
  shell.say " Show full job options:".italic.light_black
58
- shell.say " $ chronicle-etl jobs help start"
63
+ shell.say " $ chronicle-etl jobs help run"
59
64
 
60
65
  list = []
61
66
 
@@ -72,6 +77,9 @@ module Chronicle
72
77
  shell.say "VERSION".bold
73
78
  shell.say " #{Chronicle::ETL::VERSION}"
74
79
  shell.say
80
+ shell.say " Display current version:".italic.light_black
81
+ shell.say " $ chronicle-etl --version"
82
+ shell.say
75
83
  shell.say "FULL DOCUMENTATION".bold
76
84
  shell.say " https://github.com/chronicle-app/chronicle-etl".blue
77
85
  shell.say
@@ -4,15 +4,17 @@ module Chronicle
4
4
  module ETL
5
5
  # Utility methods to read, write, and access config files
6
6
  module Config
7
+ module_function
8
+
7
9
  # Loads a yml config file
8
- def self.load(path)
10
+ def load(path)
9
11
  config = Runcom::Config.new(path)
10
12
  # FIXME: hack to deeply symbolize keys
11
13
  JSON.parse(config.to_h.to_json, symbolize_names: true)
12
14
  end
13
15
 
14
16
  # Writes a hash as a yml config file
15
- def self.write(path, data)
17
+ def write(path, data)
16
18
  config = Runcom::Config.new(path)
17
19
  filename = config.all[0].to_s + '.yml'
18
20
  File.open(filename, 'w') do |f|
@@ -21,12 +23,31 @@ module Chronicle
21
23
  end
22
24
 
23
25
  # Returns all jobs available in ~/.config/chronicle/etl/jobs/*.yml
24
- def self.jobs
26
+ def available_jobs
25
27
  job_directory = Runcom::Config.new('chronicle/etl/jobs').current
26
28
  Dir.glob(File.join(job_directory, "*.yml")).map do |filename|
27
29
  File.basename(filename, ".*")
28
30
  end
29
31
  end
32
+
33
+ # Returns all available credentials available in ~/.config/chronicle/etl/credentials/*.yml
34
+ def available_credentials
35
+ job_directory = Runcom::Config.new('chronicle/etl/credentials').current
36
+ Dir.glob(File.join(job_directory, "*.yml")).map do |filename|
37
+ File.basename(filename, ".*")
38
+ end
39
+ end
40
+
41
+ # Load a job definition from job config directory
42
+ def load_job_from_config(job_name)
43
+ definition = self.load("chronicle/etl/jobs/#{job_name}.yml")
44
+ definition[:name] = job_name
45
+ definition
46
+ end
47
+
48
+ def load_credentials(name)
49
+ config = self.load("chronicle/etl/credentials/#{name}.yml")
50
+ end
30
51
  end
31
52
  end
32
53
  end
@@ -2,6 +2,8 @@ module Chronicle
2
2
  module ETL
3
3
  class Error < StandardError; end;
4
4
 
5
+ class RunnerTypeError < Error; end
6
+
5
7
  class ConnectorNotAvailableError < Error
6
8
  def initialize(message, provider: nil, name: nil)
7
9
  super(message)
@@ -13,5 +15,16 @@ module Chronicle
13
15
 
14
16
  class ProviderNotAvailableError < ConnectorNotAvailableError; end
15
17
  class ProviderConnectorNotAvailableError < ConnectorNotAvailableError; end
18
+
19
+ class TransformationError < Error
20
+ attr_reader :transformation
21
+
22
+ def initialize(message=nil, transformation:)
23
+ super(message)
24
+ @transformation = transformation
25
+ end
26
+ end
27
+
28
+ class UntransformableRecordError < TransformationError; end
16
29
  end
17
30
  end
@@ -0,0 +1,12 @@
1
+ module Chronicle
2
+ module ETL
3
+ class Extraction
4
+ attr_accessor :data, :meta
5
+
6
+ def initialize(data: {}, meta: {})
7
+ @data = data
8
+ @meta = meta
9
+ end
10
+ end
11
+ end
12
+ end
@@ -1,42 +1,48 @@
1
1
  require 'csv'
2
- class Chronicle::ETL::CsvExtractor < Chronicle::ETL::Extractor
3
- DEFAULT_OPTIONS = {
4
- headers: true,
5
- filename: $stdin
6
- }.freeze
7
-
8
- def initialize(options = {})
9
- super(DEFAULT_OPTIONS.merge(options))
10
- end
11
2
 
12
- def extract
13
- csv = initialize_csv
14
- csv.each do |row|
15
- result = row.to_h
16
- yield result
3
+ module Chronicle
4
+ module ETL
5
+ class CsvExtractor < Chronicle::ETL::Extractor
6
+ include Extractors::Helpers::FilesystemReader
7
+
8
+ register_connector do |r|
9
+ r.description = 'input as CSV'
10
+ end
11
+
12
+ DEFAULT_OPTIONS = {
13
+ headers: true,
14
+ filename: $stdin
15
+ }.freeze
16
+
17
+ def initialize(options = {})
18
+ super(DEFAULT_OPTIONS.merge(options))
19
+ end
20
+
21
+ def extract
22
+ csv = initialize_csv
23
+ csv.each do |row|
24
+ yield Chronicle::ETL::Extraction.new(data: row.to_h)
25
+ end
26
+ end
27
+
28
+ def results_count
29
+ CSV.read(@options[:filename], headers: @options[:headers]).count unless stdin?(@options[:filename])
30
+ end
31
+
32
+ private
33
+
34
+ def initialize_csv
35
+ headers = @options[:headers].is_a?(String) ? @options[:headers].split(',') : @options[:headers]
36
+
37
+ csv_options = {
38
+ headers: headers,
39
+ converters: :all
40
+ }
41
+
42
+ open_from_filesystem(filename: @options[:filename]) do |file|
43
+ return CSV.new(file, **csv_options)
44
+ end
45
+ end
17
46
  end
18
47
  end
19
-
20
- def results_count
21
- CSV.read(@options[:filename], headers: @options[:headers]).count if read_from_file?
22
- end
23
-
24
- private
25
-
26
- def initialize_csv
27
- headers = @options[:headers].is_a?(String) ? @options[:headers].split(',') : @options[:headers]
28
-
29
- csv_options = {
30
- headers: headers,
31
- header_converters: :symbol,
32
- converters: [:all]
33
- }
34
-
35
- stream = read_from_file? ? File.open(@options[:filename]) : @options[:filename]
36
- CSV.new(stream, **csv_options)
37
- end
38
-
39
- def read_from_file?
40
- @options[:filename] != $stdin
41
- end
42
48
  end
@@ -4,7 +4,7 @@ module Chronicle
4
4
  module ETL
5
5
  # Abstract class representing an Extractor for an ETL job
6
6
  class Extractor
7
- extend Chronicle::ETL::Catalog
7
+ extend Chronicle::ETL::Registry::SelfRegistering
8
8
 
9
9
  # Construct a new instance of this extractor. Options are passed in from a Runner
10
10
  # == Paramters:
@@ -12,20 +12,41 @@ module Chronicle
12
12
  # Options for configuring this Extractor
13
13
  def initialize(options = {})
14
14
  @options = options.transform_keys!(&:to_sym)
15
+ sanitize_options
16
+ handle_continuation
15
17
  end
16
18
 
19
+ # Hook called before #extract. Useful for gathering data, initailizing proxies, etc
20
+ def prepare; end
21
+
22
+ # An optional method to calculate how many records there are to extract. Used primarily for
23
+ # building the progress bar
24
+ def results_count; end
25
+
17
26
  # Entrypoint for this Extractor. Called by a Runner. Expects a series of records to be yielded
18
27
  def extract
19
28
  raise NotImplementedError
20
29
  end
21
30
 
22
- # An optional method to calculate how many records there are to extract. Used primarily for
23
- # building the progress bar
24
- def results_count; end
31
+ private
32
+
33
+ def sanitize_options
34
+ @options[:load_since] = Time.parse(@options[:load_since]) if @options[:load_since] && @options[:load_since].is_a?(String)
35
+ @options[:load_until] = Time.parse(@options[:load_until]) if @options[:load_until] && @options[:load_until].is_a?(String)
36
+ end
37
+
38
+ def handle_continuation
39
+ return unless @options[:continuation]
40
+
41
+ @options[:load_since] = @options[:continuation].highest_timestamp if @options[:continuation].highest_timestamp
42
+ @options[:load_after_id] = @options[:continuation].last_id if @options[:continuation].last_id
43
+ end
25
44
  end
26
45
  end
27
46
  end
28
47
 
48
+ require_relative 'helpers/filesystem_reader'
29
49
  require_relative 'csv_extractor'
30
50
  require_relative 'file_extractor'
51
+ require_relative 'json_extractor'
31
52
  require_relative 'stdin_extractor'