chronicle-etl 0.2.2 → 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (52) hide show
  1. checksums.yaml +4 -4
  2. data/.gitignore +3 -0
  3. data/.rubocop.yml +3 -0
  4. data/README.md +22 -15
  5. data/chronicle-etl.gemspec +13 -7
  6. data/lib/chronicle/etl/cli/connectors.rb +19 -7
  7. data/lib/chronicle/etl/cli/jobs.rb +38 -26
  8. data/lib/chronicle/etl/cli/main.rb +10 -2
  9. data/lib/chronicle/etl/config.rb +24 -3
  10. data/lib/chronicle/etl/exceptions.rb +13 -0
  11. data/lib/chronicle/etl/extraction.rb +12 -0
  12. data/lib/chronicle/etl/extractors/csv_extractor.rb +43 -37
  13. data/lib/chronicle/etl/extractors/extractor.rb +25 -4
  14. data/lib/chronicle/etl/extractors/file_extractor.rb +15 -33
  15. data/lib/chronicle/etl/extractors/helpers/filesystem_reader.rb +104 -0
  16. data/lib/chronicle/etl/extractors/json_extractor.rb +45 -0
  17. data/lib/chronicle/etl/extractors/stdin_extractor.rb +6 -1
  18. data/lib/chronicle/etl/job.rb +72 -0
  19. data/lib/chronicle/etl/job_definition.rb +89 -0
  20. data/lib/chronicle/etl/job_log.rb +95 -0
  21. data/lib/chronicle/etl/job_logger.rb +81 -0
  22. data/lib/chronicle/etl/loaders/csv_loader.rb +6 -6
  23. data/lib/chronicle/etl/loaders/loader.rb +2 -2
  24. data/lib/chronicle/etl/loaders/rest_loader.rb +16 -9
  25. data/lib/chronicle/etl/loaders/stdout_loader.rb +8 -3
  26. data/lib/chronicle/etl/loaders/table_loader.rb +58 -7
  27. data/lib/chronicle/etl/logger.rb +48 -0
  28. data/lib/chronicle/etl/models/activity.rb +15 -0
  29. data/lib/chronicle/etl/models/attachment.rb +14 -0
  30. data/lib/chronicle/etl/models/base.rb +119 -0
  31. data/lib/chronicle/etl/models/entity.rb +21 -0
  32. data/lib/chronicle/etl/models/generic.rb +23 -0
  33. data/lib/chronicle/etl/registry/connector_registration.rb +61 -0
  34. data/lib/chronicle/etl/registry/registry.rb +52 -0
  35. data/lib/chronicle/etl/registry/self_registering.rb +25 -0
  36. data/lib/chronicle/etl/runner.rb +66 -24
  37. data/lib/chronicle/etl/serializers/jsonapi_serializer.rb +25 -0
  38. data/lib/chronicle/etl/serializers/serializer.rb +27 -0
  39. data/lib/chronicle/etl/transformers/image_file_transformer.rb +253 -0
  40. data/lib/chronicle/etl/transformers/null_transformer.rb +11 -3
  41. data/lib/chronicle/etl/transformers/transformer.rb +42 -13
  42. data/lib/chronicle/etl/utils/binary_attachments.rb +21 -0
  43. data/lib/chronicle/etl/utils/hash_utilities.rb +19 -0
  44. data/lib/chronicle/etl/utils/progress_bar.rb +3 -1
  45. data/lib/chronicle/etl/utils/text_recognition.rb +15 -0
  46. data/lib/chronicle/etl/version.rb +1 -1
  47. data/lib/chronicle/etl.rb +16 -1
  48. metadata +139 -36
  49. data/CHANGELOG.md +0 -23
  50. data/Gemfile.lock +0 -85
  51. data/lib/chronicle/etl/catalog.rb +0 -102
  52. data/lib/chronicle/etl/transformers/json_transformer.rb +0 -11
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e1c08bc4f71c807525090abbf1701be19ab72cce08a99cc3bbec9b0db7150a02
4
- data.tar.gz: 172a5d7e7ba7a9424ef7b5ab4da2b8c44defdb4e0a34c833248ff1b63f40407e
3
+ metadata.gz: b74c4a7782c1ab31173e628b3e5ccb8743fe21f29d6f48d739b0e3cc2dfda22e
4
+ data.tar.gz: 7ea44638b08f6da12c0a5386f3d852600f50336ce0bb57347114804770f75691
5
5
  SHA512:
6
- metadata.gz: 0f671c00928b15f9c0f6fa159ac106ff9c4f65a8bd16048e5d0cab82d680945317f7680e7796e98c665bb5cc757e0657f1a36d773d89e3e1587d9eebc12abdd8
7
- data.tar.gz: 449d1368e0054f39006c7903218300b9b97ca839d6eff43b6b7bd659e5146d443a31c53325c4769ae7a56db9d42417020ccde17362ae024c01aca2ed63029044
6
+ metadata.gz: efb23677c731a54b0382c3095dc9bb5f98a97365c1daf031bbc8c20335e7bd146b76b3a50486971e48192e7540bc0ae1b09f232590a590257203ae3560396767
7
+ data.tar.gz: cba40b71a7e8c0b17a286ecd3db3724bff290fdd79b3fdf55ab89967f6af14228911c0e6928a949b8dd899acd6ad396b8a21fb03162a8561247c97b1200bac29
data/.gitignore CHANGED
@@ -7,6 +7,9 @@
7
7
  /spec/reports/
8
8
  /tmp/
9
9
 
10
+ # https://yehudakatz.com/2010/12/16/clarifying-the-roles-of-the-gemspec-and-gemfile/
11
+ Gemfile.lock
12
+
10
13
  # rspec failure tracking
11
14
  .rspec_status
12
15
  .DS_Store
data/.rubocop.yml CHANGED
@@ -5,4 +5,7 @@ Style/StringLiterals:
5
5
  Enabled: false
6
6
 
7
7
  Style/MethodCallWithArgsParentheses:
8
+ Enabled: false
9
+
10
+ Lint/ConstantResolution:
8
11
  Enabled: false
data/README.md CHANGED
@@ -2,9 +2,9 @@
2
2
 
3
3
  [![Gem Version](https://badge.fury.io/rb/chronicle-etl.svg)](https://badge.fury.io/rb/chronicle-etl)
4
4
 
5
- Chronicle ETL is a utility tool for archiving and processing personal data. You can extract it from a variety of source, transform it, and load it to different APIs or file formats.
5
+ Chronicle ETL is a utility that helps you archive and processes personal data. You can *extract* it from a variety of sources, *transform* it, and *load* it to an external API, file, or stdout.
6
6
 
7
- This project is an adaptation of Andrew Louis's experimental [Memex project](https://hyfen.net/memex).
7
+ This tool is an adaptation of Andrew Louis's experimental [Memex project](https://hyfen.net/memex) and the dozens of existing importers are being migrated to Chronicle.
8
8
 
9
9
  ## Installation
10
10
 
@@ -31,6 +31,9 @@ Connectors are available to read, process, and load data from different formats
31
31
  ```bash
32
32
  # List all available connectors
33
33
  $ chronicle-etl connectors:list
34
+
35
+ # Install a connector
36
+ $ chronicle-etl connectors:install imessage
34
37
  ```
35
38
 
36
39
  Built in connectors:
@@ -44,16 +47,18 @@ Built in connectors:
44
47
  - `null` - (default) Don't do anything
45
48
 
46
49
  ### Loaders
47
- - `stdout` - (default) output transformed records to stdount
50
+ - `stdout` - (default) output records to stdout serialized as JSON
48
51
  - `csv` - Load records to a csv file
52
+ - `rest` - Serialize records with [JSONAPI](https://jsonapi.org/) and send to a REST API
49
53
  - `table` - Output an ascii table of records. Useful for debugging.
50
54
 
51
55
  ### Provider-specific importers
52
56
 
53
57
  In addition to the built-in importers, importers for third-party platforms are available. They are packaged as individual Ruby gems.
54
58
 
55
- - [email](https://github.com/chronicle-app/chronicle-email). Extractors for `mbox` files. Transformers for chronicle schema
56
- - [bash](https://github.com/chronicle-app/chronicle-bash). Extract bash history from `~/.bash_history`. Transform it for chronicle schema
59
+ - [email](https://github.com/chronicle-app/chronicle-email). Extractors for `mbox` and other email files
60
+ - [bash](https://github.com/chronicle-app/chronicle-bash). Extract bash history from `~/.bash_history`
61
+ - [imessage](https://github.com/chronicle-app/chronicle-imessage). Extract iMessage messages from a local macOS installation
57
62
 
58
63
  To install any of these, run `gem install chronicle-PROVIDER`.
59
64
 
@@ -61,7 +66,7 @@ If you don't want to use the available rubygem importers, `chronicle-etl` can us
61
66
 
62
67
  I'll be open-sourcing more importers. Please [contact me](mailto:andrew@hyfen.net) to chat about what will be available!
63
68
 
64
- ### Full commands
69
+ ## Full commands
65
70
 
66
71
  ```
67
72
  $ chronicle-etl help
@@ -75,26 +80,28 @@ ALL COMMANDS
75
80
  jobs:create # Create a job
76
81
  jobs:list # List all available jobs
77
82
  jobs:run # Start a job
78
- jobs:show # Show a job
83
+ jobs:show # Show details about a job
79
84
  ```
80
85
 
81
- ### Job options
86
+ ### Running a job
82
87
 
83
88
  ```
84
89
  Usage:
85
90
  chronicle-etl jobs:run
86
91
 
87
92
  Options:
88
- -e, [--extractor=extractor-name] # Extractor class (available: stdin, csv, file)
89
- # Default: stdin
93
+ [--log-level=LOG_LEVEL] # Log level (debug, info, warn, error, fatal)
94
+ # Default: info
95
+ -v, [--verbose], [--no-verbose] # Set log level to verbose
96
+ [--dry-run], [--no-dry-run] # Only run the extraction and transform steps, not the loading
97
+ -e, [--extractor=extractor-name] # Extractor class. Default: stdin
90
98
  [--extractor-opts=key:value] # Extractor options
91
- -t, [--transformer=transformer-name] # Transformer class (available: null)
92
- # Default: null
99
+ -t, [--transformer=transformer-name] # Transformer class. Default: null
93
100
  [--transformer-opts=key:value] # Transformer options
94
- -l, [--loader=loader-name] # Loader class (available: stdout, csv, table)
95
- # Default: stdout
101
+ -l, [--loader=loader-name] # Loader class. Default: stdout
96
102
  [--loader-opts=key:value] # Loader options
97
- -j, [--job=JOB] # Job configuration file
103
+ -j, [--name=NAME] # Job configuration name
104
+
98
105
 
99
106
  Runs an ETL job
100
107
  ```
@@ -17,11 +17,11 @@ Gem::Specification.new do |spec|
17
17
  # Prevent pushing this gem to RubyGems.org. To allow pushes either set the 'allowed_push_host'
18
18
  # to allow pushing to a single host or delete this section to allow pushing to any host.
19
19
  if spec.respond_to?(:metadata)
20
- # spec.metadata["allowed_push_host"] = "TODO: Set to 'http://mygemserver.com'"
20
+ spec.metadata['allowed_push_host'] = "https://rubygems.org"
21
21
 
22
22
  spec.metadata["homepage_uri"] = spec.homepage
23
23
  spec.metadata["source_code_uri"] = "https://github.com/chronicle-app/chronicle-etl"
24
- spec.metadata["changelog_uri"] = "https://github.com/chronicle-app/chronicle-etl/blob/master/CHANGELOG.md"
24
+ spec.metadata["changelog_uri"] = "https://github.com/chronicle-app/chronicle-etl/releases"
25
25
  else
26
26
  raise "RubyGems 2.0 or newer is required to protect against " \
27
27
  "public gem pushes."
@@ -36,15 +36,21 @@ Gem::Specification.new do |spec|
36
36
  spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
37
37
  spec.require_paths = ["lib"]
38
38
 
39
- spec.add_dependency "thor", "~> 0.20"
39
+ spec.add_dependency "activesupport"
40
+ spec.add_dependency "chronic_duration", "~> 0.10.6"
40
41
  spec.add_dependency "colorize", "~> 0.8.1"
41
- spec.add_dependency "tty-table", "~> 0.11"
42
+ spec.add_dependency "marcel", "~> 1.0.2"
43
+ spec.add_dependency "mini_exiftool", "~> 2.10"
44
+ spec.add_dependency "nokogiri", "~> 1.13"
45
+ spec.add_dependency "runcom", "~> 6.2"
46
+ spec.add_dependency "sequel", "~> 5.35"
47
+ spec.add_dependency "sqlite3", "~> 1.4"
48
+ spec.add_dependency "thor", "~> 0.20"
42
49
  spec.add_dependency "tty-progressbar", "~> 0.17"
50
+ spec.add_dependency "tty-table", "~> 0.11"
43
51
 
44
52
  spec.add_development_dependency "bundler", "~> 2.1"
53
+ spec.add_development_dependency "pry-byebug", "~> 3.9"
45
54
  spec.add_development_dependency "rake", "~> 13.0"
46
55
  spec.add_development_dependency "rspec", "~> 3.9"
47
- spec.add_development_dependency "pry-byebug", "~> 3.9"
48
- spec.add_development_dependency 'runcom', '~> 6.2'
49
- spec.add_development_dependency 'redcarpet', '~> 3.5'
50
56
  end
@@ -7,23 +7,35 @@ module Chronicle
7
7
  namespace :connectors
8
8
 
9
9
  desc "install NAME", "Installs connector NAME"
10
- def install
11
- puts "Installing"
10
+ def install(name)
11
+ Chronicle::ETL::Registry.install_connector(name)
12
12
  end
13
13
 
14
14
  desc "list", "Lists available connectors"
15
15
  # Display all available connectors that chronicle-etl has access to
16
16
  def list
17
- klasses = Chronicle::ETL::Catalog.available_classes
18
- klasses = klasses.sort_by do |a|
19
- [a[:built_in].to_s, a[:provider], a[:phase]]
17
+ Chronicle::ETL::Registry.load_all!
18
+
19
+ connector_info = Chronicle::ETL::Registry.connectors.map do |connector_registration|
20
+ {
21
+ identifier: connector_registration.identifier,
22
+ phase: connector_registration.phase,
23
+ description: connector_registration.descriptive_phrase,
24
+ provider: connector_registration.provider,
25
+ core: connector_registration.built_in? ? '✓' : '',
26
+ class: connector_registration.klass_name
27
+ }
28
+ end
29
+
30
+ connector_info = connector_info.sort_by do |a|
31
+ [a[:core].to_s, a[:provider], a[:phase], a[:identifier]]
20
32
  end
21
33
 
22
- headers = klasses.first.keys.map do |key|
34
+ headers = connector_info.first.keys.map do |key|
23
35
  key.to_s.upcase.bold
24
36
  end
25
37
 
26
- table = TTY::Table.new(headers, klasses.map(&:values))
38
+ table = TTY::Table.new(headers, connector_info.map(&:values))
27
39
  puts table.render(indent: 0, padding: [0, 2])
28
40
  end
29
41
  end
@@ -1,5 +1,4 @@
1
1
  require 'pp'
2
-
3
2
  module Chronicle
4
3
  module ETL
5
4
  module CLI
@@ -8,16 +7,19 @@ module Chronicle
8
7
  default_task "start"
9
8
  namespace :jobs
10
9
 
11
- class_option :extractor, aliases: '-e', desc: 'Extractor class (available: stdin, csv, file)', default: 'stdin', banner: 'extractor-name'
10
+ class_option :extractor, aliases: '-e', desc: "Extractor class. Default: stdin", banner: 'extractor-name'
12
11
  class_option :'extractor-opts', desc: 'Extractor options', type: :hash, default: {}
13
- class_option :transformer, aliases: '-t', desc: 'Transformer class (available: null)', default: 'null', banner: 'transformer-name'
12
+ class_option :transformer, aliases: '-t', desc: 'Transformer class. Default: null', banner: 'transformer-name'
14
13
  class_option :'transformer-opts', desc: 'Transformer options', type: :hash, default: {}
15
- class_option :loader, aliases: '-l', desc: 'Loader class (available: stdout, csv, table)', default: 'stdout', banner: 'loader-name'
14
+ class_option :loader, aliases: '-l', desc: 'Loader class. Default: stdout', banner: 'loader-name'
16
15
  class_option :'loader-opts', desc: 'Loader options', type: :hash, default: {}
17
- class_option :job, aliases: '-j', desc: 'Job configuration name (or filename)'
16
+ class_option :name, aliases: '-j', desc: 'Job configuration name'
18
17
 
19
18
  map run: :start # Thor doesn't like `run` as a command name
20
19
  desc "run", "Start a job"
20
+ option :log_level, desc: 'Log level (debug, info, warn, error, fatal)', default: 'info'
21
+ option :verbose, aliases: '-v', desc: 'Set log level to verbose', type: :boolean
22
+ option :dry_run, desc: 'Only run the extraction and transform steps, not the loading', type: :boolean
21
23
  long_desc <<-LONG_DESC
22
24
  This will run an ETL job. Each job needs three parts:
23
25
 
@@ -25,36 +27,37 @@ module Chronicle
25
27
 
26
28
  2. #{'Transformer'.underline}: transforms data into a new format. If none is specified, we use the `null` transformer which does nothing to the data.
27
29
 
28
- 3. #{'Loader'.underline}: takes that transformed data and loads it externally. This can be an API, flat files, (or by default), stdout.
30
+ 3. #{'Loader'.underline}: takes that transformed data and loads it externally. This can be an API, flat files, (or by default), stdout. With the --dry-run option, this step won't be run.
29
31
 
30
32
  If you do not want to use the command line flags, you can also configure a job with a .yml config file. You can either specify the path to this file or use the filename and place the file in ~/.config/chronicle/etl/jobs/NAME.yml and call it with `--job NAME`
31
33
  LONG_DESC
32
34
  # Run an ETL job
33
35
  def start
34
- runner_options = build_runner_options(options)
35
- runner = Chronicle::ETL::Runner.new(runner_options)
36
+ setup_log_level
37
+ job_definition = build_job_definition(options)
38
+ job = Chronicle::ETL::Job.new(job_definition)
39
+ runner = Chronicle::ETL::Runner.new(job)
36
40
  runner.run!
37
41
  end
38
42
 
39
43
  desc "create", "Create a job"
40
44
  # Create an ETL job
41
45
  def create
42
- runner_options = build_runner_options(options)
43
- path = File.join('chronicle', 'etl', 'jobs', options[:job])
44
- Chronicle::ETL::Config.write(path, runner_options)
46
+ job_definition = build_job_definition(options)
47
+ path = File.join('chronicle', 'etl', 'jobs', options[:name])
48
+ Chronicle::ETL::Config.write(path, job_definition.definition)
45
49
  end
46
50
 
47
51
  desc "show", "Show details about a job"
48
52
  # Show an ETL job
49
53
  def show
50
- runner_options = build_runner_options(options)
51
- pp runner_options
54
+ puts Chronicle::ETL::Job.new(build_job_definition(options))
52
55
  end
53
56
 
54
57
  desc "list", "List all available jobs"
55
58
  # List available ETL jobs
56
59
  def list
57
- jobs = Chronicle::ETL::Config.jobs
60
+ jobs = Chronicle::ETL::Config.available_jobs
58
61
 
59
62
  job_details = jobs.map do |job|
60
63
  r = Chronicle::ETL::Config.load("chronicle/etl/jobs/#{job}.yml")
@@ -74,34 +77,43 @@ LONG_DESC
74
77
 
75
78
  private
76
79
 
77
- # Create runner options by reading config file and then overwriting with flag options
78
- def build_runner_options options
79
- flag_options = process_flag_options(options)
80
- job_options = load_job(options[:job])
81
- flag_options.merge(job_options)
80
+ def setup_log_level
81
+ if options[:verbose]
82
+ Chronicle::ETL::Logger.log_level = Chronicle::ETL::Logger::DEBUG
83
+ elsif options[:log_level]
84
+ level = Chronicle::ETL::Logger.const_get(options[:log_level].upcase)
85
+ Chronicle::ETL::Logger.log_level = level
86
+ end
87
+ end
88
+
89
+ # Create job definition by reading config file and then overwriting with flag options
90
+ def build_job_definition(options)
91
+ definition = Chronicle::ETL::JobDefinition.new
92
+ definition.add_config(load_job_config(options[:name]))
93
+ definition.add_config(process_flag_options(options))
94
+ definition
82
95
  end
83
96
 
84
- def load_job job
85
- yml_config = Chronicle::ETL::Config.load("chronicle/etl/jobs/#{job}.yml")
86
- # FIXME: use better trick to depely symbolize keys
87
- JSON.parse(yml_config.to_json, symbolize_names: true)
97
+ def load_job_config name
98
+ Chronicle::ETL::Config.load_job_from_config(name)
88
99
  end
89
100
 
90
101
  # Takes flag options and turns them into a runner config
91
102
  def process_flag_options options
92
103
  {
104
+ dry_run: options[:dry_run],
93
105
  extractor: {
94
106
  name: options[:extractor],
95
107
  options: options[:'extractor-opts']
96
- },
108
+ }.compact,
97
109
  transformer: {
98
110
  name: options[:transformer],
99
111
  options: options[:'transformer-opts']
100
- },
112
+ }.compact,
101
113
  loader: {
102
114
  name: options[:loader],
103
115
  options: options[:'loader-opts']
104
- }
116
+ }.compact
105
117
  }
106
118
  end
107
119
  end
@@ -22,6 +22,11 @@ module Chronicle
22
22
 
23
23
  # Entrypoint for the CLI
24
24
  def self.start(given_args = ARGV, config = {})
25
+ if given_args[0] == "--version"
26
+ puts "#{Chronicle::ETL::VERSION}"
27
+ exit
28
+ end
29
+
25
30
  if given_args.none?
26
31
  abort "No command entered or job specified. To see commands, run `chronicle-etl help`".red
27
32
  end
@@ -52,10 +57,10 @@ module Chronicle
52
57
  shell.say " $ chronicle-etl connectors:list"
53
58
  shell.say
54
59
  shell.say " Run a simple job:".italic.light_black
55
- shell.say " $ chronicle-etl jobs:start --extractor stdin --transformer null --loader stdout"
60
+ shell.say " $ chronicle-etl jobs:run --extractor stdin --transformer null --loader stdout"
56
61
  shell.say
57
62
  shell.say " Show full job options:".italic.light_black
58
- shell.say " $ chronicle-etl jobs help start"
63
+ shell.say " $ chronicle-etl jobs help run"
59
64
 
60
65
  list = []
61
66
 
@@ -72,6 +77,9 @@ module Chronicle
72
77
  shell.say "VERSION".bold
73
78
  shell.say " #{Chronicle::ETL::VERSION}"
74
79
  shell.say
80
+ shell.say " Display current version:".italic.light_black
81
+ shell.say " $ chronicle-etl --version"
82
+ shell.say
75
83
  shell.say "FULL DOCUMENTATION".bold
76
84
  shell.say " https://github.com/chronicle-app/chronicle-etl".blue
77
85
  shell.say
@@ -4,15 +4,17 @@ module Chronicle
4
4
  module ETL
5
5
  # Utility methods to read, write, and access config files
6
6
  module Config
7
+ module_function
8
+
7
9
  # Loads a yml config file
8
- def self.load(path)
10
+ def load(path)
9
11
  config = Runcom::Config.new(path)
10
12
  # FIXME: hack to deeply symbolize keys
11
13
  JSON.parse(config.to_h.to_json, symbolize_names: true)
12
14
  end
13
15
 
14
16
  # Writes a hash as a yml config file
15
- def self.write(path, data)
17
+ def write(path, data)
16
18
  config = Runcom::Config.new(path)
17
19
  filename = config.all[0].to_s + '.yml'
18
20
  File.open(filename, 'w') do |f|
@@ -21,12 +23,31 @@ module Chronicle
21
23
  end
22
24
 
23
25
  # Returns all jobs available in ~/.config/chronicle/etl/jobs/*.yml
24
- def self.jobs
26
+ def available_jobs
25
27
  job_directory = Runcom::Config.new('chronicle/etl/jobs').current
26
28
  Dir.glob(File.join(job_directory, "*.yml")).map do |filename|
27
29
  File.basename(filename, ".*")
28
30
  end
29
31
  end
32
+
33
+ # Returns all available credentials available in ~/.config/chronicle/etl/credentials/*.yml
34
+ def available_credentials
35
+ job_directory = Runcom::Config.new('chronicle/etl/credentials').current
36
+ Dir.glob(File.join(job_directory, "*.yml")).map do |filename|
37
+ File.basename(filename, ".*")
38
+ end
39
+ end
40
+
41
+ # Load a job definition from job config directory
42
+ def load_job_from_config(job_name)
43
+ definition = self.load("chronicle/etl/jobs/#{job_name}.yml")
44
+ definition[:name] = job_name
45
+ definition
46
+ end
47
+
48
+ def load_credentials(name)
49
+ config = self.load("chronicle/etl/credentials/#{name}.yml")
50
+ end
30
51
  end
31
52
  end
32
53
  end
@@ -2,6 +2,8 @@ module Chronicle
2
2
  module ETL
3
3
  class Error < StandardError; end;
4
4
 
5
+ class RunnerTypeError < Error; end
6
+
5
7
  class ConnectorNotAvailableError < Error
6
8
  def initialize(message, provider: nil, name: nil)
7
9
  super(message)
@@ -13,5 +15,16 @@ module Chronicle
13
15
 
14
16
  class ProviderNotAvailableError < ConnectorNotAvailableError; end
15
17
  class ProviderConnectorNotAvailableError < ConnectorNotAvailableError; end
18
+
19
+ class TransformationError < Error
20
+ attr_reader :transformation
21
+
22
+ def initialize(message=nil, transformation:)
23
+ super(message)
24
+ @transformation = transformation
25
+ end
26
+ end
27
+
28
+ class UntransformableRecordError < TransformationError; end
16
29
  end
17
30
  end
@@ -0,0 +1,12 @@
1
+ module Chronicle
2
+ module ETL
3
+ class Extraction
4
+ attr_accessor :data, :meta
5
+
6
+ def initialize(data: {}, meta: {})
7
+ @data = data
8
+ @meta = meta
9
+ end
10
+ end
11
+ end
12
+ end
@@ -1,42 +1,48 @@
1
1
  require 'csv'
2
- class Chronicle::ETL::CsvExtractor < Chronicle::ETL::Extractor
3
- DEFAULT_OPTIONS = {
4
- headers: true,
5
- filename: $stdin
6
- }.freeze
7
-
8
- def initialize(options = {})
9
- super(DEFAULT_OPTIONS.merge(options))
10
- end
11
2
 
12
- def extract
13
- csv = initialize_csv
14
- csv.each do |row|
15
- result = row.to_h
16
- yield result
3
+ module Chronicle
4
+ module ETL
5
+ class CsvExtractor < Chronicle::ETL::Extractor
6
+ include Extractors::Helpers::FilesystemReader
7
+
8
+ register_connector do |r|
9
+ r.description = 'input as CSV'
10
+ end
11
+
12
+ DEFAULT_OPTIONS = {
13
+ headers: true,
14
+ filename: $stdin
15
+ }.freeze
16
+
17
+ def initialize(options = {})
18
+ super(DEFAULT_OPTIONS.merge(options))
19
+ end
20
+
21
+ def extract
22
+ csv = initialize_csv
23
+ csv.each do |row|
24
+ yield Chronicle::ETL::Extraction.new(data: row.to_h)
25
+ end
26
+ end
27
+
28
+ def results_count
29
+ CSV.read(@options[:filename], headers: @options[:headers]).count unless stdin?(@options[:filename])
30
+ end
31
+
32
+ private
33
+
34
+ def initialize_csv
35
+ headers = @options[:headers].is_a?(String) ? @options[:headers].split(',') : @options[:headers]
36
+
37
+ csv_options = {
38
+ headers: headers,
39
+ converters: :all
40
+ }
41
+
42
+ open_from_filesystem(filename: @options[:filename]) do |file|
43
+ return CSV.new(file, **csv_options)
44
+ end
45
+ end
17
46
  end
18
47
  end
19
-
20
- def results_count
21
- CSV.read(@options[:filename], headers: @options[:headers]).count if read_from_file?
22
- end
23
-
24
- private
25
-
26
- def initialize_csv
27
- headers = @options[:headers].is_a?(String) ? @options[:headers].split(',') : @options[:headers]
28
-
29
- csv_options = {
30
- headers: headers,
31
- header_converters: :symbol,
32
- converters: [:all]
33
- }
34
-
35
- stream = read_from_file? ? File.open(@options[:filename]) : @options[:filename]
36
- CSV.new(stream, **csv_options)
37
- end
38
-
39
- def read_from_file?
40
- @options[:filename] != $stdin
41
- end
42
48
  end
@@ -4,7 +4,7 @@ module Chronicle
4
4
  module ETL
5
5
  # Abstract class representing an Extractor for an ETL job
6
6
  class Extractor
7
- extend Chronicle::ETL::Catalog
7
+ extend Chronicle::ETL::Registry::SelfRegistering
8
8
 
9
9
  # Construct a new instance of this extractor. Options are passed in from a Runner
10
10
  # == Paramters:
@@ -12,20 +12,41 @@ module Chronicle
12
12
  # Options for configuring this Extractor
13
13
  def initialize(options = {})
14
14
  @options = options.transform_keys!(&:to_sym)
15
+ sanitize_options
16
+ handle_continuation
15
17
  end
16
18
 
19
+ # Hook called before #extract. Useful for gathering data, initailizing proxies, etc
20
+ def prepare; end
21
+
22
+ # An optional method to calculate how many records there are to extract. Used primarily for
23
+ # building the progress bar
24
+ def results_count; end
25
+
17
26
  # Entrypoint for this Extractor. Called by a Runner. Expects a series of records to be yielded
18
27
  def extract
19
28
  raise NotImplementedError
20
29
  end
21
30
 
22
- # An optional method to calculate how many records there are to extract. Used primarily for
23
- # building the progress bar
24
- def results_count; end
31
+ private
32
+
33
+ def sanitize_options
34
+ @options[:load_since] = Time.parse(@options[:load_since]) if @options[:load_since] && @options[:load_since].is_a?(String)
35
+ @options[:load_until] = Time.parse(@options[:load_until]) if @options[:load_until] && @options[:load_until].is_a?(String)
36
+ end
37
+
38
+ def handle_continuation
39
+ return unless @options[:continuation]
40
+
41
+ @options[:load_since] = @options[:continuation].highest_timestamp if @options[:continuation].highest_timestamp
42
+ @options[:load_after_id] = @options[:continuation].last_id if @options[:continuation].last_id
43
+ end
25
44
  end
26
45
  end
27
46
  end
28
47
 
48
+ require_relative 'helpers/filesystem_reader'
29
49
  require_relative 'csv_extractor'
30
50
  require_relative 'file_extractor'
51
+ require_relative 'json_extractor'
31
52
  require_relative 'stdin_extractor'