RubyGems - chronicle-etl - Versions diffs - 0.4.0 → 0.4.1 - Mend

chronicle-etl 0.4.0 → 0.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (30) hide show

checksums.yaml +4 -4
data/.github/workflows/ruby.yml +2 -2
data/README.md +148 -84
data/lib/chronicle/etl/cli/jobs.rb +10 -7
data/lib/chronicle/etl/configurable.rb +8 -0
data/lib/chronicle/etl/exceptions.rb +6 -2
data/lib/chronicle/etl/extractors/csv_extractor.rb +24 -17
data/lib/chronicle/etl/extractors/extractor.rb +4 -4
data/lib/chronicle/etl/extractors/file_extractor.rb +33 -13
data/lib/chronicle/etl/extractors/helpers/input_reader.rb +76 -0
data/lib/chronicle/etl/extractors/json_extractor.rb +21 -12
data/lib/chronicle/etl/job_definition.rb +1 -1
data/lib/chronicle/etl/loaders/json_loader.rb +44 -0
data/lib/chronicle/etl/loaders/loader.rb +1 -1
data/lib/chronicle/etl/loaders/table_loader.rb +12 -9
data/lib/chronicle/etl/logger.rb +1 -0
data/lib/chronicle/etl/models/base.rb +3 -0
data/lib/chronicle/etl/models/entity.rb +8 -2
data/lib/chronicle/etl/models/raw.rb +26 -0
data/lib/chronicle/etl/runner.rb +5 -4
data/lib/chronicle/etl/serializers/jsonapi_serializer.rb +6 -0
data/lib/chronicle/etl/serializers/raw_serializer.rb +10 -0
data/lib/chronicle/etl/serializers/serializer.rb +2 -1
data/lib/chronicle/etl/transformers/null_transformer.rb +1 -1
data/lib/chronicle/etl/version.rb +1 -1
data/lib/chronicle/etl.rb +11 -4
metadata +6 -5
data/lib/chronicle/etl/extractors/helpers/filesystem_reader.rb +0 -104
data/lib/chronicle/etl/loaders/stdout_loader.rb +0 -14
data/lib/chronicle/etl/models/generic.rb +0 -23

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 5fd411a9a41a645b85780230c79b09f361e121d0e8ca7f3270ca8eba55a76ca8
-  data.tar.gz: c09053715910ab4f027fbdc3a5b7d10c042eee962f7fa93c6571ce8359f51009
+  metadata.gz: 8a267de435b41b579e36128b7392729ef499eb37f05fabaead7811f089938ddb
+  data.tar.gz: d4af2f62f3f5de926bdfbb0e3d6dbe2c952ec286c07317af4dca8d98f665d6da
 SHA512:
-  metadata.gz: 2c9ec14b6c0a51f1c5ec77ee8d9a7f016d16bdc35db5634f9fa5d38aabc30dec201cd4b8bef06a31b86773a0c1cda2d271d7008dcb247a86d956c094919f3c0f
-  data.tar.gz: 0dca41e1654e5b2b98a148f853492a67126cdac767000b3c5f97c5c8ff88b77464e17a2fab38b72c1f014f3515c911e5f3f391eaf68d64e73dcfcff5d8e6cb6a
+  metadata.gz: c78080cce008340f0b2795be46da2b5eb6562b2bffd97728150960343870f2bea4699e4efa07905710dd0e2eba7aaa1e803d8c0f727196f5d9d655b28a04f02e
+  data.tar.gz: cae3a3ffb6527f5c0b3ff89c75dc98d9cd66157ee6230c9db797f4683f90e2146daadf291108e55d3090d0120d3c9e25135cb21c4e9078bcaf4d1edf2172c930

data/.github/workflows/ruby.yml CHANGED Viewed

@@ -9,9 +9,9 @@ name: Ruby
 on:
   push:
-    branches: [ master ]
+    branches: [ main ]
   pull_request:
-    branches: [ master ]
+    branches: [ main ]
 jobs:
   test:

data/README.md CHANGED Viewed

@@ -1,125 +1,189 @@
-# Chronicle::ETL
+## A CLI toolkit for extracting and working with your digital history
 [![Gem Version](https://badge.fury.io/rb/chronicle-etl.svg)](https://badge.fury.io/rb/chronicle-etl) [![Ruby](https://github.com/chronicle-app/chronicle-etl/actions/workflows/ruby.yml/badge.svg)](https://github.com/chronicle-app/chronicle-etl/actions/workflows/ruby.yml)
-Chronicle ETL is a utility that helps you archive and processes personal data. You can *extract* it from a variety of sources, *transform* it, and *load* it to an external API, file, or stdout.
+Are you trying to archive your digital history or incorporate it into your own projects? You’ve probably discovered how frustrating it is to get machine-readable access to your own data. While [building a memex](https://hyfen.net/memex/), I learned first-hand what great efforts must be made before you can begin using the data in interesting ways.
-This tool is an adaptation of Andrew Louis's experimental [Memex project](https://hyfen.net/memex) and the dozens of existing importers are being migrated to Chronicle.
+If you don’t want to spend all your time writing scrapers, reverse-engineering APIs, or parsing takeout data, this project is for you! (*If you do enjoy these things, please see the [open issues](https://github.com/chronicle-app/chronicle-etl/issues).*)
-## Installation
+`chronicle-etl` is a CLI tool that gives you the ability to easily access your personal data. It uses the ETL pattern to **extract** it from a source (e.g. your local browser history, a directory of images, goodreads.com reading history), **transform** it (into a given schema), and **load** it to a source (e.g. a CSV file, JSON, external API).
-```bash
-$ gem install chronicle-etl
+## What does `chronicle-etl` give you?
+* **CLI tool for working with personal data**. You can monitor progress of exports, manipulate the output, set up recurring jobs, manage credentials, and more.
+* **Plugins for many third-party providers**. A plugin system allows you to access data from third-party providers and hook it into the shared CLI infrastructure.
+* **A common, opinionated schema**: You can normalize different datasets into a single schema so that, for example, all your iMessages and emails are stored in a common schema. Don’t want to use the schema? `chronicle-etl` always allows you to fall back on working with the raw extraction data.
+## Installation
+```sh
+# Install chronicle-etl
+gem install chronicle-etl
 ```
-## Usage
+After installation, the `chronicle-etl` command will be available in your shell. Homebrew support [is coming soon](https://github.com/chronicle-app/chronicle-etl/issues/13).
-After installing the gem, `chronicle-etl` is available to run in your shell.
+## Basic usage and running jobs
-```bash
-# read test.csv and display it as a table
-$ chronicle-etl jobs:run --extractor csv --extractor-opts filename:test.csv --loader table
+```sh
+# Display help
+$ chronicle-etl help
-# Display help for the jobs:run command
-$ chronicle-etl jobs help run
+# Basic job usage
+$ chronicle-etl --extractor NAME --transformer NAME --loader NAME
+# Read test.csv and display it to stdout as a table
+$ chronicle-etl --extractor csv --input ./data.csv --loader table
 ```
-## Connectors
+### Common options
+```sh
+Options:
+  -j, [--name=NAME]                    # Job configuration name
+  -e, [--extractor=EXTRACTOR-NAME]     # Extractor class. Default: stdin
+      [--extractor-opts=key:value]     # Extractor options
+  -t, [--transformer=TRANFORMER-NAME]  # Transformer class. Default: null
+      [--transformer-opts=key:value]   # Transformer options
+  -l, [--loader=LOADER-NAME]           # Loader class. Default: stdout
+      [--loader-opts=key:value]        # Loader options
+  -i, [--input=FILENAME]               # Input filename or directory
+      [--since=DATE]                   # Load records SINCE this date. Overrides job's `load_since` configuration option in extractor's options
+      [--until=DATE]                   # Load records UNTIL this date
+      [--limit=N]                      # Only extract the first LIMIT records
+  -o, [--output=OUTPUT]                # Output filename
+      [--fields=field1 field2 ...]     # Output only these fields
+      [--log-level=LOG_LEVEL]          # Log level (debug, info, warn, error, fatal)
+                                       # Default: info
+  -v, [--verbose], [--no-verbose]      # Set log level to verbose
+      [--silent], [--no-silent]        # Silence all output
+```
+## Connectors
 Connectors are available to read, process, and load data from different formats or external services.
-```bash
+```sh
 # List all available connectors
 $ chronicle-etl connectors:list
-# Install a connector
-$ chronicle-etl connectors:install imessage
 ```
-Built in connectors:
-### Extractors
-- `stdin` - (default) Load records from line-separated stdin
-- `csv`
-- `file` - load from a single file or directory (with a glob pattern)
-### Transformers
-- `null` - (default) Don't do anything
-### Loaders
-- `stdout` - (default) output records to stdout serialized as JSON
-- `csv` - Load records to a csv file
-- `rest` - Serialize records with [JSONAPI](https://jsonapi.org/) and send to a REST API
-- `table` - Output an ascii table of records. Useful for debugging.
-### Provider-specific importers
-In addition to the built-in importers, importers for third-party platforms are available. They are packaged as individual Ruby gems.
+### Built-in Connectors
+`chronicle-etl` comes with several built-in connectors for common formats and sources.
-- [email](https://github.com/chronicle-app/chronicle-email). Extractors for `mbox` and other email files
-- [shell](https://github.com/chronicle-app/chronicle-shell). Extract shell history from Bash or Zsh`
-- [imessage](https://github.com/chronicle-app/chronicle-imessage). Extract iMessage messages from a local macOS installation
+#### Extractors
+- [`csv`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/extractors/csv_extractor.rb) - Load records from CSV files or stdin
+- [`json`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/extractors/json_extractor.rb) - Load JSON (either [line-separated objects](https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON) or one object)
+- [`file`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/extractors/file_extractor.rb) - load from a single file or directory (with a glob pattern)
-To install any of these, run `gem install chronicle-PROVIDER`.
+#### Transformers
+- [`null`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/transformers/null_transformer.rb) - (default) Don’t do anything and pass on raw extraction data
-If you don't want to use the available rubygem importers, `chronicle-etl` can use `stdin` as an Extractor source (newline separated records). You can also use `stdout` as a loader — transformed records will be outputted separated by newlines.
+#### Loaders
+- [`table`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/loaders/table_loader.rb) - (default) Output an ascii table of records. Useful for exploring data.
+- [`csv`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/extractors/csv_extractor.rb) - Load records to CSV
+- [`json`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/loaders/json_loader.rb) - Load records serialized as JSON
+- [`rest`](https://github.com/chronicle-app/chronicle-etl/blob/main/lib/chronicle/etl/loaders/rest_loader.rb) - Serialize records with [JSONAPI](https://jsonapi.org/) and send to a REST API
-I'll be open-sourcing more importers. Please [contact me](mailto:andrew@hyfen.net) to chat about what will be available!
-## Full commands
-```
-$ chronicle-etl help
-ALL COMMANDS
-  help                       # This help menu
-  connectors help [COMMAND]  # Describe subcommands or one specific subcommand
-  connectors:install NAME    # Installs connector NAME
-  connectors:list            # Lists available connectors
-  jobs help [COMMAND]        # Describe subcommands or one specific subcommand
-  jobs:create                # Create a job
-  jobs:list                  # List all available jobs
-  jobs:run                   # Start a job
-  jobs:show                  # Show details about a job
-```
-### Running a job
+### Plugins
+Plugins provide access to data from third-party platforms, services, or formats.
+```bash
+# Install a plugin
+$ chronicle-etl connectors:install NAME
 ```
-Usage:
-  chronicle-etl jobs:run
-Options:
-      [--log-level=LOG_LEVEL]           # Log level (debug, info, warn, error, fatal)
-                                        # Default: info
-  -v, [--verbose], [--no-verbose]       # Set log level to verbose
-      [--dry-run], [--no-dry-run]       # Only run the extraction and transform steps, not the loading
-  -e, [--extractor=extractor-name]      # Extractor class. Default: stdin
-      [--extractor-opts=key:value]      # Extractor options
-  -t, [--transformer=transformer-name]  # Transformer class. Default: null
-      [--transformer-opts=key:value]    # Transformer options
-  -l, [--loader=loader-name]            # Loader class. Default: stdout
-      [--loader-opts=key:value]         # Loader options
-  -j, [--name=NAME]                     # Job configuration name
-Runs an ETL job
+A few dozen importers exist [in my Memex project](https://hyfen.net/memex/) and they’re being ported over to the Chronicle system. This table shows what’s available now and what’s coming. Rows are sorted in very rough order of priority.
+If you want to work together on a connector, please [get in touch](#get-in-touch)!
+| Name                                                            | Description                                                                                 | Availability                     |
+|-----------------------------------------------------------------|---------------------------------------------------------------------------------------------|----------------------------------|
+| [imessage](https://github.com/chronicle-app/chronicle-imessage) | iMessage messages and attachments                                                           | Available                        |
+| [shell](https://github.com/chronicle-app/chronicle-shell)       | Shell command history                                                                       | Available (zsh support pending)  |
+| [email](https://github.com/chronicle-app/chronicle-email)       | Emails and attachments from IMAP or .mbox files                                             | Available (imap support pending) |
+| [pinboard](https://github.com/chronicle-app/chronicle-email)    | Bookmarks and tags                                                                          | Available                        |
+| github                                                          | Github user and repo activity                                                               | In progress                      |
+| safari                                                          | Browser history from local sqlite db                                                        | Needs porting                    |
+| chrome                                                          | Browser history from local sqlite db                                                        | Needs porting                    |
+| whatsapp                                                        | Messaging history (via individual chat exports) or reverse-engineered local desktop install | Unstarted                        |
+| anki                                                            | Studying and card creation history                                                          | Needs porting                    |
+| facebook                                                        | Messaging and history posting via data export files                                         | Needs porting                    |
+| twitter                                                         | History via API or export data files                                                        | Needs porting                    |
+| foursquare                                                      | Location history via API                                                                    | Needs porting                    |
+| goodreads                                                       | Reading history via export csv (RIP goodreads API)                                          | Needs porting                    |
+| lastfm                                                          | Listening history via API                                                                   | Needs porting                    |
+| images                                                          | Process image files                                                                         | Needs porting                    |
+| arc                                                             | Location history from synced icloud backup files                                            | Needs porting                    |
+| firefox                                                         | Browser history from local sqlite db                                                        | Needs porting                    |
+| fitbit                                                          | Personal analytics via API                                                                  | Needs porting                    |
+| git                                                             | Commit history on a repo                                                                    | Needs porting                    |
+| google-calendar                                                 | Calendar events via API                                                                     | Needs porting                    |
+| instagram                                                       | Posting and messaging history via export data                                               | Needs porting                    |
+| shazam                                                          | Song tags via reverse-engineered API                                                        | Needs porting                    |
+| slack                                                           | Messaging history via API                                                                   | Need rethinking                  |
+| strava                                                          | Activity history via API                                                                    | Needs porting                    |
+| things                                                          | Task activity via local sqlite db                                                           | Needs porting                    |
+| bear                                                            | Note taking activity via local sqlite db                                                    | Needs porting                    |
+| youtube                                                         | Video activity via takeout data and API                                                     | Needs porting                    |
+### Writing your own connector
+Additional connectors are packaged as separate ruby gems. You can view the [iMessage plugin](https://github.com/chronicle-app/chronicle-imessage) for an example.
+If you want to load a custom connector without creating a gem, you can help by [completing this issue](https://github.com/chronicle-app/chronicle-etl/issues/23).
+If you want to work together on a connector, please [get in touch](#get-in-touch)!
+#### Sample custom Extractor class
+```ruby
+module Chronicle
+  module FooService
+    class FooExtractor < Chronicle::ETL::Extractor
+      register_connector do |r|
+        r.identifier = 'foo'
+        r.description = 'From foo.com'
+      end
+      setting :access_token, required: true
+      def prepare
+        @records = # load from somewhere
+      end
+      def extract
+        @records.each do |record|
+          yield Chronicle::ETL::Extraction.new(data: row.to_h)
+        end
+      end
+    end
+  end
+end
 ```
 ## Development
 After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
 To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
-## Contributing
+### Additional development commands
+```bash
+# run tests
+bundle exec rake spec
+# generate docs
+bundle exec rake yard
+# use Guard to run specs automatically
+bundle exec guard
+```
+## Get in touch
+- [@hyfen](https://twitter.com/hyfen) on Twitter
+- [@hyfen](https://github.com/hyfen) on Github
+- Email: andrew@hyfen.net
+## Contributing
 Bug reports and pull requests are welcome on GitHub at https://github.com/chronicle-app/chronicle-etl. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
 ## License
 The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
 ## Code of Conduct
-Everyone interacting in the Chronicle::ETL project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/chronicle-app/chronicle-etl/blob/master/CODE_OF_CONDUCT.md).
+Everyone interacting in the Chronicle::ETL project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/chronicle-app/chronicle-etl/blob/master/CODE_OF_CONDUCT.md).

data/lib/chronicle/etl/cli/jobs.rb CHANGED Viewed

@@ -6,20 +6,20 @@ module Chronicle
       # CLI commands for working with ETL jobs
       class Jobs < SubcommandBase
         default_task "start"
-        namespace :jobs
+        namespace :jobs
         class_option :name, aliases: '-j', desc: 'Job configuration name'
-        class_option :extractor, aliases: '-e', desc: "Extractor class. Default: stdin", banner: 'extractor-name'
+        class_option :extractor, aliases: '-e', desc: "Extractor class. Default: stdin", banner: 'NAME'
         class_option :'extractor-opts', desc: 'Extractor options', type: :hash, default: {}
-        class_option :transformer, aliases: '-t', desc: 'Transformer class. Default: null', banner: 'transformer-name'
+        class_option :transformer, aliases: '-t', desc: 'Transformer class. Default: null', banner: 'NAME'
         class_option :'transformer-opts', desc: 'Transformer options', type: :hash, default: {}
-        class_option :loader, aliases: '-l', desc: 'Loader class. Default: stdout', banner: 'loader-name'
+        class_option :loader, aliases: '-l', desc: 'Loader class. Default: table', banner: 'NAME'
         class_option :'loader-opts', desc: 'Loader options', type: :hash, default: {}
         # This is an array to deal with shell globbing
         class_option :input, aliases: '-i', desc: 'Input filename or directory', default: [], type: 'array', banner: 'FILENAME'
-        class_option :since, desc: "Load records SINCE this date. Overrides job's `load_since` configuration option in extractor's options", banner: 'DATE'
+        class_option :since, desc: "Load records SINCE this date", banner: 'DATE'
         class_option :until, desc: "Load records UNTIL this date", banner: 'DATE'
         class_option :limit, desc: "Only extract the first LIMIT records", banner: 'N'
@@ -28,6 +28,7 @@ module Chronicle
         class_option :log_level, desc: 'Log level (debug, info, warn, error, fatal)', default: 'info'
         class_option :verbose, aliases: '-v', desc: 'Set log level to verbose', type: :boolean
+        class_option :silent, desc: 'Silence all output', type: :boolean
         # Thor doesn't like `run` as a command name
         map run: :start
@@ -93,7 +94,9 @@ LONG_DESC
         private
         def setup_log_level
-          if options[:verbose]
+          if options[:silent]
+            Chronicle::ETL::Logger.log_level = Chronicle::ETL::Logger::SILENT
+          elsif options[:verbose]
             Chronicle::ETL::Logger.log_level = Chronicle::ETL::Logger::DEBUG
           elsif options[:log_level]
             level = Chronicle::ETL::Logger.const_get(options[:log_level].upcase)
@@ -116,7 +119,7 @@ LONG_DESC
         # Takes flag options and turns them into a runner config
         def process_flag_options options
           extractor_options = options[:'extractor-opts'].merge({
-            filename: (options[:input] if options[:input].any?),
+            input: (options[:input] if options[:input].any?),
             since: options[:since],
             until: options[:until],
             limit: options[:limit],

data/lib/chronicle/etl/configurable.rb CHANGED Viewed

@@ -89,6 +89,14 @@ module Chronicle
           value.to_s
         end
+        def coerce_boolean(value)
+          if value.is_a?(String)
+            value.downcase == "true"
+          else
+            value
+          end
+        end
         def coerce_time(value)
           # TODO: handle durations like '3h'
           if value.is_a?(String)

data/lib/chronicle/etl/exceptions.rb CHANGED Viewed

@@ -1,8 +1,8 @@
 module Chronicle
   module ETL
-    class Error < StandardError; end;
+    class Error < StandardError; end
-    class ConfigurationError < Error; end;
+    class ConfigurationError < Error; end
     class RunnerTypeError < Error; end
@@ -18,6 +18,10 @@ module Chronicle
     class ProviderNotAvailableError < ConnectorNotAvailableError; end
     class ProviderConnectorNotAvailableError < ConnectorNotAvailableError; end
+    class ExtractionError < Error; end
+    class SerializationError < Error; end
     class TransformationError < Error
       attr_reader :transformation

data/lib/chronicle/etl/extractors/csv_extractor.rb CHANGED Viewed

@@ -3,39 +3,46 @@ require 'csv'
 module Chronicle
   module ETL
     class CSVExtractor < Chronicle::ETL::Extractor
-      include Extractors::Helpers::FilesystemReader
+      include Extractors::Helpers::InputReader
       register_connector do |r|
-        r.description = 'input as CSV'
+        r.description = 'CSV'
       end
       setting :headers, default: true
-      setting :filename, default: $stdin
+      def prepare
+        @csvs = prepare_sources
+      end
       def extract
-        csv = initialize_csv
-        csv.each do |row|
-          yield Chronicle::ETL::Extraction.new(data: row.to_h)
+        @csvs.each do |csv|
+          csv.read.each do |row|
+            yield Chronicle::ETL::Extraction.new(data: row.to_h)
+          end
         end
       end
       def results_count
-        CSV.read(@config.filename, headers: @config.headers).count unless stdin?(@config.filename)
+        @csvs.reduce(0) do |total_rows, csv|
+          row_count = csv.readlines.size
+          csv.rewind
+          total_rows + row_count
+        end
       end
       private
-      def initialize_csv
-        headers = @config.headers.is_a?(String) ? @config.headers.split(',') : @config.headers
-        csv_options = {
-          headers: headers,
-          converters: :all
-        }
-        open_from_filesystem(filename: @config.filename) do |file|
-          return CSV.new(file, **csv_options)
+      def prepare_sources
+        @csvs = []
+        read_input do |csv_data|
+          csv_options = {
+            headers: @config.headers.is_a?(String) ? @config.headers.split(',') : @config.headers,
+            converters: :all
+          }
+          @csvs << CSV.new(csv_data, **csv_options)
         end
+        @csvs
       end
     end
   end

data/lib/chronicle/etl/extractors/extractor.rb CHANGED Viewed

@@ -7,11 +7,11 @@ module Chronicle
       extend Chronicle::ETL::Registry::SelfRegistering
       include Chronicle::ETL::Configurable
-      setting :since, type: :date
-      setting :until, type: :date
+      setting :since, type: :time
+      setting :until, type: :time
       setting :limit
       setting :load_after_id
-      setting :filename
+      setting :input
       # Construct a new instance of this extractor. Options are passed in from a Runner
       # == Parameters:
@@ -46,7 +46,7 @@ module Chronicle
   end
 end
-require_relative 'helpers/filesystem_reader'
+require_relative 'helpers/input_reader'
 require_relative 'csv_extractor'
 require_relative 'file_extractor'
 require_relative 'json_extractor'

data/lib/chronicle/etl/extractors/file_extractor.rb CHANGED Viewed

@@ -2,35 +2,55 @@ require 'pathname'
 module Chronicle
   module ETL
+    # Return filenames that match a pattern in a directory
     class FileExtractor < Chronicle::ETL::Extractor
-      include Extractors::Helpers::FilesystemReader
       register_connector do |r|
         r.description = 'file or directory of files'
       end
-      # TODO: consolidate this with @config.filename
-      setting :dir_glob_pattern
+      setting :input, default: ['.']
+      setting :dir_glob_pattern, default: "**/*"
+      setting :larger_than
+      setting :smaller_than
+      def prepare
+        @pathnames = gather_files
+      end
       def extract
-        filenames.each do |filename|
-          yield Chronicle::ETL::Extraction.new(data: filename)
+        @pathnames.each do |pathname|
+          yield Chronicle::ETL::Extraction.new(data: pathname.to_path)
         end
       end
       def results_count
-        filenames.count
+        @pathnames.count
       end
       private
-      def filenames
-        @filenames ||= filenames_in_directory(
-          path: @config.filename,
-          dir_glob_pattern: @config.dir_glob_pattern,
-          load_since: @config.since,
-          load_until: @config.until
-        )
+      def gather_files
+        roots = [@config.input].flatten.map { |filename| Pathname.new(filename) }
+        raise(ExtractionError, "Input must exist") unless roots.all?(&:exist?)
+        directories, files = roots.partition(&:directory?)
+        directories.each do |directory|
+          files += Dir.glob(File.join(directory, @config.dir_glob_pattern)).map { |filename| Pathname.new(filename) }
+        end
+        files = files.uniq
+        files = files.keep_if { |f| (f.mtime > @config.since) } if @config.since
+        files = files.keep_if { |f| (f.mtime < @config.until) } if @config.until
+        # pass in file sizes in bytes
+        files = files.keep_if { |f| (f.size < @config.smaller_than) } if @config.smaller_than
+        files = files.keep_if { |f| (f.size > @config.larger_than) } if @config.larger_than
+        # # TODO: incorporate sort argument
+        files.sort_by(&:mtime)
       end
     end
   end

data/lib/chronicle/etl/extractors/helpers/input_reader.rb ADDED Viewed

@@ -0,0 +1,76 @@
+require 'pathname'
+module Chronicle
+  module ETL
+    module Extractors
+      module Helpers
+        module InputReader
+          # Return an array of input filenames; converts a single string
+          # to an array if necessary
+          def filenames
+            [@config.input].flatten.map
+          end
+          # Filenames as an array of pathnames
+          def pathnames
+            filenames.map { |filename| Pathname.new(filename) }
+          end
+          # Whether we're reading from files
+          def read_from_files?
+            filenames.any?
+          end
+          # Whether we're reading input from stdin
+          def read_from_stdin?
+            !read_from_files? && $stdin.stat.pipe?
+          end
+          # Read input sources and yield each content
+          def read_input
+            if read_from_files?
+              pathnames.each do |pathname|
+                File.open(pathname) do |file|
+                  yield file.read, pathname.to_path
+                end
+              end
+            elsif read_from_stdin?
+              yield $stdin.read, $stdin
+            else
+              raise ExtractionError, "No input files or stdin provided"
+            end
+          end
+          # Read input sources line by line
+          def read_input_as_lines(&block)
+            if read_from_files?
+              lines_from_files(&block)
+            elsif read_from_stdin?
+              lines_from_stdin(&block)
+            else
+              raise ExtractionError, "No input files or stdin provided"
+            end
+          end
+          private
+          def lines_from_files(&block)
+            pathnames.each do |pathname|
+              File.open(pathname) do |file|
+                lines_from_io(file, &block)
+              end
+            end
+          end
+          def lines_from_stdin(&block)
+            lines_from_io($stdin, &block)
+          end
+          def lines_from_io(io, &block)
+            io.each_line(&block)
+          end
+        end
+      end
+    end
+  end
+end

data/lib/chronicle/etl/extractors/json_extractor.rb CHANGED Viewed

@@ -1,35 +1,44 @@
 module Chronicle
   module ETL
-    class JsonExtractor < Chronicle::ETL::Extractor
-      include Extractors::Helpers::FilesystemReader
+    class JSONExtractor < Chronicle::ETL::Extractor
+      include Extractors::Helpers::InputReader
       register_connector do |r|
-        r.description = 'input as JSON'
+        r.description = 'JSON'
       end
-      setting :filename, default: $stdin
-      setting :jsonl, default: true
+      setting :jsonl, default: true, type: :boolean
-      def extract
+      def prepare
+        @jsons = []
         load_input do |input|
-          parsed_data = parse_data(input)
-          yield Chronicle::ETL::Extraction.new(data: parsed_data) if parsed_data
+          @jsons << parse_data(input)
+        end
+      end
+      def extract
+        @jsons.each do |json|
+          yield Chronicle::ETL::Extraction.new(data: json)
         end
       end
       def results_count
+        @jsons.count
       end
       private
       def parse_data data
         JSON.parse(data)
-      rescue JSON::ParserError => e
+      rescue JSON::ParserError
+        raise Chronicle::ETL::ExtractionError, "Could not parse JSON"
       end
-      def load_input
-        read_from_filesystem(filename: @options[:filename]) do |data|
-          yield data
+      def load_input(&block)
+        if @config.jsonl
+          read_input_as_lines(&block)
+        else
+          read_input(&block)
         end
       end
     end

data/lib/chronicle/etl/job_definition.rb CHANGED Viewed

@@ -14,7 +14,7 @@ module Chronicle
           options: {}
         },
         loader: {
-          name: 'stdout',
+          name: 'table',
           options: {}
         }
       }.freeze

data/lib/chronicle/etl/loaders/json_loader.rb ADDED Viewed

@@ -0,0 +1,44 @@
+module Chronicle
+  module ETL
+    class JSONLoader < Chronicle::ETL::Loader
+      register_connector do |r|
+        r.description = 'json'
+      end
+      setting :serializer
+      setting :output, default: $stdout
+      def start
+        if @config.output == $stdout
+          @output = @config.output
+        else
+          @output = File.open(@config.output, "w")
+        end
+      end
+      def load(record)
+        serialized = serializer.serialize(record)
+        # When dealing with raw data, we can get improperly encoded strings
+        # (eg from sqlite database columns). We force conversion to UTF-8
+        # before converting into JSON
+        encoded = serialized.transform_values do |value|
+          next value unless value.is_a?(String)
+          value.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
+        end
+        @output.puts encoded.to_json
+      end
+      def finish
+        @output.close
+      end
+      private
+      def serializer
+        @config.serializer || Chronicle::ETL::RawSerializer
+      end
+    end
+  end
+end

data/lib/chronicle/etl/loaders/loader.rb CHANGED Viewed

@@ -30,6 +30,6 @@ module Chronicle
 end
 require_relative 'csv_loader'
+require_relative 'json_loader'
 require_relative 'rest_loader'
-require_relative 'stdout_loader'
 require_relative 'table_loader'

data/lib/chronicle/etl/loaders/table_loader.rb CHANGED Viewed

@@ -11,20 +11,19 @@ module Chronicle
       setting :fields_limit, default: nil
       setting :fields_exclude, default: ['lids', 'type']
-      setting :fields_include, default: []
+      setting :fields, default: []
       setting :truncate_values_at, default: 40
       setting :table_renderer, default: :basic
       def load(record)
-        @records ||= []
-        @records << record.to_h_flattened
+        records << record.to_h_flattened
       end
       def finish
-        return if @records.empty?
+        return if records.empty?
-        headers = build_headers(@records)
-        rows = build_rows(@records, headers)
+        headers = build_headers(records)
+        rows = build_rows(records, headers)
         @table = TTY::Table.new(header: headers, rows: rows)
         puts @table.render(
@@ -33,12 +32,16 @@ module Chronicle
         )
       end
+      def records
+        @records ||= []
+      end
       private
       def build_headers(records)
         headers =
-          if @config.fields_include.any?
-            Set[*@config.fields_include]
+          if @config.fields.any?
+            Set[*@config.fields]
           else
             # use all the keys of the flattened record hash
             Set[*records.map(&:keys).flatten.map(&:to_s).uniq]
@@ -52,7 +55,7 @@ module Chronicle
       def build_rows(records, headers)
         records.map do |record|
-          values = record.values_at(*headers).map{|value| value.to_s }
+          values = record.transform_keys(&:to_sym).values_at(*headers).map{|value| value.to_s }
           if @config.truncate_values_at
             values = values.map{ |value| value.truncate(@config.truncate_values_at) }

data/lib/chronicle/etl/logger.rb CHANGED Viewed

@@ -8,6 +8,7 @@ module Chronicle
       WARN = 2
       ERROR = 3
       FATAL = 4
+      SILENT = 5
       attr_accessor :log_level

data/lib/chronicle/etl/models/base.rb CHANGED Viewed

@@ -5,6 +5,9 @@ module Chronicle
     module Models
       # Represents a record that's been transformed by a Transformer and
       # ready to be loaded. Loosely based on ActiveModel.
+      #
+      # @todo Experiment with just mixing in ActiveModel instead of this
+      #   this reimplementation
       class Base
         ATTRIBUTES = [:provider, :provider_id, :lat, :lng, :metadata].freeze
         ASSOCIATIONS = [].freeze

data/lib/chronicle/etl/models/entity.rb CHANGED Viewed

@@ -5,13 +5,19 @@ module Chronicle
     module Models
       class Entity < Chronicle::ETL::Models::Base
         TYPE = 'entities'.freeze
-        ATTRIBUTES = [:title, :body, :represents, :slug, :myself, :metadata].freeze
+        ATTRIBUTES = [:title, :body, :provider_url, :represents, :slug, :myself, :metadata].freeze
+        # TODO: This desperately needs a validation system
         ASSOCIATIONS = [
+          :involvements, # inverse of activity's `involved`
           :attachments,
           :abouts,
+          :aboutables, # inverse of above
           :depicts,
           :consumers,
-          :contains
+          :contains,
+          :containers # inverse of above
         ].freeze  # TODO: add these to reflect Chronicle Schema
         attr_accessor(*ATTRIBUTES, *ASSOCIATIONS)

data/lib/chronicle/etl/models/raw.rb ADDED Viewed

@@ -0,0 +1,26 @@
+require 'chronicle/etl/models/base'
+module Chronicle
+  module ETL
+    module Models
+      # A record from an extraction with no processing or normalization applied
+      class Raw
+        TYPE = 'raw'
+        attr_accessor :raw_data
+        def initialize(raw_data)
+          @raw_data = raw_data
+        end
+        def to_h
+          @raw_data.to_h
+        end
+        def to_h_flattened
+          Chronicle::ETL::Utils::HashUtilities.flatten_hash(to_h)
+        end
+      end
+    end
+  end
+end

data/lib/chronicle/etl/runner.rb CHANGED Viewed

@@ -28,9 +28,10 @@ class Chronicle::ETL::Runner
       transformer = @job.instantiate_transformer(extraction)
       record = transformer.transform
-      unless record.is_a?(Chronicle::ETL::Models::Base)
-        raise Chronicle::ETL::RunnerTypeError, "Transformed data should be a type of Chronicle::ETL::Models"
-      end
+      # TODO: rethink this
+      # unless record.is_a?(Chronicle::ETL::Models)
+      #   raise Chronicle::ETL::RunnerTypeError, "Transformed data should be a type of Chronicle::ETL::Models"
+      # end
       Chronicle::ETL::Logger.info(tty_log_transformation(transformer))
       @job_logger.log_transformation(transformer)
@@ -52,7 +53,7 @@ class Chronicle::ETL::Runner
     raise e
   ensure
     @job_logger.save
-    @progress_bar.finish
+    @progress_bar&.finish
     Chronicle::ETL::Logger.detach_from_progress_bar
     Chronicle::ETL::Logger.info(tty_log_completion)
   end

data/lib/chronicle/etl/serializers/jsonapi_serializer.rb CHANGED Viewed

@@ -1,6 +1,12 @@
 module Chronicle
   module ETL
     class JSONAPISerializer < Chronicle::ETL::Serializer
+      def initialize(*args)
+        super
+        raise(SerializationError, "Record must be a subclass of Chronicle::ETL::Model::Base") unless @record.is_a?(Chronicle::ETL::Models::Base)
+      end
       def serializable_hash
         @record
           .identifier_hash

data/lib/chronicle/etl/serializers/raw_serializer.rb ADDED Viewed

@@ -0,0 +1,10 @@
+module Chronicle
+  module ETL
+    # Take a Raw model and output `raw_data` as a hash
+    class RawSerializer < Chronicle::ETL::Serializer
+      def serializable_hash
+        @record.to_h
+      end
+    end
+  end
+end

data/lib/chronicle/etl/serializers/serializer.rb CHANGED Viewed

@@ -24,4 +24,5 @@ module Chronicle
   end
 end
-require_relative 'jsonapi_serializer'
+require_relative 'jsonapi_serializer'
+require_relative 'raw_serializer'

data/lib/chronicle/etl/transformers/null_transformer.rb CHANGED Viewed

@@ -7,7 +7,7 @@ module Chronicle
       end
       def transform
-        Chronicle::ETL::Models::Generic.new(@extraction.data)
+        Chronicle::ETL::Models::Raw.new(@extraction.data)
       end
       def timestamp; end

data/lib/chronicle/etl/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 module Chronicle
   module ETL
-    VERSION = "0.4.0"
+    VERSION = "0.4.1"
   end
 end

data/lib/chronicle/etl.rb CHANGED Viewed

@@ -3,23 +3,30 @@ require_relative 'etl/config'
 require_relative 'etl/configurable'
 require_relative 'etl/exceptions'
 require_relative 'etl/extraction'
-require_relative 'etl/extractors/extractor'
 require_relative 'etl/job_definition'
 require_relative 'etl/job_log'
 require_relative 'etl/job_logger'
 require_relative 'etl/job'
-require_relative 'etl/loaders/loader'
 require_relative 'etl/logger'
 require_relative 'etl/models/activity'
 require_relative 'etl/models/attachment'
 require_relative 'etl/models/base'
+require_relative 'etl/models/raw'
 require_relative 'etl/models/entity'
-require_relative 'etl/models/generic'
 require_relative 'etl/runner'
 require_relative 'etl/serializers/serializer'
-require_relative 'etl/transformers/transformer'
 require_relative 'etl/utils/binary_attachments'
 require_relative 'etl/utils/hash_utilities'
 require_relative 'etl/utils/text_recognition'
 require_relative 'etl/utils/progress_bar'
 require_relative 'etl/version'
+require_relative 'etl/extractors/extractor'
+require_relative 'etl/loaders/loader'
+require_relative 'etl/transformers/transformer'
+begin
+  require 'pry'
+rescue LoadError
+  # Pry not available
+end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: chronicle-etl
 version: !ruby/object:Gem::Version
-  version: 0.4.0
+  version: 0.4.1
 platform: ruby
 authors:
 - Andrew Louis
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2022-02-25 00:00:00.000000000 Z
+date: 2022-03-05 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: activesupport
@@ -328,7 +328,7 @@ files:
 - lib/chronicle/etl/extractors/csv_extractor.rb
 - lib/chronicle/etl/extractors/extractor.rb
 - lib/chronicle/etl/extractors/file_extractor.rb
-- lib/chronicle/etl/extractors/helpers/filesystem_reader.rb
+- lib/chronicle/etl/extractors/helpers/input_reader.rb
 - lib/chronicle/etl/extractors/json_extractor.rb
 - lib/chronicle/etl/extractors/stdin_extractor.rb
 - lib/chronicle/etl/job.rb
@@ -336,21 +336,22 @@ files:
 - lib/chronicle/etl/job_log.rb
 - lib/chronicle/etl/job_logger.rb
 - lib/chronicle/etl/loaders/csv_loader.rb
+- lib/chronicle/etl/loaders/json_loader.rb
 - lib/chronicle/etl/loaders/loader.rb
 - lib/chronicle/etl/loaders/rest_loader.rb
-- lib/chronicle/etl/loaders/stdout_loader.rb
 - lib/chronicle/etl/loaders/table_loader.rb
 - lib/chronicle/etl/logger.rb
 - lib/chronicle/etl/models/activity.rb
 - lib/chronicle/etl/models/attachment.rb
 - lib/chronicle/etl/models/base.rb
 - lib/chronicle/etl/models/entity.rb
-- lib/chronicle/etl/models/generic.rb
+- lib/chronicle/etl/models/raw.rb
 - lib/chronicle/etl/registry/connector_registration.rb
 - lib/chronicle/etl/registry/registry.rb
 - lib/chronicle/etl/registry/self_registering.rb
 - lib/chronicle/etl/runner.rb
 - lib/chronicle/etl/serializers/jsonapi_serializer.rb
+- lib/chronicle/etl/serializers/raw_serializer.rb
 - lib/chronicle/etl/serializers/serializer.rb
 - lib/chronicle/etl/transformers/image_file_transformer.rb
 - lib/chronicle/etl/transformers/null_transformer.rb

data/lib/chronicle/etl/extractors/helpers/filesystem_reader.rb DELETED Viewed

@@ -1,104 +0,0 @@
-require 'pathname'
-module Chronicle
-  module ETL
-    module Extractors
-      module Helpers
-        module FilesystemReader
-          def filenames_in_directory(...)
-            filenames = gather_files(...)
-            if block_given?
-              filenames.each do |filename|
-                yield filename
-              end
-            else
-              filenames
-            end
-          end
-          def read_from_filesystem(filename:, yield_each_line: true, dir_glob_pattern: '**/*')
-            open_files(filename: filename, dir_glob_pattern: dir_glob_pattern) do |file|
-              if yield_each_line
-                file.each_line do |line|
-                  yield line
-                end
-              else
-                yield file.read
-              end
-            end
-          end
-          def open_from_filesystem(filename:, dir_glob_pattern: '**/*')
-            open_files(filename: filename, dir_glob_pattern: dir_glob_pattern) do |file|
-              yield file
-            end
-          end
-          def results_count
-            raise NotImplementedError
-            # if file?
-            #   return 1
-            # else
-            #   search_pattern = File.join(@options[:filename], '**/*')
-            #   Dir.glob(search_pattern).count
-            # end
-          end
-          private
-          def gather_files(path:, dir_glob_pattern: '**/*', load_since: nil, load_until: nil, smaller_than: nil, larger_than: nil, sort: :mtime)
-            search_pattern = File.join(path, '**', dir_glob_pattern)
-            files = Dir.glob(search_pattern)
-            files = files.keep_if {|f| (File.mtime(f) > load_since)} if load_since
-            files = files.keep_if {|f| (File.mtime(f) < load_until)} if load_until
-            # pass in file sizes in bytes
-            files = files.keep_if {|f| (File.size(f) < smaller_than)} if smaller_than
-            files = files.keep_if {|f| (File.size(f) > larger_than)} if larger_than
-            # TODO: incorporate sort argument
-            files.sort_by{ |f| File.mtime(f) }
-          end
-          def select_files_in_directory(path:, dir_glob_pattern: '**/*')
-            raise IOError.new("#{path} is not a directory.") unless directory?(path)
-            search_pattern = File.join(path, dir_glob_pattern)
-            Dir.glob(search_pattern).each do |filename|
-              yield(filename)
-            end
-          end
-          def open_files(filename:, dir_glob_pattern:)
-            if stdin?(filename)
-              yield $stdin
-            elsif directory?(filename)
-              search_pattern = File.join(filename, dir_glob_pattern)
-              filenames = Dir.glob(search_pattern)
-              filenames.each do |filename|
-                file = File.open(filename)
-                yield(file)
-              end
-            elsif file?(filename)
-              yield File.open(filename)
-            end
-          end
-          def stdin?(filename)
-            filename == $stdin
-          end
-          def directory?(filename)
-            Pathname.new(filename).directory?
-          end
-          def file?(filename)
-            Pathname.new(filename).file?
-          end
-        end
-      end
-    end
-  end
-end

data/lib/chronicle/etl/loaders/stdout_loader.rb DELETED Viewed

@@ -1,14 +0,0 @@
-module Chronicle
-  module ETL
-    class StdoutLoader < Chronicle::ETL::Loader
-      register_connector do |r|
-        r.description = 'stdout'
-      end
-      def load(record)
-        serializer = Chronicle::ETL::JSONAPISerializer.new(record)
-        puts serializer.serializable_hash.to_json
-      end
-    end
-  end
-end

data/lib/chronicle/etl/models/generic.rb DELETED Viewed

@@ -1,23 +0,0 @@
-require 'chronicle/etl/models/base'
-module Chronicle
-  module ETL
-    module Models
-      class Generic < Chronicle::ETL::Models::Base
-        TYPE = 'generic'
-        attr_accessor :properties
-        def initialize(properties = {})
-          @properties = properties
-          super
-        end
-        # Generic models have arbitrary attributes stored in @properties
-        def attributes
-          @properties.transform_keys(&:to_sym)
-        end
-      end
-    end
-  end
-end