cdmbl 0.7.2 → 0.8.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: cd1dcc251930032444c67fd600f933acc1d8a25f
4
- data.tar.gz: 8c5df4dc3bc694e063961cb6e2a320e75bacf1d3
3
+ metadata.gz: 818b769be7195ae37538de0892e275545da57a44
4
+ data.tar.gz: 5a1a6c25b618ea2cd1d047e3ef3047e29e54db87
5
5
  SHA512:
6
- metadata.gz: e8b1baf77f906f90cc8341d0af96e89afefeeb1b59d14961b9be5f3021f759d38d2d9b49bdb5e03f1211d926557f2cb4f4209fcc6e987be2352d1db7c6ad3431
7
- data.tar.gz: 612b9147b36693f6dc742d2a6e46432aa8b5cc88165676bd5b5ef5a158faa33ce26e687dce837f20c6ae5e080ac130978e660d7c727fe4792d64e882648fdf50
6
+ metadata.gz: aba7fdaefa7ca9e8031b9ac42c9384ebd41ca7bff629d978715aa57cb416248c2b4f5e5e234cf223b61406ffa597591e1906584e4657ed39fc60a361641c7c1d
7
+ data.tar.gz: 7a1e280b31c57cdf98207481f02946bfdaa6abe2dbfebe36ae9722ea9b444da3dec939859d8db66ce01565b476b4e211ec65d679c707a1378c3f882d3fd9e9bd
data/.rubocop.yml ADDED
@@ -0,0 +1,4 @@
1
+ Layout/IndentationWidth:
2
+ # Number of spaces for each indentation level.
3
+ Width: 2
4
+ IgnoredPatterns: []
data/README.md CHANGED
@@ -4,7 +4,7 @@
4
4
 
5
5
  Use [Blacklight](https://github.com/projectblacklight/blacklight) as a front end for your CONTENTdm instance.
6
6
 
7
- At the moment, CDMBL consists only of a micro [ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load) system dedicated to extracting metadata records from a CONTENTdm instance (using the [CONTENTdm API gem](https://github.com/UMNLibraries/contentdm_api), transforming them into Solr documents, and loading them into Solr. After initially populating the entire index, CDMBL allows for selective harvesting for incremental Solr index updates.
7
+ At the moment, CDMBL consists only of a micro [ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load) system dedicated to extracting metadata records from a CONTENTdm instance (using the [CONTENTdm API gem](https://github.com/UMNLibraries/contentdm_api), transforming them into Solr documents, and loading them into Solr.
8
8
 
9
9
  ## Installation
10
10
 
@@ -41,22 +41,24 @@ export export GEONAMES_USER="yourusernamehere"
41
41
 
42
42
  Run the ingester
43
43
 
44
- rake cdmbl:ingest[solr_url,oai_endpoint,cdm_endpoint,minimum_date]
44
+ rake cdmbl:batch[solr_url,oai_endpoint,cdm_endpoint,set_spec, batch_size, max_compounds]
45
45
 
46
46
  |Argument| Definition|
47
47
  |--:|---|
48
48
  |solr_url| The full URL to your Solr core instance (same as your blacklight.yml solr url)|
49
- |oai_endpoint| A URL to your OAI instance (e.g. http://reflections.mndigital.org/oai/oai.php) |
49
+ |oai_endpoint| A URL to your OAI instance (e.g. https://server16022.contentdm.oclc.org/oai/oai.php) |
50
50
  |cdm_endpoint| A URL to your CONTENTdm API endpoint (e.g. https://server16022.contentdm.oclc.org/dmwebservices/index.php) |
51
- |minimum_date| Date from which to [selectively harvest](https://www.openarchives.org/OAI/openarchivesprotocol.html#SelectiveHarvesting) identifiers from the OAI endpoint. These identifiers are used to determine which records to delete from your index and which records to request from the CONTENTdm API|
51
+ |set_spec| Selectively harvest from a single collection with [setSpec](http://www.openarchives.org/OAI/openarchivesprotocol.html#Set)|
52
+ |batch_size| The number of records to transform at a time. **Note**: it is within the record transformation process that the CONTENTdm API is requested. This API can be sluggish, so we conservatively transform batches of ten records at a time to prevent timeouts.|
53
+ |max_compounds| CONTENTdm records with many compounds can take a long time to load from the CONTENTdm API as multiple requests must happen in order to get the metadata for each child record of a parent compound object. For this reason, records with ten or more compound children are, by default, processed in batches of one. This setting allows you to override this behavior.|
52
54
 
53
55
  For example:
54
56
 
55
57
  ```ruby
56
- rake "cdmbl:ingest[http://solr:8983/solr/foo-bar-core, http://reflections.mndigital.org/oai/oai.php, https://server16022.contentdm.oclc.org/dmwebservices/index.php, 2015-01-01]"
58
+ rake "cdmbl:ingest[http://solr:8983/solr/foo-bar-core, https://server16022.contentdm.oclc.org/oai/oai.php, https://server16022.contentdm.oclc.org/dmwebservices/index.php, 2015-01-01]"
57
59
  ```
58
60
 
59
- ### Custom Rake Task
61
+ ### Custom Rake Tasks
60
62
 
61
63
  You might also create your own rake task to run your modified field transformers:
62
64
 
@@ -64,14 +66,21 @@ You might also create your own rake task to run your modified field transformers
64
66
  require 'cdmbl'
65
67
 
66
68
  namespace :cdmbl do
67
- desc 'Launch a background job to index metadata from CONTENTdm int Solr.'
68
- task :ingest do
69
- solr_config = { url: 'http://solr:8983/solr/foo-bar-core' }
70
- etl_config = { oai_endpoint: 'http://reflections.mndigital.org/oai/oai.php',
71
- cdm_endpoint: 'https://server16022.contentdm.oclc.org/dmwebservices/index.php',
72
- field_mappings: my_field_mappings,
73
- minimum_date: '2016-09-01'}
74
- CDMBL::ETLWorker.perform_async(solr_config, etl_config)
69
+ desc "ingest batches of records"
70
+ ##
71
+ # e.g. rake mdl_ingester:ingest[2015-09-14, 2]
72
+ task :batch, [:batch_size, :set_spec] => :environment do |t, args|
73
+ config =
74
+ {
75
+ oai_endpoint: 'http://cdm16022.contentdm.oclc.org/oai/oai.php',
76
+ cdm_endpoint: 'https://server16022.contentdm.oclc.org/dmwebservices/index.php',
77
+ set_spec: (args[:set_spec] != '""') ? args[:set_spec] : nil,
78
+ max_compounds: (args[:max_compounds]) ? args[:max_compounds] : 2,
79
+ batch_size: (args[:batch_size]) ? args[:batch_size] : 30,
80
+ solr_config: solr_config
81
+ }
82
+ CDMBL::ETLWorker.perform_async(config)
83
+ end
75
84
  end
76
85
  ```
77
86
  ### Your Own Custom Solr Field Mappings (see above code snippet)
data/cdmbl.gemspec CHANGED
@@ -24,7 +24,7 @@ Gem::Specification.new do |spec|
24
24
  spec.add_dependency 'rsolr', '~> 2.0'
25
25
  # This gem generally wants to be in a rails app, but just to avoid adding
26
26
  # another external dependency for XML procssing, we rely on activesupport's
27
- # Has.to_xml feature for testing and to allow this gem to function
27
+ # Has.to_jsonl feature for testing and to allow this gem to function
28
28
  # independently from a rails app
29
29
  spec.add_dependency 'activesupport', '>= 4.2'
30
30
 
@@ -3,8 +3,7 @@ module CDMBL
3
3
  class BatchDeleterWorker
4
4
  include Sidekiq::Worker
5
5
  attr_reader :start, :prefix, :oai_url, :solr_url
6
- attr_accessor :batch_deleter_klass, :oai_client, :solr_client
7
- sidekiq_options :backtrace => true
6
+ attr_writer :batch_deleter_klass, :oai_client, :solr_client
8
7
  def perform(start = 0, prefix = '', oai_url = '', solr_url = '')
9
8
  @start = start
10
9
  @prefix = prefix
@@ -0,0 +1,45 @@
1
+ module CDMBL
2
+ # Takes a list of record id/collection data, uses CompoundLookup to
3
+ # identifiy records with large numbers of compounds and sorts them
4
+ # into a large and a small heap
5
+ class CompoundFilter
6
+ attr_reader :record_ids,
7
+ :max_compounds,
8
+ :cdm_endpoint,
9
+ :compound_lookup_klass
10
+ def initialize(record_ids: [],
11
+ max_compounds: 10,
12
+ cdm_endpoint: '',
13
+ compound_lookup_klass: CompoundLookup)
14
+ @record_ids = record_ids
15
+ @max_compounds = max_compounds
16
+ @cdm_endpoint = cdm_endpoint
17
+ @compound_lookup_klass = compound_lookup_klass
18
+ end
19
+
20
+ def filter(large: true)
21
+ ids(records.select { |record| record[:large] == large })
22
+ end
23
+
24
+ private
25
+
26
+ def ids(records)
27
+ records.map { |record| record[:id] }
28
+ end
29
+
30
+ def records
31
+ @records ||= record_ids.map do |identifier|
32
+ {
33
+ large: count(*identifier) >= max_compounds,
34
+ id: identifier
35
+ }
36
+ end
37
+ end
38
+
39
+ def count(collection, id)
40
+ compound_lookup_klass.new(cdm_endpoint: cdm_endpoint,
41
+ collection: collection,
42
+ id: id).count
43
+ end
44
+ end
45
+ end
@@ -0,0 +1,43 @@
1
+ module CDMBL
2
+ # Fetching the full metadata for compound records is expensive. This class
3
+ # lets us check on how many compounds a CDM record has so that we know
4
+ class CompoundLookup
5
+ attr_reader :cdm_endpoint,
6
+ :collection,
7
+ :id,
8
+ :request_klass,
9
+ :service_klass
10
+
11
+ def initialize(cdm_endpoint: '',
12
+ collection: '',
13
+ id: '',
14
+ request_klass: CONTENTdmAPI::Request,
15
+ service_klass: CONTENTdmAPI::Service)
16
+ @cdm_endpoint = cdm_endpoint
17
+ @collection = collection
18
+ @id = id
19
+ @request_klass = request_klass
20
+ @service_klass = service_klass
21
+ end
22
+
23
+ def count
24
+ page.respond_to?(:length) ? page.length : 0
25
+ end
26
+
27
+ private
28
+
29
+ def page
30
+ JSON.parse(request).fetch('page', [])
31
+ end
32
+
33
+ def service
34
+ @service ||= service_klass.new(function: 'dmGetCompoundObjectInfo',
35
+ params: [collection, id])
36
+ end
37
+
38
+ def request
39
+ @request ||= request_klass.new(base_url: cdm_endpoint,
40
+ service: service).fetch
41
+ end
42
+ end
43
+ end
@@ -26,7 +26,6 @@ module CDMBL
26
26
 
27
27
  def add(records)
28
28
  connection.add records
29
- connection.commit
30
29
  end
31
30
 
32
31
  def delete(ids)
@@ -1,89 +1,140 @@
1
1
  require 'sidekiq'
2
2
  module CDMBL
3
+ # Extract records from OAI, delete records marked for deletion, sort the
4
+ # remaning records them into "big and small" record piles based upon how many
5
+ # compounds a record has, chunk the small records into batches and the big
6
+ # records individuall and then send these records to a transformation worker
3
7
  class ETLWorker
4
8
  include Sidekiq::Worker
5
-
6
- attr_reader :solr_config,
7
- :etl_config,
9
+ attr_reader :config,
10
+ :solr_config,
11
+ :cdm_endpoint,
12
+ :oai_endpoint,
13
+ :field_mappings,
14
+ :resumption_token,
15
+ :set_spec,
16
+ :max_compounds,
8
17
  :batch_size,
9
- :is_recursive,
10
- :identifiers,
11
- :deletables
12
-
13
-
14
- def perform(solr_config,
15
- etl_config,
16
- batch_size = 5,
17
- is_recursive = true,
18
- identifiers = [],
19
- deletables = [])
20
-
21
- @etl_config = etl_config.symbolize_keys
22
- @solr_config = solr_config.symbolize_keys
23
- @batch_size = batch_size.to_i
24
- @is_recursive = is_recursive
25
- @identifiers = identifiers
26
- @deletables = deletables
27
-
28
- if !identifiers.empty?
29
- load!
18
+ :is_recursive
19
+
20
+ attr_writer :compound_filter_klass,
21
+ :extractor_klass,
22
+ :etl_worker_klass,
23
+ :load_worker_klass,
24
+ :completed_callback_klass,
25
+ :transform_worker_klass
26
+
27
+ def perform(config)
28
+ # Sidekiq stores params in JSON, so we can't inject dependencies. This
29
+ # results in the long set of arguments that follows. Otherwise, we'd
30
+ # simply inject the OAI request and extractor objects
31
+ @config = config
32
+ @solr_config = config.fetch('solr_config').symbolize_keys
33
+ @cdm_endpoint = config.fetch('cdm_endpoint')
34
+ @oai_endpoint = config.fetch('oai_endpoint')
35
+ @field_mappings = config.fetch('field_mappings', false)
36
+ @resumption_token = config.fetch('resumption_token', nil)
37
+ @set_spec = config.fetch('set_spec', nil)
38
+ @max_compounds = config.fetch('max_compounds', 10)
39
+ @batch_size = config.fetch('batch_size', 5).to_i
40
+ @is_recursive = config.fetch('is_recursive', true)
41
+ extract_batch!
42
+ next_batch!
43
+ end
44
+
45
+ # Because Sidekiq serializes params to JSON, we provide custom setters
46
+ # for dependencies (normally these would be default params in the
47
+ # constructor) so that they may be mocked and tested
48
+ def completed_callback_klass
49
+ @completed_callback_klass ||= CDMBL::CompletedCallback
50
+ end
51
+
52
+ def etl_worker_klass
53
+ @etl_worker_klass ||= ETLWorker
54
+ end
55
+
56
+ def compound_filter_klass
57
+ @compound_filter_klass ||= CompoundFilter
58
+ end
59
+
60
+ def extractor_klass
61
+ @extractor_klass ||= Extractor
62
+ end
63
+
64
+ def load_worker_klass
65
+ @load_worker_klass ||= LoadWorker
66
+ end
67
+
68
+ def transform_worker_klass
69
+ @transform_worker_klass ||= TransformWorker
70
+ end
71
+
72
+ # Recurse through OAI batches one at a time
73
+ def next_batch!
74
+ if next_resumption_token && is_recursive
75
+ etl_worker_klass.perform_async(next_config)
30
76
  else
31
- ingest_batches!
32
- if extraction.next_resumption_token && is_recursive
33
- # Call the next batch of records
34
- ETLWorker.perform_async(solr_config, next_etl_config, batch_size)
35
- else
36
- CDMBL::CompletedCallback.call!(solr_client)
37
- end
77
+ completed_callback_klass.call!(solr_config)
38
78
  end
39
79
  end
40
80
 
41
81
  private
42
82
 
43
- # Break down extractions into batches of IDs for ingestion
44
- def ingest_batches!
45
- sent_deleted = false
46
- extraction.local_identifiers.each_slice(batch_size) do |ids|
47
- delete_ids = (sent_deleted == false) ? extraction.deletable_ids : []
48
- ETLWorker.perform_async(solr_config,
49
- etl_config,
50
- batch_size,
51
- is_recursive,
52
- ids,
53
- delete_ids)
54
- sent_deleted = true
55
- end
83
+ # Extract an oai response - a batch of records
84
+ def extract_batch!
85
+ # Delete records that OAI has marked for deletion
86
+ delete_deletables!
87
+ # Records with few compounds are processed in batches
88
+ transform_small_records!
89
+ # Large records are all transformed and loaded one by one to avoid
90
+ # timeouts
91
+ transform_large_records!
56
92
  end
57
93
 
58
- def load!
59
- CDMBL::LoaderNotification.call!(transformation.records, deletables)
60
- etl_run.load!(deletables, transformation.records)
94
+ def next_config
95
+ config.merge(resumption_token: next_resumption_token)
61
96
  end
62
97
 
63
- def transformation
64
- @transformation ||= etl_run.transform(extraction.set_lookup, records)
98
+ def next_resumption_token
99
+ @next_resumption_token ||= extraction.next_resumption_token
65
100
  end
66
101
 
67
- def records
68
- identifiers.map do |identifier|
69
- extraction.cdm_request(*identifier)
102
+ def transform_small_records!
103
+ compound_filter.filter(large: false).each_slice(batch_size) do |ids|
104
+ transform!(ids)
70
105
  end
71
106
  end
72
107
 
73
- def extraction
74
- @extraction ||= etl_run.extract
108
+ def transform_large_records!
109
+ compound_filter.filter(large: true).each do |id|
110
+ transform!([id])
111
+ end
75
112
  end
76
113
 
77
- def etl_run
78
- ETLRun.new(etl_config.merge(solr_client: solr_client))
114
+ def transform!(ids)
115
+ transform_worker_klass.perform_async(ids,
116
+ solr_config,
117
+ cdm_endpoint,
118
+ oai_endpoint,
119
+ field_mappings)
79
120
  end
80
121
 
81
- def solr_client
82
- @solr_client ||= CDMBL::Solr.new(solr_config)
122
+ def delete_deletables!
123
+ load_worker_klass.perform_async([], extraction.deletable_ids, solr_config)
83
124
  end
84
125
 
85
- def next_etl_config
86
- etl_config.merge(resumption_token: extraction.next_resumption_token)
126
+ def compound_filter
127
+ @compound_filter ||=
128
+ compound_filter_klass.new(record_ids: extraction.local_identifiers,
129
+ cdm_endpoint: cdm_endpoint,
130
+ max_compounds: max_compounds)
131
+ end
132
+
133
+ def extraction
134
+ @extraction ||=
135
+ extractor_klass.new(oai_endpoint: oai_endpoint,
136
+ resumption_token: resumption_token,
137
+ set_spec: set_spec)
87
138
  end
88
139
  end
89
- end
140
+ end
@@ -0,0 +1,141 @@
1
+ # require 'sidekiq'
2
+ # module CDMBL
3
+ # # Extract records from OAI, delete records marked for deletion, sort the
4
+ # # remaning records them into "big and small" record piles based upon how many
5
+ # # compounds a record has, chunk the small records into batches and the big
6
+ # # records individuall and then send these records to a transformation worker
7
+ # class ExtractWorker
8
+ # include Sidekiq::Worker
9
+
10
+ # attr_reader :config,
11
+ # :solr_config,
12
+ # :cdm_endpoint,
13
+ # :oai_endpoint,
14
+ # :field_mappings,
15
+ # :resumption_token,
16
+ # :set_spec,
17
+ # :max_compounds,
18
+ # :batch_size,
19
+ # :is_recursive
20
+
21
+ # attr_writer :compound_filter_klass,
22
+ # :extractor_klass,
23
+ # :extraction_worker_klass,
24
+ # :load_worker_klass,
25
+ # :completed_callback_klass,
26
+ # :transform_worker_klass
27
+
28
+ # def perform(config)
29
+ # # Sidekiq stores params in JSON, so we can't inject dependencies. This
30
+ # # results in the long set of arguments that follows. Otherwise, we'd
31
+ # # simply inject the OAI request and extractor objects
32
+ # @config = config
33
+ # @solr_config = config.fetch('solr_config').symbolize_keys
34
+ # @cdm_endpoint = config.fetch('cdm_endpoint')
35
+ # @oai_endpoint = config.fetch('oai_endpoint')
36
+ # @field_mappings = config.fetch('field_mappings', [])
37
+ # @resumption_token = config.fetch('resumption_token', nil)
38
+ # @set_spec = config.fetch('set_spec', nil)
39
+ # @max_compounds = config.fetch('max_compounds', 10)
40
+ # @batch_size = config.fetch('batch_size', 5).to_i
41
+ # @is_recursive = config.fetch('is_recursive', true)
42
+ # extract_batch!
43
+ # next_batch!
44
+ # end
45
+
46
+ # # Because Sidekiq serializes params to JSON, we provide custom setters
47
+ # # for dependencies (normally these would be default params in the
48
+ # # constructor) so that they may be mocked and tested
49
+ # def completed_callback_klass
50
+ # @completed_callback_klass ||= CDMBL::CompletedCallback
51
+ # end
52
+
53
+ # def extraction_worker_klass
54
+ # @extraction_worker_klass ||= ExtractionWorker
55
+ # end
56
+
57
+ # def compound_filter_klass
58
+ # @compound_filter_klass ||= CompoundFilter
59
+ # end
60
+
61
+ # def extractor_klass
62
+ # @extractor_klass ||= Extractor
63
+ # end
64
+
65
+ # def load_worker_klass
66
+ # @load_worker_klass ||= LoadWorker
67
+ # end
68
+
69
+ # def transform_worker_klass
70
+ # @transform_worker_klass ||= TransformWorker
71
+ # end
72
+
73
+ # # Recurse through OAI batches one at a time
74
+ # def next_batch!
75
+ # if next_resumption_token && is_recursive
76
+ # extraction_worker_klass.perform_async(next_config)
77
+ # else
78
+ # completed_callback_klass.call!(solr_config)
79
+ # end
80
+ # end
81
+
82
+ # private
83
+
84
+ # # Extract an oai response - a batch of records
85
+ # def extract_batch!
86
+ # # Delete records that OAI has marked for deletion
87
+ # delete_deletables!
88
+ # # Records with few compounds are processed in batches
89
+ # transform_small_records!
90
+ # # Large records are all transformed and loaded one by one to avoid
91
+ # # timeouts
92
+ # transform_large_records!
93
+ # end
94
+
95
+ # def next_config
96
+ # config.merge(resumption_token: next_resumption_token)
97
+ # end
98
+
99
+ # def next_resumption_token
100
+ # @next_resumption_token ||= extraction.next_resumption_token
101
+ # end
102
+
103
+ # def transform_small_records!
104
+ # compound_filter.filter(large: false).each_slice(batch_size) do |ids|
105
+ # transform!(ids)
106
+ # end
107
+ # end
108
+
109
+ # def transform_large_records!
110
+ # compound_filter.filter(large: true).each do |id|
111
+ # transform!([id])
112
+ # end
113
+ # end
114
+
115
+ # def transform!(ids)
116
+ # transform_worker_klass.perform_async(ids,
117
+ # solr_config,
118
+ # cdm_endpoint,
119
+ # oai_endpoint,
120
+ # field_mappings)
121
+ # end
122
+
123
+ # def delete_deletables!
124
+ # load_worker_klass.perform_async([], extraction.deletable_ids, solr_config)
125
+ # end
126
+
127
+ # def compound_filter
128
+ # @compound_filter ||=
129
+ # compound_filter_klass.new(record_ids: extraction.local_identifiers,
130
+ # cdm_endpoint: cdm_endpoint,
131
+ # max_compounds: max_compounds)
132
+ # end
133
+
134
+ # def extraction
135
+ # @extraction ||=
136
+ # extractor_klass.new(oai_endpoint: oai_endpoint,
137
+ # resumption_token: resumption_token,
138
+ # set_spec: set_spec)
139
+ # end
140
+ # end
141
+ # end
@@ -4,32 +4,27 @@ require 'hash_at_path'
4
4
  require 'forwardable'
5
5
 
6
6
  module CDMBL
7
- # This extractor uses the SimpleGet extractor initially and then makes
8
- # subsequent passes at the full ContentDM API with identifiers taken from
9
- # the contentdm api
7
+ # Retrieve OAI records and sort them into add/updatables and deletables
10
8
  class Extractor
11
9
  extend ::Forwardable
12
10
  def_delegators :@oai_request, :sets, :identifiers
13
11
  attr_reader :oai_request,
14
- :cdm_item,
15
- :cdm_endpoint,
16
- :oai_set_lookup,
17
- :oai_filter
12
+ :oai_request_klass,
13
+ :oai_filter_klass,
14
+ :oai_set_lookup_klass
18
15
 
19
- def initialize(oai_request: OaiRequest.new,
20
- cdm_endpoint: '',
21
- oai_set_lookup: OAISetLookup,
22
- cdm_item: CONTENTdmAPI::Item,
23
- oai_filter: OAIFilter)
24
- @oai_request = oai_request
25
- @cdm_item = cdm_item
26
- @cdm_endpoint = cdm_endpoint
27
- @oai_set_lookup = oai_set_lookup
28
- @oai_filter = oai_filter
29
- end
30
-
31
- def set_lookup
32
- oai_set_lookup.new(oai_sets: sets).keyed
16
+ def initialize(oai_endpoint: '',
17
+ resumption_token: nil,
18
+ set_spec: nil,
19
+ oai_request_klass: OaiRequest,
20
+ oai_filter_klass: OAIFilter,
21
+ oai_set_lookup_klass: OAISetLookup)
22
+ @oai_request_klass = oai_request_klass
23
+ @oai_filter_klass = oai_filter_klass
24
+ @oai_set_lookup_klass = oai_set_lookup_klass
25
+ @oai_request = oai_requester(oai_endpoint,
26
+ resumption_token,
27
+ set_spec)
33
28
  end
34
29
 
35
30
  def deletable_ids
@@ -44,16 +39,21 @@ module CDMBL
44
39
  oai_identifiers.at_path('OAI_PMH/ListIdentifiers/resumptionToken')
45
40
  end
46
41
 
47
- # e.g. local_identifiers.map { |identifier| extractor.cdm_request(*identifier) }
48
- def cdm_request(collection, id)
49
- CDMBL::CdmNotification.call!(collection, id, cdm_endpoint)
50
- cdm_item.new(base_url: cdm_endpoint, collection: collection, id: id).metadata
42
+ def oai_ids
43
+ oai_filter_klass.new(headers: oai_headers)
44
+ end
45
+
46
+ def set_lookup
47
+ oai_set_lookup_klass.new(oai_sets: sets).keyed
51
48
  end
52
49
 
53
50
  private
54
51
 
55
- def oai_ids
56
- oai_filter.new(headers: oai_headers)
52
+ def oai_requester(oai_endpoint, resumption_token, set_spec)
53
+ @oai_requester ||=
54
+ oai_request_klass.new(base_uri: oai_endpoint,
55
+ resumption_token: resumption_token,
56
+ set: set_spec)
57
57
  end
58
58
 
59
59
  # Get the local collection and id from an OAI namespaced identifier
@@ -67,7 +67,7 @@ module CDMBL
67
67
  end
68
68
 
69
69
  def oai_identifiers
70
- identifiers
70
+ @oai_identifiers ||= identifiers
71
71
  end
72
72
  end
73
- end
73
+ end
@@ -0,0 +1,35 @@
1
+ require 'sidekiq'
2
+ module CDMBL
3
+ # Load Records into a solr index
4
+ class LoadWorker
5
+ include Sidekiq::Worker
6
+ attr_reader :solr_config, :records, :deletables
7
+ attr_writer :loader_klass, :solr_klass
8
+ def perform(records = [], deletables = [], solr_config = {})
9
+ @solr_config = solr_config.symbolize_keys
10
+ @records = records
11
+ @deletables = deletables
12
+ load!
13
+ end
14
+
15
+ def loader_klass
16
+ @loader_klass ||= Loader
17
+ end
18
+
19
+ def solr_klass
20
+ @solr_klass ||= DefaultSolr
21
+ end
22
+
23
+ def load!
24
+ loader_klass.new(records: records,
25
+ deletable_ids: deletables,
26
+ solr_client: solr_client).load!
27
+ end
28
+
29
+ private
30
+
31
+ def solr_client
32
+ @solr_client ||= solr_klass.new(solr_config)
33
+ end
34
+ end
35
+ end
@@ -4,19 +4,16 @@ module CDMBL
4
4
  attr_reader :base_uri,
5
5
  :resumption_token,
6
6
  :client,
7
- :from,
8
7
  :set,
9
8
  :identifier
10
9
  def initialize(base_uri: '',
11
- resumption_token: false,
12
- from: false,
13
- set: false,
10
+ resumption_token: nil,
11
+ set: nil,
14
12
  identifier: '',
15
13
  client: Net::HTTP)
16
14
  @base_uri = base_uri
17
15
  @resumption_token = resumption_token
18
16
  @client = client
19
- @from = (from) ? "&from=#{from}" : ''
20
17
  @set = (set) ? "&set=#{set}" : ''
21
18
  @identifier = identifier
22
19
  end
@@ -32,7 +29,7 @@ module CDMBL
32
29
  private
33
30
 
34
31
  def first_batch_uri
35
- "#{base_uri}?verb=ListIdentifiers&metadataPrefix=oai_dc#{from}#{set}"
32
+ "#{base_uri}?verb=ListIdentifiers&metadataPrefix=oai_dc#{set}"
36
33
  end
37
34
 
38
35
  def batch_uri
@@ -2,17 +2,37 @@ require 'cdmbl'
2
2
 
3
3
  namespace :cdmbl do
4
4
  desc 'Launch a background job to index metadata from CONTENTdm to Solr.'
5
- task :ingest, [:solr_url, :oai_endpoint, :cdm_endpoint, :minimum_date, :batch_size, :set_spec] do |t, args|
6
- solr_config = { url: args[solr_url] }
7
- etl_config = {
8
- oai_endpoint: args[:oai_endpoint],
9
- cdm_endpoint: args[:cdm_endpoint],
10
- minimum_date: args[:minimum_date],
11
- set_spec: args[:set_spec]
12
- }
13
- etl_config = (args[:resumption_token]) ? etl_cofig.merge(args[:resumption_token]) : etl_config
14
- batch_size = (args[:batch_size]) ? args[:batch_size] : 10
15
- CDMBL::ETLWorker.perform_async(solr_config, etl_config, batch_size, true)
5
+ task :batch, [
6
+ :solr_url,
7
+ :oai_endpoint,
8
+ :cdm_endpoint,
9
+ :set_spec,
10
+ :batch_size,
11
+ :max_compounds
12
+ ] do |t, args|
13
+ CDMBL::ETLWorker.perform_async(
14
+ solr_config: { url: args.fetch(:solr_url) },
15
+ oai_endpoint: args.fetch(:oai_endpoint),
16
+ cdm_endpoint: args.fetch(:cdm_endpoint),
17
+ set_spec: args[:set_spec] != '""' ? args[:set_spec] : nil,
18
+ batch_size: args.fetch(:batch_size, 10),
19
+ max_compounds: args.fetch(:max_compounds, 10)
20
+ )
16
21
  end
17
- end
18
22
 
23
+ desc 'Launch a background job to index a single record.'
24
+ task :record, [
25
+ :collection,
26
+ :id,
27
+ :solr_url,
28
+ :cdm_endpoint,
29
+ :oai_endpoint
30
+ ] do |t, args|
31
+ CDMBL::TransformWorker.perform_async(
32
+ [[args.fetch(:collection), args.fetch(:id)]],
33
+ { url: args.fetch(:solr_url) },
34
+ args.fetch(:cdm_endpoint),
35
+ args.fetch(:oai_endpoint)
36
+ )
37
+ end
38
+ end
@@ -0,0 +1,9 @@
1
+ require 'cdmbl'
2
+
3
+ namespace :cdmbl do
4
+ desc 'Extract OAI results to the local file system.'
5
+ task :extract, [:oai_endpoint, :storage_dir] do |t, args|
6
+ CDMBL::OAIWorker.perform_async(oai_endpoint, false, storage_dir)
7
+ end
8
+ end
9
+
@@ -0,0 +1,93 @@
1
+ require 'sidekiq'
2
+ module CDMBL
3
+ class TransformWorker
4
+ include Sidekiq::Worker
5
+ attr_reader :identifiers,
6
+ :solr_config,
7
+ :cdm_endpoint,
8
+ :oai_endpoint,
9
+ :field_mappings
10
+
11
+ attr_writer :cdm_api_klass,
12
+ :oai_request_klass,
13
+ :oai_set_lookup_klass,
14
+ :cdm_notification_klass,
15
+ :load_worker_klass,
16
+ :transformer_klass
17
+
18
+ def perform(identifiers,
19
+ solr_config,
20
+ cdm_endpoint,
21
+ oai_endpoint,
22
+ field_mappings)
23
+
24
+ @identifiers = identifiers
25
+ @solr_config = solr_config
26
+ @cdm_endpoint = cdm_endpoint
27
+ @oai_endpoint = oai_endpoint
28
+ @field_mappings = field_mappings
29
+
30
+ transform_and_load!
31
+ end
32
+
33
+ def oai_set_lookup_klass
34
+ @oai_set_lookup_klass ||= OAISetLookup
35
+ end
36
+
37
+ def oai_request_klass
38
+ @oai_request_klass ||= OaiRequest
39
+ end
40
+
41
+ def cdm_api_klass
42
+ @cdm_api_klass ||= CONTENTdmAPI::Item
43
+ end
44
+
45
+ def cdm_notification_klass
46
+ @cdm_notification_klass ||= CdmNotification
47
+ end
48
+
49
+ def transformer_klass
50
+ @transformer_klass ||= Transformer
51
+ end
52
+
53
+ def load_worker_klass
54
+ @load_worker_klass ||= LoadWorker
55
+ end
56
+
57
+ private
58
+
59
+ def transform_and_load!
60
+ load_worker_klass.perform_async(transformed_records, [], solr_config)
61
+ end
62
+
63
+ def transformed_records
64
+ @transformation ||=
65
+ transformer_klass.new(cdm_records: records,
66
+ oai_sets: set_lookup,
67
+ field_mappings: field_mappings).records
68
+ end
69
+
70
+ def set_lookup
71
+ oai_set_lookup_klass.new(oai_sets: sets).keyed
72
+ end
73
+
74
+ def records
75
+ identifiers.map do |identifier|
76
+ cdm_request(*identifier)
77
+ end
78
+ end
79
+
80
+ # e.g. local_identifiers.map { |identifier| extractor.cdm_request(*identifier) }
81
+ def cdm_request(collection, id)
82
+ cdm_notification_klass.call!(collection, id, cdm_endpoint)
83
+ cdm_api_klass.new(base_url: cdm_endpoint,
84
+ collection: collection,
85
+ id: id).metadata
86
+ end
87
+
88
+ def sets
89
+ @oai_request ||=
90
+ oai_request_klass.new(base_uri: oai_endpoint).sets
91
+ end
92
+ end
93
+ end
data/lib/cdmbl/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module CDMBL
2
- VERSION = "0.7.2"
2
+ VERSION = "0.8.0"
3
3
  end
data/lib/cdmbl.rb CHANGED
@@ -21,4 +21,8 @@ require 'cdmbl/oai_client'
21
21
  require 'cdmbl/oai_get_record'
22
22
  require 'cdmbl/oai_deletables'
23
23
  require 'cdmbl/batch_deleter'
24
- require 'cdmbl/batch_deleter_worker'
24
+ require 'cdmbl/batch_deleter_worker'
25
+ require 'cdmbl/compound_lookup'
26
+ require 'cdmbl/compound_filter'
27
+ require 'cdmbl/load_worker'
28
+ require 'cdmbl/transform_worker'
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: cdmbl
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.7.2
4
+ version: 0.8.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - chadfennell
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2017-07-12 00:00:00.000000000 Z
11
+ date: 2017-08-01 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: hash_at_path
@@ -198,6 +198,7 @@ extensions: []
198
198
  extra_rdoc_files: []
199
199
  files:
200
200
  - ".gitignore"
201
+ - ".rubocop.yml"
201
202
  - ".travis.yml"
202
203
  - CODE_OF_CONDUCT.md
203
204
  - Gemfile
@@ -210,6 +211,8 @@ files:
210
211
  - lib/cdmbl.rb
211
212
  - lib/cdmbl/batch_deleter.rb
212
213
  - lib/cdmbl/batch_deleter_worker.rb
214
+ - lib/cdmbl/compound_filter.rb
215
+ - lib/cdmbl/compound_lookup.rb
213
216
  - lib/cdmbl/default_cdm_notification.rb
214
217
  - lib/cdmbl/default_completed_callback.rb
215
218
  - lib/cdmbl/default_loader_notification.rb
@@ -217,11 +220,13 @@ files:
217
220
  - lib/cdmbl/default_solr.rb
218
221
  - lib/cdmbl/etl_run.rb
219
222
  - lib/cdmbl/etl_worker.rb
223
+ - lib/cdmbl/extract_worker.rb
220
224
  - lib/cdmbl/extractor.rb
221
225
  - lib/cdmbl/field_formatter.rb
222
226
  - lib/cdmbl/field_transformer.rb
223
227
  - lib/cdmbl/formatters.rb
224
228
  - lib/cdmbl/hooks.rb
229
+ - lib/cdmbl/load_worker.rb
225
230
  - lib/cdmbl/loader.rb
226
231
  - lib/cdmbl/oai_client.rb
227
232
  - lib/cdmbl/oai_deletables.rb
@@ -233,6 +238,8 @@ files:
233
238
  - lib/cdmbl/record_transformer.rb
234
239
  - lib/cdmbl/tasks/delete.rake
235
240
  - lib/cdmbl/tasks/etl.rake
241
+ - lib/cdmbl/tasks/extract.rake
242
+ - lib/cdmbl/transform_worker.rb
236
243
  - lib/cdmbl/transformer.rb
237
244
  - lib/cdmbl/version.rb
238
245
  - travis.yml