preservation 0.4.2 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 26bdfccfba8fa2f79920f73a13effe5701fcafc1
4
- data.tar.gz: 73e1d0a7a0060f115f5cc90df76f5341092f8fc8
3
+ metadata.gz: 54db84bdb0bc782f05420b420200e78b9394a6af
4
+ data.tar.gz: a243b3e89cdf0fe830df9eea16639094d5854af1
5
5
  SHA512:
6
- metadata.gz: 2dbce48f44a040569acbfb7b5dc8082b48bf83e3495d2cbed13577b72a0ba38e0c20d4b459145797faac964bbc61d4ec069705d47d1068135a376e1a13c5328e
7
- data.tar.gz: 975a2b8424ddb540f5c187f8aee2b752f54bd2b81bafc04d4d5626bd471f5fc244717d28db477b9c5e9a085e2da0d8d44047af178531adcd699c8e030a863d30
6
+ metadata.gz: 51d73c2067b1d48c7ce8a5eff9659fd0cb0059e59850e6a0d80c9865a9080a9e718e839b0d83aa2bde790fbcdae18d2eb7f26480f7f72ae9a74084fc7f6975f1
7
+ data.tar.gz: b58bd774f4905d98fee7be08f4a99a1bca5fa3bd115f8136c7a524c92937757e1ddfea63308ed589c4aaa11174b73499d1b43d993fee660df95e1ab533998a6a
@@ -4,8 +4,17 @@ This project adheres to [Semantic Versioning](http://semver.org/).
4
4
 
5
5
  ## Unreleased
6
6
 
7
+ ## 0.5.0 - 2017-05-23
8
+ ### Changed
9
+ - Transfer - created as ISO8601 date format.
10
+
11
+ ### Fixed
12
+ - Transfer - handling DOIs of related works for both datasets and publications.
13
+ - Transfer - handling missing DOIs of related works.
14
+
7
15
  ## 0.4.2 - 2017-05-18
8
16
  ### Fixed
17
+ - Transfer - presence check for DOI of a related work.
9
18
 
10
19
  ## 0.4.1 - 2016-09-30
11
20
  ### Fixed
data/README.md CHANGED
@@ -1,6 +1,9 @@
1
1
  # Preservation
2
2
 
3
- Extraction and Transformation for Loading by Archivematica's Automation Tools.
3
+ Extraction from the Pure Research Information System and transformation for
4
+ loading by Archivematica.
5
+
6
+ Includes transfer preparation, reporting and disk space management.
4
7
 
5
8
  ## Status
6
9
 
@@ -27,7 +30,9 @@ Or install it yourself as:
27
30
  ## Usage
28
31
 
29
32
  ### Configuration
30
- Configure Preservation. If ```log_path``` is omitted, logging (standard library) writes to STDOUT.
33
+
34
+ Configure Preservation. If ```log_path``` is omitted, logging (standard library)
35
+ writes to STDOUT.
31
36
 
32
37
  ```ruby
33
38
  Preservation.configure do |config|
@@ -37,24 +42,129 @@ Preservation.configure do |config|
37
42
  end
38
43
  ```
39
44
 
45
+ Create a hash for passing to a transfer.
46
+
47
+ ```ruby
48
+ # Pure host with authentication.
49
+ config = {
50
+ url: ENV['PURE_URL'],
51
+ username: ENV['PURE_USERNAME'],
52
+ password: ENV['PURE_PASSWORD']
53
+ }
54
+ ```
55
+
56
+ ```ruby
57
+ # Pure host without authentication.
58
+ config = {
59
+ url: ENV['PURE_URL']
60
+ }
61
+ ```
40
62
 
41
63
  ### Transfer
42
- Create a transfer using the Pure Research Information System as a data source.
64
+
65
+ Configure a transfer to retrieve data from a Pure host.
66
+
67
+ ```ruby
68
+ transfer = Preservation::Transfer::Dataset.new config
69
+ ```
70
+
71
+ #### Single
72
+
73
+ If necessary, fetch the metadata, prepare a directory in the ingest path and
74
+ populate it with the files and JSON description file.
75
+
76
+ ```ruby
77
+ transfer.prepare uuid: 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'
78
+ ```
79
+
80
+ #### Batch
81
+
82
+ For multiple Pure datasets, if necessary, fetch the metadata, prepare a
83
+ directory in the ingest path and populate it with the files and JSON description
84
+ file.
85
+
86
+ A maximum of 10 will be prepared using the doi_short directory naming scheme.
87
+ Each dataset will only be prepared if 20 days have elapsed since the metadata
88
+ record was last modified.
43
89
 
44
90
  ```ruby
45
- transfer = Preservation::Transfer::Pure.new base_url: ENV['PURE_BASE_URL'],
46
- username: ENV['PURE_USERNAME'],
47
- password: ENV['PURE_PASSWORD'],
48
- basic_auth: true
91
+ transfer.prepare_batch max: 10,
92
+ dir_scheme: :doi_short,
93
+ delay: 20
49
94
  ```
50
95
 
51
- For a Pure dataset, if necessary, fetch the metadata, prepare
52
- a directory in the ingest path and populate it with the files and JSON description file.
96
+ #### Directory name
97
+
98
+ The following are permitted values for the dir_scheme parameter:
53
99
 
54
100
  ```ruby
55
- transfer.prepare_dataset uuid: 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'
101
+ :uuid_title
102
+ :title_uuid
103
+ :date_uuid_title
104
+ :date_title_uuid
105
+ :date_time_uuid
106
+ :date_time_title
107
+ :date_time_uuid_title
108
+ :date_time_title_uuid
109
+ :uuid
110
+ :doi
111
+ :doi_short
56
112
  ```
57
113
 
114
+ #### Load directory
115
+
116
+ A transfer-ready directory, with a name built according to the directory scheme
117
+ specified, in this case doi_short. This particular example has only one file
118
+ Ebola_data_Jun15.zip in the dataset.
119
+ ```
120
+ .
121
+ ├── 10.17635-lancaster-researchdata-6
122
+ │ ├── Ebola_data_Jun15.zip
123
+ │ └── metadata
124
+ │ └── metadata.json
125
+ ```
126
+
127
+ metadata.json:
128
+
129
+ ```json
130
+ [
131
+ {
132
+ "filename": "objects/Ebola_data_Jun15.zip",
133
+ "dc.title": "Ebolavirus evolution 2013-2015",
134
+ "dc.description": "Data used for analysis of selection and evolutionary rate in Zaire Ebolavirus variant Makona",
135
+ "dcterms.created": "2015-06-04",
136
+ "dcterms.available": "2015-06-04",
137
+ "dc.publisher": "Lancaster University",
138
+ "dc.identifier": "http://dx.doi.org/10.17635/lancaster/researchdata/6",
139
+ "dcterms.spatial": [
140
+ "Guinea, Sierra Leone, Liberia"
141
+ ],
142
+ "dc.creator": [
143
+ "Gatherer, Derek"
144
+ ],
145
+ "dc.contributor": [
146
+ "Robertson, David",
147
+ "Lovell, Simon"
148
+ ],
149
+ "dc.subject": [
150
+ "Ebolavirus",
151
+ "evolution",
152
+ "phylogenetics",
153
+ "virulence",
154
+ "Filoviridae",
155
+ "positive selection"
156
+ ],
157
+ "dcterms.license": "CC BY",
158
+ "dc.relation": [
159
+ "http://dx.doi.org/10.1136/ebmed-2014-110127",
160
+ "http://dx.doi.org/10.1099/vir.0.067199-0"
161
+ ]
162
+ }
163
+ ]
164
+ ```
165
+
166
+ ### Storage
167
+
58
168
  Free up disk space for completed transfers. Can be done at any time.
59
169
 
60
170
  ```ruby
@@ -62,13 +172,62 @@ Preservation::Storage.cleanup
62
172
  ```
63
173
 
64
174
  ### Report
175
+
65
176
  Can be used for scheduled monitoring of transfers.
66
177
 
67
178
  ```ruby
68
179
  Preservation::Report::Transfer.exception
69
180
  ```
70
181
 
71
- ## Documentation
72
- [API in YARD](http://www.rubydoc.info/gems/preservation)
73
-
74
- [Detailed usage in GitBook](https://aalbinclark.gitbooks.io/preservation)
182
+ Formatted as JSON:
183
+
184
+ ```json
185
+ {
186
+ "pending": {
187
+ "count": 3,
188
+ "data": [
189
+ {
190
+ "path": "10.17635-lancaster-researchdata-72",
191
+ "path_timestamp": "2016-09-29 12:08:58 +0100"
192
+ },
193
+ {
194
+ "path": "10.17635-lancaster-researchdata-74",
195
+ "path_timestamp": "2016-09-29 12:08:59 +0100"
196
+ },
197
+ {
198
+ "path": "10.17635-lancaster-researchdata-75",
199
+ "path_timestamp": "2016-09-29 12:09:00 +0100"
200
+ }
201
+ ]
202
+ },
203
+ "current": {
204
+ "path": "10.17635-lancaster-researchdata-90",
205
+ "unit_type": "ingest",
206
+ "status": "PROCESSING",
207
+ "current": 1,
208
+ "id": 91,
209
+ "uuid": "ebf048c3-0ca8-409c-94cf-ab3e5d97e901",
210
+ "path_timestamp": "2016-09-28 17:09:33 +0100
211
+ },
212
+ "failed": {
213
+ "count": 0
214
+ },
215
+ "incomplete": {
216
+ "count": 1,
217
+ "data": [
218
+ {
219
+ "path": "10.17635-lancaster-researchdata-90",
220
+ "unit_type": "ingest",
221
+ "status": "PROCESSING",
222
+ "current": 1,
223
+ "id": 91,
224
+ "uuid": "ebf048c3-0ca8-409c-94cf-ab3e5d97e901",
225
+ "path_timestamp": "2016-09-28 17:09:33 +0100"
226
+ }
227
+ ]
228
+ },
229
+ "complete": {
230
+ "count": 78
231
+ }
232
+ }
233
+ ```
@@ -8,11 +8,11 @@ require 'preservation/configuration'
8
8
  require 'preservation/report/database'
9
9
  require 'preservation/report/transfer'
10
10
  require 'preservation/conversion'
11
- require 'preservation/ingest'
12
11
  require 'preservation/builder'
13
12
  require 'preservation/storage'
14
13
  require 'preservation/temporal'
15
- require 'preservation/transfer/pure'
14
+ require 'preservation/transfer/base'
15
+ require 'preservation/transfer/dataset'
16
16
  require 'preservation/version'
17
17
 
18
18
  # Top level namespace
@@ -35,9 +35,9 @@ module Preservation
35
35
  # @param directory_name_scheme [Symbol]
36
36
  # @return [String]
37
37
  def self.build_directory_name(metadata_record, directory_name_scheme)
38
- doi = metadata_record['doi']
39
- uuid = metadata_record['uuid']
40
- title = metadata_record['title'].strip.gsub(' ', '-').gsub('/', '-')
38
+ doi = metadata_record[:doi]
39
+ uuid = metadata_record[:uuid]
40
+ title = metadata_record[:title].strip.gsub(' ', '-').gsub('/', '-')
41
41
  time = Time.new
42
42
  date = time.strftime("%Y-%m-%d")
43
43
  time = time.strftime("%H:%M:%S")
@@ -63,12 +63,12 @@ module Preservation
63
63
  when :uuid
64
64
  uuid
65
65
  when :doi
66
- if doi.empty?
66
+ if doi.nil? || doi.empty?
67
67
  return ''
68
68
  end
69
69
  doi.gsub('/', '-')
70
70
  when :doi_short
71
- if doi.empty?
71
+ if doi.nil? || doi.empty?
72
72
  return ''
73
73
  end
74
74
  doi_short_to_remove = 'http://dx.doi.org/'
@@ -13,8 +13,7 @@ module Preservation
13
13
  # @return [SQLite3::Database]
14
14
  def self.db_connection(db_path)
15
15
  if db_path.nil?
16
- puts 'Missing db_path'
17
- exit
16
+ raise 'Missing db_path'
18
17
  end
19
18
  @db ||= SQLite3::Database.new db_path
20
19
  end
@@ -6,13 +6,12 @@ module Preservation
6
6
 
7
7
  # time_to_preserve?
8
8
  #
9
- # @param start_utc [String]
9
+ # @param start_utc [Time]
10
10
  # @param delay [Integer] days to wait (after start date) before preserving
11
11
  # @return [Boolean]
12
12
  def self.time_to_preserve?(start_utc, delay)
13
- now = DateTime.now
14
- start_datetime = DateTime.parse(start_utc)
15
- days_since_start = (now - start_datetime).to_i # result in days
13
+ now = Time.now
14
+ days_since_start = (now - start_utc).to_i # result in days
16
15
  days_since_start >= delay ? true : false
17
16
  end
18
17
 
@@ -0,0 +1,42 @@
1
+ module Preservation
2
+
3
+ module Transfer
4
+
5
+ # Transfer base
6
+ #
7
+ class Base
8
+
9
+ attr_reader :logger
10
+
11
+ def initialize
12
+ setup_logger
13
+ check_ingest_path
14
+ end
15
+
16
+ private
17
+
18
+ def check_ingest_path
19
+ if Preservation.ingest_path.nil?
20
+ @logger.error 'Missing ingest path'
21
+ exit
22
+ end
23
+ end
24
+
25
+ def setup_logger
26
+ if @logger.nil?
27
+ if Preservation.log_path.nil?
28
+ @logger = Logger.new STDOUT
29
+ else
30
+ # Keep data for today and the past 20 days
31
+ @logger = Logger.new File.new(Preservation.log_path, 'a'), 20, 'daily'
32
+ end
33
+ end
34
+ @logger.level = Logger::INFO
35
+ end
36
+
37
+ end
38
+
39
+ end
40
+
41
+ end
42
+
@@ -0,0 +1,258 @@
1
+ module Preservation
2
+
3
+ # Transfer preparation
4
+ #
5
+ module Transfer
6
+
7
+ # Transfer preparation for dataset
8
+ #
9
+ class Dataset < Preservation::Transfer::Base
10
+
11
+ # @param config [Hash]
12
+ def initialize(config)
13
+ super()
14
+ @config = config
15
+ end
16
+
17
+ # For given uuid, if necessary, fetch the metadata,
18
+ # prepare a directory in the ingest path and populate it with the files and
19
+ # JSON description file.
20
+ #
21
+ # @param uuid [String] uuid to preserve
22
+ # @param dir_scheme [Symbol] how to make directory name
23
+ # @param delay [Integer] days to wait (after modification date) before preserving
24
+ # @return [Boolean] indicates presence of metadata description file
25
+ def prepare(uuid: nil,
26
+ dir_scheme: :uuid,
27
+ delay: 0)
28
+ success = false
29
+
30
+ if uuid.nil?
31
+ @logger.error 'Missing ' + uuid
32
+ exit
33
+ end
34
+ dir_base_path = Preservation.ingest_path
35
+
36
+ dataset_extractor = Puree::Extractor::Dataset.new @config
37
+ d = dataset_extractor.find uuid: uuid
38
+ if !d
39
+ @logger.error 'No metadata for ' + uuid
40
+ exit
41
+ end
42
+
43
+ metadata_record = {
44
+ doi: d.doi,
45
+ uuid: d.uuid,
46
+ title: d.title
47
+ }
48
+
49
+ # configurable to become more human-readable
50
+ dir_name = Preservation::Builder.build_directory_name(metadata_record, dir_scheme)
51
+
52
+ # continue only if dir_name is not empty (e.g. because there was no DOI)
53
+ # continue only if there is no DB entry
54
+ # continue only if the dataset has a DOI
55
+ # continue only if there are files for this resource
56
+ # continue only if it is time to preserve
57
+ if !dir_name.nil? &&
58
+ !dir_name.empty? &&
59
+ !Preservation::Report::Transfer.in_db?(dir_name) &&
60
+ d.doi &&
61
+ !d.files.empty? &&
62
+ Preservation::Temporal.time_to_preserve?(d.modified, delay)
63
+
64
+ dir_file_path = dir_base_path + '/' + dir_name
65
+ dir_metadata_path = dir_file_path + '/metadata/'
66
+ metadata_filename = dir_metadata_path + 'metadata.json'
67
+
68
+ # calculate total size of data files
69
+ download_storage_required = 0
70
+ d.files.each { |i| download_storage_required += i.size.to_i }
71
+
72
+ # do we have enough space in filesystem to fetch data files?
73
+ if Preservation::Storage.enough_storage_for_download? download_storage_required
74
+ # @logger.info 'Sufficient disk space for ' + dir_file_path
75
+ else
76
+ @logger.error 'Insufficient disk space to store files fetched from Pure. Skipping ' + dir_file_path
77
+ end
78
+
79
+ # has metadata file been created? if so, files and metadata are in place
80
+ # continue only if files not present in ingest location
81
+ if !File.size? metadata_filename
82
+
83
+ @logger.info 'Preparing ' + dir_name + ', Pure UUID ' + d.uuid
84
+
85
+ data = []
86
+ d.files.each do |f|
87
+ o = package_metadata d, f
88
+ data << o
89
+ wget_str = Preservation::Builder.build_wget @config[:username],
90
+ @config[:password],
91
+ f.url
92
+
93
+ Dir.mkdir(dir_file_path) if !Dir.exists?(dir_file_path)
94
+
95
+ # fetch the file
96
+ Dir.chdir(dir_file_path) do
97
+ # puts 'Changing dir to ' + Dir.pwd
98
+ # puts 'Size of ' + f.name + ' is ' + File.size(f.name).to_s
99
+ if File.size?(f.name)
100
+ # puts 'Should be deleting ' + f['name']
101
+ File.delete(f.name)
102
+ end
103
+ # puts f.name + ' missing or empty'
104
+ # puts wget_str
105
+ `#{wget_str}`
106
+ end
107
+ end
108
+
109
+ Dir.mkdir(dir_metadata_path) if !Dir.exists?(dir_metadata_path)
110
+
111
+ pretty = JSON.pretty_generate( data, :indent => ' ')
112
+ # puts pretty
113
+ File.write(metadata_filename,pretty)
114
+ @logger.info 'Created ' + metadata_filename
115
+ success = true
116
+ else
117
+ @logger.info 'Skipping ' + dir_name + ', Pure UUID ' + d.uuid +
118
+ ' because ' + metadata_filename + ' exists'
119
+ end
120
+ else
121
+ @logger.info 'Skipping ' + dir_name + ', Pure UUID ' + d.uuid
122
+ end
123
+ success
124
+ end
125
+
126
+ # For multiple datasets, if necessary, fetch the metadata,
127
+ # prepare a directory in the ingest path and populate it with the files and
128
+ # JSON description file.
129
+ #
130
+ # @param max [Integer] maximum to prepare, omit to set no maximum
131
+ # @param dir_scheme [Symbol] how to make directory name
132
+ # @param delay [Integer] days to wait (after modification date) before preserving
133
+ def prepare_batch(max: nil,
134
+ dir_scheme: :uuid,
135
+ delay: 30)
136
+ collection_extractor = Puree::Extractor::Collection.new config: @config,
137
+ resource: :dataset
138
+ count = collection_extractor.count
139
+
140
+ max = count if max.nil?
141
+
142
+ batch_size = 10
143
+ num_prepared = 0
144
+ 0.step(count, batch_size) do |n|
145
+
146
+ dataset_collection = collection_extractor.find limit: batch_size,
147
+ offset: n
148
+ dataset_collection.each do |dataset|
149
+ success = prepare uuid: dataset.uuid,
150
+ dir_scheme: dir_scheme.to_sym,
151
+ delay: delay
152
+
153
+ num_prepared += 1 if success
154
+ exit if num_prepared == max
155
+ end
156
+ end
157
+ end
158
+
159
+ private
160
+
161
+ def package_metadata(d, f)
162
+ o = {}
163
+ o['filename'] = 'objects/' + f.name
164
+ o['dc.title'] = d.title
165
+ if d.description
166
+ o['dc.description'] = d.description
167
+ end
168
+ o['dcterms.created'] = d.created.strftime("%F")
169
+ if d.available
170
+ o['dcterms.available'] = d.available.strftime("%F")
171
+ end
172
+ o['dc.publisher'] = d.publisher
173
+ if d.doi
174
+ o['dc.identifier'] = d.doi
175
+ end
176
+ if !d.spatial_places.empty?
177
+ o['dcterms.spatial'] = d.spatial_places
178
+ end
179
+
180
+ temporal = d.temporal
181
+ temporal_range = ''
182
+ if temporal
183
+ if temporal.start
184
+ temporal_range << temporal.start.strftime("%F")
185
+ if temporal.end
186
+ temporal_range << '/'
187
+ temporal_range << temporal.end.strftime("%F")
188
+ end
189
+ o['dcterms.temporal'] = temporal_range
190
+ end
191
+ end
192
+
193
+ creators = []
194
+ contributors = []
195
+ all_persons = []
196
+ all_persons << d.persons_internal
197
+ all_persons << d.persons_external
198
+ all_persons << d.persons_other
199
+ all_persons.each do |person_type|
200
+ person_type.each do |i|
201
+ name = i.name.last_first if i.name
202
+ if i.role == 'Creator'
203
+ creators << name if name
204
+ end
205
+ if i.role == 'Contributor'
206
+ contributors << name if name
207
+ end
208
+ end
209
+ end
210
+
211
+ o['dc.creator'] = creators
212
+ if !contributors.empty?
213
+ o['dc.contributor'] = contributors
214
+ end
215
+ keywords = []
216
+ d.keywords.each { |i|
217
+ keywords << i
218
+ }
219
+ if !keywords.empty?
220
+ o['dc.subject'] = keywords
221
+ end
222
+
223
+ o['dcterms.license'] = f.license.name if f.license
224
+ # o['dc.format'] = f.mime
225
+
226
+ related = []
227
+ publications = d.publications
228
+ publications.each do |i|
229
+ if i.type === 'Dataset'
230
+ extractor = Puree::Extractor::Dataset.new @config
231
+ dataset = extractor.find uuid: i.uuid
232
+ doi = dataset.doi
233
+ if doi
234
+ related << doi
235
+ end
236
+ end
237
+ if i.type === 'Publication'
238
+ extractor = Puree::Extractor::Publication.new @config
239
+ publication = extractor.find uuid: i.uuid
240
+ dois = publication.dois
241
+ if !dois.empty?
242
+ # Only one needed
243
+ related << dois[0]
244
+ end
245
+ end
246
+ end
247
+ if !related.empty?
248
+ o['dc.relation'] = related
249
+ end
250
+
251
+ o
252
+ end
253
+
254
+ end
255
+
256
+ end
257
+
258
+ end
@@ -1,5 +1,5 @@
1
1
  module Preservation
2
2
  # Semantic version number
3
3
  #
4
- VERSION = "0.4.2"
4
+ VERSION = "0.5.0"
5
5
  end
@@ -8,9 +8,9 @@ Gem::Specification.new do |spec|
8
8
  spec.version = Preservation::VERSION
9
9
  spec.authors = ["Adrian Albin-Clark"]
10
10
  spec.email = ["a.albin-clark@lancaster.ac.uk"]
11
- spec.summary = %q{Extraction and Transformation for Loading by Archivematica's Automation Tools.}
12
- spec.description = %q{Extraction and Transformation for Loading by Archivematica's Automation Tools. Includes transfer preparation, reporting and disk space management.}
13
- spec.homepage = "https://aalbinclark.gitbooks.io/preservation"
11
+ spec.summary = %q{Extraction from the Pure Research Information System and transformation for
12
+ loading by Archivematica.}
13
+ spec.homepage = "https://github.com/lulibrary/preservation"
14
14
  spec.license = "MIT"
15
15
 
16
16
  spec.files = `git ls-files -z`.split("\x0")
@@ -21,6 +21,6 @@ Gem::Specification.new do |spec|
21
21
  spec.required_ruby_version = '~> 2.1'
22
22
 
23
23
  spec.add_runtime_dependency 'free_disk_space', '~> 1.0'
24
- spec.add_runtime_dependency 'puree', '~> 0.19'
24
+ spec.add_runtime_dependency 'puree', '~> 1.3'
25
25
  spec.add_runtime_dependency 'sqlite3', '~> 1.3'
26
26
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: preservation
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.2
4
+ version: 0.5.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Adrian Albin-Clark
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2017-05-18 00:00:00.000000000 Z
11
+ date: 2017-05-23 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: free_disk_space
@@ -30,14 +30,14 @@ dependencies:
30
30
  requirements:
31
31
  - - "~>"
32
32
  - !ruby/object:Gem::Version
33
- version: '0.19'
33
+ version: '1.3'
34
34
  type: :runtime
35
35
  prerelease: false
36
36
  version_requirements: !ruby/object:Gem::Requirement
37
37
  requirements:
38
38
  - - "~>"
39
39
  - !ruby/object:Gem::Version
40
- version: '0.19'
40
+ version: '1.3'
41
41
  - !ruby/object:Gem::Dependency
42
42
  name: sqlite3
43
43
  requirement: !ruby/object:Gem::Requirement
@@ -52,8 +52,7 @@ dependencies:
52
52
  - - "~>"
53
53
  - !ruby/object:Gem::Version
54
54
  version: '1.3'
55
- description: Extraction and Transformation for Loading by Archivematica's Automation
56
- Tools. Includes transfer preparation, reporting and disk space management.
55
+ description:
57
56
  email:
58
57
  - a.albin-clark@lancaster.ac.uk
59
58
  executables: []
@@ -71,15 +70,15 @@ files:
71
70
  - lib/preservation/builder.rb
72
71
  - lib/preservation/configuration.rb
73
72
  - lib/preservation/conversion.rb
74
- - lib/preservation/ingest.rb
75
73
  - lib/preservation/report/database.rb
76
74
  - lib/preservation/report/transfer.rb
77
75
  - lib/preservation/storage.rb
78
76
  - lib/preservation/temporal.rb
79
- - lib/preservation/transfer/pure.rb
77
+ - lib/preservation/transfer/base.rb
78
+ - lib/preservation/transfer/dataset.rb
80
79
  - lib/preservation/version.rb
81
80
  - preservation.gemspec
82
- homepage: https://aalbinclark.gitbooks.io/preservation
81
+ homepage: https://github.com/lulibrary/preservation
83
82
  licenses:
84
83
  - MIT
85
84
  metadata: {}
@@ -102,5 +101,6 @@ rubyforge_project:
102
101
  rubygems_version: 2.2.2
103
102
  signing_key:
104
103
  specification_version: 4
105
- summary: Extraction and Transformation for Loading by Archivematica's Automation Tools.
104
+ summary: Extraction from the Pure Research Information System and transformation for
105
+ loading by Archivematica.
106
106
  test_files: []
@@ -1,38 +0,0 @@
1
- module Preservation
2
-
3
- # Ingest
4
- #
5
- class Ingest
6
-
7
- attr_reader :logger
8
-
9
- def initialize
10
- setup_logger
11
- check_ingest_path
12
- end
13
-
14
- private
15
-
16
- def check_ingest_path
17
- if Preservation.ingest_path.nil?
18
- @logger.error 'Missing ingest path'
19
- exit
20
- end
21
- end
22
-
23
- def setup_logger
24
- if @logger.nil?
25
- if Preservation.log_path.nil?
26
- @logger = Logger.new STDOUT
27
- else
28
- # Keep data for today and the past 20 days
29
- @logger = Logger.new File.new(Preservation.log_path, 'a'), 20, 'daily'
30
- end
31
- end
32
- @logger.level = Logger::INFO
33
- end
34
-
35
- end
36
-
37
- end
38
-
@@ -1,259 +0,0 @@
1
- module Preservation
2
-
3
- # Transfer preparation
4
- #
5
- module Transfer
6
-
7
- # Transfer preparation for Pure
8
- #
9
- class Pure < Ingest
10
-
11
- # @param base_url [String]
12
- # @param username [String]
13
- # @param password [String]
14
- # @param basic_auth [Boolean]
15
- def initialize(base_url: nil, username: nil, password: nil, basic_auth: nil)
16
- super()
17
- @base_url = base_url
18
- @basic_auth = basic_auth
19
- if basic_auth === true
20
- @username = username
21
- @password = password
22
- end
23
- end
24
-
25
- # For given uuid, if necessary, fetch the metadata,
26
- # prepare a directory in the ingest path and populate it with the files and
27
- # JSON description file.
28
- #
29
- # @param uuid [String] uuid to preserve
30
- # @param dir_scheme [Symbol] how to make directory name
31
- # @param delay [Integer] days to wait (after modification date) before preserving
32
- # @return [Boolean] indicates presence of metadata description file
33
- def prepare_dataset(uuid: nil,
34
- dir_scheme: :uuid,
35
- delay: 0)
36
- success = false
37
-
38
- if uuid.nil?
39
- @logger.error 'Missing ' + uuid
40
- exit
41
- end
42
- dir_base_path = Preservation.ingest_path
43
-
44
- dataset = Puree::Dataset.new base_url: @base_url,
45
- username: @username,
46
- password: @password,
47
- basic_auth: @basic_auth
48
-
49
- dataset.find uuid: uuid
50
- d = dataset.metadata
51
- if d.empty?
52
- @logger.error 'No metadata for ' + uuid
53
- exit
54
- end
55
-
56
- # configurable to become more human-readable
57
- dir_name = Preservation::Builder.build_directory_name(d, dir_scheme)
58
-
59
- # continue only if dir_name is not empty (e.g. because there was no DOI)
60
- # continue only if there is no DB entry
61
- # continue only if the dataset has a DOI
62
- # continue only if there are files for this resource
63
- # continue only if it is time to preserve
64
- if !dir_name.nil? &&
65
- !dir_name.empty? &&
66
- !Preservation::Report::Transfer.in_db?(dir_name) &&
67
- !d['doi'].empty? &&
68
- !d['file'].empty? &&
69
- Preservation::Temporal.time_to_preserve?(d['modified'], delay)
70
-
71
- dir_file_path = dir_base_path + '/' + dir_name
72
- dir_metadata_path = dir_file_path + '/metadata/'
73
- metadata_filename = dir_metadata_path + 'metadata.json'
74
-
75
- # calculate total size of data files
76
- download_storage_required = 0
77
- d['file'].each { |i| download_storage_required += i['size'].to_i }
78
-
79
- # do we have enough space in filesystem to fetch data files?
80
- if Preservation::Storage.enough_storage_for_download? download_storage_required
81
- # @logger.info 'Sufficient disk space for ' + dir_file_path
82
- else
83
- @logger.error 'Insufficient disk space to store files fetched from Pure. Skipping ' + dir_file_path
84
- end
85
-
86
- # has metadata file been created? if so, files and metadata are in place
87
- # continue only if files not present in ingest location
88
- if !File.size? metadata_filename
89
-
90
- @logger.info 'Preparing ' + dir_name + ', Pure UUID ' + d['uuid']
91
-
92
- data = []
93
- d['file'].each do |f|
94
- o = package_dataset_metadata d, f
95
- data << o
96
- wget_str = Preservation::Builder.build_wget @username,
97
- @password,
98
- f['url']
99
-
100
- Dir.mkdir(dir_file_path) if !Dir.exists?(dir_file_path)
101
-
102
- # fetch the file
103
- Dir.chdir(dir_file_path) do
104
- # puts 'Changing dir to ' + Dir.pwd
105
- # puts 'Size of ' + f['name'] + ' is ' + File.size(f['name']).to_s
106
- if File.size?(f['name'])
107
- # puts 'Should be deleting ' + f['name']
108
- File.delete(f['name'])
109
- end
110
- # puts f['name'] + ' missing or empty'
111
- # puts wget_str
112
- `#{wget_str}`
113
- end
114
- end
115
-
116
- Dir.mkdir(dir_metadata_path) if !Dir.exists?(dir_metadata_path)
117
-
118
- pretty = JSON.pretty_generate( data, :indent => ' ')
119
- # puts pretty
120
- File.write(metadata_filename,pretty)
121
- @logger.info 'Created ' + metadata_filename
122
- success = true
123
- else
124
- @logger.info 'Skipping ' + dir_name + ', Pure UUID ' + d['uuid'] +
125
- ' because ' + metadata_filename + ' exists'
126
- end
127
- else
128
- @logger.info 'Skipping ' + dir_name + ', Pure UUID ' + d['uuid']
129
- end
130
- success
131
- end
132
-
133
- # For multiple datasets, if necessary, fetch the metadata,
134
- # prepare a directory in the ingest path and populate it with the files and
135
- # JSON description file.
136
- #
137
- # @param max [Integer] maximum to prepare, omit to set no maximum
138
- # @param dir_scheme [Symbol] how to make directory name
139
- # @param delay [Integer] days to wait (after modification date) before preserving
140
- def prepare_dataset_batch(max: nil,
141
- dir_scheme: :uuid,
142
- delay: 30)
143
- collection = Puree::Collection.new resource: :dataset,
144
- base_url: @base_url,
145
- username: @username,
146
- password: @password,
147
- basic_auth: @basic_auth
148
- count = collection.count
149
-
150
- max = count if max.nil?
151
-
152
- batch_size = 10
153
- num_prepared = 0
154
- 0.step(count, batch_size) do |n|
155
-
156
- minimal_metadata = collection.find limit: batch_size,
157
- offset: n,
158
- full: false
159
- uuids = []
160
- minimal_metadata.each do |i|
161
- uuids << i['uuid']
162
- end
163
-
164
- uuids.each do |uuid|
165
- success = prepare_dataset uuid: uuid,
166
- dir_scheme: dir_scheme.to_sym,
167
- delay: delay
168
-
169
- num_prepared += 1 if success
170
- exit if num_prepared == max
171
- end
172
- end
173
- end
174
-
175
- private
176
-
177
- def package_dataset_metadata(d, f)
178
- o = {}
179
- o['filename'] = 'objects/' + f['name']
180
- o['dc.title'] = d['title']
181
- if !d['description'].empty?
182
- o['dc.description'] = d['description']
183
- end
184
- o['dcterms.created'] = d['created']
185
- if !d['available']['year'].empty?
186
- o['dcterms.available'] = Puree::Date.iso(d['available'])
187
- end
188
- o['dc.publisher'] = d['publisher']
189
- if !d['doi'].empty?
190
- o['dc.identifier'] = d['doi']
191
- end
192
- if !d['spatial'].empty?
193
- o['dcterms.spatial'] = d['spatial']
194
- end
195
- if !d['temporal']['start']['year'].empty?
196
- temporal_range = ''
197
- temporal_range << Puree::Date.iso(d['temporal']['start'])
198
- if !d['temporal']['end']['year'].empty?
199
- temporal_range << '/'
200
- temporal_range << Puree::Date.iso(d['temporal']['end'])
201
- end
202
- o['dcterms.temporal'] = temporal_range
203
- end
204
- creators = []
205
- contributors = []
206
- person_types = %w(internal external other)
207
- person_types.each do |person_type|
208
- d['person'][person_type].each do |i|
209
- if i['role'] == 'Creator'
210
- creator = i['name']['last'] + ', ' + i['name']['first']
211
- creators << creator
212
- end
213
- if i['role'] == 'Contributor'
214
- contributor = i['name']['last'] + ', ' + i['name']['first']
215
- contributors << contributor
216
- end
217
- end
218
- end
219
- o['dc.creator'] = creators
220
- if !contributors.empty?
221
- o['dc.contributor'] = contributors
222
- end
223
- keywords = []
224
- d['keyword'].each { |i|
225
- keywords << i
226
- }
227
- if !keywords.empty?
228
- o['dc.subject'] = keywords
229
- end
230
- if !f['license']['name'].empty?
231
- o['dcterms.license'] = f['license']['name']
232
- end
233
- # o['dc.format'] = f['mime']
234
-
235
- related = []
236
- publications = d['publication']
237
- publications.each do |i|
238
- pub = Puree::Publication.new base_url: @base_url,
239
- username: @username,
240
- password: @password,
241
- basic_auth: @basic_auth
242
- pub.find uuid: i['uuid']
243
- doi = pub.doi
244
- if doi
245
- related << doi
246
- end
247
- end
248
- if !related.empty?
249
- o['dc.relation'] = related
250
- end
251
-
252
- o
253
- end
254
-
255
- end
256
-
257
- end
258
-
259
- end