marc2linkeddata 0.1.3 → 0.1.4

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: c7a2edf3d7a7a2abc392d18d1ec431d1b17152c8
4
- data.tar.gz: 1ff8daa2a0924649f0f5da2868bb3a97e7360365
3
+ metadata.gz: f53bf2a4de09468d82acd96afa56333d7d12bf2c
4
+ data.tar.gz: bcc81582e5ad1b723c56d87213165fba35b7d55b
5
5
  SHA512:
6
- metadata.gz: 62ebd8cb6a8e7531e70cfc2915962523bf8428f018cea6352e21001452998c37b2cbba666ca55cf80f2a59f3634f2c38078cad00a78d20e1f4d4d8cf2b3d2a1a
7
- data.tar.gz: 44cf240042696b4f6f4833c2ab5a15f904aa856f6e3f7694e8488fb4016d45a8977f3ef1f1babd1463c00d73c6f2df286e520e025793d647bdcd3e6b0f996d3b
6
+ metadata.gz: af891eb4a3dbb3b9fbba41d1d900c5e57bcfd417b0ba602a6b0a537382d2984923fdbc40b0fcf2c193af2291a8c7ff88318ff51930e7b4b263a196279a464bb5
7
+ data.tar.gz: 29acc2ad5d26f2d68079d1e10b9b29ace1616c469a356946173dadfe17dd02ff81bebdc2620bbf7af8162a24a4d3387c83d98db3692def042deb44c1c0e8a86f
data/README.md CHANGED
@@ -13,16 +13,20 @@ copy the .env_example file provided into the current path.
13
13
  Without any HTTP retrieval of RDF metadata, using only data in a MARC record, it can
14
14
  translate 100,000 authority records in about 5-6 min on a current laptop system. The
15
15
  config options allow specification of MARC fields that may already contain resource links.
16
- With HTTP/RDF retrieval options enabled, it can take a lot longer (days) and the
17
- RDF providers may not be happy about a barrage of requests.
18
-
19
- File IO is the most expensive operation in the MARC-only mode (it helps to have a solid
20
- state drive with high IO performance). In the RDF-HTTP retrieval mode, it may help
21
- to enable threading for concurrent retrieval of RDF resources. However, it's still
16
+ With RDF retrieval options enabled, it can take a lot longer (days; and the
17
+ RDF providers may not be happy about a barrage of requests).
18
+
19
+ It may help to enable threading for concurrent processing. The concurrency is provided
20
+ by the ruby parallel gem, which can automatically optimise concurrent threads or processes.
21
+ It can process 100,000 authority records in under 2 min (without any RDF retrieval).
22
+
23
+ The processing involves substantial file IO and, when enabled, network IO also. With
24
+ about 100,000 records in a .mrc file, all records can be loaded into memory and processed
25
+ in a serial list or concurrently. With regard to file IO, it is the most expensive
26
+ operation in the MARC-only mode (it may help to use a drive with high IO performance).
27
+ In the RDF retrieval mode, where network IO becomes more important, it may help
28
+ to enable threads for concurrent retrieval of RDF resources. However, it's still
22
29
  relatively slow (exploring options for caching and local downloads of RDF data).
23
- Note that it runs a lot slower on jruby-9.0.0.0-pre1 than MRI 2.2.0, whether threads
24
- are enabled or not. It raises exceptions on jruby-1.7.9, related to ruby
25
- language support (such as Array#delete_if).
26
30
 
27
31
  The current output is to the file system, but it should be easy to incorporate
28
32
  and configure alternatives by using the RDF.rb facilities for connecting to a
@@ -38,8 +42,8 @@ TODO: A significant problem to solve is effective caching or mirrors for linked
38
42
  The retrieval should inspect any HTTP cache headers that might be available and
39
43
  adding PROVO to the linked-data graph generated for each record.
40
44
 
41
- TODO: Provide system platform options, to dockerize the application and make it easier
42
- for automatic horizontal scaling. Consider https://www.packer.io/intro/index.html
45
+ TODO: Provide system platform options, like docker, to package the application and
46
+ make it easier to scale out the processing. Consider https://www.packer.io/intro/index.html
43
47
 
44
48
  Optional Dependencies
45
49
 
@@ -86,6 +90,12 @@ Scripting
86
90
  # marcAuthority2LD [ authfile1.mrc .. authfileN.mrc ]
87
91
  marcAuthority2LD auth.mrc
88
92
 
93
+ # To provide one-off config values on the command line, set
94
+ # the environment variable first; e.g. the following turns of
95
+ # debug mode, processes 20 records from auth.mrc, using
96
+ # concurrent processing.
97
+ DEBUG=false TEST_RECORDS=20 THREADS=true marcAuthority2LD auth.mrc
98
+
89
99
  # Check the syntax of the output turtle files.
90
100
  touch turtle_syntax_checks.log
91
101
  for f in $(find ./auth_turtle/ -type f -name '.ttl'); do
data/bin/marcAuthority2LD CHANGED
@@ -47,7 +47,7 @@ def marc_authority_records(marc_filename)
47
47
  auth_count += 1
48
48
  $stdout.printf "\b\b\b\b\b\b" if auth_count > 1
49
49
  $stdout.printf '%06d', auth_count
50
- break if auth_count >= CONFIG.test_records
50
+ break if (CONFIG.test_records > 0 && CONFIG.test_records <= auth_count)
51
51
  end
52
52
  rescue => e
53
53
  stack_trace(e, record)
@@ -4,7 +4,7 @@ $:.unshift lib unless $:.include?(lib)
4
4
 
5
5
  Gem::Specification.new do |s|
6
6
  s.name = 'marc2linkeddata'
7
- s.version = '0.1.3'
7
+ s.version = '0.1.4'
8
8
  s.licenses = ['Apache-2.0']
9
9
 
10
10
  # mysql and bson_ext only install on MRI (c-ruby)
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: marc2linkeddata
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.3
4
+ version: 0.1.4
5
5
  platform: ruby
6
6
  authors:
7
7
  - Darren Weber