marc2linkeddata 0.1.3 → 0.1.4
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +21 -11
- data/bin/marcAuthority2LD +1 -1
- data/marc2linkeddata.gemspec +1 -1
- metadata +1 -1
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f53bf2a4de09468d82acd96afa56333d7d12bf2c
|
4
|
+
data.tar.gz: bcc81582e5ad1b723c56d87213165fba35b7d55b
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: af891eb4a3dbb3b9fbba41d1d900c5e57bcfd417b0ba602a6b0a537382d2984923fdbc40b0fcf2c193af2291a8c7ff88318ff51930e7b4b263a196279a464bb5
|
7
|
+
data.tar.gz: 29acc2ad5d26f2d68079d1e10b9b29ace1616c469a356946173dadfe17dd02ff81bebdc2620bbf7af8162a24a4d3387c83d98db3692def042deb44c1c0e8a86f
|
data/README.md
CHANGED
@@ -13,16 +13,20 @@ copy the .env_example file provided into the current path.
|
|
13
13
|
Without any HTTP retrieval of RDF metadata, using only data in a MARC record, it can
|
14
14
|
translate 100,000 authority records in about 5-6 min on a current laptop system. The
|
15
15
|
config options allow specification of MARC fields that may already contain resource links.
|
16
|
-
With
|
17
|
-
RDF providers may not be happy about a barrage of requests.
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
16
|
+
With RDF retrieval options enabled, it can take a lot longer (days; and the
|
17
|
+
RDF providers may not be happy about a barrage of requests).
|
18
|
+
|
19
|
+
It may help to enable threading for concurrent processing. The concurrency is provided
|
20
|
+
by the ruby parallel gem, which can automatically optimise concurrent threads or processes.
|
21
|
+
It can process 100,000 authority records in under 2 min (without any RDF retrieval).
|
22
|
+
|
23
|
+
The processing involves substantial file IO and, when enabled, network IO also. With
|
24
|
+
about 100,000 records in a .mrc file, all records can be loaded into memory and processed
|
25
|
+
in a serial list or concurrently. With regard to file IO, it is the most expensive
|
26
|
+
operation in the MARC-only mode (it may help to use a drive with high IO performance).
|
27
|
+
In the RDF retrieval mode, where network IO becomes more important, it may help
|
28
|
+
to enable threads for concurrent retrieval of RDF resources. However, it's still
|
22
29
|
relatively slow (exploring options for caching and local downloads of RDF data).
|
23
|
-
Note that it runs a lot slower on jruby-9.0.0.0-pre1 than MRI 2.2.0, whether threads
|
24
|
-
are enabled or not. It raises exceptions on jruby-1.7.9, related to ruby
|
25
|
-
language support (such as Array#delete_if).
|
26
30
|
|
27
31
|
The current output is to the file system, but it should be easy to incorporate
|
28
32
|
and configure alternatives by using the RDF.rb facilities for connecting to a
|
@@ -38,8 +42,8 @@ TODO: A significant problem to solve is effective caching or mirrors for linked
|
|
38
42
|
The retrieval should inspect any HTTP cache headers that might be available and
|
39
43
|
adding PROVO to the linked-data graph generated for each record.
|
40
44
|
|
41
|
-
TODO: Provide system platform options, to
|
42
|
-
|
45
|
+
TODO: Provide system platform options, like docker, to package the application and
|
46
|
+
make it easier to scale out the processing. Consider https://www.packer.io/intro/index.html
|
43
47
|
|
44
48
|
Optional Dependencies
|
45
49
|
|
@@ -86,6 +90,12 @@ Scripting
|
|
86
90
|
# marcAuthority2LD [ authfile1.mrc .. authfileN.mrc ]
|
87
91
|
marcAuthority2LD auth.mrc
|
88
92
|
|
93
|
+
# To provide one-off config values on the command line, set
|
94
|
+
# the environment variable first; e.g. the following turns of
|
95
|
+
# debug mode, processes 20 records from auth.mrc, using
|
96
|
+
# concurrent processing.
|
97
|
+
DEBUG=false TEST_RECORDS=20 THREADS=true marcAuthority2LD auth.mrc
|
98
|
+
|
89
99
|
# Check the syntax of the output turtle files.
|
90
100
|
touch turtle_syntax_checks.log
|
91
101
|
for f in $(find ./auth_turtle/ -type f -name '.ttl'); do
|
data/bin/marcAuthority2LD
CHANGED
@@ -47,7 +47,7 @@ def marc_authority_records(marc_filename)
|
|
47
47
|
auth_count += 1
|
48
48
|
$stdout.printf "\b\b\b\b\b\b" if auth_count > 1
|
49
49
|
$stdout.printf '%06d', auth_count
|
50
|
-
break if
|
50
|
+
break if (CONFIG.test_records > 0 && CONFIG.test_records <= auth_count)
|
51
51
|
end
|
52
52
|
rescue => e
|
53
53
|
stack_trace(e, record)
|
data/marc2linkeddata.gemspec
CHANGED