marc2linkeddata 0.1.3 → 0.1.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +21 -11
- data/bin/marcAuthority2LD +1 -1
- data/marc2linkeddata.gemspec +1 -1
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA1:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: f53bf2a4de09468d82acd96afa56333d7d12bf2c
|
|
4
|
+
data.tar.gz: bcc81582e5ad1b723c56d87213165fba35b7d55b
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: af891eb4a3dbb3b9fbba41d1d900c5e57bcfd417b0ba602a6b0a537382d2984923fdbc40b0fcf2c193af2291a8c7ff88318ff51930e7b4b263a196279a464bb5
|
|
7
|
+
data.tar.gz: 29acc2ad5d26f2d68079d1e10b9b29ace1616c469a356946173dadfe17dd02ff81bebdc2620bbf7af8162a24a4d3387c83d98db3692def042deb44c1c0e8a86f
|
data/README.md
CHANGED
|
@@ -13,16 +13,20 @@ copy the .env_example file provided into the current path.
|
|
|
13
13
|
Without any HTTP retrieval of RDF metadata, using only data in a MARC record, it can
|
|
14
14
|
translate 100,000 authority records in about 5-6 min on a current laptop system. The
|
|
15
15
|
config options allow specification of MARC fields that may already contain resource links.
|
|
16
|
-
With
|
|
17
|
-
RDF providers may not be happy about a barrage of requests.
|
|
18
|
-
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
16
|
+
With RDF retrieval options enabled, it can take a lot longer (days; and the
|
|
17
|
+
RDF providers may not be happy about a barrage of requests).
|
|
18
|
+
|
|
19
|
+
It may help to enable threading for concurrent processing. The concurrency is provided
|
|
20
|
+
by the ruby parallel gem, which can automatically optimise concurrent threads or processes.
|
|
21
|
+
It can process 100,000 authority records in under 2 min (without any RDF retrieval).
|
|
22
|
+
|
|
23
|
+
The processing involves substantial file IO and, when enabled, network IO also. With
|
|
24
|
+
about 100,000 records in a .mrc file, all records can be loaded into memory and processed
|
|
25
|
+
in a serial list or concurrently. With regard to file IO, it is the most expensive
|
|
26
|
+
operation in the MARC-only mode (it may help to use a drive with high IO performance).
|
|
27
|
+
In the RDF retrieval mode, where network IO becomes more important, it may help
|
|
28
|
+
to enable threads for concurrent retrieval of RDF resources. However, it's still
|
|
22
29
|
relatively slow (exploring options for caching and local downloads of RDF data).
|
|
23
|
-
Note that it runs a lot slower on jruby-9.0.0.0-pre1 than MRI 2.2.0, whether threads
|
|
24
|
-
are enabled or not. It raises exceptions on jruby-1.7.9, related to ruby
|
|
25
|
-
language support (such as Array#delete_if).
|
|
26
30
|
|
|
27
31
|
The current output is to the file system, but it should be easy to incorporate
|
|
28
32
|
and configure alternatives by using the RDF.rb facilities for connecting to a
|
|
@@ -38,8 +42,8 @@ TODO: A significant problem to solve is effective caching or mirrors for linked
|
|
|
38
42
|
The retrieval should inspect any HTTP cache headers that might be available and
|
|
39
43
|
adding PROVO to the linked-data graph generated for each record.
|
|
40
44
|
|
|
41
|
-
TODO: Provide system platform options, to
|
|
42
|
-
|
|
45
|
+
TODO: Provide system platform options, like docker, to package the application and
|
|
46
|
+
make it easier to scale out the processing. Consider https://www.packer.io/intro/index.html
|
|
43
47
|
|
|
44
48
|
Optional Dependencies
|
|
45
49
|
|
|
@@ -86,6 +90,12 @@ Scripting
|
|
|
86
90
|
# marcAuthority2LD [ authfile1.mrc .. authfileN.mrc ]
|
|
87
91
|
marcAuthority2LD auth.mrc
|
|
88
92
|
|
|
93
|
+
# To provide one-off config values on the command line, set
|
|
94
|
+
# the environment variable first; e.g. the following turns of
|
|
95
|
+
# debug mode, processes 20 records from auth.mrc, using
|
|
96
|
+
# concurrent processing.
|
|
97
|
+
DEBUG=false TEST_RECORDS=20 THREADS=true marcAuthority2LD auth.mrc
|
|
98
|
+
|
|
89
99
|
# Check the syntax of the output turtle files.
|
|
90
100
|
touch turtle_syntax_checks.log
|
|
91
101
|
for f in $(find ./auth_turtle/ -type f -name '.ttl'); do
|
data/bin/marcAuthority2LD
CHANGED
|
@@ -47,7 +47,7 @@ def marc_authority_records(marc_filename)
|
|
|
47
47
|
auth_count += 1
|
|
48
48
|
$stdout.printf "\b\b\b\b\b\b" if auth_count > 1
|
|
49
49
|
$stdout.printf '%06d', auth_count
|
|
50
|
-
break if
|
|
50
|
+
break if (CONFIG.test_records > 0 && CONFIG.test_records <= auth_count)
|
|
51
51
|
end
|
|
52
52
|
rescue => e
|
|
53
53
|
stack_trace(e, record)
|
data/marc2linkeddata.gemspec
CHANGED