marc2linkeddata 0.1.1 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: d706c18964f5a78918f4a0fef6cf68612ff7a641
4
- data.tar.gz: 2b238aeaf555d6365d781e48d2435b766000fe68
3
+ metadata.gz: 30923ebbb08cf2eb45cbe20a79bfc115fb8f695a
4
+ data.tar.gz: 1a98f477c2f8c9b61b4efb1f8c2a4d171862f2f4
5
5
  SHA512:
6
- metadata.gz: 06411d43a9d623bd364d4cd9a5d9e368bbf33278f1c45430b1930586b3cdb9e90b59f79fbd701362bc11d02450373c78e7e97cc8bea08bd3d03d6cbef0f985c2
7
- data.tar.gz: b535571333cb53d865a4d512a0bae2fa0b87bab50b2d05c113c8b0862bbece81b3f2c5cc51c0f306c8afd600a69cffdcb26c5e223216eb46c51dcbf53d135f4e
6
+ metadata.gz: c92c1369d3e39d46df6f712a94c27189c848c36b04c529fe1d9cfd2e341da9a1c8fe56f974fb7c011672fb59369572f9427307c077de9b1db0d29f6ced7a7dce
7
+ data.tar.gz: bf3e016ec9a3b01c6ed7b1bede449ee1c7791fb34fca1150d1e1efde09eb6f94436692e7e4461f32b9458ebd34890b16428bff7e532d8a0aac69a01a13e3a170
data/.env_example CHANGED
@@ -11,8 +11,12 @@
11
11
  # Uncomment and set values as required. See used settings in
12
12
  # lib/marc2linkeddata/configuration.rb
13
13
 
14
+ # Enable debug logging and breakpoints at problematic code points.
14
15
  export DEBUG=false
15
16
 
17
+ # Only read X MARC records, for testing purposes?
18
+ export TEST_RECORDS=0 # 0 for all records
19
+
16
20
  export LOG_FILE='marc2ld.log'
17
21
  export LIB_PREFIX=http://linked-data.example.org/library/
18
22
 
data/README.md CHANGED
@@ -6,10 +6,23 @@ Utilities for translating MARC21 into linked data. The project has
6
6
  focused on authority records (as of 2015).
7
7
 
8
8
  It has config options that can be enabled to increase the amount of data retrieved.
9
- Without any HTTP options enabled, using only data in the MARC record, it can
10
- translate 100,000 authority records in about 5-6 min on a current laptop system.
11
- File IO is the most expensive operation in this mode, so it helps to have a solid
12
- state drive or something with high IO performance.
9
+ All config options are set by environment variables. The .env_example file documents
10
+ the options available and how to use a .env file; the `marc2LD_config` utility will
11
+ copy the .env_example file provided into the current path.
12
+
13
+ Without any HTTP retrieval of RDF metadata, using only data in a MARC record, it can
14
+ translate 100,000 authority records in about 5-6 min on a current laptop system. The
15
+ config options allow specification of MARC fields that may already contain resource links.
16
+ With HTTP/RDF retrieval options enabled, it can take a lot longer (days) and the
17
+ RDF providers may not be happy about a barrage of requests.
18
+
19
+ File IO is the most expensive operation in the MARC-only mode (it helps to have a solid
20
+ state drive with high IO performance). In the RDF-HTTP retrieval mode, it may help
21
+ to enable threading for concurrent retrieval of RDF resources. However, it's still
22
+ relatively slow (exploring options for caching and local downloads of RDF data).
23
+ Note that it runs a lot slower on jruby-9.0.0.0-pre1 than MRI 2.2.0, whether threads
24
+ are enabled or not. It raises exceptions on jruby-1.7.9, related to ruby
25
+ language support (such as Array#delete_if).
13
26
 
14
27
  The current output is to the file system, but it should be easy to incorporate
15
28
  and configure alternatives by using the RDF.rb facilities for connecting to a
@@ -18,12 +31,8 @@ exploration hasn't matured much, mainly because there is no 'cache-expiry' data
18
31
  yet and because it would be better to use an RDF.rb extension of some
19
32
  kind (for redis, mongodb, etc) or to use a triple store/solr platform.
20
33
 
21
- With HTTP/RDF retrieval options enabled, it can take a lot longer (days) and the
22
- providers may not be very happy about a barrage of requests.
23
-
24
- Note that it runs a lot slower on jruby-9.0.0.0-pre1 than MRI 2.2.0, whether threads
25
- are enabled or not. It raises exceptions on jruby-1.7.9, related to ruby
26
- language support (such as Array#delete_if).
34
+ TODO: Develop on additional example datasets, to evaluate the generality and robustness
35
+ of the utilities.
27
36
 
28
37
  TODO: A significant problem to solve is effective caching or mirrors for linked data.
29
38
  The retrieval should inspect any HTTP cache headers that might be available and
@@ -54,8 +63,8 @@ Install with rbenv (on linux)
54
63
  echo 'eval "$(rbenv init -)"' >> ~/.bash_profile
55
64
  source .bash_profile
56
65
  git clone https://github.com/sstephenson/ruby-build.git ~/.rbenv/plugins/ruby-build
57
- rbenv install 2.1.5 # or the latest ruby available
58
- rbenv global 2.1.5
66
+ rbenv install 2.2.0 # or the latest ruby available
67
+ rbenv global 2.2.0
59
68
  rbenv rehash
60
69
  gem install bundle
61
70
  gem install marc2linkeddata
@@ -63,20 +72,115 @@ Install with rbenv (on linux)
63
72
  Configure
64
73
 
65
74
  # set env values and/or create or modify a .env file
66
- # see the .env_example file for details
67
- marc2LD_config
75
+ # see the .env_example file for details.
68
76
  # Performance will slow with more retrieval of linked
69
77
  # data resources, such as OCLC works for authorities.
78
+ marc2LD_config
70
79
 
71
80
  Scripting
72
81
 
73
82
  # First configure (see details above).
74
83
  # Translate a MARC21 authority file to a turtle file.
75
- # readMarcAuthority [ authfile1.mrc .. authfileN.mrc ]
76
- marcAuthority2LD auth.01.mrc
77
-
78
- # Check the syntax of the resulting turtle file.
79
- rapper -c -i turtle auth.01.ttl
84
+ # It's assumed that '*.mrc' files contain multiple MARC21
85
+ # records and the record identifier is in field 001.
86
+ # marcAuthority2LD [ authfile1.mrc .. authfileN.mrc ]
87
+ marcAuthority2LD auth.mrc
88
+
89
+ # Check the syntax of the output turtle files.
90
+ touch turtle_syntax_checks.log
91
+ for f in $(find ./auth_turtle/ -type f -name '.ttl'); do
92
+ rapper -c -i turtle $f >> turtle_syntax_checks.log 2>&1
93
+ done
94
+
95
+ Example Output Files
96
+
97
+ - In this example, only data in the MARC record was used, without any RDF link
98
+ resolution or retrieval. The example MARC record already contained links to
99
+ VIAF and ISNI IRIs (these 9xx MARC fields are identified in the configuration).
100
+
101
+ @prefix owl: <http://www.w3.org/2002/07/owl#> .
102
+ @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
103
+ @prefix schema: <http://schema.org/> .
104
+ @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
105
+ <http://linked-data.stanford.edu/library/authority/N79044798> a schema:Person;
106
+ schema:name "Byrnes, Christopher I.,";
107
+ owl:sameAs <http://id.loc.gov/authorities/names/n79044798>,
108
+ <http://viaf.org/viaf/108317368>,
109
+ <http://www.isni.org/0000000109311081> .
110
+
111
+ - In this example, all the RDF link resolution and retrieval was enabled. Also, the
112
+ OCLC works for this authority were resolved. The result is an 'authority index' into LOD,
113
+ including associated works. Although some of the RDF was retrieved in the process (and
114
+ could be cached in a local triple store), the output record is designed to be an LOD index
115
+ only. The index could be stored in a local triple store, to be leveraged by local clients
116
+ that may retrieve and use additional data from the RDF links. Sharing such an 'LOD index'
117
+ in a distributed network database could be very useful and open opportunities for institutions
118
+ to collaborate on scaling the link resolution and maintenance issues.
119
+
120
+ @prefix owl: <http://www.w3.org/2002/07/owl#> .
121
+ @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
122
+ @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
123
+ @prefix schema: <http://schema.org/> .
124
+ @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
125
+ <http://linked-data.example.org/library/authority/N79044798> a schema:Person;
126
+ schema:familyName "Byrnes";
127
+ schema:givenName "Christopher Ian",
128
+ "Christopher I";
129
+ schema:name "Byrnes, Christopher I., 1949-";
130
+ owl:sameAs <http://id.loc.gov/authorities/names/n79044798>,
131
+ <http://viaf.org/viaf/108317368>,
132
+ <http://www.isni.org/0000000109311081> .
133
+ <http://id.loc.gov/authorities/names/n79044798> owl:sameAs <http://www.worldcat.org/identities/lccn-n79044798> .
134
+ <http://www.worldcat.org/identities/lccn-n79044798> rdfs:seeAlso <http://www.worldcat.org/oclc/747413718>,
135
+ <http://www.worldcat.org/oclc/017649403>,
136
+ <http://www.worldcat.org/oclc/004933024>,
137
+ <http://www.worldcat.org/oclc/007170722>,
138
+ <http://www.worldcat.org/oclc/006626542>,
139
+ <http://www.worldcat.org/oclc/050868185>,
140
+ <http://www.worldcat.org/oclc/013525712>,
141
+ <http://www.worldcat.org/oclc/013700764>,
142
+ <http://www.worldcat.org/oclc/036387153>,
143
+ <http://www.worldcat.org/oclc/013525674>,
144
+ <http://www.worldcat.org/oclc/013700768>,
145
+ <http://www.worldcat.org/oclc/018380395>,
146
+ <http://www.worldcat.org/oclc/018292079>,
147
+ <http://www.worldcat.org/oclc/023969230>,
148
+ <http://www.worldcat.org/oclc/035911289>,
149
+ <http://www.worldcat.org/oclc/495781917>,
150
+ <http://www.worldcat.org/oclc/727657045>,
151
+ <http://www.worldcat.org/oclc/782013318>,
152
+ <http://www.worldcat.org/oclc/037671494>,
153
+ <http://www.worldcat.org/oclc/751661734>,
154
+ <http://www.worldcat.org/oclc/800600611> .
155
+
156
+ - In addition, when the option to resolve OCLC works is enabled (OCLC_AUTH2WORKS option), the
157
+ following triples were added to those above.
158
+
159
+ <http://www.worldcat.org/oclc/004933024> schema:exampleOfWork <http://www.worldcat.org/entity/work/id/796991413> .
160
+ <http://www.worldcat.org/oclc/006626542> schema:exampleOfWork <http://www.worldcat.org/entity/work/id/111527266> .
161
+ <http://www.worldcat.org/oclc/007170722> schema:exampleOfWork <http://www.worldcat.org/entity/work/id/144285064> .
162
+ <http://www.worldcat.org/oclc/013525674> schema:exampleOfWork <http://www.worldcat.org/entity/work/id/7358848> .
163
+ <http://www.worldcat.org/oclc/013525712> schema:exampleOfWork <http://www.worldcat.org/entity/work/id/7360091> .
164
+ <http://www.worldcat.org/oclc/013700764> schema:exampleOfWork <http://www.worldcat.org/entity/work/id/366036025> .
165
+ <http://www.worldcat.org/oclc/013700768> schema:exampleOfWork <http://www.worldcat.org/entity/work/id/366036042> .
166
+ <http://www.worldcat.org/oclc/017649403> schema:exampleOfWork <http://www.worldcat.org/entity/work/id/866252320> .
167
+ <http://www.worldcat.org/oclc/018292079> schema:exampleOfWork <http://www.worldcat.org/entity/work/id/836712068> .
168
+ <http://www.worldcat.org/oclc/018380395> schema:exampleOfWork <http://www.worldcat.org/entity/work/id/365996343> .
169
+ <http://www.worldcat.org/oclc/023969230> schema:exampleOfWork <http://www.worldcat.org/entity/work/id/890420837> .
170
+ <http://www.worldcat.org/oclc/035911289> schema:exampleOfWork <http://www.worldcat.org/entity/work/id/355875201> .
171
+ <http://www.worldcat.org/oclc/036387153> schema:exampleOfWork <http://www.worldcat.org/entity/work/id/622568> .
172
+ <http://www.worldcat.org/oclc/037671494> schema:exampleOfWork <http://www.worldcat.org/entity/work/id/9216290> .
173
+ <http://www.worldcat.org/oclc/050868185> schema:exampleOfWork <http://www.worldcat.org/entity/work/id/366714531> .
174
+ <http://www.worldcat.org/oclc/495781917> schema:contributor <http://www.worldcat.org/identities/lccn-n79044798>;
175
+ schema:exampleOfWork <http://www.worldcat.org/entity/work/id/994448191> .
176
+ <http://www.worldcat.org/oclc/727657045> schema:contributor <http://www.worldcat.org/identities/lccn-n79044798>;
177
+ schema:exampleOfWork <http://www.worldcat.org/entity/work/id/1811109792> .
178
+ <http://www.worldcat.org/oclc/747413718> schema:exampleOfWork <http://www.worldcat.org/entity/work/id/994448191> .
179
+ <http://www.worldcat.org/oclc/751661734> schema:contributor <http://www.worldcat.org/identities/lccn-n79044798>;
180
+ schema:exampleOfWork <http://www.worldcat.org/entity/work/id/1816359357> .
181
+ <http://www.worldcat.org/oclc/782013318> schema:contributor <http://www.worldcat.org/identities/lccn-n79044798>;
182
+ schema:exampleOfWork <http://www.worldcat.org/entity/work/id/146829946> .
183
+ <http://www.worldcat.org/oclc/889440750> schema:exampleOfWork <http://www.worldcat.org/entity/work/id/2061462527> .
80
184
 
81
185
 
82
186
  Ruby Library Use
@@ -93,7 +197,8 @@ Ruby Library Use
93
197
  record = MARC::Reader.decode(raw)
94
198
  auth = ParseMarcAuthority.new(record)
95
199
  auth_id = "auth:#{auth.get_id}"
96
- triples = auth.to_ttl
200
+ graph = auth.graph
201
+ puts graph.to_ttl
97
202
  end
98
203
  end
99
204
 
@@ -105,7 +210,12 @@ Development
105
210
  ./bin/test.sh
106
211
  cp .env_example .env # and edit .env
107
212
  # develop code and/or bin scripts; run bin scripts, e.g.
108
- .binstubs/marcAuthority2LD auth.01.mrc
213
+ .binstubs/marcAuthority2LD auth.mrc
214
+ # Look for results in auth_turtle/*.ttl files.
215
+ # see also full example script in
216
+ #.binstubs/run_test_data.sh
217
+ # which includes shell script for basic stats and
218
+ # using rapper to check the file output syntax.
109
219
 
110
220
 
111
221
  # License
data/bin/marcAuthority2LD CHANGED
@@ -47,6 +47,7 @@ def marc_authority_records(marc_filename)
47
47
  auth_count += 1
48
48
  $stdout.printf "\b\b\b\b\b\b" if auth_count > 1
49
49
  $stdout.printf '%06d', auth_count
50
+ break if auth_count >= CONFIG.test_records
50
51
  end
51
52
  rescue => e
52
53
  stack_trace(e, record)
@@ -4,6 +4,7 @@ module Marc2LinkedData
4
4
  class Configuration
5
5
 
6
6
  attr_accessor :debug
7
+ attr_accessor :test_records
7
8
 
8
9
  attr_accessor :threads
9
10
  attr_accessor :thread_limit
@@ -38,6 +39,8 @@ module Marc2LinkedData
38
39
 
39
40
  def initialize
40
41
  @debug = env_boolean('DEBUG')
42
+ @test_records = ENV['TEST_RECORDS'].to_i
43
+
41
44
  @threads = env_boolean('THREADS')
42
45
  @thread_limit = ENV['THREAD_LIMIT'].to_i || 25
43
46
 
@@ -95,20 +95,7 @@ module Marc2LinkedData
95
95
  unless loc_iri.nil?
96
96
  # Verify the URL (used HEAD so it's as fast as possible)
97
97
  @@config.logger.debug "Trying to validate LOC IRI: #{loc_iri}"
98
- res = Marc2LinkedData.http_head_request(loc_iri + '.rdf')
99
- case res.code
100
- when '200'
101
- # it's good to go
102
- when '301'
103
- # use the redirection
104
- loc_iri = res['location']
105
- when '302','303'
106
- #302 Moved Temporarily
107
- #303 See Other
108
- # Use the current URL, most get requests will follow a 302 or 303
109
- else
110
- loc_iri = nil
111
- end
98
+ loc_iri = Marc2LinkedData.http_head_request(loc_iri + '.rdf')
112
99
  end
113
100
  if loc_iri.nil?
114
101
  # If it gets here, it's a problem.
@@ -84,34 +84,19 @@ module Marc2LinkedData
84
84
 
85
85
  def resolve_external_auth(url)
86
86
  begin
87
- res = Marc2LinkedData.http_head_request(url)
88
- case res.code
89
- when '200'
90
- @@config.logger.debug "Mapped #{@iri}\t-> #{url}"
91
- return url
92
- when '301'
93
- #301 Moved Permanently
94
- url = res['location']
95
- @@config.logger.debug "Mapped #{@iri}\t-> #{url}"
96
- return url
97
- when '302','303'
98
- #302 Moved Temporarily
99
- #303 See Other
100
- # Use the current URL, most get requests will follow a 302 or 303
101
- @@config.logger.debug "Mapped #{@iri}\t-> #{url}"
102
- return url
103
- when '404'
104
- @@config.logger.warn "#{@iri}\t// #{url}"
105
- return nil
106
- else
107
- # WTF
108
- binding.pry if @@config.debug
109
- @@config.logger.error "unknown http response code (#{res.code}) for #{@iri}"
110
- return nil
87
+ # RestClient does all the response code handling and redirection.
88
+ url = Marc2LinkedData.http_head_request(url)
89
+ if url.nil?
90
+ @@config.logger.warn "#{@iri}\t// #{url}"
91
+ else
92
+ @@config.logger.debug "Mapped #{@iri}\t-> #{url}"
111
93
  end
112
94
  rescue
113
- nil
95
+ binding.pry if @@config.debug
96
+ @@config.logger.error "unknown http error for #{@iri}"
97
+ url = nil
114
98
  end
99
+ url
115
100
  end
116
101
 
117
102
  def same_as
@@ -23,23 +23,20 @@ module Marc2LinkedData
23
23
  end
24
24
 
25
25
  def self.http_head_request(url)
26
- uri = URI.parse(url)
26
+ uri = nil
27
27
  begin
28
- if RUBY_VERSION =~ /^1\.9/
29
- req = Net::HTTP::Head.new(uri.path)
30
- else
31
- req = Net::HTTP::Head.new(uri)
32
- end
33
- Net::HTTP.start(uri.host, uri.port) {|http| http.request req }
28
+ response = RestClient.head(url)
29
+ uri = response.args[:url]
34
30
  rescue
35
- @configuration.logger.error "Net::HTTP::Head failed for #{uri}"
31
+ @configuration.logger.error "RestClient.head failed for #{url}"
36
32
  begin
37
- Net::HTTP.get_response(uri)
33
+ response = RestClient.get(url)
34
+ uri = response.args[:url]
38
35
  rescue
39
- @configuration.logger.error "Net::HTTP.get_response failed for #{uri}"
40
- nil
36
+ @configuration.logger.error "RestClient.get failed for #{url}"
41
37
  end
42
38
  end
39
+ uri
43
40
  end
44
41
 
45
42
  def self.write_prefixes(file)
@@ -4,7 +4,7 @@ $:.unshift lib unless $:.include?(lib)
4
4
 
5
5
  Gem::Specification.new do |s|
6
6
  s.name = 'marc2linkeddata'
7
- s.version = '0.1.1'
7
+ s.version = '0.1.2'
8
8
  s.licenses = ['Apache-2.0']
9
9
 
10
10
  # mysql and bson_ext only install on MRI (c-ruby)
@@ -9,7 +9,7 @@ module Marc2LinkedData
9
9
  @loc_id = 'no99010609'
10
10
  @loc_url = 'http://id.loc.gov/authorities/names/no99010609'
11
11
  @loc = Loc.new @loc_url
12
- @viaf_url = 'http://viaf.org/viaf/85312226'
12
+ @viaf_url = 'http://viaf.org/viaf/85312226/'
13
13
  end
14
14
 
15
15
  before :each do
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: marc2linkeddata
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.1.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Darren Weber