high_level_browse 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 286d0ce64d0d9e8dffa58b716f111d086310654d
4
+ data.tar.gz: 2a13aad07ee29e47b0bcc00f4ba16740491e9bfd
5
+ SHA512:
6
+ metadata.gz: 9960852abc0686da303da11c8ead326df0ea7e7df89432962f7d1353e62350afbc7a3ad556d1beecfe6cce816c1bf654ce4bdee78bb195caefdb08caeb67b7cf
7
+ data.tar.gz: 3d29b51feb0bd70c37eea28248eff4f3dccd8a38cdb23617be8998fdfa821e392743d0f87c25e9f233ee326b4097072ca63d3d04bcdf5c21216ec43a96ecae04
@@ -0,0 +1,22 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ Gemfile.lock
7
+ InstalledFiles
8
+ _yardoc
9
+ coverage
10
+ doc/
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/reports
15
+ test/tmp
16
+ test/version_tmp
17
+ tmp
18
+ *.bundle
19
+ *.so
20
+ *.o
21
+ *.a
22
+ mkmf.log
@@ -0,0 +1,3 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.9.3
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in high_level_browse.gemspec
4
+ gemspec
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2014 Bill Dueber
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,198 @@
1
+ # HighLevelBrowse
2
+
3
+ Given an LC Call Number, try to get a set of academic disciplines associated with it
4
+
5
+ ## Usage
6
+
7
+ ```ruby
8
+
9
+ use 'high_level_browse'
10
+
11
+ # Pull a new version of the raw data from the UM website,
12
+ # transform it into something that can be quickly searched,
13
+ # and serialize it to `hlb.json.gz` in the specified directory
14
+ hlb = HighLevelBrowse.fetch_and_save(dir: '/tmp')
15
+
16
+ # ...or just grab an already fetch_and_saved copy
17
+ hlb = HighLevelBrowse.load(dir: '/tmp')
18
+
19
+ # What HLB categories is an LC Call Number in?
20
+ hlb.topics 'hc 9112.2'
21
+ # => [["Social Sciences", "Economics"],
22
+ # ["Social Sciences", "Social Sciences (General)"]]
23
+
24
+ # ... or use the #[] shortcut syntax
25
+
26
+ hlb['NC1766 .U52 D733 2014']
27
+ # => [["Arts", "Art History"],
28
+ # ["Arts", "Art and Design"],
29
+ # ["Arts", "Film and Video Studies"]]
30
+
31
+ # You can also send more than one call number at a time
32
+
33
+ hlb.topics('E 99 .S2 Y67 1993', 'PS 3565 .R5734 F67 2015')
34
+ # => [["Humanities", "American Culture"],
35
+ # ["Humanities", "United States History"],
36
+ # ["Social Sciences", "Native American Studies"],
37
+ # ["Social Sciences", "Archaeology"],
38
+ # ["Humanities", "English Language and Literature"]]
39
+
40
+ ```
41
+
42
+
43
+ ## Overview
44
+
45
+ While we in the library world sometimes use LC Call Numbers (or at least
46
+ the initial letters) as a proxy for subject matter, the mapping is iffy
47
+ in many cases and is, in any case, one-dimensional. Many works simply
48
+ cover multiple subjects or are relevant to sometimes quite different
49
+ types of academics.
50
+
51
+ Take, for example, the chemistry of the brain as it applies to mental
52
+ illness. We have a book, _Endorphins : new waves in brain chemistry_
53
+ cataloged as **QP552.E53 D381 1984**. The QP's map to "Phsiology", which
54
+ is correct but not complete.
55
+
56
+ The University of Michigan Library has for years maintained
57
+ the [High Level Browse](https://www.lib.umich.edu/browse/categories/) (HLB),
58
+ a mapping of call-number ranges to academic subjects. The entire
59
+ data set is available as [1.8MB XML file](https://www.lib.umich.edu/browse/categories/xml.php)
60
+ for download.
61
+
62
+ In the HLB, the call number for _Endorphins : new waves in brain chemistry_ maps
63
+ to the following categories:
64
+
65
+ * Science | Physiology
66
+ * Health Sciences | Physiology
67
+ * Health Sciences | Public Health (General)
68
+ * Science | Chemical Engineering
69
+ * Engineering | Chemical Engineering
70
+ * Health Sciences | Biological Chemistry
71
+ * Science | Chemistry | Biological Chemistry
72
+
73
+ This opens up potentially more accurate categorization of works for, say,
74
+ faceting in a library catalog.
75
+
76
+ This gem gives a relatively time-efficient way to get the set of disciplines associated
77
+ with the given callnumber or callnumbers as part of indexing MARC records into Solr.
78
+ This mapping is used in many places in the University Library at the University of
79
+ Michigan, including the
80
+ [Mirlyn Catalog](https://mirlyn.lib.umich.edu/)
81
+ (exposed as "Academic Discipline" in the facets) and ejournals/databases (and even
82
+ Librarians!) via the [Browse page](https://www.lib.umich.edu/browse).
83
+
84
+ This categorization may be useful for clustering/faceting
85
+ in similar applications at other institutions. Note that the actual creation and
86
+ maintenance of the call number ranges is done by subject specialist librarians and
87
+ is out of scope for this gem.
88
+
89
+ ## Command line utilities: `fetch_new_hlb` and `hlb`
90
+
91
+ There are also a couple command line applications for managing and querying the
92
+ data.
93
+
94
+ * **fetch_new_hlb** tries to grab a new copy of the data from the umich website
95
+ and serialize it to a ~500k file called `hlb.json.gz` in the given directory.
96
+ Useful for putting in a cron job to periodically update with fresh data
97
+
98
+ ```bash
99
+
100
+ $> fetch_new_hlb
101
+
102
+ fetch_new_hlb -- get a new copy of the HLB ready for use by high_level_browse
103
+ and stick it in the given directory
104
+
105
+ Usage: fetch_new_hlb <dir>
106
+ ```
107
+
108
+ * **hlb** takes one or more callnumbers and returns a text display of the categories
109
+ associated with them. It will stash a copy of the database in `Dir.tmpdir`if there
110
+ isn't one there already, and use it on subsequent calls so things aren't so
111
+ desperately slow. (To find your tmpdir, in your shell
112
+ run `ruby -e 'require "tmpdir"; puts Dir.tmpdir'`)
113
+
114
+
115
+ ```bash
116
+ $> hlb
117
+
118
+ hlb -- get high level browse data for an LC call number
119
+
120
+ Example:
121
+ hlb "qa 11.33 .C4 .H3"
122
+ or do several at once
123
+ hlb "PN 33.4" "AC 1122.3 .C22" ...
124
+
125
+ # Let's try it
126
+ $> hlb "qa 11.33 .C4"
127
+
128
+ Science | Mathematics
129
+ Social Sciences | Education
130
+
131
+ ```
132
+
133
+
134
+ ## A warning about (lack of) coverage
135
+
136
+ Note that not every possible valid callnumber will be necessarily be contained in any
137
+ dicipline at all. Many books aren't academic in nature, and even then
138
+ coverage is known to have some holes. Some of the ranges cover essentially a
139
+ single book in the umich collection. And, of course, not every record is going
140
+ to have a LC Call Number, so there's that.
141
+
142
+ This is all to say: this may or may not be useful at your insitution. You'll
143
+ have to experiment.
144
+
145
+ To help with this, there's a little script in the `bin/` directory called
146
+ `test_marc_file_for_hlb` which will, when given a MARC-XML file (ending in `.xml`)
147
+ or a MARC-binary file (ending in anything else), output some statistics on
148
+ what kind of coverage you would get. It might be useful to send a test file
149
+ through there to see what comes up. It looks in the `050` and the `852[h]` to
150
+ see if anything pops, but you can make it looks elsewhere pretty easily.
151
+
152
+ It produces something like this:
153
+
154
+ ```
155
+ 050 fields
156
+ 9790 total
157
+ 209 not recognized as LC call numbers
158
+ 9337 with at least one HLB category
159
+ 244 with NO category
160
+
161
+ Of 17642 records,
162
+ 9677 (54.85%) had a field that often contains an LC Call Number
163
+ 9262 (95.71%) of *those* had at least one HLB category
164
+
165
+ ```
166
+
167
+ ## Performance
168
+
169
+ On my laptop under normal load (e.g., not very scientific at all)
170
+ I get the following running in a single thread
171
+
172
+ ```
173
+ ruby 2.3 this gem ~8500 lookups/second
174
+ ruby 2.4 this gem ~9100 lookups/second
175
+ jruby 9 this gem ~20,000 lookups/second
176
+ jruby 9, old HLB.jar ~6500 lookups/second
177
+ jruby 1.7 this gem error, can't do named arguments since it's 1.9 mode
178
+ jruby 1.7 old HLB.jar ~6700 lookups/second
179
+ ```
180
+
181
+ The [old HLB.jar](https://github.com/billdueber/HLB-Java) refers to a pure java version that I call from within
182
+ Jruby as part of my catalog indexing process now. Ithas a different (worse) algorithm, but is of
183
+ interest because it's what I'm writing this to replace.
184
+
185
+ ## Installation
186
+
187
+ ```bash
188
+ gem 'high_level_browse'
189
+ ```
190
+
191
+
192
+ ## Contributing
193
+
194
+ 1. Fork it ( https://github.com/[my-github-username]/high_level_browse/fork )
195
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
196
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
197
+ 4. Push to the branch (`git push origin my-new-feature`)
198
+ 5. Create a new Pull Request
@@ -0,0 +1,9 @@
1
+ require "bundler/gem_tasks"
2
+ require "rake/testtask"
3
+
4
+ Rake::TestTask.new(:test) do |t|
5
+ t.libs << "test"
6
+ end
7
+
8
+ task :default => :test
9
+
@@ -0,0 +1,57 @@
1
+ require 'benchmark/ips'
2
+ $:.unshift '../lib'
3
+ $:.unshift '.'
4
+
5
+
6
+ # On my laptop under normal load (e.g., not very scientific at all)
7
+ # I get the following running in a single thread
8
+ # ruby 2.3 ~8500 lookups/second
9
+ # ruby 2.4 ~9100 lookups/second
10
+ # jruby 9 ~20k lookups/second
11
+ # jruby 9, old HLB.jar ~6500 lookups/second
12
+ # jruby 1.7 error, can't do named arguments
13
+ # jruby 1.7, old HLB.jar ~6700 lookups/second
14
+ #
15
+ # The old HLB.jar has a different (worse) algorithm, but is of
16
+ # interest because it's what I'm writing this to replace.
17
+
18
+ # umich_traject holds .jar files with the old java implementation; see
19
+ # https://github.com/hathitrust/ht_traject/tree/9e8d414fd9bb2c79e243d289c4d39c05d2de27e5/lib/umich_traject
20
+ #
21
+
22
+ TEST_OLD_STUFF = defined? JRUBY_VERSION and Dir.exist?('./umich_traject')
23
+ if TEST_OLD_STUFF
24
+ puts "Loading old HLB3.jar stuff"
25
+ require 'umich_traject/jackson-core-asl-1.4.3.jar'
26
+ require 'umich_traject/jackson-mapper-asl-1.4.3.jar'
27
+ require 'umich_traject/apache-solr-umichnormalizers.jar'
28
+ require 'umich_traject/HLB3.jar'
29
+ java_import Java::edu.umich.lib.hlb::HLB
30
+ puts "Initializing HLB"
31
+ HLB.initialize()
32
+ end
33
+
34
+ require 'high_level_browse'
35
+
36
+ h = HighLevelBrowse.load(dir: '.')
37
+
38
+ cns = File.read('call_numbers.txt').split(/\n/).cycle
39
+
40
+ puts RUBY_DESCRIPTION
41
+
42
+ total = 0
43
+ Benchmark.ips do |x|
44
+ x.config(:time => 25, :warmup => 25)
45
+
46
+ x.report("HLB lookups") do
47
+ total += h[cns.next].count
48
+ end
49
+
50
+ if TEST_OLD_STUFF
51
+ total = 0
52
+ x.report("Old java lookups") do
53
+ total += HLB.categories(cns.next).to_a.count
54
+ end
55
+ x.compare!
56
+ end
57
+ end
Binary file
@@ -0,0 +1,62 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ # If we're loading from source instead of a gem, rubygems
4
+ # isn't setting load paths for us, so we need to set it ourselves
5
+ self_load_path = File.expand_path("../lib", File.dirname(__FILE__))
6
+ unless $LOAD_PATH.include? self_load_path
7
+ $LOAD_PATH << self_load_path
8
+ end
9
+
10
+ def silence_warnings(&block)
11
+ warn_level = $VERBOSE
12
+ $VERBOSE = nil
13
+ result = block.call
14
+ $VERBOSE = warn_level
15
+ result
16
+ end
17
+
18
+ # minitest has a circular require warning, which
19
+ # drives me crazy. Suppress it.
20
+ silence_warnings do
21
+ require 'high_level_browse'
22
+ end
23
+
24
+ require 'fileutils'
25
+
26
+ def putsmsg(msg)
27
+ puts "---------------------------------------------------"
28
+ puts " ERROR: #{msg}"
29
+ puts "---------------------------------------------------"
30
+ puts
31
+ end
32
+
33
+ def usage(msg = nil)
34
+ puts
35
+ putsmsg(msg) if msg
36
+ puts "fetch_new_hlb -- get a new copy of the HLB ready for use by high_level_browse"
37
+ puts "and stick it in the given directory"
38
+ puts
39
+ puts " Usage: fetch_new_hlb <dir>"
40
+ puts
41
+ exit
42
+ end
43
+
44
+ unless ARGV.size == 1
45
+ usage
46
+ end
47
+
48
+ dir = ARGV.shift
49
+
50
+ File.exist? dir or usage "#{dir} does not exist"
51
+ Dir.exist? dir or usage "#{dir} is not a directory"
52
+ File.writable? dir or usage "#{dir} is not writable"
53
+
54
+ begin
55
+ db = HighLevelBrowse.fetch
56
+ db.save(dir: dir)
57
+ rescue => e
58
+ puts "============================="
59
+ puts "ERROR FETCHING HLB SOURCE"
60
+ puts " #{e}"
61
+ puts "============================="
62
+ end
data/bin/hlb ADDED
@@ -0,0 +1,46 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ # Hmmm. How to pass along the location? Stick one in /tmp
4
+ # and see if it exists?
5
+
6
+
7
+ def usage
8
+ puts "hlb -- get high level browse data for an LC call number"
9
+ puts
10
+ puts %Q{Example:\n hlb "qa 11.33 .C4 .H3"}
11
+ puts " or do several at once"
12
+ puts %Q{ hlb "PN 33.4" "AC 1122.3 .C22" ... }
13
+ puts
14
+ exit(1)
15
+ end
16
+
17
+ usage if ARGV.empty?
18
+
19
+ self_load_path = File.expand_path("../lib", File.dirname(__FILE__))
20
+ unless $LOAD_PATH.include? self_load_path
21
+ $LOAD_PATH << self_load_path
22
+ end
23
+
24
+ require 'high_level_browse'
25
+ require 'fileutils'
26
+ require 'tmpdir'
27
+
28
+ filename = HighLevelBrowse::DB::FILENAME
29
+ dir = Dir.tmpdir()
30
+ fullpath = File.join(dir, filename)
31
+
32
+ hlb = if File.exist?(fullpath)
33
+ HighLevelBrowse.load(dir: dir)
34
+ else
35
+ STDERR.puts "Fetching raw data from UMich; wait a sec"
36
+ HighLevelBrowse.fetch_and_save(dir: dir)
37
+ end
38
+
39
+
40
+ topics = hlb[*ARGV]
41
+
42
+ if topics.empty?
43
+ puts "\nNo categories found for #{ARGV}\n\n"
44
+ else
45
+ puts "\n" + topics.map { |x| x.join(' | ') }.join("\n") + "\n\n"
46
+ end
@@ -0,0 +1,122 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ self_load_path = File.expand_path("../lib", File.dirname(__FILE__))
4
+ unless $LOAD_PATH.include? self_load_path
5
+ $LOAD_PATH << self_load_path
6
+ end
7
+
8
+ require 'marc'
9
+ require 'high_level_browse'
10
+ require 'lcsort'
11
+ require 'tmpdir'
12
+
13
+
14
+ filename = ARGV[0]
15
+
16
+ reader = if filename =~ /xml\Z/i
17
+ MARC::XMLReader.new(filename)
18
+ else
19
+ MARC::Reader.new(filename)
20
+ end
21
+
22
+ Counter = Struct.new(:count, :invalid, :found, :notfound, :hlb) do
23
+
24
+ def update(cn)
25
+ self.count += 1
26
+ case check_cn(cn)
27
+ when :invalid
28
+ self.invalid += 1
29
+ 0
30
+ when :found
31
+ self.found += 1
32
+ 1
33
+ when :notfound
34
+ self.notfound += 1
35
+ 0
36
+ end
37
+ end
38
+
39
+
40
+ def check_cn(cn)
41
+ normalized = Lcsort.normalize(cn)
42
+ return :invalid if normalized.nil?
43
+ cats = hlb[cn]
44
+ if cats.empty?
45
+ :notfound
46
+ else
47
+ :found
48
+ end
49
+ end
50
+
51
+ def puts_pretty_output
52
+ puts '%9d total' % count
53
+ puts '%9d not recognized as LC call numbers' % invalid
54
+ puts '%9d with at least one HLB category' % found
55
+ puts '%9d with NO category' % notfound
56
+ end
57
+ end
58
+
59
+
60
+ def puts_output(f050, f852)
61
+ puts "050 fields"
62
+ f050.puts_pretty_output
63
+ puts "\n852h fields"
64
+ f852.puts_pretty_output
65
+ end
66
+
67
+ puts "Fetching/parsing HLB XML file"
68
+ filename = HighLevelBrowse::DB::FILENAME
69
+ dir = Dir.tmpdir()
70
+ fullpath = File.join(dir, filename)
71
+
72
+ hlb = if File.exist?(fullpath)
73
+ puts "Using file at #{fullpath}"
74
+ HighLevelBrowse.load(dir: dir)
75
+ else
76
+ HighLevelBrowse.fetch_and_save(dir: dir)
77
+ end
78
+
79
+ f050 = Counter.new(0, 0, 0, 0, hlb)
80
+ f852 = Counter.new(0, 0, 0, 0, hlb)
81
+ records = 0
82
+ matched_records = 0
83
+ possible_records = 0
84
+ puts "Beginning analysis of marc records with 2k record progress reports"
85
+ reader.each do |r|
86
+ records += 1
87
+ found = 0
88
+ possible = false
89
+ puts '%8d records processed so far' % records if records % 2_000 == 0
90
+ if r['050']
91
+ cns = r.fields('050').map { |x| x.map(&:value).join('') }
92
+ cns.each do |cn|
93
+ found += f050.update(cn)
94
+ possible = true
95
+ end
96
+ cns = r.fields('852').keep_if { |x| x['h'] }.map { |x| x['h'] }
97
+ cns.each do |cn|
98
+ found += f852.update(cn)
99
+ possible = true
100
+ end
101
+ end
102
+ matched_records += 1 if found > 0
103
+ possible_records += 1 if possible
104
+ end
105
+
106
+ puts "\n\n"
107
+ puts_output(f050, f852)
108
+ puts format(
109
+ %Q[\nOf %d records,
110
+ %d (%4.2f%%) had a field that often contains an LC Call Number
111
+ %d (%4.2f%%) of *those* had at least one HLB category],
112
+ records,
113
+ possible_records,
114
+ possible_records.to_f / records * 100,
115
+ matched_records,
116
+ matched_records.to_f / possible_records * 100)
117
+
118
+
119
+
120
+
121
+
122
+
@@ -0,0 +1,26 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'high_level_browse/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "high_level_browse"
8
+ spec.version = HighLevelBrowse::VERSION
9
+ spec.authors = ["Bill Dueber"]
10
+ spec.email = ["bill@dueber.com"]
11
+ spec.summary = %q{Map LC call numbers to academic categories.}
12
+ spec.homepage = ""
13
+ spec.license = "MIT"
14
+
15
+ spec.files = `git ls-files -z`.split("\x0")
16
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
17
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
18
+ spec.require_paths = ["lib"]
19
+
20
+ spec.add_dependency 'oga', '~> 2.1'
21
+ spec.add_dependency 'lcsort'
22
+
23
+ spec.add_development_dependency "bundler", "~> 1.6"
24
+ spec.add_development_dependency "rake"
25
+ spec.add_development_dependency "minitest"
26
+ end
@@ -0,0 +1,41 @@
1
+ require "high_level_browse/version"
2
+ require 'high_level_browse/db'
3
+ require 'uri'
4
+ require 'open-uri'
5
+
6
+ module HighLevelBrowse
7
+
8
+ SOURCE_URL = ENV['HLB_XML_ENDPOINT'] || 'https://www.lib.umich.edu/browse/categories/xml.php'
9
+
10
+ # Fetch a new version of the raw file and turn it into a db
11
+ # @return [DB] The loaded database
12
+ def self.fetch
13
+ uri = URI.parse(SOURCE_URL)
14
+ # Why on earth OpenURI::OpenRead is mixed into http but not https, I don't know
15
+ uri.extend OpenURI::OpenRead
16
+
17
+ xml = uri.read
18
+ return DB.new_from_xml(xml)
19
+ rescue => e
20
+ raise "Could not fetch xml from '#{SOURCE_URL}': #{e}"
21
+ end
22
+
23
+
24
+ # Fetch and save to the specified directory
25
+ # @param [String] dir The directory where the hlb.json.gz file will end up
26
+ # @return [DB] The fetched and saved database
27
+ def self.fetch_and_save(dir:)
28
+ db = self.fetch
29
+ db.save(dir: dir)
30
+ db
31
+ end
32
+
33
+
34
+ # Load from disk
35
+ # @param [String] dir The directory where the hlb.json.gz file is located
36
+ # @return [DB] The loaded database
37
+ def self.load(dir:)
38
+ DB.load(dir: dir)
39
+ end
40
+
41
+ end
@@ -0,0 +1,154 @@
1
+ require 'lcsort'
2
+ require 'high_level_browse/range_tree'
3
+
4
+
5
+ # An efficient set of CallNumberRanges from which to get topics
6
+ class HighLevelBrowse::CallNumberRangeSet < HighLevelBrowse::RangeTree
7
+
8
+
9
+ # Returns the array of topic arrays for the given LC string
10
+ # @param [String] raw_lc A raw LC string (eg., 'qa 112.3 .A4 1990')
11
+ # @return [Array<Array<String>>] Arrays of topic labels
12
+ def topics_for(raw_lc)
13
+ normalized = Lcsort.normalize(HighLevelBrowse::CallNumberRange.preprocess(raw_lc))
14
+ self.search(normalized).map(&:topic_array).uniq
15
+ end
16
+ end
17
+
18
+
19
+ # A callnumber-range keeps track of the original begin/end
20
+ # strings as well as the normalized versions, and can be
21
+ # serialized to JSON
22
+
23
+ class HighLevelBrowse::CallNumberRange
24
+ include Comparable
25
+
26
+ attr_reader :min, :max, :min_raw, :max_raw, :firstletter
27
+
28
+
29
+ attr_accessor :topic_array, :redundant
30
+
31
+ SPACE_OR_PUNCT = /\A[\s\p{Punct}]*(.*?)[\s\p{Punct}]*\Z/
32
+ DIGIT_TO_LETTER = /(\d)([A-Z])/i
33
+
34
+ # @nodoc
35
+ # Remove spaces/punctuation from the ends of the string
36
+ def self.strip_spaces_and_punct(str)
37
+ str.gsub(SPACE_OR_PUNCT, '\1')
38
+ end
39
+
40
+ # @nodoc
41
+ # Force a space between any digit->letter transition
42
+ def self.force_break_between_digit_and_letter(str)
43
+ str.gsub(DIGIT_TO_LETTER, '\1 \2')
44
+ end
45
+ # @nodoc
46
+ # Preprocess the string, removing spaces/punctuation off the end
47
+ # and forcing a space where there's a digit->letter transition
48
+ def self.preprocess(str)
49
+ str ||= ''
50
+ force_break_between_digit_and_letter(
51
+ strip_spaces_and_punct(str)
52
+ )
53
+ end
54
+
55
+
56
+ def initialize(min:, max:, topic_array:)
57
+ @illegal = false
58
+ @redundant = false
59
+ self.min = self.class.preprocess(min)
60
+ self.max = self.class.preprocess(max)
61
+ @topic_array = topic_array
62
+ @firstletter = self.min[0] unless @illegal
63
+ end
64
+
65
+
66
+ # Compare based on @min, then end
67
+ # @param [CallNumberRange] o the range to compare to
68
+ def <=>(o)
69
+ [self.min, self.max] <=> [o.min, o.max]
70
+ end
71
+
72
+ def to_s
73
+ "[#{self.min_raw} - #{self.max_raw}]"
74
+ end
75
+
76
+ def reconstitute(min, max, min_raw, max_raw, firstletter, topic_array)
77
+ @min = min
78
+ @max = max
79
+ @min_raw = min_raw
80
+ @max_raw = max_raw
81
+ @firstletter = firstletter
82
+ @topic_array = topic_array
83
+ end
84
+
85
+
86
+ # Two ranges are equal if their @min, @max, and topic array
87
+ # are all the same
88
+ # @param [CallNumberRange] o the range to compare to
89
+ def ==(other)
90
+ @min == other.min and
91
+ @max == other.max and
92
+ @topic_array == other.topic_array
93
+ end
94
+
95
+
96
+ # @nodoc
97
+ # JSON roundtrip
98
+ def to_json(*a)
99
+ {
100
+ 'json_class' => self.class.name,
101
+ 'data' => [@min, @max, @min_raw, @max_raw, @firstletter, @topic_array]
102
+ }.to_json(*a)
103
+ end
104
+
105
+ # @nodoc
106
+ def self.json_create(h)
107
+ cnr = self.allocate
108
+ cnr.reconstitute(*(h['data']))
109
+ cnr
110
+ end
111
+
112
+
113
+ # In both @min= and end=, we also rescue any parsing errors
114
+ # and simply set the @illegal flag so we can use it later on.
115
+ def min=(x)
116
+ @min_raw = x
117
+ possible_min = Lcsort.normalize(x)
118
+ if possible_min.nil? # didn't normalize
119
+ @illegal = true
120
+ nil
121
+ else
122
+ @min = possible_min
123
+ end
124
+ end
125
+
126
+ # Same as start. Set the illegal flag if we get an error
127
+ def max=(x)
128
+ @max_raw = x
129
+ possible_max = Lcsort.normalize(x)
130
+ if possible_max.nil? # didn't normalize
131
+ @illegal = true
132
+ nil
133
+ else
134
+ @max = possible_max + '~' # add a tilde to make it a true endpoint
135
+ end
136
+ end
137
+
138
+ def illegal?
139
+ @illegal
140
+ end
141
+
142
+
143
+ def surrounds(other)
144
+ @min <= other.min and @max >= other.max
145
+ end
146
+
147
+ def contains(x)
148
+ @min <= x and @max >= x
149
+ end
150
+
151
+ alias_method :cover?, :contains
152
+ alias_method :member?, :contains
153
+
154
+ end
@@ -0,0 +1,150 @@
1
+ require 'oga'
2
+ require 'high_level_browse/call_number_range'
3
+ require 'zlib'
4
+ require 'json'
5
+
6
+ class HighLevelBrowse::DB
7
+
8
+ # Hard-code filename. If you need more than one, put them
9
+ # in different directories
10
+ FILENAME = 'hlb.json.gz'
11
+
12
+ # Given a bunch of CallNumberRange objects, create a new
13
+ # database with an efficient structure for querying
14
+ # @param [Array<HighLevelBrowse::CallNumberRange>] array_of_ranges
15
+ def initialize(array_of_ranges)
16
+ @all = array_of_ranges
17
+ @ranges = self.create_letter_indexed_ranges(@all)
18
+ end
19
+
20
+ # Given an array of ranges, create efficient
21
+ # search structures
22
+ # @private
23
+ def create_letter_indexed_ranges(all)
24
+ bins = {}
25
+ ('A'..'Z').each do |letter|
26
+ cnrs = all.find_all {|x| x.firstletter == letter}
27
+ bins[letter] = HighLevelBrowse::CallNumberRangeSet.new(cnrs)
28
+ end
29
+ bins
30
+ end
31
+
32
+ # Get the topic arrays associated with this callnumber
33
+ # of the form:
34
+ # [
35
+ # [toplevel, secondlevel],
36
+ # [toplevel, secondlevel, thirdlevel],
37
+ # ...
38
+ # ]
39
+ # @param [String] raw_callnumber_string
40
+ # @return [Array<Array>] A (possibly empty) array of arrays of topics
41
+ def topics(*raw_callnumber_strings)
42
+ raw_callnumber_strings.reduce([]) do |acc, raw_callnumber_string|
43
+ firstletter = raw_callnumber_string.strip.upcase[0]
44
+ if @ranges.has_key? firstletter
45
+ acc + @ranges[firstletter].topics_for(raw_callnumber_string)
46
+ else
47
+ acc
48
+ end
49
+ end.uniq
50
+ end
51
+
52
+
53
+ alias_method :[], :topics
54
+
55
+ # Create a new object from a string with the XML
56
+ # in it.
57
+ # @param [String] xml The contents of the HLB XML dump
58
+ # (e.g., from 'https://www.lib.umich.edu/browse/categories/xml.php')
59
+ # @return [DB]
60
+ def self.new_from_xml(xml)
61
+ oga_doc_root = Oga.parse_xml(xml)
62
+ simple_array_of_cnrs = cnrs_within_oga_node(node: oga_doc_root)
63
+ self.new(simple_array_of_cnrs).freeze
64
+ end
65
+
66
+
67
+ # Save to disk
68
+ # @param [String] dir The directory where the hlb.json.gz file will be saved
69
+ # @return [DB] The loaded database
70
+ def save(dir:)
71
+ Zlib::GzipWriter.open(File.join(dir, FILENAME)) do |out|
72
+ out.puts JSON.fast_generate(@all)
73
+ end
74
+ end
75
+
76
+
77
+ # Load from disk
78
+ # @param [String] dir The directory where the hlb.json.gz file is located
79
+ # @return [DB] The loaded database
80
+ def self.load(dir:)
81
+ simple_array_of_cnrs = Zlib::GzipReader.open(File.join(dir, FILENAME)) do |infile|
82
+ JSON.load(infile.read).to_a
83
+ end
84
+ db = self.new(simple_array_of_cnrs)
85
+ db.freeze
86
+ db
87
+ end
88
+
89
+
90
+ # Freeze everything
91
+ # @return [DB] the frozen db
92
+ def freeze
93
+ @ranges.freeze
94
+ @all.freeze
95
+ self
96
+ end
97
+
98
+ private
99
+
100
+ # Recurse through the parsed XML document, at each stage keeping track of
101
+ # * where we are (what are the xpath children?)
102
+ # * what the current topics are ([level1, level2])
103
+ # Get all the call numbers assocaited with the topic represented by the given node,
104
+ # as well as all the children of the given node, and send it back as a big ol' array
105
+ # @param [Oga::Node] node A node of the parsed HLB XML file
106
+ # @param [Array<String>] decendent_xpaths A list of xpaths to the decendents of this node
107
+ # @param [Array<String>] topic_array An array with all levels of the topics associated with this node
108
+ # @return [Array<HighLevelBrowse::CallNumberRange>]
109
+ def self.cnrs_within_oga_node(node:, decendent_xpaths: ['/hlb/subject', 'topic', 'sub-topic'], topic_array: [])
110
+ if decendent_xpaths.empty?
111
+ [] # base case -- we're as low as we're going to go
112
+ else
113
+ current_xpath_component = decendent_xpaths[0]
114
+ new_xpath = decendent_xpaths[1..-1]
115
+ new_topic = topic_array.dup
116
+ new_topic.push node.get(:name) unless node == node.root_node # skip the root
117
+ cnrs = []
118
+ # For each sub-component, get both the call-number-ranges (cnrs) assocaited
119
+ # with this level, as well as recusively getting from all the children
120
+ node.xpath(current_xpath_component).each do |c|
121
+ cnrs += call_numbers_list_from_leaves(node: c, topic_array: new_topic)
122
+ cnrs += cnrs_within_oga_node(node: c, decendent_xpaths: new_xpath, topic_array: new_topic)
123
+ end
124
+ cnrs
125
+ end
126
+ end
127
+
128
+
129
+ # Given a second-to-lowest-level node, get its topic and
130
+ # extract call number ranges from its children
131
+ def self.call_numbers_list_from_leaves(node:, topic_array:)
132
+ cnrs = []
133
+ new_topic = topic_array.dup.push node.get(:name)
134
+ node.xpath('call-numbers').each do |cn_node|
135
+ min = cn_node.get(:start)
136
+ max = cn_node.get(:end)
137
+
138
+ new_cnr = HighLevelBrowse::CallNumberRange.new(min: min, max: max, topic_array: new_topic)
139
+ if new_cnr.illegal?
140
+ # do some sort of logging
141
+ else
142
+ cnrs.push new_cnr
143
+ end
144
+ end
145
+ cnrs
146
+
147
+ end
148
+
149
+
150
+ end
@@ -0,0 +1,90 @@
1
+ # Never released as a gem, as near as I can tell.
2
+ # Taken from https://github.com/clearhaus/range-tree,
3
+ # which was released under the MIT license
4
+ # by ClearHaus (https://www.clearhaus.com/)
5
+
6
+ # Namespaced to avoid conflicts with other range_tree
7
+ # gems
8
+
9
+ module HighLevelBrowse
10
+ class RangeTree
11
+ class Node
12
+ def initialize(left, range, right, min, max)
13
+ @left = left
14
+ @range = range
15
+ @right = right
16
+ @min = min || range.min
17
+ @max = max || range.max
18
+ end
19
+
20
+ attr_reader :left, :range, :right, :min, :max
21
+ end
22
+
23
+ def initialize(ranges, sorted: false)
24
+ # ranges.sort_by! {|r| [r.min, r.max]} unless sorted
25
+ # It's only required to be sorted by `r.min`, but if many ranges has the
26
+ # same left endpoint, then it's more efficient if also secondarily sorted by
27
+ # the right endpoint (or equivalently by the length).
28
+
29
+ @root = RangeTree.split(ranges.sort{|a,b| (a.min <=> b.min) || (a.max <=> b.max)})
30
+ end
31
+
32
+ attr_reader :root
33
+
34
+ def self.split(ranges)
35
+ return nil if ranges.empty?
36
+
37
+ middle = ranges.length/2
38
+
39
+ left = split(ranges.slice(0, middle)) # Handle middle == 0 correctly.
40
+ range = ranges[middle] # Current range.
41
+ right = split(ranges[(middle+1)..-1]) # Handle middle == ranges.length correctly.
42
+
43
+ ary = [left, range, right].compact
44
+
45
+ Node.new(left, range, right,
46
+ ary.map(&:min).min, # Subtree's min.
47
+ ary.map(&:max).max) # Subtree's max.
48
+ end
49
+
50
+ def search(range, limit: Float::INFINITY)
51
+ range = range.is_a?(Range) ? range : (range..range)
52
+
53
+ result = []
54
+ RangeTree.search_helper(range, @root, result, limit)
55
+
56
+ result
57
+ end
58
+
59
+ def self.search_helper(q, root, result, limit)
60
+ return if root.nil?
61
+
62
+ # Visit left child?
63
+ if (l = root.left) and l.max and q.min and \
64
+ not l.max < q.min # The interesting part.
65
+ search_helper(q, root.left, result, limit)
66
+ end
67
+
68
+ return if result.length >= limit
69
+ # Yes, it needs to be checked here rather than in the top. Otherwise, at the
70
+ # point of checking, there wasn't added too many, but after left child has
71
+ # been checked, we might hit the limit and then, "this" will add one as
72
+ # well.
73
+
74
+ # Add root?
75
+ result << root.range if RangeTree.ranges_intersect?(q, root.range)
76
+
77
+ # Visit right child?
78
+ if (r = root.right) and q.max and r.min and \
79
+ not q.max < r.min # The interesting part.
80
+ search_helper(q, root.right, result, limit)
81
+ end
82
+ end
83
+
84
+ def self.ranges_intersect?(a, b)
85
+ return false unless a.min && a.max && b.min && b.max
86
+
87
+ a.min <= b.max && a.max >= b.min
88
+ end
89
+ end
90
+ end
@@ -0,0 +1,3 @@
1
+ module HighLevelBrowse
2
+ VERSION = "0.1.0"
3
+ end
@@ -0,0 +1,14 @@
1
+ $LOAD_PATH.unshift File.expand_path('../../lib', __FILE__)
2
+
3
+ # Both oga and minitest have stupid warnings that I don't want to
4
+ # hear about
5
+
6
+ verbose = $VERBOSE
7
+ $VERBOSE = nil
8
+ require 'oga'
9
+ require 'minitest'
10
+ require 'minitest/spec'
11
+ require 'minitest/autorun'
12
+ $VERBOSE = verbose
13
+
14
+ require 'high_level_browse'
@@ -0,0 +1,27 @@
1
+ require 'minitest_helper'
2
+
3
+ require 'json'
4
+ TESTDIR = File.expand_path(File.dirname(__FILE__))
5
+
6
+ describe "loads" do
7
+ it "loads" do
8
+ assert true
9
+ end
10
+
11
+ it "has a version" do
12
+ HighLevelBrowse::VERSION.wont_be_nil
13
+ end
14
+ end
15
+
16
+ describe "Works the same as before" do
17
+ it "gets the same output for 30k randomly chosen call numbers" do
18
+ h = HighLevelBrowse.fetch_and_save(dir: TESTDIR)
19
+ JSON.load(File.open(File.join(TESTDIR, '30k_random_old_mappings.json'))).each do |rec|
20
+ cn = rec['cn'].strip
21
+ newcats = h[cn]
22
+ next if rec['jar'].empty?
23
+ assert_equal [cn, rec['jar'].sort], [rec['cn'], newcats.sort]
24
+ end
25
+
26
+ end
27
+ end
metadata ADDED
@@ -0,0 +1,138 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: high_level_browse
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Bill Dueber
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2017-06-02 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: oga
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '2.1'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '2.1'
27
+ - !ruby/object:Gem::Dependency
28
+ name: lcsort
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: bundler
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - "~>"
46
+ - !ruby/object:Gem::Version
47
+ version: '1.6'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - "~>"
53
+ - !ruby/object:Gem::Version
54
+ version: '1.6'
55
+ - !ruby/object:Gem::Dependency
56
+ name: rake
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - ">="
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ type: :development
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - ">="
67
+ - !ruby/object:Gem::Version
68
+ version: '0'
69
+ - !ruby/object:Gem::Dependency
70
+ name: minitest
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - ">="
74
+ - !ruby/object:Gem::Version
75
+ version: '0'
76
+ type: :development
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - ">="
81
+ - !ruby/object:Gem::Version
82
+ version: '0'
83
+ description:
84
+ email:
85
+ - bill@dueber.com
86
+ executables:
87
+ - fetch_new_hlb
88
+ - hlb
89
+ - test_marc_file_for_hlb
90
+ extensions: []
91
+ extra_rdoc_files: []
92
+ files:
93
+ - ".gitignore"
94
+ - ".travis.yml"
95
+ - Gemfile
96
+ - LICENSE.txt
97
+ - README.md
98
+ - Rakefile
99
+ - bench/bench.rb
100
+ - bench/hlb.json.gz
101
+ - bin/fetch_new_hlb
102
+ - bin/hlb
103
+ - bin/test_marc_file_for_hlb
104
+ - high_level_browse.gemspec
105
+ - lib/high_level_browse.rb
106
+ - lib/high_level_browse/call_number_range.rb
107
+ - lib/high_level_browse/db.rb
108
+ - lib/high_level_browse/range_tree.rb
109
+ - lib/high_level_browse/version.rb
110
+ - test/minitest_helper.rb
111
+ - test/test_high_level_browse.rb
112
+ homepage: ''
113
+ licenses:
114
+ - MIT
115
+ metadata: {}
116
+ post_install_message:
117
+ rdoc_options: []
118
+ require_paths:
119
+ - lib
120
+ required_ruby_version: !ruby/object:Gem::Requirement
121
+ requirements:
122
+ - - ">="
123
+ - !ruby/object:Gem::Version
124
+ version: '0'
125
+ required_rubygems_version: !ruby/object:Gem::Requirement
126
+ requirements:
127
+ - - ">="
128
+ - !ruby/object:Gem::Version
129
+ version: '0'
130
+ requirements: []
131
+ rubyforge_project:
132
+ rubygems_version: 2.6.8
133
+ signing_key:
134
+ specification_version: 4
135
+ summary: Map LC call numbers to academic categories.
136
+ test_files:
137
+ - test/minitest_helper.rb
138
+ - test/test_high_level_browse.rb