bento_search 1.4.4 → 1.5.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (36) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +41 -19
  3. data/app/models/bento_search/result_item.rb +1 -1
  4. data/app/models/bento_search/search_engine.rb +36 -3
  5. data/app/models/bento_search/search_engine/capabilities.rb +14 -0
  6. data/app/search_engines/bento_search/doaj_articles_engine.rb +279 -0
  7. data/app/search_engines/bento_search/ebsco_host_engine.rb +27 -7
  8. data/app/search_engines/bento_search/google_books_engine.rb +8 -1
  9. data/app/search_engines/bento_search/mock_engine.rb +8 -2
  10. data/app/search_engines/bento_search/scopus_engine.rb +27 -8
  11. data/app/search_engines/bento_search/summon_engine.rb +1 -1
  12. data/app/search_engines/bento_search/worldcat_sru_dc_engine.rb +22 -3
  13. data/config/locales/en.yml +5 -2
  14. data/lib/bento_search/version.rb +1 -1
  15. data/test/dummy/config/environments/development.rb +0 -4
  16. data/test/dummy/config/environments/production.rb +0 -4
  17. data/test/search_engines/doaj_articles_engine_test.rb +200 -0
  18. data/test/search_engines/ebsco_host_engine_test.rb +38 -0
  19. data/test/search_engines/google_books_engine_test.rb +18 -2
  20. data/test/search_engines/scopus_engine_test.rb +45 -1
  21. data/test/search_engines/search_engine_base_test.rb +59 -0
  22. data/test/search_engines/worldcat_sru_dc_engine_test.rb +17 -0
  23. data/test/vcr_cassettes/doaj_articles/basic_search.yml +97 -0
  24. data/test/vcr_cassettes/doaj_articles/catches_errors.yml +42 -0
  25. data/test/vcr_cassettes/doaj_articles/complex_multi-field.yml +67 -0
  26. data/test/vcr_cassettes/doaj_articles/live__get_identifier__round_trip.yml +387 -0
  27. data/test/vcr_cassettes/doaj_articles/live_get_identifier__raises_on_no_results.yml +41 -0
  28. data/test/vcr_cassettes/doaj_articles/multifield_author-title.yml +79 -0
  29. data/test/vcr_cassettes/doaj_articles/pagination.yml +691 -0
  30. data/test/vcr_cassettes/ebscohost/affiliation_search.yml +929 -0
  31. data/test/vcr_cassettes/ebscohost/multi-field_author_title.yml +122 -0
  32. data/test/vcr_cassettes/ebscohost/multi-field_citation_numbers.yml +122 -0
  33. data/test/vcr_cassettes/scopus/multi-field_search.yml +55 -0
  34. data/test/vcr_cassettes/scopus/multi-fielded_citation_details_search.yml +86 -0
  35. data/test/vcr_cassettes/worldcat_sru_dc/multi_field_search.yml +1839 -0
  36. metadata +31 -2
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: e070ceebebab6e8786050986873ee02f05990db9
4
- data.tar.gz: e69e6151f6e2205de884717a1b4fd2cfcce95c40
3
+ metadata.gz: 8824757e65434f6a003c9748cf68e51cddcc3795
4
+ data.tar.gz: 081397571dbcb6307fdf9f162f5493e1d75d1e77
5
5
  SHA512:
6
- metadata.gz: 43541b9f73943bd5d61e648f73e337104df2dc8a8cc01833106a92614a1cbcb14094eaf2e8364fbda1a135f6ed165275745aa3a11e0263b16a3a23b93e995159
7
- data.tar.gz: 9d515c11008d35b0bbca0ab20b744818f9a17aabc906990f14715e87c43da9b423cd0a3e22ff3f126e18b12f4a79cf7609a7ea12e1c0f86178fb56c11424c886
6
+ metadata.gz: b5f11bd2566934fc22409ec97c9aa89d91e10a497948839ce6708e9a7c90bbf1ee1d951d5d2585641aa1dc0afb0fb082ac4cd258d64b25f8589bfda112233fd5
7
+ data.tar.gz: 32ac0a5ef5be6ec91d2b9bd034e7332d7ab52812743e56aeca97f59832c69fba45a20153102264730e6e18c9d9f97dfb26f4c0d51d1589eaf651fa9d2632823e
data/README.md CHANGED
@@ -10,34 +10,31 @@ Rails 3.x or 4.x. ruby 1.9.3+
10
10
  ### Goals: To help you
11
11
 
12
12
  * **Get up and running as quickly as possible** with searching and displaying
13
- results from a third-party service. Solutions to idiosyncracies and
14
- undocumented workarounds are encoded in a shared codebase, which abstracts
15
- everything to a good, simple code API giving you building blocks to focus
16
- on your needs, not the search service's problems.
13
+ results from a third-party service. Simple common code API, with idiosyncracies and
14
+ undocumented workarounds abstracted away.
17
15
  * Let you switch out one search service for another in an already built
18
16
  application with as little code rewriting as possible. **Avoid vendor lock-in**.
19
17
  * Give you the harness to **write adapters for new search services**, without
20
18
  having to rewrite common general functionality, just focus on the interface
21
19
  with the new API you want to support.
22
20
 
23
- bento_search is focused on use cases for academic libraries, which is mainly
24
- evidenced by the search engine adapters currently included, and by the
25
- generalized domain models including fields that matter in our domain (issn,
26
- vol/issue/page, etc), and some targetted functionality (OpenURL generation).
27
- But it ought to be useful for more general basic use
28
- cases too (we include a google site search adapter for instance).
21
+ bento_search is focused on use cases for academic libraries; the shared
22
+ model for search results includes including fields that matter in our domain (issn,
23
+ vol/issue/page, etc), although they ought to have what's needed for general
24
+ basic use too. There is some targetted functionality for academic library/publishing use
25
+ (eg OpenURL generation).
29
26
 
30
27
  Adapters currently included in bento_search
31
28
 
32
- * Google Books (requires free api key)
33
- * Scopus (requires license)
34
- * Serial Solution Summon (requires license)
35
- * Ex Libris Primo (requires license)
36
- * EBSCO Discovery Service (requires license)
37
- * EBSCOHost 'traditional' API (requires license)
38
- * WorldCat Search (requires OCLC membership to get api key)
39
- * Google Site Search (requires sign-up for more than 100 searches/day)
40
- * JournalTOCs (limited support for fetching current articles by ISSN, free but requires registration)
29
+ * [Google Books](https://books.google.com/) (requires free api key)
30
+ * [Scopus](http://www.elsevier.com/solutions/scopus) (requires license)
31
+ * [Serial Solution Summon](http://www.proquest.com/products-services/The-Summon-Service.html) (requires license)
32
+ * [Ex Libris Primo](http://www.exlibrisgroup.com/category/PrimoOverview) (requires license)
33
+ * [EBSCO Discovery Service](https://www.ebscohost.com/discovery) (requires license)
34
+ * [EBSCOHost](https://www.ebscohost.com/) 'traditional' API (requires license)
35
+ * [WorldCat Search](https://www.worldcat.org/) (requires OCLC membership to get api key)
36
+ * [Google Site Search](https://www.google.com/work/search/products/gss.html) (requires sign-up for more than 100 searches/day)
37
+ * [JournalTOCs](http://www.journaltocs.hw.ac.uk/) (limited support for fetching current articles by ISSN, free but requires registration)
41
38
 
42
39
 
43
40
 
@@ -218,6 +215,31 @@ Kaminari's paginate method:
218
215
  <%= paginate results.pagination %>
219
216
  ~~~~
220
217
 
218
+ ### Multi-field search
219
+
220
+ Some search engines support-multi field searching, an engine advertises if it does:
221
+
222
+ engine_instance.multi_field_searching? # => `true` or `false`
223
+
224
+ The bento_search multi-field search feature always combines multiple
225
+ fields with boolean 'and' (intersection). You call a multi-field search
226
+ with a :query hash argument whose value is a hash of search-fields and
227
+ queries:
228
+
229
+ engine.search(:query => {
230
+ :title => '"Reflections on the History of Debt Resistance"',
231
+ :author => 'Caffentzis'
232
+ })
233
+
234
+ The search field keys can be either semantic_search_field names, or internal
235
+ engine search fields, or a combination. If the key matches a semantic search field
236
+ declared for the engine, that will be preferred.
237
+
238
+ This can be used to expose a multi-field search to users, and the `bento_field_hash_for`
239
+ helper method might be helpful in creating your UI. But this is also useful for looking
240
+ up known-item citations -- either by author/title, or issn/volume/issue/page, or doi, or
241
+ anything else -- as back-end support for various possible functions.
242
+
221
243
  ### Concurrent searching
222
244
 
223
245
  If you're going to search 2 or more search engines at once, you'll want to execute
@@ -224,7 +224,7 @@ module BentoSearch
224
224
  # Short summary of item.
225
225
  # Mark .html_safe if it includes html -- creator is responsible
226
226
  # for making sure html is safely sanitizied and/or stripped,
227
- # rails ActionView::Helpers::Sanistize #sanitize and #strip_tags
227
+ # rails ActionView::Helpers::SanitizeHelper #sanitize and #strip_tags
228
228
  # may be helpful.
229
229
  serializable_attr_accessor :abstract
230
230
 
@@ -329,11 +329,44 @@ module BentoSearch
329
329
  if arguments[:sort]
330
330
  arguments[:sort] = arguments[:sort].to_s
331
331
  end
332
+
333
+
334
+ # Multi-field search
335
+ if arguments[:query].kind_of? Hash
336
+ # Only if allowed
337
+ unless self.multi_field_search?
338
+ raise ArgumentError.new("You supplied a :query as a hash, but this engine (#{self.class}) does not suport multi-search. #{arguments[:query].inspect}")
339
+ end
340
+ # Multi-field search incompatible with :search_field or :semantic_search_field
341
+ if arguments[:search_field].present?
342
+ raise ArgumentError.new("You supplied a :query as a Hash, but also a :search_field, you can only use one. #{arguments.inspect}")
343
+ end
344
+ if arguments[:semantic_search_field].present?
345
+ raise ArgumentError.new("You supplied a :query as a Hash, but also a :semantic_search_field, you can only use one. #{arguments.inspect}")
346
+ end
347
+
348
+ # translate semantic fields, raising for unfound fields if configured
349
+ arguments[:query].transform_keys! do |key|
350
+ new_key = self.semantic_search_map[key.to_s] || key
351
+
352
+ if ( config_arg(arguments, :unrecognized_search_field) == "raise" &&
353
+ ! self.search_keys.include?(new_key))
354
+ raise ArgumentError.new("#{self.class.name} does not know about search_field #{new_key}, in query Hash #{arguments[:query]}")
355
+ end
356
+
357
+ new_key
358
+ end
359
+
360
+ end
332
361
 
333
362
  # translate semantic_search_field to search_field, or raise if
334
363
  # can't.
335
- if (semantic = arguments.delete(:semantic_search_field)) && ! semantic.blank?
336
- mapped = self.semantic_search_map[semantic.to_s]
364
+ if (semantic = arguments.delete(:semantic_search_field)) && ! semantic.blank?
365
+ semantic = semantic.to_s
366
+ # Legacy publication_title is now called source_title
367
+ semantic = "source_title" if semantic == "publication_title"
368
+
369
+ mapped = self.semantic_search_map[semantic]
337
370
  if config_arg(arguments, :unrecognized_search_field) == "raise" && ! mapped
338
371
  raise ArgumentError.new("#{self.class.name} does not know about :semantic_search_field #{semantic}")
339
372
  end
@@ -361,7 +394,7 @@ module BentoSearch
361
394
 
362
395
 
363
396
  protected
364
-
397
+
365
398
  # get value of an arg that can be supplied in search args OR config,
366
399
  # with search_args over-ridding config. Also normalizes value to_s
367
400
  # (for symbols/strings).
@@ -66,6 +66,20 @@ module BentoSearch::SearchEngine::Capabilities
66
66
  end.compact
67
67
  ]
68
68
  end
69
+
70
+
71
+ # Engines that support multi-field search should
72
+ # override to return true. Returns false in base
73
+ # default implementation.
74
+ #
75
+ # If an engine returns true here, it can receive in :query
76
+ # a Hash of multiple fields/values. The fields will all be
77
+ # normalized to internal names before engine receives them.
78
+ # The multi-field search is meant to be run as a boolean AND
79
+ # of all field/values.
80
+ def multi_field_search?
81
+ return false
82
+ end
69
83
 
70
84
 
71
85
  end
@@ -0,0 +1,279 @@
1
+ require 'httpclient'
2
+ require 'http_client_patch/include_client'
3
+
4
+ require 'json'
5
+
6
+ module BentoSearch
7
+ # DOAJ Articles search.
8
+ # https://doaj.org/api/v1/docs
9
+ #
10
+ # Phrase searches with double quotes are respected.
11
+ #
12
+ # Supports #get by unique_id feature
13
+ #
14
+ class DoajArticlesEngine
15
+ include BentoSearch::SearchEngine
16
+ include ActionView::Helpers::SanitizeHelper
17
+
18
+
19
+ class_attribute :http_timeout
20
+ self.http_timeout = 10
21
+
22
+ extend HTTPClientPatch::IncludeClient
23
+ include_http_client do |client|
24
+ client.connect_timeout = client.send_timeout = client.receive_timeout = self.http_timeout
25
+ end
26
+
27
+ class_attribute :base_url
28
+ self.base_url = "https://doaj.org/api/v1/search/articles/"
29
+
30
+ def search_implementation(arguments)
31
+ query_url = args_to_search_url(arguments)
32
+
33
+ results = Results.new
34
+
35
+ begin
36
+ Rails.logger.debug("DoajEngine: requesting #{query_url}")
37
+ response = http_client.get( query_url )
38
+ json = JSON.parse(response.body)
39
+ rescue TimeoutError, HTTPClient::TimeoutError,
40
+ HTTPClient::ConfigurationError, HTTPClient::BadResponseError,
41
+ JSON::ParserError => e
42
+ results.error ||= {}
43
+ results.error[:exception] = e
44
+ end
45
+
46
+ if ( response.nil? || json.nil? ||
47
+ (! HTTP::Status.successful? response.status) ||
48
+ (json && json["error"]))
49
+
50
+ results.error ||= {}
51
+ results.error[:status] = response.status if response
52
+ results.error[:message] = json["error"] if json["error"]
53
+
54
+ return results
55
+ end
56
+
57
+ results.total_items = json["total"]
58
+
59
+ (json["results"] || []).each do |item_response|
60
+ results << hash_to_item(item_response)
61
+ end
62
+
63
+ return results
64
+ end
65
+
66
+ def get(unique_id)
67
+ results = search(unique_id, :search_field => "id")
68
+
69
+ raise (results.error[:exception] || StandardError.new(results.error[:message] || results.error[:status])) if results.failed?
70
+ raise BentoSearch::NotFound.new("For id: #{unique_id}") if results.length == 0
71
+ raise BentoSearch::TooManyFound.new("For id: #{unique_id}") if results.length > 1
72
+
73
+ results.first
74
+ end
75
+
76
+
77
+ def args_to_search_url(arguments)
78
+ query = if arguments[:query].kind_of?(Hash)
79
+ # multi-field query
80
+ arguments[:query].collect {|field, query| fielded_query(query, field)}.join(" ")
81
+ else
82
+ fielded_query(arguments[:query], arguments[:search_field])
83
+ end
84
+
85
+ # We need to escape this for going in a PATH component,
86
+ # not a query. So space can't be "+", it needs to be "%20",
87
+ # and indeed DOAJ API does not like "+".
88
+ #
89
+ # But neither CGI.escape nor URI.escape does quite
90
+ # the right kind of escaping, seems to work out
91
+ # if we do CGI.escape but then replace '+'
92
+ # with '%20'
93
+ escaped_query = CGI.escape(query).gsub('+', '%20')
94
+ url = self.base_url + escaped_query
95
+
96
+ query_args = {}
97
+
98
+ if arguments[:per_page]
99
+ query_args["pageSize"] = arguments[:per_page]
100
+ end
101
+
102
+ if arguments[:page]
103
+ query_args["page"] = arguments[:page]
104
+ end
105
+
106
+ if arguments[:sort] &&
107
+ (defn = sort_definitions[arguments[:sort]]) &&
108
+ (value = defn[:implementation])
109
+ query_args["sort"] = value
110
+ end
111
+
112
+ query = query_args.to_query
113
+ url = url + "?" + query if query.present?
114
+
115
+ return url
116
+ end
117
+
118
+ # Prepares a DOAJ API (elastic search) query component for
119
+ # given textual query in a given field (or default non-fielded search)
120
+ #
121
+ # Separates query string into tokens (bare words and phrases),
122
+ # so they can each be made mandatory for ElasticSearch. Default
123
+ # DOAJ API makes them all optional, with a very low mm, which
124
+ # leads to low-precision odd looking results for standard use
125
+ # cases.
126
+ #
127
+ # Escapes all remaining special characters as literals (not including
128
+ # double quotes which can be used for phrases, which are respected. )
129
+ #
130
+ # Eg:
131
+ # fielded_query('apple orange "strawberry banana"', field_name)
132
+ # # => '+field_name(+apple +orange +"strawberry banana")'
133
+ #
134
+ # The "+" prefixed before field-name is to make sure all separate
135
+ # fields are also mandatory when doing multi-field searches. It should
136
+ # make no difference for a single-field search.
137
+ def fielded_query(query, field = nil)
138
+ if field.present?
139
+ "+#{field}:(#{prepare_mandatory_terms(query)})"
140
+ else
141
+ prepare_mandatory_terms(query)
142
+ end
143
+ end
144
+
145
+ # Takes a query string, prepares an ElasticSearch query
146
+ # doing what we want:
147
+ # * tokenizes into bare words and double-quoted phrases
148
+ # * Escapes other punctuation to be literal not ElasticSearch operator.
149
+ # (Does NOT do URI escaping)
150
+ # * Makes each token mandatory with an ElasticSearch "+" operator prefixed.
151
+ def prepare_mandatory_terms(query)
152
+ # use string split with regex to too-cleverly split into space
153
+ # seperated terms and phrases, keeping phrases as unit.
154
+ terms = query.split %r{[[:space:]]+|("[^"]+")}
155
+ # Wound up with some empty strings, get rid of em
156
+ terms.delete_if {|t| t.blank?}
157
+
158
+ terms.collect {|token| "+" + escape_query(token)}.join(" ")
159
+ end
160
+
161
+ # Converts from item found in DOAJ results to BentoSearch::ResultItem
162
+ def hash_to_item(hash)
163
+ item = ResultItem.new
164
+
165
+ bibjson = hash["bibjson"] || {}
166
+
167
+ item.unique_id = hash["id"]
168
+
169
+ # Hard-code to Article, we don't get any format information
170
+ item.format = "Article"
171
+
172
+ item.title = bibjson["title"]
173
+
174
+
175
+ item.start_page = bibjson["start_page"]
176
+ item.end_page = bibjson["end_page"]
177
+
178
+ item.year = bibjson["year"]
179
+ if (year = bibjson["year"].to_i) && (month = bibjson["month"].to_i)
180
+ if year != 0 && month != 0
181
+ item.publication_date = Date.new(bibjson["year"].to_i, bibjson["month"].to_i)
182
+ end
183
+ end
184
+
185
+ item.abstract = sanitize(bibjson["abstract"]) if bibjson.has_key?("abstract")
186
+
187
+ journal = bibjson["journal"] || {}
188
+ item.volume = journal["volume"]
189
+ item.issue = journal["number"]
190
+ item.source_title = journal["title"]
191
+ item.publisher = journal["publisher"]
192
+ item.language_str = journal["language"].try(:first)
193
+
194
+ (bibjson["identifier"] || []).each do |id_hash|
195
+ case id_hash["type"]
196
+ when "doi"
197
+ item.doi = id_hash["id"]
198
+ when "pissn"
199
+ item.issn = id_hash["id"]
200
+ end
201
+ end
202
+
203
+ (bibjson["author"] || []).each do |author_hash|
204
+ if author_hash.has_key?("name")
205
+ author = Author.new(:display => author_hash["name"])
206
+ item.authors << author
207
+ end
208
+ end
209
+
210
+ # I _think_ DOAJ articles results always only have one link,
211
+ # and it may always be of type 'fulltext'
212
+ link_hash = (bibjson["link"] || []).first
213
+ if link_hash && link_hash["url"]
214
+ item.link = link_hash["url"]
215
+ item.link_is_fulltext = true if link_hash["type"] == "fulltext"
216
+ end
217
+
218
+ return item
219
+ end
220
+
221
+ # Escape special chars in query, Doaj says it's elastic search,
222
+ # punctuation that needs to be escaped and how to escape (backslash)
223
+ # for ES documented here: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
224
+ #
225
+ # We do not escape double quotes, want to allow them for phrases.
226
+ #
227
+ # This method does NOT return URI-escaped, it returns literal, escaped for ES.
228
+ def escape_query(q)
229
+ q.gsub(/([\+\-\=\&\|\>\<\!\(\)\{\}\[\]\^\~\*\?\:\\\/])/) {|m| "\\#{$1}"}
230
+ end
231
+
232
+
233
+ ###########
234
+ # BentoBox::SearchEngine API
235
+ ###########
236
+
237
+ def max_per_page
238
+ 100
239
+ end
240
+
241
+ def search_field_definitions
242
+ { nil => {:semantic => :general},
243
+ "bibjson.title" => {:semantic => :title},
244
+ # Using 'exact' seems to produce much better results for
245
+ # author, don't entirely understand what's up.
246
+ "bibjson.author.name" => {:semantic => :author},
247
+ "publisher" => {:semantic => :publisher},
248
+ "bibjson.subject.term" => {:semantic => :subject},
249
+ "bibjson.journal.title" => {:semantic => :source_title},
250
+ "issn" => {:semantic => :issn},
251
+ "doi" => {:semantic => :doi},
252
+ "bibjson.journal.volume" => {:semantic => :volume},
253
+ "bibjson.journal.number" => {:semantic => :issue},
254
+ "bibjson.start_page" => {:semantic => :start_page},
255
+ "license" => {},
256
+ "id" => {}
257
+ }
258
+ end
259
+
260
+ def multi_field_search?
261
+ true
262
+ end
263
+
264
+ def sort_definitions
265
+ # Don't believe DOAJ supports sorting by author
266
+ {
267
+ "relevance" => {:implementation => nil}, # default
268
+ "title" => {:implementation => "title:asc"},
269
+ # We don't quite have publication date sorting, but we'll use
270
+ # created_date from DOAJ
271
+ "date_desc" => {:implementation => "article.created_date:desc"},
272
+ "date_asc" => {:implementation => "article.created_date:asc"},
273
+ # custom one not previously standardized
274
+ "publication_name" => {:implementation => "bibjson.journal.title:asc"}
275
+ }
276
+ end
277
+
278
+ end
279
+ end