RubyGems - solr_cursorstream - Versions diffs - 0.1.0 - Mend

solr_cursorstream 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

checksums.yaml +7 -0
data/.rspec +3 -0
data/.rubocop.yml +13 -0
data/CHANGELOG.md +5 -0
data/Gemfile +6 -0
data/LICENSE.txt +21 -0
data/README.md +112 -0
data/Rakefile +10 -0
data/bin/console +15 -0
data/bin/setup +8 -0
data/lib/solr/cursorstream/response.rb +30 -0
data/lib/solr/cursorstream/version.rb +7 -0
data/lib/solr/cursorstream.rb +127 -0
data/solr_cursorstream.gemspec +39 -0
metadata +171 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 85344a1d5cbdad956770cdb60c76c3304a12f11707ebffc59096abbe403b5d53
+  data.tar.gz: 13f93a423feab337b1bde721b0b13bc3335cf4eb976e3d6e04cf629cfa486b0d
+SHA512:
+  metadata.gz: e93a0a7dca05d60f9a2f2f6731071a1d9896df6de2899928ebda53c95876cc7b08677c88be7ec9986e848b8c63b622369d6cdd4e57e9f638c40a778671efe500
+  data.tar.gz: 1c2b116c552f38d430fda98080f1af9ed728e88eabca60f926d6408c0c593c3258dbab60bebb4cfd3d2f6d5411db54b2d1c2a72419af263cff7a96158a859590

data/.rspec ADDED Viewed

@@ -0,0 +1,3 @@
+--format documentation
+--color
+--require spec_helper

data/.rubocop.yml ADDED Viewed

@@ -0,0 +1,13 @@
+AllCops:
+  TargetRubyVersion: 2.6
+Style/StringLiterals:
+  Enabled: true
+  EnforcedStyle: double_quotes
+Style/StringLiteralsInInterpolation:
+  Enabled: true
+  EnforcedStyle: double_quotes
+Layout/LineLength:
+  Max: 120

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,5 @@
+## [Unreleased]
+## [0.1.0] - 2022-06-21
+ * Initial release
+ * See bottom of README.md for todo list

data/Gemfile ADDED Viewed

@@ -0,0 +1,6 @@
+# frozen_string_literal: true
+source "https://rubygems.org"
+# Specify your gem's dependencies in cursorstream.gemspec
+gemspec

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,21 @@
+The MIT License (MIT)
+Copyright (c) 2022 Bill Dueber
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,112 @@
+# Solr::CursorStream
+"Stream" results from solr with
+[cursor-based fetching](https://solr.apache.org/guide/8_6/pagination-of-resultshtml#fetching-a-large-number-of-sorted-results-cursors),
+exposing the stream as a normal ruby enumerator.
+Note that this is different from true streaming of results via, e.g.,
+the [default `/export` handler](https://solr.apache.org/guide/8_6/exporting-result-sets.html).
+Those queries can involve more complex processing, but are restricted in
+that you
+  * can't use relevancy ranking
+  * all fields have to be `docValues`.
+Cursor-based streaming allows, with some restrictions,
+downloading large sets of data without the "deep paging" problems
+associated with just using the `start` and `rows` parameters.
+The only significant restrictions is that _the sort specification MUST
+include the`uniqueKey` field_. If you're just downloading a whole dataset and
+don't care about order, the default query of `*:*` and the default sort of `id asc`
+will be fine (assuming your uniqueKey is `id`). If you want to sort by
+another field/value, you must use the uniqueKey in a secondary sort (e.g.,
+`sort: "score desc, id asc"`) to guarantee a stable sort.
+NOTE that if you don't need the `score` (relevancy) field,
+_use the default query parameter of `*:*`_ so
+solr doesn't have to work as hard. Just put your restrictions in the
+`filters` array.
+## Usage
+```ruby
+require 'solr/cursorstream'
+core_url = "http://my.solr.com:8025/solr/mycore/"
+# Get everything in the solr core, no restrictions
+cs = Solr::CursorStream.new(url: core_url)
+cs.each {|doc| ... }
+# Filter for newer stuff
+# Note that you need to lucene-escape any q/fq values on your own, since
+# otherwise we'd need a full solr syntax parser to determine which
+# bits to escape.
+cs = Solr::CursorStream.new(url: core_url, filters = ['year:{2010 TO *}'])
+# Find everything with the phrase "Civil War" in the title and
+# pre-20th century, ordered by year
+cs = Solr::CursorStream.new(url: core_url) do |s|
+  s.filters = ['year:[* TO 1900]', 'title:"Civil War"']
+  s.sort = 'year asc, id asc' # need to include the uniqueKey field (id)!
+end
+# #each yields a solr document hash until it runs out
+cs.each {|doc| ... }
+# The underlying Faraday http connection is available if you need
+# to mess with it directly
+cs.connection.set_basic_auth(user, password)
+# There are a _lot_ of possible arguments to `new`. It may be easier
+# to specify values in a block
+cs = Solr::CursorStream.new(url: core_url) do |s|
+  s.batch_size = 100
+  s.fields = %w[id title author year]
+  s.filters = ["year:[* TO 1900]"]
+  s.query = "title:(Civil War)"
+  s.sort = 'score desc, id asc'
+end
+# Get the first 10_000 results from a query
+cs.each_with_index do |doc, i|
+  break if i >= 10_000
+  do_someting_with_the_solr_doc(doc)
+end
+```
+## TODO
+[ ] Add a :limit option
+[ ] Add a `lucene_escape` utility function
+[ ] Change q/fq to take either a string (as current) or a {field => value} hash
+[ ] Actual error handling, or at least passing useful information along
+[ ] Figure out how to test without a live solr to bounce off of. Maybe use
+vcr or similar?
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'solr_cursorstream'
+```
+And then execute:
+    $ bundle install
+Or install it yourself as:
+    $ gem install solr_cursorstream
+## Contributing
+Bug reports and pull requests are welcome on GitHub at https://github.com/mlibrary/solr_cursorstream.
+## License
+The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).

data/Rakefile ADDED Viewed

@@ -0,0 +1,10 @@
+# frozen_string_literal: true
+require "bundler/gem_tasks"
+require "rspec/core/rake_task"
+RSpec::Core::RakeTask.new(:spec)
+require "standard/rake"
+task default: [:spec, "standard:fix"]

data/bin/console ADDED Viewed

@@ -0,0 +1,15 @@
+#!/usr/bin/env ruby
+# frozen_string_literal: true
+require "bundler/setup"
+require "cursorstream"
+# You can add fixtures and/or initialization code here to make experimenting
+# with your gem easier. You can also use a different console, if you like.
+# (If you use this, don't forget to add pry to your Gemfile!)
+# require "pry"
+# Pry.start
+require "irb"
+IRB.start(__FILE__)

data/bin/setup ADDED Viewed

@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+set -euo pipefail
+IFS=$'\n\t'
+set -vx
+bundle install
+# Do any other automated setup that you need to do here

data/lib/solr/cursorstream/response.rb ADDED Viewed

@@ -0,0 +1,30 @@
+## frozen_string_literal: true
+require "delegate"
+# Wrapper around a Faraday::Response that provides sugar methods
+# to get solr docs, numFound, and the cursor value
+class Solr::CursorStream::Response < SimpleDelegator
+  # @param [Faraday::Response] faraday_response
+  def initialize(faraday_response)
+    super
+    @base_resp = faraday_response
+    @resp = faraday_response.body
+    __setobj__(@resp)
+  end
+  # @return [Array<Hash>] Array of solr documents returned, as simple hashes
+  def docs
+    @resp["response"]["docs"]
+  end
+  # @return [Integer] Number of documents found for the solr query
+  def num_found
+    @resp["response"]["numFound"]
+  end
+  # @return [String] value of the cursor as returned from solr
+  def cursor
+    @resp["nextCursorMark"]
+  end
+end

data/lib/solr/cursorstream/version.rb ADDED Viewed

@@ -0,0 +1,7 @@
+# frozen_string_literal: true
+module Solr
+  class CursorStream
+    VERSION = "0.1.0"
+  end
+end

data/lib/solr/cursorstream.rb ADDED Viewed

@@ -0,0 +1,127 @@
+# frozen_string_literal: true
+require "solr/cursorstream/version"
+require "solr/cursorstream/response"
+require "faraday"
+require "faraday/retry"
+module Solr
+  # Fetch results from a solr filter query via solr's cursor streaming.
+  # https://solr.apache.org/guide/8_6/pagination-of-results.html#fetching-a-large-number-of-sorted-results-cursors
+  #
+  # Note that accessors for things like query, filters, etc. are made available for ease of configuration _only_.
+  # Changing anything in the middle of a job will screw up the cursors and leave things undetermined. Just
+  # make another CursorStream object.
+  class CursorStream
+    include Enumerable
+    class Error < StandardError; end
+    attr_accessor :url, :query, :handler, :filters, :sort, :batch_size, :fields, :logger
+    # @param [String] url URL to the solr _core_ (e.g., http://my.machine.com/solr/mycore)
+    # @param [String] handler The specific handler to target.
+    # @param [Array<String>] filters Array of filter queries to apply.
+    # @param [String] sort A valid solr sort string. MUST include the unique field (as per solr docs)
+    # @param [Integer] batch_size How many results to fetch at a time (for efficiency)
+    # @param [Array<String>] fields The solr fields to return.
+    # @param [Logger, #info] A logger or logger-like object. When set to `nil` will not do any logging.
+    # @param [Symbol] adapter A valid Faraday adapter. If not using the default httpx, it is up to the
+    #    programmer to do whatever `require` calls are necessary.
+    def initialize(url:, handler: "select", query: "*:*", filters: ["*:*"], sort: "id asc", batch_size: 100, fields: [], logger: nil, adapter: :httpx)
+      @url = url.gsub(/\/\Z/, "")
+      @query = query
+      @handler = handler
+      @filters = filters
+      @sort = sort
+      @batch_size = batch_size
+      @fields = fields
+      @logger = logger
+      @adapter = adapter
+      @current_cursor = "*"
+      yield self if block_given?
+    end
+    # @return String solr url build from the passed url and the handler
+    def solr_url
+      url + "/" + handler
+    end
+    # Iterate through the documents in the stream. Behind the scenes, these will be fetched in batches
+    # of `batch_size` for efficiency.
+    # @yieldreturn [Hash] A single solr document from the stream
+    def each
+      return enum_for(:each) unless block_given?
+      verify_we_have_everything!
+      while solr_has_more?
+        cursor_response = get_page
+        cursor_response.docs.each { |d| yield d }
+      end
+    end
+    # Build up a Faraday connection
+    # @param [Symbol] adapter Which faraday adapter to use. If not :httpx, you must have loaded the
+    # necessary adapter already.
+    # @return [Faraday::Connection] A faraday connection object.
+    def self.connection(adapter: :httpx)
+      require "httpx/adapters/faraday" if adapter == :httpx
+      Faraday.new(request: {params_encoder: Faraday::FlatParamsEncoder}) do |builder|
+        builder.use Faraday::Response::RaiseError
+        builder.request :url_encoded
+        builder.request :retry
+        builder.response :json
+        builder.adapter @adapter
+      end
+    end
+    # @see CursorStream.connection
+    def connection(adapter: @adapter)
+      return @connection if @connection
+      @connection = self.class.connection(adapter: @adapter)
+    end
+    # @private
+    # Get a single "page" (`batch_size` documents) from solr. Feeds into #each
+    # @return [CursorResponse]
+    def get_page
+      params = {cursorMark: @current_cursor}.merge default_params
+      r = connection.get(solr_url, params)
+      resp = Response.new(r)
+      @last_cursor = @current_cursor
+      @current_cursor = resp.cursor
+      resp
+    end
+    # @private
+    # @return [Hash] Default solr params derived from instance variables
+    def default_params
+      field_list = Array(fields).join(",")
+      p = {q: @query, wt: :json, rows: batch_size, sort: @sort, fq: filters, fl: field_list}
+      p.reject { |_k, v| [nil, "", []].include?(v) }
+      p
+    end
+    # @private
+    # Make sure we have everything we need for a successful stream
+    def verify_we_have_everything!
+      missing = {handler: @handler, filters: @filters, batch_size: @batch_size}.select { |_k, v| v.nil? }.keys
+      raise Error.new("Solr::CursorStreamer missing value for #{missing.join(", ")}") unless missing.empty?
+    end
+    # @private
+    # Determine if solr has another page of results
+    # @return [Boolean]
+    def solr_has_more?
+      @last_cursor != @current_cursor
+    end
+    # @private
+    # @return Lambda that runs every time the connection needs to retry due to http error
+    def http_request_retry_block
+      ->(env:, options:, retries_remaining:, exception:, will_retry_in:) do
+        # TODO: log that a retry happened
+      end
+    end
+  end
+end

data/solr_cursorstream.gemspec ADDED Viewed

@@ -0,0 +1,39 @@
+# frozen_string_literal: true
+require_relative "lib/solr/cursorstream/version"
+Gem::Specification.new do |spec|
+  spec.name = "solr_cursorstream"
+  spec.version = Solr::CursorStream::VERSION
+  spec.authors = ["Bill Dueber"]
+  spec.email = ["bill@dueber.com"]
+  spec.summary = "Get an iterator on a solr filter using stream/cursor"
+  spec.homepage = "https://github.com/mlibrary/solr_cursorstream"
+  spec.license = "MIT"
+  spec.metadata["homepage_uri"] = spec.homepage
+  spec.metadata["source_code_uri"] = spec.homepage
+  spec.metadata["changelog_uri"] = spec.homepage + "/CHANGELOG.md"
+  # Specify which files should be added to the gem when it is released.
+  # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
+  spec.files = Dir.chdir(File.expand_path(__dir__)) do
+    `git ls-files -z`.split("\x0").reject do |f|
+      (f == __FILE__) || f.match(%r{\A(?:(?:test|spec|features)/|\.(?:git|travis|circleci)|appveyor)})
+    end
+  end
+  spec.bindir = "exe"
+  spec.executables = spec.files.grep(%r{\Aexe/}) { |f| File.basename(f) }
+  spec.require_paths = ["lib"]
+  spec.add_dependency "faraday"
+  spec.add_dependency "faraday-retry"
+  spec.add_dependency "httpx"
+  spec.add_dependency "milemarker"
+  spec.add_development_dependency "pry"
+  spec.add_development_dependency "rake", "~> 13.0"
+  spec.add_development_dependency "rspec", "~> 3.0"
+  spec.add_development_dependency "standard"
+end

metadata ADDED Viewed

@@ -0,0 +1,171 @@
+--- !ruby/object:Gem::Specification
+name: solr_cursorstream
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- Bill Dueber
+autorequire:
+bindir: exe
+cert_chain: []
+date: 2022-06-21 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: faraday
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: faraday-retry
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: httpx
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: milemarker
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: pry
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '13.0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '13.0'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.0'
+- !ruby/object:Gem::Dependency
+  name: standard
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+description:
+email:
+- bill@dueber.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- ".rspec"
+- ".rubocop.yml"
+- CHANGELOG.md
+- Gemfile
+- LICENSE.txt
+- README.md
+- Rakefile
+- bin/console
+- bin/setup
+- lib/solr/cursorstream.rb
+- lib/solr/cursorstream/response.rb
+- lib/solr/cursorstream/version.rb
+- solr_cursorstream.gemspec
+homepage: https://github.com/mlibrary/solr_cursorstream
+licenses:
+- MIT
+metadata:
+  homepage_uri: https://github.com/mlibrary/solr_cursorstream
+  source_code_uri: https://github.com/mlibrary/solr_cursorstream
+  changelog_uri: https://github.com/mlibrary/solr_cursorstream/CHANGELOG.md
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubygems_version: 3.1.2
+signing_key:
+specification_version: 4
+summary: Get an iterator on a solr filter using stream/cursor
+test_files: []