rangefinder 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,15 @@
1
+ ---
2
+ !binary "U0hBMQ==":
3
+ metadata.gz: !binary |-
4
+ ODRiODljY2Q5MmFhZjBjYzg1ZjI5Y2YxMDgyZmY1YWZiOTk0MTA4YQ==
5
+ data.tar.gz: !binary |-
6
+ YTgwODAzMjA3ODNjNTZmYjZkYWNhMGQ3YjM4MTdjYmMzNzFmN2RmMg==
7
+ SHA512:
8
+ metadata.gz: !binary |-
9
+ NDk5NWUwNWI0NGMyMzRlMTcwYWI5OGQ0N2M5NGNiMzU2MTc5ZDM1NmFkMTU2
10
+ NWRmZTRiYWMwOTVlZmFjZTc2OGQyZDY5ODdjMzk2ODk0Yjg5MjAyYjQ4YWQ2
11
+ MzJiY2ZhOGY4NDkxNGI2NjU1NmIzZTg0YWVkZmJiMWUxYzU5NWQ=
12
+ data.tar.gz: !binary |-
13
+ ZDFmMmE3NWM2YTMzNDNhNjU0ZTdjYWU2YTRmNjZjYmZmMGQ5ZTllYzY2YTdl
14
+ ZWJiNGY3ZjZlOTM1YmU1ZmYzYzUxZTVjYzE2MGE2OWY0OTA0MGM4YjljZTFi
15
+ YWNlZWM2NjA0MzA1MjBmYjlhNTMwNDE1OWMzYjhlYzcyMGJhMTI=
@@ -0,0 +1,17 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ Gemfile.lock
7
+ InstalledFiles
8
+ _yardoc
9
+ coverage
10
+ doc/
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/reports
15
+ test/tmp
16
+ test/version_tmp
17
+ tmp
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --format documentation
2
+ --color
@@ -0,0 +1,3 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.9.3
@@ -0,0 +1,3 @@
1
+ 0.0.1 / 2014-01-10
2
+
3
+ Initial release!
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in rangefinder.gemspec
4
+ gemspec
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2014 Seamus Abshere
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,71 @@
1
+ # Rangefinder
2
+
3
+ Helps you find ranges of IDs, like when you're scraping a website and you need to guess IDs.
4
+
5
+ You tell it what a valid ID is and it looks for ranges of consecutive valid IDs. It assumes that each probe is expensive.
6
+
7
+ ## Installation
8
+
9
+ Add this line to your application's Gemfile:
10
+
11
+ gem 'rangefinder'
12
+
13
+ And then execute:
14
+
15
+ $ bundle
16
+
17
+ Or install it yourself as:
18
+
19
+ $ gem install rangefinder
20
+
21
+ ## Usage
22
+
23
+ Let's say you're rainbow tabling a website but you have to guess the IDs. What you **don't** know is that all valid ids are in the ranges `100..11_000` and `100_000..110_000`. You pass a "probe" block that returns true if an ID is valid:
24
+
25
+ ranges = Rangefinder.new.probe do |possible_id|
26
+ # your probe code here. for example:
27
+ response = http.get "http://example.com/items", id: possible_id
28
+ response.status == 200
29
+ end
30
+
31
+ You get back ranges where we think there are valid IDs. In this case, pretty good! (See Goals above)
32
+
33
+ >> ranges
34
+ => [ 0..12_200, 99_455..111_600 ]
35
+
36
+ Now you can scrape them one by one:
37
+
38
+ ranges.each do |range|
39
+ range.each do |id|
40
+ # scrape this ID
41
+ end
42
+ end
43
+
44
+ ### Please do cache
45
+
46
+ It's nice when your probe block makes a call that is cached somehow. That way when you go back and use the ranges, you're not hitting all those URLs over again.
47
+
48
+ ##$ Goals
49
+
50
+ By default
51
+
52
+ 1. Detect at least 90% of valid IDs in 1000-long ranges with up to 90% intra-range sparsity
53
+ 1. Tolerate gaps of 100,000
54
+ 1. Probe no more than 5% of the range
55
+
56
+ Maybe
57
+
58
+ 1. Don't overestimate valid ranges more than X
59
+
60
+ ### Wishlist
61
+
62
+ 1. Accept a known ID as the basis for smarter probing
63
+ 1. Internally, calculate density and use that to choose `min_range` and `samp`
64
+
65
+ ## Contributing
66
+
67
+ 1. Fork it ( http://github.com/<my-github-username>/rangefinder/fork )
68
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
69
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
70
+ 4. Push to the branch (`git push origin my-new-feature`)
71
+ 5. Create new Pull Request
@@ -0,0 +1,6 @@
1
+ require "bundler/gem_tasks"
2
+ require "rspec/core/rake_task"
3
+
4
+ RSpec::Core::RakeTask.new(:spec)
5
+
6
+ task :default => :spec
@@ -0,0 +1,58 @@
1
+ require "rangefinder/version"
2
+ require 'rangefinder/memo'
3
+
4
+ require 'ranges_merger'
5
+
6
+ class Rangefinder
7
+ MAX = 2**32 - 1
8
+ MAX_GAP = 1e5
9
+ INIT_SAMP = 0.01
10
+ MAX_SAMP = 0.1
11
+
12
+ def probe(options = {}, &blk)
13
+ ranges, _, _ = probe_with_hits_and_misses(options, &blk)
14
+ end
15
+
16
+ def probe_with_hits_and_misses(options = {}, &blk)
17
+ memo = Memo.new
18
+ _probe(memo, options, &blk)
19
+ [ ::RangesMerger.merge(memo.ranges), memo.hits, memo.misses ]
20
+ end
21
+
22
+ private
23
+
24
+ def _probe(memo, options = {}, &blk)
25
+ first = [options.fetch(:first, 0), 0].max.round
26
+ last = [options.fetch(:last, MAX), MAX].min.round
27
+ max_gap = options.fetch(:max_gap, MAX_GAP)
28
+ samp = options.fetch(:samp, INIT_SAMP)
29
+ if samp >= MAX_SAMP
30
+ memo.ranges << (first..last)
31
+ else
32
+ min_range = (10 ** (2 - Math.log(samp, 10))).round
33
+ anything = false
34
+ first_good = nil
35
+ i = first
36
+ last_good = first
37
+ begin
38
+ if blk.call(i)
39
+ memo.hit!
40
+ anything = true
41
+ first_good ||= i
42
+ last_good = i
43
+ else
44
+ memo.miss!
45
+ end
46
+ gap = i - last_good
47
+ if first_good and gap > min_range
48
+ _probe memo, {first: first_good-min_range, last: last_good+min_range, samp: samp*3}, &blk
49
+ first_good = nil
50
+ last_good = i
51
+ gap = 0
52
+ end
53
+ samp1 = gap > Math::E ? samp * Math.log(gap) : samp
54
+ i += (rand(100) * (1 - samp1)).round
55
+ end until i >= last or (gap > max_gap and anything) # sorry for mixed metaphor
56
+ end
57
+ end
58
+ end
@@ -0,0 +1,19 @@
1
+ class Rangefinder
2
+ class Memo
3
+ attr_reader :ranges
4
+ attr_reader :hits
5
+ attr_reader :misses
6
+ def initialize
7
+ @ranges = []
8
+ @hits = 0
9
+ @misses = 0
10
+ @mutex = Mutex.new
11
+ end
12
+ def hit!
13
+ @mutex.synchronize { @hits += 1 }
14
+ end
15
+ def miss!
16
+ @mutex.synchronize { @misses += 1 }
17
+ end
18
+ end
19
+ end
@@ -0,0 +1,3 @@
1
+ class Rangefinder
2
+ VERSION = "0.0.1"
3
+ end
@@ -0,0 +1,27 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'rangefinder/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "rangefinder"
8
+ spec.version = Rangefinder::VERSION
9
+ spec.authors = ["Seamus Abshere"]
10
+ spec.email = ["seamus@abshere.net"]
11
+ spec.summary = %q{Helps you find ranges of IDs, like when you're scraping a website and you need to guess IDs.}
12
+ spec.description = %q{Helps you find ranges of IDs, like when you're scraping a website and you need to guess IDs. You tell it what a valid ID is and it looks for ranges of consecutive valid IDs. It assumes that each probe is expensive.}
13
+ spec.homepage = "https://github.com/seamusabshere/rangefinder"
14
+ spec.license = "MIT"
15
+
16
+ spec.files = `git ls-files`.split($/)
17
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
18
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
+ spec.require_paths = ["lib"]
20
+
21
+ spec.add_runtime_dependency 'ranges_merger'
22
+
23
+ spec.add_development_dependency "bundler", "~> 1.5"
24
+ spec.add_development_dependency "rake"
25
+ spec.add_development_dependency "rspec"
26
+ spec.add_development_dependency "pry"
27
+ end
@@ -0,0 +1,70 @@
1
+ require 'spec_helper'
2
+
3
+ # https://github.com/rails/rails/blob/444ce93397dba3505ecef4973edba40de4fc08c6/activesupport/lib/active_support/core_ext/range/include_range.rb#L12
4
+ # (1..5).include?(1..5) # => true
5
+ # (1..5).include?(2..3) # => true
6
+ # (1..5).include?(2..6) # => false
7
+ def range_include?(zelf, other)
8
+ # 1...10 includes 1..9 but it does not include 1..10.
9
+ operator = zelf.exclude_end? && !other.exclude_end? ? :< : :<=
10
+ zelf.include?(other.first) && other.last.send(operator, zelf.last)
11
+ end
12
+
13
+ describe Rangefinder do
14
+ expected_ranges = []
15
+ pos = 0
16
+ 100.times do
17
+ len = 1000
18
+ pos += rand(100_000).to_i
19
+ expected_ranges << ((pos)..(len+pos))
20
+ end
21
+ expected_id_count = expected_ranges.map(&:count).inject(:+)
22
+
23
+ cache = {}
24
+
25
+ (0..0.9).step(0.1).each do |sparsity|
26
+ describe "sparsity=#{'%g' % sparsity}" do
27
+ found_ranges, hits, misses = Rangefinder.new.probe_with_hits_and_misses do |i|
28
+ r = (cache[i] ||= rand)
29
+ (r > sparsity) && expected_ranges.any? { |r| r.include?(i) }
30
+ end
31
+
32
+ # $stderr.puts
33
+ # $stderr.puts
34
+ # $stderr.puts "found_ranges=#{found_ranges}"
35
+ # $stderr.puts
36
+ # $stderr.puts "expected_ranges=#{expected_ranges}"
37
+
38
+ # it "finds #{expected_ranges.length} ranges" do
39
+ # expected_ranges.each do |expected|
40
+ # expect(found_ranges.any? { |found| range_include?(found, expected) }).to be_true, "#{expected} not in #{found_ranges} found"
41
+ # end
42
+ # end
43
+
44
+ it "finds 95% of ids" do
45
+ real_found_ids = []
46
+ expected_ranges.each do |expected|
47
+ found_ranges.each do |found|
48
+ # if found.include?(expected)
49
+ if range_include?(found, expected)
50
+ real_found_ids << expected.to_a
51
+ end
52
+ end
53
+ end
54
+ real_found_ids = real_found_ids.flatten.uniq
55
+ expect((real_found_ids.count.to_f / expected_id_count).round(2)).to be >= 0.95
56
+ end
57
+
58
+ it "probes only 5% of the space" do
59
+ highest_id = expected_ranges.map(&:last).max
60
+ expect(((hits+misses).to_f / highest_id).round(2)).to be <= 0.05
61
+ end
62
+
63
+ it "exaggerates no more than 5%" do
64
+ found_ids = found_ranges.map(&:to_a).flatten.uniq
65
+ expect((found_ids.count.to_f / expected_id_count).round(2)).to be <= 1.05
66
+ end
67
+ end
68
+ end
69
+
70
+ end
@@ -0,0 +1,4 @@
1
+ require 'pry'
2
+
3
+ $LOAD_PATH.unshift File.expand_path('../../lib', __FILE__)
4
+ require 'rangefinder'
metadata ADDED
@@ -0,0 +1,134 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: rangefinder
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ platform: ruby
6
+ authors:
7
+ - Seamus Abshere
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2014-01-11 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: ranges_merger
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ! '>='
18
+ - !ruby/object:Gem::Version
19
+ version: '0'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ! '>='
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: bundler
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ~>
32
+ - !ruby/object:Gem::Version
33
+ version: '1.5'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ~>
39
+ - !ruby/object:Gem::Version
40
+ version: '1.5'
41
+ - !ruby/object:Gem::Dependency
42
+ name: rake
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ! '>='
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ! '>='
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: rspec
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - ! '>='
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ type: :development
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - ! '>='
67
+ - !ruby/object:Gem::Version
68
+ version: '0'
69
+ - !ruby/object:Gem::Dependency
70
+ name: pry
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - ! '>='
74
+ - !ruby/object:Gem::Version
75
+ version: '0'
76
+ type: :development
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - ! '>='
81
+ - !ruby/object:Gem::Version
82
+ version: '0'
83
+ description: Helps you find ranges of IDs, like when you're scraping a website and
84
+ you need to guess IDs. You tell it what a valid ID is and it looks for ranges of
85
+ consecutive valid IDs. It assumes that each probe is expensive.
86
+ email:
87
+ - seamus@abshere.net
88
+ executables: []
89
+ extensions: []
90
+ extra_rdoc_files: []
91
+ files:
92
+ - .gitignore
93
+ - .rspec
94
+ - .travis.yml
95
+ - CHANGELOG
96
+ - Gemfile
97
+ - LICENSE.txt
98
+ - README.md
99
+ - Rakefile
100
+ - lib/rangefinder.rb
101
+ - lib/rangefinder/memo.rb
102
+ - lib/rangefinder/version.rb
103
+ - rangefinder.gemspec
104
+ - spec/rangefinder_spec.rb
105
+ - spec/spec_helper.rb
106
+ homepage: https://github.com/seamusabshere/rangefinder
107
+ licenses:
108
+ - MIT
109
+ metadata: {}
110
+ post_install_message:
111
+ rdoc_options: []
112
+ require_paths:
113
+ - lib
114
+ required_ruby_version: !ruby/object:Gem::Requirement
115
+ requirements:
116
+ - - ! '>='
117
+ - !ruby/object:Gem::Version
118
+ version: '0'
119
+ required_rubygems_version: !ruby/object:Gem::Requirement
120
+ requirements:
121
+ - - ! '>='
122
+ - !ruby/object:Gem::Version
123
+ version: '0'
124
+ requirements: []
125
+ rubyforge_project:
126
+ rubygems_version: 2.1.11
127
+ signing_key:
128
+ specification_version: 4
129
+ summary: Helps you find ranges of IDs, like when you're scraping a website and you
130
+ need to guess IDs.
131
+ test_files:
132
+ - spec/rangefinder_spec.rb
133
+ - spec/spec_helper.rb
134
+ has_rdoc: