rsgrep 0.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.gitignore +18 -0
- data/Gemfile +4 -0
- data/LICENSE +22 -0
- data/README.md +97 -0
- data/Rakefile +11 -0
- data/bin/rsgrep +23 -0
- data/lib/rsgrep.rb +8 -0
- data/lib/rsgrep/file.rb +77 -0
- data/lib/rsgrep/version.rb +3 -0
- data/rsgrep.gemspec +21 -0
- data/spec/file_spec.rb +70 -0
- data/spec/spec_helper.rb +10 -0
- metadata +114 -0
data/.gitignore
ADDED
data/Gemfile
ADDED
data/LICENSE
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
Copyright (c) 2012 Sam Rose
|
2
|
+
|
3
|
+
MIT License
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
6
|
+
a copy of this software and associated documentation files (the
|
7
|
+
"Software"), to deal in the Software without restriction, including
|
8
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
9
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
10
|
+
permit persons to whom the Software is furnished to do so, subject to
|
11
|
+
the following conditions:
|
12
|
+
|
13
|
+
The above copyright notice and this permission notice shall be
|
14
|
+
included in all copies or substantial portions of the Software.
|
15
|
+
|
16
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
17
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
18
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
19
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
20
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
21
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
22
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,97 @@
|
|
1
|
+
# Rsgrep
|
2
|
+
|
3
|
+
This is a pure Ruby implementation with the same goal as the small but amazing
|
4
|
+
[sorted grep](http://sourceforge.net/projects/sgrep/) program written by
|
5
|
+
Stephen C. Losen.
|
6
|
+
|
7
|
+
It is designed for use on large, lexicographically sorted files. It allows you
|
8
|
+
to search for lines that *begin* with a certain pattern (searching for anything
|
9
|
+
at a position anywhere other than the start of a line isn't possible using a
|
10
|
+
binary search).
|
11
|
+
|
12
|
+
## Installation
|
13
|
+
|
14
|
+
Add this line to your application's Gemfile:
|
15
|
+
|
16
|
+
gem 'rsgrep'
|
17
|
+
|
18
|
+
And then execute:
|
19
|
+
|
20
|
+
$ bundle
|
21
|
+
|
22
|
+
Or install it yourself as:
|
23
|
+
|
24
|
+
$ gem install rsgrep
|
25
|
+
|
26
|
+
## Usage
|
27
|
+
|
28
|
+
The gem monkey patches into the File class. It can be used in the following two
|
29
|
+
ways:
|
30
|
+
|
31
|
+
``` ruby
|
32
|
+
require 'rsgrep'
|
33
|
+
|
34
|
+
puts File.sgrep("key pattern", "path/to/file.txt")
|
35
|
+
#=> array of all lines that start with "key pattern", empty array for no
|
36
|
+
# matches.
|
37
|
+
|
38
|
+
# or ...
|
39
|
+
|
40
|
+
f = File.open("path/to/file.txt")
|
41
|
+
puts f.sgrep("key pattern")
|
42
|
+
#=> array of all lines that start with "key pattern", empty array for no
|
43
|
+
# matches.
|
44
|
+
|
45
|
+
f.close
|
46
|
+
```
|
47
|
+
|
48
|
+
You can pass both of these functions an options hash. Here are some examples of
|
49
|
+
the options you can pass:
|
50
|
+
|
51
|
+
``` ruby
|
52
|
+
require 'rsgrep'
|
53
|
+
|
54
|
+
f = File.open("path/to/file.txt")
|
55
|
+
|
56
|
+
# Case insensitive search
|
57
|
+
f.sgrep("PaTTern", :insensitive => true)
|
58
|
+
|
59
|
+
f.close
|
60
|
+
```
|
61
|
+
|
62
|
+
**NOTE**: There are a lot of caveat involved in getting this to work properly.
|
63
|
+
For example, you **cannot** do a case insensitive search on a file that is not
|
64
|
+
sorted in a case insensitive fashion. The results will not be what you expect.
|
65
|
+
|
66
|
+
This will be true of almost all options you pass to rsgrep. You will get the
|
67
|
+
best results on a file that uses alphanumeric characters and only uses one
|
68
|
+
casing (upper or lower, doesn't matter which).
|
69
|
+
|
70
|
+
## Contributing
|
71
|
+
|
72
|
+
1. Fork it
|
73
|
+
2. Create your feature branch (`git checkout -b my-new-feature`)
|
74
|
+
3. Commit your changes (`git commit -am 'Added some feature'`)
|
75
|
+
4. Push to the branch (`git push origin my-new-feature`)
|
76
|
+
5. Create new Pull Request
|
77
|
+
|
78
|
+
## A note on the specs...
|
79
|
+
|
80
|
+
Because writing specs for this required having a very large file to scan, I had
|
81
|
+
to choose a very large file that was freely available. For obvious reasons, I
|
82
|
+
cannot put the file into this repository but you can download it from
|
83
|
+
[here](http://books.google.com/ngrams/datasets). It's the 0th file of the 3grams
|
84
|
+
dataset in English.
|
85
|
+
|
86
|
+
Direct link:
|
87
|
+
[http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-3gram-20090715-0.csv.zip](http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-3gram-20090715-0.csv.zip)
|
88
|
+
|
89
|
+
It's about 440mb compressed, 3gb uncompressed. You will need to uncompress it
|
90
|
+
into the `spec/data` directory in order to run the specs successfully.
|
91
|
+
|
92
|
+
This file is a bit of a bad example though, to be honest. I'm only using it at
|
93
|
+
the moment so that the specs give a good idea of how long it takes to scan
|
94
|
+
through such a large file. The reason that this is not a good file to use is
|
95
|
+
because it isn't sorted in a way the rsgrep knows how to process yet. Its
|
96
|
+
handling of capital letters and punctuation are a bit confusing and I haven't
|
97
|
+
yet been able to find a consistent and clean way of scanning it.
|
data/Rakefile
ADDED
data/bin/rsgrep
ADDED
@@ -0,0 +1,23 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require File.join(File.dirname(__FILE__), '..', 'lib', 'rsgrep')
|
4
|
+
require 'optparse'
|
5
|
+
|
6
|
+
options = {}
|
7
|
+
|
8
|
+
OptionParser.new do |opts|
|
9
|
+
opts.on('-i', '--insensitive', 'Case insensitive search.') do
|
10
|
+
options[:insensitive] = true
|
11
|
+
end
|
12
|
+
end.parse!
|
13
|
+
|
14
|
+
key = ARGV.shift
|
15
|
+
file = ARGV.shift
|
16
|
+
|
17
|
+
begin
|
18
|
+
puts File.sgrep(key, file, options)
|
19
|
+
rescue Errno::EPIPE
|
20
|
+
# If piping to another program, then that program closes STDOUT (by exiting
|
21
|
+
# for example), you get an EPIPE excetion. We don't really care about it. This
|
22
|
+
# is just to stop error output.
|
23
|
+
end
|
data/lib/rsgrep.rb
ADDED
data/lib/rsgrep/file.rb
ADDED
@@ -0,0 +1,77 @@
|
|
1
|
+
class File
|
2
|
+
def self.sgrep key, filename, options = {}
|
3
|
+
File.open(filename) do |f|
|
4
|
+
f.sgrep key, options
|
5
|
+
end
|
6
|
+
end
|
7
|
+
|
8
|
+
def sgrep key, options = {}
|
9
|
+
# initialise the variables that the binary search algorithm needs.
|
10
|
+
hi = size
|
11
|
+
lo = 0
|
12
|
+
mid = (hi + lo) / 2
|
13
|
+
ret = []
|
14
|
+
|
15
|
+
if options[:insensitive]
|
16
|
+
comparator = Proc.new do |key, line|
|
17
|
+
key = key.downcase
|
18
|
+
line = line.downcase
|
19
|
+
if line.start_with? key
|
20
|
+
0
|
21
|
+
else
|
22
|
+
key <=> line
|
23
|
+
end
|
24
|
+
end
|
25
|
+
else
|
26
|
+
comparator = Proc.new do |key, line|
|
27
|
+
if line.start_with? key
|
28
|
+
0
|
29
|
+
else
|
30
|
+
key <=> line
|
31
|
+
end
|
32
|
+
end
|
33
|
+
end
|
34
|
+
|
35
|
+
while lo < mid and mid < hi
|
36
|
+
seek(mid)
|
37
|
+
seek(-2, IO::SEEK_CUR) while (c = getc) != "\n" and c != nil
|
38
|
+
|
39
|
+
case comparator.call(key, (line = gets))
|
40
|
+
when 0
|
41
|
+
# So we've found a line that matches the key, but this may not be the
|
42
|
+
# first line that does (due to the nature of binary searching). This
|
43
|
+
# begin/end block scans backwards in the file to find the first line
|
44
|
+
# that matches the key.
|
45
|
+
begin
|
46
|
+
# Seek back 2 lines because "gets" will advance the file pointer
|
47
|
+
# forward.
|
48
|
+
2.times do
|
49
|
+
seek(-2, IO::SEEK_CUR) # First read will always be a newline, skip it
|
50
|
+
seek(-2, IO::SEEK_CUR) while (c = getc) != "\n" and c != nil
|
51
|
+
end
|
52
|
+
end while comparator.call(key, (line = gets)) == 0
|
53
|
+
|
54
|
+
# Then scan forward line by line until all of the lines that match have
|
55
|
+
# been added to the return array.
|
56
|
+
ret << line.rstrip while (line = gets) != nil and comparator.call(key, line) == 0
|
57
|
+
|
58
|
+
# and we're done!
|
59
|
+
return ret
|
60
|
+
when -1
|
61
|
+
# Key is less than the line. Shift the hi value down and recalculate the
|
62
|
+
# mid.
|
63
|
+
hi = mid
|
64
|
+
mid = (hi + lo) / 2
|
65
|
+
when 1
|
66
|
+
# Key is greater than the line. Shift the lo value up and recalculate
|
67
|
+
# the mid.
|
68
|
+
lo = mid
|
69
|
+
mid = (hi + lo) / 2
|
70
|
+
else
|
71
|
+
raise "Should not be raised, ever."
|
72
|
+
end
|
73
|
+
end
|
74
|
+
|
75
|
+
return ret
|
76
|
+
end
|
77
|
+
end
|
data/rsgrep.gemspec
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
# -*- encoding: utf-8 -*-
|
2
|
+
require File.expand_path('../lib/rsgrep/version', __FILE__)
|
3
|
+
|
4
|
+
Gem::Specification.new do |gem|
|
5
|
+
gem.authors = ["Sam Rose"]
|
6
|
+
gem.email = ["samwho@lbak.co.uk"]
|
7
|
+
gem.description = %q{Pure Ruby implementation of the sorted grep command.}
|
8
|
+
gem.summary = %q{sgrep for Ruby!}
|
9
|
+
gem.homepage = "http://github.com/samwho/rsgrep"
|
10
|
+
|
11
|
+
gem.files = `git ls-files`.split($\)
|
12
|
+
gem.executables = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
|
13
|
+
gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
|
14
|
+
gem.name = "rsgrep"
|
15
|
+
gem.require_paths = ["lib"]
|
16
|
+
gem.version = Rsgrep::VERSION
|
17
|
+
|
18
|
+
gem.add_development_dependency 'rake'
|
19
|
+
gem.add_development_dependency 'rspec'
|
20
|
+
gem.add_development_dependency 'rspec-core'
|
21
|
+
end
|
data/spec/file_spec.rb
ADDED
@@ -0,0 +1,70 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
describe "File#sgrep" do
|
4
|
+
let(:data_file1) { File.open(DATA_FILE1) }
|
5
|
+
|
6
|
+
after :each do
|
7
|
+
data_file1.close
|
8
|
+
end
|
9
|
+
|
10
|
+
context "when searching for 'search for '" do
|
11
|
+
key = "search for "
|
12
|
+
|
13
|
+
subject { data_file1.sgrep key }
|
14
|
+
it { should_not be_empty }
|
15
|
+
|
16
|
+
specify "all elements start with '#{key}'" do
|
17
|
+
subject.all? { |elem| elem.start_with? key }.should be_true
|
18
|
+
end
|
19
|
+
end
|
20
|
+
|
21
|
+
context "when searching for 'a'" do
|
22
|
+
key = 'a'
|
23
|
+
|
24
|
+
subject { data_file1.sgrep key }
|
25
|
+
it { should_not be_empty }
|
26
|
+
|
27
|
+
specify "all elements start with '#{key}'" do
|
28
|
+
subject.all? { |elem| elem.start_with? key }.should be_true
|
29
|
+
end
|
30
|
+
end
|
31
|
+
|
32
|
+
context "when searching for '!'" do
|
33
|
+
key = '!'
|
34
|
+
|
35
|
+
subject { data_file1.sgrep key }
|
36
|
+
it { should_not be_empty }
|
37
|
+
|
38
|
+
specify "all elements start with '#{key}'" do
|
39
|
+
subject.all? { |elem| elem.start_with? key }.should be_true
|
40
|
+
end
|
41
|
+
end
|
42
|
+
|
43
|
+
context "no results (search 'this is not in this file anywhere')" do
|
44
|
+
key = 'this is not in this file anywhere'
|
45
|
+
|
46
|
+
subject { data_file1.sgrep key }
|
47
|
+
it { should be_empty }
|
48
|
+
end
|
49
|
+
|
50
|
+
if DICT_FILE
|
51
|
+
let(:dict_file) { File.open(DICT_FILE) }
|
52
|
+
|
53
|
+
after :each do
|
54
|
+
dict_file.close
|
55
|
+
end
|
56
|
+
|
57
|
+
context "Dictionary data tests" do
|
58
|
+
context "case insensitive searches" do
|
59
|
+
context "search for 'zyzzogeton'" do
|
60
|
+
key = "zyzzogeton"
|
61
|
+
|
62
|
+
subject { dict_file.sgrep key, :insensitive => true }
|
63
|
+
it { should include "Zyzzogeton" }
|
64
|
+
end
|
65
|
+
end
|
66
|
+
end
|
67
|
+
else
|
68
|
+
STDERR.puts "It doesn't seem like you have a dictionary file in a standard location. Skipping dictionary file tests."
|
69
|
+
end
|
70
|
+
end
|
data/spec/spec_helper.rb
ADDED
@@ -0,0 +1,10 @@
|
|
1
|
+
require File.join(File.dirname(__FILE__), '..', 'lib', 'rsgrep')
|
2
|
+
|
3
|
+
DATA_ROOT = File.join(File.dirname(__FILE__), 'data')
|
4
|
+
DATA_FILE1 = File.join(DATA_ROOT, 'googlebooks-eng-all-3gram-20090715-0.csv')
|
5
|
+
|
6
|
+
DICT_FILE = if File.exists?("/usr/share/dict/words")
|
7
|
+
"/usr/share/dict/words"
|
8
|
+
else
|
9
|
+
nil
|
10
|
+
end
|
metadata
ADDED
@@ -0,0 +1,114 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: rsgrep
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.0.1
|
5
|
+
prerelease:
|
6
|
+
platform: ruby
|
7
|
+
authors:
|
8
|
+
- Sam Rose
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2012-09-27 00:00:00.000000000 Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: rake
|
16
|
+
requirement: !ruby/object:Gem::Requirement
|
17
|
+
none: false
|
18
|
+
requirements:
|
19
|
+
- - ! '>='
|
20
|
+
- !ruby/object:Gem::Version
|
21
|
+
version: '0'
|
22
|
+
type: :development
|
23
|
+
prerelease: false
|
24
|
+
version_requirements: !ruby/object:Gem::Requirement
|
25
|
+
none: false
|
26
|
+
requirements:
|
27
|
+
- - ! '>='
|
28
|
+
- !ruby/object:Gem::Version
|
29
|
+
version: '0'
|
30
|
+
- !ruby/object:Gem::Dependency
|
31
|
+
name: rspec
|
32
|
+
requirement: !ruby/object:Gem::Requirement
|
33
|
+
none: false
|
34
|
+
requirements:
|
35
|
+
- - ! '>='
|
36
|
+
- !ruby/object:Gem::Version
|
37
|
+
version: '0'
|
38
|
+
type: :development
|
39
|
+
prerelease: false
|
40
|
+
version_requirements: !ruby/object:Gem::Requirement
|
41
|
+
none: false
|
42
|
+
requirements:
|
43
|
+
- - ! '>='
|
44
|
+
- !ruby/object:Gem::Version
|
45
|
+
version: '0'
|
46
|
+
- !ruby/object:Gem::Dependency
|
47
|
+
name: rspec-core
|
48
|
+
requirement: !ruby/object:Gem::Requirement
|
49
|
+
none: false
|
50
|
+
requirements:
|
51
|
+
- - ! '>='
|
52
|
+
- !ruby/object:Gem::Version
|
53
|
+
version: '0'
|
54
|
+
type: :development
|
55
|
+
prerelease: false
|
56
|
+
version_requirements: !ruby/object:Gem::Requirement
|
57
|
+
none: false
|
58
|
+
requirements:
|
59
|
+
- - ! '>='
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: '0'
|
62
|
+
description: Pure Ruby implementation of the sorted grep command.
|
63
|
+
email:
|
64
|
+
- samwho@lbak.co.uk
|
65
|
+
executables:
|
66
|
+
- rsgrep
|
67
|
+
extensions: []
|
68
|
+
extra_rdoc_files: []
|
69
|
+
files:
|
70
|
+
- .gitignore
|
71
|
+
- Gemfile
|
72
|
+
- LICENSE
|
73
|
+
- README.md
|
74
|
+
- Rakefile
|
75
|
+
- bin/rsgrep
|
76
|
+
- lib/rsgrep.rb
|
77
|
+
- lib/rsgrep/file.rb
|
78
|
+
- lib/rsgrep/version.rb
|
79
|
+
- rsgrep.gemspec
|
80
|
+
- spec/file_spec.rb
|
81
|
+
- spec/spec_helper.rb
|
82
|
+
homepage: http://github.com/samwho/rsgrep
|
83
|
+
licenses: []
|
84
|
+
post_install_message:
|
85
|
+
rdoc_options: []
|
86
|
+
require_paths:
|
87
|
+
- lib
|
88
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
89
|
+
none: false
|
90
|
+
requirements:
|
91
|
+
- - ! '>='
|
92
|
+
- !ruby/object:Gem::Version
|
93
|
+
version: '0'
|
94
|
+
segments:
|
95
|
+
- 0
|
96
|
+
hash: 1288555168689700844
|
97
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
98
|
+
none: false
|
99
|
+
requirements:
|
100
|
+
- - ! '>='
|
101
|
+
- !ruby/object:Gem::Version
|
102
|
+
version: '0'
|
103
|
+
segments:
|
104
|
+
- 0
|
105
|
+
hash: 1288555168689700844
|
106
|
+
requirements: []
|
107
|
+
rubyforge_project:
|
108
|
+
rubygems_version: 1.8.24
|
109
|
+
signing_key:
|
110
|
+
specification_version: 3
|
111
|
+
summary: sgrep for Ruby!
|
112
|
+
test_files:
|
113
|
+
- spec/file_spec.rb
|
114
|
+
- spec/spec_helper.rb
|