rsgrep 0.0.1
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +18 -0
- data/Gemfile +4 -0
- data/LICENSE +22 -0
- data/README.md +97 -0
- data/Rakefile +11 -0
- data/bin/rsgrep +23 -0
- data/lib/rsgrep.rb +8 -0
- data/lib/rsgrep/file.rb +77 -0
- data/lib/rsgrep/version.rb +3 -0
- data/rsgrep.gemspec +21 -0
- data/spec/file_spec.rb +70 -0
- data/spec/spec_helper.rb +10 -0
- metadata +114 -0
data/.gitignore
ADDED
data/Gemfile
ADDED
data/LICENSE
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
Copyright (c) 2012 Sam Rose
|
2
|
+
|
3
|
+
MIT License
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
6
|
+
a copy of this software and associated documentation files (the
|
7
|
+
"Software"), to deal in the Software without restriction, including
|
8
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
9
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
10
|
+
permit persons to whom the Software is furnished to do so, subject to
|
11
|
+
the following conditions:
|
12
|
+
|
13
|
+
The above copyright notice and this permission notice shall be
|
14
|
+
included in all copies or substantial portions of the Software.
|
15
|
+
|
16
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
17
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
18
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
19
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
20
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
21
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
22
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,97 @@
|
|
1
|
+
# Rsgrep
|
2
|
+
|
3
|
+
This is a pure Ruby implementation with the same goal as the small but amazing
|
4
|
+
[sorted grep](http://sourceforge.net/projects/sgrep/) program written by
|
5
|
+
Stephen C. Losen.
|
6
|
+
|
7
|
+
It is designed for use on large, lexicographically sorted files. It allows you
|
8
|
+
to search for lines that *begin* with a certain pattern (searching for anything
|
9
|
+
at a position anywhere other than the start of a line isn't possible using a
|
10
|
+
binary search).
|
11
|
+
|
12
|
+
## Installation
|
13
|
+
|
14
|
+
Add this line to your application's Gemfile:
|
15
|
+
|
16
|
+
gem 'rsgrep'
|
17
|
+
|
18
|
+
And then execute:
|
19
|
+
|
20
|
+
$ bundle
|
21
|
+
|
22
|
+
Or install it yourself as:
|
23
|
+
|
24
|
+
$ gem install rsgrep
|
25
|
+
|
26
|
+
## Usage
|
27
|
+
|
28
|
+
The gem monkey patches into the File class. It can be used in the following two
|
29
|
+
ways:
|
30
|
+
|
31
|
+
``` ruby
|
32
|
+
require 'rsgrep'
|
33
|
+
|
34
|
+
puts File.sgrep("key pattern", "path/to/file.txt")
|
35
|
+
#=> array of all lines that start with "key pattern", empty array for no
|
36
|
+
# matches.
|
37
|
+
|
38
|
+
# or ...
|
39
|
+
|
40
|
+
f = File.open("path/to/file.txt")
|
41
|
+
puts f.sgrep("key pattern")
|
42
|
+
#=> array of all lines that start with "key pattern", empty array for no
|
43
|
+
# matches.
|
44
|
+
|
45
|
+
f.close
|
46
|
+
```
|
47
|
+
|
48
|
+
You can pass both of these functions an options hash. Here are some examples of
|
49
|
+
the options you can pass:
|
50
|
+
|
51
|
+
``` ruby
|
52
|
+
require 'rsgrep'
|
53
|
+
|
54
|
+
f = File.open("path/to/file.txt")
|
55
|
+
|
56
|
+
# Case insensitive search
|
57
|
+
f.sgrep("PaTTern", :insensitive => true)
|
58
|
+
|
59
|
+
f.close
|
60
|
+
```
|
61
|
+
|
62
|
+
**NOTE**: There are a lot of caveat involved in getting this to work properly.
|
63
|
+
For example, you **cannot** do a case insensitive search on a file that is not
|
64
|
+
sorted in a case insensitive fashion. The results will not be what you expect.
|
65
|
+
|
66
|
+
This will be true of almost all options you pass to rsgrep. You will get the
|
67
|
+
best results on a file that uses alphanumeric characters and only uses one
|
68
|
+
casing (upper or lower, doesn't matter which).
|
69
|
+
|
70
|
+
## Contributing
|
71
|
+
|
72
|
+
1. Fork it
|
73
|
+
2. Create your feature branch (`git checkout -b my-new-feature`)
|
74
|
+
3. Commit your changes (`git commit -am 'Added some feature'`)
|
75
|
+
4. Push to the branch (`git push origin my-new-feature`)
|
76
|
+
5. Create new Pull Request
|
77
|
+
|
78
|
+
## A note on the specs...
|
79
|
+
|
80
|
+
Because writing specs for this required having a very large file to scan, I had
|
81
|
+
to choose a very large file that was freely available. For obvious reasons, I
|
82
|
+
cannot put the file into this repository but you can download it from
|
83
|
+
[here](http://books.google.com/ngrams/datasets). It's the 0th file of the 3grams
|
84
|
+
dataset in English.
|
85
|
+
|
86
|
+
Direct link:
|
87
|
+
[http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-3gram-20090715-0.csv.zip](http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-3gram-20090715-0.csv.zip)
|
88
|
+
|
89
|
+
It's about 440mb compressed, 3gb uncompressed. You will need to uncompress it
|
90
|
+
into the `spec/data` directory in order to run the specs successfully.
|
91
|
+
|
92
|
+
This file is a bit of a bad example though, to be honest. I'm only using it at
|
93
|
+
the moment so that the specs give a good idea of how long it takes to scan
|
94
|
+
through such a large file. The reason that this is not a good file to use is
|
95
|
+
because it isn't sorted in a way the rsgrep knows how to process yet. Its
|
96
|
+
handling of capital letters and punctuation are a bit confusing and I haven't
|
97
|
+
yet been able to find a consistent and clean way of scanning it.
|
data/Rakefile
ADDED
data/bin/rsgrep
ADDED
@@ -0,0 +1,23 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require File.join(File.dirname(__FILE__), '..', 'lib', 'rsgrep')
|
4
|
+
require 'optparse'
|
5
|
+
|
6
|
+
options = {}
|
7
|
+
|
8
|
+
OptionParser.new do |opts|
|
9
|
+
opts.on('-i', '--insensitive', 'Case insensitive search.') do
|
10
|
+
options[:insensitive] = true
|
11
|
+
end
|
12
|
+
end.parse!
|
13
|
+
|
14
|
+
key = ARGV.shift
|
15
|
+
file = ARGV.shift
|
16
|
+
|
17
|
+
begin
|
18
|
+
puts File.sgrep(key, file, options)
|
19
|
+
rescue Errno::EPIPE
|
20
|
+
# If piping to another program, then that program closes STDOUT (by exiting
|
21
|
+
# for example), you get an EPIPE excetion. We don't really care about it. This
|
22
|
+
# is just to stop error output.
|
23
|
+
end
|
data/lib/rsgrep.rb
ADDED
data/lib/rsgrep/file.rb
ADDED
@@ -0,0 +1,77 @@
|
|
1
|
+
class File
|
2
|
+
def self.sgrep key, filename, options = {}
|
3
|
+
File.open(filename) do |f|
|
4
|
+
f.sgrep key, options
|
5
|
+
end
|
6
|
+
end
|
7
|
+
|
8
|
+
def sgrep key, options = {}
|
9
|
+
# initialise the variables that the binary search algorithm needs.
|
10
|
+
hi = size
|
11
|
+
lo = 0
|
12
|
+
mid = (hi + lo) / 2
|
13
|
+
ret = []
|
14
|
+
|
15
|
+
if options[:insensitive]
|
16
|
+
comparator = Proc.new do |key, line|
|
17
|
+
key = key.downcase
|
18
|
+
line = line.downcase
|
19
|
+
if line.start_with? key
|
20
|
+
0
|
21
|
+
else
|
22
|
+
key <=> line
|
23
|
+
end
|
24
|
+
end
|
25
|
+
else
|
26
|
+
comparator = Proc.new do |key, line|
|
27
|
+
if line.start_with? key
|
28
|
+
0
|
29
|
+
else
|
30
|
+
key <=> line
|
31
|
+
end
|
32
|
+
end
|
33
|
+
end
|
34
|
+
|
35
|
+
while lo < mid and mid < hi
|
36
|
+
seek(mid)
|
37
|
+
seek(-2, IO::SEEK_CUR) while (c = getc) != "\n" and c != nil
|
38
|
+
|
39
|
+
case comparator.call(key, (line = gets))
|
40
|
+
when 0
|
41
|
+
# So we've found a line that matches the key, but this may not be the
|
42
|
+
# first line that does (due to the nature of binary searching). This
|
43
|
+
# begin/end block scans backwards in the file to find the first line
|
44
|
+
# that matches the key.
|
45
|
+
begin
|
46
|
+
# Seek back 2 lines because "gets" will advance the file pointer
|
47
|
+
# forward.
|
48
|
+
2.times do
|
49
|
+
seek(-2, IO::SEEK_CUR) # First read will always be a newline, skip it
|
50
|
+
seek(-2, IO::SEEK_CUR) while (c = getc) != "\n" and c != nil
|
51
|
+
end
|
52
|
+
end while comparator.call(key, (line = gets)) == 0
|
53
|
+
|
54
|
+
# Then scan forward line by line until all of the lines that match have
|
55
|
+
# been added to the return array.
|
56
|
+
ret << line.rstrip while (line = gets) != nil and comparator.call(key, line) == 0
|
57
|
+
|
58
|
+
# and we're done!
|
59
|
+
return ret
|
60
|
+
when -1
|
61
|
+
# Key is less than the line. Shift the hi value down and recalculate the
|
62
|
+
# mid.
|
63
|
+
hi = mid
|
64
|
+
mid = (hi + lo) / 2
|
65
|
+
when 1
|
66
|
+
# Key is greater than the line. Shift the lo value up and recalculate
|
67
|
+
# the mid.
|
68
|
+
lo = mid
|
69
|
+
mid = (hi + lo) / 2
|
70
|
+
else
|
71
|
+
raise "Should not be raised, ever."
|
72
|
+
end
|
73
|
+
end
|
74
|
+
|
75
|
+
return ret
|
76
|
+
end
|
77
|
+
end
|
data/rsgrep.gemspec
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
# -*- encoding: utf-8 -*-
|
2
|
+
require File.expand_path('../lib/rsgrep/version', __FILE__)
|
3
|
+
|
4
|
+
Gem::Specification.new do |gem|
|
5
|
+
gem.authors = ["Sam Rose"]
|
6
|
+
gem.email = ["samwho@lbak.co.uk"]
|
7
|
+
gem.description = %q{Pure Ruby implementation of the sorted grep command.}
|
8
|
+
gem.summary = %q{sgrep for Ruby!}
|
9
|
+
gem.homepage = "http://github.com/samwho/rsgrep"
|
10
|
+
|
11
|
+
gem.files = `git ls-files`.split($\)
|
12
|
+
gem.executables = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
|
13
|
+
gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
|
14
|
+
gem.name = "rsgrep"
|
15
|
+
gem.require_paths = ["lib"]
|
16
|
+
gem.version = Rsgrep::VERSION
|
17
|
+
|
18
|
+
gem.add_development_dependency 'rake'
|
19
|
+
gem.add_development_dependency 'rspec'
|
20
|
+
gem.add_development_dependency 'rspec-core'
|
21
|
+
end
|
data/spec/file_spec.rb
ADDED
@@ -0,0 +1,70 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
describe "File#sgrep" do
|
4
|
+
let(:data_file1) { File.open(DATA_FILE1) }
|
5
|
+
|
6
|
+
after :each do
|
7
|
+
data_file1.close
|
8
|
+
end
|
9
|
+
|
10
|
+
context "when searching for 'search for '" do
|
11
|
+
key = "search for "
|
12
|
+
|
13
|
+
subject { data_file1.sgrep key }
|
14
|
+
it { should_not be_empty }
|
15
|
+
|
16
|
+
specify "all elements start with '#{key}'" do
|
17
|
+
subject.all? { |elem| elem.start_with? key }.should be_true
|
18
|
+
end
|
19
|
+
end
|
20
|
+
|
21
|
+
context "when searching for 'a'" do
|
22
|
+
key = 'a'
|
23
|
+
|
24
|
+
subject { data_file1.sgrep key }
|
25
|
+
it { should_not be_empty }
|
26
|
+
|
27
|
+
specify "all elements start with '#{key}'" do
|
28
|
+
subject.all? { |elem| elem.start_with? key }.should be_true
|
29
|
+
end
|
30
|
+
end
|
31
|
+
|
32
|
+
context "when searching for '!'" do
|
33
|
+
key = '!'
|
34
|
+
|
35
|
+
subject { data_file1.sgrep key }
|
36
|
+
it { should_not be_empty }
|
37
|
+
|
38
|
+
specify "all elements start with '#{key}'" do
|
39
|
+
subject.all? { |elem| elem.start_with? key }.should be_true
|
40
|
+
end
|
41
|
+
end
|
42
|
+
|
43
|
+
context "no results (search 'this is not in this file anywhere')" do
|
44
|
+
key = 'this is not in this file anywhere'
|
45
|
+
|
46
|
+
subject { data_file1.sgrep key }
|
47
|
+
it { should be_empty }
|
48
|
+
end
|
49
|
+
|
50
|
+
if DICT_FILE
|
51
|
+
let(:dict_file) { File.open(DICT_FILE) }
|
52
|
+
|
53
|
+
after :each do
|
54
|
+
dict_file.close
|
55
|
+
end
|
56
|
+
|
57
|
+
context "Dictionary data tests" do
|
58
|
+
context "case insensitive searches" do
|
59
|
+
context "search for 'zyzzogeton'" do
|
60
|
+
key = "zyzzogeton"
|
61
|
+
|
62
|
+
subject { dict_file.sgrep key, :insensitive => true }
|
63
|
+
it { should include "Zyzzogeton" }
|
64
|
+
end
|
65
|
+
end
|
66
|
+
end
|
67
|
+
else
|
68
|
+
STDERR.puts "It doesn't seem like you have a dictionary file in a standard location. Skipping dictionary file tests."
|
69
|
+
end
|
70
|
+
end
|
data/spec/spec_helper.rb
ADDED
@@ -0,0 +1,10 @@
|
|
1
|
+
require File.join(File.dirname(__FILE__), '..', 'lib', 'rsgrep')
|
2
|
+
|
3
|
+
DATA_ROOT = File.join(File.dirname(__FILE__), 'data')
|
4
|
+
DATA_FILE1 = File.join(DATA_ROOT, 'googlebooks-eng-all-3gram-20090715-0.csv')
|
5
|
+
|
6
|
+
DICT_FILE = if File.exists?("/usr/share/dict/words")
|
7
|
+
"/usr/share/dict/words"
|
8
|
+
else
|
9
|
+
nil
|
10
|
+
end
|
metadata
ADDED
@@ -0,0 +1,114 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: rsgrep
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.0.1
|
5
|
+
prerelease:
|
6
|
+
platform: ruby
|
7
|
+
authors:
|
8
|
+
- Sam Rose
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2012-09-27 00:00:00.000000000 Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: rake
|
16
|
+
requirement: !ruby/object:Gem::Requirement
|
17
|
+
none: false
|
18
|
+
requirements:
|
19
|
+
- - ! '>='
|
20
|
+
- !ruby/object:Gem::Version
|
21
|
+
version: '0'
|
22
|
+
type: :development
|
23
|
+
prerelease: false
|
24
|
+
version_requirements: !ruby/object:Gem::Requirement
|
25
|
+
none: false
|
26
|
+
requirements:
|
27
|
+
- - ! '>='
|
28
|
+
- !ruby/object:Gem::Version
|
29
|
+
version: '0'
|
30
|
+
- !ruby/object:Gem::Dependency
|
31
|
+
name: rspec
|
32
|
+
requirement: !ruby/object:Gem::Requirement
|
33
|
+
none: false
|
34
|
+
requirements:
|
35
|
+
- - ! '>='
|
36
|
+
- !ruby/object:Gem::Version
|
37
|
+
version: '0'
|
38
|
+
type: :development
|
39
|
+
prerelease: false
|
40
|
+
version_requirements: !ruby/object:Gem::Requirement
|
41
|
+
none: false
|
42
|
+
requirements:
|
43
|
+
- - ! '>='
|
44
|
+
- !ruby/object:Gem::Version
|
45
|
+
version: '0'
|
46
|
+
- !ruby/object:Gem::Dependency
|
47
|
+
name: rspec-core
|
48
|
+
requirement: !ruby/object:Gem::Requirement
|
49
|
+
none: false
|
50
|
+
requirements:
|
51
|
+
- - ! '>='
|
52
|
+
- !ruby/object:Gem::Version
|
53
|
+
version: '0'
|
54
|
+
type: :development
|
55
|
+
prerelease: false
|
56
|
+
version_requirements: !ruby/object:Gem::Requirement
|
57
|
+
none: false
|
58
|
+
requirements:
|
59
|
+
- - ! '>='
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: '0'
|
62
|
+
description: Pure Ruby implementation of the sorted grep command.
|
63
|
+
email:
|
64
|
+
- samwho@lbak.co.uk
|
65
|
+
executables:
|
66
|
+
- rsgrep
|
67
|
+
extensions: []
|
68
|
+
extra_rdoc_files: []
|
69
|
+
files:
|
70
|
+
- .gitignore
|
71
|
+
- Gemfile
|
72
|
+
- LICENSE
|
73
|
+
- README.md
|
74
|
+
- Rakefile
|
75
|
+
- bin/rsgrep
|
76
|
+
- lib/rsgrep.rb
|
77
|
+
- lib/rsgrep/file.rb
|
78
|
+
- lib/rsgrep/version.rb
|
79
|
+
- rsgrep.gemspec
|
80
|
+
- spec/file_spec.rb
|
81
|
+
- spec/spec_helper.rb
|
82
|
+
homepage: http://github.com/samwho/rsgrep
|
83
|
+
licenses: []
|
84
|
+
post_install_message:
|
85
|
+
rdoc_options: []
|
86
|
+
require_paths:
|
87
|
+
- lib
|
88
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
89
|
+
none: false
|
90
|
+
requirements:
|
91
|
+
- - ! '>='
|
92
|
+
- !ruby/object:Gem::Version
|
93
|
+
version: '0'
|
94
|
+
segments:
|
95
|
+
- 0
|
96
|
+
hash: 1288555168689700844
|
97
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
98
|
+
none: false
|
99
|
+
requirements:
|
100
|
+
- - ! '>='
|
101
|
+
- !ruby/object:Gem::Version
|
102
|
+
version: '0'
|
103
|
+
segments:
|
104
|
+
- 0
|
105
|
+
hash: 1288555168689700844
|
106
|
+
requirements: []
|
107
|
+
rubyforge_project:
|
108
|
+
rubygems_version: 1.8.24
|
109
|
+
signing_key:
|
110
|
+
specification_version: 3
|
111
|
+
summary: sgrep for Ruby!
|
112
|
+
test_files:
|
113
|
+
- spec/file_spec.rb
|
114
|
+
- spec/spec_helper.rb
|