rsgrep 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
data/.gitignore ADDED
@@ -0,0 +1,18 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ Gemfile.lock
7
+ InstalledFiles
8
+ _yardoc
9
+ coverage
10
+ doc/
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/reports
15
+ spec/data
16
+ test/tmp
17
+ test/version_tmp
18
+ tmp
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in rsgrep.gemspec
4
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2012 Sam Rose
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,97 @@
1
+ # Rsgrep
2
+
3
+ This is a pure Ruby implementation with the same goal as the small but amazing
4
+ [sorted grep](http://sourceforge.net/projects/sgrep/) program written by
5
+ Stephen C. Losen.
6
+
7
+ It is designed for use on large, lexicographically sorted files. It allows you
8
+ to search for lines that *begin* with a certain pattern (searching for anything
9
+ at a position anywhere other than the start of a line isn't possible using a
10
+ binary search).
11
+
12
+ ## Installation
13
+
14
+ Add this line to your application's Gemfile:
15
+
16
+ gem 'rsgrep'
17
+
18
+ And then execute:
19
+
20
+ $ bundle
21
+
22
+ Or install it yourself as:
23
+
24
+ $ gem install rsgrep
25
+
26
+ ## Usage
27
+
28
+ The gem monkey patches into the File class. It can be used in the following two
29
+ ways:
30
+
31
+ ``` ruby
32
+ require 'rsgrep'
33
+
34
+ puts File.sgrep("key pattern", "path/to/file.txt")
35
+ #=> array of all lines that start with "key pattern", empty array for no
36
+ # matches.
37
+
38
+ # or ...
39
+
40
+ f = File.open("path/to/file.txt")
41
+ puts f.sgrep("key pattern")
42
+ #=> array of all lines that start with "key pattern", empty array for no
43
+ # matches.
44
+
45
+ f.close
46
+ ```
47
+
48
+ You can pass both of these functions an options hash. Here are some examples of
49
+ the options you can pass:
50
+
51
+ ``` ruby
52
+ require 'rsgrep'
53
+
54
+ f = File.open("path/to/file.txt")
55
+
56
+ # Case insensitive search
57
+ f.sgrep("PaTTern", :insensitive => true)
58
+
59
+ f.close
60
+ ```
61
+
62
+ **NOTE**: There are a lot of caveat involved in getting this to work properly.
63
+ For example, you **cannot** do a case insensitive search on a file that is not
64
+ sorted in a case insensitive fashion. The results will not be what you expect.
65
+
66
+ This will be true of almost all options you pass to rsgrep. You will get the
67
+ best results on a file that uses alphanumeric characters and only uses one
68
+ casing (upper or lower, doesn't matter which).
69
+
70
+ ## Contributing
71
+
72
+ 1. Fork it
73
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
74
+ 3. Commit your changes (`git commit -am 'Added some feature'`)
75
+ 4. Push to the branch (`git push origin my-new-feature`)
76
+ 5. Create new Pull Request
77
+
78
+ ## A note on the specs...
79
+
80
+ Because writing specs for this required having a very large file to scan, I had
81
+ to choose a very large file that was freely available. For obvious reasons, I
82
+ cannot put the file into this repository but you can download it from
83
+ [here](http://books.google.com/ngrams/datasets). It's the 0th file of the 3grams
84
+ dataset in English.
85
+
86
+ Direct link:
87
+ [http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-3gram-20090715-0.csv.zip](http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-3gram-20090715-0.csv.zip)
88
+
89
+ It's about 440mb compressed, 3gb uncompressed. You will need to uncompress it
90
+ into the `spec/data` directory in order to run the specs successfully.
91
+
92
+ This file is a bit of a bad example though, to be honest. I'm only using it at
93
+ the moment so that the specs give a good idea of how long it takes to scan
94
+ through such a large file. The reason that this is not a good file to use is
95
+ because it isn't sorted in a way the rsgrep knows how to process yet. Its
96
+ handling of capital letters and punctuation are a bit confusing and I haven't
97
+ yet been able to find a consistent and clean way of scanning it.
data/Rakefile ADDED
@@ -0,0 +1,11 @@
1
+ #!/usr/bin/env rake
2
+ require "bundler/gem_tasks"
3
+ require 'rspec/core/rake_task'
4
+
5
+ desc 'Default: run specs.'
6
+ task :default => :spec
7
+
8
+ desc "Run specs"
9
+ RSpec::Core::RakeTask.new do |t|
10
+ t.rspec_opts = '-cfs'
11
+ end
data/bin/rsgrep ADDED
@@ -0,0 +1,23 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require File.join(File.dirname(__FILE__), '..', 'lib', 'rsgrep')
4
+ require 'optparse'
5
+
6
+ options = {}
7
+
8
+ OptionParser.new do |opts|
9
+ opts.on('-i', '--insensitive', 'Case insensitive search.') do
10
+ options[:insensitive] = true
11
+ end
12
+ end.parse!
13
+
14
+ key = ARGV.shift
15
+ file = ARGV.shift
16
+
17
+ begin
18
+ puts File.sgrep(key, file, options)
19
+ rescue Errno::EPIPE
20
+ # If piping to another program, then that program closes STDOUT (by exiting
21
+ # for example), you get an EPIPE excetion. We don't really care about it. This
22
+ # is just to stop error output.
23
+ end
data/lib/rsgrep.rb ADDED
@@ -0,0 +1,8 @@
1
+ libdir = File.dirname(__FILE__)
2
+ $LOAD_PATH.unshift(libdir) unless $LOAD_PATH.include?(libdir)
3
+
4
+ require "rsgrep/version"
5
+ require "rsgrep/file"
6
+
7
+ module Rsgrep
8
+ end
@@ -0,0 +1,77 @@
1
+ class File
2
+ def self.sgrep key, filename, options = {}
3
+ File.open(filename) do |f|
4
+ f.sgrep key, options
5
+ end
6
+ end
7
+
8
+ def sgrep key, options = {}
9
+ # initialise the variables that the binary search algorithm needs.
10
+ hi = size
11
+ lo = 0
12
+ mid = (hi + lo) / 2
13
+ ret = []
14
+
15
+ if options[:insensitive]
16
+ comparator = Proc.new do |key, line|
17
+ key = key.downcase
18
+ line = line.downcase
19
+ if line.start_with? key
20
+ 0
21
+ else
22
+ key <=> line
23
+ end
24
+ end
25
+ else
26
+ comparator = Proc.new do |key, line|
27
+ if line.start_with? key
28
+ 0
29
+ else
30
+ key <=> line
31
+ end
32
+ end
33
+ end
34
+
35
+ while lo < mid and mid < hi
36
+ seek(mid)
37
+ seek(-2, IO::SEEK_CUR) while (c = getc) != "\n" and c != nil
38
+
39
+ case comparator.call(key, (line = gets))
40
+ when 0
41
+ # So we've found a line that matches the key, but this may not be the
42
+ # first line that does (due to the nature of binary searching). This
43
+ # begin/end block scans backwards in the file to find the first line
44
+ # that matches the key.
45
+ begin
46
+ # Seek back 2 lines because "gets" will advance the file pointer
47
+ # forward.
48
+ 2.times do
49
+ seek(-2, IO::SEEK_CUR) # First read will always be a newline, skip it
50
+ seek(-2, IO::SEEK_CUR) while (c = getc) != "\n" and c != nil
51
+ end
52
+ end while comparator.call(key, (line = gets)) == 0
53
+
54
+ # Then scan forward line by line until all of the lines that match have
55
+ # been added to the return array.
56
+ ret << line.rstrip while (line = gets) != nil and comparator.call(key, line) == 0
57
+
58
+ # and we're done!
59
+ return ret
60
+ when -1
61
+ # Key is less than the line. Shift the hi value down and recalculate the
62
+ # mid.
63
+ hi = mid
64
+ mid = (hi + lo) / 2
65
+ when 1
66
+ # Key is greater than the line. Shift the lo value up and recalculate
67
+ # the mid.
68
+ lo = mid
69
+ mid = (hi + lo) / 2
70
+ else
71
+ raise "Should not be raised, ever."
72
+ end
73
+ end
74
+
75
+ return ret
76
+ end
77
+ end
@@ -0,0 +1,3 @@
1
+ module Rsgrep
2
+ VERSION = "0.0.1"
3
+ end
data/rsgrep.gemspec ADDED
@@ -0,0 +1,21 @@
1
+ # -*- encoding: utf-8 -*-
2
+ require File.expand_path('../lib/rsgrep/version', __FILE__)
3
+
4
+ Gem::Specification.new do |gem|
5
+ gem.authors = ["Sam Rose"]
6
+ gem.email = ["samwho@lbak.co.uk"]
7
+ gem.description = %q{Pure Ruby implementation of the sorted grep command.}
8
+ gem.summary = %q{sgrep for Ruby!}
9
+ gem.homepage = "http://github.com/samwho/rsgrep"
10
+
11
+ gem.files = `git ls-files`.split($\)
12
+ gem.executables = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
13
+ gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
14
+ gem.name = "rsgrep"
15
+ gem.require_paths = ["lib"]
16
+ gem.version = Rsgrep::VERSION
17
+
18
+ gem.add_development_dependency 'rake'
19
+ gem.add_development_dependency 'rspec'
20
+ gem.add_development_dependency 'rspec-core'
21
+ end
data/spec/file_spec.rb ADDED
@@ -0,0 +1,70 @@
1
+ require 'spec_helper'
2
+
3
+ describe "File#sgrep" do
4
+ let(:data_file1) { File.open(DATA_FILE1) }
5
+
6
+ after :each do
7
+ data_file1.close
8
+ end
9
+
10
+ context "when searching for 'search for '" do
11
+ key = "search for "
12
+
13
+ subject { data_file1.sgrep key }
14
+ it { should_not be_empty }
15
+
16
+ specify "all elements start with '#{key}'" do
17
+ subject.all? { |elem| elem.start_with? key }.should be_true
18
+ end
19
+ end
20
+
21
+ context "when searching for 'a'" do
22
+ key = 'a'
23
+
24
+ subject { data_file1.sgrep key }
25
+ it { should_not be_empty }
26
+
27
+ specify "all elements start with '#{key}'" do
28
+ subject.all? { |elem| elem.start_with? key }.should be_true
29
+ end
30
+ end
31
+
32
+ context "when searching for '!'" do
33
+ key = '!'
34
+
35
+ subject { data_file1.sgrep key }
36
+ it { should_not be_empty }
37
+
38
+ specify "all elements start with '#{key}'" do
39
+ subject.all? { |elem| elem.start_with? key }.should be_true
40
+ end
41
+ end
42
+
43
+ context "no results (search 'this is not in this file anywhere')" do
44
+ key = 'this is not in this file anywhere'
45
+
46
+ subject { data_file1.sgrep key }
47
+ it { should be_empty }
48
+ end
49
+
50
+ if DICT_FILE
51
+ let(:dict_file) { File.open(DICT_FILE) }
52
+
53
+ after :each do
54
+ dict_file.close
55
+ end
56
+
57
+ context "Dictionary data tests" do
58
+ context "case insensitive searches" do
59
+ context "search for 'zyzzogeton'" do
60
+ key = "zyzzogeton"
61
+
62
+ subject { dict_file.sgrep key, :insensitive => true }
63
+ it { should include "Zyzzogeton" }
64
+ end
65
+ end
66
+ end
67
+ else
68
+ STDERR.puts "It doesn't seem like you have a dictionary file in a standard location. Skipping dictionary file tests."
69
+ end
70
+ end
@@ -0,0 +1,10 @@
1
+ require File.join(File.dirname(__FILE__), '..', 'lib', 'rsgrep')
2
+
3
+ DATA_ROOT = File.join(File.dirname(__FILE__), 'data')
4
+ DATA_FILE1 = File.join(DATA_ROOT, 'googlebooks-eng-all-3gram-20090715-0.csv')
5
+
6
+ DICT_FILE = if File.exists?("/usr/share/dict/words")
7
+ "/usr/share/dict/words"
8
+ else
9
+ nil
10
+ end
metadata ADDED
@@ -0,0 +1,114 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: rsgrep
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Sam Rose
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2012-09-27 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: rake
16
+ requirement: !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ! '>='
20
+ - !ruby/object:Gem::Version
21
+ version: '0'
22
+ type: :development
23
+ prerelease: false
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ! '>='
28
+ - !ruby/object:Gem::Version
29
+ version: '0'
30
+ - !ruby/object:Gem::Dependency
31
+ name: rspec
32
+ requirement: !ruby/object:Gem::Requirement
33
+ none: false
34
+ requirements:
35
+ - - ! '>='
36
+ - !ruby/object:Gem::Version
37
+ version: '0'
38
+ type: :development
39
+ prerelease: false
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ! '>='
44
+ - !ruby/object:Gem::Version
45
+ version: '0'
46
+ - !ruby/object:Gem::Dependency
47
+ name: rspec-core
48
+ requirement: !ruby/object:Gem::Requirement
49
+ none: false
50
+ requirements:
51
+ - - ! '>='
52
+ - !ruby/object:Gem::Version
53
+ version: '0'
54
+ type: :development
55
+ prerelease: false
56
+ version_requirements: !ruby/object:Gem::Requirement
57
+ none: false
58
+ requirements:
59
+ - - ! '>='
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ description: Pure Ruby implementation of the sorted grep command.
63
+ email:
64
+ - samwho@lbak.co.uk
65
+ executables:
66
+ - rsgrep
67
+ extensions: []
68
+ extra_rdoc_files: []
69
+ files:
70
+ - .gitignore
71
+ - Gemfile
72
+ - LICENSE
73
+ - README.md
74
+ - Rakefile
75
+ - bin/rsgrep
76
+ - lib/rsgrep.rb
77
+ - lib/rsgrep/file.rb
78
+ - lib/rsgrep/version.rb
79
+ - rsgrep.gemspec
80
+ - spec/file_spec.rb
81
+ - spec/spec_helper.rb
82
+ homepage: http://github.com/samwho/rsgrep
83
+ licenses: []
84
+ post_install_message:
85
+ rdoc_options: []
86
+ require_paths:
87
+ - lib
88
+ required_ruby_version: !ruby/object:Gem::Requirement
89
+ none: false
90
+ requirements:
91
+ - - ! '>='
92
+ - !ruby/object:Gem::Version
93
+ version: '0'
94
+ segments:
95
+ - 0
96
+ hash: 1288555168689700844
97
+ required_rubygems_version: !ruby/object:Gem::Requirement
98
+ none: false
99
+ requirements:
100
+ - - ! '>='
101
+ - !ruby/object:Gem::Version
102
+ version: '0'
103
+ segments:
104
+ - 0
105
+ hash: 1288555168689700844
106
+ requirements: []
107
+ rubyforge_project:
108
+ rubygems_version: 1.8.24
109
+ signing_key:
110
+ specification_version: 3
111
+ summary: sgrep for Ruby!
112
+ test_files:
113
+ - spec/file_spec.rb
114
+ - spec/spec_helper.rb