rsgrep 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/.gitignore ADDED
@@ -0,0 +1,18 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ Gemfile.lock
7
+ InstalledFiles
8
+ _yardoc
9
+ coverage
10
+ doc/
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/reports
15
+ spec/data
16
+ test/tmp
17
+ test/version_tmp
18
+ tmp
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in rsgrep.gemspec
4
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2012 Sam Rose
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,97 @@
1
+ # Rsgrep
2
+
3
+ This is a pure Ruby implementation with the same goal as the small but amazing
4
+ [sorted grep](http://sourceforge.net/projects/sgrep/) program written by
5
+ Stephen C. Losen.
6
+
7
+ It is designed for use on large, lexicographically sorted files. It allows you
8
+ to search for lines that *begin* with a certain pattern (searching for anything
9
+ at a position anywhere other than the start of a line isn't possible using a
10
+ binary search).
11
+
12
+ ## Installation
13
+
14
+ Add this line to your application's Gemfile:
15
+
16
+ gem 'rsgrep'
17
+
18
+ And then execute:
19
+
20
+ $ bundle
21
+
22
+ Or install it yourself as:
23
+
24
+ $ gem install rsgrep
25
+
26
+ ## Usage
27
+
28
+ The gem monkey patches into the File class. It can be used in the following two
29
+ ways:
30
+
31
+ ``` ruby
32
+ require 'rsgrep'
33
+
34
+ puts File.sgrep("key pattern", "path/to/file.txt")
35
+ #=> array of all lines that start with "key pattern", empty array for no
36
+ # matches.
37
+
38
+ # or ...
39
+
40
+ f = File.open("path/to/file.txt")
41
+ puts f.sgrep("key pattern")
42
+ #=> array of all lines that start with "key pattern", empty array for no
43
+ # matches.
44
+
45
+ f.close
46
+ ```
47
+
48
+ You can pass both of these functions an options hash. Here are some examples of
49
+ the options you can pass:
50
+
51
+ ``` ruby
52
+ require 'rsgrep'
53
+
54
+ f = File.open("path/to/file.txt")
55
+
56
+ # Case insensitive search
57
+ f.sgrep("PaTTern", :insensitive => true)
58
+
59
+ f.close
60
+ ```
61
+
62
+ **NOTE**: There are a lot of caveat involved in getting this to work properly.
63
+ For example, you **cannot** do a case insensitive search on a file that is not
64
+ sorted in a case insensitive fashion. The results will not be what you expect.
65
+
66
+ This will be true of almost all options you pass to rsgrep. You will get the
67
+ best results on a file that uses alphanumeric characters and only uses one
68
+ casing (upper or lower, doesn't matter which).
69
+
70
+ ## Contributing
71
+
72
+ 1. Fork it
73
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
74
+ 3. Commit your changes (`git commit -am 'Added some feature'`)
75
+ 4. Push to the branch (`git push origin my-new-feature`)
76
+ 5. Create new Pull Request
77
+
78
+ ## A note on the specs...
79
+
80
+ Because writing specs for this required having a very large file to scan, I had
81
+ to choose a very large file that was freely available. For obvious reasons, I
82
+ cannot put the file into this repository but you can download it from
83
+ [here](http://books.google.com/ngrams/datasets). It's the 0th file of the 3grams
84
+ dataset in English.
85
+
86
+ Direct link:
87
+ [http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-3gram-20090715-0.csv.zip](http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-3gram-20090715-0.csv.zip)
88
+
89
+ It's about 440mb compressed, 3gb uncompressed. You will need to uncompress it
90
+ into the `spec/data` directory in order to run the specs successfully.
91
+
92
+ This file is a bit of a bad example though, to be honest. I'm only using it at
93
+ the moment so that the specs give a good idea of how long it takes to scan
94
+ through such a large file. The reason that this is not a good file to use is
95
+ because it isn't sorted in a way the rsgrep knows how to process yet. Its
96
+ handling of capital letters and punctuation are a bit confusing and I haven't
97
+ yet been able to find a consistent and clean way of scanning it.
data/Rakefile ADDED
@@ -0,0 +1,11 @@
1
+ #!/usr/bin/env rake
2
+ require "bundler/gem_tasks"
3
+ require 'rspec/core/rake_task'
4
+
5
+ desc 'Default: run specs.'
6
+ task :default => :spec
7
+
8
+ desc "Run specs"
9
+ RSpec::Core::RakeTask.new do |t|
10
+ t.rspec_opts = '-cfs'
11
+ end
data/bin/rsgrep ADDED
@@ -0,0 +1,23 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require File.join(File.dirname(__FILE__), '..', 'lib', 'rsgrep')
4
+ require 'optparse'
5
+
6
+ options = {}
7
+
8
+ OptionParser.new do |opts|
9
+ opts.on('-i', '--insensitive', 'Case insensitive search.') do
10
+ options[:insensitive] = true
11
+ end
12
+ end.parse!
13
+
14
+ key = ARGV.shift
15
+ file = ARGV.shift
16
+
17
+ begin
18
+ puts File.sgrep(key, file, options)
19
+ rescue Errno::EPIPE
20
+ # If piping to another program, then that program closes STDOUT (by exiting
21
+ # for example), you get an EPIPE excetion. We don't really care about it. This
22
+ # is just to stop error output.
23
+ end
data/lib/rsgrep.rb ADDED
@@ -0,0 +1,8 @@
1
+ libdir = File.dirname(__FILE__)
2
+ $LOAD_PATH.unshift(libdir) unless $LOAD_PATH.include?(libdir)
3
+
4
+ require "rsgrep/version"
5
+ require "rsgrep/file"
6
+
7
+ module Rsgrep
8
+ end
@@ -0,0 +1,77 @@
1
+ class File
2
+ def self.sgrep key, filename, options = {}
3
+ File.open(filename) do |f|
4
+ f.sgrep key, options
5
+ end
6
+ end
7
+
8
+ def sgrep key, options = {}
9
+ # initialise the variables that the binary search algorithm needs.
10
+ hi = size
11
+ lo = 0
12
+ mid = (hi + lo) / 2
13
+ ret = []
14
+
15
+ if options[:insensitive]
16
+ comparator = Proc.new do |key, line|
17
+ key = key.downcase
18
+ line = line.downcase
19
+ if line.start_with? key
20
+ 0
21
+ else
22
+ key <=> line
23
+ end
24
+ end
25
+ else
26
+ comparator = Proc.new do |key, line|
27
+ if line.start_with? key
28
+ 0
29
+ else
30
+ key <=> line
31
+ end
32
+ end
33
+ end
34
+
35
+ while lo < mid and mid < hi
36
+ seek(mid)
37
+ seek(-2, IO::SEEK_CUR) while (c = getc) != "\n" and c != nil
38
+
39
+ case comparator.call(key, (line = gets))
40
+ when 0
41
+ # So we've found a line that matches the key, but this may not be the
42
+ # first line that does (due to the nature of binary searching). This
43
+ # begin/end block scans backwards in the file to find the first line
44
+ # that matches the key.
45
+ begin
46
+ # Seek back 2 lines because "gets" will advance the file pointer
47
+ # forward.
48
+ 2.times do
49
+ seek(-2, IO::SEEK_CUR) # First read will always be a newline, skip it
50
+ seek(-2, IO::SEEK_CUR) while (c = getc) != "\n" and c != nil
51
+ end
52
+ end while comparator.call(key, (line = gets)) == 0
53
+
54
+ # Then scan forward line by line until all of the lines that match have
55
+ # been added to the return array.
56
+ ret << line.rstrip while (line = gets) != nil and comparator.call(key, line) == 0
57
+
58
+ # and we're done!
59
+ return ret
60
+ when -1
61
+ # Key is less than the line. Shift the hi value down and recalculate the
62
+ # mid.
63
+ hi = mid
64
+ mid = (hi + lo) / 2
65
+ when 1
66
+ # Key is greater than the line. Shift the lo value up and recalculate
67
+ # the mid.
68
+ lo = mid
69
+ mid = (hi + lo) / 2
70
+ else
71
+ raise "Should not be raised, ever."
72
+ end
73
+ end
74
+
75
+ return ret
76
+ end
77
+ end
@@ -0,0 +1,3 @@
1
+ module Rsgrep
2
+ VERSION = "0.0.1"
3
+ end
data/rsgrep.gemspec ADDED
@@ -0,0 +1,21 @@
1
+ # -*- encoding: utf-8 -*-
2
+ require File.expand_path('../lib/rsgrep/version', __FILE__)
3
+
4
+ Gem::Specification.new do |gem|
5
+ gem.authors = ["Sam Rose"]
6
+ gem.email = ["samwho@lbak.co.uk"]
7
+ gem.description = %q{Pure Ruby implementation of the sorted grep command.}
8
+ gem.summary = %q{sgrep for Ruby!}
9
+ gem.homepage = "http://github.com/samwho/rsgrep"
10
+
11
+ gem.files = `git ls-files`.split($\)
12
+ gem.executables = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
13
+ gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
14
+ gem.name = "rsgrep"
15
+ gem.require_paths = ["lib"]
16
+ gem.version = Rsgrep::VERSION
17
+
18
+ gem.add_development_dependency 'rake'
19
+ gem.add_development_dependency 'rspec'
20
+ gem.add_development_dependency 'rspec-core'
21
+ end
data/spec/file_spec.rb ADDED
@@ -0,0 +1,70 @@
1
+ require 'spec_helper'
2
+
3
+ describe "File#sgrep" do
4
+ let(:data_file1) { File.open(DATA_FILE1) }
5
+
6
+ after :each do
7
+ data_file1.close
8
+ end
9
+
10
+ context "when searching for 'search for '" do
11
+ key = "search for "
12
+
13
+ subject { data_file1.sgrep key }
14
+ it { should_not be_empty }
15
+
16
+ specify "all elements start with '#{key}'" do
17
+ subject.all? { |elem| elem.start_with? key }.should be_true
18
+ end
19
+ end
20
+
21
+ context "when searching for 'a'" do
22
+ key = 'a'
23
+
24
+ subject { data_file1.sgrep key }
25
+ it { should_not be_empty }
26
+
27
+ specify "all elements start with '#{key}'" do
28
+ subject.all? { |elem| elem.start_with? key }.should be_true
29
+ end
30
+ end
31
+
32
+ context "when searching for '!'" do
33
+ key = '!'
34
+
35
+ subject { data_file1.sgrep key }
36
+ it { should_not be_empty }
37
+
38
+ specify "all elements start with '#{key}'" do
39
+ subject.all? { |elem| elem.start_with? key }.should be_true
40
+ end
41
+ end
42
+
43
+ context "no results (search 'this is not in this file anywhere')" do
44
+ key = 'this is not in this file anywhere'
45
+
46
+ subject { data_file1.sgrep key }
47
+ it { should be_empty }
48
+ end
49
+
50
+ if DICT_FILE
51
+ let(:dict_file) { File.open(DICT_FILE) }
52
+
53
+ after :each do
54
+ dict_file.close
55
+ end
56
+
57
+ context "Dictionary data tests" do
58
+ context "case insensitive searches" do
59
+ context "search for 'zyzzogeton'" do
60
+ key = "zyzzogeton"
61
+
62
+ subject { dict_file.sgrep key, :insensitive => true }
63
+ it { should include "Zyzzogeton" }
64
+ end
65
+ end
66
+ end
67
+ else
68
+ STDERR.puts "It doesn't seem like you have a dictionary file in a standard location. Skipping dictionary file tests."
69
+ end
70
+ end
@@ -0,0 +1,10 @@
1
+ require File.join(File.dirname(__FILE__), '..', 'lib', 'rsgrep')
2
+
3
+ DATA_ROOT = File.join(File.dirname(__FILE__), 'data')
4
+ DATA_FILE1 = File.join(DATA_ROOT, 'googlebooks-eng-all-3gram-20090715-0.csv')
5
+
6
+ DICT_FILE = if File.exists?("/usr/share/dict/words")
7
+ "/usr/share/dict/words"
8
+ else
9
+ nil
10
+ end
metadata ADDED
@@ -0,0 +1,114 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: rsgrep
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Sam Rose
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2012-09-27 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: rake
16
+ requirement: !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ! '>='
20
+ - !ruby/object:Gem::Version
21
+ version: '0'
22
+ type: :development
23
+ prerelease: false
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ! '>='
28
+ - !ruby/object:Gem::Version
29
+ version: '0'
30
+ - !ruby/object:Gem::Dependency
31
+ name: rspec
32
+ requirement: !ruby/object:Gem::Requirement
33
+ none: false
34
+ requirements:
35
+ - - ! '>='
36
+ - !ruby/object:Gem::Version
37
+ version: '0'
38
+ type: :development
39
+ prerelease: false
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ! '>='
44
+ - !ruby/object:Gem::Version
45
+ version: '0'
46
+ - !ruby/object:Gem::Dependency
47
+ name: rspec-core
48
+ requirement: !ruby/object:Gem::Requirement
49
+ none: false
50
+ requirements:
51
+ - - ! '>='
52
+ - !ruby/object:Gem::Version
53
+ version: '0'
54
+ type: :development
55
+ prerelease: false
56
+ version_requirements: !ruby/object:Gem::Requirement
57
+ none: false
58
+ requirements:
59
+ - - ! '>='
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ description: Pure Ruby implementation of the sorted grep command.
63
+ email:
64
+ - samwho@lbak.co.uk
65
+ executables:
66
+ - rsgrep
67
+ extensions: []
68
+ extra_rdoc_files: []
69
+ files:
70
+ - .gitignore
71
+ - Gemfile
72
+ - LICENSE
73
+ - README.md
74
+ - Rakefile
75
+ - bin/rsgrep
76
+ - lib/rsgrep.rb
77
+ - lib/rsgrep/file.rb
78
+ - lib/rsgrep/version.rb
79
+ - rsgrep.gemspec
80
+ - spec/file_spec.rb
81
+ - spec/spec_helper.rb
82
+ homepage: http://github.com/samwho/rsgrep
83
+ licenses: []
84
+ post_install_message:
85
+ rdoc_options: []
86
+ require_paths:
87
+ - lib
88
+ required_ruby_version: !ruby/object:Gem::Requirement
89
+ none: false
90
+ requirements:
91
+ - - ! '>='
92
+ - !ruby/object:Gem::Version
93
+ version: '0'
94
+ segments:
95
+ - 0
96
+ hash: 1288555168689700844
97
+ required_rubygems_version: !ruby/object:Gem::Requirement
98
+ none: false
99
+ requirements:
100
+ - - ! '>='
101
+ - !ruby/object:Gem::Version
102
+ version: '0'
103
+ segments:
104
+ - 0
105
+ hash: 1288555168689700844
106
+ requirements: []
107
+ rubyforge_project:
108
+ rubygems_version: 1.8.24
109
+ signing_key:
110
+ specification_version: 3
111
+ summary: sgrep for Ruby!
112
+ test_files:
113
+ - spec/file_spec.rb
114
+ - spec/spec_helper.rb