ruby_ngrams 0.0.2 → 0.0.4

Sign up to get free protection for your applications and to get access to all the features.
Files changed (3) hide show
  1. data/README.rdoc +74 -8
  2. data/bin/ruby_ngrams +7 -8
  3. metadata +19 -8
data/README.rdoc CHANGED
@@ -1,18 +1,84 @@
1
1
  = ruby_ngrams
2
2
 
3
- == License
3
+ Author:: Martin Velez
4
+ Copyright:: Copyright (c) 2011 Martin Velez
5
+ License:: Distributed under the same terms as Ruby
4
6
 
5
- Copyright 2011 Martin Velez
7
+ = Description
6
8
 
7
- == Features
9
+ ruby_ngrams is an extension of Ruby's core String class. It provides a String
10
+ object with the capability to produce n-grams.
8
11
 
9
- * parses a string into a set of n-grams
12
+ From Wikipedia,
13
+ "In the fields of computational linguistics and probability, an n-gram is a
14
+ contiguous sequence of n items from a given sequence of text or speech. The
15
+ items in question can be phonemes, syllables, letters, words or base pairs
16
+ according to the application. n-grams are collected from a text or speech corpus.
10
17
 
11
- == Installation
18
+ An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram"
19
+ (or, less commonly, a "digram"); size 3 is a "trigram"; size 4 is a "four-gram"
20
+ and size 5 or more is simply called an "n-gram"."
12
21
 
13
- * gem install ruby_ngrams
22
+ = Design
14
23
 
15
- == Usage
24
+ Instead of creating another namespace, this task seemed simple enough to merit
25
+ extending the String class. A string is a sequence of characters.
26
+ It can be words, binary code, sentences, paragraphs, etc. In short,
27
+ anything that you can store in a Ruby String object can be parsed into
28
+ n-grams of length n.
16
29
 
17
- usage goes here
30
+ The main method being added to the String class is ngrams(). It produces an
31
+ array of n-grams from a Ruby String object.
18
32
 
33
+ For example, let s be a Ruby String object.
34
+ Then s.ngrams() returns array of n-grams of from s.
35
+
36
+ Tokenization of s is set to single characters by default.
37
+ For example, if s = "Hello World!",
38
+ then the tokens of s are ["H","e","l","l","o"," ","W","o","r","l","d","!"].
39
+ By specifying a regular expression, you can tokenize the string s in many
40
+ different and useful ways.
41
+
42
+ If you set n = 4, then
43
+ s.ngrams = [["H", "e", "l", "l"],
44
+ ["e", "l", "l", "o"],
45
+ ["l", "l", "o", " "],
46
+ ["l", "o", " ", "W"],
47
+ ["o", " ", "W", "o"],
48
+ [" ", "W", "o", "r"],
49
+ ["W", "o", "r", "l"],
50
+ ["o", "r", "l", "d"],
51
+ ["r", "l", "d", "!"]].
52
+ Each item in the s.ngrams array can joined but doesn't need to be.
53
+ If you want to join them, normally you can do so easily if it is text.
54
+ Be careful if you are trying to join n-grams with non-printable characters.
55
+
56
+ You can google "n-grams" to get more information about how n-grams are useful.
57
+
58
+ = Installation
59
+
60
+ gem install ruby_ngrams
61
+
62
+ = Alternative Tools
63
+
64
+ This is another tool I found but which did too much. I only wanted
65
+ to produce n-grams from a string.
66
+ 1. raingrams[https://github.com/postmodern/raingrams]
67
+
68
+ = Usage
69
+
70
+ ./ruby_ngrams --
71
+
72
+ = Dependencies
73
+
74
+ * Ruby 1.9.1 or greater
75
+ * ruby_cli[https://github.com/martinvelez/ruby_cli] to run the gem executable
76
+
77
+ = TODO
78
+
79
+ * Test to determine limits of current approach which parses and stores n-grams
80
+ in memory.
81
+
82
+ = Source Code
83
+
84
+ https://github.com/martinvelez/ruby_ngrams
data/bin/ruby_ngrams CHANGED
@@ -1,4 +1,4 @@
1
- #!/usr/bin/ruby -w
1
+ #!/usr/bin/env ruby
2
2
 
3
3
  require 'ruby_cli'
4
4
  require 'ruby_ngrams'
@@ -8,10 +8,11 @@ class App
8
8
 
9
9
  def define_command_options() @options = {:regex => //, :n => 2} end
10
10
 
11
- # Define an OptionParser to parse the command line
12
- def parse_options?
11
+ # Redefining the RubyCLI define_option_parser method
12
+ # Need to tell the OptionParser how to handle this command specific options.
13
+ def define_option_parser
13
14
  #configure an OptionParser
14
- @opt_parser = OptionParser.new do |opts|
15
+ OptionParser.new do |opts|
15
16
  opts.banner = "Usage: #{__FILE__} [OPTIONS]... [FILE]..."
16
17
  opts.separator ""
17
18
  opts.separator "Specific options:"
@@ -25,12 +26,10 @@ class App
25
26
  opts.on('-n', '--n NUM', Integer, 'set length n for n-grams') do |n|
26
27
  @options[:n] = n
27
28
  end
28
- opts.on('-r', '--regex REGEX', Regexp, 'set regex to split string into tokens') do |r|
29
+ opts.on('-r', '--regex "REGEX"', Regexp, 'set regex to split string into tokens') do |r|
29
30
  @options[:regex] = r
30
31
  end
31
32
  end
32
- @opt_parser.parse!(@default_argv) rescue return false
33
- true
34
33
  end
35
34
 
36
35
  def command
@@ -52,7 +51,7 @@ end
52
51
 
53
52
 
54
53
  if __FILE__ == $0
55
- app = App.new(ARGV)
54
+ app = App.new(ARGV, __FILE__)
56
55
  app.run
57
56
  end
58
57
 
metadata CHANGED
@@ -1,13 +1,12 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ruby_ngrams
3
3
  version: !ruby/object:Gem::Version
4
- hash: 27
5
4
  prerelease: false
6
5
  segments:
7
6
  - 0
8
7
  - 0
9
- - 2
10
- version: 0.0.2
8
+ - 4
9
+ version: 0.0.4
11
10
  platform: ruby
12
11
  authors:
13
12
  - Martin Velez
@@ -15,10 +14,24 @@ autorequire:
15
14
  bindir: bin
16
15
  cert_chain: []
17
16
 
18
- date: 2011-11-11 00:00:00 -08:00
17
+ date: 2011-11-29 00:00:00 -08:00
19
18
  default_executable:
20
- dependencies: []
21
-
19
+ dependencies:
20
+ - !ruby/object:Gem::Dependency
21
+ name: ruby_cli
22
+ prerelease: false
23
+ requirement: &id001 !ruby/object:Gem::Requirement
24
+ none: false
25
+ requirements:
26
+ - - ">="
27
+ - !ruby/object:Gem::Version
28
+ segments:
29
+ - 0
30
+ - 1
31
+ - 0
32
+ version: 0.1.0
33
+ type: :runtime
34
+ version_requirements: *id001
22
35
  description: A simple extension of the Ruby core string class to parse a string into n-grams
23
36
  email: mvelez999@gmail.com
24
37
  executables:
@@ -46,7 +59,6 @@ required_ruby_version: !ruby/object:Gem::Requirement
46
59
  requirements:
47
60
  - - ">="
48
61
  - !ruby/object:Gem::Version
49
- hash: 3
50
62
  segments:
51
63
  - 0
52
64
  version: "0"
@@ -55,7 +67,6 @@ required_rubygems_version: !ruby/object:Gem::Requirement
55
67
  requirements:
56
68
  - - ">="
57
69
  - !ruby/object:Gem::Version
58
- hash: 3
59
70
  segments:
60
71
  - 0
61
72
  version: "0"