ruby_ngrams 0.0.2 → 0.0.4
Sign up to get free protection for your applications and to get access to all the features.
- data/README.rdoc +74 -8
- data/bin/ruby_ngrams +7 -8
- metadata +19 -8
data/README.rdoc
CHANGED
@@ -1,18 +1,84 @@
|
|
1
1
|
= ruby_ngrams
|
2
2
|
|
3
|
-
|
3
|
+
Author:: Martin Velez
|
4
|
+
Copyright:: Copyright (c) 2011 Martin Velez
|
5
|
+
License:: Distributed under the same terms as Ruby
|
4
6
|
|
5
|
-
|
7
|
+
= Description
|
6
8
|
|
7
|
-
|
9
|
+
ruby_ngrams is an extension of Ruby's core String class. It provides a String
|
10
|
+
object with the capability to produce n-grams.
|
8
11
|
|
9
|
-
|
12
|
+
From Wikipedia,
|
13
|
+
"In the fields of computational linguistics and probability, an n-gram is a
|
14
|
+
contiguous sequence of n items from a given sequence of text or speech. The
|
15
|
+
items in question can be phonemes, syllables, letters, words or base pairs
|
16
|
+
according to the application. n-grams are collected from a text or speech corpus.
|
10
17
|
|
11
|
-
|
18
|
+
An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram"
|
19
|
+
(or, less commonly, a "digram"); size 3 is a "trigram"; size 4 is a "four-gram"
|
20
|
+
and size 5 or more is simply called an "n-gram"."
|
12
21
|
|
13
|
-
|
22
|
+
= Design
|
14
23
|
|
15
|
-
|
24
|
+
Instead of creating another namespace, this task seemed simple enough to merit
|
25
|
+
extending the String class. A string is a sequence of characters.
|
26
|
+
It can be words, binary code, sentences, paragraphs, etc. In short,
|
27
|
+
anything that you can store in a Ruby String object can be parsed into
|
28
|
+
n-grams of length n.
|
16
29
|
|
17
|
-
|
30
|
+
The main method being added to the String class is ngrams(). It produces an
|
31
|
+
array of n-grams from a Ruby String object.
|
18
32
|
|
33
|
+
For example, let s be a Ruby String object.
|
34
|
+
Then s.ngrams() returns array of n-grams of from s.
|
35
|
+
|
36
|
+
Tokenization of s is set to single characters by default.
|
37
|
+
For example, if s = "Hello World!",
|
38
|
+
then the tokens of s are ["H","e","l","l","o"," ","W","o","r","l","d","!"].
|
39
|
+
By specifying a regular expression, you can tokenize the string s in many
|
40
|
+
different and useful ways.
|
41
|
+
|
42
|
+
If you set n = 4, then
|
43
|
+
s.ngrams = [["H", "e", "l", "l"],
|
44
|
+
["e", "l", "l", "o"],
|
45
|
+
["l", "l", "o", " "],
|
46
|
+
["l", "o", " ", "W"],
|
47
|
+
["o", " ", "W", "o"],
|
48
|
+
[" ", "W", "o", "r"],
|
49
|
+
["W", "o", "r", "l"],
|
50
|
+
["o", "r", "l", "d"],
|
51
|
+
["r", "l", "d", "!"]].
|
52
|
+
Each item in the s.ngrams array can joined but doesn't need to be.
|
53
|
+
If you want to join them, normally you can do so easily if it is text.
|
54
|
+
Be careful if you are trying to join n-grams with non-printable characters.
|
55
|
+
|
56
|
+
You can google "n-grams" to get more information about how n-grams are useful.
|
57
|
+
|
58
|
+
= Installation
|
59
|
+
|
60
|
+
gem install ruby_ngrams
|
61
|
+
|
62
|
+
= Alternative Tools
|
63
|
+
|
64
|
+
This is another tool I found but which did too much. I only wanted
|
65
|
+
to produce n-grams from a string.
|
66
|
+
1. raingrams[https://github.com/postmodern/raingrams]
|
67
|
+
|
68
|
+
= Usage
|
69
|
+
|
70
|
+
./ruby_ngrams --
|
71
|
+
|
72
|
+
= Dependencies
|
73
|
+
|
74
|
+
* Ruby 1.9.1 or greater
|
75
|
+
* ruby_cli[https://github.com/martinvelez/ruby_cli] to run the gem executable
|
76
|
+
|
77
|
+
= TODO
|
78
|
+
|
79
|
+
* Test to determine limits of current approach which parses and stores n-grams
|
80
|
+
in memory.
|
81
|
+
|
82
|
+
= Source Code
|
83
|
+
|
84
|
+
https://github.com/martinvelez/ruby_ngrams
|
data/bin/ruby_ngrams
CHANGED
@@ -1,4 +1,4 @@
|
|
1
|
-
#!/usr/bin/ruby
|
1
|
+
#!/usr/bin/env ruby
|
2
2
|
|
3
3
|
require 'ruby_cli'
|
4
4
|
require 'ruby_ngrams'
|
@@ -8,10 +8,11 @@ class App
|
|
8
8
|
|
9
9
|
def define_command_options() @options = {:regex => //, :n => 2} end
|
10
10
|
|
11
|
-
#
|
12
|
-
|
11
|
+
# Redefining the RubyCLI define_option_parser method
|
12
|
+
# Need to tell the OptionParser how to handle this command specific options.
|
13
|
+
def define_option_parser
|
13
14
|
#configure an OptionParser
|
14
|
-
|
15
|
+
OptionParser.new do |opts|
|
15
16
|
opts.banner = "Usage: #{__FILE__} [OPTIONS]... [FILE]..."
|
16
17
|
opts.separator ""
|
17
18
|
opts.separator "Specific options:"
|
@@ -25,12 +26,10 @@ class App
|
|
25
26
|
opts.on('-n', '--n NUM', Integer, 'set length n for n-grams') do |n|
|
26
27
|
@options[:n] = n
|
27
28
|
end
|
28
|
-
opts.on('-r', '--regex REGEX', Regexp, 'set regex to split string into tokens') do |r|
|
29
|
+
opts.on('-r', '--regex "REGEX"', Regexp, 'set regex to split string into tokens') do |r|
|
29
30
|
@options[:regex] = r
|
30
31
|
end
|
31
32
|
end
|
32
|
-
@opt_parser.parse!(@default_argv) rescue return false
|
33
|
-
true
|
34
33
|
end
|
35
34
|
|
36
35
|
def command
|
@@ -52,7 +51,7 @@ end
|
|
52
51
|
|
53
52
|
|
54
53
|
if __FILE__ == $0
|
55
|
-
app = App.new(ARGV)
|
54
|
+
app = App.new(ARGV, __FILE__)
|
56
55
|
app.run
|
57
56
|
end
|
58
57
|
|
metadata
CHANGED
@@ -1,13 +1,12 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: ruby_ngrams
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
hash: 27
|
5
4
|
prerelease: false
|
6
5
|
segments:
|
7
6
|
- 0
|
8
7
|
- 0
|
9
|
-
-
|
10
|
-
version: 0.0.
|
8
|
+
- 4
|
9
|
+
version: 0.0.4
|
11
10
|
platform: ruby
|
12
11
|
authors:
|
13
12
|
- Martin Velez
|
@@ -15,10 +14,24 @@ autorequire:
|
|
15
14
|
bindir: bin
|
16
15
|
cert_chain: []
|
17
16
|
|
18
|
-
date: 2011-11-
|
17
|
+
date: 2011-11-29 00:00:00 -08:00
|
19
18
|
default_executable:
|
20
|
-
dependencies:
|
21
|
-
|
19
|
+
dependencies:
|
20
|
+
- !ruby/object:Gem::Dependency
|
21
|
+
name: ruby_cli
|
22
|
+
prerelease: false
|
23
|
+
requirement: &id001 !ruby/object:Gem::Requirement
|
24
|
+
none: false
|
25
|
+
requirements:
|
26
|
+
- - ">="
|
27
|
+
- !ruby/object:Gem::Version
|
28
|
+
segments:
|
29
|
+
- 0
|
30
|
+
- 1
|
31
|
+
- 0
|
32
|
+
version: 0.1.0
|
33
|
+
type: :runtime
|
34
|
+
version_requirements: *id001
|
22
35
|
description: A simple extension of the Ruby core string class to parse a string into n-grams
|
23
36
|
email: mvelez999@gmail.com
|
24
37
|
executables:
|
@@ -46,7 +59,6 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
46
59
|
requirements:
|
47
60
|
- - ">="
|
48
61
|
- !ruby/object:Gem::Version
|
49
|
-
hash: 3
|
50
62
|
segments:
|
51
63
|
- 0
|
52
64
|
version: "0"
|
@@ -55,7 +67,6 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
55
67
|
requirements:
|
56
68
|
- - ">="
|
57
69
|
- !ruby/object:Gem::Version
|
58
|
-
hash: 3
|
59
70
|
segments:
|
60
71
|
- 0
|
61
72
|
version: "0"
|