ruby_ngrams 0.0.2 → 0.0.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.rdoc +74 -8
- data/bin/ruby_ngrams +7 -8
- metadata +19 -8
data/README.rdoc
CHANGED
@@ -1,18 +1,84 @@
|
|
1
1
|
= ruby_ngrams
|
2
2
|
|
3
|
-
|
3
|
+
Author:: Martin Velez
|
4
|
+
Copyright:: Copyright (c) 2011 Martin Velez
|
5
|
+
License:: Distributed under the same terms as Ruby
|
4
6
|
|
5
|
-
|
7
|
+
= Description
|
6
8
|
|
7
|
-
|
9
|
+
ruby_ngrams is an extension of Ruby's core String class. It provides a String
|
10
|
+
object with the capability to produce n-grams.
|
8
11
|
|
9
|
-
|
12
|
+
From Wikipedia,
|
13
|
+
"In the fields of computational linguistics and probability, an n-gram is a
|
14
|
+
contiguous sequence of n items from a given sequence of text or speech. The
|
15
|
+
items in question can be phonemes, syllables, letters, words or base pairs
|
16
|
+
according to the application. n-grams are collected from a text or speech corpus.
|
10
17
|
|
11
|
-
|
18
|
+
An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram"
|
19
|
+
(or, less commonly, a "digram"); size 3 is a "trigram"; size 4 is a "four-gram"
|
20
|
+
and size 5 or more is simply called an "n-gram"."
|
12
21
|
|
13
|
-
|
22
|
+
= Design
|
14
23
|
|
15
|
-
|
24
|
+
Instead of creating another namespace, this task seemed simple enough to merit
|
25
|
+
extending the String class. A string is a sequence of characters.
|
26
|
+
It can be words, binary code, sentences, paragraphs, etc. In short,
|
27
|
+
anything that you can store in a Ruby String object can be parsed into
|
28
|
+
n-grams of length n.
|
16
29
|
|
17
|
-
|
30
|
+
The main method being added to the String class is ngrams(). It produces an
|
31
|
+
array of n-grams from a Ruby String object.
|
18
32
|
|
33
|
+
For example, let s be a Ruby String object.
|
34
|
+
Then s.ngrams() returns array of n-grams of from s.
|
35
|
+
|
36
|
+
Tokenization of s is set to single characters by default.
|
37
|
+
For example, if s = "Hello World!",
|
38
|
+
then the tokens of s are ["H","e","l","l","o"," ","W","o","r","l","d","!"].
|
39
|
+
By specifying a regular expression, you can tokenize the string s in many
|
40
|
+
different and useful ways.
|
41
|
+
|
42
|
+
If you set n = 4, then
|
43
|
+
s.ngrams = [["H", "e", "l", "l"],
|
44
|
+
["e", "l", "l", "o"],
|
45
|
+
["l", "l", "o", " "],
|
46
|
+
["l", "o", " ", "W"],
|
47
|
+
["o", " ", "W", "o"],
|
48
|
+
[" ", "W", "o", "r"],
|
49
|
+
["W", "o", "r", "l"],
|
50
|
+
["o", "r", "l", "d"],
|
51
|
+
["r", "l", "d", "!"]].
|
52
|
+
Each item in the s.ngrams array can joined but doesn't need to be.
|
53
|
+
If you want to join them, normally you can do so easily if it is text.
|
54
|
+
Be careful if you are trying to join n-grams with non-printable characters.
|
55
|
+
|
56
|
+
You can google "n-grams" to get more information about how n-grams are useful.
|
57
|
+
|
58
|
+
= Installation
|
59
|
+
|
60
|
+
gem install ruby_ngrams
|
61
|
+
|
62
|
+
= Alternative Tools
|
63
|
+
|
64
|
+
This is another tool I found but which did too much. I only wanted
|
65
|
+
to produce n-grams from a string.
|
66
|
+
1. raingrams[https://github.com/postmodern/raingrams]
|
67
|
+
|
68
|
+
= Usage
|
69
|
+
|
70
|
+
./ruby_ngrams --
|
71
|
+
|
72
|
+
= Dependencies
|
73
|
+
|
74
|
+
* Ruby 1.9.1 or greater
|
75
|
+
* ruby_cli[https://github.com/martinvelez/ruby_cli] to run the gem executable
|
76
|
+
|
77
|
+
= TODO
|
78
|
+
|
79
|
+
* Test to determine limits of current approach which parses and stores n-grams
|
80
|
+
in memory.
|
81
|
+
|
82
|
+
= Source Code
|
83
|
+
|
84
|
+
https://github.com/martinvelez/ruby_ngrams
|
data/bin/ruby_ngrams
CHANGED
@@ -1,4 +1,4 @@
|
|
1
|
-
#!/usr/bin/ruby
|
1
|
+
#!/usr/bin/env ruby
|
2
2
|
|
3
3
|
require 'ruby_cli'
|
4
4
|
require 'ruby_ngrams'
|
@@ -8,10 +8,11 @@ class App
|
|
8
8
|
|
9
9
|
def define_command_options() @options = {:regex => //, :n => 2} end
|
10
10
|
|
11
|
-
#
|
12
|
-
|
11
|
+
# Redefining the RubyCLI define_option_parser method
|
12
|
+
# Need to tell the OptionParser how to handle this command specific options.
|
13
|
+
def define_option_parser
|
13
14
|
#configure an OptionParser
|
14
|
-
|
15
|
+
OptionParser.new do |opts|
|
15
16
|
opts.banner = "Usage: #{__FILE__} [OPTIONS]... [FILE]..."
|
16
17
|
opts.separator ""
|
17
18
|
opts.separator "Specific options:"
|
@@ -25,12 +26,10 @@ class App
|
|
25
26
|
opts.on('-n', '--n NUM', Integer, 'set length n for n-grams') do |n|
|
26
27
|
@options[:n] = n
|
27
28
|
end
|
28
|
-
opts.on('-r', '--regex REGEX', Regexp, 'set regex to split string into tokens') do |r|
|
29
|
+
opts.on('-r', '--regex "REGEX"', Regexp, 'set regex to split string into tokens') do |r|
|
29
30
|
@options[:regex] = r
|
30
31
|
end
|
31
32
|
end
|
32
|
-
@opt_parser.parse!(@default_argv) rescue return false
|
33
|
-
true
|
34
33
|
end
|
35
34
|
|
36
35
|
def command
|
@@ -52,7 +51,7 @@ end
|
|
52
51
|
|
53
52
|
|
54
53
|
if __FILE__ == $0
|
55
|
-
app = App.new(ARGV)
|
54
|
+
app = App.new(ARGV, __FILE__)
|
56
55
|
app.run
|
57
56
|
end
|
58
57
|
|
metadata
CHANGED
@@ -1,13 +1,12 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: ruby_ngrams
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
hash: 27
|
5
4
|
prerelease: false
|
6
5
|
segments:
|
7
6
|
- 0
|
8
7
|
- 0
|
9
|
-
-
|
10
|
-
version: 0.0.
|
8
|
+
- 4
|
9
|
+
version: 0.0.4
|
11
10
|
platform: ruby
|
12
11
|
authors:
|
13
12
|
- Martin Velez
|
@@ -15,10 +14,24 @@ autorequire:
|
|
15
14
|
bindir: bin
|
16
15
|
cert_chain: []
|
17
16
|
|
18
|
-
date: 2011-11-
|
17
|
+
date: 2011-11-29 00:00:00 -08:00
|
19
18
|
default_executable:
|
20
|
-
dependencies:
|
21
|
-
|
19
|
+
dependencies:
|
20
|
+
- !ruby/object:Gem::Dependency
|
21
|
+
name: ruby_cli
|
22
|
+
prerelease: false
|
23
|
+
requirement: &id001 !ruby/object:Gem::Requirement
|
24
|
+
none: false
|
25
|
+
requirements:
|
26
|
+
- - ">="
|
27
|
+
- !ruby/object:Gem::Version
|
28
|
+
segments:
|
29
|
+
- 0
|
30
|
+
- 1
|
31
|
+
- 0
|
32
|
+
version: 0.1.0
|
33
|
+
type: :runtime
|
34
|
+
version_requirements: *id001
|
22
35
|
description: A simple extension of the Ruby core string class to parse a string into n-grams
|
23
36
|
email: mvelez999@gmail.com
|
24
37
|
executables:
|
@@ -46,7 +59,6 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
46
59
|
requirements:
|
47
60
|
- - ">="
|
48
61
|
- !ruby/object:Gem::Version
|
49
|
-
hash: 3
|
50
62
|
segments:
|
51
63
|
- 0
|
52
64
|
version: "0"
|
@@ -55,7 +67,6 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
55
67
|
requirements:
|
56
68
|
- - ">="
|
57
69
|
- !ruby/object:Gem::Version
|
58
|
-
hash: 3
|
59
70
|
segments:
|
60
71
|
- 0
|
61
72
|
version: "0"
|