ruby-ahocorasick 0.4.5
Sign up to get free protection for your applications and to get access to all the features.
- data/MIT-LICENSE +21 -0
- data/README.textile +101 -0
- data/examples/dict.rb +25 -0
- data/examples/elev.rb +19 -0
- data/examples/sample.c +94 -0
- data/examples/test.rb +46 -0
- data/ext/ac.c +623 -0
- data/ext/ac.h +36 -0
- data/ext/extconf.rb +6 -0
- data/ext/ruby-ahocorasick.c +339 -0
- metadata +68 -0
data/MIT-LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
Copyright (c) 2008 Aurelian Oancea
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
21
|
+
|
data/README.textile
ADDED
@@ -0,0 +1,101 @@
|
|
1
|
+
h1. This is a work in progress.
|
2
|
+
|
3
|
+
|
4
|
+
h2. Introduction
|
5
|
+
|
6
|
+
This library is a "Ruby":http://ruby-lang.org extension, a wrapper around the "Aho-Corasick":http://en.wikipedia.org/wiki/Aho-Corasick_algorithm implementation in C, found in "Strmat":http://www.cs.ucdavis.edu/~gusfield/strmat.html package.
|
7
|
+
|
8
|
+
The source code (ac.c and ac.h) was "adapted" from Strmat. In fact, I've changed only 3-4 lines of code from the original implementation so it will feat my needs: search needed to return the current position in the searched string.
|
9
|
+
|
10
|
+
|
11
|
+
h2. Okay, so what's the idea?
|
12
|
+
|
13
|
+
Having a dictionary of known sentences (note: not *words*!), this kick ass algorithm can find individual patterns in an incoming stream of data. Kinda Fast.
|
14
|
+
|
15
|
+
The algorithm has 2 stages: one where an internal tree in being build from the given dictionary leaving the search to the second step.
|
16
|
+
|
17
|
+
|
18
|
+
h2. Okay, so where can I use this?
|
19
|
+
|
20
|
+
Well, you can do some crazy things with it, like, you can lookup for DNA patterns or maybe analyze network sequences (read: strange and maybe proprietary network protocols), or domestic stuff like building contextual links on your blog posts to enrich your users experience.
|
21
|
+
|
22
|
+
|
23
|
+
h2. Okay, so how can I install it?
|
24
|
+
|
25
|
+
|
26
|
+
h3. Rubygems - Development Version
|
27
|
+
|
28
|
+
<pre>
|
29
|
+
gem install aurelian-ruby-ahocorasick --source=http://gems.github.com
|
30
|
+
</pre>
|
31
|
+
|
32
|
+
|
33
|
+
h3. Build it from source
|
34
|
+
|
35
|
+
<pre>
|
36
|
+
$ git clone git://github.com/aurelian/ruby-ahocorasick.git
|
37
|
+
$ cd ruby-ahocorasick
|
38
|
+
</pre>
|
39
|
+
|
40
|
+
To build and install the gem on your machine (run with sudo if needed):
|
41
|
+
|
42
|
+
<pre>
|
43
|
+
$ rake install
|
44
|
+
</pre>
|
45
|
+
|
46
|
+
@rake -T@ will list other cool tasks.
|
47
|
+
|
48
|
+
|
49
|
+
h3. Rubygems - Stable Version
|
50
|
+
|
51
|
+
There's no stable version right now.
|
52
|
+
|
53
|
+
|
54
|
+
h4. Notes
|
55
|
+
|
56
|
+
It's known to work / compile / install on Ubuntu 8.04 and Mac OS 10.4.*. It should work out of the box if you have gcc around.
|
57
|
+
Unfortunately I don't have a Windows PC around nor required knowledge about Microsoft compliers.
|
58
|
+
|
59
|
+
|
60
|
+
h2. Okay, so how do I use it?
|
61
|
+
|
62
|
+
<pre>
|
63
|
+
require 'ahocorasick'
|
64
|
+
|
65
|
+
keyword_tree= AhoCorasick::KeywordTree.new # creates a new tree
|
66
|
+
keyword_tree.add_string( "foo-- Z@!bar" ) # add's a keyword to the tree
|
67
|
+
keyword_tree.add_string( "cervantes" ) # even more
|
68
|
+
|
69
|
+
results= keyword_tree.find_all( "1011000129 foo-- Z@!bar761 ! 001211 6xU" ).each do | result |
|
70
|
+
result[:value] # => "foo-- Z@!bar"
|
71
|
+
result[:starts_at] # => 11
|
72
|
+
result[:ends_at] # => 23
|
73
|
+
result[:id] # => 1
|
74
|
+
end
|
75
|
+
|
76
|
+
</pre>
|
77
|
+
|
78
|
+
You can get some API reference on the "wiki":http://github.com/aurelian/ruby-ahocorasick/wikis.
|
79
|
+
|
80
|
+
|
81
|
+
h2. Bugs? Suggestions? Ideas? Patches?
|
82
|
+
|
83
|
+
For now, just use the email address.
|
84
|
+
|
85
|
+
|
86
|
+
h2. Additional Reading
|
87
|
+
|
88
|
+
Other suffix - tree implementations:
|
89
|
+
|
90
|
+
* "Strmat":http://www.cs.ucdavis.edu/~gusfield/strmat.html
|
91
|
+
* "Pytst":http://nicolas.lehuen.com/download/pytst / "Ruby-Pytst":http://rubyforge.org/projects/ruby-pytst/
|
92
|
+
* "Aho-Corasick extension":http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/
|
93
|
+
* "Keyword Prospector":http://latimes.rubyforge.org/keyword_prospector/rdoc/
|
94
|
+
* "libstree":http://www.cl.cam.ac.uk/~cpk25/libstree/
|
95
|
+
|
96
|
+
|
97
|
+
h2. License
|
98
|
+
|
99
|
+
(c) 2008 - Aurelian Oancea, < oancea at gmail dot com >
|
100
|
+
|
101
|
+
released under MIT-LICENCE
|
data/examples/dict.rb
ADDED
@@ -0,0 +1,25 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require 'time'
|
4
|
+
|
5
|
+
require File.dirname(__FILE__) + '/../ext/ahocorasick'
|
6
|
+
|
7
|
+
t= Time.now
|
8
|
+
|
9
|
+
k= AhoCorasick::KeywordTree.from_file(File.dirname(__FILE__) + "/../spec/data/en.words")
|
10
|
+
|
11
|
+
t1= Time.now
|
12
|
+
|
13
|
+
puts "%d words added in %s seconds" % [k.size, (t1-t)]
|
14
|
+
|
15
|
+
query = File.read( File.dirname(__FILE__) + "/../spec/data/news.txt" )
|
16
|
+
|
17
|
+
results= k.search query
|
18
|
+
|
19
|
+
puts "took %s seconds to find %d results in a streem with %d charachters" % [(Time.now-t1), results.size, query.size]
|
20
|
+
|
21
|
+
exit
|
22
|
+
results.each do | r |
|
23
|
+
puts query[r[:starts_at]].chr + ".." + query[r[:ends_at]-1].chr + " => " + r[:value]
|
24
|
+
end
|
25
|
+
|
data/examples/elev.rb
ADDED
@@ -0,0 +1,19 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
$kcode='UTF-8'
|
4
|
+
|
5
|
+
require File.dirname(__FILE__) + '/../ext/ahocorasick'
|
6
|
+
|
7
|
+
k= AhoCorasick::KeywordTree.new
|
8
|
+
|
9
|
+
k << "I've"
|
10
|
+
k << "data"
|
11
|
+
k << "base"
|
12
|
+
k << "database"
|
13
|
+
|
14
|
+
query= "I've moved my data to a database"
|
15
|
+
|
16
|
+
k.search(query).each do | r |
|
17
|
+
puts "-> [ " + r[:id].to_s + " ] " + r[:value] + " / " + query[r[:starts_at]].chr + ".." + query[r[:ends_at]-1].chr
|
18
|
+
end
|
19
|
+
|
data/examples/sample.c
ADDED
@@ -0,0 +1,94 @@
|
|
1
|
+
//
|
2
|
+
// Getting started with Aho-Corasick from Strmat
|
3
|
+
//
|
4
|
+
// lasick Makefile:
|
5
|
+
//
|
6
|
+
// --
|
7
|
+
// ac.o :
|
8
|
+
// gcc -c -fPIC -shared ac.c
|
9
|
+
// libasick : ac.o
|
10
|
+
// gcc -shared -Wl,-soname,libasick.so -o libasick.so.1.0.1
|
11
|
+
// ar rcs libasick.a ac.o
|
12
|
+
// clean :
|
13
|
+
// rm -rf *.o *.a *.so* *.dylib*
|
14
|
+
// --
|
15
|
+
//
|
16
|
+
// Compile this stuff - asick is the library name, generated with the above Makefile :)
|
17
|
+
//
|
18
|
+
// gcc sample.c -o ac-sample -I../ext/ -L../ext/ -lasick
|
19
|
+
//
|
20
|
+
|
21
|
+
#include <string.h>
|
22
|
+
#include <stdio.h>
|
23
|
+
#include <stdlib.h>
|
24
|
+
#include "ac.h"
|
25
|
+
|
26
|
+
int main(int argc, char *argv[]) {
|
27
|
+
char * search;
|
28
|
+
char * remain;
|
29
|
+
char * result;
|
30
|
+
char word[1024];
|
31
|
+
|
32
|
+
FILE *dictionary;
|
33
|
+
FILE *input;
|
34
|
+
int lgt, id, n, i;
|
35
|
+
|
36
|
+
AC_STRUCT * tree;
|
37
|
+
|
38
|
+
input= fopen(argv[1], "r");
|
39
|
+
|
40
|
+
if(input == NULL) {
|
41
|
+
search= argv[1];
|
42
|
+
} else {
|
43
|
+
long lSize;
|
44
|
+
fseek (input , 0 , SEEK_END);
|
45
|
+
lSize= ftell (input);
|
46
|
+
rewind(input);
|
47
|
+
search = (char*) malloc (sizeof(char)*lSize);
|
48
|
+
if (search == NULL) { fputs ("Error: Memory error",stderr); exit(-2); }
|
49
|
+
fread(search, 1, lSize-1, input);
|
50
|
+
}
|
51
|
+
|
52
|
+
dictionary= fopen("../spec/data/dictionary.txt", "r");
|
53
|
+
|
54
|
+
if(dictionary == NULL) {
|
55
|
+
printf("Error: can't open file.\n");
|
56
|
+
return -1;
|
57
|
+
}
|
58
|
+
|
59
|
+
tree= ac_alloc();
|
60
|
+
|
61
|
+
// start counting from 1
|
62
|
+
n= 1;
|
63
|
+
|
64
|
+
printf("==> building dictionary ...");
|
65
|
+
|
66
|
+
while(fgets(word, 1024, dictionary) != NULL) {
|
67
|
+
// strip \n
|
68
|
+
ac_add_string(tree, word, strlen(word)-1, n++);
|
69
|
+
}
|
70
|
+
|
71
|
+
printf("%d entries added.\n",n);
|
72
|
+
|
73
|
+
ac_prep(tree);
|
74
|
+
|
75
|
+
printf("==> input text [%d]:\n--\n%s\n--\n", strlen(search), search);
|
76
|
+
|
77
|
+
ac_search_init(tree, search, strlen(search) );
|
78
|
+
|
79
|
+
while((remain= ac_search(tree, &lgt, &id)) != NULL) {
|
80
|
+
printf("`%d'", remain[lgt+1]);
|
81
|
+
result = (char*) malloc (sizeof(char)*lgt);
|
82
|
+
sprintf( result, "%.*s", lgt, remain);
|
83
|
+
// result: should read first lgt chars from remain.
|
84
|
+
printf("==> result: lenght=> %d, id=> %d [%s]\n", lgt, id, result);
|
85
|
+
free(result);
|
86
|
+
}
|
87
|
+
|
88
|
+
ac_free(tree);
|
89
|
+
fclose(dictionary);
|
90
|
+
free(search);
|
91
|
+
|
92
|
+
return 0;
|
93
|
+
}
|
94
|
+
|
data/examples/test.rb
ADDED
@@ -0,0 +1,46 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require File.dirname(__FILE__) + '/../ext/ahocorasick'
|
4
|
+
|
5
|
+
k= AhoCorasick::KeywordTree.new
|
6
|
+
|
7
|
+
puts k.size
|
8
|
+
k.add_string("foo");
|
9
|
+
|
10
|
+
puts k.size
|
11
|
+
k.add_string("bar", 1991);
|
12
|
+
|
13
|
+
puts k.size
|
14
|
+
k.add_string("fomz");
|
15
|
+
|
16
|
+
begin
|
17
|
+
k.add_string("foo", -1);
|
18
|
+
rescue RuntimeError => err
|
19
|
+
puts "[ok]==> got " + err.class.name + ": " + err.message
|
20
|
+
end
|
21
|
+
|
22
|
+
begin
|
23
|
+
k.add_string("foo", "bar");
|
24
|
+
rescue RuntimeError => err
|
25
|
+
puts "[ok]==> got " + err.class.name + ": " + err.message
|
26
|
+
end
|
27
|
+
|
28
|
+
k.add_string("timisoara", 22);
|
29
|
+
|
30
|
+
puts k.size
|
31
|
+
begin
|
32
|
+
k.add_string("bucuresti", 22);
|
33
|
+
rescue RuntimeError => err
|
34
|
+
puts "[ok]==> got " + err.class.name + ": " + err.message
|
35
|
+
end
|
36
|
+
|
37
|
+
k << "bacau"
|
38
|
+
|
39
|
+
k.search('am fost la bacau').each do | result |
|
40
|
+
puts result.inspect
|
41
|
+
end
|
42
|
+
|
43
|
+
k.search( 'din foo in foo' ).each do | q |
|
44
|
+
puts q.inspect
|
45
|
+
end
|
46
|
+
|
data/ext/ac.c
ADDED
@@ -0,0 +1,623 @@
|
|
1
|
+
/*
|
2
|
+
* ac.c
|
3
|
+
*
|
4
|
+
* Implementation of the Aho-Corasick algorithm.
|
5
|
+
*
|
6
|
+
* NOTES:
|
7
|
+
* 8/94 - Original Implementation (Sean Davis)
|
8
|
+
* 9/94 - Redid Implementation (James Knight)
|
9
|
+
* 3/96 - Modularized the code (James Knight)
|
10
|
+
* 7/96 - Finished the modularization (James Knight)
|
11
|
+
*/
|
12
|
+
|
13
|
+
#include <stdio.h>
|
14
|
+
#include <stdlib.h>
|
15
|
+
#include <string.h>
|
16
|
+
#include "ac.h"
|
17
|
+
|
18
|
+
/*
|
19
|
+
* ac_alloc
|
20
|
+
*
|
21
|
+
* Creates a new AC_STRUCT structure and initializes its fields.
|
22
|
+
*
|
23
|
+
* Parameters: none.
|
24
|
+
*
|
25
|
+
* Returns: A dynamically allocated AC_STRUCT structure.
|
26
|
+
*/
|
27
|
+
AC_STRUCT *ac_alloc(void)
|
28
|
+
{
|
29
|
+
AC_STRUCT *node;
|
30
|
+
|
31
|
+
if ((node = malloc(sizeof(AC_STRUCT))) == NULL)
|
32
|
+
return NULL;
|
33
|
+
memset(node, 0, sizeof(AC_STRUCT));
|
34
|
+
|
35
|
+
if ((node->tree = malloc(sizeof(ACTREE_NODE))) == NULL) {
|
36
|
+
free(node);
|
37
|
+
return NULL;
|
38
|
+
}
|
39
|
+
memset(node->tree, 0, sizeof(ACTREE_NODE));
|
40
|
+
|
41
|
+
return node;
|
42
|
+
}
|
43
|
+
|
44
|
+
|
45
|
+
/*
|
46
|
+
* ac_add_string
|
47
|
+
*
|
48
|
+
* Adds a string to the AC_STRUCT structure's keyword tree.
|
49
|
+
*
|
50
|
+
* NOTE: The `id' value given must be unique to any of the strings
|
51
|
+
* added to the tree, and must be a small integer greater than
|
52
|
+
* 0 (since it is used to index an array holding information
|
53
|
+
* about each of the strings).
|
54
|
+
*
|
55
|
+
* The best id's to use are to number the strings from 1 to K.
|
56
|
+
*
|
57
|
+
* Parameters: node - an AC_STRUCT structure
|
58
|
+
* P - the sequence
|
59
|
+
* M - the sequence length
|
60
|
+
* id - the sequence identifier
|
61
|
+
*
|
62
|
+
* Returns: non-zero on success, zero on error.
|
63
|
+
*/
|
64
|
+
int ac_add_string(AC_STRUCT *node, char *P, int M, int id)
|
65
|
+
{
|
66
|
+
int i, j, newsize;
|
67
|
+
AC_TREE tnode, child, back, newnode, list, tail;
|
68
|
+
|
69
|
+
/*
|
70
|
+
* Return a zero if a previous error had occurred, or if the
|
71
|
+
* given id equals zero. An id value of zero is used by the
|
72
|
+
* algorithm to signal that no pattern ends at a node in the
|
73
|
+
* keyword tree. So, it can't be used as a pattern's id.
|
74
|
+
*/
|
75
|
+
if (node->errorflag || id == 0)
|
76
|
+
return 0;
|
77
|
+
|
78
|
+
P--; /* Shift to make sequence be P[1],...,P[M] */
|
79
|
+
|
80
|
+
/*
|
81
|
+
* Allocate space for the new string's information.
|
82
|
+
*/
|
83
|
+
if (node->Psize <= id) {
|
84
|
+
if (node->Psize == 0) {
|
85
|
+
newsize = (id >= 16 ? id + 1 : 16);
|
86
|
+
node->Plengths = malloc(newsize * sizeof(int));
|
87
|
+
}
|
88
|
+
else {
|
89
|
+
newsize = node->Psize + id + 1;
|
90
|
+
node->Plengths = realloc(node->Plengths, newsize * sizeof(int));
|
91
|
+
}
|
92
|
+
if (node->Plengths == NULL) {
|
93
|
+
node->errorflag = 1;
|
94
|
+
return 0;
|
95
|
+
}
|
96
|
+
|
97
|
+
for (i=node->Psize; i < newsize; i++)
|
98
|
+
node->Plengths[i] = 0;
|
99
|
+
node->Psize = newsize;
|
100
|
+
}
|
101
|
+
|
102
|
+
// duplicate id
|
103
|
+
if (node->Plengths[id] != 0)
|
104
|
+
return 0;
|
105
|
+
|
106
|
+
/*
|
107
|
+
* Add the string to the keyword tree.
|
108
|
+
*/
|
109
|
+
tnode = node->tree;
|
110
|
+
for (i=1; i <= M; i++) {
|
111
|
+
/*
|
112
|
+
* Find the child whose character is P[i].
|
113
|
+
*/
|
114
|
+
back = NULL;
|
115
|
+
child = tnode->children;
|
116
|
+
while (child != NULL && child->ch < P[i]) {
|
117
|
+
back = child;
|
118
|
+
child = child->sibling;
|
119
|
+
}
|
120
|
+
|
121
|
+
if (child == NULL || child->ch != P[i])
|
122
|
+
break;
|
123
|
+
|
124
|
+
tnode = child;
|
125
|
+
|
126
|
+
#ifdef STATS
|
127
|
+
node->prep_old_edges++;
|
128
|
+
#endif
|
129
|
+
|
130
|
+
}
|
131
|
+
|
132
|
+
/*
|
133
|
+
* If only part of the pattern exists in the tree, add the
|
134
|
+
* rest of the pattern to the tree.
|
135
|
+
*/
|
136
|
+
if (i <= M) {
|
137
|
+
list = tail = NULL;
|
138
|
+
for (j=i; j <= M; j++) {
|
139
|
+
if ((newnode = malloc(sizeof(ACTREE_NODE))) == NULL)
|
140
|
+
break;
|
141
|
+
memset(newnode, 0, sizeof(ACTREE_NODE));
|
142
|
+
newnode->ch = P[j];
|
143
|
+
|
144
|
+
if (list == NULL)
|
145
|
+
list = tail = newnode;
|
146
|
+
else
|
147
|
+
tail = tail->children = newnode;
|
148
|
+
|
149
|
+
#ifdef STATS
|
150
|
+
node->prep_new_edges++;
|
151
|
+
#endif
|
152
|
+
|
153
|
+
}
|
154
|
+
if (j <= M) {
|
155
|
+
while (list != NULL) {
|
156
|
+
tail = list->children;
|
157
|
+
free(list);
|
158
|
+
list = tail;
|
159
|
+
}
|
160
|
+
return 0;
|
161
|
+
}
|
162
|
+
|
163
|
+
list->sibling = child;
|
164
|
+
if (back == NULL)
|
165
|
+
tnode->children = list;
|
166
|
+
else
|
167
|
+
back->sibling = list;
|
168
|
+
|
169
|
+
tnode = tail;
|
170
|
+
}
|
171
|
+
|
172
|
+
tnode->matchid = id;
|
173
|
+
node->Plengths[id] = M;
|
174
|
+
node->ispreprocessed = 0;
|
175
|
+
|
176
|
+
return 1;
|
177
|
+
}
|
178
|
+
|
179
|
+
|
180
|
+
/*
|
181
|
+
* ac_del_string
|
182
|
+
*
|
183
|
+
* Deletes a string from the keyword tree.
|
184
|
+
*
|
185
|
+
* Parameters: node - an AC_STRUCT structure
|
186
|
+
* P - the sequence to be deleted
|
187
|
+
* M - its length
|
188
|
+
* id - its identifier
|
189
|
+
*
|
190
|
+
* Returns: non-zero on success, zero on error.
|
191
|
+
*/
|
192
|
+
int ac_del_string(AC_STRUCT *node, char *P, int M, int id)
|
193
|
+
{
|
194
|
+
int i, flag;
|
195
|
+
AC_TREE tnode, tlast, tback, child, back;
|
196
|
+
|
197
|
+
if (node->errorflag || id > node->Psize || node->Plengths[id] == 0)
|
198
|
+
return 0;
|
199
|
+
|
200
|
+
P--; /* Shift to make sequence be P[1],...,P[M] */
|
201
|
+
|
202
|
+
/*
|
203
|
+
* Scan the tree for the path corresponding to the keyword to be deleted.
|
204
|
+
*/
|
205
|
+
flag = 1;
|
206
|
+
tlast = tnode = node->tree;
|
207
|
+
tback = NULL;
|
208
|
+
|
209
|
+
for (i=1; i <= M; i++) {
|
210
|
+
/*
|
211
|
+
* Find the child matching P[i]. It must be there.
|
212
|
+
*/
|
213
|
+
child = tnode->children;
|
214
|
+
back = NULL;
|
215
|
+
while (child != NULL && child->ch != P[i]) {
|
216
|
+
back = child;
|
217
|
+
child = child->sibling;
|
218
|
+
}
|
219
|
+
|
220
|
+
if (child == NULL) {
|
221
|
+
fprintf(stderr, "Error in Aho-Corasick preprocessing. String to be "
|
222
|
+
"deleted is not in tree.\n");
|
223
|
+
return 0;
|
224
|
+
}
|
225
|
+
|
226
|
+
/*
|
227
|
+
* Try to find the point where the pattern to be deleted branches off
|
228
|
+
* from the paths of the other patterns in the tree. This point must
|
229
|
+
* be at the latest node which satisfies one of these two conditions:
|
230
|
+
*
|
231
|
+
* 1) Another pattern ends at that node (and so
|
232
|
+
* `child->matchid != 0'). In this case, the branch point is
|
233
|
+
* just below this node and so the children of this node
|
234
|
+
* should be removed.
|
235
|
+
* 2) A node has other siblings. In this case, the node itself
|
236
|
+
* is the branch point, and it and its children should be
|
237
|
+
* removed.
|
238
|
+
*/
|
239
|
+
if (i < M && child->matchid != 0) {
|
240
|
+
flag = 1;
|
241
|
+
tlast = child;
|
242
|
+
}
|
243
|
+
else if (back != NULL || child->sibling != NULL) {
|
244
|
+
flag = 2;
|
245
|
+
tlast = child;
|
246
|
+
tback = (back == NULL ? tnode : back);
|
247
|
+
}
|
248
|
+
|
249
|
+
tnode = child;
|
250
|
+
}
|
251
|
+
|
252
|
+
/*
|
253
|
+
* If the node corresponding to the end of the keyword has children,
|
254
|
+
* then the tree should not be altered, except to remove the keyword's
|
255
|
+
* identifier from the tree.
|
256
|
+
*
|
257
|
+
* Otherwise, apply the appropriate removal, as described above.
|
258
|
+
*/
|
259
|
+
if (tnode->children != NULL) {
|
260
|
+
tnode->matchid = 0;
|
261
|
+
}
|
262
|
+
else {
|
263
|
+
if (flag == 1) {
|
264
|
+
child = tlast->children;
|
265
|
+
tlast->children = NULL;
|
266
|
+
tlast = child;
|
267
|
+
}
|
268
|
+
else {
|
269
|
+
if (tback->children == tlast)
|
270
|
+
tback->children = tlast->sibling;
|
271
|
+
else
|
272
|
+
tback->sibling = tlast->sibling;
|
273
|
+
}
|
274
|
+
|
275
|
+
while (tlast != NULL) {
|
276
|
+
child = tlast->children;
|
277
|
+
free(tlast);
|
278
|
+
tlast = child;
|
279
|
+
}
|
280
|
+
}
|
281
|
+
|
282
|
+
node->Plengths[id] = 0;
|
283
|
+
node->ispreprocessed = 0;
|
284
|
+
|
285
|
+
return 1;
|
286
|
+
}
|
287
|
+
|
288
|
+
|
289
|
+
/*
|
290
|
+
* ac_prep
|
291
|
+
*
|
292
|
+
* Compute the failure and output links for the keyword tree.
|
293
|
+
*
|
294
|
+
* Parameters: node - an AC_STRUCT structure
|
295
|
+
*
|
296
|
+
* Returns: non-zero on success, zero on error.
|
297
|
+
*/
|
298
|
+
int ac_prep(AC_STRUCT *node)
|
299
|
+
{
|
300
|
+
char x;
|
301
|
+
AC_TREE v, vprime, w, wprime, root, front, back, child;
|
302
|
+
|
303
|
+
if (node->errorflag)
|
304
|
+
return 0;
|
305
|
+
|
306
|
+
/*
|
307
|
+
* The failure link and output link computation requires a breadth-first
|
308
|
+
* traversal of the keyword tree. And, to do that, we need a queue of
|
309
|
+
* the nodes yet to be processed.
|
310
|
+
*
|
311
|
+
* The `faillink' fields will be used as the pointers for the queue
|
312
|
+
* of nodes to be computed (since the failure link is only set after
|
313
|
+
* the node is removed from the queue).
|
314
|
+
*
|
315
|
+
* The `outlink' fields will be used as the pointers to a node's parent
|
316
|
+
* for nodes in the queue (since the output link is also only set after
|
317
|
+
* the node is removed from the queue).
|
318
|
+
*/
|
319
|
+
root = node->tree;
|
320
|
+
|
321
|
+
front = back = root;
|
322
|
+
front->faillink = NULL;
|
323
|
+
front->outlink = NULL;
|
324
|
+
|
325
|
+
while (front != NULL) {
|
326
|
+
v = front;
|
327
|
+
x = v->ch;
|
328
|
+
vprime = v->outlink;
|
329
|
+
|
330
|
+
/*
|
331
|
+
* Add the node's children to the queue.
|
332
|
+
*/
|
333
|
+
for (child=v->children; child != NULL; child=child->sibling) {
|
334
|
+
child->outlink = v;
|
335
|
+
back->faillink = child;
|
336
|
+
back = child;
|
337
|
+
}
|
338
|
+
back->faillink = NULL;
|
339
|
+
|
340
|
+
front = front->faillink;
|
341
|
+
v->faillink = v->outlink = NULL;
|
342
|
+
|
343
|
+
/*
|
344
|
+
* Set the failure and output links.
|
345
|
+
*/
|
346
|
+
if (v == root)
|
347
|
+
;
|
348
|
+
else if (vprime == root)
|
349
|
+
v->faillink = root;
|
350
|
+
else {
|
351
|
+
/*
|
352
|
+
* Find the find link in the failure link chain which has a child
|
353
|
+
* labeled with x.
|
354
|
+
*/
|
355
|
+
wprime = NULL;
|
356
|
+
w = vprime->faillink;
|
357
|
+
|
358
|
+
while (1) {
|
359
|
+
wprime = w->children;
|
360
|
+
while (wprime != NULL && wprime->ch < x)
|
361
|
+
wprime = wprime->sibling;
|
362
|
+
|
363
|
+
if ((wprime != NULL && wprime->ch == x) || w == root)
|
364
|
+
break;
|
365
|
+
|
366
|
+
w = w->faillink;
|
367
|
+
|
368
|
+
#ifdef STATS
|
369
|
+
node->prep_fail_compares++;
|
370
|
+
#endif
|
371
|
+
}
|
372
|
+
#ifdef STATS
|
373
|
+
node->prep_fail_compares++;
|
374
|
+
#endif
|
375
|
+
|
376
|
+
if (wprime != NULL && wprime->ch == x)
|
377
|
+
v->faillink = wprime;
|
378
|
+
else
|
379
|
+
v->faillink = root;
|
380
|
+
|
381
|
+
if (v->matchid != 0) {
|
382
|
+
if (v->faillink->matchid != 0)
|
383
|
+
v->outlink = v->faillink;
|
384
|
+
else
|
385
|
+
v->outlink = v->faillink->outlink;
|
386
|
+
}
|
387
|
+
}
|
388
|
+
}
|
389
|
+
|
390
|
+
node->ispreprocessed = 1;
|
391
|
+
node->initflag = 0;
|
392
|
+
|
393
|
+
return 1;
|
394
|
+
}
|
395
|
+
|
396
|
+
|
397
|
+
/*
|
398
|
+
* ac_search_init
|
399
|
+
*
|
400
|
+
* Initializes the variables used during an Aho-Corasick search.
|
401
|
+
* See ac_search for an example of how it should be used.
|
402
|
+
*
|
403
|
+
* Parameters: node - an AC_STRUCT structure
|
404
|
+
* T - the sequence to be searched
|
405
|
+
* N - the length of the sequence
|
406
|
+
*
|
407
|
+
* Returns: nothing.
|
408
|
+
*/
|
409
|
+
void ac_search_init(AC_STRUCT *node, char *T, int N)
|
410
|
+
{
|
411
|
+
if (node->errorflag)
|
412
|
+
return;
|
413
|
+
else if (!node->ispreprocessed) {
|
414
|
+
fprintf(stderr, "Error in Aho-Corasick search. The preprocessing "
|
415
|
+
"has not been completed.\n");
|
416
|
+
return;
|
417
|
+
}
|
418
|
+
|
419
|
+
node->T = T - 1; /* Shift to make sequence be T[1],...,T[N] */
|
420
|
+
node->N = N;
|
421
|
+
node->c = 1;
|
422
|
+
node->w = node->tree;
|
423
|
+
node->output = NULL;
|
424
|
+
node->initflag = 1;
|
425
|
+
node->endflag = 0;
|
426
|
+
}
|
427
|
+
|
428
|
+
|
429
|
+
/*
|
430
|
+
* ac_search
|
431
|
+
*
|
432
|
+
* Scans a text to look for the next occurrence of one of the patterns
|
433
|
+
* in the text. An example of how this search should be used is the
|
434
|
+
* following:
|
435
|
+
*
|
436
|
+
* s = T;
|
437
|
+
* len = N;
|
438
|
+
* contflag = 0;
|
439
|
+
* ac_search_init(node, T, N);
|
440
|
+
* while ((s = ac_search(node, &matchlen, &matchid) != NULL) {
|
441
|
+
* >>> Pattern `matchid' matched from `s' to `s + matchlen - 1'. <<<
|
442
|
+
* }
|
443
|
+
*
|
444
|
+
* where `node', `T' and `N' are assumed to be initialized appropriately.
|
445
|
+
*
|
446
|
+
* Parameters: node - a preprocessed AC_STRUCT structure
|
447
|
+
* length_out - where to store the new match's length
|
448
|
+
* id_out - where to store the identifier of the
|
449
|
+
* pattern that matched
|
450
|
+
* ends_at - where to store the n-th matched char
|
451
|
+
*
|
452
|
+
* Returns: the left end of the text that matches a pattern, or NULL
|
453
|
+
* if no match occurs. (It also stores values in `*length_out',
|
454
|
+
* and `*id_out' giving the match's length and pattern identifier.
|
455
|
+
*/
|
456
|
+
char *ac_search(AC_STRUCT *node, int *length_out, int *id_out, int *ends_at)
|
457
|
+
{
|
458
|
+
int c, N, id;
|
459
|
+
char *T;
|
460
|
+
AC_TREE w, wprime, root;
|
461
|
+
|
462
|
+
if (node->errorflag)
|
463
|
+
return NULL;
|
464
|
+
else if (!node->ispreprocessed) {
|
465
|
+
fprintf(stderr, "Error in Aho-Corasick search. The preprocessing "
|
466
|
+
"has not been completed.\n");
|
467
|
+
return NULL;
|
468
|
+
}
|
469
|
+
else if (!node->initflag) {
|
470
|
+
fprintf(stderr, "Error in Aho-Corasick search. ac_search_init was not "
|
471
|
+
"called.\n");
|
472
|
+
return NULL;
|
473
|
+
}
|
474
|
+
else if (node->endflag)
|
475
|
+
return NULL;
|
476
|
+
|
477
|
+
T = node->T;
|
478
|
+
N = node->N;
|
479
|
+
c = node->c;
|
480
|
+
w = node->w;
|
481
|
+
root = node->tree;
|
482
|
+
|
483
|
+
/*
|
484
|
+
* If the last call to ac_search returned a match, check for another
|
485
|
+
* match ending at the same right endpoint (denoted by a non-NULL
|
486
|
+
* output link).
|
487
|
+
*/
|
488
|
+
if (node->output != NULL) {
|
489
|
+
node->output = node->output->outlink;
|
490
|
+
|
491
|
+
#ifdef STATS
|
492
|
+
node->outlinks_traversed++;
|
493
|
+
#endif
|
494
|
+
|
495
|
+
if (node->output != NULL) {
|
496
|
+
id = node->output->matchid;
|
497
|
+
if (id_out)
|
498
|
+
*id_out = id;
|
499
|
+
if (length_out)
|
500
|
+
*length_out = node->Plengths[id];
|
501
|
+
if (ends_at)
|
502
|
+
*ends_at= c;
|
503
|
+
return &T[c] - node->Plengths[id];
|
504
|
+
}
|
505
|
+
|
506
|
+
}
|
507
|
+
|
508
|
+
/*
|
509
|
+
* Run the search algorithm, stopping at the first position where a
|
510
|
+
* match to one of the patterns occurs.
|
511
|
+
*/
|
512
|
+
while (c <= N) {
|
513
|
+
/*
|
514
|
+
* Try to match the next input character to a child in the tree.
|
515
|
+
*/
|
516
|
+
wprime = w->children;
|
517
|
+
while (wprime != NULL && wprime->ch != T[c])
|
518
|
+
wprime = wprime->sibling;
|
519
|
+
|
520
|
+
#ifdef STATS
|
521
|
+
node->num_compares++;
|
522
|
+
#endif
|
523
|
+
|
524
|
+
/*
|
525
|
+
* If the match fails, then either use the failure link (if not
|
526
|
+
* at the root), or move to the next character since no prefix
|
527
|
+
* of any pattern ends with character T[c].
|
528
|
+
*/
|
529
|
+
if (wprime == NULL) {
|
530
|
+
if (w == root)
|
531
|
+
c++;
|
532
|
+
else {
|
533
|
+
w = w->faillink;
|
534
|
+
|
535
|
+
#ifdef STATS
|
536
|
+
node->num_failures++;
|
537
|
+
#endif
|
538
|
+
|
539
|
+
}
|
540
|
+
}
|
541
|
+
else {
|
542
|
+
/*
|
543
|
+
* If we could match the input, move down the tree and to the
|
544
|
+
* next input character, and see if that match completes the
|
545
|
+
* match to a pattern (when matchid != 0 or outlink != NULL).
|
546
|
+
*/
|
547
|
+
c++;
|
548
|
+
w = wprime;
|
549
|
+
|
550
|
+
#ifdef STATS
|
551
|
+
node->edges_traversed++;
|
552
|
+
#endif
|
553
|
+
|
554
|
+
if (w->matchid != 0)
|
555
|
+
node->output = w;
|
556
|
+
else if (w->outlink != NULL) {
|
557
|
+
node->output = w->outlink;
|
558
|
+
|
559
|
+
#ifdef STATS
|
560
|
+
node->outlinks_traversed++;
|
561
|
+
#endif
|
562
|
+
|
563
|
+
}
|
564
|
+
|
565
|
+
if (node->output != NULL) {
|
566
|
+
id = node->output->matchid;
|
567
|
+
if (id_out)
|
568
|
+
*id_out = id;
|
569
|
+
if (length_out)
|
570
|
+
*length_out= node->Plengths[id];
|
571
|
+
if(ends_at)
|
572
|
+
*ends_at= c;
|
573
|
+
|
574
|
+
node->w = w;
|
575
|
+
node->c = c; // ends_at - length_out;
|
576
|
+
|
577
|
+
return &T[c] - node->Plengths[id];
|
578
|
+
}
|
579
|
+
}
|
580
|
+
}
|
581
|
+
|
582
|
+
node->c = c;
|
583
|
+
node->endflag = 1;
|
584
|
+
|
585
|
+
return NULL;
|
586
|
+
}
|
587
|
+
|
588
|
+
|
589
|
+
/*
|
590
|
+
* ac_free
|
591
|
+
*
|
592
|
+
* Free up the allocated AC_STRUCT structure.
|
593
|
+
*
|
594
|
+
* Parameters: node - a AC_STRUCT structure
|
595
|
+
*
|
596
|
+
* Returns: nothing.
|
597
|
+
*/
|
598
|
+
void ac_free(AC_STRUCT *node)
|
599
|
+
{
|
600
|
+
AC_TREE front, back, next;
|
601
|
+
|
602
|
+
if (node == NULL)
|
603
|
+
return;
|
604
|
+
|
605
|
+
if (node->tree != NULL) {
|
606
|
+
front = back = node->tree;
|
607
|
+
while (front != NULL) {
|
608
|
+
back->sibling = front->children;
|
609
|
+
while (back->sibling != NULL)
|
610
|
+
back = back->sibling;
|
611
|
+
|
612
|
+
next = front->sibling;
|
613
|
+
free(front);
|
614
|
+
front = next;
|
615
|
+
}
|
616
|
+
}
|
617
|
+
|
618
|
+
if (node->Plengths != NULL)
|
619
|
+
free(node->Plengths);
|
620
|
+
|
621
|
+
free(node);
|
622
|
+
}
|
623
|
+
|
data/ext/ac.h
ADDED
@@ -0,0 +1,36 @@
|
|
1
|
+
|
2
|
+
#ifndef _AC_H_
|
3
|
+
#define _AC_H_
|
4
|
+
|
5
|
+
typedef struct actreenode {
|
6
|
+
char ch;
|
7
|
+
int matchid;
|
8
|
+
struct actreenode *outlink, *faillink;
|
9
|
+
struct actreenode *children, *sibling;
|
10
|
+
} ACTREE_NODE, *AC_TREE;
|
11
|
+
|
12
|
+
typedef struct {
|
13
|
+
AC_TREE tree;
|
14
|
+
int ispreprocessed, errorflag;
|
15
|
+
|
16
|
+
int Psize;
|
17
|
+
int *Plengths;
|
18
|
+
|
19
|
+
char *T;
|
20
|
+
int N, c, initflag, endflag;
|
21
|
+
AC_TREE w, output;
|
22
|
+
#ifdef STATS
|
23
|
+
int prep_new_edges, prep_old_edges, prep_fail_compares;
|
24
|
+
int num_compares, num_failures, edges_traversed, outlinks_traversed;
|
25
|
+
#endif
|
26
|
+
} AC_STRUCT;
|
27
|
+
|
28
|
+
AC_STRUCT *ac_alloc(void);
|
29
|
+
int ac_add_string(AC_STRUCT *node, char *P, int M, int id);
|
30
|
+
// int ac_del_string(AC_STRUCT *node, char *P, int M, int id);
|
31
|
+
int ac_prep(AC_STRUCT *node);
|
32
|
+
void ac_search_init(AC_STRUCT *node, char *T, int N);
|
33
|
+
char *ac_search(AC_STRUCT *node, int *length_out, int *id_out, int *ends_at);
|
34
|
+
void ac_free(AC_STRUCT *node);
|
35
|
+
|
36
|
+
#endif
|
data/ext/extconf.rb
ADDED
@@ -0,0 +1,339 @@
|
|
1
|
+
|
2
|
+
//
|
3
|
+
// (c) 2008, Aurelian Oancea < oancea at gmail dot com >
|
4
|
+
//
|
5
|
+
// Released under MIT-LICENSE
|
6
|
+
//
|
7
|
+
|
8
|
+
//
|
9
|
+
// TODO: new methods?
|
10
|
+
//
|
11
|
+
// * kwt[id] = word
|
12
|
+
// * kwt.from_file (class instance method)
|
13
|
+
//
|
14
|
+
// * kwt.find_each ("str") {|r| .. }
|
15
|
+
// * kwt.find_first("str")
|
16
|
+
// * kwt.find_all ("str")
|
17
|
+
//
|
18
|
+
// TODO: change last_id and dictionary_size to long
|
19
|
+
//
|
20
|
+
|
21
|
+
#include <ruby.h>
|
22
|
+
#include "ac.h"
|
23
|
+
|
24
|
+
static VALUE sym_id, sym_value, sym_ends_at, sym_starts_at;
|
25
|
+
|
26
|
+
VALUE rb_mAhoCorasick;
|
27
|
+
VALUE rb_cKeywordTree;
|
28
|
+
|
29
|
+
#define KeywordTree(obj, kwt_data) {\
|
30
|
+
Data_Get_Struct(obj, struct kwt_struct_data, kwt_data);\
|
31
|
+
}
|
32
|
+
|
33
|
+
struct kwt_struct_data {
|
34
|
+
AC_STRUCT * tree;
|
35
|
+
int last_id;
|
36
|
+
int dictionary_size;
|
37
|
+
int is_frozen;
|
38
|
+
};
|
39
|
+
|
40
|
+
// int
|
41
|
+
// rb_add_string(struct kwt_struct_data *kwt, char *word, int size, int id) {
|
42
|
+
// if(ac_add_string( kwt->tree, word, size, id ) == 0)
|
43
|
+
// return 0;
|
44
|
+
// kwt->dictionary_size++;
|
45
|
+
// kwt->last_id= id+1;
|
46
|
+
// return 1;
|
47
|
+
// }
|
48
|
+
|
49
|
+
/*
|
50
|
+
* call-seq: initialize
|
51
|
+
*
|
52
|
+
* Creates a new KeywordTree
|
53
|
+
*
|
54
|
+
* require 'ahocorasick'
|
55
|
+
* kwt = Ahocorasick::KeywordTree.new
|
56
|
+
*
|
57
|
+
*/
|
58
|
+
static VALUE
|
59
|
+
rb_kwt_init(VALUE self)
|
60
|
+
{
|
61
|
+
AC_STRUCT * tree;
|
62
|
+
struct kwt_struct_data *kwt_data;
|
63
|
+
|
64
|
+
kwt_data = ALLOC(struct kwt_struct_data);
|
65
|
+
tree = ac_alloc();
|
66
|
+
DATA_PTR(self) = kwt_data;
|
67
|
+
kwt_data->tree = tree;
|
68
|
+
kwt_data->last_id = 1;
|
69
|
+
kwt_data->dictionary_size = 0;
|
70
|
+
kwt_data->is_frozen = 0;
|
71
|
+
return self;
|
72
|
+
}
|
73
|
+
|
74
|
+
/*
|
75
|
+
* Document-method: make
|
76
|
+
* call-seq: make
|
77
|
+
*
|
78
|
+
* It freezes the current KeywordTree.
|
79
|
+
*
|
80
|
+
* ==== Note: This method is called internally by search
|
81
|
+
*
|
82
|
+
* require 'ahocorasick'
|
83
|
+
*
|
84
|
+
* kwt = Ahocorasick::KeywordTree.new
|
85
|
+
*
|
86
|
+
* kwt.add_string("one")
|
87
|
+
* kwt.add_string("two")
|
88
|
+
* kwt.make()
|
89
|
+
*/
|
90
|
+
static VALUE
|
91
|
+
rb_kwt_make(VALUE self)
|
92
|
+
{
|
93
|
+
struct kwt_struct_data *kwt_data;
|
94
|
+
KeywordTree(self, kwt_data);
|
95
|
+
|
96
|
+
if(kwt_data->is_frozen == 1)
|
97
|
+
return Qtrue;
|
98
|
+
|
99
|
+
if(ac_prep( kwt_data->tree ) == 1) {
|
100
|
+
kwt_data->is_frozen = 1;
|
101
|
+
return Qtrue;
|
102
|
+
}
|
103
|
+
|
104
|
+
rb_raise(rb_eRuntimeError, "Cannot freeze the tree");
|
105
|
+
}
|
106
|
+
|
107
|
+
/*
|
108
|
+
* Document-method: find_all
|
109
|
+
* call-seq: find_all
|
110
|
+
*
|
111
|
+
* Search the current tree.
|
112
|
+
*
|
113
|
+
* It returns an array on hashes, e.g.
|
114
|
+
*
|
115
|
+
* [ { :id => int, :value => int, :starts_at => int, :ends_at => int}, { ... } ]
|
116
|
+
*
|
117
|
+
* Returns an empty array when the search didn't return any result.
|
118
|
+
*
|
119
|
+
* # assuming a valid KeywordTree kwt object:
|
120
|
+
* kwt.add_string("one")
|
121
|
+
* kwt.add_string("two")
|
122
|
+
*
|
123
|
+
* kwt.search( "moved two times already" ).each do | result |
|
124
|
+
* result[:id] # => 2
|
125
|
+
* result[:ends_at] # => 9
|
126
|
+
* result[:starts_at] # => 6
|
127
|
+
* result[:value] # => two
|
128
|
+
* end # => 1
|
129
|
+
*
|
130
|
+
*/
|
131
|
+
static VALUE
|
132
|
+
rb_kwt_find_all(int argc, VALUE *argv, VALUE self)
|
133
|
+
{
|
134
|
+
char * remain; // returned by ac_search, the remaing text to search
|
135
|
+
int lgt, id, ends_at, starts_at; // filled in by ac_search: the length of the result, the id, and starts_at/ends_at position
|
136
|
+
VALUE v_result; // one result, as hash
|
137
|
+
VALUE v_results; // all the results, an array
|
138
|
+
|
139
|
+
VALUE v_search; // search string, function argument
|
140
|
+
struct kwt_struct_data *kwt_data;
|
141
|
+
|
142
|
+
// one mandatory argument.
|
143
|
+
rb_scan_args(argc, argv, "1", &v_search);
|
144
|
+
// it should be string.
|
145
|
+
Check_Type(v_search, T_STRING);
|
146
|
+
// get the structure
|
147
|
+
KeywordTree(self, kwt_data);
|
148
|
+
// freeze the tree, if not already
|
149
|
+
if(kwt_data->is_frozen == 0) {
|
150
|
+
if(ac_prep( kwt_data->tree ) == 0)
|
151
|
+
rb_raise(rb_eRuntimeError, "Cannot freeze the tree");
|
152
|
+
kwt_data->is_frozen = 1;
|
153
|
+
}
|
154
|
+
// prepare the return value
|
155
|
+
v_results= rb_ary_new();
|
156
|
+
// fail quickly and return the empty array
|
157
|
+
if(kwt_data->dictionary_size == 0)
|
158
|
+
return v_results;
|
159
|
+
// prepare the search
|
160
|
+
ac_search_init(kwt_data->tree, RSTRING( v_search )->ptr, RSTRING( v_search )->len);
|
161
|
+
// loop trought the results
|
162
|
+
while((remain= ac_search(kwt_data->tree, &lgt, &id, &ends_at)) != NULL) {
|
163
|
+
// this is an individual result as a hash
|
164
|
+
v_result= rb_hash_new();
|
165
|
+
rb_hash_aset( v_result, sym_id, INT2FIX(id) );
|
166
|
+
rb_hash_aset( v_result, sym_starts_at, INT2FIX( ends_at - lgt - 1 ) );
|
167
|
+
rb_hash_aset( v_result, sym_ends_at, INT2FIX( ends_at - 1 ) );
|
168
|
+
rb_hash_aset( v_result, sym_value, rb_str_new(remain, lgt) );
|
169
|
+
rb_ary_push( v_results, v_result );
|
170
|
+
}
|
171
|
+
// reopen the tree
|
172
|
+
kwt_data->is_frozen= 0;
|
173
|
+
return v_results;
|
174
|
+
}
|
175
|
+
|
176
|
+
/*
|
177
|
+
* Document-method: size
|
178
|
+
* call-seq: size
|
179
|
+
*
|
180
|
+
* Returns the size of this KeywordTree
|
181
|
+
*
|
182
|
+
* kwt.add_string("foo")
|
183
|
+
* kwt.add_string("bar")
|
184
|
+
* kwt.size #=> 2
|
185
|
+
*
|
186
|
+
*/
|
187
|
+
static VALUE
|
188
|
+
rb_kwt_size(VALUE self)
|
189
|
+
{
|
190
|
+
struct kwt_struct_data *kwt_data;
|
191
|
+
KeywordTree(self, kwt_data);
|
192
|
+
|
193
|
+
return INT2FIX(kwt_data->dictionary_size);
|
194
|
+
}
|
195
|
+
|
196
|
+
/*
|
197
|
+
* Document-method: add_string
|
198
|
+
* call-seq: add_string
|
199
|
+
*
|
200
|
+
* Adds a sequence to this KeywordTree.
|
201
|
+
*
|
202
|
+
* kwt.add_string("foo1$21^ 98N3 ba>Z")
|
203
|
+
* kwt << "bar" # using the alias
|
204
|
+
*
|
205
|
+
* ==== Note: you can also specify the id, a number between 1 and k
|
206
|
+
*
|
207
|
+
* kwt.add_string "bar", 123 # => 123
|
208
|
+
*
|
209
|
+
* This id should be unique in the context of the current tree.
|
210
|
+
*
|
211
|
+
* Returns the id of the inserted object.
|
212
|
+
*
|
213
|
+
* kwt.add_string("test", 18) # => 18
|
214
|
+
* kwt.add_string("baz") # => 19
|
215
|
+
*
|
216
|
+
*/
|
217
|
+
static VALUE
|
218
|
+
rb_kwt_add_string(int argc, VALUE *argv, VALUE self)
|
219
|
+
{
|
220
|
+
VALUE v_string, v_id;
|
221
|
+
struct kwt_struct_data *kwt_data;
|
222
|
+
char * string;
|
223
|
+
int id;
|
224
|
+
|
225
|
+
rb_scan_args(argc, argv, "11", &v_string, &v_id);
|
226
|
+
|
227
|
+
Check_Type(v_string, T_STRING);
|
228
|
+
string= RSTRING(v_string)->ptr;
|
229
|
+
KeywordTree(self, kwt_data);
|
230
|
+
|
231
|
+
if(kwt_data->is_frozen == 1)
|
232
|
+
rb_raise(rb_eRuntimeError, "Cannot add `%s\" into a frozen tree.", string);
|
233
|
+
|
234
|
+
if(v_id == Qnil) {
|
235
|
+
id = kwt_data->last_id;
|
236
|
+
} else if(TYPE(v_id) != T_FIXNUM) {
|
237
|
+
rb_raise(rb_eRuntimeError, "Please use a number from 1 to K as id, or leave nil to auto-generate one. `%s\" given.", RSTRING(v_id)->ptr);
|
238
|
+
} else if(NUM2INT(v_id) <= 0) {
|
239
|
+
rb_raise(rb_eRuntimeError, "Please use a number from 1 to K as id, or leave nil to auto-generate one. `%d\" given.", NUM2INT(v_id));
|
240
|
+
} else {
|
241
|
+
id= NUM2INT(v_id);
|
242
|
+
}
|
243
|
+
|
244
|
+
if(ac_add_string(kwt_data->tree, string, strlen(string), id) == 0)
|
245
|
+
rb_raise(rb_eRuntimeError, "Failed to add `%s\", duplicate id `%d\"?", string, id);
|
246
|
+
|
247
|
+
kwt_data->last_id= id + 1;
|
248
|
+
kwt_data->dictionary_size++;
|
249
|
+
return INT2FIX(id);
|
250
|
+
}
|
251
|
+
|
252
|
+
/*
|
253
|
+
* call-seq: from_file
|
254
|
+
*
|
255
|
+
* Creates a new KeywordTree and loads the dictionary from a file
|
256
|
+
*
|
257
|
+
* % cat dict0.txt
|
258
|
+
* foo
|
259
|
+
* bar
|
260
|
+
* base
|
261
|
+
*
|
262
|
+
* k= AhoCorasick::KeywordTree.from_file "dict0.txt"
|
263
|
+
* k.search("basement").size # => 1
|
264
|
+
*
|
265
|
+
*/
|
266
|
+
static VALUE
|
267
|
+
rb_kwt_new_from_file(int argc, VALUE *argv, VALUE klass)
|
268
|
+
{
|
269
|
+
|
270
|
+
// TODO:
|
271
|
+
// * use rb_kwt_add_string
|
272
|
+
// * use rb_io* to handle the file
|
273
|
+
|
274
|
+
struct kwt_struct_data *kwt_data;
|
275
|
+
char word[1024];
|
276
|
+
int id;
|
277
|
+
VALUE self;
|
278
|
+
VALUE f_string;
|
279
|
+
FILE *dictionary;
|
280
|
+
|
281
|
+
rb_scan_args(argc, argv, "10", &f_string);
|
282
|
+
|
283
|
+
id = 0;
|
284
|
+
SafeStringValue( f_string );
|
285
|
+
self= rb_class_new_instance( 0, NULL, klass );
|
286
|
+
KeywordTree( self, kwt_data );
|
287
|
+
|
288
|
+
dictionary = fopen( RSTRING( f_string )->ptr, "r" );
|
289
|
+
if(dictionary == NULL)
|
290
|
+
rb_raise(rb_eRuntimeError, "Cannot open `%s\". No such file?", RSTRING(f_string)->ptr);
|
291
|
+
|
292
|
+
while(fgets(word, 1024, dictionary) != NULL) {
|
293
|
+
ac_add_string(kwt_data->tree, word, strlen(word)-1, id++);
|
294
|
+
kwt_data->dictionary_size++;
|
295
|
+
}
|
296
|
+
|
297
|
+
kwt_data->last_id= id+1;
|
298
|
+
fclose(dictionary);
|
299
|
+
return self;
|
300
|
+
}
|
301
|
+
|
302
|
+
static void
|
303
|
+
rb_kwt_struct_free(struct kwt_struct_data * kwt_data)
|
304
|
+
{
|
305
|
+
ac_free(kwt_data->tree);
|
306
|
+
}
|
307
|
+
|
308
|
+
static VALUE
|
309
|
+
rb_kwt_struct_alloc(VALUE klass)
|
310
|
+
{
|
311
|
+
return Data_Wrap_Struct(klass, 0, rb_kwt_struct_free, 0);
|
312
|
+
}
|
313
|
+
|
314
|
+
/*
|
315
|
+
* Blump.
|
316
|
+
*/
|
317
|
+
void Init_ahocorasick() {
|
318
|
+
rb_mAhoCorasick = rb_define_module("AhoCorasick");
|
319
|
+
rb_cKeywordTree = rb_define_class_under(rb_mAhoCorasick, "KeywordTree", rb_cObject);
|
320
|
+
|
321
|
+
rb_define_alloc_func(rb_cKeywordTree, rb_kwt_struct_alloc);
|
322
|
+
|
323
|
+
rb_define_method(rb_cKeywordTree, "initialize", rb_kwt_init, 0);
|
324
|
+
rb_define_method(rb_cKeywordTree, "size", rb_kwt_size, 0);
|
325
|
+
rb_define_method(rb_cKeywordTree, "make", rb_kwt_make, 0);
|
326
|
+
rb_define_method(rb_cKeywordTree, "add_string", rb_kwt_add_string, -1);
|
327
|
+
rb_define_alias(rb_cKeywordTree, "<<", "add_string");
|
328
|
+
|
329
|
+
rb_define_method(rb_cKeywordTree, "find_all", rb_kwt_find_all, -1);
|
330
|
+
rb_define_alias(rb_cKeywordTree, "search", "find_all");
|
331
|
+
|
332
|
+
rb_define_singleton_method(rb_cKeywordTree, "from_file", rb_kwt_new_from_file, -1);
|
333
|
+
|
334
|
+
sym_id = ID2SYM(rb_intern("id"));
|
335
|
+
sym_value = ID2SYM(rb_intern("value"));
|
336
|
+
sym_ends_at = ID2SYM( rb_intern("ends_at") );
|
337
|
+
sym_starts_at= ID2SYM( rb_intern("starts_at") );
|
338
|
+
}
|
339
|
+
|
metadata
ADDED
@@ -0,0 +1,68 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: ruby-ahocorasick
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.4.5
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Aurelian Oancea
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
|
12
|
+
date: 2008-11-19 00:00:00 +01:00
|
13
|
+
default_executable:
|
14
|
+
dependencies: []
|
15
|
+
|
16
|
+
description: Expose Aho-Corasick implementation from Strmat to Ruby.
|
17
|
+
email: oancea at gmail dot com
|
18
|
+
executables: []
|
19
|
+
|
20
|
+
extensions:
|
21
|
+
- ext/extconf.rb
|
22
|
+
extra_rdoc_files: []
|
23
|
+
|
24
|
+
files:
|
25
|
+
- ext/extconf.rb
|
26
|
+
- ext/ruby-ahocorasick.c
|
27
|
+
- ext/ac.h
|
28
|
+
- ext/ac.c
|
29
|
+
- examples/dict.rb
|
30
|
+
- examples/test.rb
|
31
|
+
- examples/elev.rb
|
32
|
+
- examples/sample.c
|
33
|
+
- MIT-LICENSE
|
34
|
+
- README.textile
|
35
|
+
has_rdoc: true
|
36
|
+
homepage: http://www.locknet.ro
|
37
|
+
post_install_message:
|
38
|
+
rdoc_options:
|
39
|
+
- --title
|
40
|
+
- Ruby-AhoCorasick
|
41
|
+
- --inline-source
|
42
|
+
- ext/ruby-ahocorasick.c
|
43
|
+
- README.textile
|
44
|
+
- --main
|
45
|
+
- README.textile
|
46
|
+
require_paths:
|
47
|
+
- lib
|
48
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
49
|
+
requirements:
|
50
|
+
- - ">="
|
51
|
+
- !ruby/object:Gem::Version
|
52
|
+
version: "0"
|
53
|
+
version:
|
54
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
55
|
+
requirements:
|
56
|
+
- - ">="
|
57
|
+
- !ruby/object:Gem::Version
|
58
|
+
version: "0"
|
59
|
+
version:
|
60
|
+
requirements: []
|
61
|
+
|
62
|
+
rubyforge_project: ruby-ahocorasick
|
63
|
+
rubygems_version: 1.3.1
|
64
|
+
signing_key:
|
65
|
+
specification_version: 2
|
66
|
+
summary: Aho-Corasick alghorithm implementation to Ruby using Strmat lib.
|
67
|
+
test_files: []
|
68
|
+
|