ruby-ahocorasick 0.4.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/MIT-LICENSE +21 -0
- data/README.textile +101 -0
- data/examples/dict.rb +25 -0
- data/examples/elev.rb +19 -0
- data/examples/sample.c +94 -0
- data/examples/test.rb +46 -0
- data/ext/ac.c +623 -0
- data/ext/ac.h +36 -0
- data/ext/extconf.rb +6 -0
- data/ext/ruby-ahocorasick.c +339 -0
- metadata +68 -0
data/MIT-LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
Copyright (c) 2008 Aurelian Oancea
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
21
|
+
|
data/README.textile
ADDED
@@ -0,0 +1,101 @@
|
|
1
|
+
h1. This is a work in progress.
|
2
|
+
|
3
|
+
|
4
|
+
h2. Introduction
|
5
|
+
|
6
|
+
This library is a "Ruby":http://ruby-lang.org extension, a wrapper around the "Aho-Corasick":http://en.wikipedia.org/wiki/Aho-Corasick_algorithm implementation in C, found in "Strmat":http://www.cs.ucdavis.edu/~gusfield/strmat.html package.
|
7
|
+
|
8
|
+
The source code (ac.c and ac.h) was "adapted" from Strmat. In fact, I've changed only 3-4 lines of code from the original implementation so it will feat my needs: search needed to return the current position in the searched string.
|
9
|
+
|
10
|
+
|
11
|
+
h2. Okay, so what's the idea?
|
12
|
+
|
13
|
+
Having a dictionary of known sentences (note: not *words*!), this kick ass algorithm can find individual patterns in an incoming stream of data. Kinda Fast.
|
14
|
+
|
15
|
+
The algorithm has 2 stages: one where an internal tree in being build from the given dictionary leaving the search to the second step.
|
16
|
+
|
17
|
+
|
18
|
+
h2. Okay, so where can I use this?
|
19
|
+
|
20
|
+
Well, you can do some crazy things with it, like, you can lookup for DNA patterns or maybe analyze network sequences (read: strange and maybe proprietary network protocols), or domestic stuff like building contextual links on your blog posts to enrich your users experience.
|
21
|
+
|
22
|
+
|
23
|
+
h2. Okay, so how can I install it?
|
24
|
+
|
25
|
+
|
26
|
+
h3. Rubygems - Development Version
|
27
|
+
|
28
|
+
<pre>
|
29
|
+
gem install aurelian-ruby-ahocorasick --source=http://gems.github.com
|
30
|
+
</pre>
|
31
|
+
|
32
|
+
|
33
|
+
h3. Build it from source
|
34
|
+
|
35
|
+
<pre>
|
36
|
+
$ git clone git://github.com/aurelian/ruby-ahocorasick.git
|
37
|
+
$ cd ruby-ahocorasick
|
38
|
+
</pre>
|
39
|
+
|
40
|
+
To build and install the gem on your machine (run with sudo if needed):
|
41
|
+
|
42
|
+
<pre>
|
43
|
+
$ rake install
|
44
|
+
</pre>
|
45
|
+
|
46
|
+
@rake -T@ will list other cool tasks.
|
47
|
+
|
48
|
+
|
49
|
+
h3. Rubygems - Stable Version
|
50
|
+
|
51
|
+
There's no stable version right now.
|
52
|
+
|
53
|
+
|
54
|
+
h4. Notes
|
55
|
+
|
56
|
+
It's known to work / compile / install on Ubuntu 8.04 and Mac OS 10.4.*. It should work out of the box if you have gcc around.
|
57
|
+
Unfortunately I don't have a Windows PC around nor required knowledge about Microsoft compliers.
|
58
|
+
|
59
|
+
|
60
|
+
h2. Okay, so how do I use it?
|
61
|
+
|
62
|
+
<pre>
|
63
|
+
require 'ahocorasick'
|
64
|
+
|
65
|
+
keyword_tree= AhoCorasick::KeywordTree.new # creates a new tree
|
66
|
+
keyword_tree.add_string( "foo-- Z@!bar" ) # add's a keyword to the tree
|
67
|
+
keyword_tree.add_string( "cervantes" ) # even more
|
68
|
+
|
69
|
+
results= keyword_tree.find_all( "1011000129 foo-- Z@!bar761 ! 001211 6xU" ).each do | result |
|
70
|
+
result[:value] # => "foo-- Z@!bar"
|
71
|
+
result[:starts_at] # => 11
|
72
|
+
result[:ends_at] # => 23
|
73
|
+
result[:id] # => 1
|
74
|
+
end
|
75
|
+
|
76
|
+
</pre>
|
77
|
+
|
78
|
+
You can get some API reference on the "wiki":http://github.com/aurelian/ruby-ahocorasick/wikis.
|
79
|
+
|
80
|
+
|
81
|
+
h2. Bugs? Suggestions? Ideas? Patches?
|
82
|
+
|
83
|
+
For now, just use the email address.
|
84
|
+
|
85
|
+
|
86
|
+
h2. Additional Reading
|
87
|
+
|
88
|
+
Other suffix - tree implementations:
|
89
|
+
|
90
|
+
* "Strmat":http://www.cs.ucdavis.edu/~gusfield/strmat.html
|
91
|
+
* "Pytst":http://nicolas.lehuen.com/download/pytst / "Ruby-Pytst":http://rubyforge.org/projects/ruby-pytst/
|
92
|
+
* "Aho-Corasick extension":http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/
|
93
|
+
* "Keyword Prospector":http://latimes.rubyforge.org/keyword_prospector/rdoc/
|
94
|
+
* "libstree":http://www.cl.cam.ac.uk/~cpk25/libstree/
|
95
|
+
|
96
|
+
|
97
|
+
h2. License
|
98
|
+
|
99
|
+
(c) 2008 - Aurelian Oancea, < oancea at gmail dot com >
|
100
|
+
|
101
|
+
released under MIT-LICENCE
|
data/examples/dict.rb
ADDED
@@ -0,0 +1,25 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require 'time'
|
4
|
+
|
5
|
+
require File.dirname(__FILE__) + '/../ext/ahocorasick'
|
6
|
+
|
7
|
+
t= Time.now
|
8
|
+
|
9
|
+
k= AhoCorasick::KeywordTree.from_file(File.dirname(__FILE__) + "/../spec/data/en.words")
|
10
|
+
|
11
|
+
t1= Time.now
|
12
|
+
|
13
|
+
puts "%d words added in %s seconds" % [k.size, (t1-t)]
|
14
|
+
|
15
|
+
query = File.read( File.dirname(__FILE__) + "/../spec/data/news.txt" )
|
16
|
+
|
17
|
+
results= k.search query
|
18
|
+
|
19
|
+
puts "took %s seconds to find %d results in a streem with %d charachters" % [(Time.now-t1), results.size, query.size]
|
20
|
+
|
21
|
+
exit
|
22
|
+
results.each do | r |
|
23
|
+
puts query[r[:starts_at]].chr + ".." + query[r[:ends_at]-1].chr + " => " + r[:value]
|
24
|
+
end
|
25
|
+
|
data/examples/elev.rb
ADDED
@@ -0,0 +1,19 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
$kcode='UTF-8'
|
4
|
+
|
5
|
+
require File.dirname(__FILE__) + '/../ext/ahocorasick'
|
6
|
+
|
7
|
+
k= AhoCorasick::KeywordTree.new
|
8
|
+
|
9
|
+
k << "I've"
|
10
|
+
k << "data"
|
11
|
+
k << "base"
|
12
|
+
k << "database"
|
13
|
+
|
14
|
+
query= "I've moved my data to a database"
|
15
|
+
|
16
|
+
k.search(query).each do | r |
|
17
|
+
puts "-> [ " + r[:id].to_s + " ] " + r[:value] + " / " + query[r[:starts_at]].chr + ".." + query[r[:ends_at]-1].chr
|
18
|
+
end
|
19
|
+
|
data/examples/sample.c
ADDED
@@ -0,0 +1,94 @@
|
|
1
|
+
//
|
2
|
+
// Getting started with Aho-Corasick from Strmat
|
3
|
+
//
|
4
|
+
// lasick Makefile:
|
5
|
+
//
|
6
|
+
// --
|
7
|
+
// ac.o :
|
8
|
+
// gcc -c -fPIC -shared ac.c
|
9
|
+
// libasick : ac.o
|
10
|
+
// gcc -shared -Wl,-soname,libasick.so -o libasick.so.1.0.1
|
11
|
+
// ar rcs libasick.a ac.o
|
12
|
+
// clean :
|
13
|
+
// rm -rf *.o *.a *.so* *.dylib*
|
14
|
+
// --
|
15
|
+
//
|
16
|
+
// Compile this stuff - asick is the library name, generated with the above Makefile :)
|
17
|
+
//
|
18
|
+
// gcc sample.c -o ac-sample -I../ext/ -L../ext/ -lasick
|
19
|
+
//
|
20
|
+
|
21
|
+
#include <string.h>
|
22
|
+
#include <stdio.h>
|
23
|
+
#include <stdlib.h>
|
24
|
+
#include "ac.h"
|
25
|
+
|
26
|
+
int main(int argc, char *argv[]) {
|
27
|
+
char * search;
|
28
|
+
char * remain;
|
29
|
+
char * result;
|
30
|
+
char word[1024];
|
31
|
+
|
32
|
+
FILE *dictionary;
|
33
|
+
FILE *input;
|
34
|
+
int lgt, id, n, i;
|
35
|
+
|
36
|
+
AC_STRUCT * tree;
|
37
|
+
|
38
|
+
input= fopen(argv[1], "r");
|
39
|
+
|
40
|
+
if(input == NULL) {
|
41
|
+
search= argv[1];
|
42
|
+
} else {
|
43
|
+
long lSize;
|
44
|
+
fseek (input , 0 , SEEK_END);
|
45
|
+
lSize= ftell (input);
|
46
|
+
rewind(input);
|
47
|
+
search = (char*) malloc (sizeof(char)*lSize);
|
48
|
+
if (search == NULL) { fputs ("Error: Memory error",stderr); exit(-2); }
|
49
|
+
fread(search, 1, lSize-1, input);
|
50
|
+
}
|
51
|
+
|
52
|
+
dictionary= fopen("../spec/data/dictionary.txt", "r");
|
53
|
+
|
54
|
+
if(dictionary == NULL) {
|
55
|
+
printf("Error: can't open file.\n");
|
56
|
+
return -1;
|
57
|
+
}
|
58
|
+
|
59
|
+
tree= ac_alloc();
|
60
|
+
|
61
|
+
// start counting from 1
|
62
|
+
n= 1;
|
63
|
+
|
64
|
+
printf("==> building dictionary ...");
|
65
|
+
|
66
|
+
while(fgets(word, 1024, dictionary) != NULL) {
|
67
|
+
// strip \n
|
68
|
+
ac_add_string(tree, word, strlen(word)-1, n++);
|
69
|
+
}
|
70
|
+
|
71
|
+
printf("%d entries added.\n",n);
|
72
|
+
|
73
|
+
ac_prep(tree);
|
74
|
+
|
75
|
+
printf("==> input text [%d]:\n--\n%s\n--\n", strlen(search), search);
|
76
|
+
|
77
|
+
ac_search_init(tree, search, strlen(search) );
|
78
|
+
|
79
|
+
while((remain= ac_search(tree, &lgt, &id)) != NULL) {
|
80
|
+
printf("`%d'", remain[lgt+1]);
|
81
|
+
result = (char*) malloc (sizeof(char)*lgt);
|
82
|
+
sprintf( result, "%.*s", lgt, remain);
|
83
|
+
// result: should read first lgt chars from remain.
|
84
|
+
printf("==> result: lenght=> %d, id=> %d [%s]\n", lgt, id, result);
|
85
|
+
free(result);
|
86
|
+
}
|
87
|
+
|
88
|
+
ac_free(tree);
|
89
|
+
fclose(dictionary);
|
90
|
+
free(search);
|
91
|
+
|
92
|
+
return 0;
|
93
|
+
}
|
94
|
+
|
data/examples/test.rb
ADDED
@@ -0,0 +1,46 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require File.dirname(__FILE__) + '/../ext/ahocorasick'
|
4
|
+
|
5
|
+
k= AhoCorasick::KeywordTree.new
|
6
|
+
|
7
|
+
puts k.size
|
8
|
+
k.add_string("foo");
|
9
|
+
|
10
|
+
puts k.size
|
11
|
+
k.add_string("bar", 1991);
|
12
|
+
|
13
|
+
puts k.size
|
14
|
+
k.add_string("fomz");
|
15
|
+
|
16
|
+
begin
|
17
|
+
k.add_string("foo", -1);
|
18
|
+
rescue RuntimeError => err
|
19
|
+
puts "[ok]==> got " + err.class.name + ": " + err.message
|
20
|
+
end
|
21
|
+
|
22
|
+
begin
|
23
|
+
k.add_string("foo", "bar");
|
24
|
+
rescue RuntimeError => err
|
25
|
+
puts "[ok]==> got " + err.class.name + ": " + err.message
|
26
|
+
end
|
27
|
+
|
28
|
+
k.add_string("timisoara", 22);
|
29
|
+
|
30
|
+
puts k.size
|
31
|
+
begin
|
32
|
+
k.add_string("bucuresti", 22);
|
33
|
+
rescue RuntimeError => err
|
34
|
+
puts "[ok]==> got " + err.class.name + ": " + err.message
|
35
|
+
end
|
36
|
+
|
37
|
+
k << "bacau"
|
38
|
+
|
39
|
+
k.search('am fost la bacau').each do | result |
|
40
|
+
puts result.inspect
|
41
|
+
end
|
42
|
+
|
43
|
+
k.search( 'din foo in foo' ).each do | q |
|
44
|
+
puts q.inspect
|
45
|
+
end
|
46
|
+
|
data/ext/ac.c
ADDED
@@ -0,0 +1,623 @@
|
|
1
|
+
/*
|
2
|
+
* ac.c
|
3
|
+
*
|
4
|
+
* Implementation of the Aho-Corasick algorithm.
|
5
|
+
*
|
6
|
+
* NOTES:
|
7
|
+
* 8/94 - Original Implementation (Sean Davis)
|
8
|
+
* 9/94 - Redid Implementation (James Knight)
|
9
|
+
* 3/96 - Modularized the code (James Knight)
|
10
|
+
* 7/96 - Finished the modularization (James Knight)
|
11
|
+
*/
|
12
|
+
|
13
|
+
#include <stdio.h>
|
14
|
+
#include <stdlib.h>
|
15
|
+
#include <string.h>
|
16
|
+
#include "ac.h"
|
17
|
+
|
18
|
+
/*
|
19
|
+
* ac_alloc
|
20
|
+
*
|
21
|
+
* Creates a new AC_STRUCT structure and initializes its fields.
|
22
|
+
*
|
23
|
+
* Parameters: none.
|
24
|
+
*
|
25
|
+
* Returns: A dynamically allocated AC_STRUCT structure.
|
26
|
+
*/
|
27
|
+
AC_STRUCT *ac_alloc(void)
|
28
|
+
{
|
29
|
+
AC_STRUCT *node;
|
30
|
+
|
31
|
+
if ((node = malloc(sizeof(AC_STRUCT))) == NULL)
|
32
|
+
return NULL;
|
33
|
+
memset(node, 0, sizeof(AC_STRUCT));
|
34
|
+
|
35
|
+
if ((node->tree = malloc(sizeof(ACTREE_NODE))) == NULL) {
|
36
|
+
free(node);
|
37
|
+
return NULL;
|
38
|
+
}
|
39
|
+
memset(node->tree, 0, sizeof(ACTREE_NODE));
|
40
|
+
|
41
|
+
return node;
|
42
|
+
}
|
43
|
+
|
44
|
+
|
45
|
+
/*
|
46
|
+
* ac_add_string
|
47
|
+
*
|
48
|
+
* Adds a string to the AC_STRUCT structure's keyword tree.
|
49
|
+
*
|
50
|
+
* NOTE: The `id' value given must be unique to any of the strings
|
51
|
+
* added to the tree, and must be a small integer greater than
|
52
|
+
* 0 (since it is used to index an array holding information
|
53
|
+
* about each of the strings).
|
54
|
+
*
|
55
|
+
* The best id's to use are to number the strings from 1 to K.
|
56
|
+
*
|
57
|
+
* Parameters: node - an AC_STRUCT structure
|
58
|
+
* P - the sequence
|
59
|
+
* M - the sequence length
|
60
|
+
* id - the sequence identifier
|
61
|
+
*
|
62
|
+
* Returns: non-zero on success, zero on error.
|
63
|
+
*/
|
64
|
+
int ac_add_string(AC_STRUCT *node, char *P, int M, int id)
|
65
|
+
{
|
66
|
+
int i, j, newsize;
|
67
|
+
AC_TREE tnode, child, back, newnode, list, tail;
|
68
|
+
|
69
|
+
/*
|
70
|
+
* Return a zero if a previous error had occurred, or if the
|
71
|
+
* given id equals zero. An id value of zero is used by the
|
72
|
+
* algorithm to signal that no pattern ends at a node in the
|
73
|
+
* keyword tree. So, it can't be used as a pattern's id.
|
74
|
+
*/
|
75
|
+
if (node->errorflag || id == 0)
|
76
|
+
return 0;
|
77
|
+
|
78
|
+
P--; /* Shift to make sequence be P[1],...,P[M] */
|
79
|
+
|
80
|
+
/*
|
81
|
+
* Allocate space for the new string's information.
|
82
|
+
*/
|
83
|
+
if (node->Psize <= id) {
|
84
|
+
if (node->Psize == 0) {
|
85
|
+
newsize = (id >= 16 ? id + 1 : 16);
|
86
|
+
node->Plengths = malloc(newsize * sizeof(int));
|
87
|
+
}
|
88
|
+
else {
|
89
|
+
newsize = node->Psize + id + 1;
|
90
|
+
node->Plengths = realloc(node->Plengths, newsize * sizeof(int));
|
91
|
+
}
|
92
|
+
if (node->Plengths == NULL) {
|
93
|
+
node->errorflag = 1;
|
94
|
+
return 0;
|
95
|
+
}
|
96
|
+
|
97
|
+
for (i=node->Psize; i < newsize; i++)
|
98
|
+
node->Plengths[i] = 0;
|
99
|
+
node->Psize = newsize;
|
100
|
+
}
|
101
|
+
|
102
|
+
// duplicate id
|
103
|
+
if (node->Plengths[id] != 0)
|
104
|
+
return 0;
|
105
|
+
|
106
|
+
/*
|
107
|
+
* Add the string to the keyword tree.
|
108
|
+
*/
|
109
|
+
tnode = node->tree;
|
110
|
+
for (i=1; i <= M; i++) {
|
111
|
+
/*
|
112
|
+
* Find the child whose character is P[i].
|
113
|
+
*/
|
114
|
+
back = NULL;
|
115
|
+
child = tnode->children;
|
116
|
+
while (child != NULL && child->ch < P[i]) {
|
117
|
+
back = child;
|
118
|
+
child = child->sibling;
|
119
|
+
}
|
120
|
+
|
121
|
+
if (child == NULL || child->ch != P[i])
|
122
|
+
break;
|
123
|
+
|
124
|
+
tnode = child;
|
125
|
+
|
126
|
+
#ifdef STATS
|
127
|
+
node->prep_old_edges++;
|
128
|
+
#endif
|
129
|
+
|
130
|
+
}
|
131
|
+
|
132
|
+
/*
|
133
|
+
* If only part of the pattern exists in the tree, add the
|
134
|
+
* rest of the pattern to the tree.
|
135
|
+
*/
|
136
|
+
if (i <= M) {
|
137
|
+
list = tail = NULL;
|
138
|
+
for (j=i; j <= M; j++) {
|
139
|
+
if ((newnode = malloc(sizeof(ACTREE_NODE))) == NULL)
|
140
|
+
break;
|
141
|
+
memset(newnode, 0, sizeof(ACTREE_NODE));
|
142
|
+
newnode->ch = P[j];
|
143
|
+
|
144
|
+
if (list == NULL)
|
145
|
+
list = tail = newnode;
|
146
|
+
else
|
147
|
+
tail = tail->children = newnode;
|
148
|
+
|
149
|
+
#ifdef STATS
|
150
|
+
node->prep_new_edges++;
|
151
|
+
#endif
|
152
|
+
|
153
|
+
}
|
154
|
+
if (j <= M) {
|
155
|
+
while (list != NULL) {
|
156
|
+
tail = list->children;
|
157
|
+
free(list);
|
158
|
+
list = tail;
|
159
|
+
}
|
160
|
+
return 0;
|
161
|
+
}
|
162
|
+
|
163
|
+
list->sibling = child;
|
164
|
+
if (back == NULL)
|
165
|
+
tnode->children = list;
|
166
|
+
else
|
167
|
+
back->sibling = list;
|
168
|
+
|
169
|
+
tnode = tail;
|
170
|
+
}
|
171
|
+
|
172
|
+
tnode->matchid = id;
|
173
|
+
node->Plengths[id] = M;
|
174
|
+
node->ispreprocessed = 0;
|
175
|
+
|
176
|
+
return 1;
|
177
|
+
}
|
178
|
+
|
179
|
+
|
180
|
+
/*
|
181
|
+
* ac_del_string
|
182
|
+
*
|
183
|
+
* Deletes a string from the keyword tree.
|
184
|
+
*
|
185
|
+
* Parameters: node - an AC_STRUCT structure
|
186
|
+
* P - the sequence to be deleted
|
187
|
+
* M - its length
|
188
|
+
* id - its identifier
|
189
|
+
*
|
190
|
+
* Returns: non-zero on success, zero on error.
|
191
|
+
*/
|
192
|
+
int ac_del_string(AC_STRUCT *node, char *P, int M, int id)
|
193
|
+
{
|
194
|
+
int i, flag;
|
195
|
+
AC_TREE tnode, tlast, tback, child, back;
|
196
|
+
|
197
|
+
if (node->errorflag || id > node->Psize || node->Plengths[id] == 0)
|
198
|
+
return 0;
|
199
|
+
|
200
|
+
P--; /* Shift to make sequence be P[1],...,P[M] */
|
201
|
+
|
202
|
+
/*
|
203
|
+
* Scan the tree for the path corresponding to the keyword to be deleted.
|
204
|
+
*/
|
205
|
+
flag = 1;
|
206
|
+
tlast = tnode = node->tree;
|
207
|
+
tback = NULL;
|
208
|
+
|
209
|
+
for (i=1; i <= M; i++) {
|
210
|
+
/*
|
211
|
+
* Find the child matching P[i]. It must be there.
|
212
|
+
*/
|
213
|
+
child = tnode->children;
|
214
|
+
back = NULL;
|
215
|
+
while (child != NULL && child->ch != P[i]) {
|
216
|
+
back = child;
|
217
|
+
child = child->sibling;
|
218
|
+
}
|
219
|
+
|
220
|
+
if (child == NULL) {
|
221
|
+
fprintf(stderr, "Error in Aho-Corasick preprocessing. String to be "
|
222
|
+
"deleted is not in tree.\n");
|
223
|
+
return 0;
|
224
|
+
}
|
225
|
+
|
226
|
+
/*
|
227
|
+
* Try to find the point where the pattern to be deleted branches off
|
228
|
+
* from the paths of the other patterns in the tree. This point must
|
229
|
+
* be at the latest node which satisfies one of these two conditions:
|
230
|
+
*
|
231
|
+
* 1) Another pattern ends at that node (and so
|
232
|
+
* `child->matchid != 0'). In this case, the branch point is
|
233
|
+
* just below this node and so the children of this node
|
234
|
+
* should be removed.
|
235
|
+
* 2) A node has other siblings. In this case, the node itself
|
236
|
+
* is the branch point, and it and its children should be
|
237
|
+
* removed.
|
238
|
+
*/
|
239
|
+
if (i < M && child->matchid != 0) {
|
240
|
+
flag = 1;
|
241
|
+
tlast = child;
|
242
|
+
}
|
243
|
+
else if (back != NULL || child->sibling != NULL) {
|
244
|
+
flag = 2;
|
245
|
+
tlast = child;
|
246
|
+
tback = (back == NULL ? tnode : back);
|
247
|
+
}
|
248
|
+
|
249
|
+
tnode = child;
|
250
|
+
}
|
251
|
+
|
252
|
+
/*
|
253
|
+
* If the node corresponding to the end of the keyword has children,
|
254
|
+
* then the tree should not be altered, except to remove the keyword's
|
255
|
+
* identifier from the tree.
|
256
|
+
*
|
257
|
+
* Otherwise, apply the appropriate removal, as described above.
|
258
|
+
*/
|
259
|
+
if (tnode->children != NULL) {
|
260
|
+
tnode->matchid = 0;
|
261
|
+
}
|
262
|
+
else {
|
263
|
+
if (flag == 1) {
|
264
|
+
child = tlast->children;
|
265
|
+
tlast->children = NULL;
|
266
|
+
tlast = child;
|
267
|
+
}
|
268
|
+
else {
|
269
|
+
if (tback->children == tlast)
|
270
|
+
tback->children = tlast->sibling;
|
271
|
+
else
|
272
|
+
tback->sibling = tlast->sibling;
|
273
|
+
}
|
274
|
+
|
275
|
+
while (tlast != NULL) {
|
276
|
+
child = tlast->children;
|
277
|
+
free(tlast);
|
278
|
+
tlast = child;
|
279
|
+
}
|
280
|
+
}
|
281
|
+
|
282
|
+
node->Plengths[id] = 0;
|
283
|
+
node->ispreprocessed = 0;
|
284
|
+
|
285
|
+
return 1;
|
286
|
+
}
|
287
|
+
|
288
|
+
|
289
|
+
/*
|
290
|
+
* ac_prep
|
291
|
+
*
|
292
|
+
* Compute the failure and output links for the keyword tree.
|
293
|
+
*
|
294
|
+
* Parameters: node - an AC_STRUCT structure
|
295
|
+
*
|
296
|
+
* Returns: non-zero on success, zero on error.
|
297
|
+
*/
|
298
|
+
int ac_prep(AC_STRUCT *node)
|
299
|
+
{
|
300
|
+
char x;
|
301
|
+
AC_TREE v, vprime, w, wprime, root, front, back, child;
|
302
|
+
|
303
|
+
if (node->errorflag)
|
304
|
+
return 0;
|
305
|
+
|
306
|
+
/*
|
307
|
+
* The failure link and output link computation requires a breadth-first
|
308
|
+
* traversal of the keyword tree. And, to do that, we need a queue of
|
309
|
+
* the nodes yet to be processed.
|
310
|
+
*
|
311
|
+
* The `faillink' fields will be used as the pointers for the queue
|
312
|
+
* of nodes to be computed (since the failure link is only set after
|
313
|
+
* the node is removed from the queue).
|
314
|
+
*
|
315
|
+
* The `outlink' fields will be used as the pointers to a node's parent
|
316
|
+
* for nodes in the queue (since the output link is also only set after
|
317
|
+
* the node is removed from the queue).
|
318
|
+
*/
|
319
|
+
root = node->tree;
|
320
|
+
|
321
|
+
front = back = root;
|
322
|
+
front->faillink = NULL;
|
323
|
+
front->outlink = NULL;
|
324
|
+
|
325
|
+
while (front != NULL) {
|
326
|
+
v = front;
|
327
|
+
x = v->ch;
|
328
|
+
vprime = v->outlink;
|
329
|
+
|
330
|
+
/*
|
331
|
+
* Add the node's children to the queue.
|
332
|
+
*/
|
333
|
+
for (child=v->children; child != NULL; child=child->sibling) {
|
334
|
+
child->outlink = v;
|
335
|
+
back->faillink = child;
|
336
|
+
back = child;
|
337
|
+
}
|
338
|
+
back->faillink = NULL;
|
339
|
+
|
340
|
+
front = front->faillink;
|
341
|
+
v->faillink = v->outlink = NULL;
|
342
|
+
|
343
|
+
/*
|
344
|
+
* Set the failure and output links.
|
345
|
+
*/
|
346
|
+
if (v == root)
|
347
|
+
;
|
348
|
+
else if (vprime == root)
|
349
|
+
v->faillink = root;
|
350
|
+
else {
|
351
|
+
/*
|
352
|
+
* Find the find link in the failure link chain which has a child
|
353
|
+
* labeled with x.
|
354
|
+
*/
|
355
|
+
wprime = NULL;
|
356
|
+
w = vprime->faillink;
|
357
|
+
|
358
|
+
while (1) {
|
359
|
+
wprime = w->children;
|
360
|
+
while (wprime != NULL && wprime->ch < x)
|
361
|
+
wprime = wprime->sibling;
|
362
|
+
|
363
|
+
if ((wprime != NULL && wprime->ch == x) || w == root)
|
364
|
+
break;
|
365
|
+
|
366
|
+
w = w->faillink;
|
367
|
+
|
368
|
+
#ifdef STATS
|
369
|
+
node->prep_fail_compares++;
|
370
|
+
#endif
|
371
|
+
}
|
372
|
+
#ifdef STATS
|
373
|
+
node->prep_fail_compares++;
|
374
|
+
#endif
|
375
|
+
|
376
|
+
if (wprime != NULL && wprime->ch == x)
|
377
|
+
v->faillink = wprime;
|
378
|
+
else
|
379
|
+
v->faillink = root;
|
380
|
+
|
381
|
+
if (v->matchid != 0) {
|
382
|
+
if (v->faillink->matchid != 0)
|
383
|
+
v->outlink = v->faillink;
|
384
|
+
else
|
385
|
+
v->outlink = v->faillink->outlink;
|
386
|
+
}
|
387
|
+
}
|
388
|
+
}
|
389
|
+
|
390
|
+
node->ispreprocessed = 1;
|
391
|
+
node->initflag = 0;
|
392
|
+
|
393
|
+
return 1;
|
394
|
+
}
|
395
|
+
|
396
|
+
|
397
|
+
/*
|
398
|
+
* ac_search_init
|
399
|
+
*
|
400
|
+
* Initializes the variables used during an Aho-Corasick search.
|
401
|
+
* See ac_search for an example of how it should be used.
|
402
|
+
*
|
403
|
+
* Parameters: node - an AC_STRUCT structure
|
404
|
+
* T - the sequence to be searched
|
405
|
+
* N - the length of the sequence
|
406
|
+
*
|
407
|
+
* Returns: nothing.
|
408
|
+
*/
|
409
|
+
void ac_search_init(AC_STRUCT *node, char *T, int N)
|
410
|
+
{
|
411
|
+
if (node->errorflag)
|
412
|
+
return;
|
413
|
+
else if (!node->ispreprocessed) {
|
414
|
+
fprintf(stderr, "Error in Aho-Corasick search. The preprocessing "
|
415
|
+
"has not been completed.\n");
|
416
|
+
return;
|
417
|
+
}
|
418
|
+
|
419
|
+
node->T = T - 1; /* Shift to make sequence be T[1],...,T[N] */
|
420
|
+
node->N = N;
|
421
|
+
node->c = 1;
|
422
|
+
node->w = node->tree;
|
423
|
+
node->output = NULL;
|
424
|
+
node->initflag = 1;
|
425
|
+
node->endflag = 0;
|
426
|
+
}
|
427
|
+
|
428
|
+
|
429
|
+
/*
|
430
|
+
* ac_search
|
431
|
+
*
|
432
|
+
* Scans a text to look for the next occurrence of one of the patterns
|
433
|
+
* in the text. An example of how this search should be used is the
|
434
|
+
* following:
|
435
|
+
*
|
436
|
+
* s = T;
|
437
|
+
* len = N;
|
438
|
+
* contflag = 0;
|
439
|
+
* ac_search_init(node, T, N);
|
440
|
+
* while ((s = ac_search(node, &matchlen, &matchid) != NULL) {
|
441
|
+
* >>> Pattern `matchid' matched from `s' to `s + matchlen - 1'. <<<
|
442
|
+
* }
|
443
|
+
*
|
444
|
+
* where `node', `T' and `N' are assumed to be initialized appropriately.
|
445
|
+
*
|
446
|
+
* Parameters: node - a preprocessed AC_STRUCT structure
|
447
|
+
* length_out - where to store the new match's length
|
448
|
+
* id_out - where to store the identifier of the
|
449
|
+
* pattern that matched
|
450
|
+
* ends_at - where to store the n-th matched char
|
451
|
+
*
|
452
|
+
* Returns: the left end of the text that matches a pattern, or NULL
|
453
|
+
* if no match occurs. (It also stores values in `*length_out',
|
454
|
+
* and `*id_out' giving the match's length and pattern identifier.
|
455
|
+
*/
|
456
|
+
char *ac_search(AC_STRUCT *node, int *length_out, int *id_out, int *ends_at)
|
457
|
+
{
|
458
|
+
int c, N, id;
|
459
|
+
char *T;
|
460
|
+
AC_TREE w, wprime, root;
|
461
|
+
|
462
|
+
if (node->errorflag)
|
463
|
+
return NULL;
|
464
|
+
else if (!node->ispreprocessed) {
|
465
|
+
fprintf(stderr, "Error in Aho-Corasick search. The preprocessing "
|
466
|
+
"has not been completed.\n");
|
467
|
+
return NULL;
|
468
|
+
}
|
469
|
+
else if (!node->initflag) {
|
470
|
+
fprintf(stderr, "Error in Aho-Corasick search. ac_search_init was not "
|
471
|
+
"called.\n");
|
472
|
+
return NULL;
|
473
|
+
}
|
474
|
+
else if (node->endflag)
|
475
|
+
return NULL;
|
476
|
+
|
477
|
+
T = node->T;
|
478
|
+
N = node->N;
|
479
|
+
c = node->c;
|
480
|
+
w = node->w;
|
481
|
+
root = node->tree;
|
482
|
+
|
483
|
+
/*
|
484
|
+
* If the last call to ac_search returned a match, check for another
|
485
|
+
* match ending at the same right endpoint (denoted by a non-NULL
|
486
|
+
* output link).
|
487
|
+
*/
|
488
|
+
if (node->output != NULL) {
|
489
|
+
node->output = node->output->outlink;
|
490
|
+
|
491
|
+
#ifdef STATS
|
492
|
+
node->outlinks_traversed++;
|
493
|
+
#endif
|
494
|
+
|
495
|
+
if (node->output != NULL) {
|
496
|
+
id = node->output->matchid;
|
497
|
+
if (id_out)
|
498
|
+
*id_out = id;
|
499
|
+
if (length_out)
|
500
|
+
*length_out = node->Plengths[id];
|
501
|
+
if (ends_at)
|
502
|
+
*ends_at= c;
|
503
|
+
return &T[c] - node->Plengths[id];
|
504
|
+
}
|
505
|
+
|
506
|
+
}
|
507
|
+
|
508
|
+
/*
|
509
|
+
* Run the search algorithm, stopping at the first position where a
|
510
|
+
* match to one of the patterns occurs.
|
511
|
+
*/
|
512
|
+
while (c <= N) {
|
513
|
+
/*
|
514
|
+
* Try to match the next input character to a child in the tree.
|
515
|
+
*/
|
516
|
+
wprime = w->children;
|
517
|
+
while (wprime != NULL && wprime->ch != T[c])
|
518
|
+
wprime = wprime->sibling;
|
519
|
+
|
520
|
+
#ifdef STATS
|
521
|
+
node->num_compares++;
|
522
|
+
#endif
|
523
|
+
|
524
|
+
/*
|
525
|
+
* If the match fails, then either use the failure link (if not
|
526
|
+
* at the root), or move to the next character since no prefix
|
527
|
+
* of any pattern ends with character T[c].
|
528
|
+
*/
|
529
|
+
if (wprime == NULL) {
|
530
|
+
if (w == root)
|
531
|
+
c++;
|
532
|
+
else {
|
533
|
+
w = w->faillink;
|
534
|
+
|
535
|
+
#ifdef STATS
|
536
|
+
node->num_failures++;
|
537
|
+
#endif
|
538
|
+
|
539
|
+
}
|
540
|
+
}
|
541
|
+
else {
|
542
|
+
/*
|
543
|
+
* If we could match the input, move down the tree and to the
|
544
|
+
* next input character, and see if that match completes the
|
545
|
+
* match to a pattern (when matchid != 0 or outlink != NULL).
|
546
|
+
*/
|
547
|
+
c++;
|
548
|
+
w = wprime;
|
549
|
+
|
550
|
+
#ifdef STATS
|
551
|
+
node->edges_traversed++;
|
552
|
+
#endif
|
553
|
+
|
554
|
+
if (w->matchid != 0)
|
555
|
+
node->output = w;
|
556
|
+
else if (w->outlink != NULL) {
|
557
|
+
node->output = w->outlink;
|
558
|
+
|
559
|
+
#ifdef STATS
|
560
|
+
node->outlinks_traversed++;
|
561
|
+
#endif
|
562
|
+
|
563
|
+
}
|
564
|
+
|
565
|
+
if (node->output != NULL) {
|
566
|
+
id = node->output->matchid;
|
567
|
+
if (id_out)
|
568
|
+
*id_out = id;
|
569
|
+
if (length_out)
|
570
|
+
*length_out= node->Plengths[id];
|
571
|
+
if(ends_at)
|
572
|
+
*ends_at= c;
|
573
|
+
|
574
|
+
node->w = w;
|
575
|
+
node->c = c; // ends_at - length_out;
|
576
|
+
|
577
|
+
return &T[c] - node->Plengths[id];
|
578
|
+
}
|
579
|
+
}
|
580
|
+
}
|
581
|
+
|
582
|
+
node->c = c;
|
583
|
+
node->endflag = 1;
|
584
|
+
|
585
|
+
return NULL;
|
586
|
+
}
|
587
|
+
|
588
|
+
|
589
|
+
/*
|
590
|
+
* ac_free
|
591
|
+
*
|
592
|
+
* Free up the allocated AC_STRUCT structure.
|
593
|
+
*
|
594
|
+
* Parameters: node - a AC_STRUCT structure
|
595
|
+
*
|
596
|
+
* Returns: nothing.
|
597
|
+
*/
|
598
|
+
void ac_free(AC_STRUCT *node)
|
599
|
+
{
|
600
|
+
AC_TREE front, back, next;
|
601
|
+
|
602
|
+
if (node == NULL)
|
603
|
+
return;
|
604
|
+
|
605
|
+
if (node->tree != NULL) {
|
606
|
+
front = back = node->tree;
|
607
|
+
while (front != NULL) {
|
608
|
+
back->sibling = front->children;
|
609
|
+
while (back->sibling != NULL)
|
610
|
+
back = back->sibling;
|
611
|
+
|
612
|
+
next = front->sibling;
|
613
|
+
free(front);
|
614
|
+
front = next;
|
615
|
+
}
|
616
|
+
}
|
617
|
+
|
618
|
+
if (node->Plengths != NULL)
|
619
|
+
free(node->Plengths);
|
620
|
+
|
621
|
+
free(node);
|
622
|
+
}
|
623
|
+
|
data/ext/ac.h
ADDED
@@ -0,0 +1,36 @@
|
|
1
|
+
|
2
|
+
#ifndef _AC_H_
|
3
|
+
#define _AC_H_
|
4
|
+
|
5
|
+
typedef struct actreenode {
|
6
|
+
char ch;
|
7
|
+
int matchid;
|
8
|
+
struct actreenode *outlink, *faillink;
|
9
|
+
struct actreenode *children, *sibling;
|
10
|
+
} ACTREE_NODE, *AC_TREE;
|
11
|
+
|
12
|
+
typedef struct {
|
13
|
+
AC_TREE tree;
|
14
|
+
int ispreprocessed, errorflag;
|
15
|
+
|
16
|
+
int Psize;
|
17
|
+
int *Plengths;
|
18
|
+
|
19
|
+
char *T;
|
20
|
+
int N, c, initflag, endflag;
|
21
|
+
AC_TREE w, output;
|
22
|
+
#ifdef STATS
|
23
|
+
int prep_new_edges, prep_old_edges, prep_fail_compares;
|
24
|
+
int num_compares, num_failures, edges_traversed, outlinks_traversed;
|
25
|
+
#endif
|
26
|
+
} AC_STRUCT;
|
27
|
+
|
28
|
+
AC_STRUCT *ac_alloc(void);
|
29
|
+
int ac_add_string(AC_STRUCT *node, char *P, int M, int id);
|
30
|
+
// int ac_del_string(AC_STRUCT *node, char *P, int M, int id);
|
31
|
+
int ac_prep(AC_STRUCT *node);
|
32
|
+
void ac_search_init(AC_STRUCT *node, char *T, int N);
|
33
|
+
char *ac_search(AC_STRUCT *node, int *length_out, int *id_out, int *ends_at);
|
34
|
+
void ac_free(AC_STRUCT *node);
|
35
|
+
|
36
|
+
#endif
|
data/ext/extconf.rb
ADDED
@@ -0,0 +1,339 @@
|
|
1
|
+
|
2
|
+
//
|
3
|
+
// (c) 2008, Aurelian Oancea < oancea at gmail dot com >
|
4
|
+
//
|
5
|
+
// Released under MIT-LICENSE
|
6
|
+
//
|
7
|
+
|
8
|
+
//
|
9
|
+
// TODO: new methods?
|
10
|
+
//
|
11
|
+
// * kwt[id] = word
|
12
|
+
// * kwt.from_file (class instance method)
|
13
|
+
//
|
14
|
+
// * kwt.find_each ("str") {|r| .. }
|
15
|
+
// * kwt.find_first("str")
|
16
|
+
// * kwt.find_all ("str")
|
17
|
+
//
|
18
|
+
// TODO: change last_id and dictionary_size to long
|
19
|
+
//
|
20
|
+
|
21
|
+
#include <ruby.h>
|
22
|
+
#include "ac.h"
|
23
|
+
|
24
|
+
static VALUE sym_id, sym_value, sym_ends_at, sym_starts_at;
|
25
|
+
|
26
|
+
VALUE rb_mAhoCorasick;
|
27
|
+
VALUE rb_cKeywordTree;
|
28
|
+
|
29
|
+
#define KeywordTree(obj, kwt_data) {\
|
30
|
+
Data_Get_Struct(obj, struct kwt_struct_data, kwt_data);\
|
31
|
+
}
|
32
|
+
|
33
|
+
struct kwt_struct_data {
|
34
|
+
AC_STRUCT * tree;
|
35
|
+
int last_id;
|
36
|
+
int dictionary_size;
|
37
|
+
int is_frozen;
|
38
|
+
};
|
39
|
+
|
40
|
+
// int
|
41
|
+
// rb_add_string(struct kwt_struct_data *kwt, char *word, int size, int id) {
|
42
|
+
// if(ac_add_string( kwt->tree, word, size, id ) == 0)
|
43
|
+
// return 0;
|
44
|
+
// kwt->dictionary_size++;
|
45
|
+
// kwt->last_id= id+1;
|
46
|
+
// return 1;
|
47
|
+
// }
|
48
|
+
|
49
|
+
/*
|
50
|
+
* call-seq: initialize
|
51
|
+
*
|
52
|
+
* Creates a new KeywordTree
|
53
|
+
*
|
54
|
+
* require 'ahocorasick'
|
55
|
+
* kwt = Ahocorasick::KeywordTree.new
|
56
|
+
*
|
57
|
+
*/
|
58
|
+
static VALUE
|
59
|
+
rb_kwt_init(VALUE self)
|
60
|
+
{
|
61
|
+
AC_STRUCT * tree;
|
62
|
+
struct kwt_struct_data *kwt_data;
|
63
|
+
|
64
|
+
kwt_data = ALLOC(struct kwt_struct_data);
|
65
|
+
tree = ac_alloc();
|
66
|
+
DATA_PTR(self) = kwt_data;
|
67
|
+
kwt_data->tree = tree;
|
68
|
+
kwt_data->last_id = 1;
|
69
|
+
kwt_data->dictionary_size = 0;
|
70
|
+
kwt_data->is_frozen = 0;
|
71
|
+
return self;
|
72
|
+
}
|
73
|
+
|
74
|
+
/*
|
75
|
+
* Document-method: make
|
76
|
+
* call-seq: make
|
77
|
+
*
|
78
|
+
* It freezes the current KeywordTree.
|
79
|
+
*
|
80
|
+
* ==== Note: This method is called internally by search
|
81
|
+
*
|
82
|
+
* require 'ahocorasick'
|
83
|
+
*
|
84
|
+
* kwt = Ahocorasick::KeywordTree.new
|
85
|
+
*
|
86
|
+
* kwt.add_string("one")
|
87
|
+
* kwt.add_string("two")
|
88
|
+
* kwt.make()
|
89
|
+
*/
|
90
|
+
static VALUE
|
91
|
+
rb_kwt_make(VALUE self)
|
92
|
+
{
|
93
|
+
struct kwt_struct_data *kwt_data;
|
94
|
+
KeywordTree(self, kwt_data);
|
95
|
+
|
96
|
+
if(kwt_data->is_frozen == 1)
|
97
|
+
return Qtrue;
|
98
|
+
|
99
|
+
if(ac_prep( kwt_data->tree ) == 1) {
|
100
|
+
kwt_data->is_frozen = 1;
|
101
|
+
return Qtrue;
|
102
|
+
}
|
103
|
+
|
104
|
+
rb_raise(rb_eRuntimeError, "Cannot freeze the tree");
|
105
|
+
}
|
106
|
+
|
107
|
+
/*
|
108
|
+
* Document-method: find_all
|
109
|
+
* call-seq: find_all
|
110
|
+
*
|
111
|
+
* Search the current tree.
|
112
|
+
*
|
113
|
+
* It returns an array on hashes, e.g.
|
114
|
+
*
|
115
|
+
* [ { :id => int, :value => int, :starts_at => int, :ends_at => int}, { ... } ]
|
116
|
+
*
|
117
|
+
* Returns an empty array when the search didn't return any result.
|
118
|
+
*
|
119
|
+
* # assuming a valid KeywordTree kwt object:
|
120
|
+
* kwt.add_string("one")
|
121
|
+
* kwt.add_string("two")
|
122
|
+
*
|
123
|
+
* kwt.search( "moved two times already" ).each do | result |
|
124
|
+
* result[:id] # => 2
|
125
|
+
* result[:ends_at] # => 9
|
126
|
+
* result[:starts_at] # => 6
|
127
|
+
* result[:value] # => two
|
128
|
+
* end # => 1
|
129
|
+
*
|
130
|
+
*/
|
131
|
+
static VALUE
|
132
|
+
rb_kwt_find_all(int argc, VALUE *argv, VALUE self)
|
133
|
+
{
|
134
|
+
char * remain; // returned by ac_search, the remaing text to search
|
135
|
+
int lgt, id, ends_at, starts_at; // filled in by ac_search: the length of the result, the id, and starts_at/ends_at position
|
136
|
+
VALUE v_result; // one result, as hash
|
137
|
+
VALUE v_results; // all the results, an array
|
138
|
+
|
139
|
+
VALUE v_search; // search string, function argument
|
140
|
+
struct kwt_struct_data *kwt_data;
|
141
|
+
|
142
|
+
// one mandatory argument.
|
143
|
+
rb_scan_args(argc, argv, "1", &v_search);
|
144
|
+
// it should be string.
|
145
|
+
Check_Type(v_search, T_STRING);
|
146
|
+
// get the structure
|
147
|
+
KeywordTree(self, kwt_data);
|
148
|
+
// freeze the tree, if not already
|
149
|
+
if(kwt_data->is_frozen == 0) {
|
150
|
+
if(ac_prep( kwt_data->tree ) == 0)
|
151
|
+
rb_raise(rb_eRuntimeError, "Cannot freeze the tree");
|
152
|
+
kwt_data->is_frozen = 1;
|
153
|
+
}
|
154
|
+
// prepare the return value
|
155
|
+
v_results= rb_ary_new();
|
156
|
+
// fail quickly and return the empty array
|
157
|
+
if(kwt_data->dictionary_size == 0)
|
158
|
+
return v_results;
|
159
|
+
// prepare the search
|
160
|
+
ac_search_init(kwt_data->tree, RSTRING( v_search )->ptr, RSTRING( v_search )->len);
|
161
|
+
// loop trought the results
|
162
|
+
while((remain= ac_search(kwt_data->tree, &lgt, &id, &ends_at)) != NULL) {
|
163
|
+
// this is an individual result as a hash
|
164
|
+
v_result= rb_hash_new();
|
165
|
+
rb_hash_aset( v_result, sym_id, INT2FIX(id) );
|
166
|
+
rb_hash_aset( v_result, sym_starts_at, INT2FIX( ends_at - lgt - 1 ) );
|
167
|
+
rb_hash_aset( v_result, sym_ends_at, INT2FIX( ends_at - 1 ) );
|
168
|
+
rb_hash_aset( v_result, sym_value, rb_str_new(remain, lgt) );
|
169
|
+
rb_ary_push( v_results, v_result );
|
170
|
+
}
|
171
|
+
// reopen the tree
|
172
|
+
kwt_data->is_frozen= 0;
|
173
|
+
return v_results;
|
174
|
+
}
|
175
|
+
|
176
|
+
/*
|
177
|
+
* Document-method: size
|
178
|
+
* call-seq: size
|
179
|
+
*
|
180
|
+
* Returns the size of this KeywordTree
|
181
|
+
*
|
182
|
+
* kwt.add_string("foo")
|
183
|
+
* kwt.add_string("bar")
|
184
|
+
* kwt.size #=> 2
|
185
|
+
*
|
186
|
+
*/
|
187
|
+
static VALUE
|
188
|
+
rb_kwt_size(VALUE self)
|
189
|
+
{
|
190
|
+
struct kwt_struct_data *kwt_data;
|
191
|
+
KeywordTree(self, kwt_data);
|
192
|
+
|
193
|
+
return INT2FIX(kwt_data->dictionary_size);
|
194
|
+
}
|
195
|
+
|
196
|
+
/*
|
197
|
+
* Document-method: add_string
|
198
|
+
* call-seq: add_string
|
199
|
+
*
|
200
|
+
* Adds a sequence to this KeywordTree.
|
201
|
+
*
|
202
|
+
* kwt.add_string("foo1$21^ 98N3 ba>Z")
|
203
|
+
* kwt << "bar" # using the alias
|
204
|
+
*
|
205
|
+
* ==== Note: you can also specify the id, a number between 1 and k
|
206
|
+
*
|
207
|
+
* kwt.add_string "bar", 123 # => 123
|
208
|
+
*
|
209
|
+
* This id should be unique in the context of the current tree.
|
210
|
+
*
|
211
|
+
* Returns the id of the inserted object.
|
212
|
+
*
|
213
|
+
* kwt.add_string("test", 18) # => 18
|
214
|
+
* kwt.add_string("baz") # => 19
|
215
|
+
*
|
216
|
+
*/
|
217
|
+
static VALUE
|
218
|
+
rb_kwt_add_string(int argc, VALUE *argv, VALUE self)
|
219
|
+
{
|
220
|
+
VALUE v_string, v_id;
|
221
|
+
struct kwt_struct_data *kwt_data;
|
222
|
+
char * string;
|
223
|
+
int id;
|
224
|
+
|
225
|
+
rb_scan_args(argc, argv, "11", &v_string, &v_id);
|
226
|
+
|
227
|
+
Check_Type(v_string, T_STRING);
|
228
|
+
string= RSTRING(v_string)->ptr;
|
229
|
+
KeywordTree(self, kwt_data);
|
230
|
+
|
231
|
+
if(kwt_data->is_frozen == 1)
|
232
|
+
rb_raise(rb_eRuntimeError, "Cannot add `%s\" into a frozen tree.", string);
|
233
|
+
|
234
|
+
if(v_id == Qnil) {
|
235
|
+
id = kwt_data->last_id;
|
236
|
+
} else if(TYPE(v_id) != T_FIXNUM) {
|
237
|
+
rb_raise(rb_eRuntimeError, "Please use a number from 1 to K as id, or leave nil to auto-generate one. `%s\" given.", RSTRING(v_id)->ptr);
|
238
|
+
} else if(NUM2INT(v_id) <= 0) {
|
239
|
+
rb_raise(rb_eRuntimeError, "Please use a number from 1 to K as id, or leave nil to auto-generate one. `%d\" given.", NUM2INT(v_id));
|
240
|
+
} else {
|
241
|
+
id= NUM2INT(v_id);
|
242
|
+
}
|
243
|
+
|
244
|
+
if(ac_add_string(kwt_data->tree, string, strlen(string), id) == 0)
|
245
|
+
rb_raise(rb_eRuntimeError, "Failed to add `%s\", duplicate id `%d\"?", string, id);
|
246
|
+
|
247
|
+
kwt_data->last_id= id + 1;
|
248
|
+
kwt_data->dictionary_size++;
|
249
|
+
return INT2FIX(id);
|
250
|
+
}
|
251
|
+
|
252
|
+
/*
|
253
|
+
* call-seq: from_file
|
254
|
+
*
|
255
|
+
* Creates a new KeywordTree and loads the dictionary from a file
|
256
|
+
*
|
257
|
+
* % cat dict0.txt
|
258
|
+
* foo
|
259
|
+
* bar
|
260
|
+
* base
|
261
|
+
*
|
262
|
+
* k= AhoCorasick::KeywordTree.from_file "dict0.txt"
|
263
|
+
* k.search("basement").size # => 1
|
264
|
+
*
|
265
|
+
*/
|
266
|
+
static VALUE
|
267
|
+
rb_kwt_new_from_file(int argc, VALUE *argv, VALUE klass)
|
268
|
+
{
|
269
|
+
|
270
|
+
// TODO:
|
271
|
+
// * use rb_kwt_add_string
|
272
|
+
// * use rb_io* to handle the file
|
273
|
+
|
274
|
+
struct kwt_struct_data *kwt_data;
|
275
|
+
char word[1024];
|
276
|
+
int id;
|
277
|
+
VALUE self;
|
278
|
+
VALUE f_string;
|
279
|
+
FILE *dictionary;
|
280
|
+
|
281
|
+
rb_scan_args(argc, argv, "10", &f_string);
|
282
|
+
|
283
|
+
id = 0;
|
284
|
+
SafeStringValue( f_string );
|
285
|
+
self= rb_class_new_instance( 0, NULL, klass );
|
286
|
+
KeywordTree( self, kwt_data );
|
287
|
+
|
288
|
+
dictionary = fopen( RSTRING( f_string )->ptr, "r" );
|
289
|
+
if(dictionary == NULL)
|
290
|
+
rb_raise(rb_eRuntimeError, "Cannot open `%s\". No such file?", RSTRING(f_string)->ptr);
|
291
|
+
|
292
|
+
while(fgets(word, 1024, dictionary) != NULL) {
|
293
|
+
ac_add_string(kwt_data->tree, word, strlen(word)-1, id++);
|
294
|
+
kwt_data->dictionary_size++;
|
295
|
+
}
|
296
|
+
|
297
|
+
kwt_data->last_id= id+1;
|
298
|
+
fclose(dictionary);
|
299
|
+
return self;
|
300
|
+
}
|
301
|
+
|
302
|
+
static void
|
303
|
+
rb_kwt_struct_free(struct kwt_struct_data * kwt_data)
|
304
|
+
{
|
305
|
+
ac_free(kwt_data->tree);
|
306
|
+
}
|
307
|
+
|
308
|
+
static VALUE
|
309
|
+
rb_kwt_struct_alloc(VALUE klass)
|
310
|
+
{
|
311
|
+
return Data_Wrap_Struct(klass, 0, rb_kwt_struct_free, 0);
|
312
|
+
}
|
313
|
+
|
314
|
+
/*
|
315
|
+
* Blump.
|
316
|
+
*/
|
317
|
+
void Init_ahocorasick() {
|
318
|
+
rb_mAhoCorasick = rb_define_module("AhoCorasick");
|
319
|
+
rb_cKeywordTree = rb_define_class_under(rb_mAhoCorasick, "KeywordTree", rb_cObject);
|
320
|
+
|
321
|
+
rb_define_alloc_func(rb_cKeywordTree, rb_kwt_struct_alloc);
|
322
|
+
|
323
|
+
rb_define_method(rb_cKeywordTree, "initialize", rb_kwt_init, 0);
|
324
|
+
rb_define_method(rb_cKeywordTree, "size", rb_kwt_size, 0);
|
325
|
+
rb_define_method(rb_cKeywordTree, "make", rb_kwt_make, 0);
|
326
|
+
rb_define_method(rb_cKeywordTree, "add_string", rb_kwt_add_string, -1);
|
327
|
+
rb_define_alias(rb_cKeywordTree, "<<", "add_string");
|
328
|
+
|
329
|
+
rb_define_method(rb_cKeywordTree, "find_all", rb_kwt_find_all, -1);
|
330
|
+
rb_define_alias(rb_cKeywordTree, "search", "find_all");
|
331
|
+
|
332
|
+
rb_define_singleton_method(rb_cKeywordTree, "from_file", rb_kwt_new_from_file, -1);
|
333
|
+
|
334
|
+
sym_id = ID2SYM(rb_intern("id"));
|
335
|
+
sym_value = ID2SYM(rb_intern("value"));
|
336
|
+
sym_ends_at = ID2SYM( rb_intern("ends_at") );
|
337
|
+
sym_starts_at= ID2SYM( rb_intern("starts_at") );
|
338
|
+
}
|
339
|
+
|
metadata
ADDED
@@ -0,0 +1,68 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: ruby-ahocorasick
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.4.5
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Aurelian Oancea
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
|
12
|
+
date: 2008-11-19 00:00:00 +01:00
|
13
|
+
default_executable:
|
14
|
+
dependencies: []
|
15
|
+
|
16
|
+
description: Expose Aho-Corasick implementation from Strmat to Ruby.
|
17
|
+
email: oancea at gmail dot com
|
18
|
+
executables: []
|
19
|
+
|
20
|
+
extensions:
|
21
|
+
- ext/extconf.rb
|
22
|
+
extra_rdoc_files: []
|
23
|
+
|
24
|
+
files:
|
25
|
+
- ext/extconf.rb
|
26
|
+
- ext/ruby-ahocorasick.c
|
27
|
+
- ext/ac.h
|
28
|
+
- ext/ac.c
|
29
|
+
- examples/dict.rb
|
30
|
+
- examples/test.rb
|
31
|
+
- examples/elev.rb
|
32
|
+
- examples/sample.c
|
33
|
+
- MIT-LICENSE
|
34
|
+
- README.textile
|
35
|
+
has_rdoc: true
|
36
|
+
homepage: http://www.locknet.ro
|
37
|
+
post_install_message:
|
38
|
+
rdoc_options:
|
39
|
+
- --title
|
40
|
+
- Ruby-AhoCorasick
|
41
|
+
- --inline-source
|
42
|
+
- ext/ruby-ahocorasick.c
|
43
|
+
- README.textile
|
44
|
+
- --main
|
45
|
+
- README.textile
|
46
|
+
require_paths:
|
47
|
+
- lib
|
48
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
49
|
+
requirements:
|
50
|
+
- - ">="
|
51
|
+
- !ruby/object:Gem::Version
|
52
|
+
version: "0"
|
53
|
+
version:
|
54
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
55
|
+
requirements:
|
56
|
+
- - ">="
|
57
|
+
- !ruby/object:Gem::Version
|
58
|
+
version: "0"
|
59
|
+
version:
|
60
|
+
requirements: []
|
61
|
+
|
62
|
+
rubyforge_project: ruby-ahocorasick
|
63
|
+
rubygems_version: 1.3.1
|
64
|
+
signing_key:
|
65
|
+
specification_version: 2
|
66
|
+
summary: Aho-Corasick alghorithm implementation to Ruby using Strmat lib.
|
67
|
+
test_files: []
|
68
|
+
|