ruby-ahocorasick 0.4.5

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,21 @@
1
+ Copyright (c) 2008 Aurelian Oancea
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
21
+
@@ -0,0 +1,101 @@
1
+ h1. This is a work in progress.
2
+
3
+
4
+ h2. Introduction
5
+
6
+ This library is a "Ruby":http://ruby-lang.org extension, a wrapper around the "Aho-Corasick":http://en.wikipedia.org/wiki/Aho-Corasick_algorithm implementation in C, found in "Strmat":http://www.cs.ucdavis.edu/~gusfield/strmat.html package.
7
+
8
+ The source code (ac.c and ac.h) was "adapted" from Strmat. In fact, I've changed only 3-4 lines of code from the original implementation so it will feat my needs: search needed to return the current position in the searched string.
9
+
10
+
11
+ h2. Okay, so what's the idea?
12
+
13
+ Having a dictionary of known sentences (note: not *words*!), this kick ass algorithm can find individual patterns in an incoming stream of data. Kinda Fast.
14
+
15
+ The algorithm has 2 stages: one where an internal tree in being build from the given dictionary leaving the search to the second step.
16
+
17
+
18
+ h2. Okay, so where can I use this?
19
+
20
+ Well, you can do some crazy things with it, like, you can lookup for DNA patterns or maybe analyze network sequences (read: strange and maybe proprietary network protocols), or domestic stuff like building contextual links on your blog posts to enrich your users experience.
21
+
22
+
23
+ h2. Okay, so how can I install it?
24
+
25
+
26
+ h3. Rubygems - Development Version
27
+
28
+ <pre>
29
+ gem install aurelian-ruby-ahocorasick --source=http://gems.github.com
30
+ </pre>
31
+
32
+
33
+ h3. Build it from source
34
+
35
+ <pre>
36
+ $ git clone git://github.com/aurelian/ruby-ahocorasick.git
37
+ $ cd ruby-ahocorasick
38
+ </pre>
39
+
40
+ To build and install the gem on your machine (run with sudo if needed):
41
+
42
+ <pre>
43
+ $ rake install
44
+ </pre>
45
+
46
+ @rake -T@ will list other cool tasks.
47
+
48
+
49
+ h3. Rubygems - Stable Version
50
+
51
+ There's no stable version right now.
52
+
53
+
54
+ h4. Notes
55
+
56
+ It's known to work / compile / install on Ubuntu 8.04 and Mac OS 10.4.*. It should work out of the box if you have gcc around.
57
+ Unfortunately I don't have a Windows PC around nor required knowledge about Microsoft compliers.
58
+
59
+
60
+ h2. Okay, so how do I use it?
61
+
62
+ <pre>
63
+ require 'ahocorasick'
64
+
65
+ keyword_tree= AhoCorasick::KeywordTree.new # creates a new tree
66
+ keyword_tree.add_string( "foo-- Z@!bar" ) # add's a keyword to the tree
67
+ keyword_tree.add_string( "cervantes" ) # even more
68
+
69
+ results= keyword_tree.find_all( "1011000129 foo-- Z@!bar761 ! 001211 6xU" ).each do | result |
70
+ result[:value] # => "foo-- Z@!bar"
71
+ result[:starts_at] # => 11
72
+ result[:ends_at] # => 23
73
+ result[:id] # => 1
74
+ end
75
+
76
+ </pre>
77
+
78
+ You can get some API reference on the "wiki":http://github.com/aurelian/ruby-ahocorasick/wikis.
79
+
80
+
81
+ h2. Bugs? Suggestions? Ideas? Patches?
82
+
83
+ For now, just use the email address.
84
+
85
+
86
+ h2. Additional Reading
87
+
88
+ Other suffix - tree implementations:
89
+
90
+ * "Strmat":http://www.cs.ucdavis.edu/~gusfield/strmat.html
91
+ * "Pytst":http://nicolas.lehuen.com/download/pytst / "Ruby-Pytst":http://rubyforge.org/projects/ruby-pytst/
92
+ * "Aho-Corasick extension":http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/
93
+ * "Keyword Prospector":http://latimes.rubyforge.org/keyword_prospector/rdoc/
94
+ * "libstree":http://www.cl.cam.ac.uk/~cpk25/libstree/
95
+
96
+
97
+ h2. License
98
+
99
+ (c) 2008 - Aurelian Oancea, < oancea at gmail dot com >
100
+
101
+ released under MIT-LICENCE
@@ -0,0 +1,25 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'time'
4
+
5
+ require File.dirname(__FILE__) + '/../ext/ahocorasick'
6
+
7
+ t= Time.now
8
+
9
+ k= AhoCorasick::KeywordTree.from_file(File.dirname(__FILE__) + "/../spec/data/en.words")
10
+
11
+ t1= Time.now
12
+
13
+ puts "%d words added in %s seconds" % [k.size, (t1-t)]
14
+
15
+ query = File.read( File.dirname(__FILE__) + "/../spec/data/news.txt" )
16
+
17
+ results= k.search query
18
+
19
+ puts "took %s seconds to find %d results in a streem with %d charachters" % [(Time.now-t1), results.size, query.size]
20
+
21
+ exit
22
+ results.each do | r |
23
+ puts query[r[:starts_at]].chr + ".." + query[r[:ends_at]-1].chr + " => " + r[:value]
24
+ end
25
+
@@ -0,0 +1,19 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ $kcode='UTF-8'
4
+
5
+ require File.dirname(__FILE__) + '/../ext/ahocorasick'
6
+
7
+ k= AhoCorasick::KeywordTree.new
8
+
9
+ k << "I've"
10
+ k << "data"
11
+ k << "base"
12
+ k << "database"
13
+
14
+ query= "I've moved my data to a database"
15
+
16
+ k.search(query).each do | r |
17
+ puts "-> [ " + r[:id].to_s + " ] " + r[:value] + " / " + query[r[:starts_at]].chr + ".." + query[r[:ends_at]-1].chr
18
+ end
19
+
@@ -0,0 +1,94 @@
1
+ //
2
+ // Getting started with Aho-Corasick from Strmat
3
+ //
4
+ // lasick Makefile:
5
+ //
6
+ // --
7
+ // ac.o :
8
+ // gcc -c -fPIC -shared ac.c
9
+ // libasick : ac.o
10
+ // gcc -shared -Wl,-soname,libasick.so -o libasick.so.1.0.1
11
+ // ar rcs libasick.a ac.o
12
+ // clean :
13
+ // rm -rf *.o *.a *.so* *.dylib*
14
+ // --
15
+ //
16
+ // Compile this stuff - asick is the library name, generated with the above Makefile :)
17
+ //
18
+ // gcc sample.c -o ac-sample -I../ext/ -L../ext/ -lasick
19
+ //
20
+
21
+ #include <string.h>
22
+ #include <stdio.h>
23
+ #include <stdlib.h>
24
+ #include "ac.h"
25
+
26
+ int main(int argc, char *argv[]) {
27
+ char * search;
28
+ char * remain;
29
+ char * result;
30
+ char word[1024];
31
+
32
+ FILE *dictionary;
33
+ FILE *input;
34
+ int lgt, id, n, i;
35
+
36
+ AC_STRUCT * tree;
37
+
38
+ input= fopen(argv[1], "r");
39
+
40
+ if(input == NULL) {
41
+ search= argv[1];
42
+ } else {
43
+ long lSize;
44
+ fseek (input , 0 , SEEK_END);
45
+ lSize= ftell (input);
46
+ rewind(input);
47
+ search = (char*) malloc (sizeof(char)*lSize);
48
+ if (search == NULL) { fputs ("Error: Memory error",stderr); exit(-2); }
49
+ fread(search, 1, lSize-1, input);
50
+ }
51
+
52
+ dictionary= fopen("../spec/data/dictionary.txt", "r");
53
+
54
+ if(dictionary == NULL) {
55
+ printf("Error: can't open file.\n");
56
+ return -1;
57
+ }
58
+
59
+ tree= ac_alloc();
60
+
61
+ // start counting from 1
62
+ n= 1;
63
+
64
+ printf("==> building dictionary ...");
65
+
66
+ while(fgets(word, 1024, dictionary) != NULL) {
67
+ // strip \n
68
+ ac_add_string(tree, word, strlen(word)-1, n++);
69
+ }
70
+
71
+ printf("%d entries added.\n",n);
72
+
73
+ ac_prep(tree);
74
+
75
+ printf("==> input text [%d]:\n--\n%s\n--\n", strlen(search), search);
76
+
77
+ ac_search_init(tree, search, strlen(search) );
78
+
79
+ while((remain= ac_search(tree, &lgt, &id)) != NULL) {
80
+ printf("`%d'", remain[lgt+1]);
81
+ result = (char*) malloc (sizeof(char)*lgt);
82
+ sprintf( result, "%.*s", lgt, remain);
83
+ // result: should read first lgt chars from remain.
84
+ printf("==> result: lenght=> %d, id=> %d [%s]\n", lgt, id, result);
85
+ free(result);
86
+ }
87
+
88
+ ac_free(tree);
89
+ fclose(dictionary);
90
+ free(search);
91
+
92
+ return 0;
93
+ }
94
+
@@ -0,0 +1,46 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require File.dirname(__FILE__) + '/../ext/ahocorasick'
4
+
5
+ k= AhoCorasick::KeywordTree.new
6
+
7
+ puts k.size
8
+ k.add_string("foo");
9
+
10
+ puts k.size
11
+ k.add_string("bar", 1991);
12
+
13
+ puts k.size
14
+ k.add_string("fomz");
15
+
16
+ begin
17
+ k.add_string("foo", -1);
18
+ rescue RuntimeError => err
19
+ puts "[ok]==> got " + err.class.name + ": " + err.message
20
+ end
21
+
22
+ begin
23
+ k.add_string("foo", "bar");
24
+ rescue RuntimeError => err
25
+ puts "[ok]==> got " + err.class.name + ": " + err.message
26
+ end
27
+
28
+ k.add_string("timisoara", 22);
29
+
30
+ puts k.size
31
+ begin
32
+ k.add_string("bucuresti", 22);
33
+ rescue RuntimeError => err
34
+ puts "[ok]==> got " + err.class.name + ": " + err.message
35
+ end
36
+
37
+ k << "bacau"
38
+
39
+ k.search('am fost la bacau').each do | result |
40
+ puts result.inspect
41
+ end
42
+
43
+ k.search( 'din foo in foo' ).each do | q |
44
+ puts q.inspect
45
+ end
46
+
@@ -0,0 +1,623 @@
1
+ /*
2
+ * ac.c
3
+ *
4
+ * Implementation of the Aho-Corasick algorithm.
5
+ *
6
+ * NOTES:
7
+ * 8/94 - Original Implementation (Sean Davis)
8
+ * 9/94 - Redid Implementation (James Knight)
9
+ * 3/96 - Modularized the code (James Knight)
10
+ * 7/96 - Finished the modularization (James Knight)
11
+ */
12
+
13
+ #include <stdio.h>
14
+ #include <stdlib.h>
15
+ #include <string.h>
16
+ #include "ac.h"
17
+
18
+ /*
19
+ * ac_alloc
20
+ *
21
+ * Creates a new AC_STRUCT structure and initializes its fields.
22
+ *
23
+ * Parameters: none.
24
+ *
25
+ * Returns: A dynamically allocated AC_STRUCT structure.
26
+ */
27
+ AC_STRUCT *ac_alloc(void)
28
+ {
29
+ AC_STRUCT *node;
30
+
31
+ if ((node = malloc(sizeof(AC_STRUCT))) == NULL)
32
+ return NULL;
33
+ memset(node, 0, sizeof(AC_STRUCT));
34
+
35
+ if ((node->tree = malloc(sizeof(ACTREE_NODE))) == NULL) {
36
+ free(node);
37
+ return NULL;
38
+ }
39
+ memset(node->tree, 0, sizeof(ACTREE_NODE));
40
+
41
+ return node;
42
+ }
43
+
44
+
45
+ /*
46
+ * ac_add_string
47
+ *
48
+ * Adds a string to the AC_STRUCT structure's keyword tree.
49
+ *
50
+ * NOTE: The `id' value given must be unique to any of the strings
51
+ * added to the tree, and must be a small integer greater than
52
+ * 0 (since it is used to index an array holding information
53
+ * about each of the strings).
54
+ *
55
+ * The best id's to use are to number the strings from 1 to K.
56
+ *
57
+ * Parameters: node - an AC_STRUCT structure
58
+ * P - the sequence
59
+ * M - the sequence length
60
+ * id - the sequence identifier
61
+ *
62
+ * Returns: non-zero on success, zero on error.
63
+ */
64
+ int ac_add_string(AC_STRUCT *node, char *P, int M, int id)
65
+ {
66
+ int i, j, newsize;
67
+ AC_TREE tnode, child, back, newnode, list, tail;
68
+
69
+ /*
70
+ * Return a zero if a previous error had occurred, or if the
71
+ * given id equals zero. An id value of zero is used by the
72
+ * algorithm to signal that no pattern ends at a node in the
73
+ * keyword tree. So, it can't be used as a pattern's id.
74
+ */
75
+ if (node->errorflag || id == 0)
76
+ return 0;
77
+
78
+ P--; /* Shift to make sequence be P[1],...,P[M] */
79
+
80
+ /*
81
+ * Allocate space for the new string's information.
82
+ */
83
+ if (node->Psize <= id) {
84
+ if (node->Psize == 0) {
85
+ newsize = (id >= 16 ? id + 1 : 16);
86
+ node->Plengths = malloc(newsize * sizeof(int));
87
+ }
88
+ else {
89
+ newsize = node->Psize + id + 1;
90
+ node->Plengths = realloc(node->Plengths, newsize * sizeof(int));
91
+ }
92
+ if (node->Plengths == NULL) {
93
+ node->errorflag = 1;
94
+ return 0;
95
+ }
96
+
97
+ for (i=node->Psize; i < newsize; i++)
98
+ node->Plengths[i] = 0;
99
+ node->Psize = newsize;
100
+ }
101
+
102
+ // duplicate id
103
+ if (node->Plengths[id] != 0)
104
+ return 0;
105
+
106
+ /*
107
+ * Add the string to the keyword tree.
108
+ */
109
+ tnode = node->tree;
110
+ for (i=1; i <= M; i++) {
111
+ /*
112
+ * Find the child whose character is P[i].
113
+ */
114
+ back = NULL;
115
+ child = tnode->children;
116
+ while (child != NULL && child->ch < P[i]) {
117
+ back = child;
118
+ child = child->sibling;
119
+ }
120
+
121
+ if (child == NULL || child->ch != P[i])
122
+ break;
123
+
124
+ tnode = child;
125
+
126
+ #ifdef STATS
127
+ node->prep_old_edges++;
128
+ #endif
129
+
130
+ }
131
+
132
+ /*
133
+ * If only part of the pattern exists in the tree, add the
134
+ * rest of the pattern to the tree.
135
+ */
136
+ if (i <= M) {
137
+ list = tail = NULL;
138
+ for (j=i; j <= M; j++) {
139
+ if ((newnode = malloc(sizeof(ACTREE_NODE))) == NULL)
140
+ break;
141
+ memset(newnode, 0, sizeof(ACTREE_NODE));
142
+ newnode->ch = P[j];
143
+
144
+ if (list == NULL)
145
+ list = tail = newnode;
146
+ else
147
+ tail = tail->children = newnode;
148
+
149
+ #ifdef STATS
150
+ node->prep_new_edges++;
151
+ #endif
152
+
153
+ }
154
+ if (j <= M) {
155
+ while (list != NULL) {
156
+ tail = list->children;
157
+ free(list);
158
+ list = tail;
159
+ }
160
+ return 0;
161
+ }
162
+
163
+ list->sibling = child;
164
+ if (back == NULL)
165
+ tnode->children = list;
166
+ else
167
+ back->sibling = list;
168
+
169
+ tnode = tail;
170
+ }
171
+
172
+ tnode->matchid = id;
173
+ node->Plengths[id] = M;
174
+ node->ispreprocessed = 0;
175
+
176
+ return 1;
177
+ }
178
+
179
+
180
+ /*
181
+ * ac_del_string
182
+ *
183
+ * Deletes a string from the keyword tree.
184
+ *
185
+ * Parameters: node - an AC_STRUCT structure
186
+ * P - the sequence to be deleted
187
+ * M - its length
188
+ * id - its identifier
189
+ *
190
+ * Returns: non-zero on success, zero on error.
191
+ */
192
+ int ac_del_string(AC_STRUCT *node, char *P, int M, int id)
193
+ {
194
+ int i, flag;
195
+ AC_TREE tnode, tlast, tback, child, back;
196
+
197
+ if (node->errorflag || id > node->Psize || node->Plengths[id] == 0)
198
+ return 0;
199
+
200
+ P--; /* Shift to make sequence be P[1],...,P[M] */
201
+
202
+ /*
203
+ * Scan the tree for the path corresponding to the keyword to be deleted.
204
+ */
205
+ flag = 1;
206
+ tlast = tnode = node->tree;
207
+ tback = NULL;
208
+
209
+ for (i=1; i <= M; i++) {
210
+ /*
211
+ * Find the child matching P[i]. It must be there.
212
+ */
213
+ child = tnode->children;
214
+ back = NULL;
215
+ while (child != NULL && child->ch != P[i]) {
216
+ back = child;
217
+ child = child->sibling;
218
+ }
219
+
220
+ if (child == NULL) {
221
+ fprintf(stderr, "Error in Aho-Corasick preprocessing. String to be "
222
+ "deleted is not in tree.\n");
223
+ return 0;
224
+ }
225
+
226
+ /*
227
+ * Try to find the point where the pattern to be deleted branches off
228
+ * from the paths of the other patterns in the tree. This point must
229
+ * be at the latest node which satisfies one of these two conditions:
230
+ *
231
+ * 1) Another pattern ends at that node (and so
232
+ * `child->matchid != 0'). In this case, the branch point is
233
+ * just below this node and so the children of this node
234
+ * should be removed.
235
+ * 2) A node has other siblings. In this case, the node itself
236
+ * is the branch point, and it and its children should be
237
+ * removed.
238
+ */
239
+ if (i < M && child->matchid != 0) {
240
+ flag = 1;
241
+ tlast = child;
242
+ }
243
+ else if (back != NULL || child->sibling != NULL) {
244
+ flag = 2;
245
+ tlast = child;
246
+ tback = (back == NULL ? tnode : back);
247
+ }
248
+
249
+ tnode = child;
250
+ }
251
+
252
+ /*
253
+ * If the node corresponding to the end of the keyword has children,
254
+ * then the tree should not be altered, except to remove the keyword's
255
+ * identifier from the tree.
256
+ *
257
+ * Otherwise, apply the appropriate removal, as described above.
258
+ */
259
+ if (tnode->children != NULL) {
260
+ tnode->matchid = 0;
261
+ }
262
+ else {
263
+ if (flag == 1) {
264
+ child = tlast->children;
265
+ tlast->children = NULL;
266
+ tlast = child;
267
+ }
268
+ else {
269
+ if (tback->children == tlast)
270
+ tback->children = tlast->sibling;
271
+ else
272
+ tback->sibling = tlast->sibling;
273
+ }
274
+
275
+ while (tlast != NULL) {
276
+ child = tlast->children;
277
+ free(tlast);
278
+ tlast = child;
279
+ }
280
+ }
281
+
282
+ node->Plengths[id] = 0;
283
+ node->ispreprocessed = 0;
284
+
285
+ return 1;
286
+ }
287
+
288
+
289
+ /*
290
+ * ac_prep
291
+ *
292
+ * Compute the failure and output links for the keyword tree.
293
+ *
294
+ * Parameters: node - an AC_STRUCT structure
295
+ *
296
+ * Returns: non-zero on success, zero on error.
297
+ */
298
+ int ac_prep(AC_STRUCT *node)
299
+ {
300
+ char x;
301
+ AC_TREE v, vprime, w, wprime, root, front, back, child;
302
+
303
+ if (node->errorflag)
304
+ return 0;
305
+
306
+ /*
307
+ * The failure link and output link computation requires a breadth-first
308
+ * traversal of the keyword tree. And, to do that, we need a queue of
309
+ * the nodes yet to be processed.
310
+ *
311
+ * The `faillink' fields will be used as the pointers for the queue
312
+ * of nodes to be computed (since the failure link is only set after
313
+ * the node is removed from the queue).
314
+ *
315
+ * The `outlink' fields will be used as the pointers to a node's parent
316
+ * for nodes in the queue (since the output link is also only set after
317
+ * the node is removed from the queue).
318
+ */
319
+ root = node->tree;
320
+
321
+ front = back = root;
322
+ front->faillink = NULL;
323
+ front->outlink = NULL;
324
+
325
+ while (front != NULL) {
326
+ v = front;
327
+ x = v->ch;
328
+ vprime = v->outlink;
329
+
330
+ /*
331
+ * Add the node's children to the queue.
332
+ */
333
+ for (child=v->children; child != NULL; child=child->sibling) {
334
+ child->outlink = v;
335
+ back->faillink = child;
336
+ back = child;
337
+ }
338
+ back->faillink = NULL;
339
+
340
+ front = front->faillink;
341
+ v->faillink = v->outlink = NULL;
342
+
343
+ /*
344
+ * Set the failure and output links.
345
+ */
346
+ if (v == root)
347
+ ;
348
+ else if (vprime == root)
349
+ v->faillink = root;
350
+ else {
351
+ /*
352
+ * Find the find link in the failure link chain which has a child
353
+ * labeled with x.
354
+ */
355
+ wprime = NULL;
356
+ w = vprime->faillink;
357
+
358
+ while (1) {
359
+ wprime = w->children;
360
+ while (wprime != NULL && wprime->ch < x)
361
+ wprime = wprime->sibling;
362
+
363
+ if ((wprime != NULL && wprime->ch == x) || w == root)
364
+ break;
365
+
366
+ w = w->faillink;
367
+
368
+ #ifdef STATS
369
+ node->prep_fail_compares++;
370
+ #endif
371
+ }
372
+ #ifdef STATS
373
+ node->prep_fail_compares++;
374
+ #endif
375
+
376
+ if (wprime != NULL && wprime->ch == x)
377
+ v->faillink = wprime;
378
+ else
379
+ v->faillink = root;
380
+
381
+ if (v->matchid != 0) {
382
+ if (v->faillink->matchid != 0)
383
+ v->outlink = v->faillink;
384
+ else
385
+ v->outlink = v->faillink->outlink;
386
+ }
387
+ }
388
+ }
389
+
390
+ node->ispreprocessed = 1;
391
+ node->initflag = 0;
392
+
393
+ return 1;
394
+ }
395
+
396
+
397
+ /*
398
+ * ac_search_init
399
+ *
400
+ * Initializes the variables used during an Aho-Corasick search.
401
+ * See ac_search for an example of how it should be used.
402
+ *
403
+ * Parameters: node - an AC_STRUCT structure
404
+ * T - the sequence to be searched
405
+ * N - the length of the sequence
406
+ *
407
+ * Returns: nothing.
408
+ */
409
+ void ac_search_init(AC_STRUCT *node, char *T, int N)
410
+ {
411
+ if (node->errorflag)
412
+ return;
413
+ else if (!node->ispreprocessed) {
414
+ fprintf(stderr, "Error in Aho-Corasick search. The preprocessing "
415
+ "has not been completed.\n");
416
+ return;
417
+ }
418
+
419
+ node->T = T - 1; /* Shift to make sequence be T[1],...,T[N] */
420
+ node->N = N;
421
+ node->c = 1;
422
+ node->w = node->tree;
423
+ node->output = NULL;
424
+ node->initflag = 1;
425
+ node->endflag = 0;
426
+ }
427
+
428
+
429
+ /*
430
+ * ac_search
431
+ *
432
+ * Scans a text to look for the next occurrence of one of the patterns
433
+ * in the text. An example of how this search should be used is the
434
+ * following:
435
+ *
436
+ * s = T;
437
+ * len = N;
438
+ * contflag = 0;
439
+ * ac_search_init(node, T, N);
440
+ * while ((s = ac_search(node, &matchlen, &matchid) != NULL) {
441
+ * >>> Pattern `matchid' matched from `s' to `s + matchlen - 1'. <<<
442
+ * }
443
+ *
444
+ * where `node', `T' and `N' are assumed to be initialized appropriately.
445
+ *
446
+ * Parameters: node - a preprocessed AC_STRUCT structure
447
+ * length_out - where to store the new match's length
448
+ * id_out - where to store the identifier of the
449
+ * pattern that matched
450
+ * ends_at - where to store the n-th matched char
451
+ *
452
+ * Returns: the left end of the text that matches a pattern, or NULL
453
+ * if no match occurs. (It also stores values in `*length_out',
454
+ * and `*id_out' giving the match's length and pattern identifier.
455
+ */
456
+ char *ac_search(AC_STRUCT *node, int *length_out, int *id_out, int *ends_at)
457
+ {
458
+ int c, N, id;
459
+ char *T;
460
+ AC_TREE w, wprime, root;
461
+
462
+ if (node->errorflag)
463
+ return NULL;
464
+ else if (!node->ispreprocessed) {
465
+ fprintf(stderr, "Error in Aho-Corasick search. The preprocessing "
466
+ "has not been completed.\n");
467
+ return NULL;
468
+ }
469
+ else if (!node->initflag) {
470
+ fprintf(stderr, "Error in Aho-Corasick search. ac_search_init was not "
471
+ "called.\n");
472
+ return NULL;
473
+ }
474
+ else if (node->endflag)
475
+ return NULL;
476
+
477
+ T = node->T;
478
+ N = node->N;
479
+ c = node->c;
480
+ w = node->w;
481
+ root = node->tree;
482
+
483
+ /*
484
+ * If the last call to ac_search returned a match, check for another
485
+ * match ending at the same right endpoint (denoted by a non-NULL
486
+ * output link).
487
+ */
488
+ if (node->output != NULL) {
489
+ node->output = node->output->outlink;
490
+
491
+ #ifdef STATS
492
+ node->outlinks_traversed++;
493
+ #endif
494
+
495
+ if (node->output != NULL) {
496
+ id = node->output->matchid;
497
+ if (id_out)
498
+ *id_out = id;
499
+ if (length_out)
500
+ *length_out = node->Plengths[id];
501
+ if (ends_at)
502
+ *ends_at= c;
503
+ return &T[c] - node->Plengths[id];
504
+ }
505
+
506
+ }
507
+
508
+ /*
509
+ * Run the search algorithm, stopping at the first position where a
510
+ * match to one of the patterns occurs.
511
+ */
512
+ while (c <= N) {
513
+ /*
514
+ * Try to match the next input character to a child in the tree.
515
+ */
516
+ wprime = w->children;
517
+ while (wprime != NULL && wprime->ch != T[c])
518
+ wprime = wprime->sibling;
519
+
520
+ #ifdef STATS
521
+ node->num_compares++;
522
+ #endif
523
+
524
+ /*
525
+ * If the match fails, then either use the failure link (if not
526
+ * at the root), or move to the next character since no prefix
527
+ * of any pattern ends with character T[c].
528
+ */
529
+ if (wprime == NULL) {
530
+ if (w == root)
531
+ c++;
532
+ else {
533
+ w = w->faillink;
534
+
535
+ #ifdef STATS
536
+ node->num_failures++;
537
+ #endif
538
+
539
+ }
540
+ }
541
+ else {
542
+ /*
543
+ * If we could match the input, move down the tree and to the
544
+ * next input character, and see if that match completes the
545
+ * match to a pattern (when matchid != 0 or outlink != NULL).
546
+ */
547
+ c++;
548
+ w = wprime;
549
+
550
+ #ifdef STATS
551
+ node->edges_traversed++;
552
+ #endif
553
+
554
+ if (w->matchid != 0)
555
+ node->output = w;
556
+ else if (w->outlink != NULL) {
557
+ node->output = w->outlink;
558
+
559
+ #ifdef STATS
560
+ node->outlinks_traversed++;
561
+ #endif
562
+
563
+ }
564
+
565
+ if (node->output != NULL) {
566
+ id = node->output->matchid;
567
+ if (id_out)
568
+ *id_out = id;
569
+ if (length_out)
570
+ *length_out= node->Plengths[id];
571
+ if(ends_at)
572
+ *ends_at= c;
573
+
574
+ node->w = w;
575
+ node->c = c; // ends_at - length_out;
576
+
577
+ return &T[c] - node->Plengths[id];
578
+ }
579
+ }
580
+ }
581
+
582
+ node->c = c;
583
+ node->endflag = 1;
584
+
585
+ return NULL;
586
+ }
587
+
588
+
589
+ /*
590
+ * ac_free
591
+ *
592
+ * Free up the allocated AC_STRUCT structure.
593
+ *
594
+ * Parameters: node - a AC_STRUCT structure
595
+ *
596
+ * Returns: nothing.
597
+ */
598
+ void ac_free(AC_STRUCT *node)
599
+ {
600
+ AC_TREE front, back, next;
601
+
602
+ if (node == NULL)
603
+ return;
604
+
605
+ if (node->tree != NULL) {
606
+ front = back = node->tree;
607
+ while (front != NULL) {
608
+ back->sibling = front->children;
609
+ while (back->sibling != NULL)
610
+ back = back->sibling;
611
+
612
+ next = front->sibling;
613
+ free(front);
614
+ front = next;
615
+ }
616
+ }
617
+
618
+ if (node->Plengths != NULL)
619
+ free(node->Plengths);
620
+
621
+ free(node);
622
+ }
623
+
@@ -0,0 +1,36 @@
1
+
2
+ #ifndef _AC_H_
3
+ #define _AC_H_
4
+
5
+ typedef struct actreenode {
6
+ char ch;
7
+ int matchid;
8
+ struct actreenode *outlink, *faillink;
9
+ struct actreenode *children, *sibling;
10
+ } ACTREE_NODE, *AC_TREE;
11
+
12
+ typedef struct {
13
+ AC_TREE tree;
14
+ int ispreprocessed, errorflag;
15
+
16
+ int Psize;
17
+ int *Plengths;
18
+
19
+ char *T;
20
+ int N, c, initflag, endflag;
21
+ AC_TREE w, output;
22
+ #ifdef STATS
23
+ int prep_new_edges, prep_old_edges, prep_fail_compares;
24
+ int num_compares, num_failures, edges_traversed, outlinks_traversed;
25
+ #endif
26
+ } AC_STRUCT;
27
+
28
+ AC_STRUCT *ac_alloc(void);
29
+ int ac_add_string(AC_STRUCT *node, char *P, int M, int id);
30
+ // int ac_del_string(AC_STRUCT *node, char *P, int M, int id);
31
+ int ac_prep(AC_STRUCT *node);
32
+ void ac_search_init(AC_STRUCT *node, char *T, int N);
33
+ char *ac_search(AC_STRUCT *node, int *length_out, int *id_out, int *ends_at);
34
+ void ac_free(AC_STRUCT *node);
35
+
36
+ #endif
@@ -0,0 +1,6 @@
1
+ require "mkmf"
2
+
3
+ dir_config("ahocorasick")
4
+
5
+ create_makefile("ahocorasick")
6
+
@@ -0,0 +1,339 @@
1
+
2
+ //
3
+ // (c) 2008, Aurelian Oancea < oancea at gmail dot com >
4
+ //
5
+ // Released under MIT-LICENSE
6
+ //
7
+
8
+ //
9
+ // TODO: new methods?
10
+ //
11
+ // * kwt[id] = word
12
+ // * kwt.from_file (class instance method)
13
+ //
14
+ // * kwt.find_each ("str") {|r| .. }
15
+ // * kwt.find_first("str")
16
+ // * kwt.find_all ("str")
17
+ //
18
+ // TODO: change last_id and dictionary_size to long
19
+ //
20
+
21
+ #include <ruby.h>
22
+ #include "ac.h"
23
+
24
+ static VALUE sym_id, sym_value, sym_ends_at, sym_starts_at;
25
+
26
+ VALUE rb_mAhoCorasick;
27
+ VALUE rb_cKeywordTree;
28
+
29
+ #define KeywordTree(obj, kwt_data) {\
30
+ Data_Get_Struct(obj, struct kwt_struct_data, kwt_data);\
31
+ }
32
+
33
+ struct kwt_struct_data {
34
+ AC_STRUCT * tree;
35
+ int last_id;
36
+ int dictionary_size;
37
+ int is_frozen;
38
+ };
39
+
40
+ // int
41
+ // rb_add_string(struct kwt_struct_data *kwt, char *word, int size, int id) {
42
+ // if(ac_add_string( kwt->tree, word, size, id ) == 0)
43
+ // return 0;
44
+ // kwt->dictionary_size++;
45
+ // kwt->last_id= id+1;
46
+ // return 1;
47
+ // }
48
+
49
+ /*
50
+ * call-seq: initialize
51
+ *
52
+ * Creates a new KeywordTree
53
+ *
54
+ * require 'ahocorasick'
55
+ * kwt = Ahocorasick::KeywordTree.new
56
+ *
57
+ */
58
+ static VALUE
59
+ rb_kwt_init(VALUE self)
60
+ {
61
+ AC_STRUCT * tree;
62
+ struct kwt_struct_data *kwt_data;
63
+
64
+ kwt_data = ALLOC(struct kwt_struct_data);
65
+ tree = ac_alloc();
66
+ DATA_PTR(self) = kwt_data;
67
+ kwt_data->tree = tree;
68
+ kwt_data->last_id = 1;
69
+ kwt_data->dictionary_size = 0;
70
+ kwt_data->is_frozen = 0;
71
+ return self;
72
+ }
73
+
74
+ /*
75
+ * Document-method: make
76
+ * call-seq: make
77
+ *
78
+ * It freezes the current KeywordTree.
79
+ *
80
+ * ==== Note: This method is called internally by search
81
+ *
82
+ * require 'ahocorasick'
83
+ *
84
+ * kwt = Ahocorasick::KeywordTree.new
85
+ *
86
+ * kwt.add_string("one")
87
+ * kwt.add_string("two")
88
+ * kwt.make()
89
+ */
90
+ static VALUE
91
+ rb_kwt_make(VALUE self)
92
+ {
93
+ struct kwt_struct_data *kwt_data;
94
+ KeywordTree(self, kwt_data);
95
+
96
+ if(kwt_data->is_frozen == 1)
97
+ return Qtrue;
98
+
99
+ if(ac_prep( kwt_data->tree ) == 1) {
100
+ kwt_data->is_frozen = 1;
101
+ return Qtrue;
102
+ }
103
+
104
+ rb_raise(rb_eRuntimeError, "Cannot freeze the tree");
105
+ }
106
+
107
+ /*
108
+ * Document-method: find_all
109
+ * call-seq: find_all
110
+ *
111
+ * Search the current tree.
112
+ *
113
+ * It returns an array on hashes, e.g.
114
+ *
115
+ * [ { :id => int, :value => int, :starts_at => int, :ends_at => int}, { ... } ]
116
+ *
117
+ * Returns an empty array when the search didn't return any result.
118
+ *
119
+ * # assuming a valid KeywordTree kwt object:
120
+ * kwt.add_string("one")
121
+ * kwt.add_string("two")
122
+ *
123
+ * kwt.search( "moved two times already" ).each do | result |
124
+ * result[:id] # => 2
125
+ * result[:ends_at] # => 9
126
+ * result[:starts_at] # => 6
127
+ * result[:value] # => two
128
+ * end # => 1
129
+ *
130
+ */
131
+ static VALUE
132
+ rb_kwt_find_all(int argc, VALUE *argv, VALUE self)
133
+ {
134
+ char * remain; // returned by ac_search, the remaing text to search
135
+ int lgt, id, ends_at, starts_at; // filled in by ac_search: the length of the result, the id, and starts_at/ends_at position
136
+ VALUE v_result; // one result, as hash
137
+ VALUE v_results; // all the results, an array
138
+
139
+ VALUE v_search; // search string, function argument
140
+ struct kwt_struct_data *kwt_data;
141
+
142
+ // one mandatory argument.
143
+ rb_scan_args(argc, argv, "1", &v_search);
144
+ // it should be string.
145
+ Check_Type(v_search, T_STRING);
146
+ // get the structure
147
+ KeywordTree(self, kwt_data);
148
+ // freeze the tree, if not already
149
+ if(kwt_data->is_frozen == 0) {
150
+ if(ac_prep( kwt_data->tree ) == 0)
151
+ rb_raise(rb_eRuntimeError, "Cannot freeze the tree");
152
+ kwt_data->is_frozen = 1;
153
+ }
154
+ // prepare the return value
155
+ v_results= rb_ary_new();
156
+ // fail quickly and return the empty array
157
+ if(kwt_data->dictionary_size == 0)
158
+ return v_results;
159
+ // prepare the search
160
+ ac_search_init(kwt_data->tree, RSTRING( v_search )->ptr, RSTRING( v_search )->len);
161
+ // loop trought the results
162
+ while((remain= ac_search(kwt_data->tree, &lgt, &id, &ends_at)) != NULL) {
163
+ // this is an individual result as a hash
164
+ v_result= rb_hash_new();
165
+ rb_hash_aset( v_result, sym_id, INT2FIX(id) );
166
+ rb_hash_aset( v_result, sym_starts_at, INT2FIX( ends_at - lgt - 1 ) );
167
+ rb_hash_aset( v_result, sym_ends_at, INT2FIX( ends_at - 1 ) );
168
+ rb_hash_aset( v_result, sym_value, rb_str_new(remain, lgt) );
169
+ rb_ary_push( v_results, v_result );
170
+ }
171
+ // reopen the tree
172
+ kwt_data->is_frozen= 0;
173
+ return v_results;
174
+ }
175
+
176
+ /*
177
+ * Document-method: size
178
+ * call-seq: size
179
+ *
180
+ * Returns the size of this KeywordTree
181
+ *
182
+ * kwt.add_string("foo")
183
+ * kwt.add_string("bar")
184
+ * kwt.size #=> 2
185
+ *
186
+ */
187
+ static VALUE
188
+ rb_kwt_size(VALUE self)
189
+ {
190
+ struct kwt_struct_data *kwt_data;
191
+ KeywordTree(self, kwt_data);
192
+
193
+ return INT2FIX(kwt_data->dictionary_size);
194
+ }
195
+
196
+ /*
197
+ * Document-method: add_string
198
+ * call-seq: add_string
199
+ *
200
+ * Adds a sequence to this KeywordTree.
201
+ *
202
+ * kwt.add_string("foo1$21^ 98N3 ba>Z")
203
+ * kwt << "bar" # using the alias
204
+ *
205
+ * ==== Note: you can also specify the id, a number between 1 and k
206
+ *
207
+ * kwt.add_string "bar", 123 # => 123
208
+ *
209
+ * This id should be unique in the context of the current tree.
210
+ *
211
+ * Returns the id of the inserted object.
212
+ *
213
+ * kwt.add_string("test", 18) # => 18
214
+ * kwt.add_string("baz") # => 19
215
+ *
216
+ */
217
+ static VALUE
218
+ rb_kwt_add_string(int argc, VALUE *argv, VALUE self)
219
+ {
220
+ VALUE v_string, v_id;
221
+ struct kwt_struct_data *kwt_data;
222
+ char * string;
223
+ int id;
224
+
225
+ rb_scan_args(argc, argv, "11", &v_string, &v_id);
226
+
227
+ Check_Type(v_string, T_STRING);
228
+ string= RSTRING(v_string)->ptr;
229
+ KeywordTree(self, kwt_data);
230
+
231
+ if(kwt_data->is_frozen == 1)
232
+ rb_raise(rb_eRuntimeError, "Cannot add `%s\" into a frozen tree.", string);
233
+
234
+ if(v_id == Qnil) {
235
+ id = kwt_data->last_id;
236
+ } else if(TYPE(v_id) != T_FIXNUM) {
237
+ rb_raise(rb_eRuntimeError, "Please use a number from 1 to K as id, or leave nil to auto-generate one. `%s\" given.", RSTRING(v_id)->ptr);
238
+ } else if(NUM2INT(v_id) <= 0) {
239
+ rb_raise(rb_eRuntimeError, "Please use a number from 1 to K as id, or leave nil to auto-generate one. `%d\" given.", NUM2INT(v_id));
240
+ } else {
241
+ id= NUM2INT(v_id);
242
+ }
243
+
244
+ if(ac_add_string(kwt_data->tree, string, strlen(string), id) == 0)
245
+ rb_raise(rb_eRuntimeError, "Failed to add `%s\", duplicate id `%d\"?", string, id);
246
+
247
+ kwt_data->last_id= id + 1;
248
+ kwt_data->dictionary_size++;
249
+ return INT2FIX(id);
250
+ }
251
+
252
+ /*
253
+ * call-seq: from_file
254
+ *
255
+ * Creates a new KeywordTree and loads the dictionary from a file
256
+ *
257
+ * % cat dict0.txt
258
+ * foo
259
+ * bar
260
+ * base
261
+ *
262
+ * k= AhoCorasick::KeywordTree.from_file "dict0.txt"
263
+ * k.search("basement").size # => 1
264
+ *
265
+ */
266
+ static VALUE
267
+ rb_kwt_new_from_file(int argc, VALUE *argv, VALUE klass)
268
+ {
269
+
270
+ // TODO:
271
+ // * use rb_kwt_add_string
272
+ // * use rb_io* to handle the file
273
+
274
+ struct kwt_struct_data *kwt_data;
275
+ char word[1024];
276
+ int id;
277
+ VALUE self;
278
+ VALUE f_string;
279
+ FILE *dictionary;
280
+
281
+ rb_scan_args(argc, argv, "10", &f_string);
282
+
283
+ id = 0;
284
+ SafeStringValue( f_string );
285
+ self= rb_class_new_instance( 0, NULL, klass );
286
+ KeywordTree( self, kwt_data );
287
+
288
+ dictionary = fopen( RSTRING( f_string )->ptr, "r" );
289
+ if(dictionary == NULL)
290
+ rb_raise(rb_eRuntimeError, "Cannot open `%s\". No such file?", RSTRING(f_string)->ptr);
291
+
292
+ while(fgets(word, 1024, dictionary) != NULL) {
293
+ ac_add_string(kwt_data->tree, word, strlen(word)-1, id++);
294
+ kwt_data->dictionary_size++;
295
+ }
296
+
297
+ kwt_data->last_id= id+1;
298
+ fclose(dictionary);
299
+ return self;
300
+ }
301
+
302
+ static void
303
+ rb_kwt_struct_free(struct kwt_struct_data * kwt_data)
304
+ {
305
+ ac_free(kwt_data->tree);
306
+ }
307
+
308
+ static VALUE
309
+ rb_kwt_struct_alloc(VALUE klass)
310
+ {
311
+ return Data_Wrap_Struct(klass, 0, rb_kwt_struct_free, 0);
312
+ }
313
+
314
+ /*
315
+ * Blump.
316
+ */
317
+ void Init_ahocorasick() {
318
+ rb_mAhoCorasick = rb_define_module("AhoCorasick");
319
+ rb_cKeywordTree = rb_define_class_under(rb_mAhoCorasick, "KeywordTree", rb_cObject);
320
+
321
+ rb_define_alloc_func(rb_cKeywordTree, rb_kwt_struct_alloc);
322
+
323
+ rb_define_method(rb_cKeywordTree, "initialize", rb_kwt_init, 0);
324
+ rb_define_method(rb_cKeywordTree, "size", rb_kwt_size, 0);
325
+ rb_define_method(rb_cKeywordTree, "make", rb_kwt_make, 0);
326
+ rb_define_method(rb_cKeywordTree, "add_string", rb_kwt_add_string, -1);
327
+ rb_define_alias(rb_cKeywordTree, "<<", "add_string");
328
+
329
+ rb_define_method(rb_cKeywordTree, "find_all", rb_kwt_find_all, -1);
330
+ rb_define_alias(rb_cKeywordTree, "search", "find_all");
331
+
332
+ rb_define_singleton_method(rb_cKeywordTree, "from_file", rb_kwt_new_from_file, -1);
333
+
334
+ sym_id = ID2SYM(rb_intern("id"));
335
+ sym_value = ID2SYM(rb_intern("value"));
336
+ sym_ends_at = ID2SYM( rb_intern("ends_at") );
337
+ sym_starts_at= ID2SYM( rb_intern("starts_at") );
338
+ }
339
+
metadata ADDED
@@ -0,0 +1,68 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: ruby-ahocorasick
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.4.5
5
+ platform: ruby
6
+ authors:
7
+ - Aurelian Oancea
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+
12
+ date: 2008-11-19 00:00:00 +01:00
13
+ default_executable:
14
+ dependencies: []
15
+
16
+ description: Expose Aho-Corasick implementation from Strmat to Ruby.
17
+ email: oancea at gmail dot com
18
+ executables: []
19
+
20
+ extensions:
21
+ - ext/extconf.rb
22
+ extra_rdoc_files: []
23
+
24
+ files:
25
+ - ext/extconf.rb
26
+ - ext/ruby-ahocorasick.c
27
+ - ext/ac.h
28
+ - ext/ac.c
29
+ - examples/dict.rb
30
+ - examples/test.rb
31
+ - examples/elev.rb
32
+ - examples/sample.c
33
+ - MIT-LICENSE
34
+ - README.textile
35
+ has_rdoc: true
36
+ homepage: http://www.locknet.ro
37
+ post_install_message:
38
+ rdoc_options:
39
+ - --title
40
+ - Ruby-AhoCorasick
41
+ - --inline-source
42
+ - ext/ruby-ahocorasick.c
43
+ - README.textile
44
+ - --main
45
+ - README.textile
46
+ require_paths:
47
+ - lib
48
+ required_ruby_version: !ruby/object:Gem::Requirement
49
+ requirements:
50
+ - - ">="
51
+ - !ruby/object:Gem::Version
52
+ version: "0"
53
+ version:
54
+ required_rubygems_version: !ruby/object:Gem::Requirement
55
+ requirements:
56
+ - - ">="
57
+ - !ruby/object:Gem::Version
58
+ version: "0"
59
+ version:
60
+ requirements: []
61
+
62
+ rubyforge_project: ruby-ahocorasick
63
+ rubygems_version: 1.3.1
64
+ signing_key:
65
+ specification_version: 2
66
+ summary: Aho-Corasick alghorithm implementation to Ruby using Strmat lib.
67
+ test_files: []
68
+