ruby-ahocorasick 0.4.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ Copyright (c) 2008 Aurelian Oancea
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
21
+
@@ -0,0 +1,101 @@
1
+ h1. This is a work in progress.
2
+
3
+
4
+ h2. Introduction
5
+
6
+ This library is a "Ruby":http://ruby-lang.org extension, a wrapper around the "Aho-Corasick":http://en.wikipedia.org/wiki/Aho-Corasick_algorithm implementation in C, found in "Strmat":http://www.cs.ucdavis.edu/~gusfield/strmat.html package.
7
+
8
+ The source code (ac.c and ac.h) was "adapted" from Strmat. In fact, I've changed only 3-4 lines of code from the original implementation so it will feat my needs: search needed to return the current position in the searched string.
9
+
10
+
11
+ h2. Okay, so what's the idea?
12
+
13
+ Having a dictionary of known sentences (note: not *words*!), this kick ass algorithm can find individual patterns in an incoming stream of data. Kinda Fast.
14
+
15
+ The algorithm has 2 stages: one where an internal tree in being build from the given dictionary leaving the search to the second step.
16
+
17
+
18
+ h2. Okay, so where can I use this?
19
+
20
+ Well, you can do some crazy things with it, like, you can lookup for DNA patterns or maybe analyze network sequences (read: strange and maybe proprietary network protocols), or domestic stuff like building contextual links on your blog posts to enrich your users experience.
21
+
22
+
23
+ h2. Okay, so how can I install it?
24
+
25
+
26
+ h3. Rubygems - Development Version
27
+
28
+ <pre>
29
+ gem install aurelian-ruby-ahocorasick --source=http://gems.github.com
30
+ </pre>
31
+
32
+
33
+ h3. Build it from source
34
+
35
+ <pre>
36
+ $ git clone git://github.com/aurelian/ruby-ahocorasick.git
37
+ $ cd ruby-ahocorasick
38
+ </pre>
39
+
40
+ To build and install the gem on your machine (run with sudo if needed):
41
+
42
+ <pre>
43
+ $ rake install
44
+ </pre>
45
+
46
+ @rake -T@ will list other cool tasks.
47
+
48
+
49
+ h3. Rubygems - Stable Version
50
+
51
+ There's no stable version right now.
52
+
53
+
54
+ h4. Notes
55
+
56
+ It's known to work / compile / install on Ubuntu 8.04 and Mac OS 10.4.*. It should work out of the box if you have gcc around.
57
+ Unfortunately I don't have a Windows PC around nor required knowledge about Microsoft compliers.
58
+
59
+
60
+ h2. Okay, so how do I use it?
61
+
62
+ <pre>
63
+ require 'ahocorasick'
64
+
65
+ keyword_tree= AhoCorasick::KeywordTree.new # creates a new tree
66
+ keyword_tree.add_string( "foo-- Z@!bar" ) # add's a keyword to the tree
67
+ keyword_tree.add_string( "cervantes" ) # even more
68
+
69
+ results= keyword_tree.find_all( "1011000129 foo-- Z@!bar761 ! 001211 6xU" ).each do | result |
70
+ result[:value] # => "foo-- Z@!bar"
71
+ result[:starts_at] # => 11
72
+ result[:ends_at] # => 23
73
+ result[:id] # => 1
74
+ end
75
+
76
+ </pre>
77
+
78
+ You can get some API reference on the "wiki":http://github.com/aurelian/ruby-ahocorasick/wikis.
79
+
80
+
81
+ h2. Bugs? Suggestions? Ideas? Patches?
82
+
83
+ For now, just use the email address.
84
+
85
+
86
+ h2. Additional Reading
87
+
88
+ Other suffix - tree implementations:
89
+
90
+ * "Strmat":http://www.cs.ucdavis.edu/~gusfield/strmat.html
91
+ * "Pytst":http://nicolas.lehuen.com/download/pytst / "Ruby-Pytst":http://rubyforge.org/projects/ruby-pytst/
92
+ * "Aho-Corasick extension":http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/
93
+ * "Keyword Prospector":http://latimes.rubyforge.org/keyword_prospector/rdoc/
94
+ * "libstree":http://www.cl.cam.ac.uk/~cpk25/libstree/
95
+
96
+
97
+ h2. License
98
+
99
+ (c) 2008 - Aurelian Oancea, < oancea at gmail dot com >
100
+
101
+ released under MIT-LICENCE
@@ -0,0 +1,25 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'time'
4
+
5
+ require File.dirname(__FILE__) + '/../ext/ahocorasick'
6
+
7
+ t= Time.now
8
+
9
+ k= AhoCorasick::KeywordTree.from_file(File.dirname(__FILE__) + "/../spec/data/en.words")
10
+
11
+ t1= Time.now
12
+
13
+ puts "%d words added in %s seconds" % [k.size, (t1-t)]
14
+
15
+ query = File.read( File.dirname(__FILE__) + "/../spec/data/news.txt" )
16
+
17
+ results= k.search query
18
+
19
+ puts "took %s seconds to find %d results in a streem with %d charachters" % [(Time.now-t1), results.size, query.size]
20
+
21
+ exit
22
+ results.each do | r |
23
+ puts query[r[:starts_at]].chr + ".." + query[r[:ends_at]-1].chr + " => " + r[:value]
24
+ end
25
+
@@ -0,0 +1,19 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ $kcode='UTF-8'
4
+
5
+ require File.dirname(__FILE__) + '/../ext/ahocorasick'
6
+
7
+ k= AhoCorasick::KeywordTree.new
8
+
9
+ k << "I've"
10
+ k << "data"
11
+ k << "base"
12
+ k << "database"
13
+
14
+ query= "I've moved my data to a database"
15
+
16
+ k.search(query).each do | r |
17
+ puts "-> [ " + r[:id].to_s + " ] " + r[:value] + " / " + query[r[:starts_at]].chr + ".." + query[r[:ends_at]-1].chr
18
+ end
19
+
@@ -0,0 +1,94 @@
1
+ //
2
+ // Getting started with Aho-Corasick from Strmat
3
+ //
4
+ // lasick Makefile:
5
+ //
6
+ // --
7
+ // ac.o :
8
+ // gcc -c -fPIC -shared ac.c
9
+ // libasick : ac.o
10
+ // gcc -shared -Wl,-soname,libasick.so -o libasick.so.1.0.1
11
+ // ar rcs libasick.a ac.o
12
+ // clean :
13
+ // rm -rf *.o *.a *.so* *.dylib*
14
+ // --
15
+ //
16
+ // Compile this stuff - asick is the library name, generated with the above Makefile :)
17
+ //
18
+ // gcc sample.c -o ac-sample -I../ext/ -L../ext/ -lasick
19
+ //
20
+
21
+ #include <string.h>
22
+ #include <stdio.h>
23
+ #include <stdlib.h>
24
+ #include "ac.h"
25
+
26
+ int main(int argc, char *argv[]) {
27
+ char * search;
28
+ char * remain;
29
+ char * result;
30
+ char word[1024];
31
+
32
+ FILE *dictionary;
33
+ FILE *input;
34
+ int lgt, id, n, i;
35
+
36
+ AC_STRUCT * tree;
37
+
38
+ input= fopen(argv[1], "r");
39
+
40
+ if(input == NULL) {
41
+ search= argv[1];
42
+ } else {
43
+ long lSize;
44
+ fseek (input , 0 , SEEK_END);
45
+ lSize= ftell (input);
46
+ rewind(input);
47
+ search = (char*) malloc (sizeof(char)*lSize);
48
+ if (search == NULL) { fputs ("Error: Memory error",stderr); exit(-2); }
49
+ fread(search, 1, lSize-1, input);
50
+ }
51
+
52
+ dictionary= fopen("../spec/data/dictionary.txt", "r");
53
+
54
+ if(dictionary == NULL) {
55
+ printf("Error: can't open file.\n");
56
+ return -1;
57
+ }
58
+
59
+ tree= ac_alloc();
60
+
61
+ // start counting from 1
62
+ n= 1;
63
+
64
+ printf("==> building dictionary ...");
65
+
66
+ while(fgets(word, 1024, dictionary) != NULL) {
67
+ // strip \n
68
+ ac_add_string(tree, word, strlen(word)-1, n++);
69
+ }
70
+
71
+ printf("%d entries added.\n",n);
72
+
73
+ ac_prep(tree);
74
+
75
+ printf("==> input text [%d]:\n--\n%s\n--\n", strlen(search), search);
76
+
77
+ ac_search_init(tree, search, strlen(search) );
78
+
79
+ while((remain= ac_search(tree, &lgt, &id)) != NULL) {
80
+ printf("`%d'", remain[lgt+1]);
81
+ result = (char*) malloc (sizeof(char)*lgt);
82
+ sprintf( result, "%.*s", lgt, remain);
83
+ // result: should read first lgt chars from remain.
84
+ printf("==> result: lenght=> %d, id=> %d [%s]\n", lgt, id, result);
85
+ free(result);
86
+ }
87
+
88
+ ac_free(tree);
89
+ fclose(dictionary);
90
+ free(search);
91
+
92
+ return 0;
93
+ }
94
+
@@ -0,0 +1,46 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require File.dirname(__FILE__) + '/../ext/ahocorasick'
4
+
5
+ k= AhoCorasick::KeywordTree.new
6
+
7
+ puts k.size
8
+ k.add_string("foo");
9
+
10
+ puts k.size
11
+ k.add_string("bar", 1991);
12
+
13
+ puts k.size
14
+ k.add_string("fomz");
15
+
16
+ begin
17
+ k.add_string("foo", -1);
18
+ rescue RuntimeError => err
19
+ puts "[ok]==> got " + err.class.name + ": " + err.message
20
+ end
21
+
22
+ begin
23
+ k.add_string("foo", "bar");
24
+ rescue RuntimeError => err
25
+ puts "[ok]==> got " + err.class.name + ": " + err.message
26
+ end
27
+
28
+ k.add_string("timisoara", 22);
29
+
30
+ puts k.size
31
+ begin
32
+ k.add_string("bucuresti", 22);
33
+ rescue RuntimeError => err
34
+ puts "[ok]==> got " + err.class.name + ": " + err.message
35
+ end
36
+
37
+ k << "bacau"
38
+
39
+ k.search('am fost la bacau').each do | result |
40
+ puts result.inspect
41
+ end
42
+
43
+ k.search( 'din foo in foo' ).each do | q |
44
+ puts q.inspect
45
+ end
46
+
@@ -0,0 +1,623 @@
1
+ /*
2
+ * ac.c
3
+ *
4
+ * Implementation of the Aho-Corasick algorithm.
5
+ *
6
+ * NOTES:
7
+ * 8/94 - Original Implementation (Sean Davis)
8
+ * 9/94 - Redid Implementation (James Knight)
9
+ * 3/96 - Modularized the code (James Knight)
10
+ * 7/96 - Finished the modularization (James Knight)
11
+ */
12
+
13
+ #include <stdio.h>
14
+ #include <stdlib.h>
15
+ #include <string.h>
16
+ #include "ac.h"
17
+
18
+ /*
19
+ * ac_alloc
20
+ *
21
+ * Creates a new AC_STRUCT structure and initializes its fields.
22
+ *
23
+ * Parameters: none.
24
+ *
25
+ * Returns: A dynamically allocated AC_STRUCT structure.
26
+ */
27
+ AC_STRUCT *ac_alloc(void)
28
+ {
29
+ AC_STRUCT *node;
30
+
31
+ if ((node = malloc(sizeof(AC_STRUCT))) == NULL)
32
+ return NULL;
33
+ memset(node, 0, sizeof(AC_STRUCT));
34
+
35
+ if ((node->tree = malloc(sizeof(ACTREE_NODE))) == NULL) {
36
+ free(node);
37
+ return NULL;
38
+ }
39
+ memset(node->tree, 0, sizeof(ACTREE_NODE));
40
+
41
+ return node;
42
+ }
43
+
44
+
45
+ /*
46
+ * ac_add_string
47
+ *
48
+ * Adds a string to the AC_STRUCT structure's keyword tree.
49
+ *
50
+ * NOTE: The `id' value given must be unique to any of the strings
51
+ * added to the tree, and must be a small integer greater than
52
+ * 0 (since it is used to index an array holding information
53
+ * about each of the strings).
54
+ *
55
+ * The best id's to use are to number the strings from 1 to K.
56
+ *
57
+ * Parameters: node - an AC_STRUCT structure
58
+ * P - the sequence
59
+ * M - the sequence length
60
+ * id - the sequence identifier
61
+ *
62
+ * Returns: non-zero on success, zero on error.
63
+ */
64
+ int ac_add_string(AC_STRUCT *node, char *P, int M, int id)
65
+ {
66
+ int i, j, newsize;
67
+ AC_TREE tnode, child, back, newnode, list, tail;
68
+
69
+ /*
70
+ * Return a zero if a previous error had occurred, or if the
71
+ * given id equals zero. An id value of zero is used by the
72
+ * algorithm to signal that no pattern ends at a node in the
73
+ * keyword tree. So, it can't be used as a pattern's id.
74
+ */
75
+ if (node->errorflag || id == 0)
76
+ return 0;
77
+
78
+ P--; /* Shift to make sequence be P[1],...,P[M] */
79
+
80
+ /*
81
+ * Allocate space for the new string's information.
82
+ */
83
+ if (node->Psize <= id) {
84
+ if (node->Psize == 0) {
85
+ newsize = (id >= 16 ? id + 1 : 16);
86
+ node->Plengths = malloc(newsize * sizeof(int));
87
+ }
88
+ else {
89
+ newsize = node->Psize + id + 1;
90
+ node->Plengths = realloc(node->Plengths, newsize * sizeof(int));
91
+ }
92
+ if (node->Plengths == NULL) {
93
+ node->errorflag = 1;
94
+ return 0;
95
+ }
96
+
97
+ for (i=node->Psize; i < newsize; i++)
98
+ node->Plengths[i] = 0;
99
+ node->Psize = newsize;
100
+ }
101
+
102
+ // duplicate id
103
+ if (node->Plengths[id] != 0)
104
+ return 0;
105
+
106
+ /*
107
+ * Add the string to the keyword tree.
108
+ */
109
+ tnode = node->tree;
110
+ for (i=1; i <= M; i++) {
111
+ /*
112
+ * Find the child whose character is P[i].
113
+ */
114
+ back = NULL;
115
+ child = tnode->children;
116
+ while (child != NULL && child->ch < P[i]) {
117
+ back = child;
118
+ child = child->sibling;
119
+ }
120
+
121
+ if (child == NULL || child->ch != P[i])
122
+ break;
123
+
124
+ tnode = child;
125
+
126
+ #ifdef STATS
127
+ node->prep_old_edges++;
128
+ #endif
129
+
130
+ }
131
+
132
+ /*
133
+ * If only part of the pattern exists in the tree, add the
134
+ * rest of the pattern to the tree.
135
+ */
136
+ if (i <= M) {
137
+ list = tail = NULL;
138
+ for (j=i; j <= M; j++) {
139
+ if ((newnode = malloc(sizeof(ACTREE_NODE))) == NULL)
140
+ break;
141
+ memset(newnode, 0, sizeof(ACTREE_NODE));
142
+ newnode->ch = P[j];
143
+
144
+ if (list == NULL)
145
+ list = tail = newnode;
146
+ else
147
+ tail = tail->children = newnode;
148
+
149
+ #ifdef STATS
150
+ node->prep_new_edges++;
151
+ #endif
152
+
153
+ }
154
+ if (j <= M) {
155
+ while (list != NULL) {
156
+ tail = list->children;
157
+ free(list);
158
+ list = tail;
159
+ }
160
+ return 0;
161
+ }
162
+
163
+ list->sibling = child;
164
+ if (back == NULL)
165
+ tnode->children = list;
166
+ else
167
+ back->sibling = list;
168
+
169
+ tnode = tail;
170
+ }
171
+
172
+ tnode->matchid = id;
173
+ node->Plengths[id] = M;
174
+ node->ispreprocessed = 0;
175
+
176
+ return 1;
177
+ }
178
+
179
+
180
+ /*
181
+ * ac_del_string
182
+ *
183
+ * Deletes a string from the keyword tree.
184
+ *
185
+ * Parameters: node - an AC_STRUCT structure
186
+ * P - the sequence to be deleted
187
+ * M - its length
188
+ * id - its identifier
189
+ *
190
+ * Returns: non-zero on success, zero on error.
191
+ */
192
+ int ac_del_string(AC_STRUCT *node, char *P, int M, int id)
193
+ {
194
+ int i, flag;
195
+ AC_TREE tnode, tlast, tback, child, back;
196
+
197
+ if (node->errorflag || id > node->Psize || node->Plengths[id] == 0)
198
+ return 0;
199
+
200
+ P--; /* Shift to make sequence be P[1],...,P[M] */
201
+
202
+ /*
203
+ * Scan the tree for the path corresponding to the keyword to be deleted.
204
+ */
205
+ flag = 1;
206
+ tlast = tnode = node->tree;
207
+ tback = NULL;
208
+
209
+ for (i=1; i <= M; i++) {
210
+ /*
211
+ * Find the child matching P[i]. It must be there.
212
+ */
213
+ child = tnode->children;
214
+ back = NULL;
215
+ while (child != NULL && child->ch != P[i]) {
216
+ back = child;
217
+ child = child->sibling;
218
+ }
219
+
220
+ if (child == NULL) {
221
+ fprintf(stderr, "Error in Aho-Corasick preprocessing. String to be "
222
+ "deleted is not in tree.\n");
223
+ return 0;
224
+ }
225
+
226
+ /*
227
+ * Try to find the point where the pattern to be deleted branches off
228
+ * from the paths of the other patterns in the tree. This point must
229
+ * be at the latest node which satisfies one of these two conditions:
230
+ *
231
+ * 1) Another pattern ends at that node (and so
232
+ * `child->matchid != 0'). In this case, the branch point is
233
+ * just below this node and so the children of this node
234
+ * should be removed.
235
+ * 2) A node has other siblings. In this case, the node itself
236
+ * is the branch point, and it and its children should be
237
+ * removed.
238
+ */
239
+ if (i < M && child->matchid != 0) {
240
+ flag = 1;
241
+ tlast = child;
242
+ }
243
+ else if (back != NULL || child->sibling != NULL) {
244
+ flag = 2;
245
+ tlast = child;
246
+ tback = (back == NULL ? tnode : back);
247
+ }
248
+
249
+ tnode = child;
250
+ }
251
+
252
+ /*
253
+ * If the node corresponding to the end of the keyword has children,
254
+ * then the tree should not be altered, except to remove the keyword's
255
+ * identifier from the tree.
256
+ *
257
+ * Otherwise, apply the appropriate removal, as described above.
258
+ */
259
+ if (tnode->children != NULL) {
260
+ tnode->matchid = 0;
261
+ }
262
+ else {
263
+ if (flag == 1) {
264
+ child = tlast->children;
265
+ tlast->children = NULL;
266
+ tlast = child;
267
+ }
268
+ else {
269
+ if (tback->children == tlast)
270
+ tback->children = tlast->sibling;
271
+ else
272
+ tback->sibling = tlast->sibling;
273
+ }
274
+
275
+ while (tlast != NULL) {
276
+ child = tlast->children;
277
+ free(tlast);
278
+ tlast = child;
279
+ }
280
+ }
281
+
282
+ node->Plengths[id] = 0;
283
+ node->ispreprocessed = 0;
284
+
285
+ return 1;
286
+ }
287
+
288
+
289
+ /*
290
+ * ac_prep
291
+ *
292
+ * Compute the failure and output links for the keyword tree.
293
+ *
294
+ * Parameters: node - an AC_STRUCT structure
295
+ *
296
+ * Returns: non-zero on success, zero on error.
297
+ */
298
+ int ac_prep(AC_STRUCT *node)
299
+ {
300
+ char x;
301
+ AC_TREE v, vprime, w, wprime, root, front, back, child;
302
+
303
+ if (node->errorflag)
304
+ return 0;
305
+
306
+ /*
307
+ * The failure link and output link computation requires a breadth-first
308
+ * traversal of the keyword tree. And, to do that, we need a queue of
309
+ * the nodes yet to be processed.
310
+ *
311
+ * The `faillink' fields will be used as the pointers for the queue
312
+ * of nodes to be computed (since the failure link is only set after
313
+ * the node is removed from the queue).
314
+ *
315
+ * The `outlink' fields will be used as the pointers to a node's parent
316
+ * for nodes in the queue (since the output link is also only set after
317
+ * the node is removed from the queue).
318
+ */
319
+ root = node->tree;
320
+
321
+ front = back = root;
322
+ front->faillink = NULL;
323
+ front->outlink = NULL;
324
+
325
+ while (front != NULL) {
326
+ v = front;
327
+ x = v->ch;
328
+ vprime = v->outlink;
329
+
330
+ /*
331
+ * Add the node's children to the queue.
332
+ */
333
+ for (child=v->children; child != NULL; child=child->sibling) {
334
+ child->outlink = v;
335
+ back->faillink = child;
336
+ back = child;
337
+ }
338
+ back->faillink = NULL;
339
+
340
+ front = front->faillink;
341
+ v->faillink = v->outlink = NULL;
342
+
343
+ /*
344
+ * Set the failure and output links.
345
+ */
346
+ if (v == root)
347
+ ;
348
+ else if (vprime == root)
349
+ v->faillink = root;
350
+ else {
351
+ /*
352
+ * Find the find link in the failure link chain which has a child
353
+ * labeled with x.
354
+ */
355
+ wprime = NULL;
356
+ w = vprime->faillink;
357
+
358
+ while (1) {
359
+ wprime = w->children;
360
+ while (wprime != NULL && wprime->ch < x)
361
+ wprime = wprime->sibling;
362
+
363
+ if ((wprime != NULL && wprime->ch == x) || w == root)
364
+ break;
365
+
366
+ w = w->faillink;
367
+
368
+ #ifdef STATS
369
+ node->prep_fail_compares++;
370
+ #endif
371
+ }
372
+ #ifdef STATS
373
+ node->prep_fail_compares++;
374
+ #endif
375
+
376
+ if (wprime != NULL && wprime->ch == x)
377
+ v->faillink = wprime;
378
+ else
379
+ v->faillink = root;
380
+
381
+ if (v->matchid != 0) {
382
+ if (v->faillink->matchid != 0)
383
+ v->outlink = v->faillink;
384
+ else
385
+ v->outlink = v->faillink->outlink;
386
+ }
387
+ }
388
+ }
389
+
390
+ node->ispreprocessed = 1;
391
+ node->initflag = 0;
392
+
393
+ return 1;
394
+ }
395
+
396
+
397
+ /*
398
+ * ac_search_init
399
+ *
400
+ * Initializes the variables used during an Aho-Corasick search.
401
+ * See ac_search for an example of how it should be used.
402
+ *
403
+ * Parameters: node - an AC_STRUCT structure
404
+ * T - the sequence to be searched
405
+ * N - the length of the sequence
406
+ *
407
+ * Returns: nothing.
408
+ */
409
+ void ac_search_init(AC_STRUCT *node, char *T, int N)
410
+ {
411
+ if (node->errorflag)
412
+ return;
413
+ else if (!node->ispreprocessed) {
414
+ fprintf(stderr, "Error in Aho-Corasick search. The preprocessing "
415
+ "has not been completed.\n");
416
+ return;
417
+ }
418
+
419
+ node->T = T - 1; /* Shift to make sequence be T[1],...,T[N] */
420
+ node->N = N;
421
+ node->c = 1;
422
+ node->w = node->tree;
423
+ node->output = NULL;
424
+ node->initflag = 1;
425
+ node->endflag = 0;
426
+ }
427
+
428
+
429
+ /*
430
+ * ac_search
431
+ *
432
+ * Scans a text to look for the next occurrence of one of the patterns
433
+ * in the text. An example of how this search should be used is the
434
+ * following:
435
+ *
436
+ * s = T;
437
+ * len = N;
438
+ * contflag = 0;
439
+ * ac_search_init(node, T, N);
440
+ * while ((s = ac_search(node, &matchlen, &matchid) != NULL) {
441
+ * >>> Pattern `matchid' matched from `s' to `s + matchlen - 1'. <<<
442
+ * }
443
+ *
444
+ * where `node', `T' and `N' are assumed to be initialized appropriately.
445
+ *
446
+ * Parameters: node - a preprocessed AC_STRUCT structure
447
+ * length_out - where to store the new match's length
448
+ * id_out - where to store the identifier of the
449
+ * pattern that matched
450
+ * ends_at - where to store the n-th matched char
451
+ *
452
+ * Returns: the left end of the text that matches a pattern, or NULL
453
+ * if no match occurs. (It also stores values in `*length_out',
454
+ * and `*id_out' giving the match's length and pattern identifier.
455
+ */
456
+ char *ac_search(AC_STRUCT *node, int *length_out, int *id_out, int *ends_at)
457
+ {
458
+ int c, N, id;
459
+ char *T;
460
+ AC_TREE w, wprime, root;
461
+
462
+ if (node->errorflag)
463
+ return NULL;
464
+ else if (!node->ispreprocessed) {
465
+ fprintf(stderr, "Error in Aho-Corasick search. The preprocessing "
466
+ "has not been completed.\n");
467
+ return NULL;
468
+ }
469
+ else if (!node->initflag) {
470
+ fprintf(stderr, "Error in Aho-Corasick search. ac_search_init was not "
471
+ "called.\n");
472
+ return NULL;
473
+ }
474
+ else if (node->endflag)
475
+ return NULL;
476
+
477
+ T = node->T;
478
+ N = node->N;
479
+ c = node->c;
480
+ w = node->w;
481
+ root = node->tree;
482
+
483
+ /*
484
+ * If the last call to ac_search returned a match, check for another
485
+ * match ending at the same right endpoint (denoted by a non-NULL
486
+ * output link).
487
+ */
488
+ if (node->output != NULL) {
489
+ node->output = node->output->outlink;
490
+
491
+ #ifdef STATS
492
+ node->outlinks_traversed++;
493
+ #endif
494
+
495
+ if (node->output != NULL) {
496
+ id = node->output->matchid;
497
+ if (id_out)
498
+ *id_out = id;
499
+ if (length_out)
500
+ *length_out = node->Plengths[id];
501
+ if (ends_at)
502
+ *ends_at= c;
503
+ return &T[c] - node->Plengths[id];
504
+ }
505
+
506
+ }
507
+
508
+ /*
509
+ * Run the search algorithm, stopping at the first position where a
510
+ * match to one of the patterns occurs.
511
+ */
512
+ while (c <= N) {
513
+ /*
514
+ * Try to match the next input character to a child in the tree.
515
+ */
516
+ wprime = w->children;
517
+ while (wprime != NULL && wprime->ch != T[c])
518
+ wprime = wprime->sibling;
519
+
520
+ #ifdef STATS
521
+ node->num_compares++;
522
+ #endif
523
+
524
+ /*
525
+ * If the match fails, then either use the failure link (if not
526
+ * at the root), or move to the next character since no prefix
527
+ * of any pattern ends with character T[c].
528
+ */
529
+ if (wprime == NULL) {
530
+ if (w == root)
531
+ c++;
532
+ else {
533
+ w = w->faillink;
534
+
535
+ #ifdef STATS
536
+ node->num_failures++;
537
+ #endif
538
+
539
+ }
540
+ }
541
+ else {
542
+ /*
543
+ * If we could match the input, move down the tree and to the
544
+ * next input character, and see if that match completes the
545
+ * match to a pattern (when matchid != 0 or outlink != NULL).
546
+ */
547
+ c++;
548
+ w = wprime;
549
+
550
+ #ifdef STATS
551
+ node->edges_traversed++;
552
+ #endif
553
+
554
+ if (w->matchid != 0)
555
+ node->output = w;
556
+ else if (w->outlink != NULL) {
557
+ node->output = w->outlink;
558
+
559
+ #ifdef STATS
560
+ node->outlinks_traversed++;
561
+ #endif
562
+
563
+ }
564
+
565
+ if (node->output != NULL) {
566
+ id = node->output->matchid;
567
+ if (id_out)
568
+ *id_out = id;
569
+ if (length_out)
570
+ *length_out= node->Plengths[id];
571
+ if(ends_at)
572
+ *ends_at= c;
573
+
574
+ node->w = w;
575
+ node->c = c; // ends_at - length_out;
576
+
577
+ return &T[c] - node->Plengths[id];
578
+ }
579
+ }
580
+ }
581
+
582
+ node->c = c;
583
+ node->endflag = 1;
584
+
585
+ return NULL;
586
+ }
587
+
588
+
589
+ /*
590
+ * ac_free
591
+ *
592
+ * Free up the allocated AC_STRUCT structure.
593
+ *
594
+ * Parameters: node - a AC_STRUCT structure
595
+ *
596
+ * Returns: nothing.
597
+ */
598
+ void ac_free(AC_STRUCT *node)
599
+ {
600
+ AC_TREE front, back, next;
601
+
602
+ if (node == NULL)
603
+ return;
604
+
605
+ if (node->tree != NULL) {
606
+ front = back = node->tree;
607
+ while (front != NULL) {
608
+ back->sibling = front->children;
609
+ while (back->sibling != NULL)
610
+ back = back->sibling;
611
+
612
+ next = front->sibling;
613
+ free(front);
614
+ front = next;
615
+ }
616
+ }
617
+
618
+ if (node->Plengths != NULL)
619
+ free(node->Plengths);
620
+
621
+ free(node);
622
+ }
623
+
@@ -0,0 +1,36 @@
1
+
2
+ #ifndef _AC_H_
3
+ #define _AC_H_
4
+
5
+ typedef struct actreenode {
6
+ char ch;
7
+ int matchid;
8
+ struct actreenode *outlink, *faillink;
9
+ struct actreenode *children, *sibling;
10
+ } ACTREE_NODE, *AC_TREE;
11
+
12
+ typedef struct {
13
+ AC_TREE tree;
14
+ int ispreprocessed, errorflag;
15
+
16
+ int Psize;
17
+ int *Plengths;
18
+
19
+ char *T;
20
+ int N, c, initflag, endflag;
21
+ AC_TREE w, output;
22
+ #ifdef STATS
23
+ int prep_new_edges, prep_old_edges, prep_fail_compares;
24
+ int num_compares, num_failures, edges_traversed, outlinks_traversed;
25
+ #endif
26
+ } AC_STRUCT;
27
+
28
+ AC_STRUCT *ac_alloc(void);
29
+ int ac_add_string(AC_STRUCT *node, char *P, int M, int id);
30
+ // int ac_del_string(AC_STRUCT *node, char *P, int M, int id);
31
+ int ac_prep(AC_STRUCT *node);
32
+ void ac_search_init(AC_STRUCT *node, char *T, int N);
33
+ char *ac_search(AC_STRUCT *node, int *length_out, int *id_out, int *ends_at);
34
+ void ac_free(AC_STRUCT *node);
35
+
36
+ #endif
@@ -0,0 +1,6 @@
1
+ require "mkmf"
2
+
3
+ dir_config("ahocorasick")
4
+
5
+ create_makefile("ahocorasick")
6
+
@@ -0,0 +1,339 @@
1
+
2
+ //
3
+ // (c) 2008, Aurelian Oancea < oancea at gmail dot com >
4
+ //
5
+ // Released under MIT-LICENSE
6
+ //
7
+
8
+ //
9
+ // TODO: new methods?
10
+ //
11
+ // * kwt[id] = word
12
+ // * kwt.from_file (class instance method)
13
+ //
14
+ // * kwt.find_each ("str") {|r| .. }
15
+ // * kwt.find_first("str")
16
+ // * kwt.find_all ("str")
17
+ //
18
+ // TODO: change last_id and dictionary_size to long
19
+ //
20
+
21
+ #include <ruby.h>
22
+ #include "ac.h"
23
+
24
+ static VALUE sym_id, sym_value, sym_ends_at, sym_starts_at;
25
+
26
+ VALUE rb_mAhoCorasick;
27
+ VALUE rb_cKeywordTree;
28
+
29
+ #define KeywordTree(obj, kwt_data) {\
30
+ Data_Get_Struct(obj, struct kwt_struct_data, kwt_data);\
31
+ }
32
+
33
+ struct kwt_struct_data {
34
+ AC_STRUCT * tree;
35
+ int last_id;
36
+ int dictionary_size;
37
+ int is_frozen;
38
+ };
39
+
40
+ // int
41
+ // rb_add_string(struct kwt_struct_data *kwt, char *word, int size, int id) {
42
+ // if(ac_add_string( kwt->tree, word, size, id ) == 0)
43
+ // return 0;
44
+ // kwt->dictionary_size++;
45
+ // kwt->last_id= id+1;
46
+ // return 1;
47
+ // }
48
+
49
+ /*
50
+ * call-seq: initialize
51
+ *
52
+ * Creates a new KeywordTree
53
+ *
54
+ * require 'ahocorasick'
55
+ * kwt = Ahocorasick::KeywordTree.new
56
+ *
57
+ */
58
+ static VALUE
59
+ rb_kwt_init(VALUE self)
60
+ {
61
+ AC_STRUCT * tree;
62
+ struct kwt_struct_data *kwt_data;
63
+
64
+ kwt_data = ALLOC(struct kwt_struct_data);
65
+ tree = ac_alloc();
66
+ DATA_PTR(self) = kwt_data;
67
+ kwt_data->tree = tree;
68
+ kwt_data->last_id = 1;
69
+ kwt_data->dictionary_size = 0;
70
+ kwt_data->is_frozen = 0;
71
+ return self;
72
+ }
73
+
74
+ /*
75
+ * Document-method: make
76
+ * call-seq: make
77
+ *
78
+ * It freezes the current KeywordTree.
79
+ *
80
+ * ==== Note: This method is called internally by search
81
+ *
82
+ * require 'ahocorasick'
83
+ *
84
+ * kwt = Ahocorasick::KeywordTree.new
85
+ *
86
+ * kwt.add_string("one")
87
+ * kwt.add_string("two")
88
+ * kwt.make()
89
+ */
90
+ static VALUE
91
+ rb_kwt_make(VALUE self)
92
+ {
93
+ struct kwt_struct_data *kwt_data;
94
+ KeywordTree(self, kwt_data);
95
+
96
+ if(kwt_data->is_frozen == 1)
97
+ return Qtrue;
98
+
99
+ if(ac_prep( kwt_data->tree ) == 1) {
100
+ kwt_data->is_frozen = 1;
101
+ return Qtrue;
102
+ }
103
+
104
+ rb_raise(rb_eRuntimeError, "Cannot freeze the tree");
105
+ }
106
+
107
+ /*
108
+ * Document-method: find_all
109
+ * call-seq: find_all
110
+ *
111
+ * Search the current tree.
112
+ *
113
+ * It returns an array on hashes, e.g.
114
+ *
115
+ * [ { :id => int, :value => int, :starts_at => int, :ends_at => int}, { ... } ]
116
+ *
117
+ * Returns an empty array when the search didn't return any result.
118
+ *
119
+ * # assuming a valid KeywordTree kwt object:
120
+ * kwt.add_string("one")
121
+ * kwt.add_string("two")
122
+ *
123
+ * kwt.search( "moved two times already" ).each do | result |
124
+ * result[:id] # => 2
125
+ * result[:ends_at] # => 9
126
+ * result[:starts_at] # => 6
127
+ * result[:value] # => two
128
+ * end # => 1
129
+ *
130
+ */
131
+ static VALUE
132
+ rb_kwt_find_all(int argc, VALUE *argv, VALUE self)
133
+ {
134
+ char * remain; // returned by ac_search, the remaing text to search
135
+ int lgt, id, ends_at, starts_at; // filled in by ac_search: the length of the result, the id, and starts_at/ends_at position
136
+ VALUE v_result; // one result, as hash
137
+ VALUE v_results; // all the results, an array
138
+
139
+ VALUE v_search; // search string, function argument
140
+ struct kwt_struct_data *kwt_data;
141
+
142
+ // one mandatory argument.
143
+ rb_scan_args(argc, argv, "1", &v_search);
144
+ // it should be string.
145
+ Check_Type(v_search, T_STRING);
146
+ // get the structure
147
+ KeywordTree(self, kwt_data);
148
+ // freeze the tree, if not already
149
+ if(kwt_data->is_frozen == 0) {
150
+ if(ac_prep( kwt_data->tree ) == 0)
151
+ rb_raise(rb_eRuntimeError, "Cannot freeze the tree");
152
+ kwt_data->is_frozen = 1;
153
+ }
154
+ // prepare the return value
155
+ v_results= rb_ary_new();
156
+ // fail quickly and return the empty array
157
+ if(kwt_data->dictionary_size == 0)
158
+ return v_results;
159
+ // prepare the search
160
+ ac_search_init(kwt_data->tree, RSTRING( v_search )->ptr, RSTRING( v_search )->len);
161
+ // loop trought the results
162
+ while((remain= ac_search(kwt_data->tree, &lgt, &id, &ends_at)) != NULL) {
163
+ // this is an individual result as a hash
164
+ v_result= rb_hash_new();
165
+ rb_hash_aset( v_result, sym_id, INT2FIX(id) );
166
+ rb_hash_aset( v_result, sym_starts_at, INT2FIX( ends_at - lgt - 1 ) );
167
+ rb_hash_aset( v_result, sym_ends_at, INT2FIX( ends_at - 1 ) );
168
+ rb_hash_aset( v_result, sym_value, rb_str_new(remain, lgt) );
169
+ rb_ary_push( v_results, v_result );
170
+ }
171
+ // reopen the tree
172
+ kwt_data->is_frozen= 0;
173
+ return v_results;
174
+ }
175
+
176
+ /*
177
+ * Document-method: size
178
+ * call-seq: size
179
+ *
180
+ * Returns the size of this KeywordTree
181
+ *
182
+ * kwt.add_string("foo")
183
+ * kwt.add_string("bar")
184
+ * kwt.size #=> 2
185
+ *
186
+ */
187
+ static VALUE
188
+ rb_kwt_size(VALUE self)
189
+ {
190
+ struct kwt_struct_data *kwt_data;
191
+ KeywordTree(self, kwt_data);
192
+
193
+ return INT2FIX(kwt_data->dictionary_size);
194
+ }
195
+
196
+ /*
197
+ * Document-method: add_string
198
+ * call-seq: add_string
199
+ *
200
+ * Adds a sequence to this KeywordTree.
201
+ *
202
+ * kwt.add_string("foo1$21^ 98N3 ba>Z")
203
+ * kwt << "bar" # using the alias
204
+ *
205
+ * ==== Note: you can also specify the id, a number between 1 and k
206
+ *
207
+ * kwt.add_string "bar", 123 # => 123
208
+ *
209
+ * This id should be unique in the context of the current tree.
210
+ *
211
+ * Returns the id of the inserted object.
212
+ *
213
+ * kwt.add_string("test", 18) # => 18
214
+ * kwt.add_string("baz") # => 19
215
+ *
216
+ */
217
+ static VALUE
218
+ rb_kwt_add_string(int argc, VALUE *argv, VALUE self)
219
+ {
220
+ VALUE v_string, v_id;
221
+ struct kwt_struct_data *kwt_data;
222
+ char * string;
223
+ int id;
224
+
225
+ rb_scan_args(argc, argv, "11", &v_string, &v_id);
226
+
227
+ Check_Type(v_string, T_STRING);
228
+ string= RSTRING(v_string)->ptr;
229
+ KeywordTree(self, kwt_data);
230
+
231
+ if(kwt_data->is_frozen == 1)
232
+ rb_raise(rb_eRuntimeError, "Cannot add `%s\" into a frozen tree.", string);
233
+
234
+ if(v_id == Qnil) {
235
+ id = kwt_data->last_id;
236
+ } else if(TYPE(v_id) != T_FIXNUM) {
237
+ rb_raise(rb_eRuntimeError, "Please use a number from 1 to K as id, or leave nil to auto-generate one. `%s\" given.", RSTRING(v_id)->ptr);
238
+ } else if(NUM2INT(v_id) <= 0) {
239
+ rb_raise(rb_eRuntimeError, "Please use a number from 1 to K as id, or leave nil to auto-generate one. `%d\" given.", NUM2INT(v_id));
240
+ } else {
241
+ id= NUM2INT(v_id);
242
+ }
243
+
244
+ if(ac_add_string(kwt_data->tree, string, strlen(string), id) == 0)
245
+ rb_raise(rb_eRuntimeError, "Failed to add `%s\", duplicate id `%d\"?", string, id);
246
+
247
+ kwt_data->last_id= id + 1;
248
+ kwt_data->dictionary_size++;
249
+ return INT2FIX(id);
250
+ }
251
+
252
+ /*
253
+ * call-seq: from_file
254
+ *
255
+ * Creates a new KeywordTree and loads the dictionary from a file
256
+ *
257
+ * % cat dict0.txt
258
+ * foo
259
+ * bar
260
+ * base
261
+ *
262
+ * k= AhoCorasick::KeywordTree.from_file "dict0.txt"
263
+ * k.search("basement").size # => 1
264
+ *
265
+ */
266
+ static VALUE
267
+ rb_kwt_new_from_file(int argc, VALUE *argv, VALUE klass)
268
+ {
269
+
270
+ // TODO:
271
+ // * use rb_kwt_add_string
272
+ // * use rb_io* to handle the file
273
+
274
+ struct kwt_struct_data *kwt_data;
275
+ char word[1024];
276
+ int id;
277
+ VALUE self;
278
+ VALUE f_string;
279
+ FILE *dictionary;
280
+
281
+ rb_scan_args(argc, argv, "10", &f_string);
282
+
283
+ id = 0;
284
+ SafeStringValue( f_string );
285
+ self= rb_class_new_instance( 0, NULL, klass );
286
+ KeywordTree( self, kwt_data );
287
+
288
+ dictionary = fopen( RSTRING( f_string )->ptr, "r" );
289
+ if(dictionary == NULL)
290
+ rb_raise(rb_eRuntimeError, "Cannot open `%s\". No such file?", RSTRING(f_string)->ptr);
291
+
292
+ while(fgets(word, 1024, dictionary) != NULL) {
293
+ ac_add_string(kwt_data->tree, word, strlen(word)-1, id++);
294
+ kwt_data->dictionary_size++;
295
+ }
296
+
297
+ kwt_data->last_id= id+1;
298
+ fclose(dictionary);
299
+ return self;
300
+ }
301
+
302
+ static void
303
+ rb_kwt_struct_free(struct kwt_struct_data * kwt_data)
304
+ {
305
+ ac_free(kwt_data->tree);
306
+ }
307
+
308
+ static VALUE
309
+ rb_kwt_struct_alloc(VALUE klass)
310
+ {
311
+ return Data_Wrap_Struct(klass, 0, rb_kwt_struct_free, 0);
312
+ }
313
+
314
+ /*
315
+ * Blump.
316
+ */
317
+ void Init_ahocorasick() {
318
+ rb_mAhoCorasick = rb_define_module("AhoCorasick");
319
+ rb_cKeywordTree = rb_define_class_under(rb_mAhoCorasick, "KeywordTree", rb_cObject);
320
+
321
+ rb_define_alloc_func(rb_cKeywordTree, rb_kwt_struct_alloc);
322
+
323
+ rb_define_method(rb_cKeywordTree, "initialize", rb_kwt_init, 0);
324
+ rb_define_method(rb_cKeywordTree, "size", rb_kwt_size, 0);
325
+ rb_define_method(rb_cKeywordTree, "make", rb_kwt_make, 0);
326
+ rb_define_method(rb_cKeywordTree, "add_string", rb_kwt_add_string, -1);
327
+ rb_define_alias(rb_cKeywordTree, "<<", "add_string");
328
+
329
+ rb_define_method(rb_cKeywordTree, "find_all", rb_kwt_find_all, -1);
330
+ rb_define_alias(rb_cKeywordTree, "search", "find_all");
331
+
332
+ rb_define_singleton_method(rb_cKeywordTree, "from_file", rb_kwt_new_from_file, -1);
333
+
334
+ sym_id = ID2SYM(rb_intern("id"));
335
+ sym_value = ID2SYM(rb_intern("value"));
336
+ sym_ends_at = ID2SYM( rb_intern("ends_at") );
337
+ sym_starts_at= ID2SYM( rb_intern("starts_at") );
338
+ }
339
+
metadata ADDED
@@ -0,0 +1,68 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: ruby-ahocorasick
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.4.5
5
+ platform: ruby
6
+ authors:
7
+ - Aurelian Oancea
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+
12
+ date: 2008-11-19 00:00:00 +01:00
13
+ default_executable:
14
+ dependencies: []
15
+
16
+ description: Expose Aho-Corasick implementation from Strmat to Ruby.
17
+ email: oancea at gmail dot com
18
+ executables: []
19
+
20
+ extensions:
21
+ - ext/extconf.rb
22
+ extra_rdoc_files: []
23
+
24
+ files:
25
+ - ext/extconf.rb
26
+ - ext/ruby-ahocorasick.c
27
+ - ext/ac.h
28
+ - ext/ac.c
29
+ - examples/dict.rb
30
+ - examples/test.rb
31
+ - examples/elev.rb
32
+ - examples/sample.c
33
+ - MIT-LICENSE
34
+ - README.textile
35
+ has_rdoc: true
36
+ homepage: http://www.locknet.ro
37
+ post_install_message:
38
+ rdoc_options:
39
+ - --title
40
+ - Ruby-AhoCorasick
41
+ - --inline-source
42
+ - ext/ruby-ahocorasick.c
43
+ - README.textile
44
+ - --main
45
+ - README.textile
46
+ require_paths:
47
+ - lib
48
+ required_ruby_version: !ruby/object:Gem::Requirement
49
+ requirements:
50
+ - - ">="
51
+ - !ruby/object:Gem::Version
52
+ version: "0"
53
+ version:
54
+ required_rubygems_version: !ruby/object:Gem::Requirement
55
+ requirements:
56
+ - - ">="
57
+ - !ruby/object:Gem::Version
58
+ version: "0"
59
+ version:
60
+ requirements: []
61
+
62
+ rubyforge_project: ruby-ahocorasick
63
+ rubygems_version: 1.3.1
64
+ signing_key:
65
+ specification_version: 2
66
+ summary: Aho-Corasick alghorithm implementation to Ruby using Strmat lib.
67
+ test_files: []
68
+