string-eater 0.1.0
Sign up to get free protection for your applications and to get access to all the features.
- data/LICENSE +24 -0
- data/README.md +133 -0
- data/Rakefile +33 -0
- data/examples/address.rb +35 -0
- data/examples/nginx.rb +70 -0
- data/ext/string-eater/c-tokenizer.c +141 -0
- data/ext/string-eater/extconf.rb +2 -0
- data/lib/c-tokenizer.rb +93 -0
- data/lib/ruby-tokenizer-each-char.rb +145 -0
- data/lib/ruby-tokenizer.rb +98 -0
- data/lib/string-eater.rb +10 -0
- data/lib/token.rb +26 -0
- data/lib/version.rb +9 -0
- data/spec/nginx_spec.rb +27 -0
- data/spec/spec_helper.rb +1 -0
- data/spec/string_eater_spec.rb +133 -0
- metadata +66 -0
data/LICENSE
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
Copyright (c) 2012 Dan Swain
|
2
|
+
|
3
|
+
MIT License
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
6
|
+
a copy of this software and associated documentation files (the
|
7
|
+
"Software"), to deal in the Software without restriction, including
|
8
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
9
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
10
|
+
permit persons to whom the Software is furnished to do so, subject to
|
11
|
+
the following conditions:
|
12
|
+
|
13
|
+
The above copyright notice and this permission notice shall be
|
14
|
+
included in all copies or substantial portions of the Software.
|
15
|
+
|
16
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
17
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
18
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
19
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
20
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
21
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
22
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
23
|
+
|
24
|
+
|
data/README.md
ADDED
@@ -0,0 +1,133 @@
|
|
1
|
+
# String Eater
|
2
|
+
|
3
|
+
A fast ruby string tokenizer. It eats strings and dumps tokens.
|
4
|
+
|
5
|
+
## License
|
6
|
+
|
7
|
+
String Eater is released under the
|
8
|
+
[MIT license](http://en.wikipedia.org/wiki/MIT_License).
|
9
|
+
See the LICENSE file.
|
10
|
+
|
11
|
+
## Requirements
|
12
|
+
|
13
|
+
String Eater probably only works in Ruby 1.9.2+ with MRI. It's been
|
14
|
+
tested with Ruby 1.9.3p194.
|
15
|
+
|
16
|
+
String Eater uses a C extension, so it will only work on Ruby
|
17
|
+
implemenatations that provide support for C extensions.
|
18
|
+
|
19
|
+
## Installation
|
20
|
+
|
21
|
+
We'll publish this gem soon, but for now you can clone and install as
|
22
|
+
|
23
|
+
git clone git://github.com/dantswain/string-eater.git
|
24
|
+
cd string-eater
|
25
|
+
rake install
|
26
|
+
|
27
|
+
If you are working on a system where you need to `sudo gem install`
|
28
|
+
you can do
|
29
|
+
|
30
|
+
rake gem
|
31
|
+
sudo gem install string-eater
|
32
|
+
|
33
|
+
As always, you can `rake -T` to find out what other rake tasks we have
|
34
|
+
provided.
|
35
|
+
|
36
|
+
## Basic Usage
|
37
|
+
|
38
|
+
Suppose we want to tokenize a string that contains address information
|
39
|
+
for a person and is consistently formatted like
|
40
|
+
|
41
|
+
Last Name, First Name | Street address, City, State, Zip
|
42
|
+
|
43
|
+
Suppose we only want to extract the last name, city, and state.
|
44
|
+
|
45
|
+
To do this using String Eater, create a subclass of
|
46
|
+
`StringEater::Tokenizer` like this
|
47
|
+
|
48
|
+
require 'string-eater'
|
49
|
+
|
50
|
+
class PersonTokenizer < StringEater::Tokenizer
|
51
|
+
add_field :last_name
|
52
|
+
look_for ", "
|
53
|
+
add_field :first_name, :extract => false
|
54
|
+
look_for " | "
|
55
|
+
add_field :street_address, :extract => false
|
56
|
+
look_for ", "
|
57
|
+
add_field :city
|
58
|
+
look_for ", "
|
59
|
+
add_field :state
|
60
|
+
look_for ", "
|
61
|
+
end
|
62
|
+
|
63
|
+
Note the use of `:extract => false` to specify fields that are important
|
64
|
+
to the structure of the line but that we don't necessarily need to
|
65
|
+
extract.
|
66
|
+
|
67
|
+
Then, we can tokenize the string like this:
|
68
|
+
|
69
|
+
tokenizer = PersonTokenizer.new
|
70
|
+
string = "Flinstone, Fred | 301 Cobblestone Way, Bedrock, NA, 00000"
|
71
|
+
tokenizer.tokenize! string
|
72
|
+
|
73
|
+
puts tokenizer.last_name # => "Flinestone"
|
74
|
+
puts tokenizer.city # => "Bedrock"
|
75
|
+
puts tokenizer.state # => "NA"
|
76
|
+
|
77
|
+
We can also do something like this:
|
78
|
+
|
79
|
+
tokenizer.tokenize!(string) do |tokens|
|
80
|
+
puts "The #{tokens[:last_name]}s live in #{tokens[:city]}"
|
81
|
+
end
|
82
|
+
|
83
|
+
For another example, see `examples/nginx.rb`, which defines an
|
84
|
+
[nginx](http://nginx.org) log line tokenizer.
|
85
|
+
|
86
|
+
## Implementation
|
87
|
+
|
88
|
+
There are actually three tokenizer algorithms provided here. The
|
89
|
+
three algorithms should be interchangeable.
|
90
|
+
|
91
|
+
1. `StringEater::CTokenizer` - A C extension implementation. The
|
92
|
+
fastest of the three. This is the default implementation for
|
93
|
+
`StringEater::Tokenizer`.
|
94
|
+
|
95
|
+
2. `StringEater::RubyTokenizer` - A pure-Ruby implementation. This is
|
96
|
+
a slightly different implementation of the algorithm - an
|
97
|
+
implementation that is faster on Ruby than a translation of the C
|
98
|
+
algorithm. Probably not as fast (or not much faster) than using
|
99
|
+
Ruby regular expressions.
|
100
|
+
|
101
|
+
3. `StringEater::RubyTokenizerEachChar` - A pure-Ruby implementation.
|
102
|
+
This is essentially the same as the C implementation, but written
|
103
|
+
in pure Ruby. It uses `String#each_char` and is therefore VERY
|
104
|
+
SLOW! It provides a good way to hack the algorithm, though.
|
105
|
+
|
106
|
+
The main algorithm works by finding the start and end points of tokens
|
107
|
+
in a string. The search is done incrementally (i.e., loop through the
|
108
|
+
string and look for each sequence of characters). The algorithm is
|
109
|
+
"lazy" in the sense that only the required tokens are copied for
|
110
|
+
output ("extracted").
|
111
|
+
|
112
|
+
## Performance
|
113
|
+
|
114
|
+
Soon I'll add some code here to run your own benchmarks.
|
115
|
+
|
116
|
+
I've run my own benchmarks comparing String Eater to some code that does the
|
117
|
+
same task (both tokenizing nginx log lines) using Ruby regular expressions. So
|
118
|
+
far, String Eater is about 200% faster; able to process over 100,000 lines per
|
119
|
+
second on my laptop vs less than 50,000 lines per second for the regular
|
120
|
+
expression version. I'm working to further optimize the String Eater code.
|
121
|
+
|
122
|
+
## Contributing
|
123
|
+
|
124
|
+
The usual github process applies here:
|
125
|
+
|
126
|
+
1. Fork it
|
127
|
+
2. Create your feature branch (`git checkout -b my-new-feature`)
|
128
|
+
3. Commit your changes (`git commit -am 'Added some feature'`)
|
129
|
+
4. Push to the branch (`git push origin my-new-feature`)
|
130
|
+
5. Create new Pull Request
|
131
|
+
|
132
|
+
You can also contribute to the author's ego by letting him know that
|
133
|
+
you find String Eater useful ;)
|
data/Rakefile
ADDED
@@ -0,0 +1,33 @@
|
|
1
|
+
require 'rake/clean'
|
2
|
+
|
3
|
+
desc "Run rspec spec/ (compile if needed)"
|
4
|
+
task :test => :compile do
|
5
|
+
sh "rspec spec/"
|
6
|
+
end
|
7
|
+
|
8
|
+
so_ext = RbConfig::CONFIG['DLEXT']
|
9
|
+
ext_dir = "ext/string-eater"
|
10
|
+
ext_file = ext_dir + "/c_tokenizer_ext.#{so_ext}"
|
11
|
+
|
12
|
+
file ext_file => Dir.glob("ext/string-eater/*{.rb,.c}") do
|
13
|
+
Dir.chdir("ext/string-eater") do
|
14
|
+
ruby "extconf.rb"
|
15
|
+
sh "make"
|
16
|
+
end
|
17
|
+
end
|
18
|
+
|
19
|
+
desc "Create gem"
|
20
|
+
task :gem => "string-eater.gemspec" do
|
21
|
+
sh "gem build string-eater.gemspec"
|
22
|
+
end
|
23
|
+
|
24
|
+
desc "Install using 'gem install'"
|
25
|
+
task :install => :gem do
|
26
|
+
sh "gem install string-eater"
|
27
|
+
end
|
28
|
+
|
29
|
+
desc "Compile the extension"
|
30
|
+
task :compile => ext_file
|
31
|
+
|
32
|
+
CLEAN.include('ext/**/*{.o,.log,.so,.bundle}')
|
33
|
+
CLEAN.include('ext/**/Makefile')
|
data/examples/address.rb
ADDED
@@ -0,0 +1,35 @@
|
|
1
|
+
# once the gem is installed, you don't need this
|
2
|
+
$: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'lib'))
|
3
|
+
$: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'ext/string-eater'))
|
4
|
+
|
5
|
+
# this is the example from the README
|
6
|
+
require 'string-eater'
|
7
|
+
|
8
|
+
class PersonTokenizer < StringEater::Tokenizer
|
9
|
+
add_field :last_name
|
10
|
+
look_for ", "
|
11
|
+
add_field :first_name, :extract => false
|
12
|
+
look_for " | "
|
13
|
+
add_field :street_address, :extract => false
|
14
|
+
look_for ", "
|
15
|
+
add_field :city
|
16
|
+
look_for ", "
|
17
|
+
add_field :state
|
18
|
+
look_for ", "
|
19
|
+
end
|
20
|
+
|
21
|
+
if __FILE__ == $0
|
22
|
+
tokenizer = PersonTokenizer.new
|
23
|
+
puts tokenizer.describe_line
|
24
|
+
|
25
|
+
string = "Flinstone, Fred | 301 Cobblestone Way, Bedrock, NA, 00000"
|
26
|
+
tokenizer.tokenize! string
|
27
|
+
|
28
|
+
puts tokenizer.last_name # => "Flinestone"
|
29
|
+
puts tokenizer.city # => "Bedrock"
|
30
|
+
puts tokenizer.state # => "NA"
|
31
|
+
|
32
|
+
tokenizer.tokenize!(string) do |tokens|
|
33
|
+
puts "The #{tokens[:last_name]}s live in #{tokens[:city]}"
|
34
|
+
end
|
35
|
+
end
|
data/examples/nginx.rb
ADDED
@@ -0,0 +1,70 @@
|
|
1
|
+
# once the gem is installed, you don't need this
|
2
|
+
$: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'lib'))
|
3
|
+
$: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'ext/string-eater'))
|
4
|
+
|
5
|
+
require 'string-eater'
|
6
|
+
|
7
|
+
class NginxLogTokenizer < StringEater::CTokenizer
|
8
|
+
add_field :ip
|
9
|
+
look_for " - "
|
10
|
+
add_field :remote_user, :extract => false
|
11
|
+
look_for " ["
|
12
|
+
add_field :timestamp, :extract => false
|
13
|
+
look_for "] \""
|
14
|
+
add_field :request
|
15
|
+
look_for "\" "
|
16
|
+
add_field :status_code
|
17
|
+
look_for " "
|
18
|
+
add_field :bytes_sent, :extract => false
|
19
|
+
look_for " \""
|
20
|
+
add_field :referrer_url
|
21
|
+
look_for "\" \""
|
22
|
+
add_field :user_agent
|
23
|
+
look_for "\" \""
|
24
|
+
add_field :compression, :extract => false
|
25
|
+
look_for "\" "
|
26
|
+
add_field :remainder
|
27
|
+
|
28
|
+
def status_code
|
29
|
+
@extracted_tokens[:status_code].to_i
|
30
|
+
end
|
31
|
+
|
32
|
+
def request_verb
|
33
|
+
@extracted_tokens[:request_verb]
|
34
|
+
end
|
35
|
+
|
36
|
+
def request_url
|
37
|
+
@extracted_tokens[:request_url]
|
38
|
+
end
|
39
|
+
|
40
|
+
def do_extra_parsing
|
41
|
+
return unless @extracted_tokens[:request]
|
42
|
+
request_parts = @extracted_tokens[:request].split
|
43
|
+
if request_parts.size == 3
|
44
|
+
@extracted_tokens[:request_verb] = request_parts[0]
|
45
|
+
@extracted_tokens[:request_url] = request_parts[1]
|
46
|
+
end
|
47
|
+
end
|
48
|
+
end
|
49
|
+
|
50
|
+
if __FILE__ == $0
|
51
|
+
tokenizer = NginxLogTokenizer.new
|
52
|
+
puts tokenizer.describe_line
|
53
|
+
|
54
|
+
str = '73.80.217.212 - - [01/Aug/2012:09:14:25 -0500] "GET /this_is_a_url HTTP/1.1" 304 152 "http://referrer.com" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" "-" "there could be" other "stuff here"'
|
55
|
+
|
56
|
+
puts "input string: " + str
|
57
|
+
puts "Tokens: "
|
58
|
+
|
59
|
+
# use a block to work with the extracted tokens
|
60
|
+
tokenizer.tokenize!(str) do |tokens|
|
61
|
+
tokens.each do |token|
|
62
|
+
puts "\t" + token.inspect
|
63
|
+
end
|
64
|
+
end
|
65
|
+
|
66
|
+
# use the token's name as a method to get its value
|
67
|
+
puts tokenizer.ip
|
68
|
+
puts tokenizer.status_code
|
69
|
+
puts tokenizer.request_verb
|
70
|
+
end
|
@@ -0,0 +1,141 @@
|
|
1
|
+
#include <ruby.h>
|
2
|
+
|
3
|
+
/* not used in production - useful for debugging */
|
4
|
+
#define puts_inspect(var) \
|
5
|
+
ID inspect = rb_intern("inspect"); \
|
6
|
+
VALUE x = rb_funcall(var, inspect, 0); \
|
7
|
+
printf("%s\n", StringValueCStr(x));
|
8
|
+
|
9
|
+
static VALUE rb_cCTokenizer;
|
10
|
+
static VALUE rb_mStringEater;
|
11
|
+
|
12
|
+
static VALUE tokenize_string(VALUE self,
|
13
|
+
VALUE string,
|
14
|
+
VALUE tokens_to_find_indexes,
|
15
|
+
VALUE tokens_to_find_strings,
|
16
|
+
VALUE tokens_to_extract_indexes,
|
17
|
+
VALUE tokens_to_extract_names)
|
18
|
+
{
|
19
|
+
const char* input_string = StringValueCStr(string);
|
20
|
+
VALUE extracted_tokens = rb_hash_new();
|
21
|
+
VALUE curr_token;
|
22
|
+
unsigned int curr_token_ix;
|
23
|
+
long n_tokens_to_find = RARRAY_LEN(tokens_to_find_indexes);
|
24
|
+
size_t str_len = strlen(input_string);
|
25
|
+
size_t ix;
|
26
|
+
char c;
|
27
|
+
char looking_for;
|
28
|
+
size_t looking_for_len;
|
29
|
+
size_t looking_for_ix = 0;
|
30
|
+
long find_ix = 0;
|
31
|
+
const char* looking_for_token;
|
32
|
+
unsigned int n_tokens = (unsigned int)RARRAY_LEN(rb_iv_get(self, "@tokens"));
|
33
|
+
|
34
|
+
size_t startpoint = 0;
|
35
|
+
|
36
|
+
long n_tokens_to_extract = RARRAY_LEN(tokens_to_extract_indexes);
|
37
|
+
long last_token_extracted_ix = 0;
|
38
|
+
|
39
|
+
long next_token_to_extract_ix = NUM2UINT(rb_ary_entry(tokens_to_extract_indexes, last_token_extracted_ix));
|
40
|
+
|
41
|
+
curr_token = rb_ary_entry(tokens_to_find_strings, find_ix);
|
42
|
+
curr_token_ix = NUM2UINT(rb_ary_entry(tokens_to_find_indexes, find_ix));
|
43
|
+
looking_for_token = StringValueCStr(curr_token);
|
44
|
+
looking_for_len = strlen(looking_for_token);
|
45
|
+
looking_for = looking_for_token[looking_for_ix];
|
46
|
+
|
47
|
+
for(ix = 0; ix < str_len; ix++)
|
48
|
+
{
|
49
|
+
c = input_string[ix];
|
50
|
+
if(c == looking_for)
|
51
|
+
{
|
52
|
+
if(looking_for_ix == 0)
|
53
|
+
{
|
54
|
+
/* entering new token */
|
55
|
+
if(curr_token_ix > 0)
|
56
|
+
{
|
57
|
+
/* extract, if necessary */
|
58
|
+
if((curr_token_ix - 1) == next_token_to_extract_ix)
|
59
|
+
{
|
60
|
+
last_token_extracted_ix++;
|
61
|
+
if(last_token_extracted_ix < n_tokens_to_extract)
|
62
|
+
{
|
63
|
+
next_token_to_extract_ix = NUM2UINT(rb_ary_entry(tokens_to_extract_indexes, last_token_extracted_ix));
|
64
|
+
}
|
65
|
+
else
|
66
|
+
{
|
67
|
+
next_token_to_extract_ix = -1;
|
68
|
+
}
|
69
|
+
rb_hash_aset(extracted_tokens,
|
70
|
+
rb_ary_entry(tokens_to_extract_names, curr_token_ix - 1),
|
71
|
+
rb_usascii_str_new(input_string + startpoint,
|
72
|
+
ix - startpoint));
|
73
|
+
}
|
74
|
+
}
|
75
|
+
startpoint = ix;
|
76
|
+
}
|
77
|
+
if(looking_for_ix >= looking_for_len - 1)
|
78
|
+
{
|
79
|
+
/* leaving token */
|
80
|
+
if(curr_token_ix >= n_tokens-1)
|
81
|
+
{
|
82
|
+
break;
|
83
|
+
}
|
84
|
+
else
|
85
|
+
{
|
86
|
+
startpoint = ix + 1;
|
87
|
+
}
|
88
|
+
|
89
|
+
|
90
|
+
/* next token */
|
91
|
+
find_ix++;
|
92
|
+
if(find_ix >= n_tokens_to_find)
|
93
|
+
{
|
94
|
+
/* done! */
|
95
|
+
break;
|
96
|
+
}
|
97
|
+
|
98
|
+
curr_token = rb_ary_entry(tokens_to_find_strings, find_ix);
|
99
|
+
curr_token_ix = NUM2UINT(rb_ary_entry(tokens_to_find_indexes, find_ix));
|
100
|
+
looking_for_token = StringValueCStr(curr_token);
|
101
|
+
looking_for_len = strlen(looking_for_token);
|
102
|
+
looking_for_ix = 0;
|
103
|
+
}
|
104
|
+
else
|
105
|
+
{
|
106
|
+
looking_for_ix++;
|
107
|
+
}
|
108
|
+
looking_for = looking_for_token[looking_for_ix];
|
109
|
+
}
|
110
|
+
}
|
111
|
+
|
112
|
+
ix = str_len;
|
113
|
+
curr_token_ix = n_tokens - 1;
|
114
|
+
|
115
|
+
if(curr_token_ix == next_token_to_extract_ix)
|
116
|
+
{
|
117
|
+
rb_hash_aset(extracted_tokens,
|
118
|
+
rb_ary_entry(tokens_to_extract_names, curr_token_ix),
|
119
|
+
rb_usascii_str_new(input_string + startpoint,
|
120
|
+
ix - startpoint));
|
121
|
+
}
|
122
|
+
|
123
|
+
return extracted_tokens;
|
124
|
+
}
|
125
|
+
|
126
|
+
void finalize_c_tokenizer_ext(VALUE unused)
|
127
|
+
{
|
128
|
+
/* free memory, etc */
|
129
|
+
}
|
130
|
+
|
131
|
+
void Init_c_tokenizer_ext(void)
|
132
|
+
{
|
133
|
+
rb_mStringEater = rb_define_module("StringEater");
|
134
|
+
rb_cCTokenizer = rb_define_class_under(rb_mStringEater,
|
135
|
+
"CTokenizer", rb_cObject);
|
136
|
+
|
137
|
+
rb_define_method(rb_cCTokenizer, "ctokenize!", tokenize_string, 5);
|
138
|
+
|
139
|
+
/* set the callback for when the extension is unloaded */
|
140
|
+
rb_set_end_proc(finalize_c_tokenizer_ext, 0);
|
141
|
+
}
|
data/lib/c-tokenizer.rb
ADDED
@@ -0,0 +1,93 @@
|
|
1
|
+
require 'c_tokenizer_ext'
|
2
|
+
|
3
|
+
class StringEater::CTokenizer
|
4
|
+
def self.tokens
|
5
|
+
@tokens ||= []
|
6
|
+
end
|
7
|
+
|
8
|
+
def self.add_field name, opts={}
|
9
|
+
self.tokens << StringEater::Token::new_field(name, opts)
|
10
|
+
define_method(name) {@extracted_tokens[name]}
|
11
|
+
end
|
12
|
+
|
13
|
+
def self.look_for tokens
|
14
|
+
self.tokens << StringEater::Token::new_separator(tokens)
|
15
|
+
end
|
16
|
+
|
17
|
+
def initialize
|
18
|
+
refresh_tokens
|
19
|
+
end
|
20
|
+
|
21
|
+
def tokens
|
22
|
+
@tokens
|
23
|
+
end
|
24
|
+
|
25
|
+
def refresh_tokens
|
26
|
+
@tokens = self.class.tokens
|
27
|
+
tokens_to_find = tokens.each_with_index.map do |t, i|
|
28
|
+
[i, t.string] if t.string
|
29
|
+
end.compact
|
30
|
+
|
31
|
+
@tokens_to_find_indexes = tokens_to_find.map{|t| t[0]}
|
32
|
+
@tokens_to_find_strings = tokens_to_find.map{|t| t[1]}
|
33
|
+
|
34
|
+
tokens_to_extract = tokens.each_with_index.map do |t, i|
|
35
|
+
[i, t.name] if t.extract?
|
36
|
+
end.compact
|
37
|
+
|
38
|
+
@tokens_to_extract_indexes = tokens_to_extract.map{|t| t[0]}
|
39
|
+
@tokens_to_extract_names = tokens.map{|t| t.name}
|
40
|
+
end
|
41
|
+
|
42
|
+
def describe_line
|
43
|
+
tokens.inject("") do |desc, t|
|
44
|
+
desc << (t.string || t.name.to_s || "xxxxxx")
|
45
|
+
end
|
46
|
+
end
|
47
|
+
|
48
|
+
def do_extra_parsing
|
49
|
+
end
|
50
|
+
|
51
|
+
def tokenize! string, &block
|
52
|
+
@string = string
|
53
|
+
@extracted_tokens ||= {}
|
54
|
+
@extracted_tokens.clear
|
55
|
+
|
56
|
+
tokens.first.breakpoints[0] = 0
|
57
|
+
|
58
|
+
@extracted_tokens = ctokenize!(@string,
|
59
|
+
@tokens_to_find_indexes,
|
60
|
+
@tokens_to_find_strings,
|
61
|
+
@tokens_to_extract_indexes,
|
62
|
+
@tokens_to_extract_names)
|
63
|
+
|
64
|
+
# extra parsing hook
|
65
|
+
do_extra_parsing
|
66
|
+
|
67
|
+
if block_given?
|
68
|
+
yield @extracted_tokens
|
69
|
+
end
|
70
|
+
|
71
|
+
# return self for chaining
|
72
|
+
self
|
73
|
+
end
|
74
|
+
|
75
|
+
private
|
76
|
+
|
77
|
+
def set_token_startpoint ix, startpoint
|
78
|
+
@tokens[ix].breakpoints[0] = startpoint
|
79
|
+
end
|
80
|
+
|
81
|
+
def get_token_startpoint ix
|
82
|
+
@tokens[ix].breakpoints[0]
|
83
|
+
end
|
84
|
+
|
85
|
+
def set_token_endpoint ix, endpoint
|
86
|
+
@tokens[ix].breakpoints[1] = endpoint
|
87
|
+
end
|
88
|
+
|
89
|
+
def extract_token? ix
|
90
|
+
@tokens[ix].extract?
|
91
|
+
end
|
92
|
+
|
93
|
+
end
|
@@ -0,0 +1,145 @@
|
|
1
|
+
# this tokenizer is very slow, but it illustrates the
|
2
|
+
# basic idea of the C tokenizer
|
3
|
+
class StringEater::RubyTokenizerEachChar
|
4
|
+
|
5
|
+
def self.tokens
|
6
|
+
@tokens ||= []
|
7
|
+
end
|
8
|
+
|
9
|
+
def self.combined_tokens
|
10
|
+
@combined_tokens ||= []
|
11
|
+
end
|
12
|
+
|
13
|
+
def self.add_field name, opts={}
|
14
|
+
self.tokens << StringEater::Token::new_field(name, opts)
|
15
|
+
define_method(name) {@extracted_tokens[name]}
|
16
|
+
end
|
17
|
+
|
18
|
+
def self.look_for tokens
|
19
|
+
self.tokens << StringEater::Token::new_separator(tokens)
|
20
|
+
end
|
21
|
+
|
22
|
+
def self.combine_fields opts={}
|
23
|
+
from_token_index = self.tokens.index{|t| t.name == opts[:from]}
|
24
|
+
to_token_index = self.tokens.index{|t| t.name == opts[:to]}
|
25
|
+
self.combined_tokens << [opts[:as], from_token_index, to_token_index]
|
26
|
+
define_method(opts[:as]) {@extracted_tokens[opts[:as]]}
|
27
|
+
end
|
28
|
+
|
29
|
+
def tokens
|
30
|
+
@tokens ||= self.class.tokens
|
31
|
+
end
|
32
|
+
|
33
|
+
def combined_tokens
|
34
|
+
@combined_tokens ||= self.class.combined_tokens
|
35
|
+
end
|
36
|
+
|
37
|
+
def refresh_tokens
|
38
|
+
@combined_tokens = nil
|
39
|
+
@tokens = nil
|
40
|
+
tokens
|
41
|
+
end
|
42
|
+
|
43
|
+
def describe_line
|
44
|
+
tokens.inject("") do |desc, t|
|
45
|
+
desc << (t.string || t.name.to_s || "xxxxxx")
|
46
|
+
end
|
47
|
+
end
|
48
|
+
|
49
|
+
def find_breakpoints string
|
50
|
+
tokenize!(string) unless @string == string
|
51
|
+
tokens.inject([]) do |bp, t|
|
52
|
+
bp << t.breakpoints
|
53
|
+
bp
|
54
|
+
end.flatten.uniq
|
55
|
+
end
|
56
|
+
|
57
|
+
def tokenize! string, &block
|
58
|
+
@string = string
|
59
|
+
@extracted_tokens ||= {}
|
60
|
+
@extracted_tokens.clear
|
61
|
+
@tokens_to_find ||= tokens.each_with_index.map do |t, i|
|
62
|
+
[i, t.string] if t.string
|
63
|
+
end.compact
|
64
|
+
@tokens_to_extract_indeces ||= tokens.each_with_index.map do |t, i|
|
65
|
+
i if t.extract?
|
66
|
+
end.compact
|
67
|
+
|
68
|
+
tokens.first.breakpoints[0] = 0
|
69
|
+
|
70
|
+
find_index = 0
|
71
|
+
|
72
|
+
curr_token = @tokens_to_find[find_index]
|
73
|
+
curr_token_index = curr_token[0]
|
74
|
+
curr_token_length = curr_token[1].length
|
75
|
+
looking_for_index = 0
|
76
|
+
looking_for = curr_token[1][looking_for_index]
|
77
|
+
|
78
|
+
counter = 0
|
79
|
+
string.each_char do |c|
|
80
|
+
if c == looking_for
|
81
|
+
if looking_for_index == 0
|
82
|
+
# entering new token
|
83
|
+
if curr_token_index > 0
|
84
|
+
t = tokens[curr_token_index - 1]
|
85
|
+
t.breakpoints[1] = counter
|
86
|
+
if t.extract?
|
87
|
+
@extracted_tokens[t.name] = string[t.breakpoints[0]...t.breakpoints[1]]
|
88
|
+
end
|
89
|
+
end
|
90
|
+
tokens[curr_token_index].breakpoints[0] = counter
|
91
|
+
end
|
92
|
+
if looking_for_index >= (curr_token_length - 1)
|
93
|
+
# leaving token
|
94
|
+
tokens[curr_token_index].breakpoints[1] = counter
|
95
|
+
|
96
|
+
if curr_token_index >= tokens.size-1
|
97
|
+
# we're done!
|
98
|
+
break
|
99
|
+
else
|
100
|
+
tokens[curr_token_index + 1].breakpoints[0] = counter + 1
|
101
|
+
end
|
102
|
+
|
103
|
+
# next token
|
104
|
+
find_index += 1
|
105
|
+
if find_index >= @tokens_to_find.length
|
106
|
+
# we're done!
|
107
|
+
break
|
108
|
+
end
|
109
|
+
curr_token = @tokens_to_find[find_index]
|
110
|
+
curr_token_index = curr_token[0]
|
111
|
+
curr_token_length = curr_token[1].length
|
112
|
+
looking_for_index = 0
|
113
|
+
else
|
114
|
+
looking_for_index += 1
|
115
|
+
end
|
116
|
+
end
|
117
|
+
looking_for = curr_token[1][looking_for_index]
|
118
|
+
counter += 1
|
119
|
+
end
|
120
|
+
|
121
|
+
last_token = tokens.last
|
122
|
+
last_token.breakpoints[1] = string.length
|
123
|
+
|
124
|
+
if last_token.extract?
|
125
|
+
@extracted_tokens[last_token.name] = string[last_token.breakpoints[0]..last_token.breakpoints[1]]
|
126
|
+
end
|
127
|
+
|
128
|
+
combined_tokens.each do |combiner|
|
129
|
+
name = combiner[0]
|
130
|
+
from = @tokens[combiner[1]].breakpoints[0]
|
131
|
+
to = @tokens[combiner[2]].breakpoints[1]
|
132
|
+
@extracted_tokens[name] = string[from...to]
|
133
|
+
end
|
134
|
+
|
135
|
+
if block_given?
|
136
|
+
yield @extracted_tokens
|
137
|
+
end
|
138
|
+
|
139
|
+
# return self for chaining
|
140
|
+
self
|
141
|
+
end
|
142
|
+
|
143
|
+
end
|
144
|
+
|
145
|
+
|
@@ -0,0 +1,98 @@
|
|
1
|
+
# this tokenizer is fairly fast, but not necessarily faster than regexps
|
2
|
+
class StringEater::RubyTokenizer
|
3
|
+
def self.tokens
|
4
|
+
@tokens ||= []
|
5
|
+
end
|
6
|
+
|
7
|
+
def self.combined_tokens
|
8
|
+
@combined_tokens ||= []
|
9
|
+
end
|
10
|
+
|
11
|
+
def self.add_field name, opts={}
|
12
|
+
self.tokens << StringEater::Token::new_field(name, opts)
|
13
|
+
define_method(name) {@extracted_tokens[name]}
|
14
|
+
end
|
15
|
+
|
16
|
+
def self.look_for tokens
|
17
|
+
self.tokens << StringEater::Token::new_separator(tokens)
|
18
|
+
end
|
19
|
+
|
20
|
+
def self.combine_fields opts={}
|
21
|
+
from_token_index = self.tokens.index{|t| t.name == opts[:from]}
|
22
|
+
to_token_index = self.tokens.index{|t| t.name == opts[:to]}
|
23
|
+
self.combined_tokens << [opts[:as], from_token_index, to_token_index]
|
24
|
+
define_method(opts[:as]) {@extracted_tokens[opts[:as]]}
|
25
|
+
end
|
26
|
+
|
27
|
+
def tokens
|
28
|
+
@tokens ||= self.class.tokens
|
29
|
+
end
|
30
|
+
|
31
|
+
def combined_tokens
|
32
|
+
@combined_tokens ||= self.class.combined_tokens
|
33
|
+
end
|
34
|
+
|
35
|
+
def refresh_tokens
|
36
|
+
@combined_tokens = nil
|
37
|
+
@tokens = nil
|
38
|
+
tokens
|
39
|
+
end
|
40
|
+
|
41
|
+
def describe_line
|
42
|
+
tokens.inject("") do |desc, t|
|
43
|
+
desc << (t.string || t.name.to_s || "xxxxxx")
|
44
|
+
end
|
45
|
+
end
|
46
|
+
|
47
|
+
def find_breakpoints(string)
|
48
|
+
@literal_tokens ||= tokens.select{|t| t.string}
|
49
|
+
@breakpoints ||= Array.new(2*@literal_tokens.size + 2)
|
50
|
+
@breakpoints[0] = 0
|
51
|
+
@breakpoints[-1] = string.length
|
52
|
+
start_point = 0
|
53
|
+
@literal_tokens.each_with_index do |t, i|
|
54
|
+
@breakpoints[2*i+1], start_point = find_end_of(t, string, start_point)
|
55
|
+
@breakpoints[2*i+2] = start_point
|
56
|
+
end
|
57
|
+
@breakpoints
|
58
|
+
end
|
59
|
+
|
60
|
+
def tokenize! string, &block
|
61
|
+
@extracted_tokens ||= {}
|
62
|
+
@extracted_tokens.clear
|
63
|
+
@tokens_to_extract ||= tokens.select{|t| t.extract?}
|
64
|
+
|
65
|
+
find_breakpoints(string)
|
66
|
+
last_important_bp = [@breakpoints.length, tokens.size].min
|
67
|
+
(0...last_important_bp).each do |i|
|
68
|
+
tokens[i].breakpoints = [@breakpoints[i], @breakpoints[i+1]]
|
69
|
+
end
|
70
|
+
|
71
|
+
@tokens_to_extract.each do |t|
|
72
|
+
@extracted_tokens[t.name] = string[t.breakpoints[0]...t.breakpoints[1]]
|
73
|
+
end
|
74
|
+
|
75
|
+
combined_tokens.each do |combiner|
|
76
|
+
name = combiner[0]
|
77
|
+
from = @tokens[combiner[1]].breakpoints[0]
|
78
|
+
to = @tokens[combiner[2]].breakpoints[1]
|
79
|
+
@extracted_tokens[name] = string[from...to]
|
80
|
+
end
|
81
|
+
|
82
|
+
if block_given?
|
83
|
+
yield @extracted_tokens
|
84
|
+
end
|
85
|
+
|
86
|
+
# return self for chaining
|
87
|
+
self
|
88
|
+
end
|
89
|
+
|
90
|
+
protected
|
91
|
+
|
92
|
+
def find_end_of token, string, start_at
|
93
|
+
start = string.index(token.string, start_at+1) || string.length
|
94
|
+
[start, [start + token.string.length, string.length].min]
|
95
|
+
end
|
96
|
+
|
97
|
+
end
|
98
|
+
|
data/lib/string-eater.rb
ADDED
@@ -0,0 +1,10 @@
|
|
1
|
+
module StringEater
|
2
|
+
autoload :Token, 'token'
|
3
|
+
autoload :RubyTokenizer, 'ruby-tokenizer'
|
4
|
+
autoload :RubyTokenizerEachCHar, 'ruby-tokenizer-each-char'
|
5
|
+
autoload :CTokenizer, 'c-tokenizer'
|
6
|
+
|
7
|
+
autoload :VERSION, 'version'
|
8
|
+
|
9
|
+
class Tokenizer < CTokenizer; end
|
10
|
+
end
|
data/lib/token.rb
ADDED
@@ -0,0 +1,26 @@
|
|
1
|
+
class StringEater::Token
|
2
|
+
attr_accessor :name, :string, :opts, :breakpoints, :children
|
3
|
+
|
4
|
+
def initialize
|
5
|
+
@opts = {}
|
6
|
+
@breakpoints = [nil,nil]
|
7
|
+
end
|
8
|
+
|
9
|
+
def extract?
|
10
|
+
@opts[:extract]
|
11
|
+
end
|
12
|
+
|
13
|
+
def self.new_field(name, opts)
|
14
|
+
t = new
|
15
|
+
t.name = name
|
16
|
+
t.opts = {:extract => true}.merge(opts)
|
17
|
+
t
|
18
|
+
end
|
19
|
+
|
20
|
+
def self.new_separator(string)
|
21
|
+
t = new
|
22
|
+
t.string = string
|
23
|
+
t
|
24
|
+
end
|
25
|
+
|
26
|
+
end
|
data/lib/version.rb
ADDED
data/spec/nginx_spec.rb
ADDED
@@ -0,0 +1,27 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
require 'string-eater'
|
3
|
+
|
4
|
+
$: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'examples'))
|
5
|
+
|
6
|
+
require 'nginx'
|
7
|
+
|
8
|
+
describe NginxLogTokenizer do
|
9
|
+
before(:each) do
|
10
|
+
@tokenizer = NginxLogTokenizer.new
|
11
|
+
@str = '73.80.217.212 - - [01/Aug/2012:09:14:25 -0500] "GET /this_is_a_url HTTP/1.1" 304 152 "http://referrer.com" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" "-" "there could be" other "stuff here"'
|
12
|
+
end
|
13
|
+
|
14
|
+
{
|
15
|
+
:ip => "73.80.217.212",
|
16
|
+
:request => "GET /this_is_a_url HTTP/1.1",
|
17
|
+
:status_code => 304,
|
18
|
+
:referrer_url => "http://referrer.com",
|
19
|
+
:user_agent => "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
|
20
|
+
:remainder => "\"there could be\" other \"stuff here\"",
|
21
|
+
}.each_pair do |token,val|
|
22
|
+
it "should find the right value for #{token}" do
|
23
|
+
@tokenizer.tokenize!(@str).send(token).should == val
|
24
|
+
end
|
25
|
+
end
|
26
|
+
|
27
|
+
end
|
data/spec/spec_helper.rb
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
$LOAD_PATH.concat %w[./lib ./ext/string-eater]
|
@@ -0,0 +1,133 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
require 'string-eater'
|
3
|
+
|
4
|
+
TestedClass = StringEater::CTokenizer
|
5
|
+
|
6
|
+
describe StringEater do
|
7
|
+
it "should have a version" do
|
8
|
+
StringEater::VERSION::STRING.split(".").size.should >= 3
|
9
|
+
end
|
10
|
+
end
|
11
|
+
|
12
|
+
# normal use
|
13
|
+
class Example1 < TestedClass
|
14
|
+
add_field :first_word
|
15
|
+
look_for " "
|
16
|
+
add_field :second_word, :extract => false
|
17
|
+
look_for "|"
|
18
|
+
add_field :third_word
|
19
|
+
end
|
20
|
+
|
21
|
+
describe Example1 do
|
22
|
+
|
23
|
+
before(:each) do
|
24
|
+
@tokenizer = Example1.new
|
25
|
+
@str1 = "foo bar|baz"
|
26
|
+
@first_word1 = "foo"
|
27
|
+
@third_word1 = "baz"
|
28
|
+
@bp1 = [0, 3,4,7,8,11]
|
29
|
+
end
|
30
|
+
|
31
|
+
describe "find_breakpoints" do
|
32
|
+
it "should return an array of the breakpoints" do
|
33
|
+
@tokenizer.find_breakpoints(@str1).should == @bp1 if @tokenizer.respond_to?(:find_breakpoints)
|
34
|
+
end
|
35
|
+
end
|
36
|
+
|
37
|
+
describe "tokenize!" do
|
38
|
+
it "should return itself" do
|
39
|
+
@tokenizer.tokenize!(@str1).should == @tokenizer
|
40
|
+
end
|
41
|
+
|
42
|
+
it "should set the first word" do
|
43
|
+
@tokenizer.tokenize!(@str1).first_word.should == "foo"
|
44
|
+
end
|
45
|
+
|
46
|
+
it "should set the third word" do
|
47
|
+
@tokenizer.tokenize!(@str1).third_word.should == "baz"
|
48
|
+
end
|
49
|
+
|
50
|
+
it "should not set the second word" do
|
51
|
+
@tokenizer.tokenize!(@str1).second_word.should be_nil
|
52
|
+
end
|
53
|
+
|
54
|
+
it "should yield a hash of tokens if a block is given" do
|
55
|
+
@tokenizer.tokenize!(@str1) do |tokens|
|
56
|
+
tokens[:first_word].should == "foo"
|
57
|
+
end
|
58
|
+
end
|
59
|
+
|
60
|
+
it "should return everything to the end of the line for the last token" do
|
61
|
+
s = "c defg asdf | foo , baa"
|
62
|
+
@tokenizer.tokenize!("a b|#{s}").third_word.should == s
|
63
|
+
end
|
64
|
+
|
65
|
+
end
|
66
|
+
|
67
|
+
end
|
68
|
+
|
69
|
+
# an example where we ignore after a certain point in the string
|
70
|
+
class Example2 < TestedClass
|
71
|
+
add_field :first_word, :extract => false
|
72
|
+
look_for " "
|
73
|
+
add_field :second_word
|
74
|
+
look_for " "
|
75
|
+
add_field :third_word, :extract => false
|
76
|
+
look_for "-"
|
77
|
+
end
|
78
|
+
|
79
|
+
describe Example2 do
|
80
|
+
|
81
|
+
before(:each) do
|
82
|
+
@tokenizer = Example2.new
|
83
|
+
@str1 = "foo bar baz-"
|
84
|
+
@second_word1 = "bar"
|
85
|
+
end
|
86
|
+
|
87
|
+
describe "tokenize!" do
|
88
|
+
it "should find the token when there is extra stuff at the end of the string" do
|
89
|
+
@tokenizer.tokenize!(@str1).second_word.should == @second_word1
|
90
|
+
end
|
91
|
+
end
|
92
|
+
|
93
|
+
end
|
94
|
+
|
95
|
+
# CTokenizer doesn't do combine_fields because
|
96
|
+
# writing out breakpoints is a significant slow-down
|
97
|
+
if TestedClass.respond_to?(:combine_fields)
|
98
|
+
# an example where we combine fields
|
99
|
+
class Example3 < TestedClass
|
100
|
+
add_field :first_word, :extract => false
|
101
|
+
look_for " \""
|
102
|
+
add_field :part1, :extract => false
|
103
|
+
look_for " "
|
104
|
+
add_field :part2
|
105
|
+
look_for " "
|
106
|
+
add_field :part3, :extract => false
|
107
|
+
look_for "\""
|
108
|
+
|
109
|
+
combine_fields :from => :part1, :to => :part3, :as => :parts
|
110
|
+
end
|
111
|
+
|
112
|
+
describe Example3 do
|
113
|
+
before(:each) do
|
114
|
+
@tokenizer = Example3.new
|
115
|
+
@str1 = "foo \"bar baz bang\""
|
116
|
+
@part2 = "baz"
|
117
|
+
@parts = "bar baz bang"
|
118
|
+
end
|
119
|
+
|
120
|
+
it "should extract like normal" do
|
121
|
+
@tokenizer.tokenize!(@str1).part2.should == @part2
|
122
|
+
end
|
123
|
+
|
124
|
+
it "should ignore like normal" do
|
125
|
+
@tokenizer.tokenize!(@str1).part1.should be_nil
|
126
|
+
end
|
127
|
+
|
128
|
+
it "should extract the combined field" do
|
129
|
+
@tokenizer.tokenize!(@str1).parts.should == @parts
|
130
|
+
end
|
131
|
+
|
132
|
+
end
|
133
|
+
end
|
metadata
ADDED
@@ -0,0 +1,66 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: string-eater
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.1.0
|
5
|
+
prerelease:
|
6
|
+
platform: ruby
|
7
|
+
authors:
|
8
|
+
- Dan Swain
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2012-08-20 00:00:00.000000000 Z
|
13
|
+
dependencies: []
|
14
|
+
description: Fast string tokenizer. Nom strings.
|
15
|
+
email:
|
16
|
+
- dan@simpli.fi
|
17
|
+
executables: []
|
18
|
+
extensions:
|
19
|
+
- ext/string-eater/extconf.rb
|
20
|
+
extra_rdoc_files: []
|
21
|
+
files:
|
22
|
+
- lib/c-tokenizer.rb
|
23
|
+
- lib/ruby-tokenizer-each-char.rb
|
24
|
+
- lib/ruby-tokenizer.rb
|
25
|
+
- lib/string-eater.rb
|
26
|
+
- lib/token.rb
|
27
|
+
- lib/version.rb
|
28
|
+
- ext/string-eater/extconf.rb
|
29
|
+
- ext/string-eater/c-tokenizer.c
|
30
|
+
- spec/nginx_spec.rb
|
31
|
+
- spec/spec_helper.rb
|
32
|
+
- spec/string_eater_spec.rb
|
33
|
+
- examples/address.rb
|
34
|
+
- examples/nginx.rb
|
35
|
+
- LICENSE
|
36
|
+
- Rakefile
|
37
|
+
- README.md
|
38
|
+
homepage: http://github.com/simplifi/string-eater
|
39
|
+
licenses: []
|
40
|
+
post_install_message:
|
41
|
+
rdoc_options: []
|
42
|
+
require_paths:
|
43
|
+
- lib
|
44
|
+
- ext/string-eater
|
45
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
46
|
+
none: false
|
47
|
+
requirements:
|
48
|
+
- - ! '>='
|
49
|
+
- !ruby/object:Gem::Version
|
50
|
+
version: '0'
|
51
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
52
|
+
none: false
|
53
|
+
requirements:
|
54
|
+
- - ! '>='
|
55
|
+
- !ruby/object:Gem::Version
|
56
|
+
version: '0'
|
57
|
+
requirements: []
|
58
|
+
rubyforge_project:
|
59
|
+
rubygems_version: 1.8.24
|
60
|
+
signing_key:
|
61
|
+
specification_version: 3
|
62
|
+
summary: Fast string tokenizer. Nom strings.
|
63
|
+
test_files:
|
64
|
+
- spec/nginx_spec.rb
|
65
|
+
- spec/spec_helper.rb
|
66
|
+
- spec/string_eater_spec.rb
|