string-eater 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
data/LICENSE ADDED
@@ -0,0 +1,24 @@
1
+ Copyright (c) 2012 Dan Swain
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
23
+
24
+
data/README.md ADDED
@@ -0,0 +1,133 @@
1
+ # String Eater
2
+
3
+ A fast ruby string tokenizer. It eats strings and dumps tokens.
4
+
5
+ ## License
6
+
7
+ String Eater is released under the
8
+ [MIT license](http://en.wikipedia.org/wiki/MIT_License).
9
+ See the LICENSE file.
10
+
11
+ ## Requirements
12
+
13
+ String Eater probably only works in Ruby 1.9.2+ with MRI. It's been
14
+ tested with Ruby 1.9.3p194.
15
+
16
+ String Eater uses a C extension, so it will only work on Ruby
17
+ implemenatations that provide support for C extensions.
18
+
19
+ ## Installation
20
+
21
+ We'll publish this gem soon, but for now you can clone and install as
22
+
23
+ git clone git://github.com/dantswain/string-eater.git
24
+ cd string-eater
25
+ rake install
26
+
27
+ If you are working on a system where you need to `sudo gem install`
28
+ you can do
29
+
30
+ rake gem
31
+ sudo gem install string-eater
32
+
33
+ As always, you can `rake -T` to find out what other rake tasks we have
34
+ provided.
35
+
36
+ ## Basic Usage
37
+
38
+ Suppose we want to tokenize a string that contains address information
39
+ for a person and is consistently formatted like
40
+
41
+ Last Name, First Name | Street address, City, State, Zip
42
+
43
+ Suppose we only want to extract the last name, city, and state.
44
+
45
+ To do this using String Eater, create a subclass of
46
+ `StringEater::Tokenizer` like this
47
+
48
+ require 'string-eater'
49
+
50
+ class PersonTokenizer < StringEater::Tokenizer
51
+ add_field :last_name
52
+ look_for ", "
53
+ add_field :first_name, :extract => false
54
+ look_for " | "
55
+ add_field :street_address, :extract => false
56
+ look_for ", "
57
+ add_field :city
58
+ look_for ", "
59
+ add_field :state
60
+ look_for ", "
61
+ end
62
+
63
+ Note the use of `:extract => false` to specify fields that are important
64
+ to the structure of the line but that we don't necessarily need to
65
+ extract.
66
+
67
+ Then, we can tokenize the string like this:
68
+
69
+ tokenizer = PersonTokenizer.new
70
+ string = "Flinstone, Fred | 301 Cobblestone Way, Bedrock, NA, 00000"
71
+ tokenizer.tokenize! string
72
+
73
+ puts tokenizer.last_name # => "Flinestone"
74
+ puts tokenizer.city # => "Bedrock"
75
+ puts tokenizer.state # => "NA"
76
+
77
+ We can also do something like this:
78
+
79
+ tokenizer.tokenize!(string) do |tokens|
80
+ puts "The #{tokens[:last_name]}s live in #{tokens[:city]}"
81
+ end
82
+
83
+ For another example, see `examples/nginx.rb`, which defines an
84
+ [nginx](http://nginx.org) log line tokenizer.
85
+
86
+ ## Implementation
87
+
88
+ There are actually three tokenizer algorithms provided here. The
89
+ three algorithms should be interchangeable.
90
+
91
+ 1. `StringEater::CTokenizer` - A C extension implementation. The
92
+ fastest of the three. This is the default implementation for
93
+ `StringEater::Tokenizer`.
94
+
95
+ 2. `StringEater::RubyTokenizer` - A pure-Ruby implementation. This is
96
+ a slightly different implementation of the algorithm - an
97
+ implementation that is faster on Ruby than a translation of the C
98
+ algorithm. Probably not as fast (or not much faster) than using
99
+ Ruby regular expressions.
100
+
101
+ 3. `StringEater::RubyTokenizerEachChar` - A pure-Ruby implementation.
102
+ This is essentially the same as the C implementation, but written
103
+ in pure Ruby. It uses `String#each_char` and is therefore VERY
104
+ SLOW! It provides a good way to hack the algorithm, though.
105
+
106
+ The main algorithm works by finding the start and end points of tokens
107
+ in a string. The search is done incrementally (i.e., loop through the
108
+ string and look for each sequence of characters). The algorithm is
109
+ "lazy" in the sense that only the required tokens are copied for
110
+ output ("extracted").
111
+
112
+ ## Performance
113
+
114
+ Soon I'll add some code here to run your own benchmarks.
115
+
116
+ I've run my own benchmarks comparing String Eater to some code that does the
117
+ same task (both tokenizing nginx log lines) using Ruby regular expressions. So
118
+ far, String Eater is about 200% faster; able to process over 100,000 lines per
119
+ second on my laptop vs less than 50,000 lines per second for the regular
120
+ expression version. I'm working to further optimize the String Eater code.
121
+
122
+ ## Contributing
123
+
124
+ The usual github process applies here:
125
+
126
+ 1. Fork it
127
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
128
+ 3. Commit your changes (`git commit -am 'Added some feature'`)
129
+ 4. Push to the branch (`git push origin my-new-feature`)
130
+ 5. Create new Pull Request
131
+
132
+ You can also contribute to the author's ego by letting him know that
133
+ you find String Eater useful ;)
data/Rakefile ADDED
@@ -0,0 +1,33 @@
1
+ require 'rake/clean'
2
+
3
+ desc "Run rspec spec/ (compile if needed)"
4
+ task :test => :compile do
5
+ sh "rspec spec/"
6
+ end
7
+
8
+ so_ext = RbConfig::CONFIG['DLEXT']
9
+ ext_dir = "ext/string-eater"
10
+ ext_file = ext_dir + "/c_tokenizer_ext.#{so_ext}"
11
+
12
+ file ext_file => Dir.glob("ext/string-eater/*{.rb,.c}") do
13
+ Dir.chdir("ext/string-eater") do
14
+ ruby "extconf.rb"
15
+ sh "make"
16
+ end
17
+ end
18
+
19
+ desc "Create gem"
20
+ task :gem => "string-eater.gemspec" do
21
+ sh "gem build string-eater.gemspec"
22
+ end
23
+
24
+ desc "Install using 'gem install'"
25
+ task :install => :gem do
26
+ sh "gem install string-eater"
27
+ end
28
+
29
+ desc "Compile the extension"
30
+ task :compile => ext_file
31
+
32
+ CLEAN.include('ext/**/*{.o,.log,.so,.bundle}')
33
+ CLEAN.include('ext/**/Makefile')
@@ -0,0 +1,35 @@
1
+ # once the gem is installed, you don't need this
2
+ $: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'lib'))
3
+ $: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'ext/string-eater'))
4
+
5
+ # this is the example from the README
6
+ require 'string-eater'
7
+
8
+ class PersonTokenizer < StringEater::Tokenizer
9
+ add_field :last_name
10
+ look_for ", "
11
+ add_field :first_name, :extract => false
12
+ look_for " | "
13
+ add_field :street_address, :extract => false
14
+ look_for ", "
15
+ add_field :city
16
+ look_for ", "
17
+ add_field :state
18
+ look_for ", "
19
+ end
20
+
21
+ if __FILE__ == $0
22
+ tokenizer = PersonTokenizer.new
23
+ puts tokenizer.describe_line
24
+
25
+ string = "Flinstone, Fred | 301 Cobblestone Way, Bedrock, NA, 00000"
26
+ tokenizer.tokenize! string
27
+
28
+ puts tokenizer.last_name # => "Flinestone"
29
+ puts tokenizer.city # => "Bedrock"
30
+ puts tokenizer.state # => "NA"
31
+
32
+ tokenizer.tokenize!(string) do |tokens|
33
+ puts "The #{tokens[:last_name]}s live in #{tokens[:city]}"
34
+ end
35
+ end
data/examples/nginx.rb ADDED
@@ -0,0 +1,70 @@
1
+ # once the gem is installed, you don't need this
2
+ $: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'lib'))
3
+ $: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'ext/string-eater'))
4
+
5
+ require 'string-eater'
6
+
7
+ class NginxLogTokenizer < StringEater::CTokenizer
8
+ add_field :ip
9
+ look_for " - "
10
+ add_field :remote_user, :extract => false
11
+ look_for " ["
12
+ add_field :timestamp, :extract => false
13
+ look_for "] \""
14
+ add_field :request
15
+ look_for "\" "
16
+ add_field :status_code
17
+ look_for " "
18
+ add_field :bytes_sent, :extract => false
19
+ look_for " \""
20
+ add_field :referrer_url
21
+ look_for "\" \""
22
+ add_field :user_agent
23
+ look_for "\" \""
24
+ add_field :compression, :extract => false
25
+ look_for "\" "
26
+ add_field :remainder
27
+
28
+ def status_code
29
+ @extracted_tokens[:status_code].to_i
30
+ end
31
+
32
+ def request_verb
33
+ @extracted_tokens[:request_verb]
34
+ end
35
+
36
+ def request_url
37
+ @extracted_tokens[:request_url]
38
+ end
39
+
40
+ def do_extra_parsing
41
+ return unless @extracted_tokens[:request]
42
+ request_parts = @extracted_tokens[:request].split
43
+ if request_parts.size == 3
44
+ @extracted_tokens[:request_verb] = request_parts[0]
45
+ @extracted_tokens[:request_url] = request_parts[1]
46
+ end
47
+ end
48
+ end
49
+
50
+ if __FILE__ == $0
51
+ tokenizer = NginxLogTokenizer.new
52
+ puts tokenizer.describe_line
53
+
54
+ str = '73.80.217.212 - - [01/Aug/2012:09:14:25 -0500] "GET /this_is_a_url HTTP/1.1" 304 152 "http://referrer.com" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" "-" "there could be" other "stuff here"'
55
+
56
+ puts "input string: " + str
57
+ puts "Tokens: "
58
+
59
+ # use a block to work with the extracted tokens
60
+ tokenizer.tokenize!(str) do |tokens|
61
+ tokens.each do |token|
62
+ puts "\t" + token.inspect
63
+ end
64
+ end
65
+
66
+ # use the token's name as a method to get its value
67
+ puts tokenizer.ip
68
+ puts tokenizer.status_code
69
+ puts tokenizer.request_verb
70
+ end
@@ -0,0 +1,141 @@
1
+ #include <ruby.h>
2
+
3
+ /* not used in production - useful for debugging */
4
+ #define puts_inspect(var) \
5
+ ID inspect = rb_intern("inspect"); \
6
+ VALUE x = rb_funcall(var, inspect, 0); \
7
+ printf("%s\n", StringValueCStr(x));
8
+
9
+ static VALUE rb_cCTokenizer;
10
+ static VALUE rb_mStringEater;
11
+
12
+ static VALUE tokenize_string(VALUE self,
13
+ VALUE string,
14
+ VALUE tokens_to_find_indexes,
15
+ VALUE tokens_to_find_strings,
16
+ VALUE tokens_to_extract_indexes,
17
+ VALUE tokens_to_extract_names)
18
+ {
19
+ const char* input_string = StringValueCStr(string);
20
+ VALUE extracted_tokens = rb_hash_new();
21
+ VALUE curr_token;
22
+ unsigned int curr_token_ix;
23
+ long n_tokens_to_find = RARRAY_LEN(tokens_to_find_indexes);
24
+ size_t str_len = strlen(input_string);
25
+ size_t ix;
26
+ char c;
27
+ char looking_for;
28
+ size_t looking_for_len;
29
+ size_t looking_for_ix = 0;
30
+ long find_ix = 0;
31
+ const char* looking_for_token;
32
+ unsigned int n_tokens = (unsigned int)RARRAY_LEN(rb_iv_get(self, "@tokens"));
33
+
34
+ size_t startpoint = 0;
35
+
36
+ long n_tokens_to_extract = RARRAY_LEN(tokens_to_extract_indexes);
37
+ long last_token_extracted_ix = 0;
38
+
39
+ long next_token_to_extract_ix = NUM2UINT(rb_ary_entry(tokens_to_extract_indexes, last_token_extracted_ix));
40
+
41
+ curr_token = rb_ary_entry(tokens_to_find_strings, find_ix);
42
+ curr_token_ix = NUM2UINT(rb_ary_entry(tokens_to_find_indexes, find_ix));
43
+ looking_for_token = StringValueCStr(curr_token);
44
+ looking_for_len = strlen(looking_for_token);
45
+ looking_for = looking_for_token[looking_for_ix];
46
+
47
+ for(ix = 0; ix < str_len; ix++)
48
+ {
49
+ c = input_string[ix];
50
+ if(c == looking_for)
51
+ {
52
+ if(looking_for_ix == 0)
53
+ {
54
+ /* entering new token */
55
+ if(curr_token_ix > 0)
56
+ {
57
+ /* extract, if necessary */
58
+ if((curr_token_ix - 1) == next_token_to_extract_ix)
59
+ {
60
+ last_token_extracted_ix++;
61
+ if(last_token_extracted_ix < n_tokens_to_extract)
62
+ {
63
+ next_token_to_extract_ix = NUM2UINT(rb_ary_entry(tokens_to_extract_indexes, last_token_extracted_ix));
64
+ }
65
+ else
66
+ {
67
+ next_token_to_extract_ix = -1;
68
+ }
69
+ rb_hash_aset(extracted_tokens,
70
+ rb_ary_entry(tokens_to_extract_names, curr_token_ix - 1),
71
+ rb_usascii_str_new(input_string + startpoint,
72
+ ix - startpoint));
73
+ }
74
+ }
75
+ startpoint = ix;
76
+ }
77
+ if(looking_for_ix >= looking_for_len - 1)
78
+ {
79
+ /* leaving token */
80
+ if(curr_token_ix >= n_tokens-1)
81
+ {
82
+ break;
83
+ }
84
+ else
85
+ {
86
+ startpoint = ix + 1;
87
+ }
88
+
89
+
90
+ /* next token */
91
+ find_ix++;
92
+ if(find_ix >= n_tokens_to_find)
93
+ {
94
+ /* done! */
95
+ break;
96
+ }
97
+
98
+ curr_token = rb_ary_entry(tokens_to_find_strings, find_ix);
99
+ curr_token_ix = NUM2UINT(rb_ary_entry(tokens_to_find_indexes, find_ix));
100
+ looking_for_token = StringValueCStr(curr_token);
101
+ looking_for_len = strlen(looking_for_token);
102
+ looking_for_ix = 0;
103
+ }
104
+ else
105
+ {
106
+ looking_for_ix++;
107
+ }
108
+ looking_for = looking_for_token[looking_for_ix];
109
+ }
110
+ }
111
+
112
+ ix = str_len;
113
+ curr_token_ix = n_tokens - 1;
114
+
115
+ if(curr_token_ix == next_token_to_extract_ix)
116
+ {
117
+ rb_hash_aset(extracted_tokens,
118
+ rb_ary_entry(tokens_to_extract_names, curr_token_ix),
119
+ rb_usascii_str_new(input_string + startpoint,
120
+ ix - startpoint));
121
+ }
122
+
123
+ return extracted_tokens;
124
+ }
125
+
126
+ void finalize_c_tokenizer_ext(VALUE unused)
127
+ {
128
+ /* free memory, etc */
129
+ }
130
+
131
+ void Init_c_tokenizer_ext(void)
132
+ {
133
+ rb_mStringEater = rb_define_module("StringEater");
134
+ rb_cCTokenizer = rb_define_class_under(rb_mStringEater,
135
+ "CTokenizer", rb_cObject);
136
+
137
+ rb_define_method(rb_cCTokenizer, "ctokenize!", tokenize_string, 5);
138
+
139
+ /* set the callback for when the extension is unloaded */
140
+ rb_set_end_proc(finalize_c_tokenizer_ext, 0);
141
+ }
@@ -0,0 +1,2 @@
1
+ require 'mkmf'
2
+ create_makefile('c_tokenizer_ext')
@@ -0,0 +1,93 @@
1
+ require 'c_tokenizer_ext'
2
+
3
+ class StringEater::CTokenizer
4
+ def self.tokens
5
+ @tokens ||= []
6
+ end
7
+
8
+ def self.add_field name, opts={}
9
+ self.tokens << StringEater::Token::new_field(name, opts)
10
+ define_method(name) {@extracted_tokens[name]}
11
+ end
12
+
13
+ def self.look_for tokens
14
+ self.tokens << StringEater::Token::new_separator(tokens)
15
+ end
16
+
17
+ def initialize
18
+ refresh_tokens
19
+ end
20
+
21
+ def tokens
22
+ @tokens
23
+ end
24
+
25
+ def refresh_tokens
26
+ @tokens = self.class.tokens
27
+ tokens_to_find = tokens.each_with_index.map do |t, i|
28
+ [i, t.string] if t.string
29
+ end.compact
30
+
31
+ @tokens_to_find_indexes = tokens_to_find.map{|t| t[0]}
32
+ @tokens_to_find_strings = tokens_to_find.map{|t| t[1]}
33
+
34
+ tokens_to_extract = tokens.each_with_index.map do |t, i|
35
+ [i, t.name] if t.extract?
36
+ end.compact
37
+
38
+ @tokens_to_extract_indexes = tokens_to_extract.map{|t| t[0]}
39
+ @tokens_to_extract_names = tokens.map{|t| t.name}
40
+ end
41
+
42
+ def describe_line
43
+ tokens.inject("") do |desc, t|
44
+ desc << (t.string || t.name.to_s || "xxxxxx")
45
+ end
46
+ end
47
+
48
+ def do_extra_parsing
49
+ end
50
+
51
+ def tokenize! string, &block
52
+ @string = string
53
+ @extracted_tokens ||= {}
54
+ @extracted_tokens.clear
55
+
56
+ tokens.first.breakpoints[0] = 0
57
+
58
+ @extracted_tokens = ctokenize!(@string,
59
+ @tokens_to_find_indexes,
60
+ @tokens_to_find_strings,
61
+ @tokens_to_extract_indexes,
62
+ @tokens_to_extract_names)
63
+
64
+ # extra parsing hook
65
+ do_extra_parsing
66
+
67
+ if block_given?
68
+ yield @extracted_tokens
69
+ end
70
+
71
+ # return self for chaining
72
+ self
73
+ end
74
+
75
+ private
76
+
77
+ def set_token_startpoint ix, startpoint
78
+ @tokens[ix].breakpoints[0] = startpoint
79
+ end
80
+
81
+ def get_token_startpoint ix
82
+ @tokens[ix].breakpoints[0]
83
+ end
84
+
85
+ def set_token_endpoint ix, endpoint
86
+ @tokens[ix].breakpoints[1] = endpoint
87
+ end
88
+
89
+ def extract_token? ix
90
+ @tokens[ix].extract?
91
+ end
92
+
93
+ end
@@ -0,0 +1,145 @@
1
+ # this tokenizer is very slow, but it illustrates the
2
+ # basic idea of the C tokenizer
3
+ class StringEater::RubyTokenizerEachChar
4
+
5
+ def self.tokens
6
+ @tokens ||= []
7
+ end
8
+
9
+ def self.combined_tokens
10
+ @combined_tokens ||= []
11
+ end
12
+
13
+ def self.add_field name, opts={}
14
+ self.tokens << StringEater::Token::new_field(name, opts)
15
+ define_method(name) {@extracted_tokens[name]}
16
+ end
17
+
18
+ def self.look_for tokens
19
+ self.tokens << StringEater::Token::new_separator(tokens)
20
+ end
21
+
22
+ def self.combine_fields opts={}
23
+ from_token_index = self.tokens.index{|t| t.name == opts[:from]}
24
+ to_token_index = self.tokens.index{|t| t.name == opts[:to]}
25
+ self.combined_tokens << [opts[:as], from_token_index, to_token_index]
26
+ define_method(opts[:as]) {@extracted_tokens[opts[:as]]}
27
+ end
28
+
29
+ def tokens
30
+ @tokens ||= self.class.tokens
31
+ end
32
+
33
+ def combined_tokens
34
+ @combined_tokens ||= self.class.combined_tokens
35
+ end
36
+
37
+ def refresh_tokens
38
+ @combined_tokens = nil
39
+ @tokens = nil
40
+ tokens
41
+ end
42
+
43
+ def describe_line
44
+ tokens.inject("") do |desc, t|
45
+ desc << (t.string || t.name.to_s || "xxxxxx")
46
+ end
47
+ end
48
+
49
+ def find_breakpoints string
50
+ tokenize!(string) unless @string == string
51
+ tokens.inject([]) do |bp, t|
52
+ bp << t.breakpoints
53
+ bp
54
+ end.flatten.uniq
55
+ end
56
+
57
+ def tokenize! string, &block
58
+ @string = string
59
+ @extracted_tokens ||= {}
60
+ @extracted_tokens.clear
61
+ @tokens_to_find ||= tokens.each_with_index.map do |t, i|
62
+ [i, t.string] if t.string
63
+ end.compact
64
+ @tokens_to_extract_indeces ||= tokens.each_with_index.map do |t, i|
65
+ i if t.extract?
66
+ end.compact
67
+
68
+ tokens.first.breakpoints[0] = 0
69
+
70
+ find_index = 0
71
+
72
+ curr_token = @tokens_to_find[find_index]
73
+ curr_token_index = curr_token[0]
74
+ curr_token_length = curr_token[1].length
75
+ looking_for_index = 0
76
+ looking_for = curr_token[1][looking_for_index]
77
+
78
+ counter = 0
79
+ string.each_char do |c|
80
+ if c == looking_for
81
+ if looking_for_index == 0
82
+ # entering new token
83
+ if curr_token_index > 0
84
+ t = tokens[curr_token_index - 1]
85
+ t.breakpoints[1] = counter
86
+ if t.extract?
87
+ @extracted_tokens[t.name] = string[t.breakpoints[0]...t.breakpoints[1]]
88
+ end
89
+ end
90
+ tokens[curr_token_index].breakpoints[0] = counter
91
+ end
92
+ if looking_for_index >= (curr_token_length - 1)
93
+ # leaving token
94
+ tokens[curr_token_index].breakpoints[1] = counter
95
+
96
+ if curr_token_index >= tokens.size-1
97
+ # we're done!
98
+ break
99
+ else
100
+ tokens[curr_token_index + 1].breakpoints[0] = counter + 1
101
+ end
102
+
103
+ # next token
104
+ find_index += 1
105
+ if find_index >= @tokens_to_find.length
106
+ # we're done!
107
+ break
108
+ end
109
+ curr_token = @tokens_to_find[find_index]
110
+ curr_token_index = curr_token[0]
111
+ curr_token_length = curr_token[1].length
112
+ looking_for_index = 0
113
+ else
114
+ looking_for_index += 1
115
+ end
116
+ end
117
+ looking_for = curr_token[1][looking_for_index]
118
+ counter += 1
119
+ end
120
+
121
+ last_token = tokens.last
122
+ last_token.breakpoints[1] = string.length
123
+
124
+ if last_token.extract?
125
+ @extracted_tokens[last_token.name] = string[last_token.breakpoints[0]..last_token.breakpoints[1]]
126
+ end
127
+
128
+ combined_tokens.each do |combiner|
129
+ name = combiner[0]
130
+ from = @tokens[combiner[1]].breakpoints[0]
131
+ to = @tokens[combiner[2]].breakpoints[1]
132
+ @extracted_tokens[name] = string[from...to]
133
+ end
134
+
135
+ if block_given?
136
+ yield @extracted_tokens
137
+ end
138
+
139
+ # return self for chaining
140
+ self
141
+ end
142
+
143
+ end
144
+
145
+
@@ -0,0 +1,98 @@
1
+ # this tokenizer is fairly fast, but not necessarily faster than regexps
2
+ class StringEater::RubyTokenizer
3
+ def self.tokens
4
+ @tokens ||= []
5
+ end
6
+
7
+ def self.combined_tokens
8
+ @combined_tokens ||= []
9
+ end
10
+
11
+ def self.add_field name, opts={}
12
+ self.tokens << StringEater::Token::new_field(name, opts)
13
+ define_method(name) {@extracted_tokens[name]}
14
+ end
15
+
16
+ def self.look_for tokens
17
+ self.tokens << StringEater::Token::new_separator(tokens)
18
+ end
19
+
20
+ def self.combine_fields opts={}
21
+ from_token_index = self.tokens.index{|t| t.name == opts[:from]}
22
+ to_token_index = self.tokens.index{|t| t.name == opts[:to]}
23
+ self.combined_tokens << [opts[:as], from_token_index, to_token_index]
24
+ define_method(opts[:as]) {@extracted_tokens[opts[:as]]}
25
+ end
26
+
27
+ def tokens
28
+ @tokens ||= self.class.tokens
29
+ end
30
+
31
+ def combined_tokens
32
+ @combined_tokens ||= self.class.combined_tokens
33
+ end
34
+
35
+ def refresh_tokens
36
+ @combined_tokens = nil
37
+ @tokens = nil
38
+ tokens
39
+ end
40
+
41
+ def describe_line
42
+ tokens.inject("") do |desc, t|
43
+ desc << (t.string || t.name.to_s || "xxxxxx")
44
+ end
45
+ end
46
+
47
+ def find_breakpoints(string)
48
+ @literal_tokens ||= tokens.select{|t| t.string}
49
+ @breakpoints ||= Array.new(2*@literal_tokens.size + 2)
50
+ @breakpoints[0] = 0
51
+ @breakpoints[-1] = string.length
52
+ start_point = 0
53
+ @literal_tokens.each_with_index do |t, i|
54
+ @breakpoints[2*i+1], start_point = find_end_of(t, string, start_point)
55
+ @breakpoints[2*i+2] = start_point
56
+ end
57
+ @breakpoints
58
+ end
59
+
60
+ def tokenize! string, &block
61
+ @extracted_tokens ||= {}
62
+ @extracted_tokens.clear
63
+ @tokens_to_extract ||= tokens.select{|t| t.extract?}
64
+
65
+ find_breakpoints(string)
66
+ last_important_bp = [@breakpoints.length, tokens.size].min
67
+ (0...last_important_bp).each do |i|
68
+ tokens[i].breakpoints = [@breakpoints[i], @breakpoints[i+1]]
69
+ end
70
+
71
+ @tokens_to_extract.each do |t|
72
+ @extracted_tokens[t.name] = string[t.breakpoints[0]...t.breakpoints[1]]
73
+ end
74
+
75
+ combined_tokens.each do |combiner|
76
+ name = combiner[0]
77
+ from = @tokens[combiner[1]].breakpoints[0]
78
+ to = @tokens[combiner[2]].breakpoints[1]
79
+ @extracted_tokens[name] = string[from...to]
80
+ end
81
+
82
+ if block_given?
83
+ yield @extracted_tokens
84
+ end
85
+
86
+ # return self for chaining
87
+ self
88
+ end
89
+
90
+ protected
91
+
92
+ def find_end_of token, string, start_at
93
+ start = string.index(token.string, start_at+1) || string.length
94
+ [start, [start + token.string.length, string.length].min]
95
+ end
96
+
97
+ end
98
+
@@ -0,0 +1,10 @@
1
+ module StringEater
2
+ autoload :Token, 'token'
3
+ autoload :RubyTokenizer, 'ruby-tokenizer'
4
+ autoload :RubyTokenizerEachCHar, 'ruby-tokenizer-each-char'
5
+ autoload :CTokenizer, 'c-tokenizer'
6
+
7
+ autoload :VERSION, 'version'
8
+
9
+ class Tokenizer < CTokenizer; end
10
+ end
data/lib/token.rb ADDED
@@ -0,0 +1,26 @@
1
+ class StringEater::Token
2
+ attr_accessor :name, :string, :opts, :breakpoints, :children
3
+
4
+ def initialize
5
+ @opts = {}
6
+ @breakpoints = [nil,nil]
7
+ end
8
+
9
+ def extract?
10
+ @opts[:extract]
11
+ end
12
+
13
+ def self.new_field(name, opts)
14
+ t = new
15
+ t.name = name
16
+ t.opts = {:extract => true}.merge(opts)
17
+ t
18
+ end
19
+
20
+ def self.new_separator(string)
21
+ t = new
22
+ t.string = string
23
+ t
24
+ end
25
+
26
+ end
data/lib/version.rb ADDED
@@ -0,0 +1,9 @@
1
+ module StringEater
2
+ module VERSION
3
+ MAJOR = 0
4
+ MINOR = 1
5
+ PATCH = 0
6
+ PRE = nil
7
+ STRING = [MAJOR, MINOR, PATCH, PRE].compact.join('.')
8
+ end
9
+ end
@@ -0,0 +1,27 @@
1
+ require 'spec_helper'
2
+ require 'string-eater'
3
+
4
+ $: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'examples'))
5
+
6
+ require 'nginx'
7
+
8
+ describe NginxLogTokenizer do
9
+ before(:each) do
10
+ @tokenizer = NginxLogTokenizer.new
11
+ @str = '73.80.217.212 - - [01/Aug/2012:09:14:25 -0500] "GET /this_is_a_url HTTP/1.1" 304 152 "http://referrer.com" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" "-" "there could be" other "stuff here"'
12
+ end
13
+
14
+ {
15
+ :ip => "73.80.217.212",
16
+ :request => "GET /this_is_a_url HTTP/1.1",
17
+ :status_code => 304,
18
+ :referrer_url => "http://referrer.com",
19
+ :user_agent => "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
20
+ :remainder => "\"there could be\" other \"stuff here\"",
21
+ }.each_pair do |token,val|
22
+ it "should find the right value for #{token}" do
23
+ @tokenizer.tokenize!(@str).send(token).should == val
24
+ end
25
+ end
26
+
27
+ end
@@ -0,0 +1 @@
1
+ $LOAD_PATH.concat %w[./lib ./ext/string-eater]
@@ -0,0 +1,133 @@
1
+ require 'spec_helper'
2
+ require 'string-eater'
3
+
4
+ TestedClass = StringEater::CTokenizer
5
+
6
+ describe StringEater do
7
+ it "should have a version" do
8
+ StringEater::VERSION::STRING.split(".").size.should >= 3
9
+ end
10
+ end
11
+
12
+ # normal use
13
+ class Example1 < TestedClass
14
+ add_field :first_word
15
+ look_for " "
16
+ add_field :second_word, :extract => false
17
+ look_for "|"
18
+ add_field :third_word
19
+ end
20
+
21
+ describe Example1 do
22
+
23
+ before(:each) do
24
+ @tokenizer = Example1.new
25
+ @str1 = "foo bar|baz"
26
+ @first_word1 = "foo"
27
+ @third_word1 = "baz"
28
+ @bp1 = [0, 3,4,7,8,11]
29
+ end
30
+
31
+ describe "find_breakpoints" do
32
+ it "should return an array of the breakpoints" do
33
+ @tokenizer.find_breakpoints(@str1).should == @bp1 if @tokenizer.respond_to?(:find_breakpoints)
34
+ end
35
+ end
36
+
37
+ describe "tokenize!" do
38
+ it "should return itself" do
39
+ @tokenizer.tokenize!(@str1).should == @tokenizer
40
+ end
41
+
42
+ it "should set the first word" do
43
+ @tokenizer.tokenize!(@str1).first_word.should == "foo"
44
+ end
45
+
46
+ it "should set the third word" do
47
+ @tokenizer.tokenize!(@str1).third_word.should == "baz"
48
+ end
49
+
50
+ it "should not set the second word" do
51
+ @tokenizer.tokenize!(@str1).second_word.should be_nil
52
+ end
53
+
54
+ it "should yield a hash of tokens if a block is given" do
55
+ @tokenizer.tokenize!(@str1) do |tokens|
56
+ tokens[:first_word].should == "foo"
57
+ end
58
+ end
59
+
60
+ it "should return everything to the end of the line for the last token" do
61
+ s = "c defg asdf | foo , baa"
62
+ @tokenizer.tokenize!("a b|#{s}").third_word.should == s
63
+ end
64
+
65
+ end
66
+
67
+ end
68
+
69
+ # an example where we ignore after a certain point in the string
70
+ class Example2 < TestedClass
71
+ add_field :first_word, :extract => false
72
+ look_for " "
73
+ add_field :second_word
74
+ look_for " "
75
+ add_field :third_word, :extract => false
76
+ look_for "-"
77
+ end
78
+
79
+ describe Example2 do
80
+
81
+ before(:each) do
82
+ @tokenizer = Example2.new
83
+ @str1 = "foo bar baz-"
84
+ @second_word1 = "bar"
85
+ end
86
+
87
+ describe "tokenize!" do
88
+ it "should find the token when there is extra stuff at the end of the string" do
89
+ @tokenizer.tokenize!(@str1).second_word.should == @second_word1
90
+ end
91
+ end
92
+
93
+ end
94
+
95
+ # CTokenizer doesn't do combine_fields because
96
+ # writing out breakpoints is a significant slow-down
97
+ if TestedClass.respond_to?(:combine_fields)
98
+ # an example where we combine fields
99
+ class Example3 < TestedClass
100
+ add_field :first_word, :extract => false
101
+ look_for " \""
102
+ add_field :part1, :extract => false
103
+ look_for " "
104
+ add_field :part2
105
+ look_for " "
106
+ add_field :part3, :extract => false
107
+ look_for "\""
108
+
109
+ combine_fields :from => :part1, :to => :part3, :as => :parts
110
+ end
111
+
112
+ describe Example3 do
113
+ before(:each) do
114
+ @tokenizer = Example3.new
115
+ @str1 = "foo \"bar baz bang\""
116
+ @part2 = "baz"
117
+ @parts = "bar baz bang"
118
+ end
119
+
120
+ it "should extract like normal" do
121
+ @tokenizer.tokenize!(@str1).part2.should == @part2
122
+ end
123
+
124
+ it "should ignore like normal" do
125
+ @tokenizer.tokenize!(@str1).part1.should be_nil
126
+ end
127
+
128
+ it "should extract the combined field" do
129
+ @tokenizer.tokenize!(@str1).parts.should == @parts
130
+ end
131
+
132
+ end
133
+ end
metadata ADDED
@@ -0,0 +1,66 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: string-eater
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Dan Swain
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2012-08-20 00:00:00.000000000 Z
13
+ dependencies: []
14
+ description: Fast string tokenizer. Nom strings.
15
+ email:
16
+ - dan@simpli.fi
17
+ executables: []
18
+ extensions:
19
+ - ext/string-eater/extconf.rb
20
+ extra_rdoc_files: []
21
+ files:
22
+ - lib/c-tokenizer.rb
23
+ - lib/ruby-tokenizer-each-char.rb
24
+ - lib/ruby-tokenizer.rb
25
+ - lib/string-eater.rb
26
+ - lib/token.rb
27
+ - lib/version.rb
28
+ - ext/string-eater/extconf.rb
29
+ - ext/string-eater/c-tokenizer.c
30
+ - spec/nginx_spec.rb
31
+ - spec/spec_helper.rb
32
+ - spec/string_eater_spec.rb
33
+ - examples/address.rb
34
+ - examples/nginx.rb
35
+ - LICENSE
36
+ - Rakefile
37
+ - README.md
38
+ homepage: http://github.com/simplifi/string-eater
39
+ licenses: []
40
+ post_install_message:
41
+ rdoc_options: []
42
+ require_paths:
43
+ - lib
44
+ - ext/string-eater
45
+ required_ruby_version: !ruby/object:Gem::Requirement
46
+ none: false
47
+ requirements:
48
+ - - ! '>='
49
+ - !ruby/object:Gem::Version
50
+ version: '0'
51
+ required_rubygems_version: !ruby/object:Gem::Requirement
52
+ none: false
53
+ requirements:
54
+ - - ! '>='
55
+ - !ruby/object:Gem::Version
56
+ version: '0'
57
+ requirements: []
58
+ rubyforge_project:
59
+ rubygems_version: 1.8.24
60
+ signing_key:
61
+ specification_version: 3
62
+ summary: Fast string tokenizer. Nom strings.
63
+ test_files:
64
+ - spec/nginx_spec.rb
65
+ - spec/spec_helper.rb
66
+ - spec/string_eater_spec.rb