string-eater 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/LICENSE ADDED
@@ -0,0 +1,24 @@
1
+ Copyright (c) 2012 Dan Swain
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
23
+
24
+
data/README.md ADDED
@@ -0,0 +1,133 @@
1
+ # String Eater
2
+
3
+ A fast ruby string tokenizer. It eats strings and dumps tokens.
4
+
5
+ ## License
6
+
7
+ String Eater is released under the
8
+ [MIT license](http://en.wikipedia.org/wiki/MIT_License).
9
+ See the LICENSE file.
10
+
11
+ ## Requirements
12
+
13
+ String Eater probably only works in Ruby 1.9.2+ with MRI. It's been
14
+ tested with Ruby 1.9.3p194.
15
+
16
+ String Eater uses a C extension, so it will only work on Ruby
17
+ implemenatations that provide support for C extensions.
18
+
19
+ ## Installation
20
+
21
+ We'll publish this gem soon, but for now you can clone and install as
22
+
23
+ git clone git://github.com/dantswain/string-eater.git
24
+ cd string-eater
25
+ rake install
26
+
27
+ If you are working on a system where you need to `sudo gem install`
28
+ you can do
29
+
30
+ rake gem
31
+ sudo gem install string-eater
32
+
33
+ As always, you can `rake -T` to find out what other rake tasks we have
34
+ provided.
35
+
36
+ ## Basic Usage
37
+
38
+ Suppose we want to tokenize a string that contains address information
39
+ for a person and is consistently formatted like
40
+
41
+ Last Name, First Name | Street address, City, State, Zip
42
+
43
+ Suppose we only want to extract the last name, city, and state.
44
+
45
+ To do this using String Eater, create a subclass of
46
+ `StringEater::Tokenizer` like this
47
+
48
+ require 'string-eater'
49
+
50
+ class PersonTokenizer < StringEater::Tokenizer
51
+ add_field :last_name
52
+ look_for ", "
53
+ add_field :first_name, :extract => false
54
+ look_for " | "
55
+ add_field :street_address, :extract => false
56
+ look_for ", "
57
+ add_field :city
58
+ look_for ", "
59
+ add_field :state
60
+ look_for ", "
61
+ end
62
+
63
+ Note the use of `:extract => false` to specify fields that are important
64
+ to the structure of the line but that we don't necessarily need to
65
+ extract.
66
+
67
+ Then, we can tokenize the string like this:
68
+
69
+ tokenizer = PersonTokenizer.new
70
+ string = "Flinstone, Fred | 301 Cobblestone Way, Bedrock, NA, 00000"
71
+ tokenizer.tokenize! string
72
+
73
+ puts tokenizer.last_name # => "Flinestone"
74
+ puts tokenizer.city # => "Bedrock"
75
+ puts tokenizer.state # => "NA"
76
+
77
+ We can also do something like this:
78
+
79
+ tokenizer.tokenize!(string) do |tokens|
80
+ puts "The #{tokens[:last_name]}s live in #{tokens[:city]}"
81
+ end
82
+
83
+ For another example, see `examples/nginx.rb`, which defines an
84
+ [nginx](http://nginx.org) log line tokenizer.
85
+
86
+ ## Implementation
87
+
88
+ There are actually three tokenizer algorithms provided here. The
89
+ three algorithms should be interchangeable.
90
+
91
+ 1. `StringEater::CTokenizer` - A C extension implementation. The
92
+ fastest of the three. This is the default implementation for
93
+ `StringEater::Tokenizer`.
94
+
95
+ 2. `StringEater::RubyTokenizer` - A pure-Ruby implementation. This is
96
+ a slightly different implementation of the algorithm - an
97
+ implementation that is faster on Ruby than a translation of the C
98
+ algorithm. Probably not as fast (or not much faster) than using
99
+ Ruby regular expressions.
100
+
101
+ 3. `StringEater::RubyTokenizerEachChar` - A pure-Ruby implementation.
102
+ This is essentially the same as the C implementation, but written
103
+ in pure Ruby. It uses `String#each_char` and is therefore VERY
104
+ SLOW! It provides a good way to hack the algorithm, though.
105
+
106
+ The main algorithm works by finding the start and end points of tokens
107
+ in a string. The search is done incrementally (i.e., loop through the
108
+ string and look for each sequence of characters). The algorithm is
109
+ "lazy" in the sense that only the required tokens are copied for
110
+ output ("extracted").
111
+
112
+ ## Performance
113
+
114
+ Soon I'll add some code here to run your own benchmarks.
115
+
116
+ I've run my own benchmarks comparing String Eater to some code that does the
117
+ same task (both tokenizing nginx log lines) using Ruby regular expressions. So
118
+ far, String Eater is about 200% faster; able to process over 100,000 lines per
119
+ second on my laptop vs less than 50,000 lines per second for the regular
120
+ expression version. I'm working to further optimize the String Eater code.
121
+
122
+ ## Contributing
123
+
124
+ The usual github process applies here:
125
+
126
+ 1. Fork it
127
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
128
+ 3. Commit your changes (`git commit -am 'Added some feature'`)
129
+ 4. Push to the branch (`git push origin my-new-feature`)
130
+ 5. Create new Pull Request
131
+
132
+ You can also contribute to the author's ego by letting him know that
133
+ you find String Eater useful ;)
data/Rakefile ADDED
@@ -0,0 +1,33 @@
1
+ require 'rake/clean'
2
+
3
+ desc "Run rspec spec/ (compile if needed)"
4
+ task :test => :compile do
5
+ sh "rspec spec/"
6
+ end
7
+
8
+ so_ext = RbConfig::CONFIG['DLEXT']
9
+ ext_dir = "ext/string-eater"
10
+ ext_file = ext_dir + "/c_tokenizer_ext.#{so_ext}"
11
+
12
+ file ext_file => Dir.glob("ext/string-eater/*{.rb,.c}") do
13
+ Dir.chdir("ext/string-eater") do
14
+ ruby "extconf.rb"
15
+ sh "make"
16
+ end
17
+ end
18
+
19
+ desc "Create gem"
20
+ task :gem => "string-eater.gemspec" do
21
+ sh "gem build string-eater.gemspec"
22
+ end
23
+
24
+ desc "Install using 'gem install'"
25
+ task :install => :gem do
26
+ sh "gem install string-eater"
27
+ end
28
+
29
+ desc "Compile the extension"
30
+ task :compile => ext_file
31
+
32
+ CLEAN.include('ext/**/*{.o,.log,.so,.bundle}')
33
+ CLEAN.include('ext/**/Makefile')
@@ -0,0 +1,35 @@
1
+ # once the gem is installed, you don't need this
2
+ $: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'lib'))
3
+ $: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'ext/string-eater'))
4
+
5
+ # this is the example from the README
6
+ require 'string-eater'
7
+
8
+ class PersonTokenizer < StringEater::Tokenizer
9
+ add_field :last_name
10
+ look_for ", "
11
+ add_field :first_name, :extract => false
12
+ look_for " | "
13
+ add_field :street_address, :extract => false
14
+ look_for ", "
15
+ add_field :city
16
+ look_for ", "
17
+ add_field :state
18
+ look_for ", "
19
+ end
20
+
21
+ if __FILE__ == $0
22
+ tokenizer = PersonTokenizer.new
23
+ puts tokenizer.describe_line
24
+
25
+ string = "Flinstone, Fred | 301 Cobblestone Way, Bedrock, NA, 00000"
26
+ tokenizer.tokenize! string
27
+
28
+ puts tokenizer.last_name # => "Flinestone"
29
+ puts tokenizer.city # => "Bedrock"
30
+ puts tokenizer.state # => "NA"
31
+
32
+ tokenizer.tokenize!(string) do |tokens|
33
+ puts "The #{tokens[:last_name]}s live in #{tokens[:city]}"
34
+ end
35
+ end
data/examples/nginx.rb ADDED
@@ -0,0 +1,70 @@
1
+ # once the gem is installed, you don't need this
2
+ $: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'lib'))
3
+ $: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'ext/string-eater'))
4
+
5
+ require 'string-eater'
6
+
7
+ class NginxLogTokenizer < StringEater::CTokenizer
8
+ add_field :ip
9
+ look_for " - "
10
+ add_field :remote_user, :extract => false
11
+ look_for " ["
12
+ add_field :timestamp, :extract => false
13
+ look_for "] \""
14
+ add_field :request
15
+ look_for "\" "
16
+ add_field :status_code
17
+ look_for " "
18
+ add_field :bytes_sent, :extract => false
19
+ look_for " \""
20
+ add_field :referrer_url
21
+ look_for "\" \""
22
+ add_field :user_agent
23
+ look_for "\" \""
24
+ add_field :compression, :extract => false
25
+ look_for "\" "
26
+ add_field :remainder
27
+
28
+ def status_code
29
+ @extracted_tokens[:status_code].to_i
30
+ end
31
+
32
+ def request_verb
33
+ @extracted_tokens[:request_verb]
34
+ end
35
+
36
+ def request_url
37
+ @extracted_tokens[:request_url]
38
+ end
39
+
40
+ def do_extra_parsing
41
+ return unless @extracted_tokens[:request]
42
+ request_parts = @extracted_tokens[:request].split
43
+ if request_parts.size == 3
44
+ @extracted_tokens[:request_verb] = request_parts[0]
45
+ @extracted_tokens[:request_url] = request_parts[1]
46
+ end
47
+ end
48
+ end
49
+
50
+ if __FILE__ == $0
51
+ tokenizer = NginxLogTokenizer.new
52
+ puts tokenizer.describe_line
53
+
54
+ str = '73.80.217.212 - - [01/Aug/2012:09:14:25 -0500] "GET /this_is_a_url HTTP/1.1" 304 152 "http://referrer.com" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" "-" "there could be" other "stuff here"'
55
+
56
+ puts "input string: " + str
57
+ puts "Tokens: "
58
+
59
+ # use a block to work with the extracted tokens
60
+ tokenizer.tokenize!(str) do |tokens|
61
+ tokens.each do |token|
62
+ puts "\t" + token.inspect
63
+ end
64
+ end
65
+
66
+ # use the token's name as a method to get its value
67
+ puts tokenizer.ip
68
+ puts tokenizer.status_code
69
+ puts tokenizer.request_verb
70
+ end
@@ -0,0 +1,141 @@
1
+ #include <ruby.h>
2
+
3
+ /* not used in production - useful for debugging */
4
+ #define puts_inspect(var) \
5
+ ID inspect = rb_intern("inspect"); \
6
+ VALUE x = rb_funcall(var, inspect, 0); \
7
+ printf("%s\n", StringValueCStr(x));
8
+
9
+ static VALUE rb_cCTokenizer;
10
+ static VALUE rb_mStringEater;
11
+
12
+ static VALUE tokenize_string(VALUE self,
13
+ VALUE string,
14
+ VALUE tokens_to_find_indexes,
15
+ VALUE tokens_to_find_strings,
16
+ VALUE tokens_to_extract_indexes,
17
+ VALUE tokens_to_extract_names)
18
+ {
19
+ const char* input_string = StringValueCStr(string);
20
+ VALUE extracted_tokens = rb_hash_new();
21
+ VALUE curr_token;
22
+ unsigned int curr_token_ix;
23
+ long n_tokens_to_find = RARRAY_LEN(tokens_to_find_indexes);
24
+ size_t str_len = strlen(input_string);
25
+ size_t ix;
26
+ char c;
27
+ char looking_for;
28
+ size_t looking_for_len;
29
+ size_t looking_for_ix = 0;
30
+ long find_ix = 0;
31
+ const char* looking_for_token;
32
+ unsigned int n_tokens = (unsigned int)RARRAY_LEN(rb_iv_get(self, "@tokens"));
33
+
34
+ size_t startpoint = 0;
35
+
36
+ long n_tokens_to_extract = RARRAY_LEN(tokens_to_extract_indexes);
37
+ long last_token_extracted_ix = 0;
38
+
39
+ long next_token_to_extract_ix = NUM2UINT(rb_ary_entry(tokens_to_extract_indexes, last_token_extracted_ix));
40
+
41
+ curr_token = rb_ary_entry(tokens_to_find_strings, find_ix);
42
+ curr_token_ix = NUM2UINT(rb_ary_entry(tokens_to_find_indexes, find_ix));
43
+ looking_for_token = StringValueCStr(curr_token);
44
+ looking_for_len = strlen(looking_for_token);
45
+ looking_for = looking_for_token[looking_for_ix];
46
+
47
+ for(ix = 0; ix < str_len; ix++)
48
+ {
49
+ c = input_string[ix];
50
+ if(c == looking_for)
51
+ {
52
+ if(looking_for_ix == 0)
53
+ {
54
+ /* entering new token */
55
+ if(curr_token_ix > 0)
56
+ {
57
+ /* extract, if necessary */
58
+ if((curr_token_ix - 1) == next_token_to_extract_ix)
59
+ {
60
+ last_token_extracted_ix++;
61
+ if(last_token_extracted_ix < n_tokens_to_extract)
62
+ {
63
+ next_token_to_extract_ix = NUM2UINT(rb_ary_entry(tokens_to_extract_indexes, last_token_extracted_ix));
64
+ }
65
+ else
66
+ {
67
+ next_token_to_extract_ix = -1;
68
+ }
69
+ rb_hash_aset(extracted_tokens,
70
+ rb_ary_entry(tokens_to_extract_names, curr_token_ix - 1),
71
+ rb_usascii_str_new(input_string + startpoint,
72
+ ix - startpoint));
73
+ }
74
+ }
75
+ startpoint = ix;
76
+ }
77
+ if(looking_for_ix >= looking_for_len - 1)
78
+ {
79
+ /* leaving token */
80
+ if(curr_token_ix >= n_tokens-1)
81
+ {
82
+ break;
83
+ }
84
+ else
85
+ {
86
+ startpoint = ix + 1;
87
+ }
88
+
89
+
90
+ /* next token */
91
+ find_ix++;
92
+ if(find_ix >= n_tokens_to_find)
93
+ {
94
+ /* done! */
95
+ break;
96
+ }
97
+
98
+ curr_token = rb_ary_entry(tokens_to_find_strings, find_ix);
99
+ curr_token_ix = NUM2UINT(rb_ary_entry(tokens_to_find_indexes, find_ix));
100
+ looking_for_token = StringValueCStr(curr_token);
101
+ looking_for_len = strlen(looking_for_token);
102
+ looking_for_ix = 0;
103
+ }
104
+ else
105
+ {
106
+ looking_for_ix++;
107
+ }
108
+ looking_for = looking_for_token[looking_for_ix];
109
+ }
110
+ }
111
+
112
+ ix = str_len;
113
+ curr_token_ix = n_tokens - 1;
114
+
115
+ if(curr_token_ix == next_token_to_extract_ix)
116
+ {
117
+ rb_hash_aset(extracted_tokens,
118
+ rb_ary_entry(tokens_to_extract_names, curr_token_ix),
119
+ rb_usascii_str_new(input_string + startpoint,
120
+ ix - startpoint));
121
+ }
122
+
123
+ return extracted_tokens;
124
+ }
125
+
126
+ void finalize_c_tokenizer_ext(VALUE unused)
127
+ {
128
+ /* free memory, etc */
129
+ }
130
+
131
+ void Init_c_tokenizer_ext(void)
132
+ {
133
+ rb_mStringEater = rb_define_module("StringEater");
134
+ rb_cCTokenizer = rb_define_class_under(rb_mStringEater,
135
+ "CTokenizer", rb_cObject);
136
+
137
+ rb_define_method(rb_cCTokenizer, "ctokenize!", tokenize_string, 5);
138
+
139
+ /* set the callback for when the extension is unloaded */
140
+ rb_set_end_proc(finalize_c_tokenizer_ext, 0);
141
+ }
@@ -0,0 +1,2 @@
1
+ require 'mkmf'
2
+ create_makefile('c_tokenizer_ext')
@@ -0,0 +1,93 @@
1
+ require 'c_tokenizer_ext'
2
+
3
+ class StringEater::CTokenizer
4
+ def self.tokens
5
+ @tokens ||= []
6
+ end
7
+
8
+ def self.add_field name, opts={}
9
+ self.tokens << StringEater::Token::new_field(name, opts)
10
+ define_method(name) {@extracted_tokens[name]}
11
+ end
12
+
13
+ def self.look_for tokens
14
+ self.tokens << StringEater::Token::new_separator(tokens)
15
+ end
16
+
17
+ def initialize
18
+ refresh_tokens
19
+ end
20
+
21
+ def tokens
22
+ @tokens
23
+ end
24
+
25
+ def refresh_tokens
26
+ @tokens = self.class.tokens
27
+ tokens_to_find = tokens.each_with_index.map do |t, i|
28
+ [i, t.string] if t.string
29
+ end.compact
30
+
31
+ @tokens_to_find_indexes = tokens_to_find.map{|t| t[0]}
32
+ @tokens_to_find_strings = tokens_to_find.map{|t| t[1]}
33
+
34
+ tokens_to_extract = tokens.each_with_index.map do |t, i|
35
+ [i, t.name] if t.extract?
36
+ end.compact
37
+
38
+ @tokens_to_extract_indexes = tokens_to_extract.map{|t| t[0]}
39
+ @tokens_to_extract_names = tokens.map{|t| t.name}
40
+ end
41
+
42
+ def describe_line
43
+ tokens.inject("") do |desc, t|
44
+ desc << (t.string || t.name.to_s || "xxxxxx")
45
+ end
46
+ end
47
+
48
+ def do_extra_parsing
49
+ end
50
+
51
+ def tokenize! string, &block
52
+ @string = string
53
+ @extracted_tokens ||= {}
54
+ @extracted_tokens.clear
55
+
56
+ tokens.first.breakpoints[0] = 0
57
+
58
+ @extracted_tokens = ctokenize!(@string,
59
+ @tokens_to_find_indexes,
60
+ @tokens_to_find_strings,
61
+ @tokens_to_extract_indexes,
62
+ @tokens_to_extract_names)
63
+
64
+ # extra parsing hook
65
+ do_extra_parsing
66
+
67
+ if block_given?
68
+ yield @extracted_tokens
69
+ end
70
+
71
+ # return self for chaining
72
+ self
73
+ end
74
+
75
+ private
76
+
77
+ def set_token_startpoint ix, startpoint
78
+ @tokens[ix].breakpoints[0] = startpoint
79
+ end
80
+
81
+ def get_token_startpoint ix
82
+ @tokens[ix].breakpoints[0]
83
+ end
84
+
85
+ def set_token_endpoint ix, endpoint
86
+ @tokens[ix].breakpoints[1] = endpoint
87
+ end
88
+
89
+ def extract_token? ix
90
+ @tokens[ix].extract?
91
+ end
92
+
93
+ end
@@ -0,0 +1,145 @@
1
+ # this tokenizer is very slow, but it illustrates the
2
+ # basic idea of the C tokenizer
3
+ class StringEater::RubyTokenizerEachChar
4
+
5
+ def self.tokens
6
+ @tokens ||= []
7
+ end
8
+
9
+ def self.combined_tokens
10
+ @combined_tokens ||= []
11
+ end
12
+
13
+ def self.add_field name, opts={}
14
+ self.tokens << StringEater::Token::new_field(name, opts)
15
+ define_method(name) {@extracted_tokens[name]}
16
+ end
17
+
18
+ def self.look_for tokens
19
+ self.tokens << StringEater::Token::new_separator(tokens)
20
+ end
21
+
22
+ def self.combine_fields opts={}
23
+ from_token_index = self.tokens.index{|t| t.name == opts[:from]}
24
+ to_token_index = self.tokens.index{|t| t.name == opts[:to]}
25
+ self.combined_tokens << [opts[:as], from_token_index, to_token_index]
26
+ define_method(opts[:as]) {@extracted_tokens[opts[:as]]}
27
+ end
28
+
29
+ def tokens
30
+ @tokens ||= self.class.tokens
31
+ end
32
+
33
+ def combined_tokens
34
+ @combined_tokens ||= self.class.combined_tokens
35
+ end
36
+
37
+ def refresh_tokens
38
+ @combined_tokens = nil
39
+ @tokens = nil
40
+ tokens
41
+ end
42
+
43
+ def describe_line
44
+ tokens.inject("") do |desc, t|
45
+ desc << (t.string || t.name.to_s || "xxxxxx")
46
+ end
47
+ end
48
+
49
+ def find_breakpoints string
50
+ tokenize!(string) unless @string == string
51
+ tokens.inject([]) do |bp, t|
52
+ bp << t.breakpoints
53
+ bp
54
+ end.flatten.uniq
55
+ end
56
+
57
+ def tokenize! string, &block
58
+ @string = string
59
+ @extracted_tokens ||= {}
60
+ @extracted_tokens.clear
61
+ @tokens_to_find ||= tokens.each_with_index.map do |t, i|
62
+ [i, t.string] if t.string
63
+ end.compact
64
+ @tokens_to_extract_indeces ||= tokens.each_with_index.map do |t, i|
65
+ i if t.extract?
66
+ end.compact
67
+
68
+ tokens.first.breakpoints[0] = 0
69
+
70
+ find_index = 0
71
+
72
+ curr_token = @tokens_to_find[find_index]
73
+ curr_token_index = curr_token[0]
74
+ curr_token_length = curr_token[1].length
75
+ looking_for_index = 0
76
+ looking_for = curr_token[1][looking_for_index]
77
+
78
+ counter = 0
79
+ string.each_char do |c|
80
+ if c == looking_for
81
+ if looking_for_index == 0
82
+ # entering new token
83
+ if curr_token_index > 0
84
+ t = tokens[curr_token_index - 1]
85
+ t.breakpoints[1] = counter
86
+ if t.extract?
87
+ @extracted_tokens[t.name] = string[t.breakpoints[0]...t.breakpoints[1]]
88
+ end
89
+ end
90
+ tokens[curr_token_index].breakpoints[0] = counter
91
+ end
92
+ if looking_for_index >= (curr_token_length - 1)
93
+ # leaving token
94
+ tokens[curr_token_index].breakpoints[1] = counter
95
+
96
+ if curr_token_index >= tokens.size-1
97
+ # we're done!
98
+ break
99
+ else
100
+ tokens[curr_token_index + 1].breakpoints[0] = counter + 1
101
+ end
102
+
103
+ # next token
104
+ find_index += 1
105
+ if find_index >= @tokens_to_find.length
106
+ # we're done!
107
+ break
108
+ end
109
+ curr_token = @tokens_to_find[find_index]
110
+ curr_token_index = curr_token[0]
111
+ curr_token_length = curr_token[1].length
112
+ looking_for_index = 0
113
+ else
114
+ looking_for_index += 1
115
+ end
116
+ end
117
+ looking_for = curr_token[1][looking_for_index]
118
+ counter += 1
119
+ end
120
+
121
+ last_token = tokens.last
122
+ last_token.breakpoints[1] = string.length
123
+
124
+ if last_token.extract?
125
+ @extracted_tokens[last_token.name] = string[last_token.breakpoints[0]..last_token.breakpoints[1]]
126
+ end
127
+
128
+ combined_tokens.each do |combiner|
129
+ name = combiner[0]
130
+ from = @tokens[combiner[1]].breakpoints[0]
131
+ to = @tokens[combiner[2]].breakpoints[1]
132
+ @extracted_tokens[name] = string[from...to]
133
+ end
134
+
135
+ if block_given?
136
+ yield @extracted_tokens
137
+ end
138
+
139
+ # return self for chaining
140
+ self
141
+ end
142
+
143
+ end
144
+
145
+
@@ -0,0 +1,98 @@
1
+ # this tokenizer is fairly fast, but not necessarily faster than regexps
2
+ class StringEater::RubyTokenizer
3
+ def self.tokens
4
+ @tokens ||= []
5
+ end
6
+
7
+ def self.combined_tokens
8
+ @combined_tokens ||= []
9
+ end
10
+
11
+ def self.add_field name, opts={}
12
+ self.tokens << StringEater::Token::new_field(name, opts)
13
+ define_method(name) {@extracted_tokens[name]}
14
+ end
15
+
16
+ def self.look_for tokens
17
+ self.tokens << StringEater::Token::new_separator(tokens)
18
+ end
19
+
20
+ def self.combine_fields opts={}
21
+ from_token_index = self.tokens.index{|t| t.name == opts[:from]}
22
+ to_token_index = self.tokens.index{|t| t.name == opts[:to]}
23
+ self.combined_tokens << [opts[:as], from_token_index, to_token_index]
24
+ define_method(opts[:as]) {@extracted_tokens[opts[:as]]}
25
+ end
26
+
27
+ def tokens
28
+ @tokens ||= self.class.tokens
29
+ end
30
+
31
+ def combined_tokens
32
+ @combined_tokens ||= self.class.combined_tokens
33
+ end
34
+
35
+ def refresh_tokens
36
+ @combined_tokens = nil
37
+ @tokens = nil
38
+ tokens
39
+ end
40
+
41
+ def describe_line
42
+ tokens.inject("") do |desc, t|
43
+ desc << (t.string || t.name.to_s || "xxxxxx")
44
+ end
45
+ end
46
+
47
+ def find_breakpoints(string)
48
+ @literal_tokens ||= tokens.select{|t| t.string}
49
+ @breakpoints ||= Array.new(2*@literal_tokens.size + 2)
50
+ @breakpoints[0] = 0
51
+ @breakpoints[-1] = string.length
52
+ start_point = 0
53
+ @literal_tokens.each_with_index do |t, i|
54
+ @breakpoints[2*i+1], start_point = find_end_of(t, string, start_point)
55
+ @breakpoints[2*i+2] = start_point
56
+ end
57
+ @breakpoints
58
+ end
59
+
60
+ def tokenize! string, &block
61
+ @extracted_tokens ||= {}
62
+ @extracted_tokens.clear
63
+ @tokens_to_extract ||= tokens.select{|t| t.extract?}
64
+
65
+ find_breakpoints(string)
66
+ last_important_bp = [@breakpoints.length, tokens.size].min
67
+ (0...last_important_bp).each do |i|
68
+ tokens[i].breakpoints = [@breakpoints[i], @breakpoints[i+1]]
69
+ end
70
+
71
+ @tokens_to_extract.each do |t|
72
+ @extracted_tokens[t.name] = string[t.breakpoints[0]...t.breakpoints[1]]
73
+ end
74
+
75
+ combined_tokens.each do |combiner|
76
+ name = combiner[0]
77
+ from = @tokens[combiner[1]].breakpoints[0]
78
+ to = @tokens[combiner[2]].breakpoints[1]
79
+ @extracted_tokens[name] = string[from...to]
80
+ end
81
+
82
+ if block_given?
83
+ yield @extracted_tokens
84
+ end
85
+
86
+ # return self for chaining
87
+ self
88
+ end
89
+
90
+ protected
91
+
92
+ def find_end_of token, string, start_at
93
+ start = string.index(token.string, start_at+1) || string.length
94
+ [start, [start + token.string.length, string.length].min]
95
+ end
96
+
97
+ end
98
+
@@ -0,0 +1,10 @@
1
+ module StringEater
2
+ autoload :Token, 'token'
3
+ autoload :RubyTokenizer, 'ruby-tokenizer'
4
+ autoload :RubyTokenizerEachCHar, 'ruby-tokenizer-each-char'
5
+ autoload :CTokenizer, 'c-tokenizer'
6
+
7
+ autoload :VERSION, 'version'
8
+
9
+ class Tokenizer < CTokenizer; end
10
+ end
data/lib/token.rb ADDED
@@ -0,0 +1,26 @@
1
+ class StringEater::Token
2
+ attr_accessor :name, :string, :opts, :breakpoints, :children
3
+
4
+ def initialize
5
+ @opts = {}
6
+ @breakpoints = [nil,nil]
7
+ end
8
+
9
+ def extract?
10
+ @opts[:extract]
11
+ end
12
+
13
+ def self.new_field(name, opts)
14
+ t = new
15
+ t.name = name
16
+ t.opts = {:extract => true}.merge(opts)
17
+ t
18
+ end
19
+
20
+ def self.new_separator(string)
21
+ t = new
22
+ t.string = string
23
+ t
24
+ end
25
+
26
+ end
data/lib/version.rb ADDED
@@ -0,0 +1,9 @@
1
+ module StringEater
2
+ module VERSION
3
+ MAJOR = 0
4
+ MINOR = 1
5
+ PATCH = 0
6
+ PRE = nil
7
+ STRING = [MAJOR, MINOR, PATCH, PRE].compact.join('.')
8
+ end
9
+ end
@@ -0,0 +1,27 @@
1
+ require 'spec_helper'
2
+ require 'string-eater'
3
+
4
+ $: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'examples'))
5
+
6
+ require 'nginx'
7
+
8
+ describe NginxLogTokenizer do
9
+ before(:each) do
10
+ @tokenizer = NginxLogTokenizer.new
11
+ @str = '73.80.217.212 - - [01/Aug/2012:09:14:25 -0500] "GET /this_is_a_url HTTP/1.1" 304 152 "http://referrer.com" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" "-" "there could be" other "stuff here"'
12
+ end
13
+
14
+ {
15
+ :ip => "73.80.217.212",
16
+ :request => "GET /this_is_a_url HTTP/1.1",
17
+ :status_code => 304,
18
+ :referrer_url => "http://referrer.com",
19
+ :user_agent => "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
20
+ :remainder => "\"there could be\" other \"stuff here\"",
21
+ }.each_pair do |token,val|
22
+ it "should find the right value for #{token}" do
23
+ @tokenizer.tokenize!(@str).send(token).should == val
24
+ end
25
+ end
26
+
27
+ end
@@ -0,0 +1 @@
1
+ $LOAD_PATH.concat %w[./lib ./ext/string-eater]
@@ -0,0 +1,133 @@
1
+ require 'spec_helper'
2
+ require 'string-eater'
3
+
4
+ TestedClass = StringEater::CTokenizer
5
+
6
+ describe StringEater do
7
+ it "should have a version" do
8
+ StringEater::VERSION::STRING.split(".").size.should >= 3
9
+ end
10
+ end
11
+
12
+ # normal use
13
+ class Example1 < TestedClass
14
+ add_field :first_word
15
+ look_for " "
16
+ add_field :second_word, :extract => false
17
+ look_for "|"
18
+ add_field :third_word
19
+ end
20
+
21
+ describe Example1 do
22
+
23
+ before(:each) do
24
+ @tokenizer = Example1.new
25
+ @str1 = "foo bar|baz"
26
+ @first_word1 = "foo"
27
+ @third_word1 = "baz"
28
+ @bp1 = [0, 3,4,7,8,11]
29
+ end
30
+
31
+ describe "find_breakpoints" do
32
+ it "should return an array of the breakpoints" do
33
+ @tokenizer.find_breakpoints(@str1).should == @bp1 if @tokenizer.respond_to?(:find_breakpoints)
34
+ end
35
+ end
36
+
37
+ describe "tokenize!" do
38
+ it "should return itself" do
39
+ @tokenizer.tokenize!(@str1).should == @tokenizer
40
+ end
41
+
42
+ it "should set the first word" do
43
+ @tokenizer.tokenize!(@str1).first_word.should == "foo"
44
+ end
45
+
46
+ it "should set the third word" do
47
+ @tokenizer.tokenize!(@str1).third_word.should == "baz"
48
+ end
49
+
50
+ it "should not set the second word" do
51
+ @tokenizer.tokenize!(@str1).second_word.should be_nil
52
+ end
53
+
54
+ it "should yield a hash of tokens if a block is given" do
55
+ @tokenizer.tokenize!(@str1) do |tokens|
56
+ tokens[:first_word].should == "foo"
57
+ end
58
+ end
59
+
60
+ it "should return everything to the end of the line for the last token" do
61
+ s = "c defg asdf | foo , baa"
62
+ @tokenizer.tokenize!("a b|#{s}").third_word.should == s
63
+ end
64
+
65
+ end
66
+
67
+ end
68
+
69
+ # an example where we ignore after a certain point in the string
70
+ class Example2 < TestedClass
71
+ add_field :first_word, :extract => false
72
+ look_for " "
73
+ add_field :second_word
74
+ look_for " "
75
+ add_field :third_word, :extract => false
76
+ look_for "-"
77
+ end
78
+
79
+ describe Example2 do
80
+
81
+ before(:each) do
82
+ @tokenizer = Example2.new
83
+ @str1 = "foo bar baz-"
84
+ @second_word1 = "bar"
85
+ end
86
+
87
+ describe "tokenize!" do
88
+ it "should find the token when there is extra stuff at the end of the string" do
89
+ @tokenizer.tokenize!(@str1).second_word.should == @second_word1
90
+ end
91
+ end
92
+
93
+ end
94
+
95
+ # CTokenizer doesn't do combine_fields because
96
+ # writing out breakpoints is a significant slow-down
97
+ if TestedClass.respond_to?(:combine_fields)
98
+ # an example where we combine fields
99
+ class Example3 < TestedClass
100
+ add_field :first_word, :extract => false
101
+ look_for " \""
102
+ add_field :part1, :extract => false
103
+ look_for " "
104
+ add_field :part2
105
+ look_for " "
106
+ add_field :part3, :extract => false
107
+ look_for "\""
108
+
109
+ combine_fields :from => :part1, :to => :part3, :as => :parts
110
+ end
111
+
112
+ describe Example3 do
113
+ before(:each) do
114
+ @tokenizer = Example3.new
115
+ @str1 = "foo \"bar baz bang\""
116
+ @part2 = "baz"
117
+ @parts = "bar baz bang"
118
+ end
119
+
120
+ it "should extract like normal" do
121
+ @tokenizer.tokenize!(@str1).part2.should == @part2
122
+ end
123
+
124
+ it "should ignore like normal" do
125
+ @tokenizer.tokenize!(@str1).part1.should be_nil
126
+ end
127
+
128
+ it "should extract the combined field" do
129
+ @tokenizer.tokenize!(@str1).parts.should == @parts
130
+ end
131
+
132
+ end
133
+ end
metadata ADDED
@@ -0,0 +1,66 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: string-eater
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Dan Swain
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2012-08-20 00:00:00.000000000 Z
13
+ dependencies: []
14
+ description: Fast string tokenizer. Nom strings.
15
+ email:
16
+ - dan@simpli.fi
17
+ executables: []
18
+ extensions:
19
+ - ext/string-eater/extconf.rb
20
+ extra_rdoc_files: []
21
+ files:
22
+ - lib/c-tokenizer.rb
23
+ - lib/ruby-tokenizer-each-char.rb
24
+ - lib/ruby-tokenizer.rb
25
+ - lib/string-eater.rb
26
+ - lib/token.rb
27
+ - lib/version.rb
28
+ - ext/string-eater/extconf.rb
29
+ - ext/string-eater/c-tokenizer.c
30
+ - spec/nginx_spec.rb
31
+ - spec/spec_helper.rb
32
+ - spec/string_eater_spec.rb
33
+ - examples/address.rb
34
+ - examples/nginx.rb
35
+ - LICENSE
36
+ - Rakefile
37
+ - README.md
38
+ homepage: http://github.com/simplifi/string-eater
39
+ licenses: []
40
+ post_install_message:
41
+ rdoc_options: []
42
+ require_paths:
43
+ - lib
44
+ - ext/string-eater
45
+ required_ruby_version: !ruby/object:Gem::Requirement
46
+ none: false
47
+ requirements:
48
+ - - ! '>='
49
+ - !ruby/object:Gem::Version
50
+ version: '0'
51
+ required_rubygems_version: !ruby/object:Gem::Requirement
52
+ none: false
53
+ requirements:
54
+ - - ! '>='
55
+ - !ruby/object:Gem::Version
56
+ version: '0'
57
+ requirements: []
58
+ rubyforge_project:
59
+ rubygems_version: 1.8.24
60
+ signing_key:
61
+ specification_version: 3
62
+ summary: Fast string tokenizer. Nom strings.
63
+ test_files:
64
+ - spec/nginx_spec.rb
65
+ - spec/spec_helper.rb
66
+ - spec/string_eater_spec.rb