string-eater 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/LICENSE +24 -0
- data/README.md +133 -0
- data/Rakefile +33 -0
- data/examples/address.rb +35 -0
- data/examples/nginx.rb +70 -0
- data/ext/string-eater/c-tokenizer.c +141 -0
- data/ext/string-eater/extconf.rb +2 -0
- data/lib/c-tokenizer.rb +93 -0
- data/lib/ruby-tokenizer-each-char.rb +145 -0
- data/lib/ruby-tokenizer.rb +98 -0
- data/lib/string-eater.rb +10 -0
- data/lib/token.rb +26 -0
- data/lib/version.rb +9 -0
- data/spec/nginx_spec.rb +27 -0
- data/spec/spec_helper.rb +1 -0
- data/spec/string_eater_spec.rb +133 -0
- metadata +66 -0
data/LICENSE
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
Copyright (c) 2012 Dan Swain
|
2
|
+
|
3
|
+
MIT License
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
6
|
+
a copy of this software and associated documentation files (the
|
7
|
+
"Software"), to deal in the Software without restriction, including
|
8
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
9
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
10
|
+
permit persons to whom the Software is furnished to do so, subject to
|
11
|
+
the following conditions:
|
12
|
+
|
13
|
+
The above copyright notice and this permission notice shall be
|
14
|
+
included in all copies or substantial portions of the Software.
|
15
|
+
|
16
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
17
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
18
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
19
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
20
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
21
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
22
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
23
|
+
|
24
|
+
|
data/README.md
ADDED
@@ -0,0 +1,133 @@
|
|
1
|
+
# String Eater
|
2
|
+
|
3
|
+
A fast ruby string tokenizer. It eats strings and dumps tokens.
|
4
|
+
|
5
|
+
## License
|
6
|
+
|
7
|
+
String Eater is released under the
|
8
|
+
[MIT license](http://en.wikipedia.org/wiki/MIT_License).
|
9
|
+
See the LICENSE file.
|
10
|
+
|
11
|
+
## Requirements
|
12
|
+
|
13
|
+
String Eater probably only works in Ruby 1.9.2+ with MRI. It's been
|
14
|
+
tested with Ruby 1.9.3p194.
|
15
|
+
|
16
|
+
String Eater uses a C extension, so it will only work on Ruby
|
17
|
+
implemenatations that provide support for C extensions.
|
18
|
+
|
19
|
+
## Installation
|
20
|
+
|
21
|
+
We'll publish this gem soon, but for now you can clone and install as
|
22
|
+
|
23
|
+
git clone git://github.com/dantswain/string-eater.git
|
24
|
+
cd string-eater
|
25
|
+
rake install
|
26
|
+
|
27
|
+
If you are working on a system where you need to `sudo gem install`
|
28
|
+
you can do
|
29
|
+
|
30
|
+
rake gem
|
31
|
+
sudo gem install string-eater
|
32
|
+
|
33
|
+
As always, you can `rake -T` to find out what other rake tasks we have
|
34
|
+
provided.
|
35
|
+
|
36
|
+
## Basic Usage
|
37
|
+
|
38
|
+
Suppose we want to tokenize a string that contains address information
|
39
|
+
for a person and is consistently formatted like
|
40
|
+
|
41
|
+
Last Name, First Name | Street address, City, State, Zip
|
42
|
+
|
43
|
+
Suppose we only want to extract the last name, city, and state.
|
44
|
+
|
45
|
+
To do this using String Eater, create a subclass of
|
46
|
+
`StringEater::Tokenizer` like this
|
47
|
+
|
48
|
+
require 'string-eater'
|
49
|
+
|
50
|
+
class PersonTokenizer < StringEater::Tokenizer
|
51
|
+
add_field :last_name
|
52
|
+
look_for ", "
|
53
|
+
add_field :first_name, :extract => false
|
54
|
+
look_for " | "
|
55
|
+
add_field :street_address, :extract => false
|
56
|
+
look_for ", "
|
57
|
+
add_field :city
|
58
|
+
look_for ", "
|
59
|
+
add_field :state
|
60
|
+
look_for ", "
|
61
|
+
end
|
62
|
+
|
63
|
+
Note the use of `:extract => false` to specify fields that are important
|
64
|
+
to the structure of the line but that we don't necessarily need to
|
65
|
+
extract.
|
66
|
+
|
67
|
+
Then, we can tokenize the string like this:
|
68
|
+
|
69
|
+
tokenizer = PersonTokenizer.new
|
70
|
+
string = "Flinstone, Fred | 301 Cobblestone Way, Bedrock, NA, 00000"
|
71
|
+
tokenizer.tokenize! string
|
72
|
+
|
73
|
+
puts tokenizer.last_name # => "Flinestone"
|
74
|
+
puts tokenizer.city # => "Bedrock"
|
75
|
+
puts tokenizer.state # => "NA"
|
76
|
+
|
77
|
+
We can also do something like this:
|
78
|
+
|
79
|
+
tokenizer.tokenize!(string) do |tokens|
|
80
|
+
puts "The #{tokens[:last_name]}s live in #{tokens[:city]}"
|
81
|
+
end
|
82
|
+
|
83
|
+
For another example, see `examples/nginx.rb`, which defines an
|
84
|
+
[nginx](http://nginx.org) log line tokenizer.
|
85
|
+
|
86
|
+
## Implementation
|
87
|
+
|
88
|
+
There are actually three tokenizer algorithms provided here. The
|
89
|
+
three algorithms should be interchangeable.
|
90
|
+
|
91
|
+
1. `StringEater::CTokenizer` - A C extension implementation. The
|
92
|
+
fastest of the three. This is the default implementation for
|
93
|
+
`StringEater::Tokenizer`.
|
94
|
+
|
95
|
+
2. `StringEater::RubyTokenizer` - A pure-Ruby implementation. This is
|
96
|
+
a slightly different implementation of the algorithm - an
|
97
|
+
implementation that is faster on Ruby than a translation of the C
|
98
|
+
algorithm. Probably not as fast (or not much faster) than using
|
99
|
+
Ruby regular expressions.
|
100
|
+
|
101
|
+
3. `StringEater::RubyTokenizerEachChar` - A pure-Ruby implementation.
|
102
|
+
This is essentially the same as the C implementation, but written
|
103
|
+
in pure Ruby. It uses `String#each_char` and is therefore VERY
|
104
|
+
SLOW! It provides a good way to hack the algorithm, though.
|
105
|
+
|
106
|
+
The main algorithm works by finding the start and end points of tokens
|
107
|
+
in a string. The search is done incrementally (i.e., loop through the
|
108
|
+
string and look for each sequence of characters). The algorithm is
|
109
|
+
"lazy" in the sense that only the required tokens are copied for
|
110
|
+
output ("extracted").
|
111
|
+
|
112
|
+
## Performance
|
113
|
+
|
114
|
+
Soon I'll add some code here to run your own benchmarks.
|
115
|
+
|
116
|
+
I've run my own benchmarks comparing String Eater to some code that does the
|
117
|
+
same task (both tokenizing nginx log lines) using Ruby regular expressions. So
|
118
|
+
far, String Eater is about 200% faster; able to process over 100,000 lines per
|
119
|
+
second on my laptop vs less than 50,000 lines per second for the regular
|
120
|
+
expression version. I'm working to further optimize the String Eater code.
|
121
|
+
|
122
|
+
## Contributing
|
123
|
+
|
124
|
+
The usual github process applies here:
|
125
|
+
|
126
|
+
1. Fork it
|
127
|
+
2. Create your feature branch (`git checkout -b my-new-feature`)
|
128
|
+
3. Commit your changes (`git commit -am 'Added some feature'`)
|
129
|
+
4. Push to the branch (`git push origin my-new-feature`)
|
130
|
+
5. Create new Pull Request
|
131
|
+
|
132
|
+
You can also contribute to the author's ego by letting him know that
|
133
|
+
you find String Eater useful ;)
|
data/Rakefile
ADDED
@@ -0,0 +1,33 @@
|
|
1
|
+
require 'rake/clean'
|
2
|
+
|
3
|
+
desc "Run rspec spec/ (compile if needed)"
|
4
|
+
task :test => :compile do
|
5
|
+
sh "rspec spec/"
|
6
|
+
end
|
7
|
+
|
8
|
+
so_ext = RbConfig::CONFIG['DLEXT']
|
9
|
+
ext_dir = "ext/string-eater"
|
10
|
+
ext_file = ext_dir + "/c_tokenizer_ext.#{so_ext}"
|
11
|
+
|
12
|
+
file ext_file => Dir.glob("ext/string-eater/*{.rb,.c}") do
|
13
|
+
Dir.chdir("ext/string-eater") do
|
14
|
+
ruby "extconf.rb"
|
15
|
+
sh "make"
|
16
|
+
end
|
17
|
+
end
|
18
|
+
|
19
|
+
desc "Create gem"
|
20
|
+
task :gem => "string-eater.gemspec" do
|
21
|
+
sh "gem build string-eater.gemspec"
|
22
|
+
end
|
23
|
+
|
24
|
+
desc "Install using 'gem install'"
|
25
|
+
task :install => :gem do
|
26
|
+
sh "gem install string-eater"
|
27
|
+
end
|
28
|
+
|
29
|
+
desc "Compile the extension"
|
30
|
+
task :compile => ext_file
|
31
|
+
|
32
|
+
CLEAN.include('ext/**/*{.o,.log,.so,.bundle}')
|
33
|
+
CLEAN.include('ext/**/Makefile')
|
data/examples/address.rb
ADDED
@@ -0,0 +1,35 @@
|
|
1
|
+
# once the gem is installed, you don't need this
|
2
|
+
$: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'lib'))
|
3
|
+
$: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'ext/string-eater'))
|
4
|
+
|
5
|
+
# this is the example from the README
|
6
|
+
require 'string-eater'
|
7
|
+
|
8
|
+
class PersonTokenizer < StringEater::Tokenizer
|
9
|
+
add_field :last_name
|
10
|
+
look_for ", "
|
11
|
+
add_field :first_name, :extract => false
|
12
|
+
look_for " | "
|
13
|
+
add_field :street_address, :extract => false
|
14
|
+
look_for ", "
|
15
|
+
add_field :city
|
16
|
+
look_for ", "
|
17
|
+
add_field :state
|
18
|
+
look_for ", "
|
19
|
+
end
|
20
|
+
|
21
|
+
if __FILE__ == $0
|
22
|
+
tokenizer = PersonTokenizer.new
|
23
|
+
puts tokenizer.describe_line
|
24
|
+
|
25
|
+
string = "Flinstone, Fred | 301 Cobblestone Way, Bedrock, NA, 00000"
|
26
|
+
tokenizer.tokenize! string
|
27
|
+
|
28
|
+
puts tokenizer.last_name # => "Flinestone"
|
29
|
+
puts tokenizer.city # => "Bedrock"
|
30
|
+
puts tokenizer.state # => "NA"
|
31
|
+
|
32
|
+
tokenizer.tokenize!(string) do |tokens|
|
33
|
+
puts "The #{tokens[:last_name]}s live in #{tokens[:city]}"
|
34
|
+
end
|
35
|
+
end
|
data/examples/nginx.rb
ADDED
@@ -0,0 +1,70 @@
|
|
1
|
+
# once the gem is installed, you don't need this
|
2
|
+
$: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'lib'))
|
3
|
+
$: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'ext/string-eater'))
|
4
|
+
|
5
|
+
require 'string-eater'
|
6
|
+
|
7
|
+
class NginxLogTokenizer < StringEater::CTokenizer
|
8
|
+
add_field :ip
|
9
|
+
look_for " - "
|
10
|
+
add_field :remote_user, :extract => false
|
11
|
+
look_for " ["
|
12
|
+
add_field :timestamp, :extract => false
|
13
|
+
look_for "] \""
|
14
|
+
add_field :request
|
15
|
+
look_for "\" "
|
16
|
+
add_field :status_code
|
17
|
+
look_for " "
|
18
|
+
add_field :bytes_sent, :extract => false
|
19
|
+
look_for " \""
|
20
|
+
add_field :referrer_url
|
21
|
+
look_for "\" \""
|
22
|
+
add_field :user_agent
|
23
|
+
look_for "\" \""
|
24
|
+
add_field :compression, :extract => false
|
25
|
+
look_for "\" "
|
26
|
+
add_field :remainder
|
27
|
+
|
28
|
+
def status_code
|
29
|
+
@extracted_tokens[:status_code].to_i
|
30
|
+
end
|
31
|
+
|
32
|
+
def request_verb
|
33
|
+
@extracted_tokens[:request_verb]
|
34
|
+
end
|
35
|
+
|
36
|
+
def request_url
|
37
|
+
@extracted_tokens[:request_url]
|
38
|
+
end
|
39
|
+
|
40
|
+
def do_extra_parsing
|
41
|
+
return unless @extracted_tokens[:request]
|
42
|
+
request_parts = @extracted_tokens[:request].split
|
43
|
+
if request_parts.size == 3
|
44
|
+
@extracted_tokens[:request_verb] = request_parts[0]
|
45
|
+
@extracted_tokens[:request_url] = request_parts[1]
|
46
|
+
end
|
47
|
+
end
|
48
|
+
end
|
49
|
+
|
50
|
+
if __FILE__ == $0
|
51
|
+
tokenizer = NginxLogTokenizer.new
|
52
|
+
puts tokenizer.describe_line
|
53
|
+
|
54
|
+
str = '73.80.217.212 - - [01/Aug/2012:09:14:25 -0500] "GET /this_is_a_url HTTP/1.1" 304 152 "http://referrer.com" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" "-" "there could be" other "stuff here"'
|
55
|
+
|
56
|
+
puts "input string: " + str
|
57
|
+
puts "Tokens: "
|
58
|
+
|
59
|
+
# use a block to work with the extracted tokens
|
60
|
+
tokenizer.tokenize!(str) do |tokens|
|
61
|
+
tokens.each do |token|
|
62
|
+
puts "\t" + token.inspect
|
63
|
+
end
|
64
|
+
end
|
65
|
+
|
66
|
+
# use the token's name as a method to get its value
|
67
|
+
puts tokenizer.ip
|
68
|
+
puts tokenizer.status_code
|
69
|
+
puts tokenizer.request_verb
|
70
|
+
end
|
@@ -0,0 +1,141 @@
|
|
1
|
+
#include <ruby.h>
|
2
|
+
|
3
|
+
/* not used in production - useful for debugging */
|
4
|
+
#define puts_inspect(var) \
|
5
|
+
ID inspect = rb_intern("inspect"); \
|
6
|
+
VALUE x = rb_funcall(var, inspect, 0); \
|
7
|
+
printf("%s\n", StringValueCStr(x));
|
8
|
+
|
9
|
+
static VALUE rb_cCTokenizer;
|
10
|
+
static VALUE rb_mStringEater;
|
11
|
+
|
12
|
+
static VALUE tokenize_string(VALUE self,
|
13
|
+
VALUE string,
|
14
|
+
VALUE tokens_to_find_indexes,
|
15
|
+
VALUE tokens_to_find_strings,
|
16
|
+
VALUE tokens_to_extract_indexes,
|
17
|
+
VALUE tokens_to_extract_names)
|
18
|
+
{
|
19
|
+
const char* input_string = StringValueCStr(string);
|
20
|
+
VALUE extracted_tokens = rb_hash_new();
|
21
|
+
VALUE curr_token;
|
22
|
+
unsigned int curr_token_ix;
|
23
|
+
long n_tokens_to_find = RARRAY_LEN(tokens_to_find_indexes);
|
24
|
+
size_t str_len = strlen(input_string);
|
25
|
+
size_t ix;
|
26
|
+
char c;
|
27
|
+
char looking_for;
|
28
|
+
size_t looking_for_len;
|
29
|
+
size_t looking_for_ix = 0;
|
30
|
+
long find_ix = 0;
|
31
|
+
const char* looking_for_token;
|
32
|
+
unsigned int n_tokens = (unsigned int)RARRAY_LEN(rb_iv_get(self, "@tokens"));
|
33
|
+
|
34
|
+
size_t startpoint = 0;
|
35
|
+
|
36
|
+
long n_tokens_to_extract = RARRAY_LEN(tokens_to_extract_indexes);
|
37
|
+
long last_token_extracted_ix = 0;
|
38
|
+
|
39
|
+
long next_token_to_extract_ix = NUM2UINT(rb_ary_entry(tokens_to_extract_indexes, last_token_extracted_ix));
|
40
|
+
|
41
|
+
curr_token = rb_ary_entry(tokens_to_find_strings, find_ix);
|
42
|
+
curr_token_ix = NUM2UINT(rb_ary_entry(tokens_to_find_indexes, find_ix));
|
43
|
+
looking_for_token = StringValueCStr(curr_token);
|
44
|
+
looking_for_len = strlen(looking_for_token);
|
45
|
+
looking_for = looking_for_token[looking_for_ix];
|
46
|
+
|
47
|
+
for(ix = 0; ix < str_len; ix++)
|
48
|
+
{
|
49
|
+
c = input_string[ix];
|
50
|
+
if(c == looking_for)
|
51
|
+
{
|
52
|
+
if(looking_for_ix == 0)
|
53
|
+
{
|
54
|
+
/* entering new token */
|
55
|
+
if(curr_token_ix > 0)
|
56
|
+
{
|
57
|
+
/* extract, if necessary */
|
58
|
+
if((curr_token_ix - 1) == next_token_to_extract_ix)
|
59
|
+
{
|
60
|
+
last_token_extracted_ix++;
|
61
|
+
if(last_token_extracted_ix < n_tokens_to_extract)
|
62
|
+
{
|
63
|
+
next_token_to_extract_ix = NUM2UINT(rb_ary_entry(tokens_to_extract_indexes, last_token_extracted_ix));
|
64
|
+
}
|
65
|
+
else
|
66
|
+
{
|
67
|
+
next_token_to_extract_ix = -1;
|
68
|
+
}
|
69
|
+
rb_hash_aset(extracted_tokens,
|
70
|
+
rb_ary_entry(tokens_to_extract_names, curr_token_ix - 1),
|
71
|
+
rb_usascii_str_new(input_string + startpoint,
|
72
|
+
ix - startpoint));
|
73
|
+
}
|
74
|
+
}
|
75
|
+
startpoint = ix;
|
76
|
+
}
|
77
|
+
if(looking_for_ix >= looking_for_len - 1)
|
78
|
+
{
|
79
|
+
/* leaving token */
|
80
|
+
if(curr_token_ix >= n_tokens-1)
|
81
|
+
{
|
82
|
+
break;
|
83
|
+
}
|
84
|
+
else
|
85
|
+
{
|
86
|
+
startpoint = ix + 1;
|
87
|
+
}
|
88
|
+
|
89
|
+
|
90
|
+
/* next token */
|
91
|
+
find_ix++;
|
92
|
+
if(find_ix >= n_tokens_to_find)
|
93
|
+
{
|
94
|
+
/* done! */
|
95
|
+
break;
|
96
|
+
}
|
97
|
+
|
98
|
+
curr_token = rb_ary_entry(tokens_to_find_strings, find_ix);
|
99
|
+
curr_token_ix = NUM2UINT(rb_ary_entry(tokens_to_find_indexes, find_ix));
|
100
|
+
looking_for_token = StringValueCStr(curr_token);
|
101
|
+
looking_for_len = strlen(looking_for_token);
|
102
|
+
looking_for_ix = 0;
|
103
|
+
}
|
104
|
+
else
|
105
|
+
{
|
106
|
+
looking_for_ix++;
|
107
|
+
}
|
108
|
+
looking_for = looking_for_token[looking_for_ix];
|
109
|
+
}
|
110
|
+
}
|
111
|
+
|
112
|
+
ix = str_len;
|
113
|
+
curr_token_ix = n_tokens - 1;
|
114
|
+
|
115
|
+
if(curr_token_ix == next_token_to_extract_ix)
|
116
|
+
{
|
117
|
+
rb_hash_aset(extracted_tokens,
|
118
|
+
rb_ary_entry(tokens_to_extract_names, curr_token_ix),
|
119
|
+
rb_usascii_str_new(input_string + startpoint,
|
120
|
+
ix - startpoint));
|
121
|
+
}
|
122
|
+
|
123
|
+
return extracted_tokens;
|
124
|
+
}
|
125
|
+
|
126
|
+
void finalize_c_tokenizer_ext(VALUE unused)
|
127
|
+
{
|
128
|
+
/* free memory, etc */
|
129
|
+
}
|
130
|
+
|
131
|
+
void Init_c_tokenizer_ext(void)
|
132
|
+
{
|
133
|
+
rb_mStringEater = rb_define_module("StringEater");
|
134
|
+
rb_cCTokenizer = rb_define_class_under(rb_mStringEater,
|
135
|
+
"CTokenizer", rb_cObject);
|
136
|
+
|
137
|
+
rb_define_method(rb_cCTokenizer, "ctokenize!", tokenize_string, 5);
|
138
|
+
|
139
|
+
/* set the callback for when the extension is unloaded */
|
140
|
+
rb_set_end_proc(finalize_c_tokenizer_ext, 0);
|
141
|
+
}
|
data/lib/c-tokenizer.rb
ADDED
@@ -0,0 +1,93 @@
|
|
1
|
+
require 'c_tokenizer_ext'
|
2
|
+
|
3
|
+
class StringEater::CTokenizer
|
4
|
+
def self.tokens
|
5
|
+
@tokens ||= []
|
6
|
+
end
|
7
|
+
|
8
|
+
def self.add_field name, opts={}
|
9
|
+
self.tokens << StringEater::Token::new_field(name, opts)
|
10
|
+
define_method(name) {@extracted_tokens[name]}
|
11
|
+
end
|
12
|
+
|
13
|
+
def self.look_for tokens
|
14
|
+
self.tokens << StringEater::Token::new_separator(tokens)
|
15
|
+
end
|
16
|
+
|
17
|
+
def initialize
|
18
|
+
refresh_tokens
|
19
|
+
end
|
20
|
+
|
21
|
+
def tokens
|
22
|
+
@tokens
|
23
|
+
end
|
24
|
+
|
25
|
+
def refresh_tokens
|
26
|
+
@tokens = self.class.tokens
|
27
|
+
tokens_to_find = tokens.each_with_index.map do |t, i|
|
28
|
+
[i, t.string] if t.string
|
29
|
+
end.compact
|
30
|
+
|
31
|
+
@tokens_to_find_indexes = tokens_to_find.map{|t| t[0]}
|
32
|
+
@tokens_to_find_strings = tokens_to_find.map{|t| t[1]}
|
33
|
+
|
34
|
+
tokens_to_extract = tokens.each_with_index.map do |t, i|
|
35
|
+
[i, t.name] if t.extract?
|
36
|
+
end.compact
|
37
|
+
|
38
|
+
@tokens_to_extract_indexes = tokens_to_extract.map{|t| t[0]}
|
39
|
+
@tokens_to_extract_names = tokens.map{|t| t.name}
|
40
|
+
end
|
41
|
+
|
42
|
+
def describe_line
|
43
|
+
tokens.inject("") do |desc, t|
|
44
|
+
desc << (t.string || t.name.to_s || "xxxxxx")
|
45
|
+
end
|
46
|
+
end
|
47
|
+
|
48
|
+
def do_extra_parsing
|
49
|
+
end
|
50
|
+
|
51
|
+
def tokenize! string, &block
|
52
|
+
@string = string
|
53
|
+
@extracted_tokens ||= {}
|
54
|
+
@extracted_tokens.clear
|
55
|
+
|
56
|
+
tokens.first.breakpoints[0] = 0
|
57
|
+
|
58
|
+
@extracted_tokens = ctokenize!(@string,
|
59
|
+
@tokens_to_find_indexes,
|
60
|
+
@tokens_to_find_strings,
|
61
|
+
@tokens_to_extract_indexes,
|
62
|
+
@tokens_to_extract_names)
|
63
|
+
|
64
|
+
# extra parsing hook
|
65
|
+
do_extra_parsing
|
66
|
+
|
67
|
+
if block_given?
|
68
|
+
yield @extracted_tokens
|
69
|
+
end
|
70
|
+
|
71
|
+
# return self for chaining
|
72
|
+
self
|
73
|
+
end
|
74
|
+
|
75
|
+
private
|
76
|
+
|
77
|
+
def set_token_startpoint ix, startpoint
|
78
|
+
@tokens[ix].breakpoints[0] = startpoint
|
79
|
+
end
|
80
|
+
|
81
|
+
def get_token_startpoint ix
|
82
|
+
@tokens[ix].breakpoints[0]
|
83
|
+
end
|
84
|
+
|
85
|
+
def set_token_endpoint ix, endpoint
|
86
|
+
@tokens[ix].breakpoints[1] = endpoint
|
87
|
+
end
|
88
|
+
|
89
|
+
def extract_token? ix
|
90
|
+
@tokens[ix].extract?
|
91
|
+
end
|
92
|
+
|
93
|
+
end
|
@@ -0,0 +1,145 @@
|
|
1
|
+
# this tokenizer is very slow, but it illustrates the
|
2
|
+
# basic idea of the C tokenizer
|
3
|
+
class StringEater::RubyTokenizerEachChar
|
4
|
+
|
5
|
+
def self.tokens
|
6
|
+
@tokens ||= []
|
7
|
+
end
|
8
|
+
|
9
|
+
def self.combined_tokens
|
10
|
+
@combined_tokens ||= []
|
11
|
+
end
|
12
|
+
|
13
|
+
def self.add_field name, opts={}
|
14
|
+
self.tokens << StringEater::Token::new_field(name, opts)
|
15
|
+
define_method(name) {@extracted_tokens[name]}
|
16
|
+
end
|
17
|
+
|
18
|
+
def self.look_for tokens
|
19
|
+
self.tokens << StringEater::Token::new_separator(tokens)
|
20
|
+
end
|
21
|
+
|
22
|
+
def self.combine_fields opts={}
|
23
|
+
from_token_index = self.tokens.index{|t| t.name == opts[:from]}
|
24
|
+
to_token_index = self.tokens.index{|t| t.name == opts[:to]}
|
25
|
+
self.combined_tokens << [opts[:as], from_token_index, to_token_index]
|
26
|
+
define_method(opts[:as]) {@extracted_tokens[opts[:as]]}
|
27
|
+
end
|
28
|
+
|
29
|
+
def tokens
|
30
|
+
@tokens ||= self.class.tokens
|
31
|
+
end
|
32
|
+
|
33
|
+
def combined_tokens
|
34
|
+
@combined_tokens ||= self.class.combined_tokens
|
35
|
+
end
|
36
|
+
|
37
|
+
def refresh_tokens
|
38
|
+
@combined_tokens = nil
|
39
|
+
@tokens = nil
|
40
|
+
tokens
|
41
|
+
end
|
42
|
+
|
43
|
+
def describe_line
|
44
|
+
tokens.inject("") do |desc, t|
|
45
|
+
desc << (t.string || t.name.to_s || "xxxxxx")
|
46
|
+
end
|
47
|
+
end
|
48
|
+
|
49
|
+
def find_breakpoints string
|
50
|
+
tokenize!(string) unless @string == string
|
51
|
+
tokens.inject([]) do |bp, t|
|
52
|
+
bp << t.breakpoints
|
53
|
+
bp
|
54
|
+
end.flatten.uniq
|
55
|
+
end
|
56
|
+
|
57
|
+
def tokenize! string, &block
|
58
|
+
@string = string
|
59
|
+
@extracted_tokens ||= {}
|
60
|
+
@extracted_tokens.clear
|
61
|
+
@tokens_to_find ||= tokens.each_with_index.map do |t, i|
|
62
|
+
[i, t.string] if t.string
|
63
|
+
end.compact
|
64
|
+
@tokens_to_extract_indeces ||= tokens.each_with_index.map do |t, i|
|
65
|
+
i if t.extract?
|
66
|
+
end.compact
|
67
|
+
|
68
|
+
tokens.first.breakpoints[0] = 0
|
69
|
+
|
70
|
+
find_index = 0
|
71
|
+
|
72
|
+
curr_token = @tokens_to_find[find_index]
|
73
|
+
curr_token_index = curr_token[0]
|
74
|
+
curr_token_length = curr_token[1].length
|
75
|
+
looking_for_index = 0
|
76
|
+
looking_for = curr_token[1][looking_for_index]
|
77
|
+
|
78
|
+
counter = 0
|
79
|
+
string.each_char do |c|
|
80
|
+
if c == looking_for
|
81
|
+
if looking_for_index == 0
|
82
|
+
# entering new token
|
83
|
+
if curr_token_index > 0
|
84
|
+
t = tokens[curr_token_index - 1]
|
85
|
+
t.breakpoints[1] = counter
|
86
|
+
if t.extract?
|
87
|
+
@extracted_tokens[t.name] = string[t.breakpoints[0]...t.breakpoints[1]]
|
88
|
+
end
|
89
|
+
end
|
90
|
+
tokens[curr_token_index].breakpoints[0] = counter
|
91
|
+
end
|
92
|
+
if looking_for_index >= (curr_token_length - 1)
|
93
|
+
# leaving token
|
94
|
+
tokens[curr_token_index].breakpoints[1] = counter
|
95
|
+
|
96
|
+
if curr_token_index >= tokens.size-1
|
97
|
+
# we're done!
|
98
|
+
break
|
99
|
+
else
|
100
|
+
tokens[curr_token_index + 1].breakpoints[0] = counter + 1
|
101
|
+
end
|
102
|
+
|
103
|
+
# next token
|
104
|
+
find_index += 1
|
105
|
+
if find_index >= @tokens_to_find.length
|
106
|
+
# we're done!
|
107
|
+
break
|
108
|
+
end
|
109
|
+
curr_token = @tokens_to_find[find_index]
|
110
|
+
curr_token_index = curr_token[0]
|
111
|
+
curr_token_length = curr_token[1].length
|
112
|
+
looking_for_index = 0
|
113
|
+
else
|
114
|
+
looking_for_index += 1
|
115
|
+
end
|
116
|
+
end
|
117
|
+
looking_for = curr_token[1][looking_for_index]
|
118
|
+
counter += 1
|
119
|
+
end
|
120
|
+
|
121
|
+
last_token = tokens.last
|
122
|
+
last_token.breakpoints[1] = string.length
|
123
|
+
|
124
|
+
if last_token.extract?
|
125
|
+
@extracted_tokens[last_token.name] = string[last_token.breakpoints[0]..last_token.breakpoints[1]]
|
126
|
+
end
|
127
|
+
|
128
|
+
combined_tokens.each do |combiner|
|
129
|
+
name = combiner[0]
|
130
|
+
from = @tokens[combiner[1]].breakpoints[0]
|
131
|
+
to = @tokens[combiner[2]].breakpoints[1]
|
132
|
+
@extracted_tokens[name] = string[from...to]
|
133
|
+
end
|
134
|
+
|
135
|
+
if block_given?
|
136
|
+
yield @extracted_tokens
|
137
|
+
end
|
138
|
+
|
139
|
+
# return self for chaining
|
140
|
+
self
|
141
|
+
end
|
142
|
+
|
143
|
+
end
|
144
|
+
|
145
|
+
|
@@ -0,0 +1,98 @@
|
|
1
|
+
# this tokenizer is fairly fast, but not necessarily faster than regexps
|
2
|
+
class StringEater::RubyTokenizer
|
3
|
+
def self.tokens
|
4
|
+
@tokens ||= []
|
5
|
+
end
|
6
|
+
|
7
|
+
def self.combined_tokens
|
8
|
+
@combined_tokens ||= []
|
9
|
+
end
|
10
|
+
|
11
|
+
def self.add_field name, opts={}
|
12
|
+
self.tokens << StringEater::Token::new_field(name, opts)
|
13
|
+
define_method(name) {@extracted_tokens[name]}
|
14
|
+
end
|
15
|
+
|
16
|
+
def self.look_for tokens
|
17
|
+
self.tokens << StringEater::Token::new_separator(tokens)
|
18
|
+
end
|
19
|
+
|
20
|
+
def self.combine_fields opts={}
|
21
|
+
from_token_index = self.tokens.index{|t| t.name == opts[:from]}
|
22
|
+
to_token_index = self.tokens.index{|t| t.name == opts[:to]}
|
23
|
+
self.combined_tokens << [opts[:as], from_token_index, to_token_index]
|
24
|
+
define_method(opts[:as]) {@extracted_tokens[opts[:as]]}
|
25
|
+
end
|
26
|
+
|
27
|
+
def tokens
|
28
|
+
@tokens ||= self.class.tokens
|
29
|
+
end
|
30
|
+
|
31
|
+
def combined_tokens
|
32
|
+
@combined_tokens ||= self.class.combined_tokens
|
33
|
+
end
|
34
|
+
|
35
|
+
def refresh_tokens
|
36
|
+
@combined_tokens = nil
|
37
|
+
@tokens = nil
|
38
|
+
tokens
|
39
|
+
end
|
40
|
+
|
41
|
+
def describe_line
|
42
|
+
tokens.inject("") do |desc, t|
|
43
|
+
desc << (t.string || t.name.to_s || "xxxxxx")
|
44
|
+
end
|
45
|
+
end
|
46
|
+
|
47
|
+
def find_breakpoints(string)
|
48
|
+
@literal_tokens ||= tokens.select{|t| t.string}
|
49
|
+
@breakpoints ||= Array.new(2*@literal_tokens.size + 2)
|
50
|
+
@breakpoints[0] = 0
|
51
|
+
@breakpoints[-1] = string.length
|
52
|
+
start_point = 0
|
53
|
+
@literal_tokens.each_with_index do |t, i|
|
54
|
+
@breakpoints[2*i+1], start_point = find_end_of(t, string, start_point)
|
55
|
+
@breakpoints[2*i+2] = start_point
|
56
|
+
end
|
57
|
+
@breakpoints
|
58
|
+
end
|
59
|
+
|
60
|
+
def tokenize! string, &block
|
61
|
+
@extracted_tokens ||= {}
|
62
|
+
@extracted_tokens.clear
|
63
|
+
@tokens_to_extract ||= tokens.select{|t| t.extract?}
|
64
|
+
|
65
|
+
find_breakpoints(string)
|
66
|
+
last_important_bp = [@breakpoints.length, tokens.size].min
|
67
|
+
(0...last_important_bp).each do |i|
|
68
|
+
tokens[i].breakpoints = [@breakpoints[i], @breakpoints[i+1]]
|
69
|
+
end
|
70
|
+
|
71
|
+
@tokens_to_extract.each do |t|
|
72
|
+
@extracted_tokens[t.name] = string[t.breakpoints[0]...t.breakpoints[1]]
|
73
|
+
end
|
74
|
+
|
75
|
+
combined_tokens.each do |combiner|
|
76
|
+
name = combiner[0]
|
77
|
+
from = @tokens[combiner[1]].breakpoints[0]
|
78
|
+
to = @tokens[combiner[2]].breakpoints[1]
|
79
|
+
@extracted_tokens[name] = string[from...to]
|
80
|
+
end
|
81
|
+
|
82
|
+
if block_given?
|
83
|
+
yield @extracted_tokens
|
84
|
+
end
|
85
|
+
|
86
|
+
# return self for chaining
|
87
|
+
self
|
88
|
+
end
|
89
|
+
|
90
|
+
protected
|
91
|
+
|
92
|
+
def find_end_of token, string, start_at
|
93
|
+
start = string.index(token.string, start_at+1) || string.length
|
94
|
+
[start, [start + token.string.length, string.length].min]
|
95
|
+
end
|
96
|
+
|
97
|
+
end
|
98
|
+
|
data/lib/string-eater.rb
ADDED
@@ -0,0 +1,10 @@
|
|
1
|
+
module StringEater
|
2
|
+
autoload :Token, 'token'
|
3
|
+
autoload :RubyTokenizer, 'ruby-tokenizer'
|
4
|
+
autoload :RubyTokenizerEachCHar, 'ruby-tokenizer-each-char'
|
5
|
+
autoload :CTokenizer, 'c-tokenizer'
|
6
|
+
|
7
|
+
autoload :VERSION, 'version'
|
8
|
+
|
9
|
+
class Tokenizer < CTokenizer; end
|
10
|
+
end
|
data/lib/token.rb
ADDED
@@ -0,0 +1,26 @@
|
|
1
|
+
class StringEater::Token
|
2
|
+
attr_accessor :name, :string, :opts, :breakpoints, :children
|
3
|
+
|
4
|
+
def initialize
|
5
|
+
@opts = {}
|
6
|
+
@breakpoints = [nil,nil]
|
7
|
+
end
|
8
|
+
|
9
|
+
def extract?
|
10
|
+
@opts[:extract]
|
11
|
+
end
|
12
|
+
|
13
|
+
def self.new_field(name, opts)
|
14
|
+
t = new
|
15
|
+
t.name = name
|
16
|
+
t.opts = {:extract => true}.merge(opts)
|
17
|
+
t
|
18
|
+
end
|
19
|
+
|
20
|
+
def self.new_separator(string)
|
21
|
+
t = new
|
22
|
+
t.string = string
|
23
|
+
t
|
24
|
+
end
|
25
|
+
|
26
|
+
end
|
data/lib/version.rb
ADDED
data/spec/nginx_spec.rb
ADDED
@@ -0,0 +1,27 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
require 'string-eater'
|
3
|
+
|
4
|
+
$: << File.expand_path(File.join(File.dirname(__FILE__), '..', 'examples'))
|
5
|
+
|
6
|
+
require 'nginx'
|
7
|
+
|
8
|
+
describe NginxLogTokenizer do
|
9
|
+
before(:each) do
|
10
|
+
@tokenizer = NginxLogTokenizer.new
|
11
|
+
@str = '73.80.217.212 - - [01/Aug/2012:09:14:25 -0500] "GET /this_is_a_url HTTP/1.1" 304 152 "http://referrer.com" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" "-" "there could be" other "stuff here"'
|
12
|
+
end
|
13
|
+
|
14
|
+
{
|
15
|
+
:ip => "73.80.217.212",
|
16
|
+
:request => "GET /this_is_a_url HTTP/1.1",
|
17
|
+
:status_code => 304,
|
18
|
+
:referrer_url => "http://referrer.com",
|
19
|
+
:user_agent => "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
|
20
|
+
:remainder => "\"there could be\" other \"stuff here\"",
|
21
|
+
}.each_pair do |token,val|
|
22
|
+
it "should find the right value for #{token}" do
|
23
|
+
@tokenizer.tokenize!(@str).send(token).should == val
|
24
|
+
end
|
25
|
+
end
|
26
|
+
|
27
|
+
end
|
data/spec/spec_helper.rb
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
$LOAD_PATH.concat %w[./lib ./ext/string-eater]
|
@@ -0,0 +1,133 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
require 'string-eater'
|
3
|
+
|
4
|
+
TestedClass = StringEater::CTokenizer
|
5
|
+
|
6
|
+
describe StringEater do
|
7
|
+
it "should have a version" do
|
8
|
+
StringEater::VERSION::STRING.split(".").size.should >= 3
|
9
|
+
end
|
10
|
+
end
|
11
|
+
|
12
|
+
# normal use
|
13
|
+
class Example1 < TestedClass
|
14
|
+
add_field :first_word
|
15
|
+
look_for " "
|
16
|
+
add_field :second_word, :extract => false
|
17
|
+
look_for "|"
|
18
|
+
add_field :third_word
|
19
|
+
end
|
20
|
+
|
21
|
+
describe Example1 do
|
22
|
+
|
23
|
+
before(:each) do
|
24
|
+
@tokenizer = Example1.new
|
25
|
+
@str1 = "foo bar|baz"
|
26
|
+
@first_word1 = "foo"
|
27
|
+
@third_word1 = "baz"
|
28
|
+
@bp1 = [0, 3,4,7,8,11]
|
29
|
+
end
|
30
|
+
|
31
|
+
describe "find_breakpoints" do
|
32
|
+
it "should return an array of the breakpoints" do
|
33
|
+
@tokenizer.find_breakpoints(@str1).should == @bp1 if @tokenizer.respond_to?(:find_breakpoints)
|
34
|
+
end
|
35
|
+
end
|
36
|
+
|
37
|
+
describe "tokenize!" do
|
38
|
+
it "should return itself" do
|
39
|
+
@tokenizer.tokenize!(@str1).should == @tokenizer
|
40
|
+
end
|
41
|
+
|
42
|
+
it "should set the first word" do
|
43
|
+
@tokenizer.tokenize!(@str1).first_word.should == "foo"
|
44
|
+
end
|
45
|
+
|
46
|
+
it "should set the third word" do
|
47
|
+
@tokenizer.tokenize!(@str1).third_word.should == "baz"
|
48
|
+
end
|
49
|
+
|
50
|
+
it "should not set the second word" do
|
51
|
+
@tokenizer.tokenize!(@str1).second_word.should be_nil
|
52
|
+
end
|
53
|
+
|
54
|
+
it "should yield a hash of tokens if a block is given" do
|
55
|
+
@tokenizer.tokenize!(@str1) do |tokens|
|
56
|
+
tokens[:first_word].should == "foo"
|
57
|
+
end
|
58
|
+
end
|
59
|
+
|
60
|
+
it "should return everything to the end of the line for the last token" do
|
61
|
+
s = "c defg asdf | foo , baa"
|
62
|
+
@tokenizer.tokenize!("a b|#{s}").third_word.should == s
|
63
|
+
end
|
64
|
+
|
65
|
+
end
|
66
|
+
|
67
|
+
end
|
68
|
+
|
69
|
+
# an example where we ignore after a certain point in the string
|
70
|
+
class Example2 < TestedClass
|
71
|
+
add_field :first_word, :extract => false
|
72
|
+
look_for " "
|
73
|
+
add_field :second_word
|
74
|
+
look_for " "
|
75
|
+
add_field :third_word, :extract => false
|
76
|
+
look_for "-"
|
77
|
+
end
|
78
|
+
|
79
|
+
describe Example2 do
|
80
|
+
|
81
|
+
before(:each) do
|
82
|
+
@tokenizer = Example2.new
|
83
|
+
@str1 = "foo bar baz-"
|
84
|
+
@second_word1 = "bar"
|
85
|
+
end
|
86
|
+
|
87
|
+
describe "tokenize!" do
|
88
|
+
it "should find the token when there is extra stuff at the end of the string" do
|
89
|
+
@tokenizer.tokenize!(@str1).second_word.should == @second_word1
|
90
|
+
end
|
91
|
+
end
|
92
|
+
|
93
|
+
end
|
94
|
+
|
95
|
+
# CTokenizer doesn't do combine_fields because
|
96
|
+
# writing out breakpoints is a significant slow-down
|
97
|
+
if TestedClass.respond_to?(:combine_fields)
|
98
|
+
# an example where we combine fields
|
99
|
+
class Example3 < TestedClass
|
100
|
+
add_field :first_word, :extract => false
|
101
|
+
look_for " \""
|
102
|
+
add_field :part1, :extract => false
|
103
|
+
look_for " "
|
104
|
+
add_field :part2
|
105
|
+
look_for " "
|
106
|
+
add_field :part3, :extract => false
|
107
|
+
look_for "\""
|
108
|
+
|
109
|
+
combine_fields :from => :part1, :to => :part3, :as => :parts
|
110
|
+
end
|
111
|
+
|
112
|
+
describe Example3 do
|
113
|
+
before(:each) do
|
114
|
+
@tokenizer = Example3.new
|
115
|
+
@str1 = "foo \"bar baz bang\""
|
116
|
+
@part2 = "baz"
|
117
|
+
@parts = "bar baz bang"
|
118
|
+
end
|
119
|
+
|
120
|
+
it "should extract like normal" do
|
121
|
+
@tokenizer.tokenize!(@str1).part2.should == @part2
|
122
|
+
end
|
123
|
+
|
124
|
+
it "should ignore like normal" do
|
125
|
+
@tokenizer.tokenize!(@str1).part1.should be_nil
|
126
|
+
end
|
127
|
+
|
128
|
+
it "should extract the combined field" do
|
129
|
+
@tokenizer.tokenize!(@str1).parts.should == @parts
|
130
|
+
end
|
131
|
+
|
132
|
+
end
|
133
|
+
end
|
metadata
ADDED
@@ -0,0 +1,66 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: string-eater
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.1.0
|
5
|
+
prerelease:
|
6
|
+
platform: ruby
|
7
|
+
authors:
|
8
|
+
- Dan Swain
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2012-08-20 00:00:00.000000000 Z
|
13
|
+
dependencies: []
|
14
|
+
description: Fast string tokenizer. Nom strings.
|
15
|
+
email:
|
16
|
+
- dan@simpli.fi
|
17
|
+
executables: []
|
18
|
+
extensions:
|
19
|
+
- ext/string-eater/extconf.rb
|
20
|
+
extra_rdoc_files: []
|
21
|
+
files:
|
22
|
+
- lib/c-tokenizer.rb
|
23
|
+
- lib/ruby-tokenizer-each-char.rb
|
24
|
+
- lib/ruby-tokenizer.rb
|
25
|
+
- lib/string-eater.rb
|
26
|
+
- lib/token.rb
|
27
|
+
- lib/version.rb
|
28
|
+
- ext/string-eater/extconf.rb
|
29
|
+
- ext/string-eater/c-tokenizer.c
|
30
|
+
- spec/nginx_spec.rb
|
31
|
+
- spec/spec_helper.rb
|
32
|
+
- spec/string_eater_spec.rb
|
33
|
+
- examples/address.rb
|
34
|
+
- examples/nginx.rb
|
35
|
+
- LICENSE
|
36
|
+
- Rakefile
|
37
|
+
- README.md
|
38
|
+
homepage: http://github.com/simplifi/string-eater
|
39
|
+
licenses: []
|
40
|
+
post_install_message:
|
41
|
+
rdoc_options: []
|
42
|
+
require_paths:
|
43
|
+
- lib
|
44
|
+
- ext/string-eater
|
45
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
46
|
+
none: false
|
47
|
+
requirements:
|
48
|
+
- - ! '>='
|
49
|
+
- !ruby/object:Gem::Version
|
50
|
+
version: '0'
|
51
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
52
|
+
none: false
|
53
|
+
requirements:
|
54
|
+
- - ! '>='
|
55
|
+
- !ruby/object:Gem::Version
|
56
|
+
version: '0'
|
57
|
+
requirements: []
|
58
|
+
rubyforge_project:
|
59
|
+
rubygems_version: 1.8.24
|
60
|
+
signing_key:
|
61
|
+
specification_version: 3
|
62
|
+
summary: Fast string tokenizer. Nom strings.
|
63
|
+
test_files:
|
64
|
+
- spec/nginx_spec.rb
|
65
|
+
- spec/spec_helper.rb
|
66
|
+
- spec/string_eater_spec.rb
|