scanner 0.0.2 → 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -24,19 +24,146 @@ Scanner is a module that you can include in your classes. It defines a
24
24
  token function that accepts the regular expression that the token
25
25
  matches.
26
26
 
27
- Example code
27
+ For example
28
28
 
29
29
  class TestScanner
30
30
  include Scanner
31
- ignore /\s+/
32
- token :number, /\d+/
33
- token :id, /\w+/
31
+ ignore '\s+'
32
+ token :number, '\d+'
33
+ token :id, '[a-z]+'
34
34
  end
35
35
 
36
36
  @scanner = TestScanner.new
37
37
  @scanner.parse("123")
38
38
  @scanner.look_ahead.is?(:number) # Should be true
39
39
 
40
+ ### Token definition
41
+ Each token is defined by a symbol, used to identify the token, and a
42
+ regular expression that the token should match. An optional third
43
+ parameter accepts a hash of options that we will explore later. For
44
+ example
45
+
46
+ token :number, '\d+'
47
+
48
+ will match strings containing digits.
49
+
50
+ Some care is needed when defining tokens that collide with other
51
+ tokens. For instance, a languange may define the token '==' and the
52
+ token '='. You need to define the double equals before the single
53
+ equals, otherwise the string '==' will be identified as two '=' tokens,
54
+ instead of a '==' token.
55
+
56
+ ### Ignoring characters
57
+ For many scanning needs, there is a set of characters that is safely
58
+ ignored, for instace, in many programming languages, spaces and
59
+ newlines. You can define the set of characters to ignore with the
60
+ following definition:
61
+
62
+ ignore '[\s|\n]+'
63
+
64
+ ### Defining keywords
65
+ For many scanning needs, there is a set of tokens that define the
66
+ reserved words or keywords of a language. For instance, in Ruby, the
67
+ tokens 'def', 'class', 'module', and so on, are language reserved words.
68
+ Usually, these tokens are a subset of a larger token group, called
69
+ identifiers or ids. You can define a family of reserved words by using
70
+ the 'keywords' function.
71
+
72
+ ignore '[\s|\n]+'
73
+ token :id, '[a-z]+'
74
+ keywords %w{def class module}
75
+
76
+ @scanner.parse("other def")
77
+ @scanner.lookahead.is?(:id)
78
+ @scanner.lookahead(2).is?(:def)
79
+
80
+ Note that you will need to have a token definition that matches those
81
+ keywords, as the token :id in the previous example.
82
+
83
+ ### Consuming tokens and looking ahead
84
+ The Scanner method consume will try to match the first token remaining
85
+ in the input string. If successful, it will return the token, and remove
86
+ it from the input string.
87
+
88
+ ignore '[\s|\n]+'
89
+ token :id, '[a-z]+'
90
+
91
+ @scanner.parse("one two")
92
+ @scanner.consume.content == "one"
93
+ @scanner.consume.content == "two"
94
+
95
+ Lookahead performs a similar function, but without removing the token
96
+ from the string. It accepts an optional parameter indicating the number
97
+ of tokens to look ahead.
98
+
99
+ @scanner.parse("one two")
100
+ @scanner.lookahead.content == "one"
101
+ @scanner.lookahead(2).content == "two"
102
+
103
+ ### End of file
104
+
105
+ ignore '\s+'
106
+ token :number, '\d+'
107
+ token :id, '[a-z]+'
108
+
109
+ @scanner = TestScanner.new
110
+ @scanner.parse("123 abc 456 other")
111
+ begin
112
+ token = @scanner.consume
113
+ puts token.content
114
+ end while token.is_not? :eof
115
+
116
+ You need you have reached the end of the parse string when you receive
117
+ the :eof token. For instance
118
+
119
+ ### Looping through tokens
120
+ A scanner instance is a ruby Enumerable, so you can use each, map, and
121
+ others.
122
+
123
+ @scanner.parse("123 456")
124
+ @scanner.map { |tok| "-#{tok.content}-" }
125
+
126
+ ### Token separation
127
+ Sometimes it is necessary to indicate that a given token needs to be
128
+ followed by a token separator. For instance, in this example
129
+
130
+ token :number, '\d+'
131
+ token :id, '[a-z]+'
132
+
133
+ The string "abc123" will be parsed as an :id followed by a :number,
134
+ which may be undesirable. You may want to indicate that a token
135
+ separator (commonly spaces, arithmetic operators, puntuation marks,
136
+ etc) needs to occur after :id or :number.
137
+
138
+ The following code requires a space after ids and numbers:
139
+
140
+ token :number, '\d+', check_for_token_separator: true
141
+ token :id, '[a-z]+', check_for_token_separator: true
142
+ token_separator '\s'
143
+
144
+ ### Looking ahead for token types
145
+ When scanning strings, it is often necessary to lookahead to check what
146
+ types of tokens are coming. For instance:
147
+
148
+ if @scanner.lookahead.is?(:id) && @scanner.lookahead(2).is(:equal)
149
+ # variable assignment
150
+
151
+ Scanner provides a few utility functions to make this type of check
152
+ easier. For instance, the previous code could be refactored to:
153
+
154
+ if @scanner.tokens_are?(:id, :equal)
155
+
156
+ The other two methods available are token_is? and token_is_not?.
157
+
158
+ ### Tokens
159
+ The tokens returned by consume and lookahead have a few methods, which
160
+ should be self explanatory:
161
+
162
+ * content
163
+ * line
164
+ * column
165
+ * is? => Checks that the token is of a given type
166
+ * is_not? => The opposite
40
167
 
41
168
  ## Contributing
42
169
 
@@ -50,38 +50,38 @@ module Scanner
50
50
 
51
51
  def check_for_token_separator
52
52
  self.class.instance_eval { @check_for_token_separator }
53
- end
53
+ end
54
54
 
55
55
  def separator
56
56
  self.class.instance_eval { @separator }
57
- end
57
+ end
58
58
 
59
59
  public
60
60
 
61
+ include Enumerable
62
+
61
63
  def parse(program)
62
64
  @program = program
63
65
  @token_list = []
64
66
  @line_number = 1
65
67
  @column_number = 1
68
+ @token_number = 0
66
69
  end
67
70
 
68
71
  def consume
69
- if @token_list.empty?
72
+ if @token_number >= @token_list.size
70
73
  consume_next_token
71
- else
72
- @token_list.shift
73
74
  end
75
+ token = @token_list[@token_number]
76
+ @token_number+=1
77
+ token
74
78
  end
75
79
 
76
80
  def look_ahead(number_of_tokens = 1)
77
- end_of_file_met = false
78
- while @token_list.size < number_of_tokens
79
- throw :scanner_exception if end_of_file_met
80
- token = consume_next_token
81
- @token_list << token
82
- end_of_file_met = token.is? :eof
81
+ while @token_list.size < @token_number + number_of_tokens
82
+ consume_next_token
83
83
  end
84
- @token_list[-1]
84
+ @token_list[@token_number + number_of_tokens - 1]
85
85
  end
86
86
 
87
87
  def token_is?(token_type)
@@ -101,6 +101,20 @@ module Scanner
101
101
  return true
102
102
  end
103
103
 
104
+ def each
105
+ local_index = 0
106
+ begin
107
+ if local_index >= @token_list.size
108
+ consume_next_token
109
+ end
110
+ current_token = @token_list[local_index]
111
+ if current_token.is_not? :eof
112
+ yield current_token
113
+ end
114
+ local_index += 1
115
+ end while current_token.is_not? :eof
116
+ end
117
+
104
118
  private
105
119
 
106
120
 
@@ -114,7 +128,8 @@ module Scanner
114
128
  if check_for_token_separator[symbol]
115
129
  check_for_separator
116
130
  end
117
- return Token.new(token_type, content, @line_number, currently_at_column)
131
+ @token_list << Token.new(token_type, content, @line_number, currently_at_column)
132
+ return
118
133
  end
119
134
  end
120
135
 
@@ -128,7 +143,7 @@ module Scanner
128
143
 
129
144
  def get_token_from_reg_exp(reg_exp, symbol)
130
145
  content = consume_regular_expression(reg_exp)
131
- if keywords.include? content
146
+ if keywords && keywords.include?(content)
132
147
  token_type = content.to_sym
133
148
  else
134
149
  token_type = symbol
@@ -1,3 +1,3 @@
1
1
  module Scanner
2
- VERSION = "0.0.2"
2
+ VERSION = "0.0.3"
3
3
  end
@@ -35,6 +35,21 @@ describe Scanner do
35
35
  end
36
36
  end
37
37
 
38
+ describe "has enumerable functions" do
39
+ it "has each" do
40
+ @scanner.parse("123 456")
41
+ @scanner.each do |tok|
42
+ tok.content.should match /123|456/
43
+ end
44
+ end
45
+
46
+ it "has map" do
47
+ @scanner.parse("123 456")
48
+ map_results = @scanner.map { |tok| "-#{tok.content}-" }
49
+ map_results.should eq ["-123-","-456-"]
50
+ end
51
+ end
52
+
38
53
  describe "lookahead" do
39
54
  it "returns the next token without arguments" do
40
55
  @scanner.parse("123")
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: scanner
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.2
4
+ version: 0.0.3
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-08-03 00:00:00.000000000 Z
12
+ date: 2012-08-09 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: rspec
@@ -77,7 +77,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
77
77
  version: '0'
78
78
  segments:
79
79
  - 0
80
- hash: 1008594902208819548
80
+ hash: -2266866493885490648
81
81
  required_rubygems_version: !ruby/object:Gem::Requirement
82
82
  none: false
83
83
  requirements:
@@ -86,7 +86,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
86
86
  version: '0'
87
87
  segments:
88
88
  - 0
89
- hash: 1008594902208819548
89
+ hash: -2266866493885490648
90
90
  requirements: []
91
91
  rubyforge_project:
92
92
  rubygems_version: 1.8.24