lex 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 2685ca7a938c15bd7a3a969b46f2db085053ea86
4
+ data.tar.gz: fcee4a79a88908d49b87e847f6da3b3a9dd02cea
5
+ SHA512:
6
+ metadata.gz: c7972c76ccaa667426ecec48b1010a51c5a6679481fdedad50c4ba9107d43fad86421afcb6eba73b573add28dbd121cad35beb5d0be38782996593e6593a1922
7
+ data.tar.gz: 2eeacf7c91992920d4b44a0d55e6cb8b576c50f1982d70fea392436466218a570cecc088a80f8f9fbc27453177d1775cb3924adcc94d5ec2c7aae0b61c6824e5
@@ -0,0 +1,14 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
10
+ *.bundle
11
+ *.so
12
+ *.o
13
+ *.a
14
+ mkmf.log
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --color
2
+ --require spec_helper
3
+ --warnings
@@ -0,0 +1 @@
1
+ 2.0.0
@@ -0,0 +1,22 @@
1
+ language: ruby
2
+ bundler_args: --without yard benchmarks
3
+ script: "bundle exec rake ci"
4
+ rvm:
5
+ - 1.9.3
6
+ - 2.0.0
7
+ - 2.1.0
8
+ - 2.2.0
9
+ - ruby-head
10
+ matrix:
11
+ include:
12
+ - rvm: jruby-19mode
13
+ - rvm: jruby-20mode
14
+ - rvm: jruby-21mode
15
+ - rvm: jruby-head
16
+ - rvm: rbx-2
17
+ allow_failures:
18
+ - rvm: ruby-head
19
+ - rvm: jruby-head
20
+ fast_finish: true
21
+ branches:
22
+ only: master
data/Gemfile ADDED
@@ -0,0 +1,19 @@
1
+ source 'https://rubygems.org'
2
+
3
+ gemspec
4
+
5
+ group :development do
6
+ gem 'rake', '~> 10.3.2'
7
+ gem 'rspec', '~> 3.2.0'
8
+ gem 'yard', '~> 0.8.7'
9
+ end
10
+
11
+ group :metrics do
12
+ gem 'coveralls', '~> 0.7.0'
13
+ gem 'simplecov', '~> 0.8.2'
14
+ gem 'yardstick', '~> 0.9.9'
15
+ end
16
+
17
+ group :benchmarks do
18
+ gem 'benchmark_suite', '~> 1.0.0'
19
+ end
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2015 Piotr Murach
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,423 @@
1
+ # Lex
2
+ [![Gem Version](https://badge.fury.io/rb/lex.png)][gem]
3
+ [![Build Status](https://secure.travis-ci.org/peter-murach/lex.png?branch=master)][travis]
4
+ [![Code Climate](https://codeclimate.com/github/peter-murach/lex.png)][codeclimate]
5
+ [![Coverage Status](https://coveralls.io/repos/peter-murach/lex/badge.png?branch=master)][coveralls]
6
+ [![Inline docs](http://inch-ci.org/github/peter-murach/lex.png?branch=master)][inchpages]
7
+
8
+ [gem]: http://badge.fury.io/rb/lex
9
+ [travis]: http://travis-ci.org/peter-murach/lex
10
+ [codeclimate]: https://codeclimate.com/github/peter-murach/lex
11
+ [gemnasium]: https://gemnasium.com/peter-murach/lex
12
+ [coveralls]: https://coveralls.io/r/peter-murach/lex
13
+ [inchpages]: http://inch-ci.org/github/peter-murach/lex
14
+
15
+ > Lex is an implementation of complier construction tool lex in Ruby. The goal is to stay close to the way the original tool works and combine it with the expressivness of Ruby.
16
+
17
+ ## Features
18
+ * Very focused tool that mimics the basic lex functionality.
19
+ * 100% Ruby implementation.
20
+ * Provides comprehensive error reporting to assist in lexer construction.
21
+
22
+ ## Installation
23
+
24
+ Add this line to your application's Gemfile:
25
+
26
+ ```ruby
27
+ gem 'lex'
28
+ ```
29
+
30
+ And then execute:
31
+
32
+ $ bundle
33
+
34
+ Or install it yourself as:
35
+
36
+ $ gem install lex
37
+
38
+ ## Contents
39
+
40
+ * [1 Overview](#1-overview)
41
+ * [1.1 Example](#11-example)
42
+ * [1.2 Tokens list](#12-tokens-list)
43
+ * [1.3 Specifying rules](#13-specifying-rules)
44
+ * [1.4 Handling keywords](#14-handling-keywords)
45
+ * [1.5 Token values](#15-token-values)
46
+ * [1.6 Discarded tokens](#16-discarded-tokens)
47
+ * [1.7 Line numbers](#17-line-numbers)
48
+ * [1.8 Ignored characters](#18-ignored-characters)
49
+ * [1.9 Literal characters](#29-literal-characters)
50
+ * [1.10 Error handling](#110-error-handling)
51
+ * [1.11 Building the lexer](#111-building-the-lexer)
52
+ * [1.12 Maintaining state](#112-maintaining-state)
53
+ * [1.13 Conditional lexing](#113-conditional-lexing)
54
+ * [1.14 Debugging](#114-debugging)
55
+
56
+ ## 1. Overview
57
+
58
+ **Lex** is a library that processes character input streams. For example, suppose you have the following input string:
59
+
60
+ ```ruby
61
+ x = 5 + 44 * (s - t)
62
+ ```
63
+
64
+ **Lex** then partitions the input string into tokens that match a series of regular expression rules. In this instance given the tokens definitions:
65
+
66
+ ```ruby
67
+ :ID, :EQUALS, :NUMBER, :PLUS, :TIMES, :LPAREN, :RPAREN, :MINUS
68
+ ```
69
+
70
+ the output will contain the following tokens:
71
+
72
+ ```ruby
73
+ [:ID, 'x', 1, 1], [:EQUALS, '=', 1, 3], [:NUMBER, '5', 1, 5],
74
+ [:PLUS, '+', 1, 7], [:NUMBER, 44, 1, 9], [:TIMES, '*', 1, 12],
75
+ [:LPAREN, '(', 1, 14], [:ID, 's', 1, 15], [:MINUS, '-', 1, 17],
76
+ [:ID, 't', 1, 19], [:RPAREN, ')', 1, 20]
77
+ ```
78
+
79
+ The **Lex** rules specified in the lexer will determine how the chunking of the input is performed. The following example demonstrates a high level overview of how this is done.
80
+
81
+ ### 1.1 Example
82
+
83
+ Given an input:
84
+
85
+ ```ruby
86
+ x = 5 + 44 * (s - t)
87
+ ```
88
+
89
+ and a simple tokenizer:
90
+
91
+ ```ruby
92
+ class MyLexer < Lex::Lexer
93
+ tokens(
94
+ :NUMBER,
95
+ :PLUS,
96
+ :MINUS,
97
+ :TIMES,
98
+ :DIVIDE,
99
+ :LPAREN,
100
+ :RPAREN,
101
+ :EQUALS,
102
+ :IDENTIFIER
103
+ )
104
+
105
+ # Regular expression rules for simple tokens
106
+ rule(:PLUS, /\+/)
107
+ rule(:MINUS, /\-/)
108
+ rule(:TIMES, /\*/)
109
+ rule(:DIVIDE, /\//)
110
+ rule(:LPAREN, /\(/)
111
+ rule(:RPAREN, /\)/)
112
+ rule(:IDENTIFIER, /\A[_\$a-zA-Z][_\$0-9a-zA-Z]*/)
113
+
114
+ # A regular expression rules with actions
115
+ rule(:NUMBER, /[0-9]+/) do |lexer, token|
116
+ token.value = token.value.to_i
117
+ token
118
+ end
119
+
120
+ # Define a rule so we can track line numbers
121
+ rule(:newline, /\n+/) do |lexer, token|
122
+ lexer.advance_line(token.value.length)
123
+ end
124
+
125
+ # A string containing ignored characters (spaces and tabs)
126
+ ignore " \t"
127
+
128
+ error do |lexer, token|
129
+ puts "Illegal character: #{value}"
130
+ end
131
+ end
132
+
133
+ # build the lexer
134
+ my_lexer = MyLexer.new
135
+ ```
136
+
137
+ To use the lexer you need to provide it some input using the `lex` method. After that, the method `lex` will either yield tokens to a given block or return an enumereator to allow you to retrieve tokens by repeatedly calling `next` method.
138
+
139
+ ```ruby
140
+ input = "x = 5 + 44 * (s - t)"
141
+ output = my_lexer.lex(input)
142
+ output.next # => Lex::Token(:ID,'x', 1, 1)
143
+ output.next # => Lex::Token[:EQUALS, '=', 1, 3]
144
+ output.next # => Lex::Token[:NUMBER, '5', 1, 5]
145
+ ...
146
+ ```
147
+
148
+ The tokens returned by the lexer are instances of `Lex::Token`. This object has attributes such as `name`, `value`, `line` and `column`.
149
+
150
+ ### 1.2 Tokens list
151
+
152
+ A lexer always requires a list of tokens that define all the possible token names that can be produced by the lexer. This list is used to perform validation checks.
153
+
154
+ The following list is an example of token names:
155
+
156
+ ```ruby
157
+ tokens(
158
+ :NUMBER,
159
+ :PLUS,
160
+ :MINUS,
161
+ :TIMES,
162
+ :DIVIDE,
163
+ :LPAREN,
164
+ :RPAREN
165
+ )
166
+ ```
167
+
168
+ ### 1.3 Specifying rules
169
+
170
+ Each token is specified by writting a regular expression rule defined by by calling the `rule` method. For simple tokens you can just specify the name and regular expression:
171
+
172
+ ```ruby
173
+ rule(:PLUS, /\+/)
174
+ ```
175
+
176
+ In this case, the first argument is the name of the token that needs to match exactly one of the names supplied in `tokens`. If you need to perform further processing on the matched token, the rule can be further expaned by adding an action inside a block. For instance, this rule matches numbers and converts the matched string into integer type:
177
+
178
+ ```ruby
179
+ token(:NUMBER, /\d+/) do |lexer, token|
180
+ token.value = token.value.to_i
181
+ token
182
+ end
183
+ ```
184
+
185
+ The action block always takes two arguments, the first being the lexer itself and the second the token which is an instance of `Lex::Token`. This object has attributes of `name` which is the token name as string, `value` which is the actual text matched, `line` which is the current line indexed from `1`, `column` which is the position of the token in relation to the current line. By default the `name` is set to the rule name. Inside the block you can modify the token object properties. However, when you change token properties, the token itself needs to be returned. If no value is returned by the action block, the token is simply discarded and lexer moves to another token.
186
+
187
+ The rules are processed in the same order as they appear in the lexer definition. Therefore, if you wanted to have a separate tokens for "=" and "==", you need to ensure that rule for matching "==" is checked first.
188
+
189
+ ### 1.4 Handling keywords
190
+
191
+ In order to handle keywords, you should write a single rule to match an identifier and then do a name lookup like so:
192
+
193
+ ```ruby
194
+ def self.keywords
195
+ {
196
+ if: :IF,
197
+ then: :THEN,
198
+ else: :ELSE,
199
+ while: WHILE,
200
+ ...
201
+ }
202
+ end
203
+
204
+ tokens(:IDENTIFIER, *keywords.values)
205
+
206
+ rule(:IDENTIFIER, /\w[\w\d]*/) do |lexer, token|
207
+ token.name = lexer.class.keywords.fetch(token.value.to_sym, :IDENTIFIER
208
+ token
209
+ end
210
+ ```
211
+
212
+ ### 1.5 Token values
213
+
214
+ By default token value is the text that was matched by the rule. However, the token value can be changed to any object. For example, when processing identifiers you may wish to return both identifier name and actual value.
215
+
216
+ ```ruby
217
+ rule(:IDENTIFIER, /\w[\w\d]*/) do |lexer, token|
218
+ token.value = [token.value, lexer.class.keywords[token.value]]
219
+ token
220
+ end
221
+ ```
222
+
223
+ ### 1.6 Discarded tokens
224
+
225
+ To discard a token, such as comment, define a rule that returns no token. For instance:
226
+
227
+ ```ruby
228
+ rule(:COMMENT, /\#.*/) do |lexer, token|
229
+
230
+ end
231
+ ```
232
+
233
+ ### 1.7 Line numbers
234
+
235
+ By default **Lex** knows nothing about line numbers since it doesn't understand what a "line" is. To provide this information you need to add a special rule called `:newline`:
236
+
237
+ ```ruby
238
+ rule(:newline, /\n+/) do |lexer, token|
239
+ lexer.advance_line(token.value.length)
240
+ end
241
+ ```
242
+
243
+ Calling the `advance_line` method the `current_line` is updated for the underlying lexer. Only the line is updated and since no token is returned the value is discarded.
244
+
245
+ **Lex** performs automatic column tracking for each token. This information is available by calling `column` on a `Lex::Token` instance.
246
+
247
+ ### 1.8 Ignored characters
248
+
249
+ For any character that should be completely ignored in the input stream use the `ignore` rule. Usually this is used to skip over whitespace and other non-essential characters. For example:
250
+
251
+ ```ruby
252
+ ignore = " \t" # => Ignore whitespace and tabs
253
+ ```
254
+
255
+ You could create a rule to achieve similar behaviour, however you are encourage to use this method as it has increased performance over the rule regular expression matching.
256
+
257
+ ### 1.9 Literal characters
258
+
259
+ Not implemented yet!
260
+
261
+ ### 1.10 Error handling
262
+
263
+ In order to handle lexing error conditions use the `error` method. In this case thetoken `value` attribute contains the offending string. For example:
264
+
265
+ ```ruby
266
+ error do |lexer, token|
267
+ puts "Illegal character #{token.value}"
268
+ end
269
+ ```
270
+
271
+ The lexer automatically skips the offending character and increments the column count.
272
+
273
+ When performing conditional lexing, you can handle errors per state like so:
274
+
275
+ ```ruby
276
+ error :foo do |lexer, token|
277
+ puts "Illegal character #{token.value}"
278
+ end
279
+ ```
280
+
281
+ ### 1.11 Building the lexer
282
+
283
+ ```ruby
284
+ require 'lex'
285
+
286
+ class MyLexer < Lex::Lexer
287
+ # required list of tokens
288
+ tokens(
289
+ :NUMBER,
290
+ )
291
+ ...
292
+ end
293
+
294
+ ```
295
+
296
+ You can also provide lexer definition by using block:
297
+
298
+ ```ruby
299
+ my_lexer = Lex::Lexer.new do
300
+ # required list of tokens
301
+ tokens(
302
+ :NUMBER,
303
+ )
304
+ end
305
+ ```
306
+
307
+ ### 1.12 Maintaining state
308
+
309
+ In your lexer you may have a need to store state information.
310
+
311
+ ### 1.13 Conditional lexing
312
+
313
+ A lexer can maintain internal lexing state. When lexer's state changes, the corresponding tokens for that state are only considered. The start condition is called `:initial`, similar to GNU flex.
314
+
315
+ To define a new lexical state, it must first be declared. This can be achieved by using a `states` declaration:
316
+
317
+ ```ruby
318
+ states(
319
+ foo: :exclusive,
320
+ bar: :inclusive
321
+ )
322
+ ```
323
+
324
+ The above definition declares two states `:foo` and `:bar`. State may be of two types `:exclusive` and `:inclusive`. In an `:exclusive` state lexer contains no rules, which means that **Lex** will only return tokens and apply rules defined specifically for that state. On the other hand, an `:inclusive` state adds additional tokens and rules to the default set of rules. Thus, `lex` method will return both the tokens defined by default in addition to those defined specificially for the `:inclusive` state.
325
+
326
+ Once state has been declared, tokens and rules are declared by including the state name in token or rule definition. For example:
327
+
328
+ ```ruby
329
+ rule(:foo_NUMBER, /\d+/)
330
+ rule(:bar_ID, /[a-z][a-z0-9]+/)
331
+ ```
332
+
333
+ The above rules define `:NUMBER` token in state `:foo` and `:ID` token in state `:bar`.
334
+
335
+ A token can be specified in multiple states by prefixing token name by state names like so:
336
+
337
+ ```ruby
338
+ rule(:foo_bar_NUMBER, /\d+/)
339
+ ```
340
+
341
+ If no state information is provided, the lexer is assumed to be in `:initial` state. For example, the following declarations are equivalent:
342
+
343
+ ```ruby
344
+ rule(:NUMBER, /\d+/)
345
+ rule(:initial_NUMBER, /\d+/)
346
+ ```
347
+
348
+ By default, lexing operates in `:initial` state. All the normally defined tokens are included in this state. During lexing if you wish to change the lexing state use the `begin` method. For example:
349
+
350
+ ```ruby
351
+ rule(:begin_foo, /start_foo/) do |lexer, token|
352
+ lexer.begin(:foo)
353
+ end
354
+ ```
355
+
356
+ To get out of state you can use `begin` like so:
357
+
358
+ ```ruby
359
+ rule(:foo_end, /end_foo/) do |lexer, token|
360
+ lexer.begin(:initial)
361
+ end
362
+ ```
363
+
364
+ For more complex scenarios with states you can use `push_state` and `pop_state` methods. For example:
365
+
366
+ ```ruby
367
+ rule(:begin_foo, /start_foo/) do |lexer, token|
368
+ lexer.push_state(:foo)
369
+ end
370
+
371
+ rule(:foo_end, /end_foo/) do |lexer, token|
372
+ lexer.pop_state(:foo)
373
+ end
374
+ ```
375
+
376
+ Assume you are parsing HTML and you want to ignore anything inside comment. Here is how you may use lexer states to do this:
377
+
378
+ ```ruby
379
+ class MyLexer < Lex::Lexer
380
+ tokens( )
381
+
382
+ # Declare the states
383
+ states( htmlcomment: :exclusive )
384
+
385
+ # Enter html comment
386
+ rule(:begin_htmlcomment, /<!--/) do |lexer, token|
387
+ lexer.begin(:htmlcomment)
388
+ end
389
+
390
+ # Leave html comment
391
+ rule(:htmlcomment, /-->/) do |lexer, token|
392
+ lexer.begin(:initial)
393
+ end
394
+
395
+ error :htmlcomment do |lexer, token|
396
+ lexer.logger.info "Ignoring character #{token.value}"
397
+ end
398
+
399
+ ignore :htmlcomment, " \t\n"
400
+
401
+ ignore " \t"
402
+ end
403
+ ```
404
+
405
+ ### 1.14 Debugging
406
+
407
+ In order to run lexer in debug mode pass in `:debug` flag set to `true`.
408
+
409
+ ```ruby
410
+ MyLexer.new(debug: true)
411
+ ```
412
+
413
+ ## Contributing
414
+
415
+ 1. Fork it ( https://github.com/[my-github-username]/lex/fork )
416
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
417
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
418
+ 4. Push to the branch (`git push origin my-new-feature`)
419
+ 5. Create a new Pull Request
420
+
421
+ ## Copyright
422
+
423
+ Copyright (c) 2015 Piotr Murach. See LICENSE for further details.