rubylexer 0.6.2

Sign up to get free protection for your applications and to get access to all the features.
data/testing.txt ADDED
@@ -0,0 +1,130 @@
1
+
2
+ Running the tests:
3
+ the simplest thing to do is run testcode/locatetest. this will use locate to find as much ruby
4
+ code on your system and test each specimen to see if it can be tokenized correctly (by feeding it
5
+ to testcode/rubylexervsruby.rb, the operation of which is outlined below under 'testing strategy').
6
+
7
+ Interpreting the output of rubylexervsruby.rb (and locatetest):
8
+ in rubylexervsruby, i've tried to follow the philosophy that the test program
9
+ doesn't print anything unless there's an error. perhaps i haven't followed
10
+ this far enough; every run of rubylexervsruby produces a little output, and
11
+ sometimes a run will produce output that doesn't actually indicate a problem,
12
+ or only a low-priority problem. (since locatetest, torment, and test all run
13
+ rubylexervsruby over and over, they all produce lots of (mostly harmless)
14
+ output. sorry.)
15
+
16
+ the following types of output should be ignored:
17
+
18
+ diff file or chunk headers
19
+
20
+ lines that look like this:
21
+ executing: ruby testcode/tokentest.rb ... #normal, 1 for every file
22
+ or this:
23
+ warning moved from 24 to 22: ambiguous first argument; put parentheses or even spaces
24
+ or this:
25
+ Created warning(s) in new file, line 85: useless use of <=> in void context
26
+ or this:
27
+ Removed warning(s) from old file (?!), line 85: useless use of <=> in void context
28
+ indicate that a warning was added deleted, or moved. ultimately, these should
29
+ go away, but right now it's a low-priority issue.
30
+
31
+ if you ever see ruby stack dump in rubylexervsruby output, that's certainly
32
+ an error (if the input ruby code is valid).
33
+
34
+ something that looks like a unidiff chunk body (not header) may indicate
35
+ an error as well. the problem is that sometimes those morpheaous warnings
36
+ sneak through my filter (which is supposed to condense them into a single
37
+ line like those above), so you will see diff chunks where the only real
38
+ difference is a warning. here are some examples of the kind of diff chunks
39
+ that should NOT cause alarm:
40
+
41
+ --:89: warning: useless use of <=> in void context
42
+ --:92: warning: useless use of <=> in void context
43
+ +-:90: warning: useless use of <=> in void context
44
+ Stack now 0 2 62 300 5 110 365 544
45
+
46
+ Shifting token tIDENTIFIER, Entering state 34
47
+ -Reading a token: -:318: warning: ambiguous first argument; put parentheses or even spaces
48
+ -Next token is token tINTEGER ()
49
+ +Reading a token: Next token is token tINTEGER ()
50
+ Reducing stack by rule 476 (line 2382), tIDENTIFIER -> operation
51
+
52
+ if you look closely, (and are experienced in reading unidiff output), you'll
53
+ see that the only difference is a warning. to understand more about how the
54
+ unidiff output is created, see the section on testing strategy below.
55
+
56
+ htree/template.rb:
57
+ testing this file prints a small unidiff chunk. analysis indicates that the
58
+ problem is because ruby's lexer generates an extra (empty) string content
59
+ token at this point, which mine omits. there's no actual semantic difference
60
+ between the two tokenizations, so there's nothing to be concerned about. in
61
+ a future release, when my lexer supports the notion of string contents and
62
+ string delimiters as separate token types, i'll try to emulate ruby more
63
+ closely. the same case is replicated in p.rb.
64
+ (in other words, ignore the error in this file and the identical one in p.rb.)
65
+
66
+
67
+ if you find any output that doesn't look like one of the above exceptions,
68
+ and the input file was valid ruby, please send it to me so that i can add it
69
+ to my arsenal of tests.
70
+
71
+ there are a number of 'ruby' files that i know of out there that actually
72
+ contain syntax errors:
73
+ rpcd.rb from freeride -- missing an end
74
+ sample1.rb from 1.6 version of tcltk -- not legal in ruby 1.8
75
+ bdb.rb from libdb2, 3, and 4 -- not how you declare [] method in ruby
76
+
77
+ testdata/p.rb (my menagerie of weird test cases) is one of the worst
78
+ offenders; it prints lots of output when tested, but all of the problems
79
+ are harmless or minor.
80
+
81
+ only the 10 first lines of each failing file are printed. the rest, as well
82
+ as other intermediate files are kept in the testresults directory. the test
83
+ output files are named *.prs.diff. beware: this directory is never cleaned,
84
+ and can get quite large. after a large test run, you'll want to empty this
85
+ directory to recover some disk space.
86
+
87
+ about the directories: tbd
88
+
89
+ about testcode/dumptokens.rb: tbd
90
+
91
+ about testcode/tokentest.rb:
92
+ a fairly simple-minded test utility; given an input file, it uses RubyLexer
93
+ to tokenize it, then prints out each token as it is found. certain small
94
+ changes will be made; numeric constants (including char constants) are
95
+ converted to decimal and strings are converted to double-quoted form, where
96
+ possible. optional flags can cause other changes: --maxws inserts whitespace
97
+ everywhere that it's possible, --implicit inserts parentheses where they
98
+ were left out at call sites. --implicit-all adds parentheses around the lists
99
+ following when, for, and rescue keywords. --keepws is the usual mode;
100
+ otherwise a 'symbolic mode' is used wherein newline is represented by '#;',
101
+ for instance. note: currently the output will not be valid ruby unless
102
+ only the --maxws or --keepws is used. in a future release --implicit will
103
+ also be valid ruby, but currently it also puts '*[' and ']' around assignment
104
+ right hand sides, which only works most of the time.
105
+
106
+ about testcode/torment:
107
+ finds ruby files by other heuristics (not using locate) and runs each
108
+ through rubylexervsruby. this is roughly comparable to locatetest, but
109
+ more complicated and (probably) less comprehensive.
110
+
111
+ about ./test:
112
+ this contains a number of ruby files which have failed on my Debian system
113
+ in the past. as the paths are hard-coded, it's unlikely to be very portable.
114
+
115
+ testing strategy:
116
+ this command:
117
+ ruby -w -y < $1 2>&1 | grep ^Shift|cut -d" " -f3
118
+ gives a list of the types of token, as known to ruby, in a source file $1. the
119
+ utility program tokentest.rb runs the lexer against a source file and then simply
120
+ prints the tokens out again (perhaps with whitespace inserted between tokens). if
121
+ the list of token types in this derived source file, as determined by the above command,
122
+ is the same as in the original, we can be pretty confident that ruby and rubylexer are
123
+ tokenizing in the same way. since whitespaces are optionally inserted between tokens, it
124
+ is unlikely that rubylexer is ever finding two tokens where ruby thinks there's only one.
125
+ it is possible, however, that rubylexer is emitting as a single token things that ruby
126
+ thinks should be 2 tokens. and in fact, this is the case with strings: ruby divides a
127
+ string into string open, string body, and string close tokens with option interpolations,
128
+ whereas rubylexer has just a single string token (with subtokens, if interpolations are
129
+ present.) this difference in handling accounts in part for rubylexer's inability
130
+ to correctly lex certain very complicated strings.
File without changes
data/token.rb ADDED
@@ -0,0 +1,486 @@
1
+ =begin copyright
2
+ rubylexer - a ruby lexer written in ruby
3
+ Copyright (C) 2004,2005 Caleb Clausen
4
+
5
+ This library is free software; you can redistribute it and/or
6
+ modify it under the terms of the GNU Lesser General Public
7
+ License as published by the Free Software Foundation; either
8
+ version 2.1 of the License, or (at your option) any later version.
9
+
10
+ This library is distributed in the hope that it will be useful,
11
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
13
+ Lesser General Public License for more details.
14
+
15
+ You should have received a copy of the GNU Lesser General Public
16
+ License along with this library; if not, write to the Free Software
17
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
18
+ =end
19
+
20
+ require "rubycode"
21
+
22
+ #-------------------------
23
+ class Token
24
+ attr_accessor :ident
25
+ alias to_s ident
26
+ attr_accessor :offset #file offset of start of this token
27
+
28
+ def initialize(ident,offset=nil)
29
+ @ident=ident
30
+ @offset=offset
31
+ end
32
+
33
+ def error; end
34
+ end
35
+
36
+ #-------------------------
37
+ class WToken< Token
38
+ def ===(pattern)
39
+ assert @ident
40
+ pattern===@ident
41
+ end
42
+ end
43
+
44
+ #-------------------------
45
+ class KeywordToken < WToken #also some operators
46
+
47
+ #-----------------------------------
48
+ def set_callsite!
49
+ @callsite=true
50
+ end
51
+
52
+ #-----------------------------------
53
+ def callsite?
54
+ @callsite ||= nil
55
+ end
56
+
57
+ #-----------------------------------
58
+ def has_end!
59
+ assert self===RubyLexer::BEGINWORDS
60
+ @has_end=true
61
+ end
62
+
63
+
64
+ #-----------------------------------
65
+ def has_end?
66
+ self===RubyLexer::BEGINWORDS and @has_end||=nil
67
+ end
68
+ end
69
+
70
+ #-------------------------
71
+ class OperatorToken < WToken
72
+ end
73
+
74
+
75
+ #-------------------------
76
+ module TokenPat
77
+ @@TokenPats={}
78
+ def token_pat #used in various case statements...
79
+ result=self.dup
80
+ @@TokenPats[self] ||=
81
+ (class <<result
82
+ alias old_3eq ===
83
+ def ===(token)
84
+ WToken===token and old_3eq(token.ident)
85
+ end
86
+ end;result)
87
+ end
88
+ end
89
+
90
+ class String; include TokenPat; end
91
+ class Regexp; include TokenPat; end
92
+
93
+ #-------------------------
94
+ class VarNameToken < WToken
95
+ end
96
+
97
+ #-------------------------
98
+ class NumberToken < Token
99
+ def to_s; @ident.to_s end
100
+ end
101
+
102
+ #-------------------------
103
+ class SymbolToken < Token
104
+ def initialize(ident,offset=nil)
105
+ super ":#{ident}", offset
106
+ # @char=':'
107
+ end
108
+ end
109
+
110
+ #-------------------------
111
+ class MethNameToken < Token # < SymbolToken
112
+ def initialize(ident,offset=nil)
113
+ @ident= (VarNameToken===ident)? ident.ident : ident
114
+ @offset=offset
115
+ # @char=''
116
+ end
117
+
118
+ def [](regex) #is this used?
119
+ regex===ident
120
+ end
121
+ def ===(pattern)
122
+ pattern===@ident
123
+ end
124
+ end
125
+
126
+ #-------------------------
127
+ class NewlineToken < Token
128
+ def initialize(nlstr="\n",offset=nil)
129
+ super(nlstr,offset)
130
+ #@char=''
131
+ end
132
+ end
133
+
134
+ #-------------------------
135
+ class StringToken < Token
136
+ attr :char
137
+
138
+ attr_accessor :modifiers #for regex only
139
+ attr_accessor :elems
140
+
141
+ def initialize(type='"',ident='')
142
+ super(ident)
143
+ type=="'" and type='"'
144
+ @char=type
145
+ assert(@char[/^[\[{"`\/]$/])
146
+ @elems=[ident.dup] #why .dup?
147
+ @modifiers=nil
148
+ end
149
+
150
+ DQUOTE_ESCAPE_TABLE = [
151
+ ["\n",'\n'],
152
+ ["\r",'\r'],
153
+ ["\t",'\t'],
154
+ ["\v",'\v'],
155
+ ["\f",'\f'],
156
+ ["\e",'\e'],
157
+ ["\b",'\b'],
158
+ ["\a",'\a']
159
+ ]
160
+ PREFIXERS={ '['=>"%w[", '{'=>'%W{' }
161
+ SUFFIXERS={ '['=>"]", '{'=>'}' }
162
+
163
+ def to_s(transname=:transform)
164
+ assert(@char[/[\[{"`\/]/])
165
+ #on output, all single-quoted strings become double-quoted
166
+ assert(@elems.length==1) if @char=='['
167
+
168
+ result=(PREFIXERS[@char] or @char).dup
169
+ starter=result[-1,1]
170
+ ender=(SUFFIXERS[@char] or @char).dup
171
+ 0.step(@elems.length-1,2) { |i|
172
+ strfrag=@elems[i].dup
173
+ result << send(transname,strfrag,starter,ender)
174
+
175
+ if e=@elems[i+1]
176
+ assert(e.kind_of?(RubyCode))
177
+ result << '#' + e.to_s
178
+ end
179
+ }
180
+ result << ender
181
+
182
+ modifiers and result << modifiers #regex only
183
+
184
+ return result
185
+ end
186
+
187
+ def to_term
188
+ result=[]
189
+ 0.step(@elems.length-1,2) { |i|
190
+ result << ConstTerm.new(@elems[i].dup)
191
+
192
+ if e=@elems[i+1]
193
+ assert(e.kind_of?(RubyCode))
194
+ result << (RubyTerm.new e)
195
+ end
196
+ }
197
+ return result
198
+ end
199
+
200
+ def append(glob)
201
+ assert @elems.last.kind_of?(String)
202
+ case glob
203
+ when String,Integer then append_str! glob
204
+ when RubyCode then append_code! glob
205
+ else raise "bad string contents: #{glob}, a #{glob.class}"
206
+ end
207
+ assert @elems.last.kind_of?(String)
208
+ end
209
+
210
+ def append_token(strtok)
211
+ assert @elems.last.kind_of?(String)
212
+ assert strtok.elems.last.kind_of?(String)
213
+ assert strtok.elems.first.kind_of?(String)
214
+
215
+ @elems.last << strtok.elems.shift
216
+
217
+ first=strtok.elems.first
218
+ assert( first.nil? || first.kind_of?(RubyCode) )
219
+
220
+ @elems += strtok.elems
221
+ @ident << strtok.ident
222
+
223
+ assert((!@modifiers or !strtok.modifiers))
224
+ @modifiers||=strtok.modifiers
225
+
226
+ assert @elems.last.kind_of?(String)
227
+
228
+ return self
229
+ end
230
+
231
+ private
232
+ #simpler transform, preserves original exactly
233
+ def simple_transform(strfrag,starter,ender)
234
+ #assert('[{/'[@char])
235
+ #strfrag.gsub!(/#([{$@])/,'\\#\\1') unless @char=='['
236
+ strfrag.gsub!(Regexp.new("[\\"+starter+"\\"+ender+"]"), '\\\\\&')
237
+ return strfrag
238
+ end
239
+
240
+ def transform(strfrag,starter,ender)
241
+ strfrag.gsub!("\\",'\\'*4)
242
+ strfrag.gsub!(/#([{$@])/,'\\#\\1')
243
+ strfrag.gsub!(Regexp.new("[\\"+starter+"\\"+ender+"]"),'\\\\\\&') unless @char=='?'
244
+ DQUOTE_ESCAPE_TABLE.each {|pair|
245
+ strfrag.gsub!(*pair)
246
+ } unless @char=='/'
247
+ strfrag.gsub!(/[^ -~]/){|np| #nonprintables
248
+ "\\x"+sprintf('%02X',np[0])
249
+ }
250
+ #break up long lines (best done later?)
251
+ strfrag.gsub!(/(\\x[0-9A-F]{2}|\\?.){40}/i, "\\&\\\n")
252
+ return strfrag
253
+ end
254
+
255
+ def append_str!(str)
256
+ assert @elems.last.kind_of?(String)
257
+ @elems.last << str
258
+ @ident << str
259
+ assert @elems.last.kind_of?(String)
260
+ end
261
+
262
+ def append_code!(code)
263
+ assert @elems.last.kind_of?(String)
264
+ @elems.concat [code, '']
265
+ @ident << "\#{#{code}}"
266
+ assert @elems.last.kind_of?(String)
267
+ end
268
+ end
269
+
270
+ #-------------------------
271
+ class RenderExactlyStringToken < StringToken
272
+ alias transform simple_transform
273
+ end
274
+
275
+ #-------------------------
276
+ class HerePlaceholderToken < WToken
277
+ attr_reader :termex, :quote, :ender
278
+ attr_accessor :unsafe_to_use, :string
279
+ attr_accessor :bodyclass
280
+
281
+ def initialize(dash,quote,ender)
282
+ @dash,@quote,@ender=dash,quote,ender
283
+ @unsafe_to_use=true
284
+ @string=StringToken.new
285
+
286
+ #@termex=/^#{'[\s\v]*' if dash}#{Regexp.escape ender}$/
287
+ @termex=Regexp.new \
288
+ ["^", ('[\s\v]*' if dash), Regexp.escape(ender), "$"].to_s
289
+ @bodyclass=HereBodyToken
290
+ end
291
+
292
+ def ===(bogus); false end
293
+
294
+ def to_s
295
+ if unsafe_to_use
296
+ result="<<"
297
+ result << if/[^a-z_0-9]/i===@ender
298
+ %["#{@ender.gsub(/[\\"]/, '\\\\'+'\\&')}"]
299
+ else
300
+ @ender
301
+ end
302
+ else
303
+ @string.to_s
304
+ end
305
+ end
306
+
307
+ def append s; @string.append s end
308
+
309
+ def append_token tok; @string.append_token tok end
310
+
311
+ end
312
+
313
+ #-------------------------
314
+ class IgnoreToken < Token
315
+ end
316
+
317
+ #-------------------------
318
+ class WsToken < IgnoreToken
319
+ end
320
+
321
+ #-------------------------
322
+ class ZwToken < IgnoreToken
323
+ def initialize(offset)
324
+ super('',offset)
325
+ end
326
+ def explicit_form
327
+ abstract
328
+ end
329
+ def explicit_form_all; explicit_form end
330
+ end
331
+
332
+ class NoWsToken < ZwToken
333
+ def explicit_form_all
334
+ "#nows#"
335
+ end
336
+ def explicit_form
337
+ nil
338
+ end
339
+ end
340
+
341
+ class ImplicitParamListStartToken < ZwToken
342
+ def explicit_form
343
+ '('
344
+ end
345
+ end
346
+ class ImplicitParamListEndToken < ZwToken
347
+ def explicit_form
348
+ ')'
349
+ end
350
+ end
351
+
352
+ class AssignmentRhsListStartToken < ZwToken
353
+ def explicit_form
354
+ '*['
355
+ end
356
+ end
357
+
358
+ class AssignmentRhsListEndToken < ZwToken
359
+ def explicit_form
360
+ ']'
361
+ end
362
+ end
363
+
364
+ class KwParamListStartToken < ZwToken
365
+ def explicit_form_all
366
+ "#((#"
367
+ end
368
+ def explicit_form
369
+ nil
370
+ end
371
+ end
372
+
373
+ class KwParamListEndToken < ZwToken
374
+ def explicit_form_all
375
+ "#))#"
376
+ end
377
+ def explicit_form
378
+ nil
379
+ end
380
+ end
381
+
382
+ #-------------------------
383
+ class EscNlToken < IgnoreToken
384
+ def initialize(filename,linenum,ident="\\\n",offset=nil)
385
+ super(ident,offset)
386
+ #@char='\\'
387
+ @filename=filename
388
+ @linenum=linenum
389
+ end
390
+ end
391
+
392
+ #-------------------------
393
+ class EoiToken < IgnoreToken
394
+ attr :file
395
+ alias :pos :offset
396
+
397
+ def initialize(cause,file, offset=nil)
398
+ super(cause,offset)
399
+ @file=file
400
+ end
401
+ end
402
+
403
+ #-------------------------
404
+ class HereBodyToken < IgnoreToken
405
+ #attr_accessor :ender
406
+ def initialize(headtok)
407
+ assert HerePlaceholderToken===headtok
408
+ super(headtok.string,headtok.string.offset)
409
+ @headtok=headtok
410
+ end
411
+
412
+ end
413
+
414
+ #-------------------------
415
+ class FileAndLineToken < IgnoreToken
416
+ attr :line
417
+
418
+ def initialize(ident,line,offset=nil)
419
+
420
+ super ident,offset
421
+ #@char='#'
422
+ @line=line
423
+ end
424
+
425
+ #def char; '#' end
426
+
427
+ def to_s()
428
+ ['#', @ident, ':', @line].to_s
429
+ end
430
+
431
+ def file() @ident end
432
+ def subitem() @line end #needed?
433
+ end
434
+
435
+ #-------------------------
436
+ class OutlinedHereBodyToken < HereBodyToken
437
+ def to_s
438
+ assert HerePlaceholderToken===@headtok
439
+ result=@headtok.string
440
+ result=result.to_s(:simple_transform).match(/^"(.*)"$/m)[1]
441
+ return "\n" +
442
+ result +
443
+ @headtok.ender +
444
+ "\n"
445
+ end
446
+ end
447
+
448
+ #-------------------------
449
+ module ErrorToken
450
+ attr_accessor :error
451
+ end
452
+
453
+ #-------------------------
454
+ class SubitemToken < Token
455
+ attr :char2
456
+ attr :subitem
457
+
458
+ def initialize(ident,subitem)
459
+ super ident
460
+ @subitem=subitem
461
+ end
462
+
463
+ def to_s()
464
+ super+@char2+@subitem.to_s
465
+ end
466
+ end
467
+
468
+
469
+ #-------------------------
470
+ class DecoratorToken < SubitemToken
471
+ def initialize(ident,subitem)
472
+ super '^'+ident,subitem
473
+ @subitem=@subitem.to_s #why to_s?
474
+ #@char='^'
475
+ @char2='='
476
+ end
477
+
478
+ #alias to_s ident #parent has right implementation of to_s... i think
479
+ def needs_value?() @subitem.nil? end
480
+
481
+ def value=(v) @subitem=v end
482
+ def value() @subitem end
483
+ end
484
+
485
+
486
+