rubylexer 0.6.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/testing.txt ADDED
@@ -0,0 +1,130 @@
1
+
2
+ Running the tests:
3
+ the simplest thing to do is run testcode/locatetest. this will use locate to find as much ruby
4
+ code on your system and test each specimen to see if it can be tokenized correctly (by feeding it
5
+ to testcode/rubylexervsruby.rb, the operation of which is outlined below under 'testing strategy').
6
+
7
+ Interpreting the output of rubylexervsruby.rb (and locatetest):
8
+ in rubylexervsruby, i've tried to follow the philosophy that the test program
9
+ doesn't print anything unless there's an error. perhaps i haven't followed
10
+ this far enough; every run of rubylexervsruby produces a little output, and
11
+ sometimes a run will produce output that doesn't actually indicate a problem,
12
+ or only a low-priority problem. (since locatetest, torment, and test all run
13
+ rubylexervsruby over and over, they all produce lots of (mostly harmless)
14
+ output. sorry.)
15
+
16
+ the following types of output should be ignored:
17
+
18
+ diff file or chunk headers
19
+
20
+ lines that look like this:
21
+ executing: ruby testcode/tokentest.rb ... #normal, 1 for every file
22
+ or this:
23
+ warning moved from 24 to 22: ambiguous first argument; put parentheses or even spaces
24
+ or this:
25
+ Created warning(s) in new file, line 85: useless use of <=> in void context
26
+ or this:
27
+ Removed warning(s) from old file (?!), line 85: useless use of <=> in void context
28
+ indicate that a warning was added deleted, or moved. ultimately, these should
29
+ go away, but right now it's a low-priority issue.
30
+
31
+ if you ever see ruby stack dump in rubylexervsruby output, that's certainly
32
+ an error (if the input ruby code is valid).
33
+
34
+ something that looks like a unidiff chunk body (not header) may indicate
35
+ an error as well. the problem is that sometimes those morpheaous warnings
36
+ sneak through my filter (which is supposed to condense them into a single
37
+ line like those above), so you will see diff chunks where the only real
38
+ difference is a warning. here are some examples of the kind of diff chunks
39
+ that should NOT cause alarm:
40
+
41
+ --:89: warning: useless use of <=> in void context
42
+ --:92: warning: useless use of <=> in void context
43
+ +-:90: warning: useless use of <=> in void context
44
+ Stack now 0 2 62 300 5 110 365 544
45
+
46
+ Shifting token tIDENTIFIER, Entering state 34
47
+ -Reading a token: -:318: warning: ambiguous first argument; put parentheses or even spaces
48
+ -Next token is token tINTEGER ()
49
+ +Reading a token: Next token is token tINTEGER ()
50
+ Reducing stack by rule 476 (line 2382), tIDENTIFIER -> operation
51
+
52
+ if you look closely, (and are experienced in reading unidiff output), you'll
53
+ see that the only difference is a warning. to understand more about how the
54
+ unidiff output is created, see the section on testing strategy below.
55
+
56
+ htree/template.rb:
57
+ testing this file prints a small unidiff chunk. analysis indicates that the
58
+ problem is because ruby's lexer generates an extra (empty) string content
59
+ token at this point, which mine omits. there's no actual semantic difference
60
+ between the two tokenizations, so there's nothing to be concerned about. in
61
+ a future release, when my lexer supports the notion of string contents and
62
+ string delimiters as separate token types, i'll try to emulate ruby more
63
+ closely. the same case is replicated in p.rb.
64
+ (in other words, ignore the error in this file and the identical one in p.rb.)
65
+
66
+
67
+ if you find any output that doesn't look like one of the above exceptions,
68
+ and the input file was valid ruby, please send it to me so that i can add it
69
+ to my arsenal of tests.
70
+
71
+ there are a number of 'ruby' files that i know of out there that actually
72
+ contain syntax errors:
73
+ rpcd.rb from freeride -- missing an end
74
+ sample1.rb from 1.6 version of tcltk -- not legal in ruby 1.8
75
+ bdb.rb from libdb2, 3, and 4 -- not how you declare [] method in ruby
76
+
77
+ testdata/p.rb (my menagerie of weird test cases) is one of the worst
78
+ offenders; it prints lots of output when tested, but all of the problems
79
+ are harmless or minor.
80
+
81
+ only the 10 first lines of each failing file are printed. the rest, as well
82
+ as other intermediate files are kept in the testresults directory. the test
83
+ output files are named *.prs.diff. beware: this directory is never cleaned,
84
+ and can get quite large. after a large test run, you'll want to empty this
85
+ directory to recover some disk space.
86
+
87
+ about the directories: tbd
88
+
89
+ about testcode/dumptokens.rb: tbd
90
+
91
+ about testcode/tokentest.rb:
92
+ a fairly simple-minded test utility; given an input file, it uses RubyLexer
93
+ to tokenize it, then prints out each token as it is found. certain small
94
+ changes will be made; numeric constants (including char constants) are
95
+ converted to decimal and strings are converted to double-quoted form, where
96
+ possible. optional flags can cause other changes: --maxws inserts whitespace
97
+ everywhere that it's possible, --implicit inserts parentheses where they
98
+ were left out at call sites. --implicit-all adds parentheses around the lists
99
+ following when, for, and rescue keywords. --keepws is the usual mode;
100
+ otherwise a 'symbolic mode' is used wherein newline is represented by '#;',
101
+ for instance. note: currently the output will not be valid ruby unless
102
+ only the --maxws or --keepws is used. in a future release --implicit will
103
+ also be valid ruby, but currently it also puts '*[' and ']' around assignment
104
+ right hand sides, which only works most of the time.
105
+
106
+ about testcode/torment:
107
+ finds ruby files by other heuristics (not using locate) and runs each
108
+ through rubylexervsruby. this is roughly comparable to locatetest, but
109
+ more complicated and (probably) less comprehensive.
110
+
111
+ about ./test:
112
+ this contains a number of ruby files which have failed on my Debian system
113
+ in the past. as the paths are hard-coded, it's unlikely to be very portable.
114
+
115
+ testing strategy:
116
+ this command:
117
+ ruby -w -y < $1 2>&1 | grep ^Shift|cut -d" " -f3
118
+ gives a list of the types of token, as known to ruby, in a source file $1. the
119
+ utility program tokentest.rb runs the lexer against a source file and then simply
120
+ prints the tokens out again (perhaps with whitespace inserted between tokens). if
121
+ the list of token types in this derived source file, as determined by the above command,
122
+ is the same as in the original, we can be pretty confident that ruby and rubylexer are
123
+ tokenizing in the same way. since whitespaces are optionally inserted between tokens, it
124
+ is unlikely that rubylexer is ever finding two tokens where ruby thinks there's only one.
125
+ it is possible, however, that rubylexer is emitting as a single token things that ruby
126
+ thinks should be 2 tokens. and in fact, this is the case with strings: ruby divides a
127
+ string into string open, string body, and string close tokens with option interpolations,
128
+ whereas rubylexer has just a single string token (with subtokens, if interpolations are
129
+ present.) this difference in handling accounts in part for rubylexer's inability
130
+ to correctly lex certain very complicated strings.
File without changes
data/token.rb ADDED
@@ -0,0 +1,486 @@
1
+ =begin copyright
2
+ rubylexer - a ruby lexer written in ruby
3
+ Copyright (C) 2004,2005 Caleb Clausen
4
+
5
+ This library is free software; you can redistribute it and/or
6
+ modify it under the terms of the GNU Lesser General Public
7
+ License as published by the Free Software Foundation; either
8
+ version 2.1 of the License, or (at your option) any later version.
9
+
10
+ This library is distributed in the hope that it will be useful,
11
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
13
+ Lesser General Public License for more details.
14
+
15
+ You should have received a copy of the GNU Lesser General Public
16
+ License along with this library; if not, write to the Free Software
17
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
18
+ =end
19
+
20
+ require "rubycode"
21
+
22
+ #-------------------------
23
+ class Token
24
+ attr_accessor :ident
25
+ alias to_s ident
26
+ attr_accessor :offset #file offset of start of this token
27
+
28
+ def initialize(ident,offset=nil)
29
+ @ident=ident
30
+ @offset=offset
31
+ end
32
+
33
+ def error; end
34
+ end
35
+
36
+ #-------------------------
37
+ class WToken< Token
38
+ def ===(pattern)
39
+ assert @ident
40
+ pattern===@ident
41
+ end
42
+ end
43
+
44
+ #-------------------------
45
+ class KeywordToken < WToken #also some operators
46
+
47
+ #-----------------------------------
48
+ def set_callsite!
49
+ @callsite=true
50
+ end
51
+
52
+ #-----------------------------------
53
+ def callsite?
54
+ @callsite ||= nil
55
+ end
56
+
57
+ #-----------------------------------
58
+ def has_end!
59
+ assert self===RubyLexer::BEGINWORDS
60
+ @has_end=true
61
+ end
62
+
63
+
64
+ #-----------------------------------
65
+ def has_end?
66
+ self===RubyLexer::BEGINWORDS and @has_end||=nil
67
+ end
68
+ end
69
+
70
+ #-------------------------
71
+ class OperatorToken < WToken
72
+ end
73
+
74
+
75
+ #-------------------------
76
+ module TokenPat
77
+ @@TokenPats={}
78
+ def token_pat #used in various case statements...
79
+ result=self.dup
80
+ @@TokenPats[self] ||=
81
+ (class <<result
82
+ alias old_3eq ===
83
+ def ===(token)
84
+ WToken===token and old_3eq(token.ident)
85
+ end
86
+ end;result)
87
+ end
88
+ end
89
+
90
+ class String; include TokenPat; end
91
+ class Regexp; include TokenPat; end
92
+
93
+ #-------------------------
94
+ class VarNameToken < WToken
95
+ end
96
+
97
+ #-------------------------
98
+ class NumberToken < Token
99
+ def to_s; @ident.to_s end
100
+ end
101
+
102
+ #-------------------------
103
+ class SymbolToken < Token
104
+ def initialize(ident,offset=nil)
105
+ super ":#{ident}", offset
106
+ # @char=':'
107
+ end
108
+ end
109
+
110
+ #-------------------------
111
+ class MethNameToken < Token # < SymbolToken
112
+ def initialize(ident,offset=nil)
113
+ @ident= (VarNameToken===ident)? ident.ident : ident
114
+ @offset=offset
115
+ # @char=''
116
+ end
117
+
118
+ def [](regex) #is this used?
119
+ regex===ident
120
+ end
121
+ def ===(pattern)
122
+ pattern===@ident
123
+ end
124
+ end
125
+
126
+ #-------------------------
127
+ class NewlineToken < Token
128
+ def initialize(nlstr="\n",offset=nil)
129
+ super(nlstr,offset)
130
+ #@char=''
131
+ end
132
+ end
133
+
134
+ #-------------------------
135
+ class StringToken < Token
136
+ attr :char
137
+
138
+ attr_accessor :modifiers #for regex only
139
+ attr_accessor :elems
140
+
141
+ def initialize(type='"',ident='')
142
+ super(ident)
143
+ type=="'" and type='"'
144
+ @char=type
145
+ assert(@char[/^[\[{"`\/]$/])
146
+ @elems=[ident.dup] #why .dup?
147
+ @modifiers=nil
148
+ end
149
+
150
+ DQUOTE_ESCAPE_TABLE = [
151
+ ["\n",'\n'],
152
+ ["\r",'\r'],
153
+ ["\t",'\t'],
154
+ ["\v",'\v'],
155
+ ["\f",'\f'],
156
+ ["\e",'\e'],
157
+ ["\b",'\b'],
158
+ ["\a",'\a']
159
+ ]
160
+ PREFIXERS={ '['=>"%w[", '{'=>'%W{' }
161
+ SUFFIXERS={ '['=>"]", '{'=>'}' }
162
+
163
+ def to_s(transname=:transform)
164
+ assert(@char[/[\[{"`\/]/])
165
+ #on output, all single-quoted strings become double-quoted
166
+ assert(@elems.length==1) if @char=='['
167
+
168
+ result=(PREFIXERS[@char] or @char).dup
169
+ starter=result[-1,1]
170
+ ender=(SUFFIXERS[@char] or @char).dup
171
+ 0.step(@elems.length-1,2) { |i|
172
+ strfrag=@elems[i].dup
173
+ result << send(transname,strfrag,starter,ender)
174
+
175
+ if e=@elems[i+1]
176
+ assert(e.kind_of?(RubyCode))
177
+ result << '#' + e.to_s
178
+ end
179
+ }
180
+ result << ender
181
+
182
+ modifiers and result << modifiers #regex only
183
+
184
+ return result
185
+ end
186
+
187
+ def to_term
188
+ result=[]
189
+ 0.step(@elems.length-1,2) { |i|
190
+ result << ConstTerm.new(@elems[i].dup)
191
+
192
+ if e=@elems[i+1]
193
+ assert(e.kind_of?(RubyCode))
194
+ result << (RubyTerm.new e)
195
+ end
196
+ }
197
+ return result
198
+ end
199
+
200
+ def append(glob)
201
+ assert @elems.last.kind_of?(String)
202
+ case glob
203
+ when String,Integer then append_str! glob
204
+ when RubyCode then append_code! glob
205
+ else raise "bad string contents: #{glob}, a #{glob.class}"
206
+ end
207
+ assert @elems.last.kind_of?(String)
208
+ end
209
+
210
+ def append_token(strtok)
211
+ assert @elems.last.kind_of?(String)
212
+ assert strtok.elems.last.kind_of?(String)
213
+ assert strtok.elems.first.kind_of?(String)
214
+
215
+ @elems.last << strtok.elems.shift
216
+
217
+ first=strtok.elems.first
218
+ assert( first.nil? || first.kind_of?(RubyCode) )
219
+
220
+ @elems += strtok.elems
221
+ @ident << strtok.ident
222
+
223
+ assert((!@modifiers or !strtok.modifiers))
224
+ @modifiers||=strtok.modifiers
225
+
226
+ assert @elems.last.kind_of?(String)
227
+
228
+ return self
229
+ end
230
+
231
+ private
232
+ #simpler transform, preserves original exactly
233
+ def simple_transform(strfrag,starter,ender)
234
+ #assert('[{/'[@char])
235
+ #strfrag.gsub!(/#([{$@])/,'\\#\\1') unless @char=='['
236
+ strfrag.gsub!(Regexp.new("[\\"+starter+"\\"+ender+"]"), '\\\\\&')
237
+ return strfrag
238
+ end
239
+
240
+ def transform(strfrag,starter,ender)
241
+ strfrag.gsub!("\\",'\\'*4)
242
+ strfrag.gsub!(/#([{$@])/,'\\#\\1')
243
+ strfrag.gsub!(Regexp.new("[\\"+starter+"\\"+ender+"]"),'\\\\\\&') unless @char=='?'
244
+ DQUOTE_ESCAPE_TABLE.each {|pair|
245
+ strfrag.gsub!(*pair)
246
+ } unless @char=='/'
247
+ strfrag.gsub!(/[^ -~]/){|np| #nonprintables
248
+ "\\x"+sprintf('%02X',np[0])
249
+ }
250
+ #break up long lines (best done later?)
251
+ strfrag.gsub!(/(\\x[0-9A-F]{2}|\\?.){40}/i, "\\&\\\n")
252
+ return strfrag
253
+ end
254
+
255
+ def append_str!(str)
256
+ assert @elems.last.kind_of?(String)
257
+ @elems.last << str
258
+ @ident << str
259
+ assert @elems.last.kind_of?(String)
260
+ end
261
+
262
+ def append_code!(code)
263
+ assert @elems.last.kind_of?(String)
264
+ @elems.concat [code, '']
265
+ @ident << "\#{#{code}}"
266
+ assert @elems.last.kind_of?(String)
267
+ end
268
+ end
269
+
270
+ #-------------------------
271
+ class RenderExactlyStringToken < StringToken
272
+ alias transform simple_transform
273
+ end
274
+
275
+ #-------------------------
276
+ class HerePlaceholderToken < WToken
277
+ attr_reader :termex, :quote, :ender
278
+ attr_accessor :unsafe_to_use, :string
279
+ attr_accessor :bodyclass
280
+
281
+ def initialize(dash,quote,ender)
282
+ @dash,@quote,@ender=dash,quote,ender
283
+ @unsafe_to_use=true
284
+ @string=StringToken.new
285
+
286
+ #@termex=/^#{'[\s\v]*' if dash}#{Regexp.escape ender}$/
287
+ @termex=Regexp.new \
288
+ ["^", ('[\s\v]*' if dash), Regexp.escape(ender), "$"].to_s
289
+ @bodyclass=HereBodyToken
290
+ end
291
+
292
+ def ===(bogus); false end
293
+
294
+ def to_s
295
+ if unsafe_to_use
296
+ result="<<"
297
+ result << if/[^a-z_0-9]/i===@ender
298
+ %["#{@ender.gsub(/[\\"]/, '\\\\'+'\\&')}"]
299
+ else
300
+ @ender
301
+ end
302
+ else
303
+ @string.to_s
304
+ end
305
+ end
306
+
307
+ def append s; @string.append s end
308
+
309
+ def append_token tok; @string.append_token tok end
310
+
311
+ end
312
+
313
+ #-------------------------
314
+ class IgnoreToken < Token
315
+ end
316
+
317
+ #-------------------------
318
+ class WsToken < IgnoreToken
319
+ end
320
+
321
+ #-------------------------
322
+ class ZwToken < IgnoreToken
323
+ def initialize(offset)
324
+ super('',offset)
325
+ end
326
+ def explicit_form
327
+ abstract
328
+ end
329
+ def explicit_form_all; explicit_form end
330
+ end
331
+
332
+ class NoWsToken < ZwToken
333
+ def explicit_form_all
334
+ "#nows#"
335
+ end
336
+ def explicit_form
337
+ nil
338
+ end
339
+ end
340
+
341
+ class ImplicitParamListStartToken < ZwToken
342
+ def explicit_form
343
+ '('
344
+ end
345
+ end
346
+ class ImplicitParamListEndToken < ZwToken
347
+ def explicit_form
348
+ ')'
349
+ end
350
+ end
351
+
352
+ class AssignmentRhsListStartToken < ZwToken
353
+ def explicit_form
354
+ '*['
355
+ end
356
+ end
357
+
358
+ class AssignmentRhsListEndToken < ZwToken
359
+ def explicit_form
360
+ ']'
361
+ end
362
+ end
363
+
364
+ class KwParamListStartToken < ZwToken
365
+ def explicit_form_all
366
+ "#((#"
367
+ end
368
+ def explicit_form
369
+ nil
370
+ end
371
+ end
372
+
373
+ class KwParamListEndToken < ZwToken
374
+ def explicit_form_all
375
+ "#))#"
376
+ end
377
+ def explicit_form
378
+ nil
379
+ end
380
+ end
381
+
382
+ #-------------------------
383
+ class EscNlToken < IgnoreToken
384
+ def initialize(filename,linenum,ident="\\\n",offset=nil)
385
+ super(ident,offset)
386
+ #@char='\\'
387
+ @filename=filename
388
+ @linenum=linenum
389
+ end
390
+ end
391
+
392
+ #-------------------------
393
+ class EoiToken < IgnoreToken
394
+ attr :file
395
+ alias :pos :offset
396
+
397
+ def initialize(cause,file, offset=nil)
398
+ super(cause,offset)
399
+ @file=file
400
+ end
401
+ end
402
+
403
+ #-------------------------
404
+ class HereBodyToken < IgnoreToken
405
+ #attr_accessor :ender
406
+ def initialize(headtok)
407
+ assert HerePlaceholderToken===headtok
408
+ super(headtok.string,headtok.string.offset)
409
+ @headtok=headtok
410
+ end
411
+
412
+ end
413
+
414
+ #-------------------------
415
+ class FileAndLineToken < IgnoreToken
416
+ attr :line
417
+
418
+ def initialize(ident,line,offset=nil)
419
+
420
+ super ident,offset
421
+ #@char='#'
422
+ @line=line
423
+ end
424
+
425
+ #def char; '#' end
426
+
427
+ def to_s()
428
+ ['#', @ident, ':', @line].to_s
429
+ end
430
+
431
+ def file() @ident end
432
+ def subitem() @line end #needed?
433
+ end
434
+
435
+ #-------------------------
436
+ class OutlinedHereBodyToken < HereBodyToken
437
+ def to_s
438
+ assert HerePlaceholderToken===@headtok
439
+ result=@headtok.string
440
+ result=result.to_s(:simple_transform).match(/^"(.*)"$/m)[1]
441
+ return "\n" +
442
+ result +
443
+ @headtok.ender +
444
+ "\n"
445
+ end
446
+ end
447
+
448
+ #-------------------------
449
+ module ErrorToken
450
+ attr_accessor :error
451
+ end
452
+
453
+ #-------------------------
454
+ class SubitemToken < Token
455
+ attr :char2
456
+ attr :subitem
457
+
458
+ def initialize(ident,subitem)
459
+ super ident
460
+ @subitem=subitem
461
+ end
462
+
463
+ def to_s()
464
+ super+@char2+@subitem.to_s
465
+ end
466
+ end
467
+
468
+
469
+ #-------------------------
470
+ class DecoratorToken < SubitemToken
471
+ def initialize(ident,subitem)
472
+ super '^'+ident,subitem
473
+ @subitem=@subitem.to_s #why to_s?
474
+ #@char='^'
475
+ @char2='='
476
+ end
477
+
478
+ #alias to_s ident #parent has right implementation of to_s... i think
479
+ def needs_value?() @subitem.nil? end
480
+
481
+ def value=(v) @subitem=v end
482
+ def value() @subitem end
483
+ end
484
+
485
+
486
+