rubylexer 0.6.2

Sign up to get free protection for your applications and to get access to all the features.
data/README ADDED
@@ -0,0 +1,134 @@
1
+ -=RubyLexer 0.6.2=-
2
+
3
+ RubyLexer is a lexer library for Ruby, written in Ruby. My goal with Rubylexer
4
+ was to create a lexer for Ruby that's complete and correct; all legal Ruby
5
+ code should be lexed correctly by RubyLexer as well. Just enough parsing
6
+ capability is included to give RubyLexer enough context to tokenize correctly
7
+ in all cases. (This turned out to be more parsing than I had thought or
8
+ wanted to take on at first.)
9
+
10
+ Other Ruby lexers exist, but most are inadequate. For instance, irb has it's
11
+ own little lexer, as does, (I believe) RDoc, so do all the ide's that can
12
+ colorize. I've seen several stand-alone libraries as well. All or almost all
13
+ suffer from the same problems: they skip the hard part of lexing. RubyLexer
14
+ handles the hard things like complicated strings, the ambiguous nature of
15
+ some punctuation characters and keywords in ruby, and distinguishing methods
16
+ and local variables.
17
+
18
+ RubyLexer is not particularly clean code. As I progressed in writing this,
19
+ I've learned a little about how these things are supposed to be done; the
20
+ lexer is not supposed to have any state of it's own, instead it gets whatever
21
+ it needs to know from the parser. As a stand-alone lexer, Rubylexer maintains
22
+ quite a lot of state. Every instance variable in the RubyLexer class is some
23
+ sort of lexer state. Most of the complication and ugly code in RubyLexer is
24
+ in maintaining or using this state.
25
+
26
+ For information about using RubyLexer in your program, please see howtouse.txt.
27
+
28
+ For my notes on the testing of RubyLexer, see testing.txt.
29
+
30
+ If you have any questions, comments, problems, new feature requests, or just
31
+ want to figure out how to make it work for what you need to do, contact me:
32
+ rubylexer _at_ inforadical.net
33
+
34
+ RubyLexer is a RubyForge project. RubyForge is another good place to send your
35
+ bug reports or whatever: http://rubyforge.org/projects/rubylexer/
36
+
37
+ (There aren't any bug filed against RubyLexer there yet, but don't be afraid
38
+ that your report will get lonely.)
39
+
40
+ Status:
41
+ RubyLexer can correctly lex all legal Ruby 1.8 code that I've been able to
42
+ find on my Debian system. It can also handle (most of) my catalog of nasty
43
+ test cases (in testdata/p.rb). At this point, new bugs are almost exclusively
44
+ found by my home-grown test code, rather than ruby code gathered 'from the
45
+ wild'. A largish sample of ruby recently tested for the first time (that is,
46
+ Rubyx) had _0_ lex errors. (And this is not the only example.) There are a
47
+ number of issues i know about and plan to fix, but it seems that Ruby coders
48
+ don't write code complex enough to trigger them very often. Although
49
+ incomplete, RubyLexer is nevertheless better than many existing ad-hoc
50
+ lexers. For instance, RubyLexer can correctly distinguish all cases of the
51
+ different uses the following operators, depending on context:
52
+ % can be modulus operator or start of fancy string
53
+ / can be division operator or start of regex
54
+ * & + - can be unary or binary operator
55
+ [] can be for array literal or index method
56
+ << can be here document or left shift operator (or in class<<obj expr)
57
+ :: can be unary or binary operator
58
+ : can be start of symbol, substitute for then, or part of ternary op
59
+ (there are other uses too, but they're not supported yet.)
60
+ ? can be start of character constant or ternary operator
61
+ ` can be method name or start of exec string
62
+
63
+ todo:
64
+ test w/ more code (rubygems, rpa, obfuscated ruby contest, rubicon, others?)
65
+ these 5 should be my standard test suite: p.rb, (matz') test.rb, tk.rb, obfuscated ruby contest, rubicon
66
+ test more ways: cvt source to dos or mac fmt before testing
67
+ test more ways: run unit tests after passing thru rubylexer (0.7)
68
+ test more ways: test require'd, load'd, or eval'd code as well (0.7)
69
+ lex code a line (or chunk) at a time and save state for next line (irb wants this) (0.8)
70
+ incremental lexing (ides want this (for performance))
71
+ put everything in a namespace
72
+ integrate w/ other tools...
73
+ html colorized output?
74
+ move more state onto @bracestack (ongoing)
75
+ expand on test documentation above
76
+ the new cases in p.rb now compile, but won't run
77
+ use want_op_name more
78
+ return result as a half-parsed tree (with parentheses and the like matched)
79
+ emit advisory tokens when see beginword, then (or equivalent), or end... what else does florian want?
80
+ strings are still slow
81
+ rantfile
82
+ emit advisory tokens when local var defined/goes out of scope (or hidden/unhidden?)
83
+ fakefile should be a mixin
84
+ token pruning in dumptokens...
85
+
86
+ new ruby features not yet supported:
87
+ procs without proc keyword, looks like hash to current lexer
88
+ keyword arguments, in hash immediates or actual param lists (&formal param lists?)
89
+ unicode (0.9)
90
+ :wrap and friends... (i wish someone would make a list of all the uses of colon in ruby.)
91
+ parens in block param list (works, but hacky)
92
+
93
+
94
+ known issues: (and planned fix release)
95
+ context not really preserved when entering or leaving string inclusions. this causes
96
+ a number or problems. (0.8)
97
+ string tokenization sometimes a little different from ruby around newlines
98
+ (htree/template.rb) (0.8)
99
+ string contents might not be correctly translated in a few cases (0.8?)
100
+ the implicit tokens might be emitted at the wrong times. (or not at the right times) (need more test code) (0.7)
101
+ local variables should be temporarily hidden by class, module, and def (0.7)
102
+ windows or mac newline in source are likely to cause problems in obscure cases (need test case)
103
+ line numbers are sometimes off... probably to do with multi-line strings (=begin...=end causes this) (0.8)
104
+ symbols which contain string interpolations are flattened into one token. eg :"foo#{bar}" (0.8)
105
+ methnames and varnames might get mixed up in def header (in idents after the 'def' but before param list) (0.7)
106
+ FileAndLineToken not emitted everywhere it should be (0.8)
107
+ '\r' whitespace sometimes seen in dos-formatted output.. shouldn't be (eg pre.rb) (0.7)
108
+ no way to get offset of __END__ (??) (0.7)
109
+ put things in lib/
110
+
111
+
112
+ fixed issues in 0.6.2:
113
+ testcode/dumptokens.rb charhandler.rb doesn't work... but does after unix2dos (not reproducible)
114
+ files should be opened in binmode to avoid all possible eol translation
115
+ (x.+?x) doesn't work
116
+ methname/varname mixups in some cases
117
+ performance, in most important cases.
118
+ error handling tokens should be emitted on error input... ErrorToken mixin module
119
+ but old error handling interface should be preserved and made available.
120
+ move readahead and friends into IOext. make optimized readahead et al for fakefile.
121
+ dos newlines (and newlines generally) can't be fancy string delimiters
122
+ do,if,until, etc, have no way to tell if an end is associated
123
+ break readme into pieces
124
+
125
+
126
+ fixed issues in 0.6.0:
127
+ the implicit tokens might be emitted at the wrong times. (or not at the right times) (partly fixed) (0.6)
128
+ : operator might be a synonym for 'then' (0.6)
129
+ variables other than the last are not recognized in multiple assignment. (0.7)
130
+ variables created by for and rescue are not recognized. (0.7)
131
+ token following :: should not be BareSymbolToken if begins with A-Z (unless obviously a func, eg b/c followed by func param list)
132
+ read code to be lexed from a string. (irb wants this) (0.7)
133
+ fancy symbols don't work at all. (like this: %s{abcdefg}) (0.7) [this is regressing now]
134
+ Newline,EscNl,BareSymbolToken may get renamed
data/Rantfile ADDED
@@ -0,0 +1,37 @@
1
+ import %w(rubydoc rubypackage)
2
+
3
+
4
+ test_files=Dir["test{code,data}/*.rb"]
5
+ lib_files = Dir["lib/**/*.rb"] #need to actually put files here...
6
+ dist_files = lib_files + %w(Rantfile README COPYING) + test_files
7
+
8
+ desc "Run unit tests."
9
+ task :test do
10
+ sys.mkdir 'testresults'
11
+ test_files.each{|f|
12
+ sys.ruby "testcode/rubylexervsruby.rb", f
13
+ }
14
+ lib_files.each{|f|
15
+ sys.ruby "testcode/rubylexervsruby.rb", f
16
+ }
17
+ system 'which locate grep' &&
18
+ sys.ruby "testcode/rubylexervsruby.rb", `locate /tk.rb|grep 'tk.rb$'`
19
+ end
20
+
21
+ desc "Generate html documentation."
22
+ gen RubyDoc do |t|
23
+ t.opts = %w(--title RubyLexer --main README README)
24
+ end
25
+
26
+ desc "Create packages."
27
+ gen RubyPackage, :rubylexer do |t|
28
+ t.version = "0.6.1"
29
+ t.summary = "A complete lexer of ruby in ruby."
30
+ t.files = dist_files
31
+ t.package_task :gem
32
+ #need more here
33
+ end
34
+
35
+ task :clean do
36
+ sys.rm_rf %w(doc pkg testresults)
37
+ end
data/assert.rb ADDED
@@ -0,0 +1,31 @@
1
+ =begin copyright
2
+ rubylexer - a ruby lexer written in ruby
3
+ Copyright (C) 2004,2005 Caleb Clausen
4
+
5
+ This library is free software; you can redistribute it and/or
6
+ modify it under the terms of the GNU Lesser General Public
7
+ License as published by the Free Software Foundation; either
8
+ version 2.1 of the License, or (at your option) any later version.
9
+
10
+ This library is distributed in the hope that it will be useful,
11
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
13
+ Lesser General Public License for more details.
14
+
15
+ You should have received a copy of the GNU Lesser General Public
16
+ License along with this library; if not, write to the Free Software
17
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
18
+ =end
19
+ require 'set'
20
+
21
+
22
+ def assert(expr,msg="assertion failed")
23
+ $DEBUG and (expr or raise msg)
24
+ end
25
+
26
+ @@printed=Set.new
27
+ def fixme(s)
28
+ @@printed.include?( s) and return
29
+ $DEBUG and STDERR.print "FIXME: #{s}\n"
30
+ @@printed.add s
31
+ end
data/charhandler.rb ADDED
@@ -0,0 +1,84 @@
1
+ =begin copyright
2
+ rubylexer - a ruby lexer written in ruby
3
+ Copyright (C) 2004,2005 Caleb Clausen
4
+
5
+ This library is free software; you can redistribute it and/or
6
+ modify it under the terms of the GNU Lesser General Public
7
+ License as published by the Free Software Foundation; either
8
+ version 2.1 of the License, or (at your option) any later version.
9
+
10
+ This library is distributed in the hope that it will be useful,
11
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
13
+ Lesser General Public License for more details.
14
+
15
+ You should have received a copy of the GNU Lesser General Public
16
+ License along with this library; if not, write to the Free Software
17
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
18
+ =end
19
+
20
+ require "charset"
21
+ #------------------------------------
22
+ class CharHandler
23
+ #-----------------------------------
24
+ CHARSETSPECIALS=CharSet[?[ ,?] ,?\\ ,?-]
25
+ def initialize(receiver,default,hash)
26
+ @default=default
27
+ @receiver=receiver
28
+ #breakpoint
29
+ @table=Array.new(0)
30
+ @matcher='^[^'
31
+
32
+ hash.each_pair {|pattern,action|
33
+ case pattern
34
+ when Range
35
+ pattern.each { |c|
36
+ c.kind_of? String and c=c[0] #cvt to integer #still needed?
37
+ self[c]=action
38
+ }
39
+ when String
40
+ pattern.each_byte {|b| self[b]=action }
41
+ when Fixnum
42
+ self[pattern]=action
43
+ else
44
+ raise "invalid pattern class #{pattern.class}"
45
+ end
46
+ }
47
+
48
+ @matcher += ']$'
49
+ @matcher=Regexp.new(@matcher)
50
+
51
+ freeze
52
+ end
53
+
54
+ #-----------------------------------
55
+ def []=(b,action) #for use in initialize only
56
+ assert b >= ?\x00
57
+ assert b <= ?\xFF
58
+ assert !frozen?
59
+
60
+ @table[b]=action
61
+ @matcher<<?\\ if CHARSETSPECIALS===b
62
+ @matcher<<b
63
+ end
64
+ private :[]=
65
+
66
+ #-----------------------------------
67
+ def go(b,*args)
68
+ @receiver.send((@table[b] or @default), b.chr, *args)
69
+ end
70
+
71
+ #-----------------------------------
72
+ def eat_file(file,blocksize,*args)
73
+ begin
74
+ chars=file.read(blocksize)
75
+ md=@matcher.match(chars)
76
+ mychar=md[0][0]
77
+ #get file back in the right pos
78
+ file.pos+=md.offset(0)[0] - chars.length
79
+ @receiver.send(@default,md[0])
80
+ end until go(mychar,*args)
81
+ end
82
+ end
83
+
84
+
data/charset.rb ADDED
@@ -0,0 +1,76 @@
1
+ =begin copyright
2
+ rubylexer - a ruby lexer written in ruby
3
+ Copyright (C) 2004,2005 Caleb Clausen
4
+
5
+ This library is free software; you can redistribute it and/or
6
+ modify it under the terms of the GNU Lesser General Public
7
+ License as published by the Free Software Foundation; either
8
+ version 2.1 of the License, or (at your option) any later version.
9
+
10
+ This library is distributed in the hope that it will be useful,
11
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
13
+ Lesser General Public License for more details.
14
+
15
+ You should have received a copy of the GNU Lesser General Public
16
+ License along with this library; if not, write to the Free Software
17
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
18
+ =end
19
+
20
+ class CharSet
21
+ def initialize(*charss)
22
+ clear
23
+ charss.each {|chars| add chars }
24
+ end
25
+
26
+ def CharSet.[](*charss) CharSet.new(*charss) end
27
+
28
+ def clear
29
+ @bitset=0
30
+ end
31
+
32
+ def add(chars)
33
+ case chars
34
+ when String
35
+ chars.each_byte {|c| @bitset |= (1<<c) }
36
+ when Fixnum then @bitset |= (1<<chars)
37
+ else chars.each {|c| @bitset |= (1<<c) }
38
+ end
39
+ end
40
+
41
+ def remove(chars)
42
+ case chars
43
+ when String
44
+ chars.each_byte {|c| @bitset &= ~(1<<c) }
45
+ when Fixnum then @bitset &= ~(1<<chars)
46
+ else chars.each {|c| @bitset &= ~(1<<c) }
47
+ end
48
+ #this math works right with bignums... (i'm pretty sure)
49
+ end
50
+
51
+ def ===(c) #c is String|Fixnum|nil
52
+ c.nil? and return false
53
+ c.kind_of? String and c=c[0]
54
+ return ( @bitset[c] != 0 )
55
+ end
56
+
57
+ #enumerate the chars in n AS INTEGERS
58
+ def each_byte(&block)
59
+ #should use ffs... not available in ruby
60
+ (0..255).each { |n|
61
+ @bitset[n] and block[n]
62
+ }
63
+ end
64
+
65
+ def each(&block)
66
+ each_byte{|n| block[n.chr] }
67
+ end
68
+
69
+ def chars #turn bitset back into a string
70
+ result=''
71
+ each {|c| result << c }
72
+ return result
73
+ end
74
+ end
75
+
76
+
data/context.rb ADDED
@@ -0,0 +1,174 @@
1
+ module NestedContexts
2
+ class NestedContext
3
+ attr :starter
4
+ attr :ender
5
+ attr :linenum
6
+ def initialize(starter,ender,linenum)
7
+ @starter,@ender,@linenum=starter,ender,linenum
8
+ end
9
+
10
+ alias dflt_initialize initialize
11
+
12
+ def matches?(tok)
13
+ @ender==tok
14
+ end
15
+
16
+ def see stack,msg; end
17
+ end
18
+
19
+ class ListContext < NestedContext
20
+ end
21
+
22
+ class ListImmedContext < ListContext
23
+ def initialize(starter,linenum)
24
+ assert '{['[starter]
25
+ super(starter, starter.tr('{[','}]') ,linenum)
26
+ end
27
+ end
28
+
29
+ class ParenContext < NestedContext
30
+ def initialize(linenum)
31
+ super('(', ')' ,linenum)
32
+ end
33
+ end
34
+
35
+ class BlockContext < NestedContext
36
+ def initialize(linenum)
37
+ super('{','}',linenum)
38
+ end
39
+ end
40
+
41
+ class BlockParamListContext < ListContext
42
+ def initialize(linenum)
43
+ super('|','|',linenum)
44
+ end
45
+ end
46
+
47
+ class ParamListContext < ListContext
48
+ def initialize(linenum)
49
+ super('(', ')',linenum)
50
+ end
51
+ end
52
+
53
+ class ImplicitContext < ListContext
54
+ end
55
+
56
+ class ParamListContextNoParen < ImplicitContext
57
+ def initialize(linenum)
58
+ dflt_initialize(nil,nil,linenum)
59
+ end
60
+ end
61
+
62
+ class KwParamListContext < ImplicitContext
63
+ def initialize(starter,linenum)
64
+ dflt_initialize(starter,nil,linenum)
65
+ end
66
+ end
67
+
68
+ class AssignmentRhsContext < ImplicitContext
69
+ def initialize(linenum)
70
+ dflt_initialize(nil,nil,linenum)
71
+ end
72
+ end
73
+
74
+ class WantsEndContext < NestedContext
75
+ def initialize(starter,linenum)
76
+ super(starter,'end',linenum)
77
+ end
78
+
79
+ def see stack,msg
80
+ msg==:rescue ? stack.push_rescue_sm : super
81
+ end
82
+ end
83
+
84
+ class StringContext < NestedContext #not used yet
85
+ def initialize(starter,linenum)
86
+ super(starter, starter[-1,1].tr!('{[(','}])'),linenum)
87
+ end
88
+ end
89
+
90
+ class HereStringContext < StringContext #not used yet
91
+ def initialize(ender,linenum)
92
+ dflt_initialize("\n",ender,linenum)
93
+ end
94
+ end
95
+
96
+ class TopLevelContext < NestedContext
97
+ def initialize
98
+ dflt_initialize('','',1)
99
+ end
100
+ end
101
+
102
+
103
+ class RescueSMContext < NestedContext
104
+ #normal progression: rescue => arrow => then
105
+ EVENTS=[:rescue,:arrow,:then,:semi,:colon]
106
+ LEGAL_SUCCESSORS={nil=> [:rescue], :rescue => [:arrow,:then,:semi,:colon],:arrow => [:then,:semi,:colon],:then => [nil]}
107
+ #note on :semi and :colon events: in arrow state (and only then),
108
+ # (unescaped) newline, semicolon, and (unaccompanied) colon
109
+ # also trigger the :then event. otherwise, they are ignored.
110
+ attr :state
111
+
112
+ def initialize linenum
113
+ dflt_initialize("rescue","then",linenum)
114
+ @state=nil
115
+ @state=:rescue
116
+ end
117
+
118
+ def see(stack,msg)
119
+ case msg
120
+ when :rescue:
121
+ WantsEndContext===stack.last or
122
+ BlockContext===stack.last or
123
+ ParenContext===stack.last or
124
+ raise 'syntax error: rescue not expected at this time'
125
+ when :arrow: #local var defined in this state
126
+ when :then,:semi,:colon:
127
+ msg=:then
128
+ RescueSMContext===stack.pop or raise 'syntax error: then not expected at this time'
129
+ #pop self off owning context stack
130
+ else super
131
+ end
132
+ LEGAL_SUCCESSORS[@state].include? msg or raise "rescue syntax error: #{msg} unexpected in #@state"
133
+ @state=msg
134
+ end
135
+
136
+ end
137
+
138
+ class ForSMContext < NestedContext
139
+ #normal progression: for => in
140
+ EVENTS=[:for,:in]
141
+ LEGAL_SUCCESSORS={nil=> :for, :for => :in,:in => nil}
142
+ #note on :semi and :colon events: in :in state (and only then),
143
+ # (unescaped) newline, semicolon, and (unaccompanied) colon
144
+ # also trigger the :then event. otherwise, they are ignored.
145
+ attr :state
146
+
147
+ def initialize linenum
148
+ dflt_initialize("for","in",linenum)
149
+ @state=:for
150
+ end
151
+
152
+ def see(stack,msg)
153
+ case msg
154
+ when :for: WantsEndContext===stack.last or raise 'syntax error: for not expected at this time'
155
+ #local var defined in this state
156
+ when :in: ForSMContext===stack.pop or raise 'syntax error: in not expected at this time'
157
+ stack.push ExpectDoOrNlContext.new("for",/(do|;|:|\n)/,@linenum)
158
+ #pop self off owning context stack and push ExpectDoOrNlContext
159
+ else super
160
+ end
161
+ LEGAL_SUCCESSORS[@state] == msg or raise 'for syntax error: #{msg} unexpected in #@state'
162
+ @state=msg
163
+ end
164
+ end
165
+
166
+ class ExpectDoOrNlContext < NestedContext
167
+ end
168
+
169
+ class TernaryContext < NestedContext
170
+ def initialize(linenum)
171
+ dflt_initialize('?',':',linenum)
172
+ end
173
+ end
174
+ end