rubylexer 0.6.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README ADDED
@@ -0,0 +1,134 @@
1
+ -=RubyLexer 0.6.2=-
2
+
3
+ RubyLexer is a lexer library for Ruby, written in Ruby. My goal with Rubylexer
4
+ was to create a lexer for Ruby that's complete and correct; all legal Ruby
5
+ code should be lexed correctly by RubyLexer as well. Just enough parsing
6
+ capability is included to give RubyLexer enough context to tokenize correctly
7
+ in all cases. (This turned out to be more parsing than I had thought or
8
+ wanted to take on at first.)
9
+
10
+ Other Ruby lexers exist, but most are inadequate. For instance, irb has it's
11
+ own little lexer, as does, (I believe) RDoc, so do all the ide's that can
12
+ colorize. I've seen several stand-alone libraries as well. All or almost all
13
+ suffer from the same problems: they skip the hard part of lexing. RubyLexer
14
+ handles the hard things like complicated strings, the ambiguous nature of
15
+ some punctuation characters and keywords in ruby, and distinguishing methods
16
+ and local variables.
17
+
18
+ RubyLexer is not particularly clean code. As I progressed in writing this,
19
+ I've learned a little about how these things are supposed to be done; the
20
+ lexer is not supposed to have any state of it's own, instead it gets whatever
21
+ it needs to know from the parser. As a stand-alone lexer, Rubylexer maintains
22
+ quite a lot of state. Every instance variable in the RubyLexer class is some
23
+ sort of lexer state. Most of the complication and ugly code in RubyLexer is
24
+ in maintaining or using this state.
25
+
26
+ For information about using RubyLexer in your program, please see howtouse.txt.
27
+
28
+ For my notes on the testing of RubyLexer, see testing.txt.
29
+
30
+ If you have any questions, comments, problems, new feature requests, or just
31
+ want to figure out how to make it work for what you need to do, contact me:
32
+ rubylexer _at_ inforadical.net
33
+
34
+ RubyLexer is a RubyForge project. RubyForge is another good place to send your
35
+ bug reports or whatever: http://rubyforge.org/projects/rubylexer/
36
+
37
+ (There aren't any bug filed against RubyLexer there yet, but don't be afraid
38
+ that your report will get lonely.)
39
+
40
+ Status:
41
+ RubyLexer can correctly lex all legal Ruby 1.8 code that I've been able to
42
+ find on my Debian system. It can also handle (most of) my catalog of nasty
43
+ test cases (in testdata/p.rb). At this point, new bugs are almost exclusively
44
+ found by my home-grown test code, rather than ruby code gathered 'from the
45
+ wild'. A largish sample of ruby recently tested for the first time (that is,
46
+ Rubyx) had _0_ lex errors. (And this is not the only example.) There are a
47
+ number of issues i know about and plan to fix, but it seems that Ruby coders
48
+ don't write code complex enough to trigger them very often. Although
49
+ incomplete, RubyLexer is nevertheless better than many existing ad-hoc
50
+ lexers. For instance, RubyLexer can correctly distinguish all cases of the
51
+ different uses the following operators, depending on context:
52
+ % can be modulus operator or start of fancy string
53
+ / can be division operator or start of regex
54
+ * & + - can be unary or binary operator
55
+ [] can be for array literal or index method
56
+ << can be here document or left shift operator (or in class<<obj expr)
57
+ :: can be unary or binary operator
58
+ : can be start of symbol, substitute for then, or part of ternary op
59
+ (there are other uses too, but they're not supported yet.)
60
+ ? can be start of character constant or ternary operator
61
+ ` can be method name or start of exec string
62
+
63
+ todo:
64
+ test w/ more code (rubygems, rpa, obfuscated ruby contest, rubicon, others?)
65
+ these 5 should be my standard test suite: p.rb, (matz') test.rb, tk.rb, obfuscated ruby contest, rubicon
66
+ test more ways: cvt source to dos or mac fmt before testing
67
+ test more ways: run unit tests after passing thru rubylexer (0.7)
68
+ test more ways: test require'd, load'd, or eval'd code as well (0.7)
69
+ lex code a line (or chunk) at a time and save state for next line (irb wants this) (0.8)
70
+ incremental lexing (ides want this (for performance))
71
+ put everything in a namespace
72
+ integrate w/ other tools...
73
+ html colorized output?
74
+ move more state onto @bracestack (ongoing)
75
+ expand on test documentation above
76
+ the new cases in p.rb now compile, but won't run
77
+ use want_op_name more
78
+ return result as a half-parsed tree (with parentheses and the like matched)
79
+ emit advisory tokens when see beginword, then (or equivalent), or end... what else does florian want?
80
+ strings are still slow
81
+ rantfile
82
+ emit advisory tokens when local var defined/goes out of scope (or hidden/unhidden?)
83
+ fakefile should be a mixin
84
+ token pruning in dumptokens...
85
+
86
+ new ruby features not yet supported:
87
+ procs without proc keyword, looks like hash to current lexer
88
+ keyword arguments, in hash immediates or actual param lists (&formal param lists?)
89
+ unicode (0.9)
90
+ :wrap and friends... (i wish someone would make a list of all the uses of colon in ruby.)
91
+ parens in block param list (works, but hacky)
92
+
93
+
94
+ known issues: (and planned fix release)
95
+ context not really preserved when entering or leaving string inclusions. this causes
96
+ a number or problems. (0.8)
97
+ string tokenization sometimes a little different from ruby around newlines
98
+ (htree/template.rb) (0.8)
99
+ string contents might not be correctly translated in a few cases (0.8?)
100
+ the implicit tokens might be emitted at the wrong times. (or not at the right times) (need more test code) (0.7)
101
+ local variables should be temporarily hidden by class, module, and def (0.7)
102
+ windows or mac newline in source are likely to cause problems in obscure cases (need test case)
103
+ line numbers are sometimes off... probably to do with multi-line strings (=begin...=end causes this) (0.8)
104
+ symbols which contain string interpolations are flattened into one token. eg :"foo#{bar}" (0.8)
105
+ methnames and varnames might get mixed up in def header (in idents after the 'def' but before param list) (0.7)
106
+ FileAndLineToken not emitted everywhere it should be (0.8)
107
+ '\r' whitespace sometimes seen in dos-formatted output.. shouldn't be (eg pre.rb) (0.7)
108
+ no way to get offset of __END__ (??) (0.7)
109
+ put things in lib/
110
+
111
+
112
+ fixed issues in 0.6.2:
113
+ testcode/dumptokens.rb charhandler.rb doesn't work... but does after unix2dos (not reproducible)
114
+ files should be opened in binmode to avoid all possible eol translation
115
+ (x.+?x) doesn't work
116
+ methname/varname mixups in some cases
117
+ performance, in most important cases.
118
+ error handling tokens should be emitted on error input... ErrorToken mixin module
119
+ but old error handling interface should be preserved and made available.
120
+ move readahead and friends into IOext. make optimized readahead et al for fakefile.
121
+ dos newlines (and newlines generally) can't be fancy string delimiters
122
+ do,if,until, etc, have no way to tell if an end is associated
123
+ break readme into pieces
124
+
125
+
126
+ fixed issues in 0.6.0:
127
+ the implicit tokens might be emitted at the wrong times. (or not at the right times) (partly fixed) (0.6)
128
+ : operator might be a synonym for 'then' (0.6)
129
+ variables other than the last are not recognized in multiple assignment. (0.7)
130
+ variables created by for and rescue are not recognized. (0.7)
131
+ token following :: should not be BareSymbolToken if begins with A-Z (unless obviously a func, eg b/c followed by func param list)
132
+ read code to be lexed from a string. (irb wants this) (0.7)
133
+ fancy symbols don't work at all. (like this: %s{abcdefg}) (0.7) [this is regressing now]
134
+ Newline,EscNl,BareSymbolToken may get renamed
data/Rantfile ADDED
@@ -0,0 +1,37 @@
1
+ import %w(rubydoc rubypackage)
2
+
3
+
4
+ test_files=Dir["test{code,data}/*.rb"]
5
+ lib_files = Dir["lib/**/*.rb"] #need to actually put files here...
6
+ dist_files = lib_files + %w(Rantfile README COPYING) + test_files
7
+
8
+ desc "Run unit tests."
9
+ task :test do
10
+ sys.mkdir 'testresults'
11
+ test_files.each{|f|
12
+ sys.ruby "testcode/rubylexervsruby.rb", f
13
+ }
14
+ lib_files.each{|f|
15
+ sys.ruby "testcode/rubylexervsruby.rb", f
16
+ }
17
+ system 'which locate grep' &&
18
+ sys.ruby "testcode/rubylexervsruby.rb", `locate /tk.rb|grep 'tk.rb$'`
19
+ end
20
+
21
+ desc "Generate html documentation."
22
+ gen RubyDoc do |t|
23
+ t.opts = %w(--title RubyLexer --main README README)
24
+ end
25
+
26
+ desc "Create packages."
27
+ gen RubyPackage, :rubylexer do |t|
28
+ t.version = "0.6.1"
29
+ t.summary = "A complete lexer of ruby in ruby."
30
+ t.files = dist_files
31
+ t.package_task :gem
32
+ #need more here
33
+ end
34
+
35
+ task :clean do
36
+ sys.rm_rf %w(doc pkg testresults)
37
+ end
data/assert.rb ADDED
@@ -0,0 +1,31 @@
1
+ =begin copyright
2
+ rubylexer - a ruby lexer written in ruby
3
+ Copyright (C) 2004,2005 Caleb Clausen
4
+
5
+ This library is free software; you can redistribute it and/or
6
+ modify it under the terms of the GNU Lesser General Public
7
+ License as published by the Free Software Foundation; either
8
+ version 2.1 of the License, or (at your option) any later version.
9
+
10
+ This library is distributed in the hope that it will be useful,
11
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
13
+ Lesser General Public License for more details.
14
+
15
+ You should have received a copy of the GNU Lesser General Public
16
+ License along with this library; if not, write to the Free Software
17
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
18
+ =end
19
+ require 'set'
20
+
21
+
22
+ def assert(expr,msg="assertion failed")
23
+ $DEBUG and (expr or raise msg)
24
+ end
25
+
26
+ @@printed=Set.new
27
+ def fixme(s)
28
+ @@printed.include?( s) and return
29
+ $DEBUG and STDERR.print "FIXME: #{s}\n"
30
+ @@printed.add s
31
+ end
data/charhandler.rb ADDED
@@ -0,0 +1,84 @@
1
+ =begin copyright
2
+ rubylexer - a ruby lexer written in ruby
3
+ Copyright (C) 2004,2005 Caleb Clausen
4
+
5
+ This library is free software; you can redistribute it and/or
6
+ modify it under the terms of the GNU Lesser General Public
7
+ License as published by the Free Software Foundation; either
8
+ version 2.1 of the License, or (at your option) any later version.
9
+
10
+ This library is distributed in the hope that it will be useful,
11
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
13
+ Lesser General Public License for more details.
14
+
15
+ You should have received a copy of the GNU Lesser General Public
16
+ License along with this library; if not, write to the Free Software
17
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
18
+ =end
19
+
20
+ require "charset"
21
+ #------------------------------------
22
+ class CharHandler
23
+ #-----------------------------------
24
+ CHARSETSPECIALS=CharSet[?[ ,?] ,?\\ ,?-]
25
+ def initialize(receiver,default,hash)
26
+ @default=default
27
+ @receiver=receiver
28
+ #breakpoint
29
+ @table=Array.new(0)
30
+ @matcher='^[^'
31
+
32
+ hash.each_pair {|pattern,action|
33
+ case pattern
34
+ when Range
35
+ pattern.each { |c|
36
+ c.kind_of? String and c=c[0] #cvt to integer #still needed?
37
+ self[c]=action
38
+ }
39
+ when String
40
+ pattern.each_byte {|b| self[b]=action }
41
+ when Fixnum
42
+ self[pattern]=action
43
+ else
44
+ raise "invalid pattern class #{pattern.class}"
45
+ end
46
+ }
47
+
48
+ @matcher += ']$'
49
+ @matcher=Regexp.new(@matcher)
50
+
51
+ freeze
52
+ end
53
+
54
+ #-----------------------------------
55
+ def []=(b,action) #for use in initialize only
56
+ assert b >= ?\x00
57
+ assert b <= ?\xFF
58
+ assert !frozen?
59
+
60
+ @table[b]=action
61
+ @matcher<<?\\ if CHARSETSPECIALS===b
62
+ @matcher<<b
63
+ end
64
+ private :[]=
65
+
66
+ #-----------------------------------
67
+ def go(b,*args)
68
+ @receiver.send((@table[b] or @default), b.chr, *args)
69
+ end
70
+
71
+ #-----------------------------------
72
+ def eat_file(file,blocksize,*args)
73
+ begin
74
+ chars=file.read(blocksize)
75
+ md=@matcher.match(chars)
76
+ mychar=md[0][0]
77
+ #get file back in the right pos
78
+ file.pos+=md.offset(0)[0] - chars.length
79
+ @receiver.send(@default,md[0])
80
+ end until go(mychar,*args)
81
+ end
82
+ end
83
+
84
+
data/charset.rb ADDED
@@ -0,0 +1,76 @@
1
+ =begin copyright
2
+ rubylexer - a ruby lexer written in ruby
3
+ Copyright (C) 2004,2005 Caleb Clausen
4
+
5
+ This library is free software; you can redistribute it and/or
6
+ modify it under the terms of the GNU Lesser General Public
7
+ License as published by the Free Software Foundation; either
8
+ version 2.1 of the License, or (at your option) any later version.
9
+
10
+ This library is distributed in the hope that it will be useful,
11
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
13
+ Lesser General Public License for more details.
14
+
15
+ You should have received a copy of the GNU Lesser General Public
16
+ License along with this library; if not, write to the Free Software
17
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
18
+ =end
19
+
20
+ class CharSet
21
+ def initialize(*charss)
22
+ clear
23
+ charss.each {|chars| add chars }
24
+ end
25
+
26
+ def CharSet.[](*charss) CharSet.new(*charss) end
27
+
28
+ def clear
29
+ @bitset=0
30
+ end
31
+
32
+ def add(chars)
33
+ case chars
34
+ when String
35
+ chars.each_byte {|c| @bitset |= (1<<c) }
36
+ when Fixnum then @bitset |= (1<<chars)
37
+ else chars.each {|c| @bitset |= (1<<c) }
38
+ end
39
+ end
40
+
41
+ def remove(chars)
42
+ case chars
43
+ when String
44
+ chars.each_byte {|c| @bitset &= ~(1<<c) }
45
+ when Fixnum then @bitset &= ~(1<<chars)
46
+ else chars.each {|c| @bitset &= ~(1<<c) }
47
+ end
48
+ #this math works right with bignums... (i'm pretty sure)
49
+ end
50
+
51
+ def ===(c) #c is String|Fixnum|nil
52
+ c.nil? and return false
53
+ c.kind_of? String and c=c[0]
54
+ return ( @bitset[c] != 0 )
55
+ end
56
+
57
+ #enumerate the chars in n AS INTEGERS
58
+ def each_byte(&block)
59
+ #should use ffs... not available in ruby
60
+ (0..255).each { |n|
61
+ @bitset[n] and block[n]
62
+ }
63
+ end
64
+
65
+ def each(&block)
66
+ each_byte{|n| block[n.chr] }
67
+ end
68
+
69
+ def chars #turn bitset back into a string
70
+ result=''
71
+ each {|c| result << c }
72
+ return result
73
+ end
74
+ end
75
+
76
+
data/context.rb ADDED
@@ -0,0 +1,174 @@
1
+ module NestedContexts
2
+ class NestedContext
3
+ attr :starter
4
+ attr :ender
5
+ attr :linenum
6
+ def initialize(starter,ender,linenum)
7
+ @starter,@ender,@linenum=starter,ender,linenum
8
+ end
9
+
10
+ alias dflt_initialize initialize
11
+
12
+ def matches?(tok)
13
+ @ender==tok
14
+ end
15
+
16
+ def see stack,msg; end
17
+ end
18
+
19
+ class ListContext < NestedContext
20
+ end
21
+
22
+ class ListImmedContext < ListContext
23
+ def initialize(starter,linenum)
24
+ assert '{['[starter]
25
+ super(starter, starter.tr('{[','}]') ,linenum)
26
+ end
27
+ end
28
+
29
+ class ParenContext < NestedContext
30
+ def initialize(linenum)
31
+ super('(', ')' ,linenum)
32
+ end
33
+ end
34
+
35
+ class BlockContext < NestedContext
36
+ def initialize(linenum)
37
+ super('{','}',linenum)
38
+ end
39
+ end
40
+
41
+ class BlockParamListContext < ListContext
42
+ def initialize(linenum)
43
+ super('|','|',linenum)
44
+ end
45
+ end
46
+
47
+ class ParamListContext < ListContext
48
+ def initialize(linenum)
49
+ super('(', ')',linenum)
50
+ end
51
+ end
52
+
53
+ class ImplicitContext < ListContext
54
+ end
55
+
56
+ class ParamListContextNoParen < ImplicitContext
57
+ def initialize(linenum)
58
+ dflt_initialize(nil,nil,linenum)
59
+ end
60
+ end
61
+
62
+ class KwParamListContext < ImplicitContext
63
+ def initialize(starter,linenum)
64
+ dflt_initialize(starter,nil,linenum)
65
+ end
66
+ end
67
+
68
+ class AssignmentRhsContext < ImplicitContext
69
+ def initialize(linenum)
70
+ dflt_initialize(nil,nil,linenum)
71
+ end
72
+ end
73
+
74
+ class WantsEndContext < NestedContext
75
+ def initialize(starter,linenum)
76
+ super(starter,'end',linenum)
77
+ end
78
+
79
+ def see stack,msg
80
+ msg==:rescue ? stack.push_rescue_sm : super
81
+ end
82
+ end
83
+
84
+ class StringContext < NestedContext #not used yet
85
+ def initialize(starter,linenum)
86
+ super(starter, starter[-1,1].tr!('{[(','}])'),linenum)
87
+ end
88
+ end
89
+
90
+ class HereStringContext < StringContext #not used yet
91
+ def initialize(ender,linenum)
92
+ dflt_initialize("\n",ender,linenum)
93
+ end
94
+ end
95
+
96
+ class TopLevelContext < NestedContext
97
+ def initialize
98
+ dflt_initialize('','',1)
99
+ end
100
+ end
101
+
102
+
103
+ class RescueSMContext < NestedContext
104
+ #normal progression: rescue => arrow => then
105
+ EVENTS=[:rescue,:arrow,:then,:semi,:colon]
106
+ LEGAL_SUCCESSORS={nil=> [:rescue], :rescue => [:arrow,:then,:semi,:colon],:arrow => [:then,:semi,:colon],:then => [nil]}
107
+ #note on :semi and :colon events: in arrow state (and only then),
108
+ # (unescaped) newline, semicolon, and (unaccompanied) colon
109
+ # also trigger the :then event. otherwise, they are ignored.
110
+ attr :state
111
+
112
+ def initialize linenum
113
+ dflt_initialize("rescue","then",linenum)
114
+ @state=nil
115
+ @state=:rescue
116
+ end
117
+
118
+ def see(stack,msg)
119
+ case msg
120
+ when :rescue:
121
+ WantsEndContext===stack.last or
122
+ BlockContext===stack.last or
123
+ ParenContext===stack.last or
124
+ raise 'syntax error: rescue not expected at this time'
125
+ when :arrow: #local var defined in this state
126
+ when :then,:semi,:colon:
127
+ msg=:then
128
+ RescueSMContext===stack.pop or raise 'syntax error: then not expected at this time'
129
+ #pop self off owning context stack
130
+ else super
131
+ end
132
+ LEGAL_SUCCESSORS[@state].include? msg or raise "rescue syntax error: #{msg} unexpected in #@state"
133
+ @state=msg
134
+ end
135
+
136
+ end
137
+
138
+ class ForSMContext < NestedContext
139
+ #normal progression: for => in
140
+ EVENTS=[:for,:in]
141
+ LEGAL_SUCCESSORS={nil=> :for, :for => :in,:in => nil}
142
+ #note on :semi and :colon events: in :in state (and only then),
143
+ # (unescaped) newline, semicolon, and (unaccompanied) colon
144
+ # also trigger the :then event. otherwise, they are ignored.
145
+ attr :state
146
+
147
+ def initialize linenum
148
+ dflt_initialize("for","in",linenum)
149
+ @state=:for
150
+ end
151
+
152
+ def see(stack,msg)
153
+ case msg
154
+ when :for: WantsEndContext===stack.last or raise 'syntax error: for not expected at this time'
155
+ #local var defined in this state
156
+ when :in: ForSMContext===stack.pop or raise 'syntax error: in not expected at this time'
157
+ stack.push ExpectDoOrNlContext.new("for",/(do|;|:|\n)/,@linenum)
158
+ #pop self off owning context stack and push ExpectDoOrNlContext
159
+ else super
160
+ end
161
+ LEGAL_SUCCESSORS[@state] == msg or raise 'for syntax error: #{msg} unexpected in #@state'
162
+ @state=msg
163
+ end
164
+ end
165
+
166
+ class ExpectDoOrNlContext < NestedContext
167
+ end
168
+
169
+ class TernaryContext < NestedContext
170
+ def initialize(linenum)
171
+ dflt_initialize('?',':',linenum)
172
+ end
173
+ end
174
+ end