RubyGems - rubylexer - Versions diffs - 0.6.2 - Mend

rubylexer 0.6.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (41) hide show

data/COPYING +510 -0
data/README +134 -0
data/Rantfile +37 -0
data/assert.rb +31 -0
data/charhandler.rb +84 -0
data/charset.rb +76 -0
data/context.rb +174 -0
data/howtouse.txt +136 -0
data/io.each_til_charset.rb +247 -0
data/require.rb +103 -0
data/rlold.rb +12 -0
data/rubycode.rb +44 -0
data/rubylexer.rb +1589 -0
data/rulexer.rb +532 -0
data/symboltable.rb +65 -0
data/testcode/deletewarns.rb +39 -0
data/testcode/dumptokens.rb +38 -0
data/testcode/locatetest +12 -0
data/testcode/rubylexervsruby.rb +104 -0
data/testcode/rubylexervsruby.sh +51 -0
data/testcode/tokentest.rb +237 -0
data/testcode/torment +51 -0
data/testdata/1.rb.broken +729 -0
data/testdata/23.rb +24 -0
data/testdata/g.rb +15 -0
data/testdata/newsyntax.rb +18 -0
data/testdata/noeolatend.rb +1 -0
data/testdata/p.rb +1227 -0
data/testdata/pleac.rb.broken +6282 -0
data/testdata/pre.rb +33 -0
data/testdata/pre.unix.rb +33 -0
data/testdata/regtest.rb +621 -0
data/testdata/tokentest.assert.rb.can +7 -0
data/testdata/untitled1.rb +1 -0
data/testdata/w.rb +22 -0
data/testdata/wsdlDriver.rb +499 -0
data/testing.txt +130 -0
data/testresults/placeholder +0 -0
data/token.rb +486 -0
data/tokenprinter.rb +152 -0
metadata +76 -0

data/testing.txt ADDED Viewed

@@ -0,0 +1,130 @@
+Running the tests:
+the simplest thing to do is run testcode/locatetest. this will use locate to find as much ruby
+code on your system and test each specimen to see if it can be tokenized correctly (by feeding it
+to testcode/rubylexervsruby.rb, the operation of which is outlined below under 'testing strategy').
+Interpreting the output of rubylexervsruby.rb (and locatetest):
+in rubylexervsruby, i've tried to follow the philosophy that the test program
+doesn't print anything unless there's an error. perhaps i haven't followed
+this far enough; every run of rubylexervsruby produces a little output, and
+sometimes a run will produce output that doesn't actually indicate a problem,
+or only a low-priority problem. (since locatetest, torment, and test all run
+rubylexervsruby over and over, they all produce lots of (mostly harmless)
+output. sorry.)
+the following types of output should be ignored:
+diff file or chunk headers
+lines that look like this:
+  executing: ruby testcode/tokentest.rb ...     #normal, 1 for every file
+or this:
+  warning moved from 24 to 22: ambiguous first argument; put parentheses or even spaces
+or this:
+  Created warning(s) in new file, line 85: useless use of <=> in void context
+or this:
+  Removed warning(s) from old file (?!), line 85: useless use of <=> in void context
+indicate that a warning was added deleted, or moved. ultimately, these should
+go away, but right now it's a low-priority issue.
+if you ever see ruby stack dump in rubylexervsruby output, that's certainly
+an error (if the input ruby code is valid).
+something that looks like a unidiff chunk body (not header) may indicate
+an error as well. the problem is that sometimes those morpheaous warnings
+sneak through my filter (which is supposed to condense them into a single
+line like those above), so you will see diff chunks where the only real
+difference is a warning. here are some examples of the kind of diff chunks
+that should NOT cause alarm:
+--:89: warning: useless use of <=> in void context
+--:92: warning: useless use of <=> in void context
++-:90: warning: useless use of <=> in void context
+ Stack now 0 2 62 300 5 110 365 544
+ Shifting token tIDENTIFIER, Entering state 34
+-Reading a token: -:318: warning: ambiguous first argument; put parentheses or even spaces
+-Next token is token tINTEGER ()
++Reading a token: Next token is token tINTEGER ()
+ Reducing stack by rule 476 (line 2382), tIDENTIFIER -> operation
+if you look closely, (and are experienced in reading unidiff output), you'll
+see that the only difference is a warning. to understand more about how the
+unidiff output is created, see the section on testing strategy below.
+htree/template.rb:
+testing this file prints a small unidiff chunk. analysis indicates that the
+problem is because ruby's lexer generates an extra (empty) string content
+token at this point, which mine omits. there's no actual semantic difference
+between the two tokenizations, so there's nothing to be concerned about. in
+a future release, when my lexer supports the notion of string contents and
+string delimiters as separate token types, i'll try to emulate ruby more
+closely. the same case is replicated in p.rb.
+(in other words, ignore the error in this file and the identical one in p.rb.)
+if you find any output that doesn't look like one of the above exceptions,
+and the input file was valid ruby, please send it to me so that i can add it
+to my arsenal of tests.
+there are a number of 'ruby' files that i know of out there that actually
+contain syntax errors:
+rpcd.rb from freeride -- missing an end
+sample1.rb from 1.6 version of tcltk -- not legal in ruby 1.8
+bdb.rb from libdb2, 3, and 4 -- not how you declare [] method in ruby
+testdata/p.rb (my menagerie of weird test cases) is one of the worst
+offenders; it prints lots of output when tested, but all of the problems
+are harmless or minor.
+only the 10 first lines of each failing file are printed. the rest, as well
+as other intermediate files are kept in the testresults directory. the test
+output files are named *.prs.diff. beware: this directory is never cleaned,
+and can get quite large. after a large test run, you'll want to empty this
+directory to recover some disk space.
+about the directories: tbd
+about testcode/dumptokens.rb: tbd
+about testcode/tokentest.rb:
+a fairly simple-minded test utility; given an input file, it uses RubyLexer
+to tokenize it, then prints out each token as it is found. certain small
+changes will be made; numeric constants (including char constants) are
+converted to decimal and strings are converted to double-quoted form, where
+possible. optional flags can cause other changes: --maxws inserts whitespace
+everywhere that it's possible, --implicit inserts parentheses where they
+were left out at call sites. --implicit-all adds parentheses around the lists
+following when, for, and rescue keywords. --keepws is the usual mode;
+otherwise a 'symbolic mode' is used wherein newline is represented by '#;',
+for instance. note: currently the output will not be valid ruby unless
+only the --maxws or --keepws is used. in a future release --implicit will
+also be valid ruby, but currently it also puts '*[' and ']' around assignment
+right hand sides, which only works most of the time.
+about testcode/torment:
+finds ruby files by other heuristics (not using locate) and runs each
+through rubylexervsruby. this is roughly comparable to locatetest, but
+more complicated and (probably) less comprehensive.
+about ./test:
+this contains a number of ruby files which have failed on my Debian system
+in the past. as the paths are hard-coded, it's unlikely to be very portable.
+testing strategy:
+this command:
+ruby -w -y < $1 2>&1 | grep ^Shift|cut -d" " -f3
+gives a list of the types of token, as known to ruby, in a source file $1. the
+utility program tokentest.rb runs the lexer against a source file and then simply
+prints the tokens out again (perhaps with whitespace inserted between tokens). if
+the list of token types in this derived source file, as determined by the above command,
+is the same as in the original, we can be pretty confident that ruby and rubylexer are
+tokenizing in the same way. since whitespaces are optionally inserted between tokens, it
+is unlikely that rubylexer is ever finding two tokens where ruby thinks there's only one.
+it is possible, however, that rubylexer is emitting as a single token things that ruby
+thinks should be 2 tokens. and in fact, this is the case with strings: ruby divides a
+string into string open, string body, and string close tokens with option interpolations,
+whereas rubylexer has just a single string token (with subtokens, if interpolations are
+present.) this difference in handling accounts in part for rubylexer's inability
+to correctly lex certain very complicated strings.

data/testresults/placeholder ADDED Viewed

File without changes

data/token.rb ADDED Viewed

@@ -0,0 +1,486 @@
+=begin copyright
+    rubylexer - a ruby lexer written in ruby
+    Copyright (C) 2004,2005  Caleb Clausen
+    This library is free software; you can redistribute it and/or
+    modify it under the terms of the GNU Lesser General Public
+    License as published by the Free Software Foundation; either
+    version 2.1 of the License, or (at your option) any later version.
+    This library is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+    Lesser General Public License for more details.
+    You should have received a copy of the GNU Lesser General Public
+    License along with this library; if not, write to the Free Software
+    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+=end
+require "rubycode"
+#-------------------------
+class Token
+   attr_accessor :ident
+   alias to_s ident
+   attr_accessor :offset #file offset of start of this token
+   def initialize(ident,offset=nil)
+      @ident=ident
+      @offset=offset
+   end
+   def error; end
+end
+#-------------------------
+class WToken< Token
+   def ===(pattern)
+      assert @ident
+      pattern===@ident
+   end
+end
+#-------------------------
+class KeywordToken < WToken   #also some operators
+  #-----------------------------------
+  def set_callsite!
+    @callsite=true
+  end
+  #-----------------------------------
+  def callsite?
+    @callsite ||= nil
+  end
+  #-----------------------------------
+  def has_end!
+    assert self===RubyLexer::BEGINWORDS
+    @has_end=true
+  end
+  #-----------------------------------
+  def has_end?
+    self===RubyLexer::BEGINWORDS and @has_end||=nil
+  end
+end
+#-------------------------
+class OperatorToken < WToken
+end
+#-------------------------
+module TokenPat
+   @@TokenPats={}
+   def token_pat #used in various case statements...
+     result=self.dup
+     @@TokenPats[self] ||=
+       (class <<result
+         alias old_3eq ===
+         def ===(token)
+           WToken===token and old_3eq(token.ident)
+         end
+       end;result)
+   end
+end
+class String; include TokenPat; end
+class Regexp; include TokenPat; end
+#-------------------------
+class VarNameToken < WToken
+end
+#-------------------------
+class NumberToken < Token
+  def to_s; @ident.to_s end
+end
+#-------------------------
+class SymbolToken < Token
+   def initialize(ident,offset=nil)
+      super ":#{ident}", offset
+   #   @char=':'
+   end
+end
+#-------------------------
+class MethNameToken  < Token # < SymbolToken
+   def initialize(ident,offset=nil)
+      @ident= (VarNameToken===ident)? ident.ident : ident
+      @offset=offset
+   #   @char=''
+   end
+   def [](regex) #is this used?
+      regex===ident
+   end
+   def ===(pattern)
+      pattern===@ident
+   end
+end
+#-------------------------
+class NewlineToken < Token
+   def initialize(nlstr="\n",offset=nil)
+      super(nlstr,offset)
+      #@char=''
+   end
+end
+#-------------------------
+class StringToken < Token
+   attr :char
+   attr_accessor :modifiers    #for regex only
+   attr_accessor :elems
+   def initialize(type='"',ident='')
+      super(ident)
+      type=="'" and type='"'
+      @char=type
+      assert(@char[/^[\[{"`\/]$/])
+      @elems=[ident.dup]     #why .dup?
+      @modifiers=nil
+   end
+   DQUOTE_ESCAPE_TABLE = [
+     ["\n",'\n'],
+     ["\r",'\r'],
+     ["\t",'\t'],
+     ["\v",'\v'],
+     ["\f",'\f'],
+     ["\e",'\e'],
+     ["\b",'\b'],
+     ["\a",'\a']
+   ]
+   PREFIXERS={ '['=>"%w[", '{'=>'%W{' }
+   SUFFIXERS={ '['=>"]",   '{'=>'}' }
+   def to_s(transname=:transform)
+      assert(@char[/[\[{"`\/]/])
+      #on output, all single-quoted strings become double-quoted
+      assert(@elems.length==1)  if @char=='['
+      result=(PREFIXERS[@char] or @char).dup
+      starter=result[-1,1]
+      ender=(SUFFIXERS[@char] or @char).dup
+      0.step(@elems.length-1,2) { |i|
+         strfrag=@elems[i].dup
+         result << send(transname,strfrag,starter,ender)
+         if e=@elems[i+1]
+            assert(e.kind_of?(RubyCode))
+            result << '#' + e.to_s
+         end
+      }
+      result << ender
+      modifiers and result << modifiers #regex only
+      return result
+   end
+   def to_term
+      result=[]
+      0.step(@elems.length-1,2) { |i|
+         result << ConstTerm.new(@elems[i].dup)
+         if e=@elems[i+1]
+            assert(e.kind_of?(RubyCode))
+            result << (RubyTerm.new e)
+         end
+      }
+      return result
+   end
+   def append(glob)
+      assert @elems.last.kind_of?(String)
+      case glob
+      when String,Integer then append_str! glob
+      when RubyCode then append_code! glob
+      else raise "bad string contents: #{glob}, a #{glob.class}"
+      end
+      assert @elems.last.kind_of?(String)
+   end
+   def append_token(strtok)
+      assert @elems.last.kind_of?(String)
+      assert strtok.elems.last.kind_of?(String)
+      assert strtok.elems.first.kind_of?(String)
+      @elems.last << strtok.elems.shift
+      first=strtok.elems.first
+      assert( first.nil? || first.kind_of?(RubyCode) )
+      @elems += strtok.elems
+      @ident << strtok.ident
+      assert((!@modifiers or !strtok.modifiers))
+      @modifiers||=strtok.modifiers
+      assert @elems.last.kind_of?(String)
+      return self
+   end
+private
+   #simpler transform, preserves original exactly
+   def simple_transform(strfrag,starter,ender)
+      #assert('[{/'[@char])
+      #strfrag.gsub!(/#([{$@])/,'\\#\\1') unless @char=='['
+      strfrag.gsub!(Regexp.new("[\\"+starter+"\\"+ender+"]"), '\\\\\&')
+      return strfrag
+   end
+   def transform(strfrag,starter,ender)
+      strfrag.gsub!("\\",'\\'*4)
+      strfrag.gsub!(/#([{$@])/,'\\#\\1')
+      strfrag.gsub!(Regexp.new("[\\"+starter+"\\"+ender+"]"),'\\\\\\&') unless @char=='?'
+      DQUOTE_ESCAPE_TABLE.each {|pair|
+         strfrag.gsub!(*pair)
+      } unless @char=='/'
+      strfrag.gsub!(/[^ -~]/){|np| #nonprintables
+         "\\x"+sprintf('%02X',np[0])
+      }
+      #break up long lines (best done later?)
+      strfrag.gsub!(/(\\x[0-9A-F]{2}|\\?.){40}/i, "\\&\\\n")
+      return strfrag
+   end
+   def append_str!(str)
+      assert @elems.last.kind_of?(String)
+      @elems.last << str
+      @ident << str
+      assert @elems.last.kind_of?(String)
+   end
+   def append_code!(code)
+      assert @elems.last.kind_of?(String)
+      @elems.concat [code, '']
+      @ident <<  "\#{#{code}}"
+      assert @elems.last.kind_of?(String)
+   end
+end
+#-------------------------
+class RenderExactlyStringToken < StringToken
+   alias transform simple_transform
+end
+#-------------------------
+class HerePlaceholderToken < WToken
+   attr_reader :termex, :quote, :ender
+   attr_accessor :unsafe_to_use, :string
+   attr_accessor :bodyclass
+   def initialize(dash,quote,ender)
+      @dash,@quote,@ender=dash,quote,ender
+      @unsafe_to_use=true
+      @string=StringToken.new
+      #@termex=/^#{'[\s\v]*' if dash}#{Regexp.escape ender}$/
+      @termex=Regexp.new \
+         ["^", ('[\s\v]*' if dash), Regexp.escape(ender), "$"].to_s
+      @bodyclass=HereBodyToken
+   end
+   def ===(bogus); false end
+   def to_s
+      if unsafe_to_use
+        result="<<"
+        result << if/[^a-z_0-9]/i===@ender
+          %["#{@ender.gsub(/[\\"]/, '\\\\'+'\\&')}"]
+        else
+          @ender
+        end
+      else
+        @string.to_s
+      end
+   end
+   def append s; @string.append s end
+   def append_token tok; @string.append_token tok  end
+end
+#-------------------------
+class IgnoreToken < Token
+end
+#-------------------------
+class WsToken < IgnoreToken
+end
+#-------------------------
+class ZwToken < IgnoreToken
+  def initialize(offset)
+    super('',offset)
+  end
+  def explicit_form
+    abstract
+  end
+  def explicit_form_all; explicit_form end
+end
+class NoWsToken < ZwToken
+  def explicit_form_all
+    "#nows#"
+  end
+  def explicit_form
+    nil
+  end
+end
+class ImplicitParamListStartToken < ZwToken
+  def explicit_form
+    '('
+  end
+end
+class ImplicitParamListEndToken < ZwToken
+  def explicit_form
+    ')'
+  end
+end
+class AssignmentRhsListStartToken < ZwToken
+  def explicit_form
+    '*['
+  end
+end
+class AssignmentRhsListEndToken < ZwToken
+  def explicit_form
+    ']'
+  end
+end
+class KwParamListStartToken  < ZwToken
+  def explicit_form_all
+    "#((#"
+  end
+  def explicit_form
+    nil
+  end
+end
+class KwParamListEndToken  < ZwToken
+  def explicit_form_all
+    "#))#"
+  end
+  def explicit_form
+    nil
+  end
+end
+#-------------------------
+class EscNlToken < IgnoreToken
+   def initialize(filename,linenum,ident="\\\n",offset=nil)
+      super(ident,offset)
+      #@char='\\'
+      @filename=filename
+      @linenum=linenum
+   end
+end
+#-------------------------
+class EoiToken < IgnoreToken
+   attr :file
+   alias :pos :offset
+   def initialize(cause,file, offset=nil)
+      super(cause,offset)
+      @file=file
+   end
+end
+#-------------------------
+class HereBodyToken < IgnoreToken
+  #attr_accessor :ender
+  def initialize(headtok)
+    assert HerePlaceholderToken===headtok
+    super(headtok.string,headtok.string.offset)
+    @headtok=headtok
+  end
+end
+#-------------------------
+class FileAndLineToken < IgnoreToken
+   attr :line
+   def initialize(ident,line,offset=nil)
+      super ident,offset
+      #@char='#'
+      @line=line
+   end
+   #def char; '#' end
+   def to_s()
+      ['#', @ident, ':', @line].to_s
+   end
+   def file()   @ident   end
+   def subitem()   @line   end #needed?
+end
+#-------------------------
+class OutlinedHereBodyToken < HereBodyToken
+  def to_s
+    assert HerePlaceholderToken===@headtok
+    result=@headtok.string
+    result=result.to_s(:simple_transform).match(/^"(.*)"$/m)[1]
+    return "\n" +
+           result +
+           @headtok.ender +
+           "\n"
+  end
+end
+#-------------------------
+module ErrorToken
+  attr_accessor :error
+end
+#-------------------------
+class SubitemToken < Token
+   attr :char2
+   attr :subitem
+   def initialize(ident,subitem)
+      super ident
+      @subitem=subitem
+   end
+   def to_s()
+      super+@char2+@subitem.to_s
+   end
+end
+#-------------------------
+class DecoratorToken < SubitemToken
+   def initialize(ident,subitem)
+      super '^'+ident,subitem
+      @subitem=@subitem.to_s  #why to_s?
+      #@char='^'
+      @char2='='
+   end
+   #alias to_s ident  #parent has right implementation of to_s... i think
+   def needs_value?()   @subitem.nil?   end
+   def value=(v)   @subitem=v  end
+   def value()     @subitem    end
+end