RubyGems - rubylexer - Versions diffs - 0.7.3 → 0.7.4 - Mend

rubylexer 0.7.3 → 0.7.4

Files changed (20) hide show

data/History.txt +16 -0
data/Manifest.txt +3 -1
data/README.txt +12 -19
data/Rakefile +2 -2
data/lib/rubylexer.rb +214 -86
data/lib/rubylexer/context.rb +17 -6
data/lib/rubylexer/lextable.rb +202 -0
data/lib/rubylexer/rulexer.rb +61 -9
data/lib/rubylexer/test/illegal_oneliners.rb +4 -0
data/lib/rubylexer/test/stanzas.rb +2 -0
data/lib/rubylexer/test/testcases.rb +6 -1
data/lib/rubylexer/token.rb +4 -1
data/lib/rubylexer/version.rb +1 -1
data/test/code/regression.rb +1 -1
data/test/code/rubylexervsruby.rb +23 -6
data/test/data/1.rb +729 -0
data/test/data/heart.rb +43 -2
data/test/data/pleac.rb +6282 -0
data/testing.txt +1 -1
metadata +7 -4

data/History.txt CHANGED

@@ -1,3 +1,19 @@
+=== 0.7.4/5-20-2009
+* 2 Major Enhancements:
+  * preliminary support for ruby 1.9
+  * utf8 inputs should now work... more or less
+* 5 Minor Enhancements:
+  * better detection of illegal escapes and interpolations in strings
+  * indicate error on unterminated here body
+  * fixed pattern of keywords that can't start a param list (ignores ?,! now)
+  * in is_var_name?, check for global/instance vars first
+  * comma and star in a true lhs should be correctly marked as such, now
+* 2 Bugfixes:
+  * added tag field to Token; I hope many flags can be coalesced into tag.
+  * note line that all strings (and here docs) start and end on
 === 0.7.3/4-19-2009
 * 9 Bugfixes:
   * remember whether comma was seen in paren context

data/Manifest.txt CHANGED

@@ -15,6 +15,7 @@ lib/rubylexer/version.rb
 lib/rubylexer/rulexer.rb
 lib/rubylexer/tokenprinter.rb
 lib/rubylexer/charset.rb
+lib/rubylexer/lextable.rb
 lib/rubylexer/symboltable.rb
 lib/rubylexer/charhandler.rb
 lib/assert.rb
@@ -43,11 +44,13 @@ test/data/23.rb
 test/data/lbrack.rb
 test/data/untitled1.rb
 test/data/rescue.rb
+test/data/pleac.rb
 test/data/pleac.rb.broken
 test/data/heart.rb
 test/data/s.rb
 test/data/wsdlDriver.rb
 test/data/p-op.rb
+test/data/1.rb
 test/data/1.rb.broken
 test/data/untermed_here.rb.broken
 test/data/newsyntax.rb
@@ -72,7 +75,6 @@ test/code/regression.rb
 test/code/strgen.rb
 test/code/tarball.rb
 lib/rubylexer/test/testcases.rb
-test/data/chunky.plain.rb
 test/data/cvtesc.rb
 test/data/__eof2.rb
 test/data/__eof5.rb

data/README.txt CHANGED

@@ -1,8 +1,7 @@
 = RubyLexer
-*
-*
-*
+* rubyforge.net/projects/rubylexer
+* github.com/coatl/rubylexer
 === DESCRIPTION:
@@ -48,13 +47,9 @@ end
 == Status
 RubyLexer can correctly lex all legal Ruby 1.8 code that I've been able to
 find on my Debian system. It can also handle (most of) my catalog of nasty
-test cases (in testdata/p.rb) (see below for known problems). At this point,
-new bugs are almost exclusively found by my home-grown test code, rather
-than ruby code gathered 'from the wild'. There are a number of issues I know
-about and plan to fix, but it seems that Ruby coders don't write code complex
-enough to trigger them very often. Although incomplete, RubyLexer can
-correctly distinguish these ambiguous uses of the following operator and
-keywords, depending on context:
+test cases (see below for known problems). Modulo some very obscure bugs,
+RubyLexer can correctly distinguish these ambiguous uses of the following
+operators, depending on context:
   %   can be modulus operator or start of fancy string
   /   can be division operator or start of regex
   * & + - :: can be unary or binary operator
@@ -83,18 +78,16 @@ emit advisory tokens when local var defined/goes out of scope (or hidden/unhidde
 token pruning in dumptokens...
 == known issues: (and planned fix release)
-context not really preserved when entering or leaving string inclusions. this causes
-a number or problems. local variables are ok now, but here document headers started
-in a string inclusion with the body outside will be a problem. (0.8)
-string tokenization sometimes a little different from ruby around newlines
-  (htree/template.rb) (0.8)
+context not really preserved when entering or leaving string inclusions. this caused
+-a number or problems, which had to be hacked around. it would be better to avoid
+-tokens within tokens. (0.8)
 string contents might not be correctly translated in a few cases (0.8?)
 symbols which contain string interpolations are flattened into one token. eg :"foo#{bar}" (0.8)
 '\r' whitespace sometimes seen in dos-formatted output.. shouldn't be (eg pre.rb) (0.7)
 windows newline in source is likely to cause problems in obscure cases (need test case)
-unterminated =begin is not an error (0.8)
-ruby 1.9 completely unsupported (0.9)
-character sets other than ascii are not supported at all (1.0)
-regression test currently shows 14 errors with differences in exact token ordering
+ruby 1.9 incompletely supported (0.9)
+current character set is always forced to ascii-8bit. however, this mode should be
+-compatible with texts written in regular ascii, utf-8, and euc. (among others?) (1.0)
+regression test currently shows a few errors with differences in exact token ordering
 -around string inclusions. these errors are much less serious than they seem.
 offset of AssignmentRhsListEndToken appears to be off by 1

data/Rakefile CHANGED

@@ -25,13 +25,13 @@ require 'lib/rubylexer/version.rb'
    hoe=Hoe.new("rubylexer", RubyLexer::VERSION) do |_|
      _.author = "Caleb Clausen"
      _.email = "rubylexer-owner @at@ inforadical .dot. net"
-     _.url = ["http://rubylexer.rubyforge.org/", "http://rubyforge.org/projects/rubylexer/"]
+     _.url = ["http://github.com/coatl/rubylexer/", "http://rubyforge.org/projects/rubylexer/"]
      _.extra_deps << ['sequence', '>= 0.2.0']
      _.test_globs=["test/code/regression.rb"]
      _.description=desc
      _.summary=desc[/\A[^.]+\./]
      _.spec_extras={:bindir=>'',:rdoc_options=>'-x lib/rubylexer/test'}
-     _.rdoc_pattern=/\A(howtouse\.txt|testing\.txt|README\.txt|lib\/[^\/]*\.rb|lib\/rubylexer\/[^\d][^\/]*\.rb)\Z/
+     #_.rdoc_pattern=/\A(howtouse\.txt|testing\.txt|README\.txt|lib\/[^\/]*\.rb|lib\/rubylexer\/[^\d][^\/]*\.rb)\Z/
    end

data/lib/rubylexer.rb CHANGED

@@ -1,4 +1,4 @@
-=begin legal crap
+=begin
     rubylexer - a ruby lexer written in ruby
     Copyright (C) 2004,2005,2008  Caleb Clausen
@@ -60,9 +60,6 @@ class RubyLexer
    INNERBOUNDINGWORDS="(#{INNERBOUNDINGWORDLIST.join '|'})"
    BINOPWORDLIST=%w"and or"
    BINOPWORDS="(#{BINOPWORDLIST.join '|'})"
-   NEVERSTARTPARAMLISTWORDS=/\A(#{OPORBEGINWORDS}|#{INNERBOUNDINGWORDS}|#{BINOPWORDS}|end)([^a-zA-Z0-9_!?=]|\Z)/o
-   NEVERSTARTPARAMLISTFIRST=CharSet['aoeitrwu']  #chars that begin NEVERSTARTPARAMLIST
-   NEVERSTARTPARAMLISTMAXLEN=7     #max len of a NEVERSTARTPARAMLIST
    RUBYKEYWORDS=%r{
      ^(alias|#{BINOPWORDS}|defined\?|not|undef|end|
@@ -72,6 +69,11 @@ class RubyLexer
    }xo
       #__END__ should not be in this set... its handled in start_of_line_directives
+   HIGHASCII=?\x80..?\xFF
+   NONASCII=HIGHASCII
+   #NONASCII=?\x80..?xFFFFFFFF  #or is it 10FFFF, whatever the highest conceivable code point
    CHARMAPPINGS = {
          ?$ => :dollar_identifier,
          ?@ => :at_identifier,
@@ -115,14 +117,43 @@ class RubyLexer
          "])}" => :close_brace,
-         ?# => :comment
+         ?# => :comment,
+         NONASCII => :identifier,
    }
    attr_reader :incomplete_here_tokens, :parsestack, :last_token_maybe_implicit
+   UCLETTER=@@UCLETTER="[A-Z]"
+   #cheaters way, treats utf chars as always 1 byte wide
+   #all high-bit chars are lowercase letters
+   #works, but strings compare with strict binary identity, not unicode collation
+   #works for euc too, I think
+   #(the ruby spec for utf8 support permits this interpretation)
+   LCLETTER=@@LCLETTER="[a-z_\x80-\xFF]"
+   LETTER=@@LETTER="[A-Za-z_\x80-\xFF]"
+   LETTER_DIGIT=@@LETTER_DIGIT="[A-Za-z_0-9\x80-\xFF]"
+   eval %w[UCLETTER LCLETTER LETTER LETTER_DIGIT].map{|n| "
+     def #{n}; #{n}; end
+     def self.#{n}; @@#{n}; end
+     "
+   }.to_s
+   NEVERSTARTPARAMLISTWORDS=/\A(#{OPORBEGINWORDS}|#{INNERBOUNDINGWORDS}|#{BINOPWORDS}|end)((?:(?!#@@LETTER_DIGIT).)|\Z)/om
+   NEVERSTARTPARAMLISTFIRST=CharSet['aoeitrwu']  #chars that begin NEVERSTARTPARAMLIST
+   NEVERSTARTPARAMLISTMAXLEN=7     #max len of a NEVERSTARTPARAMLIST
+=begin
+   require 'jcode'
+   utf8=String::PATTERN_UTF8 #or euc, or sjis...
+   LCLETTER_U="(?>[a-z_]|#{utf8})"
+   LETTER_U="(?>[A-Za-z_]|#{utf8})"
+   IDENTCHAR_U="(?>[A-Za-z_0-9]|#{utf8})"
+=end
    #-----------------------------------
-   def initialize(filename,file,linenum=1,offset_adjust=0)
+   def initialize(filename,file,linenum=1,offset_adjust=0,options={:rubyversion=>1.8})
       @offset_adjust=0 #set again in next line
       super(filename,file, linenum,offset_adjust)
       @start_linenum=linenum
@@ -137,13 +168,61 @@ class RubyLexer
       @enable_macro=nil
       @base_file=nil
       @progress_thread=nil
+      @rubyversion=options[:rubyversion]
+      @encoding=options[:encoding]||:detect
       @toptable=CharHandler.new(self, :illegal_char, CHARMAPPINGS)
+      read_leading_encoding
       start_of_line_directives
       progress_printer
    end
+   ENCODING_ALIASES={
+    'utf-8'=>'utf8',
+    'ascii-8bit'=>'binary',
+    'ascii-7bit'=>'ascii',
+    'euc-jp'=>'euc',
+    'ascii8bit'=>'binary',
+    'ascii7bit'=>'ascii',
+    'eucjp'=>'euc',
+    'us-ascii'=>'ascii',
+    'shift-jis'=>'sjis',
+    'autodetect'=>'detect',
+   }
+   ENCODINGS=%w[ascii binary utf8 euc sjis]
+   def read_leading_encoding
+     return unless @encoding==:detect
+     @encoding=:ascii
+     @encoding=:utf8 if @file.skip( /\xEF\xBB\xBF/ )   #bom
+     if @file.skip( /\A#!/ )
+       loop do
+         til_charset( /[\s\v]/ )
+         break if @file.skip( / ([^-\s\v]|--[\s\v])/,4 )
+         if @file.skip( /.-K(.)/ )
+           case $1
+           when 'u'; @encoding=:utf8
+           when 'e'; @encoding=:euc
+           when 's'; @encoding=:sjis
+           end
+         end
+       end
+       til_charset( /[\n]/ )
+     end
+     if @rubyversion>=1.9 and @file.skip(
+          /\A#[\x00-\x7F]*?(?:en)?coding[\s\v]*[:=][\s\v]*([a-z0-9_-]+)[\x00-\x7F]*\n/i
+        )
+       name=$1
+       name.downcase!
+       name=ENCODING_ALIASES[name] if ENCODING_ALIASES[name]
+       @encoding=name.to_sym if ENCODINGS.include? name
+     end
+   end
    def progress_printer
      return unless ENV['RL_PROGRESS']
      $stderr.puts 'printing progresses'
@@ -163,6 +242,7 @@ class RubyLexer
    attr :localvars_stack
    attr :offset_adjust
    attr_writer :pending_here_bodies
+   attr :rubyversion
    #-----------------------------------
    def set_last_token(tok)
@@ -361,7 +441,7 @@ private
       result = ((
       #order matters here, but it shouldn't
       #(but til_charset must be last)
-         eat_if(/-[a-z0-9_]/i,2) or
+         eat_if(/-#@@LETTER_DIGIT/o,2) or
          eat_next_if(/[!@&+`'=~\-\/\\,.;<>*"$?:]/) or
          (?0..?9)===nextchar ? til_charset(/[^\d]/) : nil
       ))
@@ -376,21 +456,25 @@ private
       #or if in a non-bare context
       #just asserts because those contexts are never encountered.
       #control goes through symbol(<...>,nil)
-      assert( /^[a-z_]$/i===context)
+      assert( /^#@@LETTER$/o===context)
       assert MethNameToken===@last_operative_token || !(@last_operative_token===/^(\.|::|(un)?def|alias)$/)
-      @moretokens.unshift(*parse_keywords(str,oldpos) do |tok|
-        #if not a keyword,
-        case str
-          when FUNCLIKE_KEYWORDS; except=tok
-          when VARLIKE_KEYWORDS,RUBYKEYWORDS; raise "shouldnt see keywords here, now"
-        end
-        was_last=@last_operative_token
-        @last_operative_token=tok if tok
-        normally=safe_recurse { |a| var_or_meth_name(str,was_last,oldpos,after_nonid_op?{true}) }
-        (Array===normally ? normally[0]=except : normally=except) if except
-        normally
-      end)
+      if @parsestack.last.wantarrow and @rubyversion>=1.9 and @file.skip ":"
+        @moretokens.push SymbolToken.new(str,oldpos), KeywordToken.new("=>",input_position-1)
+      else
+        @moretokens.unshift(*parse_keywords(str,oldpos) do |tok|
+          #if not a keyword, decide if it should be var or method
+          case str
+            when FUNCLIKE_KEYWORDS; except=tok
+            when VARLIKE_KEYWORDS,RUBYKEYWORDS; raise "shouldnt see keywords here, now"
+          end
+          was_last=@last_operative_token
+          @last_operative_token=tok if tok
+          normally=safe_recurse { |a| var_or_meth_name(str,was_last,oldpos,after_nonid_op?{true}) }
+          (Array===normally ? normally[0]=except : normally=except) if except
+          normally
+        end)
+      end
       return @moretokens.shift
    end
@@ -399,7 +483,7 @@ private
    def identifier_as_string(context)
       #must begin w/ letter or underscore
       #char class needs changing here for utf8 support
-      /[_a-z]/i===nextchar.chr or return
+      /#@@LETTER/o===nextchar.chr or return
       #equals, question mark, and exclamation mark
       #might be allowed at the end in some contexts.
@@ -418,7 +502,7 @@ private
         end
       @in_def_name||context==?: and trailers<<"|=(?![=~>])"
-      @file.scan(IDENTREX[trailers]||=/^(?>[_a-z][a-z0-9_]*(?:#{trailers})?)/i)
+      @file.scan(IDENTREX[trailers]||=/^(?>#@@LETTER#@@LETTER_DIGIT*(?:#{trailers})?)/)
    end
   #-----------------------------------
@@ -447,8 +531,8 @@ private
    def comma_in_lvalue_list?
      @parsestack.last.lhs=
        case l=@parsestack.last
-       when ListContext:
-       when DefContext: l.in_body
+       when ListContext;
+       when DefContext; l.in_body
        else true
        end
    end
@@ -459,7 +543,7 @@ private
      @defining_lvar or case ctx=@parsestack.last
        #when ForSMContext; ctx.state==:for
        when RescueSMContext
-         lasttok.ident=="=>" and @file.match?( /\A[\s\v]*([:;#\n]|then[^a-zA-Z0-9_])/m )
+         lasttok.ident=="=>" and @file.match?( /\A[\s\v]*([:;#\n]|then(?!#@@LETTER_DIGIT))/om )
        #when BlockParamListLhsContext; true
      end
    end
@@ -487,13 +571,13 @@ private
      was_in_lvar_define_state=in_lvar_define_state(lasttok)
      #maybe_local really means 'maybe local or constant'
      maybe_local=case name
-       when /[^a-z_0-9]$/i #do nothing
-       when /^[a-z_]/
+       when /(?!#@@LETTER_DIGIT).$/o #do nothing
+       when /^#@@LCLETTER/o
          (localvars===name or
           VARLIKE_KEYWORDS===name or
           was_in_lvar_define_state
          ) and not lasttok===/^(\.|::)$/
-       when /^[A-Z]/
+       when /^#@@UCLETTER/o
          is_const=true
          not lasttok==='.'  #this is the right algorithm for constants...
      end
@@ -509,7 +593,7 @@ private
      result=ws_toks=ignored_tokens(true) {|nl| sawnl=true }
      if sawnl || eof?
          if was_in_lvar_define_state
-           if /^[a-z_][a-zA-Z_0-9]*$/===name
+           if /^#@@LCLETTER#@@LETTER_DIGIT*$/o===name
              assert !(lasttok===/^(\.|::)$/)
              localvars[name]=true
            end
@@ -531,7 +615,7 @@ private
        when ?=;  not /^=[>=~]$/===readahead(2)
        when ?,; comma_in_lvalue_list?
        when ?); last_context_not_implicit.lhs
-       when ?i; /^in[^a-zA-Z_0-9]/===readahead(3) and
+       when ?i; /^in(?!#@@LETTER_DIGIT)/o===readahead(3) and
                   ForSMContext===last_context_not_implicit
        when ?>,?<; /^(.)\1=$/===readahead(3)
        when ?*,?&; /^(.)\1?=/===readahead(3)
@@ -543,8 +627,8 @@ private
      end
      if (assignment_coming && !(lasttok===/^(\.|::)$/) or was_in_lvar_define_state)
         tok=assign_lvar_type! VarNameToken.new(name,pos)
-        if /[^a-z_0-9]$/i===name
-        elsif /^[a-z_]/===name and !(lasttok===/^(\.|::)$/)
+        if /(?!#@@LETTER_DIGIT).$/o===name
+        elsif /^#@@LCLETTER/o===name and !(lasttok===/^(\.|::)$/)
           localvars[name]=true
         end
         return result.unshift(tok)
@@ -559,7 +643,7 @@ private
        when nil: 2
        when ?!; /^![=~]$/===readahead(2) ? 2 : 1
        when ?d;
-         if /^do([^a-zA-Z0-9_]|$)/===readahead(3)
+         if /^do((?!#@@LETTER_DIGIT)|$)/o===readahead(3)
            if maybe_local and expecting_do?
              ty=VarNameToken
              0
@@ -572,7 +656,7 @@ private
          end
        when NEVERSTARTPARAMLISTFIRST
          (NEVERSTARTPARAMLISTWORDS===readahead(NEVERSTARTPARAMLISTMAXLEN)) ? 2 : 1
-       when ?",?',?`,?a..?z,?A..?Z,?0..?9,?_,?@,?$,?~; 1 #"
+       when ?",?',?`,?a..?z,?A..?Z,?0..?9,?_,?@,?$,?~,NONASCII; 1 #"
        when ?{
          maybe_local=false
          1
@@ -633,10 +717,12 @@ private
          else
            3
          end
-       when ??; next3=readahead(3);
-                   /^\?([#{WHSPLF}]|[a-z_][a-z_0-9])/io===next3 ? 2 : 3
+       when ??; next3=readahead(3)
+                #? never begins a char constant if immediately followed
+                #by 2 or more letters or digits
+                   /^\?([#{WHSPLF}]|#@@LETTER_DIGIT{2})/o===next3 ? 2 : 3
 #       when ?:,??; (readahead(2)[/^.[#{WHSPLF}]/o]) ? 2 : 3
-       when ?<; (!ws_toks.empty? && readahead(4)[/^<<-?["'`a-zA-Z_0-9]/]) ? 3 : 2
+       when ?<; (!ws_toks.empty? && readahead(4)[/^<<-?(?:["'`]|#@@LETTER_DIGIT)/o]) ? 3 : 2
        when ?[;
            if ws_toks.empty?
              (KeywordToken===oldlast and /^(return|break|next)$/===oldlast.ident) ? 3 : 2
@@ -707,7 +793,7 @@ private
            break false
          elsif ','==tok.to_s and @parsestack.size==basesize+1
            break true
-         elsif OperatorToken===tok and /^[&*]$/===tok.ident and tok.unary and @parsestack.size==basesize+1
+         elsif OperatorToken===tok and /^[&*]$/===tok.ident and tok.tag and @parsestack.size==basesize+1
            break true
          elsif EoiToken===tok
            lexerror tok, "unexpected eof in parameter list"
@@ -890,7 +976,7 @@ private
          @moretokens.push KeywordToken.new('::',offset+md.end(0)-2) if dc
          loop do
            offset=input_position
-           @file.scan(/\A(#@@WSTOKS)?([A-Z][a-zA-Z_0-9]*)(::)?/o)
+           @file.scan(/\A(#@@WSTOKS)?(#@@UCLETTER#@@LETTER_DIGIT*)(::)?/o)
            #this regexp---^ will need to change in order to support utf8 properly.
            md=@file.last_match
            all,ws,name,dc=*md
@@ -1013,11 +1099,11 @@ private
               #maybe_local really means 'maybe local or constant'
               maybe_local=case name
-                when /[^a-z_0-9]$/i; #do nothing
+                when /(?!#@@LETTER_DIGIT).$/o; #do nothing
                 when /^[@$]/; true
                 when VARLIKE_KEYWORDS,FUNCLIKE_KEYWORDS; ty=KeywordToken
-                when /^[a-z_]/;  localvars===name
-                when /^[A-Z]/; is_const=true  #this is the right algorithm for constants...
+                when /^#@@LCLETTER/o;  localvars===name
+                when /^#@@UCLETTER/o; is_const=true  #this is the right algorithm for constants...
               end
               result.push(  *ignored_tokens(false,false)  )
               nc=nextchar
@@ -1059,7 +1145,7 @@ private
                #look for start of parameter list
                nc=(@moretokens.empty? ? nextchar.chr : @moretokens.first.to_s[0,1])
-               if state==:expect_op and /^[a-z_(&*]/i===nc
+               if state==:expect_op and /^(?:#@@LETTER|[(&*])/o===nc
                   ctx.state=:def_param_list
                   list,listend=def_param_list
                   result.concat list
@@ -1080,7 +1166,7 @@ private
                when EoiToken
                   lexerror tok,'unexpected eof in def header'
                when StillIgnoreToken
-               when MethNameToken ,VarNameToken # /^[a-z_]/i.token_pat
+               when MethNameToken ,VarNameToken # /^#@@LETTER/o.token_pat
                   lexerror tok,'expected . or ::' unless state==:expect_name
                   state=:expect_op
                when /^(\.|::)$/.token_pat
@@ -1416,7 +1502,7 @@ end
             #result.concat ignored_tokens
             if expect_name
               case tok
-                when IgnoreToken #, /^[A-Z]/ #do nothing
+                when IgnoreToken #, /^#@@UCLETTER/o #do nothing
                 when /^,$/.token_pat #hack
                 when VarNameToken
@@ -1498,12 +1584,20 @@ end
      if want_unary
        #readahead(2)[1..1][/[\s\v#\\]/] or #not needed?
        assert OperatorToken===result
-       result.unary=true         #result should distinguish unary+binary *&
+       result.tag=:unary         #result should distinguish unary+binary *&
        WHSPLF[nextchar.chr] or
          @moretokens << NoWsToken.new(input_position)
-       comma_in_lvalue_list?
+       cill=comma_in_lvalue_list?
        if ch=='*'
          @parsestack.last.see self, :splat
+         case @parsestack[-1]
+         when AssignmentRhsContext; result.tag= :rhs
+         when ParamListContext,ParamListContextNoParen; #:call
+         when ListImmedContext; #:array
+         when BlockParamListLhsContext; #:block
+         when KnownNestedLhsParenContext; #:nested
+         else          result.tag=     :lhs if cill
+         end
        end
      end
      result
@@ -1553,10 +1647,10 @@ end
      s=tok.to_s
      case s
-     when /[^a-z_0-9]$/i; false
-#     when /^[a-z_]/; localvars===s or VARLIKE_KEYWORDS===s
-     when /^[A-Z_]/i; VarNameToken===tok
      when /^[@$<]/; true
+     when /(?!#@@LETTER_DIGIT).$/o; false
+#     when /^#@@LCLETTER/o; localvars===s or VARLIKE_KEYWORDS===s
+     when /^#@@LETTER/o; VarNameToken===tok
      else raise "not var or method name: #{s}"
      end
    end
@@ -1573,7 +1667,7 @@ end
          if ch==':'
            not TernaryContext===@parsestack.last
          else
-           !readahead(3)[/^\?[a-z0-9_]{2}/i]
+           !readahead(3)[/^\?#@@LETTER_DIGIT{2}/o]
          end
      }
    end
@@ -1603,21 +1697,25 @@ end
         @moretokens.push tok=KeywordToken.new(':',startpos)
         case @parsestack.last
-        when TernaryContext:
+        when TernaryContext
           tok.ternary=true
           @parsestack.pop #should be in the context's see handler
-        when ExpectDoOrNlContext: #should be in the context's see handler
-          @parsestack.pop
-          assert @parsestack.last.starter[/^(while|until|for)$/]
-          tok.as=";"
-        when ExpectThenOrNlContext,WhenParamListContext:
-          #should be in the context's see handler
-          @parsestack.pop
-          tok.as="then"
-        when RescueSMContext:
+        when ExpectDoOrNlContext #should be in the context's see handler
+          if @rubyversion<1.9
+            @parsestack.pop
+            assert @parsestack.last.starter[/^(while|until|for)$/]
+            tok.as=";"
+          end
+        when ExpectThenOrNlContext,WhenParamListContext
+          if @rubyversion<1.9
+            #should be in the context's see handler
+            @parsestack.pop
+            tok.as="then"
+          end
+        when RescueSMContext
           tok.as=";"
-        else fail ": not expected in #{@parsestack.last.class}->#{@parsestack.last.starter}"
-        end
+        end or
+          fail ": not expected in #{@parsestack.last.class}->#{@parsestack.last.starter}"
         #end ternary context, if any
         @parsestack.last.see self,:colon
@@ -1631,7 +1729,7 @@ end
       lasttok=@last_operative_token
       assert !(String===lasttok)
       if (VarNameToken===lasttok or MethNameToken===lasttok) and
-          lasttok===/^[$@a-zA-Z_]/ and !WHSPCHARS[lastchar]
+          lasttok===/^(?:[$@]|#@@LETTER)/o and !WHSPCHARS[lastchar]
       then
          @moretokens << colon2
          result= NoWsToken.new(startpos)
@@ -1664,12 +1762,12 @@ end
          when ?` then read(1) #`
          when ?@ then at_identifier.to_s
          when ?$ then dollar_identifier.to_s
-         when ?_,?a..?z then identifier_as_string(?:)
+         when ?_,?a..?z,NONASCII then identifier_as_string(?:)
          when ?A..?Z then
            result=identifier_as_string(?:)
            if @last_operative_token==='::'
              assert klass==MethNameToken
-             /[A-Z_0-9]$/i===result and klass=VarNameToken
+             /#@@LETTER_DIGIT$/o===result and klass=VarNameToken
            end
            result
          else
@@ -1696,7 +1794,7 @@ end
      return [opmatches ? read(opmatches.size) :
        case nc=nextchar
          when ?` then read(1) #`
-         when ?_,?a..?z,?A..?Z then
+         when ?_,?a..?z,?A..?Z,NONASCII then
            context=merge_assignment_op_in_setter_callsites? ? ?: : nc
            identifier_as_string(context)
          else
@@ -1720,7 +1818,7 @@ end
         quote_real=true
       else
         quote='"'
-        ender=til_charset(/[^a-zA-Z0-9_]/)
+        ender=@file.scan(/#@@LETTER_DIGIT+/o)
         ender.length >= 1  or
           return lexerror(HerePlaceholderToken.new( dash, quote, ender, nil ), "invalid here header")
       end
@@ -1739,6 +1837,7 @@ if true
       nl=readnl or return lexerror(res, "here header without body (at eof)")
+      res.string.startline=linenum
       @moretokens<< res
       bodystart=input_position
       @offset_adjust = @min_offset_adjust+procrastinated.size
@@ -1748,6 +1847,8 @@ if true
       @offset_adjust = @min_offset_adjust
       #was: @offset_adjust -= procrastinated.size
       bodysize=input_position-bodystart
+      res.string.line=linenum-1
+      lexerror res,res.string.error
       #one or two already read characters are overwritten here,
       #in order to keep offsets correct in the long term
@@ -1814,7 +1915,7 @@ end
    #-----------------------------------
    def lessthan(ch) #match quadriop('<') or here doc or spaceship op
       case readahead(3)
-        when /^<<['"`\-a-z0-9_]$/i #'
+        when /^<<(?:['"`\-]|#@@LETTER_DIGIT)$/o #'
            if quote_expected?(ch) and not @last_operative_token==='class'
               here_header
            else
@@ -1901,7 +2002,11 @@ end
             if tofill.dash
               close+=til_charset(/[^#{WHSP}]/o)
             end
-            break if eof? #this is an error, should be handled better
+            if eof? #this is an error, should be handled better
+              lexerror tofill, "unterminated here body"
+              lexerror tofill.string, "unterminated here body"
+              break
+            end
             if read(tofill.ender.size)==tofill.ender
               crs=til_charset(/[^\r]/)||''
               if nl=readnl
@@ -1917,6 +2022,8 @@ end
               line=til_charset(/[\n]/)
               unless nl=readnl
                 assert eof?
+                lexerror tofill, "unterminated here body"
+                lexerror tofill.string, "unterminated here body"
                 break  #this is an error, should be handled better
               end
               line.chomp!("\r")
@@ -2118,7 +2225,7 @@ end
   #used to resolve the ambiguity of
   # unary ops (+, -, *, &, ~ !) in ruby
   #returns whether current token is to be the start of a literal
-  IDBEGINCHAR=/^[a-zA-Z_$@]/
+  IDBEGINCHAR=/^(?:#@@LETTER|[$@])/o
   def unary_op_expected?(ch) #yukko hack
     '*&='[readahead(2)[1..1]] and return false
@@ -2139,8 +2246,8 @@ end
    def quote_expected?(ch) #yukko hack
      case ch[0]
           when ?? then readahead(2)[/^\?[#{WHSPLF}]$/o] #not needed?
-          when ?% then readahead(3)[/^%([a-pt-vyzA-PR-VX-Z]|[QqrswWx][a-zA-Z0-9])/]
-          when ?< then !readahead(4)[/^<<-?['"`a-z0-9_]/i]
+          when ?% then readahead(3)[/^%([a-pt-vyzA-PR-VX-Z]|[QqrswWx]#{@@LETTER_DIGIT.gsub('_','')})/o]
+          when ?< then !readahead(4)[/^<<-?(?:['"`]|#@@LETTER_DIGIT)/o]
           else raise 'unexpected ch (#{ch}) in quote_expected?'
      #     when ?+,?-,?&,?*,?~,?! then '*&='[readahead(2)[1..1]]
      end and return false
@@ -2322,17 +2429,26 @@ end
       str << c
       result= operator_or_methname_token( str,offset)
       case c
-      when '=': #===,==
+      when '=' #===,==
         str<< (eat_next_if(?=)or'')
-      when '>': #=>
+      when '>' #=>
         unless ParamListContextNoParen===@parsestack.last
           @moretokens.unshift result
           @moretokens.unshift( *abort_noparens!("=>"))
           result=@moretokens.shift
         end
         @parsestack.last.see self,:arrow
-      when '': #plain assignment: record local variable definitions
+      when '~' # =~... after regex, maybe?
+        last=last_operative_token
+        if @rubyversion>=1.9 and StringToken===last and last.lvars
+          #ruby delays adding lvars from regexps to known lvars table
+          #for several tokens in some cases. not sure why or if on purpose
+          #i'm just going to add them right away
+          localvars.concat last.lvars
+        end
+      when '' #plain assignment: record local variable definitions
         last_context_not_implicit.lhs=false
         @moretokens.push( *ignored_tokens(true).map{|x|
           NewlineToken===x ? EscNlToken.new(@filename,@linenum,x.ident,x.offset) : x
@@ -2340,7 +2456,7 @@ end
         @parsestack.push AssignmentRhsContext.new(@linenum)
         if eat_next_if ?*
           tok=OperatorToken.new('*', input_position-1)
-          tok.unary=true
+          tok.tag=:unary
           @moretokens.push tok
           WHSPLF[nextchar.chr] or
             @moretokens << NoWsToken.new(input_position)
@@ -2450,14 +2566,15 @@ end
         tokch.set_infix! unless after_nonid_op?{WHSPLF[lastchar]}
         @parsestack.push ListImmedContext.new(ch,@linenum)
         lasttok=last_operative_token
-        #could be: lasttok===/^[a-z_]/i
-        if (VarNameToken===lasttok or ImplicitParamListEndToken===lasttok or MethNameToken===lasttok) and !WHSPCHARS[lastchar]
+        #could be: lasttok===/^#@@LETTER/o
+        if (VarNameToken===lasttok or ImplicitParamListEndToken===lasttok or
+            MethNameToken===lasttok or lasttok===FUNCLIKE_KEYWORDS) and !WHSPCHARS[lastchar]
                @moretokens << (tokch)
                tokch= NoWsToken.new(input_position-1)
         end
       when '('
         lasttok=last_token_maybe_implicit #last_operative_token
-        #could be: lasttok===/^[a-z_]/i
+        #could be: lasttok===/^#@@LETTER/o
         if (VarNameToken===lasttok or MethNameToken===lasttok or
             lasttok===FUNCLIKE_KEYWORDS)
           unless WHSPCHARS[lastchar]
@@ -2466,7 +2583,17 @@ end
           end
           @parsestack.push ParamListContext.new(@linenum)
         else
-          @parsestack.push ParenContext.new(@linenum)
+          ctx=@parsestack.last
+          lasttok=last_operative_token
+          maybe_def=DefContext===ctx && !ctx.in_body &&
+            !(KeywordToken===lasttok && lasttok.ident=="def")
+          if maybe_def or
+             BlockParamListLhsContext===ctx or
+             ParenContext===ctx && ctx.lhs
+            @parsestack.push KnownNestedLhsParenContext.new(@linenum)
+          else
+            @parsestack.push ParenContext.new(@linenum)
+          end
         end
       when '{'
@@ -2574,13 +2701,14 @@ end
          @parsestack.pop
          @moretokens.unshift AssignmentRhsListEndToken.new(input_position)
     end
-    token.comma_type=
     case @parsestack[-1]
-    when AssignmentRhsContext; :rhs
-    when ParamListContext,ParamListContextNoParen; :call
-    when ListImmedContext; :array
+    when AssignmentRhsContext; token.tag=:rhs
+    when ParamListContext,ParamListContextNoParen; #:call
+    when ListImmedContext; #:array
+    when BlockParamListLhsContext; #:block
+    when KnownNestedLhsParenContext; #:nested
     else
-      :lhs if comma_in_lvalue_list?
+      token.tag=:lhs if comma_in_lvalue_list?
     end
     @parsestack.last.see self,:comma
     return @moretokens.shift