RubyGems - rubylexer - Versions diffs - 0.7.3 → 0.7.4 - Mend

rubylexer 0.7.3 → 0.7.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

data/History.txt +16 -0
data/Manifest.txt +3 -1
data/README.txt +12 -19
data/Rakefile +2 -2
data/lib/rubylexer.rb +214 -86
data/lib/rubylexer/context.rb +17 -6
data/lib/rubylexer/lextable.rb +202 -0
data/lib/rubylexer/rulexer.rb +61 -9
data/lib/rubylexer/test/illegal_oneliners.rb +4 -0
data/lib/rubylexer/test/stanzas.rb +2 -0
data/lib/rubylexer/test/testcases.rb +6 -1
data/lib/rubylexer/token.rb +4 -1
data/lib/rubylexer/version.rb +1 -1
data/test/code/regression.rb +1 -1
data/test/code/rubylexervsruby.rb +23 -6
data/test/data/1.rb +729 -0
data/test/data/heart.rb +43 -2
data/test/data/pleac.rb +6282 -0
data/testing.txt +1 -1
metadata +7 -4

data/History.txt CHANGED

@@ -1,3 +1,19 @@
+=== 0.7.4/5-20-2009
+* 2 Major Enhancements:
+  * preliminary support for ruby 1.9
+  * utf8 inputs should now work... more or less
+* 5 Minor Enhancements:
+  * better detection of illegal escapes and interpolations in strings
+  * indicate error on unterminated here body
+  * fixed pattern of keywords that can't start a param list (ignores ?,! now)
+  * in is_var_name?, check for global/instance vars first
+  * comma and star in a true lhs should be correctly marked as such, now
+* 2 Bugfixes:
+  * added tag field to Token; I hope many flags can be coalesced into tag.
+  * note line that all strings (and here docs) start and end on
 === 0.7.3/4-19-2009
 * 9 Bugfixes:
   * remember whether comma was seen in paren context

data/Manifest.txt CHANGED

@@ -15,6 +15,7 @@ lib/rubylexer/version.rb
 lib/rubylexer/rulexer.rb
 lib/rubylexer/tokenprinter.rb
 lib/rubylexer/charset.rb
+lib/rubylexer/lextable.rb
 lib/rubylexer/symboltable.rb
 lib/rubylexer/charhandler.rb
 lib/assert.rb
@@ -43,11 +44,13 @@ test/data/23.rb
 test/data/lbrack.rb
 test/data/untitled1.rb
 test/data/rescue.rb
+test/data/pleac.rb
 test/data/pleac.rb.broken
 test/data/heart.rb
 test/data/s.rb
 test/data/wsdlDriver.rb
 test/data/p-op.rb
+test/data/1.rb
 test/data/1.rb.broken
 test/data/untermed_here.rb.broken
 test/data/newsyntax.rb
@@ -72,7 +75,6 @@ test/code/regression.rb
 test/code/strgen.rb
 test/code/tarball.rb
 lib/rubylexer/test/testcases.rb
-test/data/chunky.plain.rb
 test/data/cvtesc.rb
 test/data/__eof2.rb
 test/data/__eof5.rb

data/README.txt CHANGED

@@ -1,8 +1,7 @@
 = RubyLexer
-*
-*
-*
+* rubyforge.net/projects/rubylexer
+* github.com/coatl/rubylexer
 === DESCRIPTION:
@@ -48,13 +47,9 @@ end
 == Status
 RubyLexer can correctly lex all legal Ruby 1.8 code that I've been able to
 find on my Debian system. It can also handle (most of) my catalog of nasty
-test cases (in testdata/p.rb) (see below for known problems). At this point,
-new bugs are almost exclusively found by my home-grown test code, rather
-than ruby code gathered 'from the wild'. There are a number of issues I know
-about and plan to fix, but it seems that Ruby coders don't write code complex
-enough to trigger them very often. Although incomplete, RubyLexer can
-correctly distinguish these ambiguous uses of the following operator and
-keywords, depending on context:
+test cases (see below for known problems). Modulo some very obscure bugs,
+RubyLexer can correctly distinguish these ambiguous uses of the following
+operators, depending on context:
   %   can be modulus operator or start of fancy string
   /   can be division operator or start of regex
   * & + - :: can be unary or binary operator
@@ -83,18 +78,16 @@ emit advisory tokens when local var defined/goes out of scope (or hidden/unhidde
 token pruning in dumptokens...
 == known issues: (and planned fix release)
-context not really preserved when entering or leaving string inclusions. this causes
-a number or problems. local variables are ok now, but here document headers started
-in a string inclusion with the body outside will be a problem. (0.8)
-string tokenization sometimes a little different from ruby around newlines
-  (htree/template.rb) (0.8)
+context not really preserved when entering or leaving string inclusions. this caused
+-a number or problems, which had to be hacked around. it would be better to avoid
+-tokens within tokens. (0.8)
 string contents might not be correctly translated in a few cases (0.8?)
 symbols which contain string interpolations are flattened into one token. eg :"foo#{bar}" (0.8)
 '\r' whitespace sometimes seen in dos-formatted output.. shouldn't be (eg pre.rb) (0.7)
 windows newline in source is likely to cause problems in obscure cases (need test case)
-unterminated =begin is not an error (0.8)
-ruby 1.9 completely unsupported (0.9)
-character sets other than ascii are not supported at all (1.0)
-regression test currently shows 14 errors with differences in exact token ordering
+ruby 1.9 incompletely supported (0.9)
+current character set is always forced to ascii-8bit. however, this mode should be
+-compatible with texts written in regular ascii, utf-8, and euc. (among others?) (1.0)
+regression test currently shows a few errors with differences in exact token ordering
 -around string inclusions. these errors are much less serious than they seem.
 offset of AssignmentRhsListEndToken appears to be off by 1

data/Rakefile CHANGED

@@ -25,13 +25,13 @@ require 'lib/rubylexer/version.rb'
    hoe=Hoe.new("rubylexer", RubyLexer::VERSION) do |_|
      _.author = "Caleb Clausen"
      _.email = "rubylexer-owner @at@ inforadical .dot. net"
-     _.url = ["http://rubylexer.rubyforge.org/", "http://rubyforge.org/projects/rubylexer/"]
+     _.url = ["http://github.com/coatl/rubylexer/", "http://rubyforge.org/projects/rubylexer/"]
      _.extra_deps << ['sequence', '>= 0.2.0']
      _.test_globs=["test/code/regression.rb"]
      _.description=desc
      _.summary=desc[/\A[^.]+\./]
      _.spec_extras={:bindir=>'',:rdoc_options=>'-x lib/rubylexer/test'}
-     _.rdoc_pattern=/\A(howtouse\.txt|testing\.txt|README\.txt|lib\/[^\/]*\.rb|lib\/rubylexer\/[^\d][^\/]*\.rb)\Z/
+     #_.rdoc_pattern=/\A(howtouse\.txt|testing\.txt|README\.txt|lib\/[^\/]*\.rb|lib\/rubylexer\/[^\d][^\/]*\.rb)\Z/
    end

data/lib/rubylexer.rb CHANGED

@@ -1,4 +1,4 @@
-=begin legal crap
+=begin
     rubylexer - a ruby lexer written in ruby
     Copyright (C) 2004,2005,2008  Caleb Clausen
@@ -60,9 +60,6 @@ class RubyLexer
    INNERBOUNDINGWORDS="(#{INNERBOUNDINGWORDLIST.join '|'})"
    BINOPWORDLIST=%w"and or"
    BINOPWORDS="(#{BINOPWORDLIST.join '|'})"
-   NEVERSTARTPARAMLISTWORDS=/\A(#{OPORBEGINWORDS}|#{INNERBOUNDINGWORDS}|#{BINOPWORDS}|end)([^a-zA-Z0-9_!?=]|\Z)/o
-   NEVERSTARTPARAMLISTFIRST=CharSet['aoeitrwu']  #chars that begin NEVERSTARTPARAMLIST
-   NEVERSTARTPARAMLISTMAXLEN=7     #max len of a NEVERSTARTPARAMLIST
    RUBYKEYWORDS=%r{
      ^(alias|#{BINOPWORDS}|defined\?|not|undef|end|
@@ -72,6 +69,11 @@ class RubyLexer
    }xo
       #__END__ should not be in this set... its handled in start_of_line_directives
+   HIGHASCII=?\x80..?\xFF
+   NONASCII=HIGHASCII
+   #NONASCII=?\x80..?xFFFFFFFF  #or is it 10FFFF, whatever the highest conceivable code point
    CHARMAPPINGS = {
          ?$ => :dollar_identifier,
          ?@ => :at_identifier,
@@ -115,14 +117,43 @@ class RubyLexer
          "])}" => :close_brace,
-         ?# => :comment
+         ?# => :comment,
+         NONASCII => :identifier,
    }
    attr_reader :incomplete_here_tokens, :parsestack, :last_token_maybe_implicit
+   UCLETTER=@@UCLETTER="[A-Z]"
+   #cheaters way, treats utf chars as always 1 byte wide
+   #all high-bit chars are lowercase letters
+   #works, but strings compare with strict binary identity, not unicode collation
+   #works for euc too, I think
+   #(the ruby spec for utf8 support permits this interpretation)
+   LCLETTER=@@LCLETTER="[a-z_\x80-\xFF]"
+   LETTER=@@LETTER="[A-Za-z_\x80-\xFF]"
+   LETTER_DIGIT=@@LETTER_DIGIT="[A-Za-z_0-9\x80-\xFF]"
+   eval %w[UCLETTER LCLETTER LETTER LETTER_DIGIT].map{|n| "
+     def #{n}; #{n}; end
+     def self.#{n}; @@#{n}; end
+     "
+   }.to_s
+   NEVERSTARTPARAMLISTWORDS=/\A(#{OPORBEGINWORDS}|#{INNERBOUNDINGWORDS}|#{BINOPWORDS}|end)((?:(?!#@@LETTER_DIGIT).)|\Z)/om
+   NEVERSTARTPARAMLISTFIRST=CharSet['aoeitrwu']  #chars that begin NEVERSTARTPARAMLIST
+   NEVERSTARTPARAMLISTMAXLEN=7     #max len of a NEVERSTARTPARAMLIST
+=begin
+   require 'jcode'
+   utf8=String::PATTERN_UTF8 #or euc, or sjis...
+   LCLETTER_U="(?>[a-z_]|#{utf8})"
+   LETTER_U="(?>[A-Za-z_]|#{utf8})"
+   IDENTCHAR_U="(?>[A-Za-z_0-9]|#{utf8})"
+=end
    #-----------------------------------
-   def initialize(filename,file,linenum=1,offset_adjust=0)
+   def initialize(filename,file,linenum=1,offset_adjust=0,options={:rubyversion=>1.8})
       @offset_adjust=0 #set again in next line
       super(filename,file, linenum,offset_adjust)
       @start_linenum=linenum
@@ -137,13 +168,61 @@ class RubyLexer
       @enable_macro=nil
       @base_file=nil
       @progress_thread=nil
+      @rubyversion=options[:rubyversion]
+      @encoding=options[:encoding]||:detect
       @toptable=CharHandler.new(self, :illegal_char, CHARMAPPINGS)
+      read_leading_encoding
       start_of_line_directives
       progress_printer
    end
+   ENCODING_ALIASES={
+    'utf-8'=>'utf8',
+    'ascii-8bit'=>'binary',
+    'ascii-7bit'=>'ascii',
+    'euc-jp'=>'euc',
+    'ascii8bit'=>'binary',
+    'ascii7bit'=>'ascii',
+    'eucjp'=>'euc',
+    'us-ascii'=>'ascii',
+    'shift-jis'=>'sjis',
+    'autodetect'=>'detect',
+   }
+   ENCODINGS=%w[ascii binary utf8 euc sjis]
+   def read_leading_encoding
+     return unless @encoding==:detect
+     @encoding=:ascii
+     @encoding=:utf8 if @file.skip( /\xEF\xBB\xBF/ )   #bom
+     if @file.skip( /\A#!/ )
+       loop do
+         til_charset( /[\s\v]/ )
+         break if @file.skip( / ([^-\s\v]|--[\s\v])/,4 )
+         if @file.skip( /.-K(.)/ )
+           case $1
+           when 'u'; @encoding=:utf8
+           when 'e'; @encoding=:euc
+           when 's'; @encoding=:sjis
+           end
+         end
+       end
+       til_charset( /[\n]/ )
+     end
+     if @rubyversion>=1.9 and @file.skip(
+          /\A#[\x00-\x7F]*?(?:en)?coding[\s\v]*[:=][\s\v]*([a-z0-9_-]+)[\x00-\x7F]*\n/i
+        )
+       name=$1
+       name.downcase!
+       name=ENCODING_ALIASES[name] if ENCODING_ALIASES[name]
+       @encoding=name.to_sym if ENCODINGS.include? name
+     end
+   end
    def progress_printer
      return unless ENV['RL_PROGRESS']
      $stderr.puts 'printing progresses'
@@ -163,6 +242,7 @@ class RubyLexer
    attr :localvars_stack
    attr :offset_adjust
    attr_writer :pending_here_bodies
+   attr :rubyversion
    #-----------------------------------
    def set_last_token(tok)
@@ -361,7 +441,7 @@ private
       result = ((
       #order matters here, but it shouldn't
       #(but til_charset must be last)
-         eat_if(/-[a-z0-9_]/i,2) or
+         eat_if(/-#@@LETTER_DIGIT/o,2) or
          eat_next_if(/[!@&+`'=~\-\/\\,.;<>*"$?:]/) or
          (?0..?9)===nextchar ? til_charset(/[^\d]/) : nil
       ))
@@ -376,21 +456,25 @@ private
       #or if in a non-bare context
       #just asserts because those contexts are never encountered.
       #control goes through symbol(<...>,nil)
-      assert( /^[a-z_]$/i===context)
+      assert( /^#@@LETTER$/o===context)
       assert MethNameToken===@last_operative_token || !(@last_operative_token===/^(\.|::|(un)?def|alias)$/)
-      @moretokens.unshift(*parse_keywords(str,oldpos) do |tok|
-        #if not a keyword,
-        case str
-          when FUNCLIKE_KEYWORDS; except=tok
-          when VARLIKE_KEYWORDS,RUBYKEYWORDS; raise "shouldnt see keywords here, now"
-        end
-        was_last=@last_operative_token
-        @last_operative_token=tok if tok
-        normally=safe_recurse { |a| var_or_meth_name(str,was_last,oldpos,after_nonid_op?{true}) }
-        (Array===normally ? normally[0]=except : normally=except) if except
-        normally
-      end)
+      if @parsestack.last.wantarrow and @rubyversion>=1.9 and @file.skip ":"
+        @moretokens.push SymbolToken.new(str,oldpos), KeywordToken.new("=>",input_position-1)
+      else
+        @moretokens.unshift(*parse_keywords(str,oldpos) do |tok|
+          #if not a keyword, decide if it should be var or method
+          case str
+            when FUNCLIKE_KEYWORDS; except=tok
+            when VARLIKE_KEYWORDS,RUBYKEYWORDS; raise "shouldnt see keywords here, now"
+          end
+          was_last=@last_operative_token
+          @last_operative_token=tok if tok
+          normally=safe_recurse { |a| var_or_meth_name(str,was_last,oldpos,after_nonid_op?{true}) }
+          (Array===normally ? normally[0]=except : normally=except) if except
+          normally
+        end)
+      end
       return @moretokens.shift
    end
@@ -399,7 +483,7 @@ private
    def identifier_as_string(context)
       #must begin w/ letter or underscore
       #char class needs changing here for utf8 support
-      /[_a-z]/i===nextchar.chr or return
+      /#@@LETTER/o===nextchar.chr or return
       #equals, question mark, and exclamation mark
       #might be allowed at the end in some contexts.
@@ -418,7 +502,7 @@ private
         end
       @in_def_name||context==?: and trailers<<"|=(?![=~>])"
-      @file.scan(IDENTREX[trailers]||=/^(?>[_a-z][a-z0-9_]*(?:#{trailers})?)/i)
+      @file.scan(IDENTREX[trailers]||=/^(?>#@@LETTER#@@LETTER_DIGIT*(?:#{trailers})?)/)
    end
   #-----------------------------------
@@ -447,8 +531,8 @@ private
    def comma_in_lvalue_list?
      @parsestack.last.lhs=
        case l=@parsestack.last
-       when ListContext:
-       when DefContext: l.in_body
+       when ListContext;
+       when DefContext; l.in_body
        else true
        end
    end
@@ -459,7 +543,7 @@ private
      @defining_lvar or case ctx=@parsestack.last
        #when ForSMContext; ctx.state==:for
        when RescueSMContext
-         lasttok.ident=="=>" and @file.match?( /\A[\s\v]*([:;#\n]|then[^a-zA-Z0-9_])/m )
+         lasttok.ident=="=>" and @file.match?( /\A[\s\v]*([:;#\n]|then(?!#@@LETTER_DIGIT))/om )
        #when BlockParamListLhsContext; true
      end
    end
@@ -487,13 +571,13 @@ private
      was_in_lvar_define_state=in_lvar_define_state(lasttok)
      #maybe_local really means 'maybe local or constant'
      maybe_local=case name
-       when /[^a-z_0-9]$/i #do nothing
-       when /^[a-z_]/
+       when /(?!#@@LETTER_DIGIT).$/o #do nothing
+       when /^#@@LCLETTER/o
          (localvars===name or
           VARLIKE_KEYWORDS===name or
           was_in_lvar_define_state
          ) and not lasttok===/^(\.|::)$/
-       when /^[A-Z]/
+       when /^#@@UCLETTER/o
          is_const=true
          not lasttok==='.'  #this is the right algorithm for constants...
      end
@@ -509,7 +593,7 @@ private
      result=ws_toks=ignored_tokens(true) {|nl| sawnl=true }
      if sawnl || eof?
          if was_in_lvar_define_state
-           if /^[a-z_][a-zA-Z_0-9]*$/===name
+           if /^#@@LCLETTER#@@LETTER_DIGIT*$/o===name
              assert !(lasttok===/^(\.|::)$/)
              localvars[name]=true
            end
@@ -531,7 +615,7 @@ private
        when ?=;  not /^=[>=~]$/===readahead(2)
        when ?,; comma_in_lvalue_list?
        when ?); last_context_not_implicit.lhs
-       when ?i; /^in[^a-zA-Z_0-9]/===readahead(3) and
+       when ?i; /^in(?!#@@LETTER_DIGIT)/o===readahead(3) and
                   ForSMContext===last_context_not_implicit
        when ?>,?<; /^(.)\1=$/===readahead(3)
        when ?*,?&; /^(.)\1?=/===readahead(3)
@@ -543,8 +627,8 @@ private
      end
      if (assignment_coming && !(lasttok===/^(\.|::)$/) or was_in_lvar_define_state)
         tok=assign_lvar_type! VarNameToken.new(name,pos)
-        if /[^a-z_0-9]$/i===name
-        elsif /^[a-z_]/===name and !(lasttok===/^(\.|::)$/)
+        if /(?!#@@LETTER_DIGIT).$/o===name
+        elsif /^#@@LCLETTER/o===name and !(lasttok===/^(\.|::)$/)
           localvars[name]=true
         end
         return result.unshift(tok)
@@ -559,7 +643,7 @@ private
        when nil: 2
        when ?!; /^![=~]$/===readahead(2) ? 2 : 1
        when ?d;
-         if /^do([^a-zA-Z0-9_]|$)/===readahead(3)
+         if /^do((?!#@@LETTER_DIGIT)|$)/o===readahead(3)
            if maybe_local and expecting_do?
              ty=VarNameToken
              0
@@ -572,7 +656,7 @@ private
          end
        when NEVERSTARTPARAMLISTFIRST
          (NEVERSTARTPARAMLISTWORDS===readahead(NEVERSTARTPARAMLISTMAXLEN)) ? 2 : 1
-       when ?",?',?`,?a..?z,?A..?Z,?0..?9,?_,?@,?$,?~; 1 #"
+       when ?",?',?`,?a..?z,?A..?Z,?0..?9,?_,?@,?$,?~,NONASCII; 1 #"
        when ?{
          maybe_local=false
          1
@@ -633,10 +717,12 @@ private
          else
            3
          end
-       when ??; next3=readahead(3);
-                   /^\?([#{WHSPLF}]|[a-z_][a-z_0-9])/io===next3 ? 2 : 3
+       when ??; next3=readahead(3)
+                #? never begins a char constant if immediately followed
+                #by 2 or more letters or digits
+                   /^\?([#{WHSPLF}]|#@@LETTER_DIGIT{2})/o===next3 ? 2 : 3
 #       when ?:,??; (readahead(2)[/^.[#{WHSPLF}]/o]) ? 2 : 3
-       when ?<; (!ws_toks.empty? && readahead(4)[/^<<-?["'`a-zA-Z_0-9]/]) ? 3 : 2
+       when ?<; (!ws_toks.empty? && readahead(4)[/^<<-?(?:["'`]|#@@LETTER_DIGIT)/o]) ? 3 : 2
        when ?[;
            if ws_toks.empty?
              (KeywordToken===oldlast and /^(return|break|next)$/===oldlast.ident) ? 3 : 2
@@ -707,7 +793,7 @@ private
            break false
          elsif ','==tok.to_s and @parsestack.size==basesize+1
            break true
-         elsif OperatorToken===tok and /^[&*]$/===tok.ident and tok.unary and @parsestack.size==basesize+1
+         elsif OperatorToken===tok and /^[&*]$/===tok.ident and tok.tag and @parsestack.size==basesize+1
            break true
          elsif EoiToken===tok
            lexerror tok, "unexpected eof in parameter list"
@@ -890,7 +976,7 @@ private
          @moretokens.push KeywordToken.new('::',offset+md.end(0)-2) if dc
          loop do
            offset=input_position
-           @file.scan(/\A(#@@WSTOKS)?([A-Z][a-zA-Z_0-9]*)(::)?/o)
+           @file.scan(/\A(#@@WSTOKS)?(#@@UCLETTER#@@LETTER_DIGIT*)(::)?/o)
            #this regexp---^ will need to change in order to support utf8 properly.
            md=@file.last_match
            all,ws,name,dc=*md
@@ -1013,11 +1099,11 @@ private
               #maybe_local really means 'maybe local or constant'
               maybe_local=case name
-                when /[^a-z_0-9]$/i; #do nothing
+                when /(?!#@@LETTER_DIGIT).$/o; #do nothing
                 when /^[@$]/; true
                 when VARLIKE_KEYWORDS,FUNCLIKE_KEYWORDS; ty=KeywordToken
-                when /^[a-z_]/;  localvars===name
-                when /^[A-Z]/; is_const=true  #this is the right algorithm for constants...
+                when /^#@@LCLETTER/o;  localvars===name
+                when /^#@@UCLETTER/o; is_const=true  #this is the right algorithm for constants...
               end
               result.push(  *ignored_tokens(false,false)  )
               nc=nextchar
@@ -1059,7 +1145,7 @@ private
                #look for start of parameter list
                nc=(@moretokens.empty? ? nextchar.chr : @moretokens.first.to_s[0,1])
-               if state==:expect_op and /^[a-z_(&*]/i===nc
+               if state==:expect_op and /^(?:#@@LETTER|[(&*])/o===nc
                   ctx.state=:def_param_list
                   list,listend=def_param_list
                   result.concat list
@@ -1080,7 +1166,7 @@ private
                when EoiToken
                   lexerror tok,'unexpected eof in def header'
                when StillIgnoreToken
-               when MethNameToken ,VarNameToken # /^[a-z_]/i.token_pat
+               when MethNameToken ,VarNameToken # /^#@@LETTER/o.token_pat
                   lexerror tok,'expected . or ::' unless state==:expect_name
                   state=:expect_op
                when /^(\.|::)$/.token_pat
@@ -1416,7 +1502,7 @@ end
             #result.concat ignored_tokens
             if expect_name
               case tok
-                when IgnoreToken #, /^[A-Z]/ #do nothing
+                when IgnoreToken #, /^#@@UCLETTER/o #do nothing
                 when /^,$/.token_pat #hack
                 when VarNameToken
@@ -1498,12 +1584,20 @@ end
      if want_unary
        #readahead(2)[1..1][/[\s\v#\\]/] or #not needed?
        assert OperatorToken===result
-       result.unary=true         #result should distinguish unary+binary *&
+       result.tag=:unary         #result should distinguish unary+binary *&
        WHSPLF[nextchar.chr] or
          @moretokens << NoWsToken.new(input_position)
-       comma_in_lvalue_list?
+       cill=comma_in_lvalue_list?
        if ch=='*'
          @parsestack.last.see self, :splat
+         case @parsestack[-1]
+         when AssignmentRhsContext; result.tag= :rhs
+         when ParamListContext,ParamListContextNoParen; #:call
+         when ListImmedContext; #:array
+         when BlockParamListLhsContext; #:block
+         when KnownNestedLhsParenContext; #:nested
+         else          result.tag=     :lhs if cill
+         end
        end
      end
      result
@@ -1553,10 +1647,10 @@ end
      s=tok.to_s
      case s
-     when /[^a-z_0-9]$/i; false
-#     when /^[a-z_]/; localvars===s or VARLIKE_KEYWORDS===s
-     when /^[A-Z_]/i; VarNameToken===tok
      when /^[@$<]/; true
+     when /(?!#@@LETTER_DIGIT).$/o; false
+#     when /^#@@LCLETTER/o; localvars===s or VARLIKE_KEYWORDS===s
+     when /^#@@LETTER/o; VarNameToken===tok
      else raise "not var or method name: #{s}"
      end
    end
@@ -1573,7 +1667,7 @@ end
          if ch==':'
            not TernaryContext===@parsestack.last
          else
-           !readahead(3)[/^\?[a-z0-9_]{2}/i]
+           !readahead(3)[/^\?#@@LETTER_DIGIT{2}/o]
          end
      }
    end
@@ -1603,21 +1697,25 @@ end
         @moretokens.push tok=KeywordToken.new(':',startpos)
         case @parsestack.last
-        when TernaryContext:
+        when TernaryContext
           tok.ternary=true
           @parsestack.pop #should be in the context's see handler
-        when ExpectDoOrNlContext: #should be in the context's see handler
-          @parsestack.pop
-          assert @parsestack.last.starter[/^(while|until|for)$/]
-          tok.as=";"
-        when ExpectThenOrNlContext,WhenParamListContext:
-          #should be in the context's see handler
-          @parsestack.pop
-          tok.as="then"
-        when RescueSMContext:
+        when ExpectDoOrNlContext #should be in the context's see handler
+          if @rubyversion<1.9
+            @parsestack.pop
+            assert @parsestack.last.starter[/^(while|until|for)$/]
+            tok.as=";"
+          end
+        when ExpectThenOrNlContext,WhenParamListContext
+          if @rubyversion<1.9
+            #should be in the context's see handler
+            @parsestack.pop
+            tok.as="then"
+          end
+        when RescueSMContext
           tok.as=";"
-        else fail ": not expected in #{@parsestack.last.class}->#{@parsestack.last.starter}"
-        end
+        end or
+          fail ": not expected in #{@parsestack.last.class}->#{@parsestack.last.starter}"
         #end ternary context, if any
         @parsestack.last.see self,:colon
@@ -1631,7 +1729,7 @@ end
       lasttok=@last_operative_token
       assert !(String===lasttok)
       if (VarNameToken===lasttok or MethNameToken===lasttok) and
-          lasttok===/^[$@a-zA-Z_]/ and !WHSPCHARS[lastchar]
+          lasttok===/^(?:[$@]|#@@LETTER)/o and !WHSPCHARS[lastchar]
       then
          @moretokens << colon2
          result= NoWsToken.new(startpos)
@@ -1664,12 +1762,12 @@ end
          when ?` then read(1) #`
          when ?@ then at_identifier.to_s
          when ?$ then dollar_identifier.to_s
-         when ?_,?a..?z then identifier_as_string(?:)
+         when ?_,?a..?z,NONASCII then identifier_as_string(?:)
          when ?A..?Z then
            result=identifier_as_string(?:)
            if @last_operative_token==='::'
              assert klass==MethNameToken
-             /[A-Z_0-9]$/i===result and klass=VarNameToken
+             /#@@LETTER_DIGIT$/o===result and klass=VarNameToken
            end
            result
          else
@@ -1696,7 +1794,7 @@ end
      return [opmatches ? read(opmatches.size) :
        case nc=nextchar
          when ?` then read(1) #`
-         when ?_,?a..?z,?A..?Z then
+         when ?_,?a..?z,?A..?Z,NONASCII then
            context=merge_assignment_op_in_setter_callsites? ? ?: : nc
            identifier_as_string(context)
          else
@@ -1720,7 +1818,7 @@ end
         quote_real=true
       else
         quote='"'
-        ender=til_charset(/[^a-zA-Z0-9_]/)
+        ender=@file.scan(/#@@LETTER_DIGIT+/o)
         ender.length >= 1  or
           return lexerror(HerePlaceholderToken.new( dash, quote, ender, nil ), "invalid here header")
       end
@@ -1739,6 +1837,7 @@ if true
       nl=readnl or return lexerror(res, "here header without body (at eof)")
+      res.string.startline=linenum
       @moretokens<< res
       bodystart=input_position
       @offset_adjust = @min_offset_adjust+procrastinated.size
@@ -1748,6 +1847,8 @@ if true
       @offset_adjust = @min_offset_adjust
       #was: @offset_adjust -= procrastinated.size
       bodysize=input_position-bodystart
+      res.string.line=linenum-1
+      lexerror res,res.string.error
       #one or two already read characters are overwritten here,
       #in order to keep offsets correct in the long term
@@ -1814,7 +1915,7 @@ end
    #-----------------------------------
    def lessthan(ch) #match quadriop('<') or here doc or spaceship op
       case readahead(3)
-        when /^<<['"`\-a-z0-9_]$/i #'
+        when /^<<(?:['"`\-]|#@@LETTER_DIGIT)$/o #'
            if quote_expected?(ch) and not @last_operative_token==='class'
               here_header
            else
@@ -1901,7 +2002,11 @@ end
             if tofill.dash
               close+=til_charset(/[^#{WHSP}]/o)
             end
-            break if eof? #this is an error, should be handled better
+            if eof? #this is an error, should be handled better
+              lexerror tofill, "unterminated here body"
+              lexerror tofill.string, "unterminated here body"
+              break
+            end
             if read(tofill.ender.size)==tofill.ender
               crs=til_charset(/[^\r]/)||''
               if nl=readnl
@@ -1917,6 +2022,8 @@ end
               line=til_charset(/[\n]/)
               unless nl=readnl
                 assert eof?
+                lexerror tofill, "unterminated here body"
+                lexerror tofill.string, "unterminated here body"
                 break  #this is an error, should be handled better
               end
               line.chomp!("\r")
@@ -2118,7 +2225,7 @@ end
   #used to resolve the ambiguity of
   # unary ops (+, -, *, &, ~ !) in ruby
   #returns whether current token is to be the start of a literal
-  IDBEGINCHAR=/^[a-zA-Z_$@]/
+  IDBEGINCHAR=/^(?:#@@LETTER|[$@])/o
   def unary_op_expected?(ch) #yukko hack
     '*&='[readahead(2)[1..1]] and return false
@@ -2139,8 +2246,8 @@ end
    def quote_expected?(ch) #yukko hack
      case ch[0]
           when ?? then readahead(2)[/^\?[#{WHSPLF}]$/o] #not needed?
-          when ?% then readahead(3)[/^%([a-pt-vyzA-PR-VX-Z]|[QqrswWx][a-zA-Z0-9])/]
-          when ?< then !readahead(4)[/^<<-?['"`a-z0-9_]/i]
+          when ?% then readahead(3)[/^%([a-pt-vyzA-PR-VX-Z]|[QqrswWx]#{@@LETTER_DIGIT.gsub('_','')})/o]
+          when ?< then !readahead(4)[/^<<-?(?:['"`]|#@@LETTER_DIGIT)/o]
           else raise 'unexpected ch (#{ch}) in quote_expected?'
      #     when ?+,?-,?&,?*,?~,?! then '*&='[readahead(2)[1..1]]
      end and return false
@@ -2322,17 +2429,26 @@ end
       str << c
       result= operator_or_methname_token( str,offset)
       case c
-      when '=': #===,==
+      when '=' #===,==
         str<< (eat_next_if(?=)or'')
-      when '>': #=>
+      when '>' #=>
         unless ParamListContextNoParen===@parsestack.last
           @moretokens.unshift result
           @moretokens.unshift( *abort_noparens!("=>"))
           result=@moretokens.shift
         end
         @parsestack.last.see self,:arrow
-      when '': #plain assignment: record local variable definitions
+      when '~' # =~... after regex, maybe?
+        last=last_operative_token
+        if @rubyversion>=1.9 and StringToken===last and last.lvars
+          #ruby delays adding lvars from regexps to known lvars table
+          #for several tokens in some cases. not sure why or if on purpose
+          #i'm just going to add them right away
+          localvars.concat last.lvars
+        end
+      when '' #plain assignment: record local variable definitions
         last_context_not_implicit.lhs=false
         @moretokens.push( *ignored_tokens(true).map{|x|
           NewlineToken===x ? EscNlToken.new(@filename,@linenum,x.ident,x.offset) : x
@@ -2340,7 +2456,7 @@ end
         @parsestack.push AssignmentRhsContext.new(@linenum)
         if eat_next_if ?*
           tok=OperatorToken.new('*', input_position-1)
-          tok.unary=true
+          tok.tag=:unary
           @moretokens.push tok
           WHSPLF[nextchar.chr] or
             @moretokens << NoWsToken.new(input_position)
@@ -2450,14 +2566,15 @@ end
         tokch.set_infix! unless after_nonid_op?{WHSPLF[lastchar]}
         @parsestack.push ListImmedContext.new(ch,@linenum)
         lasttok=last_operative_token
-        #could be: lasttok===/^[a-z_]/i
-        if (VarNameToken===lasttok or ImplicitParamListEndToken===lasttok or MethNameToken===lasttok) and !WHSPCHARS[lastchar]
+        #could be: lasttok===/^#@@LETTER/o
+        if (VarNameToken===lasttok or ImplicitParamListEndToken===lasttok or
+            MethNameToken===lasttok or lasttok===FUNCLIKE_KEYWORDS) and !WHSPCHARS[lastchar]
                @moretokens << (tokch)
                tokch= NoWsToken.new(input_position-1)
         end
       when '('
         lasttok=last_token_maybe_implicit #last_operative_token
-        #could be: lasttok===/^[a-z_]/i
+        #could be: lasttok===/^#@@LETTER/o
         if (VarNameToken===lasttok or MethNameToken===lasttok or
             lasttok===FUNCLIKE_KEYWORDS)
           unless WHSPCHARS[lastchar]
@@ -2466,7 +2583,17 @@ end
           end
           @parsestack.push ParamListContext.new(@linenum)
         else
-          @parsestack.push ParenContext.new(@linenum)
+          ctx=@parsestack.last
+          lasttok=last_operative_token
+          maybe_def=DefContext===ctx && !ctx.in_body &&
+            !(KeywordToken===lasttok && lasttok.ident=="def")
+          if maybe_def or
+             BlockParamListLhsContext===ctx or
+             ParenContext===ctx && ctx.lhs
+            @parsestack.push KnownNestedLhsParenContext.new(@linenum)
+          else
+            @parsestack.push ParenContext.new(@linenum)
+          end
         end
       when '{'
@@ -2574,13 +2701,14 @@ end
          @parsestack.pop
          @moretokens.unshift AssignmentRhsListEndToken.new(input_position)
     end
-    token.comma_type=
     case @parsestack[-1]
-    when AssignmentRhsContext; :rhs
-    when ParamListContext,ParamListContextNoParen; :call
-    when ListImmedContext; :array
+    when AssignmentRhsContext; token.tag=:rhs
+    when ParamListContext,ParamListContextNoParen; #:call
+    when ListImmedContext; #:array
+    when BlockParamListLhsContext; #:block
+    when KnownNestedLhsParenContext; #:nested
     else
-      :lhs if comma_in_lvalue_list?
+      token.tag=:lhs if comma_in_lvalue_list?
     end
     @parsestack.last.see self,:comma
     return @moretokens.shift