RubyGems - tokn - Versions diffs - 0.0.4 - Mend

tokn 0.0.4

Files changed (22) hide show

checksums.yaml +7 -0
data/README.txt +194 -0
data/bin/tokncompile +16 -0
data/bin/toknprocess +26 -0
data/figures/sample_dfa.pdf +0 -0
data/lib/tokn/code_set.rb +392 -0
data/lib/tokn/dfa.rb +196 -0
data/lib/tokn/dfa_builder.rb +261 -0
data/lib/tokn/range_partition.rb +233 -0
data/lib/tokn/reg_parse.rb +379 -0
data/lib/tokn/state.rb +320 -0
data/lib/tokn/token_defn_parser.rb +156 -0
data/lib/tokn/tokenizer.rb +211 -0
data/lib/tokn/tokn_const.rb +29 -0
data/lib/tokn/tools.rb +186 -0
data/lib/tokn.rb +1 -0
data/test/data/sampletext.txt +11 -0
data/test/data/sampletokens.txt +32 -0
data/test/simple.rb +33 -0
data/test/test.rb +519 -0
data/test/testcmds +4 -0
metadata +69 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: d44494c850d61cd0ab5e3e588bbee398d85f7902
+  data.tar.gz: f5b35f65f7fb8f0df3adbcd4ff6d5df483ab3ce4
+SHA512:
+  metadata.gz: 8432678eb42bcbacfa3db0c04b6f1cf728516e69da13fab1f79ab88002b15bb360cdd31ee76b40f148f2250b3dc78c4263037496c33611c482f1289fdf4998cd
+  data.tar.gz: c581302f4b0e77840d2a6e4d9e30657387b913b5c4f7c24376b083576fc4fdd76c28bdff401927528094caab059252e214ccc1979d170f268e27917ec1442708

data/README.txt ADDED Viewed

@@ -0,0 +1,194 @@
+'tokn' : A ruby gem for constructing DFAs and using them to tokenize text files.
+Written and (c) by Jeff Sember, March 2013.
+================================================================================
+Description of the problem
+================================================================================
+For a simple example, suppose a particular text file is designed to have
+tokens of the following three types:
+ 			1) 'a' followed by any number of 'a' or 'b'
+ 			2) 'b' followed by either 'aa' or zero or more 'b'
+ 			3) 'bbb'
+ We will also allow an additional token, one or more spaces, to separate them.
+ These four token types can be written using regular expressions as:
+      sep:  \s
+      tku:  a(a|b)*
+      tkv:  b(aa|b*)
+      tkw:  bbb
+We've given each token definition a name (to the left of the colon).
+Now suppose your program needs to read a text file and interpret the tokens it
+finds there.  This can be done using the DFA (deterministic finite state automaton)
+shown in figures/sample_dfa.pdf.  The token extraction algorithm is as follows:
+ 1) Begin at the start state, S0.
+ 2) Look at the next character in the source (text) file.  If there is an arrow (edge)
+    labelled with that character, follow it to another state (it may lead to the
+    same state; that's okay), and advance the cursor to the next character in
+    the source file.
+ 3) If there's an arrow labelled with a negative number N, don't follow the edge,
+    but instead remember the lowest (i.e., most negative) such N found.
+ 4) Continue steps 2 and 3 until no further progress is possible.
+ 5) At this point, N indicates the name of the token found.  The cursor should be
+ 	  restored to the point it was at when that N was recorded.  The token's text
+ 	  consists of the characters from the starting cursor position to that point.
+ 6) If no N value was recorded, then the source text doesn't match any of the tokens,
+    which is considered an error.
+The tokn module provides a simple and efficient way to perform this tokenization process.
+Its major accomplishment is not just performing the above six steps, but rather that
+it also can construct, from a set of token definitions, the DFA to be used in these steps.
+Such DFAs are very useful, and can be used by non-Ruby programs as well.
+Using the tokn module in a Ruby program
+===================================================================================
+There are three object classes of interest: DFA, Tokenizer, and Token.  A DFA is
+compiled once from a script containing token definitions (e.g, "tku:  b(aa|b*) ..."),
+and can then be stored (either in memory, or on disk as a JSON string) for later use.
+When tokens need to be extracted from a source file (or simple string), a Tokenizer is
+constructed.  It requires both the DFA and the source file as input.  Once this is done,
+individual Token objects can be read from the Tokenizer.
+Here's some example Ruby code showing how a text file "source.txt" can be split into
+tokens.  We'll assume there's a text file "tokendefs.txt" that contains the
+definitions shown earlier.
+	require "Tokenizer"
+	dfa = dfa_from_script(readTextFile("tokendefs.txt"))
+	t = Tokenizer.new(dfa, readTextFile("source.txt"))
+	while t.hasNext
+	      k = t.read      # read token
+				if t.typeOf(k) == "sep"   # skip 'whitespace'
+					next
+				end
+				...do something with the token ...
+	end
+If later, another file needs to be tokenized, a new Tokenizer object can be
+constructed and given the same dfa object as earlier.
+Using the tokn command line utilities
+===================================================================================
+The module has two utility scripts: tokncompile, and toknprocess.  These can be
+found in the bin/ directory.
+The tokncompile script reads a token definition script from standard input, and
+compiles it to a DFA.  For example, if you are in the tokn directory, you can
+type:
+  tokncompile < sampletokens.txt > compileddfa.txt
+It will produce the JSON encoding of the appropriate DFA.  For a description of how
+this JSON string represents the DFA, see Dfa.rb.
+The toknprocess script takes two arguments: the name of a file containing a
+previously compiled DFA, and the name of a source file.  It extracts the sequence
+of tokens from the source file to the standard output:
+  toknprocess compileddfa.txt sampletext.txt
+This will produce the following output:
+	WS 1 1 // Example source file that can be tokenized
+	WS 2 1
+	ID 3 1 speed
+	WS 3 6
+	ASSIGN 3 7 =
+	WS 3 8
+	INT 3 9 42
+	WS 3 11
+	WS 3 14 // speed of object
+	WS 4 1
+	ID 5 1 gravity
+	WS 5 8
+	ASSIGN 5 9 =
+	WS 5 10
+	DBL 5 11 -9.80
+	WS 5 16
+	ID 7 1 title
+	WS 7 6
+	ASSIGN 7 7 =
+	WS 7 8
+	LBL 7 9 'This is a string with \' an escaped delimiter'
+	WS 7 56
+	IF 9 1 if
+	WS 9 3
+	ID 9 4 gravity
+	WS 9 11
+	EQUIV 9 12 ==
+	WS 9 14
+	INT 9 15 12
+	WS 9 17
+	BROP 9 18 {
+	WS 9 19
+	DO 10 3 do
+	WS 10 5
+	ID 10 6 something
+	WS 10 15
+	BRCL 11 1 }
+	WS 11 2
+The extra linefeeds are the result of a token containing a linefeed.
+FAQ
+===================================================================================
+1) Why can't I just use Ruby's regular expressions for tokenizing text?
+You could construct a regular expression describing each possible token, and use that
+to extract a token from the start of a string; you could then remove that token from the
+string, and repeat.  The trouble is that the regular expression has no easy way to indicate
+which individual token's expression was matched.  You would then (presumably) have to match
+the returned token with each individual regular expression to identify the token type.
+Another reason why standard regular expressions can be troublesome is that their
+implementations actually 'recognize' a richer class of tokens than the ones described
+here.  This extra power can come at a cost; in some pathological cases, the running time
+can become exponential.
+2) Is tokn compatible with Unicode?
+The tokn tool is capable of extracting tokens made up of characters that have
+codes in the entire Unicode range: 0 through 0x10ffff (hex).  In fact, the labels
+on the DFA edges can be viewed as sets of any nonnegative integers (negative
+values are reserved for the token identifiers).  Note however that the current implementation
+only reads Ruby characters from the input, which I believe are only 8 bits wide.
+3) What do I do if I have some ideas for enhancing tokn, or want to point out some
+    problems with it?
+Well, I can be reached as jpsember at gmail dot com.

data/bin/tokncompile ADDED Viewed

@@ -0,0 +1,16 @@
+#!/usr/local/bin/ruby
+# Compile a DFA from a token definition script,
+# then serialize that DFA to stdout
+#
+# Example usage (for Unix):
+#
+# tokncompile < sampletokens.txt > dfa.txt
+#
+require 'tokn'
+puts dfa_from_script(ARGF.read).serialize()

data/bin/toknprocess ADDED Viewed

@@ -0,0 +1,26 @@
+#!/usr/local/bin/ruby
+# Given a compiled DFA file and a source file,
+# extract all tokens from the source file.
+#
+# Example usage (for Unix); assumes tokncompile.rb
+# has been run beforehand:
+#
+#
+# toknprocess dfa.txt sampletext.txt
+#
+require 'tokn'
+if ARGV.size != 2
+  puts "Usage: toknprocess <dfa file> <source file>"
+  abort
+end
+dfa = dfa_from_file(ARGV[0])
+tk = Tokenizer.new(dfa, readTextFile(ARGV[1]))
+while tk.hasNext()
+  t = tk.read
+  printf("%s %d %d %s\n",tk.nameOf(t),t.lineNumber,t.column,t.text)
+end

data/figures/sample_dfa.pdf ADDED Viewed

Binary file

data/lib/tokn/code_set.rb ADDED Viewed

@@ -0,0 +1,392 @@
+require_relative 'tools'
+req('tokn_const')
+# A CodeSet is an ordered set of character or token codes that
+# are used as labels on DFA edges.
+#
+# In addition to unicode character codes 0...0x10ffff, they
+# also represent epsilon transitions (-1), or token identifiers ( < -1).
+#
+# Each CodeSet is represented as an array with 2n elements;
+# each pair represents a closed lower and open upper range of values.
+#
+# Thus a value x is within the set [a1,a2,b1,b2,..]
+# iff (a1 <= x < a2) or (b1 <= x < b2) or ...
+#
+class CodeSet
+  include Tokn
+  # Construct a copy of this set
+  #
+  def makeCopy
+    c = CodeSet.new
+    c.setTo(self)
+    c
+  end
+  # Initialize set; optionally add an initial contiguous range
+  #
+  def initialize(lower = nil, upper = nil)
+    @elem = []
+    if lower
+      add(lower,upper)
+    end
+  end
+  # Replace this set with a copy of another
+  #
+  def setTo(otherSet)
+    @elem.replace(otherSet.array)
+  end
+  # Get the array containing the code set range pairs
+  #
+  def array
+    return @elem
+  end
+  # Replace this set's array
+  # @param a array to point to (does not make a copy of it)
+  #
+  def setArray(a)
+    @elem = a
+  end
+  def hash
+    return @elem.hash
+  end
+  # Determine if this set is equivalent to another
+  #
+  def eql?(other)
+    @elem == other.array
+  end
+  # Add a contiguous range of values to the set
+  # @param lower min value in range
+  # @param upper one plus max value in range
+  #
+  def add(lower, upper = nil)
+    if not upper
+      upper = lower + 1
+    end
+    if lower >= upper
+      raise RangeError
+    end
+    newSet = []
+    i = 0
+    while i < @elem.size and @elem[i] < lower
+      newSet.push(@elem[i])
+      i += 1
+    end
+    if (i & 1) == 0
+      newSet.push(lower)
+    end
+    while i < @elem.size and @elem[i] <= upper
+      i += 1
+    end
+    if (i & 1) == 0
+      newSet.push(upper)
+    end
+    while i < @elem.size
+      newSet.push(@elem[i])
+      i += 1
+    end
+    @elem = newSet
+  end
+  # Remove a contiguous range of values from the set
+  # @param lower min value in range
+  # @param upper one plus max value in range
+  #
+  def remove(lower, upper = nil)
+    if not upper
+      upper = lower + 1
+    end
+    if lower >= upper
+      raise RangeError
+    end
+    newSet = []
+    i = 0
+    while i < @elem.size and @elem[i] < lower
+      newSet.push(@elem[i])
+      i += 1
+    end
+    if (i & 1) == 1
+      newSet.push(lower)
+    end
+    while i < @elem.size and @elem[i] <= upper
+      i += 1
+    end
+    if (i & 1) == 1
+      newSet.push(upper)
+    end
+    while i < @elem.size
+      newSet.push(@elem[i])
+      i += 1
+    end
+    setArray(newSet)
+  end
+  # Replace this set with itself minus another
+  #
+  def difference!(s)
+    setTo(difference(s))
+  end
+  # Calculate difference of this set minus another
+  def difference(s)
+    combineWith(s, 'd')
+  end
+  # Calculate the intersection of this set and another
+  def intersect(s)
+    combineWith(s, 'i')
+  end
+  # Set this set equal to its intersection with another
+  def intersect!(s)
+    setTo(intersect(s))
+  end
+  # Add every value from another CodeSet to this one
+  def addSet(s)
+    sa = s.array
+    (0 ... sa.length).step(2) {
+      |i| add(sa[i],sa[i+1])
+    }
+  end
+  # Determine if this set contains a particular value
+  def contains?(val)
+    ret = false
+    i = 0
+    while i < @elem.size
+      if val < @elem[i]
+        break
+      end
+      if val < @elem[i+1]
+        ret = true
+        break
+      end
+      i += 2
+    end
+    ret
+  end
+  # Get string representation of set, treating them (where
+  # possible) as printable ASCII characters
+  #
+  def to_s
+    s = ''
+    i = 0
+    while i < @elem.size
+      if s.size
+        s += ' '
+      end
+      lower = @elem[i]
+      upper = @elem[i+1]
+      s += dbStr(lower)
+      if upper != 1+lower
+        s += '..' + dbStr(upper-1)
+      end
+      i += 2
+    end
+    return s
+  end
+  def inspect
+    to_s
+  end
+  # Get string representation of set, treating them
+  # as integers
+  #
+  def to_s_alt
+    s = ''
+    i = 0
+    while i < @elem.size
+      if s.length > 0
+        s += ' '
+      end
+      low = @elem[i]
+      upr = @elem[i+1]
+      s += low.to_s
+      if upr > low+1
+        s += '..'
+        s += (upr-1).to_s
+      end
+      i += 2
+    end
+    return s
+  end
+  # Negate the inclusion of a contiguous range of values
+  #
+  # @param lower min value in range
+  # @param upper one plus max value in range
+  #
+  def negate(lower = 0, upper =  CODEMAX)
+    s2 = CodeSet.new(lower,upper)
+    if lower >= upper
+      raise RangeError
+    end
+    newSet = []
+    i = 0
+    while i < @elem.size and @elem[i] <= lower
+      newSet.push(@elem[i])
+      i += 1
+    end
+    if i > 0 and newSet[i-1] == lower
+      newSet.pop
+    else
+      newSet.push(lower)
+    end
+    while i < @elem.size and @elem[i] <= upper
+      newSet.push(@elem[i])
+      i += 1
+    end
+    if newSet.length > 0 and newSet.last == upper
+      newSet.pop
+    else
+      newSet.push(upper)
+    end
+    while i < @elem.size
+      newSet.push(@elem[i])
+      i += 1
+    end
+    @elem = newSet
+  end
+  # Determine how many distinct values are represented by this set
+  def cardinality
+    c = 0
+    i = 0
+    while i < @elem.length
+      c += @elem[i+1] - @elem[i]
+      i += 2
+    end
+    c
+  end
+  # Determine if this set is empty
+  #
+  def empty?
+    @elem.empty?
+  end
+  private
+  # Get a debug description of a value within a CodeSet, suitable
+  # for including within a .dot label
+  #
+  def dbStr(charCode)
+    # Unless it corresponds to a non-confusing printable ASCII value,
+    # just print its decimal equivalent
+    s = charCode.to_s
+    if charCode == EPSILON
+      s = "(e)"
+    elsif (charCode > 32 && charCode < 0x7f && !"'\"\\[]{}()".index(charCode.chr))
+      s = charCode.chr
+    end
+    return s
+  end
+  # Combine this range (a) with another (b) according to particular operation
+  # > s  other range (b)
+  # > oper   'i': intersection, a^b
+  #          'd': difference, a-b
+  #          'n': negation, (a & !b) | (!a & b)
+  #
+  def combineWith(s, oper)
+    sa = array
+    sb = s.array
+    i = 0
+    j = 0
+    c = []
+    wasInside = false
+    while i < sa.length || j < sb.length
+      if i == sa.length
+        v = sb[j]
+      elsif j == sb.length
+        v = sa[i]
+      else
+        v = [sa[i],sb[j]].min
+      end
+      if i < sa.length && v == sa[i]
+        i += 1
+      end
+      if j < sb.length && v == sb[j]
+        j += 1
+      end
+      case oper
+      when 'i'
+        inside = ((i & 1) == 1) && ((j & 1) == 1)
+      when 'd'
+        inside = ((i & 1) == 1) && ((j & 1) == 0)
+      else
+        raise Exception, "illegal"
+      end
+      if inside != wasInside
+        c.push v
+        wasInside = inside
+      end
+    end
+    ret = CodeSet.new()
+    ret.setArray(c)
+    ret
+  end
+end