tokn 0.0.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: d44494c850d61cd0ab5e3e588bbee398d85f7902
4
+ data.tar.gz: f5b35f65f7fb8f0df3adbcd4ff6d5df483ab3ce4
5
+ SHA512:
6
+ metadata.gz: 8432678eb42bcbacfa3db0c04b6f1cf728516e69da13fab1f79ab88002b15bb360cdd31ee76b40f148f2250b3dc78c4263037496c33611c482f1289fdf4998cd
7
+ data.tar.gz: c581302f4b0e77840d2a6e4d9e30657387b913b5c4f7c24376b083576fc4fdd76c28bdff401927528094caab059252e214ccc1979d170f268e27917ec1442708
data/README.txt ADDED
@@ -0,0 +1,194 @@
1
+ 'tokn' : A ruby gem for constructing DFAs and using them to tokenize text files.
2
+
3
+ Written and (c) by Jeff Sember, March 2013.
4
+ ================================================================================
5
+
6
+
7
+ Description of the problem
8
+ ================================================================================
9
+
10
+ For a simple example, suppose a particular text file is designed to have
11
+ tokens of the following three types:
12
+
13
+ 1) 'a' followed by any number of 'a' or 'b'
14
+ 2) 'b' followed by either 'aa' or zero or more 'b'
15
+ 3) 'bbb'
16
+
17
+ We will also allow an additional token, one or more spaces, to separate them.
18
+ These four token types can be written using regular expressions as:
19
+
20
+ sep: \s
21
+ tku: a(a|b)*
22
+ tkv: b(aa|b*)
23
+ tkw: bbb
24
+
25
+ We've given each token definition a name (to the left of the colon).
26
+
27
+ Now suppose your program needs to read a text file and interpret the tokens it
28
+ finds there. This can be done using the DFA (deterministic finite state automaton)
29
+ shown in figures/sample_dfa.pdf. The token extraction algorithm is as follows:
30
+
31
+ 1) Begin at the start state, S0.
32
+ 2) Look at the next character in the source (text) file. If there is an arrow (edge)
33
+ labelled with that character, follow it to another state (it may lead to the
34
+ same state; that's okay), and advance the cursor to the next character in
35
+ the source file.
36
+ 3) If there's an arrow labelled with a negative number N, don't follow the edge,
37
+ but instead remember the lowest (i.e., most negative) such N found.
38
+ 4) Continue steps 2 and 3 until no further progress is possible.
39
+ 5) At this point, N indicates the name of the token found. The cursor should be
40
+ restored to the point it was at when that N was recorded. The token's text
41
+ consists of the characters from the starting cursor position to that point.
42
+ 6) If no N value was recorded, then the source text doesn't match any of the tokens,
43
+ which is considered an error.
44
+
45
+
46
+ The tokn module provides a simple and efficient way to perform this tokenization process.
47
+ Its major accomplishment is not just performing the above six steps, but rather that
48
+ it also can construct, from a set of token definitions, the DFA to be used in these steps.
49
+ Such DFAs are very useful, and can be used by non-Ruby programs as well.
50
+
51
+
52
+ Using the tokn module in a Ruby program
53
+ ===================================================================================
54
+
55
+ There are three object classes of interest: DFA, Tokenizer, and Token. A DFA is
56
+ compiled once from a script containing token definitions (e.g, "tku: b(aa|b*) ..."),
57
+ and can then be stored (either in memory, or on disk as a JSON string) for later use.
58
+
59
+ When tokens need to be extracted from a source file (or simple string), a Tokenizer is
60
+ constructed. It requires both the DFA and the source file as input. Once this is done,
61
+ individual Token objects can be read from the Tokenizer.
62
+
63
+ Here's some example Ruby code showing how a text file "source.txt" can be split into
64
+ tokens. We'll assume there's a text file "tokendefs.txt" that contains the
65
+ definitions shown earlier.
66
+
67
+ require "Tokenizer"
68
+
69
+ dfa = dfa_from_script(readTextFile("tokendefs.txt"))
70
+
71
+ t = Tokenizer.new(dfa, readTextFile("source.txt"))
72
+
73
+ while t.hasNext
74
+
75
+ k = t.read # read token
76
+
77
+ if t.typeOf(k) == "sep" # skip 'whitespace'
78
+ next
79
+ end
80
+
81
+ ...do something with the token ...
82
+ end
83
+
84
+ If later, another file needs to be tokenized, a new Tokenizer object can be
85
+ constructed and given the same dfa object as earlier.
86
+
87
+
88
+ Using the tokn command line utilities
89
+ ===================================================================================
90
+
91
+ The module has two utility scripts: tokncompile, and toknprocess. These can be
92
+ found in the bin/ directory.
93
+
94
+ The tokncompile script reads a token definition script from standard input, and
95
+ compiles it to a DFA. For example, if you are in the tokn directory, you can
96
+ type:
97
+
98
+ tokncompile < sampletokens.txt > compileddfa.txt
99
+
100
+ It will produce the JSON encoding of the appropriate DFA. For a description of how
101
+ this JSON string represents the DFA, see Dfa.rb.
102
+
103
+ The toknprocess script takes two arguments: the name of a file containing a
104
+ previously compiled DFA, and the name of a source file. It extracts the sequence
105
+ of tokens from the source file to the standard output:
106
+
107
+ toknprocess compileddfa.txt sampletext.txt
108
+
109
+ This will produce the following output:
110
+
111
+ WS 1 1 // Example source file that can be tokenized
112
+
113
+ WS 2 1
114
+
115
+ ID 3 1 speed
116
+ WS 3 6
117
+ ASSIGN 3 7 =
118
+ WS 3 8
119
+ INT 3 9 42
120
+ WS 3 11
121
+ WS 3 14 // speed of object
122
+
123
+ WS 4 1
124
+
125
+ ID 5 1 gravity
126
+ WS 5 8
127
+ ASSIGN 5 9 =
128
+ WS 5 10
129
+ DBL 5 11 -9.80
130
+ WS 5 16
131
+
132
+
133
+ ID 7 1 title
134
+ WS 7 6
135
+ ASSIGN 7 7 =
136
+ WS 7 8
137
+ LBL 7 9 'This is a string with \' an escaped delimiter'
138
+ WS 7 56
139
+
140
+
141
+ IF 9 1 if
142
+ WS 9 3
143
+ ID 9 4 gravity
144
+ WS 9 11
145
+ EQUIV 9 12 ==
146
+ WS 9 14
147
+ INT 9 15 12
148
+ WS 9 17
149
+ BROP 9 18 {
150
+ WS 9 19
151
+
152
+ DO 10 3 do
153
+ WS 10 5
154
+ ID 10 6 something
155
+ WS 10 15
156
+
157
+ BRCL 11 1 }
158
+ WS 11 2
159
+
160
+ The extra linefeeds are the result of a token containing a linefeed.
161
+
162
+
163
+ FAQ
164
+ ===================================================================================
165
+
166
+ 1) Why can't I just use Ruby's regular expressions for tokenizing text?
167
+
168
+ You could construct a regular expression describing each possible token, and use that
169
+ to extract a token from the start of a string; you could then remove that token from the
170
+ string, and repeat. The trouble is that the regular expression has no easy way to indicate
171
+ which individual token's expression was matched. You would then (presumably) have to match
172
+ the returned token with each individual regular expression to identify the token type.
173
+
174
+ Another reason why standard regular expressions can be troublesome is that their
175
+ implementations actually 'recognize' a richer class of tokens than the ones described
176
+ here. This extra power can come at a cost; in some pathological cases, the running time
177
+ can become exponential.
178
+
179
+ 2) Is tokn compatible with Unicode?
180
+
181
+ The tokn tool is capable of extracting tokens made up of characters that have
182
+ codes in the entire Unicode range: 0 through 0x10ffff (hex). In fact, the labels
183
+ on the DFA edges can be viewed as sets of any nonnegative integers (negative
184
+ values are reserved for the token identifiers). Note however that the current implementation
185
+ only reads Ruby characters from the input, which I believe are only 8 bits wide.
186
+
187
+ 3) What do I do if I have some ideas for enhancing tokn, or want to point out some
188
+ problems with it?
189
+
190
+ Well, I can be reached as jpsember at gmail dot com.
191
+
192
+
193
+
194
+
data/bin/tokncompile ADDED
@@ -0,0 +1,16 @@
1
+ #!/usr/local/bin/ruby
2
+
3
+ # Compile a DFA from a token definition script,
4
+ # then serialize that DFA to stdout
5
+ #
6
+ # Example usage (for Unix):
7
+ #
8
+ # tokncompile < sampletokens.txt > dfa.txt
9
+ #
10
+
11
+
12
+ require 'tokn'
13
+
14
+ puts dfa_from_script(ARGF.read).serialize()
15
+
16
+
data/bin/toknprocess ADDED
@@ -0,0 +1,26 @@
1
+ #!/usr/local/bin/ruby
2
+
3
+ # Given a compiled DFA file and a source file,
4
+ # extract all tokens from the source file.
5
+ #
6
+ # Example usage (for Unix); assumes tokncompile.rb
7
+ # has been run beforehand:
8
+ #
9
+ #
10
+ # toknprocess dfa.txt sampletext.txt
11
+ #
12
+
13
+ require 'tokn'
14
+
15
+ if ARGV.size != 2
16
+ puts "Usage: toknprocess <dfa file> <source file>"
17
+ abort
18
+ end
19
+
20
+ dfa = dfa_from_file(ARGV[0])
21
+ tk = Tokenizer.new(dfa, readTextFile(ARGV[1]))
22
+
23
+ while tk.hasNext()
24
+ t = tk.read
25
+ printf("%s %d %d %s\n",tk.nameOf(t),t.lineNumber,t.column,t.text)
26
+ end
Binary file
@@ -0,0 +1,392 @@
1
+ require_relative 'tools'
2
+
3
+ req('tokn_const')
4
+
5
+
6
+ # A CodeSet is an ordered set of character or token codes that
7
+ # are used as labels on DFA edges.
8
+ #
9
+ # In addition to unicode character codes 0...0x10ffff, they
10
+ # also represent epsilon transitions (-1), or token identifiers ( < -1).
11
+ #
12
+ # Each CodeSet is represented as an array with 2n elements;
13
+ # each pair represents a closed lower and open upper range of values.
14
+ #
15
+ # Thus a value x is within the set [a1,a2,b1,b2,..]
16
+ # iff (a1 <= x < a2) or (b1 <= x < b2) or ...
17
+ #
18
+ class CodeSet
19
+
20
+ include Tokn
21
+
22
+ # Construct a copy of this set
23
+ #
24
+ def makeCopy
25
+ c = CodeSet.new
26
+ c.setTo(self)
27
+ c
28
+ end
29
+
30
+ # Initialize set; optionally add an initial contiguous range
31
+ #
32
+ def initialize(lower = nil, upper = nil)
33
+ @elem = []
34
+ if lower
35
+ add(lower,upper)
36
+ end
37
+ end
38
+
39
+ # Replace this set with a copy of another
40
+ #
41
+ def setTo(otherSet)
42
+ @elem.replace(otherSet.array)
43
+ end
44
+
45
+ # Get the array containing the code set range pairs
46
+ #
47
+ def array
48
+ return @elem
49
+ end
50
+
51
+ # Replace this set's array
52
+ # @param a array to point to (does not make a copy of it)
53
+ #
54
+ def setArray(a)
55
+ @elem = a
56
+ end
57
+
58
+
59
+ def hash
60
+ return @elem.hash
61
+ end
62
+
63
+ # Determine if this set is equivalent to another
64
+ #
65
+ def eql?(other)
66
+ @elem == other.array
67
+ end
68
+
69
+
70
+ # Add a contiguous range of values to the set
71
+ # @param lower min value in range
72
+ # @param upper one plus max value in range
73
+ #
74
+ def add(lower, upper = nil)
75
+ if not upper
76
+ upper = lower + 1
77
+ end
78
+
79
+ if lower >= upper
80
+ raise RangeError
81
+ end
82
+
83
+ newSet = []
84
+ i = 0
85
+ while i < @elem.size and @elem[i] < lower
86
+ newSet.push(@elem[i])
87
+ i += 1
88
+ end
89
+
90
+ if (i & 1) == 0
91
+ newSet.push(lower)
92
+ end
93
+
94
+ while i < @elem.size and @elem[i] <= upper
95
+ i += 1
96
+ end
97
+
98
+ if (i & 1) == 0
99
+ newSet.push(upper)
100
+ end
101
+
102
+ while i < @elem.size
103
+ newSet.push(@elem[i])
104
+ i += 1
105
+ end
106
+
107
+ @elem = newSet
108
+
109
+ end
110
+
111
+
112
+
113
+
114
+
115
+
116
+ # Remove a contiguous range of values from the set
117
+ # @param lower min value in range
118
+ # @param upper one plus max value in range
119
+ #
120
+ def remove(lower, upper = nil)
121
+ if not upper
122
+ upper = lower + 1
123
+ end
124
+
125
+ if lower >= upper
126
+ raise RangeError
127
+ end
128
+
129
+ newSet = []
130
+ i = 0
131
+ while i < @elem.size and @elem[i] < lower
132
+ newSet.push(@elem[i])
133
+ i += 1
134
+ end
135
+
136
+ if (i & 1) == 1
137
+ newSet.push(lower)
138
+ end
139
+
140
+ while i < @elem.size and @elem[i] <= upper
141
+ i += 1
142
+ end
143
+
144
+ if (i & 1) == 1
145
+ newSet.push(upper)
146
+ end
147
+
148
+ while i < @elem.size
149
+ newSet.push(@elem[i])
150
+ i += 1
151
+ end
152
+
153
+ setArray(newSet)
154
+
155
+ end
156
+
157
+ # Replace this set with itself minus another
158
+ #
159
+ def difference!(s)
160
+ setTo(difference(s))
161
+ end
162
+
163
+ # Calculate difference of this set minus another
164
+ def difference(s)
165
+ combineWith(s, 'd')
166
+ end
167
+
168
+ # Calculate the intersection of this set and another
169
+ def intersect(s)
170
+ combineWith(s, 'i')
171
+ end
172
+
173
+
174
+
175
+ # Set this set equal to its intersection with another
176
+ def intersect!(s)
177
+ setTo(intersect(s))
178
+ end
179
+
180
+ # Add every value from another CodeSet to this one
181
+ def addSet(s)
182
+ sa = s.array
183
+
184
+ (0 ... sa.length).step(2) {
185
+ |i| add(sa[i],sa[i+1])
186
+ }
187
+ end
188
+
189
+ # Determine if this set contains a particular value
190
+ def contains?(val)
191
+ ret = false
192
+ i = 0
193
+ while i < @elem.size
194
+ if val < @elem[i]
195
+ break
196
+ end
197
+ if val < @elem[i+1]
198
+ ret = true
199
+ break
200
+ end
201
+ i += 2
202
+ end
203
+
204
+ ret
205
+
206
+ end
207
+
208
+ # Get string representation of set, treating them (where
209
+ # possible) as printable ASCII characters
210
+ #
211
+ def to_s
212
+ s = ''
213
+ i = 0
214
+ while i < @elem.size
215
+ if s.size
216
+ s += ' '
217
+ end
218
+
219
+ lower = @elem[i]
220
+ upper = @elem[i+1]
221
+ s += dbStr(lower)
222
+ if upper != 1+lower
223
+ s += '..' + dbStr(upper-1)
224
+ end
225
+ i += 2
226
+ end
227
+ return s
228
+ end
229
+
230
+ def inspect
231
+ to_s
232
+ end
233
+
234
+ # Get string representation of set, treating them
235
+ # as integers
236
+ #
237
+ def to_s_alt
238
+ s = ''
239
+ i = 0
240
+ while i < @elem.size
241
+ if s.length > 0
242
+ s += ' '
243
+ end
244
+ low = @elem[i]
245
+ upr = @elem[i+1]
246
+ s += low.to_s
247
+ if upr > low+1
248
+ s += '..'
249
+ s += (upr-1).to_s
250
+ end
251
+ i += 2
252
+ end
253
+ return s
254
+ end
255
+
256
+
257
+ # Negate the inclusion of a contiguous range of values
258
+ #
259
+ # @param lower min value in range
260
+ # @param upper one plus max value in range
261
+ #
262
+ def negate(lower = 0, upper = CODEMAX)
263
+ s2 = CodeSet.new(lower,upper)
264
+ if lower >= upper
265
+ raise RangeError
266
+ end
267
+
268
+ newSet = []
269
+ i = 0
270
+ while i < @elem.size and @elem[i] <= lower
271
+ newSet.push(@elem[i])
272
+ i += 1
273
+ end
274
+
275
+ if i > 0 and newSet[i-1] == lower
276
+ newSet.pop
277
+ else
278
+ newSet.push(lower)
279
+ end
280
+
281
+ while i < @elem.size and @elem[i] <= upper
282
+ newSet.push(@elem[i])
283
+ i += 1
284
+ end
285
+
286
+
287
+ if newSet.length > 0 and newSet.last == upper
288
+ newSet.pop
289
+ else
290
+ newSet.push(upper)
291
+ end
292
+
293
+ while i < @elem.size
294
+ newSet.push(@elem[i])
295
+ i += 1
296
+ end
297
+
298
+ @elem = newSet
299
+
300
+ end
301
+
302
+ # Determine how many distinct values are represented by this set
303
+ def cardinality
304
+ c = 0
305
+ i = 0
306
+ while i < @elem.length
307
+ c += @elem[i+1] - @elem[i]
308
+ i += 2
309
+ end
310
+ c
311
+ end
312
+
313
+ # Determine if this set is empty
314
+ #
315
+ def empty?
316
+ @elem.empty?
317
+ end
318
+
319
+ private
320
+
321
+ # Get a debug description of a value within a CodeSet, suitable
322
+ # for including within a .dot label
323
+ #
324
+ def dbStr(charCode)
325
+
326
+ # Unless it corresponds to a non-confusing printable ASCII value,
327
+ # just print its decimal equivalent
328
+
329
+ s = charCode.to_s
330
+
331
+ if charCode == EPSILON
332
+ s = "(e)"
333
+ elsif (charCode > 32 && charCode < 0x7f && !"'\"\\[]{}()".index(charCode.chr))
334
+ s = charCode.chr
335
+ end
336
+ return s
337
+ end
338
+
339
+ # Combine this range (a) with another (b) according to particular operation
340
+ # > s other range (b)
341
+ # > oper 'i': intersection, a^b
342
+ # 'd': difference, a-b
343
+ # 'n': negation, (a & !b) | (!a & b)
344
+ #
345
+ def combineWith(s, oper)
346
+ sa = array
347
+ sb = s.array
348
+
349
+ i = 0
350
+ j = 0
351
+ c = []
352
+
353
+ wasInside = false
354
+
355
+ while i < sa.length || j < sb.length
356
+
357
+ if i == sa.length
358
+ v = sb[j]
359
+ elsif j == sb.length
360
+ v = sa[i]
361
+ else
362
+ v = [sa[i],sb[j]].min
363
+ end
364
+
365
+ if i < sa.length && v == sa[i]
366
+ i += 1
367
+ end
368
+ if j < sb.length && v == sb[j]
369
+ j += 1
370
+ end
371
+
372
+ case oper
373
+ when 'i'
374
+ inside = ((i & 1) == 1) && ((j & 1) == 1)
375
+ when 'd'
376
+ inside = ((i & 1) == 1) && ((j & 1) == 0)
377
+ else
378
+ raise Exception, "illegal"
379
+ end
380
+
381
+ if inside != wasInside
382
+ c.push v
383
+ wasInside = inside
384
+ end
385
+ end
386
+ ret = CodeSet.new()
387
+ ret.setArray(c)
388
+ ret
389
+ end
390
+
391
+ end
392
+