tokn 0.0.4

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: d44494c850d61cd0ab5e3e588bbee398d85f7902
4
+ data.tar.gz: f5b35f65f7fb8f0df3adbcd4ff6d5df483ab3ce4
5
+ SHA512:
6
+ metadata.gz: 8432678eb42bcbacfa3db0c04b6f1cf728516e69da13fab1f79ab88002b15bb360cdd31ee76b40f148f2250b3dc78c4263037496c33611c482f1289fdf4998cd
7
+ data.tar.gz: c581302f4b0e77840d2a6e4d9e30657387b913b5c4f7c24376b083576fc4fdd76c28bdff401927528094caab059252e214ccc1979d170f268e27917ec1442708
data/README.txt ADDED
@@ -0,0 +1,194 @@
1
+ 'tokn' : A ruby gem for constructing DFAs and using them to tokenize text files.
2
+
3
+ Written and (c) by Jeff Sember, March 2013.
4
+ ================================================================================
5
+
6
+
7
+ Description of the problem
8
+ ================================================================================
9
+
10
+ For a simple example, suppose a particular text file is designed to have
11
+ tokens of the following three types:
12
+
13
+ 1) 'a' followed by any number of 'a' or 'b'
14
+ 2) 'b' followed by either 'aa' or zero or more 'b'
15
+ 3) 'bbb'
16
+
17
+ We will also allow an additional token, one or more spaces, to separate them.
18
+ These four token types can be written using regular expressions as:
19
+
20
+ sep: \s
21
+ tku: a(a|b)*
22
+ tkv: b(aa|b*)
23
+ tkw: bbb
24
+
25
+ We've given each token definition a name (to the left of the colon).
26
+
27
+ Now suppose your program needs to read a text file and interpret the tokens it
28
+ finds there. This can be done using the DFA (deterministic finite state automaton)
29
+ shown in figures/sample_dfa.pdf. The token extraction algorithm is as follows:
30
+
31
+ 1) Begin at the start state, S0.
32
+ 2) Look at the next character in the source (text) file. If there is an arrow (edge)
33
+ labelled with that character, follow it to another state (it may lead to the
34
+ same state; that's okay), and advance the cursor to the next character in
35
+ the source file.
36
+ 3) If there's an arrow labelled with a negative number N, don't follow the edge,
37
+ but instead remember the lowest (i.e., most negative) such N found.
38
+ 4) Continue steps 2 and 3 until no further progress is possible.
39
+ 5) At this point, N indicates the name of the token found. The cursor should be
40
+ restored to the point it was at when that N was recorded. The token's text
41
+ consists of the characters from the starting cursor position to that point.
42
+ 6) If no N value was recorded, then the source text doesn't match any of the tokens,
43
+ which is considered an error.
44
+
45
+
46
+ The tokn module provides a simple and efficient way to perform this tokenization process.
47
+ Its major accomplishment is not just performing the above six steps, but rather that
48
+ it also can construct, from a set of token definitions, the DFA to be used in these steps.
49
+ Such DFAs are very useful, and can be used by non-Ruby programs as well.
50
+
51
+
52
+ Using the tokn module in a Ruby program
53
+ ===================================================================================
54
+
55
+ There are three object classes of interest: DFA, Tokenizer, and Token. A DFA is
56
+ compiled once from a script containing token definitions (e.g, "tku: b(aa|b*) ..."),
57
+ and can then be stored (either in memory, or on disk as a JSON string) for later use.
58
+
59
+ When tokens need to be extracted from a source file (or simple string), a Tokenizer is
60
+ constructed. It requires both the DFA and the source file as input. Once this is done,
61
+ individual Token objects can be read from the Tokenizer.
62
+
63
+ Here's some example Ruby code showing how a text file "source.txt" can be split into
64
+ tokens. We'll assume there's a text file "tokendefs.txt" that contains the
65
+ definitions shown earlier.
66
+
67
+ require "Tokenizer"
68
+
69
+ dfa = dfa_from_script(readTextFile("tokendefs.txt"))
70
+
71
+ t = Tokenizer.new(dfa, readTextFile("source.txt"))
72
+
73
+ while t.hasNext
74
+
75
+ k = t.read # read token
76
+
77
+ if t.typeOf(k) == "sep" # skip 'whitespace'
78
+ next
79
+ end
80
+
81
+ ...do something with the token ...
82
+ end
83
+
84
+ If later, another file needs to be tokenized, a new Tokenizer object can be
85
+ constructed and given the same dfa object as earlier.
86
+
87
+
88
+ Using the tokn command line utilities
89
+ ===================================================================================
90
+
91
+ The module has two utility scripts: tokncompile, and toknprocess. These can be
92
+ found in the bin/ directory.
93
+
94
+ The tokncompile script reads a token definition script from standard input, and
95
+ compiles it to a DFA. For example, if you are in the tokn directory, you can
96
+ type:
97
+
98
+ tokncompile < sampletokens.txt > compileddfa.txt
99
+
100
+ It will produce the JSON encoding of the appropriate DFA. For a description of how
101
+ this JSON string represents the DFA, see Dfa.rb.
102
+
103
+ The toknprocess script takes two arguments: the name of a file containing a
104
+ previously compiled DFA, and the name of a source file. It extracts the sequence
105
+ of tokens from the source file to the standard output:
106
+
107
+ toknprocess compileddfa.txt sampletext.txt
108
+
109
+ This will produce the following output:
110
+
111
+ WS 1 1 // Example source file that can be tokenized
112
+
113
+ WS 2 1
114
+
115
+ ID 3 1 speed
116
+ WS 3 6
117
+ ASSIGN 3 7 =
118
+ WS 3 8
119
+ INT 3 9 42
120
+ WS 3 11
121
+ WS 3 14 // speed of object
122
+
123
+ WS 4 1
124
+
125
+ ID 5 1 gravity
126
+ WS 5 8
127
+ ASSIGN 5 9 =
128
+ WS 5 10
129
+ DBL 5 11 -9.80
130
+ WS 5 16
131
+
132
+
133
+ ID 7 1 title
134
+ WS 7 6
135
+ ASSIGN 7 7 =
136
+ WS 7 8
137
+ LBL 7 9 'This is a string with \' an escaped delimiter'
138
+ WS 7 56
139
+
140
+
141
+ IF 9 1 if
142
+ WS 9 3
143
+ ID 9 4 gravity
144
+ WS 9 11
145
+ EQUIV 9 12 ==
146
+ WS 9 14
147
+ INT 9 15 12
148
+ WS 9 17
149
+ BROP 9 18 {
150
+ WS 9 19
151
+
152
+ DO 10 3 do
153
+ WS 10 5
154
+ ID 10 6 something
155
+ WS 10 15
156
+
157
+ BRCL 11 1 }
158
+ WS 11 2
159
+
160
+ The extra linefeeds are the result of a token containing a linefeed.
161
+
162
+
163
+ FAQ
164
+ ===================================================================================
165
+
166
+ 1) Why can't I just use Ruby's regular expressions for tokenizing text?
167
+
168
+ You could construct a regular expression describing each possible token, and use that
169
+ to extract a token from the start of a string; you could then remove that token from the
170
+ string, and repeat. The trouble is that the regular expression has no easy way to indicate
171
+ which individual token's expression was matched. You would then (presumably) have to match
172
+ the returned token with each individual regular expression to identify the token type.
173
+
174
+ Another reason why standard regular expressions can be troublesome is that their
175
+ implementations actually 'recognize' a richer class of tokens than the ones described
176
+ here. This extra power can come at a cost; in some pathological cases, the running time
177
+ can become exponential.
178
+
179
+ 2) Is tokn compatible with Unicode?
180
+
181
+ The tokn tool is capable of extracting tokens made up of characters that have
182
+ codes in the entire Unicode range: 0 through 0x10ffff (hex). In fact, the labels
183
+ on the DFA edges can be viewed as sets of any nonnegative integers (negative
184
+ values are reserved for the token identifiers). Note however that the current implementation
185
+ only reads Ruby characters from the input, which I believe are only 8 bits wide.
186
+
187
+ 3) What do I do if I have some ideas for enhancing tokn, or want to point out some
188
+ problems with it?
189
+
190
+ Well, I can be reached as jpsember at gmail dot com.
191
+
192
+
193
+
194
+
data/bin/tokncompile ADDED
@@ -0,0 +1,16 @@
1
+ #!/usr/local/bin/ruby
2
+
3
+ # Compile a DFA from a token definition script,
4
+ # then serialize that DFA to stdout
5
+ #
6
+ # Example usage (for Unix):
7
+ #
8
+ # tokncompile < sampletokens.txt > dfa.txt
9
+ #
10
+
11
+
12
+ require 'tokn'
13
+
14
+ puts dfa_from_script(ARGF.read).serialize()
15
+
16
+
data/bin/toknprocess ADDED
@@ -0,0 +1,26 @@
1
+ #!/usr/local/bin/ruby
2
+
3
+ # Given a compiled DFA file and a source file,
4
+ # extract all tokens from the source file.
5
+ #
6
+ # Example usage (for Unix); assumes tokncompile.rb
7
+ # has been run beforehand:
8
+ #
9
+ #
10
+ # toknprocess dfa.txt sampletext.txt
11
+ #
12
+
13
+ require 'tokn'
14
+
15
+ if ARGV.size != 2
16
+ puts "Usage: toknprocess <dfa file> <source file>"
17
+ abort
18
+ end
19
+
20
+ dfa = dfa_from_file(ARGV[0])
21
+ tk = Tokenizer.new(dfa, readTextFile(ARGV[1]))
22
+
23
+ while tk.hasNext()
24
+ t = tk.read
25
+ printf("%s %d %d %s\n",tk.nameOf(t),t.lineNumber,t.column,t.text)
26
+ end
Binary file
@@ -0,0 +1,392 @@
1
+ require_relative 'tools'
2
+
3
+ req('tokn_const')
4
+
5
+
6
+ # A CodeSet is an ordered set of character or token codes that
7
+ # are used as labels on DFA edges.
8
+ #
9
+ # In addition to unicode character codes 0...0x10ffff, they
10
+ # also represent epsilon transitions (-1), or token identifiers ( < -1).
11
+ #
12
+ # Each CodeSet is represented as an array with 2n elements;
13
+ # each pair represents a closed lower and open upper range of values.
14
+ #
15
+ # Thus a value x is within the set [a1,a2,b1,b2,..]
16
+ # iff (a1 <= x < a2) or (b1 <= x < b2) or ...
17
+ #
18
+ class CodeSet
19
+
20
+ include Tokn
21
+
22
+ # Construct a copy of this set
23
+ #
24
+ def makeCopy
25
+ c = CodeSet.new
26
+ c.setTo(self)
27
+ c
28
+ end
29
+
30
+ # Initialize set; optionally add an initial contiguous range
31
+ #
32
+ def initialize(lower = nil, upper = nil)
33
+ @elem = []
34
+ if lower
35
+ add(lower,upper)
36
+ end
37
+ end
38
+
39
+ # Replace this set with a copy of another
40
+ #
41
+ def setTo(otherSet)
42
+ @elem.replace(otherSet.array)
43
+ end
44
+
45
+ # Get the array containing the code set range pairs
46
+ #
47
+ def array
48
+ return @elem
49
+ end
50
+
51
+ # Replace this set's array
52
+ # @param a array to point to (does not make a copy of it)
53
+ #
54
+ def setArray(a)
55
+ @elem = a
56
+ end
57
+
58
+
59
+ def hash
60
+ return @elem.hash
61
+ end
62
+
63
+ # Determine if this set is equivalent to another
64
+ #
65
+ def eql?(other)
66
+ @elem == other.array
67
+ end
68
+
69
+
70
+ # Add a contiguous range of values to the set
71
+ # @param lower min value in range
72
+ # @param upper one plus max value in range
73
+ #
74
+ def add(lower, upper = nil)
75
+ if not upper
76
+ upper = lower + 1
77
+ end
78
+
79
+ if lower >= upper
80
+ raise RangeError
81
+ end
82
+
83
+ newSet = []
84
+ i = 0
85
+ while i < @elem.size and @elem[i] < lower
86
+ newSet.push(@elem[i])
87
+ i += 1
88
+ end
89
+
90
+ if (i & 1) == 0
91
+ newSet.push(lower)
92
+ end
93
+
94
+ while i < @elem.size and @elem[i] <= upper
95
+ i += 1
96
+ end
97
+
98
+ if (i & 1) == 0
99
+ newSet.push(upper)
100
+ end
101
+
102
+ while i < @elem.size
103
+ newSet.push(@elem[i])
104
+ i += 1
105
+ end
106
+
107
+ @elem = newSet
108
+
109
+ end
110
+
111
+
112
+
113
+
114
+
115
+
116
+ # Remove a contiguous range of values from the set
117
+ # @param lower min value in range
118
+ # @param upper one plus max value in range
119
+ #
120
+ def remove(lower, upper = nil)
121
+ if not upper
122
+ upper = lower + 1
123
+ end
124
+
125
+ if lower >= upper
126
+ raise RangeError
127
+ end
128
+
129
+ newSet = []
130
+ i = 0
131
+ while i < @elem.size and @elem[i] < lower
132
+ newSet.push(@elem[i])
133
+ i += 1
134
+ end
135
+
136
+ if (i & 1) == 1
137
+ newSet.push(lower)
138
+ end
139
+
140
+ while i < @elem.size and @elem[i] <= upper
141
+ i += 1
142
+ end
143
+
144
+ if (i & 1) == 1
145
+ newSet.push(upper)
146
+ end
147
+
148
+ while i < @elem.size
149
+ newSet.push(@elem[i])
150
+ i += 1
151
+ end
152
+
153
+ setArray(newSet)
154
+
155
+ end
156
+
157
+ # Replace this set with itself minus another
158
+ #
159
+ def difference!(s)
160
+ setTo(difference(s))
161
+ end
162
+
163
+ # Calculate difference of this set minus another
164
+ def difference(s)
165
+ combineWith(s, 'd')
166
+ end
167
+
168
+ # Calculate the intersection of this set and another
169
+ def intersect(s)
170
+ combineWith(s, 'i')
171
+ end
172
+
173
+
174
+
175
+ # Set this set equal to its intersection with another
176
+ def intersect!(s)
177
+ setTo(intersect(s))
178
+ end
179
+
180
+ # Add every value from another CodeSet to this one
181
+ def addSet(s)
182
+ sa = s.array
183
+
184
+ (0 ... sa.length).step(2) {
185
+ |i| add(sa[i],sa[i+1])
186
+ }
187
+ end
188
+
189
+ # Determine if this set contains a particular value
190
+ def contains?(val)
191
+ ret = false
192
+ i = 0
193
+ while i < @elem.size
194
+ if val < @elem[i]
195
+ break
196
+ end
197
+ if val < @elem[i+1]
198
+ ret = true
199
+ break
200
+ end
201
+ i += 2
202
+ end
203
+
204
+ ret
205
+
206
+ end
207
+
208
+ # Get string representation of set, treating them (where
209
+ # possible) as printable ASCII characters
210
+ #
211
+ def to_s
212
+ s = ''
213
+ i = 0
214
+ while i < @elem.size
215
+ if s.size
216
+ s += ' '
217
+ end
218
+
219
+ lower = @elem[i]
220
+ upper = @elem[i+1]
221
+ s += dbStr(lower)
222
+ if upper != 1+lower
223
+ s += '..' + dbStr(upper-1)
224
+ end
225
+ i += 2
226
+ end
227
+ return s
228
+ end
229
+
230
+ def inspect
231
+ to_s
232
+ end
233
+
234
+ # Get string representation of set, treating them
235
+ # as integers
236
+ #
237
+ def to_s_alt
238
+ s = ''
239
+ i = 0
240
+ while i < @elem.size
241
+ if s.length > 0
242
+ s += ' '
243
+ end
244
+ low = @elem[i]
245
+ upr = @elem[i+1]
246
+ s += low.to_s
247
+ if upr > low+1
248
+ s += '..'
249
+ s += (upr-1).to_s
250
+ end
251
+ i += 2
252
+ end
253
+ return s
254
+ end
255
+
256
+
257
+ # Negate the inclusion of a contiguous range of values
258
+ #
259
+ # @param lower min value in range
260
+ # @param upper one plus max value in range
261
+ #
262
+ def negate(lower = 0, upper = CODEMAX)
263
+ s2 = CodeSet.new(lower,upper)
264
+ if lower >= upper
265
+ raise RangeError
266
+ end
267
+
268
+ newSet = []
269
+ i = 0
270
+ while i < @elem.size and @elem[i] <= lower
271
+ newSet.push(@elem[i])
272
+ i += 1
273
+ end
274
+
275
+ if i > 0 and newSet[i-1] == lower
276
+ newSet.pop
277
+ else
278
+ newSet.push(lower)
279
+ end
280
+
281
+ while i < @elem.size and @elem[i] <= upper
282
+ newSet.push(@elem[i])
283
+ i += 1
284
+ end
285
+
286
+
287
+ if newSet.length > 0 and newSet.last == upper
288
+ newSet.pop
289
+ else
290
+ newSet.push(upper)
291
+ end
292
+
293
+ while i < @elem.size
294
+ newSet.push(@elem[i])
295
+ i += 1
296
+ end
297
+
298
+ @elem = newSet
299
+
300
+ end
301
+
302
+ # Determine how many distinct values are represented by this set
303
+ def cardinality
304
+ c = 0
305
+ i = 0
306
+ while i < @elem.length
307
+ c += @elem[i+1] - @elem[i]
308
+ i += 2
309
+ end
310
+ c
311
+ end
312
+
313
+ # Determine if this set is empty
314
+ #
315
+ def empty?
316
+ @elem.empty?
317
+ end
318
+
319
+ private
320
+
321
+ # Get a debug description of a value within a CodeSet, suitable
322
+ # for including within a .dot label
323
+ #
324
+ def dbStr(charCode)
325
+
326
+ # Unless it corresponds to a non-confusing printable ASCII value,
327
+ # just print its decimal equivalent
328
+
329
+ s = charCode.to_s
330
+
331
+ if charCode == EPSILON
332
+ s = "(e)"
333
+ elsif (charCode > 32 && charCode < 0x7f && !"'\"\\[]{}()".index(charCode.chr))
334
+ s = charCode.chr
335
+ end
336
+ return s
337
+ end
338
+
339
+ # Combine this range (a) with another (b) according to particular operation
340
+ # > s other range (b)
341
+ # > oper 'i': intersection, a^b
342
+ # 'd': difference, a-b
343
+ # 'n': negation, (a & !b) | (!a & b)
344
+ #
345
+ def combineWith(s, oper)
346
+ sa = array
347
+ sb = s.array
348
+
349
+ i = 0
350
+ j = 0
351
+ c = []
352
+
353
+ wasInside = false
354
+
355
+ while i < sa.length || j < sb.length
356
+
357
+ if i == sa.length
358
+ v = sb[j]
359
+ elsif j == sb.length
360
+ v = sa[i]
361
+ else
362
+ v = [sa[i],sb[j]].min
363
+ end
364
+
365
+ if i < sa.length && v == sa[i]
366
+ i += 1
367
+ end
368
+ if j < sb.length && v == sb[j]
369
+ j += 1
370
+ end
371
+
372
+ case oper
373
+ when 'i'
374
+ inside = ((i & 1) == 1) && ((j & 1) == 1)
375
+ when 'd'
376
+ inside = ((i & 1) == 1) && ((j & 1) == 0)
377
+ else
378
+ raise Exception, "illegal"
379
+ end
380
+
381
+ if inside != wasInside
382
+ c.push v
383
+ wasInside = inside
384
+ end
385
+ end
386
+ ret = CodeSet.new()
387
+ ret.setArray(c)
388
+ ret
389
+ end
390
+
391
+ end
392
+