rdf-turtle 0.1.2 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.markdown +5 -5
- data/VERSION +1 -1
- data/lib/ebnf.rb +620 -0
- data/lib/rdf/ll1/lexer.rb +4 -4
- data/lib/rdf/ll1/parser.rb +35 -19
- data/lib/rdf/ll1/scanner.rb +1 -1
- data/lib/rdf/turtle/format.rb +1 -1
- data/lib/rdf/turtle/meta.rb +718 -1232
- data/lib/rdf/turtle/reader.rb +74 -45
- data/lib/rdf/turtle/terminals.rb +60 -34
- data/lib/rdf/turtle/writer.rb +3 -3
- metadata +192 -69
data/README.markdown
CHANGED
@@ -1,4 +1,4 @@
|
|
1
|
-
# RDF::Turtle reader/writer
|
1
|
+
# RDF::Turtle reader/writer [](http://travis-ci.org/ruby-rdf/rdf-turtle)
|
2
2
|
[Turtle][] reader/writer for [RDF.rb][RDF.rb] .
|
3
3
|
|
4
4
|
## Description
|
@@ -48,7 +48,7 @@ In some cases, the specification is unclear on certain issues:
|
|
48
48
|
cannot if the IRI contains any characters that might need escaping. This implementation currently abides
|
49
49
|
by this restriction. Presumably, this would affect both PNAME\_NS and PNAME\_LN terminals.
|
50
50
|
(This is being tracked as issues [67](http://www.w3.org/2011/rdf-wg/track/issues/67)).
|
51
|
-
* The EBNF definition of
|
51
|
+
* The EBNF definition of IRIREF seems malformed, and has no provision for \^, as discussed elsewhere in the spec.
|
52
52
|
We presume that [#0000- ] is intended to be [#0000-#0020].
|
53
53
|
* The list example in section 6 uses a list on it's own, without a predicate or object, which is not allowed
|
54
54
|
by the grammar (neither is a blankNodeProperyList). Either the EBNF should be updated to allow for these
|
@@ -128,9 +128,9 @@ see <http://unlicense.org/> or the accompanying {file:UNLICENSE} file.
|
|
128
128
|
[YARD]: http://yardoc.org/
|
129
129
|
[YARD-GS]: http://rubydoc.info/docs/yard/file/docs/GettingStarted.md
|
130
130
|
[PDD]: http://lists.w3.org/Archives/Public/public-rdf-ruby/2010May/0013.html
|
131
|
-
[RDF.rb]: http://rubydoc.info/github/
|
131
|
+
[RDF.rb]: http://rubydoc.info/github/ruby-rdf/rdf/master/frames
|
132
132
|
[Backports]: http://rubygems.org/gems/backports
|
133
133
|
[N-Triples]: http://www.w3.org/TR/rdf-testcases/#ntriples
|
134
|
-
[Turtle]: http://www.w3.org/TR/
|
134
|
+
[Turtle]: http://www.w3.org/TR/2012/WD-turtle-20120710/
|
135
135
|
[Turtle doc]: http://rubydoc.info/github/ruby-rdf/rdf-turtle/master/file/README.markdown
|
136
|
-
[Turtle EBNF]: http://
|
136
|
+
[Turtle EBNF]: http://dvcs.w3.org/hg/rdf/file/8610b8f58685/rdf-turtle/turtle.bnf
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.
|
1
|
+
0.3.0
|
data/lib/ebnf.rb
ADDED
@@ -0,0 +1,620 @@
|
|
1
|
+
require 'strscan'
|
2
|
+
|
3
|
+
# Extended Bakus-Nour Form (EBNF), being the W3C variation is
|
4
|
+
# originaly defined in the
|
5
|
+
# [W3C XML 1.0 Spec](http://www.w3.org/TR/REC-xml/#sec-notation).
|
6
|
+
#
|
7
|
+
# This version attempts to be less strict than the strict definition
|
8
|
+
# to allow for coloquial variations (such as in the Turtle syntax).
|
9
|
+
#
|
10
|
+
# A rule takes the following form:
|
11
|
+
# [1] symbol ::= expression
|
12
|
+
#
|
13
|
+
# Comments include the content between '/*' and '*/'
|
14
|
+
#
|
15
|
+
# @see http://www.w3.org/2000/10/swap/grammar/ebnf2turtle.py
|
16
|
+
# @see http://www.w3.org/2000/10/swap/grammar/ebnf2bnf.n3
|
17
|
+
#
|
18
|
+
# Based on bnf2turtle by Dan Connolly.
|
19
|
+
#
|
20
|
+
# Motivation
|
21
|
+
# ----------
|
22
|
+
#
|
23
|
+
# Many specifications include grammars that look formal but are not
|
24
|
+
# actually checked, by machine, against test data sets. Debugging the
|
25
|
+
# grammar in the XML specification has been a long, tedious manual
|
26
|
+
# process. Only when the loop is closed between a fully formal grammar
|
27
|
+
# and a large test data set can we be confident that we have an accurate
|
28
|
+
# specification of a language [#]_.
|
29
|
+
#
|
30
|
+
#
|
31
|
+
# The grammar in the `N3 design note`_ has evolved based on the original
|
32
|
+
# manual transcription into a python recursive-descent parser and
|
33
|
+
# subsequent development of test cases. Rather than maintain the grammar
|
34
|
+
# and the parser independently, our goal_ is to formalize the language
|
35
|
+
# syntax sufficiently to replace the manual implementation with one
|
36
|
+
# derived mechanically from the specification.
|
37
|
+
#
|
38
|
+
#
|
39
|
+
# .. [#] and even then, only the syntax of the language.
|
40
|
+
# .. _N3 design note: http://www.w3.org/DesignIssues/Notation3
|
41
|
+
#
|
42
|
+
# Related Work
|
43
|
+
# ------------
|
44
|
+
#
|
45
|
+
# Sean Palmer's `n3p announcement`_ demonstrated the feasibility of the
|
46
|
+
# approach, though that work did not cover some aspects of N3.
|
47
|
+
#
|
48
|
+
# In development of the `SPARQL specification`_, Eric Prud'hommeaux
|
49
|
+
# developed Yacker_, which converts EBNF syntax to perl and C and C++
|
50
|
+
# yacc grammars. It includes an interactive facility for checking
|
51
|
+
# strings against the resulting grammars.
|
52
|
+
# Yosi Scharf used it in `cwm Release 1.1.0rc1`_, which includes
|
53
|
+
# a SPAQRL parser that is *almost* completely mechanically generated.
|
54
|
+
#
|
55
|
+
# The N3/turtle output from yacker is lower level than the EBNF notation
|
56
|
+
# from the XML specification; it has the ?, +, and * operators compiled
|
57
|
+
# down to pure context-free rules, obscuring the grammar
|
58
|
+
# structure. Since that transformation is straightforwardly expressed in
|
59
|
+
# semantic web rules (see bnf-rules.n3_), it seems best to keep the RDF
|
60
|
+
# expression of the grammar in terms of the higher level EBNF
|
61
|
+
# constructs.
|
62
|
+
#
|
63
|
+
# .. _goal: http://www.w3.org/2002/02/mid/1086902566.21030.1479.camel@dirk;list=public-cwm-bugs
|
64
|
+
# .. _n3p announcement: http://lists.w3.org/Archives/Public/public-cwm-talk/2004OctDec/0029.html
|
65
|
+
# .. _Yacker: http://www.w3.org/1999/02/26-modules/User/Yacker
|
66
|
+
# .. _SPARQL specification: http://www.w3.org/TR/rdf-sparql-query/
|
67
|
+
# .. _Cwm Release 1.1.0rc1: http://lists.w3.org/Archives/Public/public-cwm-announce/2005JulSep/0000.html
|
68
|
+
# .. _bnf-rules.n3: http://www.w3.org/2000/10/swap/grammar/bnf-rules.n3
|
69
|
+
#
|
70
|
+
# Open Issues and Future Work
|
71
|
+
# ---------------------------
|
72
|
+
#
|
73
|
+
# The yacker output also has the terminals compiled to elaborate regular
|
74
|
+
# expressions. The best strategy for dealing with lexical tokens is not
|
75
|
+
# yet clear. Many tokens in SPARQL are case insensitive; this is not yet
|
76
|
+
# captured formally.
|
77
|
+
#
|
78
|
+
# The schema for the EBNF vocabulary used here (``g:seq``, ``g:alt``, ...)
|
79
|
+
# is not yet published; it should be aligned with `swap/grammar/bnf`_
|
80
|
+
# and the bnf2html.n3_ rules (and/or the style of linked XHTML grammar
|
81
|
+
# in the SPARQL and XML specificiations).
|
82
|
+
#
|
83
|
+
# It would be interesting to corroborate the claim in the SPARQL spec
|
84
|
+
# that the grammar is LL(1) with a mechanical proof based on N3 rules.
|
85
|
+
#
|
86
|
+
# .. _swap/grammar/bnf: http://www.w3.org/2000/10/swap/grammar/bnf
|
87
|
+
# .. _bnf2html.n3: http://www.w3.org/2000/10/swap/grammar/bnf2html.n3
|
88
|
+
#
|
89
|
+
#
|
90
|
+
#
|
91
|
+
# Background
|
92
|
+
# ----------
|
93
|
+
#
|
94
|
+
# The `N3 Primer`_ by Tim Berners-Lee introduces RDF and the Semantic
|
95
|
+
# web using N3, a teaching and scribbling language. Turtle is a subset
|
96
|
+
# of N3 that maps directly to (and from) the standard XML syntax for
|
97
|
+
# RDF.
|
98
|
+
#
|
99
|
+
#
|
100
|
+
#
|
101
|
+
# .. _N3 Primer: _http://www.w3.org/2000/10/swap/Primer.html
|
102
|
+
#
|
103
|
+
# @author Gregg Kellogg
|
104
|
+
class EBNF
|
105
|
+
class Rule
|
106
|
+
# @attr [Symbol] sym
|
107
|
+
attr_reader :sym
|
108
|
+
# @attr [String] id
|
109
|
+
attr_reader :id
|
110
|
+
# @attr [Symbol] kind one of :rule, :token, or :pass
|
111
|
+
attr_accessor :kind
|
112
|
+
# @attr [Array] expr
|
113
|
+
attr_reader :expr
|
114
|
+
# @attr [String] orig
|
115
|
+
attr_accessor :orig
|
116
|
+
|
117
|
+
# @param [Integer] id
|
118
|
+
# @param [Symbol] sym
|
119
|
+
# @param [Array] expr
|
120
|
+
# @param [String] orig
|
121
|
+
# @param [EBNF] ebnf
|
122
|
+
def initialize(id, sym, expr, ebnf)
|
123
|
+
@id, @sym, @expr, @ebnf = id, sym, expr, ebnf
|
124
|
+
end
|
125
|
+
|
126
|
+
def to_sxp
|
127
|
+
[id, sym, kind, expr].to_sxp
|
128
|
+
end
|
129
|
+
|
130
|
+
def to_ttl
|
131
|
+
@ebnf.debug("to_ttl") {inspect}
|
132
|
+
comment = orig.strip.
|
133
|
+
gsub(/"""/, '\"\"\"').
|
134
|
+
gsub("\\", "\\\\").
|
135
|
+
sub(/^\"/, '\"').
|
136
|
+
sub(/\"$/m, '\"')
|
137
|
+
statements = [
|
138
|
+
%{:#{id} rdfs:label "#{id}"; rdf:value "#{sym}";},
|
139
|
+
%{ rdfs:comment #{comment.inspect};},
|
140
|
+
]
|
141
|
+
|
142
|
+
statements += ttl_expr(expr, kind == :token ? "re" : "g", 1, false)
|
143
|
+
"\n" + statements.join("\n")
|
144
|
+
end
|
145
|
+
|
146
|
+
def inspect
|
147
|
+
{:sym => sym, :id => id, kind => kind, :expr => expr}.inspect
|
148
|
+
end
|
149
|
+
|
150
|
+
private
|
151
|
+
def ttl_expr(expr, pfx, depth, is_obj = true)
|
152
|
+
indent = ' ' * depth
|
153
|
+
@ebnf.debug("ttl_expr", :depth => depth) {expr.inspect}
|
154
|
+
op = expr.shift if expr.is_a?(Array)
|
155
|
+
statements = []
|
156
|
+
|
157
|
+
if is_obj
|
158
|
+
bra, ket = "[ ", " ]"
|
159
|
+
else
|
160
|
+
bra = ket = ''
|
161
|
+
end
|
162
|
+
|
163
|
+
case op
|
164
|
+
when :seq, :alt, :diff
|
165
|
+
statements << %{#{indent}#{bra}#{pfx}:#{op} (}
|
166
|
+
expr.each {|a| statements += ttl_expr(a, pfx, depth + 1)}
|
167
|
+
statements << %{#{indent} )#{ket}}
|
168
|
+
when :opt, :plus, :star
|
169
|
+
statements << %{#{indent}#{bra}#{pfx}:#{op} }
|
170
|
+
statements += ttl_expr(expr.first, pfx, depth + 1)
|
171
|
+
statements << %{#{indent} #{ket}} unless ket.empty?
|
172
|
+
when :"'"
|
173
|
+
statements << %{#{indent}"#{esc(expr)}"}
|
174
|
+
when :range
|
175
|
+
statements << %{#{indent}#{bra} re:matches #{cclass(expr.first).inspect} #{ket}}
|
176
|
+
when :hex
|
177
|
+
raise "didn't expect \" in expr" if expr.include?(:'"')
|
178
|
+
statements << %{#{indent}#{bra} re:matches #{cclass(expr.first).inspect} #{ket}}
|
179
|
+
else
|
180
|
+
if is_obj
|
181
|
+
statements << %{#{indent}#{expr.inspect}}
|
182
|
+
else
|
183
|
+
statements << %{#{indent}g:seq ( #{expr.inspect} )}
|
184
|
+
end
|
185
|
+
end
|
186
|
+
|
187
|
+
statements.last << " ." unless is_obj
|
188
|
+
@ebnf.debug("statements", :depth => depth) {statements.join("\n")}
|
189
|
+
statements
|
190
|
+
end
|
191
|
+
|
192
|
+
##
|
193
|
+
# turn an XML BNF character class into an N3 literal for that
|
194
|
+
# character class (less the outer quote marks)
|
195
|
+
#
|
196
|
+
# >>> cclass("^<>'{}|^`")
|
197
|
+
# "[^<>'{}|^`]"
|
198
|
+
# >>> cclass("#x0300-#x036F")
|
199
|
+
# "[\\u0300-\\u036F]"
|
200
|
+
# >>> cclass("#xC0-#xD6")
|
201
|
+
# "[\\u00C0-\\u00D6]"
|
202
|
+
# >>> cclass("#x370-#x37D")
|
203
|
+
# "[\\u0370-\\u037D]"
|
204
|
+
#
|
205
|
+
# as in: ECHAR ::= '\' [tbnrf\"']
|
206
|
+
# >>> cclass("tbnrf\\\"'")
|
207
|
+
# 'tbnrf\\\\\\"\''
|
208
|
+
#
|
209
|
+
# >>> cclass("^#x22#x5C#x0A#x0D")
|
210
|
+
# '^\\u0022\\\\\\u005C\\u000A\\u000D'
|
211
|
+
def cclass(txt)
|
212
|
+
'[' +
|
213
|
+
txt.gsub(/\#x[0-9a-fA-F]+/) do |hx|
|
214
|
+
hx = hx[2..-1]
|
215
|
+
if hx.length <= 4
|
216
|
+
"\\u#{'0' * (4 - hx.length)}#{hx}"
|
217
|
+
elsif hx.length <= 8
|
218
|
+
"\\U#{'0' * (8 - hx.length)}#{hx}"
|
219
|
+
end
|
220
|
+
end +
|
221
|
+
']'
|
222
|
+
end
|
223
|
+
end
|
224
|
+
|
225
|
+
# Abstract syntax tree from parse
|
226
|
+
attr_reader :ast
|
227
|
+
|
228
|
+
# Parse the string or file input generating an abstract syntax tree
|
229
|
+
# in S-Expressions (similar to SPARQL SSE)
|
230
|
+
#
|
231
|
+
# @param [#read, #to_s] input
|
232
|
+
def initialize(input, options = {})
|
233
|
+
@options = options
|
234
|
+
@lineno, @depth = 1, 0
|
235
|
+
token = false
|
236
|
+
@ast = []
|
237
|
+
|
238
|
+
input = input.respond_to?(:read) ? input.read : input.to_s
|
239
|
+
scanner = StringScanner.new(input)
|
240
|
+
|
241
|
+
eachRule(scanner) do |r|
|
242
|
+
debug("rule string") {r.inspect}
|
243
|
+
case r
|
244
|
+
when /^@terminals/
|
245
|
+
# Switch mode to parsing tokens
|
246
|
+
token = true
|
247
|
+
when /^@pass\s*(.*)$/m
|
248
|
+
rule = depth {ruleParts("[0] " + r)}
|
249
|
+
rule.kind = :pass
|
250
|
+
rule.orig = r
|
251
|
+
@ast << rule
|
252
|
+
else
|
253
|
+
rule = depth {ruleParts(r)}
|
254
|
+
|
255
|
+
# all caps symbols are tokens. Once a token is seen
|
256
|
+
# we don't go back
|
257
|
+
token ||= !!(rule.sym.to_s =~ /^[A-Z_]+$/)
|
258
|
+
rule.kind = token ? :token : :rule
|
259
|
+
rule.orig = r
|
260
|
+
@ast << rule
|
261
|
+
end
|
262
|
+
end
|
263
|
+
end
|
264
|
+
|
265
|
+
##
|
266
|
+
# Write out parsed syntax string as an S-Expression
|
267
|
+
def to_sxp
|
268
|
+
begin
|
269
|
+
require 'sxp'
|
270
|
+
SXP::Generator.string(ast)
|
271
|
+
rescue LoadError
|
272
|
+
ast.to_sxp
|
273
|
+
end
|
274
|
+
end
|
275
|
+
|
276
|
+
##
|
277
|
+
# Write out syntax tree as Turtle
|
278
|
+
# @param [String] prefix for language
|
279
|
+
# @return [String]
|
280
|
+
def to_ttl(prefix, ns)
|
281
|
+
token = false
|
282
|
+
|
283
|
+
unless ast.empty?
|
284
|
+
[
|
285
|
+
"@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.",
|
286
|
+
"@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.",
|
287
|
+
"@prefix #{prefix}: <#{ns}>.",
|
288
|
+
"@prefix : <#{ns}>.",
|
289
|
+
"@prefix re: <http://www.w3.org/2000/10/swap/grammar/regex#>.",
|
290
|
+
"@prefix g: <http://www.w3.org/2000/10/swap/grammar/ebnf#>.",
|
291
|
+
"",
|
292
|
+
":language rdfs:isDefinedBy <>; g:start :#{ast.first.id}.",
|
293
|
+
"",
|
294
|
+
]
|
295
|
+
end.join("\n") +
|
296
|
+
|
297
|
+
ast.
|
298
|
+
select {|a| [:rule, :token].include?(a.kind)}.
|
299
|
+
map(&:to_ttl).
|
300
|
+
join("\n")
|
301
|
+
end
|
302
|
+
|
303
|
+
##
|
304
|
+
# Iterate over rule strings.
|
305
|
+
# a line that starts with '[' or '@' starts a new rule
|
306
|
+
#
|
307
|
+
# @param [StringScanner] scanner
|
308
|
+
# @yield rule_string
|
309
|
+
# @yieldparam [String] rule_string
|
310
|
+
def eachRule(scanner)
|
311
|
+
cur_lineno = 1
|
312
|
+
r = ''
|
313
|
+
until scanner.eos?
|
314
|
+
case
|
315
|
+
when s = scanner.scan(%r(\s+)m)
|
316
|
+
# Eat whitespace
|
317
|
+
cur_lineno += s.count("\n")
|
318
|
+
#debug("eachRule(ws)") { "[#{cur_lineno}] #{s.inspect}" }
|
319
|
+
when s = scanner.scan(%r(/\*([^\*]|\*[^\/])*\*/)m)
|
320
|
+
# Eat comments
|
321
|
+
cur_lineno += s.count("\n")
|
322
|
+
debug("eachRule(comment)") { "[#{cur_lineno}] #{s.inspect}" }
|
323
|
+
when s = scanner.scan(%r(^@terminals))
|
324
|
+
#debug("eachRule(@terminals)") { "[#{cur_lineno}] #{s.inspect}" }
|
325
|
+
yield(r) unless r.empty?
|
326
|
+
@lineno = cur_lineno
|
327
|
+
yield(s)
|
328
|
+
r = ''
|
329
|
+
when s = scanner.scan(/@pass/)
|
330
|
+
# Found rule start, if we've already collected a rule, yield it
|
331
|
+
#debug("eachRule(@pass)") { "[#{cur_lineno}] #{s.inspect}" }
|
332
|
+
yield r unless r.empty?
|
333
|
+
@lineno = cur_lineno
|
334
|
+
r = s
|
335
|
+
when s = scanner.scan(/\[(?=\w+\])/)
|
336
|
+
# Found rule start, if we've already collected a rule, yield it
|
337
|
+
yield r unless r.empty?
|
338
|
+
#debug("eachRule(rule)") { "[#{cur_lineno}] #{s.inspect}" }
|
339
|
+
@lineno = cur_lineno
|
340
|
+
r = s
|
341
|
+
else
|
342
|
+
# Collect until end of line, or start of comment
|
343
|
+
s = scanner.scan_until(%r((?:/\*)|$)m)
|
344
|
+
cur_lineno += s.count("\n")
|
345
|
+
#debug("eachRule(rest)") { "[#{cur_lineno}] #{s.inspect}" }
|
346
|
+
r += s
|
347
|
+
end
|
348
|
+
end
|
349
|
+
yield r unless r.empty?
|
350
|
+
end
|
351
|
+
|
352
|
+
##
|
353
|
+
# Parse a rule into a rule number, a symbol and an expression
|
354
|
+
#
|
355
|
+
# @param [String] rule
|
356
|
+
# @return [Rule]
|
357
|
+
def ruleParts(rule)
|
358
|
+
num_sym, expr = rule.split('::=', 2).map(&:strip)
|
359
|
+
num, sym = num_sym.split(']', 2).map(&:strip)
|
360
|
+
num = num[1..-1]
|
361
|
+
r = Rule.new(sym && sym.to_sym, num, ebnf(expr).first, self)
|
362
|
+
debug("ruleParts") { r.inspect }
|
363
|
+
r
|
364
|
+
end
|
365
|
+
|
366
|
+
##
|
367
|
+
# Parse a string into an expression tree and a remaining string
|
368
|
+
#
|
369
|
+
# @example
|
370
|
+
# >>> ebnf("a b c")
|
371
|
+
# ((seq, [('id', 'a'), ('id', 'b'), ('id', 'c')]), '')
|
372
|
+
#
|
373
|
+
# >>> ebnf("a? b+ c*")
|
374
|
+
# ((seq, [(opt, ('id', 'a')), (plus, ('id', 'b')), ('*', ('id', 'c'))]), '')
|
375
|
+
#
|
376
|
+
# >>> ebnf(" | x xlist")
|
377
|
+
# ((alt, [(seq, []), (seq, [('id', 'x'), ('id', 'xlist')])]), '')
|
378
|
+
#
|
379
|
+
# >>> ebnf("a | (b - c)")
|
380
|
+
# ((alt, [('id', 'a'), (diff, [('id', 'b'), ('id', 'c')])]), '')
|
381
|
+
#
|
382
|
+
# >>> ebnf("a b | c d")
|
383
|
+
# ((alt, [(seq, [('id', 'a'), ('id', 'b')]), (seq, [('id', 'c'), ('id', 'd')])]), '')
|
384
|
+
#
|
385
|
+
# >>> ebnf("a | b | c")
|
386
|
+
# ((alt, [('id', 'a'), ('id', 'b'), ('id', 'c')]), '')
|
387
|
+
#
|
388
|
+
# >>> ebnf("a) b c")
|
389
|
+
# (('id', 'a'), ' b c')
|
390
|
+
#
|
391
|
+
# >>> ebnf("BaseDecl? PrefixDecl*")
|
392
|
+
# ((seq, [(opt, ('id', 'BaseDecl')), ('*', ('id', 'PrefixDecl'))]), '')
|
393
|
+
#
|
394
|
+
# >>> ebnf("NCCHAR1 | diff | [0-9] | #x00B7 | [#x0300-#x036F] | [#x203F-#x2040]")
|
395
|
+
# ((alt, [('id', 'NCCHAR1'), ("'", diff), (range, '0-9'), (hex, '#x00B7'), (range, '#x0300-#x036F'), (range, '#x203F-#x2040')]), '')
|
396
|
+
#
|
397
|
+
# @param [String] s
|
398
|
+
# @return [Array]
|
399
|
+
def ebnf(s)
|
400
|
+
debug("ebnf") {"(#{s.inspect})"}
|
401
|
+
e, s = depth {alt(s)}
|
402
|
+
debug {"=> alt returned #{[e, s].inspect}"}
|
403
|
+
unless s.empty?
|
404
|
+
t, ss = depth {token(s)}
|
405
|
+
debug {"=> token returned #{[t, ss].inspect}"}
|
406
|
+
return [e, ss] if t.is_a?(Array) && t.first == :")"
|
407
|
+
end
|
408
|
+
[e, s]
|
409
|
+
end
|
410
|
+
|
411
|
+
##
|
412
|
+
# Parse alt
|
413
|
+
# >>> alt("a | b | c")
|
414
|
+
# ((alt, [('id', 'a'), ('id', 'b'), ('id', 'c')]), '')
|
415
|
+
# @param [String] s
|
416
|
+
# @return [Array]
|
417
|
+
def alt(s)
|
418
|
+
debug("alt") {"(#{s.inspect})"}
|
419
|
+
args = []
|
420
|
+
while !s.empty?
|
421
|
+
e, s = depth {seq(s)}
|
422
|
+
debug {"=> seq returned #{[e, s].inspect}"}
|
423
|
+
if e.to_s.empty?
|
424
|
+
break unless args.empty?
|
425
|
+
e = [:seq, []] # empty sequence
|
426
|
+
end
|
427
|
+
args << e
|
428
|
+
unless s.empty?
|
429
|
+
t, ss = depth {token(s)}
|
430
|
+
break unless t[0] == :alt
|
431
|
+
s = ss
|
432
|
+
end
|
433
|
+
end
|
434
|
+
args.length > 1 ? [args.unshift(:alt), s] : [e, s]
|
435
|
+
end
|
436
|
+
|
437
|
+
##
|
438
|
+
# parse seq
|
439
|
+
#
|
440
|
+
# >>> seq("a b c")
|
441
|
+
# ((seq, [('id', 'a'), ('id', 'b'), ('id', 'c')]), '')
|
442
|
+
#
|
443
|
+
# >>> seq("a b? c")
|
444
|
+
# ((seq, [('id', 'a'), (opt, ('id', 'b')), ('id', 'c')]), '')
|
445
|
+
def seq(s)
|
446
|
+
debug("seq") {"(#{s.inspect})"}
|
447
|
+
args = []
|
448
|
+
while !s.empty?
|
449
|
+
e, ss = depth {diff(s)}
|
450
|
+
debug {"=> diff returned #{[e, ss].inspect}"}
|
451
|
+
unless e.to_s.empty?
|
452
|
+
args << e
|
453
|
+
s = ss
|
454
|
+
else
|
455
|
+
break;
|
456
|
+
end
|
457
|
+
end
|
458
|
+
if args.length > 1
|
459
|
+
[args.unshift(:seq), s]
|
460
|
+
elsif args.length == 1
|
461
|
+
args + [s]
|
462
|
+
else
|
463
|
+
["", s]
|
464
|
+
end
|
465
|
+
end
|
466
|
+
|
467
|
+
##
|
468
|
+
# parse diff
|
469
|
+
#
|
470
|
+
# >>> diff("a - b")
|
471
|
+
# ((diff, [('id', 'a'), ('id', 'b')]), '')
|
472
|
+
def diff(s)
|
473
|
+
debug("diff") {"(#{s.inspect})"}
|
474
|
+
e1, s = depth {postfix(s)}
|
475
|
+
debug {"=> postfix returned #{[e1, s].inspect}"}
|
476
|
+
unless e1.to_s.empty?
|
477
|
+
unless s.empty?
|
478
|
+
t, ss = depth {token(s)}
|
479
|
+
debug {"diff #{[t, ss].inspect}"}
|
480
|
+
if t.is_a?(Array) && t.first == :diff
|
481
|
+
s = ss
|
482
|
+
e2, s = primary(s)
|
483
|
+
unless e2.to_s.empty?
|
484
|
+
return [[:diff, e1, e2], s]
|
485
|
+
else
|
486
|
+
raise "Syntax Error"
|
487
|
+
end
|
488
|
+
end
|
489
|
+
end
|
490
|
+
end
|
491
|
+
[e1, s]
|
492
|
+
end
|
493
|
+
|
494
|
+
##
|
495
|
+
# parse postfix
|
496
|
+
#
|
497
|
+
# >>> postfix("a b c")
|
498
|
+
# (('id', 'a'), ' b c')
|
499
|
+
#
|
500
|
+
# >>> postfix("a? b c")
|
501
|
+
# ((opt, ('id', 'a')), ' b c')
|
502
|
+
def postfix(s)
|
503
|
+
debug("postfix") {"(#{s.inspect})"}
|
504
|
+
e, s = depth {primary(s)}
|
505
|
+
debug {"=> primary returned #{[e, s].inspect}"}
|
506
|
+
return ["", s] if e.to_s.empty?
|
507
|
+
if !s.empty?
|
508
|
+
t, ss = depth {token(s)}
|
509
|
+
debug {"=> #{[t, ss].inspect}"}
|
510
|
+
if t.is_a?(Array) && [:opt, :star, :plus].include?(t.first)
|
511
|
+
return [[t.first, e], ss]
|
512
|
+
end
|
513
|
+
end
|
514
|
+
[e, s]
|
515
|
+
end
|
516
|
+
|
517
|
+
##
|
518
|
+
# parse primary
|
519
|
+
#
|
520
|
+
# >>> primary("a b c")
|
521
|
+
# (('id', 'a'), ' b c')
|
522
|
+
def primary(s)
|
523
|
+
debug("primary") {"(#{s.inspect})"}
|
524
|
+
t, s = depth {token(s)}
|
525
|
+
debug {"=> token returned #{[t, s].inspect}"}
|
526
|
+
if t.is_a?(Symbol) || t.is_a?(String)
|
527
|
+
[t, s]
|
528
|
+
elsif %w(range hex).map(&:to_sym).include?(t.first)
|
529
|
+
[t, s]
|
530
|
+
elsif t.first == :"("
|
531
|
+
e, s = depth {ebnf(s)}
|
532
|
+
debug {"=> ebnf returned #{[e, s].inspect}"}
|
533
|
+
[e, s]
|
534
|
+
else
|
535
|
+
["", s]
|
536
|
+
end
|
537
|
+
end
|
538
|
+
|
539
|
+
##
|
540
|
+
# parse one token; return the token and the remaining string
|
541
|
+
#
|
542
|
+
# A token is represented as a tuple whose 1st item gives the type;
|
543
|
+
# some types have additional info in the tuple.
|
544
|
+
#
|
545
|
+
# @example
|
546
|
+
# >>> token("'abc' def")
|
547
|
+
# (("'", 'abc'), ' def')
|
548
|
+
#
|
549
|
+
# >>> token("[0-9]")
|
550
|
+
# ((range, '0-9'), '')
|
551
|
+
# >>> token("#x00B7")
|
552
|
+
# ((hex, '#x00B7'), '')
|
553
|
+
# >>> token ("[#x0300-#x036F]")
|
554
|
+
# ((range, '#x0300-#x036F'), '')
|
555
|
+
# >>> token("[^<>'{}|^`]-[#x00-#x20]")
|
556
|
+
# ((range, "^<>'{}|^`"), '-[#x00-#x20]')
|
557
|
+
def token(s)
|
558
|
+
s = s.strip
|
559
|
+
case m = s[0,1]
|
560
|
+
when '"', "'"
|
561
|
+
l, s = s[1..-1].split(m, 2)
|
562
|
+
[l, s]
|
563
|
+
when '['
|
564
|
+
l, s = s[1..-1].split(']', 2)
|
565
|
+
[[:range, l], s]
|
566
|
+
when '#'
|
567
|
+
s.match(/(#\w+)(.*)$/)
|
568
|
+
l, s = $1, $2
|
569
|
+
[[:hex, l], s]
|
570
|
+
when /[[:alpha:]]/
|
571
|
+
s.match(/(\w+)(.*)$/)
|
572
|
+
l, s = $1, $2
|
573
|
+
[l.to_sym, s]
|
574
|
+
when '@'
|
575
|
+
s.match(/@(#\w+)(.*)$/)
|
576
|
+
l, s = $1, $2
|
577
|
+
[[:"@", l], s]
|
578
|
+
when '-'
|
579
|
+
[[:diff], s[1..-1]]
|
580
|
+
when '?'
|
581
|
+
[[:opt], s[1..-1]]
|
582
|
+
when '|'
|
583
|
+
[[:alt], s[1..-1]]
|
584
|
+
when '+'
|
585
|
+
[[:plus], s[1..-1]]
|
586
|
+
when '*'
|
587
|
+
[[:star], s[1..-1]]
|
588
|
+
when /[\(\)]/
|
589
|
+
[[m.to_sym], s[1..-1]]
|
590
|
+
else
|
591
|
+
raise "unrecognized token: #{s.inspect}"
|
592
|
+
end
|
593
|
+
end
|
594
|
+
|
595
|
+
def depth
|
596
|
+
@depth += 1
|
597
|
+
ret = yield
|
598
|
+
@depth -= 1
|
599
|
+
ret
|
600
|
+
end
|
601
|
+
|
602
|
+
##
|
603
|
+
# Progress output when debugging
|
604
|
+
# param [String] node relative location in input
|
605
|
+
# param [String] message ("")
|
606
|
+
# yieldreturn [String] added to message
|
607
|
+
def debug(*args)
|
608
|
+
return unless @options[:debug]
|
609
|
+
options = args.last.is_a?(Hash) ? args.pop : {}
|
610
|
+
depth = options[:depth] || @depth
|
611
|
+
message = args.pop
|
612
|
+
message = message.call if message.is_a?(Proc)
|
613
|
+
args << message if message
|
614
|
+
args << yield if block_given?
|
615
|
+
message = "#{args.join(': ')}"
|
616
|
+
str = "[#{@lineno}]#{' ' * depth}#{message}"
|
617
|
+
@options[:debug] << str if @options[:debug].is_a?(Array)
|
618
|
+
$stderr.puts(str) if @options[:debug] == true
|
619
|
+
end
|
620
|
+
end
|