oga 0.1.3 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 8b60359c51f8e8eb14fb35fbc7ffdd6da99a6d50
4
- data.tar.gz: b7c569316e309823375f84dc628c1f170d3f2540
3
+ metadata.gz: 9abda7194e4d0f181bf8a43c5d5154c965fd1d81
4
+ data.tar.gz: 8cd27a710c2c761ffd37d9393e53e3b78f53444e
5
5
  SHA512:
6
- metadata.gz: 57224b74df2f069a99826dcbc49d1bf4efdb7d31c5568c23e8b817952d3708b365198fcbafdd82d95e0e1f7af9b221d6b6a71cc1d09fd4cf2951d177fa4ed456
7
- data.tar.gz: b7103802563321ebe4174a328e65d99a00df973abca79e1ef02544e0bcf2484e679db8009c9de98cb971d8912bfe573f0d0250bbbc25f77c5764e6d43f190d6c
6
+ metadata.gz: 556c869e33dfe785eda199e42a5d2fe869e269c8522809b81d8febdefe88285c52a3c7b17e22524b031cf5698395c505fd82f9ef8d69db538f9ac7f19f761e47
7
+ data.tar.gz: 149034fbe883e5e0df5f805aa17b07cf12afc8cccefbc22d9f73e6dcc6ba2b6fbe1ef39706f4945fbab90e9ecdcb208b4d299f939eb428706f724f657ecbc822
data/README.md CHANGED
@@ -70,6 +70,12 @@ Querying a document using XPath:
70
70
 
71
71
  document.xpath('string(people/person)') # => "Alice"
72
72
 
73
+ Querying a document using CSS:
74
+
75
+ document = Oga.parse_xml('<people><person>Alice</person></people>')
76
+
77
+ document.css('people person') # => NodeSet(Element(name: "person" ...))
78
+
73
79
  Modifying a document and serializing it back to XML:
74
80
 
75
81
  document = Oga.parse_xml('<people><person>Alice</person></people>')
@@ -95,6 +101,7 @@ Querying a document using a namespace:
95
101
  * Low memory footprint
96
102
  * High performance, if something doesn't perform well enough it's a bug
97
103
  * Support for XPath 1.0
104
+ * CSS3 selector support
98
105
  * XML namespace support (registering, querying, etc)
99
106
 
100
107
  ## Requirements
@@ -127,6 +134,53 @@ _not_ thread-safe and should not be done by multiple threads at once.
127
134
  It is advised that you do not share parsed documents between threads unless you
128
135
  _really_ have to.
129
136
 
137
+ ## Namespace Support
138
+
139
+ Oga fully supports parsing/registering XML namespaces as well as querying them
140
+ using XPath. For example, take the following XML:
141
+
142
+ <root xmlns="http://example.com">
143
+ <bar>bar</bar>
144
+ </root>
145
+
146
+ If one were to try and query the `bar` element (e.g. using XPath `root/bar`)
147
+ they'd end up with an empty node set. This is due to `<root>` defining an
148
+ alternative default namespace. Instead you can query this element using the
149
+ following XPath:
150
+
151
+ *[local-name() = "root"]/*[local-name() = "bar"]
152
+
153
+ Alternatively, if you don't really care where the `<bar>` element is located you
154
+ can use the following:
155
+
156
+ descendant::*[local-name() = "bar"]
157
+
158
+ And if you want to specify an explici namespace URI, you can use this:
159
+
160
+ descendant::*[local-name() = "bar" and namespace-uri() = "http://example.com"]
161
+
162
+ Unlike Nokogiri, Oga does _not_ provide a way to create "dynamic" namespaces.
163
+ That is, Nokogiri allows one to query the above document as following:
164
+
165
+ document = Nokogiri::XML('<root xmlns="http://example.com"><bar>bar</bar></root>')
166
+
167
+ document.xpath('x:root/x:bar', :x => 'http://example.com')
168
+
169
+ Oga does have a small trick you can use to cut down the size of your XPath
170
+ queries. Because Oga assigns the name "xmlns" to default namespaces you can use
171
+ this in your XPath queries:
172
+
173
+ document = Oga.parse_xml('<root xmlns="http://example.com"><bar>bar</bar></root>')
174
+
175
+ document.xpath('xmlns:root/xmlns:bar')
176
+
177
+ When using this you can still restrict the query to the correct namespace URI:
178
+
179
+ document.xpath('xmlns:root[namespace-uri() = "http://example.com"]/xmlns:bar')
180
+
181
+ In the future I might add an API to ease this process, although at this time I
182
+ have little interest in providing an API similar to Nokogiri.
183
+
130
184
  ## Documentation
131
185
 
132
186
  The documentation is best viewed [on the documentation website][doc-website].
@@ -134,6 +188,9 @@ The documentation is best viewed [on the documentation website][doc-website].
134
188
  * {file:CONTRIBUTING Contributing}
135
189
  * {file:changelog Changelog}
136
190
  * {file:migrating\_from\_nokogiri Migrating From Nokogiri}
191
+ * {Oga::XML::Parser XML Parser}
192
+ * {Oga::XML::SaxParser XML SAX Parser}
193
+ * {file:xml\_namespaces XML Namespaces}
137
194
 
138
195
  ## Native Extension Setup
139
196
 
@@ -3,6 +3,134 @@
3
3
  This document contains details of the various releases and their release dates.
4
4
  Dates are in the format `yyyy-mm-dd`.
5
5
 
6
+ ## 0.2.0 - 2014-11-17
7
+
8
+ ### CSS Selector Support
9
+
10
+ Probably the biggest feature of this release: support for querying documents
11
+ using CSS selectors. Oga supports a subset of the CSS3 selector specification,
12
+ in particular the following selectors are supported:
13
+
14
+ * Element, class and ID selectors
15
+ * Attribute selectors (e.g. `foo[x ~= "y"]`)
16
+
17
+ The following pseudo classes are supported:
18
+
19
+ * `:root`
20
+ * `:nth-child(n)`
21
+ * `:nth-last-child(n)`
22
+ * `:nth-of-type(n)`
23
+ * `:nth-last-of-type(n)`
24
+ * `:first-child`
25
+ * `:last-child`
26
+ * `:first-of-type`
27
+ * `:last-of-type`
28
+ * `:only-child`
29
+ * `:only-of-type`
30
+ * `:empty`
31
+
32
+ You can use CSS selectors using the methods `css` and `at_css` on an instance of
33
+ `Oga::XML::Document` or `Oga::XML::Element`. For example:
34
+
35
+ document = Oga.parse_xml('<people><person>Alice</person></people>')
36
+
37
+ document.css('people person') # => NodeSet(Element(name: "person" ...))
38
+
39
+ The architecture behind this is quite similar to parsing XPath. There's a lexer
40
+ (`Oga::CSS::Lexer`) and a parser (`Oga::CSS::Parser`). Unlike Nokogiri (and
41
+ perhaps other libraries) the parser _does not_ output XPath expressions as a
42
+ String or a CSS specific AST. Instead it directly emits an XPath AST. This
43
+ allows the resulting AST to be directly evaluated by `Oga::XPath::Evaluator`.
44
+
45
+ See <https://github.com/YorickPeterse/oga/issues/11> for more information.
46
+
47
+ ### Mutli-line Attribute Support
48
+
49
+ Oga can now lex/parse elements that have attributes with newlines in them.
50
+ Previously this would trigger memory allocation errors.
51
+
52
+ See <https://github.com/YorickPeterse/oga/issues/58> for more information.
53
+
54
+ ### SAX after_element
55
+
56
+ The `after_element` method in the SAX parsing API now always takes two
57
+ arguments: the namespace name and element name. Previously this method would
58
+ always receive a single nil value as its argument, which is rather pointless.
59
+
60
+ See <https://github.com/YorickPeterse/oga/issues/54> for more information.
61
+
62
+ ### XPath Grouping
63
+
64
+ XPath expressions can now be grouped together using parenthesis. This allows one
65
+ to specify a custom operator precedence.
66
+
67
+ ### Enumerator Parsing Input
68
+
69
+ Enumerator instances can now be used as input for `Oga.parse_xml` and friends.
70
+ This can be used to download and parse XML files on the fly. For example:
71
+
72
+ enum = Enumerator.new do |yielder|
73
+ HTTPClient.get('http://some-website.com/some-big-file.xml') do |chunk|
74
+ yielder << chunk
75
+ end
76
+ end
77
+
78
+ document = Oga.parse_xml(enum)
79
+
80
+ See <https://github.com/YorickPeterse/oga/issues/48> for more information.
81
+
82
+ ### Removing Attributes
83
+
84
+ Element attributes can now be removed using `Oga::XML::Element#unset`:
85
+
86
+ element = Oga::XML::Element.new(:name => 'foo')
87
+
88
+ element.set('class', 'foo')
89
+ element.unset('class')
90
+
91
+ ### XPath Attributes
92
+
93
+ XPath predicates are now evaluated for every context node opposed to being
94
+ evaluated once for the entire context. This ensures that expressions such as
95
+ `descendant-or-self::node()/foo[1]` are evaluated correctly.
96
+
97
+ ### Available Namespaces
98
+
99
+ When calling `Oga::XML::Element#available_namespaces` the Hash returned by
100
+ `Oga::XML::Element#namespaces` would be modified in place. This was a bug that
101
+ has been fixed in this release.
102
+
103
+ ### NodeSets
104
+
105
+ NodeSet instances can now be compared with each other using `==`. Previously
106
+ this would always consider two instances to be different from each other due to
107
+ the usage of the default `Object#==` method.
108
+
109
+ ### XML Entities
110
+
111
+ XML entities such as `&amp;` and `&lt;` are now encoded/decoded by the lexer,
112
+ string and text nodes.
113
+
114
+ See <https://github.com/YorickPeterse/oga/issues/49> for more information.
115
+
116
+ ### General
117
+
118
+ Source lines are no longer included in error messages generated by the XML
119
+ parser. This simplifies the code and removes the need of re-reading the input
120
+ (in case of IO/Enumerable inputs).
121
+
122
+ ### XML Lexer Newlines
123
+
124
+ Newlines in the XML lexer are now counted in native code (C/Java). On MRI and
125
+ JRuby the improvement is quite small, but on Rubinius it's a massive
126
+ improvement. See commit `8db77c0a09bf6c996dd2856a6dbe1ad076b1d30a` for more
127
+ information.
128
+
129
+ ### HTML Void Element Performance
130
+
131
+ Performance for detecting HTML void elements (e.g. `<br>` and `<link>`) has been
132
+ improved by removing String allocations that were not needed.
133
+
6
134
  ## 0.1.3 - 2014-09-24
7
135
 
8
136
  This release fixes a problem with serializing attributes using the namespace
@@ -6,11 +6,12 @@ body
6
6
  max-width: 960px;
7
7
  }
8
8
 
9
- p code
9
+ p code, dd code, li code
10
10
  {
11
- background: #f2f2f2;
12
- padding-left: 3px;
13
- padding-right: 3px;
11
+ background: #f9f2f4;
12
+ color: #c7254e;
13
+ border-radius: 4px;
14
+ padding: 2px 4px;
14
15
  }
15
16
 
16
17
  pre.code
@@ -0,0 +1,935 @@
1
+ # CSS Selectors Specification
2
+
3
+ This document acts as an alternative specification to the official W3
4
+ [CSS3 Selectors Specification][w3spec]. This document specifies only the
5
+ selectors supported by Oga itself. Only CSS3 selectors are covered, CSS4 is not
6
+ part of this specification.
7
+
8
+ This document is best viewed in the YARD generated documentation or any other
9
+ Markdown viewer that supports the [Kramdown][kramdown] syntax. Alternatively it
10
+ can be viewed in its raw form.
11
+
12
+ ## Abstract
13
+
14
+ The official W3 specification on CSS selectors is anything but pleasant to read.
15
+ A lack of good examples and unspecified behaviour are just two of many problems.
16
+ This document was written as a reference guide for myself as well as a way for
17
+ others to more easily understand how CSS selectors work.
18
+
19
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
20
+ "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
21
+ interpreted as described in [RFC 2119][rfc-2119].
22
+
23
+ ## Syntax
24
+
25
+ To describe syntax elements of CSS selectors this document uses the same grammar
26
+ as [Ragel][ragel]. For example, an integer would be defined as following:
27
+
28
+ integer = [0-9]+;
29
+
30
+ In turn an integer that can optionally be prefixed by `+` or `-` would be
31
+ defined as following:
32
+
33
+ integer = ('+' | '-')* [0-9]+;
34
+
35
+ A quick and basic crash course of the Ragel grammar:
36
+
37
+ * `*`: zero or more instance of the preceding token(s)
38
+ * `+`: one or more instances of the preceding token(s)
39
+ * `(` and `)`: used for grouping expressions together
40
+ * `^`: inverts a match, thus `^[0-9]` means "anything but a single digit"
41
+ * `"..."` or `'...'`: a literal character, `"x"` would match the literal "x"
42
+ * `|`: the OR operator, `x | y` translates to "x OR y"
43
+ * `[...]`: used to define a sequence, `[0-9]` translates to "0 OR 1 OR 2 OR
44
+ 3..." all the way upto 9
45
+
46
+ Semicolons are used to terminate lines. While not strictly required in this
47
+ specification they are included in order to produce a Ragel syntax compatible
48
+ grammar.
49
+
50
+ See the Ragel documentation for more information on the grammar.
51
+
52
+ ## Terminology
53
+
54
+ local name
55
+ : The name of an element without a namespace. For the element `<strong>` the
56
+ local name is `strong`.
57
+
58
+ namespace prefix
59
+ : The namespace prefix of an element. For the element `<foo:strong>` the
60
+ namespace prefix is `foo`.
61
+
62
+ expression
63
+ : A single or multiple selectors used together to retrieve a set of elements
64
+ from a document.
65
+
66
+ ## Selector Scoping
67
+
68
+ Whenever a selector is used to match an element the selector applies to all
69
+ nodes in the context. For example, the selector `foo` would match all `foo`
70
+ elements at any position in the document. On the other hand, the selector
71
+ `foo bar` only matches any `bar` elements that are a descedant of any `foo`
72
+ element.
73
+
74
+ In XPath the corresponding axis for this is `descendant`. In other words, this
75
+ CSS expression:
76
+
77
+ foo
78
+
79
+ is the same as this XPath expression:
80
+
81
+ descendant::foo
82
+
83
+ In turn this CSS expression:
84
+
85
+ foo bar
86
+
87
+ is the same as this XPath expression:
88
+
89
+ descendant::foo/::bar
90
+
91
+ Note that in the various XPath examples the `descendant` axis is omitted in
92
+ order to enhance readability.
93
+
94
+ ### Syntax
95
+
96
+ A CSS expression is made up of multiple selectors separated by one or more
97
+ spaces. There MUST be at least 1 space between two selectors, there MAY be more
98
+ than one. Multiple spaces do not alter the behaviour of the expression in any
99
+ way.
100
+
101
+ ## Universal Selector
102
+
103
+ W3 chapter: <http://www.w3.org/TR/css3-selectors/#universal-selector>
104
+
105
+ The universal selector `*` (also known as the "wildcard selector") can be used
106
+ to match any element, regardless of its local name or namespace prefix.
107
+
108
+ Example XML:
109
+
110
+ <root>
111
+ <foo></foo>
112
+ <bar></bar>
113
+ </root>
114
+
115
+ CSS:
116
+
117
+ root *
118
+
119
+ This would return a set containing two elements: `<foo>` and `<bar>`
120
+
121
+ The corresponding XPath is also `*`.
122
+
123
+ ### Syntax
124
+
125
+ The syntax for the universal selector is very simple:
126
+
127
+ universal = '*';
128
+
129
+ ## Element Selector
130
+
131
+ W3 chapter: <http://www.w3.org/TR/css3-selectors/#type-selectors>
132
+
133
+ The element selector (known as "Type selector" in the official W3 specification)
134
+ can be used to match a set of elements by their local name or namespace. The
135
+ selector `foo` is used to match all elements with the local name being set to
136
+ `foo`.
137
+
138
+ Example XML:
139
+
140
+ <root>
141
+ <foo />
142
+ <bar />
143
+ </root>
144
+
145
+ CSS:
146
+
147
+ root foo
148
+
149
+ This would return a set with only the `<foo>` element.
150
+
151
+ This selector can be used in combination with the
152
+ [Universal Selector][universal-selector]. This allows one to select elements
153
+ using both a given local name and namespace. The syntax for this is as
154
+ following:
155
+
156
+ ns-prefix|local-name
157
+
158
+ Here the pipe (`|`) character separates the namespace prefix and the local name.
159
+ Both can either be an identifier or a wildcard. For example, the selector
160
+ `rb|foo` matches all elements with local name `foo` and namespace prefix `rb`.
161
+
162
+ The namespace prefix MAY be left out producing the selector `|local-name`. In
163
+ this case the selector only matches elements _without_ a namespace prefix.
164
+
165
+ If a namespace prefix is given and it's _not_ a wildcard then elements without a
166
+ namespace prefix will _not_ be matched.
167
+
168
+ The corresponding XPath expression for such a selector is
169
+ `ns-prefix:local-name`. For example, `rb|foo` in CSS is the same as `rb:foo` in
170
+ XPath.
171
+
172
+ ### Syntax
173
+
174
+ The syntax for just the local name is as following:
175
+
176
+ identifier = '*' | [a-zA-Z]+ [a-zA-Z\-_0-9]*;
177
+
178
+ The wildcard is put in place to allow a single rule to be used for both names
179
+ and wildcards.
180
+
181
+ The syntax for selecting an element including a namespace prefix is as
182
+ following:
183
+
184
+ ns_plus_local_name = identifier* '|' identifier
185
+
186
+ This would match `|foo`, `*|foo` and `foo|bar`. In order to match `foo` the
187
+ regular `identifier` rule declared above can be used.
188
+
189
+ ## Class Selector
190
+
191
+ Class selectors can be used to select a set of elements based on the values set
192
+ in the `class` attribute. Class selectors start with a period (`.`) followed by
193
+ an identifier. Multiple class selectors can be chained together, matching only
194
+ elements that have all the specified classes set.
195
+
196
+ As an example, `.foo` can be used to select all elements that have "foo" set in
197
+ the `class` attribute, either as the sole or one of many values. In turn,
198
+ `.foo.bar` matches elements that have both "foo" and "bar" set as the class.
199
+
200
+ Example XML:
201
+
202
+ <root>
203
+ <a class="first" />
204
+ <b class="second" />
205
+ </root>
206
+
207
+ Using the CSS selector `.first` would return a set containing only the `<a>`
208
+ element. Using `.first.second` would return a set containing both the `<a>` and
209
+ `<b>` nodes.
210
+
211
+ ### Syntax
212
+
213
+ identifier = '*' | [a-zA-Z]+ [a-zA-Z\-_0-9]*;
214
+
215
+ # .foo, .foo.bar, .foo.bar.baz, etc
216
+ class = ('.' identifier)+;
217
+
218
+ ## ID Selector
219
+
220
+ The ID selector can be used to match elements where the value of the `id`
221
+ attribute matches whatever is specified in the selector. ID selectors start with
222
+ a hash sign (`#`) followed by an identifier.
223
+
224
+ While technically multiple ID selectors _can_ be chained together, HTML only
225
+ allows elements to have a single ID. As a result doing so is fairly useless.
226
+ Unlike classes IDs are globally unique, no two elements can have the same ID.
227
+
228
+ Example XML:
229
+
230
+ <root>
231
+ <a id="first" />
232
+ <b id="second" />
233
+ </root>
234
+
235
+ Using the CSS selector `#first` would return a set containing only the `<a>`
236
+ node.
237
+
238
+ ### Syntax
239
+
240
+ identifier = '*' | [a-zA-Z]+ [a-zA-Z\-_0-9]*;
241
+
242
+ # .foo, .foo.bar, .foo.bar.baz, etc
243
+ class = ('#' identifier)+;
244
+
245
+ ## Attribute Selector
246
+
247
+ W3 chapter: <http://www.w3.org/TR/css3-selectors/#attribute-selectors>
248
+
249
+ Attribute selectors can be used to further narrow down a set of elements based
250
+ on their attribute list. In XPath these selectors are known as "predicates". For
251
+ example, the selector `foo[bar]` matches all `foo` elements that have a `bar`
252
+ attribute, regardless of the value of said attribute.
253
+
254
+ Example XML:
255
+
256
+ <root>
257
+ <foo number="1" />
258
+ <bar />
259
+ </root>
260
+
261
+ CSS:
262
+
263
+ root foo[number]
264
+
265
+ This would return a set containing only the `<foo>` element since the `<bar>`
266
+ element has no attributes.
267
+
268
+ For the CSS expression `foo[number]` the corresponding XPath expression is the
269
+ following:
270
+
271
+ foo[@number]
272
+
273
+ When specifying an attribute you MAY include an operator and a value to match.
274
+ In this case you MUST include an attribute value surrounded by either single or
275
+ double quotes (but not a combination of the two).
276
+
277
+ There are 6 operators available:
278
+
279
+ * `=`: equals operator
280
+ * `~=`: whitespace-in operator
281
+ * `^=`: starts-with operator
282
+ * `$=`: ends-with operator
283
+ * `*=`: contains operator
284
+ * `|=`: hyphen-starts-with operator
285
+
286
+ ### Equals Operator
287
+
288
+ The equals operator matches an element if a given attribute value equals the
289
+ value specified. For example, `foo[number="1"]` matches all `foo` elements that
290
+ have a `number` attribute who's value is _exactly_ "1".
291
+
292
+ Example XML:
293
+
294
+ <root>
295
+ <foo number="1" />
296
+ <foo number="2" />
297
+ </root>
298
+
299
+ CSS:
300
+
301
+ root foo[number="1"]
302
+
303
+ This would return a set containing only the first `<foo>` element.
304
+
305
+ The corresponding XPath expression is quite similar. For `foo[number="1"]` this
306
+ would be:
307
+
308
+ foo[@number="1"]
309
+
310
+ ### Whitespace-in Operator
311
+
312
+ This operator matches an element if the given attribute value consists out of
313
+ space separated values of which one is exactly the given value. For example,
314
+ `foo[numbers~="1"]` matches all `foo` elements that have the value `"1"` in the
315
+ `numbers` attribute.
316
+
317
+ Example XML:
318
+
319
+ <root>
320
+ <foo numbers="1 2 3" />
321
+ <foo numbers="4 bar 6" />
322
+ </root>
323
+
324
+ CSS:
325
+
326
+ root foo[numbers~="1"]
327
+
328
+ This would return a set containing only the first `foo` element. On the other
329
+ hand, if one were to use the expression `root foo[numbers~="bar"]` instead then
330
+ only the second `<foo>` element would be matched.
331
+
332
+ The corresponding XPath expression is quite complex, `foo[numbers~="1"]` is
333
+ translated into the following XPath expression:
334
+
335
+ foo[contains(concat(" ", @numbers, " "), concat(" ", "1", " "))]
336
+
337
+ The `concat` calls are used to ensure the expression doesn't match the substring
338
+ of an attrbitue value and that the expression matches elements of which the
339
+ attribute only has a single value. If `foo[contains(@numbers, ' 1 ')]` were to
340
+ be used then attributes such as `<foo numbers="1" />` would not be matched.
341
+
342
+ Software implementing this selector are free to decide how they concatenate
343
+ spaces around the value to match. Both Oga and Nokogiri use an extra call to
344
+ `concat` but the following would be perfectly valid too:
345
+
346
+ foo[contains(concat(" ", @numbers, " "), " 1 ")]
347
+
348
+ ### Starts-with Operator
349
+
350
+ This operator matches elements of which the attribute value starts _exactly_
351
+ with the given value. For example, `foo[numbers^="1"]` would match the element
352
+ `<foo numbers="1 2 3" />` but _not_ the element `<foo numbers="2 3 1" />`.
353
+
354
+ For `foo[numbers^="1"]` the corresponding XPath expression is as following:
355
+
356
+ foo[starts-with(@numbers, "1")]
357
+
358
+ ### Ends-with Operator
359
+
360
+ This operator matches elements of which the attribute value ends _exactly_ with
361
+ the given value. For example, `foo[numbers$="3"]` would match the element `<foo
362
+ numbers="1 2 3" />` but _not_ the element `<foo numbers="2 3 1" />`.
363
+
364
+ The corresponding XPath expression is quite complex due to a lack of a
365
+ `ends-with` function in XPath. Instead one has to resort to using the
366
+ `substring()` function. As such the corresponding XPath expression for
367
+ `foo[bar="baz"]` is as following:
368
+
369
+ foo[substring(@bar, string-length(@bar) - string-length("baz") + 1, string-length("baz")) = "baz"]
370
+
371
+ ### Contains Operator
372
+
373
+ This operator matches elements of which the attribute value contains the given
374
+ value. For example, `foo[bar*="baz"]` would match both `<foo bar="bazzzz" />`
375
+ and `<foo bar="hello baz" />`.
376
+
377
+ For `foo[bar*="baz"]` the corresponding XPath expression is as following:
378
+
379
+ foo[contains(@bar, "baz")]
380
+
381
+ ### Hyphen-starts-with Operator
382
+
383
+ This operator matches elements of which the attribute value is a hyphen
384
+ separated list of values that starts _exactly_ with the given value. For
385
+ example, `foo[numbers|="1"]` matches `<foo numbers="1-2-3" />` but not
386
+ `<foo numbers="2-1-3" />`.
387
+
388
+ For `foo[numbers|="1"]` the corresponding XPath expression is as following:
389
+
390
+ foo[@numbers = "1" or starts-with(@numbers, concat("1", "-"))]
391
+
392
+ Note that this selector will also match elements such as
393
+ `<foo numbers="1- foo bar" />`.
394
+
395
+ ### Syntax
396
+
397
+ The syntax of the various attribute selectors can be described as following:
398
+
399
+ # Strings are used for the attribute values
400
+
401
+ dquote = '"';
402
+ squote = "'";
403
+
404
+ string_dquote = dquote ^dquote* dquote;
405
+ string_squote = squote ^squote* squote;
406
+
407
+ string = string_dquote | string_squote;
408
+
409
+ # The `identifier` rule is the same as the one used for matching element
410
+ # names.
411
+ attr_test = identifier '[' space* identifier (space* '=' space* string)* space* ']';
412
+
413
+ Whitespace inside the brackets does not affect the behaviour of the selector.
414
+
415
+ ## Pseudo Classes
416
+
417
+ W3 chapter: <http://www.w3.org/TR/css3-selectors/#structural-pseudos>
418
+
419
+ Pseudo classes can be used to further narrow down elements besides just their
420
+ names and attribute values. In essence they are a combination of XPath function
421
+ calls and axes. Some pseudo classes can take an argument to alter their
422
+ behaviour.
423
+
424
+ Pseudo classes are often applied to element selectors. For example:
425
+
426
+ foo:bar
427
+
428
+ Here `:bar` would be a pseudo class applied to the `foo` element. Some pseudo
429
+ classes (e.g. the `:root` pseudo class) can also be used on their own, for
430
+ example:
431
+
432
+ :root
433
+
434
+ ### :root
435
+
436
+ The `:root` pseudo class selects an element only if it's the top-level element
437
+ in a document.
438
+
439
+ Example XML:
440
+
441
+ <root>
442
+ <foo />
443
+ </root>
444
+
445
+ Using the CSS expression `root foo:root` we'd get an empty set as the `<foo>`
446
+ element is not the root element. On the other hand, `root:root` would return a
447
+ set containing only the `<root>` element.
448
+
449
+ This selector can both be applied to an element selector as well as being used
450
+ on its own.
451
+
452
+ For the selector `foo:root` the corresponding XPath expression is as following:
453
+
454
+ foo[not(parent::*)]
455
+
456
+ For `:root` the XPath expression is:
457
+
458
+ *[not(parent::*)]
459
+
460
+ ### :nth-child(n)
461
+
462
+ The `:nth-child(n)` pseudo class can be used to select a set of elements based
463
+ on their position or an interval, skipping elements that occur in a set before
464
+ the given position or interval.
465
+
466
+ In the form `:nth-child(n)` the identifier `n` is an argument that can be used
467
+ to specify one of the following:
468
+
469
+ 1. A literal node set index
470
+ 2. A node interval used to match every N nodes
471
+ 3. A node interval plus an initial offset
472
+
473
+ The first element in a node set for `:nth-child()` is located at position 1,
474
+ _not_ position 0 (unlike most programming languages). As a result
475
+ `:nth-child(1)` matches the _first_ element, _not_ the second. This can be
476
+ visualized as following:
477
+
478
+ :nth-child(2)
479
+
480
+ 1 2 3 4 5 6
481
+ +---+ +---+ +---+ +---+ +---+ +---+
482
+ | | | X | | | | | | | | |
483
+ +---+ +---+ +---+ +---+ +---+ +---+
484
+
485
+ Besides using a literal index argument you can also use an interval, optionally
486
+ with an offset. This can be used to for example match every 2nd element, or
487
+ every 2nd element starting at element number 4.
488
+
489
+ The syntax of this argument is as following:
490
+
491
+ integer = ('+' | '-')* [0-9]+;
492
+ interval = ('n' | '-n' | integer 'n') integer;
493
+
494
+ Here `interval` would match any of the following:
495
+
496
+ n
497
+ -n
498
+ 2n
499
+ 2n+5
500
+ 2n-5
501
+ -2n+5
502
+ -2n-5
503
+
504
+ Due to `integer` also matching the `+` and `-` it will be part of the same
505
+ token. If this is not desired the following grammar can be used instead:
506
+
507
+ integer = [0-9]+;
508
+ modifier = '+' | '-';
509
+ interval = ('n' | '-n' | modifier* integer 'n') modifier integer;
510
+
511
+ To match every 2nd element you'd use the following:
512
+
513
+ :nth-child(2n)
514
+
515
+ 1 2 3 4 5 6
516
+ +---+ +---+ +---+ +---+ +---+ +---+
517
+ | | | X | | | | X | | | | X |
518
+ +---+ +---+ +---+ +---+ +---+ +---+
519
+
520
+ To match every 2nd element starting at element 1 you'd instead use this:
521
+
522
+ :nth-child(2n+1)
523
+
524
+ 1 2 3 4 5 6
525
+ +---+ +---+ +---+ +---+ +---+ +---+
526
+ | X | | | | X | | | | X | | |
527
+ +---+ +---+ +---+ +---+ +---+ +---+
528
+
529
+ As mentioned the `+1` in the above example is the initial offset. This is
530
+ however _only_ the case if the second number is positive. That means that for
531
+ `:nth-child(2n-2)` the offset is _not_ `-2`. When using a negative offset the
532
+ actual offset first has to be calculated. When using an argument in the form of
533
+ `An-B` we can calculate the actual offset as following:
534
+
535
+ offset = A - (B % A)
536
+
537
+ For example, for the selector `:nth-child(2n-2)` the formula would be:
538
+
539
+ offset = 2 - (-2 % 2) # => 2
540
+
541
+ This would result in the selector `:nth-child(2n+2)`.
542
+
543
+ As an another example, for the selector `:nth-child(2n-5)` the formula would be:
544
+
545
+ offset = 2 - (-5 % 2) # => 1
546
+
547
+ Which would result in the selector `:nth-child(2n+1)`
548
+
549
+ To ease the process of selecting even and uneven elements you can also use
550
+ `even` and `odd` as an argument. Using `:nth-child(even)` is the same as
551
+ `:nth-child(2n)` while using `:nth-child(odd)` in turn is the same as
552
+ `:nth-child(2n+1)`.
553
+
554
+ Using `:nth-child(n)` simply matches all elements in the set. Using
555
+ `:nth-child(-n)` doesn't match any elements, though Oga treats it the same as
556
+ `:nth-child(n)`.
557
+
558
+ Expressions such as `:nth-child(-n-5)` are invalid as both parts of the interval
559
+ (`-n` and `-5`) are a negative. However, `:nth-child(-n+5)` is
560
+ perfectly valid and would match the first 5 elements in a set:
561
+
562
+ :nth-child(-n+5)
563
+
564
+ 1 2 3 4 5 6
565
+ +---+ +---+ +---+ +---+ +---+ +---+
566
+ | X | | X | | X | | X | | X | | |
567
+ +---+ +---+ +---+ +---+ +---+ +---+
568
+
569
+
570
+ Using `:nth-child(n+5)` would match all elements starting at element 5:
571
+
572
+ :nth-child(n+5)
573
+
574
+ 1 2 3 4 5 6 7 8 9 10
575
+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
576
+ | | | | | | | | | X | | X | | X | | X | | X | | X |
577
+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
578
+
579
+ To summarize:
580
+
581
+ :nth-child(n) => matches all elements
582
+ :nth-child(-n) => matches nothing, though Oga treats it the same as "n"
583
+ :nth-child(5) => matches element #5
584
+ :nth-child(2n) => matches every 2 elements
585
+ :nth-child(2n+2) => matches every 2 elements, starting at element 2
586
+ :nth-child(2n-2) => matches every 2 elements, starting at element 1
587
+ :nth-child(n+5) => matches all elements, starting at element 5
588
+ :nth-child(-n+5) => matches the first 5 elements
589
+ :nth-child(even) => matches every 2nd element, starting at element 2
590
+ :nth-child(odd) => matches every 2nd element, starting at element 1
591
+
592
+ The corresponding XPath expressions are quite complex and differ based on the
593
+ interval argument used. For the various forms the corresponding XPath
594
+ expressions are as following:
595
+
596
+ :nth-child(n) => *[((count(preceding-sibling::*) + 1) mod 1) = 0]
597
+ :nth-child(-n) => *[((count(preceding-sibling::*) + 1) mod 1) = 0]
598
+ :nth-child(5) => *[count(preceding-sibling::*) = 4]
599
+ :nth-child(2n) => *[((count(preceding-sibling::*) + 1) mod 2) = 0]
600
+ :nth-child(2n+2) => *[(count(preceding-sibling::*) + 1) >= 2 and (((count(preceding-sibling::*) + 1) - 2) mod 2) = 0]
601
+ :nth-child(2n-6) => *[(count(preceding-sibling::*) + 1) >= 2 and (((count(preceding-sibling::*) + 1) - 2) mod 2) = 0]
602
+ :nth-child(n+5) => *[(count(preceding-sibling::*) + 1) >= 5 and (((count(preceding-sibling::*) + 1) - 5) mod 1) = 0]
603
+ :nth-child(-n+6) => *[((count(preceding-sibling::*) + 1) <= 6) and (((count(preceding-sibling::*) + 1) - 6) mod 1) = 0]
604
+ :nth-child(even) => *[((count(preceding-sibling::*) + 1) mod 2) = 0]
605
+ :nth-child(odd) => *[(count(preceding-sibling::*) + 1) >= 1 and (((count(preceding-sibling::*) + 1) - 1) mod 2) = 0]
606
+
607
+ ### :nth-last-child(n)
608
+
609
+ The `:nth-last-child(n)` pseudo class can be used to select a set of elements
610
+ based on their position or an interval, skipping elements that occur in a set
611
+ after the given position or interval.
612
+
613
+ The arguments that can be used by this selector are the same as those mentioned
614
+ in [:nth-child(n)][nth-childn].
615
+
616
+ Because this selectors matches in reverse (compared to
617
+ [:nth-child(n)][nth-childn]) using an index such as "1" will match the _last_
618
+ element in a set, not the first one:
619
+
620
+ :nth-last-child(1)
621
+
622
+ 1 2 3 4 5 6
623
+ +---+ +---+ +---+ +---+ +---+ +---+
624
+ | | | | | | | | | | | X | <- matching direction
625
+ +---+ +---+ +---+ +---+ +---+ +---+
626
+
627
+ When using an interval (with or without an offset) the nodes are also matched in
628
+ reverse order. However, matched nodes should be returned in the order they
629
+ appear in in the document.
630
+
631
+ For example, the selector `:nth-last-child(2n)` would match as following:
632
+
633
+ :nth-last-child(2n)
634
+
635
+ 1 2 3 4 5 6
636
+ +---+ +---+ +---+ +---+ +---+ +---+
637
+ | X | | | | X | | | | X | | | <- matching direction
638
+ +---+ +---+ +---+ +---+ +---+ +---+
639
+
640
+ The resulting set however would contain the nodes in the order `[1, 3, 5]`
641
+ instead of `[5, 3, 1]`.
642
+
643
+ When using an interval with an initial offset the offset is also applied in
644
+ reverse order. For example, the selector `:nth-last-child(2n)` would match as
645
+ following:
646
+
647
+ :nth-last-child(2n+1)
648
+
649
+ 1 2 3 4 5 6
650
+ +---+ +---+ +---+ +---+ +---+ +---+
651
+ | | | X | | | | X | | | | X | <- matching direction
652
+ +---+ +---+ +---+ +---+ +---+ +---+
653
+
654
+ The corresponding XPath expressions are similar to those used for
655
+ [:nth-child(n)][nth-childn]:
656
+
657
+ :nth-last-child(n) => *[count(following-sibling::*) = -1]
658
+ :nth-last-child(-n) => *[count(following-sibling::*) = -1]
659
+ :nth-last-child(5) => *[count(following-sibling::*) = 4]
660
+ :nth-last-child(2n) => *[((count(following-sibling::*) + 1) mod 2) = 0]
661
+ :nth-last-child(2n+2) => *[((count(following-sibling::*) + 1) >= 2) and ((((count(following-sibling::*) + 1) - 2) mod 2) = 0)]
662
+ :nth-last-child(2n-6) => *[((count(following-sibling::*) + 1) >= 2) and ((((count(following-sibling::*) + 1) - 2) mod 2) = 0)]
663
+ :nth-last-child(n+5) => *[((count(following-sibling::*) + 1) >= 5) and ((((count(following-sibling::*) + 1) - 5) mod 1) = 0)]
664
+ :nth-last-child(-n+6) => *[((count(following-sibling::*) + 1) <= 6) and ((((count(following-sibling::*) + 1) - 6) mod 1) = 0)]
665
+ :nth-last-child(even) => *[((count(following-sibling::*) + 1) mod 2) = 0]
666
+ :nth-last-child(odd) => *[((count(following-sibling::*) + 1) >= 1) and ((((count(following-sibling::*) + 1) - 1) mod 2) = 0)]
667
+
668
+ ### :nth-of-type(n)
669
+
670
+ The `:nth-of-type(n)` pseudo class can be used to select a set of elements that
671
+ has a set of preceding siblings with the same name. The arguments that can be
672
+ used by this selector are the same as those mentioned in
673
+ [:nth-child(n)][nth-childn].
674
+
675
+ The matching order of this selector is the same as [:nth-child(n)][nth-childn].
676
+
677
+ Example XML:
678
+
679
+ <root>
680
+ <foo />
681
+ <foo />
682
+ <foo />
683
+ <foo />
684
+ <bar />
685
+ </root>
686
+
687
+ Using the CSS expression `root foo:nth-of-type(even)` would return a set
688
+ containing the 2nd and 4th `<foo>` nodes.
689
+
690
+ The corresponding XPath expressions for the various forms of this pseudo class
691
+ are as following:
692
+
693
+ :nth-of-type(n) => *[position() = n]
694
+ :nth-of-type(-n) => *[position() = -n]
695
+ :nth-of-type(5) => *[position() = 5]
696
+ :nth-of-type(2n) => *[(position() mod 2) = 0]
697
+ :nth-of-type(2n+2) => *[(position() >= 2) and (((position() - 2) mod 2) = 0)]
698
+ :nth-of-type(2n-6) => *[(position() >= 2) and (((position() - 2) mod 2) = 0)]
699
+ :nth-of-type(n+5) => *[(position() >= 5) and (((position() - 5) mod 1) = 0)]
700
+ :nth-of-type(-n+6) => *[(position() <= 6) and (((position() - 6) mod 1) = 0)]
701
+ :nth-of-type(even) => *[(position() mod 2) = 0]
702
+ :nth-of-type(odd) => *[(position() >= 1) and (((position() - 1) mod 2) = 0)]
703
+
704
+ ### :nth-last-of-type(n)
705
+
706
+ The `:nth-last-of-type(n)` pseudo class behaves the same as
707
+ [:nth-of-type(n)][nth-last-of-typen] excepts it matches nodes in reverse order
708
+ similar to [:nth-last-child(n)][nth-last-childn]. To clarify, this means
709
+ matching occurs as following:
710
+
711
+
712
+ :nth-last-of-type(1)
713
+
714
+ 1 2 3 4 5 6
715
+ +---+ +---+ +---+ +---+ +---+ +---+
716
+ | | | | | | | | | | | X | <- matching direction
717
+ +---+ +---+ +---+ +---+ +---+ +---+
718
+
719
+ Example XML:
720
+
721
+ <root>
722
+ <foo />
723
+ <foo />
724
+ <foo />
725
+ <foo />
726
+ <bar />
727
+ </root>
728
+
729
+ Using the CSS expression `root foo:nth-of-type(even)` would return a set
730
+ containing the 1st and 3rd `<foo>` nodes.
731
+
732
+ The corresponding XPath expressions for the various forms of this pseudo class
733
+ are as following:
734
+
735
+ :nth-last-of-type(n) => *[position() = last() - -1]
736
+ :nth-last-of-type(-n) => *[position() = last() - -1]
737
+ :nth-last-of-type(5) => *[position() = last() - 4]
738
+ :nth-last-of-type(2n) => *[((last() - position()+1) mod 2) = 0]
739
+ :nth-last-of-type(2n+2) => *[((last() - position()+1) >= 2) and ((((last() - position() + 1) - 2) mod 2) = 0)]
740
+ :nth-last-of-type(2n-6) => *[((last() - position()+1) >= 2) and ((((last() - position() + 1) - 2) mod 2) = 0)]
741
+ :nth-last-of-type(n+5) => *[((last() - position()+1) >= 5) and ((((last() - position() + 1) - 5) mod 1) = 0)]
742
+ :nth-last-of-type(-n+6) => *[((last() - position()+1) <= 6) and ((((last() - position() + 1) - 6) mod 1) = 0)]
743
+ :nth-last-of-type(even) => *[((last() - position()+1) mod 2) = 0]
744
+ :nth-last-of-type(odd) => *[((last() - position()+1) >= 1) and ((((last() - position() + 1) - 1) mod 2) = 0)]
745
+
746
+ ### :first-child
747
+
748
+ The `:first-child` pseudo class can be used to match a node that is the first
749
+ child node of another node (= a node without any preceding nodes).
750
+
751
+ Example XML:
752
+
753
+ <root>
754
+ <foo />
755
+ <bar />
756
+ </root>
757
+
758
+ Using the CSS selector `root :first-child` would return a set containing only
759
+ the `<foo>` node.
760
+
761
+ The corresponding XPath expression for this pseudo class is as following:
762
+
763
+ :first-child => *[count(preceding-sibling::*) = 0]
764
+
765
+ ### :last-child
766
+
767
+ The `:last-child` pseudo class can be used to match a node that is the last
768
+ child node of another node (= a node without any following nodes).
769
+
770
+ Example XML:
771
+
772
+ <root>
773
+ <foo />
774
+ <bar />
775
+ </root>
776
+
777
+ Using the CSS selector `root :last-child` would return a set containing only
778
+ the `<bar>` node.
779
+
780
+ The corresponding XPath expression for this pseudo class is as following:
781
+
782
+ :last-child => *[count(following-sibling::*) = 0]
783
+
784
+ ### :first-of-type
785
+
786
+ The `:first-of-type` pseudo class matches elements that are the first sibling of
787
+ its type in the list of elements of its parent element. This selector is the
788
+ same as [:nth-of-type(1)][nth-of-typen].
789
+
790
+ Example XML:
791
+
792
+ <root>
793
+ <a id="1" />
794
+ <a id="2">
795
+ <a id="3" />
796
+ <a id="4" />
797
+ </a>
798
+ </root>
799
+
800
+ Using the CSS selector `root a:first-of-type` would return a node set containing
801
+ nodes `<a id="1">` and `<a id="3">` as both nodes are the first siblings of
802
+ their type.
803
+
804
+ The corresponding XPath for this pseudo class is as following:
805
+
806
+ a:first-of-type => a[count(preceding-sibling::a) = 0]
807
+
808
+ An alternative way is to use the following XPath:
809
+
810
+ a:first-of-type => //a[position() = 1]
811
+
812
+ This however relies on the less efficient `descendant-or-self::node()` selector.
813
+ For querying larger documents it's recommended to use the first form instead.
814
+
815
+ ### :last-of-type
816
+
817
+ The `:last-of-type` pseudo class can be used to match elements that are the last
818
+ sibling of its type in the list of elements of its parent. This selector is the
819
+ same as [:nth-last-of-type(1)][nth-last-of-typen].
820
+
821
+ Example XML:
822
+
823
+ <root>
824
+ <a id="1" />
825
+ <a id="2">
826
+ <a id="3" />
827
+ <a id="4" />
828
+ </a>
829
+ </root>
830
+
831
+ Using the CSS selector `root a:last-of-type` would return a set containing nodes
832
+ `<a id="2">` and `<a id="4">` as both nodes are the last siblings of their type.
833
+
834
+ The corresponding XPath for this pseudo class is as following:
835
+
836
+ a:last-of-type => a[count(following-sibling::a) = 0]
837
+
838
+ Similar to [:first-of-type][first-of-typen] this XPath can alternatively be
839
+ written as following:
840
+
841
+ a:last-of-type => //a[position() = last()]
842
+
843
+ ### :only-child
844
+
845
+ The `:only-child` pseudo class can be used to match elements that are the only
846
+ child element of its parent.
847
+
848
+ Example XML:
849
+
850
+ <root>
851
+ <a id="1" />
852
+ <a id="2">
853
+ <a id="3" />
854
+ </a>
855
+ </root>
856
+
857
+ Using the CSS selector `root a:only-child` would return a set containing only
858
+ the `<a id="3">` node.
859
+
860
+ The corresponding XPath for this pseudo class is as following:
861
+
862
+ a:only-child => a[count(preceding-sibling::*) = 0 and count(following-sibling::*) = 0]
863
+
864
+ ### :only-of-type
865
+
866
+ The `:only-of-type` pseudo class can be used to match elements that are the only
867
+ child elements of its type of its parent.
868
+
869
+ Example XML:
870
+
871
+ <root>
872
+ <a id="1" />
873
+ <a id="2">
874
+ <a id="3" />
875
+ <b id="4" />
876
+ </a>
877
+ </root>
878
+
879
+ Using the CSS selector `root a:only-of-type` would return a set containing
880
+ only the `<a id="3">` node due to it being the only `<a>` node in the list of
881
+ elements of its parent.
882
+
883
+ The corresponding XPath for this pseudo class is as following:
884
+
885
+ a:only-child => a[count(preceding-sibling::a) = 0 and count(following-sibling::a) = 0]
886
+
887
+ ### :empty
888
+
889
+ The `:empty` pseudo class can be used to match elements that have no child nodes
890
+ at all.
891
+
892
+ Example XML:
893
+
894
+ <root>
895
+ <a />
896
+ <b>10</b>
897
+ </root>
898
+
899
+ Using the CSS selector `root :empty` would return a set containing only the
900
+ `<a>` node.
901
+
902
+ ### Syntax
903
+
904
+ The syntax of the various pseudo classes is as following:
905
+
906
+ integer = ('+' | '-')* [0-9]+;
907
+
908
+ odd = 'odd';
909
+ even = 'even';
910
+ nth = 'n';
911
+
912
+ pseudo_arg_interval = '-'* integer* nth;
913
+ pseudo_arg_offset = ('+' | '-')* integer;
914
+
915
+ pseudo_arg = odd
916
+ | even
917
+ | '-'* nth
918
+ | integer
919
+ | pseudo_arg_interval
920
+ | pseudo_arg_interval pseudo_arg_offset;
921
+
922
+ # The `identifier` rule is the same as the one used for element names.
923
+ pseudo = ':' identifier ('(' space* pseudo_arg space* ')')*;
924
+
925
+ [w3spec]: http://www.w3.org/TR/css3-selectors/
926
+ [rfc-2119]: https://www.ietf.org/rfc/rfc2119.txt
927
+ [kramdown]: http://kramdown.gettalong.org/
928
+ [universal-selector]: #universal-selector
929
+ [ragel]: http://www.colm.net/open-source/ragel/
930
+ [nth-childn]: #nth-childn
931
+ [nth-last-childn]: #nth-last-childn
932
+ [nth-last-of-typen]: #nth-last-of-typen
933
+ [nth-of-typen]: #nth-of-type
934
+ [nth-last-of-typen]: #nth-last-of-typen
935
+ [first-of-typen]: #first-of-typen