oga 0.1.3 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 8b60359c51f8e8eb14fb35fbc7ffdd6da99a6d50
4
- data.tar.gz: b7c569316e309823375f84dc628c1f170d3f2540
3
+ metadata.gz: 9abda7194e4d0f181bf8a43c5d5154c965fd1d81
4
+ data.tar.gz: 8cd27a710c2c761ffd37d9393e53e3b78f53444e
5
5
  SHA512:
6
- metadata.gz: 57224b74df2f069a99826dcbc49d1bf4efdb7d31c5568c23e8b817952d3708b365198fcbafdd82d95e0e1f7af9b221d6b6a71cc1d09fd4cf2951d177fa4ed456
7
- data.tar.gz: b7103802563321ebe4174a328e65d99a00df973abca79e1ef02544e0bcf2484e679db8009c9de98cb971d8912bfe573f0d0250bbbc25f77c5764e6d43f190d6c
6
+ metadata.gz: 556c869e33dfe785eda199e42a5d2fe869e269c8522809b81d8febdefe88285c52a3c7b17e22524b031cf5698395c505fd82f9ef8d69db538f9ac7f19f761e47
7
+ data.tar.gz: 149034fbe883e5e0df5f805aa17b07cf12afc8cccefbc22d9f73e6dcc6ba2b6fbe1ef39706f4945fbab90e9ecdcb208b4d299f939eb428706f724f657ecbc822
data/README.md CHANGED
@@ -70,6 +70,12 @@ Querying a document using XPath:
70
70
 
71
71
  document.xpath('string(people/person)') # => "Alice"
72
72
 
73
+ Querying a document using CSS:
74
+
75
+ document = Oga.parse_xml('<people><person>Alice</person></people>')
76
+
77
+ document.css('people person') # => NodeSet(Element(name: "person" ...))
78
+
73
79
  Modifying a document and serializing it back to XML:
74
80
 
75
81
  document = Oga.parse_xml('<people><person>Alice</person></people>')
@@ -95,6 +101,7 @@ Querying a document using a namespace:
95
101
  * Low memory footprint
96
102
  * High performance, if something doesn't perform well enough it's a bug
97
103
  * Support for XPath 1.0
104
+ * CSS3 selector support
98
105
  * XML namespace support (registering, querying, etc)
99
106
 
100
107
  ## Requirements
@@ -127,6 +134,53 @@ _not_ thread-safe and should not be done by multiple threads at once.
127
134
  It is advised that you do not share parsed documents between threads unless you
128
135
  _really_ have to.
129
136
 
137
+ ## Namespace Support
138
+
139
+ Oga fully supports parsing/registering XML namespaces as well as querying them
140
+ using XPath. For example, take the following XML:
141
+
142
+ <root xmlns="http://example.com">
143
+ <bar>bar</bar>
144
+ </root>
145
+
146
+ If one were to try and query the `bar` element (e.g. using XPath `root/bar`)
147
+ they'd end up with an empty node set. This is due to `<root>` defining an
148
+ alternative default namespace. Instead you can query this element using the
149
+ following XPath:
150
+
151
+ *[local-name() = "root"]/*[local-name() = "bar"]
152
+
153
+ Alternatively, if you don't really care where the `<bar>` element is located you
154
+ can use the following:
155
+
156
+ descendant::*[local-name() = "bar"]
157
+
158
+ And if you want to specify an explici namespace URI, you can use this:
159
+
160
+ descendant::*[local-name() = "bar" and namespace-uri() = "http://example.com"]
161
+
162
+ Unlike Nokogiri, Oga does _not_ provide a way to create "dynamic" namespaces.
163
+ That is, Nokogiri allows one to query the above document as following:
164
+
165
+ document = Nokogiri::XML('<root xmlns="http://example.com"><bar>bar</bar></root>')
166
+
167
+ document.xpath('x:root/x:bar', :x => 'http://example.com')
168
+
169
+ Oga does have a small trick you can use to cut down the size of your XPath
170
+ queries. Because Oga assigns the name "xmlns" to default namespaces you can use
171
+ this in your XPath queries:
172
+
173
+ document = Oga.parse_xml('<root xmlns="http://example.com"><bar>bar</bar></root>')
174
+
175
+ document.xpath('xmlns:root/xmlns:bar')
176
+
177
+ When using this you can still restrict the query to the correct namespace URI:
178
+
179
+ document.xpath('xmlns:root[namespace-uri() = "http://example.com"]/xmlns:bar')
180
+
181
+ In the future I might add an API to ease this process, although at this time I
182
+ have little interest in providing an API similar to Nokogiri.
183
+
130
184
  ## Documentation
131
185
 
132
186
  The documentation is best viewed [on the documentation website][doc-website].
@@ -134,6 +188,9 @@ The documentation is best viewed [on the documentation website][doc-website].
134
188
  * {file:CONTRIBUTING Contributing}
135
189
  * {file:changelog Changelog}
136
190
  * {file:migrating\_from\_nokogiri Migrating From Nokogiri}
191
+ * {Oga::XML::Parser XML Parser}
192
+ * {Oga::XML::SaxParser XML SAX Parser}
193
+ * {file:xml\_namespaces XML Namespaces}
137
194
 
138
195
  ## Native Extension Setup
139
196
 
@@ -3,6 +3,134 @@
3
3
  This document contains details of the various releases and their release dates.
4
4
  Dates are in the format `yyyy-mm-dd`.
5
5
 
6
+ ## 0.2.0 - 2014-11-17
7
+
8
+ ### CSS Selector Support
9
+
10
+ Probably the biggest feature of this release: support for querying documents
11
+ using CSS selectors. Oga supports a subset of the CSS3 selector specification,
12
+ in particular the following selectors are supported:
13
+
14
+ * Element, class and ID selectors
15
+ * Attribute selectors (e.g. `foo[x ~= "y"]`)
16
+
17
+ The following pseudo classes are supported:
18
+
19
+ * `:root`
20
+ * `:nth-child(n)`
21
+ * `:nth-last-child(n)`
22
+ * `:nth-of-type(n)`
23
+ * `:nth-last-of-type(n)`
24
+ * `:first-child`
25
+ * `:last-child`
26
+ * `:first-of-type`
27
+ * `:last-of-type`
28
+ * `:only-child`
29
+ * `:only-of-type`
30
+ * `:empty`
31
+
32
+ You can use CSS selectors using the methods `css` and `at_css` on an instance of
33
+ `Oga::XML::Document` or `Oga::XML::Element`. For example:
34
+
35
+ document = Oga.parse_xml('<people><person>Alice</person></people>')
36
+
37
+ document.css('people person') # => NodeSet(Element(name: "person" ...))
38
+
39
+ The architecture behind this is quite similar to parsing XPath. There's a lexer
40
+ (`Oga::CSS::Lexer`) and a parser (`Oga::CSS::Parser`). Unlike Nokogiri (and
41
+ perhaps other libraries) the parser _does not_ output XPath expressions as a
42
+ String or a CSS specific AST. Instead it directly emits an XPath AST. This
43
+ allows the resulting AST to be directly evaluated by `Oga::XPath::Evaluator`.
44
+
45
+ See <https://github.com/YorickPeterse/oga/issues/11> for more information.
46
+
47
+ ### Mutli-line Attribute Support
48
+
49
+ Oga can now lex/parse elements that have attributes with newlines in them.
50
+ Previously this would trigger memory allocation errors.
51
+
52
+ See <https://github.com/YorickPeterse/oga/issues/58> for more information.
53
+
54
+ ### SAX after_element
55
+
56
+ The `after_element` method in the SAX parsing API now always takes two
57
+ arguments: the namespace name and element name. Previously this method would
58
+ always receive a single nil value as its argument, which is rather pointless.
59
+
60
+ See <https://github.com/YorickPeterse/oga/issues/54> for more information.
61
+
62
+ ### XPath Grouping
63
+
64
+ XPath expressions can now be grouped together using parenthesis. This allows one
65
+ to specify a custom operator precedence.
66
+
67
+ ### Enumerator Parsing Input
68
+
69
+ Enumerator instances can now be used as input for `Oga.parse_xml` and friends.
70
+ This can be used to download and parse XML files on the fly. For example:
71
+
72
+ enum = Enumerator.new do |yielder|
73
+ HTTPClient.get('http://some-website.com/some-big-file.xml') do |chunk|
74
+ yielder << chunk
75
+ end
76
+ end
77
+
78
+ document = Oga.parse_xml(enum)
79
+
80
+ See <https://github.com/YorickPeterse/oga/issues/48> for more information.
81
+
82
+ ### Removing Attributes
83
+
84
+ Element attributes can now be removed using `Oga::XML::Element#unset`:
85
+
86
+ element = Oga::XML::Element.new(:name => 'foo')
87
+
88
+ element.set('class', 'foo')
89
+ element.unset('class')
90
+
91
+ ### XPath Attributes
92
+
93
+ XPath predicates are now evaluated for every context node opposed to being
94
+ evaluated once for the entire context. This ensures that expressions such as
95
+ `descendant-or-self::node()/foo[1]` are evaluated correctly.
96
+
97
+ ### Available Namespaces
98
+
99
+ When calling `Oga::XML::Element#available_namespaces` the Hash returned by
100
+ `Oga::XML::Element#namespaces` would be modified in place. This was a bug that
101
+ has been fixed in this release.
102
+
103
+ ### NodeSets
104
+
105
+ NodeSet instances can now be compared with each other using `==`. Previously
106
+ this would always consider two instances to be different from each other due to
107
+ the usage of the default `Object#==` method.
108
+
109
+ ### XML Entities
110
+
111
+ XML entities such as `&amp;` and `&lt;` are now encoded/decoded by the lexer,
112
+ string and text nodes.
113
+
114
+ See <https://github.com/YorickPeterse/oga/issues/49> for more information.
115
+
116
+ ### General
117
+
118
+ Source lines are no longer included in error messages generated by the XML
119
+ parser. This simplifies the code and removes the need of re-reading the input
120
+ (in case of IO/Enumerable inputs).
121
+
122
+ ### XML Lexer Newlines
123
+
124
+ Newlines in the XML lexer are now counted in native code (C/Java). On MRI and
125
+ JRuby the improvement is quite small, but on Rubinius it's a massive
126
+ improvement. See commit `8db77c0a09bf6c996dd2856a6dbe1ad076b1d30a` for more
127
+ information.
128
+
129
+ ### HTML Void Element Performance
130
+
131
+ Performance for detecting HTML void elements (e.g. `<br>` and `<link>`) has been
132
+ improved by removing String allocations that were not needed.
133
+
6
134
  ## 0.1.3 - 2014-09-24
7
135
 
8
136
  This release fixes a problem with serializing attributes using the namespace
@@ -6,11 +6,12 @@ body
6
6
  max-width: 960px;
7
7
  }
8
8
 
9
- p code
9
+ p code, dd code, li code
10
10
  {
11
- background: #f2f2f2;
12
- padding-left: 3px;
13
- padding-right: 3px;
11
+ background: #f9f2f4;
12
+ color: #c7254e;
13
+ border-radius: 4px;
14
+ padding: 2px 4px;
14
15
  }
15
16
 
16
17
  pre.code
@@ -0,0 +1,935 @@
1
+ # CSS Selectors Specification
2
+
3
+ This document acts as an alternative specification to the official W3
4
+ [CSS3 Selectors Specification][w3spec]. This document specifies only the
5
+ selectors supported by Oga itself. Only CSS3 selectors are covered, CSS4 is not
6
+ part of this specification.
7
+
8
+ This document is best viewed in the YARD generated documentation or any other
9
+ Markdown viewer that supports the [Kramdown][kramdown] syntax. Alternatively it
10
+ can be viewed in its raw form.
11
+
12
+ ## Abstract
13
+
14
+ The official W3 specification on CSS selectors is anything but pleasant to read.
15
+ A lack of good examples and unspecified behaviour are just two of many problems.
16
+ This document was written as a reference guide for myself as well as a way for
17
+ others to more easily understand how CSS selectors work.
18
+
19
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
20
+ "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
21
+ interpreted as described in [RFC 2119][rfc-2119].
22
+
23
+ ## Syntax
24
+
25
+ To describe syntax elements of CSS selectors this document uses the same grammar
26
+ as [Ragel][ragel]. For example, an integer would be defined as following:
27
+
28
+ integer = [0-9]+;
29
+
30
+ In turn an integer that can optionally be prefixed by `+` or `-` would be
31
+ defined as following:
32
+
33
+ integer = ('+' | '-')* [0-9]+;
34
+
35
+ A quick and basic crash course of the Ragel grammar:
36
+
37
+ * `*`: zero or more instance of the preceding token(s)
38
+ * `+`: one or more instances of the preceding token(s)
39
+ * `(` and `)`: used for grouping expressions together
40
+ * `^`: inverts a match, thus `^[0-9]` means "anything but a single digit"
41
+ * `"..."` or `'...'`: a literal character, `"x"` would match the literal "x"
42
+ * `|`: the OR operator, `x | y` translates to "x OR y"
43
+ * `[...]`: used to define a sequence, `[0-9]` translates to "0 OR 1 OR 2 OR
44
+ 3..." all the way upto 9
45
+
46
+ Semicolons are used to terminate lines. While not strictly required in this
47
+ specification they are included in order to produce a Ragel syntax compatible
48
+ grammar.
49
+
50
+ See the Ragel documentation for more information on the grammar.
51
+
52
+ ## Terminology
53
+
54
+ local name
55
+ : The name of an element without a namespace. For the element `<strong>` the
56
+ local name is `strong`.
57
+
58
+ namespace prefix
59
+ : The namespace prefix of an element. For the element `<foo:strong>` the
60
+ namespace prefix is `foo`.
61
+
62
+ expression
63
+ : A single or multiple selectors used together to retrieve a set of elements
64
+ from a document.
65
+
66
+ ## Selector Scoping
67
+
68
+ Whenever a selector is used to match an element the selector applies to all
69
+ nodes in the context. For example, the selector `foo` would match all `foo`
70
+ elements at any position in the document. On the other hand, the selector
71
+ `foo bar` only matches any `bar` elements that are a descedant of any `foo`
72
+ element.
73
+
74
+ In XPath the corresponding axis for this is `descendant`. In other words, this
75
+ CSS expression:
76
+
77
+ foo
78
+
79
+ is the same as this XPath expression:
80
+
81
+ descendant::foo
82
+
83
+ In turn this CSS expression:
84
+
85
+ foo bar
86
+
87
+ is the same as this XPath expression:
88
+
89
+ descendant::foo/::bar
90
+
91
+ Note that in the various XPath examples the `descendant` axis is omitted in
92
+ order to enhance readability.
93
+
94
+ ### Syntax
95
+
96
+ A CSS expression is made up of multiple selectors separated by one or more
97
+ spaces. There MUST be at least 1 space between two selectors, there MAY be more
98
+ than one. Multiple spaces do not alter the behaviour of the expression in any
99
+ way.
100
+
101
+ ## Universal Selector
102
+
103
+ W3 chapter: <http://www.w3.org/TR/css3-selectors/#universal-selector>
104
+
105
+ The universal selector `*` (also known as the "wildcard selector") can be used
106
+ to match any element, regardless of its local name or namespace prefix.
107
+
108
+ Example XML:
109
+
110
+ <root>
111
+ <foo></foo>
112
+ <bar></bar>
113
+ </root>
114
+
115
+ CSS:
116
+
117
+ root *
118
+
119
+ This would return a set containing two elements: `<foo>` and `<bar>`
120
+
121
+ The corresponding XPath is also `*`.
122
+
123
+ ### Syntax
124
+
125
+ The syntax for the universal selector is very simple:
126
+
127
+ universal = '*';
128
+
129
+ ## Element Selector
130
+
131
+ W3 chapter: <http://www.w3.org/TR/css3-selectors/#type-selectors>
132
+
133
+ The element selector (known as "Type selector" in the official W3 specification)
134
+ can be used to match a set of elements by their local name or namespace. The
135
+ selector `foo` is used to match all elements with the local name being set to
136
+ `foo`.
137
+
138
+ Example XML:
139
+
140
+ <root>
141
+ <foo />
142
+ <bar />
143
+ </root>
144
+
145
+ CSS:
146
+
147
+ root foo
148
+
149
+ This would return a set with only the `<foo>` element.
150
+
151
+ This selector can be used in combination with the
152
+ [Universal Selector][universal-selector]. This allows one to select elements
153
+ using both a given local name and namespace. The syntax for this is as
154
+ following:
155
+
156
+ ns-prefix|local-name
157
+
158
+ Here the pipe (`|`) character separates the namespace prefix and the local name.
159
+ Both can either be an identifier or a wildcard. For example, the selector
160
+ `rb|foo` matches all elements with local name `foo` and namespace prefix `rb`.
161
+
162
+ The namespace prefix MAY be left out producing the selector `|local-name`. In
163
+ this case the selector only matches elements _without_ a namespace prefix.
164
+
165
+ If a namespace prefix is given and it's _not_ a wildcard then elements without a
166
+ namespace prefix will _not_ be matched.
167
+
168
+ The corresponding XPath expression for such a selector is
169
+ `ns-prefix:local-name`. For example, `rb|foo` in CSS is the same as `rb:foo` in
170
+ XPath.
171
+
172
+ ### Syntax
173
+
174
+ The syntax for just the local name is as following:
175
+
176
+ identifier = '*' | [a-zA-Z]+ [a-zA-Z\-_0-9]*;
177
+
178
+ The wildcard is put in place to allow a single rule to be used for both names
179
+ and wildcards.
180
+
181
+ The syntax for selecting an element including a namespace prefix is as
182
+ following:
183
+
184
+ ns_plus_local_name = identifier* '|' identifier
185
+
186
+ This would match `|foo`, `*|foo` and `foo|bar`. In order to match `foo` the
187
+ regular `identifier` rule declared above can be used.
188
+
189
+ ## Class Selector
190
+
191
+ Class selectors can be used to select a set of elements based on the values set
192
+ in the `class` attribute. Class selectors start with a period (`.`) followed by
193
+ an identifier. Multiple class selectors can be chained together, matching only
194
+ elements that have all the specified classes set.
195
+
196
+ As an example, `.foo` can be used to select all elements that have "foo" set in
197
+ the `class` attribute, either as the sole or one of many values. In turn,
198
+ `.foo.bar` matches elements that have both "foo" and "bar" set as the class.
199
+
200
+ Example XML:
201
+
202
+ <root>
203
+ <a class="first" />
204
+ <b class="second" />
205
+ </root>
206
+
207
+ Using the CSS selector `.first` would return a set containing only the `<a>`
208
+ element. Using `.first.second` would return a set containing both the `<a>` and
209
+ `<b>` nodes.
210
+
211
+ ### Syntax
212
+
213
+ identifier = '*' | [a-zA-Z]+ [a-zA-Z\-_0-9]*;
214
+
215
+ # .foo, .foo.bar, .foo.bar.baz, etc
216
+ class = ('.' identifier)+;
217
+
218
+ ## ID Selector
219
+
220
+ The ID selector can be used to match elements where the value of the `id`
221
+ attribute matches whatever is specified in the selector. ID selectors start with
222
+ a hash sign (`#`) followed by an identifier.
223
+
224
+ While technically multiple ID selectors _can_ be chained together, HTML only
225
+ allows elements to have a single ID. As a result doing so is fairly useless.
226
+ Unlike classes IDs are globally unique, no two elements can have the same ID.
227
+
228
+ Example XML:
229
+
230
+ <root>
231
+ <a id="first" />
232
+ <b id="second" />
233
+ </root>
234
+
235
+ Using the CSS selector `#first` would return a set containing only the `<a>`
236
+ node.
237
+
238
+ ### Syntax
239
+
240
+ identifier = '*' | [a-zA-Z]+ [a-zA-Z\-_0-9]*;
241
+
242
+ # .foo, .foo.bar, .foo.bar.baz, etc
243
+ class = ('#' identifier)+;
244
+
245
+ ## Attribute Selector
246
+
247
+ W3 chapter: <http://www.w3.org/TR/css3-selectors/#attribute-selectors>
248
+
249
+ Attribute selectors can be used to further narrow down a set of elements based
250
+ on their attribute list. In XPath these selectors are known as "predicates". For
251
+ example, the selector `foo[bar]` matches all `foo` elements that have a `bar`
252
+ attribute, regardless of the value of said attribute.
253
+
254
+ Example XML:
255
+
256
+ <root>
257
+ <foo number="1" />
258
+ <bar />
259
+ </root>
260
+
261
+ CSS:
262
+
263
+ root foo[number]
264
+
265
+ This would return a set containing only the `<foo>` element since the `<bar>`
266
+ element has no attributes.
267
+
268
+ For the CSS expression `foo[number]` the corresponding XPath expression is the
269
+ following:
270
+
271
+ foo[@number]
272
+
273
+ When specifying an attribute you MAY include an operator and a value to match.
274
+ In this case you MUST include an attribute value surrounded by either single or
275
+ double quotes (but not a combination of the two).
276
+
277
+ There are 6 operators available:
278
+
279
+ * `=`: equals operator
280
+ * `~=`: whitespace-in operator
281
+ * `^=`: starts-with operator
282
+ * `$=`: ends-with operator
283
+ * `*=`: contains operator
284
+ * `|=`: hyphen-starts-with operator
285
+
286
+ ### Equals Operator
287
+
288
+ The equals operator matches an element if a given attribute value equals the
289
+ value specified. For example, `foo[number="1"]` matches all `foo` elements that
290
+ have a `number` attribute who's value is _exactly_ "1".
291
+
292
+ Example XML:
293
+
294
+ <root>
295
+ <foo number="1" />
296
+ <foo number="2" />
297
+ </root>
298
+
299
+ CSS:
300
+
301
+ root foo[number="1"]
302
+
303
+ This would return a set containing only the first `<foo>` element.
304
+
305
+ The corresponding XPath expression is quite similar. For `foo[number="1"]` this
306
+ would be:
307
+
308
+ foo[@number="1"]
309
+
310
+ ### Whitespace-in Operator
311
+
312
+ This operator matches an element if the given attribute value consists out of
313
+ space separated values of which one is exactly the given value. For example,
314
+ `foo[numbers~="1"]` matches all `foo` elements that have the value `"1"` in the
315
+ `numbers` attribute.
316
+
317
+ Example XML:
318
+
319
+ <root>
320
+ <foo numbers="1 2 3" />
321
+ <foo numbers="4 bar 6" />
322
+ </root>
323
+
324
+ CSS:
325
+
326
+ root foo[numbers~="1"]
327
+
328
+ This would return a set containing only the first `foo` element. On the other
329
+ hand, if one were to use the expression `root foo[numbers~="bar"]` instead then
330
+ only the second `<foo>` element would be matched.
331
+
332
+ The corresponding XPath expression is quite complex, `foo[numbers~="1"]` is
333
+ translated into the following XPath expression:
334
+
335
+ foo[contains(concat(" ", @numbers, " "), concat(" ", "1", " "))]
336
+
337
+ The `concat` calls are used to ensure the expression doesn't match the substring
338
+ of an attrbitue value and that the expression matches elements of which the
339
+ attribute only has a single value. If `foo[contains(@numbers, ' 1 ')]` were to
340
+ be used then attributes such as `<foo numbers="1" />` would not be matched.
341
+
342
+ Software implementing this selector are free to decide how they concatenate
343
+ spaces around the value to match. Both Oga and Nokogiri use an extra call to
344
+ `concat` but the following would be perfectly valid too:
345
+
346
+ foo[contains(concat(" ", @numbers, " "), " 1 ")]
347
+
348
+ ### Starts-with Operator
349
+
350
+ This operator matches elements of which the attribute value starts _exactly_
351
+ with the given value. For example, `foo[numbers^="1"]` would match the element
352
+ `<foo numbers="1 2 3" />` but _not_ the element `<foo numbers="2 3 1" />`.
353
+
354
+ For `foo[numbers^="1"]` the corresponding XPath expression is as following:
355
+
356
+ foo[starts-with(@numbers, "1")]
357
+
358
+ ### Ends-with Operator
359
+
360
+ This operator matches elements of which the attribute value ends _exactly_ with
361
+ the given value. For example, `foo[numbers$="3"]` would match the element `<foo
362
+ numbers="1 2 3" />` but _not_ the element `<foo numbers="2 3 1" />`.
363
+
364
+ The corresponding XPath expression is quite complex due to a lack of a
365
+ `ends-with` function in XPath. Instead one has to resort to using the
366
+ `substring()` function. As such the corresponding XPath expression for
367
+ `foo[bar="baz"]` is as following:
368
+
369
+ foo[substring(@bar, string-length(@bar) - string-length("baz") + 1, string-length("baz")) = "baz"]
370
+
371
+ ### Contains Operator
372
+
373
+ This operator matches elements of which the attribute value contains the given
374
+ value. For example, `foo[bar*="baz"]` would match both `<foo bar="bazzzz" />`
375
+ and `<foo bar="hello baz" />`.
376
+
377
+ For `foo[bar*="baz"]` the corresponding XPath expression is as following:
378
+
379
+ foo[contains(@bar, "baz")]
380
+
381
+ ### Hyphen-starts-with Operator
382
+
383
+ This operator matches elements of which the attribute value is a hyphen
384
+ separated list of values that starts _exactly_ with the given value. For
385
+ example, `foo[numbers|="1"]` matches `<foo numbers="1-2-3" />` but not
386
+ `<foo numbers="2-1-3" />`.
387
+
388
+ For `foo[numbers|="1"]` the corresponding XPath expression is as following:
389
+
390
+ foo[@numbers = "1" or starts-with(@numbers, concat("1", "-"))]
391
+
392
+ Note that this selector will also match elements such as
393
+ `<foo numbers="1- foo bar" />`.
394
+
395
+ ### Syntax
396
+
397
+ The syntax of the various attribute selectors can be described as following:
398
+
399
+ # Strings are used for the attribute values
400
+
401
+ dquote = '"';
402
+ squote = "'";
403
+
404
+ string_dquote = dquote ^dquote* dquote;
405
+ string_squote = squote ^squote* squote;
406
+
407
+ string = string_dquote | string_squote;
408
+
409
+ # The `identifier` rule is the same as the one used for matching element
410
+ # names.
411
+ attr_test = identifier '[' space* identifier (space* '=' space* string)* space* ']';
412
+
413
+ Whitespace inside the brackets does not affect the behaviour of the selector.
414
+
415
+ ## Pseudo Classes
416
+
417
+ W3 chapter: <http://www.w3.org/TR/css3-selectors/#structural-pseudos>
418
+
419
+ Pseudo classes can be used to further narrow down elements besides just their
420
+ names and attribute values. In essence they are a combination of XPath function
421
+ calls and axes. Some pseudo classes can take an argument to alter their
422
+ behaviour.
423
+
424
+ Pseudo classes are often applied to element selectors. For example:
425
+
426
+ foo:bar
427
+
428
+ Here `:bar` would be a pseudo class applied to the `foo` element. Some pseudo
429
+ classes (e.g. the `:root` pseudo class) can also be used on their own, for
430
+ example:
431
+
432
+ :root
433
+
434
+ ### :root
435
+
436
+ The `:root` pseudo class selects an element only if it's the top-level element
437
+ in a document.
438
+
439
+ Example XML:
440
+
441
+ <root>
442
+ <foo />
443
+ </root>
444
+
445
+ Using the CSS expression `root foo:root` we'd get an empty set as the `<foo>`
446
+ element is not the root element. On the other hand, `root:root` would return a
447
+ set containing only the `<root>` element.
448
+
449
+ This selector can both be applied to an element selector as well as being used
450
+ on its own.
451
+
452
+ For the selector `foo:root` the corresponding XPath expression is as following:
453
+
454
+ foo[not(parent::*)]
455
+
456
+ For `:root` the XPath expression is:
457
+
458
+ *[not(parent::*)]
459
+
460
+ ### :nth-child(n)
461
+
462
+ The `:nth-child(n)` pseudo class can be used to select a set of elements based
463
+ on their position or an interval, skipping elements that occur in a set before
464
+ the given position or interval.
465
+
466
+ In the form `:nth-child(n)` the identifier `n` is an argument that can be used
467
+ to specify one of the following:
468
+
469
+ 1. A literal node set index
470
+ 2. A node interval used to match every N nodes
471
+ 3. A node interval plus an initial offset
472
+
473
+ The first element in a node set for `:nth-child()` is located at position 1,
474
+ _not_ position 0 (unlike most programming languages). As a result
475
+ `:nth-child(1)` matches the _first_ element, _not_ the second. This can be
476
+ visualized as following:
477
+
478
+ :nth-child(2)
479
+
480
+ 1 2 3 4 5 6
481
+ +---+ +---+ +---+ +---+ +---+ +---+
482
+ | | | X | | | | | | | | |
483
+ +---+ +---+ +---+ +---+ +---+ +---+
484
+
485
+ Besides using a literal index argument you can also use an interval, optionally
486
+ with an offset. This can be used to for example match every 2nd element, or
487
+ every 2nd element starting at element number 4.
488
+
489
+ The syntax of this argument is as following:
490
+
491
+ integer = ('+' | '-')* [0-9]+;
492
+ interval = ('n' | '-n' | integer 'n') integer;
493
+
494
+ Here `interval` would match any of the following:
495
+
496
+ n
497
+ -n
498
+ 2n
499
+ 2n+5
500
+ 2n-5
501
+ -2n+5
502
+ -2n-5
503
+
504
+ Due to `integer` also matching the `+` and `-` it will be part of the same
505
+ token. If this is not desired the following grammar can be used instead:
506
+
507
+ integer = [0-9]+;
508
+ modifier = '+' | '-';
509
+ interval = ('n' | '-n' | modifier* integer 'n') modifier integer;
510
+
511
+ To match every 2nd element you'd use the following:
512
+
513
+ :nth-child(2n)
514
+
515
+ 1 2 3 4 5 6
516
+ +---+ +---+ +---+ +---+ +---+ +---+
517
+ | | | X | | | | X | | | | X |
518
+ +---+ +---+ +---+ +---+ +---+ +---+
519
+
520
+ To match every 2nd element starting at element 1 you'd instead use this:
521
+
522
+ :nth-child(2n+1)
523
+
524
+ 1 2 3 4 5 6
525
+ +---+ +---+ +---+ +---+ +---+ +---+
526
+ | X | | | | X | | | | X | | |
527
+ +---+ +---+ +---+ +---+ +---+ +---+
528
+
529
+ As mentioned the `+1` in the above example is the initial offset. This is
530
+ however _only_ the case if the second number is positive. That means that for
531
+ `:nth-child(2n-2)` the offset is _not_ `-2`. When using a negative offset the
532
+ actual offset first has to be calculated. When using an argument in the form of
533
+ `An-B` we can calculate the actual offset as following:
534
+
535
+ offset = A - (B % A)
536
+
537
+ For example, for the selector `:nth-child(2n-2)` the formula would be:
538
+
539
+ offset = 2 - (-2 % 2) # => 2
540
+
541
+ This would result in the selector `:nth-child(2n+2)`.
542
+
543
+ As an another example, for the selector `:nth-child(2n-5)` the formula would be:
544
+
545
+ offset = 2 - (-5 % 2) # => 1
546
+
547
+ Which would result in the selector `:nth-child(2n+1)`
548
+
549
+ To ease the process of selecting even and uneven elements you can also use
550
+ `even` and `odd` as an argument. Using `:nth-child(even)` is the same as
551
+ `:nth-child(2n)` while using `:nth-child(odd)` in turn is the same as
552
+ `:nth-child(2n+1)`.
553
+
554
+ Using `:nth-child(n)` simply matches all elements in the set. Using
555
+ `:nth-child(-n)` doesn't match any elements, though Oga treats it the same as
556
+ `:nth-child(n)`.
557
+
558
+ Expressions such as `:nth-child(-n-5)` are invalid as both parts of the interval
559
+ (`-n` and `-5`) are a negative. However, `:nth-child(-n+5)` is
560
+ perfectly valid and would match the first 5 elements in a set:
561
+
562
+ :nth-child(-n+5)
563
+
564
+ 1 2 3 4 5 6
565
+ +---+ +---+ +---+ +---+ +---+ +---+
566
+ | X | | X | | X | | X | | X | | |
567
+ +---+ +---+ +---+ +---+ +---+ +---+
568
+
569
+
570
+ Using `:nth-child(n+5)` would match all elements starting at element 5:
571
+
572
+ :nth-child(n+5)
573
+
574
+ 1 2 3 4 5 6 7 8 9 10
575
+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
576
+ | | | | | | | | | X | | X | | X | | X | | X | | X |
577
+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
578
+
579
+ To summarize:
580
+
581
+ :nth-child(n) => matches all elements
582
+ :nth-child(-n) => matches nothing, though Oga treats it the same as "n"
583
+ :nth-child(5) => matches element #5
584
+ :nth-child(2n) => matches every 2 elements
585
+ :nth-child(2n+2) => matches every 2 elements, starting at element 2
586
+ :nth-child(2n-2) => matches every 2 elements, starting at element 1
587
+ :nth-child(n+5) => matches all elements, starting at element 5
588
+ :nth-child(-n+5) => matches the first 5 elements
589
+ :nth-child(even) => matches every 2nd element, starting at element 2
590
+ :nth-child(odd) => matches every 2nd element, starting at element 1
591
+
592
+ The corresponding XPath expressions are quite complex and differ based on the
593
+ interval argument used. For the various forms the corresponding XPath
594
+ expressions are as following:
595
+
596
+ :nth-child(n) => *[((count(preceding-sibling::*) + 1) mod 1) = 0]
597
+ :nth-child(-n) => *[((count(preceding-sibling::*) + 1) mod 1) = 0]
598
+ :nth-child(5) => *[count(preceding-sibling::*) = 4]
599
+ :nth-child(2n) => *[((count(preceding-sibling::*) + 1) mod 2) = 0]
600
+ :nth-child(2n+2) => *[(count(preceding-sibling::*) + 1) >= 2 and (((count(preceding-sibling::*) + 1) - 2) mod 2) = 0]
601
+ :nth-child(2n-6) => *[(count(preceding-sibling::*) + 1) >= 2 and (((count(preceding-sibling::*) + 1) - 2) mod 2) = 0]
602
+ :nth-child(n+5) => *[(count(preceding-sibling::*) + 1) >= 5 and (((count(preceding-sibling::*) + 1) - 5) mod 1) = 0]
603
+ :nth-child(-n+6) => *[((count(preceding-sibling::*) + 1) <= 6) and (((count(preceding-sibling::*) + 1) - 6) mod 1) = 0]
604
+ :nth-child(even) => *[((count(preceding-sibling::*) + 1) mod 2) = 0]
605
+ :nth-child(odd) => *[(count(preceding-sibling::*) + 1) >= 1 and (((count(preceding-sibling::*) + 1) - 1) mod 2) = 0]
606
+
607
+ ### :nth-last-child(n)
608
+
609
+ The `:nth-last-child(n)` pseudo class can be used to select a set of elements
610
+ based on their position or an interval, skipping elements that occur in a set
611
+ after the given position or interval.
612
+
613
+ The arguments that can be used by this selector are the same as those mentioned
614
+ in [:nth-child(n)][nth-childn].
615
+
616
+ Because this selectors matches in reverse (compared to
617
+ [:nth-child(n)][nth-childn]) using an index such as "1" will match the _last_
618
+ element in a set, not the first one:
619
+
620
+ :nth-last-child(1)
621
+
622
+ 1 2 3 4 5 6
623
+ +---+ +---+ +---+ +---+ +---+ +---+
624
+ | | | | | | | | | | | X | <- matching direction
625
+ +---+ +---+ +---+ +---+ +---+ +---+
626
+
627
+ When using an interval (with or without an offset) the nodes are also matched in
628
+ reverse order. However, matched nodes should be returned in the order they
629
+ appear in in the document.
630
+
631
+ For example, the selector `:nth-last-child(2n)` would match as following:
632
+
633
+ :nth-last-child(2n)
634
+
635
+ 1 2 3 4 5 6
636
+ +---+ +---+ +---+ +---+ +---+ +---+
637
+ | X | | | | X | | | | X | | | <- matching direction
638
+ +---+ +---+ +---+ +---+ +---+ +---+
639
+
640
+ The resulting set however would contain the nodes in the order `[1, 3, 5]`
641
+ instead of `[5, 3, 1]`.
642
+
643
+ When using an interval with an initial offset the offset is also applied in
644
+ reverse order. For example, the selector `:nth-last-child(2n)` would match as
645
+ following:
646
+
647
+ :nth-last-child(2n+1)
648
+
649
+ 1 2 3 4 5 6
650
+ +---+ +---+ +---+ +---+ +---+ +---+
651
+ | | | X | | | | X | | | | X | <- matching direction
652
+ +---+ +---+ +---+ +---+ +---+ +---+
653
+
654
+ The corresponding XPath expressions are similar to those used for
655
+ [:nth-child(n)][nth-childn]:
656
+
657
+ :nth-last-child(n) => *[count(following-sibling::*) = -1]
658
+ :nth-last-child(-n) => *[count(following-sibling::*) = -1]
659
+ :nth-last-child(5) => *[count(following-sibling::*) = 4]
660
+ :nth-last-child(2n) => *[((count(following-sibling::*) + 1) mod 2) = 0]
661
+ :nth-last-child(2n+2) => *[((count(following-sibling::*) + 1) >= 2) and ((((count(following-sibling::*) + 1) - 2) mod 2) = 0)]
662
+ :nth-last-child(2n-6) => *[((count(following-sibling::*) + 1) >= 2) and ((((count(following-sibling::*) + 1) - 2) mod 2) = 0)]
663
+ :nth-last-child(n+5) => *[((count(following-sibling::*) + 1) >= 5) and ((((count(following-sibling::*) + 1) - 5) mod 1) = 0)]
664
+ :nth-last-child(-n+6) => *[((count(following-sibling::*) + 1) <= 6) and ((((count(following-sibling::*) + 1) - 6) mod 1) = 0)]
665
+ :nth-last-child(even) => *[((count(following-sibling::*) + 1) mod 2) = 0]
666
+ :nth-last-child(odd) => *[((count(following-sibling::*) + 1) >= 1) and ((((count(following-sibling::*) + 1) - 1) mod 2) = 0)]
667
+
668
+ ### :nth-of-type(n)
669
+
670
+ The `:nth-of-type(n)` pseudo class can be used to select a set of elements that
671
+ has a set of preceding siblings with the same name. The arguments that can be
672
+ used by this selector are the same as those mentioned in
673
+ [:nth-child(n)][nth-childn].
674
+
675
+ The matching order of this selector is the same as [:nth-child(n)][nth-childn].
676
+
677
+ Example XML:
678
+
679
+ <root>
680
+ <foo />
681
+ <foo />
682
+ <foo />
683
+ <foo />
684
+ <bar />
685
+ </root>
686
+
687
+ Using the CSS expression `root foo:nth-of-type(even)` would return a set
688
+ containing the 2nd and 4th `<foo>` nodes.
689
+
690
+ The corresponding XPath expressions for the various forms of this pseudo class
691
+ are as following:
692
+
693
+ :nth-of-type(n) => *[position() = n]
694
+ :nth-of-type(-n) => *[position() = -n]
695
+ :nth-of-type(5) => *[position() = 5]
696
+ :nth-of-type(2n) => *[(position() mod 2) = 0]
697
+ :nth-of-type(2n+2) => *[(position() >= 2) and (((position() - 2) mod 2) = 0)]
698
+ :nth-of-type(2n-6) => *[(position() >= 2) and (((position() - 2) mod 2) = 0)]
699
+ :nth-of-type(n+5) => *[(position() >= 5) and (((position() - 5) mod 1) = 0)]
700
+ :nth-of-type(-n+6) => *[(position() <= 6) and (((position() - 6) mod 1) = 0)]
701
+ :nth-of-type(even) => *[(position() mod 2) = 0]
702
+ :nth-of-type(odd) => *[(position() >= 1) and (((position() - 1) mod 2) = 0)]
703
+
704
+ ### :nth-last-of-type(n)
705
+
706
+ The `:nth-last-of-type(n)` pseudo class behaves the same as
707
+ [:nth-of-type(n)][nth-last-of-typen] excepts it matches nodes in reverse order
708
+ similar to [:nth-last-child(n)][nth-last-childn]. To clarify, this means
709
+ matching occurs as following:
710
+
711
+
712
+ :nth-last-of-type(1)
713
+
714
+ 1 2 3 4 5 6
715
+ +---+ +---+ +---+ +---+ +---+ +---+
716
+ | | | | | | | | | | | X | <- matching direction
717
+ +---+ +---+ +---+ +---+ +---+ +---+
718
+
719
+ Example XML:
720
+
721
+ <root>
722
+ <foo />
723
+ <foo />
724
+ <foo />
725
+ <foo />
726
+ <bar />
727
+ </root>
728
+
729
+ Using the CSS expression `root foo:nth-of-type(even)` would return a set
730
+ containing the 1st and 3rd `<foo>` nodes.
731
+
732
+ The corresponding XPath expressions for the various forms of this pseudo class
733
+ are as following:
734
+
735
+ :nth-last-of-type(n) => *[position() = last() - -1]
736
+ :nth-last-of-type(-n) => *[position() = last() - -1]
737
+ :nth-last-of-type(5) => *[position() = last() - 4]
738
+ :nth-last-of-type(2n) => *[((last() - position()+1) mod 2) = 0]
739
+ :nth-last-of-type(2n+2) => *[((last() - position()+1) >= 2) and ((((last() - position() + 1) - 2) mod 2) = 0)]
740
+ :nth-last-of-type(2n-6) => *[((last() - position()+1) >= 2) and ((((last() - position() + 1) - 2) mod 2) = 0)]
741
+ :nth-last-of-type(n+5) => *[((last() - position()+1) >= 5) and ((((last() - position() + 1) - 5) mod 1) = 0)]
742
+ :nth-last-of-type(-n+6) => *[((last() - position()+1) <= 6) and ((((last() - position() + 1) - 6) mod 1) = 0)]
743
+ :nth-last-of-type(even) => *[((last() - position()+1) mod 2) = 0]
744
+ :nth-last-of-type(odd) => *[((last() - position()+1) >= 1) and ((((last() - position() + 1) - 1) mod 2) = 0)]
745
+
746
+ ### :first-child
747
+
748
+ The `:first-child` pseudo class can be used to match a node that is the first
749
+ child node of another node (= a node without any preceding nodes).
750
+
751
+ Example XML:
752
+
753
+ <root>
754
+ <foo />
755
+ <bar />
756
+ </root>
757
+
758
+ Using the CSS selector `root :first-child` would return a set containing only
759
+ the `<foo>` node.
760
+
761
+ The corresponding XPath expression for this pseudo class is as following:
762
+
763
+ :first-child => *[count(preceding-sibling::*) = 0]
764
+
765
+ ### :last-child
766
+
767
+ The `:last-child` pseudo class can be used to match a node that is the last
768
+ child node of another node (= a node without any following nodes).
769
+
770
+ Example XML:
771
+
772
+ <root>
773
+ <foo />
774
+ <bar />
775
+ </root>
776
+
777
+ Using the CSS selector `root :last-child` would return a set containing only
778
+ the `<bar>` node.
779
+
780
+ The corresponding XPath expression for this pseudo class is as following:
781
+
782
+ :last-child => *[count(following-sibling::*) = 0]
783
+
784
+ ### :first-of-type
785
+
786
+ The `:first-of-type` pseudo class matches elements that are the first sibling of
787
+ its type in the list of elements of its parent element. This selector is the
788
+ same as [:nth-of-type(1)][nth-of-typen].
789
+
790
+ Example XML:
791
+
792
+ <root>
793
+ <a id="1" />
794
+ <a id="2">
795
+ <a id="3" />
796
+ <a id="4" />
797
+ </a>
798
+ </root>
799
+
800
+ Using the CSS selector `root a:first-of-type` would return a node set containing
801
+ nodes `<a id="1">` and `<a id="3">` as both nodes are the first siblings of
802
+ their type.
803
+
804
+ The corresponding XPath for this pseudo class is as following:
805
+
806
+ a:first-of-type => a[count(preceding-sibling::a) = 0]
807
+
808
+ An alternative way is to use the following XPath:
809
+
810
+ a:first-of-type => //a[position() = 1]
811
+
812
+ This however relies on the less efficient `descendant-or-self::node()` selector.
813
+ For querying larger documents it's recommended to use the first form instead.
814
+
815
+ ### :last-of-type
816
+
817
+ The `:last-of-type` pseudo class can be used to match elements that are the last
818
+ sibling of its type in the list of elements of its parent. This selector is the
819
+ same as [:nth-last-of-type(1)][nth-last-of-typen].
820
+
821
+ Example XML:
822
+
823
+ <root>
824
+ <a id="1" />
825
+ <a id="2">
826
+ <a id="3" />
827
+ <a id="4" />
828
+ </a>
829
+ </root>
830
+
831
+ Using the CSS selector `root a:last-of-type` would return a set containing nodes
832
+ `<a id="2">` and `<a id="4">` as both nodes are the last siblings of their type.
833
+
834
+ The corresponding XPath for this pseudo class is as following:
835
+
836
+ a:last-of-type => a[count(following-sibling::a) = 0]
837
+
838
+ Similar to [:first-of-type][first-of-typen] this XPath can alternatively be
839
+ written as following:
840
+
841
+ a:last-of-type => //a[position() = last()]
842
+
843
+ ### :only-child
844
+
845
+ The `:only-child` pseudo class can be used to match elements that are the only
846
+ child element of its parent.
847
+
848
+ Example XML:
849
+
850
+ <root>
851
+ <a id="1" />
852
+ <a id="2">
853
+ <a id="3" />
854
+ </a>
855
+ </root>
856
+
857
+ Using the CSS selector `root a:only-child` would return a set containing only
858
+ the `<a id="3">` node.
859
+
860
+ The corresponding XPath for this pseudo class is as following:
861
+
862
+ a:only-child => a[count(preceding-sibling::*) = 0 and count(following-sibling::*) = 0]
863
+
864
+ ### :only-of-type
865
+
866
+ The `:only-of-type` pseudo class can be used to match elements that are the only
867
+ child elements of its type of its parent.
868
+
869
+ Example XML:
870
+
871
+ <root>
872
+ <a id="1" />
873
+ <a id="2">
874
+ <a id="3" />
875
+ <b id="4" />
876
+ </a>
877
+ </root>
878
+
879
+ Using the CSS selector `root a:only-of-type` would return a set containing
880
+ only the `<a id="3">` node due to it being the only `<a>` node in the list of
881
+ elements of its parent.
882
+
883
+ The corresponding XPath for this pseudo class is as following:
884
+
885
+ a:only-child => a[count(preceding-sibling::a) = 0 and count(following-sibling::a) = 0]
886
+
887
+ ### :empty
888
+
889
+ The `:empty` pseudo class can be used to match elements that have no child nodes
890
+ at all.
891
+
892
+ Example XML:
893
+
894
+ <root>
895
+ <a />
896
+ <b>10</b>
897
+ </root>
898
+
899
+ Using the CSS selector `root :empty` would return a set containing only the
900
+ `<a>` node.
901
+
902
+ ### Syntax
903
+
904
+ The syntax of the various pseudo classes is as following:
905
+
906
+ integer = ('+' | '-')* [0-9]+;
907
+
908
+ odd = 'odd';
909
+ even = 'even';
910
+ nth = 'n';
911
+
912
+ pseudo_arg_interval = '-'* integer* nth;
913
+ pseudo_arg_offset = ('+' | '-')* integer;
914
+
915
+ pseudo_arg = odd
916
+ | even
917
+ | '-'* nth
918
+ | integer
919
+ | pseudo_arg_interval
920
+ | pseudo_arg_interval pseudo_arg_offset;
921
+
922
+ # The `identifier` rule is the same as the one used for element names.
923
+ pseudo = ':' identifier ('(' space* pseudo_arg space* ')')*;
924
+
925
+ [w3spec]: http://www.w3.org/TR/css3-selectors/
926
+ [rfc-2119]: https://www.ietf.org/rfc/rfc2119.txt
927
+ [kramdown]: http://kramdown.gettalong.org/
928
+ [universal-selector]: #universal-selector
929
+ [ragel]: http://www.colm.net/open-source/ragel/
930
+ [nth-childn]: #nth-childn
931
+ [nth-last-childn]: #nth-last-childn
932
+ [nth-last-of-typen]: #nth-last-of-typen
933
+ [nth-of-typen]: #nth-of-type
934
+ [nth-last-of-typen]: #nth-last-of-typen
935
+ [first-of-typen]: #first-of-typen