gammo 0.1.0 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: '009d6d5682151d83fe688e67ba57541bcccf5542b2865d8b85be77fae8156178'
4
- data.tar.gz: 31a2f1d37e01a3c9e47db2b034965c75b0bc7ddd4d2f86826ae08fd37d199788
3
+ metadata.gz: f88ade2267388af8d29137f6ff0db934d324d77f5bface53cd6ae9fea3f1e466
4
+ data.tar.gz: 92642a6208ede13520e29c20529f29fd21bb040eac39ec4af97935990b94b0eb
5
5
  SHA512:
6
- metadata.gz: c77bcb2f3cc9b25ac7400eff41819289980b1ff9a53481cca51b1444bca24dbdd114ead1c90f6cec219b0c844b0e63775338a7a7f74910b24c0ab6b0a00e2d54
7
- data.tar.gz: a907e000dd8d4c01bcdb3f834ec17fbc20bdddcc91c38b77add44d3b56116eef8b7a64acdbd967fff65b1076df65871899816493ca012b98947266cc1ba51d8c
6
+ metadata.gz: f15b94b27a1738234662a2f196c440b97e4cd2b83a9c9bb468d7c4cfd4058ad6482b0ac06559dc4cd50a0dc8d41ba6114db1a09d51177bca511d1ce4196ef9e6
7
+ data.tar.gz: 5377c2d17574abcc5e0069e7506ad5e47da211c970a79718218c88d9099723c04a7edba0b58da61eb9898e24fc300c04b3f8e2daba49ce41994ec990c27ff5fe
data/Gemfile CHANGED
@@ -7,3 +7,6 @@ gem 'yard'
7
7
  gem 'rake', '~> 12.0'
8
8
  gem 'test-unit', '~> 3.3.5'
9
9
  gem 'erubi'
10
+
11
+ gem 'racc'
12
+ gem 'simplecov'
@@ -6,9 +6,15 @@ PATH
6
6
  GEM
7
7
  remote: https://rubygems.org/
8
8
  specs:
9
+ docile (1.3.2)
9
10
  erubi (1.9.0)
10
11
  power_assert (1.1.5)
12
+ racc (1.5.0)
11
13
  rake (12.3.3)
14
+ simplecov (0.18.5)
15
+ docile (~> 1.1)
16
+ simplecov-html (~> 0.11)
17
+ simplecov-html (0.12.2)
12
18
  test-unit (3.3.5)
13
19
  power_assert
14
20
  yard (0.9.20)
@@ -19,9 +25,11 @@ PLATFORMS
19
25
  DEPENDENCIES
20
26
  erubi
21
27
  gammo!
28
+ racc
22
29
  rake (~> 12.0)
30
+ simplecov
23
31
  test-unit (~> 3.3.5)
24
32
  yard
25
33
 
26
34
  BUNDLED WITH
27
- 2.0.2
35
+ 2.1.4
data/README.md CHANGED
@@ -1,6 +1,11 @@
1
1
  # Gammo - A pure-Ruby HTML5 parser
2
2
 
3
3
  [![Build Status](https://travis-ci.org/namusyaka/gammo.svg?branch=master)](https://travis-ci.org/namusyaka/gammo)
4
+ [![GitHub issues](https://img.shields.io/github/issues/namusyaka/gammo)](https://github.com/namusyaka/gammo/issues)
5
+ [![GitHub forks](https://img.shields.io/github/forks/namusyaka/gammo?color=brightgreen)](https://github.com/namusyaka/gammo/network)
6
+ [![GitHub stars](https://img.shields.io/github/stars/namusyaka/gammo?color=brightgreen)](https://github.com/namusyaka/gammo/stargazers)
7
+ [![GitHub license](https://img.shields.io/github/license/namusyaka/gammo?color=brightgreen)](https://github.com/namusyaka/gammo/blob/master/LICENSE.txt)
8
+ [![Documentation](http://img.shields.io/:yard-docs-38c800.svg)](http://www.rubydoc.info/gems/gammo/frames)
4
9
 
5
10
  Gammo is an implementation of the HTML5 parsing algorithm which conforms [the WHATWG specification](https://html.spec.whatwg.org/multipage/parsing.html), without any dependencies. Given an HTML string, Gammo parses it and builds DOM tree based on the tokenization and tree-construction algorithm defined in WHATWG parsing algorithm.
6
11
 
@@ -10,7 +15,7 @@ Gammo, its naming is inspired by [Gumbo](https://github.com/google/gumbo-parser)
10
15
  require 'gammo'
11
16
  require 'open-uri'
12
17
 
13
- parser = Gammo.new(open('https://google.com'))
18
+ parser = open('https://google.com') { |f| Gammo.new(f.read) }
14
19
  parser.parse #=> #<Gammo::Node::Document>
15
20
  ```
16
21
 
@@ -21,6 +26,7 @@ parser.parse #=> #<Gammo::Node::Document>
21
26
  - [Tokenization](#tokenization): Gammo has a tokenizer for implementing [the tokenization algorithm](https://html.spec.whatwg.org/multipage/parsing.html#tokenization).
22
27
  - [Parsing](#parsing): Gammo provides a parser which implements the parsing algorithm by the above tokenization and [the tree-construction algorithm](https://html.spec.whatwg.org/multipage/parsing.html#tree-construction).
23
28
  - [Node](#node): Gammo provides the nodes which implement [WHATWG DOM specification](https://dom.spec.whatwg.org/) partially.
29
+ - [DOM Tree Traversal](#dom-tree-traversal): Gammo provides a way of DOM tree traversal.
24
30
  - [Performance](#performance): Gammo does not prioritize performance, and there are a few potential performance notes.
25
31
 
26
32
  ## Tokenizaton
@@ -155,7 +161,394 @@ The nodes generated by the parser will be categorized into one of the following
155
161
  </tbody>
156
162
  </table>
157
163
 
158
- For some nodes such as `Gammo::Node::Element` and `Gammo::Node::Document`, they contains pointers to nodes that can be referenced by itself, such as `Gammo::Node#next_sibling` or `Gammo::Node#first_child`. In addition, APIs such as `Gammo::Node#append_child` and `Gammo::Node#remove_child` that perform operations defined in DOM living standard are also provided.
164
+ For some nodes such as `Gammo::Node::Element` and `Gammo::Node::Document`, they contain pointers to nodes that can be referenced by itself, such as `Gammo::Node#next_sibling` or `Gammo::Node#first_child`. In addition, APIs such as `Gammo::Node#append_child` and `Gammo::Node#remove_child` that perform operations defined in DOM living standard are also provided.
165
+
166
+ ## DOM Tree Traversal
167
+
168
+ Currently, XPath 1.0 is the only way for traversing DOM tree built by Gammo.
169
+ CSS selector support is also planned but not having any ETA.
170
+
171
+ ### XPath 1.0 (experimental)
172
+
173
+ Gammo has an original lexer/parser for XPath 1.0, it's provided as a helper in the DOM tree built by Gammo.
174
+ Here is a simple example:
175
+
176
+ ```ruby
177
+ document = Gammo.new('<!doctype html><input type="button">').parse
178
+ node_set = document.xpath('//input[@type="button"]') #=> "<Gammo::XPath::NodeSet>"
179
+
180
+ node_set.length #=> 1
181
+ node_set.first #=> "<Gammo::Node::Element>"
182
+ ```
183
+
184
+ **Since this is implemented by full scratch, Gammo is providing this support as a very experimental feature.**
185
+ Please [file an issue](/issues/new) if you find bugs.
186
+
187
+ #### Example
188
+
189
+ Before proceeding at the details of XPath support, let's have a look at a few simple examples.
190
+ Given a sample HTML text and its DOM tree:
191
+
192
+ ```ruby
193
+ document = Gammo.new(<<-EOS).parse
194
+ <!DOCTYPE html>
195
+ <html>
196
+ <head>
197
+ </head>
198
+ <body>
199
+ <h1>namusyaka.com</h1>
200
+ <p class="description">Here is a sample web site.</p>
201
+ <ul>
202
+ <li>hello</li>
203
+ <li>world</li>
204
+ </ul>
205
+ <ul id="links">
206
+ <li>Google <a href="https://google.com/">google.com</a></li>
207
+ <li>GitHub <a href="https://github.com/namusyaka">github.com/namusyaka</a></li>
208
+ </ul>
209
+ </body>
210
+ </html>
211
+ EOS
212
+ ```
213
+
214
+ The following XPath expression gets all `li` elements and prints those text contents:
215
+
216
+ ```ruby
217
+ document.xpath('//li').each do |elm|
218
+ puts elm.inner_text
219
+ end
220
+ ```
221
+
222
+ The following XPath expression gets all `li` elements under the `ul` element having the `id=links` attribute:
223
+
224
+ ```ruby
225
+ document.xpath('//ul[@id="links"]/li').each do |elm|
226
+ puts elm.inner_text
227
+ end
228
+ ```
229
+
230
+ The following XPath expression gets each text node for each `li` element under the `ul` element having the `id=links` attribute:
231
+
232
+ ```ruby
233
+ document.xpath('//ul[@id="links"]/li/text()').each do |elm|
234
+ puts elm.data
235
+ end
236
+ ```
237
+
238
+ #### Axis Specifiers
239
+
240
+ In the combination with Gammo, the axis specifier indicates navigation direction within the DOM tree built by Gammo. Here is list of axes. As you can see, Gammo fully supports the all of axes.
241
+
242
+ <table>
243
+ <thead>
244
+ <tr>
245
+ <th>Full Syntax</th>
246
+ <th>Abbreviated Syntax</th>
247
+ <th>Supported</th>
248
+ <th>Notes</th>
249
+ </tr>
250
+ </thead>
251
+ <tbody>
252
+ <tr>
253
+ <td><code>ancestor</code></td>
254
+ <td></td>
255
+ <td>yes</td>
256
+ <td></td>
257
+ </tr>
258
+ <tr>
259
+ <td><code>ancestor-or-self</code></td>
260
+ <td></td>
261
+ <td>yes</td>
262
+ <td></td>
263
+ </tr>
264
+ <tr>
265
+ <td><code>attribute</code></td>
266
+ <td><code>@</code></td>
267
+ <td>yes</td>
268
+ <td><code>@abc</code> is the alias for <code>attribute::abc</code></td>
269
+ </tr>
270
+ <tr>
271
+ <td><code>child</code></td>
272
+ <td><code></code></td>
273
+ <td>yes</td>
274
+ <td><code>abc</code> is the short for <code>child::abc</code></td>
275
+ </tr>
276
+ <tr>
277
+ <td><code>descendant</code></td>
278
+ <td></td>
279
+ <td>yes</td>
280
+ <td></td>
281
+ </tr>
282
+ <tr>
283
+ <td><code>descendant-or-self</code></td>
284
+ <td><code>//</code></td>
285
+ <td>yes</td>
286
+ <td><code>//</code> is the alias for <code>/descendant-or-self::node()/</code></td>
287
+ </tr>
288
+ <tr>
289
+ <td><code>following</code></td>
290
+ <td></td>
291
+ <td>yes</td>
292
+ <td></td>
293
+ </tr>
294
+ <tr>
295
+ <td><code>following-sibling</code></td>
296
+ <td></td>
297
+ <td>yes</td>
298
+ <td></td>
299
+ </tr>
300
+ <tr>
301
+ <td><code>namespace</code></td>
302
+ <td></td>
303
+ <td>yes</td>
304
+ <td></td>
305
+ </tr>
306
+ <tr>
307
+ <td><code>parent</code></td>
308
+ <td><code>..</code></td>
309
+ <td>yes</td>
310
+ <td><code>..</code> is the alias for <code>parent::node()</code></td>
311
+ </tr>
312
+ <tr>
313
+ <td><code>preceding</code></td>
314
+ <td></td>
315
+ <td>yes</td>
316
+ <td></td>
317
+ </tr>
318
+ <tr>
319
+ <td><code>preceding-sibling</code></td>
320
+ <td></td>
321
+ <td>yes</td>
322
+ <td></td>
323
+ </tr>
324
+ <tr>
325
+ <td><code>self</code></td>
326
+ <td><code>.</code></td>
327
+ <td>yes</td>
328
+ <td><code>.</code> is the alias for <code>self::node()</code></td>
329
+ </tr>
330
+ </tbody>
331
+ </table>
332
+
333
+ #### Node Test
334
+
335
+ Node tests consist of specific node names or more general expressions. Although particular syntax like `:` should work for specifying namespace prefix in XPath, Gammo does not support it yet as it's [not a core feature in HTML5](https://html.spec.whatwg.org/multipage/introduction.html#html-vs-xhtml).
336
+
337
+ <table>
338
+ <thead>
339
+ <tr>
340
+ <th>Full Syntax</th>
341
+ <th>Supported</th>
342
+ <th>Notes</th>
343
+ </tr>
344
+ </thead>
345
+ <tbody>
346
+ <tr>
347
+ <td><code>text()</code></td>
348
+ <td>yes</td>
349
+ <td>Finds a node of type text, e.g. <code>hello</code> in <code><p>hello <a href="https://hello">world</a></p></code></td>
350
+ </tr>
351
+ <tr>
352
+ <td><code>comment()</code></td>
353
+ <td>yes</td>
354
+ <td>Finds a node of type comment, e.g. <code><!-- comment --></code></td>
355
+ </tr>
356
+ <tr>
357
+ <td><code>node()</code></td>
358
+ <td>yes</td>
359
+ <td>Finds any node at all.</td>
360
+ </tr>
361
+ </tbody>
362
+ </table>
363
+
364
+ Also note that the `processing-instruction` is not supported. There is no plan to support it.
365
+
366
+ #### Operators
367
+
368
+ - The `/`, `//` and `[]` are used in the path expression.
369
+ - The union operator `|` forms the union of two node sets.
370
+ - The boolean operators: `and`, `or`
371
+ - The arithmetic operators: `+`, `-`, `*`, `div` and `mod`
372
+ - Comparison operators: `=`, `!=`, `<`, `>`, `<=`, `>=`
373
+
374
+ #### Functions
375
+
376
+ XPath 1.0 defines four data types (nodeset, string, number, boolean) and there are various functions based on the types. Gammo supports those functions partially, please check it to be supported before using functions.
377
+
378
+ ##### Node set functions
379
+
380
+ <table>
381
+ <thead>
382
+ <tr>
383
+ <th>Function Name</th>
384
+ <th>Supported</th>
385
+ <th>Specification</th>
386
+ </tr>
387
+ </thead>
388
+ <tbody>
389
+ <tr>
390
+ <td><code>last()</code></td>
391
+ <td>yes</td>
392
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-last</td>
393
+ </tr>
394
+ <tr>
395
+ <td><code>position()</code></td>
396
+ <td>yes</td>
397
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-position</td>
398
+ </tr>
399
+ <tr>
400
+ <td><code>count(node-set)</code></td>
401
+ <td>yes</td>
402
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-count</td>
403
+ </tr>
404
+ </tbody>
405
+ </table>
406
+
407
+ ##### String Functions
408
+
409
+ <table>
410
+ <thead>
411
+ <tr>
412
+ <th>Function Name</th>
413
+ <th>Supported</th>
414
+ <th>Specification</th>
415
+ </tr>
416
+ </thead>
417
+ <tbody>
418
+ <tr>
419
+ <td><code>string(object?)</code></td>
420
+ <td>yes</td>
421
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string</td>
422
+ </tr>
423
+ <tr>
424
+ <td><code>concat(string, string, string*)</code></td>
425
+ <td>yes</td>
426
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-concat</td>
427
+ </tr>
428
+ <tr>
429
+ <td><code>starts-with(string, string)</code></td>
430
+ <td>yes</td>
431
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-starts-with</td>
432
+ </tr>
433
+ <tr>
434
+ <td><code>contains(string, string)</code></td>
435
+ <td>yes</td>
436
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-contains</td>
437
+ </tr>
438
+ <tr>
439
+ <td><code>substring-before(string, string)</code></td>
440
+ <td>yes</td>
441
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-substring-before</td>
442
+ </tr>
443
+ <tr>
444
+ <td><code>substring-after(string, string)</code></td>
445
+ <td>yes</td>
446
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-substring-after</td>
447
+ </tr>
448
+ <tr>
449
+ <td><code>substring(string, number, number?)</code></td>
450
+ <td>no</td>
451
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-substring</td>
452
+ </tr>
453
+ <tr>
454
+ <td><code>string-length(string?)</code></td>
455
+ <td>no</td>
456
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string-length</td>
457
+ </tr>
458
+ <tr>
459
+ <td><code>normalize-space(string?)</code></td>
460
+ <td>no</td>
461
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string-normalize-space</td>
462
+ </tr>
463
+ <tr>
464
+ <td><code>translate(string, string, string)</code></td>
465
+ <td>no</td>
466
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string-translate</td>
467
+ </tr>
468
+ </tbody>
469
+ </table>
470
+
471
+ ##### Boolean Functions
472
+
473
+ <table>
474
+ <thead>
475
+ <tr>
476
+ <th>Function Name</th>
477
+ <th>Supported</th>
478
+ <th>Specification</th>
479
+ </tr>
480
+ </thead>
481
+ <tbody>
482
+ <tr>
483
+ <td><code>boolean(object)</code></td>
484
+ <td>yes</td>
485
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-boolean</td>
486
+ </tr>
487
+ <tr>
488
+ <td><code>not(object)</code></td>
489
+ <td>yes</td>
490
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-not</td>
491
+ </tr>
492
+ <tr>
493
+ <td><code>true()</code></td>
494
+ <td>yes</td>
495
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-true</td>
496
+ </tr>
497
+ <tr>
498
+ <td><code>false()</code></td>
499
+ <td>yes</td>
500
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-false</td>
501
+ </tr>
502
+ <tr>
503
+ <td><code>lang()</code></td>
504
+ <td>no</td>
505
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-lang</td>
506
+ </tr>
507
+ </tbody>
508
+ </table>
509
+
510
+ ##### Number Functions
511
+
512
+ <table>
513
+ <thead>
514
+ <tr>
515
+ <th>Function Name</th>
516
+ <th>Supported</th>
517
+ <th>Specification</th>
518
+ </tr>
519
+ </thead>
520
+ <tbody>
521
+ <tr>
522
+ <td><code>number(object?)</code></td>
523
+ <td>no</td>
524
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-number</td>
525
+ </tr>
526
+ <tr>
527
+ <td><code>sum(node-set)</code></td>
528
+ <td>no</td>
529
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-sum</td>
530
+ </tr>
531
+ <tr>
532
+ <td><code>floor(number)</code></td>
533
+ <td>no</td>
534
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-floor</td>
535
+ </tr>
536
+ <tr>
537
+ <td><code>ceiling(number)</code></td>
538
+ <td>yes</td>
539
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-ceiling</td>
540
+ </tr>
541
+ <tr>
542
+ <td><code>round(number)</code></td>
543
+ <td>no</td>
544
+ <td>https://www.w3.org/TR/1999/REC-xpath-19991116/#function-round</td>
545
+ </tr>
546
+ </tbody>
547
+ </table>
548
+
549
+ ### CSS Selector
550
+
551
+ TBD.
159
552
 
160
553
  ## Performance
161
554
 
@@ -175,3 +568,10 @@ This was developed with reference to the following softwares.
175
568
  ## License
176
569
 
177
570
  The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
571
+
572
+ ## Release History
573
+
574
+ - v0.2.0
575
+ - XPath 1.0 support [#4](https://github.com/namusyaka/gammo/pull/4)
576
+ - v0.1.0
577
+ - Initial Release