oga 0.1.1-java

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (47) hide show
  1. checksums.yaml +7 -0
  2. data/.yardopts +13 -0
  3. data/LICENSE +19 -0
  4. data/README.md +179 -0
  5. data/doc/DCO.md +25 -0
  6. data/doc/changelog.md +20 -0
  7. data/doc/css/common.css +76 -0
  8. data/doc/migrating_from_nokogiri.md +169 -0
  9. data/ext/c/extconf.rb +13 -0
  10. data/ext/c/lexer.c +1518 -0
  11. data/ext/c/lexer.h +8 -0
  12. data/ext/c/lexer.rl +121 -0
  13. data/ext/c/liboga.c +6 -0
  14. data/ext/c/liboga.h +11 -0
  15. data/ext/java/Liboga.java +14 -0
  16. data/ext/java/org/liboga/xml/Lexer.java +829 -0
  17. data/ext/java/org/liboga/xml/Lexer.rl +151 -0
  18. data/ext/ragel/base_lexer.rl +323 -0
  19. data/lib/liboga.jar +0 -0
  20. data/lib/oga.rb +43 -0
  21. data/lib/oga/html/parser.rb +25 -0
  22. data/lib/oga/oga.rb +27 -0
  23. data/lib/oga/version.rb +3 -0
  24. data/lib/oga/xml/attribute.rb +111 -0
  25. data/lib/oga/xml/cdata.rb +17 -0
  26. data/lib/oga/xml/character_node.rb +39 -0
  27. data/lib/oga/xml/comment.rb +17 -0
  28. data/lib/oga/xml/doctype.rb +84 -0
  29. data/lib/oga/xml/document.rb +99 -0
  30. data/lib/oga/xml/element.rb +331 -0
  31. data/lib/oga/xml/lexer.rb +399 -0
  32. data/lib/oga/xml/namespace.rb +42 -0
  33. data/lib/oga/xml/node.rb +168 -0
  34. data/lib/oga/xml/node_set.rb +313 -0
  35. data/lib/oga/xml/parser.rb +556 -0
  36. data/lib/oga/xml/processing_instruction.rb +39 -0
  37. data/lib/oga/xml/pull_parser.rb +180 -0
  38. data/lib/oga/xml/querying.rb +32 -0
  39. data/lib/oga/xml/text.rb +11 -0
  40. data/lib/oga/xml/traversal.rb +48 -0
  41. data/lib/oga/xml/xml_declaration.rb +69 -0
  42. data/lib/oga/xpath/evaluator.rb +1748 -0
  43. data/lib/oga/xpath/lexer.rb +2043 -0
  44. data/lib/oga/xpath/node.rb +10 -0
  45. data/lib/oga/xpath/parser.rb +537 -0
  46. data/oga.gemspec +45 -0
  47. metadata +221 -0
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: ef3b6deb6c9b28c8ce05b88c6e49fec481f8051a
4
+ data.tar.gz: 002f8e7f8d02a02e8e77a163c82604cb4ca5fd03
5
+ SHA512:
6
+ metadata.gz: 8f8e1816df586fa18bb776cac96d043d395737abd4397b16ce654333fe13afeb106e94311820872407b6cb6c94e54ea5a6bdcd4698131f0191031364cafddf9a
7
+ data.tar.gz: 04b59105751bb96576f51fd8ae6755f1ae8eacc8c83a234e5befa181b518fca9a713bd3be986965f0661089bf1afc67142b918429deb99dc0b518bae545794c0
data/.yardopts ADDED
@@ -0,0 +1,13 @@
1
+ ./lib/oga/**/*.rb ./lib/oga.rb
2
+ -m markdown
3
+ -M kramdown
4
+ -o yardoc
5
+ -r ./README.md
6
+ --private
7
+ --protected
8
+ --asset ./doc/css/common.css:css/common.css
9
+ --verbose
10
+ -
11
+ ./doc/*.md
12
+ LICENSE
13
+ CONTRIBUTING.md
data/LICENSE ADDED
@@ -0,0 +1,19 @@
1
+ Copyright (c) 2014, Yorick Peterse
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy
4
+ of this software and associated documentation files (the "Software"), to deal
5
+ in the Software without restriction, including without limitation the rights
6
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7
+ copies of the Software, and to permit persons to whom the Software is
8
+ furnished to do so, subject to the following conditions:
9
+
10
+ The above copyright notice and this permission notice shall be included in
11
+ all copies or substantial portions of the Software.
12
+
13
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
19
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,179 @@
1
+ # Oga
2
+
3
+ Oga is an XML/HTML parser written in Ruby. It provides an easy to use API for
4
+ parsing, modifying and querying documents (using XPath expressions). Oga does
5
+ not require system libraries such as libxml, making it easier and faster to
6
+ install on various platforms. To achieve better performance Oga uses a small,
7
+ native extension (C for MRI/Rubinius, Java for JRuby).
8
+
9
+ Oga provides an API that allows you to safely parse and query documents in a
10
+ multi-threaded environment, without having to worry about your applications
11
+ blowing up.
12
+
13
+ From [Wikipedia][oga-wikipedia]:
14
+
15
+ > Oga: A large two-person saw used for ripping large boards in the days before
16
+ > power saws. One person stood on a raised platform, with the board below him,
17
+ > and the other person stood underneath them.
18
+
19
+ ## Examples
20
+
21
+ Parsing a simple string of XML:
22
+
23
+ Oga.parse_xml('<people><person>Alice</person></people>')
24
+
25
+ Parsing a simple string of HTML:
26
+
27
+ Oga.parse_html('<link rel="stylesheet" href="foo.css">')
28
+
29
+ Parsing an IO handle pointing to XML (this also works when using
30
+ `Oga.parse_html`):
31
+
32
+ handle = File.open('path/to/file.xml')
33
+
34
+ Oga.parse_xml(handle)
35
+
36
+ Parsing an IO handle using the pull parser:
37
+
38
+ handle = File.open('path/to/file.xml')
39
+ parser = Oga::XML::PullParser.new(handle)
40
+
41
+ parser.parse do |node|
42
+ parser.on(:text) do
43
+ puts node.text
44
+ end
45
+ end
46
+
47
+ Querying a document using XPath:
48
+
49
+ document = Oga.parse_xml('<people><person>Alice</person></people>')
50
+
51
+ document.xpath('string(people/person)') # => "Alice"
52
+
53
+ Modifying a document and serializing it back to XML:
54
+
55
+ document = Oga.parse_xml('<people><person>Alice</person></people>')
56
+ name = document.at_xpath('people/person[1]/text()')
57
+
58
+ name.text = 'Bob'
59
+
60
+ document.to_xml # => "<people><person>Bob</person></people>"
61
+
62
+ Querying a document using a namespace:
63
+
64
+ document = Oga.parse_xml('<root xmlns:x="foo"><x:div></x:div></root>')
65
+ div = document.xpath('root/x:div').first
66
+
67
+ div.namespace # => Namespace(name: "x" uri: "foo")
68
+
69
+ ## Features
70
+
71
+ * Support for parsing XML and HTML(5)
72
+ * DOM parsing
73
+ * Stream/pull parsing
74
+ * Low memory footprint
75
+ * High performance, if something doesn't perform well enough it's a bug
76
+ * Support for XPath 1.0
77
+ * XML namespace support (registering, querying, etc)
78
+
79
+ ## Requirements
80
+
81
+ | Ruby | Required | Recommended |
82
+ |:---------|:--------------|:------------|
83
+ | MRI | >= 1.9.3 | >= 2.1.2 |
84
+ | Rubinius | >= 2.2 | >= 2.2.10 |
85
+ | JRuby | >= 1.7 | >= 1.7.12 |
86
+ | Maglev | Not supported | |
87
+ | Topaz | Not supported | |
88
+ | mruby | Not supported | |
89
+
90
+ Maglev and Topaz are not supported due to the lack of a C API (that I know of)
91
+ and the lack of active development of these Ruby implementations. mruby is not
92
+ supported because it's a very different implementation all together.
93
+
94
+ To install Oga on MRI or Rubinius you'll need to have a working compiler such as
95
+ gcc or clang. Oga's C extension can be compiled with both. JRuby does not
96
+ require a compiler as the native extension is compiled during the Gem building
97
+ process and bundled inside the Gem itself.
98
+
99
+ ## Thread Safety
100
+
101
+ Documents parsed using Oga are thread-safe as long as they are not modified by
102
+ multiple threads at the same time. Querying documents using XPath can be done by
103
+ multiple threads just fine. Write operations, such as removing attributes, are
104
+ _not_ thread-safe and should not be done by multiple threads at once.
105
+
106
+ It is advised that you do not share parsed documents between threads unless you
107
+ _really_ have to.
108
+
109
+ ## Documentation
110
+
111
+ The documentation is best viewed [on the documentation website][doc-website].
112
+
113
+ * {file:CONTRIBUTING Contributing}
114
+ * {file:changelog Changelog}
115
+ * {file:migrating\_from\_nokogiri Migrating From Nokogiri}
116
+
117
+ ## Native Extension Setup
118
+
119
+ The native extensions can be found in `ext/` and are divided into a C and Java
120
+ extension. These extensions are only used for the XML lexer built using Ragel.
121
+ The grammar for this lexer is shared between C and Java and can be found in
122
+ `ext/ragel/base_lexer.rl`.
123
+
124
+ The extensions delegate most of their work back to Ruby code. As a result of
125
+ this maintenance of this codebase is much easier. If one wants to change the
126
+ grammar they only have to do so in one place and they don't have to worry about
127
+ C and/or Java specific details.
128
+
129
+ For more details on calling Ruby methods from Ragel see the source
130
+ documentation in `ext/ragel/base_lexer.rl`.
131
+
132
+ ## Why Another HTML/XML parser?
133
+
134
+ Currently there are a few existing parser out there, the most famous one being
135
+ [Nokogiri][nokogiri]. Another parser that's becoming more popular these days is
136
+ [Ox][ox]. Ruby's standard library also comes with REXML.
137
+
138
+ The sad truth is that these existing libraries are problematic in their own
139
+ ways. Nokogiri for example is extremely unstable on Rubinius. On MRI it works
140
+ because of the non conccurent nature of MRI, on JRuby it works because it's
141
+ implemented as Java. Nokogiri also uses libxml2 which is a massive beast of a
142
+ library, is not thread-safe and problematic to install on certain platforms
143
+ (apparently). I don't want to compile libxml2 every time I install Nokogiri
144
+ either.
145
+
146
+ To give an example about the issues with Nokogiri on Rubinius (or any other
147
+ Ruby implementation that is not MRI or JRuby), take a look at these issues:
148
+
149
+ * <https://github.com/rubinius/rubinius/issues/2957>
150
+ * <https://github.com/rubinius/rubinius/issues/2908>
151
+ * <https://github.com/rubinius/rubinius/issues/2462>
152
+ * <https://github.com/sparklemotion/nokogiri/issues/1047>
153
+ * <https://github.com/sparklemotion/nokogiri/issues/939>
154
+
155
+ Some of these have been fixed, some have not. The core problem remains:
156
+ Nokogiri acts in a way that there can be a large number of places where it
157
+ *might* break due to throwing around void pointers and what not and expecting
158
+ that things magically work. Note that I have nothing against the people running
159
+ these projects, I just heavily, *heavily* dislike the resulting codebase one
160
+ has to deal with today.
161
+
162
+ Ox looks very promising but it lacks a rather crucial feature: parsing HTML
163
+ (without using a SAX API). It's also again a C extension making debugging more
164
+ of a pain (at least for me).
165
+
166
+ I just want an XML/HTML parser that I can rely on stability wise and that is
167
+ written in Ruby so I can actually debug it. In theory it should also make it
168
+ easier for other Ruby developers to contribute.
169
+
170
+ ## License
171
+
172
+ All source code in this repository is licensed under the MIT license unless
173
+ specified otherwise. A copy of this license can be found in the file "LICENSE"
174
+ in the root directory of this repository.
175
+
176
+ [nokogiri]: https://github.com/sparklemotion/nokogiri
177
+ [oga-wikipedia]: https://en.wikipedia.org/wiki/Japanese_saw#Other_Japanese_saws
178
+ [ox]: https://github.com/ohler55/ox
179
+ [doc-website]: http://code.yorickpeterse.com/oga/latest/
data/doc/DCO.md ADDED
@@ -0,0 +1,25 @@
1
+ # Developer's Certificate of Origin 1.0
2
+
3
+ By making a contribution to this project, I certify that:
4
+
5
+ 1. The contribution was created in whole or in part by me and I
6
+ have the right to submit it under the open source license
7
+ indicated in the file LICENSE; or
8
+
9
+ 2. The contribution is based upon previous work that, to the best
10
+ of my knowledge, is covered under an appropriate open source
11
+ license and I have the right under that license to submit that
12
+ work with modifications, whether created in whole or in part
13
+ by me, under the same open source license (unless I am
14
+ permitted to submit under a different license), as indicated
15
+ in the file LICENSE; or
16
+
17
+ 3. The contribution was provided directly to me by some other
18
+ person who certified (1), (2) or (3) and I have not modified
19
+ it.
20
+
21
+ 4. I understand and agree that this project and the contribution
22
+ are public and that a record of the contribution (including all
23
+ personal information I submit with it, including my sign-off) is
24
+ maintained indefinitely and may be redistributed consistent with
25
+ this project or the open source license(s) involved.
data/doc/changelog.md ADDED
@@ -0,0 +1,20 @@
1
+ # Changelog
2
+
3
+ ## 0.2.0 - Unreleased
4
+
5
+ The `node_type` method has been removed and its purpose has been moved into
6
+ the `XML::PullParser` class itself. This method was solely used by the pull
7
+ parser to provide shorthands for node classes. As such it doesn't make sense to
8
+ expose this as a method to the outside world as a public method.
9
+
10
+ ## 0.1.1 - 2014-09-13
11
+
12
+ This release fixes a problem where element attributes were not separated by
13
+ spaces. Thanks to Jonathan Rochkind for reporting it and Bill Dueber providing
14
+ an initial patch for this problem.
15
+
16
+ ## 0.1.0 - 2014-09-12
17
+
18
+ The first public release of Oga. This release contains support for parsing XML,
19
+ basic support for parsing HTML, support for querying documents using XPath and
20
+ more.
@@ -0,0 +1,76 @@
1
+ body
2
+ {
3
+ font-size: 14px;
4
+ line-height: 1.6;
5
+ margin: 0 auto;
6
+ max-width: 960px;
7
+ }
8
+
9
+ p code
10
+ {
11
+ background: #f2f2f2;
12
+ padding-left: 3px;
13
+ padding-right: 3px;
14
+ }
15
+
16
+ pre.code
17
+ {
18
+ font-size: 13px;
19
+ line-height: 1.4;
20
+ overflow: auto;
21
+ }
22
+
23
+ blockquote
24
+ {
25
+ border-left: 5px solid #eee;
26
+ margin: 0px;
27
+ padding-left: 15px;
28
+ }
29
+
30
+ /**
31
+ * YARD uses generic table styles, using a special class means those tables
32
+ * don't get messed up.
33
+ */
34
+ .table
35
+ {
36
+ border: 1px solid #ccc;
37
+ border-right: none;
38
+ border-collapse: separate;
39
+ border-spacing: 0;
40
+ text-align: left;
41
+ }
42
+
43
+ .table.full
44
+ {
45
+ width: 100%;
46
+ }
47
+
48
+ .table .field_name
49
+ {
50
+ min-width: 160px;
51
+ }
52
+
53
+ .table thead tr th.no_sort:first-child
54
+ {
55
+ width: 25px;
56
+ }
57
+
58
+ .table thead tr th, .table tbody tr td
59
+ {
60
+ border-bottom: 1px solid #ccc;
61
+ border-right: 1px solid #ccc;
62
+ min-width: 20px;
63
+ padding: 8px 5px;
64
+ text-align: left;
65
+ vertical-align: top;
66
+ }
67
+
68
+ .table tbody tr:last-child td
69
+ {
70
+ border-bottom: none;
71
+ }
72
+
73
+ .table tr:nth-child(odd) td
74
+ {
75
+ background: #f9f9f9;
76
+ }
@@ -0,0 +1,169 @@
1
+ # Migrating From Nokogiri
2
+
3
+ If you're parsing XML/HTML documents using Ruby, chances are you're using
4
+ [Nokogiri][nokogiri] for this. This guide aims to make it easier to switch from
5
+ Nokogiri to Oga.
6
+
7
+ ## Parsing Documents
8
+
9
+ In Nokogiri there are two defacto ways of parsing documents:
10
+
11
+ * `Nokogiri.XML()` for XML documents
12
+ * `Nokogiri.HTML()` for HTML documents
13
+
14
+ For example, to parse an XML document you'd use the following:
15
+
16
+ Nokogiri::XML('<root>foo</root>')
17
+
18
+ Oga instead uses the following two methods:
19
+
20
+ * `Oga.parse_xml`
21
+ * `Oga.parse_html`
22
+
23
+ Their usage is similar:
24
+
25
+ Oga.parse_xml('<root>foo</root>')
26
+
27
+ Nokogiri returns two distinctive document classes based on what method was used
28
+ to parse a document:
29
+
30
+ * `Nokogiri::XML::Document` for XML documents
31
+ * `Nokogiri::HTML::Document` for HTML documents
32
+
33
+ Oga on the other hand always returns `Oga::XML::Document` instance, Oga
34
+ currently makes no distinction between XML and HTML documents other than on
35
+ lexer level. This might change in the future if deemed required.
36
+
37
+ ## Querying Documents
38
+
39
+ Nokogiri allows one to query documents/elements using both XPath expressions and
40
+ CSS selectors. In Nokogiri one queries a document as following:
41
+
42
+ document = Nokogiri::XML('<root><foo>bar</foo></root>')
43
+
44
+ document.xpath('root/foo')
45
+ document.css('root foo')
46
+
47
+ Oga currently only supports XPath expressions, CSS selectors will be added in
48
+ the near future. Querying documents works similar to Nokogiri:
49
+
50
+ document = Oga.parse_xml('<root><foo>bar</foo></root>')
51
+
52
+ document.xpath('root/foo')
53
+
54
+ Nokogiri also allows you to query a document and return the first match, opposed
55
+ to an entire node set, using the method `at`. In Nokogiri this method can be
56
+ used for both XPath expression and CSS selectors. Oga has no such method,
57
+ instead it provides the following more dedicated methods:
58
+
59
+ * `at_xpath`: returns the first node of an XPath expression
60
+
61
+ For example:
62
+
63
+ document = Oga.parse_xml('<root><foo>bar</foo></root>')
64
+
65
+ document.at_xpath('root/foo')
66
+
67
+ By using a dedicated method Oga doesn't have to try and guess what type of
68
+ expression you're using (XPath or CSS), meaning it can never make any mistakes.
69
+
70
+ ## Retrieving Attribute Values
71
+
72
+ Nokogiri provides two methods for retrieving attributes and attribute values:
73
+
74
+ * `Nokogiri::XML::Node#attribute`
75
+ * `Nokogiri::XML::Node#attr`
76
+
77
+ The first method always returns an instance of `Nokogiri::XML::Attribute`, the
78
+ second method returns the attribute value as a `String`. This behaviour,
79
+ especially due to the names used, is extremely confusing.
80
+
81
+ Oga on the other hand provides the following two methods:
82
+
83
+ * `Oga::XML::Element#attribute` (aliased as `attr`)
84
+ * `Oga::XML::Element#get`
85
+
86
+ The first method always returns a `Oga::XML::Attribute` instance, the second
87
+ returns the attribute value as a `String`. I deliberately chose `get` for
88
+ getting a value to remove the confusion of `attribute` vs `attr`. This also
89
+ allows for `attr` to simply be an alias of `attribute`.
90
+
91
+ As an example, this is how you'd get the value of a `class` attribute in
92
+ Nokogiri:
93
+
94
+ document = Nokogiri::XML('<root class="foo"></root>')
95
+
96
+ document.xpath('root').first.attr('class') # => "foo"
97
+
98
+ This is how you'd get the same value in Oga:
99
+
100
+ document = Oga.parse_xml('<root class="foo"></root>')
101
+
102
+ document.xpath('root').first.get('class') # => "foo"
103
+
104
+ ## Modifying Documents
105
+
106
+ Modifying documents in Nokogiri is not as convenient as it perhaps could be. For
107
+ example, adding an element to a document is done as following:
108
+
109
+ document = Nokogiri::XML('<root></root>')
110
+ root = document.xpath('root').first
111
+
112
+ name = Nokogiri::XML::Element.new('name', document)
113
+
114
+ name.inner_html = 'Alice'
115
+
116
+ root.add_child(name)
117
+
118
+ The annoying part here is that we have to pass a document into an Element's
119
+ constructor. As such, you can not create elements without first creating a
120
+ document. Another thing is that Nokogiri has no method called `inner_text=`,
121
+ instead you have to use the method `inner_html=`.
122
+
123
+ In Oga you'd use the following:
124
+
125
+ document = Oga.parse_xml('<root></root>')
126
+ root = document.xpath('root').first
127
+
128
+ name = Oga::XML::Element.new(:name => 'name')
129
+
130
+ name.inner_text = 'Alice'
131
+
132
+ root.children << name
133
+
134
+ Adding attributes works similar for both Nokogiri and Oga. For Nokogiri you'd
135
+ use the following:
136
+
137
+ element.set_attribute('class', 'foo')
138
+
139
+ Alternatively you can do the following:
140
+
141
+ element['class'] = 'foo'
142
+
143
+ In Oga you'd instead use the method `set`:
144
+
145
+ element.set('class', 'foo')
146
+
147
+ This method automatically creates an attribute if it doesn't exist, including
148
+ the namespace if specified:
149
+
150
+ element.set('foo:class', 'foo')
151
+
152
+ ## Serializing Documents
153
+
154
+ Serializing the document back to XML works the same in both libraries, simply
155
+ call `to_xml` on a document or element and you'll get a String back containing
156
+ the XML. There is one key difference here though: Nokogiri does not return the
157
+ exact same output as it was given as input, for example it adds XML declaration
158
+ tags:
159
+
160
+ Nokogiri::XML('<root></root>').to_xml # => "<?xml version=\"1.0\"?>\n<root/>\n"
161
+
162
+ Oga on the other hand does not do this:
163
+
164
+ Oga.parse_xml('<root></root>').to_xml # => "<root></root>"
165
+
166
+ Oga also doesn't insert random newlines or other possibly unexpected (or
167
+ unwanted) data.
168
+
169
+ [nokogiri]: http://nokogiri.org/