oga 0.1.1-java

Sign up to get free protection for your applications and to get access to all the features.
Files changed (47) hide show
  1. checksums.yaml +7 -0
  2. data/.yardopts +13 -0
  3. data/LICENSE +19 -0
  4. data/README.md +179 -0
  5. data/doc/DCO.md +25 -0
  6. data/doc/changelog.md +20 -0
  7. data/doc/css/common.css +76 -0
  8. data/doc/migrating_from_nokogiri.md +169 -0
  9. data/ext/c/extconf.rb +13 -0
  10. data/ext/c/lexer.c +1518 -0
  11. data/ext/c/lexer.h +8 -0
  12. data/ext/c/lexer.rl +121 -0
  13. data/ext/c/liboga.c +6 -0
  14. data/ext/c/liboga.h +11 -0
  15. data/ext/java/Liboga.java +14 -0
  16. data/ext/java/org/liboga/xml/Lexer.java +829 -0
  17. data/ext/java/org/liboga/xml/Lexer.rl +151 -0
  18. data/ext/ragel/base_lexer.rl +323 -0
  19. data/lib/liboga.jar +0 -0
  20. data/lib/oga.rb +43 -0
  21. data/lib/oga/html/parser.rb +25 -0
  22. data/lib/oga/oga.rb +27 -0
  23. data/lib/oga/version.rb +3 -0
  24. data/lib/oga/xml/attribute.rb +111 -0
  25. data/lib/oga/xml/cdata.rb +17 -0
  26. data/lib/oga/xml/character_node.rb +39 -0
  27. data/lib/oga/xml/comment.rb +17 -0
  28. data/lib/oga/xml/doctype.rb +84 -0
  29. data/lib/oga/xml/document.rb +99 -0
  30. data/lib/oga/xml/element.rb +331 -0
  31. data/lib/oga/xml/lexer.rb +399 -0
  32. data/lib/oga/xml/namespace.rb +42 -0
  33. data/lib/oga/xml/node.rb +168 -0
  34. data/lib/oga/xml/node_set.rb +313 -0
  35. data/lib/oga/xml/parser.rb +556 -0
  36. data/lib/oga/xml/processing_instruction.rb +39 -0
  37. data/lib/oga/xml/pull_parser.rb +180 -0
  38. data/lib/oga/xml/querying.rb +32 -0
  39. data/lib/oga/xml/text.rb +11 -0
  40. data/lib/oga/xml/traversal.rb +48 -0
  41. data/lib/oga/xml/xml_declaration.rb +69 -0
  42. data/lib/oga/xpath/evaluator.rb +1748 -0
  43. data/lib/oga/xpath/lexer.rb +2043 -0
  44. data/lib/oga/xpath/node.rb +10 -0
  45. data/lib/oga/xpath/parser.rb +537 -0
  46. data/oga.gemspec +45 -0
  47. metadata +221 -0
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: ef3b6deb6c9b28c8ce05b88c6e49fec481f8051a
4
+ data.tar.gz: 002f8e7f8d02a02e8e77a163c82604cb4ca5fd03
5
+ SHA512:
6
+ metadata.gz: 8f8e1816df586fa18bb776cac96d043d395737abd4397b16ce654333fe13afeb106e94311820872407b6cb6c94e54ea5a6bdcd4698131f0191031364cafddf9a
7
+ data.tar.gz: 04b59105751bb96576f51fd8ae6755f1ae8eacc8c83a234e5befa181b518fca9a713bd3be986965f0661089bf1afc67142b918429deb99dc0b518bae545794c0
data/.yardopts ADDED
@@ -0,0 +1,13 @@
1
+ ./lib/oga/**/*.rb ./lib/oga.rb
2
+ -m markdown
3
+ -M kramdown
4
+ -o yardoc
5
+ -r ./README.md
6
+ --private
7
+ --protected
8
+ --asset ./doc/css/common.css:css/common.css
9
+ --verbose
10
+ -
11
+ ./doc/*.md
12
+ LICENSE
13
+ CONTRIBUTING.md
data/LICENSE ADDED
@@ -0,0 +1,19 @@
1
+ Copyright (c) 2014, Yorick Peterse
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy
4
+ of this software and associated documentation files (the "Software"), to deal
5
+ in the Software without restriction, including without limitation the rights
6
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7
+ copies of the Software, and to permit persons to whom the Software is
8
+ furnished to do so, subject to the following conditions:
9
+
10
+ The above copyright notice and this permission notice shall be included in
11
+ all copies or substantial portions of the Software.
12
+
13
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
19
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,179 @@
1
+ # Oga
2
+
3
+ Oga is an XML/HTML parser written in Ruby. It provides an easy to use API for
4
+ parsing, modifying and querying documents (using XPath expressions). Oga does
5
+ not require system libraries such as libxml, making it easier and faster to
6
+ install on various platforms. To achieve better performance Oga uses a small,
7
+ native extension (C for MRI/Rubinius, Java for JRuby).
8
+
9
+ Oga provides an API that allows you to safely parse and query documents in a
10
+ multi-threaded environment, without having to worry about your applications
11
+ blowing up.
12
+
13
+ From [Wikipedia][oga-wikipedia]:
14
+
15
+ > Oga: A large two-person saw used for ripping large boards in the days before
16
+ > power saws. One person stood on a raised platform, with the board below him,
17
+ > and the other person stood underneath them.
18
+
19
+ ## Examples
20
+
21
+ Parsing a simple string of XML:
22
+
23
+ Oga.parse_xml('<people><person>Alice</person></people>')
24
+
25
+ Parsing a simple string of HTML:
26
+
27
+ Oga.parse_html('<link rel="stylesheet" href="foo.css">')
28
+
29
+ Parsing an IO handle pointing to XML (this also works when using
30
+ `Oga.parse_html`):
31
+
32
+ handle = File.open('path/to/file.xml')
33
+
34
+ Oga.parse_xml(handle)
35
+
36
+ Parsing an IO handle using the pull parser:
37
+
38
+ handle = File.open('path/to/file.xml')
39
+ parser = Oga::XML::PullParser.new(handle)
40
+
41
+ parser.parse do |node|
42
+ parser.on(:text) do
43
+ puts node.text
44
+ end
45
+ end
46
+
47
+ Querying a document using XPath:
48
+
49
+ document = Oga.parse_xml('<people><person>Alice</person></people>')
50
+
51
+ document.xpath('string(people/person)') # => "Alice"
52
+
53
+ Modifying a document and serializing it back to XML:
54
+
55
+ document = Oga.parse_xml('<people><person>Alice</person></people>')
56
+ name = document.at_xpath('people/person[1]/text()')
57
+
58
+ name.text = 'Bob'
59
+
60
+ document.to_xml # => "<people><person>Bob</person></people>"
61
+
62
+ Querying a document using a namespace:
63
+
64
+ document = Oga.parse_xml('<root xmlns:x="foo"><x:div></x:div></root>')
65
+ div = document.xpath('root/x:div').first
66
+
67
+ div.namespace # => Namespace(name: "x" uri: "foo")
68
+
69
+ ## Features
70
+
71
+ * Support for parsing XML and HTML(5)
72
+ * DOM parsing
73
+ * Stream/pull parsing
74
+ * Low memory footprint
75
+ * High performance, if something doesn't perform well enough it's a bug
76
+ * Support for XPath 1.0
77
+ * XML namespace support (registering, querying, etc)
78
+
79
+ ## Requirements
80
+
81
+ | Ruby | Required | Recommended |
82
+ |:---------|:--------------|:------------|
83
+ | MRI | >= 1.9.3 | >= 2.1.2 |
84
+ | Rubinius | >= 2.2 | >= 2.2.10 |
85
+ | JRuby | >= 1.7 | >= 1.7.12 |
86
+ | Maglev | Not supported | |
87
+ | Topaz | Not supported | |
88
+ | mruby | Not supported | |
89
+
90
+ Maglev and Topaz are not supported due to the lack of a C API (that I know of)
91
+ and the lack of active development of these Ruby implementations. mruby is not
92
+ supported because it's a very different implementation all together.
93
+
94
+ To install Oga on MRI or Rubinius you'll need to have a working compiler such as
95
+ gcc or clang. Oga's C extension can be compiled with both. JRuby does not
96
+ require a compiler as the native extension is compiled during the Gem building
97
+ process and bundled inside the Gem itself.
98
+
99
+ ## Thread Safety
100
+
101
+ Documents parsed using Oga are thread-safe as long as they are not modified by
102
+ multiple threads at the same time. Querying documents using XPath can be done by
103
+ multiple threads just fine. Write operations, such as removing attributes, are
104
+ _not_ thread-safe and should not be done by multiple threads at once.
105
+
106
+ It is advised that you do not share parsed documents between threads unless you
107
+ _really_ have to.
108
+
109
+ ## Documentation
110
+
111
+ The documentation is best viewed [on the documentation website][doc-website].
112
+
113
+ * {file:CONTRIBUTING Contributing}
114
+ * {file:changelog Changelog}
115
+ * {file:migrating\_from\_nokogiri Migrating From Nokogiri}
116
+
117
+ ## Native Extension Setup
118
+
119
+ The native extensions can be found in `ext/` and are divided into a C and Java
120
+ extension. These extensions are only used for the XML lexer built using Ragel.
121
+ The grammar for this lexer is shared between C and Java and can be found in
122
+ `ext/ragel/base_lexer.rl`.
123
+
124
+ The extensions delegate most of their work back to Ruby code. As a result of
125
+ this maintenance of this codebase is much easier. If one wants to change the
126
+ grammar they only have to do so in one place and they don't have to worry about
127
+ C and/or Java specific details.
128
+
129
+ For more details on calling Ruby methods from Ragel see the source
130
+ documentation in `ext/ragel/base_lexer.rl`.
131
+
132
+ ## Why Another HTML/XML parser?
133
+
134
+ Currently there are a few existing parser out there, the most famous one being
135
+ [Nokogiri][nokogiri]. Another parser that's becoming more popular these days is
136
+ [Ox][ox]. Ruby's standard library also comes with REXML.
137
+
138
+ The sad truth is that these existing libraries are problematic in their own
139
+ ways. Nokogiri for example is extremely unstable on Rubinius. On MRI it works
140
+ because of the non conccurent nature of MRI, on JRuby it works because it's
141
+ implemented as Java. Nokogiri also uses libxml2 which is a massive beast of a
142
+ library, is not thread-safe and problematic to install on certain platforms
143
+ (apparently). I don't want to compile libxml2 every time I install Nokogiri
144
+ either.
145
+
146
+ To give an example about the issues with Nokogiri on Rubinius (or any other
147
+ Ruby implementation that is not MRI or JRuby), take a look at these issues:
148
+
149
+ * <https://github.com/rubinius/rubinius/issues/2957>
150
+ * <https://github.com/rubinius/rubinius/issues/2908>
151
+ * <https://github.com/rubinius/rubinius/issues/2462>
152
+ * <https://github.com/sparklemotion/nokogiri/issues/1047>
153
+ * <https://github.com/sparklemotion/nokogiri/issues/939>
154
+
155
+ Some of these have been fixed, some have not. The core problem remains:
156
+ Nokogiri acts in a way that there can be a large number of places where it
157
+ *might* break due to throwing around void pointers and what not and expecting
158
+ that things magically work. Note that I have nothing against the people running
159
+ these projects, I just heavily, *heavily* dislike the resulting codebase one
160
+ has to deal with today.
161
+
162
+ Ox looks very promising but it lacks a rather crucial feature: parsing HTML
163
+ (without using a SAX API). It's also again a C extension making debugging more
164
+ of a pain (at least for me).
165
+
166
+ I just want an XML/HTML parser that I can rely on stability wise and that is
167
+ written in Ruby so I can actually debug it. In theory it should also make it
168
+ easier for other Ruby developers to contribute.
169
+
170
+ ## License
171
+
172
+ All source code in this repository is licensed under the MIT license unless
173
+ specified otherwise. A copy of this license can be found in the file "LICENSE"
174
+ in the root directory of this repository.
175
+
176
+ [nokogiri]: https://github.com/sparklemotion/nokogiri
177
+ [oga-wikipedia]: https://en.wikipedia.org/wiki/Japanese_saw#Other_Japanese_saws
178
+ [ox]: https://github.com/ohler55/ox
179
+ [doc-website]: http://code.yorickpeterse.com/oga/latest/
data/doc/DCO.md ADDED
@@ -0,0 +1,25 @@
1
+ # Developer's Certificate of Origin 1.0
2
+
3
+ By making a contribution to this project, I certify that:
4
+
5
+ 1. The contribution was created in whole or in part by me and I
6
+ have the right to submit it under the open source license
7
+ indicated in the file LICENSE; or
8
+
9
+ 2. The contribution is based upon previous work that, to the best
10
+ of my knowledge, is covered under an appropriate open source
11
+ license and I have the right under that license to submit that
12
+ work with modifications, whether created in whole or in part
13
+ by me, under the same open source license (unless I am
14
+ permitted to submit under a different license), as indicated
15
+ in the file LICENSE; or
16
+
17
+ 3. The contribution was provided directly to me by some other
18
+ person who certified (1), (2) or (3) and I have not modified
19
+ it.
20
+
21
+ 4. I understand and agree that this project and the contribution
22
+ are public and that a record of the contribution (including all
23
+ personal information I submit with it, including my sign-off) is
24
+ maintained indefinitely and may be redistributed consistent with
25
+ this project or the open source license(s) involved.
data/doc/changelog.md ADDED
@@ -0,0 +1,20 @@
1
+ # Changelog
2
+
3
+ ## 0.2.0 - Unreleased
4
+
5
+ The `node_type` method has been removed and its purpose has been moved into
6
+ the `XML::PullParser` class itself. This method was solely used by the pull
7
+ parser to provide shorthands for node classes. As such it doesn't make sense to
8
+ expose this as a method to the outside world as a public method.
9
+
10
+ ## 0.1.1 - 2014-09-13
11
+
12
+ This release fixes a problem where element attributes were not separated by
13
+ spaces. Thanks to Jonathan Rochkind for reporting it and Bill Dueber providing
14
+ an initial patch for this problem.
15
+
16
+ ## 0.1.0 - 2014-09-12
17
+
18
+ The first public release of Oga. This release contains support for parsing XML,
19
+ basic support for parsing HTML, support for querying documents using XPath and
20
+ more.
@@ -0,0 +1,76 @@
1
+ body
2
+ {
3
+ font-size: 14px;
4
+ line-height: 1.6;
5
+ margin: 0 auto;
6
+ max-width: 960px;
7
+ }
8
+
9
+ p code
10
+ {
11
+ background: #f2f2f2;
12
+ padding-left: 3px;
13
+ padding-right: 3px;
14
+ }
15
+
16
+ pre.code
17
+ {
18
+ font-size: 13px;
19
+ line-height: 1.4;
20
+ overflow: auto;
21
+ }
22
+
23
+ blockquote
24
+ {
25
+ border-left: 5px solid #eee;
26
+ margin: 0px;
27
+ padding-left: 15px;
28
+ }
29
+
30
+ /**
31
+ * YARD uses generic table styles, using a special class means those tables
32
+ * don't get messed up.
33
+ */
34
+ .table
35
+ {
36
+ border: 1px solid #ccc;
37
+ border-right: none;
38
+ border-collapse: separate;
39
+ border-spacing: 0;
40
+ text-align: left;
41
+ }
42
+
43
+ .table.full
44
+ {
45
+ width: 100%;
46
+ }
47
+
48
+ .table .field_name
49
+ {
50
+ min-width: 160px;
51
+ }
52
+
53
+ .table thead tr th.no_sort:first-child
54
+ {
55
+ width: 25px;
56
+ }
57
+
58
+ .table thead tr th, .table tbody tr td
59
+ {
60
+ border-bottom: 1px solid #ccc;
61
+ border-right: 1px solid #ccc;
62
+ min-width: 20px;
63
+ padding: 8px 5px;
64
+ text-align: left;
65
+ vertical-align: top;
66
+ }
67
+
68
+ .table tbody tr:last-child td
69
+ {
70
+ border-bottom: none;
71
+ }
72
+
73
+ .table tr:nth-child(odd) td
74
+ {
75
+ background: #f9f9f9;
76
+ }
@@ -0,0 +1,169 @@
1
+ # Migrating From Nokogiri
2
+
3
+ If you're parsing XML/HTML documents using Ruby, chances are you're using
4
+ [Nokogiri][nokogiri] for this. This guide aims to make it easier to switch from
5
+ Nokogiri to Oga.
6
+
7
+ ## Parsing Documents
8
+
9
+ In Nokogiri there are two defacto ways of parsing documents:
10
+
11
+ * `Nokogiri.XML()` for XML documents
12
+ * `Nokogiri.HTML()` for HTML documents
13
+
14
+ For example, to parse an XML document you'd use the following:
15
+
16
+ Nokogiri::XML('<root>foo</root>')
17
+
18
+ Oga instead uses the following two methods:
19
+
20
+ * `Oga.parse_xml`
21
+ * `Oga.parse_html`
22
+
23
+ Their usage is similar:
24
+
25
+ Oga.parse_xml('<root>foo</root>')
26
+
27
+ Nokogiri returns two distinctive document classes based on what method was used
28
+ to parse a document:
29
+
30
+ * `Nokogiri::XML::Document` for XML documents
31
+ * `Nokogiri::HTML::Document` for HTML documents
32
+
33
+ Oga on the other hand always returns `Oga::XML::Document` instance, Oga
34
+ currently makes no distinction between XML and HTML documents other than on
35
+ lexer level. This might change in the future if deemed required.
36
+
37
+ ## Querying Documents
38
+
39
+ Nokogiri allows one to query documents/elements using both XPath expressions and
40
+ CSS selectors. In Nokogiri one queries a document as following:
41
+
42
+ document = Nokogiri::XML('<root><foo>bar</foo></root>')
43
+
44
+ document.xpath('root/foo')
45
+ document.css('root foo')
46
+
47
+ Oga currently only supports XPath expressions, CSS selectors will be added in
48
+ the near future. Querying documents works similar to Nokogiri:
49
+
50
+ document = Oga.parse_xml('<root><foo>bar</foo></root>')
51
+
52
+ document.xpath('root/foo')
53
+
54
+ Nokogiri also allows you to query a document and return the first match, opposed
55
+ to an entire node set, using the method `at`. In Nokogiri this method can be
56
+ used for both XPath expression and CSS selectors. Oga has no such method,
57
+ instead it provides the following more dedicated methods:
58
+
59
+ * `at_xpath`: returns the first node of an XPath expression
60
+
61
+ For example:
62
+
63
+ document = Oga.parse_xml('<root><foo>bar</foo></root>')
64
+
65
+ document.at_xpath('root/foo')
66
+
67
+ By using a dedicated method Oga doesn't have to try and guess what type of
68
+ expression you're using (XPath or CSS), meaning it can never make any mistakes.
69
+
70
+ ## Retrieving Attribute Values
71
+
72
+ Nokogiri provides two methods for retrieving attributes and attribute values:
73
+
74
+ * `Nokogiri::XML::Node#attribute`
75
+ * `Nokogiri::XML::Node#attr`
76
+
77
+ The first method always returns an instance of `Nokogiri::XML::Attribute`, the
78
+ second method returns the attribute value as a `String`. This behaviour,
79
+ especially due to the names used, is extremely confusing.
80
+
81
+ Oga on the other hand provides the following two methods:
82
+
83
+ * `Oga::XML::Element#attribute` (aliased as `attr`)
84
+ * `Oga::XML::Element#get`
85
+
86
+ The first method always returns a `Oga::XML::Attribute` instance, the second
87
+ returns the attribute value as a `String`. I deliberately chose `get` for
88
+ getting a value to remove the confusion of `attribute` vs `attr`. This also
89
+ allows for `attr` to simply be an alias of `attribute`.
90
+
91
+ As an example, this is how you'd get the value of a `class` attribute in
92
+ Nokogiri:
93
+
94
+ document = Nokogiri::XML('<root class="foo"></root>')
95
+
96
+ document.xpath('root').first.attr('class') # => "foo"
97
+
98
+ This is how you'd get the same value in Oga:
99
+
100
+ document = Oga.parse_xml('<root class="foo"></root>')
101
+
102
+ document.xpath('root').first.get('class') # => "foo"
103
+
104
+ ## Modifying Documents
105
+
106
+ Modifying documents in Nokogiri is not as convenient as it perhaps could be. For
107
+ example, adding an element to a document is done as following:
108
+
109
+ document = Nokogiri::XML('<root></root>')
110
+ root = document.xpath('root').first
111
+
112
+ name = Nokogiri::XML::Element.new('name', document)
113
+
114
+ name.inner_html = 'Alice'
115
+
116
+ root.add_child(name)
117
+
118
+ The annoying part here is that we have to pass a document into an Element's
119
+ constructor. As such, you can not create elements without first creating a
120
+ document. Another thing is that Nokogiri has no method called `inner_text=`,
121
+ instead you have to use the method `inner_html=`.
122
+
123
+ In Oga you'd use the following:
124
+
125
+ document = Oga.parse_xml('<root></root>')
126
+ root = document.xpath('root').first
127
+
128
+ name = Oga::XML::Element.new(:name => 'name')
129
+
130
+ name.inner_text = 'Alice'
131
+
132
+ root.children << name
133
+
134
+ Adding attributes works similar for both Nokogiri and Oga. For Nokogiri you'd
135
+ use the following:
136
+
137
+ element.set_attribute('class', 'foo')
138
+
139
+ Alternatively you can do the following:
140
+
141
+ element['class'] = 'foo'
142
+
143
+ In Oga you'd instead use the method `set`:
144
+
145
+ element.set('class', 'foo')
146
+
147
+ This method automatically creates an attribute if it doesn't exist, including
148
+ the namespace if specified:
149
+
150
+ element.set('foo:class', 'foo')
151
+
152
+ ## Serializing Documents
153
+
154
+ Serializing the document back to XML works the same in both libraries, simply
155
+ call `to_xml` on a document or element and you'll get a String back containing
156
+ the XML. There is one key difference here though: Nokogiri does not return the
157
+ exact same output as it was given as input, for example it adds XML declaration
158
+ tags:
159
+
160
+ Nokogiri::XML('<root></root>').to_xml # => "<?xml version=\"1.0\"?>\n<root/>\n"
161
+
162
+ Oga on the other hand does not do this:
163
+
164
+ Oga.parse_xml('<root></root>').to_xml # => "<root></root>"
165
+
166
+ Oga also doesn't insert random newlines or other possibly unexpected (or
167
+ unwanted) data.
168
+
169
+ [nokogiri]: http://nokogiri.org/