RubyGems - oga - Versions diffs - 0.1.0 - Mend

oga 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (46) hide show

checksums.yaml +7 -0
data/.yardopts +13 -0
data/LICENSE +19 -0
data/README.md +171 -0
data/doc/DCO.md +25 -0
data/doc/changelog.md +7 -0
data/doc/css/common.css +76 -0
data/doc/migrating_from_nokogiri.md +169 -0
data/ext/c/extconf.rb +13 -0
data/ext/c/lexer.c +1518 -0
data/ext/c/lexer.h +8 -0
data/ext/c/lexer.rl +121 -0
data/ext/c/liboga.c +6 -0
data/ext/c/liboga.h +11 -0
data/ext/java/Liboga.java +14 -0
data/ext/java/org/liboga/xml/Lexer.java +829 -0
data/ext/java/org/liboga/xml/Lexer.rl +151 -0
data/ext/ragel/base_lexer.rl +323 -0
data/lib/oga.rb +43 -0
data/lib/oga/html/parser.rb +25 -0
data/lib/oga/oga.rb +27 -0
data/lib/oga/version.rb +3 -0
data/lib/oga/xml/attribute.rb +111 -0
data/lib/oga/xml/cdata.rb +24 -0
data/lib/oga/xml/character_node.rb +39 -0
data/lib/oga/xml/comment.rb +24 -0
data/lib/oga/xml/doctype.rb +91 -0
data/lib/oga/xml/document.rb +99 -0
data/lib/oga/xml/element.rb +340 -0
data/lib/oga/xml/lexer.rb +399 -0
data/lib/oga/xml/namespace.rb +42 -0
data/lib/oga/xml/node.rb +175 -0
data/lib/oga/xml/node_set.rb +313 -0
data/lib/oga/xml/parser.rb +556 -0
data/lib/oga/xml/processing_instruction.rb +39 -0
data/lib/oga/xml/pull_parser.rb +166 -0
data/lib/oga/xml/querying.rb +32 -0
data/lib/oga/xml/text.rb +16 -0
data/lib/oga/xml/traversal.rb +48 -0
data/lib/oga/xml/xml_declaration.rb +76 -0
data/lib/oga/xpath/evaluator.rb +1748 -0
data/lib/oga/xpath/lexer.rb +2043 -0
data/lib/oga/xpath/node.rb +10 -0
data/lib/oga/xpath/parser.rb +535 -0
data/oga.gemspec +45 -0
metadata +221 -0

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 9dc9048ba137771f79111a2caf449bcd4cd9d3f3
+  data.tar.gz: c17ab5f26eba683d98ff04a46db205b980c297ee
+SHA512:
+  metadata.gz: 686eadb178bdf853b6da9f520a626aecbf67c154e4dce28deeb27da5ab33c74a18d8dab023bf3c66f9375147a519078815338496cdf96e0a3deafd482b2d994a
+  data.tar.gz: 07076c85b6f44e308d07acb60ccc535ef12b84197a6a07beaf57f756659ffe01cc31906251c508d972c659efe229c253be21919ad13445da7a9078be602c9fd6

data/.yardopts ADDED

@@ -0,0 +1,13 @@
+./lib/oga/**/*.rb ./lib/oga.rb
+-m markdown
+-M kramdown
+-o yardoc
+-r ./README.md
+--private
+--protected
+--asset ./doc/css/common.css:css/common.css
+--verbose
+-
+./doc/*.md
+LICENSE
+CONTRIBUTING.md

data/LICENSE ADDED

@@ -0,0 +1,19 @@
+Copyright (c) 2014, Yorick Peterse
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,171 @@
+# Oga
+Oga is an XML/HTML parser written in Ruby. It provides an easy to use API for
+parsing, modifying and querying documents (using XPath expressions). Oga does
+not require system libraries such as libxml, making it easier and faster to
+install on various platforms. To achieve better performance Oga uses a small,
+native extension (C for MRI/Rubinius, Java for JRuby).
+Oga provides an API that allows you to safely parse and query documents in a
+multi-threaded environment, without having to worry about your applications
+blowing up.
+From [Wikipedia][oga-wikipedia]:
+> Oga: A large two-person saw used for ripping large boards in the days before
+> power saws. One person stood on a raised platform, with the board below him,
+> and the other person stood underneath them.
+## Examples
+Parsing a simple string of XML:
+    Oga.parse_xml('<people><person>Alice</person></people>')
+Parsing a simple string of HTML:
+    Oga.parse_html('<link rel="stylesheet" href="foo.css">')
+Parsing an IO handle pointing to XML (this also works when using
+`Oga.parse_html`):
+    handle = File.open('path/to/file.xml')
+    Oga.parse_xml(handle)
+Parsing an IO handle using the pull parser:
+    handle = File.open('path/to/file.xml')
+    parser = Oga::XML::PullParser.new(handle)
+    parser.parse do |node|
+      parser.on(:text) do
+        puts node.text
+      end
+    end
+Querying a document using XPath:
+    document = Oga.parse_xml('<people><person>Alice</person></people>')
+    document.xpath('string(people/person)') # => "Alice"
+Modifying a document and serializing it back to XML:
+    document = Oga.parse_xml('<people><person>Alice</person></people>')
+    name     = document.at_xpath('people/person[1]/text()')
+    name.text = 'Bob'
+    document.to_xml # => "<people><person>Bob</person></people>"
+## Features
+* Support for parsing XML and HTML(5)
+  * DOM parsing
+  * Stream/pull parsing
+* Low memory footprint
+* High performance, if something doesn't perform well enough it's a bug
+* Support for XPath 1.0
+## Requirements
+| Ruby     | Required      | Recommended |
+|:---------|:--------------|:------------|
+| MRI      | >= 1.9.3      | >= 2.1.2    |
+| Rubinius | >= 2.2        | >= 2.2.10   |
+| JRuby    | >= 1.7        | >= 1.7.12   |
+| Maglev   | Not supported |             |
+| Topaz    | Not supported |             |
+| mruby    | Not supported |             |
+Maglev and Topaz are not supported due to the lack of a C API (that I know of)
+and the lack of active development of these Ruby implementations. mruby is not
+supported because it's a very different implementation all together.
+To install Oga on MRI or Rubinius you'll need to have a working compiler such as
+gcc or clang. Oga's C extension can be compiled with both. JRuby does not
+require a compiler as the native extension is compiled during the Gem building
+process and bundled inside the Gem itself.
+## Thread Safety
+Documents parsed using Oga are thread-safe as long as they are not modified by
+multiple threads at the same time. Querying documents using XPath can be done by
+multiple threads just fine. Write operations, such as removing attributes, are
+_not_ thread-safe and should not be done by multiple threads at once.
+It is advised that you do not share parsed documents between threads unless you
+_really_ have to.
+## Documentation
+The documentation is best viewed [on the documentation website][doc-website].
+* {file:CONTRIBUTING Contributing}
+* {file:changelog Changelog}
+* {file:migrating\_from\_nokogiri Migrating From Nokogiri}
+## Native Extension Setup
+The native extensions can be found in `ext/` and are divided into a C and Java
+extension. These extensions are only used for the XML lexer built using Ragel.
+The grammar for this lexer is shared between C and Java and can be found in
+`ext/ragel/base_lexer.rl`.
+The extensions delegate most of their work back to Ruby code. As a result of
+this maintenance of this codebase is much easier. If one wants to change the
+grammar they only have to do so in one place and they don't have to worry about
+C and/or Java specific details.
+For more details on calling Ruby methods from Ragel see the source
+documentation in `ext/ragel/base_lexer.rl`.
+## Why Another HTML/XML parser?
+Currently there are a few existing parser out there, the most famous one being
+[Nokogiri][nokogiri]. Another parser that's becoming more popular these days is
+[Ox][ox]. Ruby's standard library also comes with REXML.
+The sad truth is that these existing libraries are problematic in their own
+ways. Nokogiri for example is extremely unstable on Rubinius. On MRI it works
+because of the non conccurent nature of MRI, on JRuby it works because it's
+implemented as Java. Nokogiri also uses libxml2 which is a massive beast of a
+library, is not thread-safe and problematic to install on certain platforms
+(apparently). I don't want to compile libxml2 every time I install Nokogiri
+either.
+To give an example about the issues with Nokogiri on Rubinius (or any other
+Ruby implementation that is not MRI or JRuby), take a look at these issues:
+* <https://github.com/rubinius/rubinius/issues/2957>
+* <https://github.com/rubinius/rubinius/issues/2908>
+* <https://github.com/rubinius/rubinius/issues/2462>
+* <https://github.com/sparklemotion/nokogiri/issues/1047>
+* <https://github.com/sparklemotion/nokogiri/issues/939>
+Some of these have been fixed, some have not. The core problem remains:
+Nokogiri acts in a way that there can be a large number of places where it
+*might* break due to throwing around void pointers and what not and expecting
+that things magically work. Note that I have nothing against the people running
+these projects, I just heavily, *heavily* dislike the resulting codebase one
+has to deal with today.
+Ox looks very promising but it lacks a rather crucial feature: parsing HTML
+(without using a SAX API). It's also again a C extension making debugging more
+of a pain (at least for me).
+I just want an XML/HTML parser that I can rely on stability wise and that is
+written in Ruby so I can actually debug it. In theory it should also make it
+easier for other Ruby developers to contribute.
+## License
+All source code in this repository is licensed under the MIT license unless
+specified otherwise. A copy of this license can be found in the file "LICENSE"
+in the root directory of this repository.
+[nokogiri]: https://github.com/sparklemotion/nokogiri
+[oga-wikipedia]: https://en.wikipedia.org/wiki/Japanese_saw#Other_Japanese_saws
+[ox]: https://github.com/ohler55/ox
+[doc-website]: http://code.yorickpeterse.com/oga/latest/

data/doc/DCO.md ADDED

@@ -0,0 +1,25 @@
+# Developer's Certificate of Origin 1.0
+By making a contribution to this project, I certify that:
+1. The contribution was created in whole or in part by me and I
+   have the right to submit it under the open source license
+   indicated in the file LICENSE; or
+2. The contribution is based upon previous work that, to the best
+   of my knowledge, is covered under an appropriate open source
+   license and I have the right under that license to submit that
+   work with modifications, whether created in whole or in part
+   by me, under the same open source license (unless I am
+   permitted to submit under a different license), as indicated
+   in the file LICENSE; or
+3. The contribution was provided directly to me by some other
+   person who certified (1), (2) or (3) and I have not modified
+   it.
+4. I understand and agree that this project and the contribution
+   are public and that a record of the contribution (including all
+   personal information I submit with it, including my sign-off) is
+   maintained indefinitely and may be redistributed consistent with
+   this project or the open source license(s) involved.

data/doc/changelog.md ADDED

@@ -0,0 +1,7 @@
+# Changelog
+## 0.1.0 - 2014-09-12
+The first public release of Oga. This release contains support for parsing XML,
+basic support for parsing HTML, support for querying documents using XPath and
+more.

data/doc/css/common.css ADDED

@@ -0,0 +1,76 @@
+body
+{
+    font-size:   14px;
+    line-height: 1.6;
+    margin:      0 auto;
+    max-width:   960px;
+}
+p code
+{
+    background:    #f2f2f2;
+    padding-left:  3px;
+    padding-right: 3px;
+}
+pre.code
+{
+    font-size:   13px;
+    line-height: 1.4;
+    overflow:    auto;
+}
+blockquote
+{
+    border-left: 5px solid #eee;
+    margin: 0px;
+    padding-left: 15px;
+}
+/**
+ * YARD uses generic table styles, using a special class means those tables
+ * don't get messed up.
+ */
+.table
+{
+    border:          1px solid #ccc;
+    border-right:    none;
+    border-collapse: separate;
+    border-spacing:  0;
+    text-align:      left;
+}
+.table.full
+{
+    width: 100%;
+}
+    .table .field_name
+    {
+        min-width: 160px;
+    }
+    .table thead tr th.no_sort:first-child
+    {
+        width: 25px;
+    }
+    .table thead tr th, .table tbody tr td
+    {
+        border-bottom:  1px solid #ccc;
+        border-right:   1px solid #ccc;
+        min-width:      20px;
+        padding:        8px 5px;
+        text-align:     left;
+        vertical-align: top;
+    }
+    .table tbody tr:last-child td
+    {
+        border-bottom: none;
+    }
+    .table tr:nth-child(odd) td
+    {
+        background: #f9f9f9;
+    }

data/doc/migrating_from_nokogiri.md ADDED

@@ -0,0 +1,169 @@
+# Migrating From Nokogiri
+If you're parsing XML/HTML documents using Ruby, chances are you're using
+[Nokogiri][nokogiri] for this. This guide aims to make it easier to switch from
+Nokogiri to Oga.
+## Parsing Documents
+In Nokogiri there are two defacto ways of parsing documents:
+* `Nokogiri.XML()` for XML documents
+* `Nokogiri.HTML()` for HTML documents
+For example, to parse an XML document you'd use the following:
+    Nokogiri::XML('<root>foo</root>')
+Oga instead uses the following two methods:
+* `Oga.parse_xml`
+* `Oga.parse_html`
+Their usage is similar:
+    Oga.parse_xml('<root>foo</root>')
+Nokogiri returns two distinctive document classes based on what method was used
+to parse a document:
+* `Nokogiri::XML::Document` for XML documents
+* `Nokogiri::HTML::Document` for HTML documents
+Oga on the other hand always returns `Oga::XML::Document` instance, Oga
+currently makes no distinction between XML and HTML documents other than on
+lexer level. This might change in the future if deemed required.
+## Querying Documents
+Nokogiri allows one to query documents/elements using both XPath expressions and
+CSS selectors. In Nokogiri one queries a document as following:
+    document = Nokogiri::XML('<root><foo>bar</foo></root>')
+    document.xpath('root/foo')
+    document.css('root foo')
+Oga currently only supports XPath expressions, CSS selectors will be added in
+the near future. Querying documents works similar to Nokogiri:
+    document = Oga.parse_xml('<root><foo>bar</foo></root>')
+    document.xpath('root/foo')
+Nokogiri also allows you to query a document and return the first match, opposed
+to an entire node set, using the method `at`. In Nokogiri this method can be
+used for both XPath expression and CSS selectors. Oga has no such method,
+instead it provides the following more dedicated methods:
+* `at_xpath`: returns the first node of an XPath expression
+For example:
+    document = Oga.parse_xml('<root><foo>bar</foo></root>')
+    document.at_xpath('root/foo')
+By using a dedicated method Oga doesn't have to try and guess what type of
+expression you're using (XPath or CSS), meaning it can never make any mistakes.
+## Retrieving Attribute Values
+Nokogiri provides two methods for retrieving attributes and attribute values:
+* `Nokogiri::XML::Node#attribute`
+* `Nokogiri::XML::Node#attr`
+The first method always returns an instance of `Nokogiri::XML::Attribute`, the
+second method returns the attribute value as a `String`. This behaviour,
+especially due to the names used, is extremely confusing.
+Oga on the other hand provides the following two methods:
+* `Oga::XML::Element#attribute` (aliased as `attr`)
+* `Oga::XML::Element#get`
+The first method always returns a `Oga::XML::Attribute` instance, the second
+returns the attribute value as a `String`. I deliberately chose `get` for
+getting a value to remove the confusion of `attribute` vs `attr`. This also
+allows for `attr` to simply be an alias of `attribute`.
+As an example, this is how you'd get the value of a `class` attribute in
+Nokogiri:
+    document = Nokogiri::XML('<root class="foo"></root>')
+    document.xpath('root').first.attr('class') # => "foo"
+This is how you'd get the same value in Oga:
+    document = Oga.parse_xml('<root class="foo"></root>')
+    document.xpath('root').first.get('class') # => "foo"
+## Modifying Documents
+Modifying documents in Nokogiri is not as convenient as it perhaps could be. For
+example, adding an element to a document is done as following:
+    document = Nokogiri::XML('<root></root>')
+    root     = document.xpath('root').first
+    name = Nokogiri::XML::Element.new('name', document)
+    name.inner_html = 'Alice'
+    root.add_child(name)
+The annoying part here is that we have to pass a document into an Element's
+constructor. As such, you can not create elements without first creating a
+document. Another thing is that Nokogiri has no method called `inner_text=`,
+instead you have to use the method `inner_html=`.
+In Oga you'd use the following:
+    document = Oga.parse_xml('<root></root>')
+    root     = document.xpath('root').first
+    name = Oga::XML::Element.new(:name => 'name')
+    name.inner_text = 'Alice'
+    root.children << name
+Adding attributes works similar for both Nokogiri and Oga. For Nokogiri you'd
+use the following:
+    element.set_attribute('class', 'foo')
+Alternatively you can do the following:
+    element['class'] = 'foo'
+In Oga you'd instead use the method `set`:
+    element.set('class', 'foo')
+This method automatically creates an attribute if it doesn't exist, including
+the namespace if specified:
+    element.set('foo:class', 'foo')
+## Serializing Documents
+Serializing the document back to XML works the same in both libraries, simply
+call `to_xml` on a document or element and you'll get a String back containing
+the XML. There is one key difference here though: Nokogiri does not return the
+exact same output as it was given as input, for example it adds XML declaration
+tags:
+    Nokogiri::XML('<root></root>').to_xml # => "<?xml version=\"1.0\"?>\n<root/>\n"
+Oga on the other hand does not do this:
+    Oga.parse_xml('<root></root>').to_xml # => "<root></root>"
+Oga also doesn't insert random newlines or other possibly unexpected (or
+unwanted) data.
+[nokogiri]: http://nokogiri.org/