nokogumbo 2.0.0.pre.alpha → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e0d434c0749d7922ba8f084c15ed7219ccbf0e07b715368ae846bc38e64aad17
4
- data.tar.gz: 2770648e3e9e82d0ffb1877f1c06edc537688cf6a8405bc52dbdf5a6bb69bc1a
3
+ metadata.gz: 97aae1603382eb4357f4126f4c36f86841b930a04b46e0eb6a1c02c26c9e77a4
4
+ data.tar.gz: 614f1dc01be03d4ccb48b43d512faf80f556e8ddde4690fa34fbb9e420c36cb9
5
5
  SHA512:
6
- metadata.gz: e6c3de49495bf55ccaa250e2a3275b6796b0f0565da2a930e3333d2a153f2a16312eb77cb28ca3e03c17720127c2ecc27a1f71cfd6acfd15407295c29973e9fb
7
- data.tar.gz: e8ce6c80cb2327d2327f03c7e829156c1f0074ba4d6fce2b0d59305b80112b8fd5edc0932fad1fca13cb5f4bb6f2652fe52a2f090110aa76d06e1afbdebc334f
6
+ metadata.gz: aabd005ec985f1a94b0b82195ce8547fabc3d3f97e672b9b03c174ddd131b534b6c2648803ccd0d1fdfc9224bab132cdbd2757271b5646fc3873649d1f509e26
7
+ data.tar.gz: 4e793d5436de772587f2abcdb54c1060a3ddbde3dfcad05539e823dbde3c67f0792d1589289323fe76d5baebe265a526aba71d5adcbeb8d2ca72559d789e7b14
data/README.md CHANGED
@@ -5,7 +5,8 @@ Nokogumbo provides the ability for a Ruby program to invoke the
5
5
  and to access the result as a
6
6
  [Nokogiri::HTML::Document](http://rdoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document).
7
7
 
8
- [![Build Status](https://travis-ci.org/rubys/nokogumbo.svg)](https://travis-ci.org/rubys/nokogumbo)
8
+ [![Travis-CI Build Status](https://travis-ci.org/rubys/nokogumbo.svg)](https://travis-ci.org/rubys/nokogumbo)
9
+ [![Appveyor Build Status](https://ci.appveyor.com/api/projects/status/github/rubys/nokogumbo)](https://ci.appveyor.com/project/rubys/nokogumbo/branch/master)
9
10
 
10
11
  ## Usage
11
12
 
@@ -14,8 +15,7 @@ require 'nokogumbo'
14
15
  doc = Nokogiri.HTML5(string)
15
16
  ```
16
17
 
17
- An experimental _fragment_ method is also provided. While not HTML5
18
- compliant, it may be useful:
18
+ To parse an HTML fragment, a `fragment` method is provided.
19
19
 
20
20
  ```ruby
21
21
  require 'nokogumbo'
@@ -49,20 +49,26 @@ no parse errors are reported but this can be configured by passing the
49
49
 
50
50
  ```ruby
51
51
  require 'nokogumbo'
52
- doc = Nokogiri::HTML5.parse('Hi there!<body>', max_errors: 10)
52
+ doc = Nokogiri::HTML5.parse('<span/>Hi there!</span foo=bar />', max_errors: 10)
53
53
  doc.errors.each do |err|
54
- puts err
54
+ puts(err)
55
55
  end
56
56
  ```
57
57
 
58
58
  This prints the following.
59
59
  ```
60
- 1:1: ERROR: @1:1: The doctype must be the first token in the document.
61
- Hi there!<body>
60
+ 1:1: ERROR: Expected a doctype token
61
+ <span/>Hi there!</span foo=bar />
62
62
  ^
63
- 1:10: ERROR: @1:10: That tag isn't allowed here Currently open tags: html, body..
64
- Hi there!<body>
65
- ^
63
+ 1:1: ERROR: Start tag of nonvoid HTML element ends with '/>', use '>'.
64
+ <span/>Hi there!</span foo=bar />
65
+ ^
66
+ 1:17: ERROR: End tag ends with '/>', use '>'.
67
+ <span/>Hi there!</span foo=bar />
68
+ ^
69
+ 1:17: ERROR: End tag contains attributes.
70
+ <span/>Hi there!</span foo=bar />
71
+ ^
66
72
  ```
67
73
 
68
74
  Using `max_errors: -1` results in an unlimited number of errors being
@@ -71,6 +77,41 @@ returned.
71
77
  The errors returned by `#errors` are instances of
72
78
  [`Nokogiri::XML::SyntaxError`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/SyntaxError).
73
79
 
80
+ The [HTML
81
+ standard](https://html.spec.whatwg.org/multipage/parsing.html#parse-errors)
82
+ defines a number of standard parse error codes. These error codes only cover
83
+ the "tokenization" stage of parsing HTML. The parse errors in the
84
+ "tree construction" stage do not have standardized error codes (yet).
85
+
86
+ As a convenience to Nokogumbo users, the defined error codes are available
87
+ via the
88
+ [`Nokogiri::XML::SyntaxError#str1`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/SyntaxError#str1-instance_method)
89
+ method.
90
+
91
+ ```ruby
92
+ require 'nokogumbo'
93
+ doc = Nokogiri::HTML5.parse('<span/>Hi there!</span foo=bar />', max_errors: 10)
94
+ doc.errors.each do |err|
95
+ puts("#{err.line}:#{err.column}: #{err.str1}")
96
+ end
97
+ ```
98
+
99
+ This prints the following.
100
+ ```
101
+ 1:1: generic-parser
102
+ 1:1: non-void-html-element-start-tag-with-trailing-solidus
103
+ 1:17: end-tag-with-trailing-solidus
104
+ 1:17: end-tag-with-attributes
105
+ ```
106
+
107
+ Note that the first error is `generic-parser` because it's an error from the
108
+ tree construction stage and doesn't have a standardized error code.
109
+
110
+ For the purposes of semantic versioning, the error messages, error locations,
111
+ and error codes are not part of Nokogumbo's public API. That is, these are
112
+ subject to change without Nokogumbo's major version number changing. These may
113
+ be stabilized in the future.
114
+
74
115
  ### Maximum tree depth
75
116
  The maximum depth of the DOM tree parsed by the various parsing methods is
76
117
  configurable by the `:max_tree_depth` option. If the depth of the tree would
@@ -201,6 +242,36 @@ rules defined in the HTML5 specification for doing so.
201
242
  * Instead of returning `unknown` as the element name for unknown tags, the
202
243
  original tag name is returned verbatim.
203
244
 
245
+ # Flavors of Nokogumbo
246
+ Nokogumbo uses libxml2, the XML library underlying Nokogiri, to speed up
247
+ parsing. If the libxml2 headers are not available, then Nokogumbo resorts to
248
+ using Nokogiri's Ruby API to construct the DOM tree.
249
+
250
+ Nokogiri can be configured to either use the system library version of libxml2
251
+ or use a bundled version. By default (as of Nokogiri version 1.8.4), Nokogiri
252
+ will use a bundled version.
253
+
254
+ To prevent differences between versions of libxml2, Nokogumbo will only use
255
+ libxml2 if the build process can find the exact same version used by Nokogiri.
256
+ This leads to three possibilities
257
+
258
+ 1. Nokogiri is compiled with the bundled libxml2. In this case, Nokogumbo will
259
+ (by default) use the same version of libxml2.
260
+ 2. Nokogiri is compiled with the system libxml2. In this case, if the libxml2
261
+ headers are available, then Nokogumbo will (by default) use the system
262
+ version and headers.
263
+ 3. Nokogiri is compiled with the system libxml2 but its headers aren't
264
+ available at build time for Nokogumbo. In this case, Nokogumbo will use the
265
+ slower Ruby API.
266
+
267
+ Using libxml2 can be required by passing `-- --with-libxml2` to `bundle exec
268
+ rake` or to `gem install`. Using libxml2 can be prohibited by instead passing
269
+ `-- --without-libxml2`.
270
+
271
+ Functionally, the only difference between using libxml2 or not is in the
272
+ behavior of `Nokogiri::XML::Node#line`. If it is used, then `#line` will
273
+ return the line number of the corresponding node. Otherwise, it will return 0.
274
+
204
275
  # Installation
205
276
 
206
277
  git clone https://github.com/rubys/nokogumbo.git
@@ -108,9 +108,14 @@ gumbo_src = File.join(ext_dir, 'gumbo_src')
108
108
 
109
109
  Dir.chdir(ext_dir) do
110
110
  $srcs = Dir['*.c', '../../gumbo-parser/src/*.c']
111
+ $hdrs = Dir['*.h', '../../gumbo-parser/src/*.h']
111
112
  end
112
113
  $INCFLAGS << ' -I$(srcdir)/../../gumbo-parser/src'
113
114
  $VPATH << '$(srcdir)/../../gumbo-parser/src'
114
115
 
115
- create_makefile('nokogumbo/nokogumbo')
116
+ create_makefile('nokogumbo/nokogumbo') do |conf|
117
+ conf.map! do |chunk|
118
+ chunk.gsub(/^HDRS = .*$/, "HDRS = #{$hdrs.map { |h| File.join('$(srcdir)', h)}.join(' ')}")
119
+ end
120
+ end
116
121
  # vim: set sw=2 sts=2 ts=8 et:
@@ -9,7 +9,7 @@
9
9
  // document tree is then walked:
10
10
  //
11
11
  // * if Nokogiri and libxml2 headers are available at compile time,
12
- // (ifdef NGLIB) then a parallel libxml2 tree is constructed, and the
12
+ // (if NGLIB) then a parallel libxml2 tree is constructed, and the
13
13
  // final document is then wrapped using Nokogiri_wrap_xml_document.
14
14
  // This approach reduces memory and CPU requirements as Ruby objects
15
15
  // are only built when necessary.
@@ -20,74 +20,110 @@
20
20
 
21
21
  #include <assert.h>
22
22
  #include <ruby.h>
23
+ #include <ruby/version.h>
24
+
23
25
  #include "gumbo.h"
24
- #include "error.h"
25
26
 
26
27
  // class constants
27
28
  static VALUE Document;
28
29
 
29
- #ifdef NGLIB
30
+ // Interned symbols
31
+ static ID internal_subset;
32
+ static ID parent;
33
+
34
+ /* Backwards compatibility to Ruby 2.1.0 */
35
+ #if RUBY_API_VERSION_CODE < 20200
36
+ #include <ruby/encoding.h>
37
+
38
+ static VALUE rb_utf8_str_new(const char *str, long length) {
39
+ return rb_enc_str_new(str, length, rb_utf8_encoding());
40
+ }
41
+
42
+ static VALUE rb_utf8_str_new_cstr(const char *str) {
43
+ return rb_enc_str_new_cstr(str, rb_utf8_encoding());
44
+ }
45
+
46
+ static VALUE rb_utf8_str_new_static(const char *str, long length) {
47
+ return rb_enc_str_new(str, length, rb_utf8_encoding());
48
+ }
49
+ #endif
50
+
51
+ #if NGLIB
30
52
  #include <nokogiri.h>
31
- #include <xml_syntax_error.h>
32
53
  #include <libxml/tree.h>
33
54
  #include <libxml/HTMLtree.h>
34
55
 
35
56
  #define NIL NULL
36
- #define CONST_CAST (xmlChar const*)
37
57
  #else
38
58
  #define NIL Qnil
39
- #define CONST_CAST
40
59
 
41
- // more class constants
60
+ // These are defined by nokogiri.h
42
61
  static VALUE cNokogiriXmlSyntaxError;
62
+ static VALUE cNokogiriXmlElement;
63
+ static VALUE cNokogiriXmlText;
64
+ static VALUE cNokogiriXmlCData;
65
+ static VALUE cNokogiriXmlComment;
66
+
67
+ // Interned symbols.
68
+ static ID new;
69
+ static ID node_name_;
70
+
71
+ // Map libxml2 types to Ruby VALUE.
72
+ typedef VALUE xmlNodePtr;
73
+ typedef VALUE xmlDocPtr;
74
+ typedef VALUE xmlNsPtr;
75
+ typedef VALUE xmlDtdPtr;
76
+ typedef char xmlChar;
77
+ #define BAD_CAST
78
+
79
+ // Redefine libxml2 API as Ruby function calls.
80
+ static xmlNodePtr xmlNewDocNode(xmlDocPtr doc, xmlNsPtr ns, const xmlChar *name, const xmlChar *content) {
81
+ assert(ns == NIL && content == NULL);
82
+ return rb_funcall(cNokogiriXmlElement, new, 2, rb_utf8_str_new_cstr(name), doc);
83
+ }
84
+
85
+ static xmlNodePtr xmlNewDocText(xmlDocPtr doc, const xmlChar *content) {
86
+ VALUE str = rb_utf8_str_new_cstr(content);
87
+ return rb_funcall(cNokogiriXmlText, new, 2, str, doc);
88
+ }
89
+
90
+ static xmlNodePtr xmlNewCDataBlock(xmlDocPtr doc, const xmlChar *content, int len) {
91
+ VALUE str = rb_utf8_str_new(content, len);
92
+ // CDATA.new takes arguments in the opposite order from Text.new.
93
+ return rb_funcall(cNokogiriXmlCData, new, 2, doc, str);
94
+ }
95
+
96
+ static xmlNodePtr xmlNewDocComment(xmlDocPtr doc, const xmlChar *content) {
97
+ VALUE str = rb_utf8_str_new_cstr(content);
98
+ return rb_funcall(cNokogiriXmlComment, new, 2, doc, str);
99
+ }
100
+
101
+ static xmlNodePtr xmlAddChild(xmlNodePtr parent, xmlNodePtr cur) {
102
+ ID add_child;
103
+ CONST_ID(add_child, "add_child");
104
+ return rb_funcall(parent, add_child, 1, cur);
105
+ }
106
+
107
+ static void xmlSetNs(xmlNodePtr node, xmlNsPtr ns) {
108
+ ID namespace_;
109
+ CONST_ID(namespace_, "namespace=");
110
+ rb_funcall(node, namespace_, 1, ns);
111
+ }
43
112
 
44
- static VALUE Element;
45
- static VALUE Text;
46
- static VALUE CDATA;
47
- static VALUE Comment;
48
-
49
- // interned symbols
50
- static VALUE new;
51
- static VALUE attribute;
52
- static VALUE set_attribute;
53
- static VALUE remove_attribute;
54
- static VALUE add_child;
55
- static VALUE internal_subset;
56
- static VALUE remove_;
57
- static VALUE create_internal_subset;
58
- static VALUE key_;
59
- static VALUE node_name_;
60
-
61
- // map libxml2 types to Ruby VALUE
62
- #define xmlNodePtr VALUE
63
- #define xmlDocPtr VALUE
64
-
65
- // redefine libxml2 API as Ruby function calls
66
- #define xmlNewDocNode(doc, ns, name, content) \
67
- rb_funcall(Element, new, 2, rb_str_new2(name), doc)
68
- #define xmlNewDocText(doc, text) \
69
- rb_funcall(Text, new, 2, rb_str_new2(text), doc)
70
- #define xmlNewCDataBlock(doc, content, length) \
71
- rb_funcall(CDATA, new, 2, doc, rb_str_new(content, length))
72
- #define xmlNewDocComment(doc, text) \
73
- rb_funcall(Comment, new, 2, doc, rb_str_new2(text))
74
- #define xmlAddChild(element, node) \
75
- rb_funcall(element, add_child, 1, node)
76
- #define xmlDocSetRootElement(doc, root) \
77
- rb_funcall(doc, add_child, 1, root)
78
- #define xmlCreateIntSubset(doc, name, external, system) \
79
- rb_funcall(doc, create_internal_subset, 3, rb_str_new2(name), \
80
- (external ? rb_str_new2(external) : Qnil), \
81
- (system ? rb_str_new2(system) : Qnil));
82
- #define Nokogiri_wrap_xml_document(klass, doc) \
83
- doc
113
+ static void xmlFreeDoc(xmlDocPtr doc) { }
114
+
115
+ static VALUE Nokogiri_wrap_xml_document(VALUE klass, xmlDocPtr doc) {
116
+ return doc;
117
+ }
84
118
 
85
119
  static VALUE find_dummy_key(VALUE collection) {
86
120
  VALUE r_dummy = Qnil;
87
121
  char dummy[5] = "a";
88
122
  size_t len = 1;
123
+ ID key_;
124
+ CONST_ID(key_, "key?");
89
125
  while (len < sizeof dummy) {
90
- r_dummy = rb_str_new(dummy, len);
126
+ r_dummy = rb_utf8_str_new(dummy, len);
91
127
  if (rb_funcall(collection, key_, 1, r_dummy) == Qfalse)
92
128
  return r_dummy;
93
129
  for (size_t i = 0; ; ++i) {
@@ -105,10 +141,42 @@ static VALUE find_dummy_key(VALUE collection) {
105
141
  }
106
142
  }
107
143
  // This collection has 475254 elements?? Give up.
108
- return Qnil;
144
+ rb_raise(rb_eArgError, "Failed to find a dummy key.");
109
145
  }
110
146
 
111
- static xmlNodePtr xmlNewProp(xmlNodePtr node, const char *name, const char *value) {
147
+ // This should return an xmlAttrPtr, but we don't need it and it's easier to
148
+ // not get the result.
149
+ static void xmlNewNsProp (
150
+ xmlNodePtr node,
151
+ xmlNsPtr ns,
152
+ const xmlChar *name,
153
+ const xmlChar *value
154
+ ) {
155
+ ID set_attribute;
156
+ CONST_ID(set_attribute, "set_attribute");
157
+
158
+ VALUE rvalue = rb_utf8_str_new_cstr(value);
159
+
160
+ if (RTEST(ns)) {
161
+ // This is an easy case, we have a namespace so it's enough to do
162
+ // node["#{ns.prefix}:#{name}"] = value
163
+ ID prefix;
164
+ CONST_ID(prefix, "prefix");
165
+ VALUE ns_prefix = rb_funcall(ns, prefix, 0);
166
+ VALUE qname = rb_sprintf("%" PRIsVALUE ":%s", ns_prefix, name);
167
+ rb_funcall(node, set_attribute, 2, qname, rvalue);
168
+ return;
169
+ }
170
+
171
+ size_t len = strlen(name);
172
+ VALUE rname = rb_utf8_str_new(name, len);
173
+ if (memchr(name, ':', len) == NULL) {
174
+ // This is the easiest case. There's no colon so we can do
175
+ // node[name] = value.
176
+ rb_funcall(node, set_attribute, 2, rname, rvalue);
177
+ return;
178
+ }
179
+
112
180
  // Nokogiri::XML::Node#set_attribute calls xmlSetProp(node, name, value)
113
181
  // which behaves roughly as
114
182
  // if name is a QName prefix:local
@@ -118,7 +186,7 @@ static xmlNodePtr xmlNewProp(xmlNodePtr node, const char *name, const char *valu
118
186
  //
119
187
  // If the prefix is "xml", then the namespace lookup will create it.
120
188
  //
121
- // By contrast, xmlNewProp does not do this parsing and creates an attribute
189
+ // By contrast, xmlNewNsProp does not do this parsing and creates an attribute
122
190
  // with the name and value exactly as given. This is the behavior that we
123
191
  // want.
124
192
  //
@@ -129,164 +197,84 @@ static xmlNodePtr xmlNewProp(xmlNodePtr node, const char *name, const char *valu
129
197
  // Work around this by inserting a dummy attribute and then changing the
130
198
  // name, if needed.
131
199
 
132
- // Can't use strchr since it's locale-sensitive.
133
- size_t len = strlen(name);
134
- VALUE r_name = rb_str_new(name, len);
135
- if (memchr(name, ':', len) == NULL) {
136
- // No colon.
137
- return rb_funcall(node, set_attribute, 2, r_name, rb_str_new2(value));
138
- }
139
200
  // Find a dummy attribute string that doesn't already exist.
140
201
  VALUE dummy = find_dummy_key(node);
141
- if (dummy == Qnil)
142
- return Qnil;
143
202
  // Add the dummy attribute.
144
- VALUE r_value = rb_funcall(node, set_attribute, 2, dummy, rb_str_new2(value));
145
- if (r_value == Qnil)
146
- return Qnil;
147
- // Remove thet old attribute, if it exists.
148
- rb_funcall(node, remove_attribute, 1, r_name);
203
+ rb_funcall(node, set_attribute, 2, dummy, rvalue);
204
+
205
+ // Remove the old attribute, if it exists.
206
+ ID remove_attribute;
207
+ CONST_ID(remove_attribute, "remove_attribute");
208
+ rb_funcall(node, remove_attribute, 1, rname);
209
+
149
210
  // Rename the dummy
211
+ ID attribute;
212
+ CONST_ID(attribute, "attribute");
150
213
  VALUE attr = rb_funcall(node, attribute, 1, dummy);
151
- if (attr == Qnil)
152
- return Qnil;
153
- rb_funcall(attr, node_name_, 1, r_name);
154
- return attr;
214
+ rb_funcall(attr, node_name_, 1, rname);
155
215
  }
156
216
  #endif
157
217
 
158
- // Build a xmlNodePtr for a given GumboNode (recursively)
159
- static xmlNodePtr walk_tree(xmlDocPtr document, GumboNode *node);
160
-
161
- // Build a xmlNodePtr for a given GumboElement (recursively)
162
- static xmlNodePtr walk_element(xmlDocPtr document, GumboElement *node) {
163
- // create the given element
164
- xmlNodePtr element = xmlNewDocNode(document, NIL, CONST_CAST node->name, NIL);
165
-
166
- // add in the attributes
167
- GumboVector* attrs = &node->attributes;
168
- char *name = NULL;
169
- size_t namelen = 0;
170
- const char *ns;
171
- for (size_t i=0; i < attrs->length; i++) {
172
- GumboAttribute *attr = attrs->data[i];
173
-
174
- switch (attr->attr_namespace) {
175
- case GUMBO_ATTR_NAMESPACE_XLINK:
176
- ns = "xlink:";
177
- break;
178
-
179
- case GUMBO_ATTR_NAMESPACE_XML:
180
- ns = "xml:";
181
- break;
182
-
183
- case GUMBO_ATTR_NAMESPACE_XMLNS:
184
- ns = "xmlns:";
185
- if (!strcmp(attr->name, "xmlns")) ns = NULL;
186
- break;
187
-
188
- default:
189
- ns = NULL;
190
- }
191
-
192
- if (ns) {
193
- if (strlen(ns) + strlen(attr->name) + 1 > namelen) {
194
- free(name);
195
- name = NULL;
196
- }
197
-
198
- if (!name) {
199
- namelen = strlen(ns) + strlen(attr->name) + 1;
200
- name = malloc(namelen);
201
- }
202
-
203
- strcpy(name, ns);
204
- strcat(name, attr->name);
205
- xmlNewProp(element, CONST_CAST name, CONST_CAST attr->value);
206
- } else {
207
- xmlNewProp(element, CONST_CAST attr->name, CONST_CAST attr->value);
208
- }
209
- }
210
- if (name) free(name);
211
-
212
- // add in the children
213
- GumboVector* children = &node->children;
214
- for (size_t i=0; i < children->length; i++) {
215
- xmlNodePtr node = walk_tree(document, children->data[i]);
216
- if (node) xmlAddChild(element, node);
217
- }
218
-
219
- return element;
220
- }
221
-
222
- static xmlNodePtr walk_tree(xmlDocPtr document, GumboNode *node) {
223
- switch (node->type) {
224
- case GUMBO_NODE_DOCUMENT:
225
- return NIL;
226
- case GUMBO_NODE_ELEMENT:
227
- case GUMBO_NODE_TEMPLATE:
228
- return walk_element(document, &node->v.element);
229
- case GUMBO_NODE_TEXT:
230
- case GUMBO_NODE_WHITESPACE:
231
- return xmlNewDocText(document, CONST_CAST node->v.text.text);
232
- case GUMBO_NODE_CDATA:
233
- return xmlNewCDataBlock(document,
234
- CONST_CAST node->v.text.text,
235
- (int) strlen(node->v.text.text));
236
- case GUMBO_NODE_COMMENT:
237
- return xmlNewDocComment(document, CONST_CAST node->v.text.text);
238
- }
239
- }
240
-
241
218
  // URI = system id
242
219
  // external id = public id
243
- #if NGLIB
244
- static htmlDocPtr new_html_doc(const char *dtd_name, const char *system, const char *public)
220
+ static xmlDocPtr new_html_doc(const char *dtd_name, const char *system, const char *public)
245
221
  {
222
+ #if NGLIB
246
223
  // These two libxml2 functions take the public and system ids in
247
224
  // opposite orders.
248
225
  htmlDocPtr doc = htmlNewDocNoDtD(/* URI */ NULL, /* ExternalID */NULL);
249
226
  assert(doc);
250
227
  if (dtd_name)
251
- xmlCreateIntSubset(doc, CONST_CAST dtd_name, CONST_CAST public, CONST_CAST system);
228
+ xmlCreateIntSubset(doc, BAD_CAST dtd_name, BAD_CAST public, BAD_CAST system);
252
229
  return doc;
253
- }
254
230
  #else
255
- // remove internal subset from newly created documents
256
- static VALUE new_html_doc(const char *dtd_name, const char *system, const char *public) {
231
+ // remove internal subset from newly created documents
257
232
  VALUE doc;
258
233
  // If system and public are both NULL, Document#new is going to set default
259
234
  // values for them so we're going to have to remove the internal subset
260
235
  // which seems to leak memory in Nokogiri, so leak as little as possible.
261
236
  if (system == NULL && public == NULL) {
262
- doc = rb_funcall(Document, new, 2, /* URI */ Qnil, /* external_id */ rb_str_new("", 0));
263
- rb_funcall(rb_funcall(doc, internal_subset, 0), remove_, 0);
237
+ ID remove;
238
+ CONST_ID(remove, "remove");
239
+ doc = rb_funcall(Document, new, 2, /* URI */ Qnil, /* external_id */ rb_utf8_str_new_static("", 0));
240
+ rb_funcall(rb_funcall(doc, internal_subset, 0), remove, 0);
264
241
  if (dtd_name) {
265
242
  // We need to create an internal subset now.
266
- rb_funcall(doc, create_internal_subset, 3, rb_str_new2(dtd_name), Qnil, Qnil);
243
+ ID create_internal_subset;
244
+ CONST_ID(create_internal_subset, "create_internal_subset");
245
+ rb_funcall(doc, create_internal_subset, 3, rb_utf8_str_new_cstr(dtd_name), Qnil, Qnil);
267
246
  }
268
247
  } else {
269
248
  assert(dtd_name);
270
249
  // Rather than removing and creating the internal subset as we did above,
271
250
  // just create and then rename one.
272
- VALUE r_system = system ? rb_str_new2(system) : Qnil;
273
- VALUE r_public = public ? rb_str_new2(public) : Qnil;
251
+ VALUE r_system = system ? rb_utf8_str_new_cstr(system) : Qnil;
252
+ VALUE r_public = public ? rb_utf8_str_new_cstr(public) : Qnil;
274
253
  doc = rb_funcall(Document, new, 2, r_system, r_public);
275
- rb_funcall(rb_funcall(doc, internal_subset, 0), node_name_, 1, rb_str_new2(dtd_name));
254
+ rb_funcall(rb_funcall(doc, internal_subset, 0), node_name_, 1, rb_utf8_str_new_cstr(dtd_name));
276
255
  }
277
256
  return doc;
278
- }
279
257
  #endif
258
+ }
280
259
 
281
- // Parse a string using gumbo_parse into a Nokogiri document
282
- static VALUE parse(VALUE self, VALUE string, VALUE url, VALUE max_errors, VALUE max_depth) {
283
- GumboOptions options = kGumboDefaultOptions;
284
- options.max_errors = NUM2INT(max_errors);
285
- options.max_tree_depth = NUM2INT(max_depth);
260
+ static xmlNodePtr get_parent(xmlNodePtr node) {
261
+ #if NGLIB
262
+ return node->parent;
263
+ #else
264
+ if (!rb_respond_to(node, parent))
265
+ return Qnil;
266
+ return rb_funcall(node, parent, 0);
267
+ #endif
268
+ }
286
269
 
287
- const char *input = RSTRING_PTR(string);
288
- size_t input_len = RSTRING_LEN(string);
289
- GumboOutput *output = gumbo_parse_with_options(&options, input, input_len);
270
+ static GumboOutput *perform_parse(const GumboOptions *options, VALUE input) {
271
+ assert(RTEST(input));
272
+ Check_Type(input, T_STRING);
273
+ GumboOutput *output = gumbo_parse_with_options (
274
+ options,
275
+ RSTRING_PTR(input),
276
+ RSTRING_LEN(input)
277
+ );
290
278
 
291
279
  const char *status_string = gumbo_status_to_string(output->status);
292
280
  switch (output->status) {
@@ -299,100 +287,458 @@ static VALUE parse(VALUE self, VALUE string, VALUE url, VALUE max_errors, VALUE
299
287
  gumbo_destroy_output(output);
300
288
  rb_raise(rb_eNoMemError, "%s", status_string);
301
289
  }
290
+ return output;
291
+ }
302
292
 
303
- xmlDocPtr doc;
304
- if (output->document->v.document.has_doctype) {
305
- const char *name = output->document->v.document.name;
306
- const char *public = output->document->v.document.public_identifier;
307
- const char *system = output->document->v.document.system_identifier;
308
- public = public[0] ? public : NULL;
309
- system = system[0] ? system : NULL;
310
- doc = new_html_doc(name, system, public);
311
- } else {
312
- doc = new_html_doc(NULL, NULL, NULL);
313
- }
293
+ static xmlNsPtr lookup_or_add_ns (
294
+ xmlDocPtr doc,
295
+ xmlNodePtr root,
296
+ const char *href,
297
+ const char *prefix
298
+ ) {
299
+ #if NGLIB
300
+ xmlNsPtr ns = xmlSearchNs(doc, root, BAD_CAST prefix);
301
+ if (ns)
302
+ return ns;
303
+ return xmlNewNs(root, BAD_CAST href, BAD_CAST prefix);
304
+ #else
305
+ ID add_namespace_definition;
306
+ CONST_ID(add_namespace_definition, "add_namespace_definition");
307
+ VALUE rprefix = rb_utf8_str_new_cstr(prefix);
308
+ VALUE rhref = rb_utf8_str_new_cstr(href);
309
+ return rb_funcall(root, add_namespace_definition, 2, rprefix, rhref);
310
+ #endif
311
+ }
312
+
313
+ static void set_line(xmlNodePtr node, size_t line) {
314
+ #if NGLIB
315
+ // libxml2 uses 65535 to mean look elsewhere for the line number on some
316
+ // nodes.
317
+ if (line < 65535)
318
+ node->line = (unsigned short)line;
319
+ #else
320
+ // XXX: If Nokogiri gets a `#line=` method, we'll use that.
321
+ #endif
322
+ }
314
323
 
315
- GumboVector *children = &output->document->v.document.children;
316
- for (size_t i=0; i < children->length; i++) {
317
- GumboNode *child = children->data[i];
318
- xmlNodePtr node = walk_tree(doc, child);
319
- if (node) {
320
- if (child == output->root)
321
- xmlDocSetRootElement(doc, node);
322
- else
323
- xmlAddChild((xmlNodePtr)doc, node);
324
+ // Construct an XML tree rooted at xml_output_node from the Gumbo tree rooted
325
+ // at gumbo_node.
326
+ static void build_tree (
327
+ xmlDocPtr doc,
328
+ xmlNodePtr xml_output_node,
329
+ const GumboNode *gumbo_node
330
+ ) {
331
+ xmlNodePtr xml_root = NIL;
332
+ xmlNodePtr xml_node = xml_output_node;
333
+ size_t child_index = 0;
334
+
335
+ while (true) {
336
+ assert(gumbo_node != NULL);
337
+ const GumboVector *children = gumbo_node->type == GUMBO_NODE_DOCUMENT?
338
+ &gumbo_node->v.document.children : &gumbo_node->v.element.children;
339
+ if (child_index >= children->length) {
340
+ // Move up the tree and to the next child.
341
+ if (xml_node == xml_output_node) {
342
+ // We've built as much of the tree as we can.
343
+ return;
344
+ }
345
+ child_index = gumbo_node->index_within_parent + 1;
346
+ gumbo_node = gumbo_node->parent;
347
+ xml_node = get_parent(xml_node);
348
+ // Children of fragments don't share the same root, so reset it and
349
+ // it'll be set below. In the non-fragment case, this will only happen
350
+ // after the html element has been finished at which point there are no
351
+ // further elements.
352
+ if (xml_node == xml_output_node)
353
+ xml_root = NIL;
354
+ continue;
355
+ }
356
+ const GumboNode *gumbo_child = children->data[child_index++];
357
+ xmlNodePtr xml_child;
358
+
359
+ switch (gumbo_child->type) {
360
+ case GUMBO_NODE_DOCUMENT:
361
+ abort(); // Bug in Gumbo.
362
+
363
+ case GUMBO_NODE_TEXT:
364
+ case GUMBO_NODE_WHITESPACE:
365
+ xml_child = xmlNewDocText(doc, BAD_CAST gumbo_child->v.text.text);
366
+ set_line(xml_child, gumbo_child->v.text.start_pos.line);
367
+ xmlAddChild(xml_node, xml_child);
368
+ break;
369
+
370
+ case GUMBO_NODE_CDATA:
371
+ xml_child = xmlNewCDataBlock(doc, BAD_CAST gumbo_child->v.text.text,
372
+ (int) strlen(gumbo_child->v.text.text));
373
+ set_line(xml_child, gumbo_child->v.text.start_pos.line);
374
+ xmlAddChild(xml_node, xml_child);
375
+ break;
376
+
377
+ case GUMBO_NODE_COMMENT:
378
+ xml_child = xmlNewDocComment(doc, BAD_CAST gumbo_child->v.text.text);
379
+ set_line(xml_child, gumbo_child->v.text.start_pos.line);
380
+ xmlAddChild(xml_node, xml_child);
381
+ break;
382
+
383
+ case GUMBO_NODE_TEMPLATE:
384
+ // XXX: Should create a template element and a new DocumentFragment
385
+ case GUMBO_NODE_ELEMENT:
386
+ {
387
+ xml_child = xmlNewDocNode(doc, NIL, BAD_CAST gumbo_child->v.element.name, NULL);
388
+ set_line(xml_child, gumbo_child->v.text.start_pos.line);
389
+ if (xml_root == NIL)
390
+ xml_root = xml_child;
391
+ xmlNsPtr ns = NIL;
392
+ switch (gumbo_child->v.element.tag_namespace) {
393
+ case GUMBO_NAMESPACE_HTML:
394
+ break;
395
+ case GUMBO_NAMESPACE_SVG:
396
+ ns = lookup_or_add_ns(doc, xml_root, "http://www.w3.org/2000/svg", "svg");
397
+ break;
398
+ case GUMBO_NAMESPACE_MATHML:
399
+ ns = lookup_or_add_ns(doc, xml_root, "http://www.w3.org/1998/Math/MathML", "math");
400
+ break;
401
+ }
402
+ if (ns != NIL)
403
+ xmlSetNs(xml_child, ns);
404
+ xmlAddChild(xml_node, xml_child);
405
+
406
+ // Add the attributes.
407
+ const GumboVector* attrs = &gumbo_child->v.element.attributes;
408
+ for (size_t i=0; i < attrs->length; i++) {
409
+ const GumboAttribute *attr = attrs->data[i];
410
+
411
+ switch (attr->attr_namespace) {
412
+ case GUMBO_ATTR_NAMESPACE_XLINK:
413
+ ns = lookup_or_add_ns(doc, xml_root, "http://www.w3.org/1999/xlink", "xlink");
414
+ break;
415
+
416
+ case GUMBO_ATTR_NAMESPACE_XML:
417
+ ns = lookup_or_add_ns(doc, xml_root, "http://www.w3.org/XML/1998/namespace", "xml");
418
+ break;
419
+
420
+ case GUMBO_ATTR_NAMESPACE_XMLNS:
421
+ ns = lookup_or_add_ns(doc, xml_root, "http://www.w3.org/2000/xmlns/", "xmlns");
422
+ break;
423
+
424
+ default:
425
+ ns = NIL;
426
+ }
427
+ xmlNewNsProp(xml_child, ns, BAD_CAST attr->name, BAD_CAST attr->value);
428
+ }
429
+
430
+ // Add children for this element.
431
+ child_index = 0;
432
+ gumbo_node = gumbo_child;
433
+ xml_node = xml_child;
434
+ }
324
435
  }
325
436
  }
437
+ }
326
438
 
327
- VALUE rdoc = Nokogiri_wrap_xml_document(Document, doc);
439
+ static void add_errors(const GumboOutput *output, VALUE rdoc, VALUE input, VALUE url) {
440
+ const char *input_str = RSTRING_PTR(input);
441
+ size_t input_len = RSTRING_LEN(input);
328
442
 
329
443
  // Add parse errors to rdoc.
330
444
  if (output->errors.length) {
331
- GumboVector *errors = &output->errors;
332
- GumboStringBuffer msg;
445
+ const GumboVector *errors = &output->errors;
333
446
  VALUE rerrors = rb_ary_new2(errors->length);
334
447
 
335
- gumbo_string_buffer_init(&msg);
336
448
  for (size_t i=0; i < errors->length; i++) {
337
449
  GumboError *err = errors->data[i];
338
- gumbo_string_buffer_clear(&msg);
339
- gumbo_caret_diagnostic_to_string(err, input, input_len, &msg);
340
- VALUE err_str = rb_str_new(msg.data, msg.length);
450
+ GumboSourcePosition position = gumbo_error_position(err);
451
+ char *msg;
452
+ size_t size = gumbo_caret_diagnostic_to_string(err, input_str, input_len, &msg);
453
+ VALUE err_str = rb_utf8_str_new(msg, size);
454
+ free(msg);
341
455
  VALUE syntax_error = rb_class_new_instance(1, &err_str, cNokogiriXmlSyntaxError);
456
+ const char *error_code = gumbo_error_code(err);
457
+ VALUE str1 = error_code? rb_utf8_str_new_static(error_code, strlen(error_code)) : Qnil;
342
458
  rb_iv_set(syntax_error, "@domain", INT2NUM(1)); // XML_FROM_PARSER
343
459
  rb_iv_set(syntax_error, "@code", INT2NUM(1)); // XML_ERR_INTERNAL_ERROR
344
460
  rb_iv_set(syntax_error, "@level", INT2NUM(2)); // XML_ERR_ERROR
345
461
  rb_iv_set(syntax_error, "@file", url);
346
- rb_iv_set(syntax_error, "@line", INT2NUM(err->position.line));
347
- rb_iv_set(syntax_error, "@str1", Qnil);
462
+ rb_iv_set(syntax_error, "@line", INT2NUM(position.line));
463
+ rb_iv_set(syntax_error, "@str1", str1);
348
464
  rb_iv_set(syntax_error, "@str2", Qnil);
349
465
  rb_iv_set(syntax_error, "@str3", Qnil);
350
- rb_iv_set(syntax_error, "@int1", INT2NUM(err->type));
351
- rb_iv_set(syntax_error, "@column", INT2NUM(err->position.column));
466
+ rb_iv_set(syntax_error, "@int1", INT2NUM(0));
467
+ rb_iv_set(syntax_error, "@column", INT2NUM(position.column));
352
468
  rb_ary_push(rerrors, syntax_error);
353
469
  }
354
470
  rb_iv_set(rdoc, "@errors", rerrors);
355
- gumbo_string_buffer_destroy(&msg);
356
471
  }
472
+ }
473
+
474
+ typedef struct {
475
+ GumboOutput *output;
476
+ VALUE input;
477
+ VALUE url_or_frag;
478
+ xmlDocPtr doc;
479
+ } ParseArgs;
480
+
481
+ static VALUE parse_cleanup(ParseArgs *args) {
482
+ gumbo_destroy_output(args->output);
483
+ if (args->doc != NIL)
484
+ xmlFreeDoc(args->doc);
485
+ return Qnil;
486
+ }
487
+
488
+
489
+ static VALUE parse_continue(ParseArgs *args);
490
+
491
+ // Parse a string using gumbo_parse into a Nokogiri document
492
+ static VALUE parse(VALUE self, VALUE input, VALUE url, VALUE max_errors, VALUE max_depth) {
493
+ GumboOptions options = kGumboDefaultOptions;
494
+ options.max_errors = NUM2INT(max_errors);
495
+ options.max_tree_depth = NUM2INT(max_depth);
357
496
 
358
- gumbo_destroy_output(output);
497
+ GumboOutput *output = perform_parse(&options, input);
498
+ ParseArgs args = {
499
+ .output = output,
500
+ .input = input,
501
+ .url_or_frag = url,
502
+ .doc = NIL,
503
+ };
504
+ return rb_ensure(parse_continue, (VALUE)&args, parse_cleanup, (VALUE)&args);
505
+ }
359
506
 
507
+ static VALUE parse_continue(ParseArgs *args) {
508
+ GumboOutput *output = args->output;
509
+ xmlDocPtr doc;
510
+ if (output->document->v.document.has_doctype) {
511
+ const char *name = output->document->v.document.name;
512
+ const char *public = output->document->v.document.public_identifier;
513
+ const char *system = output->document->v.document.system_identifier;
514
+ public = public[0] ? public : NULL;
515
+ system = system[0] ? system : NULL;
516
+ doc = new_html_doc(name, system, public);
517
+ } else {
518
+ doc = new_html_doc(NULL, NULL, NULL);
519
+ }
520
+ args->doc = doc; // Make sure doc gets cleaned up if an error is thrown.
521
+ build_tree(doc, (xmlNodePtr)doc, output->document);
522
+ VALUE rdoc = Nokogiri_wrap_xml_document(Document, doc);
523
+ args->doc = NIL; // The Ruby runtime now owns doc so don't delete it.
524
+ add_errors(output, rdoc, args->input, args->url_or_frag);
360
525
  return rdoc;
361
526
  }
362
527
 
363
- // Initialize the Nokogumbo class and fetch constants we will use later
528
+ static int lookup_namespace(VALUE node, bool require_known_ns) {
529
+ ID namespace, href;
530
+ CONST_ID(namespace, "namespace");
531
+ CONST_ID(href, "href");
532
+ VALUE ns = rb_funcall(node, namespace, 0);
533
+
534
+ if (NIL_P(ns))
535
+ return GUMBO_NAMESPACE_HTML;
536
+ ns = rb_funcall(ns, href, 0);
537
+ assert(RTEST(ns));
538
+ Check_Type(ns, T_STRING);
539
+
540
+ const char *href_ptr = RSTRING_PTR(ns);
541
+ size_t href_len = RSTRING_LEN(ns);
542
+ #define NAMESPACE_P(uri) (href_len == sizeof uri - 1 && !memcmp(href_ptr, uri, href_len))
543
+ if (NAMESPACE_P("http://www.w3.org/1999/xhtml"))
544
+ return GUMBO_NAMESPACE_HTML;
545
+ if (NAMESPACE_P("http://www.w3.org/1998/Math/MathML"))
546
+ return GUMBO_NAMESPACE_MATHML;
547
+ if (NAMESPACE_P("http://www.w3.org/2000/svg"))
548
+ return GUMBO_NAMESPACE_SVG;
549
+ #undef NAMESPACE_P
550
+ if (require_known_ns)
551
+ rb_raise(rb_eArgError, "Unexpected namespace URI \"%*s\"", (int)href_len, href_ptr);
552
+ return -1;
553
+ }
554
+
555
+ static xmlNodePtr extract_xml_node(VALUE node) {
556
+ #if NGLIB
557
+ xmlNodePtr xml_node;
558
+ Data_Get_Struct(node, xmlNode, xml_node);
559
+ return xml_node;
560
+ #else
561
+ return node;
562
+ #endif
563
+ }
564
+
565
+ static VALUE fragment_continue(ParseArgs *args);
566
+
567
+ static VALUE fragment (
568
+ VALUE self,
569
+ VALUE doc_fragment,
570
+ VALUE tags,
571
+ VALUE ctx,
572
+ VALUE max_errors,
573
+ VALUE max_depth
574
+ ) {
575
+ ID name = rb_intern_const("name");
576
+ const char *ctx_tag;
577
+ GumboNamespaceEnum ctx_ns;
578
+ GumboQuirksModeEnum quirks_mode;
579
+ bool form = false;
580
+ const char *encoding = NULL;
581
+
582
+ if (NIL_P(ctx)) {
583
+ ctx_tag = "body";
584
+ ctx_ns = GUMBO_NAMESPACE_HTML;
585
+ } else if (TYPE(ctx) == T_STRING) {
586
+ ctx_tag = StringValueCStr(ctx);
587
+ ctx_ns = GUMBO_NAMESPACE_HTML;
588
+ size_t len = RSTRING_LEN(ctx);
589
+ const char *colon = memchr(ctx_tag, ':', len);
590
+ if (colon) {
591
+ switch (colon - ctx_tag) {
592
+ case 3:
593
+ if (st_strncasecmp(ctx_tag, "svg", 3) != 0)
594
+ goto error;
595
+ ctx_ns = GUMBO_NAMESPACE_SVG;
596
+ break;
597
+ case 4:
598
+ if (st_strncasecmp(ctx_tag, "html", 4) == 0)
599
+ ctx_ns = GUMBO_NAMESPACE_HTML;
600
+ else if (st_strncasecmp(ctx_tag, "math", 4) == 0)
601
+ ctx_ns = GUMBO_NAMESPACE_MATHML;
602
+ else
603
+ goto error;
604
+ break;
605
+ default:
606
+ error:
607
+ rb_raise(rb_eArgError, "Invalid context namespace '%*s'", (int)(colon - ctx_tag), ctx_tag);
608
+ }
609
+ ctx_tag = colon+1;
610
+ } else {
611
+ // For convenience, put 'svg' and 'math' in their namespaces.
612
+ if (len == 3 && st_strncasecmp(ctx_tag, "svg", 3) == 0)
613
+ ctx_ns = GUMBO_NAMESPACE_SVG;
614
+ else if (len == 4 && st_strncasecmp(ctx_tag, "math", 4) == 0)
615
+ ctx_ns = GUMBO_NAMESPACE_MATHML;
616
+ }
617
+
618
+ // Check if it's a form.
619
+ form = ctx_ns == GUMBO_NAMESPACE_HTML && st_strcasecmp(ctx_tag, "form") == 0;
620
+ } else {
621
+ ID element_ = rb_intern_const("element?");
622
+
623
+ // Context fragment name.
624
+ VALUE tag_name = rb_funcall(ctx, name, 0);
625
+ assert(RTEST(tag_name));
626
+ Check_Type(tag_name, T_STRING);
627
+ ctx_tag = StringValueCStr(tag_name);
628
+
629
+ // Context fragment namespace.
630
+ ctx_ns = lookup_namespace(ctx, true);
631
+
632
+ // Check for a form ancestor, including self.
633
+ for (VALUE node = ctx;
634
+ !NIL_P(node);
635
+ node = rb_respond_to(node, parent) ? rb_funcall(node, parent, 0) : Qnil) {
636
+ if (!RTEST(rb_funcall(node, element_, 0)))
637
+ continue;
638
+ VALUE element_name = rb_funcall(node, name, 0);
639
+ if (RSTRING_LEN(element_name) == 4
640
+ && !st_strcasecmp(RSTRING_PTR(element_name), "form")
641
+ && lookup_namespace(node, false) == GUMBO_NAMESPACE_HTML) {
642
+ form = true;
643
+ break;
644
+ }
645
+ }
646
+
647
+ // Encoding.
648
+ if (RSTRING_LEN(tag_name) == 14
649
+ && !st_strcasecmp(ctx_tag, "annotation-xml")) {
650
+ VALUE enc = rb_funcall(ctx, rb_intern_const("[]"),
651
+ rb_utf8_str_new_static("encoding", 8));
652
+ if (RTEST(enc)) {
653
+ Check_Type(enc, T_STRING);
654
+ encoding = StringValueCStr(enc);
655
+ }
656
+ }
657
+ }
658
+
659
+ // Quirks mode.
660
+ VALUE doc = rb_funcall(doc_fragment, rb_intern_const("document"), 0);
661
+ VALUE dtd = rb_funcall(doc, internal_subset, 0);
662
+ if (NIL_P(dtd)) {
663
+ quirks_mode = GUMBO_DOCTYPE_NO_QUIRKS;
664
+ } else {
665
+ VALUE dtd_name = rb_funcall(dtd, name, 0);
666
+ VALUE pubid = rb_funcall(dtd, rb_intern_const("external_id"), 0);
667
+ VALUE sysid = rb_funcall(dtd, rb_intern_const("system_id"), 0);
668
+ quirks_mode = gumbo_compute_quirks_mode (
669
+ NIL_P(dtd_name)? NULL:StringValueCStr(dtd_name),
670
+ NIL_P(pubid)? NULL:StringValueCStr(pubid),
671
+ NIL_P(sysid)? NULL:StringValueCStr(sysid)
672
+ );
673
+ }
674
+
675
+ // Perform a fragment parse.
676
+ int depth = NUM2INT(max_depth);
677
+ GumboOptions options = kGumboDefaultOptions;
678
+ options.max_errors = NUM2INT(max_errors);
679
+ // Add one to account for the HTML element.
680
+ options.max_tree_depth = depth < 0 ? -1 : (depth + 1);
681
+ options.fragment_context = ctx_tag;
682
+ options.fragment_namespace = ctx_ns;
683
+ options.fragment_encoding = encoding;
684
+ options.quirks_mode = quirks_mode;
685
+ options.fragment_context_has_form_ancestor = form;
686
+
687
+ GumboOutput *output = perform_parse(&options, tags);
688
+ ParseArgs args = {
689
+ .output = output,
690
+ .input = tags,
691
+ .url_or_frag = doc_fragment,
692
+ .doc = (xmlDocPtr)extract_xml_node(doc),
693
+ };
694
+ rb_ensure(fragment_continue, (VALUE)&args, parse_cleanup, (VALUE)&args);
695
+ return Qnil;
696
+ }
697
+
698
+ static VALUE fragment_continue(ParseArgs *args) {
699
+ GumboOutput *output = args->output;
700
+ VALUE doc_fragment = args->url_or_frag;
701
+ xmlDocPtr xml_doc = args->doc;
702
+
703
+ args->doc = NIL; // The Ruby runtime owns doc so make sure we don't delete it.
704
+ xmlNodePtr xml_frag = extract_xml_node(doc_fragment);
705
+ build_tree(xml_doc, xml_frag, output->root);
706
+ add_errors(output, doc_fragment, args->input, rb_utf8_str_new_static("#fragment", 9));
707
+ return Qnil;
708
+ }
709
+
710
+ // Initialize the Nokogumbo class and fetch constants we will use later.
364
711
  void Init_nokogumbo() {
365
- rb_funcall(rb_mKernel, rb_intern("gem"), 1, rb_str_new2("nokogiri"));
712
+ rb_funcall(rb_mKernel, rb_intern("gem"), 1, rb_utf8_str_new_static("nokogiri", 8));
366
713
  rb_require("nokogiri");
367
714
 
368
- // class constants
369
- VALUE Nokogiri = rb_const_get(rb_cObject, rb_intern("Nokogiri"));
370
- VALUE HTML5 = rb_const_get(Nokogiri, rb_intern("HTML5"));
371
- Document = rb_const_get(HTML5, rb_intern("Document"));
372
-
373
715
  #ifndef NGLIB
374
- // more class constants
375
- VALUE XML = rb_const_get(Nokogiri, rb_intern("XML"));
376
- cNokogiriXmlSyntaxError = rb_const_get(XML, rb_intern("SyntaxError"));
377
- Element = rb_const_get(XML, rb_intern("Element"));
378
- Text = rb_const_get(XML, rb_intern("Text"));
379
- CDATA = rb_const_get(XML, rb_intern("CDATA"));
380
- Comment = rb_const_get(XML, rb_intern("Comment"));
381
-
382
- // interned symbols
383
- new = rb_intern("new");
384
- attribute = rb_intern("attribute");
385
- set_attribute = rb_intern("set_attribute");
386
- remove_attribute = rb_intern("remove_attribute");
387
- add_child = rb_intern("add_child_node_and_reparent_attrs");
388
- internal_subset = rb_intern("internal_subset");
389
- remove_ = rb_intern("remove");
390
- create_internal_subset = rb_intern("create_internal_subset");
391
- key_ = rb_intern("key?");
392
- node_name_ = rb_intern("node_name=");
716
+ // Class constants.
717
+ VALUE mNokogiri = rb_const_get(rb_cObject, rb_intern_const("Nokogiri"));
718
+ VALUE mNokogiriXml = rb_const_get(mNokogiri, rb_intern_const("XML"));
719
+ cNokogiriXmlSyntaxError = rb_const_get(mNokogiriXml, rb_intern_const("SyntaxError"));
720
+ cNokogiriXmlElement = rb_const_get(mNokogiriXml, rb_intern_const("Element"));
721
+ cNokogiriXmlText = rb_const_get(mNokogiriXml, rb_intern_const("Text"));
722
+ cNokogiriXmlCData = rb_const_get(mNokogiriXml, rb_intern_const("CDATA"));
723
+ cNokogiriXmlComment = rb_const_get(mNokogiriXml, rb_intern_const("Comment"));
724
+
725
+ // Interned symbols.
726
+ new = rb_intern_const("new");
727
+ node_name_ = rb_intern_const("node_name=");
393
728
  #endif
394
729
 
395
- // define Nokogumbo module with a parse method
730
+ // Class constants.
731
+ VALUE HTML5 = rb_const_get(mNokogiri, rb_intern_const("HTML5"));
732
+ Document = rb_const_get(HTML5, rb_intern_const("Document"));
733
+
734
+ // Interned symbols.
735
+ internal_subset = rb_intern_const("internal_subset");
736
+ parent = rb_intern_const("parent");
737
+
738
+ // Define Nokogumbo module with parse and fragment methods.
396
739
  VALUE Gumbo = rb_define_module("Nokogumbo");
397
740
  rb_define_singleton_method(Gumbo, "parse", parse, 4);
741
+ rb_define_singleton_method(Gumbo, "fragment", fragment, 5);
398
742
  }
743
+
744
+ // vim: set shiftwidth=2 softtabstop=2 tabstop=8 expandtab: