html2textile 1.0.0.beta1

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,19 @@
1
+ # HTML2Textile #
2
+
3
+ A quick and simple way to convert HTML to Textile.
4
+
5
+ parser = HTMLToTextileParser.new
6
+ parser.feed(your_html)
7
+ puts parser.to_textile
8
+
9
+ ## Introduction From 2007 ##
10
+
11
+ One of the many tricky decisions to be made when building content management tools is how to allow users to control the basic formatting of their input without breaking your carefully crafted layouts or injecting nasty hacks into your pages. One approach has long been to provide your own markup language. Instead of allowing users to write HTML, let them use bbcode, or markdown, or textile, which have more controlled vocabularies and rules that mean it’s much less likely that problems will occur.
12
+
13
+ Textile in particular has a nice simple syntax and is increasingly popular thanks to its adoption in products like those of 37signals. In Ruby, there’s the RedCloth library which makes it fast and easy to convert textile to HTML. The one problem is if you already have a body of user generated HTML in your legacy system that needs converting. That’s the situation I found myself in this week and I quickly needed a tool to translate the content so that I could get on with the more interesting parts of the system.
14
+
15
+ Searching for options, the ClothRed library which offers some translation, but it doesn’t handle important elements like links. I considered patching it to handle the elements I need, but in the end I decided to take a different approach and used the SGML parsing library found here to port a python html2textile parser.
16
+
17
+ Porting code from python to ruby is a pretty straightforward process as the language’s are so similar on a number of levels, but there were several issues to work through, particularly relating to scoping, and quite a few methods to change to make them feel a little more ruby-ish. I’ve not converted all of the entity handling as I didn’t really need it, but there might be a bit of work to do in making sure character set issues are properly taken care of.
18
+
19
+ The end result is a piece of code that’s now served its purpose and that I’m unlikely to need again for quite a while. It’s not something that I’m particularly proud of, it could almost certainly be implemented more neatly, but I thought I’d throw it out there in case it could be useful to someone else. Should you be inspired to take it and twist it and turn it into a well-heeled, more robust and properly distributable solution, feel free, but please let me know so that at the very least I can update this entry.
@@ -0,0 +1,35 @@
1
+ require 'html2textile'
2
+
3
+ first_block = <<END
4
+ <div class="column span-3">
5
+ <h3 class="storytitle entry-title" id="post-312">
6
+ <a href="http://jystewart.net/process/2007/11/converting-html-to-textile-with-ruby/" rel="bookmark">Converting HTML to Textile with Ruby</a>
7
+ </h3>
8
+
9
+ <p>
10
+ <span>23 November 2007</span>
11
+ (<abbr class="updated" title="2007-11-23T19:51:54+00:00">7:51 pm</abbr>)
12
+ </p>
13
+
14
+ <p>
15
+ By <span class="author vcard fn">James Stewart</span>
16
+ <br />filed under:
17
+ <a href="http://jystewart.net/process/category/snippets/" title="View all posts in Snippets" rel="category tag">Snippets</a>
18
+ <br />tagged: <a href="http://jystewart.net/process/tag/content-management/" rel="tag">content management</a>,
19
+ <a href="http://jystewart.net/process/tag/conversion/" rel="tag">conversion</a>,
20
+ <a href="http://jystewart.net/process/tag/html/" rel="tag">html</a>,
21
+ <a href="http://jystewart.net/process/tag/python/" rel="tag">Python</a>,
22
+ <a href="http://jystewart.net/process/tag/ruby/" rel="tag">ruby</a>,
23
+ <a href="http://jystewart.net/process/tag/textile/" rel="tag">textile</a>
24
+ </p>
25
+
26
+
27
+ <div class="feedback">
28
+ <script src="http://feeds.feedburner.com/~s/jystewart/iLiN?i=http://jystewart.net/process/2007/11/converting-html-to-textile-with-ruby/" type="text/javascript" charset="utf-8"></script>
29
+ </div>
30
+ </div>
31
+ END
32
+
33
+ parser = HTMLToTextileParser.new
34
+ parser.feed(first_block)
35
+ puts parser.to_textile
@@ -0,0 +1,255 @@
1
+ require 'sgml_parser'
2
+
3
+ # A class to convert HTML to textile. Based on the python parser
4
+ # found at http://aftnn.org/content/code/html2textile/
5
+ #
6
+ # Read more at http://jystewart.net/process/2007/11/converting-html-to-textile-with-ruby
7
+ #
8
+ # Author:: James Stewart (mailto:james@ketlai.co.uk)
9
+ # Copyright:: Copyright (c) 2010 James Stewart
10
+ # License:: Distributes under the same terms as Ruby
11
+
12
+ # This class is an implementation of an SgmlParser designed to convert
13
+ # HTML to textile.
14
+ #
15
+ # Example usage:
16
+ # parser = HTMLToTextileParser.new
17
+ # parser.feed(input_html)
18
+ # puts parser.to_textile
19
+ class HTMLToTextileParser < SgmlParser
20
+
21
+ attr_accessor :result
22
+ attr_accessor :in_block
23
+ attr_accessor :data_stack
24
+ attr_accessor :a_href
25
+ attr_accessor :in_ul
26
+ attr_accessor :in_ol
27
+
28
+ @@permitted_tags = []
29
+ @@permitted_attrs = []
30
+
31
+ def initialize(verbose=nil)
32
+ @output = String.new
33
+ self.in_block = false
34
+ self.result = []
35
+ self.data_stack = []
36
+ super(verbose)
37
+ end
38
+
39
+ # Normalise space in the same manner as HTML. Any substring of multiple
40
+ # whitespace characters will be replaced with a single space char.
41
+ def normalise_space(s)
42
+ s.to_s.gsub(/\s+/x, ' ')
43
+ end
44
+
45
+ def build_styles_ids_and_classes(attributes)
46
+ idclass = ''
47
+ idclass += attributes['class'] if attributes.has_key?('class')
48
+ idclass += "\##{attributes['id']}" if attributes.has_key?('id')
49
+ idclass = "(#{idclass})" if idclass != ''
50
+
51
+ style = attributes.has_key?('style') ? "{#{attributes['style']}}" : ""
52
+ "#{idclass}#{style}"
53
+ end
54
+
55
+ def make_block_start_pair(tag, attributes)
56
+ attributes = attrs_to_hash(attributes)
57
+ class_style = build_styles_ids_and_classes(attributes)
58
+ write("#{tag}#{class_style}. ")
59
+ start_capture(tag)
60
+ end
61
+
62
+ def make_block_end_pair
63
+ stop_capture_and_write
64
+ write("\n\n")
65
+ end
66
+
67
+ def make_quicktag_start_pair(tag, wrapchar, attributes)
68
+ attributes = attrs_to_hash(attributes)
69
+ class_style = build_styles_ids_and_classes(attributes)
70
+ write([" ", "#{wrapchar}#{class_style}"])
71
+ start_capture(tag)
72
+ end
73
+
74
+ def make_quicktag_end_pair(wrapchar)
75
+ stop_capture_and_write
76
+ write([wrapchar, " "])
77
+ end
78
+
79
+ def write(d)
80
+ if self.data_stack.size < 2
81
+ self.result += d.to_a
82
+ else
83
+ self.data_stack[-1] += d.to_a
84
+ end
85
+ end
86
+
87
+ def start_capture(tag)
88
+ self.in_block = tag
89
+ self.data_stack.push([])
90
+ end
91
+
92
+ def stop_capture_and_write
93
+ self.in_block = false
94
+ self.write(self.data_stack.pop)
95
+ end
96
+
97
+ def handle_data(data)
98
+ write(normalise_space(data).strip) unless data.nil? or data == ''
99
+ end
100
+
101
+ %w[1 2 3 4 5 6].each do |num|
102
+ define_method "start_h#{num}" do |attributes|
103
+ make_block_start_pair("h#{num}", attributes)
104
+ end
105
+
106
+ define_method "end_h#{num}" do
107
+ make_block_end_pair
108
+ end
109
+ end
110
+
111
+ PAIRS = { 'blockquote' => 'bq', 'p' => 'p' }
112
+ QUICKTAGS = { 'b' => '*', 'strong' => '*',
113
+ 'i' => '_', 'em' => '_', 'cite' => '??', 's' => '-',
114
+ 'sup' => '^', 'sub' => '~', 'code' => '@', 'span' => '%'}
115
+
116
+ PAIRS.each do |key, value|
117
+ define_method "start_#{key}" do |attributes|
118
+ make_block_start_pair(value, attributes)
119
+ end
120
+
121
+ define_method "end_#{key}" do
122
+ make_block_end_pair
123
+ end
124
+ end
125
+
126
+ QUICKTAGS.each do |key, value|
127
+ define_method "start_#{key}" do |attributes|
128
+ make_quicktag_start_pair(key, value, attributes)
129
+ end
130
+
131
+ define_method "end_#{key}" do
132
+ make_quicktag_end_pair(value)
133
+ end
134
+ end
135
+
136
+ def start_ol(attrs)
137
+ self.in_ol = true
138
+ end
139
+
140
+ def end_ol
141
+ self.in_ol = false
142
+ write("\n")
143
+ end
144
+
145
+ def start_ul(attrs)
146
+ self.in_ul = true
147
+ end
148
+
149
+ def end_ul
150
+ self.in_ul = false
151
+ write("\n")
152
+ end
153
+
154
+ def start_li(attrs)
155
+ if self.in_ol
156
+ write("# ")
157
+ else
158
+ write("* ")
159
+ end
160
+
161
+ start_capture("li")
162
+ end
163
+
164
+ def end_li
165
+ stop_capture_and_write
166
+ write("\n")
167
+ end
168
+
169
+ def start_a(attrs)
170
+ attrs = attrs_to_hash(attrs)
171
+ self.a_href = attrs['href']
172
+
173
+ if self.a_href:
174
+ write(" \"")
175
+ start_capture("a")
176
+ end
177
+ end
178
+
179
+ def end_a
180
+ if self.a_href:
181
+ stop_capture_and_write
182
+ write(["\":", self.a_href, " "])
183
+ self.a_href = false
184
+ end
185
+ end
186
+
187
+ def attrs_to_hash(array)
188
+ array.inject({}) { |collection, part| collection[part[0].downcase] = part[1]; collection }
189
+ end
190
+
191
+ def start_img(attrs)
192
+ attrs = attrs_to_hash(attrs)
193
+ write([" !", attrs["src"], "! "])
194
+ end
195
+
196
+ def end_img
197
+ end
198
+
199
+ def start_tr(attrs)
200
+ end
201
+
202
+ def end_tr
203
+ write("|\n")
204
+ end
205
+
206
+ def start_td(attrs)
207
+ write("|")
208
+ start_capture("td")
209
+ end
210
+
211
+ def end_td
212
+ stop_capture_and_write
213
+ write("|")
214
+ end
215
+
216
+ def start_br(attrs)
217
+ write("\n")
218
+ end
219
+
220
+ def unknown_starttag(tag, attrs)
221
+ if @@permitted_tags.include?(tag)
222
+ write(["<", tag])
223
+ attrs.each do |key, value|
224
+ if @@permitted_attributes.include?(key)
225
+ write([" ", key, "=\"", value, "\""])
226
+ end
227
+ end
228
+ end
229
+ end
230
+
231
+ def unknown_endtag(tag)
232
+ if @@permitted_tags.include?(tag)
233
+ write(["</", tag, ">"])
234
+ end
235
+ end
236
+
237
+ # Return the textile after processing
238
+ def to_textile
239
+ result.join
240
+ end
241
+
242
+ # UNCONVERTED PYTHON METHODS
243
+ #
244
+ # def handle_charref(self, tag):
245
+ # self._write(unichr(int(tag)))
246
+ #
247
+ # def handle_entityref(self, tag):
248
+ # if self.entitydefs.has_key(tag):
249
+ # self._write(self.entitydefs[tag])
250
+ #
251
+ # def handle_starttag(self, tag, method, attrs):
252
+ # method(dict(attrs))
253
+ #
254
+
255
+ end
@@ -0,0 +1,333 @@
1
+ # A parser for SGML, using the derived class as static DTD.
2
+
3
+ class SgmlParser
4
+
5
+ # Regular expressions used for parsing:
6
+ Interesting = /[&<]/
7
+ Incomplete = Regexp.compile('&([a-zA-Z][a-zA-Z0-9]*|#[0-9]*)?|' +
8
+ '<([a-zA-Z][^<>]*|/([a-zA-Z][^<>]*)?|' +
9
+ '![^<>]*)?')
10
+
11
+ Entityref = /&([a-zA-Z][-.a-zA-Z0-9]*)[^-.a-zA-Z0-9]/
12
+ Charref = /&#([0-9]+)[^0-9]/
13
+
14
+ Starttagopen = /<[>a-zA-Z]/
15
+ Endtagopen = /<\/[<>a-zA-Z]/
16
+ Endbracket = /[<>]/
17
+ Special = /<![^<>]*>/
18
+ Commentopen = /<!--/
19
+ Commentclose = /--[ \t\n]*>/
20
+ Tagfind = /[a-zA-Z][a-zA-Z0-9.-]*/
21
+ Attrfind = Regexp.compile('[\s,]*([a-zA-Z_][a-zA-Z_0-9.-]*)' +
22
+ '(\s*=\s*' +
23
+ "('[^']*'" +
24
+ '|"[^"]*"' +
25
+ '|[-~a-zA-Z0-9,./:+*%?!()_#=]*))?')
26
+
27
+ Entitydefs =
28
+ {'lt'=>'<', 'gt'=>'>', 'amp'=>'&', 'quot'=>'"', 'apos'=>'\''}
29
+
30
+ def initialize(verbose=false)
31
+ @verbose = verbose
32
+ reset
33
+ end
34
+
35
+ def reset
36
+ @rawdata = ''
37
+ @stack = []
38
+ @lasttag = '???'
39
+ @nomoretags = false
40
+ @literal = false
41
+ end
42
+
43
+ def has_context(gi)
44
+ @stack.include? gi
45
+ end
46
+
47
+ def setnomoretags
48
+ @nomoretags = true
49
+ @literal = true
50
+ end
51
+
52
+ def setliteral(*args)
53
+ @literal = true
54
+ end
55
+
56
+ def feed(data)
57
+ @rawdata << data
58
+ goahead(false)
59
+ end
60
+
61
+ def close
62
+ goahead(true)
63
+ end
64
+
65
+ def goahead(_end)
66
+ rawdata = @rawdata
67
+ i = 0
68
+ n = rawdata.length
69
+ while i < n
70
+ if @nomoretags
71
+ handle_data(rawdata[i..(n-1)])
72
+ i = n
73
+ break
74
+ end
75
+ j = rawdata.index(Interesting, i)
76
+ j = n unless j
77
+ if i < j
78
+ handle_data(rawdata[i..(j-1)])
79
+ end
80
+ i = j
81
+ break if (i == n)
82
+ if rawdata[i] == ?< #
83
+ if rawdata.index(Starttagopen, i) == i
84
+ if @literal
85
+ handle_data(rawdata[i, 1])
86
+ i += 1
87
+ next
88
+ end
89
+ k = parse_starttag(i)
90
+ break unless k
91
+ i = k
92
+ next
93
+ end
94
+ if rawdata.index(Endtagopen, i) == i
95
+ k = parse_endtag(i)
96
+ break unless k
97
+ i = k
98
+ @literal = false
99
+ next
100
+ end
101
+ if rawdata.index(Commentopen, i) == i
102
+ if @literal
103
+ handle_data(rawdata[i,1])
104
+ i += 1
105
+ next
106
+ end
107
+ k = parse_comment(i)
108
+ break unless k
109
+ i += k
110
+ next
111
+ end
112
+ if rawdata.index(Special, i) == i
113
+ if @literal
114
+ handle_data(rawdata[i, 1])
115
+ i += 1
116
+ next
117
+ end
118
+ k = parse_special(i)
119
+ break unless k
120
+ i += k
121
+ next
122
+ end
123
+ elsif rawdata[i] == ?& #
124
+ if rawdata.index(Charref, i) == i
125
+ i += $&.length
126
+ handle_charref($1)
127
+ i -= 1 unless rawdata[i-1] == ?;
128
+ next
129
+ end
130
+ if rawdata.index(Entityref, i) == i
131
+ i += $&.length
132
+ handle_entityref($1)
133
+ i -= 1 unless rawdata[i-1] == ?;
134
+ next
135
+ end
136
+ else
137
+ raise RuntimeError, 'neither < nor & ??'
138
+ end
139
+ # We get here only if incomplete matches but
140
+ # nothing else
141
+ match = rawdata.index(Incomplete, i)
142
+ unless match == i
143
+ handle_data(rawdata[i, 1])
144
+ i += 1
145
+ next
146
+ end
147
+ j = match + $&.length
148
+ break if j == n # Really incomplete
149
+ handle_data(rawdata[i..(j-1)])
150
+ i = j
151
+ end
152
+ # end while
153
+ if _end and i < n
154
+ handle_data(@rawdata[i..(n-1)])
155
+ i = n
156
+ end
157
+ @rawdata = rawdata[i..-1]
158
+ end
159
+
160
+ def parse_comment(i)
161
+ rawdata = @rawdata
162
+ if rawdata[i, 4] != '<!--'
163
+ raise RuntimeError, 'unexpected call to handle_comment'
164
+ end
165
+ match = rawdata.index(Commentclose, i)
166
+ return nil unless match
167
+ matched_length = $&.length
168
+ j = match
169
+ handle_comment(rawdata[i+4..(j-1)])
170
+ j = match + matched_length
171
+ return j-i
172
+ end
173
+
174
+ def parse_starttag(i)
175
+ rawdata = @rawdata
176
+ j = rawdata.index(Endbracket, i + 1)
177
+ return nil unless j
178
+ attrs = []
179
+ if rawdata[i+1] == ?> #
180
+ # SGML shorthand: <> == <last open tag seen>
181
+ k = j
182
+ tag = @lasttag
183
+ else
184
+ match = rawdata.index(Tagfind, i + 1)
185
+ unless match
186
+ raise RuntimeError, 'unexpected call to parse_starttag'
187
+ end
188
+ k = i + 1 + ($&.length)
189
+ tag = $&.downcase
190
+ @lasttag = tag
191
+ end
192
+ while k < j
193
+ break unless rawdata.index(Attrfind, k)
194
+ matched_length = $&.length
195
+ attrname, rest, attrvalue = $1, $2, $3
196
+ if not rest
197
+ attrvalue = '' # was: = attrname
198
+ elsif (attrvalue[0] == ?' && attrvalue[-1] == ?') or
199
+ (attrvalue[0] == ?" && attrvalue[-1] == ?")
200
+ attrvalue = attrvalue[1..-2]
201
+ end
202
+ attrs << [attrname.downcase, attrvalue]
203
+ k += matched_length
204
+ end
205
+ if rawdata[j] == ?> #
206
+ j += 1
207
+ end
208
+ finish_starttag(tag, attrs)
209
+ return j
210
+ end
211
+
212
+ def parse_endtag(i)
213
+ rawdata = @rawdata
214
+ j = rawdata.index(Endbracket, i + 1)
215
+ return nil unless j
216
+ tag = (rawdata[i+2..j-1].strip).downcase
217
+ if rawdata[j] == ?> #
218
+ j += 1
219
+ end
220
+ finish_endtag(tag)
221
+ return j
222
+ end
223
+
224
+ def finish_starttag(tag, attrs)
225
+ method = 'start_' + tag
226
+ if self.respond_to?(method)
227
+ @stack << tag
228
+ handle_starttag(tag, method, attrs)
229
+ return 1
230
+ else
231
+ method = 'do_' + tag
232
+ if self.respond_to?(method)
233
+ handle_starttag(tag, method, attrs)
234
+ return 0
235
+ else
236
+ unknown_starttag(tag, attrs)
237
+ return -1
238
+ end
239
+ end
240
+ end
241
+
242
+ def finish_endtag(tag)
243
+ if tag == ''
244
+ found = @stack.length - 1
245
+ if found < 0
246
+ unknown_endtag(tag)
247
+ return
248
+ end
249
+ else
250
+ unless @stack.include? tag
251
+ method = 'end_' + tag
252
+ unless self.respond_to?(method)
253
+ unknown_endtag(tag)
254
+ end
255
+ return
256
+ end
257
+ found = @stack.index(tag) #or @stack.length
258
+ end
259
+ while @stack.length > found
260
+ tag = @stack[-1]
261
+ method = 'end_' + tag
262
+ if respond_to?(method)
263
+ handle_endtag(tag, method)
264
+ else
265
+ unknown_endtag(tag)
266
+ end
267
+ @stack.pop
268
+ end
269
+ end
270
+
271
+ def parse_special(i)
272
+ rawdata = @rawdata
273
+ match = rawdata.index(Endbracket, i+1)
274
+ return nil unless match
275
+ matched_length = $&.length
276
+ handle_special(rawdata[i+1..(match-1)])
277
+ return match - i + matched_length
278
+ end
279
+
280
+ def handle_starttag(tag, method, attrs)
281
+ self.send(method, attrs)
282
+ end
283
+
284
+ def handle_endtag(tag, method)
285
+ self.send(method)
286
+ end
287
+
288
+ def report_unbalanced(tag)
289
+ if @verbose
290
+ print '*** Unbalanced </' + tag + '>', "\n"
291
+ print '*** Stack:', self.stack, "\n"
292
+ end
293
+ end
294
+
295
+ def handle_charref(name)
296
+ n = Integer(name)
297
+ if !(0 <= n && n <= 255)
298
+ unknown_charref(name)
299
+ return
300
+ end
301
+ handle_data(n.chr)
302
+ end
303
+
304
+ def handle_entityref(name)
305
+ table = Entitydefs
306
+ if table.include?(name)
307
+ handle_data(table[name])
308
+ else
309
+ unknown_entityref(name)
310
+ return
311
+ end
312
+ end
313
+
314
+ def handle_data(data)
315
+ end
316
+
317
+ def handle_comment(data)
318
+ end
319
+
320
+ def handle_special(data)
321
+ end
322
+
323
+ def unknown_starttag(tag, attrs)
324
+ end
325
+ def unknown_endtag(tag)
326
+ end
327
+ def unknown_charref(ref)
328
+ end
329
+ def unknown_entityref(ref)
330
+ end
331
+
332
+ end
333
+
metadata ADDED
@@ -0,0 +1,75 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: html2textile
3
+ version: !ruby/object:Gem::Version
4
+ hash: -1848230051
5
+ prerelease: true
6
+ segments:
7
+ - 1
8
+ - 0
9
+ - 0
10
+ - beta1
11
+ version: 1.0.0.beta1
12
+ platform: ruby
13
+ authors:
14
+ - James Stewart
15
+ autorequire:
16
+ bindir: bin
17
+ cert_chain: []
18
+
19
+ date: 2011-05-05 00:00:00 +10:00
20
+ default_executable:
21
+ dependencies: []
22
+
23
+ description: Provides an SGML parser to convert HTML into the Textile format
24
+ email: james@ketlai.co.uk
25
+ executables: []
26
+
27
+ extensions: []
28
+
29
+ extra_rdoc_files: []
30
+
31
+ files:
32
+ - lib/html2textile.rb
33
+ - lib/sgml_parser.rb
34
+ - example.rb
35
+ - README.mdown
36
+ has_rdoc: true
37
+ homepage: http://jystewart.net/process/2007/11/converting-html-to-textile-with-ruby
38
+ licenses: []
39
+
40
+ post_install_message:
41
+ rdoc_options: []
42
+
43
+ require_paths:
44
+ - lib
45
+ required_ruby_version: !ruby/object:Gem::Requirement
46
+ none: false
47
+ requirements:
48
+ - - ">="
49
+ - !ruby/object:Gem::Version
50
+ hash: 57
51
+ segments:
52
+ - 1
53
+ - 8
54
+ - 7
55
+ version: 1.8.7
56
+ required_rubygems_version: !ruby/object:Gem::Requirement
57
+ none: false
58
+ requirements:
59
+ - - ">="
60
+ - !ruby/object:Gem::Version
61
+ hash: 23
62
+ segments:
63
+ - 1
64
+ - 3
65
+ - 6
66
+ version: 1.3.6
67
+ requirements: []
68
+
69
+ rubyforge_project:
70
+ rubygems_version: 1.3.7
71
+ signing_key:
72
+ specification_version: 3
73
+ summary: Converter from HTML to Textile
74
+ test_files: []
75
+