agio 0.5.0

Sign up to get free protection for your applications and to get access to all the features.
File without changes
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --colour
2
+ --format documentation
@@ -0,0 +1,3 @@
1
+ === 0.5 / 2011-09-20
2
+
3
+ * Initial release for public consumption.
@@ -0,0 +1,23 @@
1
+ == License
2
+
3
+ This software is available under the terms of the MIT license.
4
+
5
+ * Copyright 2011 Austin Ziegler
6
+
7
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
8
+ this software and associated documentation files (the "Software"), to deal in
9
+ the Software without restriction, including without limitation the rights to
10
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
11
+ of the Software, and to permit persons to whom the Software is furnished to do
12
+ so, subject to the following conditions:
13
+
14
+ The above copyright notice and this permission notice shall be included in all
15
+ copies or substantial portions of the Software.
16
+
17
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
18
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
19
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
20
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
21
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
22
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
23
+ SOFTWARE.
@@ -0,0 +1,23 @@
1
+ .rspec
2
+ History.rdoc
3
+ License.rdoc
4
+ Manifest.txt
5
+ README.rdoc
6
+ Rakefile
7
+ agio.gemspec
8
+ bin/agio
9
+ lib/agio.rb
10
+ lib/agio/block.rb
11
+ lib/agio/bourse.rb
12
+ lib/agio/broker.rb
13
+ lib/agio/data.rb
14
+ lib/agio/flags.rb
15
+ lib/agio/html_element_description.rb
16
+ spec/block_spec.rb
17
+ spec/bourse_spec.rb
18
+ spec/broker_spec.rb
19
+ spec/data_spec.rb
20
+ spec/flags_spec.rb
21
+ spec/html_element_description_spec.rb
22
+ spec/pmh_spec.rb
23
+ spec/spec_helper.rb
@@ -0,0 +1,97 @@
1
+ = Agio
2
+
3
+ == Description
4
+ Agio is a library and a tool for converting HTML to
5
+ {Markdown}[http://daringfireball.net/projects/markdown/].
6
+
7
+ === About the Name
8
+
9
+ The name was chosen because <em>agio</em> is "a premium on money in exchange",
10
+ sort of the opposite of a markdown. It comes from the Italian <em>aggio</em>
11
+ (premium), not from the Italian <em>agio</em> (ease), although the hope is that
12
+ there is an ease in use of this library.
13
+
14
+ == Why Agio?
15
+
16
+ 1. Agio is well tested.
17
+ 2. Agio is pure Ruby.
18
+ 3. Agio is MIT licensed.
19
+
20
+ === Agio is Well Tested
21
+
22
+ With this release, there are 274 unit tests (RSpec examples) covering almost
23
+ everything. The only things not covered are Agio itself (which mostly performs
24
+ coordination duties) and Agio::Bourse (the parsed HTML-to-Markdown translator).
25
+ These tests cover 85%–95% of the code known to be tested, and the modules
26
+ currently missing test will have those tests completed prior to the release of
27
+ Agio 1.0.
28
+
29
+ In addition to the unit tests, more than a hundred manual tests have been run
30
+ to verify the quality of output for basic HTML. These manual tests have taken
31
+ one of two forms:
32
+
33
+ 1. Markdown input converted to HTML with
34
+ {rdiscount}[https://github.com/rtomayko/rdiscount], and then converted back
35
+ to Markdown with Agio. In all cases, the resulting Markdown is either
36
+ identical to the original or the differences can be attributed to style
37
+ (how Agio writes emphasized text versus the hand-written original, or how
38
+ Agio represents links by default). This tests Agio's round-trip capability.
39
+ 2. HTML input converted to Markdown with either Pandoc or html2txt.py,
40
+ converted <em>back</em> to HTML with rdiscount, and then converted once
41
+ again to Markdown with Agio. This tests Agio's ability to output data
42
+ that is syntactically similar to those of better-known and presumably
43
+ better-tested tools.
44
+
45
+ Agio will likely have bugs, especially before version 1.0, and not all features
46
+ are yet implemented or exposed to the user. Syntactic support is also
47
+ incomplete, as the goal is to support many of the syntax extensions found in
48
+ {Markdown Extra}[http://michelf.com/projects/php-markdown/extra/] or other
49
+ popular modules, such as Github-flavoured Markdown.
50
+
51
+ == Where
52
+
53
+ * {GitHub}[https://github.com/halostatue/agio]
54
+
55
+ == Synopsis
56
+
57
+ Install Agio with:
58
+
59
+ gem install agio
60
+
61
+ Run Agio against HTML with:
62
+
63
+ agio input.html > output.markdown
64
+
65
+ == History
66
+
67
+ === Why I Wrote Agio
68
+
69
+ Agio is the result of some yak-shaving as I was looking to convert my blog
70
+ content from {WordPress}[http://wordpress.org] to
71
+ {Jekyll}[https://github.com/mojombo/jekyll]. The Jekyll wiki points to Thomas
72
+ Frőssman's {Exitwp}[https://github.com/thomasf/exitwp] Python script as a
73
+ reliable conversion mechanism, but I found that it couldn't handle the data in
74
+ my {WordPress export file}[http://codex.wordpress.org/Tools_Export_Screen]. So,
75
+ I ported Exitwp from Python to Ruby as
76
+ {Poole}[https://github.com/halostatue/poole].
77
+
78
+ Like Exitwp, Poole depends on {Pandoc}[http://johnmacfarlane.net/pandoc/].
79
+ While it's an amazing tool, it took the better part of 45 minutes to compile
80
+ the {Haskell Platform}[http://hackage.haskell.org/platform/] with
81
+ {Homebrew}[https://github.com/mxcl/homebrew] and Pandoc with
82
+ {Cabal}[http://www.haskell.org/cabal/]. Looking around the Ruby community,
83
+ there wasn't a single Ruby-based HTML-to-Markdown converter that I felt I could
84
+ trust to get everything right that was also licensed to my liking (I prefer the
85
+ MIT license). While {Kramdown}[http://kramdown.rubyforge.org/] is impressive,
86
+ it's GPL-licensed. I didn't want Poole (which is MIT-licensed) to depend on
87
+ anything that provided any less freedom for any purpose.
88
+
89
+ Armed with this plan, I started the process of deeply understanding how Aaron
90
+ Swartz's {html2txt.py}[http://www.aaronsw.com/2002/html2text/] script works.
91
+ This included an early version of Agio that was a more-or-less straightforward
92
+ port, but produced output that was worse because of differences between
93
+ Python's +textwrap+ module and Ruby's
94
+ {Text::Format}[http://rubyforge.org/projects/text-format] that could not be
95
+ cleanly resolved by tweaking Aaron's algorithm.
96
+
97
+ :include: License.rdoc
@@ -0,0 +1,35 @@
1
+ # -*- ruby encoding: utf-8 -*-
2
+
3
+ require 'rubygems'
4
+ require 'rspec'
5
+ require 'hoe'
6
+
7
+ Hoe.plugin :doofus
8
+ Hoe.plugin :gemspec
9
+ Hoe.plugin :git
10
+
11
+ Hoe.plugins.delete :rcov
12
+
13
+ spec = Hoe.spec 'agio' do
14
+ developer('Austin Ziegler', 'austin@rubyforge.org')
15
+
16
+ self.history_file = 'History.rdoc'
17
+ self.readme_file = 'README.rdoc'
18
+ self.extra_rdoc_files = FileList["*.rdoc"].to_a
19
+
20
+ self.extra_deps << ['nokogiri', '~> 1.5.0']
21
+ self.extra_deps << ['main', '~> 4.7.3']
22
+ self.extra_dev_deps << ['rspec', '~> 2.0']
23
+ self.extra_dev_deps << ['hoe-doofus', '~> 1.0']
24
+ self.extra_dev_deps << ['hoe-gemspec', '~> 1.0']
25
+ self.extra_dev_deps << ['hoe-git', '~> 1.0']
26
+ self.extra_dev_deps << ['hoe-seattlerb', '~> 1.0']
27
+ end
28
+
29
+ RSpec::Core::RakeTask.new(:rcov) do |t|
30
+ t.rcov = true
31
+ t.rcov_opts = %q[--exclude "osx/objc,gems/,spec/,features/"]
32
+ t.verbose = true
33
+ end
34
+
35
+ # vim: syntax=ruby
@@ -0,0 +1,53 @@
1
+ # -*- encoding: utf-8 -*-
2
+
3
+ Gem::Specification.new do |s|
4
+ s.name = "agio"
5
+ s.version = "0.5.0"
6
+
7
+ s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
8
+ s.authors = ["Austin Ziegler"]
9
+ s.date = "2011-09-20"
10
+ s.description = "Agio is a library and a tool for converting HTML to\n{Markdown}[http://daringfireball.net/projects/markdown/]."
11
+ s.email = ["austin@rubyforge.org"]
12
+ s.executables = ["agio"]
13
+ s.extra_rdoc_files = ["Manifest.txt", "History.rdoc", "License.rdoc", "README.rdoc"]
14
+ s.files = [".rspec", "History.rdoc", "License.rdoc", "Manifest.txt", "README.rdoc", "Rakefile", "agio.gemspec", "bin/agio", "lib/agio.rb", "lib/agio/block.rb", "lib/agio/bourse.rb", "lib/agio/broker.rb", "lib/agio/data.rb", "lib/agio/flags.rb", "lib/agio/html_element_description.rb", "spec/block_spec.rb", "spec/bourse_spec.rb", "spec/broker_spec.rb", "spec/data_spec.rb", "spec/flags_spec.rb", "spec/html_element_description_spec.rb", "spec/pmh_spec.rb", "spec/spec_helper.rb", ".gemtest"]
15
+ s.rdoc_options = ["--main", "README.rdoc"]
16
+ s.require_paths = ["lib"]
17
+ s.rubyforge_project = "agio"
18
+ s.rubygems_version = "1.8.10"
19
+ s.summary = "Agio is a library and a tool for converting HTML to {Markdown}[http://daringfireball.net/projects/markdown/]."
20
+
21
+ if s.respond_to? :specification_version then
22
+ s.specification_version = 3
23
+
24
+ if Gem::Version.new(Gem::VERSION) >= Gem::Version.new('1.2.0') then
25
+ s.add_runtime_dependency(%q<nokogiri>, ["~> 1.5.0"])
26
+ s.add_runtime_dependency(%q<main>, ["~> 4.7.3"])
27
+ s.add_development_dependency(%q<rspec>, ["~> 2.0"])
28
+ s.add_development_dependency(%q<hoe-doofus>, ["~> 1.0"])
29
+ s.add_development_dependency(%q<hoe-gemspec>, ["~> 1.0"])
30
+ s.add_development_dependency(%q<hoe-git>, ["~> 1.0"])
31
+ s.add_development_dependency(%q<hoe-seattlerb>, ["~> 1.0"])
32
+ s.add_development_dependency(%q<hoe>, ["~> 2.12"])
33
+ else
34
+ s.add_dependency(%q<nokogiri>, ["~> 1.5.0"])
35
+ s.add_dependency(%q<main>, ["~> 4.7.3"])
36
+ s.add_dependency(%q<rspec>, ["~> 2.0"])
37
+ s.add_dependency(%q<hoe-doofus>, ["~> 1.0"])
38
+ s.add_dependency(%q<hoe-gemspec>, ["~> 1.0"])
39
+ s.add_dependency(%q<hoe-git>, ["~> 1.0"])
40
+ s.add_dependency(%q<hoe-seattlerb>, ["~> 1.0"])
41
+ s.add_dependency(%q<hoe>, ["~> 2.12"])
42
+ end
43
+ else
44
+ s.add_dependency(%q<nokogiri>, ["~> 1.5.0"])
45
+ s.add_dependency(%q<main>, ["~> 4.7.3"])
46
+ s.add_dependency(%q<rspec>, ["~> 2.0"])
47
+ s.add_dependency(%q<hoe-doofus>, ["~> 1.0"])
48
+ s.add_dependency(%q<hoe-gemspec>, ["~> 1.0"])
49
+ s.add_dependency(%q<hoe-git>, ["~> 1.0"])
50
+ s.add_dependency(%q<hoe-seattlerb>, ["~> 1.0"])
51
+ s.add_dependency(%q<hoe>, ["~> 2.12"])
52
+ end
53
+ end
@@ -0,0 +1,17 @@
1
+ #!ruby
2
+
3
+ # main 4.7.3 has a warning in main.rb:
4
+ # main-4.7.3/lib/main.rb:28: warning: redefine libdir
5
+ #
6
+ # map 4.3.0 has a warning in map.rb:
7
+ # map-4.3.0/lib/map.rb:551: warning: method redefined; discarding old to_map
8
+ #
9
+ # Therefore, no warnings.
10
+ #!ruby -w
11
+
12
+ require 'agio'
13
+
14
+ ARGV.each do |arg|
15
+ html = File.open(arg) { |f| f.read }
16
+ puts Agio.to_markdown(html)
17
+ end
@@ -0,0 +1,171 @@
1
+ # -*- ruby encoding: utf-8 -*-
2
+
3
+ require 'nokogiri'
4
+ require 'stringio'
5
+ require 'text/format'
6
+
7
+ ##
8
+ # = Agio
9
+ # Agio converts HTML to Markdown.
10
+ #
11
+ # == About the Name
12
+ # The name was chosen because agio is "a premium on money in exchange",
13
+ # sort of the opposite of a markdown. It comes from the Italian aggio
14
+ # (premium), not from the Italian agio (ease), although the hope is that
15
+ # there is an ease in use of this library.
16
+ #
17
+ # It is structurally based on Aaron Swarz's html2txt Python script inasmuch
18
+ # as the SAX parsing he does is also done here and with pretty much the same
19
+ # behaviour.
20
+ #
21
+ # == License
22
+ # This code is licensed under MIT License
23
+ class Agio
24
+ VERSION = "0.5.0"
25
+ end
26
+
27
+ require 'agio/flags'
28
+ require 'agio/broker'
29
+ require 'agio/bourse'
30
+
31
+ class Agio
32
+ ##
33
+ # :attr_accessor:
34
+ # The default HTML document to be processed. Because the #parse method can
35
+ # be called with an HTML document, this does not *need* to be set.
36
+ attr_accessor :html
37
+
38
+ ##
39
+ # :attr_reader:
40
+ # The width of the body text for the generated Markdown text outside of
41
+ # +pre+ bodies and other items which do not wrap well in most Markdown
42
+ # parsers.
43
+ def columns
44
+ bourse.formatter.columns
45
+ end
46
+ ##
47
+ # :attr_writer:
48
+ # The width of the body text for the generated Markdown text outside of
49
+ # +pre+ bodies and other items which do not wrap well in most Markdown
50
+ # parsers.
51
+ #
52
+ # If +nil+ is provided, the default value of 78 is set.
53
+ def columns=(value)
54
+ bourse.formatter.columns = value || 78
55
+ bourse.formatter.columns
56
+ end
57
+
58
+ ##
59
+ # :attr_reader:
60
+ # Controls how links are placed in the Markdown document.
61
+ def link_placement
62
+ bourse.link_placement
63
+ end
64
+ # :attr_writer: link_placement
65
+ # Controls how links are placed in the Markdown document.
66
+ #
67
+ # In-Line:: Links appear next to their wrapped text, like "[See the
68
+ # example](http://example.org/)". The default for
69
+ # link_placement, set if the value is +nil+, <tt>:inline</tt>,
70
+ # or any other unrecognized value.
71
+ # Paragraph:: Links appear in the body as linked references, like "[See
72
+ # the example][1]", and the reference "[1]:
73
+ # http://example.org" is placed immediately after the
74
+ # paragraph in which the link first appears. Used if the value
75
+ # of link_placement is <tt>:paragraph</tt>.
76
+ # Endnote:: Links appear in the body as linked references, like "[See
77
+ # the example][1]", and the reference "[1]:
78
+ # http://example.org" is placed at the end of the document.
79
+ # Used if the value of link_placement is <tt>:endnote</tt>.
80
+ def link_placement=(value)
81
+ bourse.link_placement = value
82
+ end
83
+
84
+ ##
85
+ # :attr_reader: base_url
86
+ # The base URL for implicit (or local) link references. If not provided,
87
+ # links will remain implicit. This is a String value.
88
+ def base_url
89
+ bourse.base_url
90
+ end
91
+ ##
92
+ # :attr_writer: base_url
93
+ # The base URL for implicit (or local) link references. If not provided,
94
+ # links will remain implicit. This is a String value.
95
+ def base_url=(value)
96
+ bourse.base_url = value
97
+ end
98
+
99
+ ##
100
+ # :attr_reader: skip_local_fragments
101
+ # Controls whether local link references containing fragments will be
102
+ # output in the final document.
103
+ #
104
+ # A local link reference is either an implicit link reference (one missing
105
+ # the protocol and host, such as '&lt;a href="about.html"&gt;' or '&lt;a
106
+ # href="/about.html"&gt;') or one that points to the #base_url.
107
+ #
108
+ # If this value is +true+, links that refer to fragments on local URIs
109
+ # will be ignored (such as '&lt;a href="about.html#address"&gt;').
110
+ def skip_local_fragments
111
+ bourse.skip_local_fragments
112
+ end
113
+
114
+ ##
115
+ # :attr_writer: skip_local_fragments
116
+ # Controls whether local link references containing fragments will be
117
+ # output in the final document.
118
+ #
119
+ # A local link reference is either an implicit link reference (one missing
120
+ # the protocol and host, such as '&lt;a href="about.html"&gt;' or '&lt;a
121
+ # href="/about.html"&gt;') or one that points to the #base_url.
122
+ #
123
+ # If this value is +true+, links that refer to fragments on local URIs
124
+ # will be ignored (such as '&lt;a href="about.html#address"&gt;').
125
+ def skip_local_fragments=(value)
126
+ bourse.skip_local_fragments = value
127
+ end
128
+
129
+ def initialize(options = {})
130
+ @broker = Agio::Broker.new
131
+ @bourse = Agio::Bourse.new(broker, self)
132
+
133
+ self.html = options[:html]
134
+ self.columns = options[:columns]
135
+ self.link_placement = options[:link_placement]
136
+
137
+ yield self if block_given?
138
+
139
+ @parser = Nokogiri::HTML::SAX::Parser.new(broker)
140
+ end
141
+
142
+ def parse(html)
143
+ @parser.parse(html)
144
+ self
145
+ end
146
+ private :parse
147
+
148
+ def transform(html)
149
+ parse(html)
150
+ bourse.transform
151
+ end
152
+ private :transform
153
+
154
+ def to_s(html = nil)
155
+ transform(html || self.html)
156
+ bourse.output.string
157
+ end
158
+ alias to_markdown to_s
159
+
160
+ def self.to_markdown(html)
161
+ self.new.to_markdown(html)
162
+ end
163
+
164
+ attr_reader :broker
165
+ private :broker
166
+
167
+ attr_reader :bourse
168
+ private :bourse
169
+ end
170
+
171
+ # vim: ft=ruby
@@ -0,0 +1,132 @@
1
+ # -*- ruby encoding: utf-8 -*-
2
+
3
+ require 'agio/html_element_description'
4
+
5
+ ##
6
+ # A Block is the fundamental collection for the Broker that is used to
7
+ # then generate the Markdown.
8
+ class Agio::Block
9
+ def inspect
10
+ if options.empty?
11
+ %Q(#<#{self.class} #{name} #{contents.inspect}>)
12
+ else
13
+ %Q(#<#{self.class} #{name}(#{options.inspect}) #{contents.inspect}>)
14
+ end
15
+ end
16
+
17
+ # The name of the element the Block is for.
18
+ attr_reader :name
19
+
20
+ # The options, if provided, for the element.
21
+ attr_reader :options
22
+
23
+ # The contents of the Block.
24
+ attr_reader :contents
25
+
26
+ # The description of the HTML element the Block represents (this will
27
+ # always be a Nokogiri::HTML::ElementDescription or +nil+).
28
+ attr_reader :description
29
+
30
+ # Create the Block from a tag start.
31
+ def initialize(name, options = {})
32
+ @name, @options = name, options
33
+ @description = Agio::HTMLElementDescription[name]
34
+ @contents = []
35
+ end
36
+
37
+ # Append the contents provided to the Block.
38
+ def append(*contents)
39
+ @contents.push(*contents)
40
+ end
41
+
42
+ # Returns +true+ if the Block is a standard HTML element (as understood
43
+ # by Nokogiri).
44
+ def standard?
45
+ !!description
46
+ end
47
+
48
+ ##
49
+ # Returns +true+ if this Block is an HTML inline element.
50
+ def inline?
51
+ description && description.inline?
52
+ end
53
+
54
+ ##
55
+ # Returns +true+ if this Block is an HTML block element.
56
+ def block?
57
+ description && description.block?
58
+ end
59
+
60
+ ##
61
+ # Returns +true+ if this Block can contain the other Block provided.
62
+ def can_contain?(other)
63
+ case other
64
+ when Agio::Block
65
+ description && description.sub_elements.include?(other.name)
66
+ when String
67
+ description && description.sub_elements.include?(other)
68
+ else
69
+ false
70
+ end
71
+ end
72
+
73
+ ##
74
+ # Returns +true+ if the Block is a 'li' (list item) element.
75
+ def li?
76
+ name == "li"
77
+ end
78
+
79
+ ##
80
+ # Returns +true+ if the Block is a 'pre' element.
81
+ def pre?
82
+ name == "pre"
83
+ end
84
+
85
+ ##
86
+ # Returns +true+ if the Block is a 'ul' or 'ol' element.
87
+ def ul_ol?
88
+ name == "ul" or name == "ol"
89
+ end
90
+
91
+ ##
92
+ # Returns +true+ if the Block is a definition item ('dt' or 'dd').
93
+ def definition?
94
+ name == "dt" or name == "dd"
95
+ end
96
+
97
+ ##
98
+ # Returns +true+ if the block is a definition list ('dl') element.
99
+ def dl?
100
+ definition? or name == "dl"
101
+ end
102
+
103
+ ##
104
+ # Determine whether the +other+ Block is a sibling of this Block.
105
+ # Blocks are siblings if:
106
+ #
107
+ # 1. This Block cannot contain the other Block.
108
+ # 2. This Block is a definition ('dt' or 'dd') and so is the other
109
+ # Block.
110
+ # 3. This Block's name and the other Block's name are the same.
111
+ def sibling_of?(other)
112
+ if can_contain? other
113
+ false
114
+ elsif definition? and other.definition?
115
+ true
116
+ elsif name == other.name
117
+ true
118
+ else
119
+ false
120
+ end
121
+ end
122
+
123
+ ##
124
+ # Used mostly for testing.
125
+ def ==(other)
126
+ name == other.name &&
127
+ options == other.options &&
128
+ contents == other.contents
129
+ end
130
+ end
131
+
132
+ # vim: ft=ruby