agio 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
File without changes
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --colour
2
+ --format documentation
@@ -0,0 +1,3 @@
1
+ === 0.5 / 2011-09-20
2
+
3
+ * Initial release for public consumption.
@@ -0,0 +1,23 @@
1
+ == License
2
+
3
+ This software is available under the terms of the MIT license.
4
+
5
+ * Copyright 2011 Austin Ziegler
6
+
7
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
8
+ this software and associated documentation files (the "Software"), to deal in
9
+ the Software without restriction, including without limitation the rights to
10
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
11
+ of the Software, and to permit persons to whom the Software is furnished to do
12
+ so, subject to the following conditions:
13
+
14
+ The above copyright notice and this permission notice shall be included in all
15
+ copies or substantial portions of the Software.
16
+
17
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
18
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
19
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
20
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
21
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
22
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
23
+ SOFTWARE.
@@ -0,0 +1,23 @@
1
+ .rspec
2
+ History.rdoc
3
+ License.rdoc
4
+ Manifest.txt
5
+ README.rdoc
6
+ Rakefile
7
+ agio.gemspec
8
+ bin/agio
9
+ lib/agio.rb
10
+ lib/agio/block.rb
11
+ lib/agio/bourse.rb
12
+ lib/agio/broker.rb
13
+ lib/agio/data.rb
14
+ lib/agio/flags.rb
15
+ lib/agio/html_element_description.rb
16
+ spec/block_spec.rb
17
+ spec/bourse_spec.rb
18
+ spec/broker_spec.rb
19
+ spec/data_spec.rb
20
+ spec/flags_spec.rb
21
+ spec/html_element_description_spec.rb
22
+ spec/pmh_spec.rb
23
+ spec/spec_helper.rb
@@ -0,0 +1,97 @@
1
+ = Agio
2
+
3
+ == Description
4
+ Agio is a library and a tool for converting HTML to
5
+ {Markdown}[http://daringfireball.net/projects/markdown/].
6
+
7
+ === About the Name
8
+
9
+ The name was chosen because <em>agio</em> is "a premium on money in exchange",
10
+ sort of the opposite of a markdown. It comes from the Italian <em>aggio</em>
11
+ (premium), not from the Italian <em>agio</em> (ease), although the hope is that
12
+ there is an ease in use of this library.
13
+
14
+ == Why Agio?
15
+
16
+ 1. Agio is well tested.
17
+ 2. Agio is pure Ruby.
18
+ 3. Agio is MIT licensed.
19
+
20
+ === Agio is Well Tested
21
+
22
+ With this release, there are 274 unit tests (RSpec examples) covering almost
23
+ everything. The only things not covered are Agio itself (which mostly performs
24
+ coordination duties) and Agio::Bourse (the parsed HTML-to-Markdown translator).
25
+ These tests cover 85%–95% of the code known to be tested, and the modules
26
+ currently missing test will have those tests completed prior to the release of
27
+ Agio 1.0.
28
+
29
+ In addition to the unit tests, more than a hundred manual tests have been run
30
+ to verify the quality of output for basic HTML. These manual tests have taken
31
+ one of two forms:
32
+
33
+ 1. Markdown input converted to HTML with
34
+ {rdiscount}[https://github.com/rtomayko/rdiscount], and then converted back
35
+ to Markdown with Agio. In all cases, the resulting Markdown is either
36
+ identical to the original or the differences can be attributed to style
37
+ (how Agio writes emphasized text versus the hand-written original, or how
38
+ Agio represents links by default). This tests Agio's round-trip capability.
39
+ 2. HTML input converted to Markdown with either Pandoc or html2txt.py,
40
+ converted <em>back</em> to HTML with rdiscount, and then converted once
41
+ again to Markdown with Agio. This tests Agio's ability to output data
42
+ that is syntactically similar to those of better-known and presumably
43
+ better-tested tools.
44
+
45
+ Agio will likely have bugs, especially before version 1.0, and not all features
46
+ are yet implemented or exposed to the user. Syntactic support is also
47
+ incomplete, as the goal is to support many of the syntax extensions found in
48
+ {Markdown Extra}[http://michelf.com/projects/php-markdown/extra/] or other
49
+ popular modules, such as Github-flavoured Markdown.
50
+
51
+ == Where
52
+
53
+ * {GitHub}[https://github.com/halostatue/agio]
54
+
55
+ == Synopsis
56
+
57
+ Install Agio with:
58
+
59
+ gem install agio
60
+
61
+ Run Agio against HTML with:
62
+
63
+ agio input.html > output.markdown
64
+
65
+ == History
66
+
67
+ === Why I Wrote Agio
68
+
69
+ Agio is the result of some yak-shaving as I was looking to convert my blog
70
+ content from {WordPress}[http://wordpress.org] to
71
+ {Jekyll}[https://github.com/mojombo/jekyll]. The Jekyll wiki points to Thomas
72
+ Frőssman's {Exitwp}[https://github.com/thomasf/exitwp] Python script as a
73
+ reliable conversion mechanism, but I found that it couldn't handle the data in
74
+ my {WordPress export file}[http://codex.wordpress.org/Tools_Export_Screen]. So,
75
+ I ported Exitwp from Python to Ruby as
76
+ {Poole}[https://github.com/halostatue/poole].
77
+
78
+ Like Exitwp, Poole depends on {Pandoc}[http://johnmacfarlane.net/pandoc/].
79
+ While it's an amazing tool, it took the better part of 45 minutes to compile
80
+ the {Haskell Platform}[http://hackage.haskell.org/platform/] with
81
+ {Homebrew}[https://github.com/mxcl/homebrew] and Pandoc with
82
+ {Cabal}[http://www.haskell.org/cabal/]. Looking around the Ruby community,
83
+ there wasn't a single Ruby-based HTML-to-Markdown converter that I felt I could
84
+ trust to get everything right that was also licensed to my liking (I prefer the
85
+ MIT license). While {Kramdown}[http://kramdown.rubyforge.org/] is impressive,
86
+ it's GPL-licensed. I didn't want Poole (which is MIT-licensed) to depend on
87
+ anything that provided any less freedom for any purpose.
88
+
89
+ Armed with this plan, I started the process of deeply understanding how Aaron
90
+ Swartz's {html2txt.py}[http://www.aaronsw.com/2002/html2text/] script works.
91
+ This included an early version of Agio that was a more-or-less straightforward
92
+ port, but produced output that was worse because of differences between
93
+ Python's +textwrap+ module and Ruby's
94
+ {Text::Format}[http://rubyforge.org/projects/text-format] that could not be
95
+ cleanly resolved by tweaking Aaron's algorithm.
96
+
97
+ :include: License.rdoc
@@ -0,0 +1,35 @@
1
+ # -*- ruby encoding: utf-8 -*-
2
+
3
+ require 'rubygems'
4
+ require 'rspec'
5
+ require 'hoe'
6
+
7
+ Hoe.plugin :doofus
8
+ Hoe.plugin :gemspec
9
+ Hoe.plugin :git
10
+
11
+ Hoe.plugins.delete :rcov
12
+
13
+ spec = Hoe.spec 'agio' do
14
+ developer('Austin Ziegler', 'austin@rubyforge.org')
15
+
16
+ self.history_file = 'History.rdoc'
17
+ self.readme_file = 'README.rdoc'
18
+ self.extra_rdoc_files = FileList["*.rdoc"].to_a
19
+
20
+ self.extra_deps << ['nokogiri', '~> 1.5.0']
21
+ self.extra_deps << ['main', '~> 4.7.3']
22
+ self.extra_dev_deps << ['rspec', '~> 2.0']
23
+ self.extra_dev_deps << ['hoe-doofus', '~> 1.0']
24
+ self.extra_dev_deps << ['hoe-gemspec', '~> 1.0']
25
+ self.extra_dev_deps << ['hoe-git', '~> 1.0']
26
+ self.extra_dev_deps << ['hoe-seattlerb', '~> 1.0']
27
+ end
28
+
29
+ RSpec::Core::RakeTask.new(:rcov) do |t|
30
+ t.rcov = true
31
+ t.rcov_opts = %q[--exclude "osx/objc,gems/,spec/,features/"]
32
+ t.verbose = true
33
+ end
34
+
35
+ # vim: syntax=ruby
@@ -0,0 +1,53 @@
1
+ # -*- encoding: utf-8 -*-
2
+
3
+ Gem::Specification.new do |s|
4
+ s.name = "agio"
5
+ s.version = "0.5.0"
6
+
7
+ s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
8
+ s.authors = ["Austin Ziegler"]
9
+ s.date = "2011-09-20"
10
+ s.description = "Agio is a library and a tool for converting HTML to\n{Markdown}[http://daringfireball.net/projects/markdown/]."
11
+ s.email = ["austin@rubyforge.org"]
12
+ s.executables = ["agio"]
13
+ s.extra_rdoc_files = ["Manifest.txt", "History.rdoc", "License.rdoc", "README.rdoc"]
14
+ s.files = [".rspec", "History.rdoc", "License.rdoc", "Manifest.txt", "README.rdoc", "Rakefile", "agio.gemspec", "bin/agio", "lib/agio.rb", "lib/agio/block.rb", "lib/agio/bourse.rb", "lib/agio/broker.rb", "lib/agio/data.rb", "lib/agio/flags.rb", "lib/agio/html_element_description.rb", "spec/block_spec.rb", "spec/bourse_spec.rb", "spec/broker_spec.rb", "spec/data_spec.rb", "spec/flags_spec.rb", "spec/html_element_description_spec.rb", "spec/pmh_spec.rb", "spec/spec_helper.rb", ".gemtest"]
15
+ s.rdoc_options = ["--main", "README.rdoc"]
16
+ s.require_paths = ["lib"]
17
+ s.rubyforge_project = "agio"
18
+ s.rubygems_version = "1.8.10"
19
+ s.summary = "Agio is a library and a tool for converting HTML to {Markdown}[http://daringfireball.net/projects/markdown/]."
20
+
21
+ if s.respond_to? :specification_version then
22
+ s.specification_version = 3
23
+
24
+ if Gem::Version.new(Gem::VERSION) >= Gem::Version.new('1.2.0') then
25
+ s.add_runtime_dependency(%q<nokogiri>, ["~> 1.5.0"])
26
+ s.add_runtime_dependency(%q<main>, ["~> 4.7.3"])
27
+ s.add_development_dependency(%q<rspec>, ["~> 2.0"])
28
+ s.add_development_dependency(%q<hoe-doofus>, ["~> 1.0"])
29
+ s.add_development_dependency(%q<hoe-gemspec>, ["~> 1.0"])
30
+ s.add_development_dependency(%q<hoe-git>, ["~> 1.0"])
31
+ s.add_development_dependency(%q<hoe-seattlerb>, ["~> 1.0"])
32
+ s.add_development_dependency(%q<hoe>, ["~> 2.12"])
33
+ else
34
+ s.add_dependency(%q<nokogiri>, ["~> 1.5.0"])
35
+ s.add_dependency(%q<main>, ["~> 4.7.3"])
36
+ s.add_dependency(%q<rspec>, ["~> 2.0"])
37
+ s.add_dependency(%q<hoe-doofus>, ["~> 1.0"])
38
+ s.add_dependency(%q<hoe-gemspec>, ["~> 1.0"])
39
+ s.add_dependency(%q<hoe-git>, ["~> 1.0"])
40
+ s.add_dependency(%q<hoe-seattlerb>, ["~> 1.0"])
41
+ s.add_dependency(%q<hoe>, ["~> 2.12"])
42
+ end
43
+ else
44
+ s.add_dependency(%q<nokogiri>, ["~> 1.5.0"])
45
+ s.add_dependency(%q<main>, ["~> 4.7.3"])
46
+ s.add_dependency(%q<rspec>, ["~> 2.0"])
47
+ s.add_dependency(%q<hoe-doofus>, ["~> 1.0"])
48
+ s.add_dependency(%q<hoe-gemspec>, ["~> 1.0"])
49
+ s.add_dependency(%q<hoe-git>, ["~> 1.0"])
50
+ s.add_dependency(%q<hoe-seattlerb>, ["~> 1.0"])
51
+ s.add_dependency(%q<hoe>, ["~> 2.12"])
52
+ end
53
+ end
@@ -0,0 +1,17 @@
1
+ #!ruby
2
+
3
+ # main 4.7.3 has a warning in main.rb:
4
+ # main-4.7.3/lib/main.rb:28: warning: redefine libdir
5
+ #
6
+ # map 4.3.0 has a warning in map.rb:
7
+ # map-4.3.0/lib/map.rb:551: warning: method redefined; discarding old to_map
8
+ #
9
+ # Therefore, no warnings.
10
+ #!ruby -w
11
+
12
+ require 'agio'
13
+
14
+ ARGV.each do |arg|
15
+ html = File.open(arg) { |f| f.read }
16
+ puts Agio.to_markdown(html)
17
+ end
@@ -0,0 +1,171 @@
1
+ # -*- ruby encoding: utf-8 -*-
2
+
3
+ require 'nokogiri'
4
+ require 'stringio'
5
+ require 'text/format'
6
+
7
+ ##
8
+ # = Agio
9
+ # Agio converts HTML to Markdown.
10
+ #
11
+ # == About the Name
12
+ # The name was chosen because agio is "a premium on money in exchange",
13
+ # sort of the opposite of a markdown. It comes from the Italian aggio
14
+ # (premium), not from the Italian agio (ease), although the hope is that
15
+ # there is an ease in use of this library.
16
+ #
17
+ # It is structurally based on Aaron Swarz's html2txt Python script inasmuch
18
+ # as the SAX parsing he does is also done here and with pretty much the same
19
+ # behaviour.
20
+ #
21
+ # == License
22
+ # This code is licensed under MIT License
23
+ class Agio
24
+ VERSION = "0.5.0"
25
+ end
26
+
27
+ require 'agio/flags'
28
+ require 'agio/broker'
29
+ require 'agio/bourse'
30
+
31
+ class Agio
32
+ ##
33
+ # :attr_accessor:
34
+ # The default HTML document to be processed. Because the #parse method can
35
+ # be called with an HTML document, this does not *need* to be set.
36
+ attr_accessor :html
37
+
38
+ ##
39
+ # :attr_reader:
40
+ # The width of the body text for the generated Markdown text outside of
41
+ # +pre+ bodies and other items which do not wrap well in most Markdown
42
+ # parsers.
43
+ def columns
44
+ bourse.formatter.columns
45
+ end
46
+ ##
47
+ # :attr_writer:
48
+ # The width of the body text for the generated Markdown text outside of
49
+ # +pre+ bodies and other items which do not wrap well in most Markdown
50
+ # parsers.
51
+ #
52
+ # If +nil+ is provided, the default value of 78 is set.
53
+ def columns=(value)
54
+ bourse.formatter.columns = value || 78
55
+ bourse.formatter.columns
56
+ end
57
+
58
+ ##
59
+ # :attr_reader:
60
+ # Controls how links are placed in the Markdown document.
61
+ def link_placement
62
+ bourse.link_placement
63
+ end
64
+ # :attr_writer: link_placement
65
+ # Controls how links are placed in the Markdown document.
66
+ #
67
+ # In-Line:: Links appear next to their wrapped text, like "[See the
68
+ # example](http://example.org/)". The default for
69
+ # link_placement, set if the value is +nil+, <tt>:inline</tt>,
70
+ # or any other unrecognized value.
71
+ # Paragraph:: Links appear in the body as linked references, like "[See
72
+ # the example][1]", and the reference "[1]:
73
+ # http://example.org" is placed immediately after the
74
+ # paragraph in which the link first appears. Used if the value
75
+ # of link_placement is <tt>:paragraph</tt>.
76
+ # Endnote:: Links appear in the body as linked references, like "[See
77
+ # the example][1]", and the reference "[1]:
78
+ # http://example.org" is placed at the end of the document.
79
+ # Used if the value of link_placement is <tt>:endnote</tt>.
80
+ def link_placement=(value)
81
+ bourse.link_placement = value
82
+ end
83
+
84
+ ##
85
+ # :attr_reader: base_url
86
+ # The base URL for implicit (or local) link references. If not provided,
87
+ # links will remain implicit. This is a String value.
88
+ def base_url
89
+ bourse.base_url
90
+ end
91
+ ##
92
+ # :attr_writer: base_url
93
+ # The base URL for implicit (or local) link references. If not provided,
94
+ # links will remain implicit. This is a String value.
95
+ def base_url=(value)
96
+ bourse.base_url = value
97
+ end
98
+
99
+ ##
100
+ # :attr_reader: skip_local_fragments
101
+ # Controls whether local link references containing fragments will be
102
+ # output in the final document.
103
+ #
104
+ # A local link reference is either an implicit link reference (one missing
105
+ # the protocol and host, such as '&lt;a href="about.html"&gt;' or '&lt;a
106
+ # href="/about.html"&gt;') or one that points to the #base_url.
107
+ #
108
+ # If this value is +true+, links that refer to fragments on local URIs
109
+ # will be ignored (such as '&lt;a href="about.html#address"&gt;').
110
+ def skip_local_fragments
111
+ bourse.skip_local_fragments
112
+ end
113
+
114
+ ##
115
+ # :attr_writer: skip_local_fragments
116
+ # Controls whether local link references containing fragments will be
117
+ # output in the final document.
118
+ #
119
+ # A local link reference is either an implicit link reference (one missing
120
+ # the protocol and host, such as '&lt;a href="about.html"&gt;' or '&lt;a
121
+ # href="/about.html"&gt;') or one that points to the #base_url.
122
+ #
123
+ # If this value is +true+, links that refer to fragments on local URIs
124
+ # will be ignored (such as '&lt;a href="about.html#address"&gt;').
125
+ def skip_local_fragments=(value)
126
+ bourse.skip_local_fragments = value
127
+ end
128
+
129
+ def initialize(options = {})
130
+ @broker = Agio::Broker.new
131
+ @bourse = Agio::Bourse.new(broker, self)
132
+
133
+ self.html = options[:html]
134
+ self.columns = options[:columns]
135
+ self.link_placement = options[:link_placement]
136
+
137
+ yield self if block_given?
138
+
139
+ @parser = Nokogiri::HTML::SAX::Parser.new(broker)
140
+ end
141
+
142
+ def parse(html)
143
+ @parser.parse(html)
144
+ self
145
+ end
146
+ private :parse
147
+
148
+ def transform(html)
149
+ parse(html)
150
+ bourse.transform
151
+ end
152
+ private :transform
153
+
154
+ def to_s(html = nil)
155
+ transform(html || self.html)
156
+ bourse.output.string
157
+ end
158
+ alias to_markdown to_s
159
+
160
+ def self.to_markdown(html)
161
+ self.new.to_markdown(html)
162
+ end
163
+
164
+ attr_reader :broker
165
+ private :broker
166
+
167
+ attr_reader :bourse
168
+ private :bourse
169
+ end
170
+
171
+ # vim: ft=ruby
@@ -0,0 +1,132 @@
1
+ # -*- ruby encoding: utf-8 -*-
2
+
3
+ require 'agio/html_element_description'
4
+
5
+ ##
6
+ # A Block is the fundamental collection for the Broker that is used to
7
+ # then generate the Markdown.
8
+ class Agio::Block
9
+ def inspect
10
+ if options.empty?
11
+ %Q(#<#{self.class} #{name} #{contents.inspect}>)
12
+ else
13
+ %Q(#<#{self.class} #{name}(#{options.inspect}) #{contents.inspect}>)
14
+ end
15
+ end
16
+
17
+ # The name of the element the Block is for.
18
+ attr_reader :name
19
+
20
+ # The options, if provided, for the element.
21
+ attr_reader :options
22
+
23
+ # The contents of the Block.
24
+ attr_reader :contents
25
+
26
+ # The description of the HTML element the Block represents (this will
27
+ # always be a Nokogiri::HTML::ElementDescription or +nil+).
28
+ attr_reader :description
29
+
30
+ # Create the Block from a tag start.
31
+ def initialize(name, options = {})
32
+ @name, @options = name, options
33
+ @description = Agio::HTMLElementDescription[name]
34
+ @contents = []
35
+ end
36
+
37
+ # Append the contents provided to the Block.
38
+ def append(*contents)
39
+ @contents.push(*contents)
40
+ end
41
+
42
+ # Returns +true+ if the Block is a standard HTML element (as understood
43
+ # by Nokogiri).
44
+ def standard?
45
+ !!description
46
+ end
47
+
48
+ ##
49
+ # Returns +true+ if this Block is an HTML inline element.
50
+ def inline?
51
+ description && description.inline?
52
+ end
53
+
54
+ ##
55
+ # Returns +true+ if this Block is an HTML block element.
56
+ def block?
57
+ description && description.block?
58
+ end
59
+
60
+ ##
61
+ # Returns +true+ if this Block can contain the other Block provided.
62
+ def can_contain?(other)
63
+ case other
64
+ when Agio::Block
65
+ description && description.sub_elements.include?(other.name)
66
+ when String
67
+ description && description.sub_elements.include?(other)
68
+ else
69
+ false
70
+ end
71
+ end
72
+
73
+ ##
74
+ # Returns +true+ if the Block is a 'li' (list item) element.
75
+ def li?
76
+ name == "li"
77
+ end
78
+
79
+ ##
80
+ # Returns +true+ if the Block is a 'pre' element.
81
+ def pre?
82
+ name == "pre"
83
+ end
84
+
85
+ ##
86
+ # Returns +true+ if the Block is a 'ul' or 'ol' element.
87
+ def ul_ol?
88
+ name == "ul" or name == "ol"
89
+ end
90
+
91
+ ##
92
+ # Returns +true+ if the Block is a definition item ('dt' or 'dd').
93
+ def definition?
94
+ name == "dt" or name == "dd"
95
+ end
96
+
97
+ ##
98
+ # Returns +true+ if the block is a definition list ('dl') element.
99
+ def dl?
100
+ definition? or name == "dl"
101
+ end
102
+
103
+ ##
104
+ # Determine whether the +other+ Block is a sibling of this Block.
105
+ # Blocks are siblings if:
106
+ #
107
+ # 1. This Block cannot contain the other Block.
108
+ # 2. This Block is a definition ('dt' or 'dd') and so is the other
109
+ # Block.
110
+ # 3. This Block's name and the other Block's name are the same.
111
+ def sibling_of?(other)
112
+ if can_contain? other
113
+ false
114
+ elsif definition? and other.definition?
115
+ true
116
+ elsif name == other.name
117
+ true
118
+ else
119
+ false
120
+ end
121
+ end
122
+
123
+ ##
124
+ # Used mostly for testing.
125
+ def ==(other)
126
+ name == other.name &&
127
+ options == other.options &&
128
+ contents == other.contents
129
+ end
130
+ end
131
+
132
+ # vim: ft=ruby