RubyGems - semantictext - Versions diffs - 0.1.0 - Mend

semantictext 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

data/CHANGELOG +0 -0
data/COPYING +32 -0
data/README.rdoc +35 -0
data/TODO.rdoc +23 -0
data/lib/semantictext/bullet.rb +22 -0
data/lib/semantictext/bulleted_list_parser.rb +21 -0
data/lib/semantictext/bulletedlist.rb +36 -0
data/lib/semantictext/date_extractor.rb +27 -0
data/lib/semantictext/default_tag_factory.rb +9 -0
data/lib/semantictext/extraction_failed.rb +4 -0
data/lib/semantictext/heading.rb +15 -0
data/lib/semantictext/keyword_extractor.rb +9 -0
data/lib/semantictext/link.rb +9 -0
data/lib/semantictext/not_header_line.rb +4 -0
data/lib/semantictext/paragraph.rb +24 -0
data/lib/semantictext/parser.rb +124 -0
data/lib/semantictext/rich_text_parser.rb +60 -0
data/lib/semantictext/span.rb +17 -0
data/lib/semantictext/tag.rb +16 -0
data/lib/semantictext.rb +1 -0
data/lib/string.rb +5 -0
data/test/bullet_test.rb +24 -0
data/test/bulleted_list_parser_test.rb +61 -0
data/test/dateextractor_test.rb +19 -0
data/test/export_test.rb +50 -0
data/test/keywordextractor_test.rb +13 -0
data/test/parser_test.rb +292 -0
data/testfiles/complex.art +28 -0
data/testfiles/regression-exportsample.txt +15 -0
data/testfiles/simple.art +10 -0
metadata +92 -0

data/CHANGELOG ADDED Viewed

File without changes

data/COPYING ADDED Viewed

@@ -0,0 +1,32 @@
+Semantic Text Licence
+COPYRIGHT AND PERMISSION NOTICE
+Copyright (c) 2009  Green Bar Software Limited, UK
+All rights reserved.
+Permission is hereby granted, free of charge, to any person obtaining a
+copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, and/or sell copies of the Software, and to permit persons
+to whom the Software is furnished to do so, provided that the above
+copyright notice(s) and this permission notice appear in all copies of
+the Software and that both the above copyright notice(s) and this
+permission notice appear in supporting documentation.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT
+OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
+HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL
+INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING
+FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
+NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION
+WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+Except as contained in this notice, the name of a copyright holder
+shall not be used in advertising or otherwise to promote the sale, use
+or other dealings in this Software without prior written authorization
+of the copyright holder.

data/README.rdoc ADDED Viewed

@@ -0,0 +1,35 @@
+= Semantic Text
+Semantic Text is a Domain-Specific text markup parser.
+It takes a file or sequence of lines and returns an object model of the document,
+including document metadata (e.g. doc creation time and title) and a tree of
+interconnected objects describing the document structure.
+== RDOC API
+The rdoc can be found at http://www.greenbarsoft.co.uk/software/semantictext/rdoc/
+= How to use it
+* Parse with SemanticText::Parser.parse_from(file)
+* Generate HTML with SemanticText::Parser.export_html
+= Semantic Markup
+Semantic text supports:
+* document metadata
+* section headers
+* nested bullet points
+* paragraphs that contain markup tags
+* inline hyperlinks for http: mailto: and ftp:
+* markup tags within bullet points
+We intend to support these features in future:
+* custom markup tags e.g. postal code, youtube video embed, ...
+== Compatibility
+This project is being developed on OS X. Automated testing for Linux will be included in future releases.
+== Licence
+This is open source software and comes with no warranty. See COPYING for details.
+http://www.greenbarsoft.co.uk
+Copyright 2009 Green Bar Software Limited, UK

data/TODO.rdoc ADDED Viewed

@@ -0,0 +1,23 @@
+==to do
+* support custom structure tags
+* improve testing by mocking out tag factory used in tests - consider how/whether to do this
+* support urls as a special structure tag
+* support wikinames as a special custom tag
+* publish rdoc
+* build gem
+* test gem
+* publish gem
+==maybe
+* think about how to support twitter with special structure tags e.g. #keyword and @user
+* refactor parser into header parser and text parser
+* pull out parsers for different parts and use the state pattern
+==done
+* handle proper HTML escaping
+* remove html and head elements from html output (it's here to be embedded in webpages)
+* removed surrounding square brackets from text in Tag objects
+* bullet lines should be parsed as paragraphs to support tags and inline hyperlinks
+* supported paragraphs that start with links
+* remove absolute paths from tests - use ENV['SANDBOX']

data/lib/semantictext/bullet.rb ADDED Viewed

@@ -0,0 +1,22 @@
+require 'semantictext/parser'
+module SemanticText
+  class Bullet < Paragraph
+    attr_reader :depth
+    def initialize(text, depth, rich_text_parser)
+      super()
+      @depth = depth
+      rich_text_parser.parse(text, self)
+    end
+    #export as html
+    def export_html
+      result =  "<li>"
+      content.each {|element| result+=element.export_html }
+      result += "</li>"
+    end
+  end
+end

data/lib/semantictext/bulleted_list_parser.rb ADDED Viewed

@@ -0,0 +1,21 @@
+require 'semantictext/bulletedlist'
+require 'semantictext/bullet'
+module SemanticText
+  class BulletedListParser
+    attr_reader :bulleted_list
+    def initialize(rich_text_parser)
+      @rich_text_parser = rich_text_parser
+      @bulleted_list = BulletedList.new(1)
+    end
+    def parse_line(bulleted_line)
+      match = bulleted_line.match(/^(\*+)\s+(.*$)/)
+      depth = match[1].size
+      @bulleted_list << Bullet.new(match[2], depth, @rich_text_parser)
+    end
+  end
+end

data/lib/semantictext/bulletedlist.rb ADDED Viewed

@@ -0,0 +1,36 @@
+module SemanticText
+  class BulletedList
+  	attr_reader :content
+    attr_reader :depth
+  	def initialize(depth)
+  		@content = []
+  		@depth = depth
+  	end
+  	def <<(bullet)
+  	  if bullet.depth>@depth
+        if @content.last.class != BulletedList
+    	    @content << BulletedList.new(depth+1)
+        end
+  	    @content.last << bullet
+      else
+  	    @content << bullet
+  	  end
+    end
+    def size
+      @content.size
+    end
+    #export as html
+    def export_html
+      tabs = "\t"*depth
+      out = "\n#{tabs}<ul>"
+      content.each {|element| out=out+element.export_html}
+      out = out + "\n#{tabs}</ul>"
+    end
+  end
+end

data/lib/semantictext/date_extractor.rb ADDED Viewed

@@ -0,0 +1,27 @@
+module SemanticText
+  class DateExtractor
+  	MONTHS = { 'January' => 1,
+  			'February' => 2,
+  			'March' => 3,
+  			'April' =>4,
+  			'May' =>5,
+  			'June' =>6,
+  			'July' =>7,
+  			'August' =>8,
+  			'September' =>9,
+  			'October' =>10,
+  			'November' =>11,
+  			'December' =>12
+  		}
+  	def extract_from(string)
+  		fields = string.split ' '
+  		day = fields[0]
+  		month = MONTHS[fields[1]]
+  		throw ExtractionFailed.new if month.nil?
+  		year = fields[2]
+  		Time.local(year, month, day)
+  	end
+  end
+end

data/lib/semantictext/default_tag_factory.rb ADDED Viewed

@@ -0,0 +1,9 @@
+module SemanticText
+  # I create SemanticText::Tag objects in response to create_tag(name,value) calls
+  # from a SemanticText::Parser
+  class DefaultTagFactory
+    def create_tag(name, value)
+      Tag.new(name,value)
+    end
+  end
+end

data/lib/semantictext/extraction_failed.rb ADDED Viewed

@@ -0,0 +1,4 @@
+module SemanticText
+  class ExtractionFailed < Exception
+  end
+end

data/lib/semantictext/heading.rb ADDED Viewed

@@ -0,0 +1,15 @@
+module SemanticText
+  class Heading
+  	attr_reader :text
+  	def initialize(aTitle)
+  		@text = aTitle
+  	end
+  	#export as html
+    def export_html
+      "\n<h1>#{ CGI.escapeHTML(@text)}</h1>"
+    end
+  end
+end

data/lib/semantictext/keyword_extractor.rb ADDED Viewed

@@ -0,0 +1,9 @@
+module SemanticText
+  class KeywordExtractor
+  	def extract_from(string)
+  		result = []
+  		string.split(',').each  { |keyword| result << keyword.strip }
+  		return result
+  	end
+  end
+end

data/lib/semantictext/link.rb ADDED Viewed

@@ -0,0 +1,9 @@
+require 'cgi'
+module SemanticText
+  class Link < Span
+    # export as html
+    def export_html
+      "<a href=\"#{text}\">#{CGI.escapeHTML(text)}</a>"
+    end
+  end
+end

data/lib/semantictext/not_header_line.rb ADDED Viewed

@@ -0,0 +1,4 @@
+module SemanticText
+  class NotHeaderLine < Exception
+  end
+end

data/lib/semantictext/paragraph.rb ADDED Viewed

@@ -0,0 +1,24 @@
+require 'semantictext/span'
+module SemanticText
+  class Paragraph
+  	attr_reader :content
+  	def initialize()
+  		@content = []
+  	end
+	  #export as html
+    def export_html
+      out = "\n<p>"
+      content.each {|element| out=out+element.export_html}
+      out = out + "</p>"
+    end
+  	def <<(span)
+  	  @content << span
+    end
+  end
+end

data/lib/semantictext/parser.rb ADDED Viewed

@@ -0,0 +1,124 @@
+require 'semantictext/extraction_failed'
+require 'semantictext/heading'
+require 'semantictext/keyword_extractor'
+require 'semantictext/not_header_line'
+require 'semantictext/paragraph'
+require 'semantictext/span'
+require 'semantictext/link'
+require 'semantictext/tag'
+require 'string'
+require 'semantictext/bulletedlist'
+require 'semantictext/bullet'
+require 'semantictext/bulleted_list_parser'
+require 'semantictext/rich_text_parser'
+module SemanticText
+  class Parser
+    # title of the document
+  	attr_reader :title
+  	# date the document was created
+  	attr_reader :createdAt
+  	# keyword list for the current document
+  	attr_reader :keywords
+  	# pathname of the file currently being parsed (if it exists, nil otherwise)
+  	attr_reader :pathname
+  	# the object model of the parsed document
+  	attr_reader :content
+  	def initialize
+  	  @pathname=nil
+  		@headers_completed = false
+  		@content = []
+  		@current_paragraph = nil
+      @bulleted_list_parser = nil
+      @rich_text_parser = RichTextParser.new(DefaultTagFactory.new)
+  	end
+    # export as html
+    def export_html
+      out = ""
+      content.each {|element| out=out+element.export_html}
+      out = out + "\n"
+    end
+    # true iff I have seen the end of the headers section at the top of the document
+    def parameters_complete?
+      @headers_completed
+    end
+    # parse a document into this object from pathname specified by file
+  	def parse_from(file)
+  	  @pathname=file
+  		f = File.new(file)
+  		f.each_line do |line|
+  		  parse(line)
+  		end
+  		f.close
+  	end
+    # parse an individual <i>line</i> of String appending content
+    # into the current document held by this object
+  	def parse(line)
+  	  line.chomp!
+      begin
+        if (!@headers_completed)
+        	process_header_line(line)
+        else
+        	parse_line(line)
+        end
+      rescue  NotHeaderLine
+        @headers_completed = true
+        parse_line(line)
+  		end
+  	end
+  	private
+  	def process_header_line(headerLine)
+  		splitLine = headerLine.split(':',2)
+  		(attributeName, value) = splitLine
+  		raise NotHeaderLine.new() if splitLine.size <2
+  		attributeName.strip!
+  		@title = value if attributeName=='title'
+  		@createdAt = DateExtractor.new.extract_from(value) if attributeName=='createdAt'
+  		@keywords = KeywordExtractor.new.extract_from(value) if attributeName=='keywords'
+  	end
+    def parse_paragraph_line(line)
+  	  if @current_paragraph.nil?
+        @current_paragraph = Paragraph.new
+        @content << @current_paragraph
+      end
+      @rich_text_parser.parse(line, @current_paragraph)
+    end
+  	def parse_line(line)
+  	  @bulleted_list_parser = nil if !line.begins_with '*'
+  	  if (line =='')
+        @current_paragraph = nil
+        @bulleted_list = nil
+  	    return
+      end
+  		if (line.begins_with('!'))
+  			@content << Heading.new(line[1,line.size-1])
+  		else
+  		  if (line.begins_with('*'))
+  		    if @bulleted_list_parser.nil?
+  		      @bulleted_list_parser = BulletedListParser.new(@rich_text_parser)
+  		      @content << @bulleted_list_parser.bulleted_list
+  	      end
+  	      @bulleted_list_parser.parse_line(line)
+  	    else
+  		    parse_paragraph_line(line)
+  	    end
+  		end
+  	end
+  end
+end

data/lib/semantictext/rich_text_parser.rb ADDED Viewed

@@ -0,0 +1,60 @@
+require 'semantictext/span'
+require 'semantictext/link'
+require 'semantictext/tag'
+require 'string'
+require 'semantictext/default_tag_factory'
+module SemanticText
+  # I parse chunks of text into a sequence of spans, tags and urls.
+  class RichTextParser
+    # I need a tag_factory on which I can call create_tag(name, content_text) to create my tags.
+    def initialize(tag_factory)
+      @tag_factory = tag_factory
+    end
+  private
+    def parse_text_for_urls(text, enclosing_element)
+  	    link_next = false
+  	    ignore_next_section = false
+  	    sections = text.split /((http|ftp|mailto):[^ ]*)/
+  	    if sections[0]==''
+  	      ignore_next_section = true
+  	      link_next = true
+  	    end
+  	    sections.each do |section|
+  	        if (ignore_next_section)
+  	          ignore_next_section = false
+  	        else
+  	          if (link_next)
+  	            enclosing_element << Link.new(section)
+  	            ignore_next_section = true
+  	          else
+  	            enclosing_element << Span.new(section)
+  	          end
+  	          link_next = !(link_next)
+  	        end
+  	    end
+  	end
+    public
+    def parse(line, enclosing_element)
+      sections = line.split /(\[[^:]+:[^\]]+\])/
+      tag_next = false
+      sections.each do |section|
+        if (tag_next)
+          section  =~ /\[([^:]+):([^\]]*)\]/
+          tag_name = $1
+          tag_value = $2
+          enclosing_element << @tag_factory.create_tag(tag_name, tag_value)
+        else
+          parse_text_for_urls(section, enclosing_element)
+        end
+        tag_next = !(tag_next)
+      end
+    end
+  end
+end

data/lib/semantictext/span.rb ADDED Viewed

@@ -0,0 +1,17 @@
+require 'cgi'
+module SemanticText
+  class Span
+  	attr_reader :text
+  	def initialize(content)
+  		@text = content
+  	end
+	  #export as html
+    def export_html
+      ' '+ CGI.escapeHTML(text)
+    end
+  end
+end

data/lib/semantictext/tag.rb ADDED Viewed

@@ -0,0 +1,16 @@
+module SemanticText
+  class Tag < Span
+    attr_reader :key
+    def initialize(key, value)
+  		@text = value
+  		@key = key
+  	end
+    #export as html
+    def export_html
+      "[#{@key}:#{@text}]"
+    end
+  end
+end

data/lib/semantictext.rb ADDED Viewed

	@@ -0,0 +1 @@
1	+ require 'semantictext/parser'

data/lib/string.rb ADDED Viewed

@@ -0,0 +1,5 @@
+class String
+  def begins_with(substring)
+    self.index(substring)==0
+  end
+end

data/test/bullet_test.rb ADDED Viewed

@@ -0,0 +1,24 @@
+require 'test/unit'
+require 'semantictext/default_tag_factory'
+require 'semantictext/rich_text_parser'
+require 'semantictext/bullet'
+class BulletTest < Test::Unit::TestCase
+	def assert_element(element_class, text, actual)
+		assert_equal element_class, actual.class
+		assert_equal text, actual.text
+	end
+	def test_bullet_line_with_links
+	  test_string = "beginning http://www.example.com moretext http://www.dafydd.net ending text"
+	  unit = SemanticText::Bullet.new(test_string, 1, SemanticText::RichTextParser.new(SemanticText::DefaultTagFactory.new))
+	  assert_equal(1, unit.depth)
+	  assert_element SemanticText::Span, "beginning ", unit.content[0]
+	  assert_element SemanticText::Link, "http://www.example.com", unit.content[1]
+	  assert_element SemanticText::Span, " moretext ", unit.content[2]
+	  assert_element SemanticText::Link, "http://www.dafydd.net", unit.content[3]
+	  assert_element SemanticText::Span, " ending text", unit.content[4]
+	end
+end

data/test/bulleted_list_parser_test.rb ADDED Viewed

@@ -0,0 +1,61 @@
+require 'test/unit'
+require 'semantictext/default_tag_factory'
+require 'semantictext/rich_text_parser'
+require 'semantictext/bulleted_list_parser'
+class BulletedListParserTest < Test::Unit::TestCase
+  def assert_element(element_class, text, actual)
+  	  assert_equal element_class, actual.class
+  	  assert_equal text, actual.text
+  end
+  def test_creating_simple_bulleted_list
+    unit = SemanticText::BulletedListParser.new(SemanticText::RichTextParser.new(SemanticText::DefaultTagFactory.new))
+    unit.parse_line('*   foogoo')
+    unit.parse_line('* second')
+    bulleted_list = unit.bulleted_list
+    assert_equal SemanticText::Bullet, unit.bulleted_list.content[0].class
+    assert_equal SemanticText::Bullet, unit.bulleted_list.content[1].class
+    span_line1 = unit.bulleted_list.content[0].content[0]
+    span_line2 = unit.bulleted_list.content[1].content[0]
+    assert_equal SemanticText::Span, span_line1.class, '1st elt of 1st bullet point should be a span'
+    assert_equal SemanticText::Span, span_line2.class, '1st elt of 2nd bullet point should be a span'
+    assert_equal 1, unit.bulleted_list.content[0].content.size, 'should only be 1 elt in 1st bullet point'
+    assert_equal 1, unit.bulleted_list.content[1].content.size, 'should only be 1 elt in 2nd bullet point'
+    assert_equal "foogoo", span_line1.text
+    assert_equal "second", span_line2.text
+  end
+  def test_nested_bulleting
+    unit = SemanticText::BulletedListParser.new(SemanticText::RichTextParser.new(SemanticText::DefaultTagFactory.new))
+    unit.parse_line('* top-level')
+    unit.parse_line('** nested')
+    assert_equal SemanticText::BulletedList, unit.bulleted_list.class
+    assert_equal 2, unit.bulleted_list.content.size
+    bullet1 =  unit.bulleted_list.content[0]
+    assert_equal SemanticText::Bullet, bullet1.class
+    assert_equal SemanticText::Span, bullet1.content[0].class
+    assert_equal 'top-level', bullet1.content[0].text
+    assert_equal 1, bullet1.depth
+    nested_bulleted_list = unit.bulleted_list.content[1]
+    assert_equal SemanticText::BulletedList, nested_bulleted_list.class
+    assert_equal 2, nested_bulleted_list.depth
+    assert_equal 1, nested_bulleted_list.content.size
+    nested_bullet_point = nested_bulleted_list.content[0]
+    assert_equal SemanticText::Bullet, nested_bullet_point.class
+    assert_equal 2, nested_bullet_point.depth
+    span = nested_bullet_point.content[0]
+    assert_equal SemanticText::Span, span.class
+    assert_equal 'nested', span.text
+  end
+end

data/test/dateextractor_test.rb ADDED Viewed

@@ -0,0 +1,19 @@
+require 'test/unit'
+require 'semantictext/date_extractor'
+class TestDateExtractor< Test::Unit::TestCase
+	def testExtractDateFromHappyString
+		unit = SemanticText::DateExtractor.new
+		result = unit.extract_from('5 November 2005')
+		assert_equal 5, result.day
+		assert_equal 11, result.month
+		assert_equal 2005, result.year
+		assert_equal Time, result.class
+	end
+	def testExtractRejectsInvalidMonth
+		unit = SemanticText::DateExtractor.new
+		assert_throws(:"SemanticText::ExtractionFailed") { unit.extract_from('5 x 2005')}
+	end
+end

data/test/export_test.rb ADDED Viewed

@@ -0,0 +1,50 @@
+require 'test/unit'
+require 'semantictext/parser'
+class TestExport < Test::Unit::TestCase
+	def test_end_to_end_loading
+		unit = SemanticText::Parser.new
+		unit.parse_from(ENV['SANDBOX']+'/semantictext/testfiles/complex.art')
+    actual = unit.export_html
+    actual = actual.split /\n/
+    expected_file=File.new(ENV['SANDBOX']+'/semantictext/testfiles/regression-exportsample.txt')
+    expected = expected_file.readlines
+    (0..(expected.size-1)).each {|index| assert_equal expected[index],actual[index]+"\n"}
+  end
+  def test_escaping_paragraphs
+    unit = SemanticText::Parser.new
+    unit.parse ''
+    unit.parse 'escaping test < > &'
+    assert_equal "\n<p> escaping test &lt; &gt; &amp;</p>\n", unit.export_html
+  end
+  def test_escaping_headings
+    unit = SemanticText::Parser.new
+    unit.parse ''
+    unit.parse '!heading < > &'
+    assert_equal "\n<h1>heading &lt; &gt; &amp;</h1>\n", unit.export_html
+  end
+  def test_escaping_bullet_points
+    unit = SemanticText::Parser.new
+    unit.parse ''
+    unit.parse '* < > &'
+    assert_equal "\n\t<ul><li> &lt; &gt; &amp;</li>\n\t</ul>\n", unit.export_html
+  end
+  def test_escaping_link
+    unit = SemanticText::Parser.new
+    unit.parse ''
+    unit.parse 'http://www.example.com/app?name1=value1&name2=value2'
+    assert_equal "\n<p><a href=\"http://www.example.com/app?name1=value1&name2=value2\">http://www.example.com/app?name1=value1&amp;name2=value2</a></p>\n", unit.export_html
+  end
+end

data/test/keywordextractor_test.rb ADDED Viewed

@@ -0,0 +1,13 @@
+require 'test/unit'
+require 'semantictext/keyword_extractor'
+class TestKeywordExtractor< Test::Unit::TestCase
+	def test_extract_keywords_happy_case
+		unit = SemanticText::KeywordExtractor.new
+		assert_equal ['a','b','c','d'], unit.extract_from(' a , b , c , d ')
+		assert_equal ['a','b','c','d'], unit.extract_from(' a , b , c    ,     d ')
+		assert_equal ['a','b','c','d'], unit.extract_from('a,b,c,d')
+	end
+end

data/test/parser_test.rb ADDED Viewed

@@ -0,0 +1,292 @@
+require 'test/unit'
+require 'semantictext/parser'
+class TestParser < Test::Unit::TestCase
+	def assert_element(element_class, text, actual)
+	  assert_equal element_class, actual.class
+	  assert_equal text, actual.text
+  end
+	def test_end_to_end_loading
+		unit = SemanticText::Parser.new
+		unit.parse_from(ENV['SANDBOX']+'/semantictext/testfiles/simple.art')
+		assert_equal 'test title', unit.title
+		assert_equal 5, unit.createdAt.day
+		assert_equal 11, unit.createdAt.month
+		assert_equal 2005, unit.createdAt.year
+		assert_equal Time, unit.createdAt.class
+		assert_equal ENV['SANDBOX']+'/semantictext/testfiles/simple.art', unit.pathname
+    resultant_heading = unit.content[0]
+    resultant_par0 = unit.content[1]
+    resultant_span0_0 = resultant_par0.content[0]
+    resultant_span0_1 = resultant_par0.content[1]
+    resultant_par1 = unit.content[2]
+    resultant_span1_0 = resultant_par1.content[0]
+    assert_element SemanticText::Heading, "First Big Heading", resultant_heading
+    assert_equal SemanticText::Paragraph, resultant_par0.class
+    assert_element SemanticText::Span, 'This is another', resultant_span0_0
+    assert_element SemanticText::Span, 'paragraph.', resultant_span0_1
+    assert_equal SemanticText::Paragraph, resultant_par1.class
+    assert_element SemanticText::Span, 'Theis is the second paragraph.', resultant_span1_0
+  end
+  	def test_headerless_document_parse
+  		unit = SemanticText::Parser.new
+  		test_lines = <<EOF
+!First Big Heading
+This is some text.
+EOF
+      test_lines.each {|line| unit.parse(line)}
+      assert_element SemanticText::Heading, "First Big Heading", unit.content[0]
+      assert_equal SemanticText::Paragraph, unit.content[1].class
+      assert_element SemanticText::Span, "This is some text.", unit.content[1].content[0]
+      assert_nil unit.title
+      assert_nil unit.createdAt
+      assert_nil unit.keywords
+  	end
+    def test_parse_paragraph_beginning_with_url
+        		unit = SemanticText::Parser.new
+        		test_lines = <<EOF
+http://www.dafydd.net/foogoo?blah see? http://www.example.com
+I wonder if it worked!
+EOF
+            test_lines.each {|line| unit.parse(line)}
+            result = unit.content[0]
+            assert_equal SemanticText::Paragraph, result.class
+            assert_element SemanticText::Link, "http://www.dafydd.net/foogoo?blah", result.content[0]
+            assert_element SemanticText::Span, " see? ", result.content[1]
+            assert_element SemanticText::Link, "http://www.example.com", result.content[2]
+            assert_element SemanticText::Span, "I wonder if it worked!", result.content[3]
+            assert_equal 4, result.content.size
+    end
+  	def test_headerless_document_parse_with_url
+  		unit = SemanticText::Parser.new
+  		test_lines = <<EOF
+Embedded link http://www.dafydd.net/foogoo?blah see?
+I wonder if it worked!
+a mailto:foogoo b ftp://asdfasdfasdf c
+EOF
+      test_lines.each {|line| unit.parse(line)}
+      result = unit.content[0]
+      assert_equal SemanticText::Paragraph, result.class
+      assert_element SemanticText::Span, "Embedded link ", result.content[0]
+      assert_element SemanticText::Link, "http://www.dafydd.net/foogoo?blah", result.content[1]
+      assert_element SemanticText::Span, " see?", result.content[2]
+      assert_element SemanticText::Span, "I wonder if it worked!", result.content[3]
+      assert_element SemanticText::Span, 'a ', result.content[4]
+      assert_element SemanticText::Link, 'mailto:foogoo', result.content[5]
+      assert_element SemanticText::Span, ' b ', result.content[6]
+      assert_element SemanticText::Link, 'ftp://asdfasdfasdf', result.content[7]
+      assert_element SemanticText::Span, ' c', result.content[8]
+      assert_equal 9, result.content.size
+  	end
+	def test_paragraph_parsing
+		unit = SemanticText::Parser.new
+		test_lines = <<EOF
+title:test title
+createdAt:5 November 2005
+keywords: buzz, fuzz, muzz
+!First Big Heading
+This is another
+paragraph.
+Theis is the second paragraph.
+EOF
+    test_lines.each {|line| unit.parse(line)}
+    assert_element SemanticText::Heading, "First Big Heading", unit.content[0]
+    assert_equal SemanticText::Paragraph, unit.content[1].class
+    assert_element SemanticText::Span, "This is another", unit.content[1].content[0]
+    assert_element SemanticText::Span, "paragraph.", unit.content[1].content[1]
+    assert_equal SemanticText::Paragraph, unit.content[2].class
+    assert_element SemanticText::Span, "Theis is the second paragraph.", unit.content[2].content[0]
+    assert_equal 3, unit.content.size
+	end
+	def test_parsing_of_parameters
+		unit = SemanticText::Parser.new
+		unit.parse('title:test title')
+		unit.parse('createdAt:5 November 2005')
+		unit.parse('keywords: buzz, fuzz, muzz')
+		unit.parse('')
+		assert_equal 'test title', unit.title
+		assert_equal 5, unit.createdAt.day
+		assert_equal 11, unit.createdAt.month
+		assert_equal 2005, unit.createdAt.year
+		assert_equal Time, unit.createdAt.class
+		assert unit.keywords == ['buzz', 'fuzz', 'muzz']
+		assert_equal 3, unit.keywords.size
+		assert_nil   unit.pathname
+		assert	     unit.parameters_complete?
+	end
+	def test_presendence_of_url_lower_than_tag
+    		unit = SemanticText::Parser.new
+    		unit.parse('')
+    		unit.parse('Embedded tag [http://www.dafydd.net/foogoo?blah name:here] see?')
+        result = unit.content[0]
+        assert_equal SemanticText::Paragraph, result.class
+        assert_element SemanticText::Span, "Embedded tag ", result.content[0]
+        assert_element SemanticText::Tag, "//www.dafydd.net/foogoo?blah name:here", result.content[1]
+        assert_element SemanticText::Span, " see?", result.content[2]
+        assert_equal 3, result.content.size
+  end
+	def test_headerless_document_parse_with_tags
+    		unit = SemanticText::Parser.new
+    		test_lines = <<EOF
+Embedded tag [rfc:822] see?
+I wonder if it worked!
+a [tags:red balloon] b [c2:RecentChanges] c
+EOF
+        test_lines.each {|line| unit.parse(line)}
+        result = unit.content[0]
+        assert_equal SemanticText::Paragraph, result.class
+        assert_element SemanticText::Span, "Embedded tag ", result.content[0]
+        assert_element SemanticText::Tag, "822", result.content[1]
+        assert_element SemanticText::Span, " see?", result.content[2]
+        assert_element SemanticText::Span, "I wonder if it worked!", result.content[3]
+        assert_element SemanticText::Span, 'a ', result.content[4]
+        assert_element SemanticText::Tag, 'red balloon', result.content[5]
+        assert_element SemanticText::Span, ' b ', result.content[6]
+        assert_element SemanticText::Tag, 'RecentChanges', result.content[7]
+        assert_element SemanticText::Span, ' c', result.content[8]
+        assert_equal 9, result.content.size
+    	end
+	def test_paragraphs_headings_and_bullet_points
+    unit = SemanticText::Parser.new
+    unit.parse('')
+    unit.parse('!heading')
+    unit.parse('This is a paragraph')
+    unit.parse('*   bullet point')
+    unit.parse('** nested bullet point')
+    unit.parse('')
+    unit.parse('* separate list')
+    assert_element SemanticText::Heading, "heading", unit.content[0]
+    assert_equal SemanticText::Paragraph, unit.content[1].class
+    assert_element SemanticText::Span, "This is a paragraph", unit.content[1].content[0]
+    assert_equal SemanticText::BulletedList, unit.content[2].class
+    bullets = unit.content[2]
+    assert_equal 2, bullets.content.size
+    bullet_1 = bullets.content[0]
+    assert_equal SemanticText::Bullet, bullet_1.class
+    assert_equal SemanticText::Span, bullet_1.content[0].class
+    assert_equal 1, bullet_1.content.size
+    assert_equal "bullet point", bullet_1.content[0].text
+    nested = bullets.content[1]
+    assert_equal SemanticText::BulletedList, nested.class
+    assert_equal 1, nested.content.size
+    nested_bullet = nested.content[0]
+    assert_equal SemanticText::Bullet, nested_bullet.class
+    assert_equal "nested bullet point", nested_bullet.content[0].text
+    second_bullets = unit.content[3]
+    assert_equal SemanticText::BulletedList, second_bullets.class
+    assert_equal 1, second_bullets.size
+    assert_equal SemanticText::Bullet, second_bullets.content[0].class
+    assert_equal SemanticText::Span, second_bullets.content[0].content[0].class
+    assert_equal "separate list", second_bullets.content[0].content[0].text
+	end
+	def test_bullet_points_with_urls_and_tags
+    unit = SemanticText::Parser.new
+    unit.parse('')
+    unit.parse('* with url http://www.example.com see?')
+    unit.parse('* with tag [c2:RecentChanges] see?')
+    actual_list = unit.content[0]
+    first_bullet = actual_list.content[0]
+    second_bullet = actual_list.content[1]
+    assert_equal SemanticText::BulletedList, actual_list.class
+    assert_element SemanticText::Span, 'with url ', first_bullet.content[0]
+    assert_element SemanticText::Link, 'http://www.example.com', first_bullet.content[1]
+    assert_element SemanticText::Span, ' see?', first_bullet.content[2]
+    assert_element SemanticText::Span, 'with tag ', second_bullet.content[0]
+    assert_element SemanticText::Tag, 'RecentChanges', second_bullet.content[1]
+    assert_element SemanticText::Span, ' see?', second_bullet.content[2]
+	end
+	def test_bulleted_list_nesting
+    unit = SemanticText::Parser.new
+    unit.parse('')
+    unit.parse('* separate list')
+    unit.parse('** nested bullet point 1')
+    unit.parse('** nested bullet point 2')
+    unit.parse('* top-level bullet point')
+    list = unit.content[0]
+    assert_equal SemanticText::BulletedList, list.class
+    assert_equal 3, list.content.size
+    bullet_1 = list.content[0]
+    assert_equal SemanticText::Bullet, bullet_1.class
+    assert_equal SemanticText::Span, bullet_1.content[0].class
+    assert_equal 'separate list', bullet_1.content[0].text
+    nested = list.content[1]
+    assert_equal SemanticText::BulletedList, nested.class
+    assert_equal 2, nested.content.size
+    bullet_2_1 = nested.content[0]
+    assert_equal SemanticText::Bullet, bullet_2_1.class
+    assert_equal SemanticText::Span, bullet_2_1.content[0].class
+    assert_equal "nested bullet point 1", bullet_2_1.content[0].text
+    bullet_2_2 = nested.content[1]
+    assert_equal SemanticText::Bullet, bullet_2_2.class
+    assert_equal SemanticText::Span, bullet_2_2.content[0].class
+    assert_equal "nested bullet point 2", bullet_2_2.content[0].text
+    bullet_3 = list.content[2]
+    assert_equal SemanticText::Bullet, bullet_3.class
+    assert_equal SemanticText::Span, bullet_3.content[0].class
+    assert_equal "top-level bullet point", bullet_3.content[0].text
+  end
+	def test_bulleted_lsit_parsing_into_two_separate_lists
+	  unit = SemanticText::Parser.new
+    unit.parse('')
+    unit.parse('* first bullet in first list')
+    unit.parse('')
+    unit.parse('* second bullet in second list')
+    assert_equal SemanticText::BulletedList, unit.content[0].class
+    assert_equal SemanticText::BulletedList, unit.content[1].class
+    assert_not_same(unit.content[0], unit.content[1])
+  end
+end

data/testfiles/complex.art ADDED Viewed

@@ -0,0 +1,28 @@
+title:test title
+createdAt:5 November 2005
+keywords: buzz, fuzz, muzz
+!First Big Heading
+This is another
+paragraph.
+This paragraph tests escaping < > &
+Theis is the third paragraph.
+Hey dude, check out my website:
+http://www.example.com Cool innit?
+!Second Big Section < > &
+* point 1
+** subpoint 1.1
+** subpoint 1.2
+* point 2
+* < > &
+** subpoint 2.1
+** subpoint 2.2
+This is another paragraph. This is a [c2:RecentChanges] tag.
+http://www.example.com/foo?a=b&c=d

data/testfiles/regression-exportsample.txt ADDED Viewed

@@ -0,0 +1,15 @@
+<h1>First Big Heading</h1>
+<p> This is another paragraph.</p>
+<p> This paragraph tests escaping &lt; &gt; &amp;</p>
+<p> Theis is the third paragraph.</p>
+<p> Hey dude, check out my website: <a href="http://www.example.com">http://www.example.com</a>  Cool innit?</p>
+<h1>Second Big Section &lt; &gt; &amp;</h1>
+	<ul><li> point 1</li>
+		<ul><li> subpoint 1.1</li><li> subpoint 1.2</li>
+		</ul><li> point 2</li><li> &lt; &gt; &amp;</li>
+		<ul><li> subpoint 2.1</li><li> subpoint 2.2</li>
+		</ul>
+	</ul>
+<p> This is another paragraph. This is a [c2:RecentChanges]  tag.</p>
+<p><a href="http://www.example.com/foo?a=b&c=d">http://www.example.com/foo?a=b&amp;c=d</a></p>

data/testfiles/simple.art ADDED Viewed

@@ -0,0 +1,10 @@
+title:test title
+createdAt:5 November 2005
+keywords: buzz, fuzz, muzz
+!First Big Heading
+This is another
+paragraph.
+Theis is the second paragraph.

metadata ADDED Viewed

@@ -0,0 +1,92 @@
+--- !ruby/object:Gem::Specification
+name: semantictext
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- Dafydd Rees
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2009-11-29 00:00:00 +00:00
+default_executable:
+dependencies: []
+description: Semantic Text is a system for marking up plain text documents with domain-specific tags.
+email: os@greenbarsoft.co.uk
+executables: []
+extensions: []
+extra_rdoc_files:
+- CHANGELOG
+- COPYING
+- README.rdoc
+- TODO.rdoc
+files:
+- lib/semantictext/bullet.rb
+- lib/semantictext/bulleted_list_parser.rb
+- lib/semantictext/bulletedlist.rb
+- lib/semantictext/date_extractor.rb
+- lib/semantictext/default_tag_factory.rb
+- lib/semantictext/extraction_failed.rb
+- lib/semantictext/heading.rb
+- lib/semantictext/keyword_extractor.rb
+- lib/semantictext/link.rb
+- lib/semantictext/not_header_line.rb
+- lib/semantictext/paragraph.rb
+- lib/semantictext/parser.rb
+- lib/semantictext/rich_text_parser.rb
+- lib/semantictext/span.rb
+- lib/semantictext/tag.rb
+- lib/semantictext.rb
+- lib/string.rb
+- test/bullet_test.rb
+- test/bulleted_list_parser_test.rb
+- test/dateextractor_test.rb
+- test/export_test.rb
+- test/keywordextractor_test.rb
+- test/parser_test.rb
+- testfiles/complex.art
+- testfiles/regression-exportsample.txt
+- testfiles/simple.art
+- CHANGELOG
+- COPYING
+- README.rdoc
+- TODO.rdoc
+has_rdoc: true
+homepage: http://www.greenbarsoft.co.uk/software/semantictext
+licenses: []
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: "0"
+  version:
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: "0"
+  version:
+requirements: []
+rubyforge_project:
+rubygems_version: 1.3.5
+signing_key:
+specification_version: 3
+summary: Domain-Specific text markup parser
+test_files:
+- ./test/bullet_test.rb
+- ./test/bulleted_list_parser_test.rb
+- ./test/dateextractor_test.rb
+- ./test/export_test.rb
+- ./test/keywordextractor_test.rb
+- ./test/parser_test.rb