RubyGems - jakal - Versions diffs - 0.1.1 → 0.1.2 - Mend

jakal 0.1.1 → 0.1.2

Files changed (16) hide show

data/{README.rdoc → README.md} +35 -4
data/features/calais.feature +3 -36
data/features/http.feature +4 -10
data/features/sanitize-text.feature +63 -45
data/features/step_definitions/calais_steps.rb +1 -41
data/features/step_definitions/http_steps.rb +1 -18
data/features/step_definitions/sanitize-text_steps.rb +62 -15
data/features/support/env.rb +9 -0
data/lib/jkl.rb +18 -4
data/lib/jkl/calais_client.rb +5 -57
data/lib/jkl/rest_client.rb +6 -2
data/lib/jkl/rss_client.rb +1 -1
data/lib/jkl/text_client.rb +38 -0
metadata +6 -7
data/features/step_definitions/require_steps.rb +0 -12
data/lib/jkl/url_doc_handler.rb +0 -35

@@ -1,12 +1,43 @@
-= jkl
+# jkl
-* http://github.com/sshingler/jkl
+jkl (Jakal) does these things:
-== LICENSE:
+* Connects to URLs.
+* Gets stuff out of RSS feeds.
+* Gets the main content from web pages
+* Gets a set of metadata from a web page (using the calais gem)
+# Sample usage
+For example - if you had a RSS feed:
+  require "jkl"
+		feed = "http://www.topix.net/rss/search/article?x=0&y=0&q=London"
+You could collect some metadata from the links in that feed, thus:
+		tags = []
+		Jkl::links(feed).each do |link|
+			tags << Jkl::tags("my_calais_key",link)
+		end
+A metadata sample might look something like this:
+		{
+				"Person"=>["Barack Obama", "Hillary Clinton"],
+				"Position"=>["Secretary of State"]
+		}
+It is hosted at [gemcutter](http://gemcutter.org/gems/jakal)
+		gem install jakal
+# LICENSE:
 (The MIT License)
-Copyright (c) 2009 FIXME full name
+Copyright (c) 2009 sshingler
 Permission is hereby granted, free of charge, to any person obtaining
 a copy of this software and associated documentation files (the

data/features/calais.feature CHANGED

@@ -3,41 +3,8 @@ Feature: Calais-Specific features
   As a developer
   I want to make some requests and inspect some responses
-  @connection_needed
-  Scenario: Post some very simple text to calais, inspect the response
-	  Given I have some simple text
-    When I post to calais
-    Then I should get a response
-	  And I should receive some tags
-  @connection_needed
-  Scenario: Post a mock story to calais, inspect the response
-	  Given I have a sanitized sample BBC story
-    When I post to calais
-    Then I should get a response
-    And I should receive some tags
-  @connection_needed
-	Scenario: Get nested tags from calais
-	  Given I have some simple text
+  @live
+  Scenario: Get nested tags from calais
+    Given I have some text
     When I request the nested entities from calais
     Then I should receive the entities grouped into categories
-  Scenario: Clean up blank items from a calais response
-  	Given I have a mock calais response
-  	When I remove the unwanted items
-  	Then I should receive some tags
-   	And there should no longer be any "instances"
-  	And there should no longer be any "relevance"
-  	And there should no longer be any "blank"
-   	And there should no longer be any "not_available"
-  Scenario: Go through the calais response tags in a bit more detail
-  	Given I have a mock calais response
-  	When I remove the unwanted items
-  	Then I should receive some tags
-  	And there should be some "Organization" tags
-  Scenario: Go through the calais response tags as a single array
-  	Given I have a mock calais response
-  	Then I should be able to see the whole lot of tags as one block

data/features/http.feature CHANGED

@@ -3,24 +3,18 @@ Feature: http features
   As a developer
   I want to make some requests and inspect some responses
-  @connection_needed
+  @live
   Scenario: Make a restful post to yahoo
     When I post some data to yahoo
 	  Then I should get a response
-  @connection_needed
+  @live
   Scenario: Make a restful get
     When I make a restful get request
 	  Then I should get a response
 	  And I should see some text
-  @connection_needed
+  @live
   Scenario: Get some trends
-    When I request some trends
+    When I request some twitter trends
 	  Then I should get a response
-  @connection_needed
-  Scenario: Get some RSS
-    When I request some RSS
-	  Then I should get a response
-    And I should receive some headlines

data/features/sanitize-text.feature CHANGED

@@ -3,51 +3,69 @@ Feature: Processing features
   As a developer
   I want to make some requests and inspect some responses
-	@unit @text
-	Scenario: Sanitize some ok text
-		Given I have a keyphrase 'the cat sat on the mat'
-		When I sanitize this text
-		Then it should be ok
-		And it should say 'the cat sat on the mat'
-	@unit @text
-	Scenario: Sanitize some short text
-		Given I have a keyphrase 'the cat sat'
-		When I sanitize this text
-		Then it should say ''
-	@unit @text @wip
-		Scenario: Sanitize some text with tabs and spaces
-		Given I have a keyphrase 'the cat sat on 						the mat            '
-		When I sanitize this text
-		Then it should say 'the cat sat on the mat'
-	@unit @text @wip
-		Scenario: Sanitize some short text with tabs and spaces
-		Given I have a keyphrase 'the   cat sat on 						           '
-		When I sanitize this text
-		Then it should say ''
+  @unit @text
+  Scenario: No changes needed
+    Given I have the text "the cat sat on the mat"
+    When I sanitize this text
+    Then there should be no script tags
+    And there should be no tags
+    And there should be no blank lines
+    And it should say "the cat sat on the mat"
-	@unit @text
-	Scenario: Sanitize some tagged short text
-		Given I have a keyphrase '<a href="a-link.html>the cat sat</a>'
-		When I sanitize this text
-		Then it should say ''
+  @unit @text
+  Scenario: Remove simple tags
+    Given I have the text "<a href=\"a-link.html\">the cat sat on the mat</a>"
+    When I sanitize this text
+    Then there should be no script tags
+    And there should be no tags
+    And there should be no blank lines
+    Then it should say "the cat sat on the mat"
-	@unit @text
-	Scenario: Sanitize some tagged text
-		Given I have a keyphrase '<a href="a-link.html>the cat sat on the mat</a>'
-		When I sanitize this text
-		Then it should be ok
-		Then it should say 'the cat sat on the mat'
-	@unit @text @wip
-	Scenario: Remove script tags
-	  Given I have some script tag data
-	  When I sanitize this text
-	  Then it should say ' some para stuff here '
+  @unit @text @wip
+  Scenario: Remove script tags
+    Given I have some script tag data
+    When I sanitize this text
+    Then there should be no script tags
+    And there should be no tags
+    And there should be no blank lines
+    Then it should say "the cat sat on the mat"
-	Scenario: Clean a web page
-		Given I have a sample BBC story
-		When I sanitize this text
-		Then it should be ok
+  @mock
+  Scenario: Remove script tags
+    Given I have a sample web page
+    When I remove the script tags
+    Then there should be no script tags
+  @mock
+  Scenario: Remove all tags
+    Given I have a sample web page
+    When I remove the script tags
+    And I strip all the tags
+    Then there should be no script tags
+    And there should be no tags
+  @mock
+  Scenario: Remove empty lines
+    Given a stripped web page
+    When I remove the blank lines
+    Then there should be no blank lines
+  @mock
+  Scenario: Remove a short line
+    Given I have the text "the cat sat on the"
+    When I remove the short lines
+    Then it should say ""
+  @mock
+  Scenario: Don't remove a long line
+    Given I have the text "the cat sat on the mat"
+    When I remove the short lines
+    Then it should say "the cat sat on the mat"
+  @mock
+  Scenario: Santize a sample BBC page
+    Given I have a sample BBC story
+    When I sanitize this text
+    Then there should be no script tags
+    And there should be no tags
+    And there should be no blank lines

data/features/step_definitions/calais_steps.rb CHANGED

@@ -1,48 +1,8 @@
-Given /^I have some simple text$/ do
+Given /^I have some text$/ do
   @text = "Barack Obama said today that he expects there to be conflict within his new security team after confirming Hillary Clinton as his choice for US Secretary of State."
 end
-Given /^I have a sanitized sample BBC story$/ do
-  Given "I have a sample BBC story"
-	When "I sanitize this text"
-end
-Given /^I have a mock calais response$/ do
-  @response = File.open('features/mocks/calais.json','r') {|f| f.readlines.to_s}
-end
-When /^I post to calais$/ do
-  key = YAML::load_file('config/keys.yml')['calais']
-  @response = Jkl::Extraction::get_from_calais(key, @text)
-end
-When /^I remove the unwanted items$/ do
-  @processed_json = Jkl::clean_unwanted_items_from_hash(JSON.parse(@response))
-end
-Then /^there should no longer be any "([^\"]*)"$/ do |arg1|
-  @processed_json[arg1].should be_nil
-end
-Then /^I should receive some tags$/ do
-  Jkl::get_tag_from_json(@response) do |tag|
-    tag.should_not be_nil
-  end
-end
-Then /^there should be some "([^\"]*)" tags$/ do |arg1|
-  Jkl::get_tag_from_json(@response) {|tag|
-    #puts tag.inspect
-    tag.each{|k,v| puts "#{k} : #{v}" if k=='_type'}
-  }
-end
-Then /^I should be able to see the whole lot of tags as one block$/ do
-  tags = Jkl::get_tag_from_json(@response)
-  tags.length.should > 0
-end
 When /^I request the nested entities from calais$/ do
   key = YAML::load_file('config/keys.yml')['calais']
   @response = Jkl::Extraction::tags key, @text

data/features/step_definitions/http_steps.rb CHANGED

@@ -6,12 +6,6 @@ When /^I post some data to yahoo$/ do
   @response = Jkl::post_to @url, post_args
 end
-When /^I request some RSS$/ do
-  keyphrase = @keyphrase || "iraq"
-  url = "#{YAML::load_file('config/config.yml')['topix']}#{CGI::escape(keyphrase)}"
-  @response = Jkl::get_xml_from url
-end
 Given /^I have some RSS$/ do
   raw = File.open('features/mocks/topix_rss.xml','r') {|f| f.readlines.to_s}
   @response = Hpricot.XML raw
@@ -22,7 +16,7 @@ When /^I make a restful get request$/ do
   @response = Jkl::get_from url
 end
-When /^I request some trends$/ do
+When /^I request some twitter trends$/ do
   twitter_json_url = YAML::load_file('config/config.yml')['twitter']
   output = JSON.parse Jkl::get_from twitter_json_url
   @response = output['trends']
@@ -30,17 +24,6 @@ end
 Then /^I should get a response$/ do
   @response.should_not == nil
-  #puts @response.inspect
-end
-Then /^I should receive some headlines$/ do
-  @items = Jkl::Rss::items @response
-  @links = []
-  @items.each do |item|
-    @links << Jkl::Rss::attribute_from(item, :link)
-  end
-  @links.should_not == nil
-  @links.length.should > 0
 end
 Then /^I should be able to get the copy from the first headline$/ do

data/features/step_definitions/sanitize-text_steps.rb CHANGED

@@ -1,4 +1,4 @@
-Given "I have a keyphrase '$text'" do |text|
+Given "I have the text \"$text\"" do |text|
   @text = text
 end
@@ -6,22 +6,9 @@ Given /^I have a sample BBC story$/ do
   @text = File.open('features/mocks/bbc_story.html','r') {|f| f.readlines.to_s}
 end
-When /^I sanitize this text$/ do
-  @text = Jkl::sanitize @text
-end
-Then /^it should be ok$/ do
-  @text.should_not be_nil
-  @text.should_not == ""
-end
-Then "it should say '$text'" do |text|
-  @text.should == text
-end
 Given /^I have some script tag data$/ do
   @text = <<-EOF;
-  some start stuff here
+  the cat sat on the mat
   <script type="text/javascript" charset="utf-8">
    function nofunction(){var bob;}
   </script>
@@ -30,3 +17,63 @@ Given /^I have some script tag data$/ do
     EOF
 end
+Given /^I have a sample web page$/ do
+  @text = File.open('features/mocks/sample-web-page.html','r') {|f| f.readlines.to_s}
+end
+Given /^a stripped web page$/ do
+  Given "I have a sample web page"
+  When "I remove the script tags"
+  And "I strip all the tags"
+  Then "there should be no script tags"
+  And "there should be no tags"
+end
+When /^I sanitize this text$/ do
+  @text = Jkl::Text::sanitize @text
+end
+When /^I examine the text$/ do
+  text = Jkl::Text::remove_tabs @text
+end
+Then "it should say \"$text\"" do |text|
+  @text.to_s.should == text
+end
+Then /^I can read it$/ do
+  Jkl::Text::document_from(@response).should_not be_nil
+end
+When /^I remove the script tags$/ do
+  @text = Jkl::Text::remove_script_tags @text
+end
+When /^I remove the blank lines$/ do
+  @text = Jkl::Text::remove_blank_lines @text
+end
+When /^I remove the short lines$/ do
+  @text = Jkl::Text::remove_short_lines @text
+end
+When /^I clean it up$/ do
+  @text = Jkl::Text::remove_short_lines Jkl::Text:: strip_all_tags Jkl::Text::remove_script_tags @text
+end
+When /^I strip all the tags$/ do
+  @text = Jkl::Text::strip_all_tags @text
+end
+Then /^there should be no tags$/ do
+  @text.match(/</).should be_nil
+end
+Then /^there should be no script tags$/ do
+  @text.match(/<script/).should be_nil
+end
+Then /^there should be no blank lines$/ do
+  @text.match(/\r/).should be_nil
+  @text.match(/\n/).should be_nil
+end

data/features/support/env.rb CHANGED

@@ -2,6 +2,15 @@ gem 'rack-test'
 require 'spec/expectations'
 require 'rack/test'
+require 'hpricot'
+require 'json'
+require 'restclient'
+require 'haml'
+require 'cgi'
+require 'lib/jkl.rb'
+include Jkl
 class MyWorld
   include Rack::Test::Methods

data/lib/jkl.rb CHANGED

@@ -1,8 +1,22 @@
-require "jkl/rest_client.rb"
-require "jkl/rss_client.rb"
-require "jkl/calais_client.rb"
-require "jkl/url_doc_handler.rb"
+require "jkl/rest_client"
+require "jkl/rss_client"
+require "jkl/calais_client"
+require "jkl/text_client"
 module Jkl
+  class << self
+    def links(feed)
+      links = Jkl::Rss::links(Jkl::Rss::items(Jkl::get_xml_from(feed)))
+      links.each do |link|
+        yield link if block_given?
+      end
+    end
+    def tags(key, link)
+      text = Jkl::Text::sanitize(Jkl::get_from(link))
+      Jkl::Extraction::tags(key, text)
+    end
+  end
 end

data/lib/jkl/calais_client.rb CHANGED

@@ -1,21 +1,17 @@
-require "json"
 require "calais"
-require "rest_client"
 module Jkl
   module Extraction
     class << self
-      #using the calais gem
-      def calais_response(key, pages)
+      def calais_response(key, text)
         Calais.process_document(
-            :content => pages,
+            :content => text,
             :content_type => :text,
             :license_id => key
         )
       end
       def tags(key, text)
         nested_list = {}
         entities(key,text).each do |a|
@@ -23,58 +19,10 @@ module Jkl
         end
         nested_list
       end
       def entities(key,text)
         calais_response(key, text).entities.map{|e| {e.type => [e.attributes["name"]]}}
       end
-      #not using calais gem, experimenting with json response
-      def get_from_calais(key, content)
-        post_args = {
-            "licenseID" => key,
-            "content" => content,
-            "paramsXML" => paramsXML("application/json")
-        }
-        Jkl::post_to(URI.parse("http://api.opencalais.com/enlighten/rest/"), post_args)
-      end
-      def get_tag_from_json(response)
-        result = JSON.parse response
-        result.delete_if {|key, value| key == "doc" } # ditching the doc
-        cleaned_result = []
-        result.each do |key,tag|
-          tag = Jkl::clean_unwanted_items_from_hash tag
-          cleaned_result << tag
-          yield tag if block_given?
-        end
-        cleaned_result
-      end
-      def clean_unwanted_items_from_hash h
-        h.delete_if {|k, v| k == "relevance" }
-        h.delete_if {|k, v| k == "instances" }
-        h.delete_if {|k, v| v == "N/A"}
-        h.delete_if {|k, v| v == []}
-        h.delete_if {|k, v| v == ""}
-        h.delete_if {|k, v| k == "_typeGroup"}
-        h
-      end
-      private
-      def paramsXML(format)
-       <<-paramsXML;
-        <c:params xmlns:c="http://s.opencalais.com/1/pred/"
-               xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
-               <c:processingDirectives
-               c:contentType="text/txt"
-               c:outputFormat="#{format}">
-               </c:processingDirectives>
-               <c:userDirectives />
-               <c:externalMetadata />
-               </c:params>
-        paramsXML
-      end
     end
   end

data/lib/jkl/rest_client.rb CHANGED

@@ -19,8 +19,8 @@ module Jkl
     def get_from(uri)
       begin
-        res = Net::HTTP.get_response(URI.parse(uri))
-        res.body
+        response = Net::HTTP.get_response(URI.parse(uri))
+        response.body
       rescue  URI::InvalidURIError => e
         puts("WARN: Invalid URI: #{e}")
       rescue SocketError => e
@@ -33,6 +33,10 @@ module Jkl
     def get_xml_from(uri)
       Hpricot.XML get_from uri
     end
+    def document_from(text)
+      Hpricot(text)
+    end
   end
 end

data/lib/jkl/rss_client.rb CHANGED

@@ -7,7 +7,7 @@ module Jkl
       def items(rss_doc)
         (rss_doc/:item)
       end
       def links(items)
         items.map{|item| attribute_from(item,:link)}
       end

data/lib/jkl/text_client.rb ADDED

@@ -0,0 +1,38 @@
+module Jkl
+  module Text
+    class << self
+      def sanitize(text)
+        remove_short_lines strip_all_tags remove_script_tags text
+      end
+      def strip_all_tags(text)
+        text.gsub(/<\/?[^>]*>/, "")
+      end
+      def remove_blank_lines(text)
+        text.gsub(/\n\r|\r\n|\n|\r/, "")
+      end
+      def remove_html_comments(text)
+        text.gsub(/<!--(.|\s)*?-->/, "")
+      end
+      def remove_script_tags(text)
+        text = remove_html_comments(text)
+        text.gsub(/((<[\s\/]*script\b[^>]*>)([^>]*)(<\/script>))/i, "")
+      end
+      def remove_short_lines(text)
+        text = text.gsub(/\s\s/, "\n")
+        str = ""
+        # remove short lines - ususally just navigation
+        text.split("\n").each do |l|
+          str << l unless l.count(" ") < 5
+        end
+        str
+      end
+    end
+  end
+end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: jakal
 version: !ruby/object:Gem::Version
-  version: 0.1.1
+  version: 0.1.2
 platform: ruby
 authors:
 - sshingler
@@ -13,21 +13,21 @@ date: 2009-08-27 00:00:00 +01:00
 default_executable:
 dependencies: []
-description: Jakal is a Ruby library which contains some utilies for connecting to internet based APIs.
+description: Jakal is a Ruby library which contains some utilities for connecting to internet based APIs.
 email: "'shingler@gmail.com'"
 executables: []
 extensions: []
 extra_rdoc_files:
-- README.rdoc
+- README.md
 - License.txt
 files:
 - lib/jkl.rb
 - lib/jkl/calais_client.rb
 - lib/jkl/rest_client.rb
 - lib/jkl/rss_client.rb
-- lib/jkl/url_doc_handler.rb
+- lib/jkl/text_client.rb
 - features/calais.feature
 - features/http.feature
 - features/sanitize-text.feature
@@ -37,11 +37,10 @@ files:
 - features/mocks/twitter.json
 - features/step_definitions/calais_steps.rb
 - features/step_definitions/http_steps.rb
-- features/step_definitions/require_steps.rb
 - features/step_definitions/sanitize-text_steps.rb
 - features/step_definitions/twitter_steps.rb
 - features/support/env.rb
-- README.rdoc
+- README.md
 - License.txt
 has_rdoc: true
 homepage: http://github.com/sshingler/jkl
@@ -71,6 +70,6 @@ rubyforge_project:
 rubygems_version: 1.3.5
 signing_key:
 specification_version: 3
-summary: Jakal is a Ruby library which contains some utilies for connecting to internet based APIs.
+summary: Jakal is a Ruby library which contains some utilities for connecting to internet based APIs.
 test_files: []

data/features/step_definitions/require_steps.rb DELETED

@@ -1,12 +0,0 @@
-require 'hpricot'
-require 'json'
-require 'restclient'
-require 'haml'
-require 'cgi'
-require 'lib/jkl.rb'
-require 'lib/jkl/calais_client.rb'
-require 'lib/jkl/rest_client.rb'
-require 'lib/jkl/rss_client.rb'
-require 'lib/jkl/url_doc_handler.rb'
-include Jkl

data/lib/jkl/url_doc_handler.rb DELETED

@@ -1,35 +0,0 @@
-require 'hpricot'
-require 'rest_client'
-module Jkl
-  class << self
-    def sanitize(text)
-      str = ""
-      text = text.to_s.gsub(/((<[\s\/]*script\b[^>]*>)([^>]*)(<\/script>))/i,"") #remove script tags - with contents
-      text.to_s.gsub(/<\/?[^>]*>/, "").split("\r").each do |l| # remove all tags
-        l = l.gsub(/^[ \t]/,"") #remove tabs
-        l = l.gsub(/^[ \s]/,"")
-        l.split("\n").each do |l|
-          str << l unless l.count(" ") < 5 # remove short lines - ususally just navigation
-        end
-      end
-      str
-    end
-    def from_doc(response)
-      begin
-        Hpricot(response)
-      rescue  URI::InvalidURIError => e
-        puts("WARN: Problem with getting a connection: #{e}")
-      rescue SocketError => e
-        puts("WARN: Could not connect to feed: #{e}")
-      rescue Errno::ECONNREFUSED  => e
-        puts("WARN: Connection refused: #{e}")
-      end
-    end
-  end
-end