RubyGems - muddyit_fu - Versions diffs - 0.2.11 → 0.2.12 - Mend

muddyit_fu 0.2.11 → 0.2.12

Files changed (7) hide show

data/README.rdoc CHANGED Viewed

@@ -2,9 +2,7 @@
 Muddy is an information extraction platform.  For further
 details see the '{Getting Started with Muddy}[http://blog.muddy.it/2009/11/getting-started-with-muddy]'
-article.  This gem provides access to the Muddy platform via it's API :
-{Muddy Developer Guide}[http://muddy.it/developers/]
+article.  This gem provides access to the Muddy platform via it's API (see {Muddy Developer Guide}[http://muddy.it/developers/]).
 == Installation
@@ -16,7 +14,7 @@ article.  This gem provides access to the Muddy platform via it's API :
 Muddy supports OAuth and HTTP Basic auth for authentication and authorisation.
 We recommend you use OAuth wherever possible when accessing Muddy.  An example
-of using OAuth with the muddy platform is descibed in the
+of using OAuth with the Muddy platform is described in the
 {Building with Muddy and OAuth}[http://blog.muddy.it/2010/01/building-with-muddy-and-oauth]
 article.
@@ -59,7 +57,7 @@ URL rather than text, just specify a URL instead :
 Muddy allows you to store the entity extraction results so aggregate operations
 can be performed over a collection of content (a 'collection' has many analysed 'pages').
-A basic muddy account provides a single 'collection' where extraction results
+A basic Muddy account provides a single 'collection' where extraction results
 can be stored.
 To store a page against a collection, the collection must first be found :
@@ -70,7 +68,16 @@ Once a collection has been found, entity extraction results can be stored in it:
   collection.pages.create('http://news.bbc.co.uk/1/hi/uk_politics/8011321.stm', {:minium_confidence => 0.2})
-== Viewing all analysed pages in a collection
+== Working with a collection
+A collection allows aggregate operations to be perfomed on itself and on it's
+members.  A collection is identified by it's 'collection token'.  This is an
+alphanumeric six character string (e.g. 'a0ret4').  A collection can be found if
+it's token is known :
+  collection = muddyit.collections.find('a0ret4')
+=== Viewing all analysed pages
 You can iterate through all the analysed pages in a collection, be aware that
 the Muddy API provides the pages as paginated sets, so it may take some time to
@@ -87,25 +94,42 @@ for each new paginated set of results).
     end
   end
-== Working with a collection
+=== Finding a particular page or pages
-A collection allows aggregate operations to be perfomed on itself and on it's
-members.  A collection is identified by it's 'collection token'.  This is an
-alphanumeric six character string (e.g. 'a0ret4').  A collection can be found if
-it's token is known :
+Each page in a collection is assigned a unique alphanumeric identifier.  Whilst
+this can be used to find a given page in a collection, it is possible to search
+for the page using other attributes :
-  collection = muddyit.collections.find('a0ret4')
+  page = collection.pages.find('5d0e32b6-fd0b-400a-ac49-dae965a292df')
+  page = collection.pages.find(:all, :uri => 'http://news.bbc.co.uk/1/hi/business/8186840.stm').first
+  page = collection.pages.find(:all, :title => 'BBC NEWS | Business | ITV in 25m Friends Reunited sale').first
+=== Rereshing a page's results
+A page can be 'refereshed' (the entity extraction is run again) by calling the
+refresh method on a page object :
+  page = collection.pages.find('5d0e32b6-fd0b-400a-ac49-dae965a292df')
+  updated_page = page.update
+=== Deleting a page from a collection
-=== View all pages containing 'Gordon Brown'
+A page can be removed from a collection by calling the 'destroy' method on a
+page object :
-If we want to find all references to the grounded entity for 'Gordon Brown 'then
+  page = collection.pages.find('5d0e32b6-fd0b-400a-ac49-dae965a292df')
+  page.destroy
+=== View all pages containing entity 'Gordon Brown'
+If we want to find all pages that reference the grounded entity for 'Gordon Brown' then
 it can be searched for using it's DBpedia URI :
   require 'muddyit_fu'
   muddyit = Muddyit.new('./config.yml')
   collection = muddyit.collections.find('a0ret4')
   collection.pages.find_by_entity('http://dbpedia.org/resource/Gordon_Brown') do |page|
-    puts page.identifier
+    puts "#{page.identifier} - #{page.title}"
   end
 === Find related entities for 'Gordon Brown'
@@ -118,7 +142,7 @@ collection :
   collection = muddyit.collections.find('a0ret4')
   puts "Related entity\tOccurance
   collection.entities.find_related('http://dbpedia.org/resource/Gordon_Brown').each do |entry|
-    puts "#{entry[:enity].uri}\t#{entry[:count]}"
+    puts "#{entry[:entity].uri}\t#{entry[:count]}"
   end
 === Find related content for : http://news.bbc.co.uk/1/hi/uk_politics/7878418.stm
@@ -135,6 +159,17 @@ analysed page that has a uri 'http://news.bbc.co.uk/1/hi/uk_politics/7878418.stm
     puts "#{results[:page].title} #{results[:count]}"
   end
+== Batch processing content and the Muddy queue
+The Muddy platform runs a background job queue that allows many requests to be
+made in quick succession (rather than waiting for the full extraction request to
+complete), with analysis of the pages happening asynchronously via the queue
+and being stored in the collection at a later time.  This can be useful when trying
+to analyse large content collections.  To send a request to the queue use :
+  collection = muddyit.collections.find('a0ret4')
+  collection.pages.create('http://news.bbc.co.uk/1/hi/uk_politics/8011321.stm', {:realtime => false})
 == Contact
   Author: Rob Lee

data/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 0.2.11
1	+ 0.2.12

data/lib/muddyit/base.rb CHANGED Viewed

@@ -1,9 +1,19 @@
 module Muddyit
+  class_attr_accessor :REST_ENDPOINT
+  @@REST_ENDPOINT = 'http://muddy.it'
   def self.new(*params)
     Muddyit::Base.new(*params)
   end
+  # Shortcut class method for extract
+  def self.extract(doc, options={})
+    @muddyit = Muddyit.new()
+    @muddyit.extract(doc, options)
+  end
   class Base
     class_attr_accessor :http_open_timeout
     class_attr_accessor :http_read_timeout
@@ -13,8 +23,6 @@ module Muddyit
     @@http_open_timeout = 120
     @@http_read_timeout = 120
-    REST_ENDPOINT = 'http://www.muddy.it'
     # Set the request signing method
     @@digest1   = OpenSSL::Digest::Digest.new("sha1")
     @@digest256 = nil
@@ -47,7 +55,8 @@ module Muddyit
     # access_token: CCC
     # access_token_secret: DDD
     #
-    def initialize(config_hash_or_file)
+    def initialize(config_hash_or_file = {})
       if config_hash_or_file.is_a? Hash
         config_hash_or_file.nested_symbolize_keys!
         @username = config_hash_or_file[:username]
@@ -56,7 +65,7 @@ module Muddyit
         @consumer_secret = config_hash_or_file[:consumer_secret]
         @access_token = config_hash_or_file[:access_token]
         @access_token_secret = config_hash_or_file[:access_token_secret]
-        @rest_endpoint = config_hash_or_file.has_key?(:rest_endpoint) ? config_hash_or_file[:rest_endpoint] : REST_ENDPOINT
+        @rest_endpoint = config_hash_or_file.key?(:rest_endpoint) ? config_hash_or_file[:rest_endpoint] : Muddyit.REST_ENDPOINT
       else
         config = YAML.load_file(config_hash_or_file)
         config.nested_symbolize_keys!
@@ -66,7 +75,7 @@ module Muddyit
         @consumer_secret = config[:consumer_secret]
         @access_token = config[:access_token]
         @access_token_secret = config[:access_token_secret]
-        @rest_endpoint = config.has_key?(:rest_endpoint) ? config[:rest_endpoint] : REST_ENDPOINT
+        @rest_endpoint = config.key?(:rest_endpoint) ? config[:rest_endpoint] : Muddyit.REST_ENDPOINT
       end
       if !@consumer_key.nil?
@@ -75,10 +84,7 @@ module Muddyit
         @accesstoken = ::OAuth::AccessToken.new(@consumer, @access_token, @access_token_secret)
       elsif !@username.nil?
         @auth_type = :basic
-      else
-        raise "unable to find authentication credentials"
       end
     end
     # sends a request to the muddyit REST api
@@ -99,7 +105,7 @@ module Muddyit
       case @auth_type
       when :oauth
         res = oauth_request_over_http(api_url, http_method, opts, body)
-      when :basic
+      when :basic, nil
         res = basic_request_over_http(api_url, http_method, opts, body)
       end
@@ -149,7 +155,7 @@ module Muddyit
       response = self.send_request(api_url, :post, {}, body.to_json)
       return Muddyit::Collections::Collection::Pages::Page.new(self, response)
     end
     protected
     # For easier testing. You can mock this method with a XML file you re expecting to receive
@@ -175,6 +181,12 @@ module Muddyit
     def basic_request_over_http(path, http_method, opts, data)
+      # We only allow access to /extract as an unauthenticated user
+      # all other paths should raise an error
+      if @auth_type == nil && path != '/extract'
+        raise "invalid authentication credentials supplied, are the details correct ?"
+      end
       http_opts = { "Accept" => "application/json", "Content-Type" => "application/json", "User-Agent" => "muddyit_fu" }
       query_string = opts.to_a.map {|x| x.join("=")}.join("&")
@@ -196,14 +208,12 @@ module Muddyit
         request.basic_auth @username, @password
         request["Content-Length"] = 0 # Default to 0
       when :get
-        request = Net::HTTP::Get.new(path,headers)
+        path_with_query_string = opts.empty? ? path : "#{path}?#{query_string}"
+        request = Net::HTTP::Get.new(path_with_query_string, headers)
         request.basic_auth @username, @password
       when :delete
         request =  Net::HTTP::Delete.new(path,headers)
         request.basic_auth @username, @password
-      when :head
-        request = Net::HTTP::Head.new(path,headers)
-        request.basic_auth @username, @password
       else
         raise ArgumentError, "Don't know how to handle http_method: :#{http_method.to_s}"
       end

data/muddyit_fu.gemspec CHANGED Viewed

@@ -2,11 +2,11 @@
 Gem::Specification.new do |s|
   s.name = %q{muddyit_fu}
-  s.version = "0.2.11"
+  s.version = "0.2.12"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["rattle"]
-  s.date = %q{2010-01-11}
+  s.date = %q{2010-01-18}
   s.email = %q{support[at]muddy.it}
   s.extra_rdoc_files = [
     "LICENSE",
@@ -45,7 +45,8 @@ Gem::Specification.new do |s|
   s.rubygems_version = %q{1.3.5}
   s.summary = %q{Provides a ruby interface to muddy.it}
   s.test_files = [
-    "test/test_muddyit_fu.rb",
+    "test/thing.rb",
+     "test/test_muddyit_fu.rb",
      "test/test_helper.rb",
      "examples/newsindexer.rb",
      "examples/oauth.rb"

data/test/test_muddyit_fu.rb CHANGED Viewed

@@ -6,143 +6,157 @@ class TestMuddyitFu < Test::Unit::TestCase
   @@COLLECTION_LABEL = Time.now.to_s
   @@STORY = 'http://news.bbc.co.uk/1/hi/business/8186840.stm'
-  context 'A muddy account' do
+  context 'A user without a muddy account' do
     setup do
       c = load_config
-      begin
-      @muddyit = Muddyit.new(:consumer_key => c['consumer_key'],
-                             :consumer_secret => c['consumer_secret'],
-                             :access_token => c['access_token'],
-                             :access_token_secret => c['access_token_secret'],
-                             :rest_endpoint => c['rest_endpoint'],
-                             :username => c['username'],
-                             :password => c['password'])
-      rescue
-        puts "Failed to connect to muddy, are the details correct ?"
-      end
+      Muddyit.REST_ENDPOINT = c['rest_endpoint'] if c.key?('rest_endpoint')
     end
-    should "analyse a page without a collection" do
-      page = @muddyit.extract(@@STORY)
+    should "be able to analyse a page without a collection" do
+      page = Muddyit.extract(@@STORY)
       assert page.entities.length > 0
     end
-    should 'be able to create a collection' do
-      collection = @muddyit.collections.create(@@COLLECTION_LABEL, 'http://www.test.com')
-      assert !collection.token.nil?
-    end
-    should 'be able to find a collection' do
-      # This is a bit rubbish
-      @muddyit.collections.find(:all).each do |collection|
-        if collection.label == @@COLLECTION_LABEL
-          assert true
-        end
-      end
-    end
-    should 'be able to destroy a collection' do
-      # This is also a bit rubbish
-      collections = @muddyit.collections.find(:all)
-      collections.each do |collection|
-        if collection.label == @@COLLECTION_LABEL
-           res = collection.destroy
-           assert_equal res.code, "200"
-        end
-      end
-    end
-    context "with a collection" do
+  end
+    context 'A user with a muddy account' do
       setup do
-        @collection = @muddyit.collections.create(@@COLLECTION_LABEL, 'http://www.test.com')
+        c = load_config
+        begin
+        @muddyit = Muddyit.new(:consumer_key => c['consumer_key'],
+                               :consumer_secret => c['consumer_secret'],
+                               :access_token => c['access_token'],
+                               :access_token_secret => c['access_token_secret'],
+                               :rest_endpoint => c['rest_endpoint'],
+                               :username => c['username'],
+                               :password => c['password'])
+        rescue
+          puts "Failed to connect to muddy, are the details correct ?"
+        end
       end
-      should "categorise a page in realtime and not store it" do
-        page = @collection.pages.create({:uri => @@STORY}, :realtime => true, :store => false)
+      should "be able to analyse a page without a collection" do
+        page = @muddyit.extract(@@STORY)
         assert page.entities.length > 0
-        pages = @collection.pages.find(:all)
-        assert pages[:pages].length == 0
       end
-      should "categorise a page in realtime and store it" do
-        page = @collection.pages.create({:uri => @@STORY}, :realtime => true, :store => true)
-        assert page.entities.length > 0
-        pages = @collection.pages.find(:all)
-        assert_equal pages[:pages].length, 1
+      should 'be able to create a collection' do
+        collection = @muddyit.collections.create(@@COLLECTION_LABEL, 'http://www.test.com')
+        assert !collection.token.nil?
       end
-      context "with a page" do
-        setup do
-          @page = @collection.pages.create({:uri => @@STORY}, :realtime => true)
+      should 'be able to find a collection' do
+        # This is a bit rubbish
+        @muddyit.collections.find(:all).each do |collection|
+          if collection.label == @@COLLECTION_LABEL
+            assert true
+          end
         end
-        should "find a page" do
-          assert_equal @collection.pages.find(@page.identifier).identifier, @page.identifier
-        end
-        should "have page attributes" do
-          assert !@page.identifier.nil?
-          assert !@page.title.nil?
-          assert !@page.created_at.nil?
-          assert !@page.content.nil?
-          assert !@page.uri.nil?
-          #assert !@page.token.nil?
-          # More attributes here ?
+      end
+      should 'be able to destroy a collection' do
+        # This is also a bit rubbish
+        collections = @muddyit.collections.find(:all)
+        collections.each do |collection|
+          if collection.label == @@COLLECTION_LABEL
+             res = collection.destroy
+             assert_equal res.code, "200"
+          end
         end
-        should "have many entities" do
-          assert @page.entities.length > 0
+      end
+      context "with a collection" do
+        setup do
+          @collection = @muddyit.collections.create(@@COLLECTION_LABEL, 'http://www.test.com')
         end
-        should "have an entity with a term and label" do
-          entity = @page.entities.first
-          assert !entity.term.nil?
-          assert !entity.uri.nil?
+        should "categorise a page in realtime and not store it" do
+          page = @collection.pages.create({:uri => @@STORY}, :realtime => true, :store => false)
+          assert page.entities.length > 0
+          pages = @collection.pages.find(:all)
+          assert pages[:pages].length == 0
         end
-        should "have extracted content" do
-          assert !@page.extracted_content.content.nil?
-          assert @page.extracted_content.terms.length > 0
-          assert @page.extracted_content.start_position > 0
-          assert @page.extracted_content.end_position > 0
+        should "categorise a page in realtime and store it" do
+          page = @collection.pages.create({:uri => @@STORY}, :realtime => true, :store => true)
+          assert page.entities.length > 0
+          pages = @collection.pages.find(:all)
+          assert_equal pages[:pages].length, 1
         end
-        should "delete a page" do
-          assert @page.destroy, "200"
+        context "with a page" do
+          setup do
+            @page = @collection.pages.create({:uri => @@STORY}, :realtime => true)
+          end
+          should "find a page" do
+            assert_equal @collection.pages.find(@page.identifier).identifier, @page.identifier
+          end
+          should "have page attributes" do
+            assert !@page.identifier.nil?
+            assert !@page.title.nil?
+            assert !@page.created_at.nil?
+            assert !@page.content.nil?
+            assert !@page.uri.nil?
+            #assert !@page.token.nil?
+            # More attributes here ?
+          end
+          should "have many entities" do
+            assert @page.entities.length > 0
+          end
+          should "have an entity with a term and label" do
+            entity = @page.entities.first
+            assert !entity.term.nil?
+            assert !entity.uri.nil?
+          end
+          should "have extracted content" do
+            assert !@page.extracted_content.content.nil?
+            assert @page.extracted_content.terms.length > 0
+            assert @page.extracted_content.start_position > 0
+            assert @page.extracted_content.end_position > 0
+          end
+          should "delete a page" do
+            assert @page.destroy, "200"
+          end
         end
-      end
-      context "with two pages" do
-        setup do
-          @page1 = @collection.pages.create({:uri => @@STORY}, :realtime => true)
-          @page2 = @collection.pages.create({:uri => @@STORY}, :realtime => true)
+        context "with two pages" do
+          setup do
+            @page1 = @collection.pages.create({:uri => @@STORY}, :realtime => true)
+            @page2 = @collection.pages.create({:uri => @@STORY}, :realtime => true)
+          end
+          should "find all pages" do
+            assert_equal @collection.pages.find(:all).length, 2
+          end
+          should "find related pages" do
+            assert_equal @page1.related_content.length, 1
+          end
         end
-        should "find all pages" do
-          assert_equal @collection.pages.find(:all).length, 2
+        teardown do
+          #token = @collection.token
+          @collection.destroy
+          #res = @muddyit.collections.find(token)
+          # This should be a 404 (!)
+          #assert_equal res.code, "404"
         end
-        should "find related pages" do
-          assert_equal @page1.related_content.length, 1
-        end
       end
-      teardown do
-        #token = @collection.token
-        @collection.destroy
-        #res = @muddyit.collections.find(token)
-        # This should be a 404 (!)
-        #assert_equal res.code, "404"
-      end
     end
-  end
 end

data/test/thing.rb ADDED Viewed

@@ -0,0 +1,13 @@
+#!/usr/bin/ruby
+require 'rubygems'
+$LOAD_PATH.unshift(File.dirname(__FILE__) + '/../lib')
+require 'muddyit_fu'
+  Muddyit.REST_ENDPOINT = 'http://staging.muddy.it'
+  #muddyit =  Muddyit.new('./config.yml')
+  page = Muddyit.extract(ARGV[0], :disambiguate => true, :include_unclassified => true, :include_content => true)
+  pp page.extracted_content.terms
+  page.entities.each do |entity|
+    puts "\t#{entity.term}, #{entity.uri}, #{entity.classification}"
+  end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: muddyit_fu
 version: !ruby/object:Gem::Version
-  version: 0.2.11
+  version: 0.2.12
 platform: ruby
 authors:
 - rattle
@@ -9,7 +9,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2010-01-11 00:00:00 +00:00
+date: 2010-01-18 00:00:00 +00:00
 default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency
@@ -96,6 +96,7 @@ signing_key:
 specification_version: 3
 summary: Provides a ruby interface to muddy.it
 test_files:
+- test/thing.rb
 - test/test_muddyit_fu.rb
 - test/test_helper.rb
 - examples/newsindexer.rb