RubyGems - muddyit_fu - Versions diffs - 0.2.11 → 0.2.12 - Mend

muddyit_fu 0.2.11 → 0.2.12

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

data/README.rdoc CHANGED Viewed

@@ -2,9 +2,7 @@
 Muddy is an information extraction platform.  For further
 details see the '{Getting Started with Muddy}[http://blog.muddy.it/2009/11/getting-started-with-muddy]'
-article.  This gem provides access to the Muddy platform via it's API :
-{Muddy Developer Guide}[http://muddy.it/developers/]
+article.  This gem provides access to the Muddy platform via it's API (see {Muddy Developer Guide}[http://muddy.it/developers/]).
 == Installation
@@ -16,7 +14,7 @@ article.  This gem provides access to the Muddy platform via it's API :
 Muddy supports OAuth and HTTP Basic auth for authentication and authorisation.
 We recommend you use OAuth wherever possible when accessing Muddy.  An example
-of using OAuth with the muddy platform is descibed in the
+of using OAuth with the Muddy platform is described in the
 {Building with Muddy and OAuth}[http://blog.muddy.it/2010/01/building-with-muddy-and-oauth]
 article.
@@ -59,7 +57,7 @@ URL rather than text, just specify a URL instead :
 Muddy allows you to store the entity extraction results so aggregate operations
 can be performed over a collection of content (a 'collection' has many analysed 'pages').
-A basic muddy account provides a single 'collection' where extraction results
+A basic Muddy account provides a single 'collection' where extraction results
 can be stored.
 To store a page against a collection, the collection must first be found :
@@ -70,7 +68,16 @@ Once a collection has been found, entity extraction results can be stored in it:
   collection.pages.create('http://news.bbc.co.uk/1/hi/uk_politics/8011321.stm', {:minium_confidence => 0.2})
-== Viewing all analysed pages in a collection
+== Working with a collection
+A collection allows aggregate operations to be perfomed on itself and on it's
+members.  A collection is identified by it's 'collection token'.  This is an
+alphanumeric six character string (e.g. 'a0ret4').  A collection can be found if
+it's token is known :
+  collection = muddyit.collections.find('a0ret4')
+=== Viewing all analysed pages
 You can iterate through all the analysed pages in a collection, be aware that
 the Muddy API provides the pages as paginated sets, so it may take some time to
@@ -87,25 +94,42 @@ for each new paginated set of results).
     end
   end
-== Working with a collection
+=== Finding a particular page or pages
-A collection allows aggregate operations to be perfomed on itself and on it's
-members.  A collection is identified by it's 'collection token'.  This is an
-alphanumeric six character string (e.g. 'a0ret4').  A collection can be found if
-it's token is known :
+Each page in a collection is assigned a unique alphanumeric identifier.  Whilst
+this can be used to find a given page in a collection, it is possible to search
+for the page using other attributes :
-  collection = muddyit.collections.find('a0ret4')
+  page = collection.pages.find('5d0e32b6-fd0b-400a-ac49-dae965a292df')
+  page = collection.pages.find(:all, :uri => 'http://news.bbc.co.uk/1/hi/business/8186840.stm').first
+  page = collection.pages.find(:all, :title => 'BBC NEWS | Business | ITV in 25m Friends Reunited sale').first
+=== Rereshing a page's results
+A page can be 'refereshed' (the entity extraction is run again) by calling the
+refresh method on a page object :
+  page = collection.pages.find('5d0e32b6-fd0b-400a-ac49-dae965a292df')
+  updated_page = page.update
+=== Deleting a page from a collection
-=== View all pages containing 'Gordon Brown'
+A page can be removed from a collection by calling the 'destroy' method on a
+page object :
-If we want to find all references to the grounded entity for 'Gordon Brown 'then
+  page = collection.pages.find('5d0e32b6-fd0b-400a-ac49-dae965a292df')
+  page.destroy
+=== View all pages containing entity 'Gordon Brown'
+If we want to find all pages that reference the grounded entity for 'Gordon Brown' then
 it can be searched for using it's DBpedia URI :
   require 'muddyit_fu'
   muddyit = Muddyit.new('./config.yml')
   collection = muddyit.collections.find('a0ret4')
   collection.pages.find_by_entity('http://dbpedia.org/resource/Gordon_Brown') do |page|
-    puts page.identifier
+    puts "#{page.identifier} - #{page.title}"
   end
 === Find related entities for 'Gordon Brown'
@@ -118,7 +142,7 @@ collection :
   collection = muddyit.collections.find('a0ret4')
   puts "Related entity\tOccurance
   collection.entities.find_related('http://dbpedia.org/resource/Gordon_Brown').each do |entry|
-    puts "#{entry[:enity].uri}\t#{entry[:count]}"
+    puts "#{entry[:entity].uri}\t#{entry[:count]}"
   end
 === Find related content for : http://news.bbc.co.uk/1/hi/uk_politics/7878418.stm
@@ -135,6 +159,17 @@ analysed page that has a uri 'http://news.bbc.co.uk/1/hi/uk_politics/7878418.stm
     puts "#{results[:page].title} #{results[:count]}"
   end
+== Batch processing content and the Muddy queue
+The Muddy platform runs a background job queue that allows many requests to be
+made in quick succession (rather than waiting for the full extraction request to
+complete), with analysis of the pages happening asynchronously via the queue
+and being stored in the collection at a later time.  This can be useful when trying
+to analyse large content collections.  To send a request to the queue use :
+  collection = muddyit.collections.find('a0ret4')
+  collection.pages.create('http://news.bbc.co.uk/1/hi/uk_politics/8011321.stm', {:realtime => false})
 == Contact
   Author: Rob Lee

data/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 0.2.11
1	+ 0.2.12

data/lib/muddyit/base.rb CHANGED Viewed

@@ -1,9 +1,19 @@
 module Muddyit
+  class_attr_accessor :REST_ENDPOINT
+  @@REST_ENDPOINT = 'http://muddy.it'
   def self.new(*params)
     Muddyit::Base.new(*params)
   end
+  # Shortcut class method for extract
+  def self.extract(doc, options={})
+    @muddyit = Muddyit.new()
+    @muddyit.extract(doc, options)
+  end
   class Base
     class_attr_accessor :http_open_timeout
     class_attr_accessor :http_read_timeout
@@ -13,8 +23,6 @@ module Muddyit
     @@http_open_timeout = 120
     @@http_read_timeout = 120
-    REST_ENDPOINT = 'http://www.muddy.it'
     # Set the request signing method
     @@digest1   = OpenSSL::Digest::Digest.new("sha1")
     @@digest256 = nil
@@ -47,7 +55,8 @@ module Muddyit
     # access_token: CCC
     # access_token_secret: DDD
     #
-    def initialize(config_hash_or_file)
+    def initialize(config_hash_or_file = {})
       if config_hash_or_file.is_a? Hash
         config_hash_or_file.nested_symbolize_keys!
         @username = config_hash_or_file[:username]
@@ -56,7 +65,7 @@ module Muddyit
         @consumer_secret = config_hash_or_file[:consumer_secret]
         @access_token = config_hash_or_file[:access_token]
         @access_token_secret = config_hash_or_file[:access_token_secret]
-        @rest_endpoint = config_hash_or_file.has_key?(:rest_endpoint) ? config_hash_or_file[:rest_endpoint] : REST_ENDPOINT
+        @rest_endpoint = config_hash_or_file.key?(:rest_endpoint) ? config_hash_or_file[:rest_endpoint] : Muddyit.REST_ENDPOINT
       else
         config = YAML.load_file(config_hash_or_file)
         config.nested_symbolize_keys!
@@ -66,7 +75,7 @@ module Muddyit
         @consumer_secret = config[:consumer_secret]
         @access_token = config[:access_token]
         @access_token_secret = config[:access_token_secret]
-        @rest_endpoint = config.has_key?(:rest_endpoint) ? config[:rest_endpoint] : REST_ENDPOINT
+        @rest_endpoint = config.key?(:rest_endpoint) ? config[:rest_endpoint] : Muddyit.REST_ENDPOINT
       end
       if !@consumer_key.nil?
@@ -75,10 +84,7 @@ module Muddyit
         @accesstoken = ::OAuth::AccessToken.new(@consumer, @access_token, @access_token_secret)
       elsif !@username.nil?
         @auth_type = :basic
-      else
-        raise "unable to find authentication credentials"
       end
     end
     # sends a request to the muddyit REST api
@@ -99,7 +105,7 @@ module Muddyit
       case @auth_type
       when :oauth
         res = oauth_request_over_http(api_url, http_method, opts, body)
-      when :basic
+      when :basic, nil
         res = basic_request_over_http(api_url, http_method, opts, body)
       end
@@ -149,7 +155,7 @@ module Muddyit
       response = self.send_request(api_url, :post, {}, body.to_json)
       return Muddyit::Collections::Collection::Pages::Page.new(self, response)
     end
     protected
     # For easier testing. You can mock this method with a XML file you re expecting to receive
@@ -175,6 +181,12 @@ module Muddyit
     def basic_request_over_http(path, http_method, opts, data)
+      # We only allow access to /extract as an unauthenticated user
+      # all other paths should raise an error
+      if @auth_type == nil && path != '/extract'
+        raise "invalid authentication credentials supplied, are the details correct ?"
+      end
       http_opts = { "Accept" => "application/json", "Content-Type" => "application/json", "User-Agent" => "muddyit_fu" }
       query_string = opts.to_a.map {|x| x.join("=")}.join("&")
@@ -196,14 +208,12 @@ module Muddyit
         request.basic_auth @username, @password
         request["Content-Length"] = 0 # Default to 0
       when :get
-        request = Net::HTTP::Get.new(path,headers)
+        path_with_query_string = opts.empty? ? path : "#{path}?#{query_string}"
+        request = Net::HTTP::Get.new(path_with_query_string, headers)
         request.basic_auth @username, @password
       when :delete
         request =  Net::HTTP::Delete.new(path,headers)
         request.basic_auth @username, @password
-      when :head
-        request = Net::HTTP::Head.new(path,headers)
-        request.basic_auth @username, @password
       else
         raise ArgumentError, "Don't know how to handle http_method: :#{http_method.to_s}"
       end

data/muddyit_fu.gemspec CHANGED Viewed

@@ -2,11 +2,11 @@
 Gem::Specification.new do |s|
   s.name = %q{muddyit_fu}
-  s.version = "0.2.11"
+  s.version = "0.2.12"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["rattle"]
-  s.date = %q{2010-01-11}
+  s.date = %q{2010-01-18}
   s.email = %q{support[at]muddy.it}
   s.extra_rdoc_files = [
     "LICENSE",
@@ -45,7 +45,8 @@ Gem::Specification.new do |s|
   s.rubygems_version = %q{1.3.5}
   s.summary = %q{Provides a ruby interface to muddy.it}
   s.test_files = [
-    "test/test_muddyit_fu.rb",
+    "test/thing.rb",
+     "test/test_muddyit_fu.rb",
      "test/test_helper.rb",
      "examples/newsindexer.rb",
      "examples/oauth.rb"

data/test/test_muddyit_fu.rb CHANGED Viewed

@@ -6,143 +6,157 @@ class TestMuddyitFu < Test::Unit::TestCase
   @@COLLECTION_LABEL = Time.now.to_s
   @@STORY = 'http://news.bbc.co.uk/1/hi/business/8186840.stm'
-  context 'A muddy account' do
+  context 'A user without a muddy account' do
     setup do
       c = load_config
-      begin
-      @muddyit = Muddyit.new(:consumer_key => c['consumer_key'],
-                             :consumer_secret => c['consumer_secret'],
-                             :access_token => c['access_token'],
-                             :access_token_secret => c['access_token_secret'],
-                             :rest_endpoint => c['rest_endpoint'],
-                             :username => c['username'],
-                             :password => c['password'])
-      rescue
-        puts "Failed to connect to muddy, are the details correct ?"
-      end
+      Muddyit.REST_ENDPOINT = c['rest_endpoint'] if c.key?('rest_endpoint')
     end
-    should "analyse a page without a collection" do
-      page = @muddyit.extract(@@STORY)
+    should "be able to analyse a page without a collection" do
+      page = Muddyit.extract(@@STORY)
       assert page.entities.length > 0
     end
-    should 'be able to create a collection' do
-      collection = @muddyit.collections.create(@@COLLECTION_LABEL, 'http://www.test.com')
-      assert !collection.token.nil?
-    end
-    should 'be able to find a collection' do
-      # This is a bit rubbish
-      @muddyit.collections.find(:all).each do |collection|
-        if collection.label == @@COLLECTION_LABEL
-          assert true
-        end
-      end
-    end
-    should 'be able to destroy a collection' do
-      # This is also a bit rubbish
-      collections = @muddyit.collections.find(:all)
-      collections.each do |collection|
-        if collection.label == @@COLLECTION_LABEL
-           res = collection.destroy
-           assert_equal res.code, "200"
-        end
-      end
-    end
-    context "with a collection" do
+  end
+    context 'A user with a muddy account' do
       setup do
-        @collection = @muddyit.collections.create(@@COLLECTION_LABEL, 'http://www.test.com')
+        c = load_config
+        begin
+        @muddyit = Muddyit.new(:consumer_key => c['consumer_key'],
+                               :consumer_secret => c['consumer_secret'],
+                               :access_token => c['access_token'],
+                               :access_token_secret => c['access_token_secret'],
+                               :rest_endpoint => c['rest_endpoint'],
+                               :username => c['username'],
+                               :password => c['password'])
+        rescue
+          puts "Failed to connect to muddy, are the details correct ?"
+        end
       end
-      should "categorise a page in realtime and not store it" do
-        page = @collection.pages.create({:uri => @@STORY}, :realtime => true, :store => false)
+      should "be able to analyse a page without a collection" do
+        page = @muddyit.extract(@@STORY)
         assert page.entities.length > 0
-        pages = @collection.pages.find(:all)
-        assert pages[:pages].length == 0
       end
-      should "categorise a page in realtime and store it" do
-        page = @collection.pages.create({:uri => @@STORY}, :realtime => true, :store => true)
-        assert page.entities.length > 0
-        pages = @collection.pages.find(:all)
-        assert_equal pages[:pages].length, 1
+      should 'be able to create a collection' do
+        collection = @muddyit.collections.create(@@COLLECTION_LABEL, 'http://www.test.com')
+        assert !collection.token.nil?
       end
-      context "with a page" do
-        setup do
-          @page = @collection.pages.create({:uri => @@STORY}, :realtime => true)
+      should 'be able to find a collection' do
+        # This is a bit rubbish
+        @muddyit.collections.find(:all).each do |collection|
+          if collection.label == @@COLLECTION_LABEL
+            assert true
+          end
         end
-        should "find a page" do
-          assert_equal @collection.pages.find(@page.identifier).identifier, @page.identifier
-        end
-        should "have page attributes" do
-          assert !@page.identifier.nil?
-          assert !@page.title.nil?
-          assert !@page.created_at.nil?
-          assert !@page.content.nil?
-          assert !@page.uri.nil?
-          #assert !@page.token.nil?
-          # More attributes here ?
+      end
+      should 'be able to destroy a collection' do
+        # This is also a bit rubbish
+        collections = @muddyit.collections.find(:all)
+        collections.each do |collection|
+          if collection.label == @@COLLECTION_LABEL
+             res = collection.destroy
+             assert_equal res.code, "200"
+          end
         end
-        should "have many entities" do
-          assert @page.entities.length > 0
+      end
+      context "with a collection" do
+        setup do
+          @collection = @muddyit.collections.create(@@COLLECTION_LABEL, 'http://www.test.com')
         end
-        should "have an entity with a term and label" do
-          entity = @page.entities.first
-          assert !entity.term.nil?
-          assert !entity.uri.nil?
+        should "categorise a page in realtime and not store it" do
+          page = @collection.pages.create({:uri => @@STORY}, :realtime => true, :store => false)
+          assert page.entities.length > 0
+          pages = @collection.pages.find(:all)
+          assert pages[:pages].length == 0
         end
-        should "have extracted content" do
-          assert !@page.extracted_content.content.nil?
-          assert @page.extracted_content.terms.length > 0
-          assert @page.extracted_content.start_position > 0
-          assert @page.extracted_content.end_position > 0
+        should "categorise a page in realtime and store it" do
+          page = @collection.pages.create({:uri => @@STORY}, :realtime => true, :store => true)
+          assert page.entities.length > 0
+          pages = @collection.pages.find(:all)
+          assert_equal pages[:pages].length, 1
         end
-        should "delete a page" do
-          assert @page.destroy, "200"
+        context "with a page" do
+          setup do
+            @page = @collection.pages.create({:uri => @@STORY}, :realtime => true)
+          end
+          should "find a page" do
+            assert_equal @collection.pages.find(@page.identifier).identifier, @page.identifier
+          end
+          should "have page attributes" do
+            assert !@page.identifier.nil?
+            assert !@page.title.nil?
+            assert !@page.created_at.nil?
+            assert !@page.content.nil?
+            assert !@page.uri.nil?
+            #assert !@page.token.nil?
+            # More attributes here ?
+          end
+          should "have many entities" do
+            assert @page.entities.length > 0
+          end
+          should "have an entity with a term and label" do
+            entity = @page.entities.first
+            assert !entity.term.nil?
+            assert !entity.uri.nil?
+          end
+          should "have extracted content" do
+            assert !@page.extracted_content.content.nil?
+            assert @page.extracted_content.terms.length > 0
+            assert @page.extracted_content.start_position > 0
+            assert @page.extracted_content.end_position > 0
+          end
+          should "delete a page" do
+            assert @page.destroy, "200"
+          end
         end
-      end
-      context "with two pages" do
-        setup do
-          @page1 = @collection.pages.create({:uri => @@STORY}, :realtime => true)
-          @page2 = @collection.pages.create({:uri => @@STORY}, :realtime => true)
+        context "with two pages" do
+          setup do
+            @page1 = @collection.pages.create({:uri => @@STORY}, :realtime => true)
+            @page2 = @collection.pages.create({:uri => @@STORY}, :realtime => true)
+          end
+          should "find all pages" do
+            assert_equal @collection.pages.find(:all).length, 2
+          end
+          should "find related pages" do
+            assert_equal @page1.related_content.length, 1
+          end
         end
-        should "find all pages" do
-          assert_equal @collection.pages.find(:all).length, 2
+        teardown do
+          #token = @collection.token
+          @collection.destroy
+          #res = @muddyit.collections.find(token)
+          # This should be a 404 (!)
+          #assert_equal res.code, "404"
         end
-        should "find related pages" do
-          assert_equal @page1.related_content.length, 1
-        end
       end
-      teardown do
-        #token = @collection.token
-        @collection.destroy
-        #res = @muddyit.collections.find(token)
-        # This should be a 404 (!)
-        #assert_equal res.code, "404"
-      end
     end
-  end
 end

data/test/thing.rb ADDED Viewed

@@ -0,0 +1,13 @@
+#!/usr/bin/ruby
+require 'rubygems'
+$LOAD_PATH.unshift(File.dirname(__FILE__) + '/../lib')
+require 'muddyit_fu'
+  Muddyit.REST_ENDPOINT = 'http://staging.muddy.it'
+  #muddyit =  Muddyit.new('./config.yml')
+  page = Muddyit.extract(ARGV[0], :disambiguate => true, :include_unclassified => true, :include_content => true)
+  pp page.extracted_content.terms
+  page.entities.each do |entity|
+    puts "\t#{entity.term}, #{entity.uri}, #{entity.classification}"
+  end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: muddyit_fu
 version: !ruby/object:Gem::Version
-  version: 0.2.11
+  version: 0.2.12
 platform: ruby
 authors:
 - rattle
@@ -9,7 +9,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2010-01-11 00:00:00 +00:00
+date: 2010-01-18 00:00:00 +00:00
 default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency
@@ -96,6 +96,7 @@ signing_key:
 specification_version: 3
 summary: Provides a ruby interface to muddy.it
 test_files:
+- test/thing.rb
 - test/test_muddyit_fu.rb
 - test/test_helper.rb
 - examples/newsindexer.rb