RubyGems - recommendify - Versions diffs - 0.0.1 → 0.1.0 - Mend

recommendify 0.0.1 → 0.1.0

Files changed (15) hide show

data/README.md +28 -23
data/Rakefile +9 -0
data/doc/example.rb +2 -1
data/lib/recommendify/jaccard_input_matrix.rb +28 -9
data/recommendify.gemspec +24 -0
data/spec/similarity_matrix_spec.rb +2 -0
data/src/cc_item.h +8 -0
data/src/cosine.c +3 -0
data/src/iikey.c +18 -0
data/src/jaccard.c +19 -0
data/src/output.c +22 -0
data/src/recommendify.c +184 -0
data/src/sort.c +23 -0
data/src/version.h +17 -0
metadata +14 -24

data/README.md CHANGED

@@ -1,10 +1,11 @@
 recommendify
 ============
-Incremental and distributed item-based "Collaborative Filtering" for binary ratings with ruby and redis. In a nutshell: You feed in `user -> item` interactions and it spits out similarity vectors between items ("related items").
+Incremental and distributed item-based "Collaborative Filtering" for binary ratings with ruby and redis. In a nutshell: You feed in `user -> item` interactions and it spits out similarity vectors between items ("related items").  __scroll down for a demo...__
 [ ![Build status - Travis-ci](https://secure.travis-ci.org/paulasmuth/recommendify.png) ](http://travis-ci.org/paulasmuth/recommendify)
 ### use cases
 + "Users that bought this product also bought...".
@@ -12,23 +13,20 @@ Incremental and distributed item-based "Collaborative Filtering" for binary rati
 + "Users that follow this person also follow...".
+usage
+-----
-### how it works
-Recommendify keeps an incrementally updated `item x item` matrix, the "co-concurrency matrix". This matrix stores the number of times that a combination of two items has appeared in an interaction/preferrence set. The co-concurrence counts are processed with a similarity measure to retrieve another `item x item` similarity matrix, which is used to find the N most similar items for each item. This approach was described by Miranda, Alipio et al. [1]
-1. Group the input user->item pairs by user-id and store them into interaction sets
-2. For each item<->item combination in the interaction set increment the respective element in the co-concurrence matrix
-3. For each item<->item combination in the co-concurrence matrix calculate the item<->item similarity
-3. For each item store the N most similar items in the respective output set.
-Fnord is not a draft!
+Your data should look something like this:
+```
+# which items are frequently bought togehter?
+[order23] product5 produt42 product17
+[order42] product8 produt16 product32
-usage
------
+# which users are frequently watched/followed together?
+[user4] user9 user11 user12
+[user9] user6 user8 user11
+```
 You can add new interaction-sets to the processor incrementally, but the similarity matrix has to be manually re-processed after new interactions were added to any of the applied processors. However, the processing happens on-line and you can keep track of the changed items so you only have to re-calculate the changed rows of the matrix.
@@ -91,24 +89,32 @@ recommender.for("item23")
 recommender.remove_item!("item23")
 ```
+### how it works
+Recommendify keeps an incrementally updated `item x item` matrix, the "co-concurrency matrix". This matrix stores the number of times that a combination of two items has appeared in an interaction/preferrence set. The co-concurrence counts are processed with a similarity measure to retrieve another `item x item` similarity matrix, which is used to find the N most similar items for each item. This approach was described by Miranda, Alipio et al. [1]
-demo?
------
+1. Group the input user->item pairs by user-id and store them into interaction sets
+2. For each item<->item combination in the interaction set increment the respective element in the co-concurrence matrix
+3. For each item<->item combination in the co-concurrence matrix calculate the item<->item similarity
+3. For each item store the N most similar items in the respective output set.
-[ ![Example Results](https://raw.github.com/paulasmuth/recommendify/master/doc/example.png) ](http://falbala.23loc.com/~paul/recommendify_out_1.html)
-full snippet: http://falbala.23loc.com/~paul/recommendify_out_1.html
+### does it scale?
-These recommendations were calculated from 2,3mb "profile visit"-data (taken from www.talentsuche.de). Initially processing the 120.047 `visitor_id->profile_id` pairs currently takes around half an hour on a single core and creates a 126.64mb hashtable in redis. You can try this for yourself; the complete data and code is in `doc/example.rb` and `doc/example_data.csv`.
+The maximum number of entries in the co-concurrence and similarity matrix is k(n) = (n^2)-(n/2), it grows O(n^2). However, in a real scenario it is very unlikely that all item<->item combinations appear in a interaction set and we use a sparse matrix which will only use memory for elemtens with a value > 0. The size of the similarity grows O(n).
+example
+-------
-### does it scale?
+These recommendations were calculated from 2,3mb "profile visit"-data (taken from www.talentsuche.de) - keep in mind that the recommender uses only visitor->visited data, it __doesn't know the gender__  of a user.
-The maximum number of entries in the co-concurrence and similarity matrix is k(n) = (n^2)-(n/2), it grows O(n^2). However, in a real scenario it is very unlikely that all item<->item combinations appear in a interaction set and we use a sparse matrix which will only use memory for elemtens with a value > 0. The size of the similarity grows O(n).
+[ ![Example Results](https://raw.github.com/paulasmuth/recommendify/master/doc/example.png) ](http://falbala.23loc.com/~paul/recommendify_out_1.html)
+full snippet: http://falbala.23loc.com/~paul/recommendify_out_1.html
+Initially processing the 120.047 `visitor_id->profile_id` pairs currently takes around half an hour on a single core and creates a 126.64mb hashtable in redis. The high memory usage of >100mb for only 5000 items is due to the very long user rows. If you limit the user rows to 100 items (mahout's default) it shrinks to 31mb for the 5k items from example_data.csv. In another real data set with very short user rows (purchase/payment data) it used only 3.4mb for 90k items with very good results. You can try this for yourself; the complete data and code is in `doc/example.rb` and `doc/example_data.csv`.
@@ -151,4 +157,3 @@ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLI
 + optimize sparsematrix memory usage (somehow)
 + make max_row length configurable
 + option: only add items where co-concurreny/appearnce-count > n

data/Rakefile CHANGED

@@ -10,3 +10,12 @@ task :default => "spec"
 desc "Generate documentation"
 task YARD::Rake::YardocTask.new
+desc "Compile the native client"
+task :build_native do
+  out_dir = ::File.expand_path("../bin", __FILE__)
+  src_dir = ::File.expand_path("../src", __FILE__)
+  %x{mkdir -p #{out_dir}}
+  %x{gcc -Wall #{src_dir}/recommendify.c -lhiredis -o #{out_dir}/recommendify}
+end

data/doc/example.rb CHANGED

@@ -26,7 +26,8 @@ end
 # add the test data to the recommender
 buckets.each do |user_id, items|
-  puts "#{user_id} -> #{items.join(",")}"
+  puts "#{user_id} -> #{items.join(",")}"
+  items = items[0..99] # do not add more than 100 items per user
   recommender.visits.add_set(user_id, items)
 end

data/lib/recommendify/jaccard_input_matrix.rb CHANGED

@@ -2,27 +2,28 @@ class Recommendify::JaccardInputMatrix < Recommendify::InputMatrix
   include Recommendify::CCMatrix
-  def initialize(opts={})
-    super(opts)
+  def initialize(opts={})
+    check_native if opts[:native]
+    super(opts)
   end
   def similarity(item1, item2)
     calculate_jaccard_cached(item1, item2)
   end
-  # optimize: get all item-counts and the cc-row with 2 redis hmgets.
-  # optimize: don't return more than sm.max_neighbors items (truncate set while collecting)
   def similarities_for(item1)
-    # todo: optimize native. execute with own redis conn and write top K to stdout
-    # native_ouput = %x{recommendify_native jaccard "#{redis_key}" "#{item1}"}
-    # return native_output.split("\n").map{ |l| l.split(",") }
+    return run_native(item1) if @opts[:native]
+    calculate_similarities(item1)
+  end
+private
+  def calculate_similarities(item1)
     (all_items - [item1]).map do |item2|
       [item2, similarity(item1, item2)]
     end
   end
-private
   def calculate_jaccard_cached(item1, item2)
     val = ccmatrix[item1, item2]
     val.to_f / (item_count(item1)+item_count(item2)-val).to_f
@@ -32,4 +33,22 @@ private
     (set1&set2).length.to_f / (set1 + set2).uniq.length.to_f
   end
+  def run_native(item_id)
+    res = %x{#{native_path} --jaccard "#{redis_key}" "#{item_id}"}
+    res.split("\n").map do |line|
+      sim = line.match(/OUT: \(([^\)]*)\) \(([^\)]*)\)/)
+      raise "error: #{res}" unless sim
+      [sim[1], sim[2].to_f]
+    end
+  end
+  def check_native
+    return true if ::File.exists?(native_path)
+    raise "recommendify_native not found - you need to run rake build_native first"
+  end
+  def native_path
+    ::File.expand_path('../../../bin/recommendify', __FILE__)
+  end
 end

data/recommendify.gemspec ADDED

@@ -0,0 +1,24 @@
+# -*- encoding: utf-8 -*-
+$:.push File.expand_path("../lib", __FILE__)
+Gem::Specification.new do |s|
+  s.name        = "recommendify"
+  s.version     = "0.1.0"
+  s.date        = Date.today.to_s
+  s.platform    = Gem::Platform::RUBY
+  s.authors     = ["Paul Asmuth"]
+  s.email       = ["paul@paulasmuth.com"]
+  s.homepage    = "http://github.com/paulasmuth/recommendify"
+  s.summary     = %q{Distributed item-based "Collaborative Filtering" with ruby and redis}
+  s.description = %q{Distributed item-based "Collaborative Filtering" with ruby and redis}
+  s.licenses    = ["MIT"]
+  s.add_dependency "redis", ">= 2.2.2"
+  s.add_development_dependency "rspec", "~> 2.8.0"
+  s.files         = `git ls-files`.split("\n") - [".gitignore", ".rspec", ".travis.yml"]
+  s.test_files    = `git ls-files -- spec/*`.split("\n")
+  s.executables   = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
+  s.require_paths = ["lib"]
+end

data/spec/similarity_matrix_spec.rb CHANGED

@@ -88,6 +88,8 @@ describe Recommendify::SimilarityMatrix do
       @matrix["item_fnord"].should == {"item_blubb" => 0.6, "item_foo" => 0.4}
     end
+    it "should not call split on nil when retrieving a non-existent item (return an empty array)"
   end
 end

data/src/cc_item.h ADDED

@@ -0,0 +1,8 @@
+#define ITEM_ID_SIZE 64
+struct cc_item {
+  char  item_id[ITEM_ID_SIZE];
+  int   coconcurrency_count;
+  int   total_count;
+  float similarity;
+};

data/src/cosine.c ADDED

@@ -0,0 +1,3 @@
+void calculate_cosine(char *item_id, int itemCount, struct cc_item *cc_items, int cc_items_size){
+  /* here be dragons */
+}

data/src/iikey.c ADDED

@@ -0,0 +1,18 @@
+char* item_item_key(char *item1, char *item2){
+  int keylen = strlen(item1) + strlen(item2) + 2;
+  char *key = (char *)malloc(keylen * sizeof(char));
+  if(!key){
+    printf("cannot allocate\n");
+    return 0;
+  }
+  // FIXPAUL: make shure this does exactly the same as ruby sort
+  if(rb_strcmp(item1, item2) <= 0){
+    snprintf(key, keylen, "%s:%s", item1, item2);
+  } else {
+    snprintf(key, keylen, "%s:%s", item2, item1);
+  }
+  return key;
+}

data/src/jaccard.c ADDED

@@ -0,0 +1,19 @@
+void calculate_jaccard(char *item_id, int itemCount, struct cc_item *cc_items, int cc_items_size){
+  int j, n;
+  for(j = 0; j < cc_items_size; j++){
+    n = cc_items[j].coconcurrency_count;
+    if(n>0){
+      cc_items[j].similarity = (
+        (float)n / (
+          (float)itemCount +
+          (float)cc_items[j].total_count -
+          (float)n
+        )
+      );
+    } else {
+      cc_items[j].similarity = 0.0;
+    }
+  }
+}

data/src/output.c ADDED

@@ -0,0 +1,22 @@
+int print_version(){
+  printf(
+    VERSION_STRING,
+    VERSION_MAJOR,
+    VERSION_MINOR,
+    VERSION_MICRO
+  );
+  return 0;
+}
+int print_usage(char *bin){
+  printf(USAGE_STRING, bin);
+  return 1;
+}
+void print_item(struct cc_item item){
+  printf(
+    "OUT: (%s) (%.4f)\n",
+    item.item_id,
+    item.similarity
+  );
+}

data/src/recommendify.c ADDED

@@ -0,0 +1,184 @@
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <hiredis/hiredis.h>
+#include "version.h"
+#include "cc_item.h"
+#include "jaccard.c"
+#include "cosine.c"
+#include "output.c"
+#include "sort.c"
+#include "iikey.c"
+int main(int argc, char **argv){
+  int i, j, n, similarityFunc = 0;
+  int itemCount = 0;
+  char *itemID;
+  char *redisPrefix;
+  redisContext *c;
+  redisReply *all_items;
+  redisReply *reply;
+  int cur_batch_size;
+  char* cur_batch;
+  char *iikey;
+  int batch_size = 200; /* FIXPAUL: make option */
+  int maxItems = 50; /* FIXPAUL: make option */
+  /* option parsing */
+  if(argc < 2)
+    return print_usage(argv[0]);
+  if(!strcmp(argv[1], "--version"))
+    return print_version();
+  if(!strcmp(argv[1], "--jaccard"))
+    similarityFunc = 1;
+  if(!strcmp(argv[1], "--cosine"))
+    similarityFunc = 2;
+  if(!similarityFunc){
+    printf("invalid option: %s\n", argv[1]);
+    return 1;
+  } else if(argc != 4){
+    printf("wrong number of arguments\n");
+    print_usage(argv[0]);
+    return 1;
+  }
+  redisPrefix = argv[2];
+  itemID = argv[3];
+  /* connect to redis */
+  struct timeval timeout = { 1, 500000 };
+  c = redisConnectWithTimeout("127.0.0.2", 6379, timeout);
+  if(c->err){
+    printf("connection to redis failed: %s\n", c->errstr);
+    return 1;
+  }
+  /* get item count */
+  reply = redisCommand(c,"HGET %s:items %s", redisPrefix, itemID);
+  itemCount = atoi(reply->str);
+  freeReplyObject(reply);
+  if(itemCount == 0){
+    printf("item count is zero\n");
+    return 0;
+  }
+  /* get all items_ids and the total counts */
+  all_items = redisCommand(c,"HGETALL %s:items", redisPrefix);
+  if(all_items->type != REDIS_REPLY_ARRAY)
+    return 1;
+  /* populate the cc_items array */
+  int cc_items_size = all_items->elements / 2;
+  int cc_items_mem = cc_items_size * sizeof(struct cc_item);
+  struct cc_item *cc_items = malloc(cc_items_mem);
+  cc_items_size--;
+  if(!cc_items){
+    printf("cannot allocate memory: %i", cc_items_mem);
+    return 1;
+  }
+  i = 0;
+  for (j = 0; j < all_items->elements/2; j++){
+    if(strcmp(itemID, all_items->element[j*2]->str) != 0){
+      strncpy(cc_items[i].item_id, all_items->element[j*2]->str, ITEM_ID_SIZE);
+      cc_items[i].total_count = atoi(all_items->element[j*2+1]->str);
+      i++;
+    }
+  }
+  freeReplyObject(all_items);
+  // batched redis hmgets on the ccmatrix
+  cur_batch = (char *)malloc(((batch_size * (ITEM_ID_SIZE + 4) * 2) + 100) * sizeof(char));
+  if(!cur_batch){
+    printf("cannot allocate memory");
+    return 1;
+  }
+  n = cc_items_size;
+  while(n >= 0){
+    cur_batch_size = ((n-1 < batch_size) ? n-1 : batch_size);
+    sprintf(cur_batch, "HMGET %s:ccmatrix ", redisPrefix);
+    for(i = 0; i < cur_batch_size; i++){
+      iikey = item_item_key(itemID, cc_items[n-i].item_id);
+      strcat(cur_batch, iikey);
+      strcat(cur_batch, " ");
+      if(iikey)
+        free(iikey);
+    }
+    redisAppendCommand(c, cur_batch);
+    redisGetReply(c, (void**)&reply);
+    for(j = 0; j < reply->elements; j++){
+      if(reply->element[j]->str){
+        cc_items[n-j].coconcurrency_count = atoi(reply->element[j]->str);
+      } else {
+        cc_items[n-j].coconcurrency_count = 0;
+      }
+    }
+    freeReplyObject(reply);
+    n -= batch_size;
+  }
+  free(cur_batch);
+  /* calculate similarities */
+  if(similarityFunc == 1)
+    calculate_jaccard(itemID, itemCount, cc_items, cc_items_size);
+  if(similarityFunc == 2)
+    calculate_cosine(itemID, itemCount, cc_items, cc_items_size);
+  /* find the top x items with simple bubble sort */
+  for(i = 0; i < maxItems - 1; ++i){
+    for (j = 0; j < cc_items_size - i - 1; ++j){
+      if (cc_items[j].similarity > cc_items[j + 1].similarity){
+        struct cc_item tmp = cc_items[j];
+        cc_items[j] = cc_items[j + 1];
+        cc_items[j + 1] = tmp;
+      }
+    }
+  }
+  /* print top k items */
+  n = ((cc_items_size < maxItems) ? cc_items_size : maxItems);
+  for(j = 0; j < n; j++){
+    i = cc_items_size-j-1;
+    if(cc_items[i].similarity > 0){
+      print_item(cc_items[i]);
+    }
+  }
+  free(cc_items);
+  return 0;
+}

data/src/sort.c ADDED

@@ -0,0 +1,23 @@
+int lesser(int i1, int i2){
+  if(i1 > i2){
+    return i2;
+  } else {
+    return i1;
+  }
+}
+int rb_strcmp(char *str1, char *str2){
+  long len;
+  int retval;
+  len = lesser(strlen(str1), strlen(str2));
+  retval = memcmp(str1, str2, len);
+  if (retval == 0){
+    if (strlen(str1) == strlen(str2)) {
+      return 0;
+    }
+    if (strlen(str1) > strlen(str2)) return 1;
+    return -1;
+  }
+  if (retval > 0) return 1;
+  return -1;
+}

data/src/version.h ADDED

@@ -0,0 +1,17 @@
+#ifndef VERSION_H
+#define VERSION_H
+#define VERSION_MAJOR 0
+#define VERSION_MINOR 0
+#define VERSION_MICRO 1
+#define VERSION_STRING "recommendify_native %i.%i.%i\n" \
+ "\n" \
+ "Copyright © 2012\n" \
+ "  Paul Asmuth <paul@paulasmuth.com>\n"
+#define USAGE_STRING "usage: %s " \
+ "{--version|--jaccard|--cosine} " \
+ "[redis_key] [item_id]\n"
+#endif

metadata CHANGED

@@ -1,13 +1,8 @@
 --- !ruby/object:Gem::Specification
 name: recommendify
 version: !ruby/object:Gem::Version
-  hash: 29
   prerelease:
-  segments:
-  - 0
-  - 0
-  - 1
-  version: 0.0.1
+  version: 0.1.0
 platform: ruby
 authors:
 - Paul Asmuth
@@ -15,7 +10,8 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-02-04 00:00:00 Z
+date: 2012-02-12 00:00:00 +01:00
+default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency
   name: redis
@@ -25,11 +21,6 @@ dependencies:
     requirements:
     - - ">="
       - !ruby/object:Gem::Version
-        hash: 3
-        segments:
-        - 2
-        - 2
-        - 2
         version: 2.2.2
   type: :runtime
   version_requirements: *id001
@@ -41,11 +32,6 @@ dependencies:
     requirements:
     - - ~>
       - !ruby/object:Gem::Version
-        hash: 47
-        segments:
-        - 2
-        - 8
-        - 0
         version: 2.8.0
   type: :development
   version_requirements: *id002
@@ -76,6 +62,7 @@ files:
 - lib/recommendify/recommendify.rb
 - lib/recommendify/similarity_matrix.rb
 - lib/recommendify/sparse_matrix.rb
+- recommendify.gemspec
 - spec/base_spec.rb
 - spec/cc_matrix_shared.rb
 - spec/cosine_input_matrix_spec.rb
@@ -86,6 +73,15 @@ files:
 - spec/similarity_matrix_spec.rb
 - spec/sparse_matrix_spec.rb
 - spec/spec_helper.rb
+- src/cc_item.h
+- src/cosine.c
+- src/iikey.c
+- src/jaccard.c
+- src/output.c
+- src/recommendify.c
+- src/sort.c
+- src/version.h
+has_rdoc: true
 homepage: http://github.com/paulasmuth/recommendify
 licenses:
 - MIT
@@ -99,23 +95,17 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      hash: 3
-      segments:
-      - 0
       version: "0"
 required_rubygems_version: !ruby/object:Gem::Requirement
   none: false
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      hash: 3
-      segments:
-      - 0
       version: "0"
 requirements: []
 rubyforge_project:
-rubygems_version: 1.8.15
+rubygems_version: 1.6.2
 signing_key:
 specification_version: 3
 summary: Distributed item-based "Collaborative Filtering" with ruby and redis