RubyGems - jekyll-related-posts - Versions diffs - 0.1.1 - Mend

jekyll-related-posts 0.1.1

Files changed (10) hide show

checksums.yaml +7 -0
data/.gitignore +22 -0
data/Gemfile +4 -0
data/LICENSE.txt +22 -0
data/README.md +95 -0
data/jekyll-related-posts.gemspec +38 -0
data/lib/_config.yml +5 -0
data/lib/jekyll-related-posts.rb +220 -0
data/lib/related.html +12 -0
metadata +188 -0

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: d2802d28e9109784b9d9aed5987459d7c085b29e
+  data.tar.gz: b4ec98641cbe32e590b2a6aa428013c5aa50b266
+SHA512:
+  metadata.gz: 70ad01243ce8f17f133d3c56cf2d55bb5316881b2fb50d13c442fb0cb9953528a07a2dde7c304ce4b9ecda543c1bb83257923b96f628697b93804cd3af5320f1
+  data.tar.gz: d0c17c9d025e12c51df2ce7d4ddfe0a6d15baf63454a6e09052906342d4acdf8d5e1d4c2bfb1adf970eedf6eb0621a7f60cfcfb7c3a107da67c0ec0a380016b7

data/.gitignore ADDED

@@ -0,0 +1,22 @@
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp
+*.bundle
+*.so
+*.o
+*.a
+mkmf.log

data/Gemfile ADDED

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in jekyll-related-posts.gemspec
+gemspec

data/LICENSE.txt ADDED

@@ -0,0 +1,22 @@
+Copyright (c) 2015 Amadeusz Juskowiak
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,95 @@
+# jekyll-related-posts
+Proper related posts plugin for [Jekyll](http://jekyllrb.com) - uses document correlation matrix on TF-IDF (optionally with Latent Semantic Indexing).
+## Example
+Example is provided at http://jekyll-related-posts.dev.amadeusz.me - posts are
+based on [Reuters-21578](https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection) data set.
+## Introduction
+I am going to try to start blogging, again. Anyway I am studying at
+Decision Support Systems Group and I have found document correlation
+problem somehow interesting.
+For my own purposes I have created related posts Jekyll plugin based on well
+known algorithms such as [TFIDF](https://en.wikipedia.org/wiki/Tf–idf)
+and [LSI](https://en.wikipedia.org/wiki/Latent_semantic_indexing).
+## How to install
+Initialy you had to install the plugin manually, however the plugin is a
+gem now - follow instructions to install the plugin:
+1. Install the gem `jekyll-related-posts`:
+  - if you are using bundler add `gem 'jekyll-related-posts'` to your
+    `Gemfile` and run `bundle install`,
+  - or install gem via `gem install jekyll-related-posts`.
+2. Insert `gems: ['jekyll-related-posts']` to your `_config.yml`.
+3. Insert `<related-posts />` somewhere in your `_layouts/post.html`
+file.
+4. Run `jekyll build`, don't forget to blog about the plugin!
+### Customization
+You can customize default related posts template by creating
+`related.html` in your layouts directory. Plugin behaviour can be
+altered by options in `_config.yml`, under `related:` section.
+## Basis of operation
+Each document is
+[tokenized](https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis))
+and [stemmed](https://en.wikipedia.org/wiki/Stemming), every word found
+is treated as keyword for analysis (except for some [stop
+words](https://en.wikipedia.org/wiki/Stop_words)).
+TF-IDF matrix for the whole site is calculated (including extra provided
+weights), then if given accuraccy is lower than 1.0, LSI algorithm
+is used to compute new simplified vector space. Document correlation
+matrix is created using dot product of the matrix and its transpose.
+For each of the post' related documents are inserted into priority queue
+(sorted by score from document correlation matrix), assuming the score
+is greater than minimal required score. Selected few bests related posts
+are retrieven from the queue.
+Liquid template for each post is rendered and `<related-posts />` is
+replaced with the outcomes of algorithm.
+## Configuration
+In your `_config.yml` file (under `related:`) you can configure:
+- `max_count: 5` - maximum number of related posts,
+- `min_score: 0.1` - minimal required score to treat post as related,
+- `accuracy: 0.75` - percentage of keywords used as basis for document
+    correlation matrix (if 1.0 then no LSI is computed, otherwise LSI is
+    computed and dimensions are reduced to `accuracy * |keywords|`)
+### Weights
+You can configure weights of words providing dictionary with them to
+`weights`. In example weight of `2` means for term frequency algorithm
+that the word occured twice as much in the document as in reality.
+## Benchmark
+For casual blogs, performance should not be an issue.
+I did not benchmark the plugin, however for the dataset given in the
+example (containing ~900 documents, ~7000 keywords) rendering time
+(including Jekyll hoodoo stuff) is more less 70 seconds (on Xeon, using
+750MB RAM). Computation related to this plugin is about 20 seconds
+long. It should be noticed that I'm using OpenBLAS and standard LAPACK
+distributed with Ubuntu (performance is similar on OS X using builtin
+Acccelerate framework).
+Unfortunately the plugin is not compatible with Jekyll 3.0 new
+incremental builds, even though it requires at least Jekyll 3.0 (for the
+plugin hooks).
+## Authors
+- Amadeusz Juskowiak - juskowiak[at]amadeusz.me

data/jekyll-related-posts.gemspec ADDED

@@ -0,0 +1,38 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+Gem::Specification.new do |spec|
+  spec.name          = "jekyll-related-posts"
+  spec.version       = "0.1.1"
+  spec.authors       = ["Amadeusz Juskowiak"]
+  spec.email         = ["juskowiak@amadeusz.me"]
+  spec.summary       = %q{Proper related posts plugin for Jekyll - uses document correlation matrix on TF-IDF (optionally with Latent Semantic Indexing).}
+  spec.description   = %q{Proper related posts plugin for Jekyll - uses document correlation matrix on TF-IDF (optionally with Latent Semantic Indexing).
+Each document is tokenized and stemmed, every word found is treated as keyword for analysis (except for some stop words).
+TF-IDF matrix for the whole site is calculated (including extra provided weights), then if given accuraccy is lower than 1.0, LSI algorithm is used to compute new simplified vector space. Document correlation matrix is created using dot product of the matrix and its transpose.
+For each of the post' related documents are inserted into priority queue (sorted by score from document correlation matrix), assuming the score is greater than minimal required score. Selected few bests related posts are retrieven from the queue.
+Liquid template for each post is rendered and <related-posts /> is replaced with the outcomes of algorithm.}
+  spec.homepage      = "https://github.com/alfanick/jekyll-related-posts"
+  spec.license       = "MIT"
+  spec.files         = `git ls-files -z`.split("\x0")
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ["lib"]
+  spec.add_development_dependency "bundler", "~> 1.6"
+	spec.add_runtime_dependency "jekyll", "~> 3.0"
+  spec.add_runtime_dependency "liquid", "~> 3.0"
+  spec.add_runtime_dependency "tokenizer", "~> 0.1"
+  spec.add_runtime_dependency "stopwords-filter", "~> 0.3"
+  spec.add_runtime_dependency "fast-stemmer", "~> 1.0"
+  spec.add_runtime_dependency "pqueue", "~> 2.1"
+  spec.add_runtime_dependency "nmatrix", "~> 0.2"
+  spec.add_runtime_dependency "nmatrix-lapacke", "~> 0.2"
+end

data/lib/_config.yml ADDED

@@ -0,0 +1,5 @@
+related:
+  max_count: 5
+  min_score: 0.1
+  accuracy: 0.75
+  weights: {}

data/lib/jekyll-related-posts.rb ADDED

@@ -0,0 +1,220 @@
+require 'rubygems'
+require 'jekyll'
+require 'singleton'
+require 'tokenizer'
+require 'yaml'
+require 'liquid'
+require 'fast_stemmer'
+require 'stopwords'
+require 'pqueue'
+require 'nmatrix'
+require 'nmatrix/lapacke'
+module Amadeusz
+module Jekyll
+  class RelatedPosts
+    include Singleton
+    def initialize
+      @posts = Array.new
+      @keywords = Array.new
+      @tokenizer = Tokenizer::Tokenizer.new(:en)
+      @stopwords_filter = Stopwords::Snowball::Filter.new('en')
+    end
+    def add_post(post)
+      post = {
+        url: post.url,
+        title: post.data['title'].dup,
+        content: (stem(post.content) + stem(post.data['title']))
+      }
+      @posts << post
+      @keywords += post[:content]
+      @keywords.uniq!
+    end
+    def build!(site)
+      conf = config(site)
+      @weights = keywords_weights(conf['weights'])
+      related = find_releated(conf['max_count'], conf['min_score'], conf['accuracy'])
+      template = Liquid::Template.parse(File.read(template_path(site)))
+      @posts.each do |post|
+        filename = File.join(site.config['destination'], post[:url])
+        rendered = File.read(filename)
+        output = template.render('related_posts' => related[post])
+        rendered.gsub! '<related-posts />', output
+        File.write(filename, rendered)
+      end
+    end
+    private
+    def config(site)
+      builtin_file = File.join(File.absolute_path(File.dirname(__FILE__)), '_config.yml')
+      defaults = YAML.load_file(builtin_file)
+      defaults['related'].merge(site.config['related'] || {})
+    end
+    def template_path(site)
+      site_file = File.join(site.config['source'], site.config['layouts_dir'], 'related.html')
+      builtin_file = File.join(File.absolute_path(File.dirname(__FILE__)), 'related.html')
+      if File.exist? site_file
+        site_file
+      else
+        builtin_file
+      end
+    end
+    def find_releated(count = 5, min_score = -10.0, accuracy = 1.0)
+      dc = document_correleation(accuracy)
+      result = Hash.new
+      count = [count, @posts.size].min
+      @posts.each_with_index do |post, index|
+        queue = PQueue.new(dc.row(index).each_with_index.select{|s,_| s>=min_score}) do |a, b|
+          a[0] > b[0]
+        end
+        result[post] = []
+        count.times do
+          score, id = queue.pop
+          break unless score
+          begin
+            result[post] << {
+              'score' => score,
+              'url' => @posts[id][:url],
+              'title' => @posts[id][:title]
+            }
+          rescue
+            break
+          end
+        end
+      end
+      return result
+    end
+    def lsi(matrix, accuracy)
+      degree = (@keywords.size * accuracy - 1).floor
+      u, sigma, vt = matrix.transpose.gesdd
+      u2 = u.slice(0..degree, 0..degree)
+      sigma_d = NMatrix.zeros([degree+1, @posts.size])
+      sigma.each_with_indices do |v, i, j|
+        break if i > degree
+        sigma_d[i, i] = v
+      end
+      return u2.dot(sigma_d).dot(vt).transpose
+    end
+    def document_correleation(accuracy = 1.0)
+      if accuracy == 1.0
+        scores = tfidf
+      else
+        scores = lsi(tfidf, accuracy)
+      end
+      result = scores.dot(scores.transpose)
+      result.each_with_indices do |_, u, v|
+        if u != v
+          result[u, v] /= (result[u, u] + result[v, v] - result[u, v])
+        else
+          result[u, v] = 0.0
+        end
+      end
+      return result
+    end
+    def bag_of_words
+      result = NMatrix.new([@posts.size, @keywords.size], 0.0)
+      @max = NMatrix.new([@posts.size], 0.0)
+      result.each_with_indices do |_, pi, ki|
+        result[pi, ki] = @posts[pi][:content].count(@keywords[ki])
+        if result[pi, ki] > @max[pi]
+          @max[pi] = result[pi, ki]
+        end
+      end
+      @bag_of_words = result.dup
+      return result
+    end
+    def term_frequency
+      result = bag_of_words
+      result.rows.times do |r|
+        result[r, 0..-1] *= @weights
+        result[r, 0..-1] /= @max[r]
+      end
+      return result
+    end
+    def keywords_weights(weights)
+      result = NMatrix.new([1, @keywords.size], 1.0)
+      weights.each do |word, weight|
+        keyword = word.to_s.stem.to_sym
+        next unless @keywords.include? keyword
+        result[0, @keywords.index(keyword)] = weight
+      end
+      return result
+    end
+    def inverse_document_frequency
+      result = NMatrix.new([1, @keywords.size], 0.0)
+      @bag_of_words.each_column do |column|
+        occurences = column.reduce do |m, c|
+          m + (c > 0 ? 1.0 : 0.0)
+        end
+        result[0, column.offset[1]] = Math.log(column.size / occurences) if occurences > 0
+      end
+      return result
+    end
+    def tfidf
+      result = term_frequency
+      idf = inverse_document_frequency
+      result.rows.times do |r|
+        result[r, 0..-1] *= idf
+      end
+      return result
+    end
+    def stem(data)
+      tokenized = @tokenizer.tokenize(data.gsub(/[^a-z \t'_\-\n.,+]/i, '')).map(&:downcase)
+      filtered = @stopwords_filter.filter(tokenized)
+      stemmed = filtered.map(&:stem).select{|s| not s.empty?}.map(&:to_sym)
+      return stemmed
+    end
+  end
+end
+end
+Jekyll::Hooks.register :posts, :pre_render do |post|
+  Amadeusz::Jekyll::RelatedPosts.instance.add_post(post)
+end
+Jekyll::Hooks.register :site, :post_write do |site|
+  Amadeusz::Jekyll::RelatedPosts.instance.build! site
+end

data/lib/related.html ADDED

@@ -0,0 +1,12 @@
+{% if related_posts != empty %}
+<div id="related-posts">
+  <h3>Related posts</h3>
+  <ul>
+    {% for p in related_posts %}
+      <li>
+        <a href="{{ p.url }}" data-score="{{ p.score }}">{{ p.title }}</a>
+      </li>
+    {% endfor %}
+  </ul>
+</div>
+{% endif %}

metadata ADDED

@@ -0,0 +1,188 @@
+--- !ruby/object:Gem::Specification
+name: jekyll-related-posts
+version: !ruby/object:Gem::Version
+  version: 0.1.1
+platform: ruby
+authors:
+- Amadeusz Juskowiak
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2015-11-13 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.6'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.6'
+- !ruby/object:Gem::Dependency
+  name: jekyll
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.0'
+- !ruby/object:Gem::Dependency
+  name: liquid
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.0'
+- !ruby/object:Gem::Dependency
+  name: tokenizer
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.1'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.1'
+- !ruby/object:Gem::Dependency
+  name: stopwords-filter
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.3'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.3'
+- !ruby/object:Gem::Dependency
+  name: fast-stemmer
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.0'
+- !ruby/object:Gem::Dependency
+  name: pqueue
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2.1'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2.1'
+- !ruby/object:Gem::Dependency
+  name: nmatrix
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.2'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.2'
+- !ruby/object:Gem::Dependency
+  name: nmatrix-lapacke
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.2'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.2'
+description: |-
+  Proper related posts plugin for Jekyll - uses document correlation matrix on TF-IDF (optionally with Latent Semantic Indexing).
+  Each document is tokenized and stemmed, every word found is treated as keyword for analysis (except for some stop words).
+  TF-IDF matrix for the whole site is calculated (including extra provided weights), then if given accuraccy is lower than 1.0, LSI algorithm is used to compute new simplified vector space. Document correlation matrix is created using dot product of the matrix and its transpose.
+  For each of the post' related documents are inserted into priority queue (sorted by score from document correlation matrix), assuming the score is greater than minimal required score. Selected few bests related posts are retrieven from the queue.
+  Liquid template for each post is rendered and <related-posts /> is replaced with the outcomes of algorithm.
+email:
+- juskowiak@amadeusz.me
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- ".gitignore"
+- Gemfile
+- LICENSE.txt
+- README.md
+- jekyll-related-posts.gemspec
+- lib/_config.yml
+- lib/jekyll-related-posts.rb
+- lib/related.html
+homepage: https://github.com/alfanick/jekyll-related-posts
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.2.2
+signing_key:
+specification_version: 4
+summary: Proper related posts plugin for Jekyll - uses document correlation matrix
+  on TF-IDF (optionally with Latent Semantic Indexing).
+test_files: []