RubyGems - TJNGram - Versions diffs - 1.0 - Mend

TJNGram 1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

data/example/example.rb ADDED Viewed

@@ -0,0 +1,15 @@
+require '../lib/tjngram'
+text = <<eos
+這是一個範例。
+This is an example.
+これは例です。
+這裡有一個蘋果。
+There is an apple.
+これはリンゴです。
+eos
+puts text, "=========="
+p TJNGram.process(2, text)

data/lib/tjngram.rb ADDED Viewed

@@ -0,0 +1,31 @@
+module TJNGram
+  def self.process n, content
+    # matche Chinese, English, Japanes and Korean words
+    pattern = %r((\w+)|([\u3041-\u9FFF]))
+    # record if the token is an English word
+    is_english = {}
+    # tokenize
+    tokens = []
+    content.scan(pattern){|e,c|
+      is_english[e || c] = e ? true : false
+      tokens << (e || c)
+    }
+    resault = Hash.new(0)
+    n.times{|i|
+      tokens.each_slice(n){|slice|
+        key = ""
+        pre_state = nil
+        slice.each{|token|
+          key << " " if !pre_state.nil? && (pre_state || is_english[token])
+          key << token
+          pre_state = is_english[token]
+        }
+        resault[key] += 1
+      }
+      tokens.shift
+    }
+    resault
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,48 @@
+--- !ruby/object:Gem::Specification
+name: TJNGram
+version: !ruby/object:Gem::Version
+  version: '1.0'
+  prerelease:
+platform: ruby
+authors:
+- Tony Jian
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2012-06-06 00:00:00.000000000 Z
+dependencies: []
+description: It's common to see Chinese, Jananse and Korean articles contain some
+  English, but it's not common to see an n-gram library which can parse this sort
+  of articles. TJNGram was made for solving this problem.
+email: tonytonyjan@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- example/example.rb
+- lib/tjngram.rb
+homepage: http://tonytonyjan.github.com
+licenses: []
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 1.8.23
+signing_key:
+specification_version: 3
+summary: N-Gram generator in Ruby, supporting English, Chinese, Janpanese and Korean.
+test_files: []