RubyGems - ngrams_search - Versions diffs - 0.0.1 - Mend

ngrams_search 0.0.1

Files changed (5) hide show

checksums.yaml ADDED

@@ -0,0 +1,15 @@
+---
+!binary "U0hBMQ==":
+  metadata.gz: !binary |-
+    ZTAzOTJmOGYyODNjZWNkZDgyNzkzYTAzNTU2MThlNzAxNGI2OTMxNw==
+  data.tar.gz: !binary |-
+    MDAyMzU2NzZkY2YxM2E1NGQ4NGU4YTJiZjdmMTJiZDIyZjBlNTZiZA==
+!binary "U0hBNTEy":
+  metadata.gz: !binary |-
+    ZDU1NzZlMWZmNzc3N2U0MTMzNTU5YzQ4MzZmODVmZmE1MDBiZmU4OGQ2OGI3
+    Y2U2NWUwNzMzZWNlYzY2NTMwMGJlOTg4ZGUxNGQ0NzlkZTlkMDkzMTdiNDI1
+    ZTllZTA2MzdiYTQ2ZDI0NzlhOGUzOWIzY2Y4OTYzYTE1ZDU5Y2E=
+  data.tar.gz: !binary |-
+    N2U4NThhZTgxMDE3Yjg2ZWJhNzY3MWUyNmZjNjc4MTAxMjA1ZWI5NGZiYmM0
+    MGU1YzhmZTEyNzVkNTA4ZDJlMTQ3ZGJkNmMxZTRkMWJjNjcyYjA5MTQ4NjEw
+    NTA0MTJhNzg3YThjNWIwMDI1ZDhhNTMzYzAwYTQ4NDEzYmM1ZDM=

data/README.rdoc ADDED

@@ -0,0 +1,69 @@
+= ngrams_search
+Author:: Elias Hasnat
+= Description
+ngrams_search is an extension of Ruby's core String class.  It provides a String
+object with the capability to produce n-grams.
+From Wikipedia,
+ "In the fields of computational linguistics and probability, an n-gram is a
+ contiguous sequence of n items from a given sequence of text or speech. The
+ items in question can be phonemes, syllables, letters, words or base pairs
+ according to the application. n-grams are collected from a text or speech corpus.
+ An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram"
+ (or, less commonly, a "digram"); size 3 is a "trigram"; size 4 is a "four-gram"
+ and size 5 or more is simply called an "n-gram"."
+= Design
+Instead of creating another namespace, this task seemed simple enough to merit
+extending the String class.  A string is a sequence of characters.
+It can be words, binary code, sentences, paragraphs, etc.  In short,
+anything that you can store in a Ruby String object can be parsed into
+n-grams of length n.
+The main method being added to the String class is ngrams().  It produces an
+array of n-grams from a Ruby String object.
+For example, let s be a Ruby String object.
+Then s.ngrams() returns array of n-grams of from s.
+Tokenization of s is set to single characters by default.
+For example, if
+ s = "Hello World!",
+then the tokens of s are
+ ["H","e","l","l","o"," ","W","o","r","l","d","!"].
+By specifying a regular expression, you can tokenize the string s in many
+different and useful ways.
+If you set n = 4, then
+ s.ngrams = [["H", "e", "l", "l"],
+ ["e", "l", "l", "o"],
+ ["l", "l", "o", " "],
+ ["l", "o", " ", "W"],
+ ["o", " ", "W", "o"],
+ [" ", "W", "o", "r"],
+ ["W", "o", "r", "l"],
+ ["o", "r", "l", "d"],
+ ["r", "l", "d", "!"]].
+Each item in the s.ngrams array can joined but doesn't need to be.
+If you want to join them, normally you can do so easily if it is text.
+Be careful if you are trying to join n-grams with non-printable characters.
+You can google "n-grams" to get more information about how n-grams are useful.
+= Installation
+ gem install ngrams_search
+= Usage
+You can simply run the executable and provide input via STDIN.
+ ngrams_search
+You can also provide input via one or more filenames
+ ngrams_search [FILES]

data/bin/ngrams_search ADDED

@@ -0,0 +1,40 @@
+#!/usr/bin/env ruby
+require 'ruby_cli'
+require 'ngrams_search'
+class App
+	include RubyCLI
+	def initialize_command_options() @options = {:regex => //, :n => 2}	end
+	def define_command_option_parsing
+		@opt_parser.on('-n', '--n NUM', Integer, 'set length n for n-grams') do |n|
+			@options[:n] = n
+		end
+		@opt_parser.on('-r', '--regex "REGEX"', Regexp, 'set regex to split string into tokens') do |r|
+			@options[:regex] = r
+		end
+	end
+	def command
+		# If arguments were provided, then they have to be names of files.
+		# These files will be handled using Ruby's ARGF builtin variable.
+		# If arguments are not filenames, then this application will produce a
+		# a runtime error informing the user that the given file could not be opened.
+		# ARGF is a stream designed for use in scripts that process files given as
+		#	command-line arguments or passed in via STDIN.
+		# The arguments passed to your script are stored in the ARGV Array,
+		#	one argument per element. ARGF assumes that any arguments that aren’t
+		# filenames have been removed from ARGV.
+		text = ARGF.read
+		text.ngrams(@options).each { |ngram| puts ngram.inspect }
+	end
+end
+app = App.new(ARGV, __FILE__)
+app.run

data/lib/ngrams_search.rb ADDED

@@ -0,0 +1,45 @@
+# This is an extension of Ruby's core String class.
+# It add methods to extract a set of n-grams from a string.
+# Typically, the most used set of n-grams are unigrams, bigrams, and trigrams;
+# sets of n-grams of length 1, 2, and 3, respectively.
+class String
+	# An n-gram is a sequence of units of text of length n, where those units are
+	# typically single characters or words delimited by space characters.
+	# However, a token could also be a fixed length character sequence, strings
+	# with embedded spaces, etc. depending on the intended application.
+	# Typically, n-grams are formed of contiguous tokens.
+	#
+	# This function splits the string into a set of n-grams.
+	# The default regex used tokenizes the string into characters.
+	#
+	# Regex Examples:
+	#		// 			=> splits into characters
+	#		/\s+/ 	=> splits into words delimited by one or more space characters
+	#		/\n+/ => splits into lines delimted by one or more newline characters
+	#
+	def ngrams(options = {:regex=>//, :n=>2})
+		ngrams = []
+		tokens = self.split(options[:regex])
+		max_pos = tokens.length - options[:n]
+		for i in 0..max_pos
+			ngrams.push(tokens[i..i+(options[:n]-1)])
+		end
+		ngrams
+	end
+	# This function splits the string into unigrams,
+	# tokenizes into chars by default
+	def unigrams(regex = //) ngrams({:regex => regex, :n => 1}); end
+	# This function splits the string into bigrams
+	# tokenizes into chars by default
+	def bigrams(regex = //) ngrams({:regex => regex, :n => 2}); end
+	# This function splits the string into trigrams
+	# tokenizes into chars by default
+	def trigrams(regex = //) ngrams({:regex => regex, :n => 3}); end
+end #class String

metadata ADDED

@@ -0,0 +1,62 @@
+--- !ruby/object:Gem::Specification
+name: ngrams_search
+version: !ruby/object:Gem::Version
+  version: 0.0.1
+platform: ruby
+authors:
+- Elias Hasnat
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2013-09-03 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: ruby_cli
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: 0.2.0
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: 0.2.0
+description: n-grams string search
+email: android.hasnat@gmail.com
+executables:
+- ngrams_search
+extensions: []
+extra_rdoc_files: []
+files:
+- lib/ngrams_search.rb
+- bin/ngrams_search
+- README.rdoc
+homepage: http://github.com/claymodel/telephony/ngrams_search
+licenses: []
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+- bin
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.0.7
+signing_key:
+specification_version: 4
+summary: Search string using n-grams
+test_files: []
+has_rdoc: