RubyGems - language_filter - Versions diffs - 0.2 - Mend

language_filter 0.2

Files changed (15) hide show

checksums.yaml +7 -0
data/.gitignore +17 -0
data/Gemfile +4 -0
data/LICENSE.txt +22 -0
data/README.md +190 -0
data/Rakefile +1 -0
data/config/filters/hate.txt +6 -0
data/config/filters/profanity.txt +10 -0
data/config/filters/sex.txt +56 -0
data/config/filters/violence.txt +13 -0
data/language_filter.gemspec +23 -0
data/lib/language_filter.rb +172 -0
data/lib/language_filter/error.rb +7 -0
data/lib/language_filter/version.rb +3 -0
metadata +87 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: dee09f095e1774912cf628728f5acc6735b2b85f
+  data.tar.gz: 159c7624452c4d90cd44252edc64187e36b9a644
+SHA512:
+  metadata.gz: d16829605c4b949b8ba758d4ba40aa23e3c164b36a5a8c3d564dce66803e995e9043619efe3036bf3d14161cb351e259ca0a9a72690d3ec9fae67f1d87d2b6f6
+  data.tar.gz: db03b6a6a94847f9076ccc8feca139b9d17ffd1da498a21b78e31df58c9cdfa8ab1e965b70da812b18a760461ca5de23dd1afd1da3b56813979c6238f435c4d2

data/.gitignore ADDED Viewed

@@ -0,0 +1,17 @@
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp

data/Gemfile ADDED Viewed

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in language_filter.gemspec
+gemspec

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,22 @@
+Copyright (c) 2013 Chris Fritz
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,190 @@
+# LanguageFilter
+LanguageFilter is a Ruby gem to detect and optionally filter multiple categories of language. It was adapted from Thiago Jackiw's Obscenity gem for [FractalWriting.org](http://fractalwriting.org) and features many improvements, including:
+- The ability to create and independently configure multiple language filters.
+- Comes pre-packaged with multiple matchlists (for hate, profanity, sex, and violence), for more fine-tuned language detection. I think this aligns much better with the real needs of communities that might need language filtering. For example, I probably want to flag and eventually ban users that use hateful language. Then for content featuring sex, profanity, and/or violence, I can let users know exactly what to expect before delving into content, much more so than with a single, all-encompassing "mature" tag.
+- Simpler, more intuitive configuration.
+- More neutral language to accommodate a wider variety of use cases. For example, LanguageFilter uses `matchlist` and `exceptionlist` instead of `blacklist` and `whitelist`, since the gem can be used not only for censorship, but also for content *type* identification (e.g. fantasy, sci-fi, historical, etc in the context of creative writing)
+- More robust exceptionlist (i.e. whitelist) handling. Given a simple example of a matchlist containing `cock` and an exceptionlist containing `game cock`, the other filtering gems I've seen will flag the `cock` in `game cock`, despite the exceptionlist. LanguageFilter is a little smarter and does what you would expect, so that when sanitizing the string `cock is usually sexual, but a game cock is just an animal`, the returned string will be `**** is usually sexual, but a game cock is just an animal`.
+## Installation
+Add this line to your application's Gemfile:
+``` ruby
+gem 'language_filter'
+```
+And then execute:
+``` bash
+$ bundle
+```
+Or install it yourself as:
+``` bash
+$ gem install language_filter
+```
+## Usage
+Need a new language filter? Here's a quick usage example:
+``` ruby
+sex_filter = LanguageFilter::Filter.new matchlist: :sex, replacement: :stars
+# returns true if any content matched the filter's matchlist, else false
+sex_filter.match?('This is some sexual content.')
+=> true
+# returns a "cleaned up" version of the text, based on the replacement rule
+sex_filter.sanitize('This is some sexual content.')
+=> "This is some ****** content."
+# returns an array of the words and phrases that matched an item in the matchlist
+sex_filter.matched('This is some sexual content.')
+=> ["sexual"]
+```
+Now let's go over this a little more methodically. When you create a new LanguageFilter, you simply call LanguageFilter::Filter.new, with any of the following optional parameters. Below, you can see their defaults.
+``` ruby
+LanguageFilter::Filter.new(
+                            matchlist: :profanity,
+                            exceptionlist: [],
+                            replacement: :stars
+                          )
+```
+Now let's dive a little deeper into each parameter.
+### `:matchlist` and `:exceptionlist`
+Both of these lists can take four different kinds of inputs.
+#### Symbol signifying a pre-packaged list
+By default, LanguageFilter comes with four different matchlists, each screening for a different category of language. These filters are accessible via:
+- `matchlist: :hate` (for hateful language, like `f**k you`, `b***h`, or `f*g`)
+- `matchlist: :profanity` (for swear/cuss words and phrases)
+- `matchlist: :sex` (for content of a sexual nature)
+- `matchlist: :violence` (for language indicating violence, such as `stab`, `gun`, or `murder`)
+There's quite a bit of overlap between these lists, but they can be useful for communities that may want to self-monitor, giving them an idea of the kind of content in a story or article before clicking through.
+#### An array of words and phrases to screen for
+- `matchlist: ['giraffes?','rhino\w*','elephants?'] # a non-exhaustive list of African animals`
+As you may have noticed, you can include regex! However, if you do, keep in mind that the more complicated regex you include, the slower the matching will be. Also, if you're assigning an array directly to matchlist and want to use regex, be sure to use single quotes (`'like this'`), rather than double quotes (`"like this"`). Otherwise, Ruby will think your backslashes are to help it interpolate the string, rather than to be intrepreted literally and passed into your regex, untouched.
+In the actual matching, each item you enter in the list is dumped into the middle of the following regex, through the `list_item` variable.
+``` ruby
+/\b#{list_item}\b/i
+```
+There's not a whole lot going on there, but I'll quickly parse it for any who aren't very familiar with regex.
+- `#{list_item}` just dumps in the item from our list that we want to check.
+- The two `\b` on either side ensure that only text surrounded by non-word characters (anything other than letters, numbers, and the underscore), or the beginning or end of a string, are matched.
+- The two `/` wrapping (almost) the whole statement lets Ruby know that this is a regex statement.
+- The `i` right after the regex tells it to match case-insensitively, so that whether someone writes `giraffe`, `GIRAFFE`, or `gIrAffE`, the match won't fail.
+If you'd like to master some regex Rubyfu, I highly recommend stopping at [Rubular.com](http://rubular.com/).
+#### A filepath or string pointing to a filepath
+If you want to use your own lists, there are two ways to do it.
+1) Pass in a filepath:
+``` ruby
+matchlist: File.join(Rails.root,"/config/language_filters/my_custom_list.yml")
+```
+2) Pass in a `Pathname`, like Rails.root. I'm honestly not sure when you'd do this, but it was in option in Obscenity and it's still an option now.
+##### Formatting your lists
+Now when you're actually writing these lists, they both use the same, relatively simple format, which looks something like this:
+``` regex
+giraffes?
+rhino\w*
+elephants?
+```
+It's a pretty simple pattern. Each word, phrase, or regex is on its own line - and that's it.
+### `:replacement`
+If you're not using this gem to filter out potentially offensive content, then you don't have to worry about this part. For the rest of you the `:replacement` parameter specifies what to replace matches with, when sanitizing text.
+Here are the options:
+`replacement: :stars` (this is the default replacement method)
+Example: This is some ****** up ****.
+`replacement: :garbled`
+Example: This is some $@!#% up $@!#%.
+`replacement: :vowels`
+Example: This is some f*ck*d up sh*t.
+`replacement: :nonconsonants` (useful where letters might be replaced with numbers, for example in L3375P34|< - i.e. leetspeak)
+Example: 7|-|1$ 1$ $0/\/\3 Ph*****D UP ******.
+### Methods to modify filters after creation
+If you ever want to change the matchlist, exceptionlist, or replacement type, each parameter is accessible via an assignment method.
+For example:
+``` ruby
+my_filter = LanguageFilter::Filter.new(
+                                        matchlist: ['dogs?'],
+                                        exceptionlist: ['dogs drool'],
+                                        replacement: :garbled
+                                      )
+my_filter.sanitize('Dogs rule, cats drool!')
+=> "$@!#% rule, cats drool!"
+my_filter.sanitize('Cats rule, dogs drool!')
+=> "Cats rule, dogs drool!"
+my_filter.matchlist = ['dogs?','cats drool']
+my_filter.exceptionlist = ['dogs drool','dogs are cruel']
+my_filter.replacement = :stars
+my_filter.sanitize('Dogs rule, cats drool!')
+=> "**** rule, **********!"
+my_filter.sanitize('Cats rule, dogs drool!')
+=> "Cats rule, dogs drool!"
+```
+In the above case though, we just wanted to add items to the existing lists, so there's actually a better solution. They're stored as arrays, so treat them as such. Any array methods are fair game.
+For example:
+``` ruby
+my_filter.matchlist.pop
+my_filter.matchlist << "cats are liars" << "don't listen to( the)? cats" << "why does no one heed my warnings about the cats?! aren't you getting my messages?"
+my_filter.matchlist.uniq!
+# etc...
+```
+## Contributing
+1. Fork it
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create new Pull Request

data/Rakefile ADDED Viewed

	@@ -0,0 +1 @@
1	+ require "bundler/gem_tasks"

data/config/filters/hate.txt ADDED Viewed

@@ -0,0 +1,6 @@
+\w*f[\b_]*[uv][\b_]*c[\b_]*k[\b_][a([e3]r)]s?
+\w*f[\b_]*a[\b_]*g\w*
+\w*c[\b_]*u[\b_]*n[\b_]*t\w*
+\w*a[s\$5z]{2}h[o0]l[e3]\w*
+\w*b[\b_]*i[\b_]*t[\b_]*c[\b_]*h\w*
+fudge ?pack\w*

data/config/filters/profanity.txt ADDED Viewed

@@ -0,0 +1,10 @@
+\w*f[\b_]*[uv][\b_]*c[\b_]*k\w*
+\w*f[\b_]*c[\b_]*[uv][\b_]*k\w*
+\w*f[\b_]*[uv][\b_]*k\w*
+\w*[s\$5][\b_]*h[\b_]*[i1][\b_]*t\w*
+a[s\$5z]{2}\w*
+\w*a[s\$5z]{2}(e[s\$5z])?
+b[\b_]*a[\b_]*s[\b_]*t[\b_]*a[\b_]*r[\b_]*d\w*
+b[\b_]*i[\b_]*t[\b_]*c[\b_]*h\w*
+c[\b_]*u[\b_]*n[\b_]*t\w*
+f[\b_]*a[\b_]*g\w*

data/config/filters/sex.txt ADDED Viewed

@@ -0,0 +1,56 @@
+\w*sex\w*
+blow ?job\w*
+fellat\w*
+felch\w*
+\w*f[\b_]*[uv][\b_]*c[\b_]*k\w*
+wank\w*
+cock\w*
+cock suck\w*
+poll ?smok\w*
+dick\w*
+dick ?suck\w*
+fudge ?pack\w*
+rim ?job\w*
+knob ?gobbl\w*
+anal\w*
+rectums?
+\w*a[s\$5z]{2}
+\w*a[s\$5z]{2}h[o0]l[e3]\w*
+ballsacks?
+scrotums?
+bollocks
+penis(es)?
+boners?
+pricks?
+knobends?
+manhoods?
+wieners?
+breasts?
+tit\w*
+boob\w*
+honkers?
+cleavages?
+vagina\w*
+puss[y(ies)(ee)]
+muffs?
+cunt\w*
+twats?
+clit\w*
+quims?
+labias?
+buttplugs?
+dildos?
+heteros?
+homos?
+sluts?
+whor\w*
+skank\w*
+g+[\b_]*h?[\b_]*[ae][\b_]*ys?
+dykes?
+\w*f[\b_]*a[\b_]*g\w*
+\w*cum\w*
+jizz\w*
+pubes?
+pubic
+smegma
+boy ?butter

data/config/filters/violence.txt ADDED Viewed

@@ -0,0 +1,13 @@
+stab\w*
+kill\w*
+beat ?up
+beat the \w+ out of
+beat the \w+ out of
+fuck ?\w* up
+murder\w*
+genocide
+shoot [(him)(her)(it)(me)(us)(them)]
+shot [(him)(her)(it)(me)(us)(them)]
+gun\w*
+phasers?
+death ?(ray)?

data/language_filter.gemspec ADDED Viewed

@@ -0,0 +1,23 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'language_filter/version'
+Gem::Specification.new do |spec|
+  spec.name          = "language_filter"
+  spec.version       = LanguageFilter::VERSION
+  spec.authors       = ["Chris Fritz"]
+  spec.email         = ["chrisvfritz@gmail.com"]
+  spec.description   = %q{LanguageFilter is a Ruby gem to detect and optionally filter various categories of language.}
+  spec.summary       = %q{LanguageFilter is a Ruby gem to detect and optionally filter various categories of language.}
+  spec.homepage      = "http://github.com/chrisvfritz/language_filter"
+  spec.license       = "MIT"
+  spec.files         = `git ls-files`.split($/)
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ["lib"]
+  spec.add_development_dependency "bundler", "~> 1.3"
+  spec.add_development_dependency "rake"
+end

data/lib/language_filter.rb ADDED Viewed

@@ -0,0 +1,172 @@
+require 'pathname'
+require 'yaml'
+require 'language_filter/error'
+require 'language_filter/version'
+module LanguageFilter
+	class Filter
+		attr_accessor :matchlist, :exceptionlist, :replacement
+		DEFAULT_EXCEPTIONLIST = []
+		DEFAULT_MATCHLIST = File.dirname(__FILE__) + "/../config/filters/profanity.txt"
+		DEFAULT_REPLACEMENT = :stars
+		def initialize(options={})
+			@matchlist = if options[:matchlist] then
+				validate_list_content(options[:matchlist])
+				set_list_content(options[:matchlist])
+			else set_list_content(DEFAULT_MATCHLIST) end
+			@exceptionlist = if options[:exceptionlist] then
+				validate_list_content(options[:exceptionlist])
+				set_list_content(options[:exceptionlist])
+			else set_list_content(DEFAULT_EXCEPTIONLIST) end
+			@replacement = options[:replacement] || DEFAULT_REPLACEMENT
+			validate_replacement
+		end
+		# SETTERS
+		def matchlist=(content)
+			validate_list_content(content)
+			@matchlist = case content
+			when :default then set_list_content(DEFAULT_MATCHLIST)
+			else @matchlist = set_list_content(content)
+			end
+		end
+		def exceptionlist=(content)
+			if content == :default then
+				@exceptionlist = set_list_content(DEFAULT_EXCEPTIONLIST)
+			else
+				validate_list_content(content)
+				@exceptionlist = set_list_content(content)
+			end
+		end
+		def replacement=(value)
+			@replacement = case value
+			when :default then :stars
+			else value
+			end
+			validate_replacement
+		end
+		# LANGUAGE
+		def match?(text)
+			return false unless text.to_s.size >= 3
+			@matchlist.each do |list_item|
+				start_at = 0
+				text.scan(/\b#{list_item}\b/i) do |match|
+					match_start = text[start_at,text.size].index(/\b#{list_item}\b/i) unless @exceptionlist.empty?
+					match_end = match_start + match.size unless @exceptionlist.empty?
+					unless match == [nil] then
+						return true if @exceptionlist.empty? or not protected_by_exceptionlist?(match_start,match_end,text,start_at)
+					end
+					start_at = match_end + 1
+				end
+			end
+			false
+		end
+		def matched(text)
+			words = []
+			return words unless text.to_s.size >= 3
+			@matchlist.each do |list_item|
+				start_at = 0
+				text.scan(/\b#{list_item}\b/i) do |match|
+					match_start = text[start_at,text.size].index(/\b#{list_item}\b/i) unless @exceptionlist.empty?
+					match_end = match_start + match.size unless @exceptionlist.empty?
+					unless match == [nil] then
+						words << match if @exceptionlist.empty? or not protected_by_exceptionlist?(match_start,match_end,text,start_at)
+					end
+					start_at = match_end + 1
+				end
+			end
+			words.uniq
+		end
+		def sanitize(text)
+			return text unless text.to_s.size >= 3
+			@matchlist.each do |list_item|
+				start_at = 0
+				text.gsub! /\b#{list_item}\b/i do |match|
+					match_start = text[start_at,text.size].index(/\b#{list_item}\b/i) unless @exceptionlist.empty?
+					match_end = match_start + match.size unless @exceptionlist.empty?
+					unless @exceptionlist.empty? or not protected_by_exceptionlist?(match_start,match_end,text,start_at) then
+						start_at = match_end + 1
+						match
+					else
+						start_at = match_end + 1
+						replace(match)
+					end
+				end
+			end
+			text
+		end
+		private
+		# VALIDATIONS
+		def validate_list_content(content)
+			case content
+			when Array    then content.all? {|c| c.class == String} || raise(LanguageFilter::EmptyContentList.new("List content array is empty."))
+			when String   then File.exists?(content)                || raise(LanguageFilter::UnkownContentFile.new("List content file \"#{content}\" can't be found."))
+			when Pathname then content.exist?                       || raise(LanguageFilter::UnkownContentFile.new("List content file \"#{content}\" can't be found."))
+			when Symbol   then
+				case content
+				when :default, :hate, :profanity, :sex, :violence then true
+				else raise(LanguageFilter::UnkownContent.new("The only accepted symbols are :default, :hate, :profanity, :sex, and :violence."))
+				end
+			else raise LanguageFilter::UnkownContent.new("The list content can be either an Array, Pathname, or String path to a file.")
+			end
+		end
+		def validate_replacement
+			case @replacement
+			when :default, :garbled, :vowels, :stars, :nonconsonants
+			else raise LanguageFilter::UnknownReplacement.new("This is not a known replacement type.")
+			end
+		end
+		# HELPERS
+		def set_list_content(list)
+			case list
+			when :hate      then load_list File.dirname(__FILE__) + "/../config/filters/hate.txt"
+			when :profanity then load_list File.dirname(__FILE__) + "/../config/filters/profanity.txt"
+			when :sex       then load_list File.dirname(__FILE__) + "/../config/filters/sex.txt"
+			when :violence  then load_list File.dirname(__FILE__) + "/../config/filters/violence.txt"
+			when Array then list
+			when String, Pathname then load_list list.to_s
+			else []
+			end
+		end
+		def load_list(filepath)
+			IO.readlines(filepath).each {|line| line.gsub!(/\n/,'')}
+		end
+		def protected_by_exceptionlist?(match_start,match_end,text,start_at)
+			@exceptionlist.each do |list_item|
+				exception_start = text[start_at,text.size].index(/\b#{list_item}\b/i)
+				if exception_start and exception_start <= match_start then
+					return true if exception_start + text[start_at,text.size][/\b#{list_item}\b/i].size >= match_end
+				end
+			end
+			return false
+		end
+		# This was moved to private because users should just use sanitize for any content
+		def replace(word)
+			case @replacement
+			when :vowels then word.gsub(/[aeiou]/i, '*')
+			when :stars  then '*' * word.size
+			when :nonconsonants then word.gsub(/[^bcdfghjklmnpqrstvwxyz]/i, '*')
+			when :default, :garbled then '$@!#%'
+			else raise LanguageFilter::UnknownReplacement.new("#{@replacement} is not a known replacement type.")
+			end
+		end
+	end
+end

data/lib/language_filter/error.rb ADDED Viewed

@@ -0,0 +1,7 @@
+module LanguageFilter
+  class Error < RuntimeError; end
+  class UnkownContent     < Error; end
+  class UnkownContentFile < Error; end
+  class EmptyContentList  < Error; end
+end

data/lib/language_filter/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+module LanguageFilter
+  VERSION = "0.2"
+end

metadata ADDED Viewed

@@ -0,0 +1,87 @@
+--- !ruby/object:Gem::Specification
+name: language_filter
+version: !ruby/object:Gem::Version
+  version: '0.2'
+platform: ruby
+authors:
+- Chris Fritz
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2013-07-04 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.3'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.3'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+description: LanguageFilter is a Ruby gem to detect and optionally filter various
+  categories of language.
+email:
+- chrisvfritz@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- .gitignore
+- Gemfile
+- LICENSE.txt
+- README.md
+- Rakefile
+- config/filters/hate.txt
+- config/filters/profanity.txt
+- config/filters/sex.txt
+- config/filters/violence.txt
+- language_filter.gemspec
+- lib/language_filter.rb
+- lib/language_filter/error.rb
+- lib/language_filter/version.rb
+homepage: http://github.com/chrisvfritz/language_filter
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.0.3
+signing_key:
+specification_version: 4
+summary: LanguageFilter is a Ruby gem to detect and optionally filter various categories
+  of language.
+test_files: []