language_filter 0.2.1 → 0.3.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +98 -0
- data/Rakefile +11 -0
- data/config/exceptionlists/hate.txt +0 -0
- data/config/exceptionlists/mccormick.txt +0 -0
- data/config/exceptionlists/profanity.txt +1 -0
- data/config/exceptionlists/sex.txt +5 -0
- data/config/exceptionlists/violence.txt +5 -0
- data/config/matchlists/hate.txt +7 -0
- data/config/matchlists/mccormick.txt +342 -0
- data/config/matchlists/profanity.txt +10 -0
- data/config/{filters → matchlists}/sex.txt +13 -13
- data/config/{filters → matchlists}/violence.txt +4 -4
- data/lib/language_filter.rb +278 -166
- data/lib/language_filter/version.rb +2 -2
- data/test/lib/language_filter/methods_test.rb +66 -0
- data/test/lib/language_filter/version_test.rb +9 -0
- data/test/lists/simpsons-5000.txt +1 -0
- data/test/lists/wiktionary-50000.txt +1 -0
- data/test/test_helper.rb +111 -0
- metadata +23 -7
- data/config/filters/hate.txt +0 -6
- data/config/filters/profanity.txt +0 -10
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 26f77df57fc50ffb3f1898c5f1a586d54b7af10d
|
4
|
+
data.tar.gz: f966bdf06765fa035c2a0556e8222d0299197a0a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 2514597d8f670ba7eec79a5768871cd7801e4a41623b36418510e4e5e76a3702dca5ec72b9d3018e4ca8bb70fbfc580e59754c248bcd9efcd2c1a9a2a38f5483
|
7
|
+
data.tar.gz: 1009f14d9849a113472595848b576c5350629eb178035dccf7ab7c5ce8ae6dd819bdaeafa2bb9c35870a3add1a38b9ff3a8572da003419caaa621227ab63dd74
|
data/README.md
CHANGED
@@ -1,5 +1,25 @@
|
|
1
|
+
- [LanguageFilter](#languagefilter)
|
2
|
+
- [About](#about)
|
3
|
+
- [Guiding Principles](#guiding-principles)
|
4
|
+
- [TO-DO](#to-do)
|
5
|
+
- [Installation](#installation)
|
6
|
+
- [Usage](#usage)
|
7
|
+
- [`:matchlist` and `:exceptionlist`](#matchlist-and-exceptionlist)
|
8
|
+
- [Symbol signifying a pre-packaged list](#symbol-signifying-a-pre-packaged-list)
|
9
|
+
- [An array of words and phrases to screen for](#an-array-of-words-and-phrases-to-screen-for)
|
10
|
+
- [A filepath or string pointing to a filepath](#a-filepath-or-string-pointing-to-a-filepath)
|
11
|
+
- [Formatting your lists](#formatting-your-lists)
|
12
|
+
- [`:replacement`](#replacement)
|
13
|
+
- [`:creative_letters`](#creative_letters)
|
14
|
+
- [Methods to modify filters after creation](#methods-to-modify-filters-after-creation)
|
15
|
+
- [ActiveModel integration](#activemodel-integration)
|
16
|
+
- [Contributing](#contributing)
|
17
|
+
|
18
|
+
|
1
19
|
# LanguageFilter
|
2
20
|
|
21
|
+
## About
|
22
|
+
|
3
23
|
LanguageFilter is a Ruby gem to detect and optionally filter multiple categories of language. It was adapted from Thiago Jackiw's Obscenity gem for [FractalWriting.org](http://fractalwriting.org) and features many improvements, including:
|
4
24
|
|
5
25
|
- The ability to create and independently configure multiple language filters.
|
@@ -8,6 +28,30 @@ LanguageFilter is a Ruby gem to detect and optionally filter multiple categories
|
|
8
28
|
- More neutral language to accommodate a wider variety of use cases. For example, LanguageFilter uses `matchlist` and `exceptionlist` instead of `blacklist` and `whitelist`, since the gem can be used not only for censorship, but also for content *type* identification (e.g. fantasy, sci-fi, historical, etc in the context of creative writing)
|
9
29
|
- More robust exceptionlist (i.e. whitelist) handling. Given a simple example of a matchlist containing `cock` and an exceptionlist containing `game cock`, the other filtering gems I've seen will flag the `cock` in `game cock`, despite the exceptionlist. LanguageFilter is a little smarter and does what you would expect, so that when sanitizing the string `cock is usually sexual, but a game cock is just an animal`, the returned string will be `**** is usually sexual, but a game cock is just an animal`.
|
10
30
|
|
31
|
+
It should be noted however, that if you'd like to use this gem or another language filtering library to replace human moderation, you should not, for [reasons outlined here](http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea.html). The major takeaway is that content filtering is a very difficult problem and context is everything. You can keep refining your filters, but that can easily become a full-time job and it can be difficult to do these refinements without unintentionally creating more false positives, which is extremely frustrating from a user's point of view. This kind of tool is best used to *guide* users, rather than enforce rules on them. See the guiding principles below for more on this.
|
32
|
+
|
33
|
+
## Guiding Principles
|
34
|
+
|
35
|
+
These are things I've learned from developing this gem that are good to keep in mind when using or contributing to the project.
|
36
|
+
|
37
|
+
**It's better to under-match than over-match.**
|
38
|
+
|
39
|
+
It's extremely frustrating, for example, if someone is prevented from entering a perfectly good username that just happens to contain the word "ass" in it - as many do. It's not nearly as frustrating to be exposed to profanity that you have to strain to make out.
|
40
|
+
|
41
|
+
**Using filters for language detection that aid in self-categorization is a better idea than automatically forcing mature/profane/sexual/etc tags on user-generated content.**
|
42
|
+
|
43
|
+
If someone uses language that could be considered profanity in many contexts, but is not profanity in their particular context, such as "bitch" to describe a female dog or "ass" to describe a donkey, they will be justifiably upset at the automatic categorization. It's better to say, "Your story contains the following words or phrases that we think might be profane: bitch, ass. Click on the `profane` tag if you'd like to add it." Then other users can flag content that still isn't correctly categorized and moderators can edit content tags and educate the user to further prevent miscategorization.
|
44
|
+
|
45
|
+
## TO-DO
|
46
|
+
|
47
|
+
- Expand the pre-packaged matchlists to be more exhaustive
|
48
|
+
- Add some activemodel integration, a la something like:
|
49
|
+
|
50
|
+
``` ruby
|
51
|
+
filter_language :content, matchlist: :hate, replacement: :garbled
|
52
|
+
validate_language :username, matchlist: :profanity
|
53
|
+
```
|
54
|
+
|
11
55
|
## Installation
|
12
56
|
|
13
57
|
Add this line to your application's Gemfile:
|
@@ -142,6 +186,34 @@ Example: This is some f*ck*d up sh*t.
|
|
142
186
|
|
143
187
|
Example: 7|-|1$ 1$ $0/\/\3 Ph*****D UP ******.
|
144
188
|
|
189
|
+
(**note: `creative_letters: true` must be set to match plain words to leetspeak**)
|
190
|
+
|
191
|
+
### `:creative_letters`
|
192
|
+
|
193
|
+
If you want to match leetspeak or other creative lettering, figuring out all the possible variations of each letter in a word can be exhausting. *And* you don't want to go through the whole process for each and every word, creating complicated matchlists that humans will struggle to parse.
|
194
|
+
|
195
|
+
That's why there's a :creative_letters option. When set to true, your filter will use a version of your matchlist that will catch common and not-so-common letterings for each word in your matchlist. The downside to this option is a significant hit to performance.
|
196
|
+
|
197
|
+
Here's an example. Let's say you have a matchlist with a single word:
|
198
|
+
|
199
|
+
```
|
200
|
+
hippopotamus
|
201
|
+
```
|
202
|
+
|
203
|
+
But what if some smart-allec types in something like this?
|
204
|
+
|
205
|
+
```
|
206
|
+
}{!|o|o[]|o()+4|\/|v$
|
207
|
+
```
|
208
|
+
|
209
|
+
Well, if you have :creative_letters activated, the matchlist that your filtering engine will actually use looks more like this:
|
210
|
+
|
211
|
+
```
|
212
|
+
(?:(?:h|\\#|[\\|\\}\\{\\\\/\\(\\)\\[\\]]\\-?[\\|\\}\\{\\\\/\\(\\)\\[\\]])+)(?:(?:i|l|1|\\!|\\u00a1|\\||\\]|\\[|\\\\|/|[^a-z]eye[^a-z]|\\u00a3|[\\|li1\\!\\u00a1\\[\\]\\(\\)\\{\\}]_|\\u00ac|[^a-z]el+[^a-z]))(?:(?:p|\\u00b6|[\\|li1\\[\\]\\!\\u00a1/\\\\][\\*o\\u00b0\\\"\\>7\\^]|[^a-z]pee+[^a-z])+)(?:(?:p|\\u00b6|[\\|li1\\[\\]\\!\\u00a1/\\\\][\\*o\\u00b0\\\"\\>7\\^]|[^a-z]pee+[^a-z])+)(?:(?:o|0|\\(\\)|\\[\\]|\\u00b0|[^a-z]oh+[^a-z])+)(?:(?:p|\\u00b6|[\\|li1\\[\\]\\!\\u00a1/\\\\][\\*o\\u00b0\\\"\\>7\\^]|[^a-z]pee+[^a-z])+)(?:(?:o|0|\\(\\)|\\[\\]|\\u00b0|[^a-z]oh+[^a-z])+)(?:(?:t|7|\\+|\\u2020|\\-\\|\\-|\\'\\]\\[\\')+)(?:(?:a|@|4|\\^|/\\\\|/\\-\\\\|aye?)+)(?:(?:m|[\\|\\(\\)/](?:\\\\/|v|\\|)[\\|\\(\\)\\\\]|\\^\\^|[^a-z]em+[^a-z])+)(?:(?:u|v|\\u00b5|[\\|\\(\\)\\[\\]\\{\\}]_[\\|\\(\\)\\[\\]\\{\\}]|\\L\\||\\/|[^a-z]you[^a-z]|[^a-z]yoo+[^a-z]|[^a-z]vee+[^a-z]))(?:(?:s|\\$|5|\\u00a7|[^a-z]es+[^a-z]|z|2|7_|\\~/_|\\>_|\\%|[^a-z]zee+[^a-z])+)
|
213
|
+
```
|
214
|
+
|
215
|
+
And that barely legible mess can be made completely illegible by the `sanitize` method. Even *this* crazy string of regex can be beaten though. People *will* have to get quite creative, but people *are* creative. And making it difficult to enter banned content can make it quite an attractive challenge. For this reason and because of the aforementioned performance hit, **this option is not recommended for production systems**.
|
216
|
+
|
145
217
|
### Methods to modify filters after creation
|
146
218
|
|
147
219
|
If you ever want to change the matchlist, exceptionlist, or replacement type, each parameter is accessible via an assignment method.
|
@@ -181,6 +253,32 @@ my_filter.matchlist.uniq!
|
|
181
253
|
# etc...
|
182
254
|
```
|
183
255
|
|
256
|
+
### ActiveModel integration
|
257
|
+
|
258
|
+
There's not yet any built-in ActiveModel integration, but that doesn't mean it isn't a breeze to work with filters in your model. The examples below should help get you started.
|
259
|
+
|
260
|
+
```ruby
|
261
|
+
# garbles any hateful language in the content attribute before any save to the database
|
262
|
+
before_save :remove_hateful_language
|
263
|
+
|
264
|
+
def remove_hateful_language
|
265
|
+
hate_filter = LanguageFilter::Filter.new matchlist: :hate, replacement: :garbled
|
266
|
+
content = hate_filter.sanitize(content)
|
267
|
+
end
|
268
|
+
````
|
269
|
+
|
270
|
+
``` ruby
|
271
|
+
# yells at users if they try to sneak in a dirty username, letting them know exactly why the username they wanted was rejected
|
272
|
+
validate :clean_username
|
273
|
+
|
274
|
+
def clean_username
|
275
|
+
profanity_filter = LanguageFilter::Filter.new matchlist: :profanity
|
276
|
+
if profanity_filter.match? username then
|
277
|
+
errors.add(:username, "The following language is inappropriate in a username: #{profanity_filter.matched(username).join(', ')}"
|
278
|
+
end
|
279
|
+
end
|
280
|
+
```
|
281
|
+
|
184
282
|
## Contributing
|
185
283
|
|
186
284
|
1. Fork it
|
data/Rakefile
CHANGED
File without changes
|
File without changes
|
@@ -0,0 +1 @@
|
|
1
|
+
confucius
|
@@ -0,0 +1,342 @@
|
|
1
|
+
2g1c
|
2
|
+
2 girls 1 cup
|
3
|
+
acrotomophilia
|
4
|
+
anal
|
5
|
+
anilingus
|
6
|
+
anus
|
7
|
+
arsehole
|
8
|
+
ass
|
9
|
+
asshole
|
10
|
+
assmunch
|
11
|
+
auto erotic
|
12
|
+
autoerotic
|
13
|
+
babeland
|
14
|
+
baby batter
|
15
|
+
ball gag
|
16
|
+
ball gravy
|
17
|
+
ball kicking
|
18
|
+
ball licking
|
19
|
+
ball sack
|
20
|
+
ball sucking
|
21
|
+
bangbros
|
22
|
+
bareback
|
23
|
+
barely legal
|
24
|
+
barenaked
|
25
|
+
bastardo
|
26
|
+
bastinado
|
27
|
+
bbw
|
28
|
+
bdsm
|
29
|
+
beaver cleaver
|
30
|
+
beaver lips
|
31
|
+
bestiality
|
32
|
+
bi curious
|
33
|
+
big black
|
34
|
+
big breasts
|
35
|
+
big knockers
|
36
|
+
big tits
|
37
|
+
bimbos
|
38
|
+
birdlock
|
39
|
+
bitch
|
40
|
+
black cock
|
41
|
+
blonde action
|
42
|
+
blonde on blonde action
|
43
|
+
blow j
|
44
|
+
blow your l
|
45
|
+
blue waffle
|
46
|
+
blumpkin
|
47
|
+
bollocks
|
48
|
+
bondage
|
49
|
+
boner
|
50
|
+
boob
|
51
|
+
boobs
|
52
|
+
booty call
|
53
|
+
brown showers
|
54
|
+
brunette action
|
55
|
+
bukkake
|
56
|
+
bulldyke
|
57
|
+
bullet vibe
|
58
|
+
bung hole
|
59
|
+
bunghole
|
60
|
+
busty
|
61
|
+
butt
|
62
|
+
buttcheeks
|
63
|
+
butthole
|
64
|
+
camel toe
|
65
|
+
camgirl
|
66
|
+
camslut
|
67
|
+
camwhore
|
68
|
+
carpet muncher
|
69
|
+
carpetmuncher
|
70
|
+
chocolate rosebuds
|
71
|
+
circlejerk
|
72
|
+
cleveland steamer
|
73
|
+
clit
|
74
|
+
clitoris
|
75
|
+
clover clamps
|
76
|
+
clusterfuck
|
77
|
+
cock
|
78
|
+
cocks
|
79
|
+
coprolagnia
|
80
|
+
coprophilia
|
81
|
+
cornhole
|
82
|
+
cum
|
83
|
+
cumming
|
84
|
+
cunnilingus
|
85
|
+
cunt
|
86
|
+
darkie
|
87
|
+
date rape
|
88
|
+
daterape
|
89
|
+
deep throat
|
90
|
+
deepthroat
|
91
|
+
dick
|
92
|
+
dildo
|
93
|
+
dirty pillows
|
94
|
+
dirty sanchez
|
95
|
+
dog style
|
96
|
+
doggie style
|
97
|
+
doggiestyle
|
98
|
+
doggy style
|
99
|
+
doggystyle
|
100
|
+
dolcett
|
101
|
+
domination
|
102
|
+
dominatrix
|
103
|
+
dommes
|
104
|
+
donkey punch
|
105
|
+
double dong
|
106
|
+
double penetration
|
107
|
+
dp action
|
108
|
+
eat my ass
|
109
|
+
ecchi
|
110
|
+
ejaculation
|
111
|
+
erotic
|
112
|
+
erotism
|
113
|
+
escort
|
114
|
+
ethical slut
|
115
|
+
eunuch
|
116
|
+
faggot
|
117
|
+
fecal
|
118
|
+
felch
|
119
|
+
fellatio
|
120
|
+
feltch
|
121
|
+
female squirting
|
122
|
+
femdom
|
123
|
+
figging
|
124
|
+
fingering
|
125
|
+
fisting
|
126
|
+
foot fetish
|
127
|
+
footjob
|
128
|
+
frotting
|
129
|
+
fuck
|
130
|
+
fuck buttons
|
131
|
+
fudge packer
|
132
|
+
fudgepacker
|
133
|
+
futanari
|
134
|
+
g-spot
|
135
|
+
gang bang
|
136
|
+
gay sex
|
137
|
+
genitals
|
138
|
+
giant cock
|
139
|
+
girl on
|
140
|
+
girl on top
|
141
|
+
girls gone wild
|
142
|
+
goatcx
|
143
|
+
goatse
|
144
|
+
gokkun
|
145
|
+
golden shower
|
146
|
+
goo girl
|
147
|
+
goodpoop
|
148
|
+
goregasm
|
149
|
+
grope
|
150
|
+
group sex
|
151
|
+
guro
|
152
|
+
hand job
|
153
|
+
handjob
|
154
|
+
hard core
|
155
|
+
hardcore
|
156
|
+
hentai
|
157
|
+
homoerotic
|
158
|
+
honkey
|
159
|
+
hooker
|
160
|
+
hot chick
|
161
|
+
how to kill
|
162
|
+
how to murder
|
163
|
+
huge fat
|
164
|
+
humping
|
165
|
+
incest
|
166
|
+
intercourse
|
167
|
+
jack off
|
168
|
+
jail bait
|
169
|
+
jailbait
|
170
|
+
jerk off
|
171
|
+
jigaboo
|
172
|
+
jiggaboo
|
173
|
+
jiggerboo
|
174
|
+
jizz
|
175
|
+
juggs
|
176
|
+
kike
|
177
|
+
kinbaku
|
178
|
+
kinkster
|
179
|
+
kinky
|
180
|
+
knobbing
|
181
|
+
leather restraint
|
182
|
+
leather straight jacket
|
183
|
+
lemon party
|
184
|
+
lolita
|
185
|
+
lovemaking
|
186
|
+
make me come
|
187
|
+
male squirting
|
188
|
+
masturbate
|
189
|
+
menage a trois
|
190
|
+
milf
|
191
|
+
missionary position
|
192
|
+
motherfucker
|
193
|
+
mound of venus
|
194
|
+
mr hands
|
195
|
+
muff diver
|
196
|
+
muffdiving
|
197
|
+
nambla
|
198
|
+
nawashi
|
199
|
+
negro
|
200
|
+
neonazi
|
201
|
+
nig nog
|
202
|
+
nigga
|
203
|
+
nigger
|
204
|
+
nimphomania
|
205
|
+
nipple
|
206
|
+
nipples
|
207
|
+
nsfw images
|
208
|
+
nude
|
209
|
+
nudity
|
210
|
+
nympho
|
211
|
+
nymphomania
|
212
|
+
octopussy
|
213
|
+
omorashi
|
214
|
+
one cup two girls
|
215
|
+
one guy one jar
|
216
|
+
orgasm
|
217
|
+
orgy
|
218
|
+
paedophile
|
219
|
+
panties
|
220
|
+
panty
|
221
|
+
pedobear
|
222
|
+
pedophile
|
223
|
+
pegging
|
224
|
+
penis
|
225
|
+
phone sex
|
226
|
+
piece of shit
|
227
|
+
piss pig
|
228
|
+
pissing
|
229
|
+
pisspig
|
230
|
+
playboy
|
231
|
+
pleasure chest
|
232
|
+
pole smoker
|
233
|
+
ponyplay
|
234
|
+
poof
|
235
|
+
poop chute
|
236
|
+
poopchute
|
237
|
+
porn
|
238
|
+
porno
|
239
|
+
pornography
|
240
|
+
prince albert piercing
|
241
|
+
pthc
|
242
|
+
pubes
|
243
|
+
pussy
|
244
|
+
queaf
|
245
|
+
raghead
|
246
|
+
raging boner
|
247
|
+
rape
|
248
|
+
raping
|
249
|
+
rapist
|
250
|
+
rectum
|
251
|
+
reverse cowgirl
|
252
|
+
rimjob
|
253
|
+
rimming
|
254
|
+
rosy palm
|
255
|
+
rosy palm and her 5 sisters
|
256
|
+
rusty trombone
|
257
|
+
s&m
|
258
|
+
sadism
|
259
|
+
scat
|
260
|
+
schlong
|
261
|
+
scissoring
|
262
|
+
semen
|
263
|
+
sex
|
264
|
+
sexo
|
265
|
+
sexy
|
266
|
+
shaved beaver
|
267
|
+
shaved pussy
|
268
|
+
shemale
|
269
|
+
shibari
|
270
|
+
shit
|
271
|
+
shota
|
272
|
+
shrimping
|
273
|
+
slanteye
|
274
|
+
slut
|
275
|
+
smut
|
276
|
+
snatch
|
277
|
+
snowballing
|
278
|
+
sodomize
|
279
|
+
sodomy
|
280
|
+
spic
|
281
|
+
spooge
|
282
|
+
spread legs
|
283
|
+
strap on
|
284
|
+
strapon
|
285
|
+
strappado
|
286
|
+
strip club
|
287
|
+
style doggy
|
288
|
+
suck
|
289
|
+
sucks
|
290
|
+
suicide girls
|
291
|
+
sultry women
|
292
|
+
swastika
|
293
|
+
swinger
|
294
|
+
tainted love
|
295
|
+
taste my
|
296
|
+
tea bagging
|
297
|
+
threesome
|
298
|
+
throating
|
299
|
+
tied up
|
300
|
+
tight white
|
301
|
+
tit
|
302
|
+
tits
|
303
|
+
titties
|
304
|
+
titty
|
305
|
+
tongue in a
|
306
|
+
topless
|
307
|
+
tosser
|
308
|
+
towelhead
|
309
|
+
tranny
|
310
|
+
tribadism
|
311
|
+
tub girl
|
312
|
+
tubgirl
|
313
|
+
tushy
|
314
|
+
twat
|
315
|
+
twink
|
316
|
+
twinkie
|
317
|
+
two girls one cup
|
318
|
+
undressing
|
319
|
+
upskirt
|
320
|
+
urethra play
|
321
|
+
urophilia
|
322
|
+
vagina
|
323
|
+
venus mound
|
324
|
+
vibrator
|
325
|
+
violet blue
|
326
|
+
violet wand
|
327
|
+
vorarephilia
|
328
|
+
voyeur
|
329
|
+
vulva
|
330
|
+
wank
|
331
|
+
wet dream
|
332
|
+
wetback
|
333
|
+
white power
|
334
|
+
women rapping
|
335
|
+
wrapping men
|
336
|
+
wrinkled starfish
|
337
|
+
xx
|
338
|
+
xxx
|
339
|
+
yaoi
|
340
|
+
yellow showers
|
341
|
+
yiffy
|
342
|
+
zoophilia
|