winnow 0.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.gitignore +17 -0
- data/Gemfile +2 -0
- data/LICENSE.txt +22 -0
- data/README.md +150 -0
- data/Rakefile +6 -0
- data/lib/winnow.rb +12 -0
- data/lib/winnow/fingerprinter.rb +60 -0
- data/lib/winnow/matcher.rb +16 -0
- data/lib/winnow/preprocessor.rb +10 -0
- data/lib/winnow/preprocessors/plaintext.rb +9 -0
- data/lib/winnow/preprocessors/source_code.rb +41 -0
- data/lib/winnow/version.rb +3 -0
- data/spec/fingerprinter_spec.rb +56 -0
- data/spec/matcher_spec.rb +61 -0
- data/spec/preprocessor_spec.rb +49 -0
- data/spec/spec_helper.rb +8 -0
- data/winnow.gemspec +27 -0
- metadata +123 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 09e56c471afeb1bf113f07da677460d4c6b4eb9f
|
4
|
+
data.tar.gz: e778e453f5f50fa49d305b90b97813aaccce4cdd
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 941aaca687e3350ce2bafc6c6e7a9d2209bc36ad23078ed659d308cec42f25435195e561ebfa9f3f8e824d21da437a53abc983821440b16deddd04f39cf765b9
|
7
|
+
data.tar.gz: 0e0075ad638eb46ab271ae5d0c0c896db3c3598200af5e89493eab28b117f9660a2e624a09b7080baf64976dfbd18d7e745ab6d2a2ce3e98d9e484c1e4ab700e
|
data/.gitignore
ADDED
data/Gemfile
ADDED
data/LICENSE.txt
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
Copyright (c) 2014 Ulysse Carion
|
2
|
+
|
3
|
+
MIT License
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
6
|
+
a copy of this software and associated documentation files (the
|
7
|
+
"Software"), to deal in the Software without restriction, including
|
8
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
9
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
10
|
+
permit persons to whom the Software is furnished to do so, subject to
|
11
|
+
the following conditions:
|
12
|
+
|
13
|
+
The above copyright notice and this permission notice shall be
|
14
|
+
included in all copies or substantial portions of the Software.
|
15
|
+
|
16
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
17
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
18
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
19
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
20
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
21
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
22
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,150 @@
|
|
1
|
+
# Winnow
|
2
|
+
|
3
|
+
A tiny Ruby library for document fingerprinting.
|
4
|
+
|
5
|
+
## What is document fingerprinting?
|
6
|
+
|
7
|
+
Document fingerprinting converts a document (e.g. a book, a piece of code, or
|
8
|
+
any other string) into a much smaller number of hashes called *fingerprints*. If
|
9
|
+
two documents share any fingerprints, then this means there is an exact
|
10
|
+
substring match between the two documents.
|
11
|
+
|
12
|
+
Document fingerprinting has many applications, but the most obvious one is for
|
13
|
+
plagiarism detection. By taking fingerprints of two documents, you can detect if
|
14
|
+
parts of one document were copied from another.
|
15
|
+
|
16
|
+
This library implements a fingerprinting algorithm called *winnowing*, described
|
17
|
+
by Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken in a paper titled
|
18
|
+
[*Winnowing: Local Algorithms for Document Fingerprinting*][swa_paper].
|
19
|
+
|
20
|
+
## Usage
|
21
|
+
|
22
|
+
The `Fingerprinter` class takes care of fingerprinting documents. To create a
|
23
|
+
fingerprint, you need to provide two parameters, called the *noise threshold*
|
24
|
+
and the *guarantee threshold*. When comparing two documents' fingerprints, no
|
25
|
+
match shorter than the noise threshold will be detected, but any match at least
|
26
|
+
as long as the guarantee threshold is guaranteed to be found.
|
27
|
+
|
28
|
+
The proper values for your noise and guarantee thresholds varies by context.
|
29
|
+
Experiment with the data you're looking at until you're happy with the results.
|
30
|
+
|
31
|
+
Creating a fingerprinter is easy:
|
32
|
+
|
33
|
+
```ruby
|
34
|
+
fingerprinter = Winnow::Fingerprinter.new(noise_threshold: 10, guarantee_threshold: 18)
|
35
|
+
```
|
36
|
+
|
37
|
+
Then, use `#fingerprints` get the fingerprints. Optionally, pass `:source`
|
38
|
+
(default is `nil`) so that Winnow can later report which document a match is
|
39
|
+
from.
|
40
|
+
|
41
|
+
```ruby
|
42
|
+
document = File.new('hamlet.txt')
|
43
|
+
fingerprints = fingerprinter.fingerprints(document.read, source: document)
|
44
|
+
```
|
45
|
+
|
46
|
+
`#fingerprints` just returns a plain-old Ruby `Hash`. The keys of the hash are
|
47
|
+
generated from substrings of the document being fingerprinted. Finding shared
|
48
|
+
substrings between documents is as simple as seeing if they share any of the
|
49
|
+
keys in their `#fingerprints` hash.
|
50
|
+
|
51
|
+
To keep things easier for you, Winnow comes with a `Matcher` class that will
|
52
|
+
find matches between two documents.
|
53
|
+
|
54
|
+
Here's an example that puts everything together:
|
55
|
+
|
56
|
+
```ruby
|
57
|
+
require 'winnow'
|
58
|
+
|
59
|
+
str_a = <<-EOF
|
60
|
+
'Twas brillig, and the slithy toves
|
61
|
+
Did gyre and gimble in the wabe;
|
62
|
+
This is copied.
|
63
|
+
All mimsy were the borogoves,
|
64
|
+
And the mome raths outgrabe.
|
65
|
+
EOF
|
66
|
+
|
67
|
+
str_b = <<-EOF
|
68
|
+
"Beware the Jabberwock, my son!
|
69
|
+
The jaws that bite, the claws that catch!
|
70
|
+
Beware the Jubjub bird, and shun
|
71
|
+
The frumious -- This is copied. -- Bandersnatch!"
|
72
|
+
EOF
|
73
|
+
|
74
|
+
fprinter = Winnow::Fingerprinter.new(
|
75
|
+
guarantee_threshold: 13, noise_threshold: 9)
|
76
|
+
|
77
|
+
f1 = fprinter.fingerprints(str_a, source: "Stanza 1")
|
78
|
+
f2 = fprinter.fingerprints(str_b, source: "Stanza 2")
|
79
|
+
|
80
|
+
matches = Winnow::Matcher.find_matches(f1, f2)
|
81
|
+
|
82
|
+
# Because 'This is copied' is longer than the guarantee threshold, there might
|
83
|
+
# be a couple of matches found here. For the sake of brevity, let's only look at
|
84
|
+
# the first match found.
|
85
|
+
match = matches.first
|
86
|
+
|
87
|
+
# It's possible for the same key to appear in a document multiple times (e.g. if
|
88
|
+
# 'This is copied' appears more than once). Winnow::Matcher will return all
|
89
|
+
# matches from the same key in array.
|
90
|
+
#
|
91
|
+
# In this case, we know there's only one match (because 'This is copied' appears
|
92
|
+
# only once in each document), so let's only look at the first one.
|
93
|
+
match_a = match.matches_from_a.first
|
94
|
+
match_b = match.matches_from_b.first
|
95
|
+
|
96
|
+
p match_a.index, match_b.index # 71, 125
|
97
|
+
|
98
|
+
match_context_a = str_a[match_a.index - 10 .. match_a.index + 20]
|
99
|
+
match_context_b = str_b[match_b.index - 10 .. match_b.index + 20]
|
100
|
+
|
101
|
+
# Match from Stanza 1: "e wabe;\nThis is copied.\nAll mim"
|
102
|
+
puts "Match from #{match_a.source}: #{match_context_a.inspect}"
|
103
|
+
|
104
|
+
# Match from Stanza 2: "ious -- This is copied. -- Band"
|
105
|
+
puts "Match from #{match_b.source}: #{match_context_b.inspect}"
|
106
|
+
```
|
107
|
+
|
108
|
+
You may find that `Matcher` doesn't handle your exact use-case. That's not a
|
109
|
+
problem. [The built-in matcher.rb file](lib/winnow/matcher.rb)
|
110
|
+
is only about 10 lines of code, so you could easily make your own.
|
111
|
+
|
112
|
+
## :boom: :bomb: A major caveat with `String#hash` :bomb: :boom:
|
113
|
+
|
114
|
+
In order to avoid [algorithmic complexity attacks][wiki_aca], the value returned
|
115
|
+
from Ruby's `String#hash` method [changes every time you restart the
|
116
|
+
interpreter][hash_stackoverflow]:
|
117
|
+
|
118
|
+
```sh
|
119
|
+
$ irb
|
120
|
+
2.0.0p353 :001 > "hello".hash
|
121
|
+
=> 482951767139383391
|
122
|
+
2.0.0p353 :002 > exit
|
123
|
+
|
124
|
+
$ irb
|
125
|
+
2.0.0p353 :001 > "hello".hash
|
126
|
+
=> 3216751850140847920
|
127
|
+
2.0.0p353 :002 > exit
|
128
|
+
```
|
129
|
+
|
130
|
+
(This is the case even if you're using JRuby.)
|
131
|
+
|
132
|
+
This means that although the winnowing algorithm *should* allow you to
|
133
|
+
precalculate a document's fingerprints and store them somewhere, doing so in
|
134
|
+
Ruby will not work unless you're careful to make sure you never restart your
|
135
|
+
Ruby runtime.
|
136
|
+
|
137
|
+
### A workaround
|
138
|
+
|
139
|
+
Winnow looks for the presence of a `String#consistent_hash` method. If it finds
|
140
|
+
one, it'll call that rather than call `String#hash`. You can therefore describe
|
141
|
+
your own hash function if you want to precalculate fingerprint data.
|
142
|
+
|
143
|
+
I've put together a super-simple (but effective) gem called
|
144
|
+
[consistent_hash][consistent_hash] that implements exactly this. It's about a
|
145
|
+
dozen lines of MRI C code and it'll probably work for you as well.
|
146
|
+
|
147
|
+
[swa_paper]: http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
|
148
|
+
[wiki_aca]: http://en.wikipedia.org/wiki/Algorithmic_complexity_attack
|
149
|
+
[hash_stackoverflow]: http://stackoverflow.com/questions/23331725/why-are-ruby-hash-methods-randomized
|
150
|
+
[consistent_hash]: https://github.com/ucarion/consistent_hash
|
data/Rakefile
ADDED
data/lib/winnow.rb
ADDED
@@ -0,0 +1,12 @@
|
|
1
|
+
require 'winnow/version'
|
2
|
+
require 'winnow/preprocessor'
|
3
|
+
require 'winnow/fingerprinter'
|
4
|
+
require 'winnow/matcher'
|
5
|
+
|
6
|
+
module Winnow
|
7
|
+
class Location < Struct.new(:source, :index)
|
8
|
+
end
|
9
|
+
|
10
|
+
class MatchDatum < Struct.new(:matches_from_a, :matches_from_b)
|
11
|
+
end
|
12
|
+
end
|
@@ -0,0 +1,60 @@
|
|
1
|
+
module Winnow
|
2
|
+
class Fingerprinter
|
3
|
+
attr_reader :guarantee, :noise, :preprocessor
|
4
|
+
alias_method :guarantee_threshold, :guarantee
|
5
|
+
alias_method :noise_threshold, :noise
|
6
|
+
|
7
|
+
def initialize(params)
|
8
|
+
@guarantee = params[:guarantee_threshold] || params[:t]
|
9
|
+
@noise = params[:noise_threshold] || params[:k]
|
10
|
+
@preprocessor = params[:preprocessor] || Preprocessors::Plaintext.new
|
11
|
+
end
|
12
|
+
|
13
|
+
def fingerprints(str, params = {})
|
14
|
+
source = params[:source]
|
15
|
+
|
16
|
+
fingerprints = {}
|
17
|
+
|
18
|
+
windows(str, source).each do |window|
|
19
|
+
least_fingerprint = window.min_by { |fingerprint| fingerprint[:value] }
|
20
|
+
value = least_fingerprint[:value]
|
21
|
+
location = least_fingerprint[:location]
|
22
|
+
|
23
|
+
(fingerprints[value] ||= []) << location
|
24
|
+
end
|
25
|
+
|
26
|
+
fingerprints
|
27
|
+
end
|
28
|
+
|
29
|
+
private
|
30
|
+
|
31
|
+
def windows(str, source)
|
32
|
+
k_grams(str, source).each_cons(window_size)
|
33
|
+
end
|
34
|
+
|
35
|
+
def window_size
|
36
|
+
guarantee - noise + 1
|
37
|
+
end
|
38
|
+
|
39
|
+
def k_grams(str, source)
|
40
|
+
tokens(str).each_cons(noise).map do |tokens_k_gram|
|
41
|
+
value = hash(tokens_k_gram.map { |(char)| char }.join)
|
42
|
+
location = Location.new(source, tokens_k_gram.first[1])
|
43
|
+
|
44
|
+
{value: value, location: location}
|
45
|
+
end
|
46
|
+
end
|
47
|
+
|
48
|
+
def tokens(str)
|
49
|
+
preprocessor.preprocess(str)
|
50
|
+
end
|
51
|
+
|
52
|
+
def hash(str)
|
53
|
+
if str.respond_to?(:consistent_hash)
|
54
|
+
str.consistent_hash
|
55
|
+
else
|
56
|
+
str.hash
|
57
|
+
end
|
58
|
+
end
|
59
|
+
end
|
60
|
+
end
|
@@ -0,0 +1,16 @@
|
|
1
|
+
module Winnow
|
2
|
+
class Matcher
|
3
|
+
class << self
|
4
|
+
def find_matches(fingerprints_a, fingerprints_b, params = {})
|
5
|
+
whitelist = params[:whitelist] || []
|
6
|
+
|
7
|
+
matched_values = fingerprints_a.keys & fingerprints_b.keys - whitelist
|
8
|
+
|
9
|
+
matched_values.map do |value|
|
10
|
+
matches_a, matches_b = fingerprints_a[value], fingerprints_b[value]
|
11
|
+
MatchDatum.new(matches_a, matches_b)
|
12
|
+
end
|
13
|
+
end
|
14
|
+
end
|
15
|
+
end
|
16
|
+
end
|
@@ -0,0 +1,41 @@
|
|
1
|
+
require 'rouge'
|
2
|
+
|
3
|
+
module Winnow
|
4
|
+
module Preprocessors
|
5
|
+
class SourceCode < Preprocessor
|
6
|
+
attr_reader :lexer
|
7
|
+
|
8
|
+
def initialize(language)
|
9
|
+
@lexer = Rouge::Lexer.find(language)
|
10
|
+
end
|
11
|
+
|
12
|
+
def preprocess(str)
|
13
|
+
current_index = 0
|
14
|
+
processed_chars = []
|
15
|
+
|
16
|
+
lexer.lex(str).to_a.each do |token|
|
17
|
+
type, chunk = token
|
18
|
+
|
19
|
+
processed_chunk = case
|
20
|
+
when type <= Rouge::Token::Tokens::Name
|
21
|
+
'x'
|
22
|
+
when type <= Rouge::Token::Tokens::Comment
|
23
|
+
''
|
24
|
+
when type <= Rouge::Token::Tokens::Text
|
25
|
+
''
|
26
|
+
else
|
27
|
+
chunk
|
28
|
+
end
|
29
|
+
|
30
|
+
processed_chars += processed_chunk.chars.map do |c|
|
31
|
+
[c, current_index]
|
32
|
+
end
|
33
|
+
|
34
|
+
current_index += chunk.length
|
35
|
+
end
|
36
|
+
|
37
|
+
processed_chars
|
38
|
+
end
|
39
|
+
end
|
40
|
+
end
|
41
|
+
end
|
@@ -0,0 +1,56 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
describe Winnow::Fingerprinter do
|
4
|
+
describe '#initialize' do
|
5
|
+
it 'accepts :guarantee_threshold, :noise_threshold' do
|
6
|
+
fprinter = Winnow::Fingerprinter.new(
|
7
|
+
guarantee_threshold: 0, noise_threshold: 1)
|
8
|
+
expect(fprinter.guarantee_threshold).to eq 0
|
9
|
+
expect(fprinter.noise_threshold).to eq 1
|
10
|
+
end
|
11
|
+
|
12
|
+
it 'accepts :t and :k' do
|
13
|
+
fprinter = Winnow::Fingerprinter.new(t: 0, k: 1)
|
14
|
+
expect(fprinter.guarantee_threshold).to eq 0
|
15
|
+
expect(fprinter.noise_threshold).to eq 1
|
16
|
+
end
|
17
|
+
end
|
18
|
+
|
19
|
+
describe '#fingerprints' do
|
20
|
+
it 'hashes strings to get keys' do
|
21
|
+
# if t = k = 1, then each character will become a fingerprint
|
22
|
+
fprinter = Winnow::Fingerprinter.new(t: 1, k: 1)
|
23
|
+
fprints = fprinter.fingerprints("abcdefg")
|
24
|
+
|
25
|
+
hashes = ('a'..'g').map(&:hash)
|
26
|
+
|
27
|
+
expect(fprints.keys).to eq hashes
|
28
|
+
end
|
29
|
+
|
30
|
+
it 'chooses the smallest hash per window' do
|
31
|
+
# window size = t - k + 1 = 2 ; for a two-char string, the sole
|
32
|
+
# fingerprint should just be from the char with the smallest hash value
|
33
|
+
fprinter = Winnow::Fingerprinter.new(t: 2, k: 1)
|
34
|
+
fprints = fprinter.fingerprints("ab")
|
35
|
+
|
36
|
+
expect(fprints.keys.length).to eq 1
|
37
|
+
expect(fprints.keys.first).to eq ["a", "b"].map(&:hash).min
|
38
|
+
end
|
39
|
+
|
40
|
+
it 'correctly reports the location of a fingerprint' do
|
41
|
+
fprinter = Winnow::Fingerprinter.new(t: 1, k: 1)
|
42
|
+
fprints = fprinter.fingerprints("a\nb\ncde\nfg", source: "example")
|
43
|
+
|
44
|
+
fprint_d = fprints['d'.hash].first
|
45
|
+
|
46
|
+
expect(fprint_d.index).to eq 5
|
47
|
+
expect(fprint_d.source).to eq "example"
|
48
|
+
end
|
49
|
+
|
50
|
+
it 'uses #consistent_hash when possible' do
|
51
|
+
String.any_instance.should_receive(:consistent_hash)
|
52
|
+
|
53
|
+
Winnow::Fingerprinter.new(t: 1, k: 1).fingerprints("a")
|
54
|
+
end
|
55
|
+
end
|
56
|
+
end
|
@@ -0,0 +1,61 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
describe Winnow::Matcher do
|
4
|
+
describe '#find_matches' do
|
5
|
+
def make_locations(*indices)
|
6
|
+
indices.map { |n| Winnow::Location.new(nil, n) }
|
7
|
+
end
|
8
|
+
|
9
|
+
let(:fprint1) do
|
10
|
+
{
|
11
|
+
0 => make_locations(0),
|
12
|
+
1 => make_locations(1, 2),
|
13
|
+
}
|
14
|
+
end
|
15
|
+
|
16
|
+
let(:fprint2) do
|
17
|
+
{
|
18
|
+
0 => make_locations(3),
|
19
|
+
1 => make_locations(4),
|
20
|
+
3 => make_locations(5)
|
21
|
+
}
|
22
|
+
end
|
23
|
+
|
24
|
+
let(:matches) { Winnow::Matcher.find_matches(fprint1, fprint2) }
|
25
|
+
|
26
|
+
def match_with_loc(index, matches = matches)
|
27
|
+
matches.find do |data|
|
28
|
+
data.matches_from_a.find { |loc| loc.index == index } ||
|
29
|
+
data.matches_from_b.find { |loc| loc.index == index }
|
30
|
+
end
|
31
|
+
end
|
32
|
+
|
33
|
+
it 'returns an array of match data' do
|
34
|
+
expect(matches).to be_an(Array)
|
35
|
+
expect(matches.first).to be_a(Winnow::MatchDatum)
|
36
|
+
end
|
37
|
+
|
38
|
+
it 'reports a match when values are equal' do
|
39
|
+
match = match_with_loc(0)
|
40
|
+
matchloc_b = match.matches_from_b.first
|
41
|
+
expect(matchloc_b.index).to eq 3
|
42
|
+
end
|
43
|
+
|
44
|
+
it 'reports nothing when there is no match' do
|
45
|
+
match = match_with_loc(5)
|
46
|
+
expect(match).to be_nil
|
47
|
+
end
|
48
|
+
|
49
|
+
it 'reports all matches when multi matches' do
|
50
|
+
match = match_with_loc(1)
|
51
|
+
expect(match.matches_from_a.length).to eq 2
|
52
|
+
expect(match.matches_from_b.length).to eq 1
|
53
|
+
end
|
54
|
+
|
55
|
+
it 'ignores whitelisted values' do
|
56
|
+
matches = Winnow::Matcher.find_matches(fprint1, fprint2, whitelist: [0])
|
57
|
+
|
58
|
+
expect(match_with_loc(0, matches)).to be_nil
|
59
|
+
end
|
60
|
+
end
|
61
|
+
end
|
@@ -0,0 +1,49 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
describe Winnow::Preprocessors::Plaintext do
|
4
|
+
subject { Winnow::Preprocessors::Plaintext.new }
|
5
|
+
|
6
|
+
it 'converts a string to an array of chars and indices' do
|
7
|
+
str = "abcde"
|
8
|
+
char_indices = [['a', 0], ['b', 1], ['c', 2], ['d', 3], ['e', 4]]
|
9
|
+
|
10
|
+
expect(subject.preprocess(str)).to eq char_indices
|
11
|
+
end
|
12
|
+
end
|
13
|
+
|
14
|
+
describe Winnow::Preprocessors::SourceCode do
|
15
|
+
subject { Winnow::Preprocessors::SourceCode.new(:java) }
|
16
|
+
|
17
|
+
it 'simplifies a string, but remembers correct locations' do
|
18
|
+
str = "i = 5"
|
19
|
+
result = [['x', 0], ['=', 2], ['5', 4]]
|
20
|
+
|
21
|
+
expect(subject.preprocess(str)).to eq result
|
22
|
+
end
|
23
|
+
|
24
|
+
it 'groups up the indices of a single token' do
|
25
|
+
str = "int i"
|
26
|
+
result = [['i', 0], ['n', 0], ['t', 0], ['x', 4]]
|
27
|
+
|
28
|
+
expect(subject.preprocess(str)).to eq result
|
29
|
+
end
|
30
|
+
|
31
|
+
def reconstruct_string(processed)
|
32
|
+
processed.map { |(char, _)| char }.join
|
33
|
+
end
|
34
|
+
|
35
|
+
it 'removes whitespace' do
|
36
|
+
str = '3; '
|
37
|
+
expect(reconstruct_string(subject.preprocess(str))).to eq '3;'
|
38
|
+
end
|
39
|
+
|
40
|
+
it 'removes variable names' do
|
41
|
+
str = 'class MyClass { int myInteger = 5 }'
|
42
|
+
expect(reconstruct_string(subject.preprocess(str))).to eq 'classx{intx=5}'
|
43
|
+
end
|
44
|
+
|
45
|
+
it 'removes comments' do
|
46
|
+
str = 'fooBar();//this is a comment'
|
47
|
+
expect(reconstruct_string(subject.preprocess(str))).to eq 'x();'
|
48
|
+
end
|
49
|
+
end
|
data/spec/spec_helper.rb
ADDED
data/winnow.gemspec
ADDED
@@ -0,0 +1,27 @@
|
|
1
|
+
# coding: utf-8
|
2
|
+
lib = File.expand_path('../lib', __FILE__)
|
3
|
+
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
4
|
+
require 'winnow/version'
|
5
|
+
|
6
|
+
Gem::Specification.new do |spec|
|
7
|
+
spec.name = "winnow"
|
8
|
+
spec.version = Winnow::VERSION
|
9
|
+
spec.authors = ["Ulysse Carion"]
|
10
|
+
spec.email = ["ulyssecarion@gmail.com"]
|
11
|
+
spec.description = %q{A tiny Ruby library that implements Winnowing,
|
12
|
+
an algorithm for document fingerprinting.}
|
13
|
+
spec.summary = %q{Simple document fingerprinting and plagiarism detection.}
|
14
|
+
spec.homepage = "https://github.com/ucarion/winnow"
|
15
|
+
spec.license = "MIT"
|
16
|
+
|
17
|
+
spec.files = `git ls-files`.split($/)
|
18
|
+
spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
|
19
|
+
spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
|
20
|
+
spec.require_paths = ["lib"]
|
21
|
+
|
22
|
+
spec.add_dependency 'rouge', '~> 1.3'
|
23
|
+
|
24
|
+
spec.add_development_dependency "bundler", "~> 1.3"
|
25
|
+
spec.add_development_dependency "rake"
|
26
|
+
spec.add_development_dependency "rspec", "~> 2.14"
|
27
|
+
end
|
metadata
ADDED
@@ -0,0 +1,123 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: winnow
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.0.1
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Ulysse Carion
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
date: 2014-05-08 00:00:00.000000000 Z
|
12
|
+
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: rouge
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - ~>
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: '1.3'
|
20
|
+
type: :runtime
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - ~>
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '1.3'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: bundler
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - ~>
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: '1.3'
|
34
|
+
type: :development
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - ~>
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: '1.3'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: rake
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - '>='
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '0'
|
48
|
+
type: :development
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - '>='
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '0'
|
55
|
+
- !ruby/object:Gem::Dependency
|
56
|
+
name: rspec
|
57
|
+
requirement: !ruby/object:Gem::Requirement
|
58
|
+
requirements:
|
59
|
+
- - ~>
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: '2.14'
|
62
|
+
type: :development
|
63
|
+
prerelease: false
|
64
|
+
version_requirements: !ruby/object:Gem::Requirement
|
65
|
+
requirements:
|
66
|
+
- - ~>
|
67
|
+
- !ruby/object:Gem::Version
|
68
|
+
version: '2.14'
|
69
|
+
description: |-
|
70
|
+
A tiny Ruby library that implements Winnowing,
|
71
|
+
an algorithm for document fingerprinting.
|
72
|
+
email:
|
73
|
+
- ulyssecarion@gmail.com
|
74
|
+
executables: []
|
75
|
+
extensions: []
|
76
|
+
extra_rdoc_files: []
|
77
|
+
files:
|
78
|
+
- .gitignore
|
79
|
+
- Gemfile
|
80
|
+
- LICENSE.txt
|
81
|
+
- README.md
|
82
|
+
- Rakefile
|
83
|
+
- lib/winnow.rb
|
84
|
+
- lib/winnow/fingerprinter.rb
|
85
|
+
- lib/winnow/matcher.rb
|
86
|
+
- lib/winnow/preprocessor.rb
|
87
|
+
- lib/winnow/preprocessors/plaintext.rb
|
88
|
+
- lib/winnow/preprocessors/source_code.rb
|
89
|
+
- lib/winnow/version.rb
|
90
|
+
- spec/fingerprinter_spec.rb
|
91
|
+
- spec/matcher_spec.rb
|
92
|
+
- spec/preprocessor_spec.rb
|
93
|
+
- spec/spec_helper.rb
|
94
|
+
- winnow.gemspec
|
95
|
+
homepage: https://github.com/ucarion/winnow
|
96
|
+
licenses:
|
97
|
+
- MIT
|
98
|
+
metadata: {}
|
99
|
+
post_install_message:
|
100
|
+
rdoc_options: []
|
101
|
+
require_paths:
|
102
|
+
- lib
|
103
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
104
|
+
requirements:
|
105
|
+
- - '>='
|
106
|
+
- !ruby/object:Gem::Version
|
107
|
+
version: '0'
|
108
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
109
|
+
requirements:
|
110
|
+
- - '>='
|
111
|
+
- !ruby/object:Gem::Version
|
112
|
+
version: '0'
|
113
|
+
requirements: []
|
114
|
+
rubyforge_project:
|
115
|
+
rubygems_version: 2.1.11
|
116
|
+
signing_key:
|
117
|
+
specification_version: 4
|
118
|
+
summary: Simple document fingerprinting and plagiarism detection.
|
119
|
+
test_files:
|
120
|
+
- spec/fingerprinter_spec.rb
|
121
|
+
- spec/matcher_spec.rb
|
122
|
+
- spec/preprocessor_spec.rb
|
123
|
+
- spec/spec_helper.rb
|