winnow 0.0.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.gitignore +17 -0
- data/Gemfile +2 -0
- data/LICENSE.txt +22 -0
- data/README.md +150 -0
- data/Rakefile +6 -0
- data/lib/winnow.rb +12 -0
- data/lib/winnow/fingerprinter.rb +60 -0
- data/lib/winnow/matcher.rb +16 -0
- data/lib/winnow/preprocessor.rb +10 -0
- data/lib/winnow/preprocessors/plaintext.rb +9 -0
- data/lib/winnow/preprocessors/source_code.rb +41 -0
- data/lib/winnow/version.rb +3 -0
- data/spec/fingerprinter_spec.rb +56 -0
- data/spec/matcher_spec.rb +61 -0
- data/spec/preprocessor_spec.rb +49 -0
- data/spec/spec_helper.rb +8 -0
- data/winnow.gemspec +27 -0
- metadata +123 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 09e56c471afeb1bf113f07da677460d4c6b4eb9f
|
4
|
+
data.tar.gz: e778e453f5f50fa49d305b90b97813aaccce4cdd
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 941aaca687e3350ce2bafc6c6e7a9d2209bc36ad23078ed659d308cec42f25435195e561ebfa9f3f8e824d21da437a53abc983821440b16deddd04f39cf765b9
|
7
|
+
data.tar.gz: 0e0075ad638eb46ab271ae5d0c0c896db3c3598200af5e89493eab28b117f9660a2e624a09b7080baf64976dfbd18d7e745ab6d2a2ce3e98d9e484c1e4ab700e
|
data/.gitignore
ADDED
data/Gemfile
ADDED
data/LICENSE.txt
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
Copyright (c) 2014 Ulysse Carion
|
2
|
+
|
3
|
+
MIT License
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
6
|
+
a copy of this software and associated documentation files (the
|
7
|
+
"Software"), to deal in the Software without restriction, including
|
8
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
9
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
10
|
+
permit persons to whom the Software is furnished to do so, subject to
|
11
|
+
the following conditions:
|
12
|
+
|
13
|
+
The above copyright notice and this permission notice shall be
|
14
|
+
included in all copies or substantial portions of the Software.
|
15
|
+
|
16
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
17
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
18
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
19
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
20
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
21
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
22
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,150 @@
|
|
1
|
+
# Winnow
|
2
|
+
|
3
|
+
A tiny Ruby library for document fingerprinting.
|
4
|
+
|
5
|
+
## What is document fingerprinting?
|
6
|
+
|
7
|
+
Document fingerprinting converts a document (e.g. a book, a piece of code, or
|
8
|
+
any other string) into a much smaller number of hashes called *fingerprints*. If
|
9
|
+
two documents share any fingerprints, then this means there is an exact
|
10
|
+
substring match between the two documents.
|
11
|
+
|
12
|
+
Document fingerprinting has many applications, but the most obvious one is for
|
13
|
+
plagiarism detection. By taking fingerprints of two documents, you can detect if
|
14
|
+
parts of one document were copied from another.
|
15
|
+
|
16
|
+
This library implements a fingerprinting algorithm called *winnowing*, described
|
17
|
+
by Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken in a paper titled
|
18
|
+
[*Winnowing: Local Algorithms for Document Fingerprinting*][swa_paper].
|
19
|
+
|
20
|
+
## Usage
|
21
|
+
|
22
|
+
The `Fingerprinter` class takes care of fingerprinting documents. To create a
|
23
|
+
fingerprint, you need to provide two parameters, called the *noise threshold*
|
24
|
+
and the *guarantee threshold*. When comparing two documents' fingerprints, no
|
25
|
+
match shorter than the noise threshold will be detected, but any match at least
|
26
|
+
as long as the guarantee threshold is guaranteed to be found.
|
27
|
+
|
28
|
+
The proper values for your noise and guarantee thresholds varies by context.
|
29
|
+
Experiment with the data you're looking at until you're happy with the results.
|
30
|
+
|
31
|
+
Creating a fingerprinter is easy:
|
32
|
+
|
33
|
+
```ruby
|
34
|
+
fingerprinter = Winnow::Fingerprinter.new(noise_threshold: 10, guarantee_threshold: 18)
|
35
|
+
```
|
36
|
+
|
37
|
+
Then, use `#fingerprints` get the fingerprints. Optionally, pass `:source`
|
38
|
+
(default is `nil`) so that Winnow can later report which document a match is
|
39
|
+
from.
|
40
|
+
|
41
|
+
```ruby
|
42
|
+
document = File.new('hamlet.txt')
|
43
|
+
fingerprints = fingerprinter.fingerprints(document.read, source: document)
|
44
|
+
```
|
45
|
+
|
46
|
+
`#fingerprints` just returns a plain-old Ruby `Hash`. The keys of the hash are
|
47
|
+
generated from substrings of the document being fingerprinted. Finding shared
|
48
|
+
substrings between documents is as simple as seeing if they share any of the
|
49
|
+
keys in their `#fingerprints` hash.
|
50
|
+
|
51
|
+
To keep things easier for you, Winnow comes with a `Matcher` class that will
|
52
|
+
find matches between two documents.
|
53
|
+
|
54
|
+
Here's an example that puts everything together:
|
55
|
+
|
56
|
+
```ruby
|
57
|
+
require 'winnow'
|
58
|
+
|
59
|
+
str_a = <<-EOF
|
60
|
+
'Twas brillig, and the slithy toves
|
61
|
+
Did gyre and gimble in the wabe;
|
62
|
+
This is copied.
|
63
|
+
All mimsy were the borogoves,
|
64
|
+
And the mome raths outgrabe.
|
65
|
+
EOF
|
66
|
+
|
67
|
+
str_b = <<-EOF
|
68
|
+
"Beware the Jabberwock, my son!
|
69
|
+
The jaws that bite, the claws that catch!
|
70
|
+
Beware the Jubjub bird, and shun
|
71
|
+
The frumious -- This is copied. -- Bandersnatch!"
|
72
|
+
EOF
|
73
|
+
|
74
|
+
fprinter = Winnow::Fingerprinter.new(
|
75
|
+
guarantee_threshold: 13, noise_threshold: 9)
|
76
|
+
|
77
|
+
f1 = fprinter.fingerprints(str_a, source: "Stanza 1")
|
78
|
+
f2 = fprinter.fingerprints(str_b, source: "Stanza 2")
|
79
|
+
|
80
|
+
matches = Winnow::Matcher.find_matches(f1, f2)
|
81
|
+
|
82
|
+
# Because 'This is copied' is longer than the guarantee threshold, there might
|
83
|
+
# be a couple of matches found here. For the sake of brevity, let's only look at
|
84
|
+
# the first match found.
|
85
|
+
match = matches.first
|
86
|
+
|
87
|
+
# It's possible for the same key to appear in a document multiple times (e.g. if
|
88
|
+
# 'This is copied' appears more than once). Winnow::Matcher will return all
|
89
|
+
# matches from the same key in array.
|
90
|
+
#
|
91
|
+
# In this case, we know there's only one match (because 'This is copied' appears
|
92
|
+
# only once in each document), so let's only look at the first one.
|
93
|
+
match_a = match.matches_from_a.first
|
94
|
+
match_b = match.matches_from_b.first
|
95
|
+
|
96
|
+
p match_a.index, match_b.index # 71, 125
|
97
|
+
|
98
|
+
match_context_a = str_a[match_a.index - 10 .. match_a.index + 20]
|
99
|
+
match_context_b = str_b[match_b.index - 10 .. match_b.index + 20]
|
100
|
+
|
101
|
+
# Match from Stanza 1: "e wabe;\nThis is copied.\nAll mim"
|
102
|
+
puts "Match from #{match_a.source}: #{match_context_a.inspect}"
|
103
|
+
|
104
|
+
# Match from Stanza 2: "ious -- This is copied. -- Band"
|
105
|
+
puts "Match from #{match_b.source}: #{match_context_b.inspect}"
|
106
|
+
```
|
107
|
+
|
108
|
+
You may find that `Matcher` doesn't handle your exact use-case. That's not a
|
109
|
+
problem. [The built-in matcher.rb file](lib/winnow/matcher.rb)
|
110
|
+
is only about 10 lines of code, so you could easily make your own.
|
111
|
+
|
112
|
+
## :boom: :bomb: A major caveat with `String#hash` :bomb: :boom:
|
113
|
+
|
114
|
+
In order to avoid [algorithmic complexity attacks][wiki_aca], the value returned
|
115
|
+
from Ruby's `String#hash` method [changes every time you restart the
|
116
|
+
interpreter][hash_stackoverflow]:
|
117
|
+
|
118
|
+
```sh
|
119
|
+
$ irb
|
120
|
+
2.0.0p353 :001 > "hello".hash
|
121
|
+
=> 482951767139383391
|
122
|
+
2.0.0p353 :002 > exit
|
123
|
+
|
124
|
+
$ irb
|
125
|
+
2.0.0p353 :001 > "hello".hash
|
126
|
+
=> 3216751850140847920
|
127
|
+
2.0.0p353 :002 > exit
|
128
|
+
```
|
129
|
+
|
130
|
+
(This is the case even if you're using JRuby.)
|
131
|
+
|
132
|
+
This means that although the winnowing algorithm *should* allow you to
|
133
|
+
precalculate a document's fingerprints and store them somewhere, doing so in
|
134
|
+
Ruby will not work unless you're careful to make sure you never restart your
|
135
|
+
Ruby runtime.
|
136
|
+
|
137
|
+
### A workaround
|
138
|
+
|
139
|
+
Winnow looks for the presence of a `String#consistent_hash` method. If it finds
|
140
|
+
one, it'll call that rather than call `String#hash`. You can therefore describe
|
141
|
+
your own hash function if you want to precalculate fingerprint data.
|
142
|
+
|
143
|
+
I've put together a super-simple (but effective) gem called
|
144
|
+
[consistent_hash][consistent_hash] that implements exactly this. It's about a
|
145
|
+
dozen lines of MRI C code and it'll probably work for you as well.
|
146
|
+
|
147
|
+
[swa_paper]: http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
|
148
|
+
[wiki_aca]: http://en.wikipedia.org/wiki/Algorithmic_complexity_attack
|
149
|
+
[hash_stackoverflow]: http://stackoverflow.com/questions/23331725/why-are-ruby-hash-methods-randomized
|
150
|
+
[consistent_hash]: https://github.com/ucarion/consistent_hash
|
data/Rakefile
ADDED
data/lib/winnow.rb
ADDED
@@ -0,0 +1,12 @@
|
|
1
|
+
require 'winnow/version'
|
2
|
+
require 'winnow/preprocessor'
|
3
|
+
require 'winnow/fingerprinter'
|
4
|
+
require 'winnow/matcher'
|
5
|
+
|
6
|
+
module Winnow
|
7
|
+
class Location < Struct.new(:source, :index)
|
8
|
+
end
|
9
|
+
|
10
|
+
class MatchDatum < Struct.new(:matches_from_a, :matches_from_b)
|
11
|
+
end
|
12
|
+
end
|
@@ -0,0 +1,60 @@
|
|
1
|
+
module Winnow
|
2
|
+
class Fingerprinter
|
3
|
+
attr_reader :guarantee, :noise, :preprocessor
|
4
|
+
alias_method :guarantee_threshold, :guarantee
|
5
|
+
alias_method :noise_threshold, :noise
|
6
|
+
|
7
|
+
def initialize(params)
|
8
|
+
@guarantee = params[:guarantee_threshold] || params[:t]
|
9
|
+
@noise = params[:noise_threshold] || params[:k]
|
10
|
+
@preprocessor = params[:preprocessor] || Preprocessors::Plaintext.new
|
11
|
+
end
|
12
|
+
|
13
|
+
def fingerprints(str, params = {})
|
14
|
+
source = params[:source]
|
15
|
+
|
16
|
+
fingerprints = {}
|
17
|
+
|
18
|
+
windows(str, source).each do |window|
|
19
|
+
least_fingerprint = window.min_by { |fingerprint| fingerprint[:value] }
|
20
|
+
value = least_fingerprint[:value]
|
21
|
+
location = least_fingerprint[:location]
|
22
|
+
|
23
|
+
(fingerprints[value] ||= []) << location
|
24
|
+
end
|
25
|
+
|
26
|
+
fingerprints
|
27
|
+
end
|
28
|
+
|
29
|
+
private
|
30
|
+
|
31
|
+
def windows(str, source)
|
32
|
+
k_grams(str, source).each_cons(window_size)
|
33
|
+
end
|
34
|
+
|
35
|
+
def window_size
|
36
|
+
guarantee - noise + 1
|
37
|
+
end
|
38
|
+
|
39
|
+
def k_grams(str, source)
|
40
|
+
tokens(str).each_cons(noise).map do |tokens_k_gram|
|
41
|
+
value = hash(tokens_k_gram.map { |(char)| char }.join)
|
42
|
+
location = Location.new(source, tokens_k_gram.first[1])
|
43
|
+
|
44
|
+
{value: value, location: location}
|
45
|
+
end
|
46
|
+
end
|
47
|
+
|
48
|
+
def tokens(str)
|
49
|
+
preprocessor.preprocess(str)
|
50
|
+
end
|
51
|
+
|
52
|
+
def hash(str)
|
53
|
+
if str.respond_to?(:consistent_hash)
|
54
|
+
str.consistent_hash
|
55
|
+
else
|
56
|
+
str.hash
|
57
|
+
end
|
58
|
+
end
|
59
|
+
end
|
60
|
+
end
|
@@ -0,0 +1,16 @@
|
|
1
|
+
module Winnow
|
2
|
+
class Matcher
|
3
|
+
class << self
|
4
|
+
def find_matches(fingerprints_a, fingerprints_b, params = {})
|
5
|
+
whitelist = params[:whitelist] || []
|
6
|
+
|
7
|
+
matched_values = fingerprints_a.keys & fingerprints_b.keys - whitelist
|
8
|
+
|
9
|
+
matched_values.map do |value|
|
10
|
+
matches_a, matches_b = fingerprints_a[value], fingerprints_b[value]
|
11
|
+
MatchDatum.new(matches_a, matches_b)
|
12
|
+
end
|
13
|
+
end
|
14
|
+
end
|
15
|
+
end
|
16
|
+
end
|
@@ -0,0 +1,41 @@
|
|
1
|
+
require 'rouge'
|
2
|
+
|
3
|
+
module Winnow
|
4
|
+
module Preprocessors
|
5
|
+
class SourceCode < Preprocessor
|
6
|
+
attr_reader :lexer
|
7
|
+
|
8
|
+
def initialize(language)
|
9
|
+
@lexer = Rouge::Lexer.find(language)
|
10
|
+
end
|
11
|
+
|
12
|
+
def preprocess(str)
|
13
|
+
current_index = 0
|
14
|
+
processed_chars = []
|
15
|
+
|
16
|
+
lexer.lex(str).to_a.each do |token|
|
17
|
+
type, chunk = token
|
18
|
+
|
19
|
+
processed_chunk = case
|
20
|
+
when type <= Rouge::Token::Tokens::Name
|
21
|
+
'x'
|
22
|
+
when type <= Rouge::Token::Tokens::Comment
|
23
|
+
''
|
24
|
+
when type <= Rouge::Token::Tokens::Text
|
25
|
+
''
|
26
|
+
else
|
27
|
+
chunk
|
28
|
+
end
|
29
|
+
|
30
|
+
processed_chars += processed_chunk.chars.map do |c|
|
31
|
+
[c, current_index]
|
32
|
+
end
|
33
|
+
|
34
|
+
current_index += chunk.length
|
35
|
+
end
|
36
|
+
|
37
|
+
processed_chars
|
38
|
+
end
|
39
|
+
end
|
40
|
+
end
|
41
|
+
end
|
@@ -0,0 +1,56 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
describe Winnow::Fingerprinter do
|
4
|
+
describe '#initialize' do
|
5
|
+
it 'accepts :guarantee_threshold, :noise_threshold' do
|
6
|
+
fprinter = Winnow::Fingerprinter.new(
|
7
|
+
guarantee_threshold: 0, noise_threshold: 1)
|
8
|
+
expect(fprinter.guarantee_threshold).to eq 0
|
9
|
+
expect(fprinter.noise_threshold).to eq 1
|
10
|
+
end
|
11
|
+
|
12
|
+
it 'accepts :t and :k' do
|
13
|
+
fprinter = Winnow::Fingerprinter.new(t: 0, k: 1)
|
14
|
+
expect(fprinter.guarantee_threshold).to eq 0
|
15
|
+
expect(fprinter.noise_threshold).to eq 1
|
16
|
+
end
|
17
|
+
end
|
18
|
+
|
19
|
+
describe '#fingerprints' do
|
20
|
+
it 'hashes strings to get keys' do
|
21
|
+
# if t = k = 1, then each character will become a fingerprint
|
22
|
+
fprinter = Winnow::Fingerprinter.new(t: 1, k: 1)
|
23
|
+
fprints = fprinter.fingerprints("abcdefg")
|
24
|
+
|
25
|
+
hashes = ('a'..'g').map(&:hash)
|
26
|
+
|
27
|
+
expect(fprints.keys).to eq hashes
|
28
|
+
end
|
29
|
+
|
30
|
+
it 'chooses the smallest hash per window' do
|
31
|
+
# window size = t - k + 1 = 2 ; for a two-char string, the sole
|
32
|
+
# fingerprint should just be from the char with the smallest hash value
|
33
|
+
fprinter = Winnow::Fingerprinter.new(t: 2, k: 1)
|
34
|
+
fprints = fprinter.fingerprints("ab")
|
35
|
+
|
36
|
+
expect(fprints.keys.length).to eq 1
|
37
|
+
expect(fprints.keys.first).to eq ["a", "b"].map(&:hash).min
|
38
|
+
end
|
39
|
+
|
40
|
+
it 'correctly reports the location of a fingerprint' do
|
41
|
+
fprinter = Winnow::Fingerprinter.new(t: 1, k: 1)
|
42
|
+
fprints = fprinter.fingerprints("a\nb\ncde\nfg", source: "example")
|
43
|
+
|
44
|
+
fprint_d = fprints['d'.hash].first
|
45
|
+
|
46
|
+
expect(fprint_d.index).to eq 5
|
47
|
+
expect(fprint_d.source).to eq "example"
|
48
|
+
end
|
49
|
+
|
50
|
+
it 'uses #consistent_hash when possible' do
|
51
|
+
String.any_instance.should_receive(:consistent_hash)
|
52
|
+
|
53
|
+
Winnow::Fingerprinter.new(t: 1, k: 1).fingerprints("a")
|
54
|
+
end
|
55
|
+
end
|
56
|
+
end
|
@@ -0,0 +1,61 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
describe Winnow::Matcher do
|
4
|
+
describe '#find_matches' do
|
5
|
+
def make_locations(*indices)
|
6
|
+
indices.map { |n| Winnow::Location.new(nil, n) }
|
7
|
+
end
|
8
|
+
|
9
|
+
let(:fprint1) do
|
10
|
+
{
|
11
|
+
0 => make_locations(0),
|
12
|
+
1 => make_locations(1, 2),
|
13
|
+
}
|
14
|
+
end
|
15
|
+
|
16
|
+
let(:fprint2) do
|
17
|
+
{
|
18
|
+
0 => make_locations(3),
|
19
|
+
1 => make_locations(4),
|
20
|
+
3 => make_locations(5)
|
21
|
+
}
|
22
|
+
end
|
23
|
+
|
24
|
+
let(:matches) { Winnow::Matcher.find_matches(fprint1, fprint2) }
|
25
|
+
|
26
|
+
def match_with_loc(index, matches = matches)
|
27
|
+
matches.find do |data|
|
28
|
+
data.matches_from_a.find { |loc| loc.index == index } ||
|
29
|
+
data.matches_from_b.find { |loc| loc.index == index }
|
30
|
+
end
|
31
|
+
end
|
32
|
+
|
33
|
+
it 'returns an array of match data' do
|
34
|
+
expect(matches).to be_an(Array)
|
35
|
+
expect(matches.first).to be_a(Winnow::MatchDatum)
|
36
|
+
end
|
37
|
+
|
38
|
+
it 'reports a match when values are equal' do
|
39
|
+
match = match_with_loc(0)
|
40
|
+
matchloc_b = match.matches_from_b.first
|
41
|
+
expect(matchloc_b.index).to eq 3
|
42
|
+
end
|
43
|
+
|
44
|
+
it 'reports nothing when there is no match' do
|
45
|
+
match = match_with_loc(5)
|
46
|
+
expect(match).to be_nil
|
47
|
+
end
|
48
|
+
|
49
|
+
it 'reports all matches when multi matches' do
|
50
|
+
match = match_with_loc(1)
|
51
|
+
expect(match.matches_from_a.length).to eq 2
|
52
|
+
expect(match.matches_from_b.length).to eq 1
|
53
|
+
end
|
54
|
+
|
55
|
+
it 'ignores whitelisted values' do
|
56
|
+
matches = Winnow::Matcher.find_matches(fprint1, fprint2, whitelist: [0])
|
57
|
+
|
58
|
+
expect(match_with_loc(0, matches)).to be_nil
|
59
|
+
end
|
60
|
+
end
|
61
|
+
end
|
@@ -0,0 +1,49 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
describe Winnow::Preprocessors::Plaintext do
|
4
|
+
subject { Winnow::Preprocessors::Plaintext.new }
|
5
|
+
|
6
|
+
it 'converts a string to an array of chars and indices' do
|
7
|
+
str = "abcde"
|
8
|
+
char_indices = [['a', 0], ['b', 1], ['c', 2], ['d', 3], ['e', 4]]
|
9
|
+
|
10
|
+
expect(subject.preprocess(str)).to eq char_indices
|
11
|
+
end
|
12
|
+
end
|
13
|
+
|
14
|
+
describe Winnow::Preprocessors::SourceCode do
|
15
|
+
subject { Winnow::Preprocessors::SourceCode.new(:java) }
|
16
|
+
|
17
|
+
it 'simplifies a string, but remembers correct locations' do
|
18
|
+
str = "i = 5"
|
19
|
+
result = [['x', 0], ['=', 2], ['5', 4]]
|
20
|
+
|
21
|
+
expect(subject.preprocess(str)).to eq result
|
22
|
+
end
|
23
|
+
|
24
|
+
it 'groups up the indices of a single token' do
|
25
|
+
str = "int i"
|
26
|
+
result = [['i', 0], ['n', 0], ['t', 0], ['x', 4]]
|
27
|
+
|
28
|
+
expect(subject.preprocess(str)).to eq result
|
29
|
+
end
|
30
|
+
|
31
|
+
def reconstruct_string(processed)
|
32
|
+
processed.map { |(char, _)| char }.join
|
33
|
+
end
|
34
|
+
|
35
|
+
it 'removes whitespace' do
|
36
|
+
str = '3; '
|
37
|
+
expect(reconstruct_string(subject.preprocess(str))).to eq '3;'
|
38
|
+
end
|
39
|
+
|
40
|
+
it 'removes variable names' do
|
41
|
+
str = 'class MyClass { int myInteger = 5 }'
|
42
|
+
expect(reconstruct_string(subject.preprocess(str))).to eq 'classx{intx=5}'
|
43
|
+
end
|
44
|
+
|
45
|
+
it 'removes comments' do
|
46
|
+
str = 'fooBar();//this is a comment'
|
47
|
+
expect(reconstruct_string(subject.preprocess(str))).to eq 'x();'
|
48
|
+
end
|
49
|
+
end
|
data/spec/spec_helper.rb
ADDED
data/winnow.gemspec
ADDED
@@ -0,0 +1,27 @@
|
|
1
|
+
# coding: utf-8
|
2
|
+
lib = File.expand_path('../lib', __FILE__)
|
3
|
+
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
4
|
+
require 'winnow/version'
|
5
|
+
|
6
|
+
Gem::Specification.new do |spec|
|
7
|
+
spec.name = "winnow"
|
8
|
+
spec.version = Winnow::VERSION
|
9
|
+
spec.authors = ["Ulysse Carion"]
|
10
|
+
spec.email = ["ulyssecarion@gmail.com"]
|
11
|
+
spec.description = %q{A tiny Ruby library that implements Winnowing,
|
12
|
+
an algorithm for document fingerprinting.}
|
13
|
+
spec.summary = %q{Simple document fingerprinting and plagiarism detection.}
|
14
|
+
spec.homepage = "https://github.com/ucarion/winnow"
|
15
|
+
spec.license = "MIT"
|
16
|
+
|
17
|
+
spec.files = `git ls-files`.split($/)
|
18
|
+
spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
|
19
|
+
spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
|
20
|
+
spec.require_paths = ["lib"]
|
21
|
+
|
22
|
+
spec.add_dependency 'rouge', '~> 1.3'
|
23
|
+
|
24
|
+
spec.add_development_dependency "bundler", "~> 1.3"
|
25
|
+
spec.add_development_dependency "rake"
|
26
|
+
spec.add_development_dependency "rspec", "~> 2.14"
|
27
|
+
end
|
metadata
ADDED
@@ -0,0 +1,123 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: winnow
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.0.1
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Ulysse Carion
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
date: 2014-05-08 00:00:00.000000000 Z
|
12
|
+
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: rouge
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - ~>
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: '1.3'
|
20
|
+
type: :runtime
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - ~>
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '1.3'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: bundler
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - ~>
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: '1.3'
|
34
|
+
type: :development
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - ~>
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: '1.3'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: rake
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - '>='
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '0'
|
48
|
+
type: :development
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - '>='
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '0'
|
55
|
+
- !ruby/object:Gem::Dependency
|
56
|
+
name: rspec
|
57
|
+
requirement: !ruby/object:Gem::Requirement
|
58
|
+
requirements:
|
59
|
+
- - ~>
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: '2.14'
|
62
|
+
type: :development
|
63
|
+
prerelease: false
|
64
|
+
version_requirements: !ruby/object:Gem::Requirement
|
65
|
+
requirements:
|
66
|
+
- - ~>
|
67
|
+
- !ruby/object:Gem::Version
|
68
|
+
version: '2.14'
|
69
|
+
description: |-
|
70
|
+
A tiny Ruby library that implements Winnowing,
|
71
|
+
an algorithm for document fingerprinting.
|
72
|
+
email:
|
73
|
+
- ulyssecarion@gmail.com
|
74
|
+
executables: []
|
75
|
+
extensions: []
|
76
|
+
extra_rdoc_files: []
|
77
|
+
files:
|
78
|
+
- .gitignore
|
79
|
+
- Gemfile
|
80
|
+
- LICENSE.txt
|
81
|
+
- README.md
|
82
|
+
- Rakefile
|
83
|
+
- lib/winnow.rb
|
84
|
+
- lib/winnow/fingerprinter.rb
|
85
|
+
- lib/winnow/matcher.rb
|
86
|
+
- lib/winnow/preprocessor.rb
|
87
|
+
- lib/winnow/preprocessors/plaintext.rb
|
88
|
+
- lib/winnow/preprocessors/source_code.rb
|
89
|
+
- lib/winnow/version.rb
|
90
|
+
- spec/fingerprinter_spec.rb
|
91
|
+
- spec/matcher_spec.rb
|
92
|
+
- spec/preprocessor_spec.rb
|
93
|
+
- spec/spec_helper.rb
|
94
|
+
- winnow.gemspec
|
95
|
+
homepage: https://github.com/ucarion/winnow
|
96
|
+
licenses:
|
97
|
+
- MIT
|
98
|
+
metadata: {}
|
99
|
+
post_install_message:
|
100
|
+
rdoc_options: []
|
101
|
+
require_paths:
|
102
|
+
- lib
|
103
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
104
|
+
requirements:
|
105
|
+
- - '>='
|
106
|
+
- !ruby/object:Gem::Version
|
107
|
+
version: '0'
|
108
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
109
|
+
requirements:
|
110
|
+
- - '>='
|
111
|
+
- !ruby/object:Gem::Version
|
112
|
+
version: '0'
|
113
|
+
requirements: []
|
114
|
+
rubyforge_project:
|
115
|
+
rubygems_version: 2.1.11
|
116
|
+
signing_key:
|
117
|
+
specification_version: 4
|
118
|
+
summary: Simple document fingerprinting and plagiarism detection.
|
119
|
+
test_files:
|
120
|
+
- spec/fingerprinter_spec.rb
|
121
|
+
- spec/matcher_spec.rb
|
122
|
+
- spec/preprocessor_spec.rb
|
123
|
+
- spec/spec_helper.rb
|