lycopodium 0.0.2 → 0.0.3
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.travis.yml +1 -2
- data/Gemfile +1 -1
- data/LICENSE +1 -1
- data/README.md +84 -16
- data/lib/lycopodium.rb +41 -6
- data/lib/lycopodium/version.rb +2 -2
- data/lycopodium.gemspec +5 -4
- data/spec/lycopodium_spec.rb +64 -0
- data/spec/spec_helper.rb +4 -0
- metadata +37 -33
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 4bc1c70f9f05818e7ba034a54541e4b00a73c467
|
4
|
+
data.tar.gz: 45aa5ecad0f8ea72fbd3e52dd3c955b009d1b5b7
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 980b7eb2f0a907186731a62f81903fb3e9eee5985190c9aed11bb61355437e120acfd8f3169145eff44bad2f3e14a2e7e8efdbc8cc725dc0850d06e5a6217078
|
7
|
+
data.tar.gz: a36b387c15a1ce568f07a8606947178be56380384540ff90a055ac0d97a3c928bc9703a9688ac1a5e2d7a720121d3f24346609c0d24f3b29fc5ca4898dde68e4
|
data/.travis.yml
CHANGED
data/Gemfile
CHANGED
data/LICENSE
CHANGED
data/README.md
CHANGED
@@ -1,13 +1,37 @@
|
|
1
1
|
# Lycopodium Finds Fingerprints
|
2
2
|
|
3
|
-
[![
|
4
|
-
[![
|
5
|
-
[![
|
6
|
-
[![
|
3
|
+
[![Gem Version](https://badge.fury.io/rb/lycopodium.svg)](https://badge.fury.io/rb/lycopodium)
|
4
|
+
[![Build Status](https://secure.travis-ci.org/jpmckinney/lycopodium.png)](https://travis-ci.org/jpmckinney/lycopodium)
|
5
|
+
[![Dependency Status](https://gemnasium.com/jpmckinney/lycopodium.png)](https://gemnasium.com/jpmckinney/lycopodium)
|
6
|
+
[![Coverage Status](https://coveralls.io/repos/jpmckinney/lycopodium/badge.png)](https://coveralls.io/r/jpmckinney/lycopodium)
|
7
|
+
[![Code Climate](https://codeclimate.com/github/jpmckinney/lycopodium.png)](https://codeclimate.com/github/jpmckinney/lycopodium)
|
7
8
|
|
8
|
-
|
9
|
+
Lycopodium does two things:
|
9
10
|
|
10
|
-
|
11
|
+
1. Test what transformations you can make to a set of values without creating collisions.
|
12
|
+
1. Find [unique key](http://en.wikipedia.org/wiki/Unique_key) constraints in a data table.
|
13
|
+
|
14
|
+
> Historically, Lycopodium powder, the spores of Lycopodium and related plants, was used as a fingerprint powder. – [Wikipedia](https://en.wikipedia.org/wiki/Fingerprint_powder#Composition)
|
15
|
+
|
16
|
+
## What it tries to solve
|
17
|
+
|
18
|
+
### Find a key collision method
|
19
|
+
|
20
|
+
Let's say you have an authoritative list of names: for example, a list of organization names from a [company register](https://www.ic.gc.ca/app/scr/cc/CorporationsCanada/fdrlCrpSrch.html?locale=en_CA). You want to match a messy list of names – for example, a list of government contractors published by a city – against this authoritative list.
|
21
|
+
|
22
|
+
For context, [Open Refine](http://openrefine.org/) offers [two methods to solve this problem](https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth):
|
23
|
+
|
24
|
+
* Key collision methods group names that transform into the same fingerprint; transformations include lowercasing letters, removing whitespace and punctuation, sorting words, etc.
|
25
|
+
|
26
|
+
* Nearest neighbor methods group names that are close to each other, using distance functions like [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) and [Prediction by Partial Matching](https://en.wikipedia.org/wiki/Prediction_by_Partial_Matching).
|
27
|
+
|
28
|
+
Key collision methods tend to be fast and strict, whereas nearest neighbor methods are more likely to produce false positives, especially when dealing with short strings.
|
29
|
+
|
30
|
+
If you want fast and strict reconciliation, Lycopodium lets you figure out what transformations can be applied to an authoritative list of names without creating collisions between names. Those transformations can then be safely applied to the names on the messy list to match against the authoritative list.
|
31
|
+
|
32
|
+
### Find a unique key in a data table
|
33
|
+
|
34
|
+
Let's say you have a data table: for example, the City of Toronto publishes [voting records grouped by city councillor](http://app.toronto.ca/tmmis/getAdminReport.do?function=prepareMemberVoteReport). You want to instead group the voting records by motion being voted on. However, the data table doesn't contain one, single column identifing the motion. You instead need to identify the combination of columns that identify the motion. In other words, you are looking for the data table's [unique key](http://en.wikipedia.org/wiki/Unique_key). Lycopodium does this.
|
11
35
|
|
12
36
|
## Usage
|
13
37
|
|
@@ -15,7 +39,23 @@ Test what transformations you can make to a set of values without creating colli
|
|
15
39
|
require 'lycopodium'
|
16
40
|
```
|
17
41
|
|
18
|
-
|
42
|
+
### Find a unique key in a data table
|
43
|
+
|
44
|
+
```ruby
|
45
|
+
table = [
|
46
|
+
['foo', 'bar', 'baz'],
|
47
|
+
['foo', 'bar', 'bzz'],
|
48
|
+
['foo', 'zzz', 'bzz'],
|
49
|
+
]
|
50
|
+
Lycopodium.unique_key(table)
|
51
|
+
# => [1, 2]
|
52
|
+
```
|
53
|
+
|
54
|
+
The values of the second and third columns – taken together - are unique for each row in the table. In other words, you can uniquely identify a row by taking the values of its second and third columns.
|
55
|
+
|
56
|
+
### Find a key collision method
|
57
|
+
|
58
|
+
Write a method that transforms a value, for example:
|
19
59
|
|
20
60
|
```ruby
|
21
61
|
meth1 = ->(string) do
|
@@ -23,10 +63,10 @@ meth1 = ->(string) do
|
|
23
63
|
end
|
24
64
|
```
|
25
65
|
|
26
|
-
Then, initialize a `Lycopodium` instance with a set of values and the
|
66
|
+
Then, initialize a `Lycopodium` instance with a set of values and the method:
|
27
67
|
|
28
68
|
```ruby
|
29
|
-
set = Lycopodium.new(["foo", "f o o"], meth1)
|
69
|
+
set = Lycopodium.new(["foo", "f o o", " bar "], meth1)
|
30
70
|
```
|
31
71
|
|
32
72
|
Lastly, test whether the method creates collisions between the members of the set:
|
@@ -47,21 +87,47 @@ meth2 = ->(string) do
|
|
47
87
|
end
|
48
88
|
```
|
49
89
|
|
50
|
-
It will return the mapping from original to transformed string:
|
90
|
+
It will return the mapping from original to transformed string (hence `value_to_fingerprint`):
|
51
91
|
|
52
|
-
|
92
|
+
```ruby
|
93
|
+
set.function = meth2
|
94
|
+
set.value_to_fingerprint
|
95
|
+
# => {"foo"=>"FOO", "f o o"=>"F O O", "bar"=>" BAR "}
|
96
|
+
```
|
53
97
|
|
54
98
|
We thus learn that whitespace disambiguates between members of the set, but letter case does not.
|
55
99
|
|
56
|
-
|
100
|
+
If you can't find a suitable method, you can remove all values that collide after transformation:
|
57
101
|
|
58
102
|
```ruby
|
103
|
+
set.function = meth1
|
59
104
|
set_without_collisions = set.reject_collisions
|
105
|
+
# => [" bar "]
|
106
|
+
set_without_collisions.value_to_fingerprint
|
107
|
+
# => {" bar "=>"bar"}
|
60
108
|
```
|
61
109
|
|
62
110
|
A `Lycopodium` instance otherwise behaves as an array.
|
63
111
|
|
64
|
-
|
112
|
+
### Use the key collision method
|
113
|
+
|
114
|
+
You can now apply the method to other values…
|
115
|
+
|
116
|
+
```ruby
|
117
|
+
messy = "\tbar\n"
|
118
|
+
fingerprint = meth1.call(messy)
|
119
|
+
# => "bar"
|
120
|
+
```
|
121
|
+
|
122
|
+
… and match against your original values:
|
123
|
+
|
124
|
+
```
|
125
|
+
fingerprint_to_value = set_without_collisions.value_to_fingerprint.invert
|
126
|
+
fingerprint_to_value.fetch(fingerprint)
|
127
|
+
# => " bar "
|
128
|
+
```
|
129
|
+
|
130
|
+
### Method definition
|
65
131
|
|
66
132
|
Besides the `->` syntax above, you can define the same method as:
|
67
133
|
|
@@ -96,8 +162,10 @@ end
|
|
96
162
|
meth = Object.method(:func)
|
97
163
|
```
|
98
164
|
|
99
|
-
##
|
165
|
+
## Related projects
|
100
166
|
|
101
|
-
|
167
|
+
* [Nomenklatura](http://nomenklatura.okfnlabs.org/) is a web service to maintain a canonical list of entities and to match messy input against it, either via the user interface or via Open Refine reconciliation.
|
168
|
+
* [dedupe](https://github.com/open-city/dedupe) is a Python library to determine when two records are about the same thing.
|
169
|
+
* [name-cleaver](https://github.com/sunlightlabs/name-cleaver) is a Python library to parse and standardize the names of people and organizations.
|
102
170
|
|
103
|
-
Copyright (c) 2013
|
171
|
+
Copyright (c) 2013 James McKinney, released under the MIT license
|
data/lib/lycopodium.rb
CHANGED
@@ -3,29 +3,53 @@ require "set"
|
|
3
3
|
class Lycopodium < Array
|
4
4
|
class Error < StandardError; end
|
5
5
|
class Collision < Error; end
|
6
|
+
class RaggedRow < Error; end
|
6
7
|
|
7
8
|
attr_accessor :function
|
8
9
|
|
10
|
+
def self.unique_key(data)
|
11
|
+
columns_size = data.first.size
|
12
|
+
data.each do |row|
|
13
|
+
unless row.size == columns_size
|
14
|
+
raise RaggedRow, row.inspect
|
15
|
+
end
|
16
|
+
end
|
17
|
+
|
18
|
+
columns = (0...columns_size).to_a
|
19
|
+
1.upto(columns_size) do |k|
|
20
|
+
columns.combination(k) do |combination|
|
21
|
+
if unique_key?(data, combination)
|
22
|
+
return combination
|
23
|
+
end
|
24
|
+
end
|
25
|
+
end
|
26
|
+
nil
|
27
|
+
end
|
28
|
+
|
9
29
|
# @param [Array] set a set of values
|
10
30
|
# @param [Proc] function a method that transforms a value
|
11
31
|
def initialize(set, function = lambda{|value| value})
|
12
32
|
replace(set)
|
13
|
-
|
33
|
+
@function = function
|
14
34
|
end
|
15
35
|
|
36
|
+
# Removes all members of the set that collide after transformation.
|
37
|
+
#
|
16
38
|
# @return [Array] the members of the set without collisions
|
17
39
|
def reject_collisions
|
18
40
|
hashes, collisions = hashes_and_collisions
|
19
41
|
|
20
|
-
items =
|
21
|
-
|
22
|
-
|
23
|
-
|
42
|
+
items = []
|
43
|
+
hashes.each do |item,hash|
|
44
|
+
unless collisions.include?(hash)
|
45
|
+
items << item
|
46
|
+
end
|
24
47
|
end
|
25
|
-
|
26
48
|
self.class.new(items, function)
|
27
49
|
end
|
28
50
|
|
51
|
+
# Returns a mapping from the original to the transformed value.
|
52
|
+
#
|
29
53
|
# @return [Hash] a mapping from the original to the transformed value
|
30
54
|
# @raise [Collision] if the method creates collisions between members of the set
|
31
55
|
def value_to_fingerprint
|
@@ -49,6 +73,17 @@ class Lycopodium < Array
|
|
49
73
|
|
50
74
|
private
|
51
75
|
|
76
|
+
def self.unique_key?(data, combination)
|
77
|
+
set = Set.new
|
78
|
+
data.each_with_index do |row,index|
|
79
|
+
set.add(row.values_at(*combination))
|
80
|
+
if set.size <= index
|
81
|
+
return false
|
82
|
+
end
|
83
|
+
end
|
84
|
+
true
|
85
|
+
end
|
86
|
+
|
52
87
|
def hashes_and_collisions
|
53
88
|
collisions = Set.new
|
54
89
|
|
data/lib/lycopodium/version.rb
CHANGED
@@ -1,3 +1,3 @@
|
|
1
|
-
|
2
|
-
VERSION = "0.0.
|
1
|
+
class Lycopodium < Array
|
2
|
+
VERSION = "0.0.3"
|
3
3
|
end
|
data/lycopodium.gemspec
CHANGED
@@ -5,10 +5,10 @@ Gem::Specification.new do |s|
|
|
5
5
|
s.name = "lycopodium"
|
6
6
|
s.version = Lycopodium::VERSION
|
7
7
|
s.platform = Gem::Platform::RUBY
|
8
|
-
s.authors = ["
|
9
|
-
s.
|
10
|
-
s.
|
11
|
-
s.
|
8
|
+
s.authors = ["James McKinney"]
|
9
|
+
s.homepage = "https://github.com/jpmckinney/lycopodium"
|
10
|
+
s.summary = %q{Test what transformations you can make to a set of values without creating collisions}
|
11
|
+
s.license = 'MIT'
|
12
12
|
|
13
13
|
s.files = `git ls-files`.split("\n")
|
14
14
|
s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
|
@@ -17,4 +17,5 @@ Gem::Specification.new do |s|
|
|
17
17
|
|
18
18
|
s.add_development_dependency('rspec', '~> 2.10')
|
19
19
|
s.add_development_dependency('rake')
|
20
|
+
s.add_development_dependency('coveralls')
|
20
21
|
end
|
@@ -0,0 +1,64 @@
|
|
1
|
+
require File.expand_path(File.dirname(__FILE__) + '/spec_helper')
|
2
|
+
|
3
|
+
describe Lycopodium do
|
4
|
+
describe '#initialize' do
|
5
|
+
it 'should use an identify function by default' do
|
6
|
+
set = Lycopodium.new([])
|
7
|
+
set.function.call(1).should == 1
|
8
|
+
end
|
9
|
+
|
10
|
+
it 'should accept a function as an argument' do
|
11
|
+
set = Lycopodium.new([], lambda{|value| value * 2})
|
12
|
+
set.function.call(1).should == 2
|
13
|
+
end
|
14
|
+
end
|
15
|
+
|
16
|
+
let :collisions do
|
17
|
+
Lycopodium.new(['foo', 'f o o', 'bar'], lambda{|string| string.gsub(' ', '')})
|
18
|
+
end
|
19
|
+
|
20
|
+
let :no_collisions do
|
21
|
+
Lycopodium.new(['foo', 'f o o', 'bar'], lambda{|string| string.upcase})
|
22
|
+
end
|
23
|
+
|
24
|
+
describe '#unique_key' do
|
25
|
+
it 'should return a unique key if a unique key is found' do
|
26
|
+
Lycopodium.unique_key([
|
27
|
+
['foo', 'bar', 'baz'],
|
28
|
+
['foo', 'bar', 'bzz'],
|
29
|
+
['foo', 'zzz', 'bzz'],
|
30
|
+
]).should == [1, 2]
|
31
|
+
end
|
32
|
+
|
33
|
+
it 'should return nil if no unique key is found' do
|
34
|
+
Lycopodium.unique_key([
|
35
|
+
['foo', 'bar'],
|
36
|
+
['foo', 'bar'],
|
37
|
+
['foo', 'bar'],
|
38
|
+
]).should == nil
|
39
|
+
end
|
40
|
+
|
41
|
+
it 'should raise an error if ragged rows' do
|
42
|
+
expect{Lycopodium.unique_key([
|
43
|
+
['foo'],
|
44
|
+
['foo', 'bar'],
|
45
|
+
])}.to raise_error(Lycopodium::RaggedRow, %(["foo", "bar"]))
|
46
|
+
end
|
47
|
+
end
|
48
|
+
|
49
|
+
describe '#reject_collisions' do
|
50
|
+
it 'should remove all members of the set that collide after transformation' do
|
51
|
+
collisions.reject_collisions.should == ['bar']
|
52
|
+
end
|
53
|
+
end
|
54
|
+
|
55
|
+
describe '#value_to_fingerprint' do
|
56
|
+
it 'should return a mapping from the original to the transformed value' do
|
57
|
+
no_collisions.value_to_fingerprint.should == {'foo' => 'FOO', 'f o o' => 'F O O', 'bar' => 'BAR'}
|
58
|
+
end
|
59
|
+
|
60
|
+
it 'should raise an error if the method creates collisions between members of the set' do
|
61
|
+
expect{collisions.value_to_fingerprint}.to raise_error(Lycopodium::Collision, %("foo", "f o o" => "foo"))
|
62
|
+
end
|
63
|
+
end
|
64
|
+
end
|
data/spec/spec_helper.rb
CHANGED
metadata
CHANGED
@@ -1,58 +1,66 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: lycopodium
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
5
|
-
prerelease:
|
4
|
+
version: 0.0.3
|
6
5
|
platform: ruby
|
7
6
|
authors:
|
8
|
-
-
|
7
|
+
- James McKinney
|
9
8
|
autorequire:
|
10
9
|
bindir: bin
|
11
10
|
cert_chain: []
|
12
|
-
date:
|
11
|
+
date: 2015-02-03 00:00:00.000000000 Z
|
13
12
|
dependencies:
|
14
13
|
- !ruby/object:Gem::Dependency
|
15
14
|
name: rspec
|
16
15
|
requirement: !ruby/object:Gem::Requirement
|
17
|
-
none: false
|
18
16
|
requirements:
|
19
|
-
- - ~>
|
17
|
+
- - "~>"
|
20
18
|
- !ruby/object:Gem::Version
|
21
19
|
version: '2.10'
|
22
20
|
type: :development
|
23
21
|
prerelease: false
|
24
22
|
version_requirements: !ruby/object:Gem::Requirement
|
25
|
-
none: false
|
26
23
|
requirements:
|
27
|
-
- - ~>
|
24
|
+
- - "~>"
|
28
25
|
- !ruby/object:Gem::Version
|
29
26
|
version: '2.10'
|
30
27
|
- !ruby/object:Gem::Dependency
|
31
28
|
name: rake
|
32
29
|
requirement: !ruby/object:Gem::Requirement
|
33
|
-
none: false
|
34
30
|
requirements:
|
35
|
-
- -
|
31
|
+
- - ">="
|
36
32
|
- !ruby/object:Gem::Version
|
37
33
|
version: '0'
|
38
34
|
type: :development
|
39
35
|
prerelease: false
|
40
36
|
version_requirements: !ruby/object:Gem::Requirement
|
41
|
-
none: false
|
42
37
|
requirements:
|
43
|
-
- -
|
38
|
+
- - ">="
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: '0'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: coveralls
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - ">="
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '0'
|
48
|
+
type: :development
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - ">="
|
44
53
|
- !ruby/object:Gem::Version
|
45
54
|
version: '0'
|
46
55
|
description:
|
47
|
-
email:
|
48
|
-
- info@opennorth.ca
|
56
|
+
email:
|
49
57
|
executables: []
|
50
58
|
extensions: []
|
51
59
|
extra_rdoc_files: []
|
52
60
|
files:
|
53
|
-
- .gitignore
|
54
|
-
- .travis.yml
|
55
|
-
- .yardopts
|
61
|
+
- ".gitignore"
|
62
|
+
- ".travis.yml"
|
63
|
+
- ".yardopts"
|
56
64
|
- Gemfile
|
57
65
|
- LICENSE
|
58
66
|
- README.md
|
@@ -61,37 +69,33 @@ files:
|
|
61
69
|
- lib/lycopodium.rb
|
62
70
|
- lib/lycopodium/version.rb
|
63
71
|
- lycopodium.gemspec
|
72
|
+
- spec/lycopodium_spec.rb
|
64
73
|
- spec/spec_helper.rb
|
65
|
-
homepage:
|
66
|
-
licenses:
|
74
|
+
homepage: https://github.com/jpmckinney/lycopodium
|
75
|
+
licenses:
|
76
|
+
- MIT
|
77
|
+
metadata: {}
|
67
78
|
post_install_message:
|
68
79
|
rdoc_options: []
|
69
80
|
require_paths:
|
70
81
|
- lib
|
71
82
|
required_ruby_version: !ruby/object:Gem::Requirement
|
72
|
-
none: false
|
73
83
|
requirements:
|
74
|
-
- -
|
84
|
+
- - ">="
|
75
85
|
- !ruby/object:Gem::Version
|
76
86
|
version: '0'
|
77
|
-
segments:
|
78
|
-
- 0
|
79
|
-
hash: 750622676358250438
|
80
87
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
81
|
-
none: false
|
82
88
|
requirements:
|
83
|
-
- -
|
89
|
+
- - ">="
|
84
90
|
- !ruby/object:Gem::Version
|
85
91
|
version: '0'
|
86
|
-
segments:
|
87
|
-
- 0
|
88
|
-
hash: 750622676358250438
|
89
92
|
requirements: []
|
90
93
|
rubyforge_project:
|
91
|
-
rubygems_version:
|
94
|
+
rubygems_version: 2.2.2
|
92
95
|
signing_key:
|
93
|
-
specification_version:
|
94
|
-
summary: Test what transformations you can make to a set of
|
95
|
-
|
96
|
+
specification_version: 4
|
97
|
+
summary: Test what transformations you can make to a set of values without creating
|
98
|
+
collisions
|
96
99
|
test_files:
|
100
|
+
- spec/lycopodium_spec.rb
|
97
101
|
- spec/spec_helper.rb
|