lycopodium 0.0.2 → 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 4bc1c70f9f05818e7ba034a54541e4b00a73c467
4
+ data.tar.gz: 45aa5ecad0f8ea72fbd3e52dd3c955b009d1b5b7
5
+ SHA512:
6
+ metadata.gz: 980b7eb2f0a907186731a62f81903fb3e9eee5985190c9aed11bb61355437e120acfd8f3169145eff44bad2f3e14a2e7e8efdbc8cc725dc0850d06e5a6217078
7
+ data.tar.gz: a36b387c15a1ce568f07a8606947178be56380384540ff90a055ac0d97a3c928bc9703a9688ac1a5e2d7a720121d3f24346609c0d24f3b29fc5ca4898dde68e4
@@ -1,7 +1,6 @@
1
1
  language: ruby
2
2
  rvm:
3
- - 1.8.7
4
3
  - 1.9.2
5
4
  - 1.9.3
6
5
  - 2.0.0
7
- - ree
6
+ - 2.1.0
data/Gemfile CHANGED
@@ -1,4 +1,4 @@
1
- source "http://rubygems.org"
1
+ source "https://rubygems.org/"
2
2
 
3
3
  # Specify your gem's dependencies in the gemspec
4
4
  gemspec
data/LICENSE CHANGED
@@ -1,4 +1,4 @@
1
- Copyright (c) 2013 Open North Inc.
1
+ Copyright (c) 2013 James McKinney
2
2
 
3
3
  Permission is hereby granted, free of charge, to any person obtaining
4
4
  a copy of this software and associated documentation files (the
data/README.md CHANGED
@@ -1,13 +1,37 @@
1
1
  # Lycopodium Finds Fingerprints
2
2
 
3
- [![Build Status](https://secure.travis-ci.org/opennorth/lycopodium.png)](http://travis-ci.org/opennorth/lycopodium)
4
- [![Dependency Status](https://gemnasium.com/opennorth/lycopodium.png)](https://gemnasium.com/opennorth/lycopodium)
5
- [![Coverage Status](https://coveralls.io/repos/opennorth/lycopodium/badge.png?branch=master)](https://coveralls.io/r/opennorth/lycopodium)
6
- [![Code Climate](https://codeclimate.com/github/opennorth/lycopodium.png)](https://codeclimate.com/github/opennorth/lycopodium)
3
+ [![Gem Version](https://badge.fury.io/rb/lycopodium.svg)](https://badge.fury.io/rb/lycopodium)
4
+ [![Build Status](https://secure.travis-ci.org/jpmckinney/lycopodium.png)](https://travis-ci.org/jpmckinney/lycopodium)
5
+ [![Dependency Status](https://gemnasium.com/jpmckinney/lycopodium.png)](https://gemnasium.com/jpmckinney/lycopodium)
6
+ [![Coverage Status](https://coveralls.io/repos/jpmckinney/lycopodium/badge.png)](https://coveralls.io/r/jpmckinney/lycopodium)
7
+ [![Code Climate](https://codeclimate.com/github/jpmckinney/lycopodium.png)](https://codeclimate.com/github/jpmckinney/lycopodium)
7
8
 
8
- Test what transformations you can make to a set of values without creating collisions.
9
+ Lycopodium does two things:
9
10
 
10
- > Historically, Lycopodium powder, the spores of Lycopodium and related plants, was used as a fingerprint powder. – [Wikipedia](http://en.wikipedia.org/wiki/Fingerprint_powder#Composition)
11
+ 1. Test what transformations you can make to a set of values without creating collisions.
12
+ 1. Find [unique key](http://en.wikipedia.org/wiki/Unique_key) constraints in a data table.
13
+
14
+ > Historically, Lycopodium powder, the spores of Lycopodium and related plants, was used as a fingerprint powder. – [Wikipedia](https://en.wikipedia.org/wiki/Fingerprint_powder#Composition)
15
+
16
+ ## What it tries to solve
17
+
18
+ ### Find a key collision method
19
+
20
+ Let's say you have an authoritative list of names: for example, a list of organization names from a [company register](https://www.ic.gc.ca/app/scr/cc/CorporationsCanada/fdrlCrpSrch.html?locale=en_CA). You want to match a messy list of names – for example, a list of government contractors published by a city – against this authoritative list.
21
+
22
+ For context, [Open Refine](http://openrefine.org/) offers [two methods to solve this problem](https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth):
23
+
24
+ * Key collision methods group names that transform into the same fingerprint; transformations include lowercasing letters, removing whitespace and punctuation, sorting words, etc.
25
+
26
+ * Nearest neighbor methods group names that are close to each other, using distance functions like [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) and [Prediction by Partial Matching](https://en.wikipedia.org/wiki/Prediction_by_Partial_Matching).
27
+
28
+ Key collision methods tend to be fast and strict, whereas nearest neighbor methods are more likely to produce false positives, especially when dealing with short strings.
29
+
30
+ If you want fast and strict reconciliation, Lycopodium lets you figure out what transformations can be applied to an authoritative list of names without creating collisions between names. Those transformations can then be safely applied to the names on the messy list to match against the authoritative list.
31
+
32
+ ### Find a unique key in a data table
33
+
34
+ Let's say you have a data table: for example, the City of Toronto publishes [voting records grouped by city councillor](http://app.toronto.ca/tmmis/getAdminReport.do?function=prepareMemberVoteReport). You want to instead group the voting records by motion being voted on. However, the data table doesn't contain one, single column identifing the motion. You instead need to identify the combination of columns that identify the motion. In other words, you are looking for the data table's [unique key](http://en.wikipedia.org/wiki/Unique_key). Lycopodium does this.
11
35
 
12
36
  ## Usage
13
37
 
@@ -15,7 +39,23 @@ Test what transformations you can make to a set of values without creating colli
15
39
  require 'lycopodium'
16
40
  ```
17
41
 
18
- First, write a method that transforms a value, for example:
42
+ ### Find a unique key in a data table
43
+
44
+ ```ruby
45
+ table = [
46
+ ['foo', 'bar', 'baz'],
47
+ ['foo', 'bar', 'bzz'],
48
+ ['foo', 'zzz', 'bzz'],
49
+ ]
50
+ Lycopodium.unique_key(table)
51
+ # => [1, 2]
52
+ ```
53
+
54
+ The values of the second and third columns – taken together - are unique for each row in the table. In other words, you can uniquely identify a row by taking the values of its second and third columns.
55
+
56
+ ### Find a key collision method
57
+
58
+ Write a method that transforms a value, for example:
19
59
 
20
60
  ```ruby
21
61
  meth1 = ->(string) do
@@ -23,10 +63,10 @@ meth1 = ->(string) do
23
63
  end
24
64
  ```
25
65
 
26
- Then, initialize a `Lycopodium` instance with a set of values and the transformation method:
66
+ Then, initialize a `Lycopodium` instance with a set of values and the method:
27
67
 
28
68
  ```ruby
29
- set = Lycopodium.new(["foo", "f o o"], meth1)
69
+ set = Lycopodium.new(["foo", "f o o", " bar "], meth1)
30
70
  ```
31
71
 
32
72
  Lastly, test whether the method creates collisions between the members of the set:
@@ -47,21 +87,47 @@ meth2 = ->(string) do
47
87
  end
48
88
  ```
49
89
 
50
- It will return the mapping from original to transformed string:
90
+ It will return the mapping from original to transformed string (hence `value_to_fingerprint`):
51
91
 
52
- {"foo" => "FOO", "f o o" => "F O O"}
92
+ ```ruby
93
+ set.function = meth2
94
+ set.value_to_fingerprint
95
+ # => {"foo"=>"FOO", "f o o"=>"F O O", "bar"=>" BAR "}
96
+ ```
53
97
 
54
98
  We thus learn that whitespace disambiguates between members of the set, but letter case does not.
55
99
 
56
- To remove all members of the set that collide after transformation, run:
100
+ If you can't find a suitable method, you can remove all values that collide after transformation:
57
101
 
58
102
  ```ruby
103
+ set.function = meth1
59
104
  set_without_collisions = set.reject_collisions
105
+ # => [" bar "]
106
+ set_without_collisions.value_to_fingerprint
107
+ # => {" bar "=>"bar"}
60
108
  ```
61
109
 
62
110
  A `Lycopodium` instance otherwise behaves as an array.
63
111
 
64
- ## Method definition
112
+ ### Use the key collision method
113
+
114
+ You can now apply the method to other values…
115
+
116
+ ```ruby
117
+ messy = "\tbar\n"
118
+ fingerprint = meth1.call(messy)
119
+ # => "bar"
120
+ ```
121
+
122
+ … and match against your original values:
123
+
124
+ ```
125
+ fingerprint_to_value = set_without_collisions.value_to_fingerprint.invert
126
+ fingerprint_to_value.fetch(fingerprint)
127
+ # => " bar "
128
+ ```
129
+
130
+ ### Method definition
65
131
 
66
132
  Besides the `->` syntax above, you can define the same method as:
67
133
 
@@ -96,8 +162,10 @@ end
96
162
  meth = Object.method(:func)
97
163
  ```
98
164
 
99
- ## Bugs? Questions?
165
+ ## Related projects
100
166
 
101
- This project's main repository is on GitHub: [http://github.com/opennorth/lycopodium](http://github.com/opennorth/lycopodium), where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.
167
+ * [Nomenklatura](http://nomenklatura.okfnlabs.org/) is a web service to maintain a canonical list of entities and to match messy input against it, either via the user interface or via Open Refine reconciliation.
168
+ * [dedupe](https://github.com/open-city/dedupe) is a Python library to determine when two records are about the same thing.
169
+ * [name-cleaver](https://github.com/sunlightlabs/name-cleaver) is a Python library to parse and standardize the names of people and organizations.
102
170
 
103
- Copyright (c) 2013 Open North Inc., released under the MIT license
171
+ Copyright (c) 2013 James McKinney, released under the MIT license
@@ -3,29 +3,53 @@ require "set"
3
3
  class Lycopodium < Array
4
4
  class Error < StandardError; end
5
5
  class Collision < Error; end
6
+ class RaggedRow < Error; end
6
7
 
7
8
  attr_accessor :function
8
9
 
10
+ def self.unique_key(data)
11
+ columns_size = data.first.size
12
+ data.each do |row|
13
+ unless row.size == columns_size
14
+ raise RaggedRow, row.inspect
15
+ end
16
+ end
17
+
18
+ columns = (0...columns_size).to_a
19
+ 1.upto(columns_size) do |k|
20
+ columns.combination(k) do |combination|
21
+ if unique_key?(data, combination)
22
+ return combination
23
+ end
24
+ end
25
+ end
26
+ nil
27
+ end
28
+
9
29
  # @param [Array] set a set of values
10
30
  # @param [Proc] function a method that transforms a value
11
31
  def initialize(set, function = lambda{|value| value})
12
32
  replace(set)
13
- self.function = function
33
+ @function = function
14
34
  end
15
35
 
36
+ # Removes all members of the set that collide after transformation.
37
+ #
16
38
  # @return [Array] the members of the set without collisions
17
39
  def reject_collisions
18
40
  hashes, collisions = hashes_and_collisions
19
41
 
20
- items = hashes.reject do |_,hash|
21
- collisions.include?(hash)
22
- end.map do |item,_|
23
- item
42
+ items = []
43
+ hashes.each do |item,hash|
44
+ unless collisions.include?(hash)
45
+ items << item
46
+ end
24
47
  end
25
-
26
48
  self.class.new(items, function)
27
49
  end
28
50
 
51
+ # Returns a mapping from the original to the transformed value.
52
+ #
29
53
  # @return [Hash] a mapping from the original to the transformed value
30
54
  # @raise [Collision] if the method creates collisions between members of the set
31
55
  def value_to_fingerprint
@@ -49,6 +73,17 @@ class Lycopodium < Array
49
73
 
50
74
  private
51
75
 
76
+ def self.unique_key?(data, combination)
77
+ set = Set.new
78
+ data.each_with_index do |row,index|
79
+ set.add(row.values_at(*combination))
80
+ if set.size <= index
81
+ return false
82
+ end
83
+ end
84
+ true
85
+ end
86
+
52
87
  def hashes_and_collisions
53
88
  collisions = Set.new
54
89
 
@@ -1,3 +1,3 @@
1
- module Lycopodium
2
- VERSION = "0.0.2"
1
+ class Lycopodium < Array
2
+ VERSION = "0.0.3"
3
3
  end
@@ -5,10 +5,10 @@ Gem::Specification.new do |s|
5
5
  s.name = "lycopodium"
6
6
  s.version = Lycopodium::VERSION
7
7
  s.platform = Gem::Platform::RUBY
8
- s.authors = ["Open North"]
9
- s.email = ["info@opennorth.ca"]
10
- s.homepage = "http://github.com/opennorth/lycopodium"
11
- s.summary = %q{Test what transformations you can make to a set of unique strings without creating collisions}
8
+ s.authors = ["James McKinney"]
9
+ s.homepage = "https://github.com/jpmckinney/lycopodium"
10
+ s.summary = %q{Test what transformations you can make to a set of values without creating collisions}
11
+ s.license = 'MIT'
12
12
 
13
13
  s.files = `git ls-files`.split("\n")
14
14
  s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
@@ -17,4 +17,5 @@ Gem::Specification.new do |s|
17
17
 
18
18
  s.add_development_dependency('rspec', '~> 2.10')
19
19
  s.add_development_dependency('rake')
20
+ s.add_development_dependency('coveralls')
20
21
  end
@@ -0,0 +1,64 @@
1
+ require File.expand_path(File.dirname(__FILE__) + '/spec_helper')
2
+
3
+ describe Lycopodium do
4
+ describe '#initialize' do
5
+ it 'should use an identify function by default' do
6
+ set = Lycopodium.new([])
7
+ set.function.call(1).should == 1
8
+ end
9
+
10
+ it 'should accept a function as an argument' do
11
+ set = Lycopodium.new([], lambda{|value| value * 2})
12
+ set.function.call(1).should == 2
13
+ end
14
+ end
15
+
16
+ let :collisions do
17
+ Lycopodium.new(['foo', 'f o o', 'bar'], lambda{|string| string.gsub(' ', '')})
18
+ end
19
+
20
+ let :no_collisions do
21
+ Lycopodium.new(['foo', 'f o o', 'bar'], lambda{|string| string.upcase})
22
+ end
23
+
24
+ describe '#unique_key' do
25
+ it 'should return a unique key if a unique key is found' do
26
+ Lycopodium.unique_key([
27
+ ['foo', 'bar', 'baz'],
28
+ ['foo', 'bar', 'bzz'],
29
+ ['foo', 'zzz', 'bzz'],
30
+ ]).should == [1, 2]
31
+ end
32
+
33
+ it 'should return nil if no unique key is found' do
34
+ Lycopodium.unique_key([
35
+ ['foo', 'bar'],
36
+ ['foo', 'bar'],
37
+ ['foo', 'bar'],
38
+ ]).should == nil
39
+ end
40
+
41
+ it 'should raise an error if ragged rows' do
42
+ expect{Lycopodium.unique_key([
43
+ ['foo'],
44
+ ['foo', 'bar'],
45
+ ])}.to raise_error(Lycopodium::RaggedRow, %(["foo", "bar"]))
46
+ end
47
+ end
48
+
49
+ describe '#reject_collisions' do
50
+ it 'should remove all members of the set that collide after transformation' do
51
+ collisions.reject_collisions.should == ['bar']
52
+ end
53
+ end
54
+
55
+ describe '#value_to_fingerprint' do
56
+ it 'should return a mapping from the original to the transformed value' do
57
+ no_collisions.value_to_fingerprint.should == {'foo' => 'FOO', 'f o o' => 'F O O', 'bar' => 'BAR'}
58
+ end
59
+
60
+ it 'should raise an error if the method creates collisions between members of the set' do
61
+ expect{collisions.value_to_fingerprint}.to raise_error(Lycopodium::Collision, %("foo", "f o o" => "foo"))
62
+ end
63
+ end
64
+ end
@@ -1,3 +1,7 @@
1
1
  require 'rubygems'
2
+
3
+ require 'coveralls'
4
+ Coveralls.wear!
5
+
2
6
  require 'rspec'
3
7
  require File.dirname(__FILE__) + '/../lib/lycopodium'
metadata CHANGED
@@ -1,58 +1,66 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: lycopodium
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.2
5
- prerelease:
4
+ version: 0.0.3
6
5
  platform: ruby
7
6
  authors:
8
- - Open North
7
+ - James McKinney
9
8
  autorequire:
10
9
  bindir: bin
11
10
  cert_chain: []
12
- date: 2013-08-16 00:00:00.000000000 Z
11
+ date: 2015-02-03 00:00:00.000000000 Z
13
12
  dependencies:
14
13
  - !ruby/object:Gem::Dependency
15
14
  name: rspec
16
15
  requirement: !ruby/object:Gem::Requirement
17
- none: false
18
16
  requirements:
19
- - - ~>
17
+ - - "~>"
20
18
  - !ruby/object:Gem::Version
21
19
  version: '2.10'
22
20
  type: :development
23
21
  prerelease: false
24
22
  version_requirements: !ruby/object:Gem::Requirement
25
- none: false
26
23
  requirements:
27
- - - ~>
24
+ - - "~>"
28
25
  - !ruby/object:Gem::Version
29
26
  version: '2.10'
30
27
  - !ruby/object:Gem::Dependency
31
28
  name: rake
32
29
  requirement: !ruby/object:Gem::Requirement
33
- none: false
34
30
  requirements:
35
- - - ! '>='
31
+ - - ">="
36
32
  - !ruby/object:Gem::Version
37
33
  version: '0'
38
34
  type: :development
39
35
  prerelease: false
40
36
  version_requirements: !ruby/object:Gem::Requirement
41
- none: false
42
37
  requirements:
43
- - - ! '>='
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: coveralls
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
44
53
  - !ruby/object:Gem::Version
45
54
  version: '0'
46
55
  description:
47
- email:
48
- - info@opennorth.ca
56
+ email:
49
57
  executables: []
50
58
  extensions: []
51
59
  extra_rdoc_files: []
52
60
  files:
53
- - .gitignore
54
- - .travis.yml
55
- - .yardopts
61
+ - ".gitignore"
62
+ - ".travis.yml"
63
+ - ".yardopts"
56
64
  - Gemfile
57
65
  - LICENSE
58
66
  - README.md
@@ -61,37 +69,33 @@ files:
61
69
  - lib/lycopodium.rb
62
70
  - lib/lycopodium/version.rb
63
71
  - lycopodium.gemspec
72
+ - spec/lycopodium_spec.rb
64
73
  - spec/spec_helper.rb
65
- homepage: http://github.com/opennorth/lycopodium
66
- licenses: []
74
+ homepage: https://github.com/jpmckinney/lycopodium
75
+ licenses:
76
+ - MIT
77
+ metadata: {}
67
78
  post_install_message:
68
79
  rdoc_options: []
69
80
  require_paths:
70
81
  - lib
71
82
  required_ruby_version: !ruby/object:Gem::Requirement
72
- none: false
73
83
  requirements:
74
- - - ! '>='
84
+ - - ">="
75
85
  - !ruby/object:Gem::Version
76
86
  version: '0'
77
- segments:
78
- - 0
79
- hash: 750622676358250438
80
87
  required_rubygems_version: !ruby/object:Gem::Requirement
81
- none: false
82
88
  requirements:
83
- - - ! '>='
89
+ - - ">="
84
90
  - !ruby/object:Gem::Version
85
91
  version: '0'
86
- segments:
87
- - 0
88
- hash: 750622676358250438
89
92
  requirements: []
90
93
  rubyforge_project:
91
- rubygems_version: 1.8.25
94
+ rubygems_version: 2.2.2
92
95
  signing_key:
93
- specification_version: 3
94
- summary: Test what transformations you can make to a set of unique strings without
95
- creating collisions
96
+ specification_version: 4
97
+ summary: Test what transformations you can make to a set of values without creating
98
+ collisions
96
99
  test_files:
100
+ - spec/lycopodium_spec.rb
97
101
  - spec/spec_helper.rb