licensee 5.0.0 → 6.0.0b1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (43) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +15 -50
  3. data/bin/licensee +7 -8
  4. data/lib/licensee.rb +9 -33
  5. data/lib/licensee/content_helper.rb +7 -8
  6. data/lib/licensee/license.rb +5 -28
  7. data/lib/licensee/matchers/copyright_matcher.rb +17 -16
  8. data/lib/licensee/matchers/dice_matcher.rb +65 -0
  9. data/lib/licensee/matchers/exact_matcher.rb +12 -6
  10. data/lib/licensee/matchers/gemspec_matcher.rb +11 -11
  11. data/lib/licensee/matchers/npm_bower_matcher.rb +10 -10
  12. data/lib/licensee/matchers/package_matcher.rb +11 -10
  13. data/lib/licensee/project.rb +96 -30
  14. data/lib/licensee/project_file.rb +57 -77
  15. data/lib/licensee/version.rb +1 -1
  16. data/licensee.gemspec +26 -0
  17. data/test/fixtures/npm.git/HEAD +1 -0
  18. data/test/fixtures/npm.git/config +4 -0
  19. data/test/fixtures/npm.git/objects/info/packs +2 -0
  20. data/test/fixtures/npm.git/objects/pack/pack-03c0879445cabcc37f91d97c7955465adef26f4a.idx +0 -0
  21. data/test/fixtures/npm.git/objects/pack/pack-03c0879445cabcc37f91d97c7955465adef26f4a.pack +0 -0
  22. data/test/fixtures/npm.git/packed-refs +2 -0
  23. data/test/functions.rb +4 -15
  24. data/test/test_licensee.rb +1 -13
  25. data/test/test_licensee_copyright_matcher.rb +19 -28
  26. data/test/test_licensee_dice_matcher.rb +21 -0
  27. data/test/test_licensee_exact_matcher.rb +4 -6
  28. data/test/test_licensee_gemspec_matcher.rb +3 -11
  29. data/test/test_licensee_license.rb +2 -12
  30. data/test/test_licensee_npm_bower_matcher.rb +10 -16
  31. data/test/test_licensee_project.rb +24 -35
  32. data/test/test_licensee_project_file.rb +5 -10
  33. data/vendor/choosealicense.com/_licenses/afl-3.0.txt +69 -0
  34. data/vendor/choosealicense.com/_licenses/isc.txt +2 -2
  35. metadata +14 -26
  36. data/lib/licensee/filesystem_repository.rb +0 -38
  37. data/lib/licensee/matcher.rb +0 -32
  38. data/lib/licensee/matchers/git_matcher.rb +0 -27
  39. data/lib/licensee/matchers/levenshtein_matcher.rb +0 -75
  40. data/test/test_licensee_content_helper.rb +0 -40
  41. data/test/test_licensee_git_matcher.rb +0 -19
  42. data/test/test_licensee_levenshtein_matcher.rb +0 -34
  43. data/test/test_licensee_matcher.rb +0 -7
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 177e5c94bb43c3603d0581bfaed3f873ab17c94e
4
- data.tar.gz: a9218af2d2922cccbddf269df1bd35d710198c79
3
+ metadata.gz: a9a12c248b281de58f46188030a603b373bfb502
4
+ data.tar.gz: 384ffeaeef392f999d1b0c6f20efd6651abcf34d
5
5
  SHA512:
6
- metadata.gz: 6a001f3ec27daba23c2c0a3d5902fd7f926861d3a5071aa4756591cb313884c77f6c022851f94fd440c1bd894ee45c3f6596b5835a062bf2926e1ff6abd99fd1
7
- data.tar.gz: bab94b9c213be6caed4ab179cbe7b4b64bd928482a219ab98b63a24c0700a0190998e71dbfc747af597d9b2671863c0bdf4ab359903686be9200610ef7756323
6
+ metadata.gz: f90e04e86ea6a97828ac39c52ffeb3661082c5df4a4b1c3944a09555d1e39827c052870ea63541eef675639170ff18f3bc661a2b1e84fe0e8451a3fe299aa04c
7
+ data.tar.gz: 3c2a06a4e5f9c27fa26df6f8fbb0ff62ce730d0af1dcbaed1774c1b38990158a07813306892f47a9386a0661dae48be27a3bbd31d5c33b197f88cd73a8bcc23b
data/README.md CHANGED
@@ -1,33 +1,22 @@
1
1
  # Licensee
2
-
3
- *A Ruby Gem to detect under what license a project is distributed.*
2
+ _A Ruby Gem to detect under what license a project is distributed._
4
3
 
5
4
  [![Build Status](https://travis-ci.org/benbalter/licensee.svg?branch=master)](https://travis-ci.org/benbalter/licensee) [![Gem Version](https://badge.fury.io/rb/licensee.svg)](http://badge.fury.io/rb/licensee)
6
5
 
7
6
  ## The problem
8
-
9
- * You've got an open source project. How do you know what you can and can't do with the software?
10
- * You've got a bunch of open source projects, how do you know what their licenses are?
11
- * You've got a project with a license file, but which license is it? Has it been modified?
7
+ - You've got an open source project. How do you know what you can and can't do with the software?
8
+ - You've got a bunch of open source projects, how do you know what their licenses are?
9
+ - You've got a project with a license file, but which license is it? Has it been modified?
12
10
 
13
11
  ## The solution
14
-
15
12
  Licensee automates the process of reading `LICENSE` files and compares their contents to known licenses using a several strategies (which we call "Matchers"). It attempts to determine a project's license in the following order:
13
+ - If the license file has an explicit copyright notice, and nothing more (e.g., `Copyright (c) 2015 Ben Balter`), we'll assume the author intends to retain all rights, and thus the project isn't licensed.
14
+ - If the license is an exact match to a known license. If we strip away whitespace and copyright notice, we might get lucky, and direct string comparison in Ruby is cheap.
15
+ - If we still can't match the license, we use a fancy math thing called the [Sørensen–Dice coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient), which is really good at calculating the similarity between two strings. By calculating the percent changed from the known license to the license file, you can tell, e.g., that a given license is 90% similar to the MIT license, that 10% likely representing the copyright line being properly adapted to the project.
16
16
 
17
- 1. If the license file has an explicit copyright notice, and nothing more (e.g., `Copyright (c) 2015 Ben Balter`), we'll assume the author intends to retain all rights, and thus the project isn't licensed.
18
-
19
- 2. If the license is an exact match to a known license. Licenses like GPL don't have a copyright notice that needs to be changed in the license itself, so if we strip away whitespace, we might get lucky, and direct string comparison in Ruby is cheap.
20
-
21
- 3. If 90% of the lines match a known license. We use Git's internal change calculation method. To calculate diffs, Git hashes each line of both files, and compares the hashes to tell the percent changed. This method is fast, but is done on a line-by-line basis, so if the license is wrapped differently, or has extra words inserted, it's not going to match the license.
22
-
23
- 4. If we still can't match the license, we use a fancy math thing called the [Levenshtein distance algorithm](https://en.wikipedia.org/wiki/Levenshtein_distance), which while very slow, is really good at calculating the similarity between two strings. By calculating the percent changed from the known license to the license file, you can tell, e.g., that a given license is 90% similar to the MIT license, that 10% likely representing the copyright line being properly adapted to the project.
24
-
25
- Licensee will even diff the distributed license with the original, so you can see exactly what, if anything's been changed.
26
-
27
- *Special thanks to [@vmg](https://github.com/vmg) for his Git prowess.*
17
+ _Special thanks to [@vmg](https://github.com/vmg) for his Git and algorithmic prowess._
28
18
 
29
19
  ## Installation
30
-
31
20
  `gem install licensee` or add `gem 'licensee'` to your project's `Gemfile`.
32
21
 
33
22
  ## Usage
@@ -52,20 +41,7 @@ license.meta["permitted"]
52
41
  => ["commercial-use","modifications","distribution","sublicense","private-use"]
53
42
  ```
54
43
 
55
- ## Diffing
56
-
57
- You can also generate a diff of the known license to the distributed license.
58
-
59
- ```ruby
60
- puts Licensee.diff "/path/to/a/project"
61
- -Copyright (c) [year] [fullname]
62
- +Copyright (c) 2014 Ben Balter
63
- ```
64
-
65
- For a full list of diff options (HTML output, color output, etc.) see [Diffy](https://github.com/samg/diffy).
66
-
67
44
  ## Command line usage
68
-
69
45
  1. `cd` into a project directory
70
46
  2. execute the `licensee` command
71
47
 
@@ -78,54 +54,43 @@ Matcher: Licensee::GitMatcher
78
54
  ```
79
55
 
80
56
  ## What it looks at
81
-
82
- * `LICENSE`, `LICENSE.txt`, `COPYING`, etc. files in the root of the project, comparing the body to known licenses
83
- * Crowdsourced license content and metadata from [`choosealicense.com`](http://choosealicense.com)
57
+ - `LICENSE`, `LICENSE.txt`, `COPYING`, etc. files in the root of the project, comparing the body to known licenses
58
+ - Crowdsourced license content and metadata from [`choosealicense.com`](http://choosealicense.com)
84
59
 
85
60
  ## What it doesn't look at
86
-
87
- * Dependency licensing
88
- * References to licenses in `README`, `README.md`, etc.
89
- * Structured license data in package manager configuration files (like Gemfiles)
90
- * Every single possible license (just the most popular ones)
91
- * Compliance (e.g., whitelisting certain licenses)
61
+ - Dependency licensing
62
+ - References to licenses in `README`, `README.md`, etc.
63
+ - Every single possible license (just the most popular ones)
64
+ - Compliance (e.g., whitelisting certain licenses)
92
65
 
93
66
  If you're looking for dependency license checking and compliance, take a look at [LicenseFinder](https://github.com/pivotal/LicenseFinder).
94
67
 
95
68
  ## Huh? Why don't you look at X?
96
-
97
69
  Because reasons.
98
70
 
99
71
  ### Why not just look at the "license" field of [insert package manager here]?
100
-
101
72
  Because it's not legally binding. A license is a legal contract. You give up certain rights (e.g., the right to sue the author) in exchange for the right to use the software.
102
73
 
103
- Most popular licenses today *require* that the license itself be distributed along side the software. Simply putting the letters "MIT" or "GPL" in a configuration file doesn't really meet that requirement.
74
+ Most popular licenses today _require_ that the license itself be distributed along side the software. Simply putting the letters "MIT" or "GPL" in a configuration file doesn't really meet that requirement.
104
75
 
105
76
  Not to mention, it doesn't tell you much about your rights as a user. Is it GPLv2? GPLv2 or later? Those files are designed to be read by computers (who can't enter into contracts), not humans (who can). It's great metadata, but that's about it.
106
77
 
107
78
  ### What about looking to see if the author said something in the readme?
108
-
109
79
  You could make an argument that, when linked or sufficiently identified, the terms of the license are incorporated by reference, or at least that the author's intent is there. There's a handful of reasons why this isn't ideal. For one, if you're using the MIT or BSD (ISC) license, along with a few others, there's templematic language, like the copyright notice, which would go unfilled.
110
80
 
111
81
  ### What about checking every single file for a copyright header?
112
-
113
82
  Because that's silly in the context of how software is developed today. You wouldn't put a copyright notice on each page of a book. Besides, it's a lot of work, as there's no standardized, cross-platform way to describe a project's license within a comment.
114
83
 
115
84
  Checking the actual text into version control is definitive, so that's what this project looks at.
116
85
 
117
86
  ## Bootstrapping a local development environment
118
-
119
87
  `script/bootstrap`
120
88
 
121
89
  ## Running tests
122
-
123
90
  `script/cibuild`
124
91
 
125
92
  ## Updating the licenses
126
-
127
93
  License data is pulled from `choosealicense.com`. To update the license data, simple run `script/vendor-licenses`.
128
94
 
129
95
  ## Roadmap
130
-
131
96
  See [proposed enhancements](https://github.com/benbalter/licensee/labels/enhancement).
data/bin/licensee CHANGED
@@ -3,19 +3,18 @@ require_relative "../lib/licensee"
3
3
 
4
4
  path = ARGV[0] || Dir.pwd
5
5
 
6
- Licensee.package_manager_files = true
7
- project = Licensee::Project.new(path)
8
- license = project.matched_file
6
+ project = Licensee::GitProject.new(path, detect_packages: true)
7
+ file = project.matched_file
9
8
 
10
9
  if project.license_file
11
- puts "License file: #{project.license_file.path}"
10
+ puts "License file: #{project.license_file.filename}"
12
11
  puts "Attribution: #{project.license_file.attribution}" if project.license_file.attribution
13
12
  end
14
13
 
15
- if license
16
- puts "License: #{license.match ? license.match.meta['title'] : 'no license'}"
17
- puts "Confidence: #{license.confidence}%"
18
- puts "Method: #{license.matcher.class}"
14
+ if file
15
+ puts "License: #{file.license ? file.license.meta['title'] : 'no license'}"
16
+ puts "Confidence: #{file.confidence}%"
17
+ puts "Method: #{file.matcher.class}"
19
18
  else
20
19
  puts "Unknown"
21
20
  end
data/lib/licensee.rb CHANGED
@@ -1,25 +1,17 @@
1
- require 'uri'
2
- require 'yaml'
3
- require 'rugged'
4
- require 'levenshtein'
5
-
6
1
  require_relative "licensee/version"
7
2
  require_relative "licensee/content_helper"
8
3
  require_relative "licensee/license"
9
4
  require_relative "licensee/project"
10
5
  require_relative "licensee/project_file"
11
- require_relative "licensee/filesystem_repository"
12
- require_relative "licensee/matcher"
6
+
13
7
  require_relative "licensee/matchers/exact_matcher"
14
8
  require_relative "licensee/matchers/copyright_matcher"
15
- require_relative "licensee/matchers/git_matcher"
16
- require_relative "licensee/matchers/levenshtein_matcher"
9
+ require_relative "licensee/matchers/dice_matcher"
17
10
  require_relative "licensee/matchers/package_matcher"
18
11
  require_relative "licensee/matchers/gemspec_matcher"
19
12
  require_relative "licensee/matchers/npm_bower_matcher"
20
13
 
21
14
  class Licensee
22
-
23
15
  # Over which percent is a match considered a match by default
24
16
  CONFIDENCE_THRESHOLD = 90
25
17
 
@@ -27,40 +19,24 @@ class Licensee
27
19
  DOMAIN = "http://choosealicense.com"
28
20
 
29
21
  class << self
30
-
31
- attr_writer :confidence_threshold, :package_manager_files
22
+ attr_writer :confidence_threshold
32
23
 
33
24
  # Returns an array of Licensee::License instances
34
25
  def licenses(options={})
35
26
  Licensee::License.all(options)
36
27
  end
37
28
 
38
- # Returns the license for a given git repo
29
+ # Returns the license for a given path
39
30
  def license(path)
40
- Licensee::Project.new(path).license
41
- end
42
-
43
- # Array of matchers to use, in order of preference
44
- # The order should be decending order of anticipated speed to match
45
- def matchers
46
- matchers = [
47
- Licensee::CopyrightMatcher,
48
- Licensee::ExactMatcher,
49
- Licensee::GitMatcher,
50
- Licensee::LevenshteinMatcher,
51
- Licensee::GemspecMatcher,
52
- Licensee::NpmBowerMatcher
53
- ]
54
- matchers.reject! { |m| m.package_manager? } unless package_manager_files?
55
- matchers
31
+ begin
32
+ Licensee::GitProject.new(path).license
33
+ rescue Licensee::GitProject::InvalidRepository
34
+ Licensee::FSProject.new(path).license
35
+ end
56
36
  end
57
37
 
58
38
  def confidence_threshold
59
39
  @confidence_threshold ||= CONFIDENCE_THRESHOLD
60
40
  end
61
-
62
- def package_manager_files?
63
- @package_manager_files || false
64
- end
65
41
  end
66
42
  end
@@ -1,14 +1,13 @@
1
+ require 'set'
2
+
1
3
  class Licensee
2
4
  module ContentHelper
3
- def normalize_content(content)
5
+ def create_word_set(content)
4
6
  return unless content
5
- content = content.downcase
6
- content = content.gsub(/\A[[:space:]]+/, '')
7
- content = content.gsub(/[[:space:]]+\z/, '')
8
- content = content.gsub(/^#{CopyrightMatcher::REGEX}$/i, '')
9
- content = content.gsub(/[[:space:]]+/, ' ')
10
- content = content.gsub("\u0000", '') # Remove null byte which breaks Levenshtein
11
- content.squeeze(' ').strip
7
+ content = content.dup
8
+ content.downcase!
9
+ content.gsub!(/^#{Matchers::Copyright::REGEX}$/i, '')
10
+ content.scan(/[\w']+/).to_set
12
11
  end
13
12
  end
14
13
  end
@@ -1,3 +1,6 @@
1
+ require 'uri'
2
+ require 'yaml'
3
+
1
4
  class Licensee
2
5
  class InvalidLicense < ArgumentError; end
3
6
  class License
@@ -39,10 +42,6 @@ class Licensee
39
42
 
40
43
  HIDDEN_LICENSES = %w[other no-license]
41
44
 
42
- # Licenses that technically contain the license name or nickname
43
- # But we are so short that GitMatcher may not catch if rewrapped
44
- BODY_INCLUDES_WHITELIST = %w[mit]
45
-
46
45
  include Licensee::ContentHelper
47
46
 
48
47
  def initialize(key)
@@ -107,17 +106,8 @@ class Licensee
107
106
  alias_method :to_s, :body
108
107
  alias_method :text, :body
109
108
 
110
- # License body with all whitespace replaced with a single space
111
- def body_normalized
112
- @body_normalized ||= normalize_content(body)
113
- end
114
-
115
- # Git-computed hash signature for the license file
116
- def hashsig
117
- @hashsig ||= Rugged::Blob::HashSignature.new(
118
- body, Rugged::Blob::HashSignature::WHITESPACE_SMART) unless body.nil?
119
- rescue Rugged::InvalidError
120
- nil
109
+ def wordset
110
+ @wordset ||= create_word_set(body)
121
111
  end
122
112
 
123
113
  def inspect
@@ -132,20 +122,7 @@ class Licensee
132
122
  other != nil && key == other.key
133
123
  end
134
124
 
135
- def body_includes_name?
136
- return false if BODY_INCLUDES_WHITELIST.include?(key)
137
- return @body_includes_name if defined? @body_includes_name
138
- @body_includes_name = body_normalized.include?(name_without_version.downcase)
139
- end
140
-
141
- def body_includes_nickname?
142
- return false if BODY_INCLUDES_WHITELIST.include?(key)
143
- return @body_includes_nickname if defined? @body_includes_nickname
144
- @body_includes_nickname = !!(nickname && body_normalized.include?(nickname.downcase))
145
- end
146
-
147
125
  private
148
-
149
126
  def parts
150
127
  @parts ||= content.match(/\A(---\n.*\n---\n+)?(.*)/m).to_a if content
151
128
  end
@@ -1,24 +1,25 @@
1
1
  # encoding=utf-8
2
2
  class Licensee
3
- class CopyrightMatcher < Matcher
3
+ module Matchers
4
+ class Copyright
5
+ REGEX = /\s*Copyright (©|\(c\)|\xC2\xA9)? ?(\d{4}|\[year\])(.*)?\s*/i
4
6
 
5
- REGEX = /\s*Copyright (©|\(c\)|\xC2\xA9)? ?(\d{4}|\[year\])(.*)?\s*/i
7
+ def initialize(file)
8
+ @file = file
9
+ end
6
10
 
7
- def match
8
- # Note: must use content, and not content_normalized here
9
- no_license if file.content.strip =~ /\A#{REGEX}\z/i
10
- rescue
11
- nil
12
- end
13
-
14
- def confidence
15
- 100
16
- end
17
-
18
- private
11
+ def match
12
+ # Note: must use content, and not content_normalized here
13
+ if @file.content.strip =~ /\A#{REGEX}\z/i
14
+ Licensee::License.find("no-license")
15
+ end
16
+ rescue
17
+ nil
18
+ end
19
19
 
20
- def no_license
21
- @no_license ||= Licensee::License.find("no-license")
20
+ def confidence
21
+ 100
22
+ end
22
23
  end
23
24
  end
24
25
  end
@@ -0,0 +1,65 @@
1
+ class Licensee
2
+ module Matchers
3
+ class Dice
4
+ def initialize(file)
5
+ @file = file
6
+ end
7
+
8
+ # Return the first potential license that is more similar
9
+ # than the confidence threshold
10
+ def match
11
+ return @match if defined? @match
12
+ matches = potential_licenses.map do |license|
13
+ if (sim = similarity(license)) >= Licensee.confidence_threshold
14
+ [license, sim]
15
+ end
16
+ end
17
+ matches.compact!
18
+ @match = if matches.empty?
19
+ nil
20
+ else
21
+ matches.max_by { |l, sim| sim }.first
22
+ end
23
+ end
24
+
25
+ # Sort all licenses, in decending order, by difference in
26
+ # length to the file
27
+ # Difference in lengths cannot exceed the file's length *
28
+ # the confidence threshold / 100
29
+ def potential_licenses
30
+ @potential_licenses ||= begin
31
+ licenses = Licensee.licenses(:hidden => true)
32
+ licenses = licenses.select do |license|
33
+ license.wordset && length_delta(license) <= max_delta
34
+ end
35
+ licenses.sort_by { |l| length_delta(l) }
36
+ end
37
+ end
38
+
39
+ # Calculate the difference between the file length and a given
40
+ # license's length
41
+ def length_delta(license)
42
+ (@file.wordset.size - license.wordset.size).abs
43
+ end
44
+
45
+ # Maximum possible difference between file length and license length
46
+ # for a license to be a potential license to be matched
47
+ def max_delta
48
+ @max_delta ||= (@file.wordset.size * (Licensee.confidence_threshold/100.0))
49
+ end
50
+
51
+ # Confidence that the matched license is a match
52
+ def confidence
53
+ @confidence ||= match ? similarity(match) : 0
54
+ end
55
+
56
+ private
57
+ # Calculate percent changed between file and potential license
58
+ def similarity(license)
59
+ overlap = (@file.wordset & license.wordset).size
60
+ total = @file.wordset.size + license.wordset.size
61
+ 100.0 * (overlap * 2.0 / total)
62
+ end
63
+ end
64
+ end
65
+ end
@@ -1,11 +1,17 @@
1
1
  class Licensee
2
- class ExactMatcher < Matcher
3
- def match
4
- Licensee.licenses(:hidden => true).find { |l| l.body_normalized == file.content_normalized }
5
- end
2
+ module Matchers
3
+ class Exact
4
+ def initialize(file)
5
+ @file = file
6
+ end
7
+
8
+ def match
9
+ Licensee.licenses(:hidden => true).find { |l| l.wordset == @file.wordset }
10
+ end
6
11
 
7
- def confidence
8
- 100
12
+ def confidence
13
+ 100
14
+ end
9
15
  end
10
16
  end
11
17
  end