license_matcher 0.1.0.pre.alpha → 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 3b4af3ec68010e357923af4f36968b826573648e
4
- data.tar.gz: c36ba2921d655acc1a073881bc5a8d901ae93d81
3
+ metadata.gz: 6b710a76802bc5254b2d0d9dd3c36376c613b7e5
4
+ data.tar.gz: fef8fe16e4028ed1c49710956dceb4ac13acee9b
5
5
  SHA512:
6
- metadata.gz: 5f0013742ef8d324c38a7fae607de3f7c7bff7bf71a5b0c9533eef7fe5736b57e1dc5fafe1ce5813b260a89e228f9c24bbb8d89920c5186bc601ccb9988580c3
7
- data.tar.gz: e325854f90a54dcebb6d9b5e6b4c5166399ccfa1815aa7e4e3f208141b082e09dadf498b081331ab823671fd255c65cdf25a941f6b942344feaf441b82fe0746
6
+ metadata.gz: ee54a1ae1b3258f9bc474a5a05c7221281be666976f6e1d547bb1673c4fa57507d0006a21b236d9ce02a55fa6e9e1ac9fe646a6c96d53c2d7255ab59d5e2c821
7
+ data.tar.gz: adb94097dea0e79e19ac4b2cf9285d42e190902f2d8f798e5eb7239d094e20ca12193efb7342ad9ee4065b2e5f2bc48ccf3cacc318419699d7a7f2b528fb6fab
data/Gemfile.lock CHANGED
@@ -1,8 +1,11 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- license_matcher (0.1.0)
4
+ license_matcher (0.1.0.pre.alpha)
5
5
  helix_runtime (~> 0.6.0)
6
+ narray (~> 0.6.1.2)
7
+ nokogiri (~> 1.8.0)
8
+ tf-idf-similarity (~> 0.1.6)
6
9
 
7
10
  GEM
8
11
  remote: https://rubygems.org/
@@ -13,6 +16,11 @@ GEM
13
16
  rake (>= 10.0)
14
17
  thor (~> 0.19.4)
15
18
  toml (~> 0.1.2)
19
+ mini_portile2 (2.2.0)
20
+ msgpack (1.1.0)
21
+ narray (0.6.1.2)
22
+ nokogiri (1.8.0)
23
+ mini_portile2 (~> 2.2.0)
16
24
  parslet (1.5.0)
17
25
  blankslate (~> 2.0)
18
26
  rake (10.5.0)
@@ -29,9 +37,12 @@ GEM
29
37
  diff-lcs (>= 1.2.0, < 2.0)
30
38
  rspec-support (~> 3.6.0)
31
39
  rspec-support (3.6.0)
40
+ tf-idf-similarity (0.1.6)
41
+ unicode_utils (~> 1.4)
32
42
  thor (0.19.4)
33
43
  toml (0.1.2)
34
44
  parslet (~> 1.5.0)
45
+ unicode_utils (1.4.0)
35
46
 
36
47
  PLATFORMS
37
48
  ruby
@@ -39,6 +50,7 @@ PLATFORMS
39
50
  DEPENDENCIES
40
51
  bundler (~> 1.15)
41
52
  license_matcher!
53
+ msgpack (~> 1.1.0)
42
54
  rake (~> 10.0)
43
55
  rspec (~> 3.4)
44
56
 
data/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # LicenseMatcher
2
2
 
3
- LicenseMatcher is a rubygem that can match a fulltext of Opensource License Text with SPDX id; So you dont have to guess is it **BSD** or **MIT** license, let the `LicenseMatcher` does the heavy lifting for you;
3
+ LicenseMatcher is a rubygem that matches a fulltext of Opensource License Text with the SPDX id; So you dont have to guess is it **BSD** or **MIT** license, let the `LicenseMatcher` does the heavy lifting for you;
4
4
 
5
5
 
6
6
  It uses [Fosslim](https://github.com/Fosslim/fosslim/) library underneath, which gives remarkable performance with lower memory cost than pure Ruby implementation;
@@ -34,16 +34,87 @@ run `bundle exec irb` on your commandline to fire up Ruby REPL;
34
34
  ```
35
35
  require 'license_matcher'
36
36
 
37
- # build index from SPDX data
38
- LicenseMatcher.build_index( "data/licenses", "data/index.msgpack")
37
+ # download pre-build index
38
+ curl -O https://github.com/Fosslim/license_matcher/blob/master/data/index.msgpack
39
+
40
+ # or build index from the SPDX data
41
+ LicenseMatcher::TFRustMatcher.build_index( "data/licenses", "data/index.msgpack")
39
42
 
40
43
  # match license text
41
44
  txt = File.read("fixtures/files/mit.txt");
42
- lm = LicenseMatcher.new("data/index.msgpack")
43
- lm.match(txt, 0.9)
45
+
46
+ lm = LicenseMatcher::TFRubyMatcher.new("data/index.msgpack")
47
+ lm.match_text(txt, 0.9)
48
+
49
+
50
+ ```
51
+
52
+
53
+ ## Matchers
54
+
55
+ It currently supports 4 different models:
56
+
57
+ * **UrlMatcher.match_url** - finds matching SPDX license by comparing URL with urls in the `licenses.json`
58
+
59
+ ```ruby
60
+ lm = LicenseMatcher::UrlMatcher.new
61
+ lm.match_url "https://opensource.org/licenses/AAL"
62
+
63
+ => "AAL"
64
+ ```
65
+
66
+ * **RuleMatcher.match_rule** - scans a text and returns the SPDX id, which rule matches longest substring in the license text
67
+
68
+ ```ruby
69
+ lm = LicenseMatcher::RuleMatcher.new
70
+ lm.match_rules "It is license under Apache 2.0 License."
71
+
72
+ => "Apache-2.0"
73
+ ```
74
+
75
+ * **TFRubyMatcher** - original Ruby implementation, uses TF/IDF and Cosine similarity;
44
76
 
45
77
  ```
78
+ lm = LicenseMatcher::TFRubyMatcher.new
46
79
 
80
+ txt = File.read "fixtures/files/mit.html"
81
+ clean_txt = lm.preprocess_html txt # NB! it may help to increase accuracy
82
+ lm.match_txt clean_txt
83
+ ```
84
+
85
+ * **TFRustMatcher** - uses simple Jaccard similarity;
86
+
87
+ ```
88
+ lm2 = LicenseMatcher::TFRustMatcher.new
89
+
90
+ txt = File.read "fixtures/files/mit.txt"
91
+ lm2.match_text txt
92
+ ```
93
+
94
+ ## Benchmarks
95
+
96
+ * initialization, Ruby version 1times, Rust version 1000x
97
+
98
+ ```
99
+ user system total real
100
+ TFRubyMatcher: 12.850000 0.180000 13.030000 ( 13.210955)
101
+ TFRustMatcher: 26.260000 9.400000 35.660000 ( 38.264632)
102
+ ```
103
+ * matching preprocessed short [MIT](https://raw.githubusercontent.com/Fosslim/license_matcher/master/data/spdx_licenses/plain/MIT) text 1000x times
104
+
105
+ ```
106
+ user system total real
107
+ TFRubyMatcher:102.410000 12.180000 114.590000 (116.308119)
108
+ TFRustMatcher: 7.170000 0.040000 7.210000 ( 7.266000)
109
+ ```
110
+
111
+ * matching preprocessed long [AGPL-3.0](https://raw.githubusercontent.com/Fosslim/license_matcher/master/data/spdx_licenses/plain/AGPL-3.0) text 1000x times
112
+
113
+ ```
114
+ user system total real
115
+ TFRubyMatcher:242.450000 21.960000 264.410000 (276.417704)
116
+ TFRustMatcher: 9.340000 0.070000 9.410000 ( 9.478597)
117
+ ```
47
118
 
48
119
  ## Development
49
120
 
Binary file
@@ -0,0 +1,75 @@
1
+ require 'nokogiri'
2
+
3
+ module Preprocess
4
+ def preprocess_text(text)
5
+ text = safe_encode(text)
6
+
7
+ #remove markdown url tags
8
+ text = text.gsub(/\[.+?\]\(.+?\)/, ' ')
9
+
10
+ #remove spam words
11
+ text.gsub!(/\bTHE\b/i, '')
12
+
13
+ #remove some XML grabage
14
+ text = text.gsub(/\<\!\[CDATA.*?\]\]\>/, ' ').to_s
15
+ text = text.gsub(/\<\!--.+?--\>/, ' ').to_s
16
+ text = text.gsub(/<\!\[CDATA.+?\]>/, ' ').to_s
17
+
18
+ return text.to_s.strip.gsub(/\s+/, ' ')
19
+ end
20
+
21
+ def preprocess_html(html_text)
22
+ # if text is HTML doc, then
23
+ # extract text only from visible html tags
24
+ text = ""
25
+
26
+ html_doc = parse_html(html_text)
27
+ if html_doc
28
+ text = clean_html(html_doc)
29
+ else
30
+ p "match_html: failed to parse html document\n#{html_text}"
31
+ end
32
+
33
+ return text
34
+ end
35
+
36
+ def clean_html(html_doc)
37
+ body_text = ""
38
+ body_elements = html_doc.xpath(
39
+ '//p | //h1 | //h2 | //h3 | //h4 | //h5 | //h6 | //em | //strong | //b | //td | //pre
40
+ | //li[not(@id) and not(@class) and not(a)] | //section//section[@class="project-info"]
41
+ | //blockquote | //textarea'
42
+ ).to_a
43
+
44
+ #extract text from html tag and separate them by space
45
+ body_elements.each {|el| body_text += ' ' + el.text.to_s}
46
+
47
+ #REMOVE XML CDATA like opensource.org pages has
48
+ body_text = body_text.to_s.strip
49
+ body_text.gsub!(/\<\!\[CDATA.+?\]\]\>/i, ' ')
50
+
51
+ if body_text.empty?
52
+ p "match_html: document didnt pass noise filter, will use whole body content"
53
+ body_text = html_doc.xpath('//body').text.to_s.strip
54
+ end
55
+
56
+ return body_text
57
+ end
58
+
59
+ def parse_html(html_text)
60
+ begin
61
+ return Nokogiri.HTML(safe_encode(html_text))
62
+ rescue Exception => e
63
+ log.error "failed to parse html doc: \n #{html_text}"
64
+ return nil
65
+ end
66
+ end
67
+
68
+ def safe_encode(txt)
69
+ txt.to_s.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
70
+ rescue
71
+ p "Failed to encode text:\n #{txt}i"
72
+ return ""
73
+ end
74
+
75
+ end
@@ -0,0 +1,601 @@
1
+ require 'json'
2
+
3
+ module LicenseMatcher
4
+
5
+ class RuleMatcher
6
+ include Preprocess
7
+
8
+ attr_reader :licenses, :rules, :id_spdx_idx
9
+
10
+ DEFAULT_LICENSE_JSON = 'data/spdx_licenses/licenses.json'
11
+
12
+
13
+ def initialize(license_json_file = DEFAULT_LICENSE_JSON)
14
+
15
+ licenses_json_doc = read_json_file license_json_file
16
+ raise("Failed to read licenses.json") if licenses_json_doc.nil?
17
+
18
+ @rules = init_rules(licenses_json_doc)
19
+ @id_spdx_idx = init_id_idx(licenses_json_doc) #reverse index from downcased licenseID to case sensitive spdx id
20
+ true
21
+
22
+ end
23
+
24
+ def init_id_idx(licenses_json_doc)
25
+ idx = {}
26
+ licenses_json_doc.to_a.each do |spdx_item|
27
+ lic_id = spdx_item[:id].to_s.downcase
28
+ idx[lic_id] = spdx_item[:id]
29
+ end
30
+
31
+ idx
32
+ end
33
+
34
+ # finds matching regex rules in the text and sorts matches by length of match
35
+ # ps: not very efficient, but good enough to handle special cases;
36
+ # @args:
37
+ # text - string, a name of license,
38
+ # @returns:
39
+ # [[spdx_id, score, matching_rule, matching_length],...]
40
+ def match_rules(text, early_exit = false)
41
+ matches = []
42
+ text = preprocess_text(text)
43
+
44
+ #if text is already spdx_id, then shortcut matching
45
+ if @rules.has_key?(text.downcase)
46
+ return [[text.downcase, 1.0]]
47
+ end
48
+
49
+ text += ' ' # required to make wordborder matcher to work with 1word texts
50
+ @rules.each do |spdx_id, rules|
51
+ match_res = matches_any_rule?(rules, text)
52
+ unless match_res.nil?
53
+ matches << ([spdx_id, 1.0] + match_res)
54
+ break if early_exit == true
55
+ end
56
+ end
57
+
58
+ matches.sort do |a, b|
59
+ if (a.size == b.size and a.size == 4)
60
+ -1 * (a[3] <=> b[3])
61
+ else
62
+ 0
63
+ end
64
+ end
65
+ end
66
+
67
+ # if testable license text is in the ignore set return true
68
+ def ignore?(lic_text)
69
+ ignore_rules = get_ignore_rules
70
+ m = matches_any_rule?(ignore_rules, lic_text.to_s)
71
+ not m.nil?
72
+ end
73
+
74
+ def matches_any_rule?(rules, license_name)
75
+ res = nil
76
+ rules.each do |rule|
77
+ m = rule.match(license_name.to_s)
78
+ if m
79
+ res = [rule, m[0].size]
80
+ break
81
+ end
82
+ end
83
+
84
+ res
85
+ end
86
+
87
+
88
+ #-- helpers
89
+
90
+ def read_json_file(file_path)
91
+ JSON.parse(File.read(file_path), {symbolize_names: true})
92
+ rescue
93
+ log.info "Failed to read json file `#{file_path}`"
94
+ nil
95
+ end
96
+
97
+ # combines SPDX rules with custom handwritten rules
98
+ def init_rules(license_json_doc)
99
+ rules = {}
100
+ rules = build_rules_from_spdx_json(license_json_doc)
101
+
102
+ get_custom_rules.each do |spdx_id, custom_rules_array|
103
+ spdx_id = spdx_id.to_s.strip.downcase
104
+
105
+ if rules.has_key?(spdx_id)
106
+ rules[spdx_id].concat custom_rules_array
107
+ else
108
+ rules[spdx_id] = custom_rules_array
109
+ end
110
+ end
111
+
112
+ rules
113
+ end
114
+
115
+
116
+ # builds regex rules based on the LicenseJSON file
117
+ # rules are using urls, IDS, names and alternative names to build full string matching regexes
118
+ def build_rules_from_spdx_json(spdx_json)
119
+ spdx_rules = {}
120
+
121
+ sorted_spdx_json = spdx_json.sort_by {|x| x[:id]}
122
+ sorted_spdx_json.each do |spdx_item|
123
+ spdx_id = spdx_item[:id].to_s.downcase.strip
124
+ spdx_rules[spdx_id] = build_spdx_item_rules(spdx_item)
125
+ end
126
+
127
+ spdx_rules
128
+ end
129
+
130
+ def build_spdx_item_rules(spdx_item)
131
+ rules = []
132
+
133
+ #links are first, because it has highest confidence that they are talking about the license
134
+ spdx_item[:links].to_a.each do |link|
135
+ lic_url = link[:url].to_s.strip.gsub(/https?:\/\//i, '').gsub(/www\./, '').gsub(/\./, '\\.') #normalizes urls in the file
136
+
137
+ rules << Regexp.new("\\b[\\(]?https?:\/\/(www\.)?#{lic_url}[\/]?[\\)]?\\b".gsub(/\s+/, ''), Regexp::IGNORECASE)
138
+ end
139
+
140
+ #include also links to license texts
141
+ spdx_item[:text].to_a.each do |link|
142
+ lic_url = link[:url].to_s.strip.gsub(/https?:\/\//i, '').gsub(/www\./, '').gsub(/\./, '\\.') #normalizes urls in the file
143
+ rules << Regexp.new("\\b[\\(]?https?:\/\/(www\.)?#{lic_url}[\/]?[\\)]?\\b".gsub(/\s+/, ''), Regexp::IGNORECASE)
144
+ end
145
+
146
+
147
+ spdx_name = preprocess_text(spdx_item[:name])
148
+ spdx_name.gsub!(/\(.+?\)/, '') #remove SPDX ids in the license names
149
+ spdx_name.gsub!(/\./, '\\.') #mark version dots as not regex selector
150
+ spdx_name.gsub!(/[\*|\?|\+]/, '.') #replace regex selector with whatever mark ~> WTFPL name
151
+ spdx_name.gsub!(/\,/, '[\\,]?') #make comma optional
152
+ spdx_name.strip!
153
+ spdx_name.gsub!(/\s+/, '\\s\\+') #replace spaces with space selector
154
+
155
+ rules << Regexp.new("\\b#{spdx_name}\\b", Regexp::IGNORECASE)
156
+
157
+ #use spdx_id in full-text match if it's uniq and doest ambiquity like MIT, Fair, Glide
158
+ spdx_id = spdx_item[:id].to_s.strip.downcase
159
+ if spdx_id.match /[\d|-]|ware\z/
160
+ rules << Regexp.new("\\b[\\(]?#{spdx_id}[\\)]?\\b".gsub(/\s+/, '\\s').gsub(/\./, '\\.'), Regexp::IGNORECASE)
161
+ else
162
+ rules << Regexp.new("\\A[\\(]?#{spdx_id}[\\)]?\\b", Regexp::IGNORECASE)
163
+ end
164
+
165
+ spdx_item[:identifiers].to_a.each do |id|
166
+ rules << Regexp.new("\\b#{id[:identifier]}\\s".gsub(/\s+/, '\\s').gsub(/\./, '\\.'), Regexp::IGNORECASE)
167
+ end
168
+
169
+ spdx_item[:other_names].to_a.each do |alt|
170
+ rules << Regexp.new("\\b#{alt[:name]}\\b".gsub(/\s+/, '\\s'), Regexp::IGNORECASE)
171
+ end
172
+
173
+ rules
174
+ end
175
+
176
+
177
+ def get_ignore_rules
178
+ [
179
+ /\bProprietary\b/i, /\bOther\/Proprietary\b/i, /\ALICEN[C|S]E\.\w{2,8}\b/i,
180
+ /^LICEN[C|S]ING\.\w{2,8}\b/i, /^COPYING\.\w{2,8}/i,
181
+ /\ADFSG\s+APPROVED\b/i, /\ASee\slicense\sin\spackage\b/i,
182
+ /\ASee LICENSE\b/i,
183
+ /\AFree\s+for\s+non[-]?commercial\b/i, /\AFree\s+To\s+Use\b/i,
184
+ /\AFree\sFor\sHome\sUse\b/i, /\AFree\s+For\s+Educational\b/i,
185
+ /^Freely\s+Distributable\s*$/i, /^COPYRIGHT\s+\d{2,4}/i,
186
+ /^Copyright\s+\(c\)\s+\d{2,4}\b/i, /^COPYRIGHT\s*$/i, /^COPYRIGHT\.\w{2,8}\b/i,
187
+ /^\(c\)\s+\d{2,4}\d/,
188
+ /^LICENSE\s*$/i, /^FREE\s*$/i, /\ASee\sLicense\s*\b/i, /^TODO\s*$/i, /^FREEWARE\s*$/i,
189
+ /^All\srights\sreserved\s*$/i, /^COPYING\s*$/i, /^OTHER\s*$/i, /^NONE\s*$/i, /^DUAL\s*$/i,
190
+ /^KEEP\s+IT\s+REAL\s*\b/i, /\ABE\s+REAL\s*\z/i, /\APrivate\s*\z/i, /\ACommercial\s*\z/i,
191
+ /\ASee\s+LICENSE\s+file\b/i, /\ASee\sthe\sLICENSE\b/i, /\ALICEN[C|S]E\s*\z/i,
192
+ /^PUBLIC\s*$/i, /^see file LICENSE\s*$/i, /^__license__\s*$/i,
193
+ /\bLIEULA\b/i, /\AEULA\s*\z/i, /^qQuickLicen[c|s]e\b/i, /^For\sfun\b/i, /\AVarious\s*\z/i,
194
+ /^GNU\s*$/i, /^GNU[-|\s]?v3\s*$/i, /^OSI\s+Approved\s*$/i, /^OSI\s*$/i,
195
+ /\AOPEN\sSOURCE\sLICENSE\s?\z/i, /\AOPEN\s*\z/i, /\Aunknown\s*\z/i,
196
+ /^https?:\/\/github.com/i, /^https?:\/\/gitlab\.com/i
197
+ ]
198
+ end
199
+
200
+
201
+ def get_custom_rules
202
+ {
203
+ "AAL" => [/\bAAL\b/i, /\bAAL\s+License\b/i,
204
+ /\bAttribution\s+Assurance\s+License\b/i],
205
+ "AFL-1.1" => [/\bAFL[-|v]?1[^\.]\b/i, /\bafl[-|v]?1\.1\b/i],
206
+ "AFL-1.2" => [/\bAFL[-|v]?1\.2\b/i],
207
+ "AFL-2.0" => [/\bAFL[-|v]?2[^\.]\b/i, /\bAFL[-|v]?2\.0\b/i],
208
+ "AFL-2.1" => [/\bAFL[-|v]?2\.1\b/i],
209
+ "AFL-3.0" => [
210
+ /\bAFL[-|\s|\_]?v?3\.0\b/i, /\bAFL[-|\s|\_]?v?3/i,
211
+ /\AAcademic\s+Free\s+License\s*\z/i, /^AFL\s?\z/i,
212
+ /\bhttps?:\/\/opensource\.org\/licenses\/academic\.php\b/i,
213
+ /\AAcademic[-|\s]Free[-|\s]License[-]?\s*\z/i,
214
+ /\bAcademic.Free.License.\(AFL\)/i
215
+ ],
216
+ "AGPL-1.0" => [
217
+ /\bAGPL[-|v|_|\s]?1\.0\b/i,
218
+ /\bAGPL[-|v|_|\s]1\b/i , /\bAGPL[_|-|v]?2\b/i,
219
+ /\bAffero\s+General\s+Public\s+License\s+[v]?1\b/i,
220
+ /\bAGPL\s(?!(v|\d))/i #Matches only AGPL, but not AGPL v1, AGPL 1.0 etc
221
+ ],
222
+ "AGPL-3.0" => [
223
+ /\bAGPL[-|_|\s]?v?3\.0\b/i, /\bAGPL[-|\s|_]?v?3[\+]?\b/i,
224
+ /\bAPGLv?3[\+]?\b/i, #some packages has typos
225
+ /\bGNU\s+Affero\s+General\s+Public\s+License\s+[v]?3/i,
226
+ /\bAFFERO\sGNU\sPUBLIC\sLICENSE\sv3\b/i,
227
+ /\bGnu\sAffero\sPublic\sLicense\sv3+?\b/i,
228
+ /\bAffero\sGeneral\sPublic\sLicen[s|c]e[\,]?\sversion\s+3[\.0]?\b/i,
229
+ /\bAffero\sGeneral\sPublic\sLicense\sv?3\b/i,
230
+ /\bAGPL\sversion\s3[\.]?\b/i,
231
+ /\bGNU\sAGPL\sv?3[\.0]?\b/i,
232
+ /\bGNU\sAFFERO\sv?3\b/i,
233
+ /\bhttps?:\/\/gnu\.org\/licenses\/agpl\.html\b/i,
234
+ /\AAFFERO\sGENERAL\sPUBLIC\sLICENSE\s*\z/i,
235
+ /^AFFERO\s*\z/i
236
+ ],
237
+ "Aladdin" => [/\b[\(]?AFPL[\)]?\b/i, /\bAladdin\sFree\sPublic\sLicense\b/i],
238
+ "Amazon" => [/\bAmazon\sSoftware\sLicense\b/i],
239
+ "Apache-1.0" => [/\bAPACHE[-|_|\s]?v?1[^\.]/i, /\bAPACHE[-|\s]?v?1\.0\b/i],
240
+ "Apache-1.1" => [/\bAPACHE[-|_|\s]?v?1\.1\b/i],
241
+ "Apache-2.0" => [
242
+ /\bAPACHE\s+2\.0\b/i, /\bAPACHE[-|_|\s]?v?2\b/i,
243
+ /\bApache\sOpen\sSource\sLicense\s2\.0\b/i,
244
+ /\bAPACH[A|E]\s+Licen[c|s]e\s+[\(]?v?2\.0[\)]?\b/i,
245
+ /\bAPACHE\s+LICENSE\,?\s+VERSION\s+2\.0\b/i,
246
+ /\bApache\s+License\s+v?2\b/i,
247
+ /\bApache\s+Software\sLicense\b/i,
248
+ /\bApapche[-|\s|\_]?v?2\.0\b/i, /\bAL[-|\s|\_]2\.0\b/i,
249
+ /\bAPL\s+2\.0\b/i, /\bAPL[\.|-|v]?2\b/i, /\bASL\s+2\.0\b/i,
250
+ /\bASL[-|v|\s]?2\b/i, /\bALv2\b/i, /\bASF[-|\s]?2\.0\b/i,
251
+ /\AAPACHE\s*\z/i, /\AASL\s*\z/i, /\bASL\s+v?\.2\.0\b/i, /\AASF\s*\z/i,
252
+ /\AApache\s+license\s*\z/i,
253
+ ],
254
+ "APL-1.0" => [/\bapl[-|_|\s]?v?1\b/i, /\bAPL[-|_|\s]?v?1\.0\b/i, /^APL$/i],
255
+ "APSL-1.0" => [/\bAPSL[-|_|\s]?v?1\.0\b/i, /\bAPSL[-|_|\s]?v?1(?!\.)\b/i, /\AAPPLE\s+PUBLIC\s+SOURCE\s*\z/i],
256
+ "APSL-1.1" => [/\bAPSL[-|_|\s]?v?1\.1\b/i],
257
+ "APSL-1.2" => [/\bAPSL[-|_|\s]?v?1\.2\b/i],
258
+ "APSL-2.0" => [/\bAPSL[-|_|\s]?v?2\.0\b/i, /\bAPSL[-|_|\s]?v?2\b/i],
259
+
260
+ "Artistic-1.0-Perl" => [/\bArtistic[-|_|\s]?v?1\.0\-Perl\b/i, /\bPerlArtistic\b/i],
261
+ "Artistic-1.0" => [/\bartistic[-|_|\s]?v?1\.0(?!\-)\b/i, /\bartistic[-|_|\s]?v?1(?!\.)\b/i],
262
+ "Artistic-2.0" => [/\bARTISTIC[-|_|\s]?v?2\.0\b/i, /\bartistic[-|_|\s]?v?2\b/i,
263
+ /\bArtistic.2.0\b/i,
264
+ /\bARTISTIC\s+LICENSE\b/i, /\AARTISTIC\s*\z/i],
265
+ "Beerware" => [
266
+ /\bBEERWARE\b/i, /\bBEER\s+LICEN[c|s]E\b/i,
267
+ /\bBEER[-|\s]WARE\b/i, /^BEER\b/i,
268
+ /\bBuy\ssnare\sa\sbeer\b/i,
269
+ /\bFree\sas\sin\sbeer\b/i
270
+ ],
271
+ 'BitTorrent-1.1' => [/\bBitTorrent\sOpen\sSource\sLicense\b/i],
272
+ "0BSD" => [/\A0BSD\s*\z/i],
273
+ "BSD-2-CLAUSE" => [
274
+ /\bBSD[-|_|\s]?v?2\b/i, /^FREEBSD\b/i, /^OPENBSD\b/i,
275
+ /\bBSDLv2\b/i
276
+ ],
277
+ "BSD-3-CLAUSE" => [/\bBSD[-|_|\s]?v?3\b/i, /\bBSD[-|\s]3[-\s]CLAUSE\b/i,
278
+ /\bBDS[-|_|\s]3[-|\s]CLAUSE\b/i,
279
+ /\bthree-clause\sBSD\slicen[s|c]e\b/i,
280
+ /\ABDS\s*\z/i, /^various\/BSDish\s*$/],
281
+ "BSD-4-CLAUSE" => [
282
+ /\bBSD[-|_|\s]?v?4/i, /\ABSD\s*\z/i, /\ABSD\s+LI[s|c]EN[S|C]E\s*\z/i,
283
+ /\bBSD-4-CLAUSE\b/i,
284
+ /\bhttps?:\/\/en\.wikipedia\.org\/wiki\/BSD_licenses\b/i
285
+ ],
286
+ "BSL-1.0" => [
287
+ /\bBSL[-|_|\s]?v?1\.0\b/i, /\bbsl[-|_|\s]?v?1\b/i, /^BOOST\b/i,
288
+ /\bBOOST\s+SOFTWARE\s+LICENSE\b/i,
289
+ /\bBoost\sLicense\s1\.0\b/i
290
+ ],
291
+ "CC0-1.0" => [
292
+ /\bCC0[-|_|\s]?v?1\.0\b/i, /\bCC0[-|_|\s]?v?1\b/i,
293
+ /\bCC[-|\s]?[0|o]\b/i, /\bCreative\s+Commons\s+0\b/i,
294
+ /\bhttps?:\/\/creativecommons\.org\/publicdomain\/zero\/1\.0[\/]?\b/i,
295
+ /\bcc[-|\_]zero\b/i
296
+ ],
297
+ "CC-BY-1.0" => [/\bCC.BY.v?1\.0\b/i, /\bCC.BY.v?1\b/i, /^CC[-|_|\s]?BY$/i],
298
+ "CC-BY-2.0" => [/\bCC.BY.v?2\.0\b/i, /\bCC.BY.v?2(?!\.)\b/i],
299
+ "CC-BY-2.5" => [/\bCC.BY.v?2\.5\b/i],
300
+ "CC-BY-3.0" => [
301
+ /\bCC.BY.v?3\.0\b/i, /\b[\(]?CC.BY[\)]?.v?3\b/i,
302
+ /\bCreative\sCommons\sBY\s3\.0\b/i,
303
+ /\bhttps?:\/\/.+\/licenses\/by\/3\.0[\/]?/i
304
+ ],
305
+ "CC-BY-4.0" => [
306
+ /^CC[-|\s]?BY[-|\s]?v?4\.0$/i, /\bCC.BY.v?4\b/i, /\bCC.BY.4\.0\b/i,
307
+ /\bCREATIVE\s+COMMONS\s+ATTRIBUTION\s+[v]?4\.0\b/i,
308
+ /\ACREATIVE\s+COMMONS\s+ATTRIBUTION\s*\z\b/i
309
+ ],
310
+ "CC-BY-SA-1.0" => [/\bCC[-|\s]BY.SA.v?1\.0\b/i, /\bCC[-|\s]BY.SA.v?1\b/i],
311
+ "CC-BY-SA-2.0" => [/\bCC[-|\s]BY.SA.v?2\.0\b/i, /\bCC[-|\s]BY.SA.v?2(?!\.)\b/i],
312
+ "CC-BY-SA-2.5" => [/\bCC[-|\s]BY.SA.v?2\.5\b/i],
313
+ "CC-BY-SA-3.0" => [/\bCC[-|\s]BY.SA.v?3\.0\b/i, /\bCC[-|\s]BY.SA.v?3\b/i,
314
+ /\bCC3\.0[-|_|\s]BY.SA\b/i,
315
+ /\bhttps?:\/\/(www\.)?.+\/by.sa\/3\.0[\/]?/i],
316
+ "CC-BY-SA-4.0" => [
317
+ /CC[-|\s]BY.SA.v?4\.0$/i, /\bCC[-|\s]BY.SA.v?4\b/i,
318
+ /CCSA-4\.0/i
319
+ ],
320
+ "CC-BY-NC-1.0" => [/\bCC[-|\s]BY.NC[-|\s]?v?1\.0\b/i, /\bCC[-|\s]BY.NC[-|\s]?v?1\b/i],
321
+ "CC-BY-NC-2.0" => [/\bCC[-|\s]BY.NC[-|\s]?v?2\.0\b/i],
322
+ "CC-BY-NC-2.5" => [/\bCC[-|\s]BY.NC[-|\s]?v?2\.5\b/i],
323
+ "CC-BY-NC-3.0" => [/\bCC[-|\s]BY.NC[-|\s]?v?3\.0\b/i, /\bCC.BY.NC[-|\s]?v?3\b/i,
324
+ /\bCreative\s+Commons\s+Non[-]?Commercial[,]?\s+3\.0\b/i],
325
+ "CC-BY-NC-4.0" => [
326
+ /\bCC[-|\s]BY.NC[-|\s|_]?v?4\.0\b/i, /\bCC.BY.NC[-|\s|_]?v?4\b/i,
327
+ /\bhttps?:\/\/creativecommons\.org\/licenses\/by-nc\/3\.0[\/]?\b/i
328
+ ],
329
+ "CC-BY-NC-SA-1.0" => [ /\bCC[-|\s+]BY.NC.SA[-|\s+]v?1\.0\b/i,
330
+ /\bCC[-|\s+]BY.NC.SA[-|\s+]v?1\b/i
331
+ ],
332
+ "CC-BY-NC-SA-2.0" => [/\bCC[-|\s]?BY.NC.SA[-|\s]?v?2\.0\b/i],
333
+ "CC-BY-NC-SA-2.5" => [/\bCC[-|\s]?BY.NC.SA[-|\s]?v?2\.5\b/i],
334
+ "CC-BY-NC-SA-3.0" => [
335
+ /\bCC[-|\s]?BY.NC.SA[-|\s]?v?3\.0\b/i,
336
+ /\bCC[-|\s]?BY.NC.SA[-|\s]?v?3(?!\.)\b/i,
337
+ /\bBY[-|\s]NC[-|\s]SA\sv?3\.0\b/i,
338
+ /\bhttp:\/\/creativecommons.org\/licenses\/by-nc-sa\/3.0\/us[\/]?\b/i
339
+ ],
340
+ "CC-BY-NC-SA-4.0" => [/\bCC[-|\s]?BY.NC.SA[-|\s]?v?4\.0\b/i,
341
+ /\bCC[-|_|\s]BY.NC.SA[-|\s]?v?4(?!\.)\b/i,
342
+ /\bBY.NC.SA[-|\s|\_]v?4\.0\b/i],
343
+
344
+ "CC-BY-ND-1.0" => [/\bCC[-|\s]BY.ND[-|\s]?v?1\.0\b/i],
345
+ "CC-BY-ND-2.0" => [/\bCC[-|\s]BY.ND[-|\s]?v?2\.0\b/i],
346
+ "CC-BY-ND-2.5" => [/\bCC[-|\s]BY.ND[-|\s]?v?2\.5\b/i],
347
+ "CC-BY-ND-3.0" => [/\bCC[-|\s]BY.ND[-|\s]?v?3\.0\b/i],
348
+ "CC-BY-ND-4.0" => [
349
+ /\bCC[-|\s]BY.ND[-|\s]?v?4\.0\b/i,
350
+ /\bCC\sBY.NC.ND\s4\.0/i
351
+ ],
352
+
353
+ "CC-BY-NC-ND-3.0" => [/\bCC.BY.NC.ND.3\.0\b/i],
354
+ "CC-BY-NC-ND-4.0" => [/\bCC.BY.NC.ND.4\.0\b/i],
355
+ "CDDL-1.0" => [/\bCDDL[-|_|\s]?v?1\.0\b/i, /\bCDDL[-|_|\s]?v?1\b/i, /^CDDL$/i,
356
+ /\bCDDL\s+LICEN[C|S]E\b/i,
357
+ /\bCOMMON\sDEVELOPMENT\sAND\sDISTRIBUTION\sLICENSE\b/i
358
+ ],
359
+ "CECILL-B" => [/\bCECILL[-|_|\s]?B\b/i],
360
+ "CECILL-C" => [/\bCECILL[-|_|\s]?C\b/i],
361
+ "CECILL-1.0" => [
362
+ /\bCECILL[-|\s|_]?v?1\.0\b/i, /\bCECILL[-|\s|_]?v?1\b/i,
363
+ /\ACECILL\s?\z/i, /\bCECILL\s+v?1\.2\b/i,
364
+ /^http:\/\/www\.cecill\.info\/licences\/Licence_CeCILL-C_V1-en.html$/i,
365
+ /\bhttp:\/\/www\.cecill\.info\b/i
366
+ ],
367
+ "CECILL-2.1" => [
368
+ /\bCECILL[-|_|\s]?2\.1\b/i, /\bCECILL[\s|_|-]?v?2\b/i,
369
+ /\bCECILL\sVERSION\s2\.1\b/i
370
+ ],
371
+ "CPL-1.0" => [
372
+ /\bCPL[-|\s|_]?v?1\.0\b/i, /\bCPL[-|\s|_]?v?1\b/i,
373
+ /\bCommon\s+Public\s+License\b/i, /\ACPL\s*\z/i
374
+ ],
375
+ "CPAL-1.0" => [
376
+ /\bCommon\sPublic\sAttribution\sLicense\s1\.0\b/i,
377
+ /[\(]?\bCPAL\b[\)]?/i
378
+ ],
379
+ "CUSTOM" => [ /\bCUSTOM\s+LICENSE\b/i ],
380
+ "DBAD" => [
381
+ /\bDONT\sBE\sA\sDICK\b/i, /\ADBAD\s*\z/i,
382
+ /\bdbad[-|\s|\_]license\b/i, /\ADBAD-1\s*\z/i,
383
+ /\ADBAP\b/i,
384
+ /\bhttps?:\/\/www\.dbad-license\.org[\/]?\b/i
385
+ ],
386
+ "D-FSL-1.0" => [
387
+ /\bD-?FSL[-|_|\s]?v?1\.0\b/i, /\bD-?FSL[-|\s|_]?v?1\b/,
388
+ /\bGerman\sFREE\sSOFTWARE\b/i,
389
+ /\bDeutsche\sFreie\sSoftware\sLizenz\b/i
390
+ ],
391
+ "ECL-1.0" => [ /\bECL[-|\s|_]?v?1\.0\b/i, /\bECL[-|\s|_]?v?1\b/i ],
392
+ "ECL-2.0" => [
393
+ /\bECL[-|\s|_]?v?2\.0\b/i, /\bECL[-|\s|_]?v?2\b/i,
394
+ /\bEDUCATIONAL\s+COMMUNITY\s+LICENSE[,]?\sVERSION\s2\.0\b/i
395
+ ],
396
+ "EFL-1.0" => [/\bEFL[-|\s|_]?v?1\.0\b/i, /\bEFL[-|\s|_]?v?1\b/i ],
397
+ "EFL-2.0" => [
398
+ /\bEFL[-|\s|_]?v?2\.0\b/i, /\bEFL[-|\s|_]?v?2\b/i,
399
+ /\bEiffel\sForum\sLicense,?\sversion\s2/i,
400
+ /\bEiffel\sForum\sLicense\s2(?!\.)\b/i,
401
+ /\bEiffel\sForum\sLicense\b/i
402
+ ],
403
+ "EPL-1.0" => [
404
+ /\bEPL[-|\s|_]?v?1\.0\b/i, /\bEPL[-|\s|_]?v?1\b/i,
405
+ /\bECLIPSE\s+PUBLIC\s+LICENSE\s+[v]?1\.0\b/i,
406
+ /\bECLIPSE\s+PUBLIC\s+LICENSE\b/i,
407
+ /^ECLIPSE$/i, /\AEPL\s*\z/
408
+ ],
409
+ "ESA-1.0" => [
410
+ /\bESCL\s+[-|_]?\sType\s?1\b/,
411
+ /\bESA\sSOFTWARE\sCommunity\sLICENSE.+TYPE\s?1\b/i
412
+ ],
413
+ "EUPL-1.0" => [/\b[\(]?EUPL[-|\s]?v?1\.0[\)]?\b/i],
414
+ "EUPL-1.1" => [
415
+ /\b[\(]?EUPL[-|\s]?v?1\.1[\)]?\b/i,
416
+ /\bEUROPEAN\s+UNION\s+PUBLIC\s+LICENSE\s+1\.1\b/i,
417
+ /\bEuropean\sUnion\sPublic\sLicense\b/i,
418
+ /\bEUPL\s+V?\.?1\.1\b/i, /\AEUPL\s*\z/i
419
+ ],
420
+ "Fair" => [ /\bFAIR\s+LICENSE\b/i, /\AFair\s*\z/i],
421
+ "FreeType" => [ /\bFreeType\s+LICENSE\b/i],
422
+ "GFDL-1.0" => [
423
+ /\bGNU\sFree\sDocumentation\sLicense\b/i,
424
+ /\b[\(]?FDL[\)]?\b/
425
+ ],
426
+ "GPL-1.0" => [
427
+ /\bGPL[-|\s|_]?v?1\.0\b/i, /\bGPL[-|\s|_]?v?1\b/i,
428
+ /\bGNU\sPUBLIC\sLICEN[S|C]E\sv?1\b/i
429
+ ],
430
+ "GPL-2.0" => [
431
+ /\bGPL[-|\s|_]?v?2\.0/i, /\bGPL[-|\s|_]?v?2\b/i, /\bGPL\s+[v]?2\b/i,
432
+ /\bGNU\s+PUBLIC\s+LICENSE\s+v?2\.0\b/i,
433
+ /\bGNU\s+PUBLIC\s+License\sV?2\b/i,
434
+ /\bGNU\spublic\slicense\sversion\s2\b/i,
435
+ /\bGNU\sGeneral\sPublic\sLicense\sv?2\.0\b/i,
436
+ /\bGNU\sPublic\sLicense\s>=2\b/i,
437
+ /\bGNU\s+GPL\s+v2\b/i, /^GNUv?2\b/i, /^GLPv2\b/,
438
+ /\bWhatever\slicense\sPlone\sis\b/i
439
+ ],
440
+ "GPL-3.0" => [
441
+ /\bGNU\s+GENERAL\s+PUBLIC\s+License\s+[v]?3\b/i,
442
+ /\bGNU\s+General\s+Public\s+License[\,]?\sVersion\s3[\.0]?\b/i,
443
+ /\bGNU\sPublic\sLicense\sv?3\.0\b/i,
444
+ /\bGNU\s+PUBLIC\s+LICENSE\s+v?3\b/i,
445
+ /\bGnu\sPublic\sLicense\sversion\s3\b/i,
446
+ /\bGNU\sGeneral\sPublic\sLicense\sversion\s?3\b/i,
447
+ /\bGPL[-|\s|_]?v?3\.0\b/i, /\bGPL[-|\s|_]?v?[\.]?3\b/i, /\bGPL\s+3\b/i,
448
+ /\bGNU\s+PUBLIC\s+v3\+?\b/i,
449
+ /\bGNUGPL[-|\s|\_]?v?3\b/i, /\bGNU\s+PL\s+[v]?3\b/i,
450
+ /\bGLPv3\b/i, /\bGNU3\b/i, /GPvL3/i, /\bGNU\sGLP\sv?3\b/i,
451
+ /\AGNU\sGENERAL\sPUBLIC\sLICENSE\s*\z/i, /\A[\(]?GPL[\)]?\s*\z/i
452
+ ],
453
+
454
+ "IDPL-1.0" => [
455
+ /\bIDPL[-|\s|\_]?v?1\.0\b/,
456
+ /\bhttps?:\/\/www\.firebirdsql\.org\/index\.php\?op=doc\&id=idpl\b/i
457
+ ],
458
+ "IPL-1.0" => [/\bIBM\sOpen\sSource\sLicense\b/i, /\bIBM\sPublic\sLicen[s|c]e\b/i],
459
+ "ISC" => [/\bISC\s+LICENSE\b/i, /\b[\(]?ISCL[\)]?\b/i, /\bISC\b/i,
460
+ /\AICS\s*\z/i],
461
+ "JSON" => [/\bJSON\s+LICENSE\b/i],
462
+ "KINDLY" => [/\bKINDLY\s+License\b/i],
463
+ "LGPL-2.0" => [
464
+ /\bLGPL[-|\s|_]?v?2\.0\b/i, /\bLGPL[-|\s|_]?v?2(?!\.)\b/i,
465
+ /\bLesser\sGeneral\sPublic\sLicense\sv?2(?!\.)\b/i,
466
+ /\bLPGL[-|\s|\_]?v?2(?!\.)\b/i
467
+ ],
468
+ "LGPL-2.1" => [
469
+ /\bLGPL[-|\s|_]?v?2\.1\b/i,
470
+ /\bLesser\sGeneral\sPublic\sLicense\s+\(LGPL\)\s+Version\s+2\.1\b/i,
471
+ /\bLESSER\sGENERAL\sPUBLIC\sLICENSE[\,]?\sVersion\s2\.1[\,]?\b/i,
472
+ /\bLESSER\sGENERAL\sPUBLIC\sLICENSE[\,]?\sv?2\.1\b/i
473
+ ],
474
+ "LGPL-3.0" => [/\bLGPL[-|\s|_]?v?3\.0\b/i, /\bLGPL[-|\s|_]?v?3[\+]?\b/i,
475
+ /\bLGLP[\s|-|v]?3\.0\b/i, /^LPLv3\s*$/, /\bLPGL[-|\s|_]?v?3[\+]?\b/i,
476
+ /\bLESSER\s+GENERAL\s+PUBLIC\s+License\s+[v]?3\b/i,
477
+ /\bLesser\sGeneral\sPublic\sLicense\sv?\.?\s+3\.0\b/i,
478
+ /\bhttps?:\/\/www\.gnu\.org\/copyleft\/lesser.html\b/i,
479
+ /\bLESSER\sGENERAL\sPUBLIC\sLICENSE\sVersion\s3\b/i,
480
+ /\bLesser\sGeneral\sPublic\sLicense[\,]?\sversion\s3\.0\b/i,
481
+ /\bLESSER\sGENERAL\sPUBLIC\sLICENSE.+?version\s?3/i,
482
+ /\A[\(]?LGPL[\)]?\s*\z/i
483
+ ],
484
+ "MirOS" => [/\bMirOS\b/i],
485
+ "MIT" => [
486
+ /\bMIT\s+LICEN[S|C]E\b/i, /\AMITL?\s*\z/i, /\bEXPAT\b/i,
487
+ /\bMIT[-|\_]LICENSE\.\w{2,8}\b/i, /^MTI\b/i,
488
+ /\bMIT[-|\s|\_]?v?2\.0\b/i, /\AM\.I\.T[\.]?\s*\z/,
489
+ /\bMassachusetts-Institute-of-Technology-License/i
490
+ ],
491
+ "MITNFA" => [/\bMIT\s\+no\-false\-attribs\slicense\b/i],
492
+ "MPL-1.0" => [
493
+ /\bMPL[-|\s|\_]?v?1\.0\b/i, /\bMPL[-|\s|\_]?v?1(?!\.)\b/i,
494
+ /\bMozilla\sPublic\sLicense\sv?1\.0\b/i,
495
+ ],
496
+ "MPL-1.1" => [
497
+ /\bMozilla.Public.License\s+v?1\.1\b/i,
498
+ /\bMPL[-|\s|\_]?v?1\.1\b/i,
499
+ ],
500
+ "MPL-2.0" => [
501
+ /\bMPL[-|\s|\_]?v?2\.0\b/i, /\bMPL[-|\s|\_]?v?2\b/i,
502
+ /\bMOZILLA\s+PUBLIC\s+LICENSE\s+2\.0\b/i,
503
+ /\bMozilla\sPublic\sLicense[\,]?\s+v?[\.]?\s*2\.0\b/i,
504
+ /\bMOZILLA\s+PUBLIC\s+LICENSE[,]?\s+version\s+2\.0\b/i,
505
+ /\bMozilla\s+v?2\.0\b/i,
506
+ /\b[\(]?MPL\s+2\.0[\)]?\b/, /\bMPL\b/i,
507
+ /\AMozilla\sPublic\sLicense\s*\z/i
508
+ ],
509
+ "MS-PL" => [/\bMS-?PL\b/i],
510
+ "MS-RL" => [/\bMS-?RL\b/i, /\bMSR\-LA\b/i],
511
+ "ms_dotnet" => [/\bMICROSOFT\sSOFTWARE\sLICENSE\sTERMS\b/i],
512
+ "NASA-1.3" => [/\bNASA[-|\_|\s]?v?1\.3\b/i,
513
+ /\bNASA\sOpen\sSource\sAgreement\sversion\s1\.3\b/i],
514
+ "NCSA" => [/\bNCSA\s+License\b/i, /\bIllinois\/NCSA\sOpen\sSource\b/i, /\bNCSA\b/i ],
515
+ "NGPL" => [/\bNGPL\b/i],
516
+ "NOKIA" => [/\bNokia\sOpen\sSource\sLicense\b/i],
517
+ "NPL-1.1" => [/\bNetscape\sPublic\sLicense\b/i, /\b[(]?NPL[\)]?\b/i],
518
+
519
+ "NPOSL-3.0" => [/\bNPOSL[-|\s|\_]?v?3\.0\b/i, /\bNPOSL[-|\s|\_]?v?3\b/],
520
+ "OFL-1.0" => [/\bOFL[-|\s|\_]?v?1\.0\b/i, /\bOFL[-|\s|\_]?v?1(?!\.)\b/i,
521
+ /\bSIL\s+OFL\s+1\.0\b/i, /\ASIL\sOFL\s*\z/i ],
522
+ "OFL-1.1" => [
523
+ /\bOFL[-|\s|\_]?v?1\.1\b/i, /\bSIL\s+OFL\s+1\.1\b/i,
524
+ /\bSIL\sOpen\sFont\sLicense\b/i, /\bSIL\sOFL\s1\.1\b/i,
525
+ /\bOpen\sFont\sLicense\b/i ],
526
+
527
+ "OSL-1.0" => [/\bOSL[-|\s|\_]?v?1\.0\b/i, /\b\OSL[-|\s|\_]?v?1(?!\.)\b/i],
528
+ "OSL-2.0" => [/\bOSL[-|\s|\_]?v?2\.0\b/i, /\bOSL[-|\s|\_]?v?2(?!\.)\b/i],
529
+ "OSL-2.1" => [/\bOSL[-|\s|\_]?v?2\.1\b/i],
530
+ "OSL-3.0" => [
531
+ /\bOSL[-|\s|\_]?v?3\.0\b/i, /\bOSL[-|\s|\_]?v?3(?!\.)\b/i,
532
+ /\bOpen\sSoftware\sLicen[c|s]e\sv?3\.0\b/i,
533
+ /\bOSL[-|\s|\_]?v?\.?3\.[0|O]\b/i,
534
+ /\bOpen\sSoftware\sLicense\sversion\s3\.0\b/i,
535
+ /\AOSL\s*\z/i, /\bOpen-Software-License/i,
536
+ /\b[\(]?OSL[\)]?\s+v\s+3\.0\b/i
537
+ ],
538
+
539
+ "PHP-3.0" => [/^PHP\s?\z/i, /\bPHP\sLicense\s3\.0\b/i, /\APHP[-|\s]LICEN[S|C]E\s*\z/i],
540
+ "PHP-3.01" => [/\bPHP\sLicense\sversion\s3\.0\d\b/i],
541
+ "PIL" => [/\bStandard\sPIL\sLicense\b/i, /\APIL\s*\z/i],
542
+ "PostgreSQL" => [/\bPostgreSQL\b/i],
543
+ "Public Domain" => [/\bPublic\s+Domain\b/i],
544
+ "Python-2.0" => [
545
+ /\bPython[-|\s|\_]?v?2\.0\b/i, /\bPython[-|\s|\_]?v?2(?!\.)\b/i,
546
+ /\bPSF[-|\s|\_]?v?2\b/i, /\bPSFL\b/i, /\bPSF\b/i,
547
+ /\bPython\s+Software\s+Foundation\b/i,
548
+ /\APython\b/i, /\bPSL\b/i, /\bSAME\sAS\spython2\.3\b/i,
549
+ /\bhttps?:\/\/www\.opensource\.org\/licenses\/PythonSoftFoundation\.php\b/i,
550
+ /\bhttps?:\/\/opensource\.org\/licenses\/PythonSoftFoundation\.php\b/i
551
+ ],
552
+ "Repoze" => [/\bRepoze\sPublic\sLicense\b/i],
553
+ "RPL-1.1" => [/\bRPL[-|\s|_]?v?1\.1\b/i, /\bRPL[-|\s|_]?v?1(?!\.)\b/i],
554
+ "RPL-1.5" => [
555
+ /\bRPL[-|\s|_]?v?1\.5\b/i, /\ARPL\s*\z/i,
556
+ /\bhttps?:\/\/www\.opensource\.org\/licenses\/rpl\.php\b/i
557
+ ],
558
+ "Ruby" => [/\bRUBY\sLICEN[S|C]E\b/i, /\ARUBY\b/i, /\bRUBY\'s\b/i],
559
+ "QPL-1.0" => [/\bQPL[-|\s|_]?v?1\.0\b/i,
560
+ /\bQT\sPublic\sLicen[c|s]e\b/i,
561
+ /\bPyQ\sGeneral\sLicense\b/i],
562
+ "Sleepycat" => [/\bSleepyCat\b/i],
563
+ "SPL-1.0" => [
564
+ /\bSPL[-|\_|\s]?v?1\.0\b/i, /\bSun\sPublic\sLicense\b/i
565
+ ],
566
+ "W3C" => [/\bW3C\b/i],
567
+ "OpenSSL" => [/\bOPENSSL\b/i],
568
+ "Unicode-TOU" => [/\AUnicode-TOU[\s|\/|-]/i],
569
+ "UPL-1.0" => [/\bUniversal\sPermissive\sLicense\b/i],
570
+ "Unlicense" => [
571
+ /\bUNLI[C|S]EN[S|C]E\b/i, /\AUnlicen[s|c]ed\s*\z/i, /^go\sfor\sit\b/i,
572
+ /^Undecided\b/i,
573
+ /\bNO\s+LICEN[C|S]E\b/i, /\bNON[\s|-|\_]?LICENSE\b/i
574
+ ],
575
+ "Whiskeyware" => [/\bWH?ISKEY[-|\s|\_]?WARE\b/i],
576
+ "WTFPL" => [
577
+ /\bWTF[P|G]?L\b/i, /\bWTFPL[-|v]?2\b/i, /^WTF\b/i, /\AWTFP\s*\z/i,
578
+ /\bDo\s+whatever\s+you\s+want\b/i, /\bDWTFYW\b/i, /\AWTPFL\s*\z/i,
579
+ /\bDo\s+What\s+the\s+Fuck\s+You\s+Want\b/i, /\ADWTFYWT\s*\z/i,
580
+ /\ADo\sWHATEVER\b/i, /\ADWYW\b/i, /\bDWTFYWTP\b/i,
581
+ /\ADWHTFYWTPL\s*\z/i, /\AWhatever\s*\z/i,
582
+ /\bDO\s(THE\s)?FUCK\sWHAT\sYOU\sWANT\b/i
583
+ ],
584
+ "WXwindows" => [/\bwxWINDOWS\s+LIBRARY\sLICEN[C|S]E\b/i, /\AWXwindows\s*\z/i],
585
+ "X11" => [/\bX11\b/i],
586
+ "Zend-2.0" => [/\bZend\sFramework\b/i],
587
+ "ZPL-1.1" => [/\bZPL[-|\s|\_]?v?1\.1\b/i, /\bZPL[-|\s|\_]?v?1(?!\.)\b/i,
588
+ /\bZPL[-|\s|\_]?1\.0\b/i],
589
+ "ZPL-2.1" => [
590
+ /\bZPL[-|\s|\/|_|]?v?2\.1\b/i, /\bZPL[-|\s|_]?v?2(?!\.)\b/i,
591
+ /\bZPL\s+2\.\d\b/i, /\bZOPE\s+PUBLIC\s+LICENSE\b/i,
592
+ /\bZPL\s?$/i
593
+ ],
594
+ "zlib-acknowledgement" => [/\bZLIB[\/|-|\s]LIBPNG\b/i],
595
+ "ZLIB" => [/\bZLIB(?!\-|\/)\b/i]
596
+
597
+ }
598
+ end
599
+
600
+ end
601
+ end
@@ -0,0 +1,100 @@
1
+ require 'narray'
2
+ require 'tf-idf-similarity'
3
+ require 'msgpack'
4
+
5
+ module LicenseMatcher
6
+
7
+ class TFRubyMatcher
8
+ include Preprocess
9
+
10
+ attr_reader :corpus, :model, :spdx_ids
11
+
12
+ DEFAULT_INDEX_PATH = 'data/index.msgpack'
13
+ DEFAULT_MIN_CONFIDENCE = 0.9
14
+ A_DOC_ROW = 3 # a array index to find the rows of indexed documents
15
+
16
+ def initialize(index_path = DEFAULT_INDEX_PATH)
17
+ spdx_ids, spdx_docs = read_corpus(index_path)
18
+
19
+ @spdx_ids = spdx_ids
20
+ @corpus = spdx_docs
21
+ @model = TfIdfSimilarity::BM25Model.new(@corpus, :library => :narray)
22
+
23
+ true
24
+ end
25
+
26
+ def match_text(text, min_confidence = DEFAULT_MIN_CONFIDENCE, is_processed_text = false)
27
+ return [] if text.to_s.empty?
28
+
29
+ text = preprocess_text(text) if is_processed_text == false
30
+ test_doc = TfIdfSimilarity::Document.new(text, {:id => "test"})
31
+
32
+ mat1 = @model.instance_variable_get(:@matrix)
33
+ mat2 = doc_tfidf_matrix(test_doc)
34
+
35
+ n_docs = @model.documents.size
36
+ dists = []
37
+ n_docs.times do |i|
38
+ dists << [i, cos_sim(mat1[i, true], mat2)]
39
+ end
40
+
41
+ doc_id, best_score = dists.sort {|a,b| b[1] <=> a[1]}.first
42
+ best_match = @model.documents[doc_id].id
43
+
44
+ if best_score.to_f > min_confidence
45
+ best_match
46
+ else
47
+ ""
48
+ end
49
+ end
50
+
51
+ def match_html(html_text, min_confidence = DEFAULT_MIN_CONFIDENCE)
52
+ match_text(preprocess_html(html_text), min_confidence)
53
+ end
54
+
55
+ #-- helpers
56
+ # Transforms document into TF-IDF matrix used for comparition
57
+ def doc_tfidf_matrix(doc)
58
+ arr = Array.new(@model.terms.size) do |i|
59
+ the_term = @model.terms[i]
60
+ if doc.term_count(the_term) > 0
61
+ #calc score only for words that exists in the test doc and the corpus of licenses
62
+ model.idf(the_term) * model.tf(doc, the_term)
63
+ else
64
+ 0.0
65
+ end
66
+ end
67
+
68
+ NArray[*arr]
69
+ end
70
+
71
+
72
+ # Calculates cosine similarity between 2 TF-IDF vector
73
+ def cos_sim(mat1, mat2)
74
+ length = (mat1 * mat2).sum
75
+ norm = Math::sqrt((mat1 ** 2).sum) * Math::sqrt((mat2 ** 2).sum)
76
+
77
+ ( norm > 0 ? length / norm : 0.0)
78
+ end
79
+
80
+ # Reads the content of licenses from the pre-built index
81
+ # NB! it is sensitive to the changes in the Fosslim/Index serialization
82
+ def read_corpus(index_path)
83
+ idx = MessagePack.unpack File.read index_path
84
+ spdx_ids = []
85
+ docs = []
86
+
87
+ idx[A_DOC_ROW].to_a.each do |doc_row|
88
+ _, spdx_id, content, _ = doc_row
89
+ txt = preprocess_text content
90
+ if txt
91
+ spdx_ids << spdx_id
92
+ docs << TfIdfSimilarity::Document.new(txt, :id => spdx_id)
93
+ end
94
+ end
95
+
96
+ [spdx_ids, docs]
97
+ end
98
+
99
+ end
100
+ end
@@ -0,0 +1,102 @@
1
+
2
+ module LicenseMatcher
3
+
4
+ class UrlMatcher
5
+ attr_reader :url_index
6
+
7
+ DEFAULT_LICENSE_JSON = 'data/spdx_licenses/licenses.json'
8
+
9
+ def initialize(license_json_file = DEFAULT_LICENSE_JSON)
10
+ licenses_json_doc = read_json_file license_json_file
11
+ raise("Failed to read licenses.json") if licenses_json_doc.nil?
12
+
13
+ @url_index = read_license_url_index(licenses_json_doc)
14
+ end
15
+
16
+ # Matches License.url with urls in Licenses.json and returns tuple [spdx_id, score]
17
+ def match_url(the_url)
18
+ the_url = the_url.to_s.strip
19
+ spdx_id = nil
20
+
21
+ case the_url
22
+ when 'http://jquery.org/license'
23
+ return ['mit', 1.0] #Jquery license page doesnt include any license text
24
+ when 'https://www.mozilla.org/en-US/MPL/'
25
+ return ['mpl-2.0', 1.0]
26
+ when 'http://fairlicense.org'
27
+ return ['fair', 1.0]
28
+ when 'http://www.aforgenet.com/framework/license.html'
29
+ return ['lgpl-3.0', 1.0]
30
+ when 'http://www.apache.org/licenses/'
31
+ return ['apache-2.0', 1.0]
32
+ when 'http://aws.amazon.com/apache2.0/'
33
+ return ['apache-2.0', 1.0]
34
+ when 'http://aws.amazon.com/asl/'
35
+ return ['amazon', 1.0]
36
+ when 'https://choosealicense.com/no-license/'
37
+ return ['no-license', 1.0]
38
+ when 'http://www.gzip.org/zlib/zlib_license.html'
39
+ return ['zlib', 1.0]
40
+ when 'http://zlib.net/zlib-license.html'
41
+ return ['zlib', 1.0]
42
+ when 'http://www.wtfpl.net/about/'
43
+ return ['wtfpl', 1.0]
44
+ end
45
+
46
+ #does url match with choosealicense.com
47
+ match = the_url.match(/\bhttps?:\/\/(www\.)?choosealicense\.com\/licenses\/([\S|^\/]+)[\/]?\b/i)
48
+ if match
49
+ return [match[2].to_s.downcase, 1.0]
50
+ end
51
+
52
+ match = the_url.match(/\bhttps?:\/\/(www\.)?creativecommons\.org\/licenses\/([\S|^\/]+)[\/]?\b/i)
53
+ if match
54
+ return ["cc-#{match[2].to_s.gsub(/\//, '-')}", 1.0]
55
+ end
56
+
57
+ #check through SPDX urls
58
+ @url_index.each do |lic_url, lic_id|
59
+ lic_url = lic_url.to_s.strip.gsub(/https?:\/\//i, '').gsub(/www\./, '') #normalizes urls in the file
60
+ matcher = Regexp.new("https?:\/\/(www\.)?#{lic_url}", Regexp::IGNORECASE)
61
+
62
+ if matcher.match(the_url)
63
+ spdx_id = lic_id.to_s.downcase
64
+ break
65
+ end
66
+ end
67
+
68
+ return [] if spdx_id.nil?
69
+
70
+ [spdx_id, 1.0]
71
+ end
72
+
73
+ # Reads license urls from the license.json and builds a map {url : spdx_id}
74
+ def read_license_url_index(spdx_licenses)
75
+ url_index = {}
76
+ spdx_licenses.each {|lic| url_index.merge! process_spdx_item(lic) }
77
+ url_index
78
+ end
79
+
80
+
81
+ def process_spdx_item(lic)
82
+ url_index = {}
83
+ lic_id = lic[:id].to_s.strip.downcase
84
+
85
+ return url_index if lic_id.empty?
86
+
87
+ lic[:links].to_a.each {|x| url_index[x[:url]] = lic_id }
88
+ lic[:text].to_a.each {|x| url_index[x[:url]] = lic_id }
89
+
90
+ url_index
91
+ end
92
+
93
+ def read_json_file(file_path)
94
+ JSON.parse(File.read(file_path), {symbolize_names: true})
95
+ rescue
96
+ log.info "Failed to read json file `#{file_path}`"
97
+ nil
98
+ end
99
+
100
+
101
+ end
102
+ end
@@ -1,7 +1,21 @@
1
1
  require "helix_runtime"
2
2
 
3
3
  begin
4
- require "license_matcher/native"
4
+ require "license_matcher/native"
5
5
  rescue LoadError
6
- warn "Unable to load license_matcher/native. Please run `rake build`"
6
+ warn "Unable to load license_matcher/native. Please run `rake build`"
7
+ end
8
+
9
+ require 'license_matcher/preprocess'
10
+ require 'license_matcher/url_matcher'
11
+ require 'license_matcher/rule_matcher'
12
+ require 'license_matcher/tf_ruby_matcher'
13
+
14
+ module LicenseMatcher
15
+
16
+ # if class is missing from the module,
17
+ # then look from global ns
18
+ def self.const_missing(c)
19
+ Object.const_get(c)
20
+ end
7
21
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: license_matcher
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0.pre.alpha
4
+ version: 0.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Timo Sulg
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2017-09-13 00:00:00.000000000 Z
12
+ date: 2017-09-19 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: helix_runtime
@@ -25,6 +25,48 @@ dependencies:
25
25
  - - "~>"
26
26
  - !ruby/object:Gem::Version
27
27
  version: 0.6.0
28
+ - !ruby/object:Gem::Dependency
29
+ name: narray
30
+ requirement: !ruby/object:Gem::Requirement
31
+ requirements:
32
+ - - "~>"
33
+ - !ruby/object:Gem::Version
34
+ version: 0.6.1.2
35
+ type: :runtime
36
+ prerelease: false
37
+ version_requirements: !ruby/object:Gem::Requirement
38
+ requirements:
39
+ - - "~>"
40
+ - !ruby/object:Gem::Version
41
+ version: 0.6.1.2
42
+ - !ruby/object:Gem::Dependency
43
+ name: tf-idf-similarity
44
+ requirement: !ruby/object:Gem::Requirement
45
+ requirements:
46
+ - - "~>"
47
+ - !ruby/object:Gem::Version
48
+ version: 0.1.6
49
+ type: :runtime
50
+ prerelease: false
51
+ version_requirements: !ruby/object:Gem::Requirement
52
+ requirements:
53
+ - - "~>"
54
+ - !ruby/object:Gem::Version
55
+ version: 0.1.6
56
+ - !ruby/object:Gem::Dependency
57
+ name: nokogiri
58
+ requirement: !ruby/object:Gem::Requirement
59
+ requirements:
60
+ - - "~>"
61
+ - !ruby/object:Gem::Version
62
+ version: 1.8.0
63
+ type: :runtime
64
+ prerelease: false
65
+ version_requirements: !ruby/object:Gem::Requirement
66
+ requirements:
67
+ - - "~>"
68
+ - !ruby/object:Gem::Version
69
+ version: 1.8.0
28
70
  - !ruby/object:Gem::Dependency
29
71
  name: bundler
30
72
  requirement: !ruby/object:Gem::Requirement
@@ -67,6 +109,20 @@ dependencies:
67
109
  - - "~>"
68
110
  - !ruby/object:Gem::Version
69
111
  version: '3.4'
112
+ - !ruby/object:Gem::Dependency
113
+ name: msgpack
114
+ requirement: !ruby/object:Gem::Requirement
115
+ requirements:
116
+ - - "~>"
117
+ - !ruby/object:Gem::Version
118
+ version: 1.1.0
119
+ type: :development
120
+ prerelease: false
121
+ version_requirements: !ruby/object:Gem::Requirement
122
+ requirements:
123
+ - - "~>"
124
+ - !ruby/object:Gem::Version
125
+ version: 1.1.0
70
126
  description: "\n LicenseMatcher is rubygem, which uses Fosslim to match various
71
127
  OSS license\n with correct SPDX-id or EULA label.\n "
72
128
  email:
@@ -86,6 +142,10 @@ files:
86
142
  - Rakefile
87
143
  - lib/license_matcher.rb
88
144
  - lib/license_matcher/native.bundle
145
+ - lib/license_matcher/preprocess.rb
146
+ - lib/license_matcher/rule_matcher.rb
147
+ - lib/license_matcher/tf_ruby_matcher.rb
148
+ - lib/license_matcher/url_matcher.rb
89
149
  - lib/tasks/helix_runtime.rake
90
150
  homepage: https://www.github.com/fosslim
91
151
  licenses: []
@@ -101,9 +161,9 @@ required_ruby_version: !ruby/object:Gem::Requirement
101
161
  version: '0'
102
162
  required_rubygems_version: !ruby/object:Gem::Requirement
103
163
  requirements:
104
- - - ">"
164
+ - - ">="
105
165
  - !ruby/object:Gem::Version
106
- version: 1.3.1
166
+ version: '0'
107
167
  requirements: []
108
168
  rubyforge_project:
109
169
  rubygems_version: 2.5.2