inci_score 1.2.1 → 2.0.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.travis.yml +1 -2
- data/README.md +44 -10
- data/bin/inci_score +4 -4
- data/inci_score.gemspec +1 -1
- data/lib/inci_score/api/app.rb +1 -1
- data/lib/inci_score/cli.rb +44 -0
- data/lib/inci_score/computer.rb +9 -9
- data/lib/inci_score/{parser.rb → fetcher.rb} +1 -1
- data/lib/inci_score/normalizer.rb +4 -4
- data/lib/inci_score/normalizer_rules.rb +51 -34
- data/lib/inci_score/recognizer.rb +8 -12
- data/lib/inci_score/recognizer_rules.rb +6 -6
- data/lib/inci_score/version.rb +1 -1
- data/lib/inci_score.rb +3 -2
- metadata +6 -5
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 5970cfdecac8492dbfd510dce7a24488e543233c
|
4
|
+
data.tar.gz: fb5b1171f1fcab479e24b33dc7d7f37582b93741
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 42624a99c66bc3fcfb53cff14ebe6a153b220901df9b9e5f49f3d8ec2c9378436cd7090446bb449fefc7320c8ff61fdd5375b551b015312858fac8b8bfa8b66c
|
7
|
+
data.tar.gz: 2f6bcc48dd8727a6b2b9665882cd3a809b849a7876a7a5303b7a8c3438c373fb3e25c28545889f70fcc14aa1724ea5febfbb5a78a555456ec99e44b5ca9de329
|
data/.travis.yml
CHANGED
data/README.md
CHANGED
@@ -11,9 +11,13 @@
|
|
11
11
|
* [Starting Puma](#starting-puma)
|
12
12
|
* [Triggering a request](#triggering-a-request)
|
13
13
|
* [CLI API](#cli-api)
|
14
|
-
* [
|
14
|
+
* [Refresh catalog](#refresh-catalog)
|
15
|
+
* [Benchmark](#benchmark)
|
15
16
|
* [Levenshtein in C](#levenshtein-in-c)
|
16
|
-
* [
|
17
|
+
* [Platform](#platform)
|
18
|
+
* [Wrk](#wrk)
|
19
|
+
* [Results](#results)
|
20
|
+
* [Ruby 2.4](#ruby-2.4)
|
17
21
|
|
18
22
|
## Scope
|
19
23
|
This gem computes the score of cosmetic components basing on the information provided by the [Biodizionario site](http://www.biodizionario.it/) by Fabrizio Zago.
|
@@ -75,8 +79,8 @@ The Web API exposes the *InciScore* library over HTTP via the [Puma](http://puma
|
|
75
79
|
|
76
80
|
### Starting Puma
|
77
81
|
Simply start Puma via the *config.ru* file included in the repository by spawning how many workers as your current workstation supports:
|
78
|
-
```
|
79
|
-
bundle exec puma -w
|
82
|
+
```shell
|
83
|
+
bundle exec puma -w 8 -t 0:2 --preload
|
80
84
|
```
|
81
85
|
|
82
86
|
### Triggering a request
|
@@ -84,7 +88,7 @@ The Web API responds with a JSON object representing the original *InciScore::Re
|
|
84
88
|
|
85
89
|
You can pass the source string directly as a HTTP parameter:
|
86
90
|
|
87
|
-
```
|
91
|
+
```shell
|
88
92
|
curl http://127.0.0.1:9292?src=aqua,dimethicone
|
89
93
|
=> {"components":{"aqua":0,"dimethicone":4},"unrecognized":[],"score":53.762874945799766,"valid":true}
|
90
94
|
```
|
@@ -92,8 +96,8 @@ curl http://127.0.0.1:9292?src=aqua,dimethicone
|
|
92
96
|
## CLI API
|
93
97
|
You can collect INCI data by using the available binary:
|
94
98
|
|
95
|
-
```
|
96
|
-
inci_score "aqua,dimethicone,pej-10,noent"
|
99
|
+
```shell
|
100
|
+
inci_score --src="aqua,dimethicone,pej-10,noent"
|
97
101
|
|
98
102
|
TOTAL SCORE:
|
99
103
|
47.18034913243358
|
@@ -107,11 +111,41 @@ UNRECOGNIZED:
|
|
107
111
|
noent
|
108
112
|
```
|
109
113
|
|
110
|
-
|
114
|
+
### Refresh catalog
|
115
|
+
When using CLI you have the option to fetch a fresh catalog from remote by specifyng a flag:
|
116
|
+
```shell
|
117
|
+
inci_score --fresh --src="aqua,dimethicone,pej-10,noent"
|
118
|
+
```
|
119
|
+
|
120
|
+
## Benchmark
|
121
|
+
|
122
|
+
### Levenshtein in C
|
111
123
|
I noticed the APIs slows down dramatically when dealing with unrecognized components to fuzzy match on.
|
112
124
|
I profiled the code by using the [benchmark-ips](https://github.com/evanphx/benchmark-ips) gem, finding the bottleneck was the pure Ruby implementation of the Levenshtein distance algorithm.
|
113
125
|
After some pointless optimization, i replaced this routine with a C implementation: i opted for the straightforward [Ruby Inline](https://github.com/seattlerb/rubyinline) library to call the C code straight from Ruby.
|
114
126
|
As a result i've got a 10x increment of the throughput, all without scarifying code readability.
|
115
127
|
|
116
|
-
###
|
117
|
-
I
|
128
|
+
### Platform
|
129
|
+
I registered these benchmarks with a MacBook PRO 15 mid 2015 having these specs:
|
130
|
+
* OSX El Captain
|
131
|
+
* 2,2 GHz Intel Core i7 (4 cores)
|
132
|
+
* 16 GB 1600 MHz DDR3
|
133
|
+
|
134
|
+
### Wrk
|
135
|
+
As always i used [wrk](https://github.com/wg/wrk) as the loading tool.
|
136
|
+
I measured each library three times, picking the best lap.
|
137
|
+
The following script command is used:
|
138
|
+
|
139
|
+
```
|
140
|
+
wrk -t 4 -c 100 -d 30s --timeout 2000 http://127.0.0.1:9292/?src=<list_of_ingredients>
|
141
|
+
```
|
142
|
+
|
143
|
+
### Results
|
144
|
+
| Type | Ingredients | Throughput (req/s) | Latency in ms (avg/stdev/max) |
|
145
|
+
| :----------------- | :----------------------- | -----------------: | ----------------------------: |
|
146
|
+
| exact matching | aqua,parfum,zeolite | 48863.58 | 0.31/0.55/10.82 |
|
147
|
+
|
148
|
+
## Ruby 2.4
|
149
|
+
After upgrading to Ruby 2.4 i doubled the throughput of the matcher: i assume Ruby optimization to the [Hash access](#https://blog.heroku.com/ruby-2-4-features-hashes-integers-rounding) is the driving reason.
|
150
|
+
I also adopted the new #match? method to avoid creating a MatchData object when i am just checking for predicate.
|
151
|
+
In the end Ruby upgrade is a big deal for my gem, give it a try!
|
data/bin/inci_score
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
#!/usr/bin/env ruby
|
2
|
+
lib = File.expand_path("../../lib", __FILE__)
|
3
|
+
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
2
4
|
|
3
|
-
require
|
4
|
-
require 'inci_score'
|
5
|
+
require "inci_score"
|
5
6
|
|
6
|
-
|
7
|
-
puts InciScore::Computer.new(ARGV[0], InciScore::Catalog.fetch).call
|
7
|
+
InciScore::CLI.new(args: ARGV.clone).call
|
data/inci_score.gemspec
CHANGED
@@ -14,7 +14,7 @@ Gem::Specification.new do |s|
|
|
14
14
|
s.executables << "inci_score"
|
15
15
|
s.require_paths = ["lib"]
|
16
16
|
s.license = "MIT"
|
17
|
-
s.required_ruby_version = ">= 2.
|
17
|
+
s.required_ruby_version = ">= 2.4"
|
18
18
|
|
19
19
|
s.add_runtime_dependency "nokogiri", "~> 1.6"
|
20
20
|
s.add_runtime_dependency "puma", "~> 3"
|
data/lib/inci_score/api/app.rb
CHANGED
@@ -13,7 +13,7 @@ module InciScore
|
|
13
13
|
def call(env)
|
14
14
|
req = Rack::Request.new(env)
|
15
15
|
src = req.params["src"]
|
16
|
-
json = src ? Computer.new(src, catalog).call.to_json : %q({"error": "no valid source"})
|
16
|
+
json = src ? Computer.new(src: src, catalog: catalog).call.to_json : %q({"error": "no valid source"})
|
17
17
|
['200', {'Content-Type' => 'application/json'}, [json]]
|
18
18
|
end
|
19
19
|
end
|
@@ -0,0 +1,44 @@
|
|
1
|
+
require "optparse"
|
2
|
+
require "inci_score/computer"
|
3
|
+
|
4
|
+
module InciScore
|
5
|
+
class CLI
|
6
|
+
def initialize(args:, io: STDOUT, catalog: InciScore::Catalog.fetch)
|
7
|
+
@args = args
|
8
|
+
@io = io
|
9
|
+
@catalog = catalog
|
10
|
+
@src = nil
|
11
|
+
@fresh = nil
|
12
|
+
end
|
13
|
+
|
14
|
+
def call(computer_klass = Computer, fetcher = Fetcher.new)
|
15
|
+
parser.parse!(@args)
|
16
|
+
return @io.puts("Specify inci list as: --src='aqua, parfum, etc'") unless @src
|
17
|
+
@io.puts computer_klass.new(src: @src, catalog: catalog(fetcher)).call
|
18
|
+
end
|
19
|
+
|
20
|
+
private def parser
|
21
|
+
OptionParser.new do |opts|
|
22
|
+
opts.banner = %q{Usage: ./bin/inci_score --src='aqua, parfum, etc' --fresh}
|
23
|
+
|
24
|
+
opts.on("-sSRC", "--src=SRC", "The INCI list: 'aqua, parfum, etc'") do |src|
|
25
|
+
@src = src
|
26
|
+
end
|
27
|
+
|
28
|
+
opts.on("-f", "--fresh", "Fetch a fresh catalog from remote") do |fresh|
|
29
|
+
@fresh = fresh
|
30
|
+
end
|
31
|
+
|
32
|
+
opts.on("-h", "--help", "Prints this help") do
|
33
|
+
@io.puts opts
|
34
|
+
exit
|
35
|
+
end
|
36
|
+
end
|
37
|
+
end
|
38
|
+
|
39
|
+
private def catalog(fetcher)
|
40
|
+
return @catalog unless @fresh
|
41
|
+
fetcher.call
|
42
|
+
end
|
43
|
+
end
|
44
|
+
end
|
data/lib/inci_score/computer.rb
CHANGED
@@ -7,9 +7,11 @@ module InciScore
|
|
7
7
|
class Computer
|
8
8
|
TOLERANCE = 30.0
|
9
9
|
|
10
|
-
def initialize(src,
|
10
|
+
def initialize(src:, catalog:, tolerance: TOLERANCE, rules: Normalizer::DEFAULT_RULES)
|
11
11
|
@src = src
|
12
12
|
@catalog = catalog
|
13
|
+
@tolerance = Float(tolerance)
|
14
|
+
@rules = rules
|
13
15
|
@unrecognized = []
|
14
16
|
end
|
15
17
|
|
@@ -20,17 +22,15 @@ module InciScore
|
|
20
22
|
valid: valid?)
|
21
23
|
end
|
22
24
|
|
23
|
-
private
|
24
|
-
|
25
|
-
def score
|
25
|
+
private def score
|
26
26
|
Scorer.new(components.map(&:last)).call
|
27
27
|
end
|
28
28
|
|
29
|
-
def ingredients
|
30
|
-
@ingredients ||= Normalizer.new(src: @src).call
|
29
|
+
private def ingredients
|
30
|
+
@ingredients ||= Normalizer.new(src: @src, rules: @rules).call
|
31
31
|
end
|
32
32
|
|
33
|
-
def components
|
33
|
+
private def components
|
34
34
|
@components ||= ingredients.map do |ingredient|
|
35
35
|
Recognizer.new(ingredient, @catalog).call.tap do |component|
|
36
36
|
@unrecognized << ingredient unless component
|
@@ -38,8 +38,8 @@ module InciScore
|
|
38
38
|
end.compact
|
39
39
|
end
|
40
40
|
|
41
|
-
def valid?
|
42
|
-
@unrecognized.size / (ingredients.size / 100.0) <=
|
41
|
+
private def valid?
|
42
|
+
@unrecognized.size / (ingredients.size / 100.0) <= @tolerance
|
43
43
|
end
|
44
44
|
end
|
45
45
|
end
|
@@ -2,7 +2,7 @@ require 'inci_score/normalizer_rules'
|
|
2
2
|
|
3
3
|
module InciScore
|
4
4
|
class Normalizer
|
5
|
-
DEFAULT_RULES = Rules
|
5
|
+
DEFAULT_RULES = [Rules::Replacer, Rules::Downcaser, Rules::Beheader, Rules::Separator, Rules::Tokenizer, Rules::Sanitizer, Rules::Desynonymizer]
|
6
6
|
|
7
7
|
attr_reader :src
|
8
8
|
|
@@ -12,9 +12,9 @@ module InciScore
|
|
12
12
|
end
|
13
13
|
|
14
14
|
def call
|
15
|
-
@rules
|
16
|
-
|
17
|
-
src = rule.call
|
15
|
+
yield(@rules) if block_given?
|
16
|
+
@rules.reduce(@src) do |src, rule|
|
17
|
+
@src = rule.call(src)
|
18
18
|
end
|
19
19
|
end
|
20
20
|
end
|
@@ -1,73 +1,90 @@
|
|
1
1
|
module InciScore
|
2
2
|
class Normalizer
|
3
3
|
module Rules
|
4
|
-
|
5
|
-
SEPARATOR = ','
|
4
|
+
SEPARATOR = ','
|
6
5
|
|
7
|
-
|
8
|
-
|
9
|
-
end
|
6
|
+
module Replacer
|
7
|
+
extend self
|
10
8
|
|
11
|
-
def call
|
12
|
-
fail NotImplementedError
|
13
|
-
end
|
14
|
-
end
|
15
|
-
|
16
|
-
class Replacer < Base
|
17
9
|
REPLACEMENTS = [
|
18
10
|
[/\n+|\t+/, ' '],
|
19
11
|
['‘', "'"],
|
20
12
|
['—', '-'],
|
21
|
-
['(', 'C'],
|
22
13
|
['_', ' '],
|
23
14
|
['~', '-'],
|
24
15
|
['|', 'l'],
|
25
16
|
[' I ', '/']
|
26
17
|
]
|
27
18
|
|
28
|
-
def call
|
29
|
-
REPLACEMENTS.reduce(
|
19
|
+
def call(src)
|
20
|
+
REPLACEMENTS.reduce(src) do |_src, replacement|
|
30
21
|
invalid, valid = *replacement
|
31
|
-
|
22
|
+
_src.index(invalid) ? _src.gsub(invalid, valid) : _src
|
32
23
|
end
|
33
24
|
end
|
34
25
|
end
|
35
26
|
|
36
|
-
|
37
|
-
|
38
|
-
|
27
|
+
module Downcaser
|
28
|
+
extend self
|
29
|
+
|
30
|
+
def call(src)
|
31
|
+
src.downcase
|
39
32
|
end
|
40
33
|
end
|
41
34
|
|
42
|
-
|
35
|
+
module Beheader
|
36
|
+
extend self
|
37
|
+
|
43
38
|
TITLE_SEP = ':'
|
44
39
|
MAX_INDEX = 50
|
45
40
|
|
46
|
-
def call
|
47
|
-
sep_index =
|
48
|
-
return
|
49
|
-
|
41
|
+
def call(src)
|
42
|
+
sep_index = src.index(TITLE_SEP)
|
43
|
+
return src if !sep_index || sep_index > MAX_INDEX
|
44
|
+
src[sep_index+1, src.size]
|
50
45
|
end
|
51
46
|
end
|
52
47
|
|
53
|
-
|
48
|
+
module Separator
|
49
|
+
extend self
|
50
|
+
|
54
51
|
SEPARATORS = ["; ", ". ", " ' ", " - ", " : "]
|
55
52
|
|
56
|
-
def call
|
57
|
-
SEPARATORS.reduce(
|
58
|
-
|
53
|
+
def call(src)
|
54
|
+
SEPARATORS.reduce(src) do |_src, separator|
|
55
|
+
_src = _src.gsub(separator, SEPARATOR)
|
59
56
|
end
|
60
57
|
end
|
61
58
|
end
|
62
59
|
|
63
|
-
|
64
|
-
|
60
|
+
module Tokenizer
|
61
|
+
extend self
|
62
|
+
|
63
|
+
def call(src)
|
64
|
+
src.split(SEPARATOR).map(&:strip)
|
65
|
+
end
|
66
|
+
end
|
67
|
+
|
68
|
+
module Sanitizer
|
69
|
+
extend self
|
70
|
+
|
71
|
+
INVALID_CHARS = /[^\/\(\)\w\s-]/
|
72
|
+
|
73
|
+
def call(src)
|
74
|
+
Array(src).map do |token|
|
75
|
+
token.gsub(INVALID_CHARS, '')
|
76
|
+
end.reject(&:empty?)
|
77
|
+
end
|
78
|
+
end
|
79
|
+
|
80
|
+
module Desynonymizer
|
81
|
+
extend self
|
82
|
+
|
83
|
+
SYNONYM = /\/.*/
|
65
84
|
|
66
|
-
def call
|
67
|
-
|
68
|
-
token
|
69
|
-
token = token.gsub(INVALID_CHARS, '')
|
70
|
-
token = token.strip
|
85
|
+
def call(src)
|
86
|
+
Array(src).map do |token|
|
87
|
+
token.sub(SYNONYM, '').strip
|
71
88
|
end.reject(&:empty?)
|
72
89
|
end
|
73
90
|
end
|
@@ -2,7 +2,7 @@ require 'inci_score/recognizer_rules'
|
|
2
2
|
|
3
3
|
module InciScore
|
4
4
|
class Recognizer
|
5
|
-
DEFAULT_RULES = Rules
|
5
|
+
DEFAULT_RULES = [Rules::Key, Rules::Levenshtein, Rules::Digits, Rules::Tokens]
|
6
6
|
|
7
7
|
def initialize(src, catalog, rules = DEFAULT_RULES)
|
8
8
|
@src = src
|
@@ -11,17 +11,13 @@ module InciScore
|
|
11
11
|
end
|
12
12
|
|
13
13
|
def call
|
14
|
-
@component =
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
def apply_rules
|
21
|
-
@rules.reduce(nil) do |component, name|
|
22
|
-
rule = Rules.const_get(name).new(@src, @catalog)
|
23
|
-
component || rule.call
|
14
|
+
@component = @rules.reduce(nil) do |component, rule|
|
15
|
+
break(component) if component
|
16
|
+
_rule = rule.new(@src, @catalog)
|
17
|
+
yield(rule) if block_given?
|
18
|
+
_rule.call
|
24
19
|
end
|
25
|
-
|
20
|
+
[@component, @catalog[@component]] if @component
|
21
|
+
end
|
26
22
|
end
|
27
23
|
end
|
@@ -28,12 +28,12 @@ module InciScore
|
|
28
28
|
def call
|
29
29
|
size = @src.size
|
30
30
|
initial = @src[0]
|
31
|
-
component, distance = @catalog.reduce([nil, size]) do |min, (
|
32
|
-
next min unless
|
33
|
-
match = (n =
|
31
|
+
component, distance = @catalog.reduce([nil, size]) do |min, (_component, _)|
|
32
|
+
next min unless _component.start_with?(initial)
|
33
|
+
match = (n = _component.index(ALTERNATE_SEP)) ? _component[0, n] : _component
|
34
34
|
next min if match.size > (size + TOLERANCE)
|
35
35
|
dist = @src.distance(match)
|
36
|
-
min = [
|
36
|
+
min = [_component, dist] if dist < min[1]
|
37
37
|
min
|
38
38
|
end
|
39
39
|
component unless distance > TOLERANCE || distance >= (size-1)
|
@@ -47,7 +47,7 @@ module InciScore
|
|
47
47
|
return if @src.size < TOLERANCE
|
48
48
|
digits = @src[0, MIN_MEANINGFUL]
|
49
49
|
@catalog.detect do |component, _|
|
50
|
-
component.match(/^#{Regexp::escape(digits)}/)
|
50
|
+
component.match?(/^#{Regexp::escape(digits)}/)
|
51
51
|
end.to_a.first
|
52
52
|
end
|
53
53
|
end
|
@@ -58,7 +58,7 @@ module InciScore
|
|
58
58
|
def call
|
59
59
|
tokens.each do |token|
|
60
60
|
@catalog.each do |component, _|
|
61
|
-
return component if component.match(/\b#{Regexp.escape(token)}\b/)
|
61
|
+
return component if component.match?(/\b#{Regexp.escape(token)}\b/)
|
62
62
|
end
|
63
63
|
end
|
64
64
|
nil
|
data/lib/inci_score/version.rb
CHANGED
data/lib/inci_score.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: inci_score
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 2.0.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- costajob
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2017-01-03 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|
@@ -145,11 +145,12 @@ files:
|
|
145
145
|
- lib/inci_score.rb
|
146
146
|
- lib/inci_score/api/app.rb
|
147
147
|
- lib/inci_score/catalog.rb
|
148
|
+
- lib/inci_score/cli.rb
|
148
149
|
- lib/inci_score/computer.rb
|
150
|
+
- lib/inci_score/fetcher.rb
|
149
151
|
- lib/inci_score/levenshtein.rb
|
150
152
|
- lib/inci_score/normalizer.rb
|
151
153
|
- lib/inci_score/normalizer_rules.rb
|
152
|
-
- lib/inci_score/parser.rb
|
153
154
|
- lib/inci_score/recognizer.rb
|
154
155
|
- lib/inci_score/recognizer_rules.rb
|
155
156
|
- lib/inci_score/response.rb
|
@@ -169,7 +170,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
169
170
|
requirements:
|
170
171
|
- - ">="
|
171
172
|
- !ruby/object:Gem::Version
|
172
|
-
version: 2.
|
173
|
+
version: '2.4'
|
173
174
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
174
175
|
requirements:
|
175
176
|
- - ">="
|
@@ -177,7 +178,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
177
178
|
version: '0'
|
178
179
|
requirements: []
|
179
180
|
rubyforge_project:
|
180
|
-
rubygems_version: 2.
|
181
|
+
rubygems_version: 2.6.8
|
181
182
|
signing_key:
|
182
183
|
specification_version: 4
|
183
184
|
summary: A library that computes the hazard of cosmetic products components, based
|