inci_score 1.2.1 → 2.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.travis.yml +1 -2
- data/README.md +44 -10
- data/bin/inci_score +4 -4
- data/inci_score.gemspec +1 -1
- data/lib/inci_score/api/app.rb +1 -1
- data/lib/inci_score/cli.rb +44 -0
- data/lib/inci_score/computer.rb +9 -9
- data/lib/inci_score/{parser.rb → fetcher.rb} +1 -1
- data/lib/inci_score/normalizer.rb +4 -4
- data/lib/inci_score/normalizer_rules.rb +51 -34
- data/lib/inci_score/recognizer.rb +8 -12
- data/lib/inci_score/recognizer_rules.rb +6 -6
- data/lib/inci_score/version.rb +1 -1
- data/lib/inci_score.rb +3 -2
- metadata +6 -5
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 5970cfdecac8492dbfd510dce7a24488e543233c
|
4
|
+
data.tar.gz: fb5b1171f1fcab479e24b33dc7d7f37582b93741
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 42624a99c66bc3fcfb53cff14ebe6a153b220901df9b9e5f49f3d8ec2c9378436cd7090446bb449fefc7320c8ff61fdd5375b551b015312858fac8b8bfa8b66c
|
7
|
+
data.tar.gz: 2f6bcc48dd8727a6b2b9665882cd3a809b849a7876a7a5303b7a8c3438c373fb3e25c28545889f70fcc14aa1724ea5febfbb5a78a555456ec99e44b5ca9de329
|
data/.travis.yml
CHANGED
data/README.md
CHANGED
@@ -11,9 +11,13 @@
|
|
11
11
|
* [Starting Puma](#starting-puma)
|
12
12
|
* [Triggering a request](#triggering-a-request)
|
13
13
|
* [CLI API](#cli-api)
|
14
|
-
* [
|
14
|
+
* [Refresh catalog](#refresh-catalog)
|
15
|
+
* [Benchmark](#benchmark)
|
15
16
|
* [Levenshtein in C](#levenshtein-in-c)
|
16
|
-
* [
|
17
|
+
* [Platform](#platform)
|
18
|
+
* [Wrk](#wrk)
|
19
|
+
* [Results](#results)
|
20
|
+
* [Ruby 2.4](#ruby-2.4)
|
17
21
|
|
18
22
|
## Scope
|
19
23
|
This gem computes the score of cosmetic components basing on the information provided by the [Biodizionario site](http://www.biodizionario.it/) by Fabrizio Zago.
|
@@ -75,8 +79,8 @@ The Web API exposes the *InciScore* library over HTTP via the [Puma](http://puma
|
|
75
79
|
|
76
80
|
### Starting Puma
|
77
81
|
Simply start Puma via the *config.ru* file included in the repository by spawning how many workers as your current workstation supports:
|
78
|
-
```
|
79
|
-
bundle exec puma -w
|
82
|
+
```shell
|
83
|
+
bundle exec puma -w 8 -t 0:2 --preload
|
80
84
|
```
|
81
85
|
|
82
86
|
### Triggering a request
|
@@ -84,7 +88,7 @@ The Web API responds with a JSON object representing the original *InciScore::Re
|
|
84
88
|
|
85
89
|
You can pass the source string directly as a HTTP parameter:
|
86
90
|
|
87
|
-
```
|
91
|
+
```shell
|
88
92
|
curl http://127.0.0.1:9292?src=aqua,dimethicone
|
89
93
|
=> {"components":{"aqua":0,"dimethicone":4},"unrecognized":[],"score":53.762874945799766,"valid":true}
|
90
94
|
```
|
@@ -92,8 +96,8 @@ curl http://127.0.0.1:9292?src=aqua,dimethicone
|
|
92
96
|
## CLI API
|
93
97
|
You can collect INCI data by using the available binary:
|
94
98
|
|
95
|
-
```
|
96
|
-
inci_score "aqua,dimethicone,pej-10,noent"
|
99
|
+
```shell
|
100
|
+
inci_score --src="aqua,dimethicone,pej-10,noent"
|
97
101
|
|
98
102
|
TOTAL SCORE:
|
99
103
|
47.18034913243358
|
@@ -107,11 +111,41 @@ UNRECOGNIZED:
|
|
107
111
|
noent
|
108
112
|
```
|
109
113
|
|
110
|
-
|
114
|
+
### Refresh catalog
|
115
|
+
When using CLI you have the option to fetch a fresh catalog from remote by specifyng a flag:
|
116
|
+
```shell
|
117
|
+
inci_score --fresh --src="aqua,dimethicone,pej-10,noent"
|
118
|
+
```
|
119
|
+
|
120
|
+
## Benchmark
|
121
|
+
|
122
|
+
### Levenshtein in C
|
111
123
|
I noticed the APIs slows down dramatically when dealing with unrecognized components to fuzzy match on.
|
112
124
|
I profiled the code by using the [benchmark-ips](https://github.com/evanphx/benchmark-ips) gem, finding the bottleneck was the pure Ruby implementation of the Levenshtein distance algorithm.
|
113
125
|
After some pointless optimization, i replaced this routine with a C implementation: i opted for the straightforward [Ruby Inline](https://github.com/seattlerb/rubyinline) library to call the C code straight from Ruby.
|
114
126
|
As a result i've got a 10x increment of the throughput, all without scarifying code readability.
|
115
127
|
|
116
|
-
###
|
117
|
-
I
|
128
|
+
### Platform
|
129
|
+
I registered these benchmarks with a MacBook PRO 15 mid 2015 having these specs:
|
130
|
+
* OSX El Captain
|
131
|
+
* 2,2 GHz Intel Core i7 (4 cores)
|
132
|
+
* 16 GB 1600 MHz DDR3
|
133
|
+
|
134
|
+
### Wrk
|
135
|
+
As always i used [wrk](https://github.com/wg/wrk) as the loading tool.
|
136
|
+
I measured each library three times, picking the best lap.
|
137
|
+
The following script command is used:
|
138
|
+
|
139
|
+
```
|
140
|
+
wrk -t 4 -c 100 -d 30s --timeout 2000 http://127.0.0.1:9292/?src=<list_of_ingredients>
|
141
|
+
```
|
142
|
+
|
143
|
+
### Results
|
144
|
+
| Type | Ingredients | Throughput (req/s) | Latency in ms (avg/stdev/max) |
|
145
|
+
| :----------------- | :----------------------- | -----------------: | ----------------------------: |
|
146
|
+
| exact matching | aqua,parfum,zeolite | 48863.58 | 0.31/0.55/10.82 |
|
147
|
+
|
148
|
+
## Ruby 2.4
|
149
|
+
After upgrading to Ruby 2.4 i doubled the throughput of the matcher: i assume Ruby optimization to the [Hash access](#https://blog.heroku.com/ruby-2-4-features-hashes-integers-rounding) is the driving reason.
|
150
|
+
I also adopted the new #match? method to avoid creating a MatchData object when i am just checking for predicate.
|
151
|
+
In the end Ruby upgrade is a big deal for my gem, give it a try!
|
data/bin/inci_score
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
#!/usr/bin/env ruby
|
2
|
+
lib = File.expand_path("../../lib", __FILE__)
|
3
|
+
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
2
4
|
|
3
|
-
require
|
4
|
-
require 'inci_score'
|
5
|
+
require "inci_score"
|
5
6
|
|
6
|
-
|
7
|
-
puts InciScore::Computer.new(ARGV[0], InciScore::Catalog.fetch).call
|
7
|
+
InciScore::CLI.new(args: ARGV.clone).call
|
data/inci_score.gemspec
CHANGED
@@ -14,7 +14,7 @@ Gem::Specification.new do |s|
|
|
14
14
|
s.executables << "inci_score"
|
15
15
|
s.require_paths = ["lib"]
|
16
16
|
s.license = "MIT"
|
17
|
-
s.required_ruby_version = ">= 2.
|
17
|
+
s.required_ruby_version = ">= 2.4"
|
18
18
|
|
19
19
|
s.add_runtime_dependency "nokogiri", "~> 1.6"
|
20
20
|
s.add_runtime_dependency "puma", "~> 3"
|
data/lib/inci_score/api/app.rb
CHANGED
@@ -13,7 +13,7 @@ module InciScore
|
|
13
13
|
def call(env)
|
14
14
|
req = Rack::Request.new(env)
|
15
15
|
src = req.params["src"]
|
16
|
-
json = src ? Computer.new(src, catalog).call.to_json : %q({"error": "no valid source"})
|
16
|
+
json = src ? Computer.new(src: src, catalog: catalog).call.to_json : %q({"error": "no valid source"})
|
17
17
|
['200', {'Content-Type' => 'application/json'}, [json]]
|
18
18
|
end
|
19
19
|
end
|
@@ -0,0 +1,44 @@
|
|
1
|
+
require "optparse"
|
2
|
+
require "inci_score/computer"
|
3
|
+
|
4
|
+
module InciScore
|
5
|
+
class CLI
|
6
|
+
def initialize(args:, io: STDOUT, catalog: InciScore::Catalog.fetch)
|
7
|
+
@args = args
|
8
|
+
@io = io
|
9
|
+
@catalog = catalog
|
10
|
+
@src = nil
|
11
|
+
@fresh = nil
|
12
|
+
end
|
13
|
+
|
14
|
+
def call(computer_klass = Computer, fetcher = Fetcher.new)
|
15
|
+
parser.parse!(@args)
|
16
|
+
return @io.puts("Specify inci list as: --src='aqua, parfum, etc'") unless @src
|
17
|
+
@io.puts computer_klass.new(src: @src, catalog: catalog(fetcher)).call
|
18
|
+
end
|
19
|
+
|
20
|
+
private def parser
|
21
|
+
OptionParser.new do |opts|
|
22
|
+
opts.banner = %q{Usage: ./bin/inci_score --src='aqua, parfum, etc' --fresh}
|
23
|
+
|
24
|
+
opts.on("-sSRC", "--src=SRC", "The INCI list: 'aqua, parfum, etc'") do |src|
|
25
|
+
@src = src
|
26
|
+
end
|
27
|
+
|
28
|
+
opts.on("-f", "--fresh", "Fetch a fresh catalog from remote") do |fresh|
|
29
|
+
@fresh = fresh
|
30
|
+
end
|
31
|
+
|
32
|
+
opts.on("-h", "--help", "Prints this help") do
|
33
|
+
@io.puts opts
|
34
|
+
exit
|
35
|
+
end
|
36
|
+
end
|
37
|
+
end
|
38
|
+
|
39
|
+
private def catalog(fetcher)
|
40
|
+
return @catalog unless @fresh
|
41
|
+
fetcher.call
|
42
|
+
end
|
43
|
+
end
|
44
|
+
end
|
data/lib/inci_score/computer.rb
CHANGED
@@ -7,9 +7,11 @@ module InciScore
|
|
7
7
|
class Computer
|
8
8
|
TOLERANCE = 30.0
|
9
9
|
|
10
|
-
def initialize(src,
|
10
|
+
def initialize(src:, catalog:, tolerance: TOLERANCE, rules: Normalizer::DEFAULT_RULES)
|
11
11
|
@src = src
|
12
12
|
@catalog = catalog
|
13
|
+
@tolerance = Float(tolerance)
|
14
|
+
@rules = rules
|
13
15
|
@unrecognized = []
|
14
16
|
end
|
15
17
|
|
@@ -20,17 +22,15 @@ module InciScore
|
|
20
22
|
valid: valid?)
|
21
23
|
end
|
22
24
|
|
23
|
-
private
|
24
|
-
|
25
|
-
def score
|
25
|
+
private def score
|
26
26
|
Scorer.new(components.map(&:last)).call
|
27
27
|
end
|
28
28
|
|
29
|
-
def ingredients
|
30
|
-
@ingredients ||= Normalizer.new(src: @src).call
|
29
|
+
private def ingredients
|
30
|
+
@ingredients ||= Normalizer.new(src: @src, rules: @rules).call
|
31
31
|
end
|
32
32
|
|
33
|
-
def components
|
33
|
+
private def components
|
34
34
|
@components ||= ingredients.map do |ingredient|
|
35
35
|
Recognizer.new(ingredient, @catalog).call.tap do |component|
|
36
36
|
@unrecognized << ingredient unless component
|
@@ -38,8 +38,8 @@ module InciScore
|
|
38
38
|
end.compact
|
39
39
|
end
|
40
40
|
|
41
|
-
def valid?
|
42
|
-
@unrecognized.size / (ingredients.size / 100.0) <=
|
41
|
+
private def valid?
|
42
|
+
@unrecognized.size / (ingredients.size / 100.0) <= @tolerance
|
43
43
|
end
|
44
44
|
end
|
45
45
|
end
|
@@ -2,7 +2,7 @@ require 'inci_score/normalizer_rules'
|
|
2
2
|
|
3
3
|
module InciScore
|
4
4
|
class Normalizer
|
5
|
-
DEFAULT_RULES = Rules
|
5
|
+
DEFAULT_RULES = [Rules::Replacer, Rules::Downcaser, Rules::Beheader, Rules::Separator, Rules::Tokenizer, Rules::Sanitizer, Rules::Desynonymizer]
|
6
6
|
|
7
7
|
attr_reader :src
|
8
8
|
|
@@ -12,9 +12,9 @@ module InciScore
|
|
12
12
|
end
|
13
13
|
|
14
14
|
def call
|
15
|
-
@rules
|
16
|
-
|
17
|
-
src = rule.call
|
15
|
+
yield(@rules) if block_given?
|
16
|
+
@rules.reduce(@src) do |src, rule|
|
17
|
+
@src = rule.call(src)
|
18
18
|
end
|
19
19
|
end
|
20
20
|
end
|
@@ -1,73 +1,90 @@
|
|
1
1
|
module InciScore
|
2
2
|
class Normalizer
|
3
3
|
module Rules
|
4
|
-
|
5
|
-
SEPARATOR = ','
|
4
|
+
SEPARATOR = ','
|
6
5
|
|
7
|
-
|
8
|
-
|
9
|
-
end
|
6
|
+
module Replacer
|
7
|
+
extend self
|
10
8
|
|
11
|
-
def call
|
12
|
-
fail NotImplementedError
|
13
|
-
end
|
14
|
-
end
|
15
|
-
|
16
|
-
class Replacer < Base
|
17
9
|
REPLACEMENTS = [
|
18
10
|
[/\n+|\t+/, ' '],
|
19
11
|
['‘', "'"],
|
20
12
|
['—', '-'],
|
21
|
-
['(', 'C'],
|
22
13
|
['_', ' '],
|
23
14
|
['~', '-'],
|
24
15
|
['|', 'l'],
|
25
16
|
[' I ', '/']
|
26
17
|
]
|
27
18
|
|
28
|
-
def call
|
29
|
-
REPLACEMENTS.reduce(
|
19
|
+
def call(src)
|
20
|
+
REPLACEMENTS.reduce(src) do |_src, replacement|
|
30
21
|
invalid, valid = *replacement
|
31
|
-
|
22
|
+
_src.index(invalid) ? _src.gsub(invalid, valid) : _src
|
32
23
|
end
|
33
24
|
end
|
34
25
|
end
|
35
26
|
|
36
|
-
|
37
|
-
|
38
|
-
|
27
|
+
module Downcaser
|
28
|
+
extend self
|
29
|
+
|
30
|
+
def call(src)
|
31
|
+
src.downcase
|
39
32
|
end
|
40
33
|
end
|
41
34
|
|
42
|
-
|
35
|
+
module Beheader
|
36
|
+
extend self
|
37
|
+
|
43
38
|
TITLE_SEP = ':'
|
44
39
|
MAX_INDEX = 50
|
45
40
|
|
46
|
-
def call
|
47
|
-
sep_index =
|
48
|
-
return
|
49
|
-
|
41
|
+
def call(src)
|
42
|
+
sep_index = src.index(TITLE_SEP)
|
43
|
+
return src if !sep_index || sep_index > MAX_INDEX
|
44
|
+
src[sep_index+1, src.size]
|
50
45
|
end
|
51
46
|
end
|
52
47
|
|
53
|
-
|
48
|
+
module Separator
|
49
|
+
extend self
|
50
|
+
|
54
51
|
SEPARATORS = ["; ", ". ", " ' ", " - ", " : "]
|
55
52
|
|
56
|
-
def call
|
57
|
-
SEPARATORS.reduce(
|
58
|
-
|
53
|
+
def call(src)
|
54
|
+
SEPARATORS.reduce(src) do |_src, separator|
|
55
|
+
_src = _src.gsub(separator, SEPARATOR)
|
59
56
|
end
|
60
57
|
end
|
61
58
|
end
|
62
59
|
|
63
|
-
|
64
|
-
|
60
|
+
module Tokenizer
|
61
|
+
extend self
|
62
|
+
|
63
|
+
def call(src)
|
64
|
+
src.split(SEPARATOR).map(&:strip)
|
65
|
+
end
|
66
|
+
end
|
67
|
+
|
68
|
+
module Sanitizer
|
69
|
+
extend self
|
70
|
+
|
71
|
+
INVALID_CHARS = /[^\/\(\)\w\s-]/
|
72
|
+
|
73
|
+
def call(src)
|
74
|
+
Array(src).map do |token|
|
75
|
+
token.gsub(INVALID_CHARS, '')
|
76
|
+
end.reject(&:empty?)
|
77
|
+
end
|
78
|
+
end
|
79
|
+
|
80
|
+
module Desynonymizer
|
81
|
+
extend self
|
82
|
+
|
83
|
+
SYNONYM = /\/.*/
|
65
84
|
|
66
|
-
def call
|
67
|
-
|
68
|
-
token
|
69
|
-
token = token.gsub(INVALID_CHARS, '')
|
70
|
-
token = token.strip
|
85
|
+
def call(src)
|
86
|
+
Array(src).map do |token|
|
87
|
+
token.sub(SYNONYM, '').strip
|
71
88
|
end.reject(&:empty?)
|
72
89
|
end
|
73
90
|
end
|
@@ -2,7 +2,7 @@ require 'inci_score/recognizer_rules'
|
|
2
2
|
|
3
3
|
module InciScore
|
4
4
|
class Recognizer
|
5
|
-
DEFAULT_RULES = Rules
|
5
|
+
DEFAULT_RULES = [Rules::Key, Rules::Levenshtein, Rules::Digits, Rules::Tokens]
|
6
6
|
|
7
7
|
def initialize(src, catalog, rules = DEFAULT_RULES)
|
8
8
|
@src = src
|
@@ -11,17 +11,13 @@ module InciScore
|
|
11
11
|
end
|
12
12
|
|
13
13
|
def call
|
14
|
-
@component =
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
def apply_rules
|
21
|
-
@rules.reduce(nil) do |component, name|
|
22
|
-
rule = Rules.const_get(name).new(@src, @catalog)
|
23
|
-
component || rule.call
|
14
|
+
@component = @rules.reduce(nil) do |component, rule|
|
15
|
+
break(component) if component
|
16
|
+
_rule = rule.new(@src, @catalog)
|
17
|
+
yield(rule) if block_given?
|
18
|
+
_rule.call
|
24
19
|
end
|
25
|
-
|
20
|
+
[@component, @catalog[@component]] if @component
|
21
|
+
end
|
26
22
|
end
|
27
23
|
end
|
@@ -28,12 +28,12 @@ module InciScore
|
|
28
28
|
def call
|
29
29
|
size = @src.size
|
30
30
|
initial = @src[0]
|
31
|
-
component, distance = @catalog.reduce([nil, size]) do |min, (
|
32
|
-
next min unless
|
33
|
-
match = (n =
|
31
|
+
component, distance = @catalog.reduce([nil, size]) do |min, (_component, _)|
|
32
|
+
next min unless _component.start_with?(initial)
|
33
|
+
match = (n = _component.index(ALTERNATE_SEP)) ? _component[0, n] : _component
|
34
34
|
next min if match.size > (size + TOLERANCE)
|
35
35
|
dist = @src.distance(match)
|
36
|
-
min = [
|
36
|
+
min = [_component, dist] if dist < min[1]
|
37
37
|
min
|
38
38
|
end
|
39
39
|
component unless distance > TOLERANCE || distance >= (size-1)
|
@@ -47,7 +47,7 @@ module InciScore
|
|
47
47
|
return if @src.size < TOLERANCE
|
48
48
|
digits = @src[0, MIN_MEANINGFUL]
|
49
49
|
@catalog.detect do |component, _|
|
50
|
-
component.match(/^#{Regexp::escape(digits)}/)
|
50
|
+
component.match?(/^#{Regexp::escape(digits)}/)
|
51
51
|
end.to_a.first
|
52
52
|
end
|
53
53
|
end
|
@@ -58,7 +58,7 @@ module InciScore
|
|
58
58
|
def call
|
59
59
|
tokens.each do |token|
|
60
60
|
@catalog.each do |component, _|
|
61
|
-
return component if component.match(/\b#{Regexp.escape(token)}\b/)
|
61
|
+
return component if component.match?(/\b#{Regexp.escape(token)}\b/)
|
62
62
|
end
|
63
63
|
end
|
64
64
|
nil
|
data/lib/inci_score/version.rb
CHANGED
data/lib/inci_score.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: inci_score
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 2.0.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- costajob
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2017-01-03 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|
@@ -145,11 +145,12 @@ files:
|
|
145
145
|
- lib/inci_score.rb
|
146
146
|
- lib/inci_score/api/app.rb
|
147
147
|
- lib/inci_score/catalog.rb
|
148
|
+
- lib/inci_score/cli.rb
|
148
149
|
- lib/inci_score/computer.rb
|
150
|
+
- lib/inci_score/fetcher.rb
|
149
151
|
- lib/inci_score/levenshtein.rb
|
150
152
|
- lib/inci_score/normalizer.rb
|
151
153
|
- lib/inci_score/normalizer_rules.rb
|
152
|
-
- lib/inci_score/parser.rb
|
153
154
|
- lib/inci_score/recognizer.rb
|
154
155
|
- lib/inci_score/recognizer_rules.rb
|
155
156
|
- lib/inci_score/response.rb
|
@@ -169,7 +170,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
169
170
|
requirements:
|
170
171
|
- - ">="
|
171
172
|
- !ruby/object:Gem::Version
|
172
|
-
version: 2.
|
173
|
+
version: '2.4'
|
173
174
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
174
175
|
requirements:
|
175
176
|
- - ">="
|
@@ -177,7 +178,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
177
178
|
version: '0'
|
178
179
|
requirements: []
|
179
180
|
rubyforge_project:
|
180
|
-
rubygems_version: 2.
|
181
|
+
rubygems_version: 2.6.8
|
181
182
|
signing_key:
|
182
183
|
specification_version: 4
|
183
184
|
summary: A library that computes the hazard of cosmetic products components, based
|