inci_score 1.2.1 → 2.0.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: ffacec4022ea855b06a3eee2eae7813894ae11c8
4
- data.tar.gz: c53b431b2a74d2f0156d7079f8f866a4326b4c7e
3
+ metadata.gz: 5970cfdecac8492dbfd510dce7a24488e543233c
4
+ data.tar.gz: fb5b1171f1fcab479e24b33dc7d7f37582b93741
5
5
  SHA512:
6
- metadata.gz: 26538bc66e1aa5f845130e446a37d674b538173dcff2bbce5b6bc28c7e168979b44dba6b42ab8d80750b16bf92277d09c2eb9f31789a844485ab48601b13762a
7
- data.tar.gz: ebc439f00b724c3a51bef74e4fac90c838c604a34150d9fd3324a90ed24b618f0092ecc9cb7ed0cb74cf2f20adefc872ef26d5c245800388c568bab003cced68
6
+ metadata.gz: 42624a99c66bc3fcfb53cff14ebe6a153b220901df9b9e5f49f3d8ec2c9378436cd7090446bb449fefc7320c8ff61fdd5375b551b015312858fac8b8bfa8b66c
7
+ data.tar.gz: 2f6bcc48dd8727a6b2b9665882cd3a809b849a7876a7a5303b7a8c3438c373fb3e25c28545889f70fcc14aa1724ea5febfbb5a78a555456ec99e44b5ca9de329
data/.travis.yml CHANGED
@@ -1,5 +1,4 @@
1
1
  language: ruby
2
2
  rvm:
3
- - 2.2.2
4
- - 2.3.0
3
+ - 2.4.0
5
4
  before_install: gem install bundler -v 1.11.2
data/README.md CHANGED
@@ -11,9 +11,13 @@
11
11
  * [Starting Puma](#starting-puma)
12
12
  * [Triggering a request](#triggering-a-request)
13
13
  * [CLI API](#cli-api)
14
- * [Performance](#performance)
14
+ * [Refresh catalog](#refresh-catalog)
15
+ * [Benchmark](#benchmark)
15
16
  * [Levenshtein in C](#levenshtein-in-c)
16
- * [Records](#records)
17
+ * [Platform](#platform)
18
+ * [Wrk](#wrk)
19
+ * [Results](#results)
20
+ * [Ruby 2.4](#ruby-2.4)
17
21
 
18
22
  ## Scope
19
23
  This gem computes the score of cosmetic components basing on the information provided by the [Biodizionario site](http://www.biodizionario.it/) by Fabrizio Zago.
@@ -75,8 +79,8 @@ The Web API exposes the *InciScore* library over HTTP via the [Puma](http://puma
75
79
 
76
80
  ### Starting Puma
77
81
  Simply start Puma via the *config.ru* file included in the repository by spawning how many workers as your current workstation supports:
78
- ```
79
- bundle exec puma -w 7 -t 16:32 --preload
82
+ ```shell
83
+ bundle exec puma -w 8 -t 0:2 --preload
80
84
  ```
81
85
 
82
86
  ### Triggering a request
@@ -84,7 +88,7 @@ The Web API responds with a JSON object representing the original *InciScore::Re
84
88
 
85
89
  You can pass the source string directly as a HTTP parameter:
86
90
 
87
- ```
91
+ ```shell
88
92
  curl http://127.0.0.1:9292?src=aqua,dimethicone
89
93
  => {"components":{"aqua":0,"dimethicone":4},"unrecognized":[],"score":53.762874945799766,"valid":true}
90
94
  ```
@@ -92,8 +96,8 @@ curl http://127.0.0.1:9292?src=aqua,dimethicone
92
96
  ## CLI API
93
97
  You can collect INCI data by using the available binary:
94
98
 
95
- ```
96
- inci_score "aqua,dimethicone,pej-10,noent"
99
+ ```shell
100
+ inci_score --src="aqua,dimethicone,pej-10,noent"
97
101
 
98
102
  TOTAL SCORE:
99
103
  47.18034913243358
@@ -107,11 +111,41 @@ UNRECOGNIZED:
107
111
  noent
108
112
  ```
109
113
 
110
- ## Performance
114
+ ### Refresh catalog
115
+ When using CLI you have the option to fetch a fresh catalog from remote by specifyng a flag:
116
+ ```shell
117
+ inci_score --fresh --src="aqua,dimethicone,pej-10,noent"
118
+ ```
119
+
120
+ ## Benchmark
121
+
122
+ ### Levenshtein in C
111
123
  I noticed the APIs slows down dramatically when dealing with unrecognized components to fuzzy match on.
112
124
  I profiled the code by using the [benchmark-ips](https://github.com/evanphx/benchmark-ips) gem, finding the bottleneck was the pure Ruby implementation of the Levenshtein distance algorithm.
113
125
  After some pointless optimization, i replaced this routine with a C implementation: i opted for the straightforward [Ruby Inline](https://github.com/seattlerb/rubyinline) library to call the C code straight from Ruby.
114
126
  As a result i've got a 10x increment of the throughput, all without scarifying code readability.
115
127
 
116
- ### Numbers
117
- I moved the benchmark numbers to the [Crystal porting](https://github.com/costajob/inci_score.cr) of the InciScore library, please look there.
128
+ ### Platform
129
+ I registered these benchmarks with a MacBook PRO 15 mid 2015 having these specs:
130
+ * OSX El Captain
131
+ * 2,2 GHz Intel Core i7 (4 cores)
132
+ * 16 GB 1600 MHz DDR3
133
+
134
+ ### Wrk
135
+ As always i used [wrk](https://github.com/wg/wrk) as the loading tool.
136
+ I measured each library three times, picking the best lap.
137
+ The following script command is used:
138
+
139
+ ```
140
+ wrk -t 4 -c 100 -d 30s --timeout 2000 http://127.0.0.1:9292/?src=<list_of_ingredients>
141
+ ```
142
+
143
+ ### Results
144
+ | Type | Ingredients | Throughput (req/s) | Latency in ms (avg/stdev/max) |
145
+ | :----------------- | :----------------------- | -----------------: | ----------------------------: |
146
+ | exact matching | aqua,parfum,zeolite | 48863.58 | 0.31/0.55/10.82 |
147
+
148
+ ## Ruby 2.4
149
+ After upgrading to Ruby 2.4 i doubled the throughput of the matcher: i assume Ruby optimization to the [Hash access](#https://blog.heroku.com/ruby-2-4-features-hashes-integers-rounding) is the driving reason.
150
+ I also adopted the new #match? method to avoid creating a MatchData object when i am just checking for predicate.
151
+ In the end Ruby upgrade is a big deal for my gem, give it a try!
data/bin/inci_score CHANGED
@@ -1,7 +1,7 @@
1
1
  #!/usr/bin/env ruby
2
+ lib = File.expand_path("../../lib", __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
2
4
 
3
- require 'bundler/setup'
4
- require 'inci_score'
5
+ require "inci_score"
5
6
 
6
- fail ArgumentError, "please specify at least a src argument" if ARGV.empty?
7
- puts InciScore::Computer.new(ARGV[0], InciScore::Catalog.fetch).call
7
+ InciScore::CLI.new(args: ARGV.clone).call
data/inci_score.gemspec CHANGED
@@ -14,7 +14,7 @@ Gem::Specification.new do |s|
14
14
  s.executables << "inci_score"
15
15
  s.require_paths = ["lib"]
16
16
  s.license = "MIT"
17
- s.required_ruby_version = ">= 2.2.2"
17
+ s.required_ruby_version = ">= 2.4"
18
18
 
19
19
  s.add_runtime_dependency "nokogiri", "~> 1.6"
20
20
  s.add_runtime_dependency "puma", "~> 3"
@@ -13,7 +13,7 @@ module InciScore
13
13
  def call(env)
14
14
  req = Rack::Request.new(env)
15
15
  src = req.params["src"]
16
- json = src ? Computer.new(src, catalog).call.to_json : %q({"error": "no valid source"})
16
+ json = src ? Computer.new(src: src, catalog: catalog).call.to_json : %q({"error": "no valid source"})
17
17
  ['200', {'Content-Type' => 'application/json'}, [json]]
18
18
  end
19
19
  end
@@ -0,0 +1,44 @@
1
+ require "optparse"
2
+ require "inci_score/computer"
3
+
4
+ module InciScore
5
+ class CLI
6
+ def initialize(args:, io: STDOUT, catalog: InciScore::Catalog.fetch)
7
+ @args = args
8
+ @io = io
9
+ @catalog = catalog
10
+ @src = nil
11
+ @fresh = nil
12
+ end
13
+
14
+ def call(computer_klass = Computer, fetcher = Fetcher.new)
15
+ parser.parse!(@args)
16
+ return @io.puts("Specify inci list as: --src='aqua, parfum, etc'") unless @src
17
+ @io.puts computer_klass.new(src: @src, catalog: catalog(fetcher)).call
18
+ end
19
+
20
+ private def parser
21
+ OptionParser.new do |opts|
22
+ opts.banner = %q{Usage: ./bin/inci_score --src='aqua, parfum, etc' --fresh}
23
+
24
+ opts.on("-sSRC", "--src=SRC", "The INCI list: 'aqua, parfum, etc'") do |src|
25
+ @src = src
26
+ end
27
+
28
+ opts.on("-f", "--fresh", "Fetch a fresh catalog from remote") do |fresh|
29
+ @fresh = fresh
30
+ end
31
+
32
+ opts.on("-h", "--help", "Prints this help") do
33
+ @io.puts opts
34
+ exit
35
+ end
36
+ end
37
+ end
38
+
39
+ private def catalog(fetcher)
40
+ return @catalog unless @fresh
41
+ fetcher.call
42
+ end
43
+ end
44
+ end
@@ -7,9 +7,11 @@ module InciScore
7
7
  class Computer
8
8
  TOLERANCE = 30.0
9
9
 
10
- def initialize(src, catalog)
10
+ def initialize(src:, catalog:, tolerance: TOLERANCE, rules: Normalizer::DEFAULT_RULES)
11
11
  @src = src
12
12
  @catalog = catalog
13
+ @tolerance = Float(tolerance)
14
+ @rules = rules
13
15
  @unrecognized = []
14
16
  end
15
17
 
@@ -20,17 +22,15 @@ module InciScore
20
22
  valid: valid?)
21
23
  end
22
24
 
23
- private
24
-
25
- def score
25
+ private def score
26
26
  Scorer.new(components.map(&:last)).call
27
27
  end
28
28
 
29
- def ingredients
30
- @ingredients ||= Normalizer.new(src: @src).call
29
+ private def ingredients
30
+ @ingredients ||= Normalizer.new(src: @src, rules: @rules).call
31
31
  end
32
32
 
33
- def components
33
+ private def components
34
34
  @components ||= ingredients.map do |ingredient|
35
35
  Recognizer.new(ingredient, @catalog).call.tap do |component|
36
36
  @unrecognized << ingredient unless component
@@ -38,8 +38,8 @@ module InciScore
38
38
  end.compact
39
39
  end
40
40
 
41
- def valid?
42
- @unrecognized.size / (ingredients.size / 100.0) <= TOLERANCE
41
+ private def valid?
42
+ @unrecognized.size / (ingredients.size / 100.0) <= @tolerance
43
43
  end
44
44
  end
45
45
  end
@@ -1,7 +1,7 @@
1
1
  require 'nokogiri'
2
2
 
3
3
  module InciScore
4
- class Parser
4
+ class Fetcher
5
5
  BIODIZIO_URI = 'http://www.biodizionario.it/biodizio.php'
6
6
  SEMAPHORES = %w[vv v g r rr]
7
7
  CSS_QUERY = 'table[width="751"] > tr > td img'
@@ -2,7 +2,7 @@ require 'inci_score/normalizer_rules'
2
2
 
3
3
  module InciScore
4
4
  class Normalizer
5
- DEFAULT_RULES = Rules.constants - [:Base]
5
+ DEFAULT_RULES = [Rules::Replacer, Rules::Downcaser, Rules::Beheader, Rules::Separator, Rules::Tokenizer, Rules::Sanitizer, Rules::Desynonymizer]
6
6
 
7
7
  attr_reader :src
8
8
 
@@ -12,9 +12,9 @@ module InciScore
12
12
  end
13
13
 
14
14
  def call
15
- @rules.reduce(@src) do |src, name|
16
- rule = Rules.const_get(name).new(src)
17
- src = rule.call
15
+ yield(@rules) if block_given?
16
+ @rules.reduce(@src) do |src, rule|
17
+ @src = rule.call(src)
18
18
  end
19
19
  end
20
20
  end
@@ -1,73 +1,90 @@
1
1
  module InciScore
2
2
  class Normalizer
3
3
  module Rules
4
- class Base
5
- SEPARATOR = ','
4
+ SEPARATOR = ','
6
5
 
7
- def initialize(src)
8
- @src = src
9
- end
6
+ module Replacer
7
+ extend self
10
8
 
11
- def call
12
- fail NotImplementedError
13
- end
14
- end
15
-
16
- class Replacer < Base
17
9
  REPLACEMENTS = [
18
10
  [/\n+|\t+/, ' '],
19
11
  ['‘', "'"],
20
12
  ['—', '-'],
21
- ['(', 'C'],
22
13
  ['_', ' '],
23
14
  ['~', '-'],
24
15
  ['|', 'l'],
25
16
  [' I ', '/']
26
17
  ]
27
18
 
28
- def call
29
- REPLACEMENTS.reduce(@src) do |src, replacement|
19
+ def call(src)
20
+ REPLACEMENTS.reduce(src) do |_src, replacement|
30
21
  invalid, valid = *replacement
31
- src.index(invalid) ? src.gsub(invalid, valid) : src
22
+ _src.index(invalid) ? _src.gsub(invalid, valid) : _src
32
23
  end
33
24
  end
34
25
  end
35
26
 
36
- class Downcaser < Base
37
- def call
38
- @src.downcase
27
+ module Downcaser
28
+ extend self
29
+
30
+ def call(src)
31
+ src.downcase
39
32
  end
40
33
  end
41
34
 
42
- class Beheader < Base
35
+ module Beheader
36
+ extend self
37
+
43
38
  TITLE_SEP = ':'
44
39
  MAX_INDEX = 50
45
40
 
46
- def call
47
- sep_index = @src.index(TITLE_SEP)
48
- return @src if !sep_index || sep_index > MAX_INDEX
49
- @src[sep_index+1, @src.size]
41
+ def call(src)
42
+ sep_index = src.index(TITLE_SEP)
43
+ return src if !sep_index || sep_index > MAX_INDEX
44
+ src[sep_index+1, src.size]
50
45
  end
51
46
  end
52
47
 
53
- class Separator < Base
48
+ module Separator
49
+ extend self
50
+
54
51
  SEPARATORS = ["; ", ". ", " ' ", " - ", " : "]
55
52
 
56
- def call
57
- SEPARATORS.reduce(@src) do |src, separator|
58
- src = src.gsub(separator, SEPARATOR)
53
+ def call(src)
54
+ SEPARATORS.reduce(src) do |_src, separator|
55
+ _src = _src.gsub(separator, SEPARATOR)
59
56
  end
60
57
  end
61
58
  end
62
59
 
63
- class Tokenizer < Base
64
- INVALID_CHARS = /[^\w\s-]/
60
+ module Tokenizer
61
+ extend self
62
+
63
+ def call(src)
64
+ src.split(SEPARATOR).map(&:strip)
65
+ end
66
+ end
67
+
68
+ module Sanitizer
69
+ extend self
70
+
71
+ INVALID_CHARS = /[^\/\(\)\w\s-]/
72
+
73
+ def call(src)
74
+ Array(src).map do |token|
75
+ token.gsub(INVALID_CHARS, '')
76
+ end.reject(&:empty?)
77
+ end
78
+ end
79
+
80
+ module Desynonymizer
81
+ extend self
82
+
83
+ SYNONYM = /\/.*/
65
84
 
66
- def call
67
- @src.split(SEPARATOR).map do |token|
68
- token = token.sub(/\/.*/, '')
69
- token = token.gsub(INVALID_CHARS, '')
70
- token = token.strip
85
+ def call(src)
86
+ Array(src).map do |token|
87
+ token.sub(SYNONYM, '').strip
71
88
  end.reject(&:empty?)
72
89
  end
73
90
  end
@@ -2,7 +2,7 @@ require 'inci_score/recognizer_rules'
2
2
 
3
3
  module InciScore
4
4
  class Recognizer
5
- DEFAULT_RULES = Rules.constants - [:Base]
5
+ DEFAULT_RULES = [Rules::Key, Rules::Levenshtein, Rules::Digits, Rules::Tokens]
6
6
 
7
7
  def initialize(src, catalog, rules = DEFAULT_RULES)
8
8
  @src = src
@@ -11,17 +11,13 @@ module InciScore
11
11
  end
12
12
 
13
13
  def call
14
- @component = apply_rules
15
- return [@component, @catalog[@component]] if @component
16
- end
17
-
18
- private
19
-
20
- def apply_rules
21
- @rules.reduce(nil) do |component, name|
22
- rule = Rules.const_get(name).new(@src, @catalog)
23
- component || rule.call
14
+ @component = @rules.reduce(nil) do |component, rule|
15
+ break(component) if component
16
+ _rule = rule.new(@src, @catalog)
17
+ yield(rule) if block_given?
18
+ _rule.call
24
19
  end
25
- end
20
+ [@component, @catalog[@component]] if @component
21
+ end
26
22
  end
27
23
  end
@@ -28,12 +28,12 @@ module InciScore
28
28
  def call
29
29
  size = @src.size
30
30
  initial = @src[0]
31
- component, distance = @catalog.reduce([nil, size]) do |min, (component, _)|
32
- next min unless component.start_with?(initial)
33
- match = (n = component.index(ALTERNATE_SEP)) ? component[0, n] : component
31
+ component, distance = @catalog.reduce([nil, size]) do |min, (_component, _)|
32
+ next min unless _component.start_with?(initial)
33
+ match = (n = _component.index(ALTERNATE_SEP)) ? _component[0, n] : _component
34
34
  next min if match.size > (size + TOLERANCE)
35
35
  dist = @src.distance(match)
36
- min = [component, dist] if dist < min[1]
36
+ min = [_component, dist] if dist < min[1]
37
37
  min
38
38
  end
39
39
  component unless distance > TOLERANCE || distance >= (size-1)
@@ -47,7 +47,7 @@ module InciScore
47
47
  return if @src.size < TOLERANCE
48
48
  digits = @src[0, MIN_MEANINGFUL]
49
49
  @catalog.detect do |component, _|
50
- component.match(/^#{Regexp::escape(digits)}/)
50
+ component.match?(/^#{Regexp::escape(digits)}/)
51
51
  end.to_a.first
52
52
  end
53
53
  end
@@ -58,7 +58,7 @@ module InciScore
58
58
  def call
59
59
  tokens.each do |token|
60
60
  @catalog.each do |component, _|
61
- return component if component.match(/\b#{Regexp.escape(token)}\b/)
61
+ return component if component.match?(/\b#{Regexp.escape(token)}\b/)
62
62
  end
63
63
  end
64
64
  nil
@@ -1,3 +1,3 @@
1
1
  module InciScore
2
- VERSION = "1.2.1"
2
+ VERSION = "2.0.1"
3
3
  end
data/lib/inci_score.rb CHANGED
@@ -1,4 +1,5 @@
1
+ require 'open-uri'
1
2
  require 'inci_score/version'
2
- require 'inci_score/parser'
3
+ require 'inci_score/fetcher'
3
4
  require 'inci_score/catalog'
4
- require 'inci_score/computer'
5
+ require 'inci_score/cli'
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: inci_score
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.2.1
4
+ version: 2.0.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - costajob
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2016-09-28 00:00:00.000000000 Z
11
+ date: 2017-01-03 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -145,11 +145,12 @@ files:
145
145
  - lib/inci_score.rb
146
146
  - lib/inci_score/api/app.rb
147
147
  - lib/inci_score/catalog.rb
148
+ - lib/inci_score/cli.rb
148
149
  - lib/inci_score/computer.rb
150
+ - lib/inci_score/fetcher.rb
149
151
  - lib/inci_score/levenshtein.rb
150
152
  - lib/inci_score/normalizer.rb
151
153
  - lib/inci_score/normalizer_rules.rb
152
- - lib/inci_score/parser.rb
153
154
  - lib/inci_score/recognizer.rb
154
155
  - lib/inci_score/recognizer_rules.rb
155
156
  - lib/inci_score/response.rb
@@ -169,7 +170,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
169
170
  requirements:
170
171
  - - ">="
171
172
  - !ruby/object:Gem::Version
172
- version: 2.2.2
173
+ version: '2.4'
173
174
  required_rubygems_version: !ruby/object:Gem::Requirement
174
175
  requirements:
175
176
  - - ">="
@@ -177,7 +178,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
177
178
  version: '0'
178
179
  requirements: []
179
180
  rubyforge_project:
180
- rubygems_version: 2.5.1
181
+ rubygems_version: 2.6.8
181
182
  signing_key:
182
183
  specification_version: 4
183
184
  summary: A library that computes the hazard of cosmetic products components, based