inci_score 1.1.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.gitignore +11 -0
- data/.travis.yml +7 -0
- data/Gemfile +4 -0
- data/README.md +117 -0
- data/Rakefile +18 -0
- data/bin/console +7 -0
- data/bin/inci_score +7 -0
- data/bin/setup +6 -0
- data/config/catalog.yml +5014 -0
- data/config.ru +3 -0
- data/ext/levenshtein.c +43 -0
- data/inci_score.gemspec +28 -0
- data/lib/inci_score/api/app.rb +21 -0
- data/lib/inci_score/catalog.rb +13 -0
- data/lib/inci_score/computer.rb +45 -0
- data/lib/inci_score/levenshtein.rb +55 -0
- data/lib/inci_score/normalizer.rb +21 -0
- data/lib/inci_score/normalizer_rules.rb +76 -0
- data/lib/inci_score/parser.rb +43 -0
- data/lib/inci_score/recognizer.rb +27 -0
- data/lib/inci_score/recognizer_rules.rb +75 -0
- data/lib/inci_score/response.rb +31 -0
- data/lib/inci_score/score.rb +19 -0
- data/lib/inci_score/scorer.rb +45 -0
- data/lib/inci_score/version.rb +3 -0
- data/lib/inci_score.rb +4 -0
- data/log/.gitignore +4 -0
- metadata +170 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 7e4bae1a80f16b3684a4d38be2455c737d0fa58a
|
4
|
+
data.tar.gz: e1b1bc6d078e85953e14c071bdd82f678a87cf99
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: a1234e75bf300c6de9de417f133d858876cebcaf90abac76b803ac242b5a88f8a36eea1e99badbb56ca7ef317d269de2dadaaa6502703cce05afa66ff829bdbc
|
7
|
+
data.tar.gz: e566482ae9a45829b5bbbe1c2431f35cfb72a68ba6e9eafbb2a32a75e9559a74c8d25819c520d05708f6c82691de1f139bf8628acf665f0b30a52730c4e7552a
|
data/.gitignore
ADDED
data/.travis.yml
ADDED
data/Gemfile
ADDED
data/README.md
ADDED
@@ -0,0 +1,117 @@
|
|
1
|
+
## Table of Contents
|
2
|
+
|
3
|
+
* [Scope](#scope)
|
4
|
+
* [INCI catalog](#inci-catalog)
|
5
|
+
* [Computation](#computation)
|
6
|
+
* [Component matching](#component-matching)
|
7
|
+
* [Sources](#sources)
|
8
|
+
* [API](#api)
|
9
|
+
* [Unrecognized components](#unrecognized-components)
|
10
|
+
* [Web API](#web-api)
|
11
|
+
* [Starting Puma](#starting-puma)
|
12
|
+
* [Triggering a request](#triggering-a-request)
|
13
|
+
* [CLI API](#cli-api)
|
14
|
+
* [Performance](#performance)
|
15
|
+
* [Levenshtein in C](#levenshtein-in-c)
|
16
|
+
* [Records](#records)
|
17
|
+
|
18
|
+
## Scope
|
19
|
+
This gem computes the score of cosmetic components basing on the information provided by the [Biodizionario site](http://www.biodizionario.it/) by Fabrizio Zago.
|
20
|
+
|
21
|
+
## INCI catalog
|
22
|
+
[INCI](https://en.wikipedia.org/wiki/International_Nomenclature_of_Cosmetic_Ingredients) catalog is fetched directly by the bidizionario site and kept in memory.
|
23
|
+
Currently there are more than 5000 components with a hazard score that ranges from 0 (safe) to 4 (dangerous).
|
24
|
+
|
25
|
+
## Computation
|
26
|
+
The computation takes care to score each component of the cosmetic basing on:
|
27
|
+
* its hazard basing on the biodizionario score
|
28
|
+
* its position on the list of ingredients
|
29
|
+
|
30
|
+
The total score is then calculated on a percent basis.
|
31
|
+
|
32
|
+
### Component matching
|
33
|
+
Since the ingredients list could come from an unreliable source (e.g. data scanned from a captured image), the gem tries to fuzzy match the ingredients by using different algorithms:
|
34
|
+
* exact matching
|
35
|
+
* [edit distance](https://en.wikipedia.org/wiki/Levenshtein_distance) behind a specified tolerance
|
36
|
+
* first relevant matching digits
|
37
|
+
* matching splitted tokens
|
38
|
+
|
39
|
+
### Sources
|
40
|
+
The library accepts the list of ingredients as a single string of text. Since this source could come from an OCR program, the library performs a normalization by stripping invalid characters and removing the unimportant parts.
|
41
|
+
The ingredients are typically separated by comma, although normalizer will detect the most appropriate separator:
|
42
|
+
|
43
|
+
```
|
44
|
+
"Ingredients: Aqua, Disodium Laureth Sulfosuccinate, Cocamidopropiyl\nBetaine"
|
45
|
+
```
|
46
|
+
|
47
|
+
## API
|
48
|
+
The API of the gem is pretty simple, you can open irb by *bundle console* and start computing the INCI score:
|
49
|
+
|
50
|
+
```ruby
|
51
|
+
inci = InciScore::Computer.new(src: 'aqua, dimethicone').call
|
52
|
+
=> #<InciScore::Response:0x000000029f8100 @components={"aqua"=>0, "dimethicone"=>4}, @score=53.762874945799766, @unrecognized=[], @valid=true>
|
53
|
+
inci.score
|
54
|
+
=> 53.762874945799766
|
55
|
+
```
|
56
|
+
|
57
|
+
As you see the results are wrapped by an *InciScore::Response* object, this is useful when dealing with the Web API (read below) and when printing them to standard output.
|
58
|
+
|
59
|
+
### Unrecognized components
|
60
|
+
The API treats unrecognized components as a common case by just marking the object as non valid and raise a warning in case more than 30% of the ingredients are not found.
|
61
|
+
In such case the score is computed anyway by considering only recognized components.
|
62
|
+
Is still possible to query the object for its state:
|
63
|
+
|
64
|
+
```ruby
|
65
|
+
inci = InciScore::Computer.new(src: 'ingredients:aqua,noent1,noent2').call
|
66
|
+
=> #<InciScore::Response:0x000000030c16d0 @components={"aqua"=>0}, @score=100.0, @unrecognized=["noent1", "noent2"], @valid=false>
|
67
|
+
inci.valid
|
68
|
+
=> false
|
69
|
+
inci.unrecognized
|
70
|
+
=> ["noent1", "noent2"]
|
71
|
+
```
|
72
|
+
|
73
|
+
## Web API
|
74
|
+
The Web API exposes the *InciScore* library over HTTP via the [Puma](http://puma.io/) application server.
|
75
|
+
|
76
|
+
### Starting Puma
|
77
|
+
Simply start Puma via the *config.ru* file included in the repository by spawning how many workers as your current workstation supports:
|
78
|
+
```
|
79
|
+
bundle exec puma -w 8 -t 16:32 --preload
|
80
|
+
```
|
81
|
+
|
82
|
+
### Triggering a request
|
83
|
+
The Web API responds with a JSON object representing the original *InciScore::Response* one.
|
84
|
+
|
85
|
+
You can pass the source string directly as a HTTP parameter:
|
86
|
+
|
87
|
+
```
|
88
|
+
curl http://127.0.0.1:9292?src=aqua,dimethicone
|
89
|
+
=> {"components":{"aqua":0,"dimethicone":4},"unrecognized":[],"score":53.762874945799766,"valid":true}
|
90
|
+
```
|
91
|
+
|
92
|
+
## CLI API
|
93
|
+
You can collect INCI data by using the available binary:
|
94
|
+
|
95
|
+
```
|
96
|
+
bin/inci_score "aqua,dimethicone,pej-10,noent"
|
97
|
+
|
98
|
+
TOTAL SCORE:
|
99
|
+
47.18034913243358
|
100
|
+
VALID STATE:
|
101
|
+
true
|
102
|
+
COMPONENTS (hazard - name):
|
103
|
+
0 - aqua
|
104
|
+
4 - dimethicone
|
105
|
+
3 - peg-10
|
106
|
+
UNRECOGNIZED:
|
107
|
+
noent
|
108
|
+
```
|
109
|
+
|
110
|
+
## Performance
|
111
|
+
I noticed the APIs slows down dramatically when dealing with unrecognized components to fuzzy match on.
|
112
|
+
I profiled the code by using the [benchmark-ips](https://github.com/evanphx/benchmark-ips) gem, finding the bottleneck was the pure Ruby implementation of the Levenshtein distance algorithm.
|
113
|
+
After some pointless optimization, i replaced this routine with a C implementation: i opted for the straightforward [Ruby Inline](https://github.com/seattlerb/rubyinline) library to call the C code straight from Ruby.
|
114
|
+
As a result i've got a 10x increment of the throughput, all without scarifying code readability.
|
115
|
+
|
116
|
+
### Numbers
|
117
|
+
I moved the benchmark numbers to the [Crystal porting](https://github.com/costajob/inci_score.cr) of the InciScore library, please look there.
|
data/Rakefile
ADDED
@@ -0,0 +1,18 @@
|
|
1
|
+
require 'bundler/gem_tasks'
|
2
|
+
require 'rake/testtask'
|
3
|
+
|
4
|
+
namespace :spec do
|
5
|
+
Rake::TestTask.new(:unit) do |t|
|
6
|
+
t.libs << 'spec'
|
7
|
+
t.libs << 'lib'
|
8
|
+
t.test_files = FileList['spec/unit/*_spec.rb']
|
9
|
+
end
|
10
|
+
|
11
|
+
Rake::TestTask.new(:integration) do |t|
|
12
|
+
t.libs << 'spec'
|
13
|
+
t.libs << 'lib'
|
14
|
+
t.test_files = FileList['spec/integration/*_spec.rb']
|
15
|
+
end
|
16
|
+
end
|
17
|
+
|
18
|
+
task :default => :"spec:unit"
|
data/bin/console
ADDED
data/bin/inci_score
ADDED