inci_score 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.gitignore +11 -0
- data/.travis.yml +7 -0
- data/Gemfile +4 -0
- data/README.md +117 -0
- data/Rakefile +18 -0
- data/bin/console +7 -0
- data/bin/inci_score +7 -0
- data/bin/setup +6 -0
- data/config/catalog.yml +5014 -0
- data/config.ru +3 -0
- data/ext/levenshtein.c +43 -0
- data/inci_score.gemspec +28 -0
- data/lib/inci_score/api/app.rb +21 -0
- data/lib/inci_score/catalog.rb +13 -0
- data/lib/inci_score/computer.rb +45 -0
- data/lib/inci_score/levenshtein.rb +55 -0
- data/lib/inci_score/normalizer.rb +21 -0
- data/lib/inci_score/normalizer_rules.rb +76 -0
- data/lib/inci_score/parser.rb +43 -0
- data/lib/inci_score/recognizer.rb +27 -0
- data/lib/inci_score/recognizer_rules.rb +75 -0
- data/lib/inci_score/response.rb +31 -0
- data/lib/inci_score/score.rb +19 -0
- data/lib/inci_score/scorer.rb +45 -0
- data/lib/inci_score/version.rb +3 -0
- data/lib/inci_score.rb +4 -0
- data/log/.gitignore +4 -0
- metadata +170 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 7e4bae1a80f16b3684a4d38be2455c737d0fa58a
|
4
|
+
data.tar.gz: e1b1bc6d078e85953e14c071bdd82f678a87cf99
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: a1234e75bf300c6de9de417f133d858876cebcaf90abac76b803ac242b5a88f8a36eea1e99badbb56ca7ef317d269de2dadaaa6502703cce05afa66ff829bdbc
|
7
|
+
data.tar.gz: e566482ae9a45829b5bbbe1c2431f35cfb72a68ba6e9eafbb2a32a75e9559a74c8d25819c520d05708f6c82691de1f139bf8628acf665f0b30a52730c4e7552a
|
data/.gitignore
ADDED
data/.travis.yml
ADDED
data/Gemfile
ADDED
data/README.md
ADDED
@@ -0,0 +1,117 @@
|
|
1
|
+
## Table of Contents
|
2
|
+
|
3
|
+
* [Scope](#scope)
|
4
|
+
* [INCI catalog](#inci-catalog)
|
5
|
+
* [Computation](#computation)
|
6
|
+
* [Component matching](#component-matching)
|
7
|
+
* [Sources](#sources)
|
8
|
+
* [API](#api)
|
9
|
+
* [Unrecognized components](#unrecognized-components)
|
10
|
+
* [Web API](#web-api)
|
11
|
+
* [Starting Puma](#starting-puma)
|
12
|
+
* [Triggering a request](#triggering-a-request)
|
13
|
+
* [CLI API](#cli-api)
|
14
|
+
* [Performance](#performance)
|
15
|
+
* [Levenshtein in C](#levenshtein-in-c)
|
16
|
+
* [Records](#records)
|
17
|
+
|
18
|
+
## Scope
|
19
|
+
This gem computes the score of cosmetic components basing on the information provided by the [Biodizionario site](http://www.biodizionario.it/) by Fabrizio Zago.
|
20
|
+
|
21
|
+
## INCI catalog
|
22
|
+
[INCI](https://en.wikipedia.org/wiki/International_Nomenclature_of_Cosmetic_Ingredients) catalog is fetched directly by the bidizionario site and kept in memory.
|
23
|
+
Currently there are more than 5000 components with a hazard score that ranges from 0 (safe) to 4 (dangerous).
|
24
|
+
|
25
|
+
## Computation
|
26
|
+
The computation takes care to score each component of the cosmetic basing on:
|
27
|
+
* its hazard basing on the biodizionario score
|
28
|
+
* its position on the list of ingredients
|
29
|
+
|
30
|
+
The total score is then calculated on a percent basis.
|
31
|
+
|
32
|
+
### Component matching
|
33
|
+
Since the ingredients list could come from an unreliable source (e.g. data scanned from a captured image), the gem tries to fuzzy match the ingredients by using different algorithms:
|
34
|
+
* exact matching
|
35
|
+
* [edit distance](https://en.wikipedia.org/wiki/Levenshtein_distance) behind a specified tolerance
|
36
|
+
* first relevant matching digits
|
37
|
+
* matching splitted tokens
|
38
|
+
|
39
|
+
### Sources
|
40
|
+
The library accepts the list of ingredients as a single string of text. Since this source could come from an OCR program, the library performs a normalization by stripping invalid characters and removing the unimportant parts.
|
41
|
+
The ingredients are typically separated by comma, although normalizer will detect the most appropriate separator:
|
42
|
+
|
43
|
+
```
|
44
|
+
"Ingredients: Aqua, Disodium Laureth Sulfosuccinate, Cocamidopropiyl\nBetaine"
|
45
|
+
```
|
46
|
+
|
47
|
+
## API
|
48
|
+
The API of the gem is pretty simple, you can open irb by *bundle console* and start computing the INCI score:
|
49
|
+
|
50
|
+
```ruby
|
51
|
+
inci = InciScore::Computer.new(src: 'aqua, dimethicone').call
|
52
|
+
=> #<InciScore::Response:0x000000029f8100 @components={"aqua"=>0, "dimethicone"=>4}, @score=53.762874945799766, @unrecognized=[], @valid=true>
|
53
|
+
inci.score
|
54
|
+
=> 53.762874945799766
|
55
|
+
```
|
56
|
+
|
57
|
+
As you see the results are wrapped by an *InciScore::Response* object, this is useful when dealing with the Web API (read below) and when printing them to standard output.
|
58
|
+
|
59
|
+
### Unrecognized components
|
60
|
+
The API treats unrecognized components as a common case by just marking the object as non valid and raise a warning in case more than 30% of the ingredients are not found.
|
61
|
+
In such case the score is computed anyway by considering only recognized components.
|
62
|
+
Is still possible to query the object for its state:
|
63
|
+
|
64
|
+
```ruby
|
65
|
+
inci = InciScore::Computer.new(src: 'ingredients:aqua,noent1,noent2').call
|
66
|
+
=> #<InciScore::Response:0x000000030c16d0 @components={"aqua"=>0}, @score=100.0, @unrecognized=["noent1", "noent2"], @valid=false>
|
67
|
+
inci.valid
|
68
|
+
=> false
|
69
|
+
inci.unrecognized
|
70
|
+
=> ["noent1", "noent2"]
|
71
|
+
```
|
72
|
+
|
73
|
+
## Web API
|
74
|
+
The Web API exposes the *InciScore* library over HTTP via the [Puma](http://puma.io/) application server.
|
75
|
+
|
76
|
+
### Starting Puma
|
77
|
+
Simply start Puma via the *config.ru* file included in the repository by spawning how many workers as your current workstation supports:
|
78
|
+
```
|
79
|
+
bundle exec puma -w 8 -t 16:32 --preload
|
80
|
+
```
|
81
|
+
|
82
|
+
### Triggering a request
|
83
|
+
The Web API responds with a JSON object representing the original *InciScore::Response* one.
|
84
|
+
|
85
|
+
You can pass the source string directly as a HTTP parameter:
|
86
|
+
|
87
|
+
```
|
88
|
+
curl http://127.0.0.1:9292?src=aqua,dimethicone
|
89
|
+
=> {"components":{"aqua":0,"dimethicone":4},"unrecognized":[],"score":53.762874945799766,"valid":true}
|
90
|
+
```
|
91
|
+
|
92
|
+
## CLI API
|
93
|
+
You can collect INCI data by using the available binary:
|
94
|
+
|
95
|
+
```
|
96
|
+
bin/inci_score "aqua,dimethicone,pej-10,noent"
|
97
|
+
|
98
|
+
TOTAL SCORE:
|
99
|
+
47.18034913243358
|
100
|
+
VALID STATE:
|
101
|
+
true
|
102
|
+
COMPONENTS (hazard - name):
|
103
|
+
0 - aqua
|
104
|
+
4 - dimethicone
|
105
|
+
3 - peg-10
|
106
|
+
UNRECOGNIZED:
|
107
|
+
noent
|
108
|
+
```
|
109
|
+
|
110
|
+
## Performance
|
111
|
+
I noticed the APIs slows down dramatically when dealing with unrecognized components to fuzzy match on.
|
112
|
+
I profiled the code by using the [benchmark-ips](https://github.com/evanphx/benchmark-ips) gem, finding the bottleneck was the pure Ruby implementation of the Levenshtein distance algorithm.
|
113
|
+
After some pointless optimization, i replaced this routine with a C implementation: i opted for the straightforward [Ruby Inline](https://github.com/seattlerb/rubyinline) library to call the C code straight from Ruby.
|
114
|
+
As a result i've got a 10x increment of the throughput, all without scarifying code readability.
|
115
|
+
|
116
|
+
### Numbers
|
117
|
+
I moved the benchmark numbers to the [Crystal porting](https://github.com/costajob/inci_score.cr) of the InciScore library, please look there.
|
data/Rakefile
ADDED
@@ -0,0 +1,18 @@
|
|
1
|
+
require 'bundler/gem_tasks'
|
2
|
+
require 'rake/testtask'
|
3
|
+
|
4
|
+
namespace :spec do
|
5
|
+
Rake::TestTask.new(:unit) do |t|
|
6
|
+
t.libs << 'spec'
|
7
|
+
t.libs << 'lib'
|
8
|
+
t.test_files = FileList['spec/unit/*_spec.rb']
|
9
|
+
end
|
10
|
+
|
11
|
+
Rake::TestTask.new(:integration) do |t|
|
12
|
+
t.libs << 'spec'
|
13
|
+
t.libs << 'lib'
|
14
|
+
t.test_files = FileList['spec/integration/*_spec.rb']
|
15
|
+
end
|
16
|
+
end
|
17
|
+
|
18
|
+
task :default => :"spec:unit"
|
data/bin/console
ADDED
data/bin/inci_score
ADDED