food_ingredient_parser 1.0.0.pre.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 7d20d204e60d22f23f7aa3a039450629c8f4c729
4
+ data.tar.gz: 127801cb5a00b81ca6abf244d92bfebb39ddde28
5
+ SHA512:
6
+ metadata.gz: 3a79c07b4f193fcfbbd08f62fdd18407fc034ea036d5b4b5eb3278e449e337e1ee5e0ad5f468bab5f8e948746e99a6bcf5d686e13277b39dafb9c3b11d3ca531
7
+ data.tar.gz: 95c224f63934e3c601b62d0db9a7d8dc6070fe008816298e3e81f435751012d1bbf66d6270259c4f6e9252a9df6e6d0cf15081e8020e99853c0abb7af55b1a96
data/LICENSE ADDED
@@ -0,0 +1,22 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2018 Questionmark
4
+ Copyright (c) 2018 wvengen
5
+
6
+ Permission is hereby granted, free of charge, to any person obtaining a copy
7
+ of this software and associated documentation files (the "Software"), to deal
8
+ in the Software without restriction, including without limitation the rights
9
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10
+ copies of the Software, and to permit persons to whom the Software is
11
+ furnished to do so, subject to the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be included in all
14
+ copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22
+ SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,115 @@
1
+ # Food ingredient parser
2
+
3
+ Ingredients are listed on food products in various ways. This [Ruby](https://www.ruby-lang.org/)
4
+ gem and program parses the ingredient text and returns a structured representation.
5
+
6
+ ## Installation
7
+
8
+ ```
9
+ gem install food_ingredient_parser
10
+ ```
11
+
12
+ This will also install the dependency [treetop](http://cjheath.github.io/treetop).
13
+ If you want colored output for the test program, also install [pry](http://pryrepl.org/): `gem install pry`.
14
+
15
+ ## Example
16
+
17
+ ```ruby
18
+ require 'food_ingredient_parser'
19
+
20
+ s = "Water* 60%, suiker 30%, voedingszuren: citroenzuur, appelzuur, zuurteregelaar: E576/E577, " \
21
+ + "natuurlijke citroen-limoen aroma's 0,2%, zoetstof: steviolglycosiden, * = Biologisch. " \
22
+ + "E = door de E.U. goedgekeurde toevoeging."
23
+ parser = FoodIngredientParser::Parser.new
24
+ puts parser.parse(s).to_h.inspect
25
+ ```
26
+ Results in
27
+ ```
28
+ {
29
+ :contains=>[
30
+ {:name=>"Water", :amount=>"60%", :mark=>"*"},
31
+ {:name=>"suiker", :amount=>"30%"},
32
+ {:name=>"voedingszuren", :contains=>[
33
+ {:name=>"citroenzuur"}
34
+ ]},
35
+ {:name=>"appelzuur"},
36
+ {:name=>"zuurteregelaar", :contains=>[
37
+ {:name=>"E576"},
38
+ {:name=>"E577"}
39
+ ]},
40
+ {:name=>"natuurlijke citroen-limoen aroma's", :amount=>"0,2%"},
41
+ {:name=>"zoetstof", :contains=>[
42
+ {:name=>"steviolglycosiden"}
43
+ ]}
44
+ ],
45
+ :notes=>[
46
+ "* = Biologisch",
47
+ "E = door de E.U. goedgekeurde toevoeging"
48
+ ]
49
+ }
50
+ ```
51
+
52
+ ## Test tool
53
+
54
+ The executable `food_ingredient_parser` is available after installing the gem. If you're
55
+ running this from the source tree, use `bin/food_ingredient_parser` instead.
56
+
57
+ ```
58
+ $ food_ingredient_parser -h
59
+ Usage: food_ingredient_parser [options] --file|-f <filename>
60
+ food_ingredient_parser [options] --string|-s <ingredients>
61
+
62
+ -f, --file FILE Parse all lines of the file as ingredient lists.
63
+ -s, --string INGREDIENTS Parse specified ingredient list.
64
+ -q, --[no-]quiet Only show summary.
65
+ -p, --parsed Only show lines that were successfully parsed.
66
+ -e, --escape Escape newlines
67
+ -c, --[no-]color Use color
68
+ -n, --noresult Only show lines that had no result.
69
+ -v, --[no-]verbose Show more data (parsed tree).
70
+ --version Show program version.
71
+ -h, --help Show this help
72
+
73
+ $ food_ingredient_parser -v -s "tomato"
74
+ "tomato"
75
+ RootNode+Root3 offset=0, "tomato" (contains,notes):
76
+ SyntaxNode offset=0, ""
77
+ SyntaxNode offset=0, ""
78
+ SyntaxNode offset=0, ""
79
+ ListNode+List13 offset=0, "tomato" (contains):
80
+ SyntaxNode+List12 offset=0, "tomato" (ingredient):
81
+ SyntaxNode+Ingredient0 offset=0, "tomato":
82
+ SyntaxNode offset=0, ""
83
+ IngredientNode+IngredientSimpleWithAmount3 offset=0, "tomato" (ing):
84
+ IngredientNode+IngredientSimple5 offset=0, "tomato" (name):
85
+ SyntaxNode+IngredientSimple4 offset=0, "tomato" (word):
86
+ SyntaxNode offset=0, "tomato":
87
+ SyntaxNode offset=0, "t"
88
+ SyntaxNode offset=1, "o"
89
+ SyntaxNode offset=2, "m"
90
+ SyntaxNode offset=3, "a"
91
+ SyntaxNode offset=4, "t"
92
+ SyntaxNode offset=5, "o"
93
+ SyntaxNode offset=6, ""
94
+ SyntaxNode offset=6, ""
95
+ SyntaxNode offset=6, ""
96
+ SyntaxNode+Root2 offset=6, "":
97
+ SyntaxNode offset=6, ""
98
+ SyntaxNode offset=6, ""
99
+ SyntaxNode offset=6, ""
100
+ SyntaxNode offset=6, ""
101
+ {:contains=>[{:name=>"tomato"}]}
102
+
103
+ $ food_ingredient_parser -q -f data/test-cases
104
+ parsed 35 (100.0%), no result 0 (0.0%)
105
+ ```
106
+
107
+ If you want to use the output in (shell)scripts, the options `-e -c` may be quite useful.
108
+
109
+ ## Test data
110
+
111
+ [`data/ingredient-samples-nl`](data/ingredient-samples-nl) contains about 150k
112
+ real-world ingredient lists found on the Dutch market. Each line contains one ingredient
113
+ list (newlines are encoded as `\n`, empty lines and those starting with `#` are ignored).
114
+ Currently almost three quarter is recognized and parsed. We aim to reach at least 90%.
115
+
@@ -0,0 +1,107 @@
1
+ #!/usr/bin/env ruby
2
+ #
3
+ # Parser for food ingredient lists.
4
+ #
5
+ require 'optparse'
6
+
7
+ $:.push(File.expand_path(File.dirname(__FILE__) + "/../lib"))
8
+ require 'food_ingredient_parser'
9
+
10
+ begin
11
+ require 'pry'
12
+ def pp(o, color: true)
13
+ if color
14
+ Pry::ColorPrinter.pp(o)
15
+ else
16
+ puts(o.inspect)
17
+ end
18
+ end
19
+ rescue LoadError
20
+ # fallback without color printing
21
+ def pp(o, color: nil)
22
+ puts(o.inspect)
23
+ end
24
+ end
25
+
26
+ def colorize(color, s)
27
+ if color
28
+ "\e[#{color}m#{s}\e[0;22m"
29
+ else
30
+ s
31
+ end
32
+ end
33
+
34
+ def parse_single(s, parsed=nil, parser: nil, verbosity: 1, print: nil, escape: false, color: false)
35
+ parser ||= FoodIngredientParser::Parser.new
36
+ parsed ||= parser.parse(s)
37
+
38
+ return unless print.nil? || (parsed && print == :parsed) || (!parsed && print == :noresult)
39
+
40
+ puts colorize(color && "0;32", escape ? s.gsub("\n", "\\n") : s) if verbosity > 0
41
+
42
+ if parsed
43
+ puts(parsed.inspect) if verbosity > 1
44
+ pp(parsed.to_h, color: color) if verbosity > 0
45
+ else
46
+ puts "(no result: #{parser.failure_reason})" if verbosity > 0
47
+ end
48
+ end
49
+
50
+ def parse_file(path, parser: nil, verbosity: 1, print: nil, escape: false, color: false)
51
+ count_parsed = count_noresult = 0
52
+ File.foreach(path) do |line|
53
+ next if line =~ /^#/ # comment
54
+ next if line =~ /^\s*$/ # empty line
55
+
56
+ line = line.gsub('\\n', "\n").strip
57
+ parsed = parser.parse(line)
58
+ count_parsed += 1 if parsed
59
+ count_noresult += 1 unless parsed
60
+
61
+ parse_single(line, parsed, parser: parser, verbosity: verbosity, print: print, escape: escape, color: color)
62
+ end
63
+
64
+ pct_parsed = 100.0 * count_parsed / (count_parsed + count_noresult)
65
+ pct_noresult = 100.0 * count_noresult / (count_parsed + count_noresult)
66
+ puts "parsed #{colorize(color && "1;32", count_parsed)} (#{pct_parsed.round(1)}%), no result #{colorize(color && "1;31", count_noresult)} (#{pct_noresult.round(1)}%)"
67
+ end
68
+
69
+ verbosity = 1
70
+ files = []
71
+ strings = []
72
+ print = nil
73
+ escape = false
74
+ color = true
75
+ OptionParser.new do |opts|
76
+ opts.banner = <<-EOF.gsub(/^ /, '')
77
+ Usage: #{$0} [options] --file|-f <filename>
78
+ #{$0} [options] --string|-s <ingredients>
79
+
80
+ EOF
81
+
82
+ opts.on("-f", "--file FILE", "Parse all lines of the file as ingredient lists.") {|f| files << f }
83
+ opts.on("-s", "--string INGREDIENTS", "Parse specified ingredient list.") {|s| strings << s }
84
+
85
+ opts.on("-q", "--[no-]quiet", "Only show summary.") {|q| verbosity = q ? 0 : 1 }
86
+ opts.on("-p", "--parsed", "Only show lines that were successfully parsed.") {|p| print = :parsed }
87
+ opts.on("-e", "--escape", "Escape newlines") {|e| escape = true }
88
+ opts.on("-c", "--[no-]color", "Use color") {|e| color = !!e }
89
+ opts.on("-n", "--noresult", "Only show lines that had no result.") {|p| print = :noresult }
90
+ opts.on("-v", "--[no-]verbose", "Show more data (parsed tree).") {|v| verbosity = v ? 2 : 1 }
91
+ opts.on( "--version", "Show program version.") do
92
+ puts("food_ingredient_parser v#{FoodIngredientParser::VERSION}")
93
+ exit
94
+ end
95
+ opts.on("-h", "--help", "Show this help") do
96
+ puts(opts)
97
+ exit
98
+ end
99
+ end.parse!
100
+
101
+ if strings.any? || files.any?
102
+ parser = FoodIngredientParser::Parser.new
103
+ strings.each {|s| parse_single(s, parser: parser, verbosity: verbosity, print: print, escape: escape, color: color) }
104
+ files.each {|f| parse_file(f, parser: parser, verbosity: verbosity, print: print, escape: escape, color: color) }
105
+ else
106
+ STDERR.puts("Please specify one or more --file or --string arguments (see --help).")
107
+ end
@@ -0,0 +1,29 @@
1
+ $:.unshift(File.expand_path(File.dirname(__FILE__) + '/lib'))
2
+ require 'food_ingredient_parser/version'
3
+
4
+ Gem::Specification.new do |s|
5
+ s.name = 'food_ingredient_parser'
6
+ s.version = FoodIngredientParser::VERSION
7
+ s.date = '2018-08-09'
8
+ s.summary = 'Parser for ingredient lists found on food products.'
9
+ s.authors = ['wvengen']
10
+ s.email = ['dev-ruby@willem.engen.nl']
11
+ s.homepage = 'https://github.com/q-m/food-ingredient-parser-ruby'
12
+ s.license = 'MIT'
13
+ s.description = <<-EOD
14
+ Food products often contain an ingredient list of some sort. This parser
15
+ tries to recognize the syntax and returns a structured representation of the
16
+ food ingredients.
17
+ EOD
18
+ s.metadata = {
19
+ 'bug_tracker_uri' => 'https://github.com/q-m/food-ingredient-parser-ruby/issues',
20
+ 'source_code_uri' => 'https://github.com/q-m/food-ingredient-parser-ruby',
21
+ }
22
+
23
+ s.files = `git ls-files *.gemspec lib`.split("\n")
24
+ s.executables = `git ls-files bin`.split("\n").map(&File.method(:basename))
25
+ s.extra_rdoc_files = ['README.md', 'LICENSE']
26
+ s.require_paths = ['lib']
27
+
28
+ s.add_runtime_dependency 'treetop', '~> 1.6'
29
+ end
@@ -0,0 +1,30 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar Amount
3
+ include Common
4
+
5
+ rule amount
6
+ '(' ws* amount:simple_amount ws* ')' <AmountNode> /
7
+ '[' ws* amount:simple_amount ws* ']' <AmountNode> /
8
+ '{' ws* amount:simple_amount ws* '}' <AmountNode> /
9
+ amount:simple_amount <AmountNode>
10
+ end
11
+
12
+ rule simple_amount
13
+ ( (
14
+ 'of which' / 'at least' /
15
+ 'waarvan' / 'ten minste' / 'tenminste' / 'minimaal'
16
+ ) ws* )?
17
+ [<>]? ws*
18
+ simple_amount_quantity
19
+ ( ws+ (
20
+ 'minimum' /
21
+ 'minimaal' / 'minimum'
22
+ ) )?
23
+ end
24
+
25
+ rule simple_amount_quantity
26
+ number ( ws* '-' ws* number )? ws* ( '%' / 'g' / 'mg' / 'gram' / 'ml' )
27
+ end
28
+
29
+ end
30
+ end
@@ -0,0 +1,150 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar Common
3
+
4
+ rule ws
5
+ " " / "\t"
6
+ end
7
+
8
+ rule newline
9
+ "\n"
10
+ end
11
+
12
+ rule char
13
+ [[:alnum:]] /
14
+ fraction /
15
+ [-/\'’+=_{}&] /
16
+ [®] /
17
+ [¿?] / # weird characters turning up in names (e.g. encoding issues)
18
+ [₁₂₃₄₅₆₇₈₉] # can occur with vitamins
19
+ #[A-Za-z0-9] / [-\'+/=?;!*#@$_%] / [#xC0-#xD6] / [#xD8-#xF6] / [#xF8-#x2FF] / [#x370-#x37D] / [#x37F-#x1FFF] / [#x200C-#x200D] / [#x2070-#x218F] / [#x2C00-#x2FEF] / [#x3001-#xD7FF] / [#xF900-#xFDCF] / [#xFDF0-#xFFFD] / [#x10000-#xEFFFF]
20
+ end
21
+
22
+ rule mark
23
+ # mark referencing a footnote
24
+ [¹²³⁴⁵ᵃᵇᶜᵈᵉᶠᵍªº] '⁾'? / '⁽' [¹²³⁴⁵ᵃᵇᶜᵈᵉᶠᵍªº] '⁾' / [†‡•°#^] / '*'+
25
+ end
26
+
27
+ rule digit
28
+ [0-9]
29
+ end
30
+
31
+ rule fraction
32
+ [½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅐⅛⅜⅝⅞⅑⅒]
33
+ end
34
+
35
+ rule number
36
+ digit+ [,.] digit+ / digit+ ws* fraction / fraction / digit+
37
+ end
38
+
39
+ rule word
40
+ abbrev / char+
41
+ end
42
+
43
+ rule word_nas
44
+ # word, but don't include the trailing '-' that may occure before an 'and'
45
+ abbrev / ( !andsep char )+
46
+ end
47
+
48
+ rule and
49
+ ( 'and' / 'en' / 'und' / '&' ) !char
50
+ end
51
+
52
+ # we want to match "a and b" but not "a- and bthing", this allows to avoid the latter
53
+ rule andsep
54
+ '-' ws+ and
55
+ end
56
+
57
+ rule abbrev
58
+ # These are listed explicitely to avoid incorrect interpretations, and allow missing trailing dots.
59
+ # To get an idea of what occurs (second one omits trailing dots):
60
+ # cat data/ingredient-samples-nl | perl -ne '$_=lc($_); /\b(([a-z]\.)+[a-z]\.?)\W/ && print "$1\n"' | sort | uniq -c | sort -rn
61
+ # cat data/ingredient-samples-nl | perl -ne '$_=lc($_); /\b(([a-z]\.)+[a-z])\W/ && print "$1\n"' | sort | uniq -c | sort -rn
62
+ # Finally, you can generate the full list using this command:
63
+ # cat data/ingredient-samples-nl | perl -ne '$_=lc($_); /\b(([a-z]\.)+[a-z])\W/ && print "$1\n"' | sort | uniq | sed "s/^/'/;s/$/'i \//"
64
+ (
65
+ 'a.o.p'i /
66
+ 'b.g.a'i /
67
+ 'b.o.b'i /
68
+ 'c.a'i /
69
+ 'c.i'i /
70
+ 'd.e'i /
71
+ 'd.m.v'i /
72
+ 'd.o.c'i /
73
+ 'd.o.p'i /
74
+ 'd.s'i /
75
+ 'e.a'i /
76
+ 'e.g'i /
77
+ 'e.u'i /
78
+ 'f.i.l'i /
79
+ 'f.o.s'i /
80
+ 'i.a'i /
81
+ 'i.d'i /
82
+ 'i.e'i /
83
+ 'i.g.m.e'i /
84
+ 'i.g.p'i /
85
+ 'i.m.v'i /
86
+ 'i.o'i /
87
+ 'i.v.m'i /
88
+ 'l.s.l'i /
89
+ 'n.a'i /
90
+ 'n.b'i /
91
+ 'n.o'i /
92
+ 'n.v.t'i /
93
+ 'o.a'i /
94
+ 'o.b.v'i /
95
+ 'p.d.o'i /
96
+ 'p.g.i'i /
97
+ 'q.s'i /
98
+ 's.l'i /
99
+ 's.s'i /
100
+ 't.o.v'i /
101
+ 'u.h.t'i /
102
+ 'v.g'i /
103
+ 'v.s'i /
104
+ 'w.a'i /
105
+ 'w.o'i /
106
+ 'w.v'i /
107
+ # special words and abbreviations (not auto-generated)
108
+ 'vit.'i /
109
+ 'denat.'i /
110
+ 'N°'i /
111
+ '°C'i /
112
+ # word combinations that should not be split (not auto-generated)
113
+ # @todo this really would benefit from matching known ingredients instead of hardcoding
114
+ ( 'oliën'i / 'olien'i / 'olië'i / 'olie'i ) ws+ and ws+ ( 'vetten'i / 'vet'i ) /
115
+ 'palm'i ws+ and ws+ 'kokosvet'i /
116
+ color ( ws+ and ws+ color )+ /
117
+ color2 ( ws+ and ws+ color2 )+ /
118
+ 'kruiden'i ws+ and ws+ 'specerijen'i /
119
+ 'kruiden'i ws+ and ws+ 'specerij'i /
120
+ 'specerijen'i ws+ and ws+ 'kruiden'i /
121
+ 'vitamine'i 'n'i? ws+ and ws+ 'mineralen'i /
122
+ 'lactose'i ws+ and ws+ 'melk'i ( 'eiwit'i 'en'i? )? /
123
+ 'granen'i ws+ and ws+ 'zaden'i /
124
+ 'gekookt'i [eE]? ws+ and ws+ 'gemarineerd'i [eE]? /
125
+ 'mono'i ws+ and ws+ 'diglyceriden'i /
126
+ 'guarpitmeel'i ws+ and ws+ 'natriumalginaat'i /
127
+ 'vlees'i ws+ and ws+ 'dierlijke bijproducten'i /
128
+ 'vis'i ws+ and ws+ 'visbijproducten'i /
129
+ 'glucose'i ws+ and ws+ 'fructosestroop'i /
130
+ 'ijzeroxiden'i ws+ and ws+ 'hydroxiden'i /
131
+ char+ 'sap'i ws+ and ws+ 'overige vruchtensappen'i /
132
+ char* 'sap'i ( ws+ 'uit concentraat'i / ws+ 'uit sapconcentraat'i )? ws+ and ws+ 'vruchten'i? 'puree'i /
133
+ ( 'vit.'i / 'vitamine'i / 'vitamin' ) ws+ [a-zA-Z] [0-9]* ws+ and ws+ [a-zA-Z] [0-9]*
134
+ )
135
+ '.'? ![[:alpha:]]
136
+ end
137
+
138
+ rule color
139
+ # used for paprika, honey ("yellow and white honey") (nouns)
140
+ 'red'i / 'green'i / 'yellow'i / 'white'i / 'black'i /
141
+ 'rood'i / 'groen'i / 'geel'i / 'wit'i / 'zwart'i
142
+ end
143
+
144
+ rule color2
145
+ # adjective colors (can not occur together with noun colors in a list)
146
+ 'rode'i / 'groene'i / 'gele'i / 'witte'i / 'zwarte'i
147
+ end
148
+
149
+ end
150
+ end
@@ -0,0 +1,12 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar Ingredient
3
+ include IngredientSimple
4
+ include IngredientNested
5
+ include IngredientColoned
6
+
7
+ rule ingredient
8
+ ws* ( ingredient_nested / ingredient_coloned / ingredient_simple_with_amount )
9
+ end
10
+
11
+ end
12
+ end
@@ -0,0 +1,41 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar IngredientColoned
3
+ include Common
4
+ include Amount
5
+ include IngredientSimple
6
+
7
+ rule ingredient_coloned
8
+ ing:ingredient_simple ws* ':' ws* amount:amount post:( ws* '}' )? !( ws* word ) <IngredientNode> /
9
+ ing:ingredient_simple ws* ':' post:( ws* '}' )? ws* contains:( ingredient_coloned_inner_list ) <NestedIngredientNode>
10
+ end
11
+
12
+ rule ingredient_coloned_inner_list
13
+ contains:( ingredient_coloned_simple_with_amount_and_nest ( ws+ and ws+ ingredient_coloned_simple_with_amount_and_nest )+ ) <ListNode> /
14
+ contains:( ingredient_coloned_simple_with_amount_and_nest ws* ( '/'+ ws* ingredient_coloned_simple_with_amount_and_nest )* ) <ListNode>
15
+ end
16
+
17
+ # @see IngredientSimple#ingredient_simple
18
+ rule ingredient_coloned_simple
19
+ name:( ingredient_coloned_word_nas ( andsep? ws+ !amount !and ingredient_coloned_word_nas )* ) ws? mark:mark <IngredientNode> /
20
+ name:( ingredient_coloned_word_nas ( andsep? ws+ !amount !and ingredient_coloned_word_nas )* ) <IngredientNode>
21
+ end
22
+
23
+ # @see IngredientSimple#ingredient_simple_with_amount
24
+ rule ingredient_coloned_simple_with_amount
25
+ pre:( '{' ws* )? amount:amount ws+ ing:ingredient_coloned_simple <IngredientNode> /
26
+ ing:ingredient_coloned_simple ws* amount:amount post:( ws* '}' )? (ws? mark:mark)? <IngredientNode> /
27
+ ing:ingredient_coloned_simple <IngredientNode>
28
+ end
29
+
30
+ rule ingredient_coloned_simple_with_amount_and_nest
31
+ ing:ingredient_coloned_simple_with_amount ws* '(' ws* contains:ingredient_coloned_simple_with_amount ws* ')' ( ws* '}' )? <NestedIngredientNode> /
32
+ ingredient_coloned_simple_with_amount
33
+ end
34
+
35
+ # @see Common#word
36
+ rule ingredient_coloned_word_nas
37
+ abbrev / ( !'/' !andsep char )+
38
+ end
39
+
40
+ end
41
+ end
@@ -0,0 +1,28 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar IngredientNested
3
+ include Common
4
+ include Amount
5
+ include IngredientSimple
6
+
7
+ rule ingredient_nested
8
+ ( ing:ingredient_simple ws* '(' contains:ingredient_nested_in ws* ')' ws? mark:mark ws* amount:amount <NestedIngredientNode> ) /
9
+ ( ing:ingredient_simple ws* '(' contains:ingredient_nested_in ws* ')' ws* amount:amount <NestedIngredientNode> ) /
10
+ ( ing:ingredient_simple_with_amount ws* '(' contains:ingredient_nested_in ws* ')' ws? mark:mark <NestedIngredientNode> ) /
11
+ ( ing:ingredient_simple_with_amount ws* '(' contains:ingredient_nested_in ws* ')' <NestedIngredientNode> ) /
12
+ ( ing:ingredient_simple ws* '[' contains:ingredient_nested_in ws* ']' ws? mark:mark ws* amount:amount <NestedIngredientNode> ) /
13
+ ( ing:ingredient_simple ws* '[' contains:ingredient_nested_in ws* ']' ws* amount:amount <NestedIngredientNode> ) /
14
+ ( ing:ingredient_simple_with_amount ws* '[' contains:ingredient_nested_in ws* ']' ws? mark:mark <NestedIngredientNode> ) /
15
+ ( ing:ingredient_simple_with_amount ws* '[' contains:ingredient_nested_in ws* ']' <NestedIngredientNode> )
16
+ end
17
+
18
+ rule ingredient_nested_in
19
+ ( ingredient_nested_contains (ws* ':')? ws+ )? ws* contains:list ws* '.'?
20
+ end
21
+
22
+ rule ingredient_nested_contains
23
+ 'contains'i /
24
+ 'bevat'i
25
+ end
26
+
27
+ end
28
+ end
@@ -0,0 +1,21 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar IngredientSimple
3
+ include Common
4
+ include Amount
5
+
6
+ rule ingredient_simple
7
+ name:( word_nas ( andsep? ws+ !amount !and word_nas )* ) ws? mark:mark <IngredientNode> /
8
+ name:( word_nas ( andsep? ws+ !amount !and word_nas )* ) <IngredientNode> /
9
+ # We've tried to omit 'and' from the ingredient, but if it doesn't work out, do it anyway.
10
+ name:( word_nas ( andsep? ws+ !amount word_nas )* ) ws? mark:mark <IngredientNode> /
11
+ name:( word_nas ( andsep? ws+ !amount word_nas )* ) <IngredientNode>
12
+ end
13
+
14
+ rule ingredient_simple_with_amount
15
+ pre:( '{' ws* )? amount:amount ws+ ing:ingredient_simple <IngredientNode> /
16
+ ing:ingredient_simple ws* amount:amount post:( ws* '}' )? (ws? mark:mark)? <IngredientNode> /
17
+ ing:ingredient_simple <IngredientNode>
18
+ end
19
+
20
+ end
21
+ end
@@ -0,0 +1,14 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar List
3
+ include Common
4
+ include Ingredient
5
+
6
+ rule list
7
+ contains:(ingredient ( ws* '|' ws* ingredient )+ ( ws+ and ws+ ingredient )? ) <ListNode> /
8
+ contains:(ingredient ( ws* ';' ws* ingredient )+ ( ws+ and ws+ ingredient )? ) <ListNode> /
9
+ contains:(ingredient ( ws* ',' ws* ingredient )+ ( ws+ and ws+ ingredient )? ) <ListNode> /
10
+ contains:(ingredient ( ws* '.' ws* ingredient )+ ( ws+ and ws+ ingredient )? ) <ListNode> /
11
+ contains:(ingredient ( ws+ and ws+ ingredient )? ) <ListNode>
12
+ end
13
+ end
14
+ end
@@ -0,0 +1,23 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar ListColoned
3
+ include Common
4
+ include IngredientSimple
5
+ include Ingredient
6
+
7
+ rule list_coloned
8
+ contains:( ( list_coloned_ingredient ws* '.' ws* )+ list_coloned_ingredient? ) <ListNode> /
9
+ contains:( ( list_coloned_ingredient ws* ';' ws* )+ list_coloned_ingredient? ) <ListNode> /
10
+ contains:( list_coloned_ingredient ) <ListNode>
11
+ end
12
+
13
+ rule list_coloned_inner_list
14
+ contains:( ingredient ( ws* ',' ws* ingredient )* ) <ListNode>
15
+ end
16
+
17
+ rule list_coloned_ingredient
18
+ ing:ingredient_simple_with_amount ws* ':' ws* amount:amount post:( ws* '}' )? <IngredientNode> /
19
+ ing:ingredient_simple_with_amount ws* ':' post:( ws* '}' )? ws* contains:list_coloned_inner_list <NestedIngredientNode>
20
+ end
21
+
22
+ end
23
+ end
@@ -0,0 +1,16 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar ListNewlined
3
+ include Common
4
+ include IngredientSimple
5
+ include List
6
+
7
+ rule list_newlined
8
+ contains:( ( list_newlined_ingredient_nested ws* newline newline )* list_newlined_ingredient_nested ) <ListNode>
9
+ end
10
+
11
+ rule list_newlined_ingredient_nested
12
+ ws* ing:ingredient_simple ws* ':'? ws* newline contains:list ( ws* '.' )? <NestedIngredientNode>
13
+ end
14
+
15
+ end
16
+ end
@@ -0,0 +1,51 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar Root
3
+ include Common
4
+ include List
5
+ include ListColoned
6
+ include ListNewlined
7
+
8
+ rule root
9
+ '"'?
10
+ root_prefix? ws*
11
+ contains:( list_newlined / list_coloned / list )
12
+ notes:(
13
+ root_mark_sentences_in_list? ws*
14
+ ( ( '.' ws* newline* / '.'? ws* newline+ ) ws* root_sentences? ws* )?
15
+ )
16
+ '"'?
17
+ <RootNode>
18
+ end
19
+
20
+ rule root_prefix
21
+ (
22
+ 'ingredients'i / 'contains'i /
23
+ ('ingred'i [IÏiï] [EËeë] 'n'i ( 't'i 'en'i? 'declaratie'i? )? ) / 'bevat'i / 'dit zit er in'i / 'samenstelling'i /
24
+ 'zutaten'i
25
+ )
26
+ ( ws* [:;.] ( ws* newline )? / ws* newline / ws ) ws* # optional colon or other separator
27
+ "'"? ws* # stray quote occurs sometimes
28
+ end
29
+
30
+ rule root_sentences
31
+ ( root_sentence ws* )+ root_sentence_open? / root_sentence_open
32
+ end
33
+
34
+ rule root_sentence
35
+ root_sentence_open ( '.' / ';' / newline+ )
36
+ end
37
+
38
+ rule root_sentence_open
39
+ ( word / ws / [,:()%] / '[' / ']' / mark )+ <NoteNode>
40
+ end
41
+
42
+ rule root_mark_sentences_in_list
43
+ ( ( ws* [,.] / ws ) ws* root_mark_sentence_in_list )+
44
+ end
45
+
46
+ rule root_mark_sentence_in_list
47
+ mark ws* root_sentence_open <NoteNode>
48
+ end
49
+
50
+ end
51
+ end
@@ -0,0 +1,16 @@
1
+ require 'treetop'
2
+ require_relative 'nodes'
3
+
4
+ # @todo find a way to auto-generate Ruby from Treetop files when building gem,
5
+ # see https://stackoverflow.com/q/37794587/2866660
6
+
7
+ Treetop.load File.dirname(__FILE__) + '/grammar/common'
8
+ Treetop.load File.dirname(__FILE__) + '/grammar/amount'
9
+ Treetop.load File.dirname(__FILE__) + '/grammar/ingredient_simple'
10
+ Treetop.load File.dirname(__FILE__) + '/grammar/ingredient_nested'
11
+ Treetop.load File.dirname(__FILE__) + '/grammar/ingredient_coloned'
12
+ Treetop.load File.dirname(__FILE__) + '/grammar/ingredient'
13
+ Treetop.load File.dirname(__FILE__) + '/grammar/list'
14
+ Treetop.load File.dirname(__FILE__) + '/grammar/list_coloned'
15
+ Treetop.load File.dirname(__FILE__) + '/grammar/list_newlined'
16
+ Treetop.load File.dirname(__FILE__) + '/grammar/root'
@@ -0,0 +1,69 @@
1
+ require 'treetop/runtime'
2
+
3
+ # Needs to be in grammar namespace so Treetop can find the nodes.
4
+ module FoodIngredientParser::Grammar
5
+
6
+ # Treetop syntax node with our additions, use this as parent for all our own nodes.
7
+ class SyntaxNode < Treetop::Runtime::SyntaxNode
8
+ private
9
+
10
+ def to_a_deep(n, cls)
11
+ if n.is_a?(cls)
12
+ [n]
13
+ elsif n.nonterminal?
14
+ n.elements.map {|m| to_a_deep(m, cls) }.flatten(1).compact
15
+ end
16
+ end
17
+ end
18
+
19
+ # Root object, contains everything else.
20
+ class RootNode < SyntaxNode
21
+ def to_h
22
+ h = { contains: contains.to_a }
23
+ if notes && notes_ary = to_a_deep(notes, NoteNode)&.map(&:text_value)
24
+ h[:notes] = notes_ary if notes_ary.length > 0
25
+ end
26
+ h
27
+ end
28
+ end
29
+
30
+ # List of ingredients.
31
+ class ListNode < SyntaxNode
32
+ def to_a
33
+ to_a_deep(contains, IngredientNode).map(&:to_h)
34
+ end
35
+ end
36
+
37
+ # Ingredient
38
+ class IngredientNode < SyntaxNode
39
+ def to_h
40
+ h = {}
41
+ h.merge!(to_a_deep(ing, IngredientNode)&.first&.to_h || {}) if respond_to?(:ing)
42
+ h.merge!(to_a_deep(amount, AmountNode)&.first&.to_h || {}) if respond_to?(:amount)
43
+ h[:name] = name.text_value if respond_to?(:name)
44
+ h[:name] = pre.text_value + h[:name] if respond_to?(:pre)
45
+ h[:name] = h[:name] + post.text_value if respond_to?(:post)
46
+ h[:mark] = mark.text_value if respond_to?(:mark) && mark.text_value != ''
47
+ h
48
+ end
49
+ end
50
+
51
+ # Ingredient with containing ingredients.
52
+ class NestedIngredientNode < IngredientNode
53
+ def to_h
54
+ super.merge({ contains: to_a_deep(contains, IngredientNode).map(&:to_h) })
55
+ end
56
+ end
57
+
58
+ # Amount, specifying an ingredient.
59
+ class AmountNode < SyntaxNode
60
+ def to_h
61
+ { amount: amount.text_value }
62
+ end
63
+ end
64
+
65
+ # Note at the end of the ingredient list.
66
+ class NoteNode < SyntaxNode
67
+ end
68
+
69
+ end
@@ -0,0 +1,34 @@
1
+ require_relative 'grammar'
2
+
3
+ module FoodIngredientParser
4
+ class Parser
5
+
6
+ # Create a new food ingredient parser
7
+ # @return [FoodIngredientParser]
8
+ def initialize
9
+ @parser = Grammar::RootParser.new
10
+ end
11
+
12
+ # Parse food ingredient list text into a structured representation.
13
+ # @option clean [Boolean] pass +false+ to disable correcting frequently occuring issues
14
+ # @return [Hash] structured representation of food ingredients
15
+ def parse(s, clean: true)
16
+ s = clean(s) if clean
17
+ @parser.parse(s)
18
+ end
19
+
20
+ private
21
+
22
+ def clean(s)
23
+ s.gsub!("\u00ad", "") # strip soft hyphen
24
+ s.gsub!("\u0092", "'") # windows-1252 apostrophe - https://stackoverflow.com/a/15564279/2866660
25
+ s.gsub!("aÄs", "aïs") # encoding issue for maïs
26
+ s.gsub!("ï", "ï") # encoding issue
27
+ s.gsub!("ë", "ë") # encoding issue
28
+ s.gsub!(/\A\s*"(.*)"\s*\z/, '\1') # enclosing double quotation marks
29
+ s.gsub!(/\A\s*'(.*)'\s*\z/, '\1') # enclosing single quotation marks
30
+ s
31
+ end
32
+
33
+ end
34
+ end
@@ -0,0 +1,3 @@
1
+ module FoodIngredientParser
2
+ VERSION = '1.0.0.pre.1'
3
+ end
@@ -0,0 +1,2 @@
1
+ require_relative 'food_ingredient_parser/version'
2
+ require_relative 'food_ingredient_parser/parser'
metadata ADDED
@@ -0,0 +1,85 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: food_ingredient_parser
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.0.0.pre.1
5
+ platform: ruby
6
+ authors:
7
+ - wvengen
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2018-08-09 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: treetop
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '1.6'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '1.6'
27
+ description: |2
28
+ Food products often contain an ingredient list of some sort. This parser
29
+ tries to recognize the syntax and returns a structured representation of the
30
+ food ingredients.
31
+ email:
32
+ - dev-ruby@willem.engen.nl
33
+ executables:
34
+ - food_ingredient_parser
35
+ extensions: []
36
+ extra_rdoc_files:
37
+ - README.md
38
+ - LICENSE
39
+ files:
40
+ - LICENSE
41
+ - README.md
42
+ - bin/food_ingredient_parser
43
+ - food_ingredient_parser.gemspec
44
+ - lib/food_ingredient_parser.rb
45
+ - lib/food_ingredient_parser/grammar.rb
46
+ - lib/food_ingredient_parser/grammar/amount.treetop
47
+ - lib/food_ingredient_parser/grammar/common.treetop
48
+ - lib/food_ingredient_parser/grammar/ingredient.treetop
49
+ - lib/food_ingredient_parser/grammar/ingredient_coloned.treetop
50
+ - lib/food_ingredient_parser/grammar/ingredient_nested.treetop
51
+ - lib/food_ingredient_parser/grammar/ingredient_simple.treetop
52
+ - lib/food_ingredient_parser/grammar/list.treetop
53
+ - lib/food_ingredient_parser/grammar/list_coloned.treetop
54
+ - lib/food_ingredient_parser/grammar/list_newlined.treetop
55
+ - lib/food_ingredient_parser/grammar/root.treetop
56
+ - lib/food_ingredient_parser/nodes.rb
57
+ - lib/food_ingredient_parser/parser.rb
58
+ - lib/food_ingredient_parser/version.rb
59
+ homepage: https://github.com/q-m/food-ingredient-parser-ruby
60
+ licenses:
61
+ - MIT
62
+ metadata:
63
+ bug_tracker_uri: https://github.com/q-m/food-ingredient-parser-ruby/issues
64
+ source_code_uri: https://github.com/q-m/food-ingredient-parser-ruby
65
+ post_install_message:
66
+ rdoc_options: []
67
+ require_paths:
68
+ - lib
69
+ required_ruby_version: !ruby/object:Gem::Requirement
70
+ requirements:
71
+ - - ">="
72
+ - !ruby/object:Gem::Version
73
+ version: '0'
74
+ required_rubygems_version: !ruby/object:Gem::Requirement
75
+ requirements:
76
+ - - ">"
77
+ - !ruby/object:Gem::Version
78
+ version: 1.3.1
79
+ requirements: []
80
+ rubyforge_project:
81
+ rubygems_version: 2.6.13
82
+ signing_key:
83
+ specification_version: 4
84
+ summary: Parser for ingredient lists found on food products.
85
+ test_files: []