food_ingredient_parser 1.0.0.pre.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 7d20d204e60d22f23f7aa3a039450629c8f4c729
4
+ data.tar.gz: 127801cb5a00b81ca6abf244d92bfebb39ddde28
5
+ SHA512:
6
+ metadata.gz: 3a79c07b4f193fcfbbd08f62fdd18407fc034ea036d5b4b5eb3278e449e337e1ee5e0ad5f468bab5f8e948746e99a6bcf5d686e13277b39dafb9c3b11d3ca531
7
+ data.tar.gz: 95c224f63934e3c601b62d0db9a7d8dc6070fe008816298e3e81f435751012d1bbf66d6270259c4f6e9252a9df6e6d0cf15081e8020e99853c0abb7af55b1a96
data/LICENSE ADDED
@@ -0,0 +1,22 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2018 Questionmark
4
+ Copyright (c) 2018 wvengen
5
+
6
+ Permission is hereby granted, free of charge, to any person obtaining a copy
7
+ of this software and associated documentation files (the "Software"), to deal
8
+ in the Software without restriction, including without limitation the rights
9
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10
+ copies of the Software, and to permit persons to whom the Software is
11
+ furnished to do so, subject to the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be included in all
14
+ copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22
+ SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,115 @@
1
+ # Food ingredient parser
2
+
3
+ Ingredients are listed on food products in various ways. This [Ruby](https://www.ruby-lang.org/)
4
+ gem and program parses the ingredient text and returns a structured representation.
5
+
6
+ ## Installation
7
+
8
+ ```
9
+ gem install food_ingredient_parser
10
+ ```
11
+
12
+ This will also install the dependency [treetop](http://cjheath.github.io/treetop).
13
+ If you want colored output for the test program, also install [pry](http://pryrepl.org/): `gem install pry`.
14
+
15
+ ## Example
16
+
17
+ ```ruby
18
+ require 'food_ingredient_parser'
19
+
20
+ s = "Water* 60%, suiker 30%, voedingszuren: citroenzuur, appelzuur, zuurteregelaar: E576/E577, " \
21
+ + "natuurlijke citroen-limoen aroma's 0,2%, zoetstof: steviolglycosiden, * = Biologisch. " \
22
+ + "E = door de E.U. goedgekeurde toevoeging."
23
+ parser = FoodIngredientParser::Parser.new
24
+ puts parser.parse(s).to_h.inspect
25
+ ```
26
+ Results in
27
+ ```
28
+ {
29
+ :contains=>[
30
+ {:name=>"Water", :amount=>"60%", :mark=>"*"},
31
+ {:name=>"suiker", :amount=>"30%"},
32
+ {:name=>"voedingszuren", :contains=>[
33
+ {:name=>"citroenzuur"}
34
+ ]},
35
+ {:name=>"appelzuur"},
36
+ {:name=>"zuurteregelaar", :contains=>[
37
+ {:name=>"E576"},
38
+ {:name=>"E577"}
39
+ ]},
40
+ {:name=>"natuurlijke citroen-limoen aroma's", :amount=>"0,2%"},
41
+ {:name=>"zoetstof", :contains=>[
42
+ {:name=>"steviolglycosiden"}
43
+ ]}
44
+ ],
45
+ :notes=>[
46
+ "* = Biologisch",
47
+ "E = door de E.U. goedgekeurde toevoeging"
48
+ ]
49
+ }
50
+ ```
51
+
52
+ ## Test tool
53
+
54
+ The executable `food_ingredient_parser` is available after installing the gem. If you're
55
+ running this from the source tree, use `bin/food_ingredient_parser` instead.
56
+
57
+ ```
58
+ $ food_ingredient_parser -h
59
+ Usage: food_ingredient_parser [options] --file|-f <filename>
60
+ food_ingredient_parser [options] --string|-s <ingredients>
61
+
62
+ -f, --file FILE Parse all lines of the file as ingredient lists.
63
+ -s, --string INGREDIENTS Parse specified ingredient list.
64
+ -q, --[no-]quiet Only show summary.
65
+ -p, --parsed Only show lines that were successfully parsed.
66
+ -e, --escape Escape newlines
67
+ -c, --[no-]color Use color
68
+ -n, --noresult Only show lines that had no result.
69
+ -v, --[no-]verbose Show more data (parsed tree).
70
+ --version Show program version.
71
+ -h, --help Show this help
72
+
73
+ $ food_ingredient_parser -v -s "tomato"
74
+ "tomato"
75
+ RootNode+Root3 offset=0, "tomato" (contains,notes):
76
+ SyntaxNode offset=0, ""
77
+ SyntaxNode offset=0, ""
78
+ SyntaxNode offset=0, ""
79
+ ListNode+List13 offset=0, "tomato" (contains):
80
+ SyntaxNode+List12 offset=0, "tomato" (ingredient):
81
+ SyntaxNode+Ingredient0 offset=0, "tomato":
82
+ SyntaxNode offset=0, ""
83
+ IngredientNode+IngredientSimpleWithAmount3 offset=0, "tomato" (ing):
84
+ IngredientNode+IngredientSimple5 offset=0, "tomato" (name):
85
+ SyntaxNode+IngredientSimple4 offset=0, "tomato" (word):
86
+ SyntaxNode offset=0, "tomato":
87
+ SyntaxNode offset=0, "t"
88
+ SyntaxNode offset=1, "o"
89
+ SyntaxNode offset=2, "m"
90
+ SyntaxNode offset=3, "a"
91
+ SyntaxNode offset=4, "t"
92
+ SyntaxNode offset=5, "o"
93
+ SyntaxNode offset=6, ""
94
+ SyntaxNode offset=6, ""
95
+ SyntaxNode offset=6, ""
96
+ SyntaxNode+Root2 offset=6, "":
97
+ SyntaxNode offset=6, ""
98
+ SyntaxNode offset=6, ""
99
+ SyntaxNode offset=6, ""
100
+ SyntaxNode offset=6, ""
101
+ {:contains=>[{:name=>"tomato"}]}
102
+
103
+ $ food_ingredient_parser -q -f data/test-cases
104
+ parsed 35 (100.0%), no result 0 (0.0%)
105
+ ```
106
+
107
+ If you want to use the output in (shell)scripts, the options `-e -c` may be quite useful.
108
+
109
+ ## Test data
110
+
111
+ [`data/ingredient-samples-nl`](data/ingredient-samples-nl) contains about 150k
112
+ real-world ingredient lists found on the Dutch market. Each line contains one ingredient
113
+ list (newlines are encoded as `\n`, empty lines and those starting with `#` are ignored).
114
+ Currently almost three quarter is recognized and parsed. We aim to reach at least 90%.
115
+
@@ -0,0 +1,107 @@
1
+ #!/usr/bin/env ruby
2
+ #
3
+ # Parser for food ingredient lists.
4
+ #
5
+ require 'optparse'
6
+
7
+ $:.push(File.expand_path(File.dirname(__FILE__) + "/../lib"))
8
+ require 'food_ingredient_parser'
9
+
10
+ begin
11
+ require 'pry'
12
+ def pp(o, color: true)
13
+ if color
14
+ Pry::ColorPrinter.pp(o)
15
+ else
16
+ puts(o.inspect)
17
+ end
18
+ end
19
+ rescue LoadError
20
+ # fallback without color printing
21
+ def pp(o, color: nil)
22
+ puts(o.inspect)
23
+ end
24
+ end
25
+
26
+ def colorize(color, s)
27
+ if color
28
+ "\e[#{color}m#{s}\e[0;22m"
29
+ else
30
+ s
31
+ end
32
+ end
33
+
34
+ def parse_single(s, parsed=nil, parser: nil, verbosity: 1, print: nil, escape: false, color: false)
35
+ parser ||= FoodIngredientParser::Parser.new
36
+ parsed ||= parser.parse(s)
37
+
38
+ return unless print.nil? || (parsed && print == :parsed) || (!parsed && print == :noresult)
39
+
40
+ puts colorize(color && "0;32", escape ? s.gsub("\n", "\\n") : s) if verbosity > 0
41
+
42
+ if parsed
43
+ puts(parsed.inspect) if verbosity > 1
44
+ pp(parsed.to_h, color: color) if verbosity > 0
45
+ else
46
+ puts "(no result: #{parser.failure_reason})" if verbosity > 0
47
+ end
48
+ end
49
+
50
+ def parse_file(path, parser: nil, verbosity: 1, print: nil, escape: false, color: false)
51
+ count_parsed = count_noresult = 0
52
+ File.foreach(path) do |line|
53
+ next if line =~ /^#/ # comment
54
+ next if line =~ /^\s*$/ # empty line
55
+
56
+ line = line.gsub('\\n', "\n").strip
57
+ parsed = parser.parse(line)
58
+ count_parsed += 1 if parsed
59
+ count_noresult += 1 unless parsed
60
+
61
+ parse_single(line, parsed, parser: parser, verbosity: verbosity, print: print, escape: escape, color: color)
62
+ end
63
+
64
+ pct_parsed = 100.0 * count_parsed / (count_parsed + count_noresult)
65
+ pct_noresult = 100.0 * count_noresult / (count_parsed + count_noresult)
66
+ puts "parsed #{colorize(color && "1;32", count_parsed)} (#{pct_parsed.round(1)}%), no result #{colorize(color && "1;31", count_noresult)} (#{pct_noresult.round(1)}%)"
67
+ end
68
+
69
+ verbosity = 1
70
+ files = []
71
+ strings = []
72
+ print = nil
73
+ escape = false
74
+ color = true
75
+ OptionParser.new do |opts|
76
+ opts.banner = <<-EOF.gsub(/^ /, '')
77
+ Usage: #{$0} [options] --file|-f <filename>
78
+ #{$0} [options] --string|-s <ingredients>
79
+
80
+ EOF
81
+
82
+ opts.on("-f", "--file FILE", "Parse all lines of the file as ingredient lists.") {|f| files << f }
83
+ opts.on("-s", "--string INGREDIENTS", "Parse specified ingredient list.") {|s| strings << s }
84
+
85
+ opts.on("-q", "--[no-]quiet", "Only show summary.") {|q| verbosity = q ? 0 : 1 }
86
+ opts.on("-p", "--parsed", "Only show lines that were successfully parsed.") {|p| print = :parsed }
87
+ opts.on("-e", "--escape", "Escape newlines") {|e| escape = true }
88
+ opts.on("-c", "--[no-]color", "Use color") {|e| color = !!e }
89
+ opts.on("-n", "--noresult", "Only show lines that had no result.") {|p| print = :noresult }
90
+ opts.on("-v", "--[no-]verbose", "Show more data (parsed tree).") {|v| verbosity = v ? 2 : 1 }
91
+ opts.on( "--version", "Show program version.") do
92
+ puts("food_ingredient_parser v#{FoodIngredientParser::VERSION}")
93
+ exit
94
+ end
95
+ opts.on("-h", "--help", "Show this help") do
96
+ puts(opts)
97
+ exit
98
+ end
99
+ end.parse!
100
+
101
+ if strings.any? || files.any?
102
+ parser = FoodIngredientParser::Parser.new
103
+ strings.each {|s| parse_single(s, parser: parser, verbosity: verbosity, print: print, escape: escape, color: color) }
104
+ files.each {|f| parse_file(f, parser: parser, verbosity: verbosity, print: print, escape: escape, color: color) }
105
+ else
106
+ STDERR.puts("Please specify one or more --file or --string arguments (see --help).")
107
+ end
@@ -0,0 +1,29 @@
1
+ $:.unshift(File.expand_path(File.dirname(__FILE__) + '/lib'))
2
+ require 'food_ingredient_parser/version'
3
+
4
+ Gem::Specification.new do |s|
5
+ s.name = 'food_ingredient_parser'
6
+ s.version = FoodIngredientParser::VERSION
7
+ s.date = '2018-08-09'
8
+ s.summary = 'Parser for ingredient lists found on food products.'
9
+ s.authors = ['wvengen']
10
+ s.email = ['dev-ruby@willem.engen.nl']
11
+ s.homepage = 'https://github.com/q-m/food-ingredient-parser-ruby'
12
+ s.license = 'MIT'
13
+ s.description = <<-EOD
14
+ Food products often contain an ingredient list of some sort. This parser
15
+ tries to recognize the syntax and returns a structured representation of the
16
+ food ingredients.
17
+ EOD
18
+ s.metadata = {
19
+ 'bug_tracker_uri' => 'https://github.com/q-m/food-ingredient-parser-ruby/issues',
20
+ 'source_code_uri' => 'https://github.com/q-m/food-ingredient-parser-ruby',
21
+ }
22
+
23
+ s.files = `git ls-files *.gemspec lib`.split("\n")
24
+ s.executables = `git ls-files bin`.split("\n").map(&File.method(:basename))
25
+ s.extra_rdoc_files = ['README.md', 'LICENSE']
26
+ s.require_paths = ['lib']
27
+
28
+ s.add_runtime_dependency 'treetop', '~> 1.6'
29
+ end
@@ -0,0 +1,30 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar Amount
3
+ include Common
4
+
5
+ rule amount
6
+ '(' ws* amount:simple_amount ws* ')' <AmountNode> /
7
+ '[' ws* amount:simple_amount ws* ']' <AmountNode> /
8
+ '{' ws* amount:simple_amount ws* '}' <AmountNode> /
9
+ amount:simple_amount <AmountNode>
10
+ end
11
+
12
+ rule simple_amount
13
+ ( (
14
+ 'of which' / 'at least' /
15
+ 'waarvan' / 'ten minste' / 'tenminste' / 'minimaal'
16
+ ) ws* )?
17
+ [<>]? ws*
18
+ simple_amount_quantity
19
+ ( ws+ (
20
+ 'minimum' /
21
+ 'minimaal' / 'minimum'
22
+ ) )?
23
+ end
24
+
25
+ rule simple_amount_quantity
26
+ number ( ws* '-' ws* number )? ws* ( '%' / 'g' / 'mg' / 'gram' / 'ml' )
27
+ end
28
+
29
+ end
30
+ end
@@ -0,0 +1,150 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar Common
3
+
4
+ rule ws
5
+ " " / "\t"
6
+ end
7
+
8
+ rule newline
9
+ "\n"
10
+ end
11
+
12
+ rule char
13
+ [[:alnum:]] /
14
+ fraction /
15
+ [-/\'’+=_{}&] /
16
+ [®] /
17
+ [¿?] / # weird characters turning up in names (e.g. encoding issues)
18
+ [₁₂₃₄₅₆₇₈₉] # can occur with vitamins
19
+ #[A-Za-z0-9] / [-\'+/=?;!*#@$_%] / [#xC0-#xD6] / [#xD8-#xF6] / [#xF8-#x2FF] / [#x370-#x37D] / [#x37F-#x1FFF] / [#x200C-#x200D] / [#x2070-#x218F] / [#x2C00-#x2FEF] / [#x3001-#xD7FF] / [#xF900-#xFDCF] / [#xFDF0-#xFFFD] / [#x10000-#xEFFFF]
20
+ end
21
+
22
+ rule mark
23
+ # mark referencing a footnote
24
+ [¹²³⁴⁵ᵃᵇᶜᵈᵉᶠᵍªº] '⁾'? / '⁽' [¹²³⁴⁵ᵃᵇᶜᵈᵉᶠᵍªº] '⁾' / [†‡•°#^] / '*'+
25
+ end
26
+
27
+ rule digit
28
+ [0-9]
29
+ end
30
+
31
+ rule fraction
32
+ [½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅐⅛⅜⅝⅞⅑⅒]
33
+ end
34
+
35
+ rule number
36
+ digit+ [,.] digit+ / digit+ ws* fraction / fraction / digit+
37
+ end
38
+
39
+ rule word
40
+ abbrev / char+
41
+ end
42
+
43
+ rule word_nas
44
+ # word, but don't include the trailing '-' that may occure before an 'and'
45
+ abbrev / ( !andsep char )+
46
+ end
47
+
48
+ rule and
49
+ ( 'and' / 'en' / 'und' / '&' ) !char
50
+ end
51
+
52
+ # we want to match "a and b" but not "a- and bthing", this allows to avoid the latter
53
+ rule andsep
54
+ '-' ws+ and
55
+ end
56
+
57
+ rule abbrev
58
+ # These are listed explicitely to avoid incorrect interpretations, and allow missing trailing dots.
59
+ # To get an idea of what occurs (second one omits trailing dots):
60
+ # cat data/ingredient-samples-nl | perl -ne '$_=lc($_); /\b(([a-z]\.)+[a-z]\.?)\W/ && print "$1\n"' | sort | uniq -c | sort -rn
61
+ # cat data/ingredient-samples-nl | perl -ne '$_=lc($_); /\b(([a-z]\.)+[a-z])\W/ && print "$1\n"' | sort | uniq -c | sort -rn
62
+ # Finally, you can generate the full list using this command:
63
+ # cat data/ingredient-samples-nl | perl -ne '$_=lc($_); /\b(([a-z]\.)+[a-z])\W/ && print "$1\n"' | sort | uniq | sed "s/^/'/;s/$/'i \//"
64
+ (
65
+ 'a.o.p'i /
66
+ 'b.g.a'i /
67
+ 'b.o.b'i /
68
+ 'c.a'i /
69
+ 'c.i'i /
70
+ 'd.e'i /
71
+ 'd.m.v'i /
72
+ 'd.o.c'i /
73
+ 'd.o.p'i /
74
+ 'd.s'i /
75
+ 'e.a'i /
76
+ 'e.g'i /
77
+ 'e.u'i /
78
+ 'f.i.l'i /
79
+ 'f.o.s'i /
80
+ 'i.a'i /
81
+ 'i.d'i /
82
+ 'i.e'i /
83
+ 'i.g.m.e'i /
84
+ 'i.g.p'i /
85
+ 'i.m.v'i /
86
+ 'i.o'i /
87
+ 'i.v.m'i /
88
+ 'l.s.l'i /
89
+ 'n.a'i /
90
+ 'n.b'i /
91
+ 'n.o'i /
92
+ 'n.v.t'i /
93
+ 'o.a'i /
94
+ 'o.b.v'i /
95
+ 'p.d.o'i /
96
+ 'p.g.i'i /
97
+ 'q.s'i /
98
+ 's.l'i /
99
+ 's.s'i /
100
+ 't.o.v'i /
101
+ 'u.h.t'i /
102
+ 'v.g'i /
103
+ 'v.s'i /
104
+ 'w.a'i /
105
+ 'w.o'i /
106
+ 'w.v'i /
107
+ # special words and abbreviations (not auto-generated)
108
+ 'vit.'i /
109
+ 'denat.'i /
110
+ 'N°'i /
111
+ '°C'i /
112
+ # word combinations that should not be split (not auto-generated)
113
+ # @todo this really would benefit from matching known ingredients instead of hardcoding
114
+ ( 'oliën'i / 'olien'i / 'olië'i / 'olie'i ) ws+ and ws+ ( 'vetten'i / 'vet'i ) /
115
+ 'palm'i ws+ and ws+ 'kokosvet'i /
116
+ color ( ws+ and ws+ color )+ /
117
+ color2 ( ws+ and ws+ color2 )+ /
118
+ 'kruiden'i ws+ and ws+ 'specerijen'i /
119
+ 'kruiden'i ws+ and ws+ 'specerij'i /
120
+ 'specerijen'i ws+ and ws+ 'kruiden'i /
121
+ 'vitamine'i 'n'i? ws+ and ws+ 'mineralen'i /
122
+ 'lactose'i ws+ and ws+ 'melk'i ( 'eiwit'i 'en'i? )? /
123
+ 'granen'i ws+ and ws+ 'zaden'i /
124
+ 'gekookt'i [eE]? ws+ and ws+ 'gemarineerd'i [eE]? /
125
+ 'mono'i ws+ and ws+ 'diglyceriden'i /
126
+ 'guarpitmeel'i ws+ and ws+ 'natriumalginaat'i /
127
+ 'vlees'i ws+ and ws+ 'dierlijke bijproducten'i /
128
+ 'vis'i ws+ and ws+ 'visbijproducten'i /
129
+ 'glucose'i ws+ and ws+ 'fructosestroop'i /
130
+ 'ijzeroxiden'i ws+ and ws+ 'hydroxiden'i /
131
+ char+ 'sap'i ws+ and ws+ 'overige vruchtensappen'i /
132
+ char* 'sap'i ( ws+ 'uit concentraat'i / ws+ 'uit sapconcentraat'i )? ws+ and ws+ 'vruchten'i? 'puree'i /
133
+ ( 'vit.'i / 'vitamine'i / 'vitamin' ) ws+ [a-zA-Z] [0-9]* ws+ and ws+ [a-zA-Z] [0-9]*
134
+ )
135
+ '.'? ![[:alpha:]]
136
+ end
137
+
138
+ rule color
139
+ # used for paprika, honey ("yellow and white honey") (nouns)
140
+ 'red'i / 'green'i / 'yellow'i / 'white'i / 'black'i /
141
+ 'rood'i / 'groen'i / 'geel'i / 'wit'i / 'zwart'i
142
+ end
143
+
144
+ rule color2
145
+ # adjective colors (can not occur together with noun colors in a list)
146
+ 'rode'i / 'groene'i / 'gele'i / 'witte'i / 'zwarte'i
147
+ end
148
+
149
+ end
150
+ end
@@ -0,0 +1,12 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar Ingredient
3
+ include IngredientSimple
4
+ include IngredientNested
5
+ include IngredientColoned
6
+
7
+ rule ingredient
8
+ ws* ( ingredient_nested / ingredient_coloned / ingredient_simple_with_amount )
9
+ end
10
+
11
+ end
12
+ end
@@ -0,0 +1,41 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar IngredientColoned
3
+ include Common
4
+ include Amount
5
+ include IngredientSimple
6
+
7
+ rule ingredient_coloned
8
+ ing:ingredient_simple ws* ':' ws* amount:amount post:( ws* '}' )? !( ws* word ) <IngredientNode> /
9
+ ing:ingredient_simple ws* ':' post:( ws* '}' )? ws* contains:( ingredient_coloned_inner_list ) <NestedIngredientNode>
10
+ end
11
+
12
+ rule ingredient_coloned_inner_list
13
+ contains:( ingredient_coloned_simple_with_amount_and_nest ( ws+ and ws+ ingredient_coloned_simple_with_amount_and_nest )+ ) <ListNode> /
14
+ contains:( ingredient_coloned_simple_with_amount_and_nest ws* ( '/'+ ws* ingredient_coloned_simple_with_amount_and_nest )* ) <ListNode>
15
+ end
16
+
17
+ # @see IngredientSimple#ingredient_simple
18
+ rule ingredient_coloned_simple
19
+ name:( ingredient_coloned_word_nas ( andsep? ws+ !amount !and ingredient_coloned_word_nas )* ) ws? mark:mark <IngredientNode> /
20
+ name:( ingredient_coloned_word_nas ( andsep? ws+ !amount !and ingredient_coloned_word_nas )* ) <IngredientNode>
21
+ end
22
+
23
+ # @see IngredientSimple#ingredient_simple_with_amount
24
+ rule ingredient_coloned_simple_with_amount
25
+ pre:( '{' ws* )? amount:amount ws+ ing:ingredient_coloned_simple <IngredientNode> /
26
+ ing:ingredient_coloned_simple ws* amount:amount post:( ws* '}' )? (ws? mark:mark)? <IngredientNode> /
27
+ ing:ingredient_coloned_simple <IngredientNode>
28
+ end
29
+
30
+ rule ingredient_coloned_simple_with_amount_and_nest
31
+ ing:ingredient_coloned_simple_with_amount ws* '(' ws* contains:ingredient_coloned_simple_with_amount ws* ')' ( ws* '}' )? <NestedIngredientNode> /
32
+ ingredient_coloned_simple_with_amount
33
+ end
34
+
35
+ # @see Common#word
36
+ rule ingredient_coloned_word_nas
37
+ abbrev / ( !'/' !andsep char )+
38
+ end
39
+
40
+ end
41
+ end
@@ -0,0 +1,28 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar IngredientNested
3
+ include Common
4
+ include Amount
5
+ include IngredientSimple
6
+
7
+ rule ingredient_nested
8
+ ( ing:ingredient_simple ws* '(' contains:ingredient_nested_in ws* ')' ws? mark:mark ws* amount:amount <NestedIngredientNode> ) /
9
+ ( ing:ingredient_simple ws* '(' contains:ingredient_nested_in ws* ')' ws* amount:amount <NestedIngredientNode> ) /
10
+ ( ing:ingredient_simple_with_amount ws* '(' contains:ingredient_nested_in ws* ')' ws? mark:mark <NestedIngredientNode> ) /
11
+ ( ing:ingredient_simple_with_amount ws* '(' contains:ingredient_nested_in ws* ')' <NestedIngredientNode> ) /
12
+ ( ing:ingredient_simple ws* '[' contains:ingredient_nested_in ws* ']' ws? mark:mark ws* amount:amount <NestedIngredientNode> ) /
13
+ ( ing:ingredient_simple ws* '[' contains:ingredient_nested_in ws* ']' ws* amount:amount <NestedIngredientNode> ) /
14
+ ( ing:ingredient_simple_with_amount ws* '[' contains:ingredient_nested_in ws* ']' ws? mark:mark <NestedIngredientNode> ) /
15
+ ( ing:ingredient_simple_with_amount ws* '[' contains:ingredient_nested_in ws* ']' <NestedIngredientNode> )
16
+ end
17
+
18
+ rule ingredient_nested_in
19
+ ( ingredient_nested_contains (ws* ':')? ws+ )? ws* contains:list ws* '.'?
20
+ end
21
+
22
+ rule ingredient_nested_contains
23
+ 'contains'i /
24
+ 'bevat'i
25
+ end
26
+
27
+ end
28
+ end
@@ -0,0 +1,21 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar IngredientSimple
3
+ include Common
4
+ include Amount
5
+
6
+ rule ingredient_simple
7
+ name:( word_nas ( andsep? ws+ !amount !and word_nas )* ) ws? mark:mark <IngredientNode> /
8
+ name:( word_nas ( andsep? ws+ !amount !and word_nas )* ) <IngredientNode> /
9
+ # We've tried to omit 'and' from the ingredient, but if it doesn't work out, do it anyway.
10
+ name:( word_nas ( andsep? ws+ !amount word_nas )* ) ws? mark:mark <IngredientNode> /
11
+ name:( word_nas ( andsep? ws+ !amount word_nas )* ) <IngredientNode>
12
+ end
13
+
14
+ rule ingredient_simple_with_amount
15
+ pre:( '{' ws* )? amount:amount ws+ ing:ingredient_simple <IngredientNode> /
16
+ ing:ingredient_simple ws* amount:amount post:( ws* '}' )? (ws? mark:mark)? <IngredientNode> /
17
+ ing:ingredient_simple <IngredientNode>
18
+ end
19
+
20
+ end
21
+ end
@@ -0,0 +1,14 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar List
3
+ include Common
4
+ include Ingredient
5
+
6
+ rule list
7
+ contains:(ingredient ( ws* '|' ws* ingredient )+ ( ws+ and ws+ ingredient )? ) <ListNode> /
8
+ contains:(ingredient ( ws* ';' ws* ingredient )+ ( ws+ and ws+ ingredient )? ) <ListNode> /
9
+ contains:(ingredient ( ws* ',' ws* ingredient )+ ( ws+ and ws+ ingredient )? ) <ListNode> /
10
+ contains:(ingredient ( ws* '.' ws* ingredient )+ ( ws+ and ws+ ingredient )? ) <ListNode> /
11
+ contains:(ingredient ( ws+ and ws+ ingredient )? ) <ListNode>
12
+ end
13
+ end
14
+ end
@@ -0,0 +1,23 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar ListColoned
3
+ include Common
4
+ include IngredientSimple
5
+ include Ingredient
6
+
7
+ rule list_coloned
8
+ contains:( ( list_coloned_ingredient ws* '.' ws* )+ list_coloned_ingredient? ) <ListNode> /
9
+ contains:( ( list_coloned_ingredient ws* ';' ws* )+ list_coloned_ingredient? ) <ListNode> /
10
+ contains:( list_coloned_ingredient ) <ListNode>
11
+ end
12
+
13
+ rule list_coloned_inner_list
14
+ contains:( ingredient ( ws* ',' ws* ingredient )* ) <ListNode>
15
+ end
16
+
17
+ rule list_coloned_ingredient
18
+ ing:ingredient_simple_with_amount ws* ':' ws* amount:amount post:( ws* '}' )? <IngredientNode> /
19
+ ing:ingredient_simple_with_amount ws* ':' post:( ws* '}' )? ws* contains:list_coloned_inner_list <NestedIngredientNode>
20
+ end
21
+
22
+ end
23
+ end
@@ -0,0 +1,16 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar ListNewlined
3
+ include Common
4
+ include IngredientSimple
5
+ include List
6
+
7
+ rule list_newlined
8
+ contains:( ( list_newlined_ingredient_nested ws* newline newline )* list_newlined_ingredient_nested ) <ListNode>
9
+ end
10
+
11
+ rule list_newlined_ingredient_nested
12
+ ws* ing:ingredient_simple ws* ':'? ws* newline contains:list ( ws* '.' )? <NestedIngredientNode>
13
+ end
14
+
15
+ end
16
+ end
@@ -0,0 +1,51 @@
1
+ module FoodIngredientParser::Grammar
2
+ grammar Root
3
+ include Common
4
+ include List
5
+ include ListColoned
6
+ include ListNewlined
7
+
8
+ rule root
9
+ '"'?
10
+ root_prefix? ws*
11
+ contains:( list_newlined / list_coloned / list )
12
+ notes:(
13
+ root_mark_sentences_in_list? ws*
14
+ ( ( '.' ws* newline* / '.'? ws* newline+ ) ws* root_sentences? ws* )?
15
+ )
16
+ '"'?
17
+ <RootNode>
18
+ end
19
+
20
+ rule root_prefix
21
+ (
22
+ 'ingredients'i / 'contains'i /
23
+ ('ingred'i [IÏiï] [EËeë] 'n'i ( 't'i 'en'i? 'declaratie'i? )? ) / 'bevat'i / 'dit zit er in'i / 'samenstelling'i /
24
+ 'zutaten'i
25
+ )
26
+ ( ws* [:;.] ( ws* newline )? / ws* newline / ws ) ws* # optional colon or other separator
27
+ "'"? ws* # stray quote occurs sometimes
28
+ end
29
+
30
+ rule root_sentences
31
+ ( root_sentence ws* )+ root_sentence_open? / root_sentence_open
32
+ end
33
+
34
+ rule root_sentence
35
+ root_sentence_open ( '.' / ';' / newline+ )
36
+ end
37
+
38
+ rule root_sentence_open
39
+ ( word / ws / [,:()%] / '[' / ']' / mark )+ <NoteNode>
40
+ end
41
+
42
+ rule root_mark_sentences_in_list
43
+ ( ( ws* [,.] / ws ) ws* root_mark_sentence_in_list )+
44
+ end
45
+
46
+ rule root_mark_sentence_in_list
47
+ mark ws* root_sentence_open <NoteNode>
48
+ end
49
+
50
+ end
51
+ end
@@ -0,0 +1,16 @@
1
+ require 'treetop'
2
+ require_relative 'nodes'
3
+
4
+ # @todo find a way to auto-generate Ruby from Treetop files when building gem,
5
+ # see https://stackoverflow.com/q/37794587/2866660
6
+
7
+ Treetop.load File.dirname(__FILE__) + '/grammar/common'
8
+ Treetop.load File.dirname(__FILE__) + '/grammar/amount'
9
+ Treetop.load File.dirname(__FILE__) + '/grammar/ingredient_simple'
10
+ Treetop.load File.dirname(__FILE__) + '/grammar/ingredient_nested'
11
+ Treetop.load File.dirname(__FILE__) + '/grammar/ingredient_coloned'
12
+ Treetop.load File.dirname(__FILE__) + '/grammar/ingredient'
13
+ Treetop.load File.dirname(__FILE__) + '/grammar/list'
14
+ Treetop.load File.dirname(__FILE__) + '/grammar/list_coloned'
15
+ Treetop.load File.dirname(__FILE__) + '/grammar/list_newlined'
16
+ Treetop.load File.dirname(__FILE__) + '/grammar/root'
@@ -0,0 +1,69 @@
1
+ require 'treetop/runtime'
2
+
3
+ # Needs to be in grammar namespace so Treetop can find the nodes.
4
+ module FoodIngredientParser::Grammar
5
+
6
+ # Treetop syntax node with our additions, use this as parent for all our own nodes.
7
+ class SyntaxNode < Treetop::Runtime::SyntaxNode
8
+ private
9
+
10
+ def to_a_deep(n, cls)
11
+ if n.is_a?(cls)
12
+ [n]
13
+ elsif n.nonterminal?
14
+ n.elements.map {|m| to_a_deep(m, cls) }.flatten(1).compact
15
+ end
16
+ end
17
+ end
18
+
19
+ # Root object, contains everything else.
20
+ class RootNode < SyntaxNode
21
+ def to_h
22
+ h = { contains: contains.to_a }
23
+ if notes && notes_ary = to_a_deep(notes, NoteNode)&.map(&:text_value)
24
+ h[:notes] = notes_ary if notes_ary.length > 0
25
+ end
26
+ h
27
+ end
28
+ end
29
+
30
+ # List of ingredients.
31
+ class ListNode < SyntaxNode
32
+ def to_a
33
+ to_a_deep(contains, IngredientNode).map(&:to_h)
34
+ end
35
+ end
36
+
37
+ # Ingredient
38
+ class IngredientNode < SyntaxNode
39
+ def to_h
40
+ h = {}
41
+ h.merge!(to_a_deep(ing, IngredientNode)&.first&.to_h || {}) if respond_to?(:ing)
42
+ h.merge!(to_a_deep(amount, AmountNode)&.first&.to_h || {}) if respond_to?(:amount)
43
+ h[:name] = name.text_value if respond_to?(:name)
44
+ h[:name] = pre.text_value + h[:name] if respond_to?(:pre)
45
+ h[:name] = h[:name] + post.text_value if respond_to?(:post)
46
+ h[:mark] = mark.text_value if respond_to?(:mark) && mark.text_value != ''
47
+ h
48
+ end
49
+ end
50
+
51
+ # Ingredient with containing ingredients.
52
+ class NestedIngredientNode < IngredientNode
53
+ def to_h
54
+ super.merge({ contains: to_a_deep(contains, IngredientNode).map(&:to_h) })
55
+ end
56
+ end
57
+
58
+ # Amount, specifying an ingredient.
59
+ class AmountNode < SyntaxNode
60
+ def to_h
61
+ { amount: amount.text_value }
62
+ end
63
+ end
64
+
65
+ # Note at the end of the ingredient list.
66
+ class NoteNode < SyntaxNode
67
+ end
68
+
69
+ end
@@ -0,0 +1,34 @@
1
+ require_relative 'grammar'
2
+
3
+ module FoodIngredientParser
4
+ class Parser
5
+
6
+ # Create a new food ingredient parser
7
+ # @return [FoodIngredientParser]
8
+ def initialize
9
+ @parser = Grammar::RootParser.new
10
+ end
11
+
12
+ # Parse food ingredient list text into a structured representation.
13
+ # @option clean [Boolean] pass +false+ to disable correcting frequently occuring issues
14
+ # @return [Hash] structured representation of food ingredients
15
+ def parse(s, clean: true)
16
+ s = clean(s) if clean
17
+ @parser.parse(s)
18
+ end
19
+
20
+ private
21
+
22
+ def clean(s)
23
+ s.gsub!("\u00ad", "") # strip soft hyphen
24
+ s.gsub!("\u0092", "'") # windows-1252 apostrophe - https://stackoverflow.com/a/15564279/2866660
25
+ s.gsub!("aÄs", "aïs") # encoding issue for maïs
26
+ s.gsub!("ï", "ï") # encoding issue
27
+ s.gsub!("ë", "ë") # encoding issue
28
+ s.gsub!(/\A\s*"(.*)"\s*\z/, '\1') # enclosing double quotation marks
29
+ s.gsub!(/\A\s*'(.*)'\s*\z/, '\1') # enclosing single quotation marks
30
+ s
31
+ end
32
+
33
+ end
34
+ end
@@ -0,0 +1,3 @@
1
+ module FoodIngredientParser
2
+ VERSION = '1.0.0.pre.1'
3
+ end
@@ -0,0 +1,2 @@
1
+ require_relative 'food_ingredient_parser/version'
2
+ require_relative 'food_ingredient_parser/parser'
metadata ADDED
@@ -0,0 +1,85 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: food_ingredient_parser
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.0.0.pre.1
5
+ platform: ruby
6
+ authors:
7
+ - wvengen
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2018-08-09 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: treetop
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '1.6'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '1.6'
27
+ description: |2
28
+ Food products often contain an ingredient list of some sort. This parser
29
+ tries to recognize the syntax and returns a structured representation of the
30
+ food ingredients.
31
+ email:
32
+ - dev-ruby@willem.engen.nl
33
+ executables:
34
+ - food_ingredient_parser
35
+ extensions: []
36
+ extra_rdoc_files:
37
+ - README.md
38
+ - LICENSE
39
+ files:
40
+ - LICENSE
41
+ - README.md
42
+ - bin/food_ingredient_parser
43
+ - food_ingredient_parser.gemspec
44
+ - lib/food_ingredient_parser.rb
45
+ - lib/food_ingredient_parser/grammar.rb
46
+ - lib/food_ingredient_parser/grammar/amount.treetop
47
+ - lib/food_ingredient_parser/grammar/common.treetop
48
+ - lib/food_ingredient_parser/grammar/ingredient.treetop
49
+ - lib/food_ingredient_parser/grammar/ingredient_coloned.treetop
50
+ - lib/food_ingredient_parser/grammar/ingredient_nested.treetop
51
+ - lib/food_ingredient_parser/grammar/ingredient_simple.treetop
52
+ - lib/food_ingredient_parser/grammar/list.treetop
53
+ - lib/food_ingredient_parser/grammar/list_coloned.treetop
54
+ - lib/food_ingredient_parser/grammar/list_newlined.treetop
55
+ - lib/food_ingredient_parser/grammar/root.treetop
56
+ - lib/food_ingredient_parser/nodes.rb
57
+ - lib/food_ingredient_parser/parser.rb
58
+ - lib/food_ingredient_parser/version.rb
59
+ homepage: https://github.com/q-m/food-ingredient-parser-ruby
60
+ licenses:
61
+ - MIT
62
+ metadata:
63
+ bug_tracker_uri: https://github.com/q-m/food-ingredient-parser-ruby/issues
64
+ source_code_uri: https://github.com/q-m/food-ingredient-parser-ruby
65
+ post_install_message:
66
+ rdoc_options: []
67
+ require_paths:
68
+ - lib
69
+ required_ruby_version: !ruby/object:Gem::Requirement
70
+ requirements:
71
+ - - ">="
72
+ - !ruby/object:Gem::Version
73
+ version: '0'
74
+ required_rubygems_version: !ruby/object:Gem::Requirement
75
+ requirements:
76
+ - - ">"
77
+ - !ruby/object:Gem::Version
78
+ version: 1.3.1
79
+ requirements: []
80
+ rubyforge_project:
81
+ rubygems_version: 2.6.13
82
+ signing_key:
83
+ specification_version: 4
84
+ summary: Parser for ingredient lists found on food products.
85
+ test_files: []