RubyGems - food_ingredient_parser - Versions diffs - 1.0.0.pre.5 → 1.0.0.pre.6 - Mend

food_ingredient_parser 1.0.0.pre.5 → 1.0.0.pre.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 8725e4ed3763020de6b46cad6709ce05aca0b77f
-  data.tar.gz: fedb82af99346e8db38a3beaca5914ab44be197a
+  metadata.gz: 54bdb9187f9a2dfbec67737ddc2a3ad90f4ca058
+  data.tar.gz: fcfc99674e0f58801ca3a375acebe91ba3f80c84
 SHA512:
-  metadata.gz: 114feb403f87140f2eccc21860c9a75aff4cff3583f3046f9815299f90dc6374c73b3b9ee60d682c3a9f91b2803ab3fc6c934f3255716986a494fbb51c9e5564
-  data.tar.gz: 8f94825d627ab8068fb3dfd6c20f61a20fc8612f06ec6a760d33e46def103ca11497e364d9e3a7c4720647794e644a7253ba6e226d984801a55e089a5961b753
+  metadata.gz: 773526e862a74f04614486f3542de0c89f8663e10f959978a7d3a2e1ba8703e8a9ae93bde6b043a62326be68956733a30e6c01d85b6578dba0abc9064590fb18
+  data.tar.gz: e3296ae3222745f20727eed70bd1a9dac84a1714f1bdc64c2cd524bfbd6589a470afea33eab1e4ce717dbec9f4b3b3d8d9350025d13a4046e78cea3d0aea9b6d

data/README.md CHANGED Viewed

@@ -22,11 +22,11 @@ require 'food_ingredient_parser'
 s = "Water* 60%, suiker 30%, voedingszuren: citroenzuur, appelzuur, zuurteregelaar: E576/E577, " \
     + "natuurlijke citroen-limoen aroma's 0,2%, zoetstof: steviolglycosiden, * = Biologisch. " \
     + "E = door de E.U. goedgekeurde toevoeging."
-parser = FoodIngredientParser::Parser.new
+parser = FoodIngredientParser::Strict::Parser.new
 puts parser.parse(s).to_h.inspect
 ```
 Results in
-```
+```ruby
 {
   :contains=>[
     {:name=>"Water", :amount=>"60%", :mark=>"*"},
@@ -58,14 +58,15 @@ running this from the source tree, use `bin/food_ingredient_parser` instead.
 ```
 $ food_ingredient_parser -h
-Usage: food_ingredient_parser [options] --file|-f <filename>
-       food_ingredient_parser [options] --string|-s <ingredients>
+Usage: bin/food_ingredient_parser [options] --file|-f <filename>
+       bin/food_ingredient_parser [options] --string|-s <ingredients>
     -f, --file FILE                  Parse all lines of the file as ingredient lists.
     -s, --string INGREDIENTS         Parse specified ingredient list.
     -q, --[no-]quiet                 Only show summary.
     -p, --parsed                     Only show lines that were successfully parsed.
-    -e, --escape                     Escape newlines
+    -r, --parser PARSER              Use specific parser (strict, loose).
+    -e, --[no-]escape                Escape newlines
     -c, --[no-]color                 Use color
     -n, --noresult                   Only show lines that had no result.
     -v, --[no-]verbose               Show more data (parsed tree).
@@ -102,6 +103,12 @@ RootNode+Root3 offset=0, "tomato" (contains,notes):
   SyntaxNode offset=6, ""
 {:contains=>[{:name=>"tomato"}]}
+$ food_ingredient_parser -v -r loose -s "tomato"
+"tomato"
+Node interval=0..5
+  Node interval=0..5, name="tomato"
+{:contains=>[{:name=>"tomato"}]}
 $ food_ingredient_parser -q -f data/test-cases
 parsed 35 (100.0%), no result 0 (0.0%)
 ```
@@ -114,12 +121,12 @@ When ingredient lists are entered manually, it can be very useful to show how th
 recognized. This can help understanding why a certain ingredients list cannot be parsed.
 For this you can use the `to_html` method on the parsed output, which returns the original
-text, augmented with CSS classes for different parts.
+text, augmented with CSS classes for different parts. (Available for strict parser only.)
 ```ruby
 require 'food_ingredient_parser'
-parsed = FoodIngredientParser::Parser.new.parse("Saus (10% tomaat*, zout). * = bio")
+parsed = FoodIngredientParser::Strict::Parser.new.parse("Saus (10% tomaat*, zout). * = bio")
 puts parsed.to_html
 ```
@@ -138,9 +145,38 @@ For an example of an interactive editor, see [examples/editor.rb](examples/edito
 ![editor example screenshot](examples/editor-screenshot.png)
+## Loose parser
+The strict parser only parses ingredient lists that conform to one of the many different
+formats expected. If you'd like to return a result always, even if that is not necessarily
+completely correct, you can use the _loose_ parser. This does not use Treetop, but looks
+at the input character for character and tries to make the best of it. Nevertheless, if you
+just want to have _some_ result, this can still be very useful.
+```ruby
+require 'food_ingredient_parser'
+parsed = FoodIngredientParser::Loose::Parser.new.parse("Saus [10% tomaat*, (zout); peper.")
+puts parsed.to_h
+```
+Even though the strict parser would not give a result, the loose parser returns:
+```ruby
+{
+  :contains=>[
+    {:name=>"Saus", :contains=>[
+      {:name=>"tomaat", :mark=>"*", :amount=>"10%"},
+      {:contains=>[{:name=>"zout"}]},
+      {:name=>"peper"}
+    ]}
+  ]
+}
+```
 ## Test data
 [`data/ingredient-samples-nl`](data/ingredient-samples-nl) contains about 150k
 real-world ingredient lists found on the Dutch market. Each line contains one ingredient
 list (newlines are encoded as `\n`, empty lines and those starting with `#` are ignored).
-Currently almost three quarter is recognized and parsed. We aim to reach at least 90%.
+The strict parser currently parses about three quarter, while the loose parser returns
+something for all of them.

data/bin/food_ingredient_parser CHANGED Viewed

@@ -31,8 +31,7 @@ def colorize(color, s)
   end
 end
-def parse_single(s, parsed=nil, parser: nil, verbosity: 1, print: nil, escape: false, color: false)
-  parser ||= FoodIngredientParser::Parser.new
+def parse_single(s, parsed=nil, parser:, verbosity: 1, print: nil, escape: false, color: false)
   parsed ||= parser.parse(s)
   return unless print.nil? || (parsed && print == :parsed) || (!parsed && print == :noresult)
@@ -47,7 +46,7 @@ def parse_single(s, parsed=nil, parser: nil, verbosity: 1, print: nil, escape: f
   end
 end
-def parse_file(path, parser: nil, verbosity: 1, print: nil, escape: false, color: false)
+def parse_file(path, parser:, verbosity: 1, print: nil, escape: false, color: false)
   count_parsed = count_noresult = 0
   File.foreach(path) do |line|
     next if line =~ /^#/    # comment
@@ -70,8 +69,13 @@ verbosity = 1
 files = []
 strings = []
 print = nil
+parser_name = :strict
 escape = false
 color = true
+PARSERS = {
+  strict: FoodIngredientParser::Strict::Parser,
+  loose:  FoodIngredientParser::Loose::Parser
+}
 OptionParser.new do |opts|
   opts.banner = <<-EOF.gsub(/^    /, '')
     Usage: #{$0} [options] --file|-f <filename>
@@ -84,7 +88,8 @@ OptionParser.new do |opts|
   opts.on("-q", "--[no-]quiet", "Only show summary.") {|q| verbosity = q ? 0 : 1 }
   opts.on("-p", "--parsed", "Only show lines that were successfully parsed.") {|p| print = :parsed }
-  opts.on("-e", "--escape", "Escape newlines") {|e| escape = true }
+  opts.on("-r", "--parser PARSER", "Use specific parser (#{PARSERS.keys.join(", ")}).") {|p| parser_name = p&.downcase&.to_sym }
+  opts.on("-e", "--[no-]escape", "Escape newlines") {|e| escape = !!e }
   opts.on("-c", "--[no-]color", "Use color") {|e| color = !!e }
   opts.on("-n", "--noresult", "Only show lines that had no result.") {|p| print = :noresult }
   opts.on("-v", "--[no-]verbose", "Show more data (parsed tree).") {|v| verbosity = v ? 2 : 1 }
@@ -99,7 +104,10 @@ OptionParser.new do |opts|
 end.parse!
 if strings.any? || files.any?
-  parser = FoodIngredientParser::Parser.new
+  unless parser = PARSERS[parser_name]&.new
+    STDERR.puts("Please specify one of the known parsers: #{PARSERS.keys.join(", ")}.")
+    exit(1)
+  end
   strings.each {|s| parse_single(s, parser: parser, verbosity: verbosity, print: print, escape: escape, color: color) }
   files.each   {|f| parse_file(f, parser: parser, verbosity: verbosity, print: print, escape: escape, color: color) }
 else

data/lib/food_ingredient_parser/cleaner.rb ADDED Viewed

@@ -0,0 +1,16 @@
+module FoodIngredientParser
+  module Cleaner
+    def self.clean(s)
+      s.gsub!("\u00ad", "")             # strip soft hyphen
+      s.gsub!("\u0092", "'")            # windows-1252 apostrophe - https://stackoverflow.com/a/15564279/2866660
+      s.gsub!("aÄs", "aïs")             # encoding issue for maïs
+      s.gsub!("Ã¯", "ï")                # encoding issue
+      s.gsub!("Ã«", "ë")                # encoding issue
+      s.gsub!(/\A\s*"(.*)"\s*\z/, '\1') # enclosing double quotation marks
+      s.gsub!(/\A\s*'(.*)'\s*\z/, '\1') # enclosing single quotation marks
+      s
+    end
+  end
+end

data/lib/food_ingredient_parser/loose/node.rb ADDED Viewed

@@ -0,0 +1,60 @@
+module FoodIngredientParser::Loose
+  # Parsing result.
+  class Node
+    attr_accessor :name, :mark, :amount, :contains, :notes
+    attr_reader :input, :interval, :auto_close
+    def initialize(input, interval, auto_close: false)
+      @input = input
+      @interval = interval.is_a?(Range) ? interval : ( interval .. interval )
+      @auto_close = auto_close
+      @contains = []
+      @notes = []
+      @name = @mark = @amount = nil
+    end
+    def ends(index)
+      @interval = @interval.first .. index
+    end
+    def <<(child)
+      @contains << child
+    end
+    def text_value
+      @input[@interval]
+    end
+    def to_h
+      r = {}
+      r[:name] = name.text_value.strip if name && name.text_value.strip != ''
+      r[:mark] = mark.text_value.strip if mark
+      r[:amount] = amount.text_value.strip if amount
+      r[:contains] = contains.map(&:to_h).reject {|c| c == {} } if contains.any?
+      r[:notes] = notes.map{|n| n.text_value.strip }.reject {|c| c == '' } if notes.any?
+      r
+    end
+    def inspect(indent="", variant="")
+      inspect_self(indent, variant) +
+      inspect_children(indent)
+    end
+    def inspect_self(indent="", variant="")
+      [
+        indent + "Node#{variant} interval=#{@interval}",
+        name ? "name=#{name.text_value.strip.inspect}" : nil,
+        mark ? "mark=#{mark.text_value.strip.inspect}" : nil,
+        amount ? "amount=#{amount.text_value.strip.inspect}" : nil,
+        auto_close ? "auto_close" : nil
+      ].compact.join(", ")
+    end
+    def inspect_children(indent="")
+      [
+        *contains.map {|child| "\n" + child.inspect(indent + "  ") },
+        *notes.map    {|note|  "\n" + note.inspect(indent + "  ", "(note)") }
+      ].join("")
+    end
+  end
+end

data/lib/food_ingredient_parser/loose/parser.rb ADDED Viewed

@@ -0,0 +1,24 @@
+require_relative '../cleaner'
+require_relative 'scanner'
+require_relative 'transform/amount'
+module FoodIngredientParser::Loose
+  class Parser
+    # Create a new food ingredient stream parser
+    # @return [FoodIngredientParser::StreamParser]
+    def initialize
+    end
+    # Parse food ingredient list text into a structured representation.
+    #
+    # @option clean [Boolean] pass +false+ to disable correcting frequently occuring issues
+    # @return [FoodIngredientParser::Loose::Node] structured representation of food ingredients
+    def parse(s, clean: true, **options)
+      s = FoodIngredientParser::Cleaner.clean(s) if clean
+      n = Scanner.new(s).scan
+      n = Transform::Amount.transform!(n) if n
+      n
+    end
+  end
+end

data/lib/food_ingredient_parser/loose/scanner.rb ADDED Viewed

@@ -0,0 +1,191 @@
+require_relative 'node'
+module FoodIngredientParser::Loose
+  class Scanner
+    SEP_CHARS  = "|;,.".freeze
+    MARK_CHARS = "¹²³⁴⁵ᵃᵇᶜᵈᵉᶠᵍªº⁽⁾†‡•°#^*".freeze
+    PREFIX_RE  = /\A\s*(ingredients|contains|ingred[iï][eë]nt(en)?(declaratie)?|bevat|dit zit er\s?in|samenstelling|zutaten)\s*[:;.]\s*/i.freeze
+    NOTE_RE    = /\A\b(dit product kan\b|kan sporen\b.*?\bbevatten\b|voor allergenen\b|allergenen\b|E\s*=|gemaakt in\b|geproduceerd in\b|bevat mogelijk\b|kijk voor meer\b|allergie-info|in de fabriek\b|in dit bedrijf\b)/i.freeze
+    def initialize(s, index: 0)
+      @s = s                           # input string
+      @i = index                       # current index in string
+      @cur = nil                       # current node we're populating
+      @ancestors = [Node.new(@s, @i)]  # nesting hierarchy
+      @iterator = :beginning           # scan_iteration_<iterator> to use for parsing
+      @dest = :contains                # append current node to this attribute on parent
+    end
+    def scan
+      loop do
+        method(:"scan_iteration_#{@iterator}").call
+      end
+      close_all_ancestors
+      @ancestors.first.ends(@i-1)
+      @ancestors.first
+    end
+    private
+    def loop
+      while @i < @s.length
+        @i += 1 if yield != false
+      end
+    end
+    def scan_iteration_beginning
+      # skip over some common prefixes
+      m = @s[@i .. -1].match(PREFIX_RE)
+      @i += m.offset(0).last if m
+      # now continue with the standard parsing
+      @iterator = :standard
+      false
+    end
+    def scan_iteration_standard
+      if "([".include?(c)         # open nesting
+        open_parent
+      elsif ")]".include?(c)      # close nesting
+        add_child
+        close_parent
+      elsif is_notes_start?       # usually a dot marks the start of notes
+        close_all_ancestors
+        @iterator = :notes
+        @dest = :notes
+      elsif is_sep?               # separator
+        add_child
+      elsif ":".include?(c)       # another open nesting
+        add_child
+        open_parent(auto_close: true)
+        @iterator = :colon
+      elsif is_mark? && !cur.mark # mark after ingredient
+        name_until_here
+        len = mark_len
+        cur.mark = Node.new(@s, @i .. @i+len-1)
+        @i += len - 1
+      else
+        cur # reference to record starting position
+      end
+    end
+    def scan_iteration_colon
+      if "/".include?(c)        # slash separator in colon nesting only
+        add_child
+      elsif is_sep?             # regular separator indicates end of colon nesting
+        add_child
+        close_parent
+        # revert to standard parsing from here on
+        @iterator = :standard
+        scan_iteration_standard
+      elsif "([]):".include?(c) # continue with deeper nesting level
+        # revert to standard parsing from here on
+        @iterator = :standard
+        scan_iteration_standard
+      else
+        # normal handling for this character
+        scan_iteration_standard
+      end
+    end
+    def scan_iteration_notes
+      if is_sep?(chars: ".")    # dot means new note
+        add_child
+      else
+        cur
+      end
+    end
+    def c
+      @s[@i]
+    end
+    def parent
+      @ancestors.last
+    end
+    def cur
+      @cur ||= Node.new(@s, @i)
+    end
+    def is_mark?
+      mark_len > 0 && @s[@i..@i+1] !~ /\A°[CF]/
+    end
+    def is_sep?(chars: SEP_CHARS)
+      chars.include?(c) && @s[@i-1..@i+1] !~ /\A\d.\d\z/
+    end
+    def mark_len
+      i = @i
+      while @s[i] && MARK_CHARS.include?(@s[i])
+        i += 1
+      end
+      i - @i
+    end
+    def is_notes_start?
+      # @todo use more heuristics: don't assume dot is notes when separator is a dot, and only toplevel?
+      if ( is_mark? && @s[@i+mark_len..-1] =~ /\A\s*=/ ) ||     # "* = Biologisch"
+         ( is_mark? && @s[@i-2..@i-1] =~ /\A\s\s/ ) ||          # "  **Biologisch"
+         ( @s[@i..-1] =~ NOTE_RE )                              # "E=", "Kan sporen van", ...
+        @i -= 1 # we want to include the mark in the note
+        true
+      # End of sentence
+      elsif dot_is_not_sep? && is_sep?(chars: ".")
+        true
+      else
+        false
+      end
+    end
+    def add_child
+      cur.ends(@i-1)
+      cur.name ||= Node.new(@s, cur.interval)
+      parent.send(@dest) << cur
+      @cur = nil
+    end
+    def open_parent(**options)
+      name_until_here
+      @ancestors << cur
+      @cur = Node.new(@s, @i + 1, **options)
+    end
+    def close_parent
+      return unless @ancestors.count > 1
+      @cur = @ancestors.pop
+      while @cur.auto_close
+        add_child
+        @cur = @ancestors.pop
+      end
+    end
+    def close_all_ancestors
+      while @ancestors.count > 1
+        add_child
+        close_parent
+      end
+      add_child
+    end
+    def name_until_here
+      cur.name ||= Node.new(@s, cur.interval.first .. @i-1)
+    end
+    def dot_is_not_sep?
+      # if separator is dot ".", don't use it for note detection
+      if @dot_is_not_sep.nil?
+        @dot_is_not_sep = begin
+          # @todo if another separator is found more often, dot is not a separator
+          num_words = @s.split(/\s+/).count
+          num_dots = @s.count(".")
+          # heuristic: 1/4+ of the words has a dot, with at least five words
+          num_words < 5 || 4 * num_dots < num_words
+        end
+      end
+      @dot_is_not_sep
+    end
+  end
+end

data/lib/food_ingredient_parser/loose/transform/amount.rb ADDED Viewed

@@ -0,0 +1,70 @@
+require 'treetop'
+require_relative '../../strict/nodes'
+Treetop.load File.dirname(__FILE__) + '/../../strict/grammar/common'
+Treetop.load File.dirname(__FILE__) + '/../../strict/grammar/amount'
+Treetop.load File.dirname(__FILE__) + '/amount_from_name'
+require_relative '../node'
+module FoodIngredientParser::Loose
+  module Transform
+    # Transforms node tree to extract amount into its own attribute.
+    class Amount
+      def self.transform!(node)
+        new(node).transform!
+      end
+      def initialize(node)
+        @node = node
+        @parser = FoodIngredientParser::Loose::Transform::AmountFromNameParser.new
+      end
+      def transform!
+        transform_name
+        transform_contains
+        @node
+      end
+      private
+      # Extract amount from name, if any.
+      def transform_name(node = @node)
+        if !node.amount && parsed = parse_amount(node.name&.text_value)
+          offset = node.name.interval.first
+          amount = parsed.amount.amount
+          node.amount = Node.new(node.input, offset + amount.interval.first .. offset + amount.interval.last - 1)
+          name = parsed.respond_to?(:name) && parsed.name
+          if name && name.interval.count > 0
+            node.name = Node.new(node.input, offset + name.interval.first .. offset + name.interval.last - 1)
+          else
+            node.name = nil
+          end
+        end
+        # recursively transform contained nodes
+        node.contains&.each(&method(:transform_name))
+      end
+      # If first or last child is an amount, it's this node's amount.
+      # Assumes all names already have extracted their amounts with {{#transform_name}}.
+      def transform_contains(node = @node)
+        if !node.amount && node.contains.any?
+          if node.contains.first.name.nil? && node.contains.first.amount
+            node.amount = node.contains.shift.amount
+          elsif node.contains.count > 1 && node.contains.last.name.nil? && node.contains.last.amount
+            node.amount = node.contains.pop.amount
+          end
+        end
+        # recursively transform contained nodes
+        node.contains.each(&method(:transform_contains))
+      end
+      def parse_amount(s)
+        @parser.parse(s) if s && s.strip != ''
+      end
+    end
+  end
+end

data/lib/food_ingredient_parser/loose/transform/amount_from_name.treetop ADDED Viewed

@@ -0,0 +1,13 @@
+module FoodIngredientParser::Loose::Transform
+  grammar AmountFromName
+    include FoodIngredientParser::Strict::Grammar::Common
+    include FoodIngredientParser::Strict::Grammar::Amount
+    rule amount_from_name
+      # just amount, amount in front or at the end
+      ws* amount:amount ws+ name:(.*) /
+      ws* amount:amount ws* /
+      ws* name:( !amount word ( ws+ !amount word )* )+ ws* amount:amount ws*
+    end
+  end
+end

data/lib/food_ingredient_parser/{grammar → strict/grammar}/amount.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar Amount
     include Common
@@ -12,18 +12,19 @@ module FoodIngredientParser::Grammar
     rule simple_amount
       ( (
         'of which'i / 'at least'i / 'minimal'i / 'maximal'i / 'less than'i / 'more than'i /
-        'waarvan'i / 'ten minste'i / 'tenminste'i / 'minimaal'i / 'maximaal'i / 'minder dan'i / 'meer dan'i
+        'waarvan'i / 'ten minste'i / 'tenminste'i / 'minimaal'i / 'maximaal'i / 'minder dan'i / 'meer dan'i /
+        'min.'i / 'min'i / 'max.'i / 'max'i
       ) ws* )?
       [±∓~∼∽≂≃≈≲≤<>≥≳]? ws*
       simple_amount_quantity
       ( ws+ (
-        'minimum'i /
-        'minimaal'i / 'minimum'i
+        'minimaal'i / 'minimum'i / 'van het uitlekgewicht'i / 'van het geheel'i /
+        'min.'i / 'min'i / 'max.'i / 'max'i
       ) )?
     end
     rule simple_amount_quantity
-      number ( ws* '-' ws* number )? ws* ( '%' / 'g'i / 'mg'i / 'gram'i / 'ml'i )
+      number ( ws* '-' ws* number )? ws* ( [%٪⁒％﹪] / ( ( 'procent' / 'percent' / 'gram'i / 'ml'i / 'mg'i / 'g'i ) !char ) )
     end
   end

data/lib/food_ingredient_parser/{grammar → strict/grammar}/common.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar Common
     rule ws

data/lib/food_ingredient_parser/{grammar → strict/grammar}/ingredient.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar Ingredient
     include IngredientSimple
     include IngredientNested

data/lib/food_ingredient_parser/{grammar → strict/grammar}/ingredient_coloned.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar IngredientColoned
     include Common
     include Amount

data/lib/food_ingredient_parser/{grammar → strict/grammar}/ingredient_nested.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar IngredientNested
     include Common
     include Amount

data/lib/food_ingredient_parser/{grammar → strict/grammar}/ingredient_simple.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar IngredientSimple
     include Common
     include Amount

data/lib/food_ingredient_parser/{grammar → strict/grammar}/list.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar List
     include Common
     include Ingredient

data/lib/food_ingredient_parser/{grammar → strict/grammar}/list_coloned.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar ListColoned
     include Common
     include IngredientSimple

data/lib/food_ingredient_parser/{grammar → strict/grammar}/list_newlined.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar ListNewlined
     include Common
     include IngredientSimple

data/lib/food_ingredient_parser/{grammar → strict/grammar}/root.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar Root
     include Common
     include List

data/lib/food_ingredient_parser/strict/nodes.rb ADDED Viewed

@@ -0,0 +1,74 @@
+require 'treetop/runtime'
+require_relative 'to_html'
+# Needs to be in grammar namespace so Treetop can find the nodes.
+module FoodIngredientParser::Strict
+  module Grammar
+    # Treetop syntax node with our additions, use this as parent for all our own nodes.
+    class SyntaxNode < Treetop::Runtime::SyntaxNode
+      private
+      def to_a_deep(n, cls)
+        if n.is_a?(cls)
+          [n]
+        elsif n.nonterminal?
+          n.elements.map {|m| to_a_deep(m, cls) }.flatten(1).compact
+        end
+      end
+    end
+    # Root object, contains everything else.
+    class RootNode < SyntaxNode
+      include FoodIngredientParser::Strict::ToHtml
+      def to_h
+        h = { contains: contains.to_a }
+        if notes && notes_ary = to_a_deep(notes, NoteNode)&.map(&:text_value)
+          h[:notes] = notes_ary if notes_ary.length > 0
+        end
+        h
+      end
+    end
+    # List of ingredients.
+    class ListNode < SyntaxNode
+      def to_a
+        to_a_deep(contains, IngredientNode).map(&:to_h)
+      end
+    end
+    # Ingredient
+    class IngredientNode < SyntaxNode
+      def to_h
+        h = {}
+        h.merge!(to_a_deep(ing, IngredientNode)&.first&.to_h || {}) if respond_to?(:ing)
+        h.merge!(to_a_deep(amount, AmountNode)&.first&.to_h || {}) if respond_to?(:amount)
+        h[:name] = name.text_value if respond_to?(:name)
+        h[:name] = pre.text_value + h[:name] if respond_to?(:pre)
+        h[:name] = h[:name] + post.text_value if respond_to?(:post)
+        h[:mark] = mark.text_value if respond_to?(:mark) && mark.text_value != ''
+        h
+      end
+    end
+    # Ingredient with containing ingredients.
+    class NestedIngredientNode < IngredientNode
+      def to_h
+        super.merge({ contains: to_a_deep(contains, IngredientNode).map(&:to_h) })
+      end
+    end
+    # Amount, specifying an ingredient.
+    class AmountNode < SyntaxNode
+      def to_h
+        { amount: amount.text_value }
+      end
+    end
+    # Note at the end of the ingredient list.
+    class NoteNode < SyntaxNode
+    end
+  end
+end

data/lib/food_ingredient_parser/{parser.rb → strict/parser.rb} RENAMED Viewed

@@ -1,6 +1,7 @@
 require_relative 'grammar'
+require_relative '../cleaner'
-module FoodIngredientParser
+module FoodIngredientParser::Strict
   class Parser
     # @!attribute [r] parser
@@ -20,22 +21,9 @@ module FoodIngredientParser
     # @return [FoodIngredientParser::Grammar::RootNode] structured representation of food ingredients
     # @note Unrecognized options are passed to Treetop, but this is not guarenteed to remain so forever.
     def parse(s, clean: true, **options)
-      s = clean(s) if clean
+      s = FoodIngredientParser::Cleaner.clean(s) if clean
       @parser.parse(s, **options)
     end
-    private
-    def clean(s)
-      s.gsub!("\u00ad", "")             # strip soft hyphen
-      s.gsub!("\u0092", "'")            # windows-1252 apostrophe - https://stackoverflow.com/a/15564279/2866660
-      s.gsub!("aÄs", "aïs")             # encoding issue for maïs
-      s.gsub!("Ã¯", "ï")                # encoding issue
-      s.gsub!("Ã«", "ë")                # encoding issue
-      s.gsub!(/\A\s*"(.*)"\s*\z/, '\1') # enclosing double quotation marks
-      s.gsub!(/\A\s*'(.*)'\s*\z/, '\1') # enclosing single quotation marks
-      s
-    end
   end
 end

data/lib/food_ingredient_parser/strict/to_html.rb ADDED Viewed

@@ -0,0 +1,54 @@
+require 'cgi'
+# Adds HTML output functionality to a Treetop Node.
+#
+# The node needs to provide a {#to_h} method (for {#to_html_h}).
+#
+module FoodIngredientParser::Strict
+  module ToHtml
+    # Markup original ingredients list text in HTML.
+    #
+    # The input text is returned as HTML, augmented with CSS classes
+    # on +span+s for +name+, +amount+, +mark+ and +note+.
+    #
+    # @return [String] HTML representation of ingredient list.
+    def to_html
+      node_to_html(self)
+    end
+    private
+    def node_to_html(node, cls=nil, depth=0)
+      el_cls = {}               # map of node instances to class names for contained elements
+      terminal = node.terminal? # whether to look at children elements or not
+      if node.is_a?(FoodIngredientParser::Strict::Grammar::AmountNode)
+        cls ||= "amount"
+      elsif node.is_a?(FoodIngredientParser::Strict::Grammar::NoteNode)
+        cls ||= "note"
+        terminal = true # NoteNodes may contain other NoteNodes, we want it flat.
+      elsif node.is_a?(FoodIngredientParser::Strict::Grammar::IngredientNode)
+        el_cls[node.name] = "name" if node.respond_to?(:name)
+        el_cls[node.mark] = "mark" if node.respond_to?(:mark)
+        if node.respond_to?(:contains)
+          el_cls[node.contains] = "contains depth#{depth}"
+          depth += 1
+        end
+      elsif node.is_a?(FoodIngredientParser::Strict::Grammar::RootNode)
+        if node.respond_to?(:contains)
+          el_cls[node.contains] = "depth#{depth}"
+          depth += 1
+        end
+      end
+      val = if terminal
+        CGI.escapeHTML(node.text_value)
+      else
+        node.elements.map {|el| node_to_html(el, el_cls[el], depth) }.join("")
+      end
+      cls ? "<span class='#{cls}'>#{val}</span>" : val
+    end
+  end
+end

data/lib/food_ingredient_parser/version.rb CHANGED Viewed

@@ -1,4 +1,4 @@
 module FoodIngredientParser
-  VERSION      = '1.0.0.pre.5'
-  VERSION_DATE = '2018-09-07'
+  VERSION      = '1.0.0.pre.6'
+  VERSION_DATE = '2018-09-17'
 end

data/lib/food_ingredient_parser.rb CHANGED Viewed

@@ -1,2 +1,3 @@
 require_relative 'food_ingredient_parser/version'
-require_relative 'food_ingredient_parser/parser'
+require_relative 'food_ingredient_parser/strict/parser'
+require_relative 'food_ingredient_parser/loose/parser'

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: food_ingredient_parser
 version: !ruby/object:Gem::Version
-  version: 1.0.0.pre.5
+  version: 1.0.0.pre.6
 platform: ruby
 authors:
 - wvengen
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2018-09-07 00:00:00.000000000 Z
+date: 2018-09-17 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: treetop
@@ -42,20 +42,26 @@ files:
 - bin/food_ingredient_parser
 - food_ingredient_parser.gemspec
 - lib/food_ingredient_parser.rb
-- lib/food_ingredient_parser/grammar.rb
-- lib/food_ingredient_parser/grammar/amount.treetop
-- lib/food_ingredient_parser/grammar/common.treetop
-- lib/food_ingredient_parser/grammar/ingredient.treetop
-- lib/food_ingredient_parser/grammar/ingredient_coloned.treetop
-- lib/food_ingredient_parser/grammar/ingredient_nested.treetop
-- lib/food_ingredient_parser/grammar/ingredient_simple.treetop
-- lib/food_ingredient_parser/grammar/list.treetop
-- lib/food_ingredient_parser/grammar/list_coloned.treetop
-- lib/food_ingredient_parser/grammar/list_newlined.treetop
-- lib/food_ingredient_parser/grammar/root.treetop
-- lib/food_ingredient_parser/nodes.rb
-- lib/food_ingredient_parser/parser.rb
-- lib/food_ingredient_parser/to_html.rb
+- lib/food_ingredient_parser/cleaner.rb
+- lib/food_ingredient_parser/loose/node.rb
+- lib/food_ingredient_parser/loose/parser.rb
+- lib/food_ingredient_parser/loose/scanner.rb
+- lib/food_ingredient_parser/loose/transform/amount.rb
+- lib/food_ingredient_parser/loose/transform/amount_from_name.treetop
+- lib/food_ingredient_parser/strict/grammar.rb
+- lib/food_ingredient_parser/strict/grammar/amount.treetop
+- lib/food_ingredient_parser/strict/grammar/common.treetop
+- lib/food_ingredient_parser/strict/grammar/ingredient.treetop
+- lib/food_ingredient_parser/strict/grammar/ingredient_coloned.treetop
+- lib/food_ingredient_parser/strict/grammar/ingredient_nested.treetop
+- lib/food_ingredient_parser/strict/grammar/ingredient_simple.treetop
+- lib/food_ingredient_parser/strict/grammar/list.treetop
+- lib/food_ingredient_parser/strict/grammar/list_coloned.treetop
+- lib/food_ingredient_parser/strict/grammar/list_newlined.treetop
+- lib/food_ingredient_parser/strict/grammar/root.treetop
+- lib/food_ingredient_parser/strict/nodes.rb
+- lib/food_ingredient_parser/strict/parser.rb
+- lib/food_ingredient_parser/strict/to_html.rb
 - lib/food_ingredient_parser/version.rb
 homepage: https://github.com/q-m/food-ingredient-parser-ruby
 licenses:

data/lib/food_ingredient_parser/nodes.rb DELETED Viewed

@@ -1,72 +0,0 @@
-require 'treetop/runtime'
-require_relative 'to_html'
-# Needs to be in grammar namespace so Treetop can find the nodes.
-module FoodIngredientParser::Grammar
-  # Treetop syntax node with our additions, use this as parent for all our own nodes.
-  class SyntaxNode < Treetop::Runtime::SyntaxNode
-    private
-    def to_a_deep(n, cls)
-      if n.is_a?(cls)
-        [n]
-      elsif n.nonterminal?
-        n.elements.map {|m| to_a_deep(m, cls) }.flatten(1).compact
-      end
-    end
-  end
-  # Root object, contains everything else.
-  class RootNode < SyntaxNode
-    include FoodIngredientParser::ToHtml
-    def to_h
-      h = { contains: contains.to_a }
-      if notes && notes_ary = to_a_deep(notes, NoteNode)&.map(&:text_value)
-        h[:notes] = notes_ary if notes_ary.length > 0
-      end
-      h
-    end
-  end
-  # List of ingredients.
-  class ListNode < SyntaxNode
-    def to_a
-      to_a_deep(contains, IngredientNode).map(&:to_h)
-    end
-  end
-  # Ingredient
-  class IngredientNode < SyntaxNode
-    def to_h
-      h = {}
-      h.merge!(to_a_deep(ing, IngredientNode)&.first&.to_h || {}) if respond_to?(:ing)
-      h.merge!(to_a_deep(amount, AmountNode)&.first&.to_h || {}) if respond_to?(:amount)
-      h[:name] = name.text_value if respond_to?(:name)
-      h[:name] = pre.text_value + h[:name] if respond_to?(:pre)
-      h[:name] = h[:name] + post.text_value if respond_to?(:post)
-      h[:mark] = mark.text_value if respond_to?(:mark) && mark.text_value != ''
-      h
-    end
-  end
-  # Ingredient with containing ingredients.
-  class NestedIngredientNode < IngredientNode
-    def to_h
-      super.merge({ contains: to_a_deep(contains, IngredientNode).map(&:to_h) })
-    end
-  end
-  # Amount, specifying an ingredient.
-  class AmountNode < SyntaxNode
-    def to_h
-      { amount: amount.text_value }
-    end
-  end
-  # Note at the end of the ingredient list.
-  class NoteNode < SyntaxNode
-  end
-end

data/lib/food_ingredient_parser/to_html.rb DELETED Viewed

@@ -1,52 +0,0 @@
-require 'cgi'
-# Adds HTML output functionality to a Treetop Node.
-#
-# The node needs to provide a {#to_h} method (for {#to_html_h}).
-#
-module FoodIngredientParser::ToHtml
-  # Markup original ingredients list text in HTML.
-  #
-  # The input text is returned as HTML, augmented with CSS classes
-  # on +span+s for +name+, +amount+, +mark+ and +note+.
-  #
-  # @return [String] HTML representation of ingredient list.
-  def to_html
-    node_to_html(self)
-  end
-  private
-  def node_to_html(node, cls=nil, depth=0)
-    el_cls = {}               # map of node instances to class names for contained elements
-    terminal = node.terminal? # whether to look at children elements or not
-    if node.is_a?(FoodIngredientParser::Grammar::AmountNode)
-      cls ||= "amount"
-    elsif node.is_a?(FoodIngredientParser::Grammar::NoteNode)
-      cls ||= "note"
-      terminal = true # NoteNodes may contain other NoteNodes, we want it flat.
-    elsif node.is_a?(FoodIngredientParser::Grammar::IngredientNode)
-      el_cls[node.name] = "name" if node.respond_to?(:name)
-      el_cls[node.mark] = "mark" if node.respond_to?(:mark)
-      if node.respond_to?(:contains)
-        el_cls[node.contains] = "contains depth#{depth}"
-        depth += 1
-      end
-    elsif node.is_a?(FoodIngredientParser::Grammar::RootNode)
-      if node.respond_to?(:contains)
-        el_cls[node.contains] = "depth#{depth}"
-        depth += 1
-      end
-    end
-    val = if terminal
-      CGI.escapeHTML(node.text_value)
-    else
-      node.elements.map {|el| node_to_html(el, el_cls[el], depth) }.join("")
-    end
-    cls ? "<span class='#{cls}'>#{val}</span>" : val
-  end
-end

/data/lib/food_ingredient_parser/{grammar.rb → strict/grammar.rb} RENAMED Viewed

File without changes