RubyGems - food_ingredient_parser - Versions diffs - 1.0.0.pre.5 → 1.0.0.pre.6 - Mend

food_ingredient_parser 1.0.0.pre.5 → 1.0.0.pre.6

Files changed (28) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 8725e4ed3763020de6b46cad6709ce05aca0b77f
-  data.tar.gz: fedb82af99346e8db38a3beaca5914ab44be197a
+  metadata.gz: 54bdb9187f9a2dfbec67737ddc2a3ad90f4ca058
+  data.tar.gz: fcfc99674e0f58801ca3a375acebe91ba3f80c84
 SHA512:
-  metadata.gz: 114feb403f87140f2eccc21860c9a75aff4cff3583f3046f9815299f90dc6374c73b3b9ee60d682c3a9f91b2803ab3fc6c934f3255716986a494fbb51c9e5564
-  data.tar.gz: 8f94825d627ab8068fb3dfd6c20f61a20fc8612f06ec6a760d33e46def103ca11497e364d9e3a7c4720647794e644a7253ba6e226d984801a55e089a5961b753
+  metadata.gz: 773526e862a74f04614486f3542de0c89f8663e10f959978a7d3a2e1ba8703e8a9ae93bde6b043a62326be68956733a30e6c01d85b6578dba0abc9064590fb18
+  data.tar.gz: e3296ae3222745f20727eed70bd1a9dac84a1714f1bdc64c2cd524bfbd6589a470afea33eab1e4ce717dbec9f4b3b3d8d9350025d13a4046e78cea3d0aea9b6d

data/README.md CHANGED Viewed

@@ -22,11 +22,11 @@ require 'food_ingredient_parser'
 s = "Water* 60%, suiker 30%, voedingszuren: citroenzuur, appelzuur, zuurteregelaar: E576/E577, " \
     + "natuurlijke citroen-limoen aroma's 0,2%, zoetstof: steviolglycosiden, * = Biologisch. " \
     + "E = door de E.U. goedgekeurde toevoeging."
-parser = FoodIngredientParser::Parser.new
+parser = FoodIngredientParser::Strict::Parser.new
 puts parser.parse(s).to_h.inspect
 ```
 Results in
-```
+```ruby
 {
   :contains=>[
     {:name=>"Water", :amount=>"60%", :mark=>"*"},
@@ -58,14 +58,15 @@ running this from the source tree, use `bin/food_ingredient_parser` instead.
 ```
 $ food_ingredient_parser -h
-Usage: food_ingredient_parser [options] --file|-f <filename>
-       food_ingredient_parser [options] --string|-s <ingredients>
+Usage: bin/food_ingredient_parser [options] --file|-f <filename>
+       bin/food_ingredient_parser [options] --string|-s <ingredients>
     -f, --file FILE                  Parse all lines of the file as ingredient lists.
     -s, --string INGREDIENTS         Parse specified ingredient list.
     -q, --[no-]quiet                 Only show summary.
     -p, --parsed                     Only show lines that were successfully parsed.
-    -e, --escape                     Escape newlines
+    -r, --parser PARSER              Use specific parser (strict, loose).
+    -e, --[no-]escape                Escape newlines
     -c, --[no-]color                 Use color
     -n, --noresult                   Only show lines that had no result.
     -v, --[no-]verbose               Show more data (parsed tree).
@@ -102,6 +103,12 @@ RootNode+Root3 offset=0, "tomato" (contains,notes):
   SyntaxNode offset=6, ""
 {:contains=>[{:name=>"tomato"}]}
+$ food_ingredient_parser -v -r loose -s "tomato"
+"tomato"
+Node interval=0..5
+  Node interval=0..5, name="tomato"
+{:contains=>[{:name=>"tomato"}]}
 $ food_ingredient_parser -q -f data/test-cases
 parsed 35 (100.0%), no result 0 (0.0%)
 ```
@@ -114,12 +121,12 @@ When ingredient lists are entered manually, it can be very useful to show how th
 recognized. This can help understanding why a certain ingredients list cannot be parsed.
 For this you can use the `to_html` method on the parsed output, which returns the original
-text, augmented with CSS classes for different parts.
+text, augmented with CSS classes for different parts. (Available for strict parser only.)
 ```ruby
 require 'food_ingredient_parser'
-parsed = FoodIngredientParser::Parser.new.parse("Saus (10% tomaat*, zout). * = bio")
+parsed = FoodIngredientParser::Strict::Parser.new.parse("Saus (10% tomaat*, zout). * = bio")
 puts parsed.to_html
 ```
@@ -138,9 +145,38 @@ For an example of an interactive editor, see [examples/editor.rb](examples/edito
 ![editor example screenshot](examples/editor-screenshot.png)
+## Loose parser
+The strict parser only parses ingredient lists that conform to one of the many different
+formats expected. If you'd like to return a result always, even if that is not necessarily
+completely correct, you can use the _loose_ parser. This does not use Treetop, but looks
+at the input character for character and tries to make the best of it. Nevertheless, if you
+just want to have _some_ result, this can still be very useful.
+```ruby
+require 'food_ingredient_parser'
+parsed = FoodIngredientParser::Loose::Parser.new.parse("Saus [10% tomaat*, (zout); peper.")
+puts parsed.to_h
+```
+Even though the strict parser would not give a result, the loose parser returns:
+```ruby
+{
+  :contains=>[
+    {:name=>"Saus", :contains=>[
+      {:name=>"tomaat", :mark=>"*", :amount=>"10%"},
+      {:contains=>[{:name=>"zout"}]},
+      {:name=>"peper"}
+    ]}
+  ]
+}
+```
 ## Test data
 [`data/ingredient-samples-nl`](data/ingredient-samples-nl) contains about 150k
 real-world ingredient lists found on the Dutch market. Each line contains one ingredient
 list (newlines are encoded as `\n`, empty lines and those starting with `#` are ignored).
-Currently almost three quarter is recognized and parsed. We aim to reach at least 90%.
+The strict parser currently parses about three quarter, while the loose parser returns
+something for all of them.

data/bin/food_ingredient_parser CHANGED Viewed

@@ -31,8 +31,7 @@ def colorize(color, s)
   end
 end
-def parse_single(s, parsed=nil, parser: nil, verbosity: 1, print: nil, escape: false, color: false)
-  parser ||= FoodIngredientParser::Parser.new
+def parse_single(s, parsed=nil, parser:, verbosity: 1, print: nil, escape: false, color: false)
   parsed ||= parser.parse(s)
   return unless print.nil? || (parsed && print == :parsed) || (!parsed && print == :noresult)
@@ -47,7 +46,7 @@ def parse_single(s, parsed=nil, parser: nil, verbosity: 1, print: nil, escape: f
   end
 end
-def parse_file(path, parser: nil, verbosity: 1, print: nil, escape: false, color: false)
+def parse_file(path, parser:, verbosity: 1, print: nil, escape: false, color: false)
   count_parsed = count_noresult = 0
   File.foreach(path) do |line|
     next if line =~ /^#/    # comment
@@ -70,8 +69,13 @@ verbosity = 1
 files = []
 strings = []
 print = nil
+parser_name = :strict
 escape = false
 color = true
+PARSERS = {
+  strict: FoodIngredientParser::Strict::Parser,
+  loose:  FoodIngredientParser::Loose::Parser
+}
 OptionParser.new do |opts|
   opts.banner = <<-EOF.gsub(/^    /, '')
     Usage: #{$0} [options] --file|-f <filename>
@@ -84,7 +88,8 @@ OptionParser.new do |opts|
   opts.on("-q", "--[no-]quiet", "Only show summary.") {|q| verbosity = q ? 0 : 1 }
   opts.on("-p", "--parsed", "Only show lines that were successfully parsed.") {|p| print = :parsed }
-  opts.on("-e", "--escape", "Escape newlines") {|e| escape = true }
+  opts.on("-r", "--parser PARSER", "Use specific parser (#{PARSERS.keys.join(", ")}).") {|p| parser_name = p&.downcase&.to_sym }
+  opts.on("-e", "--[no-]escape", "Escape newlines") {|e| escape = !!e }
   opts.on("-c", "--[no-]color", "Use color") {|e| color = !!e }
   opts.on("-n", "--noresult", "Only show lines that had no result.") {|p| print = :noresult }
   opts.on("-v", "--[no-]verbose", "Show more data (parsed tree).") {|v| verbosity = v ? 2 : 1 }
@@ -99,7 +104,10 @@ OptionParser.new do |opts|
 end.parse!
 if strings.any? || files.any?
-  parser = FoodIngredientParser::Parser.new
+  unless parser = PARSERS[parser_name]&.new
+    STDERR.puts("Please specify one of the known parsers: #{PARSERS.keys.join(", ")}.")
+    exit(1)
+  end
   strings.each {|s| parse_single(s, parser: parser, verbosity: verbosity, print: print, escape: escape, color: color) }
   files.each   {|f| parse_file(f, parser: parser, verbosity: verbosity, print: print, escape: escape, color: color) }
 else

data/lib/food_ingredient_parser/cleaner.rb ADDED Viewed

@@ -0,0 +1,16 @@
+module FoodIngredientParser
+  module Cleaner
+    def self.clean(s)
+      s.gsub!("\u00ad", "")             # strip soft hyphen
+      s.gsub!("\u0092", "'")            # windows-1252 apostrophe - https://stackoverflow.com/a/15564279/2866660
+      s.gsub!("aÄs", "aïs")             # encoding issue for maïs
+      s.gsub!("Ã¯", "ï")                # encoding issue
+      s.gsub!("Ã«", "ë")                # encoding issue
+      s.gsub!(/\A\s*"(.*)"\s*\z/, '\1') # enclosing double quotation marks
+      s.gsub!(/\A\s*'(.*)'\s*\z/, '\1') # enclosing single quotation marks
+      s
+    end
+  end
+end

data/lib/food_ingredient_parser/loose/node.rb ADDED Viewed

@@ -0,0 +1,60 @@
+module FoodIngredientParser::Loose
+  # Parsing result.
+  class Node
+    attr_accessor :name, :mark, :amount, :contains, :notes
+    attr_reader :input, :interval, :auto_close
+    def initialize(input, interval, auto_close: false)
+      @input = input
+      @interval = interval.is_a?(Range) ? interval : ( interval .. interval )
+      @auto_close = auto_close
+      @contains = []
+      @notes = []
+      @name = @mark = @amount = nil
+    end
+    def ends(index)
+      @interval = @interval.first .. index
+    end
+    def <<(child)
+      @contains << child
+    end
+    def text_value
+      @input[@interval]
+    end
+    def to_h
+      r = {}
+      r[:name] = name.text_value.strip if name && name.text_value.strip != ''
+      r[:mark] = mark.text_value.strip if mark
+      r[:amount] = amount.text_value.strip if amount
+      r[:contains] = contains.map(&:to_h).reject {|c| c == {} } if contains.any?
+      r[:notes] = notes.map{|n| n.text_value.strip }.reject {|c| c == '' } if notes.any?
+      r
+    end
+    def inspect(indent="", variant="")
+      inspect_self(indent, variant) +
+      inspect_children(indent)
+    end
+    def inspect_self(indent="", variant="")
+      [
+        indent + "Node#{variant} interval=#{@interval}",
+        name ? "name=#{name.text_value.strip.inspect}" : nil,
+        mark ? "mark=#{mark.text_value.strip.inspect}" : nil,
+        amount ? "amount=#{amount.text_value.strip.inspect}" : nil,
+        auto_close ? "auto_close" : nil
+      ].compact.join(", ")
+    end
+    def inspect_children(indent="")
+      [
+        *contains.map {|child| "\n" + child.inspect(indent + "  ") },
+        *notes.map    {|note|  "\n" + note.inspect(indent + "  ", "(note)") }
+      ].join("")
+    end
+  end
+end

data/lib/food_ingredient_parser/loose/parser.rb ADDED Viewed

@@ -0,0 +1,24 @@
+require_relative '../cleaner'
+require_relative 'scanner'
+require_relative 'transform/amount'
+module FoodIngredientParser::Loose
+  class Parser
+    # Create a new food ingredient stream parser
+    # @return [FoodIngredientParser::StreamParser]
+    def initialize
+    end
+    # Parse food ingredient list text into a structured representation.
+    #
+    # @option clean [Boolean] pass +false+ to disable correcting frequently occuring issues
+    # @return [FoodIngredientParser::Loose::Node] structured representation of food ingredients
+    def parse(s, clean: true, **options)
+      s = FoodIngredientParser::Cleaner.clean(s) if clean
+      n = Scanner.new(s).scan
+      n = Transform::Amount.transform!(n) if n
+      n
+    end
+  end
+end

data/lib/food_ingredient_parser/loose/scanner.rb ADDED Viewed

@@ -0,0 +1,191 @@
+require_relative 'node'
+module FoodIngredientParser::Loose
+  class Scanner
+    SEP_CHARS  = "|;,.".freeze
+    MARK_CHARS = "¹²³⁴⁵ᵃᵇᶜᵈᵉᶠᵍªº⁽⁾†‡•°#^*".freeze
+    PREFIX_RE  = /\A\s*(ingredients|contains|ingred[iï][eë]nt(en)?(declaratie)?|bevat|dit zit er\s?in|samenstelling|zutaten)\s*[:;.]\s*/i.freeze
+    NOTE_RE    = /\A\b(dit product kan\b|kan sporen\b.*?\bbevatten\b|voor allergenen\b|allergenen\b|E\s*=|gemaakt in\b|geproduceerd in\b|bevat mogelijk\b|kijk voor meer\b|allergie-info|in de fabriek\b|in dit bedrijf\b)/i.freeze
+    def initialize(s, index: 0)
+      @s = s                           # input string
+      @i = index                       # current index in string
+      @cur = nil                       # current node we're populating
+      @ancestors = [Node.new(@s, @i)]  # nesting hierarchy
+      @iterator = :beginning           # scan_iteration_<iterator> to use for parsing
+      @dest = :contains                # append current node to this attribute on parent
+    end
+    def scan
+      loop do
+        method(:"scan_iteration_#{@iterator}").call
+      end
+      close_all_ancestors
+      @ancestors.first.ends(@i-1)
+      @ancestors.first
+    end
+    private
+    def loop
+      while @i < @s.length
+        @i += 1 if yield != false
+      end
+    end
+    def scan_iteration_beginning
+      # skip over some common prefixes
+      m = @s[@i .. -1].match(PREFIX_RE)
+      @i += m.offset(0).last if m
+      # now continue with the standard parsing
+      @iterator = :standard
+      false
+    end
+    def scan_iteration_standard
+      if "([".include?(c)         # open nesting
+        open_parent
+      elsif ")]".include?(c)      # close nesting
+        add_child
+        close_parent
+      elsif is_notes_start?       # usually a dot marks the start of notes
+        close_all_ancestors
+        @iterator = :notes
+        @dest = :notes
+      elsif is_sep?               # separator
+        add_child
+      elsif ":".include?(c)       # another open nesting
+        add_child
+        open_parent(auto_close: true)
+        @iterator = :colon
+      elsif is_mark? && !cur.mark # mark after ingredient
+        name_until_here
+        len = mark_len
+        cur.mark = Node.new(@s, @i .. @i+len-1)
+        @i += len - 1
+      else
+        cur # reference to record starting position
+      end
+    end
+    def scan_iteration_colon
+      if "/".include?(c)        # slash separator in colon nesting only
+        add_child
+      elsif is_sep?             # regular separator indicates end of colon nesting
+        add_child
+        close_parent
+        # revert to standard parsing from here on
+        @iterator = :standard
+        scan_iteration_standard
+      elsif "([]):".include?(c) # continue with deeper nesting level
+        # revert to standard parsing from here on
+        @iterator = :standard
+        scan_iteration_standard
+      else
+        # normal handling for this character
+        scan_iteration_standard
+      end
+    end
+    def scan_iteration_notes
+      if is_sep?(chars: ".")    # dot means new note
+        add_child
+      else
+        cur
+      end
+    end
+    def c
+      @s[@i]
+    end
+    def parent
+      @ancestors.last
+    end
+    def cur
+      @cur ||= Node.new(@s, @i)
+    end
+    def is_mark?
+      mark_len > 0 && @s[@i..@i+1] !~ /\A°[CF]/
+    end
+    def is_sep?(chars: SEP_CHARS)
+      chars.include?(c) && @s[@i-1..@i+1] !~ /\A\d.\d\z/
+    end
+    def mark_len
+      i = @i
+      while @s[i] && MARK_CHARS.include?(@s[i])
+        i += 1
+      end
+      i - @i
+    end
+    def is_notes_start?
+      # @todo use more heuristics: don't assume dot is notes when separator is a dot, and only toplevel?
+      if ( is_mark? && @s[@i+mark_len..-1] =~ /\A\s*=/ ) ||     # "* = Biologisch"
+         ( is_mark? && @s[@i-2..@i-1] =~ /\A\s\s/ ) ||          # "  **Biologisch"
+         ( @s[@i..-1] =~ NOTE_RE )                              # "E=", "Kan sporen van", ...
+        @i -= 1 # we want to include the mark in the note
+        true
+      # End of sentence
+      elsif dot_is_not_sep? && is_sep?(chars: ".")
+        true
+      else
+        false
+      end
+    end
+    def add_child
+      cur.ends(@i-1)
+      cur.name ||= Node.new(@s, cur.interval)
+      parent.send(@dest) << cur
+      @cur = nil
+    end
+    def open_parent(**options)
+      name_until_here
+      @ancestors << cur
+      @cur = Node.new(@s, @i + 1, **options)
+    end
+    def close_parent
+      return unless @ancestors.count > 1
+      @cur = @ancestors.pop
+      while @cur.auto_close
+        add_child
+        @cur = @ancestors.pop
+      end
+    end
+    def close_all_ancestors
+      while @ancestors.count > 1
+        add_child
+        close_parent
+      end
+      add_child
+    end
+    def name_until_here
+      cur.name ||= Node.new(@s, cur.interval.first .. @i-1)
+    end
+    def dot_is_not_sep?
+      # if separator is dot ".", don't use it for note detection
+      if @dot_is_not_sep.nil?
+        @dot_is_not_sep = begin
+          # @todo if another separator is found more often, dot is not a separator
+          num_words = @s.split(/\s+/).count
+          num_dots = @s.count(".")
+          # heuristic: 1/4+ of the words has a dot, with at least five words
+          num_words < 5 || 4 * num_dots < num_words
+        end
+      end
+      @dot_is_not_sep
+    end
+  end
+end

data/lib/food_ingredient_parser/loose/transform/amount.rb ADDED Viewed

@@ -0,0 +1,70 @@
+require 'treetop'
+require_relative '../../strict/nodes'
+Treetop.load File.dirname(__FILE__) + '/../../strict/grammar/common'
+Treetop.load File.dirname(__FILE__) + '/../../strict/grammar/amount'
+Treetop.load File.dirname(__FILE__) + '/amount_from_name'
+require_relative '../node'
+module FoodIngredientParser::Loose
+  module Transform
+    # Transforms node tree to extract amount into its own attribute.
+    class Amount
+      def self.transform!(node)
+        new(node).transform!
+      end
+      def initialize(node)
+        @node = node
+        @parser = FoodIngredientParser::Loose::Transform::AmountFromNameParser.new
+      end
+      def transform!
+        transform_name
+        transform_contains
+        @node
+      end
+      private
+      # Extract amount from name, if any.
+      def transform_name(node = @node)
+        if !node.amount && parsed = parse_amount(node.name&.text_value)
+          offset = node.name.interval.first
+          amount = parsed.amount.amount
+          node.amount = Node.new(node.input, offset + amount.interval.first .. offset + amount.interval.last - 1)
+          name = parsed.respond_to?(:name) && parsed.name
+          if name && name.interval.count > 0
+            node.name = Node.new(node.input, offset + name.interval.first .. offset + name.interval.last - 1)
+          else
+            node.name = nil
+          end
+        end
+        # recursively transform contained nodes
+        node.contains&.each(&method(:transform_name))
+      end
+      # If first or last child is an amount, it's this node's amount.
+      # Assumes all names already have extracted their amounts with {{#transform_name}}.
+      def transform_contains(node = @node)
+        if !node.amount && node.contains.any?
+          if node.contains.first.name.nil? && node.contains.first.amount
+            node.amount = node.contains.shift.amount
+          elsif node.contains.count > 1 && node.contains.last.name.nil? && node.contains.last.amount
+            node.amount = node.contains.pop.amount
+          end
+        end
+        # recursively transform contained nodes
+        node.contains.each(&method(:transform_contains))
+      end
+      def parse_amount(s)
+        @parser.parse(s) if s && s.strip != ''
+      end
+    end
+  end
+end

data/lib/food_ingredient_parser/loose/transform/amount_from_name.treetop ADDED Viewed

@@ -0,0 +1,13 @@
+module FoodIngredientParser::Loose::Transform
+  grammar AmountFromName
+    include FoodIngredientParser::Strict::Grammar::Common
+    include FoodIngredientParser::Strict::Grammar::Amount
+    rule amount_from_name
+      # just amount, amount in front or at the end
+      ws* amount:amount ws+ name:(.*) /
+      ws* amount:amount ws* /
+      ws* name:( !amount word ( ws+ !amount word )* )+ ws* amount:amount ws*
+    end
+  end
+end

data/lib/food_ingredient_parser/{grammar → strict/grammar}/amount.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar Amount
     include Common
@@ -12,18 +12,19 @@ module FoodIngredientParser::Grammar
     rule simple_amount
       ( (
         'of which'i / 'at least'i / 'minimal'i / 'maximal'i / 'less than'i / 'more than'i /
-        'waarvan'i / 'ten minste'i / 'tenminste'i / 'minimaal'i / 'maximaal'i / 'minder dan'i / 'meer dan'i
+        'waarvan'i / 'ten minste'i / 'tenminste'i / 'minimaal'i / 'maximaal'i / 'minder dan'i / 'meer dan'i /
+        'min.'i / 'min'i / 'max.'i / 'max'i
       ) ws* )?
       [±∓~∼∽≂≃≈≲≤<>≥≳]? ws*
       simple_amount_quantity
       ( ws+ (
-        'minimum'i /
-        'minimaal'i / 'minimum'i
+        'minimaal'i / 'minimum'i / 'van het uitlekgewicht'i / 'van het geheel'i /
+        'min.'i / 'min'i / 'max.'i / 'max'i
       ) )?
     end
     rule simple_amount_quantity
-      number ( ws* '-' ws* number )? ws* ( '%' / 'g'i / 'mg'i / 'gram'i / 'ml'i )
+      number ( ws* '-' ws* number )? ws* ( [%٪⁒％﹪] / ( ( 'procent' / 'percent' / 'gram'i / 'ml'i / 'mg'i / 'g'i ) !char ) )
     end
   end

data/lib/food_ingredient_parser/{grammar → strict/grammar}/common.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar Common
     rule ws

data/lib/food_ingredient_parser/{grammar → strict/grammar}/ingredient.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar Ingredient
     include IngredientSimple
     include IngredientNested

data/lib/food_ingredient_parser/{grammar → strict/grammar}/ingredient_coloned.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar IngredientColoned
     include Common
     include Amount

data/lib/food_ingredient_parser/{grammar → strict/grammar}/ingredient_nested.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar IngredientNested
     include Common
     include Amount

data/lib/food_ingredient_parser/{grammar → strict/grammar}/ingredient_simple.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar IngredientSimple
     include Common
     include Amount

data/lib/food_ingredient_parser/{grammar → strict/grammar}/list.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar List
     include Common
     include Ingredient

data/lib/food_ingredient_parser/{grammar → strict/grammar}/list_coloned.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar ListColoned
     include Common
     include IngredientSimple

data/lib/food_ingredient_parser/{grammar → strict/grammar}/list_newlined.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar ListNewlined
     include Common
     include IngredientSimple

data/lib/food_ingredient_parser/{grammar → strict/grammar}/root.treetop RENAMED Viewed

@@ -1,4 +1,4 @@
-module FoodIngredientParser::Grammar
+module FoodIngredientParser::Strict::Grammar
   grammar Root
     include Common
     include List

data/lib/food_ingredient_parser/strict/nodes.rb ADDED Viewed

@@ -0,0 +1,74 @@
+require 'treetop/runtime'
+require_relative 'to_html'
+# Needs to be in grammar namespace so Treetop can find the nodes.
+module FoodIngredientParser::Strict
+  module Grammar
+    # Treetop syntax node with our additions, use this as parent for all our own nodes.
+    class SyntaxNode < Treetop::Runtime::SyntaxNode
+      private
+      def to_a_deep(n, cls)
+        if n.is_a?(cls)
+          [n]
+        elsif n.nonterminal?
+          n.elements.map {|m| to_a_deep(m, cls) }.flatten(1).compact
+        end
+      end
+    end
+    # Root object, contains everything else.
+    class RootNode < SyntaxNode
+      include FoodIngredientParser::Strict::ToHtml
+      def to_h
+        h = { contains: contains.to_a }
+        if notes && notes_ary = to_a_deep(notes, NoteNode)&.map(&:text_value)
+          h[:notes] = notes_ary if notes_ary.length > 0
+        end
+        h
+      end
+    end
+    # List of ingredients.
+    class ListNode < SyntaxNode
+      def to_a
+        to_a_deep(contains, IngredientNode).map(&:to_h)
+      end
+    end
+    # Ingredient
+    class IngredientNode < SyntaxNode
+      def to_h
+        h = {}
+        h.merge!(to_a_deep(ing, IngredientNode)&.first&.to_h || {}) if respond_to?(:ing)
+        h.merge!(to_a_deep(amount, AmountNode)&.first&.to_h || {}) if respond_to?(:amount)
+        h[:name] = name.text_value if respond_to?(:name)
+        h[:name] = pre.text_value + h[:name] if respond_to?(:pre)
+        h[:name] = h[:name] + post.text_value if respond_to?(:post)
+        h[:mark] = mark.text_value if respond_to?(:mark) && mark.text_value != ''
+        h
+      end
+    end
+    # Ingredient with containing ingredients.
+    class NestedIngredientNode < IngredientNode
+      def to_h
+        super.merge({ contains: to_a_deep(contains, IngredientNode).map(&:to_h) })
+      end
+    end
+    # Amount, specifying an ingredient.
+    class AmountNode < SyntaxNode
+      def to_h
+        { amount: amount.text_value }
+      end
+    end
+    # Note at the end of the ingredient list.
+    class NoteNode < SyntaxNode
+    end
+  end
+end

data/lib/food_ingredient_parser/{parser.rb → strict/parser.rb} RENAMED Viewed

@@ -1,6 +1,7 @@
 require_relative 'grammar'
+require_relative '../cleaner'
-module FoodIngredientParser
+module FoodIngredientParser::Strict
   class Parser
     # @!attribute [r] parser
@@ -20,22 +21,9 @@ module FoodIngredientParser
     # @return [FoodIngredientParser::Grammar::RootNode] structured representation of food ingredients
     # @note Unrecognized options are passed to Treetop, but this is not guarenteed to remain so forever.
     def parse(s, clean: true, **options)
-      s = clean(s) if clean
+      s = FoodIngredientParser::Cleaner.clean(s) if clean
       @parser.parse(s, **options)
     end
-    private
-    def clean(s)
-      s.gsub!("\u00ad", "")             # strip soft hyphen
-      s.gsub!("\u0092", "'")            # windows-1252 apostrophe - https://stackoverflow.com/a/15564279/2866660
-      s.gsub!("aÄs", "aïs")             # encoding issue for maïs
-      s.gsub!("Ã¯", "ï")                # encoding issue
-      s.gsub!("Ã«", "ë")                # encoding issue
-      s.gsub!(/\A\s*"(.*)"\s*\z/, '\1') # enclosing double quotation marks
-      s.gsub!(/\A\s*'(.*)'\s*\z/, '\1') # enclosing single quotation marks
-      s
-    end
   end
 end

data/lib/food_ingredient_parser/strict/to_html.rb ADDED Viewed

@@ -0,0 +1,54 @@
+require 'cgi'
+# Adds HTML output functionality to a Treetop Node.
+#
+# The node needs to provide a {#to_h} method (for {#to_html_h}).
+#
+module FoodIngredientParser::Strict
+  module ToHtml
+    # Markup original ingredients list text in HTML.
+    #
+    # The input text is returned as HTML, augmented with CSS classes
+    # on +span+s for +name+, +amount+, +mark+ and +note+.
+    #
+    # @return [String] HTML representation of ingredient list.
+    def to_html
+      node_to_html(self)
+    end
+    private
+    def node_to_html(node, cls=nil, depth=0)
+      el_cls = {}               # map of node instances to class names for contained elements
+      terminal = node.terminal? # whether to look at children elements or not
+      if node.is_a?(FoodIngredientParser::Strict::Grammar::AmountNode)
+        cls ||= "amount"
+      elsif node.is_a?(FoodIngredientParser::Strict::Grammar::NoteNode)
+        cls ||= "note"
+        terminal = true # NoteNodes may contain other NoteNodes, we want it flat.
+      elsif node.is_a?(FoodIngredientParser::Strict::Grammar::IngredientNode)
+        el_cls[node.name] = "name" if node.respond_to?(:name)
+        el_cls[node.mark] = "mark" if node.respond_to?(:mark)
+        if node.respond_to?(:contains)
+          el_cls[node.contains] = "contains depth#{depth}"
+          depth += 1
+        end
+      elsif node.is_a?(FoodIngredientParser::Strict::Grammar::RootNode)
+        if node.respond_to?(:contains)
+          el_cls[node.contains] = "depth#{depth}"
+          depth += 1
+        end
+      end
+      val = if terminal
+        CGI.escapeHTML(node.text_value)
+      else
+        node.elements.map {|el| node_to_html(el, el_cls[el], depth) }.join("")
+      end
+      cls ? "<span class='#{cls}'>#{val}</span>" : val
+    end
+  end
+end

data/lib/food_ingredient_parser/version.rb CHANGED Viewed

@@ -1,4 +1,4 @@
 module FoodIngredientParser
-  VERSION      = '1.0.0.pre.5'
-  VERSION_DATE = '2018-09-07'
+  VERSION      = '1.0.0.pre.6'
+  VERSION_DATE = '2018-09-17'
 end

data/lib/food_ingredient_parser.rb CHANGED Viewed

@@ -1,2 +1,3 @@
 require_relative 'food_ingredient_parser/version'
-require_relative 'food_ingredient_parser/parser'
+require_relative 'food_ingredient_parser/strict/parser'
+require_relative 'food_ingredient_parser/loose/parser'

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: food_ingredient_parser
 version: !ruby/object:Gem::Version
-  version: 1.0.0.pre.5
+  version: 1.0.0.pre.6
 platform: ruby
 authors:
 - wvengen
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2018-09-07 00:00:00.000000000 Z
+date: 2018-09-17 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: treetop
@@ -42,20 +42,26 @@ files:
 - bin/food_ingredient_parser
 - food_ingredient_parser.gemspec
 - lib/food_ingredient_parser.rb
-- lib/food_ingredient_parser/grammar.rb
-- lib/food_ingredient_parser/grammar/amount.treetop
-- lib/food_ingredient_parser/grammar/common.treetop
-- lib/food_ingredient_parser/grammar/ingredient.treetop
-- lib/food_ingredient_parser/grammar/ingredient_coloned.treetop
-- lib/food_ingredient_parser/grammar/ingredient_nested.treetop
-- lib/food_ingredient_parser/grammar/ingredient_simple.treetop
-- lib/food_ingredient_parser/grammar/list.treetop
-- lib/food_ingredient_parser/grammar/list_coloned.treetop
-- lib/food_ingredient_parser/grammar/list_newlined.treetop
-- lib/food_ingredient_parser/grammar/root.treetop
-- lib/food_ingredient_parser/nodes.rb
-- lib/food_ingredient_parser/parser.rb
-- lib/food_ingredient_parser/to_html.rb
+- lib/food_ingredient_parser/cleaner.rb
+- lib/food_ingredient_parser/loose/node.rb
+- lib/food_ingredient_parser/loose/parser.rb
+- lib/food_ingredient_parser/loose/scanner.rb
+- lib/food_ingredient_parser/loose/transform/amount.rb
+- lib/food_ingredient_parser/loose/transform/amount_from_name.treetop
+- lib/food_ingredient_parser/strict/grammar.rb
+- lib/food_ingredient_parser/strict/grammar/amount.treetop
+- lib/food_ingredient_parser/strict/grammar/common.treetop
+- lib/food_ingredient_parser/strict/grammar/ingredient.treetop
+- lib/food_ingredient_parser/strict/grammar/ingredient_coloned.treetop
+- lib/food_ingredient_parser/strict/grammar/ingredient_nested.treetop
+- lib/food_ingredient_parser/strict/grammar/ingredient_simple.treetop
+- lib/food_ingredient_parser/strict/grammar/list.treetop
+- lib/food_ingredient_parser/strict/grammar/list_coloned.treetop
+- lib/food_ingredient_parser/strict/grammar/list_newlined.treetop
+- lib/food_ingredient_parser/strict/grammar/root.treetop
+- lib/food_ingredient_parser/strict/nodes.rb
+- lib/food_ingredient_parser/strict/parser.rb
+- lib/food_ingredient_parser/strict/to_html.rb
 - lib/food_ingredient_parser/version.rb
 homepage: https://github.com/q-m/food-ingredient-parser-ruby
 licenses:

data/lib/food_ingredient_parser/nodes.rb DELETED Viewed

@@ -1,72 +0,0 @@
-require 'treetop/runtime'
-require_relative 'to_html'
-# Needs to be in grammar namespace so Treetop can find the nodes.
-module FoodIngredientParser::Grammar
-  # Treetop syntax node with our additions, use this as parent for all our own nodes.
-  class SyntaxNode < Treetop::Runtime::SyntaxNode
-    private
-    def to_a_deep(n, cls)
-      if n.is_a?(cls)
-        [n]
-      elsif n.nonterminal?
-        n.elements.map {|m| to_a_deep(m, cls) }.flatten(1).compact
-      end
-    end
-  end
-  # Root object, contains everything else.
-  class RootNode < SyntaxNode
-    include FoodIngredientParser::ToHtml
-    def to_h
-      h = { contains: contains.to_a }
-      if notes && notes_ary = to_a_deep(notes, NoteNode)&.map(&:text_value)
-        h[:notes] = notes_ary if notes_ary.length > 0
-      end
-      h
-    end
-  end
-  # List of ingredients.
-  class ListNode < SyntaxNode
-    def to_a
-      to_a_deep(contains, IngredientNode).map(&:to_h)
-    end
-  end
-  # Ingredient
-  class IngredientNode < SyntaxNode
-    def to_h
-      h = {}
-      h.merge!(to_a_deep(ing, IngredientNode)&.first&.to_h || {}) if respond_to?(:ing)
-      h.merge!(to_a_deep(amount, AmountNode)&.first&.to_h || {}) if respond_to?(:amount)
-      h[:name] = name.text_value if respond_to?(:name)
-      h[:name] = pre.text_value + h[:name] if respond_to?(:pre)
-      h[:name] = h[:name] + post.text_value if respond_to?(:post)
-      h[:mark] = mark.text_value if respond_to?(:mark) && mark.text_value != ''
-      h
-    end
-  end
-  # Ingredient with containing ingredients.
-  class NestedIngredientNode < IngredientNode
-    def to_h
-      super.merge({ contains: to_a_deep(contains, IngredientNode).map(&:to_h) })
-    end
-  end
-  # Amount, specifying an ingredient.
-  class AmountNode < SyntaxNode
-    def to_h
-      { amount: amount.text_value }
-    end
-  end
-  # Note at the end of the ingredient list.
-  class NoteNode < SyntaxNode
-  end
-end

data/lib/food_ingredient_parser/to_html.rb DELETED Viewed

@@ -1,52 +0,0 @@
-require 'cgi'
-# Adds HTML output functionality to a Treetop Node.
-#
-# The node needs to provide a {#to_h} method (for {#to_html_h}).
-#
-module FoodIngredientParser::ToHtml
-  # Markup original ingredients list text in HTML.
-  #
-  # The input text is returned as HTML, augmented with CSS classes
-  # on +span+s for +name+, +amount+, +mark+ and +note+.
-  #
-  # @return [String] HTML representation of ingredient list.
-  def to_html
-    node_to_html(self)
-  end
-  private
-  def node_to_html(node, cls=nil, depth=0)
-    el_cls = {}               # map of node instances to class names for contained elements
-    terminal = node.terminal? # whether to look at children elements or not
-    if node.is_a?(FoodIngredientParser::Grammar::AmountNode)
-      cls ||= "amount"
-    elsif node.is_a?(FoodIngredientParser::Grammar::NoteNode)
-      cls ||= "note"
-      terminal = true # NoteNodes may contain other NoteNodes, we want it flat.
-    elsif node.is_a?(FoodIngredientParser::Grammar::IngredientNode)
-      el_cls[node.name] = "name" if node.respond_to?(:name)
-      el_cls[node.mark] = "mark" if node.respond_to?(:mark)
-      if node.respond_to?(:contains)
-        el_cls[node.contains] = "contains depth#{depth}"
-        depth += 1
-      end
-    elsif node.is_a?(FoodIngredientParser::Grammar::RootNode)
-      if node.respond_to?(:contains)
-        el_cls[node.contains] = "depth#{depth}"
-        depth += 1
-      end
-    end
-    val = if terminal
-      CGI.escapeHTML(node.text_value)
-    else
-      node.elements.map {|el| node_to_html(el, el_cls[el], depth) }.join("")
-    end
-    cls ? "<span class='#{cls}'>#{val}</span>" : val
-  end
-end

/data/lib/food_ingredient_parser/{grammar.rb → strict/grammar.rb} RENAMED Viewed

File without changes