food_fish_parser 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +90 -2
- data/bin/food_fish_parser +8 -6
- data/food_fish_parser.gemspec +1 -1
- data/lib/food_fish_parser/grammar/aquac_area.treetop +6 -1
- data/lib/food_fish_parser/grammar/common.treetop +1 -1
- data/lib/food_fish_parser/grammar/fish_name_nl.treetop +2 -1
- data/lib/food_fish_parser/grammar/root.treetop +45 -12
- data/lib/food_fish_parser/grammar/words.treetop +1 -0
- data/lib/food_fish_parser/nodes.rb +1 -1
- data/lib/food_fish_parser/parser.rb +7 -1
- data/lib/food_fish_parser/version.rb +2 -2
- metadata +3 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 83e6a16aa71bcfb8c10737828059764fb46bb855bedef1a682695e3aed9a4b31
|
4
|
+
data.tar.gz: 866d70e749ca95dcd449ac64b2617e6076fac00e2361c5f252272d27a677fa14
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: c4dfdf7a85fadc219b423e7d8a95b7d3d0fd4f699657dbcbd4bf43e48b1f89cd4e5975fc48f25c145d56182483a5914fb148475dc50c9d2359f4e3540f7b2ec7
|
7
|
+
data.tar.gz: 65f0b596fa65870a130bbfdd4663086fdd6c46c1d9349a2cb3b3561c43b2d0bd541df924af7df976cb946f98c79e716129c005f671a191766a80075d8aae07f1
|
data/README.md
CHANGED
@@ -31,7 +31,9 @@ EOT
|
|
31
31
|
parser = FoodFishParser::Parser.new
|
32
32
|
puts parser.parse(s).to_a.inspect
|
33
33
|
```
|
34
|
+
|
34
35
|
Results in a list of detected fishes
|
36
|
+
|
35
37
|
```ruby
|
36
38
|
[
|
37
39
|
{
|
@@ -72,12 +74,98 @@ Results in a list of detected fishes
|
|
72
74
|
]
|
73
75
|
```
|
74
76
|
|
77
|
+
When you have a piece of text and don't know where (or if) any fish details are
|
78
|
+
present, you can use the `anywhere` option.
|
79
|
+
|
80
|
+
```ruby
|
81
|
+
require 'food_fish_parser'
|
82
|
+
|
83
|
+
parser = FoodFishParser::Parser.new
|
84
|
+
s = "tomaat, vis (zalm (salmo salar) gevangen in Noorwegen), zout"
|
85
|
+
puts parser.parse(s, anywhere: true).to_a.inspect
|
86
|
+
```
|
87
|
+
|
88
|
+
This will find as many occurences as possible. It is assumed that all fish details
|
89
|
+
in the text have the same amount of information (so fish name plus catch or aquaculture
|
90
|
+
information, or only fish names, or only catch or aquaculture information).
|
91
|
+
While the parser would normally return nothing, with `anywhere` it returns:
|
92
|
+
|
93
|
+
```ruby
|
94
|
+
[
|
95
|
+
{
|
96
|
+
:names => [{ :common=>"zalm", :latin=>"salmo salar" }],
|
97
|
+
:catch_areas => [{ :text=>"Noorwegen", :fao_codes=>[] }],
|
98
|
+
:catch_methods => [],
|
99
|
+
:aquaculture_areas => [],
|
100
|
+
:aquaculture_methods => []
|
101
|
+
}
|
102
|
+
]
|
103
|
+
```
|
104
|
+
|
105
|
+
Please note that the `anywhere` option can make the parser much slower.
|
106
|
+
|
107
|
+
|
108
|
+
## Test tool
|
109
|
+
|
110
|
+
The executable `food_fish_parser` is available after installing the gem. If you're
|
111
|
+
running from the source tree, use `bin/food_fish_parser` instead.
|
112
|
+
|
113
|
+
```
|
114
|
+
$ food_fish_parser -h
|
115
|
+
Usage: bin/food_fish_parser [options] --file|-f <filename>
|
116
|
+
bin/food_fish_parser [options] --string|-s <text>
|
117
|
+
|
118
|
+
-f, --file FILE Parse all lines of the file as fish detail text.
|
119
|
+
-s, --string TEXT Parse specified fish detail text.
|
120
|
+
-q, --[no-]quiet Only show summary.
|
121
|
+
-p, --parsed Only show lines that were successfully parsed.
|
122
|
+
-n, --noresult Only show lines that had no result.
|
123
|
+
-a, --[no-]anywhere Search for fish details anywhere in the text.
|
124
|
+
-e, --[no-]escape Escape newlines
|
125
|
+
-c, --[no-]color Use color
|
126
|
+
-v, --[no-]verbose Show more data (parsed tree).
|
127
|
+
--version Show program version.
|
128
|
+
-h, --help Show this help
|
129
|
+
|
130
|
+
$ food_fish_parser -v -s "salmo salar"
|
131
|
+
"salmo salar"
|
132
|
+
SyntaxNode+Root6+RootNode+SyntaxNodeAdditions offset=0, "salmo salar" (to_a,to_a_deep):
|
133
|
+
SyntaxNode+Root3 offset=0, "salmo salar" (fish_only_names):
|
134
|
+
SyntaxNode+FishNode+SyntaxNodeAdditions+FishNameList1 offset=0, "salmo salar" (to_h,to_a_deep,fish_name):
|
135
|
+
SyntaxNode+FishNameNode+SyntaxNodeAdditions+FishNameLatin1+FishNameLatinNode offset=0, "salmo salar" (to_h,to_a_deep,fish_name_latin_first):
|
136
|
+
SyntaxNode offset=0, "salmo"
|
137
|
+
SyntaxNode+FishNameLatin0 offset=5, " salar" (fish_name_latin_second):
|
138
|
+
SyntaxNode offset=5, " ":
|
139
|
+
SyntaxNode offset=5, " "
|
140
|
+
SyntaxNode offset=6, "salar"
|
141
|
+
SyntaxNode offset=11, ""
|
142
|
+
SyntaxNode offset=11, ""
|
143
|
+
SyntaxNode offset=11, ""
|
144
|
+
SyntaxNode offset=11, ""
|
145
|
+
SyntaxNode offset=11, ""
|
146
|
+
[
|
147
|
+
{
|
148
|
+
:names=>[{:common=>nil, :latin=>"salmo salar"}],
|
149
|
+
:catch_areas=>[],
|
150
|
+
:catch_methods=>[],
|
151
|
+
:aquaculture_areas=>[],
|
152
|
+
:aquaculture_methods=>[]
|
153
|
+
}
|
154
|
+
]
|
155
|
+
|
156
|
+
$ food_fish_parser -q -f data/test-cases
|
157
|
+
parsed 51 (100.0%), no result 0 (0.0%)
|
158
|
+
```
|
159
|
+
|
160
|
+
If you want to use the output in (shell)scripts, the options `-e -c` may be quite useful.
|
161
|
+
|
75
162
|
|
76
163
|
## Test data
|
77
164
|
|
78
|
-
[`data/fish-ingredient-samples-qm-nl`](data/fish-ingredient-samples-qm-nl) contains about 2k
|
165
|
+
[`data/fish-ingredient-samples-qm-nl`](data/fish-ingredient-samples-qm-nl) contains about 2k
|
79
166
|
real-world ingredient lists with fish found on the Dutch market. Each line contains one ingredient
|
80
|
-
list (newlines are encoded as `\n`, empty lines and those starting with `#` are ignored).
|
167
|
+
list (newlines are encoded as `\n`, empty lines and those starting with `#` are ignored). Of those,
|
168
|
+
something is returned for 99.8% of them (with the `anywhere` option), but quality varies greatly.
|
81
169
|
|
82
170
|
|
83
171
|
## Species
|
data/bin/food_fish_parser
CHANGED
@@ -31,8 +31,8 @@ def colorize(color, s)
|
|
31
31
|
end
|
32
32
|
end
|
33
33
|
|
34
|
-
def parse_single(s, parsed=nil, parser:, verbosity: 1, print: nil, escape: false, color: false)
|
35
|
-
parsed ||= parser.parse(s)
|
34
|
+
def parse_single(s, parsed=nil, parser:, verbosity: 1, print: nil, escape: false, color: false, anywhere: false)
|
35
|
+
parsed ||= parser.parse(s, anywhere: anywhere)
|
36
36
|
|
37
37
|
return unless print.nil? || (parsed && print == :parsed) || (!parsed && print == :noresult)
|
38
38
|
|
@@ -48,14 +48,14 @@ def parse_single(s, parsed=nil, parser:, verbosity: 1, print: nil, escape: false
|
|
48
48
|
end
|
49
49
|
end
|
50
50
|
|
51
|
-
def parse_file(path, parser:, verbosity: 1, print: nil, escape: false, color: false)
|
51
|
+
def parse_file(path, parser:, verbosity: 1, print: nil, escape: false, color: false, anywhere: false)
|
52
52
|
count_parsed = count_noresult = 0
|
53
53
|
File.foreach(path) do |line|
|
54
54
|
next if line =~ /^#/ # comment
|
55
55
|
next if line =~ /^\s*$/ # empty line
|
56
56
|
|
57
57
|
line = line.gsub('\\n', "\n").strip
|
58
|
-
parsed = parser.parse(line)
|
58
|
+
parsed = parser.parse(line, anywhere: anywhere)
|
59
59
|
count_parsed += 1 if parsed
|
60
60
|
count_noresult += 1 unless parsed
|
61
61
|
|
@@ -72,6 +72,7 @@ verbosity = 1
|
|
72
72
|
files = []
|
73
73
|
strings = []
|
74
74
|
print = nil
|
75
|
+
anywhere = false
|
75
76
|
escape = false
|
76
77
|
color = true
|
77
78
|
OptionParser.new do |opts|
|
@@ -87,6 +88,7 @@ OptionParser.new do |opts|
|
|
87
88
|
opts.on("-q", "--[no-]quiet", "Only show summary.") {|q| verbosity = q ? 0 : 1 }
|
88
89
|
opts.on("-p", "--parsed", "Only show lines that were successfully parsed.") {|p| print = :parsed }
|
89
90
|
opts.on("-n", "--noresult", "Only show lines that had no result.") {|p| print = :noresult }
|
91
|
+
opts.on("-a", "--[no-]anywhere", "Search for fish details anywhere in the text.") {|a| anywhere = !!a }
|
90
92
|
opts.on("-e", "--[no-]escape", "Escape newlines") {|e| escape = !!e }
|
91
93
|
opts.on("-c", "--[no-]color", "Use color") {|e| color = !!e }
|
92
94
|
opts.on("-v", "--[no-]verbose", "Show more data (parsed tree).") {|v| verbosity = v ? 2 : 1 }
|
@@ -103,8 +105,8 @@ end.parse!
|
|
103
105
|
if strings.any? || files.any?
|
104
106
|
parser = FoodFishParser::Parser.new
|
105
107
|
success = true
|
106
|
-
strings.each {|s| success &= parse_single(s, parser: parser, verbosity: verbosity, print: print, escape: escape, color: color) }
|
107
|
-
files.each {|f| success &= parse_file(f, parser: parser, verbosity: verbosity, print: print, escape: escape, color: color) }
|
108
|
+
strings.each {|s| success &= parse_single(s, parser: parser, verbosity: verbosity, print: print, escape: escape, color: color, anywhere: anywhere) }
|
109
|
+
files.each {|f| success &= parse_file(f, parser: parser, verbosity: verbosity, print: print, escape: escape, color: color, anywhere: anywhere) }
|
108
110
|
success or exit(1)
|
109
111
|
else
|
110
112
|
STDERR.puts("Please specify one or more --file or --string arguments (see --help).")
|
data/food_fish_parser.gemspec
CHANGED
@@ -12,7 +12,7 @@ Gem::Specification.new do |s|
|
|
12
12
|
s.license = 'MIT'
|
13
13
|
s.description = <<-EOD
|
14
14
|
Food products that contain fish sometimes indicate details like fishing
|
15
|
-
area, method or aquaculture country. This parser
|
15
|
+
area, method or aquaculture country. This parser knows about various ways
|
16
16
|
this is found on a product package, and returns a structured representation
|
17
17
|
of the fish ingredient details.
|
18
18
|
EOD
|
@@ -7,13 +7,18 @@ module FoodFishParser::Grammar
|
|
7
7
|
|
8
8
|
rule aquac_area_indicator
|
9
9
|
(
|
10
|
-
'
|
10
|
+
'gekweekt in'i / 'gekweekt op'i /
|
11
11
|
'aquacultuurproduct uit'i / 'aquacultuur product uit'i
|
12
12
|
)
|
13
13
|
!char
|
14
14
|
( ws* ( ':' / '>' ) )?
|
15
15
|
end
|
16
16
|
|
17
|
+
rule aquac_area_indicator_short
|
18
|
+
aquac_area_indicator /
|
19
|
+
( 'uit'i ) !char ( ws* ( ':' / '>' ) )?
|
20
|
+
end
|
21
|
+
|
17
22
|
rule aquac_area_content
|
18
23
|
(
|
19
24
|
( area:( words ) ( ws* comma? ws* fao_area_list_enclosures )? ) /
|
@@ -4,12 +4,13 @@ module FoodFishParser::Grammar
|
|
4
4
|
include Common
|
5
5
|
|
6
6
|
rule fish_name_nl
|
7
|
+
( 'verse'i ws+ )?
|
7
8
|
( fish_name_nl_area ws+ )? ( fish_name_nl_attr ws* )? fish_name_nl_name fish_name_nl_suffix?
|
8
9
|
<FishNameCommonNode>
|
9
10
|
end
|
10
11
|
|
11
12
|
rule fish_name_nl_area
|
12
|
-
'pacifische'i / 'indische'i / 'groenlandse'i / 'atlantische'i / 'argentijnse'i / 'alaska'i
|
13
|
+
'pacifische'i / 'noorse'i / 'indische'i / 'groenlandse'i / 'atlantische'i / 'argentijnse'i / 'alaska'i
|
13
14
|
end
|
14
15
|
|
15
16
|
rule fish_name_nl_attr
|
@@ -7,27 +7,61 @@ module FoodFishParser::Grammar
|
|
7
7
|
include AquacArea
|
8
8
|
include AquacMethod
|
9
9
|
|
10
|
+
# Regular root node that requires all text to match fish details.
|
11
|
+
#
|
12
|
+
# Note that here we prefer fish with catch or aquaculture info,
|
13
|
+
# then try fish names, and finally only catch or aquaculture info.
|
14
|
+
#
|
15
|
+
# The assumption is that all declared fish would have the same amount
|
16
|
+
# of information. Hence we first try for all info, then resort to partial
|
17
|
+
# info.
|
10
18
|
rule root
|
11
|
-
|
12
|
-
(
|
13
|
-
(
|
14
|
-
|
19
|
+
(
|
20
|
+
( fish_with_info ( ws* fish_sep ws* fish_with_info )* ) /
|
21
|
+
( fish_only_names ( ws* fish_sep ws* fish_only_names )* ) /
|
22
|
+
( fish_only_info ( ws* fish_sep ws* fish_only_info )* )
|
15
23
|
)
|
16
|
-
|
24
|
+
ws* '.'? ws*
|
17
25
|
<RootNode>
|
18
26
|
end
|
19
27
|
|
20
|
-
|
28
|
+
# Alternate root node that allows fish details to be interspaced with other text.
|
29
|
+
#
|
30
|
+
# Note that this can be a much more expensive operation.
|
31
|
+
rule root_anywhere
|
32
|
+
(
|
33
|
+
( !fish_with_info . )* ( fish_with_info ( ( !fish_with_info . )* fish_with_info )* ) ( !fish_with_info . )* /
|
34
|
+
( !fish_only_names . )* ( fish_only_names ( ( !fish_only_names . )+ fish_only_names )* ) ( !fish_only_names . )* /
|
35
|
+
( !fish_only_info . )* ( fish_only_info ( ( !fish_only_info . )+ fish_only_info )* ) ( !fish_only_info . )*
|
36
|
+
)
|
37
|
+
<RootNode>
|
38
|
+
end
|
39
|
+
|
40
|
+
# separator between fish declarations
|
41
|
+
rule fish_sep
|
42
|
+
and_or / '.' / comma
|
43
|
+
end
|
44
|
+
|
45
|
+
# fish with catch or aquaculture info
|
46
|
+
rule fish_with_info
|
21
47
|
(
|
22
48
|
( fish_name_list ( ws* ( comma / ':' ) )? ws+ fish_catch_info ) /
|
23
|
-
( fish_name_list ( ws* ( comma / ':' ) )? ws+ fish_aquac_info )
|
24
|
-
fish_name_list /
|
25
|
-
fish_catch_info /
|
26
|
-
fish_aquac_info
|
49
|
+
( fish_name_list ( ws* ( comma / ':' ) )? ws+ fish_aquac_info )
|
27
50
|
)
|
28
51
|
<FishNode>
|
29
52
|
end
|
30
53
|
|
54
|
+
# fish names only
|
55
|
+
rule fish_only_names
|
56
|
+
fish_name_list <FishNode>
|
57
|
+
end
|
58
|
+
|
59
|
+
# catch or aquaculture info only (no names)
|
60
|
+
rule fish_only_info
|
61
|
+
( fish_catch_info / fish_aquac_info ) <FishNode>
|
62
|
+
end
|
63
|
+
|
64
|
+
|
31
65
|
rule fish_catch_info
|
32
66
|
(
|
33
67
|
catch_method_indicator ws* catch_method_content
|
@@ -47,9 +81,8 @@ module FoodFishParser::Grammar
|
|
47
81
|
( ( ws* comma )? ws+ aquac_method_indicator ws* aquac_method_content )?
|
48
82
|
) / (
|
49
83
|
aquac_method_indicator ws* aquac_method_content
|
50
|
-
( ( ws* comma )? ws+
|
84
|
+
( ( ws* comma )? ws+ aquac_area_indicator_short ws* aquac_area_content )?
|
51
85
|
)
|
52
86
|
end
|
53
|
-
|
54
87
|
end
|
55
88
|
end
|
@@ -16,9 +16,15 @@ module FoodFishParser
|
|
16
16
|
|
17
17
|
# Parse food fish text into a structured representation.
|
18
18
|
#
|
19
|
+
# @param s [String] text to parse
|
20
|
+
# @param anywhere [Bool] +false+ assume the text is only fish details, +true+ to search for fish details in the text
|
19
21
|
# @return [FoodFishParser::Grammar::RootNode] structured representation of fish details
|
20
22
|
# @note Unrecognized options are passed to Treetop, but this is not guarenteed to remain so forever.
|
21
|
-
def parse(s, **options)
|
23
|
+
def parse(s, anywhere: false, **options)
|
24
|
+
if anywhere
|
25
|
+
options = options.merge(root: :root_anywhere, consume_all_input: false)
|
26
|
+
end
|
27
|
+
|
22
28
|
@parser.parse(s, **options)
|
23
29
|
end
|
24
30
|
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: food_fish_parser
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- wvengen
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2020-03-
|
11
|
+
date: 2020-03-18 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: treetop
|
@@ -26,7 +26,7 @@ dependencies:
|
|
26
26
|
version: '1.6'
|
27
27
|
description: |2
|
28
28
|
Food products that contain fish sometimes indicate details like fishing
|
29
|
-
area, method or aquaculture country. This parser
|
29
|
+
area, method or aquaculture country. This parser knows about various ways
|
30
30
|
this is found on a product package, and returns a structured representation
|
31
31
|
of the fish ingredient details.
|
32
32
|
email:
|