extended_email_reply_parser 0.1.0 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.codeclimate.yml +17 -0
- data/.rubocop.yml +1156 -0
- data/CHANGELOG.md +22 -0
- data/Gemfile +2 -0
- data/README.md +103 -2
- data/lib/extended_email_reply_parser/email_reply_parser/email.rb +68 -0
- data/lib/extended_email_reply_parser/email_reply_parser/fragment.rb +9 -0
- data/lib/extended_email_reply_parser/parsers/base.rb +156 -0
- data/lib/extended_email_reply_parser/parsers/github.rb +1 -0
- data/lib/extended_email_reply_parser/parsers/i18n_de.rb +14 -0
- data/lib/extended_email_reply_parser/parsers/i18n_en.rb +11 -0
- data/lib/extended_email_reply_parser/version.rb +1 -1
- data/lib/extended_email_reply_parser.rb +4 -0
- metadata +7 -1
data/CHANGELOG.md
CHANGED
@@ -10,6 +10,28 @@ This project adheres to [Semantic Versioning](http://semver.org/).
|
|
10
10
|
### Removed
|
11
11
|
### Fixed
|
12
12
|
|
13
|
+
## ExtendedEmailReplyParser 0.2.0 (2016-07-22)
|
14
|
+
### Added
|
15
|
+
- `Parsers::Base#hide_everything_after(expressions)` is useful when email clients do not quote the previous conversation. This parser method hides everything lead by a series of expressions, e.g. `hide_everything_after %w(From: Sent: To:)`.
|
16
|
+
- `Parsers::Base#except_in_visible_block_quotes`. Within this block, `hide_everything_after` is not applied. This is useful when a quote is already marked as to be shown.
|
17
|
+
- German parser `Parsers::I18nDe`, which removes previous conversation by searching for the phrases "Gesendet: Von: An:" and "Am ... schrieb ...:".
|
18
|
+
- Support for i18n-ed header lines. The github parser only knows "On ... wrote". Since this is needed when the github parser runs, specify additional regexes in the class header of the parsers using `add_quote_header_regex`, for example: `add_quote_header_regex '^Am .* schrieb.*$'`.
|
19
|
+
- The German parser adds the regex for quote headers like "Am ... schrieb ...:".
|
20
|
+
- Remove empty lines between quote lines:
|
21
|
+
|
22
|
+
> Hi,
|
23
|
+
> how are you doing?
|
24
|
+
> Cheers
|
25
|
+
|
26
|
+
rather than
|
27
|
+
|
28
|
+
> Hi,
|
29
|
+
|
30
|
+
> how are you doing?
|
31
|
+
|
32
|
+
> Cheers
|
33
|
+
- English parser `Parsers::I18nEn`, which removes previous conversation by searching for the phrases "From: Sent: To".
|
34
|
+
|
13
35
|
## ExtendedEmailReplyParser 0.1.0 (2016-07-22)
|
14
36
|
### Added
|
15
37
|
- `ExtendedEmailReplyParser.read "/path/to/email.eml"` returns the corresponding `Mail::Message` object.
|
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
# ExtendedEmailReplyParser
|
2
2
|
|
3
|
-
[![Build Status](https://travis-ci.org/fiedl/extended_email_reply_parser.svg?branch=master)](https://travis-ci.org/fiedl/extended_email_reply_parser)
|
3
|
+
[![Join the chat at https://gitter.im/fiedl/extended_email_reply_parser](https://badges.gitter.im/fiedl/extended_email_reply_parser.svg)](https://gitter.im/fiedl/extended_email_reply_parser?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) [![Build Status](https://travis-ci.org/fiedl/extended_email_reply_parser.svg?branch=master)](https://travis-ci.org/fiedl/extended_email_reply_parser) [![Code Climate](https://codeclimate.com/github/fiedl/extended_email_reply_parser/badges/gpa.svg)](https://codeclimate.com/github/fiedl/extended_email_reply_parser) [![Test Coverage](https://codeclimate.com/github/fiedl/extended_email_reply_parser/badges/coverage.svg)](https://codeclimate.com/github/fiedl/extended_email_reply_parser/coverage) [![Gem Version](https://badge.fury.io/rb/extended_email_reply_parser.svg)](https://badge.fury.io/rb/extended_email_reply_parser) [![Documentation](https://img.shields.io/badge/documentation-rubydoc.info-blue.svg)](http://www.rubydoc.info/github/fiedl/extended_email_reply_parser/)
|
4
4
|
|
5
5
|
When implementing a "reply or comment by email" feature, it's neccessary to filter out signatures and the previous conversation. One needs to extract just the relevant parts for the conversation or comment section of the application. This is what this [ruby](https://www.ruby-lang.org) gem helps to do.
|
6
6
|
|
@@ -81,7 +81,6 @@ EmailParsers::ShoutParser.parse \
|
|
81
81
|
```
|
82
82
|
|
83
83
|
|
84
|
-
|
85
84
|
## Installation
|
86
85
|
|
87
86
|
Add this line to your application's Gemfile:
|
@@ -105,6 +104,108 @@ After checking out the repo, run `bin/setup` to install dependencies. Then, run
|
|
105
104
|
|
106
105
|
To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
|
107
106
|
|
107
|
+
### Helper methods for writing parsers
|
108
|
+
|
109
|
+
To accomplish the most common parsing operations, there are a couple of helper methods.
|
110
|
+
|
111
|
+
This, for example, is the [English parser](lib/extended_email_reply_parser/parsers/i18n_en.rb).
|
112
|
+
|
113
|
+
```ruby
|
114
|
+
module ExtendedEmailReplyParser
|
115
|
+
class Parsers::I18nEn < Parsers::Base
|
116
|
+
|
117
|
+
def parse
|
118
|
+
except_in_visible_block_quotes do
|
119
|
+
hide_everything_after ["From: ", "Sent: ", "To: "]
|
120
|
+
end
|
121
|
+
end
|
122
|
+
|
123
|
+
end
|
124
|
+
end
|
125
|
+
```
|
126
|
+
|
127
|
+
#### `add_quote_header_regex`
|
128
|
+
|
129
|
+
The [github parser](https://github.com/github/email_reply_parser) needs to know how to identify the header line of quotes, for example "On Tue, 2011-03-01 at 18:02 +0530, Abhishek Kona wrote":
|
130
|
+
|
131
|
+
Hi,
|
132
|
+
|
133
|
+
On Tue, 2011-03-01 at 18:02 +0530, Abhishek Kona wrote:
|
134
|
+
> Hi folks
|
135
|
+
>
|
136
|
+
> What is the best way to clear a Riak bucket of all key, values after
|
137
|
+
> running a test?
|
138
|
+
> I am currently using the Java HTTP API.
|
139
|
+
|
140
|
+
You can list the keys for the bucket and call delete for each. Or if you
|
141
|
+
put the keys (and kept track of them in your test) you can delete them
|
142
|
+
one at a time (without incurring the cost of calling list first.)
|
143
|
+
|
144
|
+
By default, it uses the regex `/^On .* wrote:$/` for that. To make it recognize other header lines, specify their patterns using `add_quote_header_regex`.
|
145
|
+
|
146
|
+
Since this is needed by the github parser, i.e. possibly before the `parse` method of your custom parser is run, make sure to add the quote header regex in the class head:
|
147
|
+
|
148
|
+
```ruby
|
149
|
+
module ExtendedEmailReplyParser
|
150
|
+
class Parsers::I18nDe < Parsers::Base
|
151
|
+
add_quote_header_regex '^Am .* schrieb.*$'
|
152
|
+
# ...
|
153
|
+
end
|
154
|
+
end
|
155
|
+
```
|
156
|
+
|
157
|
+
#### `hide_everything_after`
|
158
|
+
|
159
|
+
Some email clients do not quote the previous conversation.
|
160
|
+
|
161
|
+
Hi Chris,
|
162
|
+
this is great, thanks!
|
163
|
+
Cheers, John
|
164
|
+
|
165
|
+
|
166
|
+
From: Chris <chris@example.com>
|
167
|
+
Sent: Saturday, July 09, 2016 3:27 PM
|
168
|
+
To: John <john@example.com>
|
169
|
+
Subject: The solution!
|
170
|
+
|
171
|
+
Hi John,
|
172
|
+
I've just found a solution to our big problem!
|
173
|
+
...
|
174
|
+
|
175
|
+
To remove the previous conversation, tell the parser expressions to identify where start of the previous conversation:
|
176
|
+
|
177
|
+
```ruby
|
178
|
+
module ExtendedEmailReplyParser
|
179
|
+
class Parsers::I18nEn < Parsers::Base
|
180
|
+
def parse
|
181
|
+
except_in_visible_block_quotes do
|
182
|
+
hide_everything_after ["From: ", "Sent: ", "To: "]
|
183
|
+
end
|
184
|
+
# ...
|
185
|
+
end
|
186
|
+
end
|
187
|
+
end
|
188
|
+
```
|
189
|
+
|
190
|
+
(The parser will combine the expressions to a regex: `/(#{expressions.join(".*?")}.*?\n)/m`, for example: `/(From: .*?Sent: .*?To: .*?\n)/m`.)
|
191
|
+
|
192
|
+
To avoid cutting off the email within a visible quote, wrap the `hide_everything_after` within a `except_in_visible_block_quotes` block as shown above.
|
193
|
+
|
194
|
+
Hi Chris,
|
195
|
+
|
196
|
+
> From: Chris <chris@example.com>
|
197
|
+
> Sent: Saturday, July 09, 2016 3:27 PM
|
198
|
+
> To: John <john@example.com>
|
199
|
+
> Subject: The solution!
|
200
|
+
>
|
201
|
+
> Hi John,
|
202
|
+
> I've just found a solution to our big problem!
|
203
|
+
|
204
|
+
this is great, thanks!
|
205
|
+
Cheers, John
|
206
|
+
|
207
|
+
If not wrapped in `except_in_visible_block_quotes`, the parsed email would just be "Hi Chris,", because everything after "From: Sent: To:" would be cut off.
|
208
|
+
|
108
209
|
## Contributing
|
109
210
|
|
110
211
|
Bug reports and pull requests are welcome on GitHub at https://github.com/fiedl/extended_email_reply_parser.
|
@@ -0,0 +1,68 @@
|
|
1
|
+
class EmailReplyParser
|
2
|
+
class Email
|
3
|
+
def hide_everything_after(expressions)
|
4
|
+
split_regex = /(#{expressions.join(".*?")}.*?\n)/m
|
5
|
+
split_fragments_at split_regex
|
6
|
+
end
|
7
|
+
|
8
|
+
def remove_empty_lines_between_block_quote_lines
|
9
|
+
@fragments = @fragments.collect do |fragment|
|
10
|
+
if fragment.quoted?
|
11
|
+
fragment.content = fragment.content.gsub /\n *?\n>/m, "\n>"
|
12
|
+
end
|
13
|
+
fragment
|
14
|
+
end
|
15
|
+
end
|
16
|
+
|
17
|
+
def split_fragments_at(regex)
|
18
|
+
@fragments = @fragments.collect do |fragment|
|
19
|
+
if fragment.to_s
|
20
|
+
first_text, *rest = fragment.to_s.split(regex)
|
21
|
+
|
22
|
+
first_fragment = Fragment.new(false, first_text)
|
23
|
+
first_fragment.quoted = fragment.quoted
|
24
|
+
first_fragment.hidden = fragment.hidden
|
25
|
+
first_fragment.signature = fragment.signature
|
26
|
+
first_fragment.content = first_text
|
27
|
+
|
28
|
+
hidden_fragment = Fragment.new(true, rest.join("\n"))
|
29
|
+
hidden_fragment.content = rest.join("\n")
|
30
|
+
|
31
|
+
hidden_fragment.quoted = true
|
32
|
+
if @except_in_visible_block_quotes
|
33
|
+
hidden_fragment.hidden = true unless fragment.quoted? and not fragment.hidden?
|
34
|
+
else
|
35
|
+
hidden_fragment.hidden = true
|
36
|
+
end
|
37
|
+
|
38
|
+
[first_fragment, hidden_fragment]
|
39
|
+
end
|
40
|
+
end.flatten - [nil]
|
41
|
+
@fragments = @fragments.select { |fragment| fragment.to_s && fragment.to_s != "" }
|
42
|
+
end
|
43
|
+
|
44
|
+
def except_in_visible_block_quotes
|
45
|
+
@except_in_visible_block_quotes = true
|
46
|
+
yield
|
47
|
+
@except_in_visible_block_quotes = false
|
48
|
+
end
|
49
|
+
|
50
|
+
private
|
51
|
+
|
52
|
+
# Detects if a given line is a header above a quoted area. It is only
|
53
|
+
# checked for lines preceding quoted regions.
|
54
|
+
#
|
55
|
+
# line - A String line of text from the email.
|
56
|
+
#
|
57
|
+
# Returns true if the line is a valid header, or false.
|
58
|
+
#
|
59
|
+
# This method overrides the original in order to include the different
|
60
|
+
# regex defined in the different ExtendedEmailReplyParser::Parsers.
|
61
|
+
#
|
62
|
+
def quote_header?(line)
|
63
|
+
regex = ExtendedEmailReplyParser::Parsers::Base.quote_header_regexes.join("|")
|
64
|
+
line.reverse =~ /#{regex}/
|
65
|
+
end
|
66
|
+
|
67
|
+
end
|
68
|
+
end
|
@@ -1,16 +1,172 @@
|
|
1
1
|
module ExtendedEmailReplyParser
|
2
2
|
class Parsers::Base
|
3
|
+
@@quote_header_regexes ||= []
|
3
4
|
|
4
5
|
attr_accessor :text
|
5
6
|
|
6
7
|
def initialize(text_before_parsing)
|
7
8
|
self.text = text_before_parsing
|
9
|
+
|
10
|
+
# The `EmailReplyParser::Email` is extended in this gem.
|
11
|
+
# Have a look at:
|
12
|
+
#
|
13
|
+
# lib/extended_email_reply_parser/email_reply_parser/email.rb
|
14
|
+
#
|
15
|
+
@email = EmailReplyParser::Email.new.read(text)
|
8
16
|
end
|
9
17
|
|
18
|
+
# This `parse` method of the `Parsers::Base` will be overridden
|
19
|
+
# by the individual parsers.
|
20
|
+
#
|
21
|
+
# The text before parsing is accessed with `text`.
|
22
|
+
# The method `parse` is expected to return the parsed text.
|
23
|
+
#
|
10
24
|
def parse
|
11
25
|
return text
|
12
26
|
end
|
13
27
|
|
28
|
+
# To avoid cutting off the email within a visible quote, wrap the
|
29
|
+
# `hide_everything_after` calls within a `except_in_visible_block_quotes`
|
30
|
+
# block:
|
31
|
+
#
|
32
|
+
# module ExtendedEmailReplyParser
|
33
|
+
# class Parsers::I18nEn < Parsers::Base
|
34
|
+
# def parse
|
35
|
+
# except_in_visible_block_quotes do
|
36
|
+
# hide_everything_after ["From: ", "Sent: ", "To: "]
|
37
|
+
# end
|
38
|
+
# # ...
|
39
|
+
# end
|
40
|
+
# end
|
41
|
+
# end
|
42
|
+
#
|
43
|
+
# Otherwise, the following email would be completely cut off after
|
44
|
+
# "Hi Chris,".
|
45
|
+
#
|
46
|
+
# Hi Chris,
|
47
|
+
#
|
48
|
+
# > From: Chris <chris@example.com>
|
49
|
+
# > Sent: Saturday, July 09, 2016 3:27 PM
|
50
|
+
# > To: John <john@example.com>
|
51
|
+
# > Subject: The solution!
|
52
|
+
# >
|
53
|
+
# > Hi John,
|
54
|
+
# > I've just found a solution to our big problem!
|
55
|
+
#
|
56
|
+
# this is great, thanks!
|
57
|
+
# Cheers, John
|
58
|
+
#
|
59
|
+
def except_in_visible_block_quotes(&block)
|
60
|
+
@email.except_in_visible_block_quotes(&block)
|
61
|
+
return @email.visible_text
|
62
|
+
end
|
63
|
+
|
64
|
+
# Boil quote like these
|
65
|
+
#
|
66
|
+
# > Hi,
|
67
|
+
#
|
68
|
+
# > how are you doing?
|
69
|
+
#
|
70
|
+
# > Cheers
|
71
|
+
#
|
72
|
+
# down to
|
73
|
+
#
|
74
|
+
# > Hi,
|
75
|
+
# > how are you doing?
|
76
|
+
# > Cheers
|
77
|
+
#
|
78
|
+
#
|
79
|
+
def remove_empty_lines_between_block_quote_lines
|
80
|
+
@email.remove_empty_lines_between_block_quote_lines
|
81
|
+
return @email.visible_text
|
82
|
+
end
|
83
|
+
|
84
|
+
# Some email clients do not quote the previous conversation.
|
85
|
+
#
|
86
|
+
# Hi Chris,
|
87
|
+
# this is great, thanks!
|
88
|
+
# Cheers, John
|
89
|
+
#
|
90
|
+
#
|
91
|
+
# From: Chris <chris@example.com>
|
92
|
+
# Sent: Saturday, July 09, 2016 3:27 PM
|
93
|
+
# To: John <john@example.com>
|
94
|
+
# Subject: The solution!
|
95
|
+
#
|
96
|
+
# Hi John,
|
97
|
+
# I've just found a solution to our big problem!
|
98
|
+
# ...
|
99
|
+
#
|
100
|
+
# To remove the previous conversation, tell the parser expressions
|
101
|
+
# to identify where start of the previous conversation:
|
102
|
+
#
|
103
|
+
# module ExtendedEmailReplyParser
|
104
|
+
# class Parsers::I18nEn < Parsers::Base
|
105
|
+
# def parse
|
106
|
+
# except_in_visible_block_quotes do
|
107
|
+
# hide_everything_after ["From: ", "Sent: ", "To: "]
|
108
|
+
# end
|
109
|
+
# # ...
|
110
|
+
# end
|
111
|
+
# end
|
112
|
+
# end
|
113
|
+
#
|
114
|
+
# The parser will combine the expressions to a regex:
|
115
|
+
# /(#{expressions.join(".*?")}.*?\n)/m`
|
116
|
+
# for example:
|
117
|
+
# /(From: .*?Sent: .*?To: .*?\n)/m
|
118
|
+
#
|
119
|
+
def hide_everything_after(expressions)
|
120
|
+
@email.hide_everything_after(expressions)
|
121
|
+
return @email.visible_text
|
122
|
+
end
|
123
|
+
|
124
|
+
# "On ... wrote:" (English)
|
125
|
+
# "Am ... schrieb ...:" (German)
|
126
|
+
# ...
|
127
|
+
#
|
128
|
+
def self.quote_header_regexes
|
129
|
+
@@quote_header_regexes
|
130
|
+
end
|
131
|
+
|
132
|
+
# The github parser (https://github.com/github/email_reply_parser) needs to
|
133
|
+
# know how to identify the header line of quotes, for example
|
134
|
+
#
|
135
|
+
# "On Tue, 2011-03-01 at 18:02 +0530, Abhishek Kona wrote"
|
136
|
+
#
|
137
|
+
# Example email:
|
138
|
+
#
|
139
|
+
# Hi,
|
140
|
+
#
|
141
|
+
# On Tue, 2011-03-01 at 18:02 +0530, Abhishek Kona wrote:
|
142
|
+
# > Hi folks
|
143
|
+
# >
|
144
|
+
# > What is the best way to clear a Riak bucket of all key, values after
|
145
|
+
# > running a test?
|
146
|
+
# > I am currently using the Java HTTP API.
|
147
|
+
#
|
148
|
+
# You can list the keys for the bucket and call delete for each. Or if you
|
149
|
+
# put the keys (and kept track of them in your test) you can delete them
|
150
|
+
# one at a time (without incurring the cost of calling list first.)
|
151
|
+
#
|
152
|
+
# By default, the github parser uses the regex `/^On .* wrote:$/` for that.
|
153
|
+
# To make it recognize other header lines, specify their patterns using
|
154
|
+
# `add_quote_header_regex`.
|
155
|
+
#
|
156
|
+
# Since this is needed by the github parser, i.e. possibly before the `parse`
|
157
|
+
# method of your custom parser is run, make sure to add the quote header
|
158
|
+
# regex in the class head:
|
159
|
+
#
|
160
|
+
# module ExtendedEmailReplyParser
|
161
|
+
# class Parsers::I18nDe < Parsers::Base
|
162
|
+
# add_quote_header_regex '^Am .* schrieb.*$'
|
163
|
+
# # ...
|
164
|
+
# end
|
165
|
+
# end
|
166
|
+
#
|
167
|
+
def self.add_quote_header_regex(regex_string)
|
168
|
+
@@quote_header_regexes << regex_string
|
169
|
+
end
|
14
170
|
|
15
171
|
def self.subclasses
|
16
172
|
ObjectSpace.each_object(Class).select { |klass| klass < self }
|
@@ -0,0 +1,14 @@
|
|
1
|
+
module ExtendedEmailReplyParser
|
2
|
+
class Parsers::I18nDe < Parsers::Base
|
3
|
+
add_quote_header_regex '^Am .* schrieb.*$'
|
4
|
+
|
5
|
+
def parse
|
6
|
+
remove_empty_lines_between_block_quote_lines
|
7
|
+
except_in_visible_block_quotes do
|
8
|
+
hide_everything_after ["Von: ", "Gesendet: ", "An: "]
|
9
|
+
hide_everything_after ["Am ", "schrieb "]
|
10
|
+
end
|
11
|
+
end
|
12
|
+
|
13
|
+
end
|
14
|
+
end
|