uts58 0.1.1 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +57 -8
- data/lib/uts58/extractor.rb +82 -7
- data/lib/uts58.rb +42 -2
- data/uts58.gemspec +1 -1
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 219330b829e83d0c24c85005951597734773bb0380be136a5b60dd948e913c53
|
|
4
|
+
data.tar.gz: ffe60d5db9448fea9b657de9fe6a7db279a8f693195c58be91618ae16be6effc
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: b26f6967c9a977fb2f6bcd3cd53a7ed1341f60736e386401e9e26409e675945bfb114ee8224edee98c33aef4ae2d1abae1f155c731116f15d57df9ef6b2f0311
|
|
7
|
+
data.tar.gz: df1d287454d53695032f5ae5c28d1588f9b01f35639f6d9a2dee191a22f56b22760d1c08430e4a96e8cfa52902406a2ef68b66dbe352def8184079d7f5d250eb
|
data/README.md
CHANGED
|
@@ -2,11 +2,11 @@
|
|
|
2
2
|
|
|
3
3
|
A Ruby implementation of [UTS #58](https://www.unicode.org/reports/tr58/),
|
|
4
4
|
the Unicode spec for finding links in running text. Given a chunk of text,
|
|
5
|
-
it returns the URLs in it along with their character
|
|
5
|
+
it returns the URLs and email addresses in it along with their character
|
|
6
|
+
offsets.
|
|
6
7
|
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
desirable on generally visible web pages.
|
|
8
|
+
Both halves of UTS #58 are covered: **web links** and **email addresses**.
|
|
9
|
+
The two are detected independently and can be combined.
|
|
10
10
|
|
|
11
11
|
Tested extensively on relevant OSes: [](https://github.com/arnt/uts58/actions/workflows/ci.yml)
|
|
12
12
|
|
|
@@ -56,13 +56,62 @@ can read.)
|
|
|
56
56
|
Trailing punctuation, balanced brackets, ports, paths, queries and fragments
|
|
57
57
|
are handled per the spec.
|
|
58
58
|
|
|
59
|
+
## Email addresses
|
|
60
|
+
|
|
61
|
+
Email detection mirrors the URL methods. Each result carries the address
|
|
62
|
+
twice — as a bare `:email` and as a `mailto:` `:url` — so it drops straight
|
|
63
|
+
into anything that already renders a `:url` entity:
|
|
64
|
+
|
|
65
|
+
```ruby
|
|
66
|
+
Uts58.extract_email_addresses_with_indices("write to info@grå.org today")
|
|
67
|
+
# => [{ email: "info@grå.org",
|
|
68
|
+
# url: "mailto:info@grå.org",
|
|
69
|
+
# indices: [9, 21] }]
|
|
70
|
+
|
|
71
|
+
Uts58.extract_email_addresses("write to info@grå.org today")
|
|
72
|
+
# => ["info@grå.org"]
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
UTS #58 allows Unicode local-parts, so `阿Q@例子.中国` and `उदाहरण@उदाहरण.भारत`
|
|
76
|
+
are recognised; the domain is IDN-decoded just like a URL host. A leading
|
|
77
|
+
`mailto:` in the input is folded into the matched span.
|
|
78
|
+
|
|
79
|
+
## Combined extraction
|
|
80
|
+
|
|
81
|
+
`extract_entities_with_indices` runs both detectors, sorts by offset, and
|
|
82
|
+
strips overlaps — mirroring `Twitter::TwitterText::Extractor#extract_entities_with_indices`.
|
|
83
|
+
The result is a mixed list of `:url` and email (`:email` + `:url`) hashes:
|
|
84
|
+
|
|
85
|
+
```ruby
|
|
86
|
+
Uts58.extract_entities_with_indices("mail arnt@grå.org or see blogspot.com")
|
|
87
|
+
# => [{ email: "arnt@grå.org", url: "mailto:arnt@grå.org", indices: [5, 17] },
|
|
88
|
+
# { url: "https://blogspot.com", indices: [25, 37] }]
|
|
89
|
+
|
|
90
|
+
Uts58.extract_entities("mail arnt@grå.org or see blogspot.com")
|
|
91
|
+
# => ["mailto:arnt@grå.org", "https://blogspot.com"]
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
### Not wanting `mailto:` links
|
|
95
|
+
|
|
96
|
+
`info@example.com` overlaps the bare domain `example.com` that the URL scan
|
|
97
|
+
finds after the `@`. If you'd rather not turn addresses into `mailto:` links,
|
|
98
|
+
you have two options, with different results for `contact info@example.com for pricing`:
|
|
99
|
+
|
|
100
|
+
1. **Extract both, then drop emails.** Take `extract_entities_with_indices`
|
|
101
|
+
(already overlap-stripped) and reject the hashes that have an `:email`
|
|
102
|
+
key. The address wins the overlap, so dropping it leaves that span
|
|
103
|
+
*unlinked* — `info@example.com` stays plain text. Choose this if an
|
|
104
|
+
address shouldn't silently become a website link.
|
|
105
|
+
2. **Extract only URLs.** Call `extract_urls_with_indices` and skip email
|
|
106
|
+
detection entirely. The URL scan finds domain after the `@`, so the
|
|
107
|
+
same input links to `https://example.com`. Choose this if you'd
|
|
108
|
+
rather fall back to the domain.
|
|
109
|
+
|
|
59
110
|
## What's not here
|
|
60
111
|
|
|
61
|
-
- **Email addresses.** UTS #58 covers them; this gem doesn't. If you
|
|
62
|
-
need that, send me mail and explain what you need.
|
|
63
112
|
- **Link validation.** Recognised URLs are not fetched, normalised beyond
|
|
64
|
-
IDN decoding, or their hostnames checked in the DNS.
|
|
65
|
-
|
|
113
|
+
IDN decoding, or their hostnames checked in the DNS. If you need this,
|
|
114
|
+
send me mail.
|
|
66
115
|
|
|
67
116
|
## Roadmap
|
|
68
117
|
|
data/lib/uts58/extractor.rb
CHANGED
|
@@ -4,15 +4,23 @@ require 'public_suffix'
|
|
|
4
4
|
require_relative 'constants'
|
|
5
5
|
|
|
6
6
|
module Uts58
|
|
7
|
-
# Finds
|
|
8
|
-
#
|
|
9
|
-
#
|
|
10
|
-
# other.
|
|
7
|
+
# Finds links in arbitrary text per UTS #58. The public API mirrors
|
|
8
|
+
# Twitter::TwitterText::Extractor closely enough that twitter-text
|
|
9
|
+
# consumers (notably Mastodon) can easily swap one for the other.
|
|
11
10
|
#
|
|
12
11
|
# Instances carry only optional configuration (see #max_length=); if
|
|
13
|
-
# you don't need to set anything, the module-level
|
|
14
|
-
# Uts58.extract_urls and Uts58.extract_urls_with_indices shortcuts
|
|
12
|
+
# you don't need to set anything, the module-level shortcuts
|
|
15
13
|
# are simpler.
|
|
14
|
+
#
|
|
15
|
+
# Note that this may often find overlapping link candiates,
|
|
16
|
+
# e.g. "contact example@example.com for details" may find a mailto
|
|
17
|
+
# link and also a link to +https://example.com+. You'll almost
|
|
18
|
+
# certainly want to #remove_overlapping_entities after extracting
|
|
19
|
+
# the kinds of entities you want and merging the lists.
|
|
20
|
+
#
|
|
21
|
+
# (Bluesky handles vs. web sites are another example of common
|
|
22
|
+
# overlap, Fediverse vs. email a third, Tibetan domains
|
|
23
|
+
# vs. themselves a less common fourth, the list goes on.)
|
|
16
24
|
class Extractor
|
|
17
25
|
PATH_CLOSERS = [35, 47, 63]
|
|
18
26
|
QUERY_CLOSERS = [35] # how about &?
|
|
@@ -20,7 +28,7 @@ module Uts58
|
|
|
20
28
|
|
|
21
29
|
# Maximum allowed length of the matched text, in input codepoints.
|
|
22
30
|
# Matches whose input span exceeds this are dropped from the result
|
|
23
|
-
# of #extract_urls_with_indices.
|
|
31
|
+
# of #extract_urls_with_indices and the other extraction methods.
|
|
24
32
|
#
|
|
25
33
|
# "Matched text" means the substring that came out of +text+ — for
|
|
26
34
|
# example 11 for <tt>"example.com"</tt>. The returned +:url+ can
|
|
@@ -127,6 +135,73 @@ module Uts58
|
|
|
127
135
|
extract_urls_with_indices(text, options).map { |r| r[:url] }
|
|
128
136
|
end
|
|
129
137
|
|
|
138
|
+
# Returns every email address found in +text+ as a list of hashes:
|
|
139
|
+
#
|
|
140
|
+
# { email: String, url: String, indices: [start, end] }
|
|
141
|
+
#
|
|
142
|
+
# +email+ is the bare address ( <tt>"info@example.com"</tt> ); +url+ is
|
|
143
|
+
# the same thing as a +mailto:+ URL ( <tt>"mailto:info@example.com"</tt> ), so
|
|
144
|
+
# that the result drops straight into anything that already knows how
|
|
145
|
+
# to render a <tt>:url</tt> entity. Both carry the IDN-decoded domain
|
|
146
|
+
# (A-labels become U-labels, as in #extract_urls_with_indices).
|
|
147
|
+
# +indices+ are codepoint offsets, +end+ exclusive; they cover a
|
|
148
|
+
# leading +mailto:+ in the input if there was one, per UTS #58 §5.2.
|
|
149
|
+
#
|
|
150
|
+
# A plain address such as "info@example.com" overlaps the bare domain
|
|
151
|
+
# "example.com" that #extract_urls_with_indices would find after the
|
|
152
|
+
# <tt>@</tt>. If you'd rather *not* turn addresses into +mailto:+ links, you
|
|
153
|
+
# have two choices, with different outcomes for
|
|
154
|
+
# "blah info@example.com blah":
|
|
155
|
+
#
|
|
156
|
+
# 1. Extract both kinds, merge with #remove_overlapping_entities, then
|
|
157
|
+
# drop the survivors that have an +:email+ key. The address wins the
|
|
158
|
+
# overlap, so dropping it leaves that span unlinked — "info@example.com"
|
|
159
|
+
# becomes plain text.
|
|
160
|
+
# 2. Extract only URLs (skip this method). The URL scan still sees the
|
|
161
|
+
# domain after the <tt>@</tt>, so the same input links to
|
|
162
|
+
# +https://example.com+.
|
|
163
|
+
#
|
|
164
|
+
# Returns an empty array if +text+ contains no addresses. +options+ is
|
|
165
|
+
# accepted for twitter-text compatibility and currently ignored.
|
|
166
|
+
def extract_email_addresses_with_indices(text, options = {})
|
|
167
|
+
result = []
|
|
168
|
+
text.to_enum(:scan, /@/).map{Regexp.last_match}.each do |match|
|
|
169
|
+
at_pos = match.begin(0)
|
|
170
|
+
pre = text[0...at_pos]
|
|
171
|
+
lp_match = /[\p{XID_Continue}.!#$%&'*+\-\/=?^_`{|}~]+\z/.match(pre)
|
|
172
|
+
next unless lp_match
|
|
173
|
+
local = lp_match[0]
|
|
174
|
+
next if local.start_with?('.') || local.end_with?('.') || local.include?('..')
|
|
175
|
+
s = match.post_match
|
|
176
|
+
prefix = /^([-\p{L}\p{N}\p{M}ßς۽۾་〇]+[\.。]){1,4}[-\p{L}\p{N}\p{M}]+(?![-\p{L}\p{N}\p{M}])/.match(s)
|
|
177
|
+
next unless prefix && prefix[0].length < 254
|
|
178
|
+
hn = SimpleIDN.to_unicode(prefix.match(0).gsub(/。/, "."))
|
|
179
|
+
begin
|
|
180
|
+
about = PublicSuffix.parse(hn, ignore_private: true, default_rule: nil)
|
|
181
|
+
next unless about && about.tld != "invalid"
|
|
182
|
+
rescue PublicSuffix::DomainInvalid, PublicSuffix::DomainNotAllowed
|
|
183
|
+
next
|
|
184
|
+
end
|
|
185
|
+
local_start = at_pos - local.length
|
|
186
|
+
end_pos = at_pos + 1 + prefix[0].length
|
|
187
|
+
# UTS #58 §5.2 step 6: absorb a leading "mailto:" into the span.
|
|
188
|
+
if local_start >= 7 && text[(local_start - 7)...local_start].downcase == "mailto:"
|
|
189
|
+
local_start -= 7
|
|
190
|
+
end
|
|
191
|
+
next if @max_length && (end_pos - local_start) > @max_length
|
|
192
|
+
result << {
|
|
193
|
+
email: "#{local}@#{hn}",
|
|
194
|
+
url: "mailto:#{local}@#{hn}",
|
|
195
|
+
indices: [local_start, end_pos]
|
|
196
|
+
}
|
|
197
|
+
end
|
|
198
|
+
result
|
|
199
|
+
end
|
|
200
|
+
|
|
201
|
+
def extract_email_addresses(text, options = {})
|
|
202
|
+
extract_email_addresses_with_indices(text, options).map { |r| r[:email] }
|
|
203
|
+
end
|
|
204
|
+
|
|
130
205
|
# Given a list of entities (hashes with an +:indices+ key of the
|
|
131
206
|
# shape <tt>[start, end]</tt>, as produced by
|
|
132
207
|
# #extract_urls_with_indices) drops every entity that overlaps an
|
data/lib/uts58.rb
CHANGED
|
@@ -9,7 +9,7 @@
|
|
|
9
9
|
# any characters in the input, the wrappers keep the earlier one and
|
|
10
10
|
# drop the rest. Use Uts58::Extractor directly if you want the
|
|
11
11
|
# raw, possibly-overlapping list (e.g. to merge with hashtag/mention
|
|
12
|
-
# extractors before resolving overlap
|
|
12
|
+
# extractors before resolving overlap).
|
|
13
13
|
#
|
|
14
14
|
# Uts58.extract_urls("see example.com here")
|
|
15
15
|
# # => ["https://example.com"]
|
|
@@ -17,7 +17,7 @@
|
|
|
17
17
|
# Uts58.extract_urls_with_indices("see example.com here")
|
|
18
18
|
# # => [{ url: "https://example.com", indices: [4, 15] }]
|
|
19
19
|
module Uts58
|
|
20
|
-
VERSION = "0.
|
|
20
|
+
VERSION = "0.2.0"
|
|
21
21
|
|
|
22
22
|
class << self
|
|
23
23
|
# Like Uts58::Extractor#extract_urls_with_indices, but with
|
|
@@ -35,6 +35,46 @@ module Uts58
|
|
|
35
35
|
extract_urls_with_indices(text, options).map { |r| r[:url] }
|
|
36
36
|
end
|
|
37
37
|
|
|
38
|
+
# Like Uts58::Extractor#extract_email_addresses_with_indices, but
|
|
39
|
+
# with overlapping results merged.
|
|
40
|
+
def extract_email_addresses_with_indices(text, options = {})
|
|
41
|
+
extractor.remove_overlapping_entities(
|
|
42
|
+
extractor.extract_email_addresses_with_indices(text, options)
|
|
43
|
+
)
|
|
44
|
+
end
|
|
45
|
+
|
|
46
|
+
# Like Uts58::Extractor#extract_email_addresses, but with
|
|
47
|
+
# overlapping results merged.
|
|
48
|
+
def extract_email_addresses(text, options = {})
|
|
49
|
+
extract_email_addresses_with_indices(text, options).map { |r| r[:email] }
|
|
50
|
+
end
|
|
51
|
+
|
|
52
|
+
# Both the URLs and email addresses in +text+, as one list of
|
|
53
|
+
# mixed-shape hashes — <tt>{ url:, indices: }</tt> for links and
|
|
54
|
+
# <tt>{ email:, indices: }</tt> for addresses — sorted by start
|
|
55
|
+
# offset with overlaps removed. The name and mixed-shape return
|
|
56
|
+
# follow Twitter::TwitterText::Extractor#extract_entities_with_indices.
|
|
57
|
+
#
|
|
58
|
+
# Overlap is the point of going through here rather than calling the
|
|
59
|
+
# two extractors yourself: "contact info@grå.org today" yields both
|
|
60
|
+
# an email and the bare domain grå.org, and only one of those should
|
|
61
|
+
# survive. The earlier-starting candidate (the email) wins.
|
|
62
|
+
def extract_entities_with_indices(text, options = {})
|
|
63
|
+
extractor.remove_overlapping_entities(
|
|
64
|
+
extractor.extract_urls_with_indices(text, options) +
|
|
65
|
+
extractor.extract_email_addresses_with_indices(text, options)
|
|
66
|
+
)
|
|
67
|
+
end
|
|
68
|
+
|
|
69
|
+
# Like ::extract_entities_with_indices, but flattened to the bare
|
|
70
|
+
# URL strings, in the order they occur. Email addresses appear in
|
|
71
|
+
# their +mailto:+ form, e.g. "contact info@example.com or look at
|
|
72
|
+
# example.com" returns [<tt>"mailto:info@example.com"</tt>,
|
|
73
|
+
# <tt>"https://example.com"</tt>].
|
|
74
|
+
def extract_entities(text, options = {})
|
|
75
|
+
extract_entities_with_indices(text, options).map { |e| e[:url] }
|
|
76
|
+
end
|
|
77
|
+
|
|
38
78
|
private
|
|
39
79
|
|
|
40
80
|
def extractor
|
data/uts58.gemspec
CHANGED