uts58 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 94cf52fc3b4ea1f23cdba5bebe093e9f9c1c2aa06abd30f24fc27385983b1597
4
- data.tar.gz: 5f9dc85d33fe996c80d876197fd7affec34a0d99901b8daffb58458ce75e8eb5
3
+ metadata.gz: 219330b829e83d0c24c85005951597734773bb0380be136a5b60dd948e913c53
4
+ data.tar.gz: ffe60d5db9448fea9b657de9fe6a7db279a8f693195c58be91618ae16be6effc
5
5
  SHA512:
6
- metadata.gz: 35edf0e1c464fd6b9bf46efbe581d040ba37378e0861ae245fbfac4dc3d3132295157fa4c5123d92c4b780eeeecba87a48b59d44c74e2f268d4b057e763564a0
7
- data.tar.gz: f510e5e6e0e22e302e8fbc4f62489a90ba7a742249003aded0db14664b3acfbc411a18952dfd5007b9516456f8be12ef4d092618c97b5a1517be0978397af120
6
+ metadata.gz: b26f6967c9a977fb2f6bcd3cd53a7ed1341f60736e386401e9e26409e675945bfb114ee8224edee98c33aef4ae2d1abae1f155c731116f15d57df9ef6b2f0311
7
+ data.tar.gz: df1d287454d53695032f5ae5c28d1588f9b01f35639f6d9a2dee191a22f56b22760d1c08430e4a96e8cfa52902406a2ef68b66dbe352def8184079d7f5d250eb
data/README.md CHANGED
@@ -2,11 +2,11 @@
2
2
 
3
3
  A Ruby implementation of [UTS #58](https://www.unicode.org/reports/tr58/),
4
4
  the Unicode spec for finding links in running text. Given a chunk of text,
5
- it returns the URLs in it along with their character offsets.
5
+ it returns the URLs and email addresses in it along with their character
6
+ offsets.
6
7
 
7
- This covers the **web link** half of UTS #58 only. Email address recognition
8
- is not implemented here since at the moment it's unclear whether that's
9
- desirable on generally visible web pages.
8
+ Both halves of UTS #58 are covered: **web links** and **email addresses**.
9
+ The two are detected independently and can be combined.
10
10
 
11
11
  Tested extensively on relevant OSes: [![CI](https://github.com/arnt/uts58/actions/workflows/ci.yml/badge.svg)](https://github.com/arnt/uts58/actions/workflows/ci.yml)
12
12
 
@@ -56,13 +56,62 @@ can read.)
56
56
  Trailing punctuation, balanced brackets, ports, paths, queries and fragments
57
57
  are handled per the spec.
58
58
 
59
+ ## Email addresses
60
+
61
+ Email detection mirrors the URL methods. Each result carries the address
62
+ twice — as a bare `:email` and as a `mailto:` `:url` — so it drops straight
63
+ into anything that already renders a `:url` entity:
64
+
65
+ ```ruby
66
+ Uts58.extract_email_addresses_with_indices("write to info@grå.org today")
67
+ # => [{ email: "info@grå.org",
68
+ # url: "mailto:info@grå.org",
69
+ # indices: [9, 21] }]
70
+
71
+ Uts58.extract_email_addresses("write to info@grå.org today")
72
+ # => ["info@grå.org"]
73
+ ```
74
+
75
+ UTS #58 allows Unicode local-parts, so `阿Q@例子.中国` and `उदाहरण@उदाहरण.भारत`
76
+ are recognised; the domain is IDN-decoded just like a URL host. A leading
77
+ `mailto:` in the input is folded into the matched span.
78
+
79
+ ## Combined extraction
80
+
81
+ `extract_entities_with_indices` runs both detectors, sorts by offset, and
82
+ strips overlaps — mirroring `Twitter::TwitterText::Extractor#extract_entities_with_indices`.
83
+ The result is a mixed list of `:url` and email (`:email` + `:url`) hashes:
84
+
85
+ ```ruby
86
+ Uts58.extract_entities_with_indices("mail arnt@grå.org or see blogspot.com")
87
+ # => [{ email: "arnt@grå.org", url: "mailto:arnt@grå.org", indices: [5, 17] },
88
+ # { url: "https://blogspot.com", indices: [25, 37] }]
89
+
90
+ Uts58.extract_entities("mail arnt@grå.org or see blogspot.com")
91
+ # => ["mailto:arnt@grå.org", "https://blogspot.com"]
92
+ ```
93
+
94
+ ### Not wanting `mailto:` links
95
+
96
+ `info@example.com` overlaps the bare domain `example.com` that the URL scan
97
+ finds after the `@`. If you'd rather not turn addresses into `mailto:` links,
98
+ you have two options, with different results for `contact info@example.com for pricing`:
99
+
100
+ 1. **Extract both, then drop emails.** Take `extract_entities_with_indices`
101
+ (already overlap-stripped) and reject the hashes that have an `:email`
102
+ key. The address wins the overlap, so dropping it leaves that span
103
+ *unlinked* — `info@example.com` stays plain text. Choose this if an
104
+ address shouldn't silently become a website link.
105
+ 2. **Extract only URLs.** Call `extract_urls_with_indices` and skip email
106
+ detection entirely. The URL scan finds domain after the `@`, so the
107
+ same input links to `https://example.com`. Choose this if you'd
108
+ rather fall back to the domain.
109
+
59
110
  ## What's not here
60
111
 
61
- - **Email addresses.** UTS #58 covers them; this gem doesn't. If you
62
- need that, send me mail and explain what you need.
63
112
  - **Link validation.** Recognised URLs are not fetched, normalised beyond
64
- IDN decoding, or their hostnames checked in the DNS. Again, if you
65
- need this, send me mail.
113
+ IDN decoding, or their hostnames checked in the DNS. If you need this,
114
+ send me mail.
66
115
 
67
116
  ## Roadmap
68
117
 
@@ -4,15 +4,23 @@ require 'public_suffix'
4
4
  require_relative 'constants'
5
5
 
6
6
  module Uts58
7
- # Finds web links in arbitrary text per UTS #58. The public API
8
- # mirrors Twitter::TwitterText::Extractor closely enough that
9
- # twitter-text consumers (notably Mastodon) can swap one for the
10
- # other.
7
+ # Finds links in arbitrary text per UTS #58. The public API mirrors
8
+ # Twitter::TwitterText::Extractor closely enough that twitter-text
9
+ # consumers (notably Mastodon) can easily swap one for the other.
11
10
  #
12
11
  # Instances carry only optional configuration (see #max_length=); if
13
- # you don't need to set anything, the module-level
14
- # Uts58.extract_urls and Uts58.extract_urls_with_indices shortcuts
12
+ # you don't need to set anything, the module-level shortcuts
15
13
  # are simpler.
14
+ #
15
+ # Note that this may often find overlapping link candiates,
16
+ # e.g. "contact example@example.com for details" may find a mailto
17
+ # link and also a link to +https://example.com+. You'll almost
18
+ # certainly want to #remove_overlapping_entities after extracting
19
+ # the kinds of entities you want and merging the lists.
20
+ #
21
+ # (Bluesky handles vs. web sites are another example of common
22
+ # overlap, Fediverse vs. email a third, Tibetan domains
23
+ # vs. themselves a less common fourth, the list goes on.)
16
24
  class Extractor
17
25
  PATH_CLOSERS = [35, 47, 63]
18
26
  QUERY_CLOSERS = [35] # how about &?
@@ -20,7 +28,7 @@ module Uts58
20
28
 
21
29
  # Maximum allowed length of the matched text, in input codepoints.
22
30
  # Matches whose input span exceeds this are dropped from the result
23
- # of #extract_urls_with_indices.
31
+ # of #extract_urls_with_indices and the other extraction methods.
24
32
  #
25
33
  # "Matched text" means the substring that came out of +text+ — for
26
34
  # example 11 for <tt>"example.com"</tt>. The returned +:url+ can
@@ -127,6 +135,73 @@ module Uts58
127
135
  extract_urls_with_indices(text, options).map { |r| r[:url] }
128
136
  end
129
137
 
138
+ # Returns every email address found in +text+ as a list of hashes:
139
+ #
140
+ # { email: String, url: String, indices: [start, end] }
141
+ #
142
+ # +email+ is the bare address ( <tt>"info@example.com"</tt> ); +url+ is
143
+ # the same thing as a +mailto:+ URL ( <tt>"mailto:info@example.com"</tt> ), so
144
+ # that the result drops straight into anything that already knows how
145
+ # to render a <tt>:url</tt> entity. Both carry the IDN-decoded domain
146
+ # (A-labels become U-labels, as in #extract_urls_with_indices).
147
+ # +indices+ are codepoint offsets, +end+ exclusive; they cover a
148
+ # leading +mailto:+ in the input if there was one, per UTS #58 §5.2.
149
+ #
150
+ # A plain address such as "info@example.com" overlaps the bare domain
151
+ # "example.com" that #extract_urls_with_indices would find after the
152
+ # <tt>@</tt>. If you'd rather *not* turn addresses into +mailto:+ links, you
153
+ # have two choices, with different outcomes for
154
+ # "blah info@example.com blah":
155
+ #
156
+ # 1. Extract both kinds, merge with #remove_overlapping_entities, then
157
+ # drop the survivors that have an +:email+ key. The address wins the
158
+ # overlap, so dropping it leaves that span unlinked — "info@example.com"
159
+ # becomes plain text.
160
+ # 2. Extract only URLs (skip this method). The URL scan still sees the
161
+ # domain after the <tt>@</tt>, so the same input links to
162
+ # +https://example.com+.
163
+ #
164
+ # Returns an empty array if +text+ contains no addresses. +options+ is
165
+ # accepted for twitter-text compatibility and currently ignored.
166
+ def extract_email_addresses_with_indices(text, options = {})
167
+ result = []
168
+ text.to_enum(:scan, /@/).map{Regexp.last_match}.each do |match|
169
+ at_pos = match.begin(0)
170
+ pre = text[0...at_pos]
171
+ lp_match = /[\p{XID_Continue}.!#$%&'*+\-\/=?^_`{|}~]+\z/.match(pre)
172
+ next unless lp_match
173
+ local = lp_match[0]
174
+ next if local.start_with?('.') || local.end_with?('.') || local.include?('..')
175
+ s = match.post_match
176
+ prefix = /^([-\p{L}\p{N}\p{M}ßς۽۾་〇]+[\.。]){1,4}[-\p{L}\p{N}\p{M}]+(?![-\p{L}\p{N}\p{M}])/.match(s)
177
+ next unless prefix && prefix[0].length < 254
178
+ hn = SimpleIDN.to_unicode(prefix.match(0).gsub(/。/, "."))
179
+ begin
180
+ about = PublicSuffix.parse(hn, ignore_private: true, default_rule: nil)
181
+ next unless about && about.tld != "invalid"
182
+ rescue PublicSuffix::DomainInvalid, PublicSuffix::DomainNotAllowed
183
+ next
184
+ end
185
+ local_start = at_pos - local.length
186
+ end_pos = at_pos + 1 + prefix[0].length
187
+ # UTS #58 §5.2 step 6: absorb a leading "mailto:" into the span.
188
+ if local_start >= 7 && text[(local_start - 7)...local_start].downcase == "mailto:"
189
+ local_start -= 7
190
+ end
191
+ next if @max_length && (end_pos - local_start) > @max_length
192
+ result << {
193
+ email: "#{local}@#{hn}",
194
+ url: "mailto:#{local}@#{hn}",
195
+ indices: [local_start, end_pos]
196
+ }
197
+ end
198
+ result
199
+ end
200
+
201
+ def extract_email_addresses(text, options = {})
202
+ extract_email_addresses_with_indices(text, options).map { |r| r[:email] }
203
+ end
204
+
130
205
  # Given a list of entities (hashes with an +:indices+ key of the
131
206
  # shape <tt>[start, end]</tt>, as produced by
132
207
  # #extract_urls_with_indices) drops every entity that overlaps an
data/lib/uts58.rb CHANGED
@@ -9,7 +9,7 @@
9
9
  # any characters in the input, the wrappers keep the earlier one and
10
10
  # drop the rest. Use Uts58::Extractor directly if you want the
11
11
  # raw, possibly-overlapping list (e.g. to merge with hashtag/mention
12
- # extractors before resolving overlap yourself).
12
+ # extractors before resolving overlap).
13
13
  #
14
14
  # Uts58.extract_urls("see example.com here")
15
15
  # # => ["https://example.com"]
@@ -17,7 +17,7 @@
17
17
  # Uts58.extract_urls_with_indices("see example.com here")
18
18
  # # => [{ url: "https://example.com", indices: [4, 15] }]
19
19
  module Uts58
20
- VERSION = "0.1.1"
20
+ VERSION = "0.2.0"
21
21
 
22
22
  class << self
23
23
  # Like Uts58::Extractor#extract_urls_with_indices, but with
@@ -35,6 +35,46 @@ module Uts58
35
35
  extract_urls_with_indices(text, options).map { |r| r[:url] }
36
36
  end
37
37
 
38
+ # Like Uts58::Extractor#extract_email_addresses_with_indices, but
39
+ # with overlapping results merged.
40
+ def extract_email_addresses_with_indices(text, options = {})
41
+ extractor.remove_overlapping_entities(
42
+ extractor.extract_email_addresses_with_indices(text, options)
43
+ )
44
+ end
45
+
46
+ # Like Uts58::Extractor#extract_email_addresses, but with
47
+ # overlapping results merged.
48
+ def extract_email_addresses(text, options = {})
49
+ extract_email_addresses_with_indices(text, options).map { |r| r[:email] }
50
+ end
51
+
52
+ # Both the URLs and email addresses in +text+, as one list of
53
+ # mixed-shape hashes — <tt>{ url:, indices: }</tt> for links and
54
+ # <tt>{ email:, indices: }</tt> for addresses — sorted by start
55
+ # offset with overlaps removed. The name and mixed-shape return
56
+ # follow Twitter::TwitterText::Extractor#extract_entities_with_indices.
57
+ #
58
+ # Overlap is the point of going through here rather than calling the
59
+ # two extractors yourself: "contact info@grå.org today" yields both
60
+ # an email and the bare domain grå.org, and only one of those should
61
+ # survive. The earlier-starting candidate (the email) wins.
62
+ def extract_entities_with_indices(text, options = {})
63
+ extractor.remove_overlapping_entities(
64
+ extractor.extract_urls_with_indices(text, options) +
65
+ extractor.extract_email_addresses_with_indices(text, options)
66
+ )
67
+ end
68
+
69
+ # Like ::extract_entities_with_indices, but flattened to the bare
70
+ # URL strings, in the order they occur. Email addresses appear in
71
+ # their +mailto:+ form, e.g. "contact info@example.com or look at
72
+ # example.com" returns [<tt>"mailto:info@example.com"</tt>,
73
+ # <tt>"https://example.com"</tt>].
74
+ def extract_entities(text, options = {})
75
+ extract_entities_with_indices(text, options).map { |e| e[:url] }
76
+ end
77
+
38
78
  private
39
79
 
40
80
  def extractor
data/uts58.gemspec CHANGED
@@ -2,7 +2,7 @@
2
2
 
3
3
  Gem::Specification.new do |spec|
4
4
  spec.name = "uts58"
5
- spec.version = "0.1.1"
5
+ spec.version = "0.2.0"
6
6
  spec.authors = ["Arnt Gulbrandsen"]
7
7
  spec.email = ["arnt@gulbrandsen.priv.no"]
8
8
 
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: uts58
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Arnt Gulbrandsen