RubyGems - uts58 - Versions diffs - 0.1.1 → 0.2.0 - Mend

uts58 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 94cf52fc3b4ea1f23cdba5bebe093e9f9c1c2aa06abd30f24fc27385983b1597
-  data.tar.gz: 5f9dc85d33fe996c80d876197fd7affec34a0d99901b8daffb58458ce75e8eb5
+  metadata.gz: 219330b829e83d0c24c85005951597734773bb0380be136a5b60dd948e913c53
+  data.tar.gz: ffe60d5db9448fea9b657de9fe6a7db279a8f693195c58be91618ae16be6effc
 SHA512:
-  metadata.gz: 35edf0e1c464fd6b9bf46efbe581d040ba37378e0861ae245fbfac4dc3d3132295157fa4c5123d92c4b780eeeecba87a48b59d44c74e2f268d4b057e763564a0
-  data.tar.gz: f510e5e6e0e22e302e8fbc4f62489a90ba7a742249003aded0db14664b3acfbc411a18952dfd5007b9516456f8be12ef4d092618c97b5a1517be0978397af120
+  metadata.gz: b26f6967c9a977fb2f6bcd3cd53a7ed1341f60736e386401e9e26409e675945bfb114ee8224edee98c33aef4ae2d1abae1f155c731116f15d57df9ef6b2f0311
+  data.tar.gz: df1d287454d53695032f5ae5c28d1588f9b01f35639f6d9a2dee191a22f56b22760d1c08430e4a96e8cfa52902406a2ef68b66dbe352def8184079d7f5d250eb

data/README.md CHANGED Viewed

@@ -2,11 +2,11 @@
 A Ruby implementation of [UTS #58](https://www.unicode.org/reports/tr58/),
 the Unicode spec for finding links in running text. Given a chunk of text,
-it returns the URLs in it along with their character offsets.
+it returns the URLs and email addresses in it along with their character
+offsets.
-This covers the **web link** half of UTS #58 only. Email address recognition
-is not implemented here since at the moment it's unclear whether that's
-desirable on generally visible web pages.
+Both halves of UTS #58 are covered: **web links** and **email addresses**.
+The two are detected independently and can be combined.
 Tested extensively on relevant OSes: [![CI](https://github.com/arnt/uts58/actions/workflows/ci.yml/badge.svg)](https://github.com/arnt/uts58/actions/workflows/ci.yml)
@@ -56,13 +56,62 @@ can read.)
 Trailing punctuation, balanced brackets, ports, paths, queries and fragments
 are handled per the spec.
+## Email addresses
+Email detection mirrors the URL methods. Each result carries the address
+twice — as a bare `:email` and as a `mailto:` `:url` — so it drops straight
+into anything that already renders a `:url` entity:
+```ruby
+Uts58.extract_email_addresses_with_indices("write to info@grå.org today")
+# => [{ email: "info@grå.org",
+#       url: "mailto:info@grå.org",
+#       indices: [9, 21] }]
+Uts58.extract_email_addresses("write to info@grå.org today")
+# => ["info@grå.org"]
+```
+UTS #58 allows Unicode local-parts, so `阿Q@例子.中国` and `उदाहरण@उदाहरण.भारत`
+are recognised; the domain is IDN-decoded just like a URL host. A leading
+`mailto:` in the input is folded into the matched span.
+## Combined extraction
+`extract_entities_with_indices` runs both detectors, sorts by offset, and
+strips overlaps — mirroring `Twitter::TwitterText::Extractor#extract_entities_with_indices`.
+The result is a mixed list of `:url` and email (`:email` + `:url`) hashes:
+```ruby
+Uts58.extract_entities_with_indices("mail arnt@grå.org or see blogspot.com")
+# => [{ email: "arnt@grå.org", url: "mailto:arnt@grå.org", indices: [5, 17] },
+#     { url: "https://blogspot.com", indices: [25, 37] }]
+Uts58.extract_entities("mail arnt@grå.org or see blogspot.com")
+# => ["mailto:arnt@grå.org", "https://blogspot.com"]
+```
+### Not wanting `mailto:` links
+`info@example.com` overlaps the bare domain `example.com` that the URL scan
+finds after the `@`. If you'd rather not turn addresses into `mailto:` links,
+you have two options, with different results for `contact info@example.com for pricing`:
+1. **Extract both, then drop emails.** Take `extract_entities_with_indices`
+   (already overlap-stripped) and reject the hashes that have an `:email`
+   key. The address wins the overlap, so dropping it leaves that span
+   *unlinked* — `info@example.com` stays plain text. Choose this if an
+   address shouldn't silently become a website link.
+2. **Extract only URLs.** Call `extract_urls_with_indices` and skip email
+   detection entirely. The URL scan finds domain after the `@`, so the
+   same input links to `https://example.com`. Choose this if you'd
+   rather fall back to the domain.
 ## What's not here
-- **Email addresses.** UTS #58 covers them; this gem doesn't. If you
-  need that, send me mail and explain what you need.
 - **Link validation.** Recognised URLs are not fetched, normalised beyond
-  IDN decoding, or their hostnames checked in the DNS. Again, if you
-  need this, send me mail.
+  IDN decoding, or their hostnames checked in the DNS. If you need this,
+  send me mail.
 ## Roadmap

data/lib/uts58/extractor.rb CHANGED Viewed

@@ -4,15 +4,23 @@ require 'public_suffix'
 require_relative 'constants'
 module Uts58
-  # Finds web links in arbitrary text per UTS #58. The public API
-  # mirrors Twitter::TwitterText::Extractor closely enough that
-  # twitter-text consumers (notably Mastodon) can swap one for the
-  # other.
+  # Finds links in arbitrary text per UTS #58. The public API mirrors
+  # Twitter::TwitterText::Extractor closely enough that twitter-text
+  # consumers (notably Mastodon) can easily swap one for the other.
   #
   # Instances carry only optional configuration (see #max_length=); if
-  # you don't need to set anything, the module-level
-  # Uts58.extract_urls and Uts58.extract_urls_with_indices shortcuts
+  # you don't need to set anything, the module-level shortcuts
   # are simpler.
+  #
+  # Note that this may often find overlapping link candiates,
+  # e.g. "contact example@example.com for details" may find a mailto
+  # link and also a link to +https://example.com+. You'll almost
+  # certainly want to #remove_overlapping_entities after extracting
+  # the kinds of entities you want and merging the lists.
+  #
+  # (Bluesky handles vs. web sites are another example of common
+  # overlap, Fediverse vs. email a third, Tibetan domains
+  # vs. themselves a less common fourth, the list goes on.)
   class Extractor
     PATH_CLOSERS = [35, 47, 63]
     QUERY_CLOSERS = [35] # how about &?
@@ -20,7 +28,7 @@ module Uts58
     # Maximum allowed length of the matched text, in input codepoints.
     # Matches whose input span exceeds this are dropped from the result
-    # of #extract_urls_with_indices.
+    # of #extract_urls_with_indices and the other extraction methods.
     #
     # "Matched text" means the substring that came out of +text+ — for
     # example 11 for <tt>"example.com"</tt>. The returned +:url+ can
@@ -127,6 +135,73 @@ module Uts58
       extract_urls_with_indices(text, options).map { |r| r[:url] }
     end
+    # Returns every email address found in +text+ as a list of hashes:
+    #
+    #   { email: String, url: String, indices: [start, end] }
+    #
+    # +email+ is the bare address ( <tt>"info@example.com"</tt> ); +url+ is
+    # the same thing as a +mailto:+ URL ( <tt>"mailto:info@example.com"</tt> ), so
+    # that the result drops straight into anything that already knows how
+    # to render a <tt>:url</tt> entity. Both carry the IDN-decoded domain
+    # (A-labels become U-labels, as in #extract_urls_with_indices).
+    # +indices+ are codepoint offsets, +end+ exclusive; they cover a
+    # leading +mailto:+ in the input if there was one, per UTS #58 §5.2.
+    #
+    # A plain address such as "info@example.com" overlaps the bare domain
+    # "example.com" that #extract_urls_with_indices would find after the
+    # <tt>@</tt>. If you'd rather *not* turn addresses into +mailto:+ links, you
+    # have two choices, with different outcomes for
+    # "blah info@example.com blah":
+    #
+    # 1. Extract both kinds, merge with #remove_overlapping_entities, then
+    #    drop the survivors that have an +:email+ key. The address wins the
+    #    overlap, so dropping it leaves that span unlinked — "info@example.com"
+    #    becomes plain text.
+    # 2. Extract only URLs (skip this method). The URL scan still sees the
+    #    domain after the <tt>@</tt>, so the same input links to
+    #    +https://example.com+.
+    #
+    # Returns an empty array if +text+ contains no addresses. +options+ is
+    # accepted for twitter-text compatibility and currently ignored.
+    def extract_email_addresses_with_indices(text, options = {})
+      result = []
+      text.to_enum(:scan, /@/).map{Regexp.last_match}.each do |match|
+        at_pos = match.begin(0)
+        pre = text[0...at_pos]
+        lp_match = /[\p{XID_Continue}.!#$%&'*+\-\/=?^_`{|}~]+\z/.match(pre)
+        next unless lp_match
+        local = lp_match[0]
+        next if local.start_with?('.') || local.end_with?('.') || local.include?('..')
+        s = match.post_match
+        prefix = /^([-\p{L}\p{N}\p{M}ßς۽۾་〇]+[\.。]){1,4}[-\p{L}\p{N}\p{M}]+(?![-\p{L}\p{N}\p{M}])/.match(s)
+        next unless prefix && prefix[0].length < 254
+        hn = SimpleIDN.to_unicode(prefix.match(0).gsub(/。/, "."))
+        begin
+          about = PublicSuffix.parse(hn, ignore_private: true, default_rule: nil)
+          next unless about && about.tld != "invalid"
+        rescue PublicSuffix::DomainInvalid, PublicSuffix::DomainNotAllowed
+          next
+        end
+        local_start = at_pos - local.length
+        end_pos = at_pos + 1 + prefix[0].length
+        # UTS #58 §5.2 step 6: absorb a leading "mailto:" into the span.
+        if local_start >= 7 && text[(local_start - 7)...local_start].downcase == "mailto:"
+          local_start -= 7
+        end
+        next if @max_length && (end_pos - local_start) > @max_length
+        result << {
+          email: "#{local}@#{hn}",
+          url: "mailto:#{local}@#{hn}",
+          indices: [local_start, end_pos]
+        }
+      end
+      result
+    end
+    def extract_email_addresses(text, options = {})
+      extract_email_addresses_with_indices(text, options).map { |r| r[:email] }
+    end
     # Given a list of entities (hashes with an +:indices+ key of the
     # shape <tt>[start, end]</tt>, as produced by
     # #extract_urls_with_indices) drops every entity that overlaps an

data/lib/uts58.rb CHANGED Viewed

@@ -9,7 +9,7 @@
 # any characters in the input, the wrappers keep the earlier one and
 # drop the rest. Use Uts58::Extractor directly if you want the
 # raw, possibly-overlapping list (e.g. to merge with hashtag/mention
-# extractors before resolving overlap yourself).
+# extractors before resolving overlap).
 #
 #   Uts58.extract_urls("see example.com here")
 #   # => ["https://example.com"]
@@ -17,7 +17,7 @@
 #   Uts58.extract_urls_with_indices("see example.com here")
 #   # => [{ url: "https://example.com", indices: [4, 15] }]
 module Uts58
-  VERSION = "0.1.1"
+  VERSION = "0.2.0"
   class << self
     # Like Uts58::Extractor#extract_urls_with_indices, but with
@@ -35,6 +35,46 @@ module Uts58
       extract_urls_with_indices(text, options).map { |r| r[:url] }
     end
+    # Like Uts58::Extractor#extract_email_addresses_with_indices, but
+    # with overlapping results merged.
+    def extract_email_addresses_with_indices(text, options = {})
+      extractor.remove_overlapping_entities(
+        extractor.extract_email_addresses_with_indices(text, options)
+      )
+    end
+    # Like Uts58::Extractor#extract_email_addresses, but with
+    # overlapping results merged.
+    def extract_email_addresses(text, options = {})
+      extract_email_addresses_with_indices(text, options).map { |r| r[:email] }
+    end
+    # Both the URLs and email addresses in +text+, as one list of
+    # mixed-shape hashes — <tt>{ url:, indices: }</tt> for links and
+    # <tt>{ email:, indices: }</tt> for addresses — sorted by start
+    # offset with overlaps removed. The name and mixed-shape return
+    # follow Twitter::TwitterText::Extractor#extract_entities_with_indices.
+    #
+    # Overlap is the point of going through here rather than calling the
+    # two extractors yourself: "contact info@grå.org today" yields both
+    # an email and the bare domain grå.org, and only one of those should
+    # survive. The earlier-starting candidate (the email) wins.
+    def extract_entities_with_indices(text, options = {})
+      extractor.remove_overlapping_entities(
+        extractor.extract_urls_with_indices(text, options) +
+        extractor.extract_email_addresses_with_indices(text, options)
+      )
+    end
+    # Like ::extract_entities_with_indices, but flattened to the bare
+    # URL strings, in the order they occur. Email addresses appear in
+    # their +mailto:+ form, e.g. "contact info@example.com or look at
+    # example.com" returns [<tt>"mailto:info@example.com"</tt>,
+    # <tt>"https://example.com"</tt>].
+    def extract_entities(text, options = {})
+      extract_entities_with_indices(text, options).map { |e| e[:url] }
+    end
     private
     def extractor

data/uts58.gemspec CHANGED Viewed

@@ -2,7 +2,7 @@
 Gem::Specification.new do |spec|
   spec.name          = "uts58"
-  spec.version       = "0.1.1"
+  spec.version       = "0.2.0"
   spec.authors       = ["Arnt Gulbrandsen"]
   spec.email         = ["arnt@gulbrandsen.priv.no"]

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: uts58
 version: !ruby/object:Gem::Version
-  version: 0.1.1
+  version: 0.2.0
 platform: ruby
 authors:
 - Arnt Gulbrandsen