uts58 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 94090cf157a1aad3bbaad182f23ac8041f24ef2636f5899cab3059729d675bcc
4
+ data.tar.gz: 5c0147a99c2a7ed929e17756a1c16db2d32080c8a19ea9792aeabc40041b4e89
5
+ SHA512:
6
+ metadata.gz: 8bd0c865a3222e5129c364aa95d5f6efe03a0860457f07095a083a1e203bf13dde564601aaf8fa2e861f106a6df54cb61728f2ba37a5603a819b7c59ef4d4d66
7
+ data.tar.gz: 61ad2bd233f2ac2266c02a5acce5caf66d2f07b8d23e37a8a75ea5ebe27388b8149a6e77822db63ed9606b6bf75c845ef1efc7a9389313fdfb0525fd6f954de3
@@ -0,0 +1,21 @@
1
+ name: CI
2
+
3
+ on:
4
+ push:
5
+ branches: [main]
6
+ pull_request:
7
+
8
+ jobs:
9
+ test:
10
+ runs-on: ubuntu-latest
11
+ strategy:
12
+ fail-fast: false
13
+ matrix:
14
+ ruby: ["3.1", "3.2", "3.3", "3.4"]
15
+ steps:
16
+ - uses: actions/checkout@v4
17
+ - uses: ruby/setup-ruby@v1
18
+ with:
19
+ ruby-version: ${{ matrix.ruby }}
20
+ bundler-cache: true
21
+ - run: bundle exec rspec
data/Gemfile ADDED
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ source "https://rubygems.org"
4
+
5
+ gemspec
data/README.md ADDED
@@ -0,0 +1,76 @@
1
+ # uts58
2
+
3
+ A Ruby implementation of [UTS #58](https://www.unicode.org/reports/tr58/),
4
+ the Unicode spec for finding links in running text. Given a chunk of text,
5
+ it returns the URLs in it along with their character offsets.
6
+
7
+ This covers the **web link** half of UTS #58 only. Email address recognition
8
+ is not implemented here since at the moment it's unclear whether that's
9
+ desirable on generally visible web pages.
10
+
11
+ Tested extensively on relevant OSes: [![CI](https://github.com/arnt/uts58/actions/workflows/ci.yml/badge.svg)](https://github.com/arnt/uts58/actions/workflows/ci.yml)
12
+
13
+ ## Install
14
+
15
+ ```ruby
16
+ gem "uts58"
17
+ ```
18
+
19
+ ## Usage
20
+
21
+ ```ruby
22
+ require "uts58"
23
+
24
+ Uts58.extract_urls_with_indices("see https://example.com/ for details")
25
+ # => [{ url: "https://example.com/", indices: [4, 24] }]
26
+
27
+ Uts58.extract_urls("see https://example.com/ for details")
28
+ # => ["https://example.com/"]
29
+ ```
30
+
31
+ The API mirrors `Twitter::TwitterText::Extractor#extract_urls_with_indices`
32
+ closely; it was written to provide what Mastodon uses. The two module-level
33
+ methods above also strip partly overlapping matches; you can use
34
+ `Uts58::Extractor` directly if you'd rather merge with other extractors
35
+ (mentions, hashtags, …) and resolve overlap across all of them yourself.
36
+
37
+ Input without a scheme is recognised, and `https://` is prepended in the
38
+ returned `:url`:
39
+
40
+ ```ruby
41
+ Uts58.extract_urls_with_indices("blogspot.com is still a thing")
42
+ # => [{ url: "https://blogspot.com", indices: [0, 12] }]
43
+ ```
44
+
45
+ IDNs are decoded to use UTF8 in the output, for better readability:
46
+
47
+ ```ruby
48
+ Uts58.extract_urls("xn-----ctdbabcfhu9c2b9l1acccr4c.xn--mgbah1a3hjkrd").first
49
+ # => "https://تجربة-القبول-الشامل.موريتانيا"
50
+ ```
51
+
52
+ (Admittedly that output isn't very readable if you can't read Arabic.
53
+ But the input wasn't readable to anyone, no matter what languages they
54
+ can read.)
55
+
56
+ Trailing punctuation, balanced brackets, ports, paths, queries and fragments
57
+ are handled per the spec.
58
+
59
+ ## What's not here
60
+
61
+ - **Email addresses.** UTS #58 covers them; this gem doesn't. If you
62
+ need that, send me mail and explain what you need.
63
+ - **Link validation.** Recognised URLs are not fetched, normalised beyond
64
+ IDN decoding, or their hostnames checked in the DNS. Again, if you
65
+ need this, send me mail.
66
+
67
+ ## Roadmap
68
+
69
+ My immediate need is UTS58 conformant link detection suitable for
70
+ public web pages. If you need something more, I rather think that an
71
+ item can be added to this roadmap, so long as the description in the
72
+ rdoc remains short and simple. Send mail to arnt@gulbrandsen.priv.no.
73
+
74
+ ## License
75
+
76
+ BSD-2-Clause. See `LICENSE`.