uts58 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.github/workflows/ci.yml +21 -0
- data/Gemfile +5 -0
- data/README.md +76 -0
- data/lib/uts58/constants.rb +138233 -0
- data/lib/uts58/extractor.rb +201 -0
- data/lib/uts58.rb +46 -0
- data/uts58.gemspec +27 -0
- metadata +117 -0
checksums.yaml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
---
|
|
2
|
+
SHA256:
|
|
3
|
+
metadata.gz: 94090cf157a1aad3bbaad182f23ac8041f24ef2636f5899cab3059729d675bcc
|
|
4
|
+
data.tar.gz: 5c0147a99c2a7ed929e17756a1c16db2d32080c8a19ea9792aeabc40041b4e89
|
|
5
|
+
SHA512:
|
|
6
|
+
metadata.gz: 8bd0c865a3222e5129c364aa95d5f6efe03a0860457f07095a083a1e203bf13dde564601aaf8fa2e861f106a6df54cb61728f2ba37a5603a819b7c59ef4d4d66
|
|
7
|
+
data.tar.gz: 61ad2bd233f2ac2266c02a5acce5caf66d2f07b8d23e37a8a75ea5ebe27388b8149a6e77822db63ed9606b6bf75c845ef1efc7a9389313fdfb0525fd6f954de3
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
name: CI
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
push:
|
|
5
|
+
branches: [main]
|
|
6
|
+
pull_request:
|
|
7
|
+
|
|
8
|
+
jobs:
|
|
9
|
+
test:
|
|
10
|
+
runs-on: ubuntu-latest
|
|
11
|
+
strategy:
|
|
12
|
+
fail-fast: false
|
|
13
|
+
matrix:
|
|
14
|
+
ruby: ["3.1", "3.2", "3.3", "3.4"]
|
|
15
|
+
steps:
|
|
16
|
+
- uses: actions/checkout@v4
|
|
17
|
+
- uses: ruby/setup-ruby@v1
|
|
18
|
+
with:
|
|
19
|
+
ruby-version: ${{ matrix.ruby }}
|
|
20
|
+
bundler-cache: true
|
|
21
|
+
- run: bundle exec rspec
|
data/Gemfile
ADDED
data/README.md
ADDED
|
@@ -0,0 +1,76 @@
|
|
|
1
|
+
# uts58
|
|
2
|
+
|
|
3
|
+
A Ruby implementation of [UTS #58](https://www.unicode.org/reports/tr58/),
|
|
4
|
+
the Unicode spec for finding links in running text. Given a chunk of text,
|
|
5
|
+
it returns the URLs in it along with their character offsets.
|
|
6
|
+
|
|
7
|
+
This covers the **web link** half of UTS #58 only. Email address recognition
|
|
8
|
+
is not implemented here since at the moment it's unclear whether that's
|
|
9
|
+
desirable on generally visible web pages.
|
|
10
|
+
|
|
11
|
+
Tested extensively on relevant OSes: [](https://github.com/arnt/uts58/actions/workflows/ci.yml)
|
|
12
|
+
|
|
13
|
+
## Install
|
|
14
|
+
|
|
15
|
+
```ruby
|
|
16
|
+
gem "uts58"
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
## Usage
|
|
20
|
+
|
|
21
|
+
```ruby
|
|
22
|
+
require "uts58"
|
|
23
|
+
|
|
24
|
+
Uts58.extract_urls_with_indices("see https://example.com/ for details")
|
|
25
|
+
# => [{ url: "https://example.com/", indices: [4, 24] }]
|
|
26
|
+
|
|
27
|
+
Uts58.extract_urls("see https://example.com/ for details")
|
|
28
|
+
# => ["https://example.com/"]
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
The API mirrors `Twitter::TwitterText::Extractor#extract_urls_with_indices`
|
|
32
|
+
closely; it was written to provide what Mastodon uses. The two module-level
|
|
33
|
+
methods above also strip partly overlapping matches; you can use
|
|
34
|
+
`Uts58::Extractor` directly if you'd rather merge with other extractors
|
|
35
|
+
(mentions, hashtags, …) and resolve overlap across all of them yourself.
|
|
36
|
+
|
|
37
|
+
Input without a scheme is recognised, and `https://` is prepended in the
|
|
38
|
+
returned `:url`:
|
|
39
|
+
|
|
40
|
+
```ruby
|
|
41
|
+
Uts58.extract_urls_with_indices("blogspot.com is still a thing")
|
|
42
|
+
# => [{ url: "https://blogspot.com", indices: [0, 12] }]
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
IDNs are decoded to use UTF8 in the output, for better readability:
|
|
46
|
+
|
|
47
|
+
```ruby
|
|
48
|
+
Uts58.extract_urls("xn-----ctdbabcfhu9c2b9l1acccr4c.xn--mgbah1a3hjkrd").first
|
|
49
|
+
# => "https://تجربة-القبول-الشامل.موريتانيا"
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
(Admittedly that output isn't very readable if you can't read Arabic.
|
|
53
|
+
But the input wasn't readable to anyone, no matter what languages they
|
|
54
|
+
can read.)
|
|
55
|
+
|
|
56
|
+
Trailing punctuation, balanced brackets, ports, paths, queries and fragments
|
|
57
|
+
are handled per the spec.
|
|
58
|
+
|
|
59
|
+
## What's not here
|
|
60
|
+
|
|
61
|
+
- **Email addresses.** UTS #58 covers them; this gem doesn't. If you
|
|
62
|
+
need that, send me mail and explain what you need.
|
|
63
|
+
- **Link validation.** Recognised URLs are not fetched, normalised beyond
|
|
64
|
+
IDN decoding, or their hostnames checked in the DNS. Again, if you
|
|
65
|
+
need this, send me mail.
|
|
66
|
+
|
|
67
|
+
## Roadmap
|
|
68
|
+
|
|
69
|
+
My immediate need is UTS58 conformant link detection suitable for
|
|
70
|
+
public web pages. If you need something more, I rather think that an
|
|
71
|
+
item can be added to this roadmap, so long as the description in the
|
|
72
|
+
rdoc remains short and simple. Send mail to arnt@gulbrandsen.priv.no.
|
|
73
|
+
|
|
74
|
+
## License
|
|
75
|
+
|
|
76
|
+
BSD-2-Clause. See `LICENSE`.
|